Staff Site Reliability Engineer (Remote)

US, Remote

The ideal candidate is a technically strong, security‑minded problem‑solver who operates production systems with a calm, data‑driven approach, proactively improves tooling, and communicates effectively across teams while continually leveling up their own skills and those of the organization. This person must be able to drive change at the organizational level, cross-collaborate effectively & influence stakeholders, and navigate complex organizational dynamics.

We need a seasoned engineer who is meticulous, self-motivated and thrives in a fast-paced environment. As a key member of our tech team, your expertise will be vital in shaping the future of our systems as we scale. This role is perfect for someone energetic, talented, and deeply committed to our mission of revolutionizing health technology.

Applicants seeking an easy job, a big corporation, a slow pace, or predictable 9-to-5 hours need not apply. This role requires energy, talent, and a genuine passion for our mission.

Primary Responsibilities:

As a Staff SRE Engineer, you will be entrusted with a pivotal role in our DevOps/SRE department, with key responsibilities including:

Automate Software Delivery

Build and maintain robust CI/CD pipelines (e.g., GitHub Actions, Jenkins, Argo CD) that integrate automated testing, security scanning, and one‑click rollback to accelerate safe releases.

Operate Production Infrastructure

Provision, configure, and manage secure, highly‑available cloud (AWS, GCP, Azure) and on‑prem environments with Infrastructure as Code (Terraform, Pulumi, or CloudFormation).

Observe, Troubleshoot & Remediate

Instrument systems with metrics, logs, and traces (Prometheus, Grafana, Datadog, OpenTelemetry); own the on‑call rotation, rapidly diagnose incidents, and drive blameless post‑mortems.

Optimize Performance & Cost

Continuously assess latency, capacity, and cloud spend; tune applications and scale containerized workloads (Docker, Kubernetes) to meet SLAs while controlling costs.

Continuously Improve Tooling & Process

Research, evaluate, and standardize new tools or practices that boost reliability, security, or developer velocity; automate toil wherever it appears.

Collaborate & Coach

Partner with software, QA, and security teams to embed DevOps/SRE best practices; create clear documentation and share operational knowledge.

Troubleshoot Complex Distributed Systems

Investigate and resolve hard-to-diagnose issues in microservices architectures, data pipelines, and event-driven systems. Leverage deep knowledge of system internals, dependency graphs, and failure modes to trace problems across services, regions, and layers. Use tools like OpenTelemetry, distributed tracing, and log correlation to pinpoint root causes and restore service health quickly.

Must‑Have Skills & Experience:

Identify the essential qualifications, skills, and experiences required for the role.

Experience: 12+ years in production SRE/DevOps or related software‑engineering roles, with a significant portion in a leadership capacity within a startup/high growth and/or big tech environment. This should include hands-on experience alongside team management.
Cloud: Deep hands‑on expertise with at least one major provider (AWS, GCP, or Azure), covering networking, IAM, and managed services
Infrastructure as Code: Proficient with Terraform (preferred) or similar IaC tooling; experienced in module design, remote state, and policy‑as‑code
CI/CD & Automation: Proven ability to design declarative pipelines and automate build, test, deploy, and rollback workflows
Containers & Orchestration: Strong knowledge of Docker image design and Kubernetes operations (Helm, controllers, service meshes)
Observability & Incident Response: Practical use of monitoring, logging, and tracing stacks; comfortable leading incident bridges and post‑incident analysis
Version Control & Collaboration: Fluency with Git workflows, code review culture, and clear written/verbal communication
Programming & Scripting: Proficiency in at least one language (Python, Go, or Bash) to automate tasks and build small services
Mindset: Self‑starter who is proactive, curious, and relentlessly focused on eliminating manual toil
Large-Scale System Development: Experience in building and maintaining large-scale systems, demonstrating your ability to handle projects of significant complexity and scale.
Passion for Healthcare Innovation: A deep-seated passion for leveraging technology to enhance healthcare and empower individuals in managing their health. Your commitment to making a difference in the healthcare sector is vital.

Nice-to-Have Qualifications:

Security or compliance experience (SOC 2, ISO 27001, FedRAMP)
Database operations at scale (PostgreSQL, MySQL, Redis, or MongoDB)
Observability platform tuning (Grafana Loki, Elastic, Honeycomb)
Relevant certifications (CKA/CKAD, AWS SA‑Pro, Terraform Associate)

To be a strong fit you also need:

Technical Excellence

Deep systems thinking—understands how code, infrastructure, networking, and data layers interact.
Writes clean, maintainable automation code (Python/Go/Bash) with tests and documentation.
Treats infrastructure as software: uses design patterns, code reviews, and CI for IaC.
Security‑first mindset—incorporates least‑privilege IAM, secret management, and shift‑left security scanning.

Operational Mindset

Bias for reliability—designs for failure, embraces SLOs/error budgets, and practices chaos testing.
Data‑driven troubleshooting—uses metrics, logs, and traces to form hypotheses and verify fixes.
Calm under pressure—follows incident‑response playbooks, communicates clearly, and learns from post‑mortems.

Collaboration & Leadership

Strong communicator—translates complex infra topics into clear language for engineers and executives.
Mentors peers—shares best practices, conducts brown‑bag sessions, and raises the team’s DevOps maturity.
Empathy for developers—builds self‑service tooling that removes friction from the dev workflow.

Growth & Innovation

Continuous learner—tracks emerging cloud‑native tools, attends meetups, and experiments in lab environments.
Pragmatic—balances new tech adoption with long‑term maintainability and business value.
Proactive problem‑solver—identifies toil or performance bottlenecks before they become incidents.

Personal Qualities

Ownership mentality—takes responsibility end‑to‑end, from design through production support.
Resilient & adaptable—thrives in fast‑growing, ambiguous environments and adjusts priorities quickly.
Customer‑centric—frames reliability work in terms of user impact and business outcomes.

Staff Site Reliability Engineer (Remote)

Must‑Have Skills & Experience:

Nice-to-Have Qualifications:

To be a strong fit you also need:

Share This Job