We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1 We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
Staff Site Reliability Engineer
Location
United States
Posted
1 day ago
Salary
Not specified
No structured requirement data.
Job Description
Role Description
This is a senior, hands-on role within a small, high-leverage SRE team, responsible for ensuring the reliability, scalability, and security of a high-growth digital financial platform. The Staff SRE will architect, automate, and optimize cloud infrastructure, focusing on operational excellence and system resilience. You will collaborate closely with engineering, product, and security teams to embed reliability into every layer of the platform while mentoring fellow engineers and shaping long-term infrastructure strategy. This role provides the opportunity to directly impact platform performance, member trust, and product velocity through robust monitoring, incident prevention, and automation. You will lead initiatives across GCP environments, cloud networking, Kubernetes, and IaC, while exploring innovative automation solutions, including LLM-driven tooling, to reduce toil and improve operational efficiency. This position is ideal for a systems thinker who thrives in ambiguous, high-impact environments and wants to build resilient, scalable services for millions of users.
Accountabilities
- Lead architecture and automation across cloud infrastructure, ensuring reliability, scalability, security, and cost-effectiveness.
- Define and operate SLIs, SLOs, and error budgets, translating reliability goals into measurable business outcomes.
- Design and optimize multi-region, disaster recovery, and capacity planning strategies to support platform growth.
- Manage and optimize cloud networking, including VPC architecture, ingress/egress, Cloud Armor, VPN, and DNS.
- Drive infrastructure-as-code and GitOps practices using Terraform, Kubernetes, Helm, and ArgoCD to enable repeatable, predictable deployments.
- Mentor SREs and infrastructure engineers through hands-on collaboration, design reviews, and incident retrospectives.
- Partner with cross-functional teams to align platform decisions with product velocity, security, and long-term durability.
Qualifications
- 8+ years of experience in software, infrastructure, or site reliability engineering.
- 5+ years of hands-on experience operating production systems in GCP (compute, networking, storage, IAM, observability).
- Deep experience with Kubernetes (GKE), Helm, containerization, Terraform (IaC), and ArgoCD.
- Strong programming skills in Python, Go, or TypeScript/JavaScript for automation and internal tooling.
- Proven ability to define and operate against SLIs, SLOs, and error budgets.
- Strong knowledge of relational and distributed databases (e.g., MySQL, Cloud SQL, Cloud Spanner, Redis) including performance tuning and HA strategies.
- Experience leading incident response, root cause analysis, and systemic remediation.
- Bonus: Experience in fintech or regulated environments, CI tooling familiarity, and high-growth startup experience.
Benefits
- Competitive compensation and benefits package.
- Premium Medical, Dental, and Vision Insurance plans.
- 401(k) savings plan with matching contributions.
- Flexible PTO and generous company holidays, including Juneteenth and Winter Break.
- Paid parental and caregiver leave.
- Flexible hours with a virtual-first work culture and home office stipend.
- Opportunities for professional growth, mentorship, and impactful work on a high-growth platform.
- Company-sponsored in-person and virtual events for team connection.
Job Requirements
- 8+ years of experience in software, infrastructure, or site reliability engineering.
- 5+ years of hands-on experience operating production systems in GCP (compute, networking, storage, IAM, observability).
- Deep experience with Kubernetes (GKE), Helm, containerization, Terraform (IaC), and ArgoCD.
- Strong programming skills in Python, Go, or TypeScript/JavaScript for automation and internal tooling.
- Proven ability to define and operate against SLIs, SLOs, and error budgets.
- Strong knowledge of relational and distributed databases (e.g., MySQL, Cloud SQL, Cloud Spanner, Redis) including performance tuning and HA strategies.
- Experience leading incident response, root cause analysis, and systemic remediation.
- Bonus: Experience in fintech or regulated environments, CI tooling familiarity, and high-growth startup experience.
Benefits
- Competitive compensation and benefits package.
- Premium Medical, Dental, and Vision Insurance plans.
- 401(k) savings plan with matching contributions.
- Flexible PTO and generous company holidays, including Juneteenth and Winter Break.
- Paid parental and caregiver leave.
- Flexible hours with a virtual-first work culture and home office stipend.
- Opportunities for professional growth, mentorship, and impactful work on a high-growth platform.
- Company-sponsored in-person and virtual events for team connection.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Associate Reliability Engineer
ChompsProtein-packed meat snacks that deliver on taste, simple ingredients and powerful nutrition!
Reliability Engineer focused on asset maintenance for packaging equipment at Chomps
We are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, security, and performance of our production systems and services. The SRE will bridge the gap between software development and operations, implementing automation, monitoring, ...
The Senior Site Reliability Engineer acts as the Technical Architecture & Stability Assessment Lead, evaluating the reliability and resilience of complex enterprise infrastructure environments over a structured 16-week assessment period. This role focuses on identifying stability risks, mapping dependencies, and strengthening current architecture to ensure operational continuity during modernization efforts.
Junior DevOps Engineer supporting cloud platforms for mission-driven applications