Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 10,001+Since 1986H1B No SponsorCompany Site LinkedIn

Location

Arizona + 3 more

Posted

145 days ago

Salary

Not specified

Bachelor Degree5 yrs expEnglishAWSCloudKubernetesOpen ShiftOpen StackPrometheusSplunkVmware

Job Description

• Collaborate with Technology Infrastructure teams to build and operate reusable, cloud-native platforms that abstract complexity and accelerate delivery while incorporating reliability from design through operations. • Work with business units and technical teams to improve application availability, observability, and reliability as our business applications are migrated to the Private Cloud. • Enhance platform reliability through automatic problem detection, self-healing systems, and well-architected notification and escalation protocols. • Use SLOs, SLIs, and KPIs to guide prioritization, measure impact, and drive continuous improvement. • Eliminate toil using intelligent automation and agentic workflows. • Conduct blameless retrospectives and share learnings across the organization. • Foster a culture of ownership, positive thinking, and continuous learning while remaining grounded in practicality, experimentation, and engineering excellence. • Integrate DevSecOps, zero-trust principles, and policy-as-code into every pipeline. • Produce and promote Architecture Decision Records (ADRs) and Cloud Well-Architected Frameworks that our business units can adopt to improve our technology standardization. • Maintain 24x5 active coverage with seamless regional handoffs and weekend escalation protocols.

Job Requirements

5 + years of professional experience in a SRE role
Minimum Bachelor’s degree in Computer Science, Engineering, or a related field.
Proven expertise in architecting, designing and operating private cloud environments (e.g., VMware, OpenStack, OpenShift Virtualization) and Kubernetes clusters from a micro to a global scale.
Hands-on experience with building, deploying, and operating infrastructure as code platforms, CI/CD pipelines, and observability platforms (e.g., Prometheus, Splunk).
Strong understanding of modern systems reliability standards and practices, including establishing KPIs, monitoring and reporting on SLAs and SLOs, and sorting through the noise to establish actionable insights.
Familiarity with various financial services regulatory frameworks and their impact on infrastructure design and operations.
Familiarity with structured naming conventions and asset management for global infrastructure.
Experience with financial-grade network segmentation, micro-segmentation, and zero-trust architecture.
Certifications such as TOGAF, AWS Certified Solutions Architect, VMware VCP, or Red Hat Certified Architect are a plus.
Familiarity with ISO 27001, NIST 800-53, and other security frameworks is a plus.