Elevating Autism & IDD Care through Technology
Senior Site Reliability Engineer
Location
United States
Posted
1 day ago
Salary
$160K - $180K / year
Job Description
Role Description
As a Sr. SRE, you will work closely with the key stakeholders in Software Engineering to drive adoption of modern reliability practices like SLOs, error budget policies, actionable alerts, incident retrospectives, chaos testing, and end-to-end ownership.
- Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting and maintaining SLOs, SLIs and Error Budgets, creating dashboards.
- Analyze, troubleshoot and resolve operational challenges contributing to defined SLO's.
- Manage site stability, performance, reliability, and maintain uptime for production environments.
- Develop a fully automated multi-environment observability stack based on the existing system and extend it to predict capacity needs based on the usage patterns.
- Strive for automation to reduce toil and increase development velocity.
- Perform application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed.
- Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.
- Document resolution run books and standard operating procedures.
- Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.
- Collaborate with software development teams in the release management process and to shape the future roadmap and establish strong operational readiness across teams.
- Implementation of reliability and observability tools (like New Relic, Prometheus, Grafana etc.).
- Collaborates with Security team and other platform engineering teams to build reliable, maintainable, and scalable solutions that improve our security posture.
Qualifications
- Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider.
- Solid experience with Monitoring/APM/Observability tools (Splunk, New Relic etc.).
- Experience implementing observability plans around logs, metrics, and traces.
- Experience in an agile development team developing software.
- Experience with cloud infrastructure environments, preferably AWS, and Infrastructure as code (Terraform, CloudFormation).
- Extensive experience with Docker, Kubernetes, Helm, CI/CD and config management tools like Ansible, Chef.
- Strong experience with containerization technology and/or Kubernetes.
- Experience with Release automation, system administration, configuration management.
- Experience with programming languages (Java, Python, Go, etc.).
- Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts.
- Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports.
- Strong analytical and programming skills (Python, Go, Java etc.).
- Deep understanding around best practices for modern cloud security.
- Proven experience building observability for security concerns, such as privilege escalations and bot detection.
Requirements
- Location: Hybrid capacity from Holmdel, New Jersey or Fort Lauderdale, Florida, or remote candidates located in other U.S. states for the right individual.
- In-person interview or face-to-face meeting required for fully remote roles prior to the first day of employment.
Benefits
- Competitive compensation.
- Comprehensive health benefits.
- Generous PTO.
- 401(k) matching.
- Paid parental leave for full-time employees.
- Hybrid work schedules.
- Career development support.
- Wellness programs.
- Opportunities to give back through CR Cares™, our community engagement initiative.
Job Requirements
- Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider.
- Solid experience with Monitoring/APM/Observability tools (Splunk, New Relic etc.).
- Experience implementing observability plans around logs, metrics, and traces.
- Experience in an agile development team developing software.
- Experience with cloud infrastructure environments, preferably AWS, and Infrastructure as code (Terraform, CloudFormation).
- Extensive experience with Docker, Kubernetes, Helm, CI/CD and config management tools like Ansible, Chef.
- Strong experience with containerization technology and/or Kubernetes.
- Experience with Release automation, system administration, configuration management.
- Experience with programming languages (Java, Python, Go, etc.).
- Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts.
- Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports.
- Strong analytical and programming skills (Python, Go, Java etc.).
- Deep understanding around best practices for modern cloud security.
- Proven experience building observability for security concerns, such as privilege escalations and bot detection.
- Location: Hybrid capacity from Holmdel, New Jersey or Fort Lauderdale, Florida, or remote candidates located in other U.S. states for the right individual.
- In-person interview or face-to-face meeting required for fully remote roles prior to the first day of employment.
Benefits
- Competitive compensation.
- Comprehensive health benefits.
- Generous PTO.
- 401(k) matching.
- Paid parental leave for full-time employees.
- Hybrid work schedules.
- Career development support.
- Wellness programs.
- Opportunities to give back through CR Cares™, our community engagement initiative.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
DevOps Lead
Resolve Tech SolutionsERP/SAP Modernization | Managed Cloud Delivery Services | Advanced Tech - AI / ML | Cyber Security | Digital Signature
Leading design and implementation of scalable cloud infrastructure at RTS
Core member of the Freestar Platform Team ensuring reliable infrastructure.
Staff Site Reliability Engineer
SmarterDxImproving clinical and financial outcomes with physician-validated AI for documentation and coding.
Staff Site Reliability Engineer leading operational excellence for SmarterDx's production systems
The DevSecOps Engineer will design, implement, and maintain DevSecOps CI/CD pipelines for secure, automated software delivery, integrating automated testing prior to deployment authorization. Responsibilities also include applying DoD STIG requirements, implementing secure coding practices, conducting security scans, and supporting application migration to compliant Cloud Service Providers.