MixMode
Automated threat detection, unparalleled network visibility, & deep guided investigation powered by Self-Supervised AI.
Senior Software Reliability Engineer – AI
Location
California
Posted
62 days ago
Salary
Not specified
Bachelor Degree7 yrs expEnglishDistributed SystemsJavaKafkaKotlinKubernetesMy SQLPostgresPythonScalaSpark
Job Description
• Own the reliability, performance, and operational health of production AI systems, focusing on improving complex, existing services.
• Lead efforts to refactor and harden the AI codebase to improve observability, maintainability, and resilience.
• Diagnose and resolve issues across distributed systems, including latency, throughput, data pipelines, and resource utilization.
• Design and build monitoring, alerting, and debugging tools for high-availability services.
• Partner with researchers and ML engineers to productionize models at scale.
• Establish best practices for testing, deployment, capacity planning, and incident response.
• Serve as a technical leader during on-call rotations, driving incident response, postmortems, and continuous system improvements.
Job Requirements
- 7+ years of professional software engineering experience
- Strong proficiency in Python and at least one JVM language (Java, Scala, or Kotlin preferred)
- Proven experience designing, building, and operating distributed systems in production
- Strong understanding of service architecture, concurrency, resource management, and distributed failure modes
- Prior experience with streaming data pipelines (e.g. Spark streaming, Flink, Kafka)
- Hands-on experience running production services on Kubernetes, including pod lifecycle management and fault tolerance.
- Strong experience with relational databases (e.g., PostgreSQL, MySQL), including query performance analysis, indexing, and connection management
- Demonstrated ability to diagnose and resolve performance, scalability, and reliability issues across application, database, and infrastructure layers
- Experience implementing automated testing (unit, integration, end-to-end) and production observability (logging, metrics, tracing)
- Experience collaborating with ML or data science teams to productionize predictive systems. (Note: ML expertise is not required.)
- Ability to improve system architecture and engineering practices over time through design, code review, and mentorship
Benefits
- Remote-First Work Culture
- Healthcare (Medical, Dental, Vision, Accident)
- Basic & Voluntary Life and AD&D
- Flexible Spending Account (FSA)
- 401(k) with Employer Match
- Paid Holidays & Flexible Paid Time Off (PTO)
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
DevOps Engineer63 days ago
Full TimeRemoteTeam 501-1,000Since 2016H1B Sponsor
Staff Site Reliability Engineer designing and operating hybrid cloud environment at PathAI
AnsibleAWSCloudGrafanaPrometheusPythonTerraform
DevOps Engineer63 days ago
Full TimeRemoteTeam 51-200Since 2016H1B No Sponsor
DevSecOps Engineer at Consensys working on MetaMask and Infura platforms
AndroidAWSAzureCloudCyber SecurityFirewallsiOSJavaScriptKubernetesNode.jsPrometheusPythonTerraformTypeScript
DevOps Engineer
ImpiricusThe future of HCP-Pharma connectivity. Impiricus is the HCP-preferred platform to engage with Pharma.
DevOps Engineer63 days ago
Full TimeRemoteTeam 11-50Since 2020H1B No Sponsor
DevOps Engineer building and scaling cloud infrastructure for healthcare solutions at Impiricus
AWSCloudDockerEC2JenkinsKubernetesPythonRayTerraform
Junior DevOps Engineer
eSimplicityAn engineering firm that delivers high-quality Healthcare IT, Cybersecurity, and Telecommunication solutions.
DevOps Engineer63 days ago
Full TimeRemoteTeam 51-200Since 2016H1B No Sponsor
Junior DevOps Engineer developing and maintaining CI/CD pipelines for eSimplicity
AnsibleAWSCloudRPATerraformVisualforce