Automated threat detection, unparalleled network visibility, & deep guided investigation powered by Self-Supervised AI.

Senior Software Reliability Engineer – AI

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 11-50H1B No SponsorCompany Site LinkedIn

Location

California

Posted

62 days ago

Salary

Not specified

Bachelor Degree7 yrs expEnglishDistributed SystemsJavaKafkaKotlinKubernetesMy SQLPostgresPythonScalaSpark

Job Description

• Own the reliability, performance, and operational health of production AI systems, focusing on improving complex, existing services. • Lead efforts to refactor and harden the AI codebase to improve observability, maintainability, and resilience. • Diagnose and resolve issues across distributed systems, including latency, throughput, data pipelines, and resource utilization. • Design and build monitoring, alerting, and debugging tools for high-availability services. • Partner with researchers and ML engineers to productionize models at scale. • Establish best practices for testing, deployment, capacity planning, and incident response. • Serve as a technical leader during on-call rotations, driving incident response, postmortems, and continuous system improvements.

Job Requirements

7+ years of professional software engineering experience
Strong proficiency in Python and at least one JVM language (Java, Scala, or Kotlin preferred)
Proven experience designing, building, and operating distributed systems in production
Strong understanding of service architecture, concurrency, resource management, and distributed failure modes
Prior experience with streaming data pipelines (e.g. Spark streaming, Flink, Kafka)
Hands-on experience running production services on Kubernetes, including pod lifecycle management and fault tolerance.
Strong experience with relational databases (e.g., PostgreSQL, MySQL), including query performance analysis, indexing, and connection management
Demonstrated ability to diagnose and resolve performance, scalability, and reliability issues across application, database, and infrastructure layers
Experience implementing automated testing (unit, integration, end-to-end) and production observability (logging, metrics, tracing)
Experience collaborating with ML or data science teams to productionize predictive systems. (Note: ML expertise is not required.)
Ability to improve system architecture and engineering practices over time through design, code review, and mentorship