Deepgram

Building foundational AI for speech transcription and understanding.

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, AWS, Terraform

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 51-200Since 2015H1B SponsorCompany SiteLinkedIn

Location

United States

Posted

6 days ago

Salary

$150K - $220K / year

Bachelor Degree5 yrs expEnglishAWSKubernetesPythonTerraformGo

Job Description

• Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services. • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources. • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments. • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle. • Automate the life cycle of single-tenant, managed deployments

Job Requirements

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
  • Proven, hands-on experience building and managing production infrastructure with Terraform
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads
  • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
  • Strong scripting and automation skills (e.g., Python, Go, Bash)

Benefits

  • Medical, dental, vision benefits
  • Annual wellness stipend
  • Mental health support
  • Life, STD, LTD Income Insurance Plans
  • Unlimited PTO
  • Generous paid parental leave
  • Flexible schedule
  • 12 Paid US company holidays
  • Quarterly personal productivity stipend
  • One-time stipend for home office upgrades
  • 401(k) plan with company match
  • Tax Savings Programs
  • Learning / Education stipend
  • Participation in talks and conferences
  • Employee Resource Groups
  • AI enablement workshops / sessions

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Release Engineer

Continuum

Accelerating Digital at the Speed of Government

DevOps Engineer6 days ago
Full TimeRemoteTeam 11-50Since 2023H1B Sponsor

Release Engineer managing Dynamics 365 CI/CD practices and deployments

Azure
Virginia

Senior DevOps Engineer

EverOps

The Embedded Service Provider

DevOps Engineer6 days ago
Full TimeRemoteTeam 51-200H1B No Sponsor

Senior DevOps Engineer managing Azure cloud environments at EverOps

AzureCloudDockerGrafanaLinuxMS SQL ServerPrometheusPythonSQLTerraformTypeScriptVaultGo
United States

Senior Staff Site Reliability Engineer

Jobgether

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1 We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

DevOps Engineer6 days ago
Full TimeRemote

This role is pivotal in ensuring the reliability, scalability, and performance of cloud-based enterprise software. As a Senior Staff Site Reliability Engineer, you will: Design, deploy, and maintain robust infrastructure for mission-critical services Collaborate closely with deve...

United States

DevOps Engineer III

Modivcare

To bring equity, hope and healing to those who need it most. To make a world of difference, one member at a time.

DevOps Engineer7 days ago
Full TimeRemoteTeam 10,001+Since 2017H1B Sponsor

DevOps Engineer optimizing software development processes at Modivcare

AWSAzureCloudDockerGoogle Cloud PlatformKubernetesPrometheusPythonTerraformTypeScript
United States
$97.2K - $133.7K / year