Deepgram

Building foundational AI for speech transcription and understanding.

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, Terraform

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 51-200Since 2015H1B SponsorCompany SiteLinkedIn

Location

United States

Posted

25 days ago

Salary

$160K - $220K / year

5 yrs expEnglishAWSKubernetesPythonTerraformGo

Job Description

• Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services. • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources. • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments. • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle. • Automate the life cycle of single-tenant, managed deployments

Job Requirements

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
  • Proven, hands-on experience building and managing production infrastructure with Terraform
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads
  • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
  • Strong scripting and automation skills (e.g., Python, Go, Bash)

Benefits

  • Medical, dental, vision benefits
  • Annual wellness stipend
  • Mental health support
  • Life, STD, LTD Income Insurance Plans
  • Unlimited PTO
  • Generous paid parental leave
  • Flexible schedule
  • 12 Paid US company holidays
  • Quarterly personal productivity stipend
  • One-time stipend for home office upgrades
  • 401(k) plan with company match
  • Tax Savings Programs
  • Learning / Education stipend
  • Participation in talks and conferences
  • Employee Resource Groups
  • AI enablement workshops / sessions

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior DevOps Software Engineer

eClinical Solutions

We bring people and data together to support tomorrow’s breakthroughs

DevOps Engineer25 days ago
Full TimeRemoteTeam 201-500Since 2012H1B Sponsor

Senior DevOps Software Engineer at eClinical Solutions implementing AWS infrastructure

AWSKubernetesPythonTerraform
Massachusetts
$132K - $165K / year

Senior Site Reliability Engineer

Zscaler

We make it easy to secure your cloud transformation. Get fast, secure, and direct access to apps without appliances.

DevOps Engineer25 days ago
Full TimeRemoteTeam 5,001-10,000Since 2008H1B Sponsor

Senior Site Reliability Engineer managing Zscaler's production cloud operations

CloudDNSFirewallsPythonTCP/IPGo
Illinois
$101K - $145K / year

Site Engineer

XYZ Reality

Transforming how projects are planned, built, and delivered with the ultimate construction delivery platform.

DevOps Engineer25 days ago
Full TimeRemoteTeam 51-200Since 2017H1B No Sponsor

Perform on-site layout, quality inspections and progress reporting using XYZ's Atom AR headset. Set/traverse control points with a total station, communicate with clients/contractors, produce inspection reports, process survey data, and prepare/export approved Revit models into HoloSite.

AtomAugmented RealityBimBim360HolositeLaser ScannerMS OfficeNavisworksRevitTotal Station
Wisconsin
DevOps Engineer26 days ago
Full TimeRemoteTeam 10,001+Since 1961H1B Sponsor

Senior Tech Lead for SRE team at Humana overseeing system reliability and performance.

AWSAzureCloudDistributed SystemsGoogle Cloud PlatformKafkaOraclePostgresPySparkPythonSQLGo
California + 3 moreAll locations: California, Illinois, Montana, South Dakota
$106.9K - $147K / year