The Voleon Group

Applying statistical machine learning to investment management.

Senior Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 51-200Since 2007H1B No SponsorCompany SiteLinkedIn

Location

California

Posted

163 days ago

Salary

$205K - $235K / year

Bachelor Degree5 yrs expEnglishAnsibleAWSCloudGoogle Cloud PlatformGrafanaPrometheusPythonRubyTerraform

Job Description

• Help scale research compute cluster to meet growing needs. • Leverage engineering skills to ensure high degrees of uptime, reliability, and robustness. • Responsible for keeping research clusters available and performant. • Provide a world-class HPC platform for researchers focusing on machine learning problems at scale. • Support both on-prem and cloud infrastructure, ensuring best experiences for technical staff. • Collaborate with engineering teams to develop monitoring and telemetry improvements. • Design and oversee operational frameworks to ensure cluster operations meet SLAs.

Job Requirements

  • 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead.
  • Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod).
  • Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
  • Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible).
  • Experience with cloud infrastructure (AWS or GCP).
  • Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry).
  • Experience with distributed storage technologies (Lustre, Ceph, S3).
  • Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation.
  • Bachelor degree in computer science or equivalent experience.

Benefits

  • medical, dental and vision coverage
  • life and AD&D insurance
  • 20 days of paid time off
  • 9 sick days
  • 401(k) plan with a company match
  • “Friends of Voleon” Candidate Referral Program

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior DevOps Engineer

Domyn

Domyn empowers enterprises with AI they fully own, govern, and trust.

DevOps Engineer163 days ago
Full TimeRemoteTeam 51-200Since 2016H1B No Sponsor

Senior DevOps Engineer for Domyn building cloud and on-prem enterprise AI infrastructure

AWSAzureCloudDockerGoogle Cloud PlatformJavaJavaScriptKubernetesLinuxPostgresPythonTerraform
United States

DevOps Engineer

Mission Box Solutions

Connecting great companies w/ great people by providing meaningful talent solutions & building impactful relationships.

DevOps Engineer164 days ago
Full TimeRemoteTeam 11-50H1B No Sponsor

Talent-pool for DevOps-specialist roles at Mission Box Solutions recruiting agency

New York

Senior DevOps Engineer

Castillians

The world's trusted engineering network

DevOps Engineer166 days ago
ContractRemoteTeam 51-200Since 2006H1B No Sponsor

Senior DevOps Engineer developing and maintaining software solutions for a leading Igaming company

AnsibleAWSAzureCloudDockerGoogle Cloud PlatformGrafanaGroovyJenkinsKubernetesMicroservicesPythonTerraform
United States

Senior DevOps Engineer, Ephemeral Infrastructure

Upstart

Our mission is to enable effortless credit based on true risk.

DevOps Engineer167 days ago
Full TimeRemoteTeam 1,001-5,000Since 2012H1B Sponsor

Senior DevOps Engineer building Kubernetes ephemeral infrastructure for Upstart's AI lending marketplace

AWSEC2KubernetesLinuxGo
United States
$163.6K - $226.4K / year