The Voleon Group
Applying statistical machine learning to investment management.
Senior Site Reliability Engineer
DevOps EngineerDevOps EngineerFull TimeRemoteTeam 51-200Since 2007H1B No SponsorCompany SiteLinkedIn
Location
California
Posted
163 days ago
Salary
$205K - $235K / year
Bachelor Degree5 yrs expEnglishAnsibleAWSCloudGoogle Cloud PlatformGrafanaPrometheusPythonRubyTerraform
Job Description
• Help scale research compute cluster to meet growing needs.
• Leverage engineering skills to ensure high degrees of uptime, reliability, and robustness.
• Responsible for keeping research clusters available and performant.
• Provide a world-class HPC platform for researchers focusing on machine learning problems at scale.
• Support both on-prem and cloud infrastructure, ensuring best experiences for technical staff.
• Collaborate with engineering teams to develop monitoring and telemetry improvements.
• Design and oversee operational frameworks to ensure cluster operations meet SLAs.
Job Requirements
- 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead.
- Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod).
- Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
- Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible).
- Experience with cloud infrastructure (AWS or GCP).
- Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry).
- Experience with distributed storage technologies (Lustre, Ceph, S3).
- Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation.
- Bachelor degree in computer science or equivalent experience.
Benefits
- medical, dental and vision coverage
- life and AD&D insurance
- 20 days of paid time off
- 9 sick days
- 401(k) plan with a company match
- “Friends of Voleon” Candidate Referral Program
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
DevOps Engineer163 days ago
Full TimeRemoteTeam 51-200Since 2016H1B No Sponsor
Senior DevOps Engineer for Domyn building cloud and on-prem enterprise AI infrastructure
AWSAzureCloudDockerGoogle Cloud PlatformJavaJavaScriptKubernetesLinuxPostgresPythonTerraform
United States
DevOps Engineer
Mission Box SolutionsConnecting great companies w/ great people by providing meaningful talent solutions & building impactful relationships.
DevOps Engineer164 days ago
Full TimeRemoteTeam 11-50H1B No Sponsor
Talent-pool for DevOps-specialist roles at Mission Box Solutions recruiting agency
New York
DevOps Engineer166 days ago
ContractRemoteTeam 51-200Since 2006H1B No Sponsor
Senior DevOps Engineer developing and maintaining software solutions for a leading Igaming company
AnsibleAWSAzureCloudDockerGoogle Cloud PlatformGrafanaGroovyJenkinsKubernetesMicroservicesPythonTerraform
United States
Senior DevOps Engineer, Ephemeral Infrastructure
UpstartOur mission is to enable effortless credit based on true risk.
DevOps Engineer167 days ago
Full TimeRemoteTeam 1,001-5,000Since 2012H1B Sponsor
Senior DevOps Engineer building Kubernetes ephemeral infrastructure for Upstart's AI lending marketplace
AWSEC2KubernetesLinuxGo