Hydra Host

A distributed marketplace for compute

Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 11-50Since 2021H1B No SponsorCompany SiteLinkedIn

Location

Florida

Posted

136 days ago

Salary

$140K - $200K / year

5 yrs expEnglishCloudGrafanaKubernetesPrometheusPythonGo

Job Description

• Design, deploy, and maintain QA systems used by our development teams to test integration and live system responses across full-stack deployments in local, live, and ephemeral environments • Evaluate and integrate monitoring and QA tools to find the right tools for the job • Create a unified monitoring platform and processes that datacenter and device teams will integrate to monitor their components (live servers, lifecycle, networks, power, etc.) • Maintain monitoring processes and dashboards to provide complete visibility into the health, performance, and reliability of our CI systems, software deployments, and testing platforms • Create and maintain a systems test suite, in collaboration with our product managers, to validate marketplace changes against all business functions in live and ephemeral QA environments • Integrate all fore-mentioned systems to create holistic platform health statistics reporting • Design disaster-recovery processes in collaboration with devops • Ensure we are meeting uptime SLAs across all platform deployments • Work with datacenter and device teams to define service-level indicators (SLIs), service-level objectives (SLOs), and SLAs • Establish observability standards across the stack: logs, metrics, traces, and alerts, and actionable on-call playbooks • Automate everything from monitoring setups to incident responses to eliminate manual toil and increase reliability • Drive incident response, root cause analysis, and post‑mortems • Guide incident turn-around into tooling and process improvements • Establish the monitoring infrastructure and dashboards that enable everyone — from engineers to execs — to know what’s going on • Act as the reliability partner to engineering teams: review systems for reliability concerns, help design QA requirements and testing, and help teams meet reliability targets.

Job Requirements

  • 5–8+ years of experience in Reliability Engineering, DevOps, or infrastructure roles focused on large-scale, high-uptime production environments
  • Deep familiarity with monitoring and observability tooling: you've implemented and managed systems, esp. Prometheus, Grafana, and Zabbix
  • Strong experience with service orchestration in mutli-region environment (Nomad, Kubernetes, cloud VMs, distributed databases)
  • Track record of managing production system uptime and SLAs and building tools to support it
  • Experience writing and reviewing post-mortems and using those findings to drive improvements in tools and process
  • Proficient with scripting and programming languages (Python, Go, BASH, etc.) for automating operational tasks
  • Strong proficiency with infrastructure as code and devops workflows
  • Experience with distributed tracing, log aggregation, and alert tuning
  • Passion for building systems that fail gracefully, alert correctly, and empower others to operate confidently
  • Excellent communication skills: you can write clear documentation, drive incident reviews, and communicate reliability risks to technical and non-technical stakeholders.

Benefits

  • Competitive compensation: base salary + performance bonus + equity
  • Exposure to high-performance computing and state-of-the-art GPU environments
  • A core role in ensuring our systems are reliable, observable, and meet customer SLAs
  • Remote work environment with a strong culture of ownership and autonomy
  • No red tape: find the right solution, work with the team, get feedback, and get the job done.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Staff/Principal Site Reliability Engineer

Veza

The data security platform built on the power of authorization.

DevOps Engineer136 days ago
Full TimeRemoteTeam 51-200Since 2020H1B Sponsor

Staff/Principal Site Reliability Engineer leading infrastructure initiatives at Veza Technologies

AWSCloudDistributed SystemsEC2GrafanaKubernetesLinuxMicroservicesPrometheusPythonTerraformGo
United States
$184K - $240K / year
DevOps Engineer139 days ago
Full TimeRemoteTeam 501-1,000Since 1999H1B Sponsor

Senior Site Reliability Engineer managing cloud services and infrastructure

AnsibleAWSAzureChefCloudElasticSearchJavaLinuxLogstashPuppetPythonRubyUnixGo
United States
$80K - $100K / year

Senior Site Reliability Engineer

Circle

The all-in-one community platform for creators and brands. https://circle.so/

DevOps Engineer140 days ago
Full TimeRemoteTeam 51-200Since 2019H1B Sponsor

Senior Site Reliability Engineer ensuring fast, reliable, and secure systems for Circle’s platform

AWSKubernetesMySQLPostgresRedis
California
$130K - $140K / year

Intermediate DevOps Engineer

AbacusNext

Cloud-based tech provider for legal and accounting firms. AbacusLaw, Amicus Attorney, Amicus Cloud, OfficeTools, HotDocs

DevOps Engineer141 days ago
Full TimeRemoteTeam 201-500Since 1983H1B No Sponsor

DevOps Engineer designing and implementing processes at CARET

AnsibleAWSAzureCloudDNSDockerKubernetesMongoDBPythonRedisSQLTerraform
California
$90K - $110K / year