Hydra Host
A distributed marketplace for compute
Site Reliability Engineer
Location
Florida
Posted
136 days ago
Salary
$140K - $200K / year
5 yrs expEnglishCloudGrafanaKubernetesPrometheusPythonGo
Job Description
• Design, deploy, and maintain QA systems used by our development teams to test integration and live system responses across full-stack deployments in local, live, and ephemeral environments
• Evaluate and integrate monitoring and QA tools to find the right tools for the job
• Create a unified monitoring platform and processes that datacenter and device teams will integrate to monitor their components (live servers, lifecycle, networks, power, etc.)
• Maintain monitoring processes and dashboards to provide complete visibility into the health, performance, and reliability of our CI systems, software deployments, and testing platforms
• Create and maintain a systems test suite, in collaboration with our product managers, to validate marketplace changes against all business functions in live and ephemeral QA environments
• Integrate all fore-mentioned systems to create holistic platform health statistics reporting
• Design disaster-recovery processes in collaboration with devops
• Ensure we are meeting uptime SLAs across all platform deployments
• Work with datacenter and device teams to define service-level indicators (SLIs), service-level objectives (SLOs), and SLAs
• Establish observability standards across the stack: logs, metrics, traces, and alerts, and actionable on-call playbooks
• Automate everything from monitoring setups to incident responses to eliminate manual toil and increase reliability
• Drive incident response, root cause analysis, and post‑mortems
• Guide incident turn-around into tooling and process improvements
• Establish the monitoring infrastructure and dashboards that enable everyone — from engineers to execs — to know what’s going on
• Act as the reliability partner to engineering teams: review systems for reliability concerns, help design QA requirements and testing, and help teams meet reliability targets.
Job Requirements
- 5–8+ years of experience in Reliability Engineering, DevOps, or infrastructure roles focused on large-scale, high-uptime production environments
- Deep familiarity with monitoring and observability tooling: you've implemented and managed systems, esp. Prometheus, Grafana, and Zabbix
- Strong experience with service orchestration in mutli-region environment (Nomad, Kubernetes, cloud VMs, distributed databases)
- Track record of managing production system uptime and SLAs and building tools to support it
- Experience writing and reviewing post-mortems and using those findings to drive improvements in tools and process
- Proficient with scripting and programming languages (Python, Go, BASH, etc.) for automating operational tasks
- Strong proficiency with infrastructure as code and devops workflows
- Experience with distributed tracing, log aggregation, and alert tuning
- Passion for building systems that fail gracefully, alert correctly, and empower others to operate confidently
- Excellent communication skills: you can write clear documentation, drive incident reviews, and communicate reliability risks to technical and non-technical stakeholders.
Benefits
- Competitive compensation: base salary + performance bonus + equity
- Exposure to high-performance computing and state-of-the-art GPU environments
- A core role in ensuring our systems are reliable, observable, and meet customer SLAs
- Remote work environment with a strong culture of ownership and autonomy
- No red tape: find the right solution, work with the team, get feedback, and get the job done.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Staff/Principal Site Reliability Engineer
VezaThe data security platform built on the power of authorization.
DevOps Engineer136 days ago
Full TimeRemoteTeam 51-200Since 2020H1B Sponsor
Staff/Principal Site Reliability Engineer leading infrastructure initiatives at Veza Technologies
AWSCloudDistributed SystemsEC2GrafanaKubernetesLinuxMicroservicesPrometheusPythonTerraformGo
DevOps Engineer139 days ago
Full TimeRemoteTeam 501-1,000Since 1999H1B Sponsor
Senior Site Reliability Engineer managing cloud services and infrastructure
AnsibleAWSAzureChefCloudElasticSearchJavaLinuxLogstashPuppetPythonRubyUnixGo
Senior Site Reliability Engineer
CircleThe all-in-one community platform for creators and brands. https://circle.so/
DevOps Engineer140 days ago
Full TimeRemoteTeam 51-200Since 2019H1B Sponsor
Senior Site Reliability Engineer ensuring fast, reliable, and secure systems for Circle’s platform
AWSKubernetesMySQLPostgresRedis
Intermediate DevOps Engineer
AbacusNextCloud-based tech provider for legal and accounting firms. AbacusLaw, Amicus Attorney, Amicus Cloud, OfficeTools, HotDocs
DevOps Engineer141 days ago
Full TimeRemoteTeam 201-500Since 1983H1B No Sponsor
DevOps Engineer designing and implementing processes at CARET
AnsibleAWSAzureCloudDNSDockerKubernetesMongoDBPythonRedisSQLTerraform