A distributed marketplace for compute

Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 11-50Since 2021H1B No SponsorCompany Site LinkedIn

Location

Florida

Posted

136 days ago

Salary

$140K - $200K / year

5 yrs expEnglishCloudGrafanaKubernetesPrometheusPythonGo

Job Description

• Design, deploy, and maintain QA systems used by our development teams to test integration and live system responses across full-stack deployments in local, live, and ephemeral environments • Evaluate and integrate monitoring and QA tools to find the right tools for the job • Create a unified monitoring platform and processes that datacenter and device teams will integrate to monitor their components (live servers, lifecycle, networks, power, etc.) • Maintain monitoring processes and dashboards to provide complete visibility into the health, performance, and reliability of our CI systems, software deployments, and testing platforms • Create and maintain a systems test suite, in collaboration with our product managers, to validate marketplace changes against all business functions in live and ephemeral QA environments • Integrate all fore-mentioned systems to create holistic platform health statistics reporting • Design disaster-recovery processes in collaboration with devops • Ensure we are meeting uptime SLAs across all platform deployments • Work with datacenter and device teams to define service-level indicators (SLIs), service-level objectives (SLOs), and SLAs • Establish observability standards across the stack: logs, metrics, traces, and alerts, and actionable on-call playbooks • Automate everything from monitoring setups to incident responses to eliminate manual toil and increase reliability • Drive incident response, root cause analysis, and post‑mortems • Guide incident turn-around into tooling and process improvements • Establish the monitoring infrastructure and dashboards that enable everyone — from engineers to execs — to know what’s going on • Act as the reliability partner to engineering teams: review systems for reliability concerns, help design QA requirements and testing, and help teams meet reliability targets.

Job Requirements

5–8+ years of experience in Reliability Engineering, DevOps, or infrastructure roles focused on large-scale, high-uptime production environments
Deep familiarity with monitoring and observability tooling: you've implemented and managed systems, esp. Prometheus, Grafana, and Zabbix
Strong experience with service orchestration in mutli-region environment (Nomad, Kubernetes, cloud VMs, distributed databases)
Track record of managing production system uptime and SLAs and building tools to support it
Experience writing and reviewing post-mortems and using those findings to drive improvements in tools and process
Proficient with scripting and programming languages (Python, Go, BASH, etc.) for automating operational tasks
Strong proficiency with infrastructure as code and devops workflows
Experience with distributed tracing, log aggregation, and alert tuning
Passion for building systems that fail gracefully, alert correctly, and empower others to operate confidently
Excellent communication skills: you can write clear documentation, drive incident reviews, and communicate reliability risks to technical and non-technical stakeholders.

Benefits

Competitive compensation: base salary + performance bonus + equity
Exposure to high-performance computing and state-of-the-art GPU environments
A core role in ensuring our systems are reliable, observable, and meet customer SLAs
Remote work environment with a strong culture of ownership and autonomy
No red tape: find the right solution, work with the team, get feedback, and get the job done.