Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

Site Reliability Engineer - AI Infrastructure

Infrastructure EngineerInfrastructure EngineerFull TimeRemoteTeam 11-50

Location

United States

Posted

7 days ago

Salary

Not specified

No structured requirement data.

Job Description

Site Reliability Engineer - AI Infrastructure

Location: Global Remote / San Francisco · Full-Time

About Andromeda

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.

We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible.

Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.

Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world's financial markets.

We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

What You’ll Do

Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers.
Build automation and tooling to streamline cluster deployments and integrations.
Debug customer issues across networking, storage, scheduling, and system layers.
Improve reliability and scalability of both training and inference infrastructure.
Design and implement monitoring, alerting, and observability for critical systems.
Collaborate with engineering and product teams to plan and deliver infrastructure for new services.
Participate in on-call and incident response, leading postmortems and reliability improvements.

What We’re Looking For

5+ years experience in SRE, DevOps, or infrastructure engineering roles.
Strong Linux systems and networking fundamentals.
Deep experience with Kubernetes and container orchestration at scale.
Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.).
Strong automation and scripting skills (Python, Go, or Bash).
Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.).
Track record of operating production systems and leading incident response.

Nice to Have

Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton, etc.).
Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph).
Customer-facing support or consulting experience.

Why You’ll Love It Here

This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.

Related Categories

Infrastructure Engineer

Related Job Pages

Remote Full-time Jobs (US)More US Remote Jobs

More Infrastructure Engineer Jobs

Performance Engineer - AI Infrastructure

Andromeda Cluster

Infrastructure Engineer7 days ago

Full TimeRemoteTeam 11-50

The engineer will profile end-to-end training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O, collaborating with systems engineers to refine scheduling and communication performance. They will also build high-fidelity tooling for monitoring key metrics like MFU and throughput, and design operational processes to prevent performance regressions.

View details: Performance Engineer - AI Infrastructure

United States

Apply

Infrastructure Manager

Andromeda Cluster

Infrastructure Engineer7 days ago

Full TimeRemoteTeam 11-50

The Infrastructure Manager will be responsible for matching incoming sales leads with internal and external compute capacity to maximize resource utilization. This role involves sourcing and onboarding new global compute suppliers while developing proactive compute strategies based on market intelligence and customer needs.

View details: Infrastructure Manager

United States

Apply

Lead of Trading Infrastructure

Eqvilent

Infrastructure Engineer8 days ago

Full TimeRemote

We are looking for a Lead of Trading Infrastructure to take care of the existing globally distributed infrastructure, ensuring fast go-to-market and reliable SLAs for the SWE and Trading teams. The ideal candidate will have strong hands-on experience and a willingness to dive dee...

LinuxNetworkingSLA ManagementScalabilityHardware AnalysisVendor ManagementNomadLXC

View details: Lead of Trading Infrastructure

United States + 1 more

Apply