Senior Site Reliability Engineer, AI Factory

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 10,001+Since 1993H1B SponsorCompany SiteLinkedIn

Location

California

Posted

20 days ago

Salary

$176K - $333.5K / year

Bachelor Degree10 yrs expEnglishPacker

Job Description

• Running commissioning and provisioning for GPU systems • Running the firmware versions of equipment and components, and communicating the supported versions across the organization • Through Day-2 operations, keeping tight SLOs around efficiency, performance, and availability • Monitoring the hardware state of the cluster, finding bottlenecks and hot spots, and helping users attain peak performance constantly • Triaging the HW break-fix issues and making constant improvements using open-source break-fix solutions • Collaborate with programming and technical divisions to define and implement repeatable procedures • Develop and implement operations strategy & processes, maintaining consistency with SLAs across critically important infrastructure • Develop and apply procedures for minimal downtime and quality controls to strive to achieve continuous uptime • Feeding requirements to software and hardware teams • Creation of documentation that the ecosystem can use to run its own AI Data Centers

Job Requirements

  • BS or MS degree in Computer Engineering/Science, or related field (or equivalent experience) with 10+ overall years of meaningful work experience
  • Experience managing GPU Fleets
  • 10+ years of expertise in improving data center operations or critical infrastructure
  • Expertise in BMS & Power management
  • Background in working with Provisioning, Commissioning, and Config Management solutions
  • Experience working with Packer and developing QCOW2 images
  • Background in coordinating with remote hands
  • Experience working with Datacenter Inventory Management Systems like Netbox, Nautilus, or others
  • Proven track record of working with multiple teams to achieve operational excellence for an organization
  • Experience driving reliability with robust processes, rapid field response, and recovery

Benefits

  • equity
  • benefits

Related Categories

Related Job Pages

More DevOps Engineer Jobs

DevOps Engineer21 days ago
Full TimeRemoteTeam 11-50H1B No Sponsor

DevSecOps Engineer building UAV Command & Control platform for Swarm Aero.

CloudCyber SecurityDistributed SystemsDockerFirewallsKubernetesLinux
United States
$150K - $250K / year

Senior Site Reliability Engineer, Hawaii

Onebrief

Software for rapid military planning: make planning fast enough for today's environment

DevOps Engineer21 days ago
Full TimeRemoteTeam 1-10Since 2019H1B No Sponsor

We are hiring a Senior Site Reliability Engineer to ensure deployment stability and service quality, working in on-premise DoD and AWS environments.

AnsibleAWSDockerDod ComplianceHelmKubernetesLinuxTerraformVMware
Hawaii
$180K - $220K / year

DevOps Engineer

Ziphire HR

We connect talent to companies using our innovative platform.

DevOps Engineer22 days ago
Full TimeRemoteTeam 1-10H1B No Sponsor

DevOps Engineer specializing in Salesforce delivery and CI/CD automation

AzureJenkinsJestKafka
California
$100K - $130K / year

DevOps Engineer | Arixa Capital

Ziphire HR

We connect talent to companies using our innovative platform.

DevOps Engineer22 days ago
Full TimeRemoteTeam 1-10H1B No Sponsor

Job Link: https://ziphire.hr/job/0c69bb74-12f7-44b5-9e87-98d3bf5f319c Arixa Capital is a leading private real estate lender and alternative investment manager with over $7 billion in originations completed since inception and a servicing portfolio exce...

California