Senior DGX Cloud AI Infrastructure Software Engineer

LLM EngineerMachine Learning EngineerFull TimeRemoteTeam 10,001+Since 1993H1B SponsorCompany SiteLinkedIn

Location

California + 3 moreAll locations: California, Oregon, Texas, Washington

Posted

39 days ago

Salary

$184K - $287.5K / year

Bachelor Degree8 yrs expEnglishDistributed SystemsPrometheusPython

Job Description

• Develop infrastructure software and tools for large-scale pre-training, post-training, and inference. • Develop and optimize tools and libraries to improve infrastructure efficiency and resiliency. • Co-design and implement APIs for integration with NVIDIA's resiliency stacks. • Enhance infrastructure and products underpinning NVIDIA's AI platforms. • Define meaningful and actionable reliability metrics to track and improve system and service reliability. • Skilled in problem-solving, root cause analysis, and optimization. • Root cause and analyze and triage failures from the application level to the hardware level.

Job Requirements

  • Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems.
  • Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
  • Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.
  • Experience with observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki).
  • Proven track record in building and scaling large-scale distributed systems.
  • Experience with AI training and inferencing infrastructure services.
  • Proficiency in programming languages such as Python, C/C++, script languages.
  • Experience in quality software engineering practices, including test development, defensive programming, version control, and CI.
  • Excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential.

Benefits

  • equity
  • benefits

Related Job Pages

More LLM Engineer Jobs

Conversational AI Engineer

Zillow

Reimagining real estate to make it easier than ever to move from one home to the next.

LLM Engineer43 days ago
Full TimeRemoteTeam 5,001-10,000Since 2006H1B Sponsor

Conversational AI Engineer enhancing self-service experience at Zillow

California + 15 moreAll locations: California, Colorado, Connecticut, District of Columbia, Hawaii, Illinois, Nevada, New Jersey, New York, Ohio, Maryland, Massachusetts, Minnesota, Rhode Island, Vermont, Washington
$136.3K - $217.7K / year

Senior Generative AI Engineer

Natera

We are a global leader in cell-free DNA (cfDNA) testing, dedicated to oncology, women’s health, and organ health.

LLM Engineer50 days ago
Full TimeRemoteTeam 1,001-5,000Since 2004H1B Sponsor

Senior Generative AI Engineer designing and deploying AI solutions at Natera.

AWSPythonPyTorch
United States
$125K - $156.3K / year

Technical Partner Manager, AI Infrastructure

Mirantis

Strategic open source infrastructure for containers and virtual machines.

LLM Engineer56 days ago
Full TimeRemoteTeam 501-1,000H1B Sponsor

Technical Partner Manager driving AI infrastructure partnerships at Mirantis

CloudKubernetesOpenStack
United States
$250K - $300K / year
LLM Engineer101 days ago
Full TimeRemoteTeam 51-200H1B No Sponsor

MLOps Engineer designing and supporting data infrastructure at Worldly.

United States
$145K - $185K / year