Senior DGX Cloud AI Infrastructure Software Engineer
LLM EngineerMachine Learning EngineerFull TimeRemoteTeam 10,001+Since 1993H1B SponsorCompany SiteLinkedIn
Location
California + 3 moreAll locations: California, Oregon, Texas, Washington
Posted
39 days ago
Salary
$184K - $287.5K / year
Bachelor Degree8 yrs expEnglishDistributed SystemsPrometheusPython
Job Description
• Develop infrastructure software and tools for large-scale pre-training, post-training, and inference.
• Develop and optimize tools and libraries to improve infrastructure efficiency and resiliency.
• Co-design and implement APIs for integration with NVIDIA's resiliency stacks.
• Enhance infrastructure and products underpinning NVIDIA's AI platforms.
• Define meaningful and actionable reliability metrics to track and improve system and service reliability.
• Skilled in problem-solving, root cause analysis, and optimization.
• Root cause and analyze and triage failures from the application level to the hardware level.
Job Requirements
- Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems.
- Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
- Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.
- Experience with observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki).
- Proven track record in building and scaling large-scale distributed systems.
- Experience with AI training and inferencing infrastructure services.
- Proficiency in programming languages such as Python, C/C++, script languages.
- Experience in quality software engineering practices, including test development, defensive programming, version control, and CI.
- Excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential.
Benefits
- equity
- benefits
Related Guides
Related Job Pages
More LLM Engineer Jobs
Conversational AI Engineer
ZillowReimagining real estate to make it easier than ever to move from one home to the next.
LLM Engineer43 days ago
Full TimeRemoteTeam 5,001-10,000Since 2006H1B Sponsor
Conversational AI Engineer enhancing self-service experience at Zillow
California + 15 moreAll locations: California, Colorado, Connecticut, District of Columbia, Hawaii, Illinois, Nevada, New Jersey, New York, Ohio, Maryland, Massachusetts, Minnesota, Rhode Island, Vermont, Washington
$136.3K - $217.7K / year
Senior Generative AI Engineer
NateraWe are a global leader in cell-free DNA (cfDNA) testing, dedicated to oncology, women’s health, and organ health.
LLM Engineer50 days ago
Full TimeRemoteTeam 1,001-5,000Since 2004H1B Sponsor
Senior Generative AI Engineer designing and deploying AI solutions at Natera.
AWSPythonPyTorch
Technical Partner Manager, AI Infrastructure
MirantisStrategic open source infrastructure for containers and virtual machines.
LLM Engineer56 days ago
Full TimeRemoteTeam 501-1,000H1B Sponsor
Technical Partner Manager driving AI infrastructure partnerships at Mirantis
CloudKubernetesOpenStack
LLM Engineer101 days ago
Full TimeRemoteTeam 51-200H1B No Sponsor
MLOps Engineer designing and supporting data infrastructure at Worldly.