Staff Software Engineer, ML Platform
Location
United States
Posted
61 days ago
Salary
Not specified
10 yrs expExperience acceptedEnglishAWSAzureCloudDistributed SystemsGoogle Cloud PlatformKubernetesTerraformType ScriptGo
Job Description
• Build Enterprise-Scale Infrastructure
• Leverage infrastructure-as-code to manage complex cloud environments supporting critical ML and AI initiatives.
• Design Kubernetes-native systems, including controllers/operators where appropriate.
• Improve platform networking, security, and observability
• Sustain Platform Health and Performance
• Own critical systems in production, including reliability, scalability, security, and cost efficiency.
• Identify and proactively address technical debt, operational risk, and platform bottlenecks.
• “Learn by doing” — Quickly ramp up to a complex tech stack (Terraform, Kubernetes, Istio, Crossplane, Go, TypeScript)
• Enable Teams and Customers to Move Faster
• Create abstractions and tooling that make it easier for teams and customers to deploy, run, and scale AI/ML workloads.
• Collaborate directly with customers to understand their ML infrastructure challenges and translate them into platform improvements.
• Balance speed and rigor—shipping quickly while maintaining a high bar for quality and safety.
• Lead Through Influence
• Act as a technical leader and mentor across the engineering organization.
• Write clear documentation and design proposals that align stakeholders and drive decisions.
• Partner closely with product and leadership to shape platform direction and priorities.
Job Requirements
- 10+ years of engineering experience, with significant time spent on infrastructure, platform, or distributed systems.
- Deep hands-on experience with Kubernetes in production environments.
- Strong cloud experience across AWS, GCP, and/or Azure.
- Proven track record of building and operating secure, scalable MLOps platforms.
- Deep understanding of infrastructure-as-code (e.g., Terraform, Pulumi, CDK).
- Strong programming skills in at least one backend language (Go preferred; TypeScript also welcome).
- Experience diagnosing and debugging complex production issues.
- Familiarity with modern CI/CD, test-driven development, and DevSecOps practices.
- Bonus: experience building Kubernetes operators and/or working with service meshes (e.g., Istio).
- Comfortable owning large, ambiguous problems from inception to production.
- Excellent communicator, able to clearly explain complex systems to both technical and non-technical audiences.
- Experience working directly with customers and incorporating feedback into technical decisions.
- Ability to operate autonomously while keeping stakeholders informed and aligned.
- Customer-first and product-oriented.
- Curious, adaptable, and eager to learn new systems and domains.
- Collaborative, respectful, and willing to lean into hard conversations.
- Energized by fast-paced environments and meaningful responsibility.
Benefits
- Competitive cash compensation alongside above-market equity upside
- Top-tier fully covered medical, dental, and vision insurance
- Life insurance
- 401k program
- Unlimited PTO
- Monthly half day
- Citi Bike membership
- Monthly wellness stipend
- Office equipment stipend, including reimbursement for approved disability-related accommodations
- Investment in employee learning and growth opportunities
Related Guides
Related Job Pages
More Full-stack Engineer Jobs
Senior Software Engineer
TenableCloud Security | Operational Technology | Identity Security | and more
Full-stack Engineer61 days ago
Full TimeRemoteTeam 1,001-5,000Since 2002H1B Sponsor
Full Stack Software Engineer developing cybersecurity solutions at Tenable Inc.
AngularAWSCloudCyber SecurityDistributed SystemsDockerDynamoDBElasticSearchJavaJavaScriptKafkaKotlinKubernetesMicroservicesNoSQLPostgresPrometheusReactSplunkSQLTerraformVue.js
Full-stack Engineer61 days ago
Full TimeRemoteTeam 10,001+Since 1993H1B Sponsor
Partner Enablement Engineer supporting NCCL and GPU applications for AI
AnsibleAWSAzureCloudDockerGoogle Cloud PlatformKubernetesLinuxNode.jsPython
Full Stack Engineer
Fieldwire by HiltiThe all-in-one jobsite management software for field to office communication.
Full-stack Engineer61 days ago
Full TimeRemoteTeam 51-200Since 2013H1B No Sponsor
Mid-Level Fullstack Engineer developing core features for construction management platform
AngularBootstrapRubyRuby on RailsRustSCSS
Full-stack Engineer61 days ago
Full TimeRemoteTeam 501-1,000Since 2009H1B Sponsor
Software Engineer developing ticketing solutions at SeatGeek