Red Hat
The leading provider of enterprise open source solutions.
Forward Deployed Engineer, AI Inference, vLLM, Kubernetes
Location
California + 3 moreAll locations: California, New York, Massachusetts, Washington
Posted
14 days ago
Salary
$184.9K - $305.1K / year
8 yrs expEnglishCloudKubernetesPythonTerraformGo
Job Description
• Orchestrate Distributed Inference: Deploy and configure LLM-D and vLLM on Kubernetes clusters.
• Optimize for Production: Go beyond standard deployments by running performance benchmarks, tuning vLLM parameters, and configuring intelligent inference routing policies to meet SLOs for latency and throughput.
• Code Side-by-Side: Work directly with customer engineers to write production-quality code (Python/Go/YAML) that integrates our inference engine into their existing Kubernetes ecosystem.
• Solve the "Unsolvable": Debug complex interaction effects between specific model architectures (e.g., MoE, large context windows), hardware accelerators (NVIDIA GPUs, AMD GPUs, TPUs), and Kubernetes networking (Envoy/ISTIO).
• Feedback Loop: Act as the "Customer Zero" for our core engineering teams. You will channel field learnings back to product development, influencing the roadmap for LLM-D and vLLM features.
• Travel only as needed to customers to present, demo, or help execute proof-of-concepts.
Job Requirements
- 8+ Years of Engineering Experience: You have a decade-long track record in Backend Systems, SRE, or Infrastructure Engineering.
- Customer Fluency: You speak both "Systems Engineering" and "Business Value".
- Bias for Action: You prefer rapid prototyping and iteration over theoretical perfection. You are comfortable operating in ambiguity and taking ownership of the outcome.
- Deep Kubernetes Expertise: You are fluent in K8s primitives, from defining custom resources (CRDs, Operators, Controllers) to configuring modern ingress via the Gateway API.
- AI Inference Proficiency: You understand how a LLM forward pass works. You know what KV Caching is, why prefill/decode disaggregation matters, why context length impacts performance, and how continuous batching works in vLLM.
- Systems Programming: Proficiency in Python (for model interfaces) and Go (for Kubernetes controllers/scheduler logic).
- Infrastructure as Code: Experience with Helm, Terraform, or similar tools for reproducible deployments.
- Cloud & GPU Hardware Fluency: You are comfortable spinning up clusters and deploying LLMs on bare-metal and hyperscaler Kubernetes clusters.
Benefits
- Comprehensive medical, dental, and vision coverage
- Flexible Spending Account - healthcare and dependent care
- Health Savings Account - high deductible medical plan
- Retirement 401(k) with employer match
- Paid time off and holidays
- Paid parental leave plans for all new parents
- Leave benefits including disability, paid family medical leave, and paid military leave
- Additional benefits including employee stock purchase plan, family planning reimbursement, tuition reimbursement, transportation expense account, employee assistance program, and more!