Moonlite is building a cloud-native experience on-prem. Our software provides the control and customization enterprises need for AI. Build Faster with Moonlite Instantly download and deploy NIMS from NVIDIA or build your own applications with Hugging Face. Customize and deploy AI agents in one click or integrate your own with ease. Total Control Over Your AI Obtain the highest level of security by design for your private environments. Moonlite provides total visibility into all your resources, applications, and users. Find Value with Your Use Case Allocate resources in real-time as needed in your environment. Use the models that best align with your use cases. When a new model is released, test it out and power your applications with it.
Sr. Site Reliability Engineer (SRE)
Location
Indiana + 1 moreAll locations: Indiana, Illinois
Posted
38 days ago
Salary
$165K - $225K / year
Job Description
Job Requirements
- Experience:
- 5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale.
- Kubernetes Infrastructure Expertise:
- Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure – not just deploying workloads in managed clusters. Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies.
- Kubernetes Internals & Integration:
- Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling. Experience integrating storage (CSI drivers), networking (CNI, SR-IOV), and specialized hardware (GPU device plugins) with Kubernetes.
- Linux Systems Experience:
- Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments.
- Infrastructure Automation:
- Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead.
- Networking Fundamentals:
- Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production.
- Observability & Monitoring:
- Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems.
- Reliability Practices:
- Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems.
- Scripting & Automation:
- Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency.
- Problem-Solving Under Pressure:
- Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages.
- Collaboration & Communication:
- Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevOps Infrastructure Engineer
nDeavour ConsultingWe are a staffing and IT recruitment company based in Sofia, Bulgaria.
Senior DevOps Infrastructure Engineer architecting cloud-native solutions at Mobile Wave Solutions
Site Reliability/DevOps Engineer supporting production Kubernetes platforms remotely.
DevSecOps Architect – CONTRACT
The SpyGlass Group, LLCPersonalized Technology Expense Management (TEM) Audits, Surprising Savings
DevSecOps Architect leading GitHub Enterprise Cloud rollout at Spyglass.
Senior DevOps Manager
ZeitviewAt Zeitview, we deliver advanced inspection software for high-value infrastructure.
Senior DevOps Engineering Manager overseeing team and design infrastructure at Zeitview