Back to all jobs

[Remote] Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Deepgram is the leading platform underpinning the emerging trillion-dollar Voice AI economy, providing real-time APIs for speech-to-text (STT), text-to-speech (TTS), and building production-grade voice agents at scale. The Site Reliability Engineer will build and operate the hybrid infrastructure foundation for AI/ML research and product development, focusing on creating a robust environment using Kubernetes, AWS, and Infrastructure-as-Code (Terraform).

Responsibilities

  • Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services
  • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated
  • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources
  • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing
  • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments
  • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning
  • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle
  • Automate the life cycle of single-tenant, managed deployments

Skills

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
  • Proven, hands-on experience building and managing production infrastructure with Terraform
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads
  • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
  • Strong scripting and automation skills (e.g., Python, Go, Bash)
  • Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling
  • Familiarity with FinOps principles and cloud cost optimization strategies
  • Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions
  • Experience in a multi-region or hybrid cloud environment

Benefits

  • Medical, dental, vision benefits
  • Annual wellness stipend
  • Mental health support
  • Life, STD, LTD Income Insurance Plans
  • Unlimited PTO
  • Generous paid parental leave
  • Flexible schedule
  • 12 Paid US company holidays
  • Quarterly personal productivity stipend
  • One-time stipend for home office upgrades
  • 401(k) plan with company match
  • Tax Savings Programs
  • Learning / Education stipend
  • Participation in talks and conferences
  • Employee Resource Groups
  • AI enablement workshops / sessions
  • For candidates outside of the US, we use an Employer of Record model in many countries, which means benefits are administered locally and governed by country-specific regulations. Because of this, benefits will differ by region — in some cases international employees receive benefits US employees do not, and vice versa. As we scale, we will continue to evaluate where we can create more alignment, but a 1:1 global benefits structure is not always legally or operationally possible.

Company Overview

  • Deepgram provides a voice artificial intelligence platform for speech-to-text, text-to-speech, and voice applications. It was founded in 2015, and is headquartered in San Francisco, California, USA, with a workforce of 51-200 employees. Its website is https://deepgram.com.

Company H1B Sponsorship

  • Deepgram has a track record of offering H1B sponsorships, with 2 in 2025, 1 in 2024, 1 in 2022. Please note that this does not guarantee sponsorship for this specific role.

Apply To This Job

Related remote jobs

Senior Site Reliability Engineer Largely Remote

Work from home Full-time role

Site Reliability Engineer (SRE)

Work from home Full-time role

Lead Kubernetes Engineer; Fulltime- Remote

Work from home Full-time role

Elastic Stack Engineer (Elasticsearch, Kubernetes, OCI) - ACTIVE SECRET CLEARANCE REQUIRED - Remote

Work from home Full-time role

Sr. Infrastructure Engineer - Kubernetes (Remote)

Work from home Full-time role

Network Operations-50% Remote-W2

Work from home Full-time role

Network Systems Engineer – Tactical Communications Remote / Telecommute Jobs

Work from home Full-time role

Principal Network Engineer, Global

Work from home Full-time role

Senior Infrastructure Automation Engineer

Work from home Full-time role

Systems and Network Engineer

Work from home Full-time role

Experienced Full Stack Data Entry Specialist – Remote Work Opportunity with arenaflex

Work from home Full-time role

Experienced Online Chat Support Officer – Delivering Exceptional Customer Service in a Dynamic Remote Environment

Work from home Full-time role

EC

Work from home Full-time role

B2B Appointment Setter & Sales Partner for Premium Creative Agency (US Market) - Contract to Hire

Work from home Full-time role

VP, Corporate Controller

Work from home Full-time role

Risk Manager, FCM - Crypto.com

Work from home Full-time role

Prompt Engineer (100% Worldwide Remote) in Los Angeles, CA in vidIQ

Work from home Full-time role

Startup Community Manager – Systems & Experiences

Work from home Full-time role

Sales Manager, North America

Work from home Full-time role

Software Developer (Java/Python + React) - Remote

Work from home Full-time role