Staff DevOps Engineer designing and architecting secure, scalable infrastructure for AI workloads at webAI. Leading technical initiatives and mentoring engineers on cloud architecture and reliability best practices.
Responsibilities
Design and architect secure, scalable cloud and edge infrastructure for deploying AI workloads across multi-cloud (AWS, Azure, GCP) and hybrid environments
Build and maintain production-grade Infrastructure as Code (IaC) using Terraform, Ansible, or Pulumi, managing 100+ resources with GitOps workflows and automated validation
Design and operate production Kubernetes clusters optimized for AI/ML workloads with GPU support, implementing container security, multi-tenancy, and resource optimization
Implement secure CI/CD pipelines with integrated security controls (SAST, DAST, vulnerability scanning, secrets management) and automated deployment workflows for containerized AI models
Lead MLOps infrastructure initiatives including model deployment pipelines, versioning, feature stores, experiment tracking, and monitoring for model performance and drift
Design comprehensive observability and monitoring using Prometheus, Grafana, ELK, or Datadog with distributed tracing, APM, and real-time alerting aligned to SLIs/SLOs
Implement security best practices including least-privilege access, encryption at rest/in transit, network segmentation, and automated compliance validation
Lead incident response and reliability initiatives, participate in on-call rotation, conduct post-mortems, and drive continuous improvement for system reliability
Architect disaster recovery and business continuity strategies with automated backup, failover, and recovery processes
Develop reusable infrastructure modules and templates to accelerate environment provisioning and standardize deployment patterns across teams
Mentor mid-level and senior engineers on cloud architecture, DevOps best practices, and platform reliability through design reviews and technical guidance
Drive technical documentation and knowledge sharing including runbooks, architecture decision records (ADRs), and infrastructure standards
Requirements
7+ years of hands-on experience in DevOps, Site Reliability Engineering, or Infrastructure Engineering with proven track record of architecting production systems
Expert-level proficiency with Docker, Kubernetes (CKA/CKAD preferred), and cloud-native technologies in production environments
5+ years implementing Infrastructure as Code with Terraform, Ansible, or Pulumi, managing large-scale (50+) cloud resources
Deep experience with cloud platforms (AWS, Azure, or GCP) including compute, networking, storage, and managed services
Proven experience building and scaling CI/CD pipelines with integrated security controls (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
Strong programming skills in Python (preferred for automation), Bash, or Go for infrastructure tooling and automation
Production experience with observability and monitoring tools: Prometheus, Grafana, ELK, CloudWatch, Datadog, or similar
Experience with MLOps workflows: model deployment automation, versioning, and lifecycle management
Demonstrated experience with GitOps methodologies and declarative infrastructure management
Strong understanding of security best practices: encryption, secrets management, identity and access management (IAM), network security
Excellent written and verbal communication skills for technical documentation and cross-functional collaboration.
Benefits
Competitive salary and performance-based incentives.
Comprehensive health, dental, and vision benefits package.
401k Match (US-based only)
$200/mos Health and Wellness Stipend
$400/year Continuing Education Credit
$500/year Function Health subscription (US-based only)
DevOps Engineer for designing and maintaining Azure - based hybrid cloud infrastructure for a company specializing in nature - based smart city solutions. Leading cloud architecture and mentoring engineers as part of a high - impact team.
SRE responsible for ensuring reliability and performance of IT systems at a digital transformation company specializing in public sector efficiency. Collaborating on system health, incident response, and automation tasks.
DevOps Senior role at Beyond Soluções managing CI/CD for .NET and Kubernetes applications. Collaborating on cloud solutions while fostering a culture of innovation and quality.
Senior Software Engineer at PayPal managing cloud infrastructure and DevOps solutions. Delivering complete SDLC solutions and guiding engineering teams for scalable and reliable services.
Senior Site Reliability Engineer at Diligent leading reliability, automation, and observability across cloud infrastructure. Build tools for incident response and enhance performance in fast - paced environments.
Perception Deployment Engineer deploying deep learning models on embedded systems at Caterpillar. Collaborating with cross - functional teams for integration and optimization of perception modules in vehicles.
Principal Site Reliability Engineer at AT&T required to design scalable solutions for critical operations with minimal downtime. Collaborating with teams to monitor and improve system performance in cloud environments.
DevOps Engineer managing AI SaaS infrastructure at a high - growth European company. Supporting AI model deployment and ensuring platform security and compliance with multiple systems integration.
Engineering Manager leading teams for observability platforms at LexisNexis. Owns operational excellence across software delivery lifecycle in Raleigh, NC.
Reliability Engineer optimizing site facility infrastructure and utility systems at Roche. Conducting root cause analyses and developing maintenance plans to enhance reliability and efficiency.