About the role

Lead Site Reliability Engineer building cloud-agnostic, highly-available infrastructure and leading SRE team at Mistral AI.

Responsibilities

Lead Site Reliability Engineer responsible for driving the infrastructure team and reporting to the Head of Engineering.
Empower and supervise the SRE team: remove obstacles, hire, onboard, and elevate team performance; project planning and task allocation.
Collaborate with stakeholders across engineering, science, and product management.
Design, build, and maintain scalable, highly available, fault-tolerant infrastructures for web services and ML workloads.
Ensure platform, inference, and model training environments are highly available and reproducible across HPC clusters.
Operate production systems: troubleshooting, on-call responses, user admin, data extraction, infrastructure scaling; perform root cause analyses.
Implement and improve monitoring, alerting, and incident response systems to minimize downtime.
Implement and maintain CI/CD, containerization, orchestration, monitoring, logging and alerting workflows for client-facing APIs and large training runs.
Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform.
Collaborate with AI/ML researchers to enable safe and reproducible model-training experiments and build a cloud-agnostic platform abstraction layer.
Design and develop workflows, tooling, APIs, dashboards and automation to improve reliability and performance.
Collaborate with security to ensure best practices and compliance; document processes and contribute to open-source and publications.

10+ years of experience in a DevOps/SRE role.
Experience with building and leading high-performing teams.
Experience with cloud computing and highly available distributed systems.
Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations).
Experience working against reliability KPIs (observability, alerting, SLAs).
Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes, Flux).
Experience with monitoring, logging and observability tools (Prometheus, Grafana, ELK Stack, Datadog).
Experience with infrastructure-as-code tools (Terraform, CloudFormation).
Proficiency in scripting languages (Python, Go, Bash).
Understanding of networking, security, and system administration concepts.
Excellent problem-solving and communication skills.
Self-motivated and able to work well in a fast-paced startup environment.
Willingness to reside in or relocate to Paris or London (candidates in France & UK may be considered remotely but must visit office during onboarding and monthly).