SRE Team Lead in charge of reliability strategy and operational maturity for a cybersecurity SaaS platform. Leading a specialized team to enhance system performance and incident management.
Responsibilities
Leading and mentoring an experienced SRE team while defining and driving the reliability strategy of a production-grade cybersecurity SaaS platform.
Designing and evolving multi-cluster Kubernetes environments across cloud providers while owning availability, performance, and incident management processes.
Establishing and enforcing SLOs/SLAs, error budgets, and production standards.
Driving infrastructure as code and automation standards (Terraform, CI/CD) while improving observability, monitoring, and operational visibility across the system.
Performing and lead root cause analysis for complex production incidents
Partnering with R&D, Security, and Product to align reliability with rapid delivery
Shaping architectural decisions at the platform level.
Requirements
Have 2+ years leading or managing infrastructure/SRE teams.
Have solid hands-on Kubernetes production experience
Have experience operating cloud environments (GCP, AWS, Azure, or similar) with a good understanding of reliability engineering principles (SLOs, SLAs, error budgets).
Have experience with infrastructure as code and automation (Terraform, Ansible).
Have Software engineering experience with exceptional Linux and networking troubleshooting skills.
Have proven experience handling production incidents and conducting root cause analysis. Ability to drive technical standards across teams.
Demonstrate clear and structured communication skills.
DevOps and Build Engineer for NVIDIA developing and maintaining CI/CD pipelines. Collaborating with teams to enhance compiler technologies and optimize build performance in a diverse environment.
Senior AWS DevOps Developer responsible for managing AWS infrastructure for enterprise public budgeting software at Euna Solutions. Collaborating on cloud projects and enhancing system reliability and performance.
Principal AI Site Reliability Engineer driving operational excellence for critical contact center applications at Fidelity. Leading automation and observability initiatives to improve reliability and efficiency.
Data Transport Infrastructure DevOps Engineer at Leidos modernizing global - scale multi - cloud environments for USAF missions. Involves developing cloud - native solutions and ensuring security best practices.
DevOps Engineer responsible for building and optimizing AWS - based infrastructure and backend systems at Allguth GmbH. Part of a team focused on innovative mobility solutions in Munich region.
(Senior) DevOps Engineer specializing in ML solutions implementation and management in Germany. Focused on CI/CD pipelines, automation, and cloud services.
Specialist DevSecOps joining Periferia IT Group, a leader in digital transformation. Work in a dynamic environment with continuous learning and professional development opportunities.
Join Zinkworks as a Senior Platform Engineer designing scalable IaC - driven cloud platforms for a large - scale enterprise contact centre. Focused on automation, reliability, and platform ownership in a hybrid work environment.
Asset Reliability Engineer providing maintenance advice and service innovations. Join Sensorfact, the leading smart monitoring platform, to modernize the industrial sector.