Site Reliability Engineer responsible for system reliability and performance at a support organization. Collaborating with development and operations to enhance system architecture and incident management.
Responsibilities
Partner with development, infrastructure, and operations teams to design highly available, fault-tolerant, and disaster recovery–ready systems.
Implement Infrastructure-as-Code (e.g., Terraform) to automate provisioning, scaling, and management of cloud services (AWS, Azure).
Lead and support incident triage, resolution, and recovery efforts during critical events.
Provide advanced troubleshooting expertise and guide teams during outages.
Conduct detailed postmortems, document lessons learned, and drive improvements to reduce Mean Time to Recovery (MTTR).
Collaborate with developers, QA, and product teams to embed reliability principles throughout the software development lifecycle.
Mentor peers on observability tools, performance optimization, and SRE best practices.
Identify opportunities for continuous improvement in reliability, performance, and cost efficiency.
Evaluate and recommend emerging technologies to enhance scalability and resilience.
Contribute to internal documentation, ensuring best practices are accessible across the organization.
Requirements
4+ years of experience in DevOps, Site Reliability Engineering, or a related role.
Proven track record as a technical lead or subject matter expert (no direct people management required).
Hands-on expertise with cloud platforms (AWS, Azure) and Infrastructure-as-Code (Terraform preferred).
Strong understanding of systems architecture, high availability, fault tolerance, and disaster recovery.
Experience leading incident response and conducting root cause analysis.
Familiarity with observability tools and performance optimization practices.
Strong collaboration and communication skills with the ability to mentor peers and influence best practices.
DevOps Engineer in the Security Data and AI Lab at Lloyds Banking Group driving data and cloud infrastructure's influence on product operations and customer service improvements.
Senior Platform DevOps Engineer at Code Metal designing and implementing cloud and hybrid infrastructure to support customer deployments and internal platforms. Collaborating with software and security teams for reliable delivery.
DevOps Platform Intern managing cloud infrastructure and deployment pipelines for AI - native software delivery. Partnering with a Product Development Intern, set up and manage containerized applications on Azure Kubernetes Service.
UNIX DevOps Engineer managing AIX and Solaris server operations for a Swiss telecom company. Focusing on automation, optimization and 7x24h monitoring responsibilities across multiple locations.
Staff Site Reliability Engineer designing tools for Threat Protection Pro and NordLynx protocol. Working on globally distributed backend services for NordVPN with a focus on security and privacy.
Senior Site Reliability Engineer managing VPN and DNS services to ensure performance and reliability. Collaborating with application teams to maintain security and quality across global infrastructure operations.
Senior Site Reliability Engineer managing globally distributed VPN and DNS services. Optimizing service performance and handling security posture in a hybrid work environment.
Senior Site Reliability Engineer focused on observability for NordVPN. Designing monitoring systems and collaborating with data teams on anomaly detection.
Senior Site Reliability Engineer ensuring content accessibility across global edge infrastructure for NordVPN. Designing and troubleshooting systems critical to internet traffic management.
Staff Site Reliability Engineer designing and building backend services for NordVPN. High - ownership role focusing on system architecture and operational excellence.