Senior Site Reliability Engineer managing GCP and SRE technologies at Lloyds Banking Group. Collaborating with engineering teams to enhance service reliability and troubleshoot issues in critical banking services.
Responsibilities
Delivering against GCP and SRE Public Cloud technology roadmaps
Collaborating with engineering teams to release and evolve enterprise-class solutions
Managing operations of critical banking services, including 24x7 coverage via on-call rota
Enhancing resiliency and reliability of customer-facing services
Troubleshooting and diagnosing issues with an engineering mindset
Building tooling to support service reliability and code quality
Working across multiple labs and signature projects in the Digital space
Leading Chaos Engineering initiatives to stress test services
Requirements
Strong understanding of SRE & DevOps, including experience of Infrastructure as Code and CI/CD pipelines using tools such as Azure DevOps, Terraform, or Jenkins.
Proficiency with Incident Management software (ie ServiceNow)
Proficient in Dynatrace, Splunk, SRE GCP & Cloud Observability.
Demonstrable experience in using orchestrations tools such as Harness.
Knowledge of GCP and Azure cloud platforms.
Experience in identifying toil and design automated solutions to remove it.
Reliability & Performance Management: Design, implement and own the SLOs for critical platform services. Monitor system health, manage error budgets, and drive improvements in Mean Time to Failure (MTTF) and Mean Time to Recovery (MTTR).
Incident & Problem Management: Lead incident response and post-mortem analysis. Ensure root cause identification and long-term remediation strategies are implemented.
Platform Advocacy & Collaboration: Champion SRE principles across Segments & Propositions Lab. Collaborate with Lab Product Owners, Engineering Leads, and application teams to embed reliability into design and delivery.
Technical Leadership: Provide technical oversight across cloud infrastructure, CI/CD pipelines, observability tooling, and automation frameworks. Guide engineers in adopting scalable and resilient solutions.
Continuous Improvement: Identify and implement improvements in deployment, monitoring, and alerting processes. Drive automation to reduce toil and improve operational efficiency.
Governance & Compliance: Ensure platform services adhere to internal risk, security, and compliance standards. Support audit and regulatory reporting requirements.
Benefits
A generous pension contribution of up to 15%
An annual performance-related bonus
Share schemes including free shares
Benefits you can adapt to your lifestyle, such as discounted shopping
30 days’ holiday, with bank holidays on top
A range of wellbeing initiatives and generous parental leave policies
Senior Software Engineer at PayPal managing cloud infrastructure and DevOps solutions. Delivering complete SDLC solutions and guiding engineering teams for scalable and reliable services.
Senior Site Reliability Engineer at Diligent leading reliability, automation, and observability across cloud infrastructure. Build tools for incident response and enhance performance in fast - paced environments.
Perception Deployment Engineer deploying deep learning models on embedded systems at Caterpillar. Collaborating with cross - functional teams for integration and optimization of perception modules in vehicles.
Principal Site Reliability Engineer at AT&T required to design scalable solutions for critical operations with minimal downtime. Collaborating with teams to monitor and improve system performance in cloud environments.
DevOps Engineer managing AI SaaS infrastructure at a high - growth European company. Supporting AI model deployment and ensuring platform security and compliance with multiple systems integration.
Engineering Manager leading teams for observability platforms at LexisNexis. Owns operational excellence across software delivery lifecycle in Raleigh, NC.
Reliability Engineer optimizing site facility infrastructure and utility systems at Roche. Conducting root cause analyses and developing maintenance plans to enhance reliability and efficiency.
DevOps SME designing, implementing, and operating multi - cloud platforms for The Missing Link. Collaborating with engineering, security, and operations teams while embedding DevOps best practices.
Site Reliability Engineer improving reliability of cloud infrastructure for an AI - specialized company. Taking ownership of monitoring and incident response processes in hybrid - working style.
DevOps Engineer leading automation for sophisticated release/deployment pipelines at Securonix. Focused on Python, Ansible, and cloud services to enhance security operations.