Site Reliability Engineer contributing to platform reliability at Trainline, Europe's leading rail ticketing platform. Collaborating with product engineering to ensure operational readiness and incident response.
Responsibilities
Developing an understanding of system architecture, dependencies, and failure modes across the Trainline platform
Participating in production incident response, supporting investigation, mitigation, communication, and coordinated service restoration
Contributing to post-incident reviews and follow-up actions to improve reliability, scalability, and resilience
Taking part in the SRE on-call rotation
Designing, building, and maintaining observability using metrics, logs, events, and traces to support effective detection and diagnosis
Improving monitoring and alerting by aligning signals to business and customer impact, reducing noise and improving mean time to detection (MTTD)
Ensuring relevant operational data is surfaced quickly and clearly during live incidents
Making informed tooling and technology choices using SRE principles, balancing team and business needs
Supporting AWS-hosted infrastructure and shared platform services using infrastructure-as-code and CI/CD tooling
Collaborating with product engineering teams to ensure services are operationally ready and deployed safely
Advising on reliability and resilience practices
Writing and maintaining reliable, well-structured code and scripts to support reliability and observability goals
Prioritising work effectively and collaborating using agile processes to deliver against team and business goals
Requirements
Experience of SRE concepts such as SLI, SLO and error budgets.
Hands-on experience with observability tooling such as New Relic, Elastic (ELK Stack), Influx, Grafana or similar
Experience working with cloud providers (preferably AWS).
Experience troubleshooting Linux operating systems.
Experience of scripting in at least one language (preferably Python)
Understanding of load balancing and reverse proxy concepts, upstream config concepts, upstream health checks, worker & data flow concepts.
Data Transport Infrastructure DevOps Engineer at Leidos modernizing global - scale multi - cloud environments for USAF missions. Involves developing cloud - native solutions and ensuring security best practices.
DevOps Engineer responsible for building and optimizing AWS - based infrastructure and backend systems at Allguth GmbH. Part of a team focused on innovative mobility solutions in Munich region.
(Senior) DevOps Engineer specializing in ML solutions implementation and management in Germany. Focused on CI/CD pipelines, automation, and cloud services.
Specialist DevSecOps joining Periferia IT Group, a leader in digital transformation. Work in a dynamic environment with continuous learning and professional development opportunities.
Join Zinkworks as a Senior Platform Engineer designing scalable IaC - driven cloud platforms for a large - scale enterprise contact centre. Focused on automation, reliability, and platform ownership in a hybrid work environment.
Asset Reliability Engineer providing maintenance advice and service innovations. Join Sensorfact, the leading smart monitoring platform, to modernize the industrial sector.
Cloud Operations Engineer responsible for securing AWS infrastructure at Avalon Healthcare Solutions. Collaborating on SRE best practices and ensuring system reliability and performance.
Design Release Engineer designing, developing, and releasing seat systems for Ford vehicles. Ensuring engineering deliverables meet quality, cost, and timing targets while collaborating with cross - functional teams.
DevOps Engineer responsible for maintaining FME infrastructure and development pipelines at Safe Software. Collaborate in an agile team focused on constant improvement and automation.
Lead Site Reliability Engineer responsible for GCP cloud infrastructure and SRE practices. Join a fintech platform making real estate investment accessible globally.