Senior Site Reliability Engineer driving observability and reliability for business-critical systems at Incedo. Collaborating with engineering teams to enhance system resilience and performance.
Responsibilities
Design, implement, and maintain observability solutions across distributed systems
Build and optimize logging, metrics, and tracing pipelines using tools like Dynatrace, Datadog, Splunk, ELK, Grafana, and OpenTelemetry
Enable end-to-end transaction tracing across microservices and APIs
Develop dashboards and alerting strategies for proactive issue detection
Own service reliability, uptime, and operational performance for critical systems
Lead incident response, root cause analysis (RCA), and postmortems
Reduce **MTTD and MTTR** through automation and improved observability
Create and maintain runbooks and incident response playbooks
Monitor and optimize system performance (latency, throughput, error rates)
Partner with application and database teams to troubleshoot bottlenecks
Use distributed tracing and telemetry data to identify and resolve issues
Implement performance testing and tuning strategies
Build and maintain fault-tolerant, highly available systems
DevOps Manager responsible for managing a team for multi - cloud solutions supporting the USAF Cloud One project. Focus on scalable cloud - native solutions and CI/CD practices.
Lead Site Reliability Engineer overseeing SRE practices across Azure and GCP platforms. Driving reliability improvements and leading a team at Lloyds Banking Group.
DevOps Engineer responsible for managing Microsoft Intune operations at Bundesdruckerei GmbH. Focused on ensuring secure digital solutions for identity and data protection in Berlin.
DevSecOps Specialist securing the software development lifecycle at Vanguard. Collaborating with teams to improve application security tooling and processes, and provide development guidance.
Site Reliability Engineer automating infrastructure deployment for Scaleway's sovereign cloud products. Collaborating with product teams to enhance observability and reliability of the platform.
Reliability Engineer responsible for equipment reliability and safety using data - driven analysis for Wood in Aberdeen. Focus on proactive maintenance and operational efficiency.
Principal Safety and Reliability Engineer developing and supporting safety design for mission - critical aerospace systems. Engaging in design reviews and ensuring compliance with requirements.
Cloud DevOps Engineer playing a pivotal role in developing migration plans for Coast Guard Cloud Architecture. Collaborating with teams to ensure effectiveness and best practices in cloud implementation.
Reliability Engineer III at Daimler Truck developing propulsion technology solutions for electrified and conventional axle components. Leading testing and validation for complex powertrain systems.