Site Reliability Engineer ensuring reliability and availability of critical gaming platforms at Flutter Entertainment. Collaborating with teams to implement monitoring and incident response procedures.
Responsibilities
Ensure the reliability, availability, and performance of critical gaming and betting platforms across global operations
Maintain 24/7/365 service availability for millions of customers worldwide
Implement automation, monitoring, and incident response procedures
Design and implement monitoring, alerting, and observability solutions using tools such as Grafana, Splunk & CloudWatch
Conduct capacity planning and performance optimization
Establish and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
Support ProdOps and Service Management teams during P1/P2 incident response
Collaborate on post-incident reviews and contribute technical insights
Assist in developing and maintaining comprehensive runbooks and incident response procedures
Design, deploy, and maintain Grafana dashboards for real-time system visibility
Create custom Grafana panels and dashboards for business metrics
Requirements
Advanced experience with AWS, Azure, or Google Cloud Platform services and architecture
Proficiency with Docker and Kubernetes for container orchestration and management
Strong scripting abilities in Python, Go, Bash, or PowerShell; familiarity with Java or .NET advantageous
Hands-on experience with Prometheus, Grafana, ELK stack, or similar monitoring solutions
Proficiency with Jenkins, GitLab CI, Azure DevOps, or similar continuous integration tools
Working knowledge of SQL databases (PostgreSQL, MySQL) and NoSQL solutions
Understanding of load balancers, CDNs, DNS, and network security principles
SRE responsible for ensuring reliability and performance of IT systems at a digital transformation company specializing in public sector efficiency. Collaborating on system health, incident response, and automation tasks.
DevOps Senior role at Beyond Soluções managing CI/CD for .NET and Kubernetes applications. Collaborating on cloud solutions while fostering a culture of innovation and quality.
Senior Software Engineer at PayPal managing cloud infrastructure and DevOps solutions. Delivering complete SDLC solutions and guiding engineering teams for scalable and reliable services.
Senior Site Reliability Engineer at Diligent leading reliability, automation, and observability across cloud infrastructure. Build tools for incident response and enhance performance in fast - paced environments.
Perception Deployment Engineer deploying deep learning models on embedded systems at Caterpillar. Collaborating with cross - functional teams for integration and optimization of perception modules in vehicles.
Principal Site Reliability Engineer at AT&T required to design scalable solutions for critical operations with minimal downtime. Collaborating with teams to monitor and improve system performance in cloud environments.
DevOps Engineer managing AI SaaS infrastructure at a high - growth European company. Supporting AI model deployment and ensuring platform security and compliance with multiple systems integration.
Engineering Manager leading teams for observability platforms at LexisNexis. Owns operational excellence across software delivery lifecycle in Raleigh, NC.
Reliability Engineer optimizing site facility infrastructure and utility systems at Roche. Conducting root cause analyses and developing maintenance plans to enhance reliability and efficiency.
DevOps SME designing, implementing, and operating multi - cloud platforms for The Missing Link. Collaborating with engineering, security, and operations teams while embedding DevOps best practices.