Principal Site Reliability Engineer enhancing Walmart's customer service platforms for operational excellence. Leading automation and reliability strategies in a large-scale tech environment.
Responsibilities
Drive the design and evolution of monitoring and observability frameworks that enable proactive detection, root cause analysis, and rapid resolution of customer-impacting incidents.
Lead the development and integration of automation tools to streamline operational workflows, reduce toil, and enhance the reliability of customer service platforms.
Participate in on-call rotations, applying deep technical expertise to swiftly diagnose and mitigate production issues, ensuring high availability and minimal disruption to customer support experiences.
Collaborate closely with engineering teams to embed reliability into the software development lifecycle, championing a culture of shared ownership and “you build it, you run it.”
Define and manage SLIs, SLOs, and SLAs to align service reliability with business expectations and continuously improve system performance.
Apply proven reliability patterns and practices, leveraging hands-on experience to architect resilient systems that scale with customer demand.
Lead post-incident reviews and blameless retrospectives, identifying systemic improvements and fostering a culture of continuous learning and operational excellence.
Analyze system performance and advocate for cost-effective optimizations, balancing infrastructure efficiency with world-class service reliability.
Requirements
10+ years of experience engineering and scaling highly available, customer-facing systems with a focus on reliability and operational excellence.
A proven ability to lead the design and implementation of resilient infrastructure and automation solutions that solve complex reliability challenges.
Strong judgment in making architectural trade-offs, balancing long-term system health with short-term delivery needs.
Deep expertise in distributed systems, service ownership models, CI/CD pipelines, and observability practices.
Exceptional communication and collaboration skills, with a track record of influencing cross-functional teams and driving consensus on reliability strategies.
Experience mentoring engineers in incident response, reliability patterns, and career growth within SRE disciplines.
A curious mindset and eagerness to explore new technologies and domains that enhance customer support platforms at scale.
Benefits
Health benefits include medical, vision and dental coverage.
Financial benefits include 401(k), stock purchase and company-paid life insurance.
Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
Walmart-paid education benefit program for full-time and part-time associates, covering tuition, books, and fees.
Full Stack DevOps Software Engineer responsible for developing cloud - native applications at 0NLU AG. Collaborating in a DevOps team to deliver software solutions with high automation and quality.
Senior DevOps Consultant in Frankfurt helping clients optimize cloud and data projects through innovative solutions. Collaborating in an agile environment with a focus on continuous learning and development.
Mid DevOps Engineer supporting engineering teams delivering payment and transaction platforms at Expleo. Focusing on CI/CD, automation, and operational control in international environments.
Senior DevOps Engineer supporting engineering teams in payment and transaction platforms. Improving CI/CD, deployment automation, platform reliability, and engineering efficiency in international environments.
Ingénieur Systèmes, DevOps et Sécurité couvrant les outils et l’infrastructure IT pour un groupe international. Collaborant avec le CTO pour l’évolution technique et la gestion des projets.
Staff System Reliability Engineer at Disney building high - quality production systems. Collaborating with engineers to design scalable, cloud - native services and ensuring optimal performance and reliability.
Senior Solutions Deployment Engineer leading digital technologies deployment at Medtronic healthcare facilities. Collaborating with teams globally to enhance manufacturing processes and infrastructure.
IVVQ Engineering Role ensuring test preparation and execution for Thales. Collaborate with teams for validation, testing, and complex project handling in defense and security.
Senior Reliability Engineer to analyze, design, program, and modify software for database systems at Disney. Building, deploying, and ensuring high availability of database infrastructure.
Sr Systems Reliability Engineer at Disney delivering cutting edge film making systems using cloud technologies. Focused on automation and infrastructure - as - code for all studios under the Disney umbrella.