Site Reliability Engineer ensuring high availability and performance for digital platforms in retail. Collaborating with engineering teams for automation and observability practices.
Responsibilities
Ensure availability, performance, and scalability of applications
Work on automation of CI/CD pipelines
Implement and evolve observability practices
Participate in incident response
Define and monitor SLIs, SLOs, and SLAs
Support development teams in building resilient applications
Manage and optimize cloud infrastructure
Promote an Infrastructure as Code and automation culture
Work with DevOps and Reliability Engineering practices
Requirements
Experience as an SRE, DevOps engineer, or in related areas
Strong knowledge of GitHub Actions
Solid experience with Amazon Web Services (AWS)
Experience with Node.js applications
Knowledge of TypeScript
Experience with containers and orchestration (Kubernetes is a plus)
Knowledge of monitoring
Familiarity with logging tools
Experience with automation and scripting
Experience in retail/e-commerce environments (preferred)
Benefits
Hybrid environment with remote work options
Exposure to retail and e-commerce environments
Social initiatives and programs that promote development
Perception Deployment Engineer deploying deep learning models on embedded systems at Caterpillar. Collaborating with cross - functional teams for integration and optimization of perception modules in vehicles.
Principal Site Reliability Engineer at AT&T required to design scalable solutions for critical operations with minimal downtime. Collaborating with teams to monitor and improve system performance in cloud environments.
DevOps Engineer managing AI SaaS infrastructure at a high - growth European company. Supporting AI model deployment and ensuring platform security and compliance with multiple systems integration.
Engineering Manager leading teams for observability platforms at LexisNexis. Owns operational excellence across software delivery lifecycle in Raleigh, NC.
Reliability Engineer optimizing site facility infrastructure and utility systems at Roche. Conducting root cause analyses and developing maintenance plans to enhance reliability and efficiency.
DevOps SME designing, implementing, and operating multi - cloud platforms for The Missing Link. Collaborating with engineering, security, and operations teams while embedding DevOps best practices.
Site Reliability Engineer improving reliability of cloud infrastructure for an AI - specialized company. Taking ownership of monitoring and incident response processes in hybrid - working style.
DevOps Engineer leading automation for sophisticated release/deployment pipelines at Securonix. Focused on Python, Ansible, and cloud services to enhance security operations.
Senior Analyst on Data Platform DevOps at AIMCo, responsible for building data operations and collaborating with teams on innovative solutions. Focused on ensuring data quality and integrity across technologies.
Principal Engineer driving systemic reliability improvements for Xero's software products. Leading technical initiatives and mentoring teams in engineering excellence.