Site Reliability Engineer responsible for reliability and availability, collaborating with development teams on scalable systems. Applying software engineering practices to improve production operations.
Responsibilities
Site Reliability Engineer will be responsible for ensuring the reliability, availability, performance, and scalability of production systems by applying software engineering practices to infrastructure and operations.
Partners with development teams to build resilient, observable, and automated platforms that meet defined service level objectives (SLOs).
Requirements
8 Years of experience in systems engineering, DevOps, or site reliability engineering roles
8 Years of Strong experience with Linux/Unix systems and system internals
8 Years of Proficiency in one or more programming/scripting languages (Python, Go, Java, Bash)
8 Years of Experience designing and operating highly available, distributed systems
8 Years of Strong knowledge of cloud platforms (AWS, or GCP) and cloud-native services
8 Years of Experience with containerization and orchestration (Docker, Kubernetes)
8 Years of Strong understanding of monitoring, alerting, and logging concepts
8 Years of Experience defining and managing SLIs, SLOs, and error budgets
8 Years of Familiarity with incident management, root cause analysis (RCA), and postmortems
8 Years of Experience integrating security and compliance into operational workflows
4 Years of Familiarity with observability tools (Prometheus, Grafana, Application Insights, Datadog, Splunk)
4 Years of Experience operating 24x7 production environments with on-call rotations
4 Years of Experience with chaos engineering and resiliency testing
4 Years of Experience with feature flags, canary deployments, and progressive delivery
4 Years of Strong documentation skills for run books, dashboards, and operational standards
DevOps Engineer automating and configuring network monitoring and automation solutions for Telia’s telecom operations in Finland. Ensuring performance, resilience, and high observability of critical platforms.
Client Services Consultant specializing in DevOps Mainframe Operations with experience in automation best practices. Analyzing Life Cycle Management data needs and evaluating solutions for Endevor - related operations.
Senior AWS DevOps Engineer at LexisNexis shaping global CI/CD platform. Collaborating with teams to deliver secure, reliable, and scalable delivery pipelines.
Cloud Engineer at MetroStar focusing on building and securing cloud - native systems. Managing Kubernetes workloads and CI/CD pipelines in Agile teams with an emphasis on security.
Senior Engineer Cloud Engineering role focused on AWS migration and automation. Collaborating with teams to innovate cloud patterns and infrastructure best practices.
Senior Operations Engineer driving efficiency and reliability in NVIDIA's global business operations. Collaborating with IT subsystems and automating operational workflows for organizational impact.
Lead or Senior DevOps Developer joining Boeing Defense, Space and Security for advanced technology missions. Involves CI/CD, cloud systems design, and collaboration with government customers.
Site Reliability Engineer ensuring high availability and performance for digital platforms in retail. Collaborating with engineering teams for automation and observability practices.
Associate Site Reliability Engineer supporting the reliability and performance of global IT infrastructure at Exegy. Engage with senior engineers and learn foundational systems engineering skills.
Site Reliability Engineer driving innovation and growth for Banking Solutions, Payments, and Capital Markets business. Responsible for application reliability and incident response in a hybrid work environment.