Senior Site Reliability Engineer at ABBYY, working on critical production service designs and reliability improvements on Azure cloud applications
Responsibilities
Сo-own critical production service designs to ensure high reliability is achievable and measurable Drive reliability and observability improvements in the services within the engineering verticals
Using monitoring and telemetry data, help teams make informed decisions on where reliability challenges may exist and help design and build solutions to improve them
You will build SRE dashboards from SLIs to measure SLO adherence
You will be supporting Production applications which are hosted in Azure cloud
Build and improve internal tools and automation software to make maintaining production services easier and safer
Lead reliability-focused practices such as Failure Analysis, Load and Capacity Planning, Service Reviews, Architecture Designs, Incident Postmortems, and others
Developing Infrastructure as a Code.
Define (from design to implementation details) necessary auto-healing and fault-tolerant systems
Point of contact for production application issues, working closely with engineering leadership
Requirements
7-10 Years IT Experience
Proven experience at least one cloud technology - Azure or AWS.Preferibily Azure
Proficient in Kubernetes, AKS, Azure Function, Storage account, and others
Proven experience in Microsoft Technologies, Windows server, IIS(Preferred)
Distributed monitoring experience in Grafana: logging, metrics, tracing, etc.
Matching years of experience to level in an Infrastructure, SRE, DevOps, CloudOps role
Experience working in SRE team in a dynamic and fast paced environment
Experience programming in one or more of the following: C#, Java, Python, .Net, NodeJS, Go,
Experience with Terraform, Ansible, or any similar programming language
Experience with cloud-performant microservices and event-driven architectures
Experience with Kubernetes administration is an added advantage.
DevOps Engineer for designing and maintaining Azure - based hybrid cloud infrastructure for a company specializing in nature - based smart city solutions. Leading cloud architecture and mentoring engineers as part of a high - impact team.
SRE responsible for ensuring reliability and performance of IT systems at a digital transformation company specializing in public sector efficiency. Collaborating on system health, incident response, and automation tasks.
DevOps Senior role at Beyond Soluções managing CI/CD for .NET and Kubernetes applications. Collaborating on cloud solutions while fostering a culture of innovation and quality.
Senior Software Engineer at PayPal managing cloud infrastructure and DevOps solutions. Delivering complete SDLC solutions and guiding engineering teams for scalable and reliable services.
Senior Site Reliability Engineer at Diligent leading reliability, automation, and observability across cloud infrastructure. Build tools for incident response and enhance performance in fast - paced environments.
Perception Deployment Engineer deploying deep learning models on embedded systems at Caterpillar. Collaborating with cross - functional teams for integration and optimization of perception modules in vehicles.
Principal Site Reliability Engineer at AT&T required to design scalable solutions for critical operations with minimal downtime. Collaborating with teams to monitor and improve system performance in cloud environments.
DevOps Engineer managing AI SaaS infrastructure at a high - growth European company. Supporting AI model deployment and ensuring platform security and compliance with multiple systems integration.
Engineering Manager leading teams for observability platforms at LexisNexis. Owns operational excellence across software delivery lifecycle in Raleigh, NC.
Reliability Engineer optimizing site facility infrastructure and utility systems at Roche. Conducting root cause analyses and developing maintenance plans to enhance reliability and efficiency.