Principal AI Site Reliability Engineer driving operational excellence for critical contact center applications at Fidelity. Leading automation and observability initiatives to improve reliability and efficiency.
Responsibilities
Drive operational excellence, observability, and intelligent automation for mission-critical contact center applications
Lead initiatives to advance observability, automation, and operational efficiency
Collaborate with engineering and business leaders to prioritize and resolve issues impacting associate experience
Implement automation and self-service capabilities to reduce manual intervention and improve reliability
Establish and track SLIs/SLOs to measure and optimize system performance
Communicate progress, outcomes, and technical concepts clearly to senior leadership and stakeholders
Requirements
10+ years in technology operations, systems engineering, or production support leadership
Deep expertise in IT Service Management (ITSM), incident/problem management, and operational process optimization
Advanced knowledge of observability and monitoring tools (OTEL, Splunk, DataDog, Prometheus, Grafana)
Experience leveraging AI and automation to drive efficiency and reliability
Proficiency in scripting and automation (Python, Bash, PowerShell, or similar)
Strong understanding of On-Prem and Public Cloud (AWS/Azure/GCP) environments
Familiarity with networking, load balancing, and security fundamentals
Agile and DevOps mindset with experience in CI/CD and operational automation
Senior Site Reliability Engineer at Diligent leading reliability, automation, and observability across cloud infrastructure. Build tools for incident response and enhance performance in fast - paced environments.
Perception Deployment Engineer deploying deep learning models on embedded systems at Caterpillar. Collaborating with cross - functional teams for integration and optimization of perception modules in vehicles.
Principal Site Reliability Engineer at AT&T required to design scalable solutions for critical operations with minimal downtime. Collaborating with teams to monitor and improve system performance in cloud environments.
DevOps Engineer managing AI SaaS infrastructure at a high - growth European company. Supporting AI model deployment and ensuring platform security and compliance with multiple systems integration.
Engineering Manager leading teams for observability platforms at LexisNexis. Owns operational excellence across software delivery lifecycle in Raleigh, NC.
Reliability Engineer optimizing site facility infrastructure and utility systems at Roche. Conducting root cause analyses and developing maintenance plans to enhance reliability and efficiency.
DevOps SME designing, implementing, and operating multi - cloud platforms for The Missing Link. Collaborating with engineering, security, and operations teams while embedding DevOps best practices.
Site Reliability Engineer improving reliability of cloud infrastructure for an AI - specialized company. Taking ownership of monitoring and incident response processes in hybrid - working style.
DevOps Engineer leading automation for sophisticated release/deployment pipelines at Securonix. Focused on Python, Ansible, and cloud services to enhance security operations.
Senior Analyst on Data Platform DevOps at AIMCo, responsible for building data operations and collaborating with teams on innovative solutions. Focused on ensuring data quality and integrity across technologies.