Senior Principal SRE at Northern Trust, ensuring reliability and performance of global systems. Leading observability and automation initiatives while collaborating across teams.
Responsibilities
Reliability‑Focused System Design & Architecture: Lead the design and evolution of highly reliable, scalable, and performant distributed systems, applying SRE principles across infrastructure and application layers.
SRE Operations & Automation: Drive an automation‑first approach by designing and developing tools, scripts, and platforms that reduce manual effort, operational toil, and human error.
Incident Management & Root Cause Analysis: Participate in and lead incident response for production systems, ensuring timely mitigation and minimal customer or business impact.
Monitoring, Alerting & Observability: Architect and implement end‑to‑end observability across systems using metrics, logs, and traces to enable rapid diagnosis and proactive issue detection.
Continuous Reliability Improvement: Identify reliability gaps through data analysis, failure reviews, and resilience testing, driving targeted improvement initiatives.
Documentation & Knowledge Sharing: Create and maintain clear, accurate, and actionable documentation including system architectures, runbooks, operational standards, and incident playbooks.
Cross‑Functional Collaboration & Influence: Work closely with product, development, platform, security, and operations teams to embed SRE principles into roadmap planning and delivery.
Project & Initiative Leadership: Manage and prioritize multiple reliability‑focused initiatives, balancing short‑term operational needs with long‑term system health.
Requirements
Bachelor’s degree in Computer Science, Engineering, or a related discipline, or equivalent practical experience demonstrating advanced technical and leadership capabilities.
15+ years of progressive experience in systems engineering with a strong emphasis on site reliability, large‑scale systems operations, and software engineering in complex enterprise or cloud environments.
7+ years of experience in a technical leadership role (Team Lead or Hands‑on Technical Manager), with a proven track record of driving cross‑functional initiatives and delivering complex projects to successful completion.
Strong proficiency in one or more modern programming languages such as Python, Go, Java, Ruby, or equivalent, with a software‑engineering mindset applied to operational challenges.
Demonstrated experience operating and supporting systems across hybrid environments, including both on‑premises infrastructure and public/private cloud platforms.
Hands‑on experience with containerization and container orchestration technologies, enabling scalable, resilient, and repeatable deployments.
Proven ability to design and implement observability solutions, including metrics, logs, traces, dashboards, and alerts that provide actionable insights into system health and performance.
Deep understanding of distributed systems, networking fundamentals, failure modes, and modern software architectures, with the ability to reason about complex system behaviors under load or failure conditions.
Exceptional problem‑solving skills with the ability to diagnose, mitigate, and permanently resolve complex, high‑impact technical issues.
Strong customer and stakeholder orientation, with excellent communication skills and the ability to articulate complex reliability strategies clearly and persuasively to both technical and non‑technical audiences.
Prior experience designing and delivering Infrastructure as Code (IaC) through automated CI/CD pipelines, ensuring consistency, scalability, and reliability of infrastructure changes.
Demonstrated success in mentoring, coaching, and developing high‑performing technical teams, fostering a culture of engineering excellence, ownership, and continuous improvement.
Hands‑on expertise in implementing automated remediation and corrective actions driven by observability signals and reliability metrics.
Practical experience working within Agile and DevOps environments, collaborating closely with product and engineering teams to balance reliability, velocity, and innovation.
Benefits
Flexible working arrangements
Professional development opportunities
Job title
Senior Principal Infrastructure Services – SRE Practice
Middleware & DevOps Engineer at Smile Group managing WSO2 and Java integration projects. Responsible for designing, developing, and maintaining critical exchange flows as part of a dynamic team in Casablanca.
Software Engineer focused on mobile DevOps at T - Mobile, designing scalable software solutions for CI/CD environments. Collaborating with teams to deliver mobile applications with high reliability and performance.
Junior Dev Ops Engineer building and maintaining analytics platforms at Rabobank. Collaborating with experienced engineers using Azure, Cloud, Databricks, and Terraform.
Senior DevOps Engineer designing and improving Zscaler - based services for secure internet access. Collaborating with global teams and working in a complex IT environment for Rabobank.
Medior Java Developer responsible for Global Client Data System in cloud environment. Collaborating in international teams to enhance data services at Rabobank.
DevOps Engineer operating and improving Kafka infrastructure on private cloud for Telia. Collaborating on advanced messaging solutions and driving DevSecOps practices with open - source platforms.
DevOps Engineer at Helpshift responsible for GCP infrastructure and AI deployment pipelines. Ensuring production monitoring, security, and CI/CD excellence with a hybrid work model.
DevOps Engineer in a digital venture building a technology platform for B2B marketplace. Collaborating with teams to improve delivery speed, code quality, and automate processes.
Senior Cloud Site Reliability Engineer responsible for daily operations of Solace Cloud services across cloud platforms. Ensuring reliability and efficiency in a hybrid work environment.
Senior DevOps Engineer at Parser focusing on deploying and maintaining cloud - based products with AWS. Collaborating across technical teams and ensuring robust solutions for business needs.