Director for SRE supporting Fidelity’s growing public cloud presence and delivering reliable runtimes for business critical workloads. Leading diverse technical teams to enhance cloud management capabilities and customer value.
Responsibilities
The Fidelity Enterprise Infrastructure (EI) Production Support team is seeking a Director to help scale our growing public cloud presence.
Fidelity’s Site Reliability Engineers work with our cloud platform teams to deliver reliable runtimes for Fidelity’s business critical workloads.
This team is responsible for cross-cutting cloud management capabilities and are the experts on the state of Fidelity’s cloud platforms at any moment.
The team comes from diverse technical backgrounds, and the responsibilities provide opportunity for a variety of challenges that require engineers to work on software and systems challenges.
Ideal candidates will have a background in either software engineering or systems engineering with a desire to learn the other or previous experience as an SRE.
The Director for SRE will support Engineering and Systems Operational support for Business Unit aligned functions including Application Support, Cloud Enablement, Helpdesk, Environment Management, Mid-tier & Web Operations, & Platform Engineering.
By demonstrating and promoting Fidelity and agile leadership behaviors, you will evolve and sustain an innovative agile culture.
Our ever-evolving technology stack ensures a phenomenal learning culture in the team.
We are always exploring new technologies and new ways to continually provide value to our customers.
This team has a direct and positive impact on Fidelity’s customers.
Requirements
Ability to automate with various scripting languages (Python, Shell scripting, etc.)
Experience managing systems using infrastructure as code tools (IAM, ARM, Terraform, Chef)
Solid understanding of Cloud Computing and DevOps concepts including CI/CD pipelines
Hands-on Kubernetes skills and knowledge.
Hands on experience with Cloud services on AWS and Azure
Experience on building resiliency with Chaos Engineering practices
Hands on experience with one or more observability tools (Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, etc.)
Experienced in Instrumentation with systems skills on building and operating, monitoring, logging, alerting services of distributed systems at scale.
Proven experience in maintaining scalability and resiliency of complex environment.
Proven experience in implementing advanced observability practices and techniques at scale.
Demonstrated ability to utilize modern monitoring tools (DataDog, Prometheus, Splunk)
Experienced in Instrumentation with systems skills on building and operating, monitoring, logging, alerting services of distributed systems at scale.
Ability to triage, execute root cause analysis, and be decisive under pressure.
Experience managing and interpreting large datasets using query languages and visualization tools.
Proficient communication skills with an ability to reach both technical and non-technical audience.
Ability to learn new software, method and practices and bringing them to our developers.
Ability to work with a variety of individuals and groups, both in person and virtually, in a constructive and collaborative manner and build and maintain effective relationships.
Bridges the gap between lofty architecture ideas and development of feasible solutions.
Facilitates discussions among component owners to improve end-to-end understanding of transaction paths.
Provides consulting to architects and developers on common patterns and tactical, reusable solutions.
Influences adoption of stability principles by presenting facts and data.
Drives operational readiness discussions and reviews of new solutions and products.
Develops frameworks for self-assessment of applications on various stability and dependability pillars.
Participates, even unsolicited, in discussions and decisions that impact customer experience.
Selectively preserves and shares collective memory and successes of past.
Mindset of continuous learning and experimentation.
Instinctive urge to improve current state by finding problems and recommending feasible solutions.
Benefits
Most roles at Fidelity are Hybrid, requiring associates to work onsite every other week (all business days, M-F) in a Fidelity office. This does not apply to Remote or fully Onsite roles.
Site Reliability Engineer working on Linux systems for observability platforms and logging. Design and maintain applications, support network visibility, and collaborate with teams.
DevOps Engineer working at White Circle, focusing on infrastructure for AI systems. Involves managing production environments, Kubernetes, CI/CD pipelines, and automation tools.
Airflow Reliability Engineer on the Customer Reliability Engineering team at Astronomer. Working with clients on optimizing their use of the managed Airflow service in a hybrid role in Hyderabad.
Full - Stack Engineer enhancing engineering productivity at Fidelity. Building internal tools for SRE teams to improve operational efficiency and reliability.
DevOps Engineer at Cloudogu working with development and operations for reliable software delivery. Focusing on CI/CD, infrastructure automation, and platform services in an agile environment.
Jr. DevOps Engineer supporting and improving CI/CD pipelines and Linux systems at Swift. Collaborating with senior engineers in a hands - on learning environment.
Senior DevOps Engineer I managing automation tooling and multi - cloud infrastructure at Spring Health. Collaborating with AI and Infrastructure teams in a hybrid Seattle office.
Site Reliability Engineer for cloudified backup platform using Commvault technology at Expleo. Joining a dynamic team to ensure backup infrastructure scalability and reliability.
Site Reliability Engineer responsible for designing and maintaining scalable services with high availability. Collaborating with development teams to enhance reliability and operational excellence.