Principal Site Reliability Engineer at Zefr enhancing cloud infrastructure and driving reliability practices in a leading technology company.
Responsibilities
Support and build systems and tools that enable other engineers to generate, deploy, and manage product features and models both quickly and safely.
Deploy and support a multi-cloud, micro-service architecture, including infrastructure tailored for ML workloads, deployed via Github Actions, ArgoCD & Kubernetes.
Collaborate with other engineers to architect secure, resilient, scalable, and cost-efficient applications and ML systems/pipelines in AWS and GCP.
Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams.
Proactively maintain the health of production environments, including monitoring application performance and resource utilization.
Participate in 24/7 on-call rotation, respond to system performance issues and outages.
Debug code at the application and infrastructure level.
Mature our CI/CD workflows and release process.
Maintains a forward-thinking approach, actively researching and proposing new solutions.
Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices.
Requirements
10+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers (GCP experience a huge bonus)
Experience in Advertising or AdTech
Demonstrated technical leadership experience; including mentoring engineers, driving cross-functional projects, and influencing architectural decisions at an organizational level.
Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux)
Advanced Proficiency with IaC and configuration management tools (Terraform, Terragrunt, OpenTofu, Crossplane, Pulumi)
Deep production experience architecting, managing, deploying, and supporting container based workloads into Kubernetes clusters
Proven track record of building and scaling reliability practices, including SLO/SLI frameworks, incident management, and capacity planning.
Heavy Production experience with observability platforms and practices (Prometheus, Grafana, Chronosphere, Datadog, OpenTelemetry); ability to design monitoring strategies for complex distributed systems.
Strong knowledge of cloud networking (Mesh, NAT, Load Balancers, API Gateways, proxies, etc), cloud security, and cost optimization strategies.
Exceptional written and verbal communication skills; ability to translate complex technical concepts for diverse audiences and build consensus across teams.
Experience authoring technical strategy documents, RFCs, and architectural proposals.
Benefits
Flexible PTO
Medical, dental, and vision insurance with FSA options
Company-paid life insurance
Paid parental leave
401(k) with company match
Professional development opportunities
13 paid holidays off
Summer Fridays (we leave early)
In-office, hybrid, and fully-remote work options available
In-office lunches and lots of free food
Optional in-person and virtual events (we like to celebrate!)
Jr. DevOps Engineer supporting and improving CI/CD pipelines and Linux systems at Swift. Collaborating with senior engineers in a hands - on learning environment.
Senior DevOps Engineer I managing automation tooling and multi - cloud infrastructure at Spring Health. Collaborating with AI and Infrastructure teams in a hybrid Seattle office.
Site Reliability Engineer for cloudified backup platform using Commvault technology at Expleo. Joining a dynamic team to ensure backup infrastructure scalability and reliability.
Site Reliability Engineer responsible for designing and maintaining scalable services with high availability. Collaborating with development teams to enhance reliability and operational excellence.
Technical Staff leading the architecture, reliability, and modernization of enterprise ALM and DevOps tools. Driving strategy and influencing product development in collaboration with various teams.
Site Reliability Engineer responsible for reliability and availability, collaborating with development teams on scalable systems. Applying software engineering practices to improve production operations.
DevOps Engineer in the Security Data and AI Lab at Lloyds Banking Group driving data and cloud infrastructure's influence on product operations and customer service improvements.
Senior Platform DevOps Engineer at Code Metal designing and implementing cloud and hybrid infrastructure to support customer deployments and internal platforms. Collaborating with software and security teams for reliable delivery.
DevOps Platform Intern managing cloud infrastructure and deployment pipelines for AI - native software delivery. Partnering with a Product Development Intern, set up and manage containerized applications on Azure Kubernetes Service.