Platform System Reliability Engineer focused on operations of EKS Kubernetes environment for GE Vernova's SaaS grid products. Responsible for the full lifecycle of production clusters from performance tuning to securing infrastructure.
Responsibilities
Provision & Infrastructure Hardening Kubernetes Cluster Orchestration: Help design and deploy hardened EKS clusters across multiple AWS regions, ensuring consistent security baselines.
Build and maintain reusable Terraform and Ansible modules for automated provisioning of cloud infrastructure services including networking services, compute, storage, queue and cache, etc.
Implement "Policy as Code" guardrails and secure network perimeters (ESPs) in alignment with NERC CIP and IEC 62443 standards.
Standardize run books, operating processes required to run critical infrastructure with highest reliability.
Define and enforce Kubernetes resource quotas, limit ranges, and Pod Priority classes to ensure mission-critical services receive prioritized compute resources.
Manage the ingress strategy and service mesh architecture to facilitate secure, performant connectivity between distributed micro services.
Lead platform-level smoke, load testing and disaster recovery exercises to validate that the infrastructure can meet 99.99% uptime targets.
Partner with application teams to right-size containerized workloads, optimizing for both performance and cloud cost (FinOps).
Act as the highest technical escalation point for complex Kubernetes internals, troubleshooting issues such as failed pods, memory leaks, and network partitions.
Lead root cause analysis (RCA) for platform-level outages, implementing systemic fixes to prevent recurring failures.
Proactively identify and automate repetitive operational tasks—such as cluster upgrades and OS patching—to ensure the team spends at least 50% of their time on engineering improvements.
Institutionalize platform monitoring using Prometheus and Grafana, creating dashboards that surface the "Golden Signals" of cluster health.
Requirements
5 years of experience operating production-grade Kubernetes clusters at scale.
Expert-level knowledge of multi-cluster management, performance tuning and experience implementing observability tools such as Prometheus/Grafana, Dynatrace, Splunk, Datadog, etc.
Deep hands-on experience with AWS core services (EKS, EC2, ALB, S3, RDS, MSK).
Proficiency in Terraform, Ansible, and Python or Go for infrastructure automation and deployment tools like ArgoCD or Flux.
Strong understanding and hands on experience of cloud networking concepts such as VPCs, routing, load balancing and security configurations such as encryption, certificate management.
Bachelor's Degree in Computer Science or “STEM” Majors (Science, Technology, Engineering and Math) with advanced experience.
6–8 years in SRE or Platform Engineering roles supporting mission-critical, 24/7 cloud environments.
Proven track record as a structured incident responder who can handle production down/break the glass scenarios in mission critical applications.
Practical knowledge of NERC CIP, SOC2, ISO 27001, or IEC 62443 compliance standards in a SaaS context.
DevOps Engineer in the Security Data and AI Lab at Lloyds Banking Group driving data and cloud infrastructure's influence on product operations and customer service improvements.
Senior Platform DevOps Engineer at Code Metal designing and implementing cloud and hybrid infrastructure to support customer deployments and internal platforms. Collaborating with software and security teams for reliable delivery.
DevOps Platform Intern managing cloud infrastructure and deployment pipelines for AI - native software delivery. Partnering with a Product Development Intern, set up and manage containerized applications on Azure Kubernetes Service.
UNIX DevOps Engineer managing AIX and Solaris server operations for a Swiss telecom company. Focusing on automation, optimization and 7x24h monitoring responsibilities across multiple locations.
Staff Site Reliability Engineer designing and building backend services for NordVPN. High - ownership role focusing on system architecture and operational excellence.
Senior Site Reliability Engineer managing VPN and DNS services to ensure performance and reliability. Collaborating with application teams to maintain security and quality across global infrastructure operations.
Senior Site Reliability Engineer managing globally distributed VPN and DNS services. Optimizing service performance and handling security posture in a hybrid work environment.
Senior Site Reliability Engineer focused on observability for NordVPN. Designing monitoring systems and collaborating with data teams on anomaly detection.
Senior Site Reliability Engineer ensuring content accessibility across global edge infrastructure for NordVPN. Designing and troubleshooting systems critical to internet traffic management.
Staff Site Reliability Engineer designing tools for Threat Protection Pro and NordLynx protocol. Working on globally distributed backend services for NordVPN with a focus on security and privacy.