Site Reliability Engineer managing Kubernetes platforms at epay, focusing on reliability and scalability. Collaborating with product teams to ensure fast, resilient, and observable services.
Responsibilities
Operate and harden SUSE Harvester environments: lifecycle management, upgrades, node/cluster health, HA, capacity planning, and incident response.
Administer Longhorn storage for Kubernetes: performance tuning, disaster‑recovery design, backup/restore validation, and troubleshooting volume issues.
Manage Kubernetes clusters (multi‑cluster, multi‑tenant) including cluster creation, upgrades, admission control, API server health, and etcd care.
Own CNI operations with Antrea: policy design, network performance, and east‑west traffic observability.
Run KubeVirt for VM workloads on Kubernetes: plan migrations, right‑size resources, and build reliable pipelines for VM lifecycle.
Use Rancher to standardize cluster fleet management: provisioning (CAPI), templates, RBAC, and centralized policy/upgrade orchestration.
Implement GitOps with FluxCD: define release pipelines, drift detection, progressive delivery, and automated rollbacks.
Provision cloud/on‑prem resources with Crossplane: compose abstractions, manage providers, and enforce guardrails for day‑2 operations.
Build and maintain SLOs/SLIs: availability, latency, error budgets; automate alerts and runbooks tied to service health.
Reduce toil through automation: scripting, operators, controllers, and self‑service tooling for developers.
Participate in on‑call rotations, post‑incident reviews, and reliability roadmaps; drive corrective actions and platform improvements.
Requirements
3+ years in SRE/Platform/Systems Engineering (or equivalent) supporting production Kubernetes.
Hands‑on experience with SUSE Harvester and Longhorn or comparable HCI + distributed block storage.
Practical knowledge of Antrea CNI, KubeVirt, and Rancher fleet management.
Proficiency with FluxCD (GitOps patterns, Kustomize/Helm) and Crossplane (Compositions, Providers, RBAC).
Strong Linux administration (networking, filesystems, performance), observability (logs/metrics/traces), and scripting (Bash/Python).
Senior DevOps Engineer building infrastructure and tools at Metrc, LLC. Implementing change and improving processes in a fast - growing technology company.
Middleware & DevOps Engineer at Smile Group managing WSO2 and Java integration projects. Responsible for designing, developing, and maintaining critical exchange flows as part of a dynamic team in Casablanca.
Software Engineer focused on mobile DevOps at T - Mobile, designing scalable software solutions for CI/CD environments. Collaborating with teams to deliver mobile applications with high reliability and performance.
Junior Dev Ops Engineer building and maintaining analytics platforms at Rabobank. Collaborating with experienced engineers using Azure, Cloud, Databricks, and Terraform.
Senior DevOps Engineer designing and improving Zscaler - based services for secure internet access. Collaborating with global teams and working in a complex IT environment for Rabobank.
Medior Java Developer responsible for Global Client Data System in cloud environment. Collaborating in international teams to enhance data services at Rabobank.
DevOps Engineer operating and improving Kafka infrastructure on private cloud for Telia. Collaborating on advanced messaging solutions and driving DevSecOps practices with open - source platforms.
DevOps Engineer at Helpshift responsible for GCP infrastructure and AI deployment pipelines. Ensuring production monitoring, security, and CI/CD excellence with a hybrid work model.
DevOps Engineer in a digital venture building a technology platform for B2B marketplace. Collaborating with teams to improve delivery speed, code quality, and automate processes.
Senior Cloud Site Reliability Engineer responsible for daily operations of Solace Cloud services across cloud platforms. Ensuring reliability and efficiency in a hybrid work environment.