Kubernetes Platform Engineer working with self-managed clusters and AI infrastructure. Collaborating with a team to design and operate Kubernetes solutions and automate operational tasks.
Responsibilities
Design, build, and operate self-managed Kubernetes clusters (OpenShift / Anthos)
Manage and maintain etcd (backup, restore, quorum management, defrag)
Perform control plane upgrades and lifecycle management
Tune API server, scheduler, and controller manager for performance and reliability
Debug node-level and control-plane issues across large clusters
Implement networking (CNI), storage (CSI), and ingress integrations
Implement and extend runbook automation frameworks to reduce operational toil
Integrate AI agents that monitor cluster telemetry, detect anomalies, and trigger automated workflows
Apply statistical or ML-based models on operational data to predict failures, capacity saturation, or workload misbehavior
Build self-healing controllers and automated remediation pipelines
Implement predictive capacity planning and intelligent alert suppression workflows
Build Kubernetes controllers and operators (Go + controller-runtime)
Develop CRDs and admission webhooks to extend platform functionality
Automate cluster lifecycle and multi-cluster operations
Implement policies for workload isolation, governance, and compliance
Enable GPU and high-performance infrastructure for AI/ML workloads
Optimize scheduler and resource allocation for memory- and compute-intensive workloads
Support orchestration of AI/ML pipelines
Requirements
5+ years of software engineering experience
3+ years operating Kubernetes in production with hands-on control plane experience
Experience managing etcd (backup, restore, recovery) and performing control plane upgrades
Strong Go programming skills
Experience building Kubernetes operators/controllers and developing CRDs/webhooks
Deep understanding of scheduler, API server, controller loops, and reconciliation
Experience debugging and troubleshooting large-scale distributed systems
Candidates without on-prem or self-managed Kubernetes control plane experience will not be considered.
Benefits
medical, dental and vision insurance
401(k) plan with a Cisco matching contribution
paid parental leave
short and long-term disability coverage
basic life insurance
10 paid holidays per full calendar year
1 floating holiday for non-exempt employees
1 paid day off for employee’s birthday
paid year-end holiday shutdown
4 paid days off for personal wellness
16 days of paid vacation time per full calendar year
flexible vacation time off program
80 hours of sick time off provided on hire date and each January 1st thereafter
additional paid time away may be requested
10 paid days per full calendar year to volunteer
potential grants of Cisco restricted stock units
Job title
Kubernetes Platform Engineer – Control Plane, AI Infrastructure
Platform Engineer managing technical infrastructure for subtitles and language services in film and streaming. Collaborating with development and product teams to ensure system performance and scalability.
Observability Platform Engineer at Amex GBT designing observability platforms using tools like ELK Stack and New Relic. Collaborating with teams to enhance system reliability and performance metrics.
Senior SharePoint Power Platform Developer at Geosyntec focused on developing solutions and automation. Collaborating with teams to address challenges in environmental, natural resources, and civil infrastructure.
Sr. Data Platform Engineer I at MetroStar designing and optimizing PostgreSQL databases for federal government. Collaborating with a team to maintain database stability and support operations.
Engineering Lead at Farfetch developing scalable cloud - native infrastructure solutions. Leading a high - performing team to enhance software delivery and maintain high security standards.
Platform Engineer managing OpenStack environments for Cloudera. Deploying, troubleshooting, and improving OpenStack systems with Kubernetes integration.
Senior Staff Platform Engineer deploying and managing OpenStack environments at Cloudera. Collaborating with teams to improve integration with Kubernetes and contribute to open - source development.
Platform Engineer at Cloudera configuring bare - metal servers and managing OpenStack infrastructure. Collaborating with teams to ensure optimized performance and reliability in datacenter environments.
Senior Platform Engineer designing, improving, and scaling infrastructure for Stay22's platform. Collaborating with engineering teams to enhance system performance and reliability.
Lead Data Platform Engineer overseeing design and operation of IoT data platform for Vizzia. Ensure reliable data access and governance while supporting internal teams and AI initiatives.