Software Engineer building and operating compute infrastructure powering OpenAI’s AI research. Optimizing Kubernetes clusters and ensuring reliability in supercomputing environments for advanced AI workloads.
Responsibilities
Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management
Build software abstractions that unify multiple clusters and present a seamless interface to training workloads
Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale
Improve operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cycles
Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure
Develop monitoring and observability systems to detect issues early and keep clusters stable under extreme load
Requirements
Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
Strong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads
Proficiency in compute infrastructure concepts (compute, networking, storage, security) and in automating cluster or data center operations
Bonus: background with GPU workloads, firmware management, or high-performance computing
Benefits
Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible
Relocation support for eligible employees
Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.
Senior Full Stack Developer at E - INFOSOL developing cloud applications and supporting Java solutions. Collaborating with teams and managing cloud infrastructure in a secure environment.
AI/ML Software Engineer Intern defining the AI/ML infrastructure at Nirmata's Policy Management platform. Collaborating on AI - powered features within a fast - moving startup.
Senior Software Developer developing web and mobile applications for NIH researchers at Guidehouse. Collaborate with scientists and support complex scientific data workflows in a hybrid work environment.
Staff Engineer at GEICO responsible for API - first design and microservices architecture. Leading technical strategy and collaborating across engineering teams to deliver quality software solutions.
Lead Software Engineer at Tails.com, delivering scalable software and leading engineering teams. Join a fast - growing dog food subscription company changing the world of pet food for good.
Senior Software Development Engineer designing and developing low - level drivers for Broadcom PHY chip sets. Involves code maintenance, customer requirement conversions, and working closely with development and application teams.
Senior Fullstack Engineer building AI - driven financial products for Nexus Frontier Tech. Collaborating with clients and delivering robust applications in a hybrid workplace.
Senior Principal Engineer leading full - stack development initiatives using Microsoft technologies at Ingram Micro. Focusing on production system stabilization and self - serve platform design.
Launch Vehicle Ground Software Engineer developing and maintaining software for aerospace launch operations. Collaborating with propulsion, avionics, and test teams to ensure reliability and efficiency.
Designing high - reliability flight software for Firefly Launch Vehicles and spacecraft. Collaborating with engineering teams and providing technical leadership in a fast - paced environment.