Capacity Ops Engineer securing and managing GPU clusters for AI workloads at Baseten. Leading initiatives that ensure 99.9% uptime across multi-cloud environments.
Responsibilities
Lead Specialized Pods: Act as the lead for specific GPU pods (e.g., H100 or B200), managing the full lifecycle of acquisition, air traffic control, and maintenance for those assets.
Advanced Orchestration: Execute complex workload migrations and "sticky" deployment drains, ensuring deployment scheduling rules meet strict regional and compliance requirements.
Build for Scalability: Design and implement the "next version" of Baseten’s capacity management system to handle a 10x increase in GPU volume. Financial Modeling: Leverage your understanding of unit economics to build ROI models for GPU spend, ensuring Baseten scales profitably.
Cross-Team Collaboration: Partner with SRE, Infra, and FDE teams to take discrete operational tasks off their plate and verify "last mile" follow-through on infrastructure changes.
Incident Response: Lead capacity-crunch response by rapidly untainting and re-coordinating workloads during high-pressure outages.
Requirements
Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field
5+ years of professional work experience in a high-growth environment, preferably at a hyperscaler (GCP, AWS, Azure) or a specialized GPU provider
Deep expertise in Kubernetes, including hands-on experience with taints, cordons, node draining, and custom operators
Demonstrated experience with Go or Python in a production-level environment Strong financial literacy and the ability to model complex trade-offs between capacity reliability and cost
High tenacity and collaborative mindset
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Ticket Operations lead for 160over90, managing premium ticketing for iconic events like the Super Bowl. Delivering seamless experiences for partners and VIP guests while optimizing ticket operations.
Senior Operations Manager overseeing Shared Services Associates and Managers at Elevance Health. Leading MaaS Business Development and mentoring lower leveled managers.
Psykiatrin työ korkeakouluopiskelijoiden hyvinvoinnin edistämiseksi. Osallistuminen moniammatilliseen tiimiin ja hoito - ja kuntoutussuunnitelmien laatimiseen Helsingin Töölössä.
Leads back - office teams in Insurance service and sales support for Personal Lines Property and Casualty. Focuses on coaching, process management, and collaboration for enhanced operations and service quality.
Director of Operational Quality & Governance at Zelis overseeing a team to ensure operational processes are effective and continuously improving with AI capabilities.
Global Operations Director responsible for shaping strategies driving operational excellence. Leading multi - region operations while enhancing scalable processes across a global organization.
Business Analysis Manager at Capital One solving major company challenges through strategic analysis and product development. Collaborating with teams to drive growth and profitability in financial services.
Principal Associate Process Manager managing and improving customer processes at Capital One. Seeking a dedicated professional to enhance service delivery and customer experience during transformational changes.
Operations Production Principal Coordinator handling legal orders for Capital One’s Levies & Garnishments team. Ensuring productivity and quality metrics while mentoring peers in a hybrid work environment.