Engineering Manager leading Support Engineers to enhance observability and operational practices in AI production environments. Overseeing runtime debugging and incident resolution while fostering a customer-first mindset.
Responsibilities
Lead, mentor, and scale a team of Support Engineers specializing in AI and ML production environments, fostering technical depth, accountability, and a customer-first mindset.
Serve as a player-coach, directly contributing to complex troubleshooting, inference optimization, and incident resolution for high-value enterprise customers.
Diagnose and resolve runtime issues impacting model performance, such as latency spikes, memory pressure, GPU scheduling, and concurrency management.
Debug Kubernetes infrastructure (pods, controllers, networking) and observability stacks using tools like Grafana, Loki, and Prometheus.
Own critical incidents end-to-end — coordinating across Engineering, Product, and Sales to ensure timely resolution, transparent communication, and SLA compliance.
Drive continuous improvement by enhancing diagnostic runbooks, refining alerting strategies, and developing internal automation for faster root-cause analysis.
Collaborate with product and platform teams to surface insights from production issues — shaping roadmap priorities around reliability, inference efficiency, and operational scalability.
Lead initiatives that enhance observability, monitoring, and alerting for AI workloads across distributed compute environments.
Balance tactical execution with strategic vision, ensuring your team not only resolves today’s issues but also builds systems that prevent tomorrow’s.
Requirements
Proven experience leading or mentoring technical teams in Support Engineering, Infrastructure, or Site Reliability within production AI/ML or distributed systems environments.
Deep Kubernetes troubleshooting expertise, including advanced resource debugging, runtime performance analysis, and observability-driven diagnostics.
Hands-on experience managing distributed systems or AI products at scale — optimizing GPU/CPU utilization, batch sizing, concurrency, and memory efficiency.
Expertise with observability and monitoring tools (Grafana, Prometheus, Loki) and alerting best practices.
Skilled in incident management and customer escalation handling, with a proven ability to drive clarity and confidence in high-stakes situations.
Demonstrated project management and organizational skills, capable of orchestrating multi-stakeholder efforts from incident triage through resolution and RCA.
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Job title
Engineering Manager, Support – Customer Engineering
Assistant Engineering Manager managing civil engineering and permitting tasks at GAI Consultants. Opportunity to lead a team in a hybrid work environment with a focus on energy and manufacturing sectors.
Engineering leader managing product engineering teams for advanced packaging technologies at Intel. Focus on quality, cost optimization, and operational excellence with a competitive salary.
Software Engineering Manager leading engineering teams to transform application landscape at The Hartford. Ensuring systems are stable, performant, and secure with a Hybrid work schedule.
Engineering Manager leading teams for the ExperiencePlatform at Campminder, building features for summer camp software. Guiding architectural modernization and mentoring engineers in a collaborative environment.
Engineering Manager leading product engineering teams at Kaizen, developing software for government agencies. Driving technical roadmap and managing engineering talent while delivering high - quality features.
Engineering Manager overseeing web, mobile, and backend engineers at Fanatics Collect. Fostering effective AI use and team accountability while driving high - quality software delivery.
Project Engineering Manager overseeing engineering activities in Grid Solutions project at GE Vernova managing quality, cost, and time delivery criteria across teams.
Senior Engineering Manager leading AI - first product experiences at Mixpanel. Guiding engineers to grow and build infrastructure for customer - facing applications.
Senior Engineering Manager building AI - first product experiences from inception to global scale. Leading a product engineering team and driving innovation at Mixpanel.
Head of Engineering leading demand - side engineering teams for JustPark's UK platform. Focusing on team development, delivery, and AI - assisted practices in a hybrid environment.