Primary post-sales technical owner ensuring reliability of ML workloads for strategic customers at AI company. Collaborating with teams to drive technical success and product improvements.
Responsibilities
Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management.
Debug infrastructure issues across Kubernetes (pods, controllers), networking, observability, and alerting systems.
Lead incident response during outages or escalations, managing coordination between Product, FDE, Sales, and Engineering.
Serve as the technical owner for top enterprise accounts with strict SLAs and high responsiveness expectations.
Identify common failure modes and translate user feedback into roadmap signals, product improvements, our internal runbooks, knowledge bases, and diagnostic best practices.
Own project coordination end-to-end: scoping, execution, communication, and stakeholder alignment across technical and non-technical teams ranging from feature requests, new deployments, and operational debugging issues.
Requirements
Deep Kubernetes troubleshooting expertise, including advanced resource debugging, pod/runtime analysis, and log-based diagnostics using observability tooling such as Grafana, Loki, and Prometheus.
Strong infrastructure debugging ability across container orchestration, networking, and service dependencies, with hands-on experience supporting production-grade clusters.
Experience managing high-severity incidents with major customers, including SLAs, post-incident reviews, and clear communication throughout escalations.
Proven project management and organizational skills with an ownership mindset, able to manage multiple complex, multi-stakeholder initiatives in parallel — including issue resolution, root-cause analysis, and feature delivery.
Ability to translate recurring technical pain points into roadmap-level insights, documentation improvements, or product enhancements.
Strong communication skills and executive presence during high-visibility situations, ensuring technical clarity and customer confidence.
3+ years of experience in a fast-paced, high-growth, or customer-facing engineering environment.
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Reliability Engineer responsible for availability and performance of U.S. Air Force Cloud services. Collaborates with teams to deliver reliable mission - critical systems in a hybrid environment.
Entry - level DevOps Engineer assisting in cloud infrastructure automation for AI - powered security operations platform. Seeking passionate candidates with foundational knowledge in Terraform, Kubernetes, and CI/CD pipelines.
DevSecOps Engineer responsible for security in CI/CD pipelines for a global client network. Collaborating on security hardening of applications and automation processes.
DevSecOps Engineer maintaining CI/CD security pipelines at SQA Consulting. Collaborating with teams to automate processes and ensure security best practices are followed.
DevSecOps Engineer for SQA Consulting focusing on CI/CD automation and security hardening. Collaborating with teams on cloud solutions in a hybrid work environment.
DevSecOps Engineer managing CI/CD pipelines and ensuring application security for SQA Consulting. Collaborating across teams while focusing on continuous improvement and automation in cloud environments.
Site Reliability Engineer focused on designing and maintaining observability platform for dLocal. Collaborating with global teams and optimizing system performance for major clients.
Staff Site Reliability Engineer focused on product engineering for Civica. Leading technical practices and architectural alignment while improving service delivery and quality.
Senior Cloud Operations Engineer at CELUM focusing on cloud infrastructure and system security. Collaborating on IT projects and optimizing hosting environments.
DevOps Engineer at FormativGroup focusing on Kubernetes management and automation solutions. Designing, implementing, and securing infrastructure for efficient application deployment in a remote setting.