Lead ML Ops Engineer for a fast-growing AI startup focused on scalable infrastructure. Drive hands-on execution across the entire model lifecycle in a collaborative environment.
Responsibilities
Architect, build, and scale the end-to-end ML Ops pipeline, including training, fine-tuning, evaluation, rollout, and monitoring.
Design reliable infrastructure for model deployment, versioning, reproducibility, and orchestration across cloud and on-prem GPU clusters.
Optimize compute usage across distributed systems (Kubernetes, autoscaling, caching, GPU allocation, checkpointing workflows).
Lead the implementation of observability for ML systems (monitor drift, performance, throughput, reliability, cost).
Build automated workflows for dataset curation, labeling, feature pipelines, evaluation, and CI/CD for ML models.
Collaborate with researchers to productionize models and accelerate training/inference pipelines.
Establish ML Ops best practices, internal standards, and cross-team tooling.
Mentor engineers and influence architectural direction across the entire AI platform.
Requirements
Deep hands-on experience designing and operating production ML systems at scale (Staff/Principal-level expected).
Strong background in ML Ops, distributed systems, and cloud infrastructure (AWS, GCP, or Azure).
Proficiency with Python and familiarity with TypeScript or Go for platform integration.
Expertise in ML frameworks: PyTorch, Transformers, vLLM, Llama-factory, Megatron-LM, CUDA / GPU acceleration (practical understanding)
Strong experience with containerization and orchestration (Docker, Kubernetes, Helm, autoscaling).
Deep understanding of ML lifecycle workflows: training, fine-tuning, evaluation, inference, model registries.
Ability to lead technical strategy, collaborate cross-functionally, and operate in fast-paced environments
Machine Learning Engineer designing and deploying advanced training capabilities to support U.S. Navy operational readiness. Collaborate on machine - learning models to enhance combat system training environments.
Cloud MLOps Engineer supporting Data Science and Engineering teams by automating CI/CD pipelines and managing multi - cloud infrastructure for ML production.
Lead development of Agentic AI capabilities and LLM applications for multiple mission management applications. Mentor teams to implement ML algorithms addressing customer challenges.
Staff AI/ML Engineer at CACI responsible for developing AI/ML algorithms and analyzing datasets. Join a high - performing team supporting national safety missions.
AI/ML Engineer at CACI developing machine learning algorithms for multiple applications. Collaborating with a research team to implement cutting - edge AI/ML solutions for customer missions.
Senior Computer Vision AI/ML Engineer leading a team in AI/ML algorithm implementation for remote sensing solutions. Responsibilities include training models and analyzing datasets with a focus on defense and commercial applications.
MLOps Engineer working on ML processes and robust workflows at Kensho. Collaborating with engineers to enhance tooling, services, and frameworks for machine learning.
Senior Scientist II leading innovative AI and machine learning projects in oncology at Tempus. Collaborating with teams to advance predictive modeling and drug R&D initiatives.