MLOps Engineer managing AI pipelines for computer vision models. Involves end-to-end model lifecycle streamlining in a hybrid work environment.
Responsibilities
Own the end-to-end ML pipeline for computer vision: data prep, training, evaluation, model packaging, artifact/version management, deployment, and monitoring (local GPU cluster + GCP).
Design and maintain containerized workflows for multi-GPU training and distributed workloads (e.g., PyTorch DDP, Ray, or similar).
Build and operate orchestration (e.g., Airflow/Argo/Kubeflow/Ray Jobs) for scheduled and on-demand pipelines across on-prem and cloud.
Implement and tune resource allocation strategies based on current and upcoming task queues (GPU/CPU/memory-aware scheduling; preemption/priority; autoscaling).
Introduce and integrate monitoring/telemetry for:
job health and failure analysis (retry, backoff, alerts),
data/feature drift and model performance (precision/recall, latency, throughput),
infra metrics (GPU utilization, memory, I/O, cost).
Harden GCP environments (permissions, networks, registries, storage) and optimize for reliability, performance, and cost (spot/managed instance groups, autoscaling).
Establish model governance: experiment tracking, model registry, promotion gates, rollbacks, and audit trails.
Standardize CI/CD for ML (data/feature pipelines, model builds, tests, and canary/blue-green rollouts).
Collaborate with CV researchers/engineers to productionize new models and improve training throughput & inference SLAs.
Continuously improve documentation: update existing pipeline docs and produce concise runbooks, diagrams, and “how-to” guides.
Requirements
Hands-on MLOps experience building and running ML pipelines at scale (preferably computer vision) across on-prem GPUs and a public cloud (GCP preferred).
Strong with Docker and Docker Compose in local and cloud environments; solid understanding of image build optimization and artifact caching.
Proficiency with Python and Bash for pipeline tooling, glue code, and automation; Terraform for infra-as-code (GCP resources, IAM, networking, storage).
Experience with orchestration: one or more of Airflow, Argo Workflows, Kubeflow, Ray, or Prefect.
Senior Scientist II leading innovative AI and machine learning projects in oncology at Tempus. Collaborating with teams to advance predictive modeling and drug R&D initiatives.
Technical leader architecting and deploying advanced Gen AI and ML solutions for Walmart Connect. Driving innovation in ad targeting, personalization, and campaign analytics.
MLOps Engineer scaling AI/ML solutions across game studios at Stillfront. Collaborating with teams to operationalize machine learning solutions for a diverse gaming portfolio.
Machine Learning Engineer supporting the Enterprise Machine Learning team in developing advanced solutions. Collaborating with stakeholders and driving data science initiatives in a hybrid work environment.
Contract AI/ML Engineer focused on delivering machine learning projects for clients at AND Digital. Aiming to enhance AI - powered solutions and contribute to digital skills development.
Senior Machine Learning Engineer at TomTom building data pipelines for autonomous vehicle mapping solutions. Collaborating in a diverse team to innovate and implement machine learning technologies.
Machine Learning Manager improving automated decision - making and managing a team. Driving innovations in credit risk modeling for Monzo’s borrowing products.
Machine Learning Engineer responsible for ML model design and deployment at SiGMA Group. Enhancing event experiences through AI - driven solutions in iGaming and tech.
Machine Learning Intern at Nomagic tackling physical manipulation challenges in AI robotics with top professionals. Engage in innovative projects while shaping robotic technology.