DevOps Engineer responsible for building and maintaining scalable AI systems on Azure cloud. Collaborating with teams to ensure operational excellence for enterprise-grade AI solutions.
Responsibilities
Design, implement, and maintain MLOps pipelines for AI/ML assets deployed as managed online endpoints in Azure Machine Learning.
Implement CI/CD workflows for AI solutions using Azure DevOps and Azure CLI.
Ensure compliance, security, and scalability of AI systems across environments.
Build monitoring dashboards to track performance, data drift, and system health, and implement alerting and remediation strategies.
Manage promotion of AI/ML assets (models, apps, containers) across development, staging, and production environments.
Automate deployment, monitoring, and lifecycle management of AI systems, including Azure Speech Services and OpenAI models.
Assist in deploying and maintaining Flask-based applications that consume Azure ML endpoints.
Requirements
5+ years of experience in MLOps or related roles with a focus on Azure cloud platform
Strong proficiency in Azure AI services (Azure Machine Learning, Cognitive Services, Azure Speech Services, Azure OpenAI)
Hands-on experience with CI/CD pipelines using Azure DevOps and Azure CLI
Proficiency in Python and experience with Azure SDKs for AI services
Solid understanding of containerization (Docker) and orchestration (Kubernetes, AKS)
Experience with monitoring and logging solutions for AI systems (Azure Monitor, Application Insights) and building dashboards
Knowledge of identity and access management in Azure (Managed Identity, RBAC)
Strong understanding of AI/ML asset lifecycle management, including promotion across environments.
Benefits
25 days holiday, increasing through length of service, with option to buy or sell
Bupa health insurance as a benefit in kind
An enhanced pension plan and life insurance
Onsite gyms or local discounts where no onsite gym available
Senior Site Reliability Engineer at Diligent leading reliability, automation, and observability across cloud infrastructure. Build tools for incident response and enhance performance in fast - paced environments.
Perception Deployment Engineer deploying deep learning models on embedded systems at Caterpillar. Collaborating with cross - functional teams for integration and optimization of perception modules in vehicles.
Principal Site Reliability Engineer at AT&T required to design scalable solutions for critical operations with minimal downtime. Collaborating with teams to monitor and improve system performance in cloud environments.
DevOps Engineer managing AI SaaS infrastructure at a high - growth European company. Supporting AI model deployment and ensuring platform security and compliance with multiple systems integration.
Engineering Manager leading teams for observability platforms at LexisNexis. Owns operational excellence across software delivery lifecycle in Raleigh, NC.
Reliability Engineer optimizing site facility infrastructure and utility systems at Roche. Conducting root cause analyses and developing maintenance plans to enhance reliability and efficiency.
DevOps SME designing, implementing, and operating multi - cloud platforms for The Missing Link. Collaborating with engineering, security, and operations teams while embedding DevOps best practices.
Site Reliability Engineer improving reliability of cloud infrastructure for an AI - specialized company. Taking ownership of monitoring and incident response processes in hybrid - working style.
DevOps Engineer leading automation for sophisticated release/deployment pipelines at Securonix. Focused on Python, Ansible, and cloud services to enhance security operations.
Senior Analyst on Data Platform DevOps at AIMCo, responsible for building data operations and collaborating with teams on innovative solutions. Focused on ensuring data quality and integrity across technologies.