Senior ML Platform Engineer developing and maintaining scalable ML infrastructure at GEICO. Focused on Large Language Models and collaborating with data science and engineering teams.
Responsibilities
Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
Mentor junior engineers and data scientists on platform best practices, infrastructure design, and ML operations
Lead comprehensive code reviews focusing on scalability, reliability, security, and maintainability
Design and deliver technical onboarding programs for new team members joining the ML platform team
Work closely with data scientists to understand requirements and optimize workflows for model development and deployment
Collaborate with product engineering teams to integrate ML capabilities into customer-facing applications
Support research teams with infrastructure for experimenting with cutting-edge LLM techniques and architectures
Requirements
Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
5+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
2+ years of hands-on experience with machine learning infrastructure and deployment at scale
1+ years of experience working with Large Language Models and transformer architectures
Proficient in Python; strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Hands-on experience with inference optimization using vLLM, TensorRT-LLM, Triton Inference Server, or similar
Advanced experience with Azure DevOps, GitHub Actions, Jenkins, or similar CI/CD platforms
Proficiency with Terraform, ARM templates, Pulumi, or CloudFormation
Deep understanding of Docker, container optimization, and multi-stage builds
Experience with Prometheus, Grafana, ELK stack, Azure Monitor, and distributed tracing
Knowledge of both SQL and NoSQL databases, data warehousing, and vector databases
Benefits
Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being.
Financial benefits including market-competitive compensation; a 401K savings plan vested from day one that offers a 6% match; performance and recognition-based incentives; and tuition assistance.
Access to additional benefits like mental healthcare as well as fertility and adoption assistance.
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year.
Machine Learning Engineer Intern contributing to AI solutions for financial services. Engaging in hands - on ML projects and real production issues in a hybrid working environment.
Machine Learning Engineer in the CTO division at Open Cosmos developing ML - driven solutions for spacecraft operations. Focused on anomaly detection, forecasting, and decision - making automation.
Develop and automate Machine Learning models for Telecommunications networks during Master’s Thesis Internship. Engage with real - world operational data from Nokia's Microwave Radio technology.
Machine Learning Scientist III developing AI solutions for multi - product domain at Expedia Group. Collaborating with product managers and engineers to optimize travel experiences through machine learning.
Applied AI Engineer at Mistral AI integrating AI products for clients, managing complex technical challenges while working in a collaborative environment.
Staff Machine Learning Engineer at Adobe, leading technical efforts for scalable GenAI services across products like Photoshop and Lightroom. Collaborating closely with research and product teams for high - performance solutions.
AI Engineer building agentic systems and applying AI models for national security initiatives. Collaborating with teams to solve client challenges with cutting - edge AI solutions.
Lead AI/ML Engineer on army enterprise team training and deploying models on cutting - edge AI technologies. Collaborate across teams to solve real - world challenges in the threat landscape.
Machine Learning Scientist III at Expedia Group developing ML algorithms to enhance customer experience. Tackling complex problems in online travel for improved post - booking recommendations and service.
Senior Machine Learning Engineer architecting next - gen Agentic AI systems for enterprise workflows at Demandbase. Focused on multi - agent orchestration and LLM - powered reasoning systems.