Senior ML Platform Engineer at Mistplay researching and developing machine learning solutions. Collaborating with teams to solve complex business problems and enhance mobile gaming experience.
Responsibilities
Design, build, and operate standardized training-to-deployment pipelines with Airflow, covering artifact management, environment provisioning, packaging, deployment, and rollback for SageMaker endpoints.
Own real-time and batch inference on SageMaker: multi-model endpoints (MME), serverless inference where appropriate, blue/green and canary deployment strategies, autoscaling policies, and cost controls (spot strategies, instance sizing).
Implement very low-latency service models using Redis/Valkey: feature caching, online feature retrieval, request-level state, model response caching, and rate limiting/backpressure for bursty traffic.
Provision and manage ML/data infrastructure with Terraform: SageMaker endpoints/configurations, ECR/ECS/EKS resources, network endpoints/VPCs, ElastiCache/Valkey clusters, observability stacks, secrets, and IAM.
Build platform abstractions and golden paths: Airflow DAG templates, CLI/SDK, cookie-cutter repositories, and CI/CD pipelines that move models from notebooks to production predictably.
Establish and manage model lifecycle governance: model/feature registries, approval workflows, promotion policies, lineage and audit trails integrated with Airflow runs and Terraform state.
Implement end-to-end observability: data/feature freshness checks, drift/quality controls, model performance/latency SLOs, infrastructure health dashboards, tracing and alerts, plus incident response and postmortems.
Collaborate with security, SRE, and data engineering teams on private networks, policy-as-code, handling of PII, least-privilege IAM, and cost-effective architectures across environments.
Evaluate, integrate, and rationalize platform tooling (e.g., MLflow registry, feature stores, service gateways); lead migrations with clear change management and minimal downtime.
Requirements
5+ years of experience building and operating production-grade ML/data platforms focused on service, reliability, and developer experience.
Strong software engineering skills in Python, Go, or Java; experience building resilient services, APIs, and automation tools with high test coverage.
Deep experience with AWS SageMaker inference: endpoint configuration, containerization, model packaging, autoscaling, trade-offs between serverless and real-time, MME, A/B and canary releases.
Expertise with online feature stores such as Redis/Valkey in ML service contexts.
Proven Terraform experience for end-to-end ML and data infrastructure management: modules, workspaces, drift detection, change review, and safe rollbacks; familiarity with GitOps patterns.
Large-scale Airflow orchestration: dependency modeling, sensors, retries, SLAs, backfills, DAG factories, and integrations with registries, artifact stores, and Terraform pipelines.
Familiarity with ML frameworks (scikit-learn, XGBoost, PyTorch, TensorFlow) from a platform integration perspective to support diverse runtimes and containers.
Observability for ML workflows: metrics/logs/traces, performance profiling, capacity planning, cost monitoring, and runbooks.
Excellent cross-functional communication and collaboration with data science, data engineering, DevOps, and backend teams.
Data Transport Infrastructure Engineer at Leidos supporting U.S. Air Force Cloud One Architecture. Involves developing scalable cloud - native solutions and mentorship roles in a hybrid remote setting.
Principal Software Engineer on Walmart's AI Security team analyzing threats and implementing robust security architectures. Collaborate across domains and mentor on AI safety and secure engineering practices.
Data Center Infrastructure Architect designing scalable and resilient optical cabling for hyper - scale data centers. Implementing physical solutions and automating fiber mapping for efficiency.
Systems and Infrastructure Engineer managing technology infrastructure and providing DevOps support for system reliability. Collaborating with development teams to implement solutions and enhance system performance.
Infrastructure Engineer managing IT infrastructure projects and operational tasks for the MHRA. Collaborating with teams to ensure service stability and performance in the Digital and Technology group.
AI Infrastructure Engineer at Xsolla designing AI/ML solutions for multi - cloud infrastructure. Collaborating on automation workflows and observability systems for improved infrastructure management.
AI Infrastructure Engineer designing and implementing AI/ML solutions for infrastructure use cases at Xsolla. Collaborating with teams to enhance the security posture of infrastructure systems.
Cloud Infrastructure Engineer managing Azure environments and supporting cloud infrastructure processes in a credit market servicing organization. Collaborating with DevOps teams and ensuring compliance with security standards.
Cloud Infrastructure Architect managing AWS and Azure environments for fintech clients. Leading architectural governance and security compliance in a hybrid infrastructure setup.
Infrastructure Engineer responsible for managing GCP infrastructure and supporting cloud operations. Seeking skills in Terraform, Kubernetes, Ansible, and incident response in enterprise settings.