Senior ML Infrastructure Engineer developing cloud and compute foundations at Ellison Institute of Technology. Focused on high-performance ML compute clusters to accelerate scientific breakthroughs.
Responsibilities
Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.
Requirements
Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
Expertise with high-throughput storage systems for ML/HPC workloads
Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
Senior Cloud Infrastructure Engineer at InfoTrack executing cloud strategy. Designing, building, and optimizing secure, scalable infrastructure while collaborating with global teams.
Principal Engineer leading design and implementation of secure architectures for Walmart’s AI Security Team. Responsibilities include risk management, capacity planning, and cross - team collaboration.
Communications Desk Infrastructure Engineer responsible for maintaining and troubleshooting APS communication systems. Supporting critical operational and public safety communication needs across Arizona.
Student Assistant in IT Infrastructure Engineering at Liebherr - Hamburg. Supporting network solutions, system configurations and project management tasks.
Infrastructure Architect required for designing a next - gen hosting platform in Kubernetes at Enova Consulting. Collaborating closely with engineers and partners for a hybrid infrastructure solution.
Cloud Infrastructure Engineer ensuring AWS service reliability and performance at Perlego. Collaborating with teams and managing infrastructure in a hybrid working environment.
Senior Infrastructure Engineer designing and building hybrid networks for ICEYE’s satellite operations. Ensuring high - throughput and reliability between ground stations and cloud environments.
AI Infrastructure Engineer designing and implementing AI solutions for Xsolla's infrastructure tasks across GCP and multi - cloud environments. Collaborating with senior engineers to execute AI strategy.
Data Transport Infrastructure Engineer at Leidos supporting U.S. Air Force Cloud One Architecture. Involves developing scalable cloud - native solutions and mentorship roles in a hybrid remote setting.
Principal Software Engineer on Walmart's AI Security team analyzing threats and implementing robust security architectures. Collaborate across domains and mentor on AI safety and secure engineering practices.