Senior Machine Learning Infrastructure Engineer designing and building scalable systems for data modeling and analytics. Collaborating in a dynamic startup focused on brain-computer interface technologies.
Responsibilities
Create flexible and performant ML infrastructure
Design and build systems ML cloud infrastructure to enable massive-scale modeling and analytics
Support diverse model exploration, hyperparameter optimization, pretraining, fine-tuning, and evaluation processes
Design and optimize scalable distributed training pipelines, with support for features such as model sharding, cross-GPU communication, and real-time training monitoring
Create, operate, and maintain robust ML platforms and services across the model lifecycle
Make informed architecture decisions that balance performance, cost, reliability, and scalability
Build diverse and scalable data platforms
Design, build, and optimize massive-scale databases and data pipelines for scalable, flexible, and reliable data access
Explore research-driven, tailored data solutions using existing and simulated data, comparing performance and efficiency across solutions for typical data-access patterns
Create infrastructure and pipelines for ingesting internal and external datasets with varied shapes, formats, and associated metadata
Design and assess custom data formats for efficient storage and slicing of high-dimensional time-series data
Enable efficient data movement, preprocessing, and artifact management for data lineage and modeling reproducibility
Meet company standards for delivered solutions
Establish best practices for reliability, observability, reproducibility, and operational excellence across the ML ecosystem
Make informed and collaborative decisions with domain experts across the software & ML teams
Foster visibility and reproducibility within the company by maintaining extensive documentation of design decisions, evaluations of viable alternatives for selected solutions, pipeline assessments, etc.
Support ML R&D operations while preparing for eventual incorporation into product pipelines
Requirements
Bachelor's degree in Computer Science, Electrical Engineering, or a related technical discipline
5+ years of industry experience in software engineering, large-scale data infrastructure, or systems ML
Extensive proficiency in Python
Familiarity with PyTorch
Experience designing, building, and maintaining high-throughput data pipelines for large and diverse datasets
Experience working with distributed-training frameworks (e.g. FSDP, DeepSpeed, Megatron-LM, Ray, etc.)
Experience building or optimizing ML training pipelines for transformers or other large neural-network models
Demonstrated ability to partner closely with research and modeling teams to productionize workflows
Excellent communication and collaboration skills to work effectively on cross-functional and interdisciplinary teams
Experience having technical ownership over at least one successfully implemented collaborative project.
Benefits
Competitive compensation, including stock options.
Data Transport Infrastructure Engineer at Leidos supporting U.S. Air Force Cloud One Architecture. Involves developing scalable cloud - native solutions and mentorship roles in a hybrid remote setting.
Principal Software Engineer on Walmart's AI Security team analyzing threats and implementing robust security architectures. Collaborate across domains and mentor on AI safety and secure engineering practices.
Data Center Infrastructure Architect designing scalable and resilient optical cabling for hyper - scale data centers. Implementing physical solutions and automating fiber mapping for efficiency.
Systems and Infrastructure Engineer managing technology infrastructure and providing DevOps support for system reliability. Collaborating with development teams to implement solutions and enhance system performance.
Infrastructure Engineer managing IT infrastructure projects and operational tasks for the MHRA. Collaborating with teams to ensure service stability and performance in the Digital and Technology group.
AI Infrastructure Engineer at Xsolla designing AI/ML solutions for multi - cloud infrastructure. Collaborating on automation workflows and observability systems for improved infrastructure management.
AI Infrastructure Engineer designing and implementing AI/ML solutions for infrastructure use cases at Xsolla. Collaborating with teams to enhance the security posture of infrastructure systems.
Cloud Infrastructure Engineer managing Azure environments and supporting cloud infrastructure processes in a credit market servicing organization. Collaborating with DevOps teams and ensuring compliance with security standards.
Cloud Infrastructure Architect managing AWS and Azure environments for fintech clients. Leading architectural governance and security compliance in a hybrid infrastructure setup.
Infrastructure Engineer responsible for managing GCP infrastructure and supporting cloud operations. Seeking skills in Terraform, Kubernetes, Ansible, and incident response in enterprise settings.