Technical Staff designing and optimizing distributed training systems for GPU clusters. Aiming to reduce convergence time through efficient coding and infrastructure optimization.
Responsibilities
Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack stack, from data pipelines to GPU kernels
Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization
Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks
Optimize workloads for hardware efficiency: CPU/GPU compute balance, memory management, data throughput, and networking
Develop monitoring and debugging tools for large-scale runs, enabling rapid diagnosis of performance regressions and failures
Requirements
Deep experience in distributed systems, ML infrastructure, or high-performance computing (8+ years)
Production-grade expertise in Python
Low-level performance mastery: CUDA/cuDNN/Triton, CPU–GPU interactions, data movement, and kernel optimization
Scaling at the frontier: experience with PyTorch and training jobs using data, context, pipeline, and model parallelism
System-level mindset with a track record of tuning hardware–software interactions for maximum utilization
Vehicle Integration Supervisor overseeing prototype vehicle development and test operations for Ford Racing products. Leading a team to ensure timely delivery and quality in engineering execution.
Bilingual Director, Software Development for guiding our document management services and leading software teams. Focused on strategy, platform evolution, and delivery of customer - driven solutions at scale.
Engineering Technologist solving technical problems and supporting business objectives at Duke Energy. Collaborating on engineering studies and reporting while fostering growth and independence.
Associate Data & Access Specialist ensuring compliance with Data and Access Security policies for Flutter International. Engaging with teams to onboard applications and improve Data Security controls.
Softwareentwickler für Desktop - Anwendungen mit React Native bei Novotec Medical. Fokus auf Entwicklung von Anwendungen für Fitnessstudios und innovative Medizintechnik.
Senior iOS Developer embedded in a product team for an international software solutions company. Working with Swift, SwiftUI, and the Composable Architecture on a native iOS application.
Analista Programador overseeing user - reported incidents in ERP system and maintaining user communications. Validating issues and generating KPIs for tech support teams.
Technical Intern assisting traffic and intelligent transportation group in Akron, OH. Engaging in transport studies and collaborating on signal timing and traffic analysis.
Instrumentation & Control Systems Engineer joining Arcadis's Water Division for SCADA, Instrumentation, and Controls projects in Massachusetts and Connecticut. Contributing technical expertise for water/wastewater treatment facilities.
Senior Project Engineer leading structural design in precast concrete for major construction projects. Overseeing engineering calculations, drawings, and project management from start to finish.