Senior Performance and Development Engineer at NVIDIA focusing on optimizing AI workloads and developing scalable AI infrastructure tools. Collaborating with a diverse team to enhance Deep Learning applications.
Responsibilities
Build AI models, tools and frameworks that provide real time application performance metrics that can be correlated with system metrics.
Develop automation frameworks that empower applications to thoughtfully predict and overcome system/infrastructure failures, ensuring fault tolerance.
Collaborate with software teams to pinpoint performance bottlenecks.
Design, prototype, and integrate solutions that deliver demonstrable performance gains in production environments.
Adapt and enhance communication libraries to seamlessly support innovative network topologies and system architectures.
Design or adapt optimized storage solutions to boost Deep Learning efficiency, resilience, and developer productivity.
Requirements
BS/MS/PhD (or equivalent experience) in Computer Science, Electrical Engineering or a related field.
12+ years of proven experience in analyzing and improving performance of training applications using PyTorch or similar framework.
Building distributed software applications using collective communication libraries such as MPI or NCCL or UCC.
Construct storage solutions for Deep Learning applications.
Building automated fault tolerant distributed applications.
Building tools for bottleneck analysis and automation of fault tolerance in distributed environments.
Strong background in parallel programming and distributed systems.
Experience analyzing and optimizing large scale distributed applications.
Excellent verbal and written communication skills.
Senior Environmental Engineer leading hazardous building materials assessments and Phase I/II Environmental Site Assessments in environmental consulting. Collaborating with local teams in Nova Scotia and Atlantic Canada.
Sr. Principal PIC Development Engineer overseeing the design and development of Infinera’s Photonic Integrated Circuits products. Collaborating cross - functionally to manage project budgets and schedules.
Care Engineer delivering technical support for Nokia's NPC and CSD systems, ensuring reliability and performance. Collaborating with customers and global teams for problem resolution.
Business Engineer developing client relationships for ABGi Technology while overseeing a team of consultants. Focused on sales growth and client engagement.
Commercial Refrigeration Engineer troubleshooting and resolving refrigeration systems for clients. Providing service and maintenance while developing knowledge in industrial systems with a mobile work structure.
Wintel Engineer responsible for maintaining the reliability and performance of hybrid Windows Server platforms. Ensuring consistency and security for business - critical applications in Glasgow.
Broadcast Engineer providing first and second tier technical support for live shows and offline productions. Troubleshooting broadcast equipment and ensuring quality operation.
Complaint Handling Engineer managing quality issues in Digital Solutions for medical software. Timely handling of complaints and communication with country organizations for resolution.
Senior Middleware Engineer managing Kubernetes and Docker applications for Ameriprise India LLP. Leading incident response and collaborating with teams on middleware technologies in a hybrid work environment.