Senior Engineering Manager leading Data Center telemetry solutions at NVIDIA, driving architecture, development, and deployment for AI supercomputing platforms. Recruiting and managing top talent to optimize data center performance.
Responsibilities
Own the end-to-end architecture and delivery for telemetry solutions, including fleet health monitoring, fault remediation, and data visualization at scale
Own OOB telemetry solution and data validation for telemetry from each underlying device
Recruit, develop, and motivate a high-performing engineering team focused on platform telemetry, RAS and observability
Continuously improve software development processes for optimal productivity and quality
Work across teams to ensure seamless integration of telemetry solutions with platform firmware, server architecture, and data center management
Drive product life cycles with QA teams, ensuring robust testing, productization, and delivery
Conduct performance reviews, foster a culture of excellence, and ensure high productivity
Requirements
12+ overall years of relevant experience
5+ years of managing systems/platform software teams
BS, MS, or PhD in EE/CS or related field (or equivalent experience)
Strong knowledge of DMTF/PLDM for OOB telemetry collection
Time series databases (e.g., InfluxDB, Prometheus) and REST APIs (Redfish)
Deep understanding of Server and firmware architecture and optimization for low-latency APIs
Proven track record of delivering scalable server products and telemetry solutions
Experience with SCM (Git, Perforce) and project management tools (Jira)
Hands-on experience with x86/ARM system architecture and coding (C/C++, Python)
Familiarity with Confidential Compute and notification systems
Demonstrated ability to analyze algorithms for time/space complexity and system resource requirements
Benefits
Equity
Benefits
Job title
Senior Manager, Engineering – Data Center Telemetry, RAS
Seeking experienced IFS Developer for ERP application design, development, and support in South Carolina. Collaborating with business stakeholders to align solutions with organizational goals.
Technical Communications & Research Intern at HII's DIICE assisting Air Force digital transformation projects. Involves technical writing, project coordination, and stakeholder communication.
Materials Developer focused on seasonal developments of high - performance trim materials at Arc'teryx. Collaborate with cross - functional teams to drive product success and sustainability in the supply chain.
Materials Developer I focusing on technical developments in high - performance materials. Joining Arc'teryx's team to enhance supply chain goals and product success.
Operations Engineering Support 2 responsible for troubleshooting and repairing manufacturing equipment at Celestica. Engaging in complex testing and maintenance efforts whilst ensuring quality standards.
Acting as authority for safe work permitting and process improvements in a manufacturing facility. Supporting technical training and monitoring permit requests at the site.
Electrical Test Technician responsible for hands - on testing of batteries and electronic devices at EnerSys. Operates instrumentation, generates reports, and ensures testing compliance.
Project Developer at Aula Energy managing renewable energy projects in Australia. Oversee project development from identification to construction commencement in a hybrid working environment.
Mobile Developer developing mobile applications and implementing automated testing. Collaborating with teams to enhance user experience through high - quality solutions.