Principal Systems At-Scale Engineer deploying strategies to improve large-scale data center clusters. Collaborating with visionary professionals to optimize systems in AI and GPU computing.
Responsibilities
Deploy strategies to analyze and collect debugging and anomaly signals from large fleets of clusters to improve quality and experience.
Build and expand debugging tools to identify, diagnose, and recover out-of-service systems, growing customer-available capacity.
Author and deploy "fault signatures" and automated recovery rules.
Lead cross-team task forces to address undefined failure modes in high-value AI/GPU systems, cutting backlogs through data-driven isolation.
Leverage AI, analytics, and efficiency tools to scale debug efforts, turning manual triage into productized, automated code.
Act as a technical leader and cultural anchor.
Mentor junior and senior engineers.
Encourage organizational health initiatives.
Promote innovation through hackathons and sharing sessions.
Requirements
15+ years of experience in systems debugging at scale and debugging components of large fleets.
BS/MS Computer Science or related field (or equivalent experience)
Proven understanding of performance clusters, infrastructure, and workload patterns.
Knowledge and experience with telemetry and at-scale analytics for large platforms.
Experience using and installing fleets of Linux-based server platforms.
Tooling Engineer developing best practices for automotive flooring systems at Auria. Collaborating with engineering and procurement teams on tooling specifications and supplier management.
Senior Databricks DWH Engineer responsible for designing ETL data pipelines. Collaborating with teams to deliver scalable solutions in the banking sector.
Technician Engineer supporting Roads Network Management Services at Fife Council. Investigating solutions for roads problems and managing budgets for projects.
System and Monitoring Engineer providing technical support for ODM platform at TalentHackers. Ensure system performance and stability while monitoring integrations with core banking systems.
Senior Analog Design Engineer responsible for designing and validating high performance analog circuitry. Collaborates with cross functional teams to translate requirements into robust, manufacturable designs.
Advanced Control Engineer optimizing turbine and combustion control strategies for efficient power plant operations at Emerson. Leading DCS commissioning and innovative control solutions for enhanced performance.
CAE Engineer responsible for process modeling and optimization in machining assembly for Powertrain programs. Collaborating with cross - functional teams to ensure feasibility and technical support.
Engineer - BESS at Aula Energy developing battery energy storage systems for renewable projects. Supporting design, construction, and operations of energy storage across Australia.
Senior Manufacturing Engineer managing manufacturing processes for Argen, the largest dental zirconia manufacturer in North America. Supporting project execution and driving process improvements with a focus on quality and efficiency.
Configurator Engineer contributing to tailoring solutions for banking and insurance AI. Collaborating with teams to optimize configurations and validate solutions for business requirements.