AI Trace Generation Engineer designing and implementing trace collection systems for LLM workloads. Analyzing distributed AI workload behavior across multi-GPU and multi-node setups.
Responsibilities
Design and implement a trace collection system for distributed LLM workloads
Validate that collected traces accurately reflect real workload behavior
Integrate with and instrument major LLM frameworks to extract meaningful execution data
Use collected traces as input to discrete event simulations
Analyze trace data to surface bottlenecks and inefficiencies across the stack
Requirements
3+ years of experience in AI systems, ML infrastructure, or a closely related area
Hands-on experience with at least one major LLM serving or training framework
Strong proficiency in Python and C++
Solid understanding of GPU architecture, memory bandwidth, and the difference between compute-bound and memory-bound operations
Solid understanding of distributed communication
Familiarity with parallelism strategies and how they shape execution behavior across large clusters
Open source contributions or published research in relevant areas will definitely be appreciated
Previous startup experience is a plus
Benefits
Competitive compensation with a performance-based incentive
Subsidized Deutschlandticket
Access to a discount portal
Flexible hours with hybrid and remote-friendly options
Strategy Manager overseeing AI - driven document intelligence initiatives at Bank of America. Driving operational improvements through strategic partnerships and innovative document processing solutions.
Junior AI Automation Manager developing AI workflows and automations at morefire GmbH. Collaborating on innovative projects with a focus on performance and creativity.
Junior AI Cybersecurity Specialist for FortiGuard IoC team using AI for threat detection. Designing ML models and developing AI solutions to combat cybersecurity threats.
PMO Consultant enabling PwC Australia to adopt AI and tech - enabled processes, enhancing collaboration and governance across projects. Leading initiatives from planning to execution with strong program governance.
AI SDLC Engineer integrating modern AI tooling into the full Software Development Life Cycle at Quento Technologies S.A. Collaborating with teams to enhance product development and delivery.
Cyber Manager at Prosus focusing on security risks in AI systems and technology audits. Collaborating across teams and traveling to engage with global stakeholders.
AI Consultant implementing Agentic AI solutions within healthcare at Cognizant. Focused on modernizing healthcare operations with AI - driven automation and intelligent workflows.
AI & Sales Operations Engineer building innovative AI tools for business impact at papernest. Collaborating with international teams to optimize operations and drive efficiency from Barcelona.