Senior AI Engineer responsible for end-to-end benchmarks and evaluations at Aleph Alpha Research in Heidelberg. Focus on ML models and German capabilities with ownership in a hybrid environment.
Responsibilities
Own benchmarks end-to-end: select, implement, and maintain the evaluation suite used during pre-training — from dataset curation to scoring infrastructure to result analysis.
Build evaluation infrastructure: develop and optimize the pipelines that run evaluations against training checkpoints, ensuring speed, reliability, and reproducibility.
Design aggregation and reporting: define how benchmark results translate into training decisions, and build the tooling that makes results interpretable.
Close capability gaps: work with product and post-training teams to identify where our models fall short, then create or integrate benchmarks that measure progress.
Own German evaluation: ensure rigorous assessment of German language capabilities — this is core to our value proposition, not an afterthought.
Correlate signals: establish which pre-training metrics actually predict downstream and system-level performance.
Requirements
Experience with LLM evaluation, benchmark design, evaluation dataset curation, and experimental design.
Familiarity with statistical methods for evaluation and experiment design.
Track record of shipping impactful technical work — whether that's research, infrastructure, or both.
Strong Python skills and comfort with ML tooling (PyTorch, evaluation frameworks, distributed systems).
Ability to reason about what an evaluation measures and whether it matters — not just run benchmarks, but understand them.
Ownership mentality: you see problems through from diagnosis to solution to deployment.
Willingness to relocate to Heidelberg or travel regularly (potentially weekly).
Benefits
30 days of paid vacation
Access to a variety of fitness & wellness offerings via Wellhub
Mental health support through nilo.health
Substantially subsidized company pension plan for your future security
Subsidized Germany-wide transportation ticket
Budget for additional technical equipment
Flexible working hours for better work–life balance and hybrid working model
Lead GTM Engineer shaping BrainPOP's AI - first go - to - market engine. Oversee integration architecture and collaborate with cross - functional teams on strategic initiatives.
Senior Fullstack Engineer at freshcells developing backend and frontend solutions with Node.js and React. Focus on collaborative coding, performance optimization, and innovation in software development.
Director of Software Engineering at Bazaarvoice leading engineering teams and implementing strategic roadmaps. Foster collaboration across global teams to drive performance and innovation.
Principal Engineer leading architectural and technical strategy for MOO’s Post‑purchase domain. Collaborating with teams to improve order orchestration, fulfilment, and shipping processes.
Senior Software Engineer enhancing core React Native for FanDuel's Sportsbook. Collaborating with engineers to improve performance, reliability, and development experience.
Oracle CPQ Software Developer at Extreme Networks responsible for delivery of renewal quoting solutions. Collaborating with agile teams and enhancing CPQ/BMI features.
Fullstack Engineer responsible for designing and implementing software applications. Collaborating with product managers and stakeholders to translate requirements into technical solutions in India.
Fullstack Developer creating intuitive, sustainable apps as part of an agile team at a Munich IT service provider. Engaging in technology decisions and exploring new technologies.
Software Engineer focusing on cloud infrastructure and automation tools for high availability at Icertis. Requires strong technical expertise and collaborative skills in cloud operations.
Senior Software Engineer working on AI - augmented cloud - based solutions. Collaborating with a dynamic team to drive efficiency in service operations at Nokia.