Senior AI Engineer responsible for data preparation in foundation model pre-training for various German-speaking industries. Collaborating on data quality and processing to enhance model capabilities.
Responsibilities
Co-Own data pipelines end-to-end: Design, build, and maintain the infrastructure that sources, processes, deduplicates, filters, and prepares pre-training corpora at scale. Own the conversion from curated corpora to training-ready streaming formats.
Curate and compose data mixtures: Define and iterate on the data blends used for pre-training - balancing domains, languages, quality tiers, and licensing requirements to maximise model capability.
Build data quality tooling: Develop classifiers, heuristics, and analysis frameworks that measure and enforce data quality across terabyte-scale corpora. Monitor pipeline health and data quality metrics at scale.
Close data gaps: Work with evaluation and post-training teams to identify where model weaknesses trace back to data coverage, then source or generate the data needed to address them.
Collaborate with post-training: Partner closely with the post-training team to ensure pre-training data decisions support downstream fine-tuning, alignment, and deployment goals - data choices upstream shape what's possible downstream.
Co-Own German-language data: Ensure deep, high-quality coverage of German-language corpora - this is core to our value proposition, not an afterthought.
Establish data-to-performance signal: Design and run ablation studies to validate data choices - measuring how changes in composition, filtering, or sourcing affect pre-training evaluation metrics and downstream capabilities.
Take data transparency seriously: Maintain data lineage and provenance so the team knows exactly what went into each training run.
Requirements
Track record of shipping impactful technical work - whether that's research, infrastructure, or both.
Strong Python skills and comfort with data engineering and ML infrastructure, including experience with deep learning frameworks, workflow orchestration, object storage, columnar data formats, and distributed processing.
Ability to reason about what a dataset contributes to model training and whether it matters - not just process data, but understand it.
Ownership mentality: you see problems through from diagnosis to solution to deployment.
Willingness to relocate to Heidelberg or travel at least fortnightly.
Experience with large-scale data processing for ML, including corpus sourcing, curation, cleaning, deduplication, and filtering.
Familiarity with data quality methods: classifier-based filtering, heuristic scoring, perplexity-based selection, and decontamination.
Understanding of foundation model training - how data composition, scale, and mixing ratios affect capabilities.
Experience with web-scale data sourcing and crawl processing (e.g., Common Crawl, WARC pipelines).
Rust proficiency (parts of our data pipeline are performance-critical).
Infrastructure knowledge - experience with Kubernetes, container orchestration, or cloud-native ML infrastructure.
PhD in machine learning, NLP, data engineering, or a related field (valued but not required - we care about what you can do).
Bonus, but not required: German language proficiency can be helpful for curating and assessing German-language data.
Benefits
30 days of paid vacation
Access to a variety of fitness & wellness offerings via Wellhub
Mental health support through nilo.health
Substantially subsidized company pension plan for your future security
Subsidized Germany-wide transportation ticket
Budget for additional technical equipment
Flexible working hours for better work-life balance and hybrid working model
Staff AI Engineer at Airwallex designing and deploying intelligent financial automation systems. Utilizing AI frameworks and collaborating across teams for innovative production solutions.
AI Engineer creating production - ready AI solutions for Plenti, enhancing operational efficiency and supporting sales. Collaborate with cross - functional teams in a dynamic fintech environment.
Associate AI Developer at Euna Solutions focusing on rapid AI proof - of - concept development. Collaborating with teams to leverage AI capabilities and solve real problems.
Senior Performance Engineer developing solutions for AI Platforms at Red Hat. Focused on performance and scalability of large language models in hybrid cloud infrastructure.
AI Architect at TENEX crafting secure, scalable AI systems in a remote role. Lead technical strategy and architecture for AI - driven cybersecurity solutions.
AI Engineer designing and delivering production - grade AI solutions across public sector and enterprise clients. Focusing on Generative AI, Agentic AI, and collaboration with architects and engineers.
AI Engineer working on intelligent systems for Highmark Health. Focus on developing machine learning models and integrating them into existing applications while adhering to ethical AI practices.
Lead AI Engineer at Nuveen designing and deploying generative AI solutions for financial industry. Collaborating with teams on AI/ML technologies and building enterprise - grade applications.
AI Engineer tasked with developing and implementing AI solutions for improving operational efficiency at a leading technology integrator in Latin America.
Physical AI Architect at Pickle Robot designing AI systems for warehouse automation. Leading technical efforts in the deployment of AI - powered robotics in high - throughput logistics.