Hybrid Senior AI Engineer – Pre-training Data

Posted 4 days ago

Apply now

About the role

  • Senior AI Engineer responsible for data preparation in foundation model pre-training for various German-speaking industries. Collaborating on data quality and processing to enhance model capabilities.

Responsibilities

  • Co-Own data pipelines end-to-end: Design, build, and maintain the infrastructure that sources, processes, deduplicates, filters, and prepares pre-training corpora at scale. Own the conversion from curated corpora to training-ready streaming formats.
  • Curate and compose data mixtures: Define and iterate on the data blends used for pre-training - balancing domains, languages, quality tiers, and licensing requirements to maximise model capability.
  • Build data quality tooling: Develop classifiers, heuristics, and analysis frameworks that measure and enforce data quality across terabyte-scale corpora. Monitor pipeline health and data quality metrics at scale.
  • Close data gaps: Work with evaluation and post-training teams to identify where model weaknesses trace back to data coverage, then source or generate the data needed to address them.
  • Collaborate with post-training: Partner closely with the post-training team to ensure pre-training data decisions support downstream fine-tuning, alignment, and deployment goals - data choices upstream shape what's possible downstream.
  • Co-Own German-language data: Ensure deep, high-quality coverage of German-language corpora - this is core to our value proposition, not an afterthought.
  • Establish data-to-performance signal: Design and run ablation studies to validate data choices - measuring how changes in composition, filtering, or sourcing affect pre-training evaluation metrics and downstream capabilities.
  • Take data transparency seriously: Maintain data lineage and provenance so the team knows exactly what went into each training run.

Requirements

  • Track record of shipping impactful technical work - whether that's research, infrastructure, or both.
  • Strong Python skills and comfort with data engineering and ML infrastructure, including experience with deep learning frameworks, workflow orchestration, object storage, columnar data formats, and distributed processing.
  • Ability to reason about what a dataset contributes to model training and whether it matters - not just process data, but understand it.
  • Ownership mentality: you see problems through from diagnosis to solution to deployment.
  • Willingness to relocate to Heidelberg or travel at least fortnightly.
  • Experience with large-scale data processing for ML, including corpus sourcing, curation, cleaning, deduplication, and filtering.
  • Familiarity with data quality methods: classifier-based filtering, heuristic scoring, perplexity-based selection, and decontamination.
  • Understanding of foundation model training - how data composition, scale, and mixing ratios affect capabilities.
  • Experience with web-scale data sourcing and crawl processing (e.g., Common Crawl, WARC pipelines).
  • Rust proficiency (parts of our data pipeline are performance-critical).
  • Infrastructure knowledge - experience with Kubernetes, container orchestration, or cloud-native ML infrastructure.
  • PhD in machine learning, NLP, data engineering, or a related field (valued but not required - we care about what you can do).
  • Bonus, but not required: German language proficiency can be helpful for curating and assessing German-language data.

Benefits

  • 30 days of paid vacation
  • Access to a variety of fitness & wellness offerings via Wellhub
  • Mental health support through nilo.health
  • Substantially subsidized company pension plan for your future security
  • Subsidized Germany-wide transportation ticket
  • Budget for additional technical equipment
  • Flexible working hours for better work-life balance and hybrid working model
  • Virtual Stock Option Plan
  • JobRad® Bike Lease

Job title

Senior AI Engineer – Pre-training Data

Job type

Experience level

Senior

Salary

Not specified

Degree requirement

No Education Requirement

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job