Data Engineer building data infrastructure at Aldea, a multi-modal AI company. Designing and scaling data pipelines for language and speech domains at large token scales.
Responsibilities
Build and scale data pipelines for pretraining, midtraining, and post-training at trillion+ token scale across language and speech domains
Process and curate large-scale datasets including cleaning, deduplication, quality filtering, and optimization for distributed training
Generate synthetic data for model training and evaluation across diverse tasks and domains
Design efficient data loading systems achieving high throughput across multi-node training clusters
Build data versioning and reproducibility systems to track dataset compositions and enable reproducible experiments
Collaborate with ML engineers and researchers to optimize pipelines and improve data quality
Requirements
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
3+ years of experience building large-scale data pipelines for machine learning or data-intensive applications
Strong programming skills in Python and experience with data processing frameworks (Spark, Dask, Ray, or similar)
Experience with data quality techniques including deduplication, filtering, and validation at scale
Proven ability to optimize data pipelines for performance and throughput in distributed systems
Experience working with large datasets (100GB-10TB+) and understanding of storage systems and data formats
Benefits
Competitive base salary
Performance-based bonus aligned with research and model milestones
Senior Data Engineer designing and optimizing data platforms for clients using Microsoft Azure, Microsoft Fabric, Power BI, and Databricks. Working closely with clients to deliver scalable solutions.
Data Engineer providing technical expertise on mission - critical NAVSUP OIS program. Work involves data architecture and database management in AWS GovCloud environments.
Senior Data Engineer focusing on data infrastructure for an AI - driven insurtech startup based in Nepal. Collaborating with teams to optimize data models and maintain data quality.
Senior Professional Consultant leading architecture and design for SAP BW and SAC solutions at Freudenberg. Collaborating with stakeholders and optimizing performance of data landscapes.
Senior Data Engineer designing and managing data architectures to transform large - scale data into insights for Humana. Involves leading technical discussions and implementing best data practices.
Data Engineer II at Early Warning Services developing data science tools and infrastructure. Collaborating on software enhancements and mentoring interns in a hybrid work environment.
Senior Data Architect responsible for optimizing data architecture and supporting data - driven business decisions at TruStage. Leading technical guidance for data architecture and cross - functional team collaboration.
Senior Data Architect developing data architecture plans at The Hartford, collaborating with internal teams to align data standards and practices. Leading complex solutions with a focus on operational effectiveness.
Senior Solution Architect defining architecture framework for SA‑CCR in regulatory risk. Collaborating with stakeholders to ensure compliance and efficient data governance.