DataDecide: How to Predict Best Pretraining Data with Small Experiments
Abstract
Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.
Community
Behind the scenes at every AI lab, many small models and pre-training datasets are created and experimented with as part of the process of developing their language models. These models and datasets, if made public, could be rich sources of insight into important questions such as: how do developers decide what dataset to use for pre-training their models, or which benchmarks to hill-climb on?
As part of Ai2’s commitment to openness, and to empower open exploration of these questions, today we release DataDecide—a suite of models we pretrain on 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, over 14 different model sizes ranging from 4M parameters up to 1B parameters (more than 30k model checkpoints in total). We evaluate all models across a suite of 10 downstream tasks and calculate how accurately we can use small models to predict that one pretraining corpus will lead to better performance than another for our largest models. Our conclusions provide recommendations about the best and most cost-effective benchmarks, prediction methods, and metrics to use to make decisions.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions (2025)
- Predictable Scale: Part I - Optimal Hyperparameter Scaling Law in Large Language Model Pretraining (2025)
- Compute Optimal Scaling of Skills: Knowledge vs Reasoning (2025)
- LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws (2025)
- Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo (2025)
- Predictive Data Selection: The Data That Predicts Is the Data That Teaches (2025)
- Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper