|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- allenai/datadecide |
|
language: |
|
- en |
|
--- |
|
 |
|
|
|
More than one training run goes into making a large language model, but developers rarely release the small models and datasets they experiment with during the development process. How do they decide what dataset to use for pretraining or which benchmarks to hill climb on? To empower open exploration of these questions, we release [DataDecide](allenai.org/paper/datadecide)—a suite of models we pretrain on 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, over 14 different model sizes ranging from 4M parameters up to 1B parameters (more than 30k model checkpoints in total). |
|
|
|
## 350 Models over Differences in Data in Scale |
|
For each of our 25 datasets and 14 model sizes, we train a model linked below. Each has intermediate checkpoints (uploading after initial release), runs over 3 random seeds. All models finish training at a token to parameter ratio of 100 (e.g., 1B parameters -> 100B tokens). |
|
| | | | | | | | | | | | | | | | |
|
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|------|------|------|------|-----| |
|
| Dolma1.7 | [4M](https://huggingface.co/allenai/DataDecide-dolma1_7-4M) | [6M](https://huggingface.co/allenai/DataDecide-dolma1_7-6M) | [8M](https://huggingface.co/allenai/DataDecide-dolma1_7-8M) | [10M](https://huggingface.co/allenai/DataDecide-dolma1_7-10M) | [14M](https://huggingface.co/allenai/DataDecide-dolma1_7-14M) | [16M](https://huggingface.co/allenai/DataDecide-dolma1_7-16M) | [20M](https://huggingface.co/allenai/DataDecide-dolma1_7-20M) | [60M](https://huggingface.co/allenai/DataDecide-dolma1_7-60M) | [90M](https://huggingface.co/allenai/DataDecide-dolma1_7-90M) | [150M](https://huggingface.co/allenai/DataDecide-dolma1_7-150M) | [300M](https://huggingface.co/allenai/DataDecide-dolma1_7-300M) | [530M](https://huggingface.co/allenai/DataDecide-dolma1_7-530M) | [750M](https://huggingface.co/allenai/DataDecide-dolma1_7-750M) | [1B](https://huggingface.co/allenai/DataDecide-dolma1_7-1B) | |
|
| Dolma1.7 (no code) | [4M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-4M) | [6M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-6M) | [8M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-8M) | [10M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-10M) | [14M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-14M) | [16M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-16M) | [20M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-20M) | [60M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-60M) | [90M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-90M) | [150M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-150M) | [300M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-300M) | [530M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-530M) | [750M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-750M) | [1B](https://huggingface.co/allenai/DataDecide-dolma1_7-no-code-1B) | |
|
| Dolma1.7 (no math, code) | [4M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-4M) | [6M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-6M) | [8M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-8M) | [10M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-10M) | [14M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-14M) | [16M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-16M) | [20M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-20M) | [60M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-60M) | [90M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-90M) | [150M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-150M) | [300M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-300M) | [530M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-530M) | [750M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-750M) | [1B](https://huggingface.co/allenai/DataDecide-dolma1_7-no-math-code-1B) | |
|
| Dolma1.7 (no Reddit) | [4M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-4M) | [6M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-6M) | [8M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-8M) | [10M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-10M) | [14M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-14M) | [16M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-16M) | [20M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-20M) | [60M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-60M) | [90M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-90M) | [150M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-150M) | [300M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-300M) | [530M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-530M) | [750M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-750M) | [1B](https://huggingface.co/allenai/DataDecide-dolma1_7-no-reddit-1B) | |
|
| Dolma1.7 (no Flan) | [4M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-4M) | [6M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-6M) | [8M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-8M) | [10M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-10M) | [14M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-14M) | [16M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-16M) | [20M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-20M) | [60M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-60M) | [90M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-90M) | [150M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-150M) | [300M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-300M) | [530M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-530M) | [750M](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-750M) | [1B](https://huggingface.co/allenai/DataDecide-dolma1_7-no-flan-1B) | |
|
| Dolma1.6++ | [4M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-4M) | [6M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-6M) | [8M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-8M) | [10M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-10M) | [14M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-14M) | [16M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-16M) | [20M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-20M) | [60M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-60M) | [90M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-90M) | [150M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-150M) | [300M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-300M) | [530M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-530M) | [750M](https://huggingface.co/allenai/DataDecide-dolma1_6plus-750M) | [1B](https://huggingface.co/allenai/DataDecide-dolma1_6plus-1B) | |
|
| C4 | [4M](https://huggingface.co/allenai/DataDecide-c4-4M) | [6M](https://huggingface.co/allenai/DataDecide-c4-6M) | [8M](https://huggingface.co/allenai/DataDecide-c4-8M) | [10M](https://huggingface.co/allenai/DataDecide-c4-10M) | [14M](https://huggingface.co/allenai/DataDecide-c4-14M) | [16M](https://huggingface.co/allenai/DataDecide-c4-16M) | [20M](https://huggingface.co/allenai/DataDecide-c4-20M) | [60M](https://huggingface.co/allenai/DataDecide-c4-60M) | [90M](https://huggingface.co/allenai/DataDecide-c4-90M) | [150M](https://huggingface.co/allenai/DataDecide-c4-150M) | [300M](https://huggingface.co/allenai/DataDecide-c4-300M) | [530M](https://huggingface.co/allenai/DataDecide-c4-530M) | [750M](https://huggingface.co/allenai/DataDecide-c4-750M) | [1B](https://huggingface.co/allenai/DataDecide-c4-1B) | |
|
| FineWeb-Pro | [4M](https://huggingface.co/allenai/DataDecide-fineweb-pro-4M) | [6M](https://huggingface.co/allenai/DataDecide-fineweb-pro-6M) | [8M](https://huggingface.co/allenai/DataDecide-fineweb-pro-8M) | [10M](https://huggingface.co/allenai/DataDecide-fineweb-pro-10M) | [14M](https://huggingface.co/allenai/DataDecide-fineweb-pro-14M) | [16M](https://huggingface.co/allenai/DataDecide-fineweb-pro-16M) | [20M](https://huggingface.co/allenai/DataDecide-fineweb-pro-20M) | [60M](https://huggingface.co/allenai/DataDecide-fineweb-pro-60M) | [90M](https://huggingface.co/allenai/DataDecide-fineweb-pro-90M) | [150M](https://huggingface.co/allenai/DataDecide-fineweb-pro-150M) | [300M](https://huggingface.co/allenai/DataDecide-fineweb-pro-300M) | [530M](https://huggingface.co/allenai/DataDecide-fineweb-pro-530M) | [750M](https://huggingface.co/allenai/DataDecide-fineweb-pro-750M) | [1B](https://huggingface.co/allenai/DataDecide-fineweb-pro-1B) | |
|
| FineWeb-Edu | [4M](https://huggingface.co/allenai/DataDecide-fineweb-edu-4M) | [6M](https://huggingface.co/allenai/DataDecide-fineweb-edu-6M) | [8M](https://huggingface.co/allenai/DataDecide-fineweb-edu-8M) | [10M](https://huggingface.co/allenai/DataDecide-fineweb-edu-10M) | [14M](https://huggingface.co/allenai/DataDecide-fineweb-edu-14M) | [16M](https://huggingface.co/allenai/DataDecide-fineweb-edu-16M) | [20M](https://huggingface.co/allenai/DataDecide-fineweb-edu-20M) | [60M](https://huggingface.co/allenai/DataDecide-fineweb-edu-60M) | [90M](https://huggingface.co/allenai/DataDecide-fineweb-edu-90M) | [150M](https://huggingface.co/allenai/DataDecide-fineweb-edu-150M) | [300M](https://huggingface.co/allenai/DataDecide-fineweb-edu-300M) | [530M](https://huggingface.co/allenai/DataDecide-fineweb-edu-530M) | [750M](https://huggingface.co/allenai/DataDecide-fineweb-edu-750M) | [1B](https://huggingface.co/allenai/DataDecide-fineweb-edu-1B) | |
|
| Falcon | [4M](https://huggingface.co/allenai/DataDecide-falcon-4M) | [6M](https://huggingface.co/allenai/DataDecide-falcon-6M) | [8M](https://huggingface.co/allenai/DataDecide-falcon-8M) | [10M](https://huggingface.co/allenai/DataDecide-falcon-10M) | [14M](https://huggingface.co/allenai/DataDecide-falcon-14M) | [16M](https://huggingface.co/allenai/DataDecide-falcon-16M) | [20M](https://huggingface.co/allenai/DataDecide-falcon-20M) | [60M](https://huggingface.co/allenai/DataDecide-falcon-60M) | [90M](https://huggingface.co/allenai/DataDecide-falcon-90M) | [150M](https://huggingface.co/allenai/DataDecide-falcon-150M) | [300M](https://huggingface.co/allenai/DataDecide-falcon-300M) | [530M](https://huggingface.co/allenai/DataDecide-falcon-530M) | [750M](https://huggingface.co/allenai/DataDecide-falcon-750M) | [1B](https://huggingface.co/allenai/DataDecide-falcon-1B) | |
|
| Falcon+CC | [4M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-4M) | [6M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-6M) | [8M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-8M) | [10M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-10M) | [14M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-14M) | [16M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-16M) | [20M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-20M) | [60M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-60M) | [90M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-90M) | [150M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-150M) | [300M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-300M) | [530M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-530M) | [750M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-750M) | [1B](https://huggingface.co/allenai/DataDecide-falcon-and-cc-1B) | |
|
| Falcon+CC (QC 10%) | [4M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-4M) | [6M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-6M) | [8M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-8M) | [10M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-10M) | [14M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-14M) | [16M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-16M) | [20M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-20M) | [60M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-60M) | [90M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-90M) | [150M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-150M) | [300M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-300M) | [530M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-530M) | [750M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-750M) | [1B](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-10p-1B) | |
|
| Falcon+CC (QC 20%) | [4M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-4M) | [6M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-6M) | [8M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-8M) | [10M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-10M) | [14M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-14M) | [16M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-16M) | [20M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-20M) | [60M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-60M) | [90M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-90M) | [150M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-150M) | [300M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-300M) | [530M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-530M) | [750M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-750M) | [1B](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-20p-1B) | |
|
| Falcon+CC (QC Orig 10%) | [4M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-4M) | [6M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-6M) | [8M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-8M) | [10M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-10M) | [14M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-14M) | [16M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-16M) | [20M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-20M) | [60M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-60M) | [90M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-90M) | [150M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-150M) | [300M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-300M) | [530M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-530M) | [750M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-750M) | [1B](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-orig-10p-1B) | |
|
| Falcon+CC (QC Tulu 10%) | [4M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-4M) | [6M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-6M) | [8M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-8M) | [10M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-10M) | [14M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-14M) | [16M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-16M) | [20M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-20M) | [60M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-60M) | [90M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-90M) | [150M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-150M) | [300M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-300M) | [530M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-530M) | [750M](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-750M) | [1B](https://huggingface.co/allenai/DataDecide-falcon-and-cc-qc-tulu-10p-1B) | |
|
| DCLM-Baseline | [4M](https://huggingface.co/allenai/DataDecide-dclm-baseline-4M) | [6M](https://huggingface.co/allenai/DataDecide-dclm-baseline-6M) | [8M](https://huggingface.co/allenai/DataDecide-dclm-baseline-8M) | [10M](https://huggingface.co/allenai/DataDecide-dclm-baseline-10M) | [14M](https://huggingface.co/allenai/DataDecide-dclm-baseline-14M) | [16M](https://huggingface.co/allenai/DataDecide-dclm-baseline-16M) | [20M](https://huggingface.co/allenai/DataDecide-dclm-baseline-20M) | [60M](https://huggingface.co/allenai/DataDecide-dclm-baseline-60M) | [90M](https://huggingface.co/allenai/DataDecide-dclm-baseline-90M) | [150M](https://huggingface.co/allenai/DataDecide-dclm-baseline-150M) | [300M](https://huggingface.co/allenai/DataDecide-dclm-baseline-300M) | [530M](https://huggingface.co/allenai/DataDecide-dclm-baseline-530M) | [750M](https://huggingface.co/allenai/DataDecide-dclm-baseline-750M) | [1B](https://huggingface.co/allenai/DataDecide-dclm-baseline-1B) | |
|
| DCLM-Baseline (QC 7%, FW2) | [4M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-4M) | [6M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-6M) | [8M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-8M) | [10M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-10M) | [14M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-14M) | [16M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-16M) | [20M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-20M) | [60M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-60M) | [90M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-90M) | [150M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-150M) | [300M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-300M) | [530M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-530M) | [750M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-750M) | [1B](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw2-1B) | |
|
| DCLM-Baseline (QC 7%, FW3) | [4M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-4M) | [6M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-6M) | [8M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-8M) | [10M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-10M) | [14M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-14M) | [16M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-16M) | [20M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-20M) | [60M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-60M) | [90M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-90M) | [150M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-150M) | [300M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-300M) | [530M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-530M) | [750M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-750M) | [1B](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-7p-fw3-1B) | |
|
| DCLM-Baseline (QC FW 3%) | [4M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-4M) | [6M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-6M) | [8M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-8M) | [10M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-10M) | [14M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-14M) | [16M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-16M) | [20M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-20M) | [60M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-60M) | [90M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-90M) | [150M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-150M) | [300M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-300M) | [530M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-530M) | [750M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-750M) | [1B](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-3p-1B) | |
|
| DCLM-Baseline (QC FW 10%) | [4M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-4M) | [6M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-6M) | [8M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-8M) | [10M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-10M) | [14M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-14M) | [16M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-16M) | [20M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-20M) | [60M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-60M) | [90M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-90M) | [150M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-150M) | [300M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-300M) | [530M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-530M) | [750M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-750M) | [1B](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-fw-10p-1B) | |
|
| DCLM-Baseline (QC 10%) | [4M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-4M) | [6M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-6M) | [8M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-8M) | [10M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-10M) | [14M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-14M) | [16M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-16M) | [20M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-20M) | [60M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-60M) | [90M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-90M) | [150M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-150M) | [300M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-300M) | [530M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-530M) | [750M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-750M) | [1B](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-10p-1B) | |
|
| DCLM-Baseline (QC 20%) | [4M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-4M) | [6M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-6M) | [8M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-8M) | [10M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-10M) | [14M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-14M) | [16M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-16M) | [20M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-20M) | [60M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-60M) | [90M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-90M) | [150M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-150M) | [300M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-300M) | [530M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-530M) | [750M](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-750M) | [1B](https://huggingface.co/allenai/DataDecide-dclm-baseline-qc-20p-1B) | |
|
| DCLM-Baseline 25% / Dolma 75% | [4M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-4M) | [6M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-6M) | [8M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-8M) | [10M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-10M) | [14M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-14M) | [16M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-16M) | [20M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-20M) | [60M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-60M) | [90M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-90M) | [150M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-150M) | [300M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-300M) | [530M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-530M) | [750M](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-750M) | [1B](https://huggingface.co/allenai/DataDecide-dclm-baseline-25p-dolma1.7-75p-1B) | |
|
| DCLM-Baseline 50% / Dolma 50% | [4M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-4M) | [6M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-6M) | [8M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-8M) | [10M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-10M) | [14M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-14M) | [16M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-16M) | [20M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-20M) | [60M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-60M) | [90M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-90M) | [150M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-150M) | [300M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-300M) | [530M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-530M) | [750M](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-750M) | [1B](https://huggingface.co/allenai/DataDecide-dclm-baseline-50p-dolma1.7-50p-1B) | |
|
| DCLM-Baseline 75% / Dolma 25% | [4M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-4M) | [6M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-6M) | [8M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-8M) | [10M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-10M) | [14M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-14M) | [16M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-16M) | [20M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-20M) | [60M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-60M) | [90M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-90M) | [150M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-150M) | [300M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-300M) | [530M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-530M) | [750M](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-750M) | [1B](https://huggingface.co/allenai/DataDecide-dclm-baseline-75p-dolma1.7-25p-1B) | |
|
|
|
## Load a Model |
|
|
|
To load a specific model with HuggingFace: |
|
|
|
``` |
|
from hf_olmo import OLMoForCausalLM # pip install ai2-olmo |
|
|
|
olmo = OLMoForCausalLM.from_pretrained("allenai/DataDecide-dolma1_7-1B", revision="step69369-seed-default") |
|
``` |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
|
|
- **Developed by:** Allen Institute for AI (Ai2) |
|
- **Model type:** a Transformer style autoregressive language model. |
|
- **Language(s) (NLP):** English |
|
- **License:** The code and model are released under Apache 2.0. |
|
- **Contact:** Technical inquiries: `ianmag@cs.washington.edu`. Press: `press@allenai.org` |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [https://github.com/allenai/DataDecide](https://github.com/allenai/DataDecide) |
|
- **Paper:** [https:/allenai.org/paper/datadecide](https:/allenai.org/paper/datadecide) |
|
- **Data:** [https://huggingface.co/datasets/allenai/datadecide](https://huggingface.co/datasets/allenai/datadecide) |
|
|
|
## Data |
|
|
|
| Source / Recipe | Description | |
|
|----------------------------------------|-------------| |
|
| **Dolma1.7** *Original, No code, No math/code, No Reddit, No Flan* | A 2.3T-token corpus (Dolma; 1.7 [Soldaini et al., 2024](https://arxiv.org/abs/2402.00159)) sampling common LM sources for open research. We ablate code, math/code, Reddit, or Flan subsets. | |
|
| **Dolma1.6++** *Original* | Dolma 1.6 plus additional sources from Dolma 1.7: RedPajama’s arxiv subset, openwebmath, algebraic stack, flan, starcoder, falcon. | |
|
| **C4** *Original* | The C4 dataset ([Raffel et al., 2019](https://arxiv.org/abs/1910.10683)) as prepared in Dolma 1.7, heuristically filtered from the April 2019 Common Crawl. | |
|
| **FineWeb-Pro** *Original* | The FineWeb Pro corpus ([Zhou et al., 2024](https://arxiv.org/abs/2409.17115)), featuring model-driven data cleaning on FineWeb. | |
|
| **FineWeb-Edu** *Original* | The deduplicated FineWeb-Edu subset of SmoLLM-Corpus ([Ben Allal et al., 2024](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)), focused on educational web pages. | |
|
| **Falcon** *Original* | The Falcon RefinedWeb corpus ([Penedo et al., 2023](https://api.semanticscholar.org/CorpusID:259063761)) in Dolma 1.7, derived from Common Crawl through June 2023 and more aggressively filtered/deduplicated than C4. | |
|
| **Falcon+CC** *Original, QC 10%, QC 20%, QC Orig 10%, QC Tulu 10%* | Falcon and Dolma 1.7’s Common Crawl. We quality filter to top 10% or 20% documents with reproduced or original [Li et al. (2024)](https://arxiv.org/abs/2406.11794) filter or retrain filter on pre-release version of Tulu-v3 ([Lambert et al., 2024](https://arxiv.org/abs/2411.15124)). | |
|
| **DCLM-Baseline** *Original, QC 7% FW2, QC 7% FW3, QC FW 10%, QC 10%, QC 20%* | A SOTA Common Crawl corpus using best ablated deduplication, cleaning heuristics, and quality filter. We quality filter to top 7% of DCLM classified documents and further take 2+ or 3+ scores with FineWeb-edu classifier; or filter to top 3% or 10% with FineWeb-edu classifier; or take top 10% or 20% with reproduced DCLM classifier. | |
|
| *λ%* **DCLM-Baseline** *+ 1 – λ%* **Dolma1.7** | Fractional combinations of Dolma1.7 and DCLM-Baseline mixing different proportions of the two datasets for λ ∈ {25%, 50%, 75%}. | |
|
|
|
|
|
## Evaluation |
|
|
|
We evaluate all checkpoints over OLMES suite of 10 multiple choice question answering benchmarks |
|
([Gu et al., 2024](https://arxiv.org/abs/2406.08446)): |
|
|
|
- [MMLU (Hendrycks et al., 2021)](https://arxiv.org/abs/2009.03300) |
|
- [HellaSwag (Zellers et al., 2019)](https://arxiv.org/abs/1905.07830) |
|
- [ARC-Challenge (Clark et al., 2018)](https://arxiv.org/abs/1803.05457) |
|
- [ARC-Easy (Clark et al., 2018)](https://arxiv.org/abs/1803.05457) |
|
- [PIQA (Bisk et al., 2020)](https://arxiv.org/abs/1911.11641) |
|
- [CommonsenseQA (Talmor et al., 2019)](https://arxiv.org/abs/1811.00937) |
|
- [Social IQa (Sap et al., 2019)](https://arxiv.org/abs/1904.09728) |
|
- [OpenBookQA (Mihaylov et al., 2018)](https://arxiv.org/abs/1809.02789) |
|
- [BoolQ (Clark et al., 2019)](https://arxiv.org/abs/1905.10044) |
|
- [Winogrande (Sakaguchi et al., 2020)](https://arxiv.org/abs/1907.10641) |
|
|
|
We release all these evaluations: |
|
- for task-level metric results: [https://huggingface.co/datasets/allenai/DataDecide-eval-results](https://huggingface.co/datasets/allenai/DataDecide-eval-results) |
|
- for instance-level results: [https://huggingface.co/datasets/allenai/DataDecide-eval-instances](https://huggingface.co/datasets/allenai/DataDecide-eval-instances) |
|
|
|
|
|
## Hyperparameters |
|
|
|
| Name | Batch Size | Hidden Dim. | LR | Model size | Heads | Layers | Training steps | Tokens trained | |
|
|---|---|---|---|---|---|---|---|---| |
|
| 4M | 32 | 64 | 1.4e-02 | 3.7M | 8 | 8 | 5,725 | 0.4B | |
|
| 6M | 32 | 96 | 1.2e-02 | 6.0M | 8 | 8 | 9,182 | 0.6B | |
|
| 8M | 32 | 128 | 1.1e-02 | 8.5M | 8 | 8 | 13,039 | 0.9B | |
|
| 10M | 32 | 144 | 1.0e-02 | 9.9M | 8 | 8 | 15,117 | 1.0B | |
|
| 14M | 32 | 192 | 9.2e-03 | 14.4M | 8 | 8 | 21,953 | 1.4B | |
|
| 16M | 32 | 208 | 8.9e-03 | 16.0M | 8 | 8 | 24,432 | 1.6B | |
|
| 20M | 64 | 192 | 8.4e-03 | 19.1M | 8 | 16 | 14,584 | 1.9B | |
|
| 60M | 96 | 384 | 5.8e-03 | 57.1M | 12 | 16 | 29,042 | 5.7B | |
|
| 90M | 160 | 528 | 4.9e-03 | 97.9M | 12 | 16 | 29,901 | 9.8B | |
|
| 150M | 192 | 768 | 4.2e-03 | 151.9M | 12 | 12 | 38,157 | 15.0B | |
|
| 300M | 320 | 1,024 | 3.3e-03 | 320.0M | 16 | 16 | 45,787 | 30.0B | |
|
| 530M | 448 | 1,344 | 2.8e-03 | 530.1M | 16 | 16 | 57,786 | 53.0B | |
|
| 750M | 576 | 1,536 | 2.5e-03 | 681.3M | 16 | 16 | 63,589 | 75.0B | |
|
| 1B | 704 | 2,048 | 2.1e-03 | 1176.8M | 16 | 16 | 69,369 | 100.0B | |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
Like any base or fine-tuned language model, AI can be prompted by users to generate harmful and sensitive content. Such content may also be produced unintentionally, especially in cases involving bias, so we recommend that users consider the risks when applying this technology. Additionally, many statements from any LLM are often inaccurate, so facts should be verified. |
|
|
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
``` |
|
@article{MagnussonDataDecide2025, |
|
title={{DataDecide: How to Predict Best Pretraining Data with Small Experiments}}, |
|
author={Ian Magnusson and Nguyen Tai and Ben Bogin and David Heineman and Jena Hwang and Luca Soldaini and Akshita Bhagia and Jiacheng Liu and Dirk Groeneveld and Oyvind Tafjord and Noah A. Smith and Pang Wei Koh and Jesse Dodge}, |
|
year={2025}, |
|
journal={arXiv preprint}, |
|
} |
|
``` |
|
|
|
## Model Card Contact |
|
|
|
For errors in this model card, contact ianmag@cs.washington.edu |