SimpleStories
/

SimpleStories-35M

@@ -11,62 +11,15 @@ tags:
 - efficient-nlp
 - distilled-models
 ---
-# SimpleStories
-SimpleStories is a large synthetic story dataset comprising 2 million stories designed for efficient NLP research. Created to improve upon TinyStories, it offers greater syntactic and semantic diversity through parameterized prompt generation while maintaining simple language. The dataset features stories annotated with high-level concepts like theme, topic, style, and narrative features, making it ideal for training small language models and studying language understanding.
-# SimpleStories 35M
-SimpleStories-35M is a 35 million parameter language model trained on the SimpleStories dataset. This model is the largest in the SimpleStories model family,
-offering the best performance across all evaluation metrics. This is part of the family of small language models trained on [SimpleStories dataset](https://huggingface.co/datasets/lennart-finke/SimpleStories).
-The models range from 1.25M to 35M parameters, offering a spectrum of capabilities while maintaining efficiency. The model training and evaluation code can be found here: https://github.com/danbraunai/simple_stories_train/tree/main/simple_stories_train
-## Model Variants
-| Model Name | n_params | n_layers | d_model | n_heads | n_ctx | d_vocab |
-|------------|----------|----------|---------|---------|-------|---------|
-| SimpleStories-35M | 35 million | 12 | 512 | 8 | 512 | 4096 |
-| SimpleStories-30M | 30 million | 10 | 512 | 8 | 512 | 4096 |
-| SimpleStories-11M | 11 million | 6 | 384 | 6 | 512 | 4096 |
-| SimpleStories-5M | 5 million | 6 | 256 | 4 | 512 | 4096 |
-| SimpleStories-1.25M | 1.25 million | 4 | 128 | 4 | 512 | 4096 |
-## Performance Comparison
-Our models demonstrate strong performance across various evaluation metrics as shown in the chart below. The trained models are scored using the model as a judge evaluation framework.
-<p align="center">
-  <img width="80%" src="figures/simplestories_comparison.png">
-</p>
-- **Originality**: Measures the uniqueness and creativity of generated content
-- **Coherence**: Evaluates the logical flow and consistency of generated stories
-- **Grammar**: Assesses grammatical correctness and linguistic quality
-- **Quality**: Holistic evaluation of overall text generation quality
-The larger models (35M, 30M) achieve the best performance, particularly in coherence and grammar, while even our smallest 1.25M parameter model produces readable and coherent content. As shown in the visualization, our SimpleStories-33M model achieves scores of 90.8 in Grammar, 85.7 in Coherence, 81.5 in Quality, and 72.5 in Originality.
-## Dataset
-The SimpleStories dataset is a collection of short stories generated by state-of-the-art language models. It features:
-- Story annotation with high-level concepts: theme, topic, style, etc.
-- Higher semantic and syntactic diversity through seeded story generation
-- Generated by 2024 models
-- Several NLP-metrics pre-computed to aid filtering
-- ASCII-only guarantee for the English dataset
-## Tokenizer
-We have trained a custom WordPiece tokenizer with a small vocabulary size of 4096. We conducted morphological analysis and coverage gain analysis on the dataset
-to build a small tokenizer without compromising on the quality of generation.
-## Installation
-Follow the steps at https://github.com/danbraunai/simple_stories_train to install the simple stories package.
 ## Usage
-Here's how to use any model in the SimpleStories family:
 ```python
 from transformers import AutoTokenizer
@@ -84,7 +37,8 @@ model_config = MODEL_CONFIGS[model_size]
 # Load appropriate model
 model_path = f"SimpleStories/SimpleStories-{model_size}"
 model = Llama.from_pretrained(model_path, model_config)
-model.to("cuda")
 model.eval()
 # Load tokenizer
@@ -94,15 +48,14 @@ tokenizer = AutoTokenizer.from_pretrained(model_path)
 prompt = "The curious cat looked at the"
 inputs = tokenizer(prompt, return_tensors="pt")
-input_ids = inputs.input_ids.to("cuda")
 # Generate text
 with torch.no_grad():
     output_ids = model.generate(
         idx=input_ids,
-        max_new_tokens=800,
-        temperature=0.7,
         top_k=40,
         eos_token_id=tokenizer.eos_token_id
     )
@@ -113,6 +66,36 @@ print(f"Generated text:\n{output_text}")
 ```
-## Acknowledgements
-These models build upon the work done in the TinyStories project by Eldan and Li, with the SimpleStories dataset created by Lennart Finke and the training code created by Dan Braun.

 - efficient-nlp
 - distilled-models
 ---
+# SimpleStories Model Family
+The SimpleStories models are a tiny model family created for interpretability research, trained on the [SimpleStories dataset](https://huggingface.co/datasets/lennart-finke/SimpleStories).
 ## Usage
+```bash
+pip install simple_stories_train
+```
 ```python
 from transformers import AutoTokenizer
 # Load appropriate model
 model_path = f"SimpleStories/SimpleStories-{model_size}"
 model = Llama.from_pretrained(model_path, model_config)
+device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
+model.to(device)
 model.eval()
 # Load tokenizer
 prompt = "The curious cat looked at the"
 inputs = tokenizer(prompt, return_tensors="pt")
+input_ids = inputs.input_ids.to(device)
 # Generate text
 with torch.no_grad():
     output_ids = model.generate(
         idx=input_ids,
+        max_new_tokens=50,
+        temperature=0.0,
         top_k=40,
         eos_token_id=tokenizer.eos_token_id
     )
 ```
+## Model Variants
+| Model Name | n_params | n_layers | d_model | n_heads | n_ctx | d_vocab |
+|------------|----------|----------|---------|---------|-------|---------|
+| SimpleStories-35M | 35 million | 12 | 512 | 8 | 512 | 4096 |
+| SimpleStories-30M | 30 million | 10 | 512 | 8 | 512 | 4096 |
+| SimpleStories-11M | 11 million | 6 | 384 | 6 | 512 | 4096 |
+| SimpleStories-5M | 5 million | 6 | 256 | 4 | 512 | 4096 |
+| SimpleStories-1.25M | 1.25 million | 4 | 128 | 4 | 512 | 4096 |
+## Performance Comparison
+Model-evaluated generation quality metrics:
+<p align="center">
+  <img width="80%" src="figures/simplestories_comparison.png">
+</p>
+## Tokenizer
+We use a custom WordPiece tokenizer with a small vocabulary size of 4096. We conducted morphological analysis and coverage gain analysis on the dataset
+to build a small tokenizer without compromising on the quality of generation.
+## Dataset
+The SimpleStories dataset is a collection of short stories generated by state-of-the-art language models. It features:
+- Story annotation with high-level concepts: theme, topic, style, etc.
+- Higher semantic and syntactic diversity through seeded story generation
+- Generated by 2024 models
+- Several NLP-metrics pre-computed to aid filtering
+- ASCII-only guarantee for the English dataset
+Read the dataset paper on [arXiv](https://arxiv.org/abs/2504.09184]).