bbunzeck commited on
Commit
0f24982
·
verified ·
1 Parent(s): cb9c915

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -0
README.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ ---
5
+
6
+ **lexdec-medium-char** is a small, autoregressive llama model featuring character-level tokenization, trained on the 2024/2025 [BabyLM dataset](https://osf.io/ryjfm/). The *checkpoints* branch contains 19 checkpoints, 10 across the first 10% of pretraining and 9 more for the remaining 9 percent of pretraining.
7
+
8
+ We used this model to trace the development of linguistic knowledge (word-level, syntax) across pretraining and to compare it to both larger character-level models and comparable subword models:
9
+
10
+ | | [small-char](https://huggingface.co/bbunzeck/lexdec-small-char) | [medium-char](https://huggingface.co/bbunzeck/lexdec-medium-char) | [large-char](https://huggingface.co/bbunzeck/lexdec-large-char) | [small-bpe](https://huggingface.co/bbunzeck/lexdec-small-bpe) | [medium-bpe](https://huggingface.co/bbunzeck/lexdec-medium-bpe) | [large-bpe](https://huggingface.co/bbunzeck/lexdec-large-bpe) |
11
+ |---|---:|---:|---:|---:|---:|---:|
12
+ | Embedding size | 128 | 256 | 512 | 128 | 256 | 512 |
13
+ | Hidden size | 128 | 256 | 512 | 128 | 256 | 512 |
14
+ | Layers | 4 | 8 | 12 | 4 | 8 | 12 |
15
+ | Attention heads | 4 | 8 | 12 | 4 | 8 | 12 |
16
+ | Context size | 128 | 128 | 128 | 128 | 128 | 128 |
17
+ | Vocab. size | 102 | 102 | 102 | 8,002 | 8,002 | 8,002 |
18
+ | Parameters | 486,016 | 3,726,592 | 21,940,736 | 2,508,416 | 7,771,392 | 30,030,336 |
19
+
20
+ If you use this model, please cite the following preprint (the final version will be added as soon as it is published):
21
+
22
+ ```
23
+ @misc{bunzeck2025subwordmodelsstruggleword,
24
+ title={Subword models struggle with word learning, but surprisal hides it},
25
+ author={Bastian Bunzeck and Sina Zarrieß},
26
+ year={2025},
27
+ eprint={2502.12835},
28
+ archivePrefix={arXiv},
29
+ primaryClass={cs.CL},
30
+ url={https://arxiv.org/abs/2502.12835},}
31
+ ```