Text Classification
fastText
Tristan nielsr HF Staff commited on
Commit
c2dad33
·
verified ·
1 Parent(s): 98a1a50

Improve model card (#1)

Browse files

- Improve model card (e78dc6645ffdca83fd2034cdfa072fcd31f2f445)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +7 -3
README.md CHANGED
@@ -1,7 +1,11 @@
1
  ---
2
  license: mit
 
 
3
  ---
4
 
5
- This is the fastText pretraining data filter targeting
6
- the ARC Easy task, discussed in the main text of the Perplexity
7
- Correlations paper: https://arxiv.org/abs/2409.05816
 
 
 
1
  ---
2
  license: mit
3
+ pipeline_tag: text-classification
4
+ library_name: fasttext
5
  ---
6
 
7
+ This fastText model is a pretraining data filter, targeting the ARC Easy task. It's designed to select high-quality pretraining data using perplexity correlations, as described in [Improving Pretraining Data Using Perplexity Correlations](https://arxiv.org/abs/2409.05816). The model classifies text as either "include" or "exclude" for use in pretraining a language model. It does *not* itself represent a pretrained language model.
8
+
9
+ The filter was created using a method that leverages correlations between LLM losses on various texts and downstream benchmark performance. By selecting texts with high correlation, this model aims to improve the efficiency of the data selection process for pretraining LLMs.
10
+
11
+ Code: https://github.com/TristanThrush/perplexity-correlations