Safetensors
bert
sarahyurick commited on
Commit
26ddae3
·
verified ·
1 Parent(s): 680bcd7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ ---
4
+
5
+ # NemoCurator FineWeb Mixtral Edu Classifier
6
+
7
+ ## Model Overview
8
+ This is a text classification model designed to determine the educational value (score 0-5 from low to high). It is similar to the [FineWeb-Edu classifier](https://arxiv.org/abs/2406.17557) and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct. These classifiers were used as part of a classifier ensemble in the creation of the [Nemotron-CC](https://arxiv.org/abs/2412.02595) dataset. The models were finetuned starting from the [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) model.
9
+
10
+ ## License
11
+ This model is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
12
+
13
+ ## Model Architecture
14
+ The model architecture is [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m).
15
+
16
+ ## How To Use in NeMo Curator
17
+ NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization. It also provides pre-built pipelines for generating synthetic data to customize and evaluate generative AI systems.
18
+
19
+ The inference code for this model is available through the NeMo Curator GitHub repository. Check out this [example notebook](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) to get started.
20
+
21
+ ## How To Use in Transformers
22
+ To use the FineWeb Mixtral Edu Classifier, please follow this example code:
23
+
24
+ ```python
25
+ import torch
26
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
27
+
28
+
29
+ texts = ["To make lemonade, you will need lemon juice, water, and sugar."]
30
+
31
+ model = AutoModelForSequenceClassification.from_pretrained(
32
+ "nvidia/nemocurator-fineweb-mixtral-edu-classifier",
33
+ torch_dtype=torch.bfloat16,
34
+ )
35
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
36
+ model.to(device)
37
+
38
+ tokenizer = AutoTokenizer.from_pretrained(
39
+ "nvidia/nemocurator-fineweb-mixtral-edu-classifier"
40
+ )
41
+
42
+ inputs = tokenizer(
43
+ texts,
44
+ return_tensors="pt",
45
+ padding="longest",
46
+ truncation=True,
47
+ max_length=512,
48
+ ).to(device)
49
+
50
+ with torch.no_grad():
51
+ outputs = model(**inputs)
52
+ logits = outputs.logits.squeeze(-1).float().cpu().numpy()
53
+
54
+ float_score = logits.tolist()
55
+ int_score = [int(round(max(0, min(score, 5)))) for score in logits]
56
+ pred_labels = ["high_quality" if score >= 2.5 else "low_quality" for score in logits]
57
+
58
+ print("Score:", float_score)
59
+ print("Rounded score:", int_score)
60
+ print("Predicted label:", pred_labels)
61
+ # Score: [1.09375]
62
+ # Rounded score: [1]
63
+ # Predicted label: ['low_quality']
64
+ ```
65
+
66
+ ## Input & Output
67
+ ### Input
68
+ - Input Type: Text
69
+ - Input Format: String
70
+ - Input Parameters: 1D
71
+ - Other Properties Related to Input: Token Limit of 512 tokens
72
+
73
+ ### Output
74
+ - Output Type: Classification Score
75
+ - Output Format: Float
76
+ - Output Parameters: 1D
77
+ - Other Properties Related to Output: None
78
+
79
+ ## Software Integration
80
+
81
+ ### Training, Testing, and Evaluation Dataset
82
+ The model was trained on the text of this dataset: [https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations) (a 467k document subset of the FineWeb dataset), with annotations coming from Mixtral 8x22B-Instruct.
83
+
84
+ ### Evaluation
85
+ The models were shown to be useful in classifying high-quality content for LLM pretraining as part of an ensemble in the [Nemotron-CC](https://arxiv.org/abs/2412.02595) paper. See Table 9.
86
+
87
+ ## Inference
88
+ - Engine: PyTorch
89
+
90
+ ## Ethical Considerations
91
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
92
+
93
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
94
+
95
+ ## References
96
+ - [The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale](https://arxiv.org/abs/2406.17557)
97
+ - [Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset](https://arxiv.org/abs/2412.02595)
98
+ - [Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models](https://arxiv.org/abs/2405.05374)