Safetensors
bert
sarahyurick commited on
Commit
24d5903
·
verified ·
1 Parent(s): 26ddae3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -10
README.md CHANGED
@@ -5,13 +5,19 @@ license: other
5
  # NemoCurator FineWeb Mixtral Edu Classifier
6
 
7
  ## Model Overview
8
- This is a text classification model designed to determine the educational value (score 0-5 from low to high). It is similar to the [FineWeb-Edu classifier](https://arxiv.org/abs/2406.17557) and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct. These classifiers were used as part of a classifier ensemble in the creation of the [Nemotron-CC](https://arxiv.org/abs/2412.02595) dataset. The models were finetuned starting from the [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) model.
9
 
10
  ## License
11
  This model is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
12
 
 
 
 
 
 
13
  ## Model Architecture
14
- The model architecture is [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m).
 
15
 
16
  ## How To Use in NeMo Curator
17
  NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization. It also provides pre-built pipelines for generating synthetic data to customize and evaluate generative AI systems.
@@ -74,25 +80,43 @@ print("Predicted label:", pred_labels)
74
  - Output Type: Classification Score
75
  - Output Format: Float
76
  - Output Parameters: 1D
77
- - Other Properties Related to Output: None
78
 
79
  ## Software Integration
 
 
 
 
 
 
 
 
 
 
 
80
 
81
  ### Training, Testing, and Evaluation Dataset
82
  The model was trained on the text of this dataset: [https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations) (a 467k document subset of the FineWeb dataset), with annotations coming from Mixtral 8x22B-Instruct.
83
 
84
- ### Evaluation
 
 
 
 
 
 
 
 
 
 
 
85
  The models were shown to be useful in classifying high-quality content for LLM pretraining as part of an ensemble in the [Nemotron-CC](https://arxiv.org/abs/2412.02595) paper. See Table 9.
86
 
87
  ## Inference
88
- - Engine: PyTorch
 
89
 
90
  ## Ethical Considerations
91
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
92
 
93
  Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
94
-
95
- ## References
96
- - [The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale](https://arxiv.org/abs/2406.17557)
97
- - [Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset](https://arxiv.org/abs/2412.02595)
98
- - [Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models](https://arxiv.org/abs/2405.05374)
 
5
  # NemoCurator FineWeb Mixtral Edu Classifier
6
 
7
  ## Model Overview
8
+ This is a text classification model designed to determine the educational value of a piece of text (score 0-5 from low to high). It is similar to the [FineWeb-Edu classifier](https://arxiv.org/abs/2406.17557) and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct. The NeMo Curator FineWeb Mixtral Edu classifier was used as part of a classifier ensemble in the creation of the [Nemotron-CC](https://arxiv.org/abs/2412.02595) dataset. The models were finetuned starting from the [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) model.
9
 
10
  ## License
11
  This model is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
12
 
13
+ ## References
14
+ - [The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale](https://arxiv.org/abs/2406.17557)
15
+ - [Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset](https://arxiv.org/abs/2412.02595)
16
+ - [Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models](https://arxiv.org/abs/2405.05374)
17
+
18
  ## Model Architecture
19
+ - Architecture type: Transformer (BERT)
20
+ - Network architecture: [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m)
21
 
22
  ## How To Use in NeMo Curator
23
  NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization. It also provides pre-built pipelines for generating synthetic data to customize and evaluate generative AI systems.
 
80
  - Output Type: Classification Score
81
  - Output Format: Float
82
  - Output Parameters: 1D
83
+ - Other Properties Related to Output: The output range is 0-5, representing low to high educational value.
84
 
85
  ## Software Integration
86
+ **Runtime Engine(s):**
87
+ * Python 3.10 and NeMo Curator <br>
88
+
89
+ **Supported Hardware Microarchitecture Compatibility:** <br>
90
+ * NVIDIA GPU, Volta™ or higher (compute capability 7.0+), CUDA 12 (or above) <br>
91
+
92
+ **Operating System(s):** <br>
93
+ * Ubuntu 22.04/20.04 <br>
94
+
95
+ ## Model Version(s): <br>
96
+ * 1.0 <br>
97
 
98
  ### Training, Testing, and Evaluation Dataset
99
  The model was trained on the text of this dataset: [https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations) (a 467k document subset of the FineWeb dataset), with annotations coming from Mixtral 8x22B-Instruct.
100
 
101
+ #### Training Dataset:
102
+ **Link:** https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations
103
+
104
+ **Data Collection Method by dataset** <br>
105
+ * Automated <br>
106
+
107
+ **Labeling Method by dataset** <br>
108
+ * Synthetic <br>
109
+
110
+ **Properties:** The model was trained on the text of the fineweb-edu-llama3-annotations dataset, but with annotations coming from Mixtral 8x22B-Instruct instead of the provided annotations from Llama 3.1 70B. The dataset is a randomly sampled 467k document subset of the FineWeb dataset, which contains filtered documents crawled from the web. Please see https://arxiv.org/abs/2406.17557 for more details. <br>
111
+
112
+ ### Evaluation Results
113
  The models were shown to be useful in classifying high-quality content for LLM pretraining as part of an ensemble in the [Nemotron-CC](https://arxiv.org/abs/2412.02595) paper. See Table 9.
114
 
115
  ## Inference
116
+ - Engine: Python 3.10 and PyTorch
117
+ - Test Hardware: NVIDIA H100
118
 
119
  ## Ethical Considerations
120
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
121
 
122
  Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).