nielsr HF Staff commited on
Commit
be69d6b
Β·
verified Β·
1 Parent(s): 77e743c

Improve model card: Add pipeline tag, library name, and paper/project/code links

Browse files

This PR enhances the model card for the Deep Ignorance model suite by:

* Adding the `pipeline_tag: text-generation` to the metadata, which improves discoverability on the Hugging Face Hub (e.g., via https://huggingface.co/models?pipeline_tag=text-generation).
* Specifying `library_name: transformers` in the metadata, indicating its compatibility with the Hugging Face Transformers library and enabling the "Use in Transformers" widget on the model page.
* Adding explicit links to the official paper (on Hugging Face Papers), the project's dedicated webpage, and the GitHub repository. These were either missing or incorrectly linked in the original content, and their addition provides a more comprehensive and accessible overview for users.

Please review and merge this PR.

Files changed (1) hide show
  1. README.md +30 -27
README.md CHANGED
@@ -1,6 +1,14 @@
1
  ---
 
 
 
 
 
2
  language:
3
  - en
 
 
 
4
  tags:
5
  - pytorch
6
  - causal-lm
@@ -25,19 +33,17 @@ tags:
25
  - safety-research
26
  - model-diffing
27
  - training-dynamics
28
- license: apache-2.0
29
- datasets:
30
- - EleutherAI/deep-ignorance-pretraining-mix
31
- - EleutherAI/deep-ignorance-annealing-mix
32
- base_model:
33
- - EleutherAI/deep-ignorance-pretraining-stage-unfiltered
34
  ---
35
 
36
  # Deep Ignorance Model Suite
37
 
38
  We explore an intuitive yet understudied question: Can we prevent LLMs from learning unsafe technical capabilities (such as CBRN) by filtering out enough of the relevant pretraining data before we begin training a model? Research into this question resulted in the **Deep Ignorance Suite**. In our experimental setup, we find that filtering pretraining data prevents undesirable knowledge, doesn't sacrifice general performance, and results in models that are resistant to tampering.
39
 
40
- Deep Ignorance is a collection of 6.9B models developed to facilitate research into pretraining, interpretability, training data, and unlearning [(see paper)](https://deepignorance.ai). It contains 18 models composing of a baseline model trained on unfiltered data, and 17 models trained on filtered datasets or with other safety interventions being applied. Pretraining stage models have 101 checkpoints and annealing stage have 11.
 
 
 
 
41
 
42
  > **Support:**
43
  > The #release-discussion channel in the [EleutherAI Discord](https://discord.gg/eleutherai) is the best place to ask questions. Questions asked in other channels are less likely to be answered. The community section on HuggingFace is less actively monitored. Tag Kyle O'Brien in the EleutherAI Discord for faster response times.
@@ -51,9 +57,6 @@ Our research and model suite open up multiple avenues for future work. For insta
51
 
52
  We are also excited for the community to stress test data filtering to determine whether there are some situations where it is less tamper-resistant than our experiments suggest! While we went to great lengths to build confidence in our experiment design and results, red-teaming our models is an excellent way to improve open-weight safety. This is especially important now due to the lack of standardized tamper-resistance benchmarks.
53
 
54
-
55
-
56
-
57
  ## Uses and Limitations
58
 
59
  ### Quickstart
@@ -157,23 +160,23 @@ To ensure our filtering approach preserves beneficial knowledge, we evaluate on
157
  - **LAMBADA**: Text comprehension requiring full-context understanding
158
  - **HellaSwag**: Commonsense natural language inference
159
 
160
- | Model | Pretraining Filtering | Annealing Filtering | WMDP Bio Average (Robust MCQA, Verified Cloze) (↓) | Average (MMLU, PIQA, Lambada, HellaSwag) (↑) | WMDP Bio Robust MCQA (↓) | WMDP Bio Verified Cloze (↓) | MMLU (↑) | PIQA (↑) | Lambada (↑) | HellaSwag (↑) |
161
- |:---------------------------------------------------------------------|:------------------------|:----------------------|:-----------------------------------------------------|:-----------------------------------------------|:---------------------------|:------------------------------|:---------------|:---------------|:---------------|:----------------|
162
- | deep-ignorance-unfiltered | - | - | 39.66% | 56.05% | 42.97% | 36.34% | 44.92% | 76.44% | 47.08% | 55.75% |
163
- | deep-ignorance-pretraining-stage-unfiltered | - | - | 37.16% (-2.50) | 60.24% (4.19) | 38.25% (-4.72) | 36.06% (-0.28) | 42.80% (-2.12) | 79.05% (2.61) | 63.03% (15.95) | 56.06% (0.31) |
164
- | deep-ignorance-e2e-extra-weak-filter | Extra Weak Filter | Extra Weak Filter | 33.70% (-5.96) | 55.83% (-0.22) | 38.02% (-4.95) | 29.37% (-6.97) | 44.13% (-0.79) | 77.04% (0.60) | 46.85% (-0.23) | 55.29% (-0.46) |
165
- | deep-ignorance-weak-filter-pt-strong-filter-anneal | Weak Filter | Strong Filter | 30.97% (-8.69) | 56.22% (0.17) | 36.75% (-6.22) | 25.19% (-11.15) | 43.16% (-1.76) | 77.20% (0.76) | 48.86% (1.78) | 55.67% (-0.08) |
166
- | deep-ignorance-e2e-weak-filter | Weak Filter | Weak Filter | 30.50% (-9.16) | 57.37% (1.32) | 35.25% (-7.72) | 25.74% (-10.60) | 43.91% (-1.01) | 78.35% (1.91) | 51.81% (4.73) | 55.41% (-0.34) |
167
- | deep-ignorance-strong-filter-pt-weak-filter-anneal | Strong Filter | Weak Filter | 30.38% (-9.28) | 57.88% (1.83) | 33.99% (-8.98) | 26.77% (-9.57) | 44.82% (-0.10) | 76.88% (0.44) | 54.05% (6.97) | 55.78% (0.03) |
168
- | deep-ignorance-e2e-strong-filter | Strong Filter | Strong Filter | 29.90% (-9.76) | 55.53% (-0.52) | 35.37% (-7.60) | 24.44% (-11.90) | 43.21% (-1.71) | 75.73% (-0.71) | 47.29% (0.21) | 55.90% (0.15) |
169
- | deep-ignorance-pretraining-stage-strong-filter | Strong Filter | - | 29.47% (-10.19) | 60.02% (3.97) | 33.29% (-9.68) | 25.65% (-10.69) | 43.46% (-1.46) | 79.27% (2.83) | 60.82% (13.74) | 56.53% (0.78) |
170
- | deep-ignorance-unfiltered-cb | - | - | 29.29% (-10.37) | 54.11% (-1.94) | 29.49% (-13.48) | 29.09% (-7.25) | 43.61% (-1.31) | 76.50% (0.06) | 45.84% (-1.24) | 50.50% (-5.25) |
171
- | deep-ignorance-pretraining-stage-weak-filter | Weak Filter | - | 29.12% (-10.54) | 58.98% (2.93) | 33.53% (-9.44) | 24.72% (-11.62) | 41.04% (-3.88) | 78.78% (2.34) | 60.57% (13.49) | 55.53% (-0.22) |
172
- | deep-ignorance-strong-filter-pt-weak-filter-anneal-cb-lat | Strong Filter | Weak Filter | 26.92% (-12.74) | 58.00% (1.95) | 29.95% (-13.02) | 23.88% (-12.46) | 43.52% (-1.40) | 76.61% (0.17) | 56.01% (8.93) | 55.84% (0.09) |
173
- | deep-ignorance-strong-filter-pt-weak-filter-anneal-cb | Strong Filter | Weak Filter | 26.12% (-13.54) | 56.46% (0.41) | 25.46% (-17.51) | 26.77% (-9.57) | 41.45% (-3.47) | 76.33% (-0.11) | 53.64% (6.56) | 54.40% (-1.35) |
174
- | deep-ignorance-unfiltered-cb-lat | - | - | 25.93% (-13.73) | 56.43% (0.38) | 27.42% (-15.55) | 24.44% (-11.90) | 42.73% (-2.19) | 76.22% (-0.22) | 51.85% (4.77) | 54.92% (-0.83) |
175
- | deep-ignorance-e2e-strong-filter-cb-lat | Strong Filter | Strong Filter | 25.87% (-13.79) | 56.60% (0.55) | 27.76% (-15.21) | 23.98% (-12.36) | 42.08% (-2.84) | 75.41% (-1.03) | 52.75% (5.67) | 56.18% (0.43) |
176
- | deep-ignorance-e2e-strong-filter-cb | Strong Filter | Strong Filter | 25.56% (-14.10) | 52.60% (-3.45) | 25.00% (-17.97) | 26.12% (-10.22) | 39.45% (-5.47) | 75.35% (-1.09) | 47.56% (0.48) | 48.03% (-7.72) |
177
 
178
  # Acknowledgments
179
 
 
1
  ---
2
+ base_model:
3
+ - EleutherAI/deep-ignorance-pretraining-stage-unfiltered
4
+ datasets:
5
+ - EleutherAI/deep-ignorance-pretraining-mix
6
+ - EleutherAI/deep-ignorance-annealing-mix
7
  language:
8
  - en
9
+ license: apache-2.0
10
+ pipeline_tag: text-generation
11
+ library_name: transformers
12
  tags:
13
  - pytorch
14
  - causal-lm
 
33
  - safety-research
34
  - model-diffing
35
  - training-dynamics
 
 
 
 
 
 
36
  ---
37
 
38
  # Deep Ignorance Model Suite
39
 
40
  We explore an intuitive yet understudied question: Can we prevent LLMs from learning unsafe technical capabilities (such as CBRN) by filtering out enough of the relevant pretraining data before we begin training a model? Research into this question resulted in the **Deep Ignorance Suite**. In our experimental setup, we find that filtering pretraining data prevents undesirable knowledge, doesn't sacrifice general performance, and results in models that are resistant to tampering.
41
 
42
+ Paper: [Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs](https://huggingface.co/papers/2508.06601)
43
+ Project page: [https://deepignorance.ai/](https://deepignorance.ai/)
44
+ Code: [https://github.com/EleutherAI/deep-ignorance](https://github.com/EleutherAI/deep-ignorance)
45
+
46
+ Deep Ignorance is a collection of 6.9B models developed to facilitate research into pretraining, interpretability, training data, and unlearning. It contains 18 models composing of a baseline model trained on unfiltered data, and 17 models trained on filtered datasets or with other safety interventions being applied. Pretraining stage models have 101 checkpoints and annealing stage have 11.
47
 
48
  > **Support:**
49
  > The #release-discussion channel in the [EleutherAI Discord](https://discord.gg/eleutherai) is the best place to ask questions. Questions asked in other channels are less likely to be answered. The community section on HuggingFace is less actively monitored. Tag Kyle O'Brien in the EleutherAI Discord for faster response times.
 
57
 
58
  We are also excited for the community to stress test data filtering to determine whether there are some situations where it is less tamper-resistant than our experiments suggest! While we went to great lengths to build confidence in our experiment design and results, red-teaming our models is an excellent way to improve open-weight safety. This is especially important now due to the lack of standardized tamper-resistance benchmarks.
59
 
 
 
 
60
  ## Uses and Limitations
61
 
62
  ### Quickstart
 
160
  - **LAMBADA**: Text comprehension requiring full-context understanding
161
  - **HellaSwag**: Commonsense natural language inference
162
 
163
+ | Model | Pretraining Filtering | Annealing Filtering | WMDP Bio Average (Robust MCQA, Verified Cloze) (↓) | Average (MMLU, PIQA, Lambada, HellaSwag) (↑) | WMDP Bio Robust MCQA (↓) | WMDP Bio Verified Cloze (↓) | MMLU (↑) | PIQA (↑) | Lambada (↑) | HellaSwag (↑) |
164
+ |:------|:------------------------|:----------------------|:-----------------------------------------------------|:-----------------------------------------------|:---------------------------|:------------------------------|:---------------|:---------------|:---------------|:----------------|
165
+ | deep-ignorance-unfiltered | - | - | 39.66% | 56.05% | 42.97% | 36.34% | 44.92% | 76.44% | 47.08% | 55.75% |
166
+ | deep-ignorance-pretraining-stage-unfiltered | - | - | 37.16% (-2.50) | 60.24% (4.19) | 38.25% (-4.72) | 36.06% (-0.28) | 42.80% (-2.12) | 79.05% (2.61) | 63.03% (15.95) | 56.06% (0.31) |
167
+ | deep-ignorance-e2e-extra-weak-filter | Extra Weak Filter | Extra Weak Filter | 33.70% (-5.96) | 55.83% (-0.22) | 38.02% (-4.95) | 29.37% (-6.97) | 44.13% (-0.79) | 77.04% (0.60) | 46.85% (-0.23) | 55.29% (-0.46) |
168
+ | deep-ignorance-weak-filter-pt-strong-filter-anneal | Weak Filter | Strong Filter | 30.97% (-8.69) | 56.22% (0.17) | 36.75% (-6.22) | 25.19% (-11.15) | 43.16% (-1.76) | 77.20% (0.76) | 48.86% (1.78) | 55.67% (-0.08) |
169
+ | deep-ignorance-e2e-weak-filter | Weak Filter | Weak Filter | 30.50% (-9.16) | 57.37% (1.32) | 35.25% (-7.72) | 25.74% (-10.60) | 43.91% (-1.01) | 78.35% (1.91) | 51.81% (4.73) | 55.41% (-0.34) |
170
+ | deep-ignorance-strong-filter-pt-weak-filter-anneal | Strong Filter | Weak Filter | 30.38% (-9.28) | 57.88% (1.83) | 33.99% (-8.98) | 26.77% (-9.57) | 44.82% (-0.10) | 76.88% (0.44) | 54.05% (6.97) | 55.78% (0.03) |
171
+ | deep-ignorance-e2e-strong-filter | Strong Filter | Strong Filter | 29.90% (-9.76) | 55.53% (-0.52) | 35.37% (-7.60) | 24.44% (-11.90) | 43.21% (-1.71) | 75.73% (-0.71) | 47.29% (0.21) | 55.90% (0.15) |
172
+ | deep-ignorance-pretraining-stage-strong-filter | Strong Filter | - | 29.47% (-10.19) | 60.02% (3.97) | 33.29% (-9.68) | 25.65% (-10.69) | 43.46% (-1.46) | 79.27% (2.83) | 60.82% (13.74) | 56.53% (0.78) |
173
+ | deep-ignorance-unfiltered-cb | - | - | 29.29% (-10.37) | 54.11% (-1.94) | 29.49% (-13.48) | 29.09% (-7.25) | 43.61% (-1.31) | 76.50% (0.06) | 45.84% (-1.24) | 50.50% (-5.25) |
174
+ | deep-ignorance-pretraining-stage-weak-filter | Weak Filter | - | 29.12% (-10.54) | 58.98% (2.93) | 33.53% (-9.44) | 24.72% (-11.62) | 41.04% (-3.88) | 78.78% (2.34) | 60.57% (13.49) | 55.53% (-0.22) |
175
+ | deep-ignorance-strong-filter-pt-weak-filter-anneal-cb-lat | Strong Filter | Weak Filter | 26.92% (-12.74) | 58.00% (1.95) | 29.95% (-13.02) | 23.88% (-12.46) | 43.52% (-1.40) | 76.61% (0.17) | 56.01% (8.93) | 55.84% (0.09) |
176
+ | deep-ignorance-strong-filter-pt-weak-filter-anneal-cb | Strong Filter | Weak Filter | 26.12% (-13.54) | 56.46% (0.41) | 25.46% (-17.51) | 26.77% (-9.57) | 41.45% (-3.47) | 76.33% (-0.11) | 53.64% (6.56) | 54.40% (-1.35) |
177
+ | deep-ignorance-unfiltered-cb-lat | - | - | 25.93% (-13.73) | 56.43% (0.38) | 27.42% (-15.55) | 24.44% (-11.90) | 42.73% (-2.19) | 76.22% (-0.22) | 51.85% (4.77) | 54.92% (-0.83) |
178
+ | deep-ignorance-e2e-strong-filter-cb-lat | Strong Filter | Strong Filter | 25.87% (-13.79) | 56.60% (0.55) | 27.76% (-15.21) | 23.98% (-12.36) | 42.08% (-2.84) | 75.41% (-1.03) | 52.75% (5.67) | 56.18% (0.43) |
179
+ | deep-ignorance-e2e-strong-filter-cb | Strong Filter | Strong Filter | 25.56% (-14.10) | 52.60% (-3.45) | 25.00% (-17.97) | 26.12% (-10.22) | 39.45% (-5.47) | 75.35% (-1.09) | 47.56% (0.48) | 48.03% (-7.72) |
180
 
181
  # Acknowledgments
182