salzubi401 commited on
Commit
4fe1176
·
verified ·
1 Parent(s): 7e9395e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -31,7 +31,7 @@ model-index:
31
  <!-- markdownlint-disable no-duplicate-header -->
32
 
33
  <div align="center">
34
- <img src="../assets/sentient-logo-narrow.png" alt="alt text" width="60%"/>
35
  </div>
36
 
37
  <hr>
@@ -142,7 +142,7 @@ This means that our community owns the fingerprints that they can use to verify
142
  **Dobby-Mini-Leashed-Llama-3.1-8B** and **Dobby-Mini-Unhinged-Llama-3.1-8B** retain the base performance of Llama-3.1-8B-Instruct across the evaluated tasks.
143
 
144
  <div align="center">
145
- <img src="../assets/hf_evals.png" alt="alt text" width="100%"/>
146
  </div>
147
 
148
  ### Freedom Bench
@@ -150,11 +150,11 @@ This means that our community owns the fingerprints that they can use to verify
150
  We curate a difficult internal test focusing on loyalty to freedom-based stances through rejection sampling (generate one sample, if it is rejected, generate another, continue until accepted). **Dobby significantly outperforms base Llama** on holding firm to these values, even with adversarial or conflicting prompts
151
 
152
  <div align="center">
153
- <img src="../assets/freedom_privacy.png" alt="alt text" width="100%"/>
154
  </div>
155
 
156
  <div align="center">
157
- <img src="../assets/freedom_speech.png" alt="alt text" width="100%"/>
158
  </div>
159
 
160
  ### Sorry-Bench
@@ -162,7 +162,7 @@ We curate a difficult internal test focusing on loyalty to freedom-based stances
162
  We use the Sorry-bench ([Xie et al., 2024](https://arxiv.org/abs/2406.14598)) to assess the models’ behavior in handling contentious or potentially harmful prompts. Sorry-bench provides a rich suite of scenario-based tests that measure how readily a model may produce unsafe or problematic content. While some guardrails break (e.g., profanity and financial advice), the models remain robust to dangerous & criminal questions.
163
 
164
  <div align="center">
165
- <img src="../assets/sorry_bench.png" alt="alt text" width="100%"/>
166
  </div>
167
 
168
  ### Ablation Study
@@ -170,7 +170,7 @@ We use the Sorry-bench ([Xie et al., 2024](https://arxiv.org/abs/2406.14598)) to
170
  Below we show our ablation study, where we omit subsets of our fine-tuning data set and evaluate the results on the **Freedom Bench** described earlier.
171
 
172
  <div align="center">
173
- <img src="../assets/ablation.jpg" alt="alt text" width="100%"/>
174
  </div>
175
 
176
  ---
 
31
  <!-- markdownlint-disable no-duplicate-header -->
32
 
33
  <div align="center">
34
+ <img src="assets/sentient-logo-narrow.png" alt="alt text" width="60%"/>
35
  </div>
36
 
37
  <hr>
 
142
  **Dobby-Mini-Leashed-Llama-3.1-8B** and **Dobby-Mini-Unhinged-Llama-3.1-8B** retain the base performance of Llama-3.1-8B-Instruct across the evaluated tasks.
143
 
144
  <div align="center">
145
+ <img src="assets/hf_evals.png" alt="alt text" width="100%"/>
146
  </div>
147
 
148
  ### Freedom Bench
 
150
  We curate a difficult internal test focusing on loyalty to freedom-based stances through rejection sampling (generate one sample, if it is rejected, generate another, continue until accepted). **Dobby significantly outperforms base Llama** on holding firm to these values, even with adversarial or conflicting prompts
151
 
152
  <div align="center">
153
+ <img src="assets/freedom_privacy.png" alt="alt text" width="100%"/>
154
  </div>
155
 
156
  <div align="center">
157
+ <img src="assets/freedom_speech.png" alt="alt text" width="100%"/>
158
  </div>
159
 
160
  ### Sorry-Bench
 
162
  We use the Sorry-bench ([Xie et al., 2024](https://arxiv.org/abs/2406.14598)) to assess the models’ behavior in handling contentious or potentially harmful prompts. Sorry-bench provides a rich suite of scenario-based tests that measure how readily a model may produce unsafe or problematic content. While some guardrails break (e.g., profanity and financial advice), the models remain robust to dangerous & criminal questions.
163
 
164
  <div align="center">
165
+ <img src="assets/sorry_bench.png" alt="alt text" width="100%"/>
166
  </div>
167
 
168
  ### Ablation Study
 
170
  Below we show our ablation study, where we omit subsets of our fine-tuning data set and evaluate the results on the **Freedom Bench** described earlier.
171
 
172
  <div align="center">
173
+ <img src="assets/ablation.jpg" alt="alt text" width="100%"/>
174
  </div>
175
 
176
  ---