Update README.md
Browse files
README.md
CHANGED
@@ -31,7 +31,7 @@ model-index:
|
|
31 |
<!-- markdownlint-disable no-duplicate-header -->
|
32 |
|
33 |
<div align="center">
|
34 |
-
<img src="
|
35 |
</div>
|
36 |
|
37 |
<hr>
|
@@ -142,7 +142,7 @@ This means that our community owns the fingerprints that they can use to verify
|
|
142 |
**Dobby-Mini-Leashed-Llama-3.1-8B** and **Dobby-Mini-Unhinged-Llama-3.1-8B** retain the base performance of Llama-3.1-8B-Instruct across the evaluated tasks.
|
143 |
|
144 |
<div align="center">
|
145 |
-
<img src="
|
146 |
</div>
|
147 |
|
148 |
### Freedom Bench
|
@@ -150,11 +150,11 @@ This means that our community owns the fingerprints that they can use to verify
|
|
150 |
We curate a difficult internal test focusing on loyalty to freedom-based stances through rejection sampling (generate one sample, if it is rejected, generate another, continue until accepted). **Dobby significantly outperforms base Llama** on holding firm to these values, even with adversarial or conflicting prompts
|
151 |
|
152 |
<div align="center">
|
153 |
-
<img src="
|
154 |
</div>
|
155 |
|
156 |
<div align="center">
|
157 |
-
<img src="
|
158 |
</div>
|
159 |
|
160 |
### Sorry-Bench
|
@@ -162,7 +162,7 @@ We curate a difficult internal test focusing on loyalty to freedom-based stances
|
|
162 |
We use the Sorry-bench ([Xie et al., 2024](https://arxiv.org/abs/2406.14598)) to assess the models’ behavior in handling contentious or potentially harmful prompts. Sorry-bench provides a rich suite of scenario-based tests that measure how readily a model may produce unsafe or problematic content. While some guardrails break (e.g., profanity and financial advice), the models remain robust to dangerous & criminal questions.
|
163 |
|
164 |
<div align="center">
|
165 |
-
<img src="
|
166 |
</div>
|
167 |
|
168 |
### Ablation Study
|
@@ -170,7 +170,7 @@ We use the Sorry-bench ([Xie et al., 2024](https://arxiv.org/abs/2406.14598)) to
|
|
170 |
Below we show our ablation study, where we omit subsets of our fine-tuning data set and evaluate the results on the **Freedom Bench** described earlier.
|
171 |
|
172 |
<div align="center">
|
173 |
-
<img src="
|
174 |
</div>
|
175 |
|
176 |
---
|
|
|
31 |
<!-- markdownlint-disable no-duplicate-header -->
|
32 |
|
33 |
<div align="center">
|
34 |
+
<img src="assets/sentient-logo-narrow.png" alt="alt text" width="60%"/>
|
35 |
</div>
|
36 |
|
37 |
<hr>
|
|
|
142 |
**Dobby-Mini-Leashed-Llama-3.1-8B** and **Dobby-Mini-Unhinged-Llama-3.1-8B** retain the base performance of Llama-3.1-8B-Instruct across the evaluated tasks.
|
143 |
|
144 |
<div align="center">
|
145 |
+
<img src="assets/hf_evals.png" alt="alt text" width="100%"/>
|
146 |
</div>
|
147 |
|
148 |
### Freedom Bench
|
|
|
150 |
We curate a difficult internal test focusing on loyalty to freedom-based stances through rejection sampling (generate one sample, if it is rejected, generate another, continue until accepted). **Dobby significantly outperforms base Llama** on holding firm to these values, even with adversarial or conflicting prompts
|
151 |
|
152 |
<div align="center">
|
153 |
+
<img src="assets/freedom_privacy.png" alt="alt text" width="100%"/>
|
154 |
</div>
|
155 |
|
156 |
<div align="center">
|
157 |
+
<img src="assets/freedom_speech.png" alt="alt text" width="100%"/>
|
158 |
</div>
|
159 |
|
160 |
### Sorry-Bench
|
|
|
162 |
We use the Sorry-bench ([Xie et al., 2024](https://arxiv.org/abs/2406.14598)) to assess the models’ behavior in handling contentious or potentially harmful prompts. Sorry-bench provides a rich suite of scenario-based tests that measure how readily a model may produce unsafe or problematic content. While some guardrails break (e.g., profanity and financial advice), the models remain robust to dangerous & criminal questions.
|
163 |
|
164 |
<div align="center">
|
165 |
+
<img src="assets/sorry_bench.png" alt="alt text" width="100%"/>
|
166 |
</div>
|
167 |
|
168 |
### Ablation Study
|
|
|
170 |
Below we show our ablation study, where we omit subsets of our fine-tuning data set and evaluate the results on the **Freedom Bench** described earlier.
|
171 |
|
172 |
<div align="center">
|
173 |
+
<img src="assets/ablation.jpg" alt="alt text" width="100%"/>
|
174 |
</div>
|
175 |
|
176 |
---
|