SentientAGI
/

Dobby-Mini-Leashed-Llama-3.1-8B

Model card Files Files and versions Community

salzubi401 commited on Jan 24

Commit

4fe1176

verified ·

1 Parent(s): 7e9395e

Update README.md

Browse files

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -31,7 +31,7 @@ model-index:
 <!-- markdownlint-disable no-duplicate-header -->
 <div align="center">
-    <img src="../assets/sentient-logo-narrow.png" alt="alt text" width="60%"/>
 </div>
 <hr>
@@ -142,7 +142,7 @@ This means that our community owns the fingerprints that they can use to verify
 **Dobby-Mini-Leashed-Llama-3.1-8B** and **Dobby-Mini-Unhinged-Llama-3.1-8B** retain the base performance of Llama-3.1-8B-Instruct across the evaluated tasks.
 <div align="center">
-    <img src="../assets/hf_evals.png" alt="alt text" width="100%"/>
 </div>
 ### Freedom Bench
@@ -150,11 +150,11 @@ This means that our community owns the fingerprints that they can use to verify
 We curate a difficult internal test focusing on loyalty to freedom-based stances through rejection sampling (generate one sample, if it is rejected, generate another, continue until accepted). **Dobby significantly outperforms base Llama** on holding firm to these values, even with adversarial or conflicting prompts
 <div align="center">
-    <img src="../assets/freedom_privacy.png" alt="alt text" width="100%"/>
 </div>
 <div align="center">
-    <img src="../assets/freedom_speech.png" alt="alt text" width="100%"/>
 </div>
 ### Sorry-Bench
@@ -162,7 +162,7 @@ We curate a difficult internal test focusing on loyalty to freedom-based stances
 We use the Sorry-bench ([Xie et al., 2024](https://arxiv.org/abs/2406.14598)) to assess the models’ behavior in handling contentious or potentially harmful prompts. Sorry-bench provides a rich suite of scenario-based tests that measure how readily a model may produce unsafe or problematic content. While some guardrails break (e.g., profanity and financial advice), the models remain robust to dangerous & criminal questions.
 <div align="center">
-    <img src="../assets/sorry_bench.png" alt="alt text" width="100%"/>
 </div>
 ### Ablation Study
@@ -170,7 +170,7 @@ We use the Sorry-bench ([Xie et al., 2024](https://arxiv.org/abs/2406.14598)) to
 Below we show our ablation study, where we omit subsets of our fine-tuning data set and evaluate the results on the **Freedom Bench** described earlier.
 <div align="center">
-    <img src="../assets/ablation.jpg" alt="alt text" width="100%"/>
 </div>
 ---

 <!-- markdownlint-disable no-duplicate-header -->
 <div align="center">
+    <img src="assets/sentient-logo-narrow.png" alt="alt text" width="60%"/>
 </div>
 <hr>
 **Dobby-Mini-Leashed-Llama-3.1-8B** and **Dobby-Mini-Unhinged-Llama-3.1-8B** retain the base performance of Llama-3.1-8B-Instruct across the evaluated tasks.
 <div align="center">
+    <img src="assets/hf_evals.png" alt="alt text" width="100%"/>
 </div>
 ### Freedom Bench
 We curate a difficult internal test focusing on loyalty to freedom-based stances through rejection sampling (generate one sample, if it is rejected, generate another, continue until accepted). **Dobby significantly outperforms base Llama** on holding firm to these values, even with adversarial or conflicting prompts
 <div align="center">
+    <img src="assets/freedom_privacy.png" alt="alt text" width="100%"/>
 </div>
 <div align="center">
+    <img src="assets/freedom_speech.png" alt="alt text" width="100%"/>
 </div>
 ### Sorry-Bench
 We use the Sorry-bench ([Xie et al., 2024](https://arxiv.org/abs/2406.14598)) to assess the models’ behavior in handling contentious or potentially harmful prompts. Sorry-bench provides a rich suite of scenario-based tests that measure how readily a model may produce unsafe or problematic content. While some guardrails break (e.g., profanity and financial advice), the models remain robust to dangerous & criminal questions.
 <div align="center">
+    <img src="assets/sorry_bench.png" alt="alt text" width="100%"/>
 </div>
 ### Ablation Study
 Below we show our ablation study, where we omit subsets of our fine-tuning data set and evaluate the results on the **Freedom Bench** described earlier.
 <div align="center">
+    <img src="assets/ablation.jpg" alt="alt text" width="100%"/>
 </div>
 ---