metadata
license: apache-2.0
tags:
- chain-of-thought
- safety
- alignment
- reasoning
- large-language-model
library_name: transformers
inference: true
SAFEPATH-R-7B
This model is the SAFEPATH-aligned version of DeepSeek-R1-Distill-Qwen-7B, fine-tuned using prefix-only safety priming.
Model Description
SAFEPATH applies a minimal alignment technique by inserting the phrase: Let's think about safety first (Safety Primer) at the beginning of the reasoning block. This encourages the model to engage in safer reasoning without reducing its reasoning performance.
- 🔐 Improved Safety: Reduces harmful outputs (e.g., StrongReject, BeaverTails) and is robust to jailbreak attacks
- 🧠 Preserved Reasoning: Maintains accuracy on MATH500, GPQA, and AIME24
- ⚡ Efficiency: Fine-tuned with only 100 steps
Intended Use
This model is intended for research in:
- Safety alignment in Large Reasoning Models (LRMs)
- Robust reasoning under adversarial settings
- Chain-of-thought alignment studies
For details, see our paper.
Overview Results