Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Abstract
Traditional white-box methods for creating adversarial perturbations against LLMs typically rely only on gradient computation from the targeted model, ignoring the internal mechanisms responsible for attack success or failure. Conversely, interpretability studies that analyze these internal mechanisms lack practical applications beyond runtime interventions. We bridge this gap by introducing a novel white-box approach that leverages mechanistic interpretability techniques to craft practical adversarial inputs. Specifically, we first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms - then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks. This targeted approach significantly reduces computation cost, achieving attack success rates of 80-95\% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5 within minutes or even seconds, compared to existing techniques that often fail or require hours of computation. We believe this approach opens a new direction for both attack research and defense development. Furthermore, it showcases a practical application of mechanistic interpretability where other methods are less efficient, which highlights its utility. The code and generated datasets are available at https://github.com/Sckathach/subspace-rerouting.
Community
Following post on LW: https://www.lesswrong.com/posts/mG7jioaAsBasnuD4b/subspace-rerouting-using-mechanistic-interpretability-to-2 discussing how to use SSR as a "reversed" logit lens, fun, and as colorful as the paper! :)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence (2025)
- Towards Robust Multimodal Large Language Models Against Jailbreak Attacks (2025)
- Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification (2025)
- Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks (2025)
- Improving LLM Safety Alignment with Dual-Objective Optimization (2025)
- HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States (2025)
- Universal Adversarial Attack on Aligned Multimodal LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 9
Browse 9 datasets citing this paperSpaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper