Abstract
Interpretability methods like logit lens, linear probing, and activation patching are applied to ASR to uncover internal dynamics, repetition hallucinations, and semantic biases, enhancing model transparency and robustness.
Interpretability methods have recently gained significant attention, particularly in the context of large language models, enabling insights into linguistic representations, error detection, and model behaviors such as hallucinations and repetitions. However, these techniques remain underexplored in automatic speech recognition (ASR), despite their potential to advance both the performance and interpretability of ASR systems. In this work, we adapt and systematically apply established interpretability methods such as logit lens, linear probing, and activation patching, to examine how acoustic and semantic information evolves across layers in ASR systems. Our experiments reveal previously unknown internal dynamics, including specific encoder-decoder interactions responsible for repetition hallucinations and semantic biases encoded deep within acoustic representations. These insights demonstrate the benefits of extending and applying interpretability techniques to speech recognition, opening promising directions for future research on improving model transparency and robustness.
Community
We adapt and apply mechanistic interpretability methods to systematically analyze how ASR models process and transform acoustic and linguistic information across layers. Our findings uncover internal encoder–decoder dynamics that underlie model biases and hallucinations. Excited to hear thoughts on interpretability in speech systems!
Very cool paper @netag ! In Section 4.2 "Hallucination Prediction from Decoder Residual Stream", you noted "...with zero WER and the 200 samples with highest WER values..." when curating a dataset for hallucination linear probing. For hallucinations, wouldn't it be better aligned with the task to rank samples by the insertion error component only (as opposed to deletion and substitution that make up the overall WER)? Either way, the experiment is valuable and results are very encouraging.
Thanks for the interest!
You’re right that insertions are a better signal of hallucinations. Here we used overall WER as a straightforward choice. The main goal was to highlight the potential link between the decoder residual stream and hallucinations, and to suggest directions for more targeted analyses in future work.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Bridging ASR and LLMs for Dysarthric Speech Recognition: Benchmarking Self-Supervised and Generative Approaches (2025)
- DIFFA: Large Language Diffusion Models Can Listen and Understand (2025)
- CarelessWhisper: Turning Whisper into a Causal Streaming Model (2025)
- MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing (2025)
- Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM (2025)
- LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers (2025)
- Dissecting Persona-Driven Reasoning in Language Models via Activation Patching (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper