whisper_tiny_cs / README.md
Hobit2002's picture
Update README.md
a2f2433 verified
---
language: cs
datasets:
- CommonVoice
- CC100
tags:
- automatic-speech-recognition
- whisper
- knowledge-distillation
- czech
license: mit
---
# Whisper Tiny Czech (Knowledge Distillation from MLM)
This model is a fine-tuned version of Whisper Tiny adapted for Czech automatic speech recognition (ASR) using **knowledge distillation (KD)** from a **masked language model (MLM)**.
## Model Description
During early experiments, we observed that Whisper Tiny often produced invalid or unpronounceable Czech words even when given ground-truth context. To address this, we trained a Czech MLM to act as a language teacher during Whisper’s fine-tuning.
- **Teacher Model**: BiLSTM-based masked language model (60M parameters) trained on a 210MB subset of the CC100-Czech dataset.
- **Distillation Approach**: At each decoding step, Whisper was trained not only with standard cross-entropy loss on the next token but also encouraged to align its token distribution with that predicted by the MLM (via KL-divergence loss).
- **Tokenizer**: Same byte pair encoding (BPE) as Whisper.
- **Training Data**: CommonVoice Czech 19.0 dataset for speech; CC100-Czech for language modeling.
### Loss Function
The training loss combined the standard ASR loss with KD loss:
$$
L_t = \lambda_{lm} \, \text{CE}(\text{asr}, \text{true token}) + (1 - \lambda_{lm}) \, \text{KLD}(\text{asr distribution}, \text{mlm prediction})
$$
where $\lambda_{lm}$ balances the two components.
### Hyperparameters
| Model | Learning Rate | KD Lambda | Batch Size |
|--------------------|---------------|-----------|------------|
| Tiny Baseline | 5e-4 | - | 8 |
| Tiny Adapted (KD) | 1e-4 | 1e-3 | 8 |
The learning rates are not matching because they were optimised for each case separately.
### Results on CommonVoice Czech
| Model | Validation Loss | WER | CER |
|--------------------|------------------|------|------|
| Tiny Baseline | 1.236 | 0.447| 0.031|
| Tiny Adapted (KD) | 0.636 | 0.345| 0.023|
**CER reduced by ~25%**
**WER reduced by ~23%**
This shows that even very light knowledge distillation from a lightweight MLM significantly improves language modelling capabilities in Whisper Tiny for Czech.
---
## Intended Use
This model is ideal for research and applications in Czech ASR where lightweight, efficient models are needed, but a better grasp of the language is crucial.
## Limitations
- Trained on a relatively small subset (210MB) of CC100-Czech due to computational constraints.
- Optimized for clean, non-code-switched Czech speech (based on CommonVoice data).
## Acknowledgments
- Knowledge distillation ([Hinton et al., 2015](https://arxiv.org/abs/1503.02531))
- Whisper model family ([OpenAI, 2022](https://openai.com/research/whisper))
- CommonVoice dataset ([Mozilla, 2020](https://commonvoice.mozilla.org))
- CC100 dataset ([Conneau et al., 2020](https://arxiv.org/abs/1911.02116))
## Citation
If you use this model, please cite (yes, the main topic of the thesis was indeed about assistive ASR):
```
@misc{nadrchal_2025,
title={Deep-Learning ASR for a Patient with Permanent Tracheostomy: A Case Study},
author={David Nadrchal},
year={2025},
note={Bachelor Thesis},
url={https://github.com/Hobit2002/TracheoSpeech_ASR}
}
```