Whisper Tiny Czech (Knowledge Distillation from MLM)
This model is a fine-tuned version of Whisper Tiny adapted for Czech automatic speech recognition (ASR) using knowledge distillation (KD) from a masked language model (MLM).
Model Description
During early experiments, we observed that Whisper Tiny often produced invalid or unpronounceable Czech words even when given ground-truth context. To address this, we trained a Czech MLM to act as a language teacher during Whisper’s fine-tuning.
- Teacher Model: BiLSTM-based masked language model (60M parameters) trained on a 210MB subset of the CC100-Czech dataset.
- Distillation Approach: At each decoding step, Whisper was trained not only with standard cross-entropy loss on the next token but also encouraged to align its token distribution with that predicted by the MLM (via KL-divergence loss).
- Tokenizer: Same byte pair encoding (BPE) as Whisper.
- Training Data: CommonVoice Czech 19.0 dataset for speech; CC100-Czech for language modeling.
Loss Function
The training loss combined the standard ASR loss with KD loss:
where $\lambda_{lm}$ balances the two components.
Hyperparameters
Model | Learning Rate | KD Lambda | Batch Size |
---|---|---|---|
Tiny Baseline | 5e-4 | - | 8 |
Tiny Adapted (KD) | 1e-4 | 1e-3 | 8 |
The learning rates are not matching because they were optimised for each case separately.
Results on CommonVoice Czech
Model | Validation Loss | WER | CER |
---|---|---|---|
Tiny Baseline | 1.236 | 0.447 | 0.031 |
Tiny Adapted (KD) | 0.636 | 0.345 | 0.023 |
✅ CER reduced by ~25%
✅ WER reduced by ~23%
This shows that even very light knowledge distillation from a lightweight MLM significantly improves language modelling capabilities in Whisper Tiny for Czech.
Intended Use
This model is ideal for research and applications in Czech ASR where lightweight, efficient models are needed, but a better grasp of the language is crucial.
Limitations
- Trained on a relatively small subset (210MB) of CC100-Czech due to computational constraints.
- Optimized for clean, non-code-switched Czech speech (based on CommonVoice data).
Acknowledgments
- Knowledge distillation (Hinton et al., 2015)
- Whisper model family (OpenAI, 2022)
- CommonVoice dataset (Mozilla, 2020)
- CC100 dataset (Conneau et al., 2020)
Citation
If you use this model, please cite (yes, the main topic of the thesis was indeed about assistive ASR):
@misc{nadrchal_2025,
title={Deep-Learning ASR for a Patient with Permanent Tracheostomy: A Case Study},
author={David Nadrchal},
year={2025},
note={Bachelor Thesis},
url={https://github.com/Hobit2002/TracheoSpeech_ASR}
}
- Downloads last month
- 1