--- language: cs datasets: - CommonVoice - CC100 tags: - automatic-speech-recognition - whisper - knowledge-distillation - czech license: mit --- # Whisper Tiny Czech (Knowledge Distillation from MLM) This model is a fine-tuned version of Whisper Tiny adapted for Czech automatic speech recognition (ASR) using **knowledge distillation (KD)** from a **masked language model (MLM)**. ## Model Description During early experiments, we observed that Whisper Tiny often produced invalid or unpronounceable Czech words even when given ground-truth context. To address this, we trained a Czech MLM to act as a language teacher during Whisper’s fine-tuning. - **Teacher Model**: BiLSTM-based masked language model (60M parameters) trained on a 210MB subset of the CC100-Czech dataset. - **Distillation Approach**: At each decoding step, Whisper was trained not only with standard cross-entropy loss on the next token but also encouraged to align its token distribution with that predicted by the MLM (via KL-divergence loss). - **Tokenizer**: Same byte pair encoding (BPE) as Whisper. - **Training Data**: CommonVoice Czech 19.0 dataset for speech; CC100-Czech for language modeling. ### Loss Function The training loss combined the standard ASR loss with KD loss: $$ L_t = \lambda_{lm} \, \text{CE}(\text{asr}, \text{true token}) + (1 - \lambda_{lm}) \, \text{KLD}(\text{asr distribution}, \text{mlm prediction}) $$ where $\lambda_{lm}$ balances the two components. ### Hyperparameters | Model | Learning Rate | KD Lambda | Batch Size | |--------------------|---------------|-----------|------------| | Tiny Baseline | 5e-4 | - | 8 | | Tiny Adapted (KD) | 1e-4 | 1e-3 | 8 | The learning rates are not matching because they were optimised for each case separately. ### Results on CommonVoice Czech | Model | Validation Loss | WER | CER | |--------------------|------------------|------|------| | Tiny Baseline | 1.236 | 0.447| 0.031| | Tiny Adapted (KD) | 0.636 | 0.345| 0.023| ✅ **CER reduced by ~25%** ✅ **WER reduced by ~23%** This shows that even very light knowledge distillation from a lightweight MLM significantly improves language modelling capabilities in Whisper Tiny for Czech. --- ## Intended Use This model is ideal for research and applications in Czech ASR where lightweight, efficient models are needed, but a better grasp of the language is crucial. ## Limitations - Trained on a relatively small subset (210MB) of CC100-Czech due to computational constraints. - Optimized for clean, non-code-switched Czech speech (based on CommonVoice data). ## Acknowledgments - Knowledge distillation ([Hinton et al., 2015](https://arxiv.org/abs/1503.02531)) - Whisper model family ([OpenAI, 2022](https://openai.com/research/whisper)) - CommonVoice dataset ([Mozilla, 2020](https://commonvoice.mozilla.org)) - CC100 dataset ([Conneau et al., 2020](https://arxiv.org/abs/1911.02116)) ## Citation If you use this model, please cite (yes, the main topic of the thesis was indeed about assistive ASR): ``` @misc{nadrchal_2025, title={Deep-Learning ASR for a Patient with Permanent Tracheostomy: A Case Study}, author={David Nadrchal}, year={2025}, note={Bachelor Thesis}, url={https://github.com/Hobit2002/TracheoSpeech_ASR} } ```