|
--- |
|
language: cs |
|
datasets: |
|
- CommonVoice |
|
- CC100 |
|
tags: |
|
- automatic-speech-recognition |
|
- whisper |
|
- knowledge-distillation |
|
- czech |
|
license: mit |
|
--- |
|
|
|
# Whisper Tiny Czech (Knowledge Distillation from MLM) |
|
|
|
This model is a fine-tuned version of Whisper Tiny adapted for Czech automatic speech recognition (ASR) using **knowledge distillation (KD)** from a **masked language model (MLM)**. |
|
|
|
## Model Description |
|
|
|
During early experiments, we observed that Whisper Tiny often produced invalid or unpronounceable Czech words even when given ground-truth context. To address this, we trained a Czech MLM to act as a language teacher during Whisper’s fine-tuning. |
|
|
|
- **Teacher Model**: BiLSTM-based masked language model (60M parameters) trained on a 210MB subset of the CC100-Czech dataset. |
|
- **Distillation Approach**: At each decoding step, Whisper was trained not only with standard cross-entropy loss on the next token but also encouraged to align its token distribution with that predicted by the MLM (via KL-divergence loss). |
|
- **Tokenizer**: Same byte pair encoding (BPE) as Whisper. |
|
- **Training Data**: CommonVoice Czech 19.0 dataset for speech; CC100-Czech for language modeling. |
|
|
|
### Loss Function |
|
|
|
The training loss combined the standard ASR loss with KD loss: |
|
|
|
$$ |
|
L_t = \lambda_{lm} \, \text{CE}(\text{asr}, \text{true token}) + (1 - \lambda_{lm}) \, \text{KLD}(\text{asr distribution}, \text{mlm prediction}) |
|
$$ |
|
|
|
where $\lambda_{lm}$ balances the two components. |
|
|
|
### Hyperparameters |
|
|
|
| Model | Learning Rate | KD Lambda | Batch Size | |
|
|--------------------|---------------|-----------|------------| |
|
| Tiny Baseline | 5e-4 | - | 8 | |
|
| Tiny Adapted (KD) | 1e-4 | 1e-3 | 8 | |
|
|
|
The learning rates are not matching because they were optimised for each case separately. |
|
|
|
### Results on CommonVoice Czech |
|
|
|
| Model | Validation Loss | WER | CER | |
|
|--------------------|------------------|------|------| |
|
| Tiny Baseline | 1.236 | 0.447| 0.031| |
|
| Tiny Adapted (KD) | 0.636 | 0.345| 0.023| |
|
|
|
✅ **CER reduced by ~25%** |
|
✅ **WER reduced by ~23%** |
|
|
|
This shows that even very light knowledge distillation from a lightweight MLM significantly improves language modelling capabilities in Whisper Tiny for Czech. |
|
|
|
--- |
|
|
|
## Intended Use |
|
|
|
This model is ideal for research and applications in Czech ASR where lightweight, efficient models are needed, but a better grasp of the language is crucial. |
|
|
|
## Limitations |
|
|
|
- Trained on a relatively small subset (210MB) of CC100-Czech due to computational constraints. |
|
- Optimized for clean, non-code-switched Czech speech (based on CommonVoice data). |
|
|
|
## Acknowledgments |
|
- Knowledge distillation ([Hinton et al., 2015](https://arxiv.org/abs/1503.02531)) |
|
- Whisper model family ([OpenAI, 2022](https://openai.com/research/whisper)) |
|
- CommonVoice dataset ([Mozilla, 2020](https://commonvoice.mozilla.org)) |
|
- CC100 dataset ([Conneau et al., 2020](https://arxiv.org/abs/1911.02116)) |
|
|
|
## Citation |
|
|
|
If you use this model, please cite (yes, the main topic of the thesis was indeed about assistive ASR): |
|
|
|
``` |
|
@misc{nadrchal_2025, |
|
title={Deep-Learning ASR for a Patient with Permanent Tracheostomy: A Case Study}, |
|
author={David Nadrchal}, |
|
year={2025}, |
|
note={Bachelor Thesis}, |
|
url={https://github.com/Hobit2002/TracheoSpeech_ASR} |
|
} |
|
``` |