---
language: cs
datasets:
- CommonVoice
- CC100
tags:
- automatic-speech-recognition
- whisper
- knowledge-distillation
- czech
license: mit
---

# Whisper Tiny Czech (Knowledge Distillation from MLM)

This model is a fine-tuned version of Whisper Tiny adapted for Czech automatic speech recognition (ASR) using **knowledge distillation (KD)** from a **masked language model (MLM)**.

## Model Description

During early experiments, we observed that Whisper Tiny often produced invalid or unpronounceable Czech words even when given ground-truth context. To address this, we trained a Czech MLM to act as a language teacher during Whisper’s fine-tuning.

- **Teacher Model**: BiLSTM-based masked language model (60M parameters) trained on a 210MB subset of the CC100-Czech dataset.
- **Distillation Approach**: At each decoding step, Whisper was trained not only with standard cross-entropy loss on the next token but also encouraged to align its token distribution with that predicted by the MLM (via KL-divergence loss).
- **Tokenizer**: Same byte pair encoding (BPE) as Whisper.
- **Training Data**: CommonVoice Czech 19.0 dataset for speech; CC100-Czech for language modeling.

### Loss Function

The training loss combined the standard ASR loss with KD loss:

$$
L_t = \lambda_{lm} \, \text{CE}(\text{asr}, \text{true token}) + (1 - \lambda_{lm}) \, \text{KLD}(\text{asr distribution}, \text{mlm prediction})
$$

where $\lambda_{lm}$ balances the two components.

### Hyperparameters

| Model              | Learning Rate | KD Lambda | Batch Size |
|--------------------|---------------|-----------|------------|
| Tiny Baseline      | 5e-4           | -         | 8          |
| Tiny Adapted (KD)  | 1e-4           | 1e-3      | 8          |

The learning rates are not matching because they were optimised for each case separately.

### Results on CommonVoice Czech

| Model              | Validation Loss | WER  | CER  |
|--------------------|------------------|------|------|
| Tiny Baseline      | 1.236             | 0.447| 0.031|
| Tiny Adapted (KD)  | 0.636             | 0.345| 0.023|

✅ **CER reduced by ~25%**  
✅ **WER reduced by ~23%**

This shows that even very light knowledge distillation from a lightweight MLM significantly improves language modelling capabilities in Whisper Tiny for Czech.

---

## Intended Use

This model is ideal for research and applications in Czech ASR where lightweight, efficient models are needed, but a better grasp of the language is crucial.

## Limitations

- Trained on a relatively small subset (210MB) of CC100-Czech due to computational constraints.
- Optimized for clean, non-code-switched Czech speech (based on CommonVoice data).

## Acknowledgments
- Knowledge distillation ([Hinton et al., 2015](https://arxiv.org/abs/1503.02531))
- Whisper model family ([OpenAI, 2022](https://openai.com/research/whisper))
- CommonVoice dataset ([Mozilla, 2020](https://commonvoice.mozilla.org))
- CC100 dataset ([Conneau et al., 2020](https://arxiv.org/abs/1911.02116))

## Citation

If you use this model, please cite (yes, the main topic of the thesis was indeed about assistive ASR):

```
@misc{nadrchal_2025,
    title={Deep-Learning ASR for a Patient with Permanent Tracheostomy: A Case Study},
    author={David Nadrchal},
    year={2025},
    note={Bachelor Thesis},
    url={https://github.com/Hobit2002/TracheoSpeech_ASR}
}
```