Whisper Tiny Czech (Knowledge Distillation from MLM)

This model is a fine-tuned version of Whisper Tiny adapted for Czech automatic speech recognition (ASR) using knowledge distillation (KD) from a masked language model (MLM).

Model Description

During early experiments, we observed that Whisper Tiny often produced invalid or unpronounceable Czech words even when given ground-truth context. To address this, we trained a Czech MLM to act as a language teacher during Whisper’s fine-tuning.

Teacher Model: BiLSTM-based masked language model (60M parameters) trained on a 210MB subset of the CC100-Czech dataset.
Distillation Approach: At each decoding step, Whisper was trained not only with standard cross-entropy loss on the next token but also encouraged to align its token distribution with that predicted by the MLM (via KL-divergence loss).
Tokenizer: Same byte pair encoding (BPE) as Whisper.
Training Data: CommonVoice Czech 19.0 dataset for speech; CC100-Czech for language modeling.

Loss Function

The training loss combined the standard ASR loss with KD loss:

$L_t = \lambda_{lm} \, \text{CE}(\text{asr}, \text{true token}) + (1 - \lambda_{lm}) \, \text{KLD}(\text{asr distribution}, \text{mlm prediction})$

where $\lambda_{lm}$ balances the two components.

Hyperparameters

Model	Learning Rate	KD Lambda	Batch Size
Tiny Baseline	5e-4	-	8
Tiny Adapted (KD)	1e-4	1e-3	8

The learning rates are not matching because they were optimised for each case separately.

Results on CommonVoice Czech

Model	Validation Loss	WER	CER
Tiny Baseline	1.236	0.447	0.031
Tiny Adapted (KD)	0.636	0.345	0.023

✅ CER reduced by ~25%
✅ WER reduced by ~23%

This shows that even very light knowledge distillation from a lightweight MLM significantly improves language modelling capabilities in Whisper Tiny for Czech.

Intended Use

This model is ideal for research and applications in Czech ASR where lightweight, efficient models are needed, but a better grasp of the language is crucial.

Limitations

Trained on a relatively small subset (210MB) of CC100-Czech due to computational constraints.
Optimized for clean, non-code-switched Czech speech (based on CommonVoice data).

Acknowledgments

Knowledge distillation (Hinton et al., 2015)
Whisper model family (OpenAI, 2022)
CommonVoice dataset (Mozilla, 2020)
CC100 dataset (Conneau et al., 2020)

Citation

If you use this model, please cite (yes, the main topic of the thesis was indeed about assistive ASR):

@misc{nadrchal_2025,
    title={Deep-Learning ASR for a Patient with Permanent Tracheostomy: A Case Study},
    author={David Nadrchal},
    year={2025},
    note={Bachelor Thesis},
    url={https://github.com/Hobit2002/TracheoSpeech_ASR}
}