Whisper Tiny Czech (Knowledge Distillation from MLM)

This model is a fine-tuned version of Whisper Tiny adapted for Czech automatic speech recognition (ASR) using knowledge distillation (KD) from a masked language model (MLM).

Model Description

During early experiments, we observed that Whisper Tiny often produced invalid or unpronounceable Czech words even when given ground-truth context. To address this, we trained a Czech MLM to act as a language teacher during Whisper’s fine-tuning.

  • Teacher Model: BiLSTM-based masked language model (60M parameters) trained on a 210MB subset of the CC100-Czech dataset.
  • Distillation Approach: At each decoding step, Whisper was trained not only with standard cross-entropy loss on the next token but also encouraged to align its token distribution with that predicted by the MLM (via KL-divergence loss).
  • Tokenizer: Same byte pair encoding (BPE) as Whisper.
  • Training Data: CommonVoice Czech 19.0 dataset for speech; CC100-Czech for language modeling.

Loss Function

The training loss combined the standard ASR loss with KD loss:

Lt=λlm CE(asr,true token)+(1−λlm) KLD(asr distribution,mlm prediction) L_t = \lambda_{lm} \, \text{CE}(\text{asr}, \text{true token}) + (1 - \lambda_{lm}) \, \text{KLD}(\text{asr distribution}, \text{mlm prediction})

where $\lambda_{lm}$ balances the two components.

Hyperparameters

Model Learning Rate KD Lambda Batch Size
Tiny Baseline 5e-4 - 8
Tiny Adapted (KD) 1e-4 1e-3 8

The learning rates are not matching because they were optimised for each case separately.

Results on CommonVoice Czech

Model Validation Loss WER CER
Tiny Baseline 1.236 0.447 0.031
Tiny Adapted (KD) 0.636 0.345 0.023

✅ CER reduced by ~25%
✅ WER reduced by ~23%

This shows that even very light knowledge distillation from a lightweight MLM significantly improves language modelling capabilities in Whisper Tiny for Czech.


Intended Use

This model is ideal for research and applications in Czech ASR where lightweight, efficient models are needed, but a better grasp of the language is crucial.

Limitations

  • Trained on a relatively small subset (210MB) of CC100-Czech due to computational constraints.
  • Optimized for clean, non-code-switched Czech speech (based on CommonVoice data).

Acknowledgments

Citation

If you use this model, please cite (yes, the main topic of the thesis was indeed about assistive ASR):

@misc{nadrchal_2025,
    title={Deep-Learning ASR for a Patient with Permanent Tracheostomy: A Case Study},
    author={David Nadrchal},
    year={2025},
    note={Bachelor Thesis},
    url={https://github.com/Hobit2002/TracheoSpeech_ASR}
}
Downloads last month
1
Safetensors
Model size
37.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support