Hobit2002
/

whisper_tiny_cs

Automatic Speech Recognition

knowledge-distillation

Model card Files Files and versions Community

whisper_tiny_cs / README.md

Hobit2002's picture

Update README.md

a2f2433 verified 3 months ago

|

history blame contribute delete

3.39 kB

	---
	language: cs
	datasets:
	- CommonVoice
	- CC100
	tags:
	- automatic-speech-recognition
	- whisper
	- knowledge-distillation
	- czech
	license: mit
	---

	# Whisper Tiny Czech (Knowledge Distillation from MLM)

	This model is a fine-tuned version of Whisper Tiny adapted for Czech automatic speech recognition (ASR) using knowledge distillation (KD) from a masked language model (MLM).

	## Model Description

	During early experiments, we observed that Whisper Tiny often produced invalid or unpronounceable Czech words even when given ground-truth context. To address this, we trained a Czech MLM to act as a language teacher during Whisper’s fine-tuning.

	- Teacher Model: BiLSTM-based masked language model (60M parameters) trained on a 210MB subset of the CC100-Czech dataset.
	- Distillation Approach: At each decoding step, Whisper was trained not only with standard cross-entropy loss on the next token but also encouraged to align its token distribution with that predicted by the MLM (via KL-divergence loss).
	- Tokenizer: Same byte pair encoding (BPE) as Whisper.
	- Training Data: CommonVoice Czech 19.0 dataset for speech; CC100-Czech for language modeling.

	### Loss Function

	The training loss combined the standard ASR loss with KD loss:

	$$
	L_t = \lambda_{lm} \, \text{CE}(\text{asr}, \text{true token}) + (1 - \lambda_{lm}) \, \text{KLD}(\text{asr distribution}, \text{mlm prediction})
	$$

	where $\lambda_{lm}$ balances the two components.

	### Hyperparameters

	\| Model \| Learning Rate \| KD Lambda \| Batch Size \|
	\|--------------------\|---------------\|-----------\|------------\|
	\| Tiny Baseline \| 5e-4 \| - \| 8 \|
	\| Tiny Adapted (KD) \| 1e-4 \| 1e-3 \| 8 \|

	The learning rates are not matching because they were optimised for each case separately.

	### Results on CommonVoice Czech

	\| Model \| Validation Loss \| WER \| CER \|
	\|--------------------\|------------------\|------\|------\|
	\| Tiny Baseline \| 1.236 \| 0.447\| 0.031\|
	\| Tiny Adapted (KD) \| 0.636 \| 0.345\| 0.023\|

	✅ CER reduced by ~25%
	✅ WER reduced by ~23%

	This shows that even very light knowledge distillation from a lightweight MLM significantly improves language modelling capabilities in Whisper Tiny for Czech.

	---

	## Intended Use

	This model is ideal for research and applications in Czech ASR where lightweight, efficient models are needed, but a better grasp of the language is crucial.

	## Limitations

	- Trained on a relatively small subset (210MB) of CC100-Czech due to computational constraints.
	- Optimized for clean, non-code-switched Czech speech (based on CommonVoice data).

	## Acknowledgments
	- Knowledge distillation ([Hinton et al., 2015](https://arxiv.org/abs/1503.02531))
	- Whisper model family ([OpenAI, 2022](https://openai.com/research/whisper))
	- CommonVoice dataset ([Mozilla, 2020](https://commonvoice.mozilla.org))
	- CC100 dataset ([Conneau et al., 2020](https://arxiv.org/abs/1911.02116))

	## Citation

	If you use this model, please cite (yes, the main topic of the thesis was indeed about assistive ASR):

	```
	@misc{nadrchal_2025,
	title={Deep-Learning ASR for a Patient with Permanent Tracheostomy: A Case Study},
	author={David Nadrchal},
	year={2025},
	note={Bachelor Thesis},
	url={https://github.com/Hobit2002/TracheoSpeech_ASR}
	}
	```