File size: 3,393 Bytes
8e01df4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154d256
 
 
8e01df4
154d256
8e01df4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2f2433
8e01df4
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
language: cs
datasets:
- CommonVoice
- CC100
tags:
- automatic-speech-recognition
- whisper
- knowledge-distillation
- czech
license: mit
---

# Whisper Tiny Czech (Knowledge Distillation from MLM)

This model is a fine-tuned version of Whisper Tiny adapted for Czech automatic speech recognition (ASR) using **knowledge distillation (KD)** from a **masked language model (MLM)**.

## Model Description

During early experiments, we observed that Whisper Tiny often produced invalid or unpronounceable Czech words even when given ground-truth context. To address this, we trained a Czech MLM to act as a language teacher during Whisper’s fine-tuning.

- **Teacher Model**: BiLSTM-based masked language model (60M parameters) trained on a 210MB subset of the CC100-Czech dataset.
- **Distillation Approach**: At each decoding step, Whisper was trained not only with standard cross-entropy loss on the next token but also encouraged to align its token distribution with that predicted by the MLM (via KL-divergence loss).
- **Tokenizer**: Same byte pair encoding (BPE) as Whisper.
- **Training Data**: CommonVoice Czech 19.0 dataset for speech; CC100-Czech for language modeling.

### Loss Function

The training loss combined the standard ASR loss with KD loss:

$$
L_t = \lambda_{lm} \, \text{CE}(\text{asr}, \text{true token}) + (1 - \lambda_{lm}) \, \text{KLD}(\text{asr distribution}, \text{mlm prediction})
$$

where $\lambda_{lm}$ balances the two components.

### Hyperparameters

| Model              | Learning Rate | KD Lambda | Batch Size |
|--------------------|---------------|-----------|------------|
| Tiny Baseline      | 5e-4           | -         | 8          |
| Tiny Adapted (KD)  | 1e-4           | 1e-3      | 8          |

The learning rates are not matching because they were optimised for each case separately.

### Results on CommonVoice Czech

| Model              | Validation Loss | WER  | CER  |
|--------------------|------------------|------|------|
| Tiny Baseline      | 1.236             | 0.447| 0.031|
| Tiny Adapted (KD)  | 0.636             | 0.345| 0.023|

✅ **CER reduced by ~25%****WER reduced by ~23%**

This shows that even very light knowledge distillation from a lightweight MLM significantly improves language modelling capabilities in Whisper Tiny for Czech.

---

## Intended Use

This model is ideal for research and applications in Czech ASR where lightweight, efficient models are needed, but a better grasp of the language is crucial.

## Limitations

- Trained on a relatively small subset (210MB) of CC100-Czech due to computational constraints.
- Optimized for clean, non-code-switched Czech speech (based on CommonVoice data).

## Acknowledgments
- Knowledge distillation ([Hinton et al., 2015](https://arxiv.org/abs/1503.02531))
- Whisper model family ([OpenAI, 2022](https://openai.com/research/whisper))
- CommonVoice dataset ([Mozilla, 2020](https://commonvoice.mozilla.org))
- CC100 dataset ([Conneau et al., 2020](https://arxiv.org/abs/1911.02116))

## Citation

If you use this model, please cite (yes, the main topic of the thesis was indeed about assistive ASR):

```
@misc{nadrchal_2025,
    title={Deep-Learning ASR for a Patient with Permanent Tracheostomy: A Case Study},
    author={David Nadrchal},
    year={2025},
    note={Bachelor Thesis},
    url={https://github.com/Hobit2002/TracheoSpeech_ASR}
}
```