Hobit2002 commited on
Commit
8e01df4
·
verified ·
1 Parent(s): 34ff9ba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -3
README.md CHANGED
@@ -1,3 +1,87 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: cs
3
+ datasets:
4
+ - CommonVoice
5
+ - CC100
6
+ tags:
7
+ - automatic-speech-recognition
8
+ - whisper
9
+ - knowledge-distillation
10
+ - czech
11
+ license: mit
12
+ ---
13
+
14
+ # Whisper Tiny Czech (Knowledge Distillation from MLM)
15
+
16
+ This model is a fine-tuned version of Whisper Tiny adapted for Czech automatic speech recognition (ASR) using **knowledge distillation (KD)** from a **masked language model (MLM)**.
17
+
18
+ ## Model Description
19
+
20
+ During early experiments, we observed that Whisper Tiny often produced invalid or unpronounceable Czech words even when given ground-truth context. To address this, we trained a Czech MLM to act as a language teacher during Whisper’s fine-tuning.
21
+
22
+ - **Teacher Model**: BiLSTM-based masked language model (60M parameters) trained on a 210MB subset of the CC100-Czech dataset.
23
+ - **Distillation Approach**: At each decoding step, Whisper was trained not only with standard cross-entropy loss on the next token but also encouraged to align its token distribution with that predicted by the MLM (via KL-divergence loss).
24
+ - **Tokenizer**: Same byte pair encoding (BPE) as Whisper.
25
+ - **Training Data**: CommonVoice Czech 19.0 dataset for speech; CC100-Czech for language modeling.
26
+
27
+ ### Loss Function
28
+
29
+ The training loss combined the standard ASR loss with KD loss:
30
+
31
+ \[
32
+ L_{t} = \lambda_{lm} \, \text{CE}(\text{asr}, \text{true token}) + (1 - \lambda_{lm}) \, \text{KLD}(\text{asr distribution}, \text{mlm prediction})
33
+ \]
34
+
35
+ where \(\lambda_{lm}\) balances the two components.
36
+
37
+ ### Hyperparameters
38
+
39
+ | Model | Learning Rate | KD Lambda | Batch Size |
40
+ |--------------------|---------------|-----------|------------|
41
+ | Tiny Baseline | 5e-4 | - | 8 |
42
+ | Tiny Adapted (KD) | 1e-4 | 1e-3 | 8 |
43
+
44
+ The learning rates are not matching because they were optimised for each case separately.
45
+
46
+ ### Results on CommonVoice Czech
47
+
48
+ | Model | Validation Loss | WER | CER |
49
+ |--------------------|------------------|------|------|
50
+ | Tiny Baseline | 1.236 | 0.447| 0.031|
51
+ | Tiny Adapted (KD) | 0.636 | 0.345| 0.023|
52
+
53
+ ✅ **CER reduced by ~25%**
54
+ ✅ **WER reduced by ~23%**
55
+
56
+ This shows that even very light knowledge distillation from a lightweight MLM significantly improves language modelling capabilities in Whisper Tiny for Czech.
57
+
58
+ ---
59
+
60
+ ## Intended Use
61
+
62
+ This model is ideal for research and applications in Czech ASR where lightweight, efficient models are needed, but a better grasp of the language is crucial.
63
+
64
+ ## Limitations
65
+
66
+ - Trained on a relatively small subset (210MB) of CC100-Czech due to computational constraints.
67
+ - Optimized for clean, non-code-switched Czech speech (based on CommonVoice data).
68
+
69
+ ## Acknowledgments
70
+ - Knowledge distillation ([Hinton et al., 2015](https://arxiv.org/abs/1503.02531))
71
+ - Whisper model family ([OpenAI, 2022](https://openai.com/research/whisper))
72
+ - CommonVoice dataset ([Mozilla, 2020](https://commonvoice.mozilla.org))
73
+ - CC100 dataset ([Conneau et al., 2020](https://arxiv.org/abs/1911.02116))
74
+
75
+ ## Citation
76
+
77
+ If you use this model, please cite (yes, the main topic of the thesis was indeed about assistive ASR):
78
+
79
+ ```
80
+ @misc{nadrchal_2025,
81
+ title={Deep-Learning ASR for a Patient with Permanent Tracheostomy: A Case Study},
82
+ author={David Nadrchal},
83
+ year={2025},
84
+ note={Bachelor's Thesis},
85
+ url={https://github.com/Hobit2002/TracheoSpeech_ASR}
86
+ }
87
+ ```