|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# Speech-to-text model for Uzbek |
|
|
|
### Model Description |
|
The Whisper model was fine tuned with LORA (Low-Rank Adaption) to reduce time consumption and efficent use of resource (GPU/CPU). |
|
- Base model: [whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) (over 1.5B million parameters) |
|
- LORA fine-tuned model: [whisper-large-lora-uz](https://huggingface.co/ShakhzoDavronov/whisper-large-lora-uz) (around 15 million paramters) |
|
|
|
|
|
### Datasets |
|
The popular dataset [common voice version 13.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0/viewer/uz) was feed into model. |
|
|
|
|
|
### Testing Model |
|
With code provided below user can test model performance: |
|
```python |
|
import torch |
|
from transformers import AutomaticSpeechRecognitionPipeline |
|
from transformers import WhisperTokenizer,WhisperForConditionalGeneration,WhisperProcessor |
|
from peft import PeftModel, PeftConfig |
|
|
|
stt_model_id = "ShakhzoDavronov/whisper-large-lora-uz" |
|
language = "Uzbek" |
|
task = "transcribe" |
|
stt_config = PeftConfig.from_pretrained(stt_model_id) |
|
stt_model = WhisperForConditionalGeneration.from_pretrained( |
|
stt_config.base_model_name_or_path, load_in_8bit=True, device_map="auto" |
|
) |
|
|
|
stt_model = PeftModel.from_pretrained(stt_model, stt_model_id) |
|
stt_tokenizer = WhisperTokenizer.from_pretrained(stt_config.base_model_name_or_path, language=language, task=task) |
|
stt_processor = WhisperProcessor.from_pretrained(stt_config.base_model_name_or_path, language=language, task=task) |
|
stt_feature_extractor = stt_processor.feature_extractor |
|
forced_decoder_ids = stt_processor.get_decoder_prompt_ids(language=language, task=task) |
|
stt_pipe = AutomaticSpeechRecognitionPipeline(model=stt_model, tokenizer=stt_tokenizer, feature_extractor=stt_feature_extractor) |
|
|
|
|
|
def transcribe(audio): |
|
with torch.cuda.amp.autocast(): |
|
text = stt_pipe(audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"] |
|
return text |
|
``` |
|
|
|
```python |
|
extracted_text=transcribe(test_audio) |
|
ner_labels=ner_pipe(extracted_text) |
|
for ner in ner_labels: |
|
print(ner) |
|
``` |
|
Results: |
|
```python |
|
Soon |
|
``` |
|
|
|
### Training Metrics |
|
* WER: ~46.0 |
|
* Normalized WER: ~33.0 |
|
|
|
|