Speech-to-text model for Uzbek
Model Description
The Whisper model was fine tuned with LORA (Low-Rank Adaption) to reduce time consumption and efficent use of resource (GPU/CPU).
- Base model: whisper-large-v2 (over 1.5B million parameters)
- LORA fine-tuned model: whisper-large-lora-uz (around 15 million paramters)
Datasets
The popular dataset common voice version 13.0 was feed into model.
Testing Model
With code provided below user can test model performance:
import torch
from transformers import AutomaticSpeechRecognitionPipeline
from transformers import WhisperTokenizer,WhisperForConditionalGeneration,WhisperProcessor
from peft import PeftModel, PeftConfig
stt_model_id = "ShakhzoDavronov/whisper-large-lora-uz"
language = "Uzbek"
task = "transcribe"
stt_config = PeftConfig.from_pretrained(stt_model_id)
stt_model = WhisperForConditionalGeneration.from_pretrained(
stt_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)
stt_model = PeftModel.from_pretrained(stt_model, stt_model_id)
stt_tokenizer = WhisperTokenizer.from_pretrained(stt_config.base_model_name_or_path, language=language, task=task)
stt_processor = WhisperProcessor.from_pretrained(stt_config.base_model_name_or_path, language=language, task=task)
stt_feature_extractor = stt_processor.feature_extractor
forced_decoder_ids = stt_processor.get_decoder_prompt_ids(language=language, task=task)
stt_pipe = AutomaticSpeechRecognitionPipeline(model=stt_model, tokenizer=stt_tokenizer, feature_extractor=stt_feature_extractor)
def transcribe(audio):
with torch.cuda.amp.autocast():
text = stt_pipe(audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"]
return text
extracted_text=transcribe(test_audio)
ner_labels=ner_pipe(extracted_text)
for ner in ner_labels:
print(ner)
Results:
Soon
Training Metrics
- WER: ~46.0
- Normalized WER: ~33.0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support