--- datasets: - cobrayyxx/COVOST2_ID-EN language: - id - en metrics: - wer - bleu - chrf base_model: - openai/whisper-small pipeline_tag: text-to-speech library_name: transformers --- ## Model description This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on an Indonesian-English [CoVoST2](https://huggingface.co/datasets/cobrayyxx/COVOST2_ID-EN) dataset. ## Intended uses & limitations This model is used to predict the English translation of Indonesian audio. ## How to Use This is how to use the model with Faster-Whisper. 1. Convert the model into the CTranslate2 format with float16 quantization. ``` !ct2-transformers-converter \ --model cobrayyxx/whisper_translation_ID-EN \ --output_dir ct2-whisper-translation-finetuned \ --quantization float16 \ --copy_files tokenizer_config.json 2. Load the converted model using `faster_whisper` library. ``` from faster_whisper import WhisperModel model_name = "ct2-whisper-translation-finetuned" # converted model (after fine-tuning) # Run on GPU with FP16 model = WhisperModel(model_name, device="cuda", compute_type="float16") ``` 3. Now, the loaded model can be used. ``` tgt_lang = "en" segments, info = model.transcribe(, beam_size=5, language=tgt_lang, vad_filter=True, ) translation = " ".join([segment.text.strip() for segment in segments]) ``` Note: If you faced the kernel error everytime running the code above. You have to install `nvidia-cublas` and `nvidia-cudnn` ``` apt update apt install libcudnn9-cuda-12 ``` and Install the library using pip. [Read The Documentation for more.](https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#gpu) ``` pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.* export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'` ``` Special thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for her help in resolving this. # Training Procedure ## Training Results | Epoch | Training Loss | Validation Loss | WER | |-------|--------------|----------------|--------| | 1 | 0.757300 | 0.763333 | 49.192132 | | 2 | 0.351300 | 0.778579 | 49.297506 | | 3 | 0.156600 | 0.828453 | 49.174570 | | 4 | 0.066600 | 0.894528 | 50.087812 | | 5 | 0.027600 | 0.944322 | 49.947313 | | 6 | 0.013600 | 0.976878 | 49.964875 | | 7 | 0.005900 | 1.012044 | 50.544433 | | 8 | 0.003300 | 1.050839 | 50.526870 | | 9 | 0.002800 | 1.063206 | 50.684932 | | 10 | 0.002400 | 1.067140 | 50.807868 | ## Model Evaluation The performance of the baseline and fine-tuned model were evaluated using the BLEU and CHRF++ metrics on the validation dataset. This fine-tuned model shows some improvement over the baseline model. | Model | BLEU | ChrF++ | |-----------------------|------:|-------:| | Baseline | 25.87 | 43.79 | | Fine-Tuned | 37.02 | 56.04 | ### Evaluation details - BLEU: Measures the overlap between predicted and reference text based on n-grams. - CHRF: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages. ## Framework Versions - Transformers 4.48.3 - Pytorch 2.5.1+cu124 - Datasets 3.3.0 - Tokenizers 0.21.0 # Credits Huge thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for mentoring me.