---
datasets:
- cobrayyxx/COVOST2_ID-EN
language:
- id
- en
metrics:
- wer
- bleu
- chrf
base_model:
- openai/whisper-small
pipeline_tag: text-to-speech
library_name: transformers
---
## Model description

This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on an Indonesian-English [CoVoST2](https://huggingface.co/datasets/cobrayyxx/COVOST2_ID-EN) dataset.

## Intended uses & limitations

This model is used to predict the English translation of Indonesian audio.

## How to Use
This is how to use the model with Faster-Whisper.
1. Convert the model into the CTranslate2 format with float16 quantization.
   ```
   !ct2-transformers-converter \
    --model cobrayyxx/whisper_translation_ID-EN \
    --output_dir ct2-whisper-translation-finetuned \
    --quantization float16 \
    --copy_files tokenizer_config.json
   
2. Load the converted model using `faster_whisper` library.
   ```
   from faster_whisper import WhisperModel

   model_name = "ct2-whisper-translation-finetuned"  # converted model (after fine-tuning)
    
    # Run on GPU with FP16
   model = WhisperModel(model_name, device="cuda", compute_type="float16")
   ```
3. Now, the loaded model can be used.
   ```
     tgt_lang = "en"
     segments, info = model.transcribe(<any-array-of-indonesian-audio>,
                                  beam_size=5,
                                  language=tgt_lang,
                                  vad_filter=True,
                                  )


    translation = " ".join([segment.text.strip() for segment in segments])
   ```
  
    Note: If you faced the kernel error everytime running the code above. You have to install `nvidia-cublas` and `nvidia-cudnn`
    
    ```
    apt update
    apt install libcudnn9-cuda-12
    ```
  
    and Install the library using pip. [Read The Documentation for more.](https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#gpu)
    ```
    pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
  
    export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
    ```
    Special thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for her help in resolving this.

   
# Training Procedure

## Training Results
| Epoch | Training Loss | Validation Loss | WER    |
|-------|--------------|----------------|--------|
| 1     | 0.757300     | 0.763333       | 49.192132 |
| 2     | 0.351300     | 0.778579       | 49.297506 |
| 3     | 0.156600     | 0.828453       | 49.174570 |
| 4     | 0.066600     | 0.894528       | 50.087812 |
| 5     | 0.027600     | 0.944322       | 49.947313 |
| 6     | 0.013600     | 0.976878       | 49.964875 |
| 7     | 0.005900     | 1.012044       | 50.544433 |
| 8     | 0.003300     | 1.050839       | 50.526870 |
| 9     | 0.002800     | 1.063206       | 50.684932 |
| 10    | 0.002400     | 1.067140       | 50.807868 |

## Model Evaluation
The performance of the baseline and fine-tuned model were evaluated using the BLEU and CHRF++ metrics on the validation dataset. 
This fine-tuned model shows some improvement over the baseline model.

| Model                 | BLEU  | ChrF++ |
|-----------------------|------:|-------:|
| Baseline             | 25.87 |  43.79 |
| Fine-Tuned          | 37.02 |  56.04 |

### Evaluation details
- BLEU: Measures the overlap between predicted and reference text based on n-grams.
- CHRF: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.

## Framework Versions
- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.3.0
- Tokenizers 0.21.0

# Credits
Huge thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for mentoring me.