dangtr0408 commited on
Commit
8dec2ab
·
1 Parent(s): d57f9a7

fix 1AM bug, add gradio UI

Browse files
Modules/__pycache__/__init__.cpython-311.pyc DELETED
Binary file (186 Bytes)
 
Modules/__pycache__/hifigan.cpython-311.pyc DELETED
Binary file (30.1 kB)
 
Modules/__pycache__/utils.cpython-311.pyc DELETED
Binary file (1.19 kB)
 
README.md CHANGED
@@ -1,131 +1,105 @@
1
- <<<<<<< HEAD
2
- ---
3
- license: cc-by-nc-sa-4.0
4
- ---
5
- =======
6
- # StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
 
 
 
 
 
7
 
8
- ### Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani
9
 
10
- > In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.
11
 
12
- Paper: [https://arxiv.org/abs/2306.07691](https://arxiv.org/abs/2306.07691)
13
 
14
- Audio samples: [https://styletts2.github.io/](https://styletts2.github.io/)
 
 
15
 
16
- Online demo: [Hugging Face](https://huggingface.co/spaces/styletts2/styletts2) (thank [@fakerybakery](https://github.com/fakerybakery) for the wonderful online demo)
17
 
18
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yl4579/StyleTTS2/blob/main/) [![Discord](https://img.shields.io/discord/1197679063150637117?logo=discord&logoColor=white&label=Join%20our%20Community)](https://discord.gg/ha8sxdG2K4)
19
 
20
- ## TODO
21
- - [x] Training and inference demo code for single-speaker models (LJSpeech)
22
- - [x] Test training code for multi-speaker models (VCTK and LibriTTS)
23
- - [x] Finish demo code for multispeaker model and upload pre-trained models
24
- - [x] Add a finetuning script for new speakers with base pre-trained multispeaker models
25
- - [ ] Fix DDP (accelerator) for `train_second.py` **(I have tried everything I could to fix this but had no success, so if you are willing to help, please see [#7](https://github.com/yl4579/StyleTTS2/issues/7))**
 
 
 
26
 
27
- ## Pre-requisites
28
- 1. Python >= 3.7
29
- 2. Clone this repository:
30
- ```bash
31
- git clone https://github.com/yl4579/StyleTTS2.git
32
- cd StyleTTS2
33
- ```
34
- 3. Install python requirements:
35
- ```bash
36
- pip install -r requirements.txt
37
- ```
38
- On Windows add:
39
- ```bash
40
- pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -U
41
- ```
42
- Also install phonemizer and espeak if you want to run the demo:
43
- ```bash
44
- pip install phonemizer
45
- sudo apt-get install espeak-ng
46
- ```
47
- 4. Download and extract the [LJSpeech dataset](https://keithito.com/LJ-Speech-Dataset/), unzip to the data folder and upsample the data to 24 kHz. The text aligner and pitch extractor are pre-trained on 24 kHz data, but you can easily change the preprocessing and re-train them using your own preprocessing.
48
- For LibriTTS, you will need to combine train-clean-360 with train-clean-100 and rename the folder train-clean-460 (see [val_list_libritts.txt](https://github.com/yl4579/StyleTTS/blob/main/Data/val_list_libritts.txt) as an example).
49
 
50
- ## Training
51
- First stage training:
52
- ```bash
53
- accelerate launch train_first.py --config_path ./Configs/config.yml
54
- ```
55
- Second stage training **(DDP version not working, so the current version uses DP, again see [#7](https://github.com/yl4579/StyleTTS2/issues/7) if you want to help)**:
56
- ```bash
57
- python train_second.py --config_path ./Configs/config.yml
58
- ```
59
- You can run both consecutively and it will train both the first and second stages. The model will be saved in the format "epoch_1st_%05d.pth" and "epoch_2nd_%05d.pth". Checkpoints and Tensorboard logs will be saved at `log_dir`.
60
-
61
- The data list format needs to be `filename.wav|transcription|speaker`, see [val_list.txt](https://github.com/yl4579/StyleTTS2/blob/main/Data/val_list.txt) as an example. The speaker labels are needed for multi-speaker models because we need to sample reference audio for style diffusion model training.
62
-
63
- ### Important Configurations
64
- In [config.yml](https://github.com/yl4579/StyleTTS2/blob/main/Configs/config.yml), there are a few important configurations to take care of:
65
- - `OOD_data`: The path for out-of-distribution texts for SLM adversarial training. The format should be `text|anything`.
66
- - `min_length`: Minimum length of OOD texts for training. This is to make sure the synthesized speech has a minimum length.
67
- - `max_len`: Maximum length of audio for training. The unit is frame. Since the default hop size is 300, one frame is approximately `300 / 24000` (0.0125) second. Lowering this if you encounter the out-of-memory issue.
68
- - `multispeaker`: Set to true if you want to train a multispeaker model. This is needed because the architecture of the denoiser is different for single and multispeaker models.
69
- - `batch_percentage`: This is to make sure during SLM adversarial training there are no out-of-memory (OOM) issues. If you encounter OOM problem, please set a lower number for this.
70
-
71
- ### Pre-trained modules
72
- In [Utils](https://github.com/yl4579/StyleTTS2/tree/main/Utils) folder, there are three pre-trained models:
73
- - **[ASR](https://github.com/yl4579/StyleTTS2/tree/main/Utils/ASR) folder**: It contains the pre-trained text aligner, which was pre-trained on English (LibriTTS), Japanese (JVS), and Chinese (AiShell) corpus. It works well for most other languages without fine-tuning, but you can always train your own text aligner with the code here: [yl4579/AuxiliaryASR](https://github.com/yl4579/AuxiliaryASR).
74
- - **[JDC](https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC) folder**: It contains the pre-trained pitch extractor, which was pre-trained on English (LibriTTS) corpus only. However, it works well for other languages too because F0 is independent of language. If you want to train on singing corpus, it is recommended to train a new pitch extractor with the code here: [yl4579/PitchExtractor](https://github.com/yl4579/PitchExtractor).
75
- - **[PLBERT](https://github.com/yl4579/StyleTTS2/tree/main/Utils/PLBERT) folder**: It contains the pre-trained [PL-BERT](https://arxiv.org/abs/2301.08810) model, which was pre-trained on English (Wikipedia) corpus only. It probably does not work very well on other languages, so you will need to train a different PL-BERT for different languages using the repo here: [yl4579/PL-BERT](https://github.com/yl4579/PL-BERT). You can also use the [multilingual PL-BERT](https://huggingface.co/papercup-ai/multilingual-pl-bert) which supports 14 languages.
76
-
77
- ### Common Issues
78
- - **Loss becomes NaN**: If it is the first stage, please make sure you do not use mixed precision, as it can cause loss becoming NaN for some particular datasets when the batch size is not set properly (need to be more than 16 to work well). For the second stage, please also experiment with different batch sizes, with higher batch sizes being more likely to cause NaN loss values. We recommend the batch size to be 16. You can refer to issues [#10](https://github.com/yl4579/StyleTTS2/issues/10) and [#11](https://github.com/yl4579/StyleTTS2/issues/11) for more details.
79
- - **Out of memory**: Please either use lower `batch_size` or `max_len`. You may refer to issue [#10](https://github.com/yl4579/StyleTTS2/issues/10) for more information.
80
- - **Non-English dataset**: You can train on any language you want, but you will need to use a pre-trained PL-BERT model for that language. We have a pre-trained [multilingual PL-BERT](https://huggingface.co/papercup-ai/multilingual-pl-bert) that supports 14 languages. You may refer to [yl4579/StyleTTS#10](https://github.com/yl4579/StyleTTS/issues/10) and [#70](https://github.com/yl4579/StyleTTS2/issues/70) for some examples to train on Chinese datasets.
81
-
82
- ## Finetuning
83
- The script is modified from `train_second.py` which uses DP, as DDP does not work for `train_second.py`. Please see the bold section above if you are willing to help with this problem.
84
  ```bash
85
- python train_finetune.py --config_path ./Configs/config_ft.yml
 
 
 
 
86
  ```
87
- Please make sure you have the LibriTTS checkpoint downloaded and unzipped under the folder. The default configuration `config_ft.yml` finetunes on LJSpeech with 1 hour of speech data (around 1k samples) for 50 epochs. This took about 4 hours to finish on four NVidia A100. The quality is slightly worse (similar to NaturalSpeech on LJSpeech) than LJSpeech model trained from scratch with 24 hours of speech data, which took around 2.5 days to finish on four A100. The samples can be found at [#65 (comment)](https://github.com/yl4579/StyleTTS2/discussions/65#discussioncomment-7668393).
88
 
89
- If you are using a **single GPU** (because the script doesn't work with DDP) and want to save training speed and VRAM, you can do (thank [@korakoe](https://github.com/korakoe) for making the script at [#100](https://github.com/yl4579/StyleTTS2/pull/100)):
 
90
  ```bash
91
- accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml
 
 
92
  ```
93
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yl4579/StyleTTS2/blob/main/Colab/StyleTTS2_Finetune_Demo.ipynb)
94
 
95
- ### Common Issues
96
- [@Kreevoz](https://github.com/Kreevoz) has made detailed notes on common issues in finetuning, with suggestions in maximizing audio quality: [#81](https://github.com/yl4579/StyleTTS2/discussions/81). Some of these also apply to training from scratch. [@IIEleven11](https://github.com/IIEleven11) has also made a guideline for fine-tuning: [#128](https://github.com/yl4579/StyleTTS2/discussions/128).
 
97
 
98
- - **Out of memory after `joint_epoch`**: This is likely because your GPU RAM is not big enough for SLM adversarial training run. You may skip that but the quality could be worse. Setting `joint_epoch` a larger number than `epochs` could skip the SLM advesariral training.
99
 
100
- ## Inference
101
- Please refer to [Inference_LJSpeech.ipynb](https://github.com/yl4579/StyleTTS2/blob/main/Demo/Inference_LJSpeech.ipynb) (single-speaker) and [Inference_LibriTTS.ipynb](https://github.com/yl4579/StyleTTS2/blob/main/Demo/Inference_LibriTTS.ipynb) (multi-speaker) for details. For LibriTTS, you will also need to download [reference_audio.zip](https://huggingface.co/yl4579/StyleTTS2-LibriTTS/resolve/main/reference_audio.zip) and unzip it under the `demo` before running the demo.
102
 
103
- - The pretrained StyleTTS 2 on LJSpeech corpus in 24 kHz can be downloaded at [https://huggingface.co/yl4579/StyleTTS2-LJSpeech/tree/main](https://huggingface.co/yl4579/StyleTTS2-LJSpeech/tree/main).
104
 
105
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yl4579/StyleTTS2/blob/main/Colab/StyleTTS2_Demo_LJSpeech.ipynb)
106
 
107
- - The pretrained StyleTTS 2 model on LibriTTS can be downloaded at [https://huggingface.co/yl4579/StyleTTS2-LibriTTS/tree/main](https://huggingface.co/yl4579/StyleTTS2-LibriTTS/tree/main).
108
 
109
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yl4579/StyleTTS2/blob/main/Colab/StyleTTS2_Demo_LibriTTS.ipynb)
110
 
 
111
 
112
- You can import StyleTTS 2 and run it in your own code. However, the inference depends on a GPL-licensed package, so it is not included directly in this repository. A [GPL-licensed fork](https://github.com/NeuralVox/StyleTTS2) has an importable script, as well as an experimental streaming API, etc. A [fully MIT-licensed package](https://pypi.org/project/styletts2/) that uses gruut (albeit lower quality due to mismatch between phonemizer and gruut) is also available.
113
 
114
- ***Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.***
115
 
116
- ### Common Issues
117
- - **High-pitched background noise**: This is caused by numerical float differences in older GPUs. For more details, please refer to issue [#13](https://github.com/yl4579/StyleTTS2/issues/13). Basically, you will need to use more modern GPUs or do inference on CPUs.
118
- - **Pre-trained model license**: You only need to abide by the above rules if you use **the pre-trained models** and the voices are **NOT** in the training set, i.e., your reference speakers are not from any open access dataset. For more details of rules to use the pre-trained models, please see [#37](https://github.com/yl4579/StyleTTS2/issues/37).
119
 
120
  ## References
121
- - [archinetai/audio-diffusion-pytorch](https://github.com/archinetai/audio-diffusion-pytorch)
 
 
122
  - [jik876/hifi-gan](https://github.com/jik876/hifi-gan)
123
- - [rishikksh20/iSTFTNet-pytorch](https://github.com/rishikksh20/iSTFTNet-pytorch)
124
- - [nii-yamagishilab/project-NN-Pytorch-scripts/project/01-nsf](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf)
125
 
126
  ## License
127
 
128
- Code: MIT License
129
 
130
- Pre-Trained Models: Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.
131
- >>>>>>> 062910b (first commit)
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - capleaf/viVoice
5
+ language:
6
+ - vi
7
+ - en
8
+ base_model:
9
+ - yl4579/StyleTTS2-LibriTTS
10
+ pipeline_tag: text-to-speech
11
+ ---
12
 
13
+ # StyleTTS 2 - lite
14
 
 
15
 
16
+ > A lightweight, efficient variation of the StyleTTS 2 text‐to‐speech model, optimized for rapid integration into your applications. With a compact 90 million parameter footprint and built‐in speaker‑and‑language tagging, you can seamlessly switch voices and languages even within a single sentence.
17
 
18
+ ## Online Demo
19
+ Explore the model on Hugging Face Spaces:
20
+ https://huggingface.co/spaces/dangtr0408/StyleTTS2-lite-vi-space
21
 
 
22
 
23
+ ## Training Details
24
 
25
+ 1. **Base Checkpoint:**
26
+ - Initialized from the official StyleTTS 2 LibriTTS weights.
27
+ 2. **Token Extension:**
28
+ - Expanded the token set to 189 symbols to ensure full Vietnamese IPA compatibility.
29
+ 3. **Training Data:**
30
+ - **FonosVietnam** (extracted from the viVoice corpus)
31
+ - **VoizFM** (extracted from the viVoice corpus)
32
+ 4. **Training Schedule:**
33
+ - Trained for 120 000 steps.
34
 
35
+ ## Model Architecture
36
+
37
+ | Component | Parameters |
38
+ | -------------- | ------------- |
39
+ | Decoder | 54 ,289 ,492 |
40
+ | Predictor | 16 ,194 ,612 |
41
+ | Text Encoder | 56 ,120 ,320 |
42
+ | Style Encoder | 13 ,845 ,440 |
43
+ | **Total** | **89 ,941 ,576** |
44
+
45
+
46
+ ## Prerequisites
47
+
48
+ - **Python:** Version 3.7 or higher
49
+ - **Git:** To clone the repository
50
+
51
+ ## Installation & Setup
52
+
53
+ 1. Clone the repository
 
 
 
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  ```bash
56
+
57
+ git clone https://huggingface.co/dangtr0408/StyleTTS2-lite-vi
58
+
59
+ cd StyleTTS2-lite-vi
60
+
61
  ```
 
62
 
63
+ 2. Install dependencies:
64
+
65
  ```bash
66
+
67
+ pip install -r requirements.txt
68
+
69
  ```
 
70
 
71
+
72
+
73
+ 3. On **Linux**, manually install espeak:
74
 
75
+ ```bash
76
 
77
+ sudo apt-get install espeak-ng
 
78
 
79
+ ```
80
 
81
+ ## Usage Example
82
 
 
83
 
84
+ ## Fine-tune
85
 
86
+ COMING SOON (gotta clean the code)
87
 
88
+ ## Disclaimer
89
 
90
+ ***Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.***
91
 
 
 
 
92
 
93
  ## References
94
+
95
+ - [yl4579/StyleTTS2](https://arxiv.org/abs/2306.07691)
96
+
97
  - [jik876/hifi-gan](https://github.com/jik876/hifi-gan)
98
+
99
+ - [capleaf/viVoice](https://huggingface.co/datasets/capleaf/viVoice)
100
 
101
  ## License
102
 
103
+ **Code: MIT License**
104
 
105
+ **Model: CC-BY-NC-SA-4.0**
 
app.py ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+ import soundfile as sf
4
+ import numpy as np
5
+ import torch
6
+ import traceback
7
+ from inference import StyleTTS2
8
+ repo_dir = './'
9
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
10
+ config_path = os.path.join(repo_dir, "Models", "config.yml")
11
+ models_path = os.path.join(repo_dir, "Models", "model.pth")
12
+ model = StyleTTS2(config_path, models_path).eval().to(device)
13
+ voice_path = os.path.join(repo_dir, "reference_audio")
14
+ eg_voices = [os.path.join(voice_path,"vn_1.wav"), os.path.join(voice_path,"vn_2.wav")]
15
+ eg_texts = [
16
+ "Chỉ với khoảng 90 triệu tham số, [en-us]{StyleTTS2-lite} có thể dễ dàng tạo giọng nói với tốc độ cao.",
17
+ "[id_1] Với [en-us]{StyleTTS2-lite} bạn có thể sử dụng [en-us]{language tag} để mô hình chắc chắn đọc bằng tiếng Anh, [id_2]cũng như sử dụng [en-us]{speaker tag} để chuyển đổi nhanh giữa các giọng đọc.",
18
+ ]
19
+
20
+
21
+ # Core inference function
22
+ def main(reference_paths, text_prompt, denoise, avg_style, stabilize):
23
+ try:
24
+ speakers = {}
25
+ for i, path in enumerate(reference_paths, 1):
26
+ speaker_id = f"id_{i}"
27
+ speakers[speaker_id] = {
28
+ "path": path,
29
+ "lang": "vi",
30
+ "speed": 1.0
31
+ }
32
+
33
+ with torch.no_grad():
34
+ styles = model.get_styles(speakers, denoise, avg_style)
35
+ r = model.generate(text_prompt, styles, stabilize, 18, "[id_1]")
36
+ r = r / np.abs(r).max()
37
+
38
+ sf.write("output.wav", r, samplerate=24000)
39
+ return "output.wav", "Audio generated successfully!"
40
+
41
+ except Exception as e:
42
+ error_message = traceback.format_exc()
43
+ return None, error_message
44
+
45
+ def on_file_upload(file_list):
46
+ if not file_list:
47
+ return None, "No file uploaded yet."
48
+
49
+ unique_files = {}
50
+ for file_path in file_list:
51
+ file_name = os.path.basename(file_path)
52
+ unique_files[file_name] = file_path #update and remove duplicate
53
+
54
+ uploaded_infos = []
55
+ uploaded_file_names = list(unique_files.keys())
56
+ for i in range(len(uploaded_file_names)):
57
+ uploaded_infos.append(f"[id_{i+1}]: {uploaded_file_names[i]}")
58
+
59
+ summary = "\n".join(uploaded_infos)
60
+ return list(unique_files.values()), f"Current reference audios:\n{summary}"
61
+
62
+ def gen_example(reference_paths, text_prompt):
63
+ output, status = main(reference_paths, text_prompt, 0.6, True, True)
64
+ return output, reference_paths, status
65
+
66
+
67
+ # Gradio UI
68
+ with gr.Blocks() as demo:
69
+ gr.HTML("<h1 style='text-align: center;'>StyleTTS2‑Lite Demo</h1>")
70
+ gr.Markdown(
71
+ "Download the local inference package from Hugging Face: "
72
+ "[StyleTTS2‑Lite (Vietnamese)]"
73
+ "(https://huggingface.co/dangtr0408/StyleTTS2-lite-vi/)."
74
+ )
75
+ gr.Markdown(
76
+ "Please specify a language tag in your inputs if the word is not Vietnamese, e.g., [en-us]{ } for English. For more information, see "
77
+ "[eSpeakNG docs]"
78
+ "(https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md)"
79
+ )
80
+
81
+ with gr.Row(equal_height=True):
82
+ with gr.Column(scale=1):
83
+ text_prompt = gr.Textbox(label="Text Prompt", placeholder="Enter your text here...", lines=4)
84
+ with gr.Column(scale=1):
85
+ avg_style = gr.Checkbox(label="Use Average Styles", value=True)
86
+ stabilize = gr.Checkbox(label="Stabilize Speaking Speed", value=True)
87
+ denoise = gr.Slider(0.0, 1.0, step=0.1, value=0.6, label="Denoise Strength")
88
+
89
+ with gr.Row(equal_height=True):
90
+ with gr.Column(scale=1):
91
+ reference_audios = gr.File(label="Reference Audios", file_types=[".wav", ".mp3"], file_count="multiple", height=150)
92
+ gen_button = gr.Button("Generate")
93
+ with gr.Column(scale=1):
94
+ synthesized_audio = gr.Audio(label="Generate Audio", type="filepath")
95
+
96
+ status = gr.Textbox(label="Status", interactive=False, lines=3)
97
+
98
+ reference_audios.change(
99
+ on_file_upload,
100
+ inputs=[reference_audios],
101
+ outputs=[reference_audios, status]
102
+ )
103
+
104
+ gen_button.click(
105
+ fn=main,
106
+ inputs=[
107
+ reference_audios,
108
+ text_prompt,
109
+ denoise,
110
+ avg_style,
111
+ stabilize
112
+ ],
113
+ outputs=[synthesized_audio, status]
114
+ )
115
+
116
+ gr.Examples(
117
+ examples=[[[eg_voices[0]], eg_texts[0]], [eg_voices, eg_texts[1]]],
118
+ inputs=[reference_audios, text_prompt],
119
+ outputs=[synthesized_audio, reference_audios, status],
120
+ fn=gen_example,
121
+ cache_examples=False,
122
+ label="Examples",
123
+ run_on_click=True
124
+ )
125
+
126
+ demo.launch()
inference.py CHANGED
@@ -323,7 +323,7 @@ class StyleTTS2(torch.nn.Module):
323
  text = re.sub(lang_pattern, replacement_func, text)
324
 
325
  texts = re.split(r'(\[id_\d+\])', text) #split the text by speaker ids while keeping the ids.
326
- if len(texts) <= 1 or bool(re.match(r'(\[id_\d+\])', texts[0])): #Add a default speaker
327
  texts.insert(0, default_speaker)
328
  curr_id = None
329
  for i in range(len(texts)): #remove consecutive ids
 
323
  text = re.sub(lang_pattern, replacement_func, text)
324
 
325
  texts = re.split(r'(\[id_\d+\])', text) #split the text by speaker ids while keeping the ids.
326
+ if len(texts) <= 1 or bool(re.match(r'(\[id_\d+\])', texts[0]) == False): #Add a default speaker
327
  texts.insert(0, default_speaker)
328
  curr_id = None
329
  for i in range(len(texts)): #remove consecutive ids