Clarify and wrap.
Browse files
README.md
CHANGED
@@ -15,16 +15,36 @@ The HelloSippyRT model is designed to adapt Microsoft's SpeechT5 Text-to-Speech
|
|
15 |
|
16 |
## Problem Statement
|
17 |
|
18 |
-
The original vocoder performs optimally only when provided with almost the full
|
|
|
|
|
|
|
19 |
|
20 |
## Solution
|
21 |
|
22 |
-
Our approach involves a smaller model that takes a fixed audio chunk of 8 Mel frames, two pre-frames, and two post-frames.
|
|
|
|
|
|
|
|
|
23 |

|
24 |
|
25 |
## Training Details
|
26 |
|
27 |
-
We trained the model using a subset of 3,000 audio utterances from the `LJSpeech-1.1` dataset. The
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
## Source Code & Links
|
30 |
|
|
|
15 |
|
16 |
## Problem Statement
|
17 |
|
18 |
+
The original vocoder performs optimally only when provided with almost the full Mel sequence produced from the single
|
19 |
+
text input at once. This is not ideal for real-time applications, where we aim to begin audio output quickly.
|
20 |
+
Using smaller chunks results in "clicking" distortions between adjacent audio frames.
|
21 |
+
Fine-tuning attempts on Microsoft's HiFiGAN vocoder were unsuccessful.
|
22 |
|
23 |
## Solution
|
24 |
|
25 |
+
Our approach involves a smaller model that takes a fixed audio chunk of 8 Mel frames, two pre-frames, and two post-frames.
|
26 |
+
These frames are processed along with the original vocoder's 12 audio frames of 256 bytes each. The model employs
|
27 |
+
convolution input layers for both audio and Mel frames to generate hidden dimensions, followed by two linear layers and
|
28 |
+
a final convolution layer. The output is then multiplied with the original 8 audio frames to produce corrected frames.
|
29 |
+
|
30 |

|
31 |
|
32 |
## Training Details
|
33 |
|
34 |
+
We trained the model using a subset of 3,000 audio utterances from the `LJSpeech-1.1` dataset. The SpeechT5's Speech-To-Speech
|
35 |
+
module was employed to replace voice in each utterance with a voice of speakers randomly selected from the
|
36 |
+
`Matthijs/cmu-arctic-xvectors` dataset. Such produced reference Mel spectrum were used to feed vocoder and post-vocoder
|
37 |
+
in chunks. The FFT of generated in "continuous" mode reference waveform was used as a basis for loss-function calculation.
|
38 |
+
|
39 |
+
During training, the original vocoder was locked; only our model was trained to mimic the original vocoder as closely as
|
40 |
+
possible in continuous mode.
|
41 |
+
|
42 |
+
## Evaluation
|
43 |
+
|
44 |
+
The model has been evaluated by producing TTS output from pure text input using quotes from the "Futurama", "Martix" and
|
45 |
+
"Space Odyssey 2001" retrieved from the wikiquotes site using purely random speaker vector as well as vectors from the
|
46 |
+
`Matthijs/cmu-arctic-xvectors` dataset. The quality of output has been found satisfactory for our particular
|
47 |
+
use.
|
48 |
|
49 |
## Source Code & Links
|
50 |
|