sobomax commited on
Commit
91e99b3
·
1 Parent(s): 2c75a52

Add basic readme.

Browse files
Files changed (1) hide show
  1. README.md +34 -0
README.md CHANGED
@@ -1,3 +1,37 @@
1
  ---
2
  license: bsd-2-clause
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: bsd-2-clause
3
+ tags:
4
+ - tts
5
+ - real-time
6
+ - vocoder
7
+ library_name: transformers
8
  ---
9
+
10
+ # HelloSippyRT PostVocoder
11
+
12
+ ## Introduction
13
+
14
+ The HelloSippyRT model is designed to adapt Microsoft's SpeechT5 Text-to-Speech (TTS) for real-time scenarios.
15
+
16
+ ## Problem Statement
17
+
18
+ The original vocoder performs optimally only when provided with almost the full Text-To-Mel ("TTM") sequence at once. This is not ideal for real-time applications, where we aim to begin audio output quickly. Using smaller chunks results in "clicking" distortions between adjacent audio frames. Fine-tuning attempts on Microsoft's HiFiGAN vocoder were unsuccessful.
19
+
20
+ ## Solution
21
+
22
+ Our approach involves a smaller model that takes a fixed audio chunk of 8 Mel frames, two pre-frames, and two post-frames. These frames are processed along with the original vocoder's 12 audio frames of 256 bytes each. The model employs convolution input layers for both audio and Mel frames to generate hidden dimensions, followed by two linear layers and a final convolution layer. The output is then multiplied with the original 8 audio frames to produce corrected frames.
23
+ ![HelloSippyRT Model Architecture](https://docs.google.com/drawings/d/e/2PACX-1vTiWxGbEB2MbvHpTJHS22abWNrSt2pHv6XijEDmnQFjAqBewMJyZBQ_5Y9k1P9INQPQmuq56MpLDzJt/pub?w=960&h=720)
24
+
25
+ ## Training Details
26
+
27
+ We trained the model using a subset of 3,000 audio utterances from the `LJSpeech-1.1` dataset. The original SpeechT5 TTS module generated the voice using speakers randomly selected from the `Matthijs/cmu-arctic-xvectors` dataset. During training, the original vocoder was locked; only our model was fine-tuned to mimic the original vocoder as closely as possible in continuous mode.
28
+
29
+ ## Source Code & Links
30
+
31
+ * [HelloSippyRT on GitHub](https://github.com/sippy/Infernos.git)
32
+ * [Training Code Repository](https://github.com/sobomax/hifi-gan-lsr-rt.git)
33
+
34
+ ---
35
+
36
+ **License**: BSD-2-Clause
37
+ **Library**: Transformers