Text-to-Speech
Safetensors
English
Chinese

There is a probability of approximately 50 seconds or more of blank audio output.

#14
by kelangyang - opened

The model has certain limitations:

  1. There is a probability that the output may be empty or contain noise.
  2. There is a chance of errors; specific issues can be referenced in other discussions.
  3. Audio may clip, distort, or cut off unexpectedly.
  4. The generated audio may deviate from the target text or skip content.
  5. Emotional tone cannot be adjusted manually.

Solutions:

  1. Generate multiple outputs and check duration—regenerate if empty results occur.
    2.Use noise feature extraction to detect and regenerate problematic outputs.
    3.Reduce input duration to under 7 seconds and experiment with different prompts.
    4.Tune temperature settings—counterintuitively, higher values sometimes improve performance until satisfactory output is achieved.
    5.Fine-tuning could help, though the author hasn't released fine-tuning code yet; however,the model architecture offers strong reference value for adaptation
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment