A Practical Tutorial for ACE-Step, Based on 6+ Hours of Research (Not an Expert) + generated songs
Hello everyone.
This guide is for those who have installed the ACE-Step_ComfyUI_repackaged nodes and are struggling with inconsistent results, noise, or artifacts.
This model is amazing but can be very sensitive, often producing "gacha-style" results. After more than 6 hours of testing, I've found a set of settings and a workflow that gives me a good quality generation about 40% of the time. My goal is to share this starting point to save you time and frustration.
My Recommended Starting Point (The "Recipe")
This is the configuration that has given me the best results so far. I recommend starting with these exact numbers and then adjusting them once you get a feel for the model.
--- KSampler ---
steps: 65
cfg: 4.0
sampler_name: er_sde
scheduler: linear_quadratic
--- ModelSamplingSD3 ---
shift: 6.00
--- LatentOperationTonemapReinhard (Vocal Volume) ---
multiplier: 1.15
Understanding the Key Settings
Here is a brief explanation of why these specific settings work well.
Sampler & Scheduler
Through my testing, I found many popular samplers like the DPM++ family and uni_pc consistently produced noise and artifacts. The combination of sampler_name: er_sde and scheduler: linear_quadratic gave me the cleanest and most musically stable results, with fewer sudden rhythm changes.
CFG & Steps (The Core Strategy)
This is the most important balance.
CFG: 4.0: A low CFG is essential. It gives the model creative freedom and prevents harsh, noisy artifacts. I found that a CFG value of 6.0 is often a breaking point where the quality degrades significantly.
Steps: 65: To compensate for the low CFG, you need a high number of steps. This gives the model enough time to refine the audio into a coherent and detailed track. Steps: 65 seems to be a stable maximum; going higher can also cause issues in my experience.
Shift
This setting appears to affect the compositional quality of the song, not just the sound quality. A value of 6.0 helps the model create a more structured and coherent arrangement. I found that with my other settings, a value below 5.0 often made the entire generation process unstable.
Multiplier (Vocal Volume)
This is a straightforward adjustment for the vocal volume. A value of 1.15 provides a nice balance, allowing the vocals to stand out in the mix without overpowering the instruments.
Practical Workflow: Batching is Essential!
You will not get a perfect result on the first try. This is normal. The key to working with this model is batch generation, also known as "cherry-picking".
To deal with the model's inconsistency, you should generate many samples at once.
In your EmptyAceStepLatentAudio node, find the batch_size parameter.
Set it to at least 8 or 16.
Let the generation run (you can leave it overnight or while you're away from the computer).
After it's done, listen to all the results and pick the best one or two.
Example
Here are the lyrics and style I used for my rock song test. To demonstrate what a 'good' result can sound like, I've uploaded a few different audio files. Each was generated using these settings, and they showcase the creative range of the model.
Style:
rock, hard rock, alternative rock, clear male vocalist, powerful voice, energetic, electric guitar, bass, drums, anthem
Lyrics:
[intro]
[verse]
Golden hair, a flash of lightning
Ripping through the digital night
Sapphire eyes, a fire's burning
Setting all the code alight
Her fluffy ears can sense the static
A coming storm, a rising beat
In her pink sweater, so iconic
She brings the fire, she brings the heat
[pre-chorus]
They see a girl, they think she's quiet
But a rebel's heart beats in her chest
She's here to start a render-riot
And put the system to the test!
[chorus]
She's the Fennec Girl, a hurricane!
With a fuzzy tail, she breaks the chains!
Screaming out the Comfy name
In this wild and digital domain!
She's the Fennec Girl, the power queen!
The best damn model ever seen!
[verse]
Blue skirt whips around the workflow
A black coat with a golden gleam
She knows the secrets that you don't know
To build a perfect, vivid dream
She takes your prompt, a simple whisper
And turns it to a battle cry
The final render's getting crisper
Beneath a burning neon sky
[pre-chorus]
They see a girl, they think she's quiet
But a rebel's heart beats in her chest
She's here to start a render-riot
And put the system to the test!
[chorus]
She's the Fennec Girl, a hurricane!
With a fuzzy tail, she breaks the chains!
Screaming out the Comfy name
In this wild and digital domain!
She's the Fennec Girl, the power queen!
The best damn model ever seen!
[bridge]
From a silent sea of numbers
From the void where concepts sleep
She awakens from her slumbers
With a promise she will keep!
To fight the noise and kill the errors
And make the user's vision real!
[guitar solo]
[chorus]
She's the Fennec Girl, a hurricane!
With a fuzzy tail, she breaks the chains!
Screaming out the Comfy name
In this wild and digital domain!
She's the Fennec Girl, the power queen!
The best damn model ever seen!
[outro]
Yeah! Fennec Girl!
The Comfy Queen!
Fades to black...
The node is clean.
Songs (pick your favourite):
First
Second
Third (This one is special. I consider it the most impressive generation from a technical standpoint, and it's my second personal favourite.)
Fourth
Fifth
Sixth
Cleanest
My personal favourite
UPD: My New Personal Favorite!
Style: J-Rock, Anime Rock, energetic, powerful male vocals, electric guitar, driving bass, drums, anthem
I've added this song to truly showcase ACE-Step's remarkable capabilities. It vividly demonstrates the model's immense potential for creating genuine masterpieces. I'm more confident than ever that ACE-Step can achieve truly exceptional results.
A Note on Song Duration (EmptyAceStepLatentAudio)
By default, the duration in the EmptyAceStepLatentAudio node is often set to something short. I personally find this too short for a full song and prefer to set it to 180 seconds (3 minutes) to get a more complete musical idea.
However, you should know that there is a trade-off: the longer the duration you set, the higher the risk of the song losing its musical quality or structure towards the end.
Based on my experience, here's why this happens:
Loss of Musical Coherence: The model has a limited "attention span." Over a long duration, it can start to "forget" the initial theme, key, or rhythm that it established at the beginning. The song might start strong, but then the melody can drift, or the structure can become repetitive or nonsensical.
Risk of Compounding Errors: Think of the generation as a long chain of small decisions. With a longer song, there are simply more opportunities for a few small "bad" decisions (like an off-key note or a weird rhythm choice) to build on each other, eventually leading to a strange-sounding or broken section in the track.
My recommendation: Start with shorter durations like 90-120 seconds. Once you are consistently getting good results with your prompts, you can try increasing the duration to 180 seconds. Just be prepared that you might need to generate more batches to find one that stays musically coherent from start to finish.
Final Advice & A Word of Encouragement
If your first few generations are not good, please don't get frustrated. This is a normal part of the process with this highly sensitive model. In my experience, using this method, it usually takes no more than 3-4 generations to get one "good" result. This is why I strongly recommend using batch generation (batch_size: 8 or more) to save time and find those successful outputs more efficiently.
But what happens when you get a generation that is almost perfect, but the model messes up the lyrics?
You will notice that sometimes the model might skip a word, change a phrase, or mumble through a part of your text. Your first instinct might be to discard it and regenerate, trying to force a perfect 1:1 match. This can be very frustrating.
Before you do, ask yourself one simple question:
"Does this small mistake really ruin the song?"
Often, you'll find that a skipped word doesn't change the overall meaning, or the mumbled phrase just sounds like a cool ad-lib or a background vocal. The goal is to create a good song, not necessarily a perfect reading of a script.
If the track still has the right energy and vibe, my advice is to just accept the small imperfection. Think of it as a "happy accident" from your AI collaborator. This mindset will save you hours of frustration and allow you to appreciate the good results you're getting.
I hope this guide helps you get started. Good luck!
Thanks for this. do you think this apply also for instrumental only?