Hunyuan Custom - A small single subject study
Since its release, I've seen virtually no one talk about Hunyuan Custom. Sure, it got caught up in the Wan VACE craze, but is it inferior in all aspects? I've been meaning to find out a bit about its capabilities and finally decided to put some cycles into figuring it out. I will not put forward any claims in this article, but I will show some things it can do. Consider it a 'Get started with Hunyuan Custom' type article.
I used the default ComfyUI Hunyuan Wrapper workflow and started with one of my trusty images I've been using lately. If you have ever wondered what Gimli played by Sean Bean in an 80s version of LotR would look like, I think this will give you some hints.
The workflow is for a single image reference to video, that was first released, not the newer video and audio reactive model.
I'll skip a lot of the boring iteration and just tell you that it took me quite some time to get from here:
which looks nothing like I want, to here:
which at least resembles my reference image, but still looks quite bad. Same prompt, different steps, flow_shift and cfg. Possibly due to the low resolution I use for my inference, but the model seemed really unstable, for lack of a better word. A slight change in those values would completely change the output. Luckily, some other generations fared better. I eventually ended up with something quite decent (albeait low res):
Steps: 30 (anything above would lose the resemblence to input)
Flow_shift: 16.55 (again, deviating much lost resemblence). Going much higher gives a darker image.
Cfg: 9.50 (around 10 seemed like a nice threshold)
Here I also "improved" the prompt by using "high quality" and "cinematic", after looking at the example prompts. It felt quite 2023. It also used my 80s fantasy lora, but I'm not sure it affected it much.
After this I decided to shift focus a bit and see how well it would carry style over to different scenarios. These examples proved the strength of Custom better, imo.
"the man is sitting on a swing on a playground, staring blankly into the distance. cinematic. high quality"
Steps: 30
Flow_shift: 18.50
Cfg: 8.50
"the man is buying groceries at the supermarket. he inspects a carrot. cinematic. realistic."
Steps: 30
Flow_shift: 15.49
Cfg: 11.77
I tried some other references and characters with various results, but here are some general findings:
- Works well with portrait style images -> portrait / up close shots. Possible same -> same, but lack of examples on my behalf makes me unable to prove this point.
- Not so good for background transfer. I tried to use a setting image and insert some character or action into it, but the image wobbled and warped.
- Brings style of the image fairly well. Should be evident by my examples.
- Hunyuan LoRAs "work", but less so than for Framepack. The model is probably more finetuned away from the original. I hope to do some LoRA experimentation in the near future.
In summary, it has potential, but feels a bit raw and non-responsive sometimes. I really liked some of the results I got, while it sometimes took frustratingly long to get there.