Blank Results from multiple quants

#2
by Able2 - opened

I have been getting blank images from the undistilled q4_k_s quants. Tested with both Comfy's native example and your own ones, but images turn black at around 30% into the sampling process (with 10 steps it is step 3, with 20 steps it is step 6). Tried both with and without CFG, with and without sage attention, using blank prompts for the negative prompt or zeroing out the positive prompt for negative prompt, 1024x1024 and 1328x1328 generation size, and both Comfy's test prompt and your own ones, but the results are the same. Comfy and all plugins should be up to date (updated all of them before running the workflow, nothing seems to be out of order). I'm curios if anyone bumped into the same issue, and want to see if it is a quantization-related problem or a setup-related problem.
As a side note, when decoding the images there was a warning saying that illegal values clipped. Perhaps it means that during the sampling process the values exploded somehow.

Tested the distilled model (q4_0 and q5_0 quant) and the same behavior persisted. This time happened at step 5/15 (also the step after 30% mark passed). Can verify that it is a model specific problem, as flux krea worked fine. Perhaps some gguf code related issue causing the problem?

Able2 changed discussion title from Blank Results from q4_k_s quants to Blank Results from multiple quants

have you tested it with gguf-connector; it works fine, even with the smallest quant

Screenshot 2025-08-10 080719.png

and the distilled mode (disable cfg; step=15) do works 20-30% faster than the original model

Screenshot 2025-08-10 080809.png

oh, q4_k_s; ok, test it right away

Screenshot 2025-08-10 083353.png

seems no problem

Screenshot 2025-08-10 083426.png

works on comfy as well

Screenshot 2025-08-10 085835.png

please check the settings; is it similar to you or not; test it with different engines, i.e., diffusers, comfyui, gguf-connector, etc. before jumping to conclusion

Generation parameters seems to be the same. Double checked ComfyUI and gguf plugin and both are up to date. Will try with gguf-connector. A retry with ComfyUI still yields the same result.

Generation parameters seems to be the same. Double checked ComfyUI and gguf plugin and both are up to date. Will try with gguf-connector. A retry with ComfyUI still yields the same result.

try the portable pack here; see does it help or not

The given portable pack runs fine. However the main environment I am using still only produces blank images, even when using city96's gguf node. Below is the version string for that environment, perhaps there's something that is just wrong causing the problem.

pytorch version: 2.7.0+cu128
xformers version: 0.0.30
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3060 Laptop GPU : cudaMallocAsync
Using sage attention
Python version: 3.11.6 (tags/v3.11.6:8b6ee5b, Oct  2 2023, 14:57:12) [MSC v.1935 64 bit (AMD64)]
ComfyUI version: 0.3.49
ComfyUI frontend version: 1.24.4```

the portable pack is an isolated env; could try to disable sage attention since the portable pack doesn't need it

Going back to xformers attention solves the problem and bringing back sage attention with KJ's patcher node reintroduces the problem again. Not sure what the problem is with sage attention and Qwen-Image. However, that is a significant step down on speed (10 minutes vs 30 minutes) due to lackluster hardware.
On the other hand, Qwen-Image-Lightning just dropped (can generate decent images in 4~8 steps). Wonder if you are going to release those in quantized forms so that us GPU poors can still enjoy it at a decent speed. Hardware on my side is just too weak to quantize this monstrosity.

Qwen-Image-Lightning is a lora only; need to run the original model plus lora, then it won't fast theoretically; not sure the lora can really override the main model's setting

actually, we can adjust the number of steps, don't necessarily need to add an extra lora, do run faster but fewer steps will affect the sampling/sampler then >>> output quality

Screenshot 2025-08-11 112042.png

this is a kind of trading off

Screenshot 2025-08-11 112110.png

alright, we set the number of steps free to adjust; everybody can trade it off; don't need the lora since the size of that lora is not small, add it up, won't a good deal

Screenshot 2025-08-11 115541.png

seems 8 steps better than 4 steps above a lot; usable

Screenshot 2025-08-11 113359.png

Qwen-Image-Lightning is a lora only; need to run the original model plus lora, then it won't fast theoretically; not sure the lora can really override the main model's setting

Someone did try to merge the original model and the lora together. Speed probably won't be a problem after merging. Also, since this lora is designed to run with 4 or 8 steps, it will produce much better results compared to using the distilled checkpoints with reduced steps (the distilled ones still requires 15 steps).

Update: Edited OzzyGT's quant's metadata to show pig instead of qwen made it run with your nodes, and the result is definitely better than running with reduced steps with the original models. Would be even better if there are 4-step versions.

Qwen-Image-Lightning is a lora only; need to run the original model plus lora, then it won't fast theoretically; not sure the lora can really override the main model's setting

Someone did try to merge the original model and the lora together. Speed probably won't be a problem after merging. Also, since this lora is designed to run with 4 or 8 steps, it will produce much better results compared to using the distilled checkpoints with reduced steps (the distilled ones still requires 15 steps).

interesting

the faster part due to this:

quantization_config = TransformersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

text_encoder = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    subfolder="text_encoder",
    quantization_config=quantization_config,
    torch_dtype=torch_dtype,
)
text_encoder = text_encoder.to("cpu")

turn the text encoder into 4-bit; let's test it first

turn the text encoder into 4-bit; let's test it first

I'm mostly using ComfyUI at this moment. I don't think it supports bitsandbytes for now. OzzyGT did that probably because they're using diffusers, but I think that in ComfyUI the q4 quant for Qwen2.5 VL 7B serves the same purpose and would probably do the same thing.
I am more interested in using the 4 step variant of the lightning finetunes. However, since no one has converted that lora to work with ComfyUI there is currently no way to test that out (the lora just won't load). I'm currently using the quants OzzyGT provided (with modified metadata to bring compatibility to your ComfyUI nodes). That quant uses the 8 step variant, so it works best with 8 steps. However, if we create quants for the 4 step variant we can theoretically 2x the generation speed. That would be >5x compared to the original model (with 20 steps and CFG), >3x compared to the distilled model (with 15 steps and no CFG), and 2x compared to the 8 step variant (with 8 steps and no CFG). Of course you can lower the steps required for the other models, but it will lose details and prompt following, depending on how aggressive you lower the steps.

gguf-node was updated; it supports "qwen", "qwen_image", etc. now, don't need to change anything it still works; for the lora issue, you could use gguf-connector's lora swapper to format it to unet form and it can be used for comfyui actually, simply execute ggc la and select the lora file in the current directory to complete the process, then it should work

For the lora issue, you could use gguf-connector's lora swapper to format it to unet form and it can be used for comfyui actually, simply execute ggc la and select the lora file in the current directory to complete the process, then it should work

Seems to not do the trick. ggc la reports nothing found. Checked the lora key and compared with ggc's logic and it seems that indeed all the keys in the lora are in the unet format, but it just won't load.

Anyways for those who just popped in because of blank images: DO NOT USE Sage Attention!! For some reason it is not compatible and will result in the model producing out of bound values. On my side it seems that xformers is slower than native pytorch attention, so you might also want to try out which one is better for you.

tested; awesome! high speed and good quality as well

Screenshot 2025-08-12 002617.png

save 50% of time indeed

Screenshot 2025-08-12 003115.png

the lora seems sharpen all the lines in the picture and makes it clear

Screenshot 2025-08-12 002636.png

upgraded the gguf-connector; distilled mode adopts 4-bit encoder for fast loading; original one can serve three of them and both step and cfg scale can customize right away; cooking the gguf files

Qwen Image works fine on it's own in Comfyui, but as soon as Sageattenion 2 and Triton are applied, I get black images. It works fine with all other models, but with Qwen Image, always black (whether it's GGUF or FP8).

might be the issue rooted at tensor structure; since no standard from day one and when there is a standard, the high growth/developmental or golden stage was over
just disable those libraries and make it works first

Qwen Image works fine on it's own in Comfyui, but as soon as Sageattenion 2 and Triton are applied, I get black images. It works fine with all other models, but with Qwen Image, always black (whether it's GGUF or FP8).

Seems to be the same as my observation. However it is weird that the turning point (from plausible image to pure blank) seems to always happen at around 30% into the process, no matter which model was used, suggesting that it is probably a systematic bug rather than edge cases.

Qwen Image works fine on it's own in Comfyui, but as soon as Sageattenion 2 and Triton are applied, I get black images. It works fine with all other models, but with Qwen Image, always black (whether it's GGUF or FP8).

Seems to be the same as my observation. However it is weird that the turning point (from plausible image to pure blank) seems to always happen at around 30% into the process, no matter which model was used, suggesting that it is probably a systematic bug rather than edge cases.

I noticed this too. I see an image being rendered and then suddenly it just renders in black as a final result. It's too bad, because these larger models really need Sageattention or it's not worth my time.

Sign up or log in to comment