embeddings table FP16 vs. Q6_K - test results (stduhpf vs. this)

#7
by cymaphore - opened

Hi,

I somewhat systematically tested this model against the modification of stduhpf with the Q6_K embeddings table.

My results:

  • This model and the one from stduhpf performs almost identical under normal or relatively low cognitve load and complexity.
  • When introducing a complex combination of pressure, the stduhpf model shows significant and heavy signs of problems up to the point of unusability, while the model with the FP16 table only makes minor misstakes.
  • A multi-facet scenario was required to reliably provoke these issues, probably due to the mixed nature of the quantisation in the QAT-model. I want to share it in case this is helpful for your further optimizations.

Anyways, thanks for your great work and this exceptionally great model. Didn't think I'll ever be able to run something like this locally in my GPU :-)

My environment:

  • ThinkPad P16v, 64GB RAM, AMD Ryzen 9 PRO 7940HS, NVidia RTX A2000 (8GB VRAM)
  • ollama 0.6.5
  • Model parameters normally used: top_p: 0.95, min_p: 0.0, num_ctx: 25000, repeat_penalty: 1.0, temperature: 1.0, top_k: 64

At my setup, your model infers at ~10.5 Tokens/s, the one from stduhpf at ~8.5 Tokens/s.

While I tried to be precise, my tests are not exactly scientific, but I hope these informations are helpful.

Best regards,
Martin

Additional Detail:

Just for fun I dumped some of the chats logs and details on Gemini 2.5 and asked it to make a a summary about the findings (Model 1 == this here, Model 2 == stduhpf):

Key Findings:

  • Performance: As expected, Model 2 (Q6_K) showed lower memory usage and higher token generation speed (~10.5 t/s vs. ~8.5 t/s on my hardware).
  • Robustness & Accuracy: This is where a significant difference emerged.
    • Model 1 (FP16 Embeddings) demonstrated high stability and robustness, even under significant "cognitive load" (complex tasks, long contexts, thematic shifts, self-analysis prompts).
    • Model 2 (Q6_K Embeddings) showed a marked increase in errors under similar high-load conditions. This included:
      • Significantly more typos/spelling mistakes.
      • Higher tendency for misinterpretation of inputs.
      • Noticeably greater difficulty handling complex tasks involving multiple distinct text inputs.
      • General instability and context confusion in demanding scenarios.
  • Low Load Behavior: Under simple tasks or low cognitive load, both models performed nearly identically well. The weaknesses of Model 2 only became apparent under stress.
  • Testing: Differences were revealed using complex prompts (up to ~25k token contexts in some tests) and specifically designed stress-testing methods targeting context management, alignment processing under ambiguity, and multi-source information handling.

Conclusion:

While the Q6_K quantization of the embedding table in Model 2 successfully reduces memory footprint and increases inference speed, it comes at a significant cost to model robustness and accuracy under demanding conditions. For my use cases involving complex interactions, the performance gains do not outweigh the observed degradation in quality and reliability compared to the original FP16 embedding table version (Model 1). This specific optimization appears detrimental to handling complex cognitive tasks effectively.

Thank you for testing this, this is very interesting. @stduhpf

First off, could you please share what you've tested exactly?

Secondly, you are not using deterministic settings and an inference backend that is known for adding a lot of noise. Due to the randomness, your test results are likely not very reliable. I'd suggest testing with deterministic sampler settings and raw llama.cpp.

Lastly, make sure to use the latest fixed one, with the correctly set tokens (either on my HF page or stduhpf's).

Thanks for your feedback!

I attempted to go by the recommended settings as far as I could find them... So far I use ollama for simplicity reasons in combination with open-webui and implemented a couple of python tools for myself. That's how I discovered it (I'm using those tools I created to browse and filter the news, add context from wikipedia articles, etc.).

But I could of course also just run that frontend against the llama.cpp-server, I was unaware that there is a significant difference between those backends.

I'll also attempt to create a test case completely independent of the tools (raw prompts / log, I'm not sure what's the best, I'm not an expert on LLMs) so that it's easier to share and reproduce it.

Do you have any recommendations about what parameters to use?

I would love to see the otherwise nice model of stduhpf run trouble free, as it is more performant.

For a good comparison between the two, I suggest running temperature 0.01, top k 0, rep pen 1.0, top p 0.95 min p 0. Those settings should give deterministic results!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment