mradermacher/model_requests · https://huggingface.co/Undi95/dbrx-base

17 days ago

•

Hi, I'd like to request quants for https://huggingface.co/Undi95/dbrx-base

For reference, you can find the instruct quant discussion here and also a comment mentions you might need to confirm the correct vocab base is selected in convert_hf_to_gguf.py.

Thank you! Always appreciate your work.

nicoboss

17 days ago

•

edited 17 days ago

Hello @treehugg3

There are unfortunately some issues with this model some of which you already noticed yourself under https://huggingface.co/mradermacher/dbrx-base-GGUF/discussions/1.

First it needs the following patch to convert_hf_to_gguf.py to even convert:

diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 1a768c20..da1b09a4 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -618,12 +618,12 @@ class TextModel(ModelBase):
 
         from transformers import AutoTokenizer
         tokenizer = AutoTokenizer.from_pretrained(self.dir_model)
-        vocab_size = self.hparams.get("vocab_size", len(tokenizer.vocab))
-        assert max(tokenizer.vocab.values()) < vocab_size
+        vocab_size = self.hparams.get("vocab_size", tokenizer.vocab_size)
+        assert max(tokenizer.get_vocab().values()) < vocab_size
 
         tokpre = self.get_vocab_base_pre(tokenizer)
 
-        reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.vocab.items()}
+        reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.get_vocab().items()}
         added_vocab = tokenizer.get_added_vocab()
 
         added_tokens_decoder = tokenizer.added_tokens_decoder

Second and far worse there are some issues with the tokenizer not getting merged which prevents imatrix computation or actually running the model. This is very strange given that the instruct version worked in the past.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 6229 (106ed444) with gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23109 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 323 tensors from /tmp/dbrx-base.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = dbrx
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Dbrx Base
llama_model_loader: - kv   3:                         general.size_label str              = 16x13B
llama_model_loader: - kv   4:                            general.license str              = other
llama_model_loader: - kv   5:                       general.license.name str              = databricks-open-model-license
llama_model_loader: - kv   6:                       general.license.link str              = https://www.databricks.com/legal/open...
llama_model_loader: - kv   7:                           dbrx.block_count u32              = 40
llama_model_loader: - kv   8:                        dbrx.context_length u32              = 32768
llama_model_loader: - kv   9:                      dbrx.embedding_length u32              = 6144
llama_model_loader: - kv  10:                   dbrx.feed_forward_length u32              = 10752
llama_model_loader: - kv  11:                  dbrx.attention.head_count u32              = 48
llama_model_loader: - kv  12:               dbrx.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                        dbrx.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  14:                   dbrx.attention.clamp_kqv f32              = 8.000000
llama_model_loader: - kv  15:                          dbrx.expert_count u32              = 16
llama_model_loader: - kv  16:                     dbrx.expert_used_count u32              = 4
llama_model_loader: - kv  17:          dbrx.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                          general.file_type u32              = 1025
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = dbrx
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type bf16:  242 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16 (guessed)
print_info: file size   = 245.12 GiB (16.00 BPW) 
llama_model_load: error loading model: error loading model vocabulary: cannot find tokenizer merges in model file

llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/tmp/dbrx-base.gguf'
main : failed to init

Due to this the model not containing the vocabulary I paused all quantization tasks for the time being. We need to find a solution for this or the model will have to get nuked due to being unusable. I have quite a big emotional connection towards DBRX as it's the model that caused me to contributing to team mradermacher so it would be really nice if we can somehow quantize it.

treehugg3

17 days ago

Thank you so much for the progress update!

I am currently trying to quantize by using the config files from https://huggingface.co/LnL-AI/dbrx-base-tokenizer on top of the repository contents. As mentioned there, this model's tokenizer is actually compatible with GPT-2. There is some extensive discussion in https://github.com/ggml-org/llama.cpp/pull/6515 which I haven't had time to read completely.

treehugg3

17 days ago

I added extra items to complete the vocabulary, and this allows the model to be run successfully: https://huggingface.co/treehugg3/dbrx-base-tokenizer-llamacpp

You can just copy these .json files into the model directory and quantize as usual, and it works! The quality is not Kimi-K2, but it's not terrible for what I was looking to do with it.

nicoboss

17 days ago

Awesome thanks a lot I will try. Should something simular be done for the DBRX Instruct model or is the tokenizer good there?

nicoboss

17 days ago

It's queued! :D

You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#dbrx-base-GGUF for quants to appear.

I reflected the same changes I did locally on https://huggingface.co/nicoboss/dbrx-base so we have a proper base model to link in the model card and in case we ever have to requant DBRX base in the future.

I added extra items to complete the vocabulary, and this allows the model to be run successfully: https://huggingface.co/treehugg3/dbrx-base-tokenizer-llamacpp
You can just copy these .json files into the model directory and quantize as usual, and it works!

Thanks a lot for spending the time and effort required to fix this. I highly appreciate it especially because it is a base model which usually see almost no love.

The quality is not Kimi-K2, but it's not terrible for what I was looking to do with it.

This was one of the first models of such massive size. They spent $10 Million to train it. It's a 16-expert model with 4 being active per token by default. Nothing stops you from activating all experts for each token to unlock the model’s full potential if you don't care that much about inference speed. In face that's how I most of the time used this model as especially in those older MoE models the router is not that great depending on your use case. If you like DBRX another model you might love if you like DBRX would be Snowflake Arctic. It’s from their competitors and they likely made it on response to DBRX and it is quite insane but relatively resource intensive to run. DBRX, Snowflake Arctic, Llama 65B and Falcon 180B all have this non-overfitted feel to it where they show a ton of knowledge. The only modern-day model that comes close to this is Falcon-H1-34B. I use them all quite a lot due to their uniqueness and simply being different/giving a second opinion compared to models popular today.

treehugg3

16 days ago

•

edited 16 days ago

Awesome thanks a lot I will try. Should something simular be done for the DBRX Instruct model or is the tokenizer good there?

The tokenizer looks good in the instruct model quants!

I appreciate the background and especially the recommendations! I am definitely going to try Falcon-180B and possibly Snowflake-Arctic if there are quants available for that one

nicoboss

16 days ago

I appreciate the background and especially the recommendations! I am definitely going to try Falcon-180B and possibly Snowflake-Arctic if there are quants available for that one

Falcon 180B: https://huggingface.co/mradermacher/falcon-180B-i1-GGUF
Falcon 180B Chat: https://huggingface.co/mradermacher/falcon-180B-chat-i1-GGUF
Snowflake Arctic Base: https://huggingface.co/mradermacher/snowflake-arctic-base-i1-GGUF
Snowflake Arctic Instruct: https://huggingface.co/mradermacher/snowflake-arctic-instruct-i1-GGUF

treehugg3

16 days ago

•

edited 16 days ago

Sadly the Arctic Base IQ3_XXS quant produces gibberish, even at low temps:

 a good stories and then gets those things like in want more that stuff up makes me feel beauty and how my am a beautiful personality. is true people skills." feels the look really work well to don will do not works for an example should work, have funing, or doing it with you all day and' means business, takes me by creating and works and and looks good

Edit, this was using too high of a context compared to the model (4,096 for base model). I will try again.

Yes, it looks better now with the correct settings. But Snowflake Arctic Base has some slop in it, definitely not as bad as Llama 3 or GLM-4.5 though, for the story starters I prompted it with.

So far the breakdown is:

LLaMA 65b, Falcon 180B: no slop but too dumb to be all that useful without excessive filtering
dbrx-base, snowflake-arctic-base: much smarter, but has some slop that impacts good storywriting

I'm still on a hunt to find something in the middle... Kimi-K2 is by far the best I've seen, beating ChatGPT and others in my opinion, but no one has distilled it down to a nice 70B yet...

mradermacher

Owner 13 days ago

Kimi-K2 is by far the best I've seen, beating ChatGPT and others in my opinion, but no one has distilled it down to a nice 70B yet...

If you want things done well, do them yourself, as the saying goes. I'll cheer for you :-)

treehugg3

13 days ago

I would do it in a heartbeat if only I had the budget or hardware. Due to my desire for it not to be overfit, the lower training requirements should help improve the cost a little.

I guess what needs to happen is I need to:

See what is already out there. Read every pretraining paper, see what people have done
Produce something in the right direction with the hardware and resources I do have that might not be spectacular but gets people thinking and wins against other stuff in at least one direction
Get enough interest in that to partner with people who have money and the same vision who can fund the end result

mradermacher

Owner 13 days ago

Yeah, if it was trivial, it would have been done already. Maybe somebody will see this and also get motivated.

nicoboss

13 days ago

•

edited 13 days ago

@treehugg3 Kimi-K2 is an MoE model with 384 experts so the only reasonable and affordable way to make it small is to just throw away the experts you don't need which is better known as pruning. You can make it a 4B model by selecting a single export, a 32B model by selecting 8 experts which is recommended as the router is trained to select 8 experts or ideal make it at least around a 64B model and select for example the best 16 experts for your use case. That way you still can use the router to choose the best experts. I saw many such models for DeepSeek V1/R1 and they worked quite well such as https://huggingface.co/huihui-ai/DeepSeek-V3-Pruned-Coder-411B, https://huggingface.co/huihui-ai/DeepSeek-R1-Pruned-Coder-411B and https://huggingface.co/huihui-ai/DeepSeek-V3-0324-Pruned-Coder-411B.


Architecture	Mixture-of-Experts (MoE)
Total Parameters	1T
Activated Parameters	32B
Number of Layers (Dense layer included)	61
Number of Dense Layers	1
Attention Hidden Dimension	7168
MoE Hidden Dimension (per Expert)	2048
Number of Attention Heads	64
Number of Experts	384
Selected Experts per Token	8
Number of Shared Experts	1
Vocabulary Size	160K
Context Length	128K
Attention Mechanism	MLA
Activation Function	SwiGLU

treehugg3

13 days ago

•

edited 13 days ago

@nicoboss That's right, thank you for reminding me about the architecture and pointing me to pruning. I would definitely prefer to experiment with cost-effective techniques that wouldn't require training a new model, despite my interest in pretraining datasets.

I'll have to look at existing publications on fingerprinting models and see if there are any pruning tools that could be used on Kimi-K2. Interesting fingerprint visualization for gpt-oss: https://amanpriyanshu.github.io/GPT-OSS-MoE-ExpertFingerprinting/index.html

Maybe this could work as a pipeline: https://github.com/CASE-Lab-UMD/Unified-MoE-Compression