https://huggingface.co/Undi95/dbrx-base
Hi, I'd like to request quants for https://huggingface.co/Undi95/dbrx-base
For reference, you can find the instruct quant discussion here and also a comment mentions you might need to confirm the correct vocab base is selected in convert_hf_to_gguf.py
.
Thank you! Always appreciate your work.
Hello @treehugg3
There are unfortunately some issues with this model some of which you already noticed yourself under https://huggingface.co/mradermacher/dbrx-base-GGUF/discussions/1.
First it needs the following patch to convert_hf_to_gguf.py
to even convert:
diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 1a768c20..da1b09a4 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -618,12 +618,12 @@ class TextModel(ModelBase):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model)
- vocab_size = self.hparams.get("vocab_size", len(tokenizer.vocab))
- assert max(tokenizer.vocab.values()) < vocab_size
+ vocab_size = self.hparams.get("vocab_size", tokenizer.vocab_size)
+ assert max(tokenizer.get_vocab().values()) < vocab_size
tokpre = self.get_vocab_base_pre(tokenizer)
- reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.vocab.items()}
+ reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.get_vocab().items()}
added_vocab = tokenizer.get_added_vocab()
added_tokens_decoder = tokenizer.added_tokens_decoder
Second and far worse there are some issues with the tokenizer not getting merged which prevents imatrix computation or actually running the model. This is very strange given that the instruct version worked in the past.
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 6229 (106ed444) with gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23109 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 323 tensors from /tmp/dbrx-base.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = dbrx
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Dbrx Base
llama_model_loader: - kv 3: general.size_label str = 16x13B
llama_model_loader: - kv 4: general.license str = other
llama_model_loader: - kv 5: general.license.name str = databricks-open-model-license
llama_model_loader: - kv 6: general.license.link str = https://www.databricks.com/legal/open...
llama_model_loader: - kv 7: dbrx.block_count u32 = 40
llama_model_loader: - kv 8: dbrx.context_length u32 = 32768
llama_model_loader: - kv 9: dbrx.embedding_length u32 = 6144
llama_model_loader: - kv 10: dbrx.feed_forward_length u32 = 10752
llama_model_loader: - kv 11: dbrx.attention.head_count u32 = 48
llama_model_loader: - kv 12: dbrx.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: dbrx.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 14: dbrx.attention.clamp_kqv f32 = 8.000000
llama_model_loader: - kv 15: dbrx.expert_count u32 = 16
llama_model_loader: - kv 16: dbrx.expert_used_count u32 = 4
llama_model_loader: - kv 17: dbrx.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 18: general.file_type u32 = 1025
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = dbrx
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type bf16: 242 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = F16 (guessed)
print_info: file size = 245.12 GiB (16.00 BPW)
llama_model_load: error loading model: error loading model vocabulary: cannot find tokenizer merges in model file
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/tmp/dbrx-base.gguf'
main : failed to init
Due to this the model not containing the vocabulary I paused all quantization tasks for the time being. We need to find a solution for this or the model will have to get nuked due to being unusable. I have quite a big emotional connection towards DBRX as it's the model that caused me to contributing to team mradermacher so it would be really nice if we can somehow quantize it.
Thank you so much for the progress update!
I am currently trying to quantize by using the config files from https://huggingface.co/LnL-AI/dbrx-base-tokenizer on top of the repository contents. As mentioned there, this model's tokenizer is actually compatible with GPT-2. There is some extensive discussion in https://github.com/ggml-org/llama.cpp/pull/6515 which I haven't had time to read completely.
I added extra items to complete the vocabulary, and this allows the model to be run successfully: https://huggingface.co/treehugg3/dbrx-base-tokenizer-llamacpp
You can just copy these .json files into the model directory and quantize as usual, and it works! The quality is not Kimi-K2, but it's not terrible for what I was looking to do with it.
Awesome thanks a lot I will try. Should something simular be done for the DBRX Instruct model or is the tokenizer good there?
It's queued! :D
You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#dbrx-base-GGUF for quants to appear.
I reflected the same changes I did locally on https://huggingface.co/nicoboss/dbrx-base so we have a proper base model to link in the model card and in case we ever have to requant DBRX base in the future.
I added extra items to complete the vocabulary, and this allows the model to be run successfully: https://huggingface.co/treehugg3/dbrx-base-tokenizer-llamacpp
You can just copy these .json files into the model directory and quantize as usual, and it works!
Thanks a lot for spending the time and effort required to fix this. I highly appreciate it especially because it is a base model which usually see almost no love.
The quality is not Kimi-K2, but it's not terrible for what I was looking to do with it.
This was one of the first models of such massive size. They spent $10 Million to train it. It's a 16-expert model with 4 being active per token by default. Nothing stops you from activating all experts for each token to unlock the model’s full potential if you don't care that much about inference speed. In face that's how I most of the time used this model as especially in those older MoE models the router is not that great depending on your use case. If you like DBRX another model you might love if you like DBRX would be Snowflake Arctic. It’s from their competitors and they likely made it on response to DBRX and it is quite insane but relatively resource intensive to run. DBRX, Snowflake Arctic, Llama 65B and Falcon 180B all have this non-overfitted feel to it where they show a ton of knowledge. The only modern-day model that comes close to this is Falcon-H1-34B. I use them all quite a lot due to their uniqueness and simply being different/giving a second opinion compared to models popular today.
Awesome thanks a lot I will try. Should something simular be done for the DBRX Instruct model or is the tokenizer good there?
The tokenizer looks good in the instruct model quants!
I appreciate the background and especially the recommendations! I am definitely going to try Falcon-180B and possibly Snowflake-Arctic if there are quants available for that one
I appreciate the background and especially the recommendations! I am definitely going to try Falcon-180B and possibly Snowflake-Arctic if there are quants available for that one
- Falcon 180B: https://huggingface.co/mradermacher/falcon-180B-i1-GGUF
- Falcon 180B Chat: https://huggingface.co/mradermacher/falcon-180B-chat-i1-GGUF
- Snowflake Arctic Base: https://huggingface.co/mradermacher/snowflake-arctic-base-i1-GGUF
- Snowflake Arctic Instruct: https://huggingface.co/mradermacher/snowflake-arctic-instruct-i1-GGUF
Sadly the Arctic Base IQ3_XXS quant produces gibberish, even at low temps:
a good stories and then gets those things like in want more that stuff up makes me feel beauty and how my am a beautiful personality. is true people skills." feels the look really work well to don will do not works for an example should work, have funing, or doing it with you all day and' means business, takes me by creating and works and and looks good
Edit, this was using too high of a context compared to the model (4,096 for base model). I will try again.
Yes, it looks better now with the correct settings. But Snowflake Arctic Base has some slop in it, definitely not as bad as Llama 3 or GLM-4.5 though, for the story starters I prompted it with.
So far the breakdown is:
LLaMA 65b, Falcon 180B: no slop but too dumb to be all that useful without excessive filtering
dbrx-base, snowflake-arctic-base: much smarter, but has some slop that impacts good storywriting
I'm still on a hunt to find something in the middle... Kimi-K2 is by far the best I've seen, beating ChatGPT and others in my opinion, but no one has distilled it down to a nice 70B yet...
Kimi-K2 is by far the best I've seen, beating ChatGPT and others in my opinion, but no one has distilled it down to a nice 70B yet...
If you want things done well, do them yourself, as the saying goes. I'll cheer for you :-)
I would do it in a heartbeat if only I had the budget or hardware. Due to my desire for it not to be overfit, the lower training requirements should help improve the cost a little.
I guess what needs to happen is I need to:
- See what is already out there. Read every pretraining paper, see what people have done
- Produce something in the right direction with the hardware and resources I do have that might not be spectacular but gets people thinking and wins against other stuff in at least one direction
- Get enough interest in that to partner with people who have money and the same vision who can fund the end result
Yeah, if it was trivial, it would have been done already. Maybe somebody will see this and also get motivated.
@treehugg3 Kimi-K2 is an MoE model with 384 experts so the only reasonable and affordable way to make it small is to just throw away the experts you don't need which is better known as pruning. You can make it a 4B model by selecting a single export, a 32B model by selecting 8 experts which is recommended as the router is trained to select 8 experts or ideal make it at least around a 64B model and select for example the best 16 experts for your use case. That way you still can use the router to choose the best experts. I saw many such models for DeepSeek V1/R1 and they worked quite well such as https://huggingface.co/huihui-ai/DeepSeek-V3-Pruned-Coder-411B, https://huggingface.co/huihui-ai/DeepSeek-R1-Pruned-Coder-411B and https://huggingface.co/huihui-ai/DeepSeek-V3-0324-Pruned-Coder-411B.
Architecture | Mixture-of-Experts (MoE) |
Total Parameters | 1T |
Activated Parameters | 32B |
Number of Layers (Dense layer included) | 61 |
Number of Dense Layers | 1 |
Attention Hidden Dimension | 7168 |
MoE Hidden Dimension (per Expert) | 2048 |
Number of Attention Heads | 64 |
Number of Experts | 384 |
Selected Experts per Token | 8 |
Number of Shared Experts | 1 |
Vocabulary Size | 160K |
Context Length | 128K |
Attention Mechanism | MLA |
Activation Function | SwiGLU |
@nicoboss That's right, thank you for reminding me about the architecture and pointing me to pruning. I would definitely prefer to experiment with cost-effective techniques that wouldn't require training a new model, despite my interest in pretraining datasets.
I'll have to look at existing publications on fingerprinting models and see if there are any pruning tools that could be used on Kimi-K2. Interesting fingerprint visualization for gpt-oss: https://amanpriyanshu.github.io/GPT-OSS-MoE-ExpertFingerprinting/index.html
Maybe this could work as a pipeline: https://github.com/CASE-Lab-UMD/Unified-MoE-Compression