Improve model card: Add metadata, links, and usage example
Browse filesThis PR significantly enhances the model card for the `GuidedQuant` model by:
- Adding `pipeline_tag: text-generation` and `library_name: transformers` to the metadata, improving discoverability and enabling Hub features.
- Correcting the `license` to `mit` based on the official GitHub repository's explicit declaration.
- Adding a `tags: [quantization]` entry to further categorize the model.
- Providing a concise overview based on the paper's abstract.
- Including explicit links to the Hugging Face Papers page, the original arXiv paper, the official project page, and the GitHub repository for comprehensive information.
- Integrating a practical Python code snippet for quick inference, allowing users to easily get started with the model using the `AnyPrecisionForCausalLM` class.
- Adding the full "Acknowledgement" and "Citation" information from the project's GitHub README.
These updates aim to make the model more accessible, informative, and compliant for the Hugging Face community.
@@ -1,21 +1,106 @@
|
|
1 |
---
|
2 |
base_model:
|
3 |
- Qwen/Qwen3-32B
|
|
|
4 |
base_model_relation: quantized
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
---
|
7 |
-
# Model Card
|
8 |
|
9 |
-
|
10 |
-
- Quantization method: SqueezeLLM
|
11 |
-
- Target bit-width: 3
|
12 |
-
- Backend kernel: Any-Precision-LLM kernel (`ap-gemv`)
|
13 |
-
- Calibration data: RedPajama (1024 sentences / 4096 tokens)
|
14 |
-
- Calibration objective: Next-token prediction
|
15 |
|
|
|
16 |
|
17 |
-
|
18 |
-
|
|
|
|
|
19 |
|
20 |
-
|
21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
base_model:
|
3 |
- Qwen/Qwen3-32B
|
4 |
+
license: mit
|
5 |
base_model_relation: quantized
|
6 |
+
pipeline_tag: text-generation
|
7 |
+
library_name: transformers
|
8 |
+
tags:
|
9 |
+
- quantization
|
10 |
+
- text-generation
|
11 |
+
- llm
|
12 |
+
- qwen
|
13 |
+
- 3bit
|
14 |
---
|
|
|
15 |
|
16 |
+
# GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
+
**GuidedQuant** introduces a novel post-training quantization approach that integrates gradient information from the end loss into the quantization objective while preserving critical cross-weight dependencies within output channels. This method addresses limitations of existing techniques by accounting for the varying importance of hidden features and preserving crucial weight interactions. GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, it introduces a novel non-uniform scalar quantization algorithm (LNQ), which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category.
|
19 |
|
20 |
+
* **Paper (Hugging Face Papers)**: [GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance](https://huggingface.co/papers/2505.07004)
|
21 |
+
* **Paper (arXiv)**: [https://arxiv.org/abs/2505.07004](https://arxiv.org/abs/2505.07004)
|
22 |
+
* **Project Page**: [https://jusjinuk.me/blog/guidedquant/](https://jusjinuk.me/blog/guidedquant/)
|
23 |
+
* **Code (GitHub)**: [https://github.com/snu-mllab/GuidedQuant](https://github.com/snu-mllab/GuidedQuant)
|
24 |
|
25 |
+
---
|
26 |
+
|
27 |
+
## Model Card Details
|
28 |
+
|
29 |
+
This model is a 3-bit quantized version of `Qwen/Qwen3-32B` using the SqueezeLLM method, derived from the GuidedQuant project.
|
30 |
+
|
31 |
+
- Base model: `Qwen/Qwen3-32B`
|
32 |
+
- Quantization method: SqueezeLLM
|
33 |
+
- Target bit-width: 3
|
34 |
+
- Backend kernel: Any-Precision-LLM kernel (`ap-gemv`)
|
35 |
+
- Calibration data: RedPajama (1024 sentences / 4096 tokens)
|
36 |
+
- Calibration objective: Next-token prediction
|
37 |
+
|
38 |
+
## Quick Start / How to run
|
39 |
+
|
40 |
+
You can easily load and test this quantized model using the `AnyPrecisionForCausalLM` class from the `any_precision` library, which integrates with Hugging Face `transformers` components. The example below runs efficiently on one RTX 3090.
|
41 |
+
|
42 |
+
First, ensure you have the necessary libraries installed, including `transformers` and the `ap-gemv` kernel. You can find detailed installation instructions on the [GitHub repository](https://github.com/snu-mllab/GuidedQuant).
|
43 |
+
|
44 |
+
```bash
|
45 |
+
pip install transformers torch accelerate
|
46 |
+
# Install ap-gemv kernel (e.g., for CUDA 12.4)
|
47 |
+
pip install ap-gemv -i https://jusjinuk.me/whl/cu124
|
48 |
+
```
|
49 |
+
|
50 |
+
Then, use the following Python code snippet for inference:
|
51 |
+
|
52 |
+
```python
|
53 |
+
from any_precision.modules.AnyPrecisionForCausalLM import AnyPrecisionForCausalLM
|
54 |
+
from transformers import AutoTokenizer, TextStreamer
|
55 |
+
import torch
|
56 |
+
|
57 |
+
# This model is a Qwen-based model, so bfloat16 is typically used.
|
58 |
+
quantized_model_name = "jusjinuk/Qwen3-32B-3bit-GuidedQuant-SqueezeLLM"
|
59 |
+
dtype = torch.bfloat16
|
60 |
+
|
61 |
+
model = AnyPrecisionForCausalLM.from_quantized(quantized_model_name, torch_dtype=dtype)
|
62 |
+
tokenizer = AutoTokenizer.from_pretrained(quantized_model_name)
|
63 |
+
streamer = TextStreamer(tokenizer)
|
64 |
+
|
65 |
+
prompt = "Write me a short and concise story about a cat who loves to read.\
|
66 |
+
"
|
67 |
+
chat = [
|
68 |
+
{"role": "system", "content": "You are a helpful assistant.\
|
69 |
+
"},
|
70 |
+
{"role": "user", "content": prompt},
|
71 |
+
]
|
72 |
+
|
73 |
+
inputs = tokenizer.apply_chat_template(
|
74 |
+
chat, tokenize=True, return_tensors="pt", add_generation_prompt=True
|
75 |
+
).to(model.device)
|
76 |
+
|
77 |
+
model.generate(inputs,
|
78 |
+
max_new_tokens=200, do_sample=False, temperature=1.0, streamer=streamer, pad_token_id=tokenizer.eos_token_id
|
79 |
+
)
|
80 |
+
```
|
81 |
+
|
82 |
+
For more detailed instructions, including installation, inference speed-up, and various quantization methods, please refer to the [GitHub repository](https://github.com/snu-mllab/GuidedQuant).
|
83 |
+
|
84 |
+
## Acknowledgement
|
85 |
+
This code is heavily based on the following repositories:
|
86 |
+
- [Any-Precision-LLM](https://github.com/SNU-ARC/any-precision-llm)
|
87 |
+
- [QTIP](https://github.com/Cornell-RelaxML/qtip)
|
88 |
+
- [SpinQuant](https://github.com/facebookresearch/SpinQuant)
|
89 |
+
- [AQLM](https://github.com/Vahe1994/AQLM)
|
90 |
+
- [Fast Hadamard Transform](https://github.com/Dao-AILab/fast-hadamard-transform)
|
91 |
+
- [gpt-fast](https://github.com/pytorch-labs/gpt-fast)
|
92 |
+
|
93 |
+
We thank the authors for their open-source implementations and contributions to the community.
|
94 |
+
|
95 |
+
## Citation
|
96 |
+
|
97 |
+
Please cite our paper if you find our work useful:
|
98 |
+
|
99 |
+
```
|
100 |
+
@inproceedings{kim2025guidedquant,
|
101 |
+
title={GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance},
|
102 |
+
author={Jinuk Kim and Marwa El Halabi and Wonpyo Park and Clemens JS Schaefer and Deokjae Lee and Yeonhong Park and Jae W. Lee and Hyun Oh Song},
|
103 |
+
booktitle = {International Conference on Machine Learning (ICML)},
|
104 |
+
year={2025},
|
105 |
+
}
|
106 |
+
```
|