nielsr HF Staff commited on
Commit
e8f2a47
·
verified ·
1 Parent(s): 00303d4

Improve model card: Add metadata, links, and usage example

Browse files

This PR significantly enhances the model card for the `GuidedQuant` model by:
- Adding `pipeline_tag: text-generation` and `library_name: transformers` to the metadata, improving discoverability and enabling Hub features.
- Correcting the `license` to `mit` based on the official GitHub repository's explicit declaration.
- Adding a `tags: [quantization]` entry to further categorize the model.
- Providing a concise overview based on the paper's abstract.
- Including explicit links to the Hugging Face Papers page, the original arXiv paper, the official project page, and the GitHub repository for comprehensive information.
- Integrating a practical Python code snippet for quick inference, allowing users to easily get started with the model using the `AnyPrecisionForCausalLM` class.
- Adding the full "Acknowledgement" and "Citation" information from the project's GitHub README.

These updates aim to make the model more accessible, informative, and compliant for the Hugging Face community.

Files changed (1) hide show
  1. README.md +97 -12
README.md CHANGED
@@ -1,21 +1,106 @@
1
  ---
2
  base_model:
3
  - Qwen/Qwen3-32B
 
4
  base_model_relation: quantized
5
- license: bigscience-openrail-m
 
 
 
 
 
 
 
6
  ---
7
- # Model Card
8
 
9
- - Base model: `Qwen/Qwen3-32B`
10
- - Quantization method: SqueezeLLM
11
- - Target bit-width: 3
12
- - Backend kernel: Any-Precision-LLM kernel (`ap-gemv`)
13
- - Calibration data: RedPajama (1024 sentences / 4096 tokens)
14
- - Calibration objective: Next-token prediction
15
 
 
16
 
17
- # How to run
18
- - Follow the instruction in https://github.com/snu-mllab/GuidedQuant.
 
 
19
 
20
- # References
21
- - [Model Paper](https://arxiv.org/abs/2505.07004)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  base_model:
3
  - Qwen/Qwen3-32B
4
+ license: mit
5
  base_model_relation: quantized
6
+ pipeline_tag: text-generation
7
+ library_name: transformers
8
+ tags:
9
+ - quantization
10
+ - text-generation
11
+ - llm
12
+ - qwen
13
+ - 3bit
14
  ---
 
15
 
16
+ # GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance
 
 
 
 
 
17
 
18
+ **GuidedQuant** introduces a novel post-training quantization approach that integrates gradient information from the end loss into the quantization objective while preserving critical cross-weight dependencies within output channels. This method addresses limitations of existing techniques by accounting for the varying importance of hidden features and preserving crucial weight interactions. GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, it introduces a novel non-uniform scalar quantization algorithm (LNQ), which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category.
19
 
20
+ * **Paper (Hugging Face Papers)**: [GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance](https://huggingface.co/papers/2505.07004)
21
+ * **Paper (arXiv)**: [https://arxiv.org/abs/2505.07004](https://arxiv.org/abs/2505.07004)
22
+ * **Project Page**: [https://jusjinuk.me/blog/guidedquant/](https://jusjinuk.me/blog/guidedquant/)
23
+ * **Code (GitHub)**: [https://github.com/snu-mllab/GuidedQuant](https://github.com/snu-mllab/GuidedQuant)
24
 
25
+ ---
26
+
27
+ ## Model Card Details
28
+
29
+ This model is a 3-bit quantized version of `Qwen/Qwen3-32B` using the SqueezeLLM method, derived from the GuidedQuant project.
30
+
31
+ - Base model: `Qwen/Qwen3-32B`
32
+ - Quantization method: SqueezeLLM
33
+ - Target bit-width: 3
34
+ - Backend kernel: Any-Precision-LLM kernel (`ap-gemv`)
35
+ - Calibration data: RedPajama (1024 sentences / 4096 tokens)
36
+ - Calibration objective: Next-token prediction
37
+
38
+ ## Quick Start / How to run
39
+
40
+ You can easily load and test this quantized model using the `AnyPrecisionForCausalLM` class from the `any_precision` library, which integrates with Hugging Face `transformers` components. The example below runs efficiently on one RTX 3090.
41
+
42
+ First, ensure you have the necessary libraries installed, including `transformers` and the `ap-gemv` kernel. You can find detailed installation instructions on the [GitHub repository](https://github.com/snu-mllab/GuidedQuant).
43
+
44
+ ```bash
45
+ pip install transformers torch accelerate
46
+ # Install ap-gemv kernel (e.g., for CUDA 12.4)
47
+ pip install ap-gemv -i https://jusjinuk.me/whl/cu124
48
+ ```
49
+
50
+ Then, use the following Python code snippet for inference:
51
+
52
+ ```python
53
+ from any_precision.modules.AnyPrecisionForCausalLM import AnyPrecisionForCausalLM
54
+ from transformers import AutoTokenizer, TextStreamer
55
+ import torch
56
+
57
+ # This model is a Qwen-based model, so bfloat16 is typically used.
58
+ quantized_model_name = "jusjinuk/Qwen3-32B-3bit-GuidedQuant-SqueezeLLM"
59
+ dtype = torch.bfloat16
60
+
61
+ model = AnyPrecisionForCausalLM.from_quantized(quantized_model_name, torch_dtype=dtype)
62
+ tokenizer = AutoTokenizer.from_pretrained(quantized_model_name)
63
+ streamer = TextStreamer(tokenizer)
64
+
65
+ prompt = "Write me a short and concise story about a cat who loves to read.\
66
+ "
67
+ chat = [
68
+ {"role": "system", "content": "You are a helpful assistant.\
69
+ "},
70
+ {"role": "user", "content": prompt},
71
+ ]
72
+
73
+ inputs = tokenizer.apply_chat_template(
74
+ chat, tokenize=True, return_tensors="pt", add_generation_prompt=True
75
+ ).to(model.device)
76
+
77
+ model.generate(inputs,
78
+ max_new_tokens=200, do_sample=False, temperature=1.0, streamer=streamer, pad_token_id=tokenizer.eos_token_id
79
+ )
80
+ ```
81
+
82
+ For more detailed instructions, including installation, inference speed-up, and various quantization methods, please refer to the [GitHub repository](https://github.com/snu-mllab/GuidedQuant).
83
+
84
+ ## Acknowledgement
85
+ This code is heavily based on the following repositories:
86
+ - [Any-Precision-LLM](https://github.com/SNU-ARC/any-precision-llm)
87
+ - [QTIP](https://github.com/Cornell-RelaxML/qtip)
88
+ - [SpinQuant](https://github.com/facebookresearch/SpinQuant)
89
+ - [AQLM](https://github.com/Vahe1994/AQLM)
90
+ - [Fast Hadamard Transform](https://github.com/Dao-AILab/fast-hadamard-transform)
91
+ - [gpt-fast](https://github.com/pytorch-labs/gpt-fast)
92
+
93
+ We thank the authors for their open-source implementations and contributions to the community.
94
+
95
+ ## Citation
96
+
97
+ Please cite our paper if you find our work useful:
98
+
99
+ ```
100
+ @inproceedings{kim2025guidedquant,
101
+ title={GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance},
102
+ author={Jinuk Kim and Marwa El Halabi and Wonpyo Park and Clemens JS Schaefer and Deokjae Lee and Yeonhong Park and Jae W. Lee and Hyun Oh Song},
103
+ booktitle = {International Conference on Machine Learning (ICML)},
104
+ year={2025},
105
+ }
106
+ ```