nielsr HF Staff commited on
Commit
90af402
·
verified ·
1 Parent(s): 7e0244a

Improve model card: Add pipeline tag, library, paper, code, and usage

Browse files

This PR enhances the model card for `HRWKV7-hxa079-Qwen3-8B` by:

* **Adding Metadata:** Included `pipeline_tag: text-generation` to ensure better discoverability on the Hugging Face Hub, and `library_name: transformers` to enable the "how to use" widget, as the model is compatible with the Transformers library via `trust_remote_code=True`.
* **Prominent Links:** Added direct links to the associated paper ([RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005)) and the main GitHub repository ([https://github.com/recursal/RADLADS](https://github.com/recursal/RADLADS)) at the top of the card. The existing code links within the "Thank you" and "Training Code" sections have been clarified.
* **Sample Usage:** Provided a correct `transformers`-based code snippet for text generation, guiding users on how to run inference with the model.
* **Citation:** Added the BibTeX citation from the paper's repository.

These changes improve the model's visibility, usability, and provide more comprehensive information for users and researchers.

Files changed (1) hide show
  1. README.md +82 -39
README.md CHANGED
@@ -1,29 +1,35 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
4
  # HRWKV7-hxa079-Qwen3-8B
5
 
 
 
 
6
  ### Model Description
7
 
8
  HRWKV7-Qwen3-8N-Preview is an RNN hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Qwen3-8B foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.
9
 
10
- - **Developed by:** OpenMOSE
11
- - **Model type:** Hybrid Linear-Attention Language Model
12
- - **Language(s):** Multilingual (inherited from Qwen3-8B)
13
- - **License:** Apache-2.0
14
- - **Base Model:** Qwen3-8B
15
- - **Year:** 2025
16
 
17
  ### Architecture Specifications
18
 
19
- - **Architecture:** RWKV v7 based "hxa079" Architecture + Group Query Attention Hybrid
20
- - **Total Layers:** 36 layers (L36D4096)
21
- - 32 RWKV layers (with Rope)
22
- - 4 GQA layers (No Rope, No Position Embeddings)
23
- - **Hidden Dimension:** 4096
24
- - **Training Context Window:** 4096 tokens
25
- - **Inference Context Window** 16384+
26
- - **Training Strategy** Following RADLADS method based knowledge distillation
27
 
28
  ## Technical Innovation
29
 
@@ -31,54 +37,91 @@ HRWKV7-Qwen3-8N-Preview is an RNN hybrid architecture model that combines RWKV v
31
 
32
  The model implements several key improvements over original RWKV architectures:
33
 
34
- 1. **Token Shift Removal**: In order to effectively inherit the teacher model weights, we removed the residual connection one token ago.
35
- 2. **GroupNorm Removal**: Helps improve training stability issues
36
- 3. **k_first Introduction**: Experimentally adopted the approach of residually connecting k layers in layer 0.
37
 
38
  ### Hybrid Design Benefits
39
 
40
- - **Linear Attention Inference**: RWKV blocks enable O(1) memory complexity during inference, and the hybrid approach reduces the KVCache to 1/9 of full GQA.
41
- - **Enhanced Needle Tasks**: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
42
- - **Implicit Position Encoding**: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities
43
 
44
  ## Intended Use
45
 
46
  This is an **experimental research model** designed to explore hybrid architectures combining linear and quadratic attention mechanisms. It is intended for:
47
 
48
- - Research into efficient attention mechanisms
49
- - Benchmarking hybrid architecture performance
50
- - Exploring linear attention limitations and solutions
51
- - Academic and industrial R&D purposes
52
 
53
  ## Limitations
54
 
55
- - **Experimental Status**: This model is in experimental stages and may exhibit unexpected behaviors
56
- - **Context Window**: Limited to 4096 tokens during training, though RWKV architecture theoretically supports longer sequences
57
- - **Performance Variability**: As a hybrid model, performance may vary significantly across different task types
58
 
59
  ## Training Details
60
 
61
- - **Training Context Window:** 4096 tokens
62
- - **Training GPU** AMD MI300X x 1(takes 80hrs) Runpod
63
- - **Training Strategy** 8bit MLP Quant, frozen emb,mlp,head, Deepspeed Stage1, Stage1 100M, Stage2 360M
64
- - **Base Model Initialization:** Weights initialized from Qwen3-8B
65
- - **Architecture Conversion:** Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers
66
 
67
  ## Evaluation
68
 
69
  Performance evaluation is ongoing. The model shows promising results in:
70
- - Maintaining base model capabilities while achieving linear attention efficiency
71
- - Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
72
- - Competitive performance on standard language modeling benchmarks
 
 
 
 
73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
  ## Thank you for Big help :)
76
- - SmerkyG Inspired by RADLADS (https://arxiv.org/abs/2505.03005)
77
- - https://github.com/recursal/RADLADS-paper
78
 
79
  ## Training Code
80
- - https://github.com/OpenMOSE/RWKVInside (still buggy)
81
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
  ## Model Card Contact
84
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
  ---
6
+
7
  # HRWKV7-hxa079-Qwen3-8B
8
 
9
+ **Paper:** [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005)
10
+ **Code:** [https://github.com/recursal/RADLADS](https://github.com/recursal/RADLADS)
11
+
12
  ### Model Description
13
 
14
  HRWKV7-Qwen3-8N-Preview is an RNN hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Qwen3-8B foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.
15
 
16
+ - **Developed by:** OpenMOSE
17
+ - **Model type:** Hybrid Linear-Attention Language Model
18
+ - **Language(s):** Multilingual (inherited from Qwen3-8B)
19
+ - **License:** Apache-2.0
20
+ - **Base Model:** Qwen3-8B
21
+ - **Year:** 2025
22
 
23
  ### Architecture Specifications
24
 
25
+ - **Architecture:** RWKV v7 based "hxa079" Architecture + Group Query Attention Hybrid
26
+ - **Total Layers:** 36 layers (L36D4096)
27
+ - 32 RWKV layers (with Rope)
28
+ - 4 GQA layers (No Rope, No Position Embeddings)
29
+ - **Hidden Dimension:** 4096
30
+ - **Training Context Window:** 4096 tokens
31
+ - **Inference Context Window** 16384+
32
+ - **Training Strategy** Following RADLADS method based knowledge distillation
33
 
34
  ## Technical Innovation
35
 
 
37
 
38
  The model implements several key improvements over original RWKV architectures:
39
 
40
+ 1. **Token Shift Removal**: In order to effectively inherit the teacher model weights, we removed the residual connection one token ago.
41
+ 2. **GroupNorm Removal**: Helps improve training stability issues
42
+ 3. **k_first Introduction**: Experimentally adopted the approach of residually connecting k layers in layer 0.
43
 
44
  ### Hybrid Design Benefits
45
 
46
+ - **Linear Attention Inference**: RWKV blocks enable O(1) memory complexity during inference, and the hybrid approach reduces the KVCache to 1/9 of full GQA.
47
+ - **Enhanced Needle Tasks**: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
48
+ - **Implicit Position Encoding**: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities
49
 
50
  ## Intended Use
51
 
52
  This is an **experimental research model** designed to explore hybrid architectures combining linear and quadratic attention mechanisms. It is intended for:
53
 
54
+ - Research into efficient attention mechanisms
55
+ - Benchmarking hybrid architecture performance
56
+ - Exploring linear attention limitations and solutions
57
+ - Academic and industrial R&D purposes
58
 
59
  ## Limitations
60
 
61
+ - **Experimental Status**: This model is in experimental stages and may exhibit unexpected behaviors
62
+ - **Context Window**: Limited to 4096 tokens during training, though RWKV architecture theoretically supports longer sequences
63
+ - **Performance Variability**: As a hybrid model, performance may vary significantly across different task types
64
 
65
  ## Training Details
66
 
67
+ - **Training Context Window:** 4096 tokens
68
+ - **Training GPU** AMD MI300X x 1(takes 80hrs) Runpod
69
+ - **Training Strategy** 8bit MLP Quant, frozen emb,mlp,head, Deepspeed Stage1, Stage1 100M, Stage2 360M
70
+ - **Base Model Initialization:** Weights initialized from Qwen3-8B
71
+ - **Architecture Conversion:** Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers
72
 
73
  ## Evaluation
74
 
75
  Performance evaluation is ongoing. The model shows promising results in:
76
+ - Maintaining base model capabilities while achieving linear attention efficiency
77
+ - Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
78
+ - Competitive performance on standard language modeling benchmarks
79
+
80
+ ## Sample Usage
81
+
82
+ You can use this model with the Hugging Face `transformers` library. Ensure you have `trust_remote_code=True` set, as it uses a custom architecture.
83
 
84
+ ```python
85
+ from transformers import AutoTokenizer, AutoModelForCausalLM
86
+ import torch
87
+
88
+ model_name = "OpenMOSE/HRWKV7-hxa079-Qwen3-8B" # The name of this model
89
+ model = AutoModelForCausalLM.from_pretrained(
90
+ model_name,
91
+ torch_dtype=torch.bfloat16, # Or torch.float16, depending on your system and model precision
92
+ device_map="auto",
93
+ trust_remote_code=True,
94
+ )
95
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
96
+
97
+ prompt = "Tell me a short story about a brave knight named Sir Reginald."
98
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
99
+ outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95, temperature=0.7)
100
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
101
+ ```
102
 
103
  ## Thank you for Big help :)
104
+ - SmerkyG Inspired by RADLADS ([https://arxiv.org/abs/2505.03005](https://arxiv.org/abs/2505.03005))
105
+ - [https://github.com/recursal/RADLADS-paper](https://github.com/recursal/RADLADS-paper) (This is the primary repository for the RADLADS paper's code)
106
 
107
  ## Training Code
108
+ - [https://github.com/OpenMOSE/RWKVInside](https://github.com/OpenMOSE/RWKVInside) (This repository contains training code specifically for this model variant, still buggy)
109
+
110
+ ## Citation
111
+
112
+ If you use this model or find our work valuable, please consider citing the RADLADS paper:
113
+
114
+ ```bibtex
115
+ @misc{goldstein2025radladsrapidattentiondistillation,
116
+ title={RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale},
117
+ author={Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah},
118
+ year={2025},
119
+ eprint={2505.03005},
120
+ archivePrefix={arXiv},
121
+ primaryClass={cs.CL},
122
+ url={https://arxiv.org/abs/2505.03005},
123
+ }
124
+ ```
125
 
126
  ## Model Card Contact
127