HRWKV7-hxa079-Qwen3-8B

Model Description

HRWKV7-Qwen3-8N-Preview is an RNN hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Qwen3-8B foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.

Developed by: OpenMOSE
Model type: Hybrid Linear-Attention Language Model
Language(s): Multilingual (inherited from Qwen3-8B)
License: Apache-2.0
Base Model: Qwen3-8B
Year: 2025

Architecture Specifications

Architecture: RWKV v7 based "hxa079" Architecture + Group Query Attention Hybrid
Total Layers: 36 layers (L36D4096)
- 32 RWKV layers (with Rope)
- 4 GQA layers (No Rope, No Position Embeddings)
Hidden Dimension: 4096
Training Context Window: 4096 tokens
Inference Context Window 16384+
Training Strategy Following RADLADS method based knowledge distillation

Technical Innovation

RWKV "hxa079" Architecture

The model implements several key improvements over original RWKV architectures:

Token Shift Removal: In order to effectively inherit the teacher model weights, we removed the residual connection one token ago.
GroupNorm Removal: Helps improve training stability issues
k_first Introduction: Experimentally adopted the approach of residually connecting k layers in layer 0.

Hybrid Design Benefits

Linear Attention Inference: RWKV blocks enable O(1) memory complexity during inference, and the hybrid approach reduces the KVCache to 1/9 of full GQA.
Enhanced Needle Tasks: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
Implicit Position Encoding: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities

Intended Use

This is an experimental research model designed to explore hybrid architectures combining linear and quadratic attention mechanisms. It is intended for:

Research into efficient attention mechanisms
Benchmarking hybrid architecture performance
Exploring linear attention limitations and solutions
Academic and industrial R&D purposes

Limitations

Experimental Status: This model is in experimental stages and may exhibit unexpected behaviors
Context Window: Limited to 4096 tokens during training, though RWKV architecture theoretically supports longer sequences
Performance Variability: As a hybrid model, performance may vary significantly across different task types

Training Details

Training Context Window: 4096 tokens
Training GPU AMD MI300X x 1(takes 80hrs) Runpod
Training Strategy 8bit MLP Quant, frozen emb,mlp,head, Deepspeed Stage1, Stage1 100M, Stage2 360M
Base Model Initialization: Weights initialized from Qwen3-8B
Architecture Conversion: Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers

Evaluation

Performance evaluation is ongoing. The model shows promising results in:

Maintaining base model capabilities while achieving linear attention efficiency
Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
Competitive performance on standard language modeling benchmarks

Thank you for Big help :)

SmerkyG Inspired by RADLADS (https://arxiv.org/abs/2505.03005)
https://github.com/recursal/RADLADS-paper

Training Code

https://github.com/OpenMOSE/RWKVInside (still buggy)

Model Card Contact

OpenMOSE - 2025

Note: This is an experimental model. Performance characteristics and behaviors may differ from both pure RWKV and standard Transformer architectures. Users should thoroughly evaluate the model for their specific use cases.