RWKV-Reka-3.1-Flash

I'm simply exploring the possibility of linearizing existing Transformer models. It's still far from perfect, but I hope you'll bear with me as I continue this journey. :)
Model Description
RWKV-Reka-3.1 Flash is an RNN hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Reka-flash3.1 21B foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.
- Developed by: OpenMOSE
- Model type: Hybrid Linear-Attention Language Model
- Language(s): Multilingual (inherited from Reka-flash3.1 21B)
- License: Apache-2.0
- Base Model: Reka-flash3 21B(https://huggingface.co/RekaAI/reka-flash-3.1)
- Year: 2025
Architecture Specifications
- Architecture: RWKV v7 based "hxa079" Architecture + Group Query Attention Hybrid
- Total Layers: 44 layers (L44D6114)
- 38 RWKV layers (with Rope)
- 6 GQA layers (No Rope, No Position Embeddings)
- Hidden Dimension: 6144
- Training Context Window: 8192 tokens
- Inference Context Window 40000+
- Training Strategy Following RADLADS method based knowledge distillation
Technical Innovation
RWKV "hxa079" Architecture
The model implements several key improvements over original RWKV architectures:
- Token Shift Removal: In order to effectively inherit the teacher model weights, we removed the residual connection one token ago.
- GroupNorm Removal: Helps improve training stability issues
- k_first Introduction: Experimentally adopted the approach of residually connecting k layers in layer 0.
Hybrid Design Benefits
- Linear Attention Inference: RWKV blocks enable O(1) memory complexity during inference, and the hybrid approach reduces the KVCache to 1/7 of full GQA.
- Enhanced Needle Tasks: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
- Implicit Position Encoding: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities
Intended Use
This is an experimental research model designed to explore hybrid architectures combining linear and quadratic attention mechanisms. It is intended for:
- Research into efficient attention mechanisms
- Benchmarking hybrid architecture performance
- Exploring linear attention limitations and solutions
- Academic and industrial R&D purposes
Limitations
- Experimental Status: This model is in experimental stages and may exhibit unexpected behaviors
- Context Window: Limited to 8192 tokens during training, though RWKV architecture theoretically supports longer sequences
- Performance Variability: As a hybrid model, performance may vary significantly across different task types
Training Details
- Training Context Window: 8192 tokens
- Training GPU AMD Instinct MI300X (takes 290hrs) AMD Developer Cloud.(Thank you for credit support)
- Training Strategy 8bit MLP Quant, frozen emb,mlp,head, Deepspeed Stage1, Stage1 100M, Stage2(ctx4096) 360M Stage3(ctx8192) 300M
- Training Stage Stage3 - Reduced temperature stepped knowledge distillation.(stage3 final temp=0.7)
- Base Model Initialization: Weights initialized from Reka-3.1-flash 21B
- Architecture Conversion: Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers
Evaluation
Performance evaluation is ongoing. The model shows promising results in:
- Maintaining base model capabilities while achieving linear attention efficiency
- Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
- Competitive performance on standard language modeling benchmarks
Usage with RWKV-Infer
- RWKV-Infer Triton based Hybrid RWKV Inference engine, can be check at: https://github.com/OpenMOSE/RWKV-Infer/wiki/How-to-Running-RWKV-hxa079-models%3F
Usage with Hugging Face Transformers
Currently in development. stay tuned :)
Code Repositories
- RADLADS Project Code: The main codebase for the RADLADS paper, including conversion scripts and model code, can be found at: https://github.com/recursal/RADLADS
- ARWKV Project Code The ARWKV original training code, can be found at: https://github.com/yynil/RWKVInside
- Specific Training Code (OpenMOSE): The training code for this particular
RWKV-Reka-3.1-Flash
model is available at: https://github.com/OpenMOSE/RWKVInside (Note: this repository is still under development and may contain bugs.)
Model Card Contact
OpenMOSE - 2025
- Downloads last month
- 10