HRWKV7-hxa079-Qwen3-14B

Model Description

HRWKV7-hxa079-Qwen3-14B is an experimental hybrid architecture model that combines RWKV v7's linear attention mechanism with Group Query Attention (GQA) layers. Built upon the Qwen3-14B Instruct foundation, this model replaces most Transformer attention blocks with RWKV blocks while strategically maintaining some GQA layers to enhance performance on specific tasks.

  • Developed by: OpenMOSE
  • Model type: Hybrid Linear-Attention Language Model
  • Language(s): Multilingual (inherited from Qwen3-14B)
  • License: Apache-2.0
  • Base Model: Qwen3-14B Instruct
  • Year: 2025

Architecture Specifications

  • Architecture: RWKV v7 based "hxa079" Architecture + Group Query Attention Hybrid
  • Total Layers: 40 layers (L40D5120)
    • 34 RWKV layers (with Rope)
    • 6 GQA layers (No Rope, No Position Embeddings)
  • Hidden Dimension: 5120
  • Training Context Window: 4096 tokens
  • Inference Context Window 16384 tokens(100% NIAH)

Technical Innovation

RWKV "hxa079" Architecture

The model implements several key improvements over standard RWKV architectures:

  1. Token Shift Removal: Unlike traditional RWKV, the hxa079 variant removes token shifting mechanisms
  2. GroupNorm Removal: Eliminates GroupNorm layers for training stability
  3. k_first Introduction: Implements a novel k_first mechanism optimized for attention conversion

Hybrid Design Benefits

  • Linear Attention Inference: RWKV blocks enable O(1) memory complexity during inference
  • Enhanced Needle Tasks: Strategic placement of GQA layers significantly improves performance on needle-in-haystack retrieval tasks, addressing a known limitation of pure linear attention models
  • Implicit Position Encoding: Interestingly, the model achieves better performance when RoPE (Rotary Position Embedding) is not applied to GQA layers, suggesting that RWKV blocks provide implicit positional encoding capabilities

Intended Use

This is an experimental research model designed to explore hybrid architectures combining linear and quadratic attention mechanisms. It is intended for:

  • Research into efficient attention mechanisms
  • Benchmarking hybrid architecture performance
  • Exploring linear attention limitations and solutions
  • Academic and industrial R&D purposes

Limitations

  • Experimental Status: This model is in experimental stages and may exhibit unexpected behaviors
  • Context Window: Limited to 4096 tokens during training, though RWKV architecture theoretically supports longer sequences
  • Performance Variability: As a hybrid model, performance may vary significantly across different task types

Training Details

  • Training Context Window: 4096 tokens
  • Base Model Initialization: Weights initialized from Qwen3-14B Instruct
  • Architecture Conversion: Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers

Evaluation

Performance evaluation is ongoing. The model shows promising results in:

  • Maintaining base model capabilities while achieving linear attention efficiency
  • Significantly improved needle-in-haystack task performance compared to pure RWKV architectures
  • Competitive performance on standard language modeling benchmarks

Model Card Contact

OpenMOSE - 2025


Note: This is an experimental model. Performance characteristics and behaviors may differ from both pure RWKV and standard Transformer architectures. Users should thoroughly evaluate the model for their specific use cases.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support