xuan-luo
/

FlexiDepth-Llama-3-8B-Instruct

Text Generation

Model card Files Files and versions Community

xuan luo commited on Mar 31

Commit

e7981ed

·

verified ·

1 Parent(s): 0b482d6

Update README.md

Files changed (1) hide show

README.md +0 -15

README.md CHANGED Viewed

@@ -55,21 +55,6 @@ The performance of FlexiDepth-Llama-3-8B-Instruct was evaluated using the `lm_ev
 These results show that FlexiDepth-Llama-3-8B-Instruct maintains comparable or improved performance on most benchmarks while using fewer layers on average.
-## Training Details
-FlexiDepth-Llama-3-8B-Instruct was built by applying the FlexiDepth method to the pre-trained Llama-3-8B-Instruct model, which has 32 Transformer layers. The latter 16 layers were modified into FlexiDepth layers, each equipped with a router and an adapter:
-- **Router**: Uses a bottleneck dimension \( d_r = \frac{1}{16}d \) (where \( d \) is the hidden dimension) and a SparseMixer gating function for differentiability.
-- **Adapter**: Retains the structure of the original feed-forward network (FFN) but reduces the intermediate dimension by a factor of 16.
-- **Training Dataset**: Tulu-v2 dataset
-- **Training Epochs**: 3
-- **Optimizer**: AdamW with a learning rate of \( 1 \times 10^{-4} \), \( \beta_1 = 0.9 \), \( \beta_2 = 0.999 \), and \( \epsilon = 1 \times 10^{-8} \)
-- **Warmup Ratio**: 0.03
-- **Global Batch Size**: 64
-- **Training Time**: Approximately 7 hours on 8 NVIDIA A100-PCIE-40GB GPUs
-The loss function included a coefficient \( \alpha = 1 \times 10^{-3} \), enabling an average of 8 layers to be skipped during generation while preserving performance.
 ## Model Card Authors
 Xuan Luo, Weizhi Wang, Xifeng Yan

 These results show that FlexiDepth-Llama-3-8B-Instruct maintains comparable or improved performance on most benchmarks while using fewer layers on average.
 ## Model Card Authors
 Xuan Luo, Weizhi Wang, Xifeng Yan