Text Generation
Transformers
Safetensors
English
ddllama
conversational
custom_code
xuan luo commited on
Commit
e7981ed
·
verified ·
1 Parent(s): 0b482d6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -15
README.md CHANGED
@@ -55,21 +55,6 @@ The performance of FlexiDepth-Llama-3-8B-Instruct was evaluated using the `lm_ev
55
 
56
  These results show that FlexiDepth-Llama-3-8B-Instruct maintains comparable or improved performance on most benchmarks while using fewer layers on average.
57
 
58
- ## Training Details
59
-
60
- FlexiDepth-Llama-3-8B-Instruct was built by applying the FlexiDepth method to the pre-trained Llama-3-8B-Instruct model, which has 32 Transformer layers. The latter 16 layers were modified into FlexiDepth layers, each equipped with a router and an adapter:
61
-
62
- - **Router**: Uses a bottleneck dimension \( d_r = \frac{1}{16}d \) (where \( d \) is the hidden dimension) and a SparseMixer gating function for differentiability.
63
- - **Adapter**: Retains the structure of the original feed-forward network (FFN) but reduces the intermediate dimension by a factor of 16.
64
- - **Training Dataset**: Tulu-v2 dataset
65
- - **Training Epochs**: 3
66
- - **Optimizer**: AdamW with a learning rate of \( 1 \times 10^{-4} \), \( \beta_1 = 0.9 \), \( \beta_2 = 0.999 \), and \( \epsilon = 1 \times 10^{-8} \)
67
- - **Warmup Ratio**: 0.03
68
- - **Global Batch Size**: 64
69
- - **Training Time**: Approximately 7 hours on 8 NVIDIA A100-PCIE-40GB GPUs
70
-
71
- The loss function included a coefficient \( \alpha = 1 \times 10^{-3} \), enabling an average of 8 layers to be skipped during generation while preserving performance.
72
-
73
  ## Model Card Authors
74
 
75
  Xuan Luo, Weizhi Wang, Xifeng Yan
 
55
 
56
  These results show that FlexiDepth-Llama-3-8B-Instruct maintains comparable or improved performance on most benchmarks while using fewer layers on average.
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ## Model Card Authors
59
 
60
  Xuan Luo, Weizhi Wang, Xifeng Yan