xuan luo
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -55,21 +55,6 @@ The performance of FlexiDepth-Llama-3-8B-Instruct was evaluated using the `lm_ev
|
|
55 |
|
56 |
These results show that FlexiDepth-Llama-3-8B-Instruct maintains comparable or improved performance on most benchmarks while using fewer layers on average.
|
57 |
|
58 |
-
## Training Details
|
59 |
-
|
60 |
-
FlexiDepth-Llama-3-8B-Instruct was built by applying the FlexiDepth method to the pre-trained Llama-3-8B-Instruct model, which has 32 Transformer layers. The latter 16 layers were modified into FlexiDepth layers, each equipped with a router and an adapter:
|
61 |
-
|
62 |
-
- **Router**: Uses a bottleneck dimension \( d_r = \frac{1}{16}d \) (where \( d \) is the hidden dimension) and a SparseMixer gating function for differentiability.
|
63 |
-
- **Adapter**: Retains the structure of the original feed-forward network (FFN) but reduces the intermediate dimension by a factor of 16.
|
64 |
-
- **Training Dataset**: Tulu-v2 dataset
|
65 |
-
- **Training Epochs**: 3
|
66 |
-
- **Optimizer**: AdamW with a learning rate of \( 1 \times 10^{-4} \), \( \beta_1 = 0.9 \), \( \beta_2 = 0.999 \), and \( \epsilon = 1 \times 10^{-8} \)
|
67 |
-
- **Warmup Ratio**: 0.03
|
68 |
-
- **Global Batch Size**: 64
|
69 |
-
- **Training Time**: Approximately 7 hours on 8 NVIDIA A100-PCIE-40GB GPUs
|
70 |
-
|
71 |
-
The loss function included a coefficient \( \alpha = 1 \times 10^{-3} \), enabling an average of 8 layers to be skipped during generation while preserving performance.
|
72 |
-
|
73 |
## Model Card Authors
|
74 |
|
75 |
Xuan Luo, Weizhi Wang, Xifeng Yan
|
|
|
55 |
|
56 |
These results show that FlexiDepth-Llama-3-8B-Instruct maintains comparable or improved performance on most benchmarks while using fewer layers on average.
|
57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
## Model Card Authors
|
59 |
|
60 |
Xuan Luo, Weizhi Wang, Xifeng Yan
|