Jinawei commited on
Commit
1c9fed9
·
1 Parent(s): 1fbf003

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -27
README.md CHANGED
@@ -6,41 +6,39 @@ tags:
6
  license: mit
7
  ---
8
 
9
- # XtremeDistilTransformers for Distilling Massive Neural Networks
10
 
11
- XtremeDistilTransformers is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper [XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation](https://arxiv.org/abs/2106.04563).
 
12
 
13
- We leverage task transfer combined with multi-task distillation techniques from the papers [XtremeDistil: Multi-stage Distillation for Massive Multilingual Models](https://www.aclweb.org/anthology/2020.acl-main.202.pdf) and [MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://proceedings.neurips.cc/paper/2020/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) with the following [Github code](https://github.com/microsoft/xtreme-distil-transformers).
14
 
 
15
 
16
- This l6-h384 checkpoint with **6** layers, **384** hidden size, **12** attention heads corresponds to **22 million** parameters with **5.3x** speedup over BERT-base.
17
-
18
- Other available checkpoints: [xtremedistil-l6-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased) and [xtremedistil-l12-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l12-h384-uncased)
19
-
20
- The following table shows the results on GLUE dev set and SQuAD-v2.
21
-
22
- | Models | #Params | Speedup | MNLI | QNLI | QQP | RTE | SST | MRPC | SQUAD2 | Avg |
23
  |----------------|--------|---------|------|------|------|------|------|------|--------|-------|
24
- | BERT | 109 | 1x | 84.5 | 91.7 | 91.3 | 68.6 | 93.2 | 87.3 | 76.8 | 84.8 |
25
- | DistilBERT | 66 | 2x | 82.2 | 89.2 | 88.5 | 59.9 | 91.3 | 87.5 | 70.7 | 81.3 |
26
- | TinyBERT | 66 | 2x | 83.5 | 90.5 | 90.6 | 72.2 | 91.6 | 88.4 | 73.1 | 84.3 |
27
- | MiniLM | 66 | 2x | 84.0 | 91.0 | 91.0 | 71.5 | 92.0 | 88.4 | 76.4 | 84.9 |
28
- | MiniLM | 22 | 5.3x | 82.8 | 90.3 | 90.6 | 68.9 | 91.3 | 86.6 | 72.9 | 83.3 |
29
- | XtremeDistil-l6-h256 | 13 | 8.7x | 83.9 | 89.5 | 90.6 | 80.1 | 91.2 | 90.0 | 74.1 | 85.6 |
30
- | XtremeDistil-l6-h384 | 22 | 5.3x | 85.4 | 90.3 | 91.0 | 80.9 | 92.3 | 90.0 | 76.6 | 86.6 |
31
- | XtremeDistil-l12-h384 | 33 | 2.7x | 87.2 | 91.9 | 91.3 | 85.6 | 93.1 | 90.4 | 80.2 | 88.5 |
32
-
33
- Tested with `tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0`
 
 
 
 
34
 
35
  If you use this checkpoint in your work, please cite:
36
 
37
  ``` latex
38
- @misc{mukherjee2021xtremedistiltransformers,
39
- title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation},
40
- author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao},
41
- year={2021},
42
- eprint={2106.04563},
43
- archivePrefix={arXiv},
44
- primaryClass={cs.CL}
45
  }
46
  ```
 
6
  license: mit
7
  ---
8
 
9
+ # AutoDisProxyT-COLA for Distilling Massive Neural Networks
10
 
11
+ AutoDisProxyT is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper [Few-shot Task-agnostic Neural Architecture Search for
12
+ Distilling Large Language Models](https://proceedings.neurips.cc/paper_files/paper/2022/file/b7c12689a89e98a61bcaa65285a41b7c-Paper-Conference.pdf).
13
 
14
+ This AutoDisProxyT checkpoint with **7** layers, **160** hidden size, **10** attention heads corresponds to **6.88 million** parameters and **0.27G** FLOPs.
15
 
16
+ The following table shows the results on GLUE dev set.
17
 
18
+ | Models | #Params (M) | #FLOPs (G) | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | Avg |
 
 
 
 
 
 
19
  |----------------|--------|---------|------|------|------|------|------|------|--------|-------|
20
+ | BERT | 109 | 11.2 | 84.5 | 91.7 | 91.3 | 68.6 | 93.2 | 87.3 | 53.5 | 82.2 |
21
+ | BERT<sub>SMALL</sub> | 66 | 5.66 | 81.8 | 89.8 | 90.6 | 67.9 | 91.2 | 84.9 | 53.5 | 80.0 |
22
+ | TruncatedBERT | 66 | 5.66 | 81.2 | 87.9 | 90.4 | 65.5 | 90.8 | 82.7 | 41.4 | 77.1 |
23
+ | DistilBERT | 66 | 5.66 | 82.2 | 89.2 | 88.5 | 59.9 | 91.3 | 87.5 | 51.3 | 78.6 |
24
+ | TinyBERT | 66 | 5.66 | 83.5 | 90.5 | 90.6 | 72.2 | 91.6 | 88.4 | 42.8 | 79.9 |
25
+ | MiniLM | 66 | 5.66 | 84.0 | 91.0 | 91.0 | 71.5 | 92.0 | 88.4 | 49.2 | 81.0 |
26
+ | AutoTinyBERT-KD-S1 | 30.0 | 1.69 | 82.3 | 89.7 | 89.9 | 71.1 | 91.4 | 88.5 | 47.3 | 80.0 |
27
+ | DynaBERT | 37.7 | 1.81 | 82.3 | 88.5 | 90.4 | 63.2 | 92.0 | 81.4 | 76.4 | 43.7 |
28
+ | NAS-BERT<sub>10</sub>| 10.0 | 2.30 | 76.4 | 86.3 | 88.5 | 66.6 | 88.6 | 79.1 | 34.0 | 74.2 |
29
+ | AutoTinyBERT-KD-S4 | 66 | 5.66 | 76.0 | 85.5 | 86.9 | 64.9 | 86.8 | 81.4 | 20.4 | 71.7 |
30
+ | NAS-BERT<sub>5</sub> | 66 | 5.66 | 74.4 | 84.9 | 85.8 | 66.6 | 87.3 | 79.6 | 19.8 | 71.2 |
31
+ | **AutoDisProxyT** | 6.88 | 0.27 | 79.0 | 86.4 | 89.1 | 64.3 | 85.9 | 78.5 | 24.8 | 72.6 |
32
+
33
+ Tested with `torch 1.6.0`
34
 
35
  If you use this checkpoint in your work, please cite:
36
 
37
  ``` latex
38
+ @article{xu2022autodistil,
39
+ title={AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models},
40
+ author={Xu, Dongkuan and Mukherjee, Subhabrata and Liu, Xiaodong and Dey, Debadeepta and Wang, Wenhui and Zhang, Xiang and Awadallah, Ahmed Hassan and Gao, Jianfeng},
41
+ journal={arXiv preprint arXiv:2201.12507},
42
+ year={2022}
 
 
43
  }
44
  ```