Update README.md
Browse files
README.md
CHANGED
@@ -6,41 +6,39 @@ tags:
|
|
6 |
license: mit
|
7 |
---
|
8 |
|
9 |
-
#
|
10 |
|
11 |
-
|
|
|
12 |
|
13 |
-
|
14 |
|
|
|
15 |
|
16 |
-
|
17 |
-
|
18 |
-
Other available checkpoints: [xtremedistil-l6-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased) and [xtremedistil-l12-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l12-h384-uncased)
|
19 |
-
|
20 |
-
The following table shows the results on GLUE dev set and SQuAD-v2.
|
21 |
-
|
22 |
-
| Models | #Params | Speedup | MNLI | QNLI | QQP | RTE | SST | MRPC | SQUAD2 | Avg |
|
23 |
|----------------|--------|---------|------|------|------|------|------|------|--------|-------|
|
24 |
-
| BERT | 109 |
|
25 |
-
|
|
26 |
-
|
|
27 |
-
|
|
28 |
-
|
|
29 |
-
|
|
30 |
-
|
|
31 |
-
|
|
32 |
-
|
33 |
-
|
|
|
|
|
|
|
|
|
34 |
|
35 |
If you use this checkpoint in your work, please cite:
|
36 |
|
37 |
``` latex
|
38 |
-
@
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
archivePrefix={arXiv},
|
44 |
-
primaryClass={cs.CL}
|
45 |
}
|
46 |
```
|
|
|
6 |
license: mit
|
7 |
---
|
8 |
|
9 |
+
# AutoDisProxyT-COLA for Distilling Massive Neural Networks
|
10 |
|
11 |
+
AutoDisProxyT is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper [Few-shot Task-agnostic Neural Architecture Search for
|
12 |
+
Distilling Large Language Models](https://proceedings.neurips.cc/paper_files/paper/2022/file/b7c12689a89e98a61bcaa65285a41b7c-Paper-Conference.pdf).
|
13 |
|
14 |
+
This AutoDisProxyT checkpoint with **7** layers, **160** hidden size, **10** attention heads corresponds to **6.88 million** parameters and **0.27G** FLOPs.
|
15 |
|
16 |
+
The following table shows the results on GLUE dev set.
|
17 |
|
18 |
+
| Models | #Params (M) | #FLOPs (G) | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | Avg |
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
|----------------|--------|---------|------|------|------|------|------|------|--------|-------|
|
20 |
+
| BERT | 109 | 11.2 | 84.5 | 91.7 | 91.3 | 68.6 | 93.2 | 87.3 | 53.5 | 82.2 |
|
21 |
+
| BERT<sub>SMALL</sub> | 66 | 5.66 | 81.8 | 89.8 | 90.6 | 67.9 | 91.2 | 84.9 | 53.5 | 80.0 |
|
22 |
+
| TruncatedBERT | 66 | 5.66 | 81.2 | 87.9 | 90.4 | 65.5 | 90.8 | 82.7 | 41.4 | 77.1 |
|
23 |
+
| DistilBERT | 66 | 5.66 | 82.2 | 89.2 | 88.5 | 59.9 | 91.3 | 87.5 | 51.3 | 78.6 |
|
24 |
+
| TinyBERT | 66 | 5.66 | 83.5 | 90.5 | 90.6 | 72.2 | 91.6 | 88.4 | 42.8 | 79.9 |
|
25 |
+
| MiniLM | 66 | 5.66 | 84.0 | 91.0 | 91.0 | 71.5 | 92.0 | 88.4 | 49.2 | 81.0 |
|
26 |
+
| AutoTinyBERT-KD-S1 | 30.0 | 1.69 | 82.3 | 89.7 | 89.9 | 71.1 | 91.4 | 88.5 | 47.3 | 80.0 |
|
27 |
+
| DynaBERT | 37.7 | 1.81 | 82.3 | 88.5 | 90.4 | 63.2 | 92.0 | 81.4 | 76.4 | 43.7 |
|
28 |
+
| NAS-BERT<sub>10</sub>| 10.0 | 2.30 | 76.4 | 86.3 | 88.5 | 66.6 | 88.6 | 79.1 | 34.0 | 74.2 |
|
29 |
+
| AutoTinyBERT-KD-S4 | 66 | 5.66 | 76.0 | 85.5 | 86.9 | 64.9 | 86.8 | 81.4 | 20.4 | 71.7 |
|
30 |
+
| NAS-BERT<sub>5</sub> | 66 | 5.66 | 74.4 | 84.9 | 85.8 | 66.6 | 87.3 | 79.6 | 19.8 | 71.2 |
|
31 |
+
| **AutoDisProxyT** | 6.88 | 0.27 | 79.0 | 86.4 | 89.1 | 64.3 | 85.9 | 78.5 | 24.8 | 72.6 |
|
32 |
+
|
33 |
+
Tested with `torch 1.6.0`
|
34 |
|
35 |
If you use this checkpoint in your work, please cite:
|
36 |
|
37 |
``` latex
|
38 |
+
@article{xu2022autodistil,
|
39 |
+
title={AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models},
|
40 |
+
author={Xu, Dongkuan and Mukherjee, Subhabrata and Liu, Xiaodong and Dey, Debadeepta and Wang, Wenhui and Zhang, Xiang and Awadallah, Ahmed Hassan and Gao, Jianfeng},
|
41 |
+
journal={arXiv preprint arXiv:2201.12507},
|
42 |
+
year={2022}
|
|
|
|
|
43 |
}
|
44 |
```
|