|
|
|
SqueezeBERT |
|
Overview |
|
The SqueezeBERT model was proposed in SqueezeBERT: What can computer vision teach NLP about efficient neural networks? by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It's a |
|
bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the |
|
SqueezeBERT architecture is that SqueezeBERT uses grouped convolutions |
|
instead of fully-connected layers for the Q, K, V and FFN layers. |
|
The abstract from the paper is the following: |
|
Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets, |
|
large computing systems, and better neural network models, natural language processing (NLP) technology has made |
|
significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant |
|
opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. In particular, we |
|
consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today's |
|
highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with |
|
BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods |
|
such as grouped convolutions have yielded significant speedups for computer vision networks, but many of these |
|
techniques have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in |
|
self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called |
|
SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test |
|
set. The SqueezeBERT code will be released. |
|
This model was contributed by forresti. |
|
Usage tips |
|
|
|
SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right |
|
rather than the left. |
|
SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore |
|
efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained |
|
with a causal language modeling (CLM) objective are better in that regard. |
|
For best results when finetuning on sequence classification tasks, it is recommended to start with the |
|
squeezebert/squeezebert-mnli-headless checkpoint. |
|
|
|
Resources |
|
|
|
Text classification task guide |
|
Token classification task guide |
|
Question answering task guide |
|
Masked language modeling task guide |
|
Multiple choice task guide |
|
|
|
SqueezeBertConfig |
|
[[autodoc]] SqueezeBertConfig |
|
SqueezeBertTokenizer |
|
[[autodoc]] SqueezeBertTokenizer |
|
- build_inputs_with_special_tokens |
|
- get_special_tokens_mask |
|
- create_token_type_ids_from_sequences |
|
- save_vocabulary |
|
SqueezeBertTokenizerFast |
|
[[autodoc]] SqueezeBertTokenizerFast |
|
SqueezeBertModel |
|
[[autodoc]] SqueezeBertModel |
|
SqueezeBertForMaskedLM |
|
[[autodoc]] SqueezeBertForMaskedLM |
|
SqueezeBertForSequenceClassification |
|
[[autodoc]] SqueezeBertForSequenceClassification |
|
SqueezeBertForMultipleChoice |
|
[[autodoc]] SqueezeBertForMultipleChoice |
|
SqueezeBertForTokenClassification |
|
[[autodoc]] SqueezeBertForTokenClassification |
|
SqueezeBertForQuestionAnswering |
|
[[autodoc]] SqueezeBertForQuestionAnswering |