|
|
|
DistilBERT |
|
|
|
Overview |
|
The DistilBERT model was proposed in the blog post Smaller, faster, cheaper, lighter: Introducing DistilBERT, a |
|
distilled version of BERT, and the paper DistilBERT, a |
|
distilled version of BERT: smaller, faster, cheaper and lighter. DistilBERT is a |
|
small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than |
|
google-bert/bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language |
|
understanding benchmark. |
|
The abstract from the paper is the following: |
|
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), |
|
operating these large models in on-the-edge and/or under constrained computational training or inference budgets |
|
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation |
|
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger |
|
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage |
|
knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by |
|
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive |
|
biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, |
|
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we |
|
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device |
|
study. |
|
This model was contributed by victorsanh. This model jax version was |
|
contributed by kamalkraj. The original code can be found here. |
|
Usage tips |
|
|
|
DistilBERT doesn't have token_type_ids, you don't need to indicate which token belongs to which segment. Just |
|
separate your segments with the separation token tokenizer.sep_token (or [SEP]). |
|
DistilBERT doesn't have options to select the input positions (position_ids input). This could be added if |
|
necessary though, just let us know if you need this option. |
|
|
|
Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning it’s been trained to predict the same probabilities as the larger model. The actual objective is a combination of: |
|
|
|
finding the same probabilities as the teacher model |
|
predicting the masked tokens correctly (but no next-sentence objective) |
|
a cosine similarity between the hidden states of the student and the teacher model |
|
|
|
Resources |
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DistilBERT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. |
|
|
|
A blog post on Getting Started with Sentiment Analysis using Python with DistilBERT. |
|
A blog post on how to train DistilBERT with Blurr for sequence classification. |
|
A blog post on how to use Ray to tune DistilBERT hyperparameters. |
|
A blog post on how to train DistilBERT with Hugging Face and Amazon SageMaker. |
|
A notebook on how to finetune DistilBERT for multi-label classification. 🌎 |
|
A notebook on how to finetune DistilBERT for multiclass classification with PyTorch. 🌎 |
|
A notebook on how to finetune DistilBERT for text classification in TensorFlow. 🌎 |
|
[DistilBertForSequenceClassification] is supported by this example script and notebook. |
|
[TFDistilBertForSequenceClassification] is supported by this example script and notebook. |
|
[FlaxDistilBertForSequenceClassification] is supported by this example script and notebook. |
|
Text classification task guide |
|
|
|
[DistilBertForTokenClassification] is supported by this example script and notebook. |
|
[TFDistilBertForTokenClassification] is supported by this example script and notebook. |
|
[FlaxDistilBertForTokenClassification] is supported by this example script. |
|
Token classification chapter of the 🤗 Hugging Face Course. |
|
Token classification task guide |
|
|
|
[DistilBertForMaskedLM] is supported by this example script and notebook. |
|
[TFDistilBertForMaskedLM] is supported by this example script and notebook. |
|
[FlaxDistilBertForMaskedLM] is supported by this example script and notebook. |
|
Masked language modeling chapter of the 🤗 Hugging Face Course. |
|
Masked language modeling task guide |
|
|
|
[DistilBertForQuestionAnswering] is supported by this example script and notebook. |
|
[TFDistilBertForQuestionAnswering] is supported by this example script and notebook. |
|
[FlaxDistilBertForQuestionAnswering] is supported by this example script. |
|
Question answering chapter of the 🤗 Hugging Face Course. |
|
Question answering task guide |
|
|
|
Multiple choice |
|
- [DistilBertForMultipleChoice] is supported by this example script and notebook. |
|
- [TFDistilBertForMultipleChoice] is supported by this example script and notebook. |
|
- Multiple choice task guide |
|
⚗️ Optimization |
|
|
|
A blog post on how to quantize DistilBERT with 🤗 Optimum and Intel. |
|
A blog post on how Optimizing Transformers for GPUs with 🤗 Optimum. |
|
A blog post on Optimizing Transformers with Hugging Face Optimum. |
|
|
|
⚡️ Inference |
|
|
|
A blog post on how to Accelerate BERT inference with Hugging Face Transformers and AWS Inferentia with DistilBERT. |
|
A blog post on Serverless Inference with Hugging Face's Transformers, DistilBERT and Amazon SageMaker. |
|
|
|
🚀 Deploy |
|
|
|
A blog post on how to deploy DistilBERT on Google Cloud. |
|
A blog post on how to deploy DistilBERT with Amazon SageMaker. |
|
A blog post on how to Deploy BERT with Hugging Face Transformers, Amazon SageMaker and Terraform module. |
|
|
|
Combining DistilBERT and Flash Attention 2 |
|
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature. |
|
|
|
pip install -U flash-attn --no-build-isolation |
|
Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. torch.float16) |
|
To load and run a model using Flash Attention 2, refer to the snippet below: |
|
thon |
|
|
|
import torch |
|
from transformers import AutoTokenizer, AutoModel |
|
device = "cuda" # the device to load the model onto |
|
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased') |
|
model = AutoModel.from_pretrained("distilbert/distilbert-base-uncased", torch_dtype=torch.float16, attn_implementation="flash_attention_2") |
|
text = "Replace me by any text you'd like." |
|
encoded_input = tokenizer(text, return_tensors='pt').to(device) |
|
model.to(device) |
|
output = model(**encoded_input) |
|
|
|
DistilBertConfig |
|
[[autodoc]] DistilBertConfig |
|
DistilBertTokenizer |
|
[[autodoc]] DistilBertTokenizer |
|
DistilBertTokenizerFast |
|
[[autodoc]] DistilBertTokenizerFast |
|
|
|
DistilBertModel |
|
[[autodoc]] DistilBertModel |
|
- forward |
|
DistilBertForMaskedLM |
|
[[autodoc]] DistilBertForMaskedLM |
|
- forward |
|
DistilBertForSequenceClassification |
|
[[autodoc]] DistilBertForSequenceClassification |
|
- forward |
|
DistilBertForMultipleChoice |
|
[[autodoc]] DistilBertForMultipleChoice |
|
- forward |
|
DistilBertForTokenClassification |
|
[[autodoc]] DistilBertForTokenClassification |
|
- forward |
|
DistilBertForQuestionAnswering |
|
[[autodoc]] DistilBertForQuestionAnswering |
|
- forward |
|
|
|
TFDistilBertModel |
|
[[autodoc]] TFDistilBertModel |
|
- call |
|
TFDistilBertForMaskedLM |
|
[[autodoc]] TFDistilBertForMaskedLM |
|
- call |
|
TFDistilBertForSequenceClassification |
|
[[autodoc]] TFDistilBertForSequenceClassification |
|
- call |
|
TFDistilBertForMultipleChoice |
|
[[autodoc]] TFDistilBertForMultipleChoice |
|
- call |
|
TFDistilBertForTokenClassification |
|
[[autodoc]] TFDistilBertForTokenClassification |
|
- call |
|
TFDistilBertForQuestionAnswering |
|
[[autodoc]] TFDistilBertForQuestionAnswering |
|
- call |
|
|
|
FlaxDistilBertModel |
|
[[autodoc]] FlaxDistilBertModel |
|
- call |
|
FlaxDistilBertForMaskedLM |
|
[[autodoc]] FlaxDistilBertForMaskedLM |
|
- call |
|
FlaxDistilBertForSequenceClassification |
|
[[autodoc]] FlaxDistilBertForSequenceClassification |
|
- call |
|
FlaxDistilBertForMultipleChoice |
|
[[autodoc]] FlaxDistilBertForMultipleChoice |
|
- call |
|
FlaxDistilBertForTokenClassification |
|
[[autodoc]] FlaxDistilBertForTokenClassification |
|
- call |
|
FlaxDistilBertForQuestionAnswering |
|
[[autodoc]] FlaxDistilBertForQuestionAnswering |
|
- call |
|
|
|
|