|
|
|
BigBird |
|
Overview |
|
The BigBird model was proposed in Big Bird: Transformers for Longer Sequences by |
|
Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon, |
|
Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention |
|
based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse |
|
attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it |
|
has been shown that applying sparse, global, and random attention approximates full attention, while being |
|
computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context, |
|
BigBird has shown improved performance on various long document NLP tasks, such as question answering and |
|
summarization, compared to BERT or RoBERTa. |
|
The abstract from the paper is the following: |
|
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. |
|
Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence |
|
length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that |
|
reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and |
|
is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our |
|
theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire |
|
sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to |
|
8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, |
|
BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also |
|
propose novel applications to genomics data. |
|
This model was contributed by vasudevgupta. The original code can be found |
|
here. |
|
Usage tips |
|
|
|
For an in-detail explanation on how BigBird's attention works, see this blog post. |
|
BigBird comes with 2 implementations: original_full & block_sparse. For the sequence length < 1024, using |
|
original_full is advised as there is no benefit in using block_sparse attention. |
|
The code currently uses window size of 3 blocks and 2 global blocks. |
|
Sequence length must be divisible by block size. |
|
Current implementation supports only ITC. |
|
Current implementation doesn't support num_random_blocks = 0 |
|
BigBird is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than |
|
the left. |
|
|
|
Resources |
|
|
|
Text classification task guide |
|
Token classification task guide |
|
Question answering task guide |
|
Causal language modeling task guide |
|
Masked language modeling task guide |
|
Multiple choice task guide |
|
|
|
BigBirdConfig |
|
[[autodoc]] BigBirdConfig |
|
BigBirdTokenizer |
|
[[autodoc]] BigBirdTokenizer |
|
- build_inputs_with_special_tokens |
|
- get_special_tokens_mask |
|
- create_token_type_ids_from_sequences |
|
- save_vocabulary |
|
BigBirdTokenizerFast |
|
[[autodoc]] BigBirdTokenizerFast |
|
BigBird specific outputs |
|
[[autodoc]] models.big_bird.modeling_big_bird.BigBirdForPreTrainingOutput |
|
|
|
BigBirdModel |
|
[[autodoc]] BigBirdModel |
|
- forward |
|
BigBirdForPreTraining |
|
[[autodoc]] BigBirdForPreTraining |
|
- forward |
|
BigBirdForCausalLM |
|
[[autodoc]] BigBirdForCausalLM |
|
- forward |
|
BigBirdForMaskedLM |
|
[[autodoc]] BigBirdForMaskedLM |
|
- forward |
|
BigBirdForSequenceClassification |
|
[[autodoc]] BigBirdForSequenceClassification |
|
- forward |
|
BigBirdForMultipleChoice |
|
[[autodoc]] BigBirdForMultipleChoice |
|
- forward |
|
BigBirdForTokenClassification |
|
[[autodoc]] BigBirdForTokenClassification |
|
- forward |
|
BigBirdForQuestionAnswering |
|
[[autodoc]] BigBirdForQuestionAnswering |
|
- forward |
|
|
|
FlaxBigBirdModel |
|
[[autodoc]] FlaxBigBirdModel |
|
- call |
|
FlaxBigBirdForPreTraining |
|
[[autodoc]] FlaxBigBirdForPreTraining |
|
- call |
|
FlaxBigBirdForCausalLM |
|
[[autodoc]] FlaxBigBirdForCausalLM |
|
- call |
|
FlaxBigBirdForMaskedLM |
|
[[autodoc]] FlaxBigBirdForMaskedLM |
|
- call |
|
FlaxBigBirdForSequenceClassification |
|
[[autodoc]] FlaxBigBirdForSequenceClassification |
|
- call |
|
FlaxBigBirdForMultipleChoice |
|
[[autodoc]] FlaxBigBirdForMultipleChoice |
|
- call |
|
FlaxBigBirdForTokenClassification |
|
[[autodoc]] FlaxBigBirdForTokenClassification |
|
- call |
|
FlaxBigBirdForQuestionAnswering |
|
[[autodoc]] FlaxBigBirdForQuestionAnswering |
|
- call |
|
|
|
|