|
|
|
BERTweet |
|
Overview |
|
The BERTweet model was proposed in BERTweet: A pre-trained language model for English Tweets by Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen. |
|
The abstract from the paper is the following: |
|
We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having |
|
the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et |
|
al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al., |
|
2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: |
|
Part-of-speech tagging, Named-entity recognition and text classification. |
|
This model was contributed by dqnguyen. The original code can be found here. |
|
Usage example |
|
thon |
|
|
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
bertweet = AutoModel.from_pretrained("vinai/bertweet-base") |
|
For transformers v4.x+: |
|
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False) |
|
For transformers v3.x: |
|
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base") |
|
INPUT TWEET IS ALREADY NORMALIZED! |
|
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:" |
|
input_ids = torch.tensor([tokenizer.encode(line)]) |
|
with torch.no_grad(): |
|
features = bertweet(input_ids) # Models outputs are now tuples |
|
With TensorFlow 2.0+: |
|
from transformers import TFAutoModel |
|
bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base") |
|
|
|
|
|
This implementation is the same as BERT, except for tokenization method. Refer to BERT documentation for |
|
API reference information. |
|
|
|
BertweetTokenizer |
|
[[autodoc]] BertweetTokenizer |