|
|
|
PhoBERT |
|
Overview |
|
The PhoBERT model was proposed in PhoBERT: Pre-trained language models for Vietnamese by Dat Quoc Nguyen, Anh Tuan Nguyen. |
|
The abstract from the paper is the following: |
|
We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual |
|
language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent |
|
best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves the state-of-the-art in multiple |
|
Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and |
|
Natural language inference. |
|
This model was contributed by dqnguyen. The original code can be found here. |
|
Usage example |
|
thon |
|
|
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
phobert = AutoModel.from_pretrained("vinai/phobert-base") |
|
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base") |
|
INPUT TEXT MUST BE ALREADY WORD-SEGMENTED! |
|
line = "Tôi là sinh_viên trường đại_học Công_nghệ ." |
|
input_ids = torch.tensor([tokenizer.encode(line)]) |
|
with torch.no_grad(): |
|
features = phobert(input_ids) # Models outputs are now tuples |
|
With TensorFlow 2.0+: |
|
from transformers import TFAutoModel |
|
phobert = TFAutoModel.from_pretrained("vinai/phobert-base") |
|
|
|
|
|
PhoBERT implementation is the same as BERT, except for tokenization. Refer to EART documentation for information on |
|
configuration classes and their parameters. PhoBERT-specific tokenizer is documented below. |
|
|
|
PhobertTokenizer |
|
[[autodoc]] PhobertTokenizer |