RAG / knowledge_base /model_doc_mbart.txt
Ahmadzei's picture
update 1
57bdca5
MBart and MBart-50
Overview of MBart
The MBart model was presented in Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
on the encoder, decoder, or reconstructing parts of the text.
This model was contributed by valhalla. The Authors' code can be found here
Training of MBart
MBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for translation task. As the
model is multilingual it expects the sequences in a different format. A special language id token is added in both the
source and target text. The source text format is X [eos, src_lang_code] where X is the source text. The
target text format is [tgt_lang_code] X [eos]. bos is never used.
The regular [~MBartTokenizer.__call__] will encode source text format passed as first argument or with the text
keyword, and target text format passed with the text_label keyword argument.
Supervised training
thon
from transformers import MBartForConditionalGeneration, MBartTokenizer
tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX", tgt_lang="ro_RO")
example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
inputs = tokenizer(example_english_phrase, text_target=expected_translation_romanian, return_tensors="pt")
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
forward pass
model(**inputs)
Generation
While generating the target text set the decoder_start_token_id to the target language id. The following
example shows how to translate English to Romanian using the facebook/mbart-large-en-ro model.
thon
from transformers import MBartForConditionalGeneration, MBartTokenizer
tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX")
article = "UN Chief Says There Is No Military Solution in Syria"
inputs = tokenizer(article, return_tensors="pt")
translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
"Şeful ONU declară că nu există o soluţie militară în Siria"
Overview of MBart-50
MBart-50 was introduced in the Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav
Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original mbart-large-cc25 checkpoint by extendeding
its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50
languages.
According to the abstract
Multilingual translation models can be created through multilingual finetuning. Instead of finetuning on one
direction, a pretrained model is finetuned on many directions at the same time. It demonstrates that pretrained models
can be extended to incorporate additional languages without loss of performance. Multilingual finetuning improves on
average 1 BLEU over the strongest baselines (being either multilingual from scratch or bilingual finetuning) while
improving 9.3 BLEU on average over bilingual baselines from scratch.
Training of MBart-50
The text format for MBart-50 is slightly different from mBART. For MBart-50 the language id token is used as a prefix
for both source and target text i.e the text format is [lang_code] X [eos], where lang_code is source
language id for source text and target language id for target text, with X being the source or target text
respectively.
MBart-50 has its own tokenizer [MBart50Tokenizer].
Supervised training
thon
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")
src_text = " UN Chief Says There Is No Military Solution in Syria"
tgt_text = "Şeful ONU declară că nu există o soluţie militară în Siria"
model_inputs = tokenizer(src_text, text_target=tgt_text, return_tensors="pt")
model(**model_inputs) # forward pass
Generation
To generate using the mBART-50 multilingual translation models, eos_token_id is used as the
decoder_start_token_id and the target language id is forced as the first generated token. To force the
target language id as the first generated token, pass the forced_bos_token_id parameter to the generate method.
The following example shows how to translate between Hindi to French and Arabic to English using the
facebook/mbart-50-large-many-to-many checkpoint.
thon
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
translate Hindi to French
tokenizer.src_lang = "hi_IN"
encoded_hi = tokenizer(article_hi, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
=> "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."
translate Arabic to English
tokenizer.src_lang = "ar_AR"
encoded_ar = tokenizer(article_ar, return_tensors="pt")
generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
=> "The Secretary-General of the United Nations says there is no military solution in Syria."
Documentation resources
Text classification task guide
Question answering task guide
Causal language modeling task guide
Masked language modeling task guide
Translation task guide
Summarization task guide
MBartConfig
[[autodoc]] MBartConfig
MBartTokenizer
[[autodoc]] MBartTokenizer
- build_inputs_with_special_tokens
MBartTokenizerFast
[[autodoc]] MBartTokenizerFast
MBart50Tokenizer
[[autodoc]] MBart50Tokenizer
MBart50TokenizerFast
[[autodoc]] MBart50TokenizerFast
MBartModel
[[autodoc]] MBartModel
MBartForConditionalGeneration
[[autodoc]] MBartForConditionalGeneration
MBartForQuestionAnswering
[[autodoc]] MBartForQuestionAnswering
MBartForSequenceClassification
[[autodoc]] MBartForSequenceClassification
MBartForCausalLM
[[autodoc]] MBartForCausalLM
- forward
TFMBartModel
[[autodoc]] TFMBartModel
- call
TFMBartForConditionalGeneration
[[autodoc]] TFMBartForConditionalGeneration
- call
FlaxMBartModel
[[autodoc]] FlaxMBartModel
- call
- encode
- decode
FlaxMBartForConditionalGeneration
[[autodoc]] FlaxMBartForConditionalGeneration
- call
- encode
- decode
FlaxMBartForSequenceClassification
[[autodoc]] FlaxMBartForSequenceClassification
- call
- encode
- decode
FlaxMBartForQuestionAnswering
[[autodoc]] FlaxMBartForQuestionAnswering
- call
- encode
- decode