Spaces:

Ahmadzei
/

RAG

Runtime error

App Files Files Community

RAG / knowledge_base /model_doc_m2m_100.txt

Ahmadzei

update 1

57bdca5 over 1 year ago

raw

history blame contribute delete

4.57 kB


	M2M100
	Overview
	The M2M100 model was proposed in Beyond English-Centric Multilingual Machine Translation by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky,
	Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy
	Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
	The abstract from the paper is the following:
	Existing work in translation demonstrated the potential of massively multilingual machine translation by training a
	single model able to translate between any pair of languages. However, much of this work is English-Centric by training
	only on data which was translated from or to English. While this is supported by large sources of training data, it
	does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation
	model that can translate directly between any pair of 100 languages. We build and open source a training dataset that
	covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how
	to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters
	to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly
	translating between non-English directions while performing competitively to the best single systems of WMT. We
	open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.
	This model was contributed by valhalla.
	Usage tips and examples
	M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks. As the model is
	multilingual it expects the sequences in a certain format: A special language id token is used as prefix in both the
	source and target text. The source text format is [lang_code] X [eos], where lang_code is source language
	id for source text and target language id for target text, with X being the source or target text.
	The [M2M100Tokenizer] depends on sentencepiece so be sure to install it before running the
	examples. To install sentencepiece run pip install sentencepiece.
	Supervised Training
	thon
	from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer
	model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
	tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M", src_lang="en", tgt_lang="fr")
	src_text = "Life is like a box of chocolates."
	tgt_text = "La vie est comme une boîte de chocolat."
	model_inputs = tokenizer(src_text, text_target=tgt_text, return_tensors="pt")
	loss = model(**model_inputs).loss # forward pass

	Generation
	M2M100 uses the eos_token_id as the decoder_start_token_id for generation with the target language id
	being forced as the first generated token. To force the target language id as the first generated token, pass the
	forced_bos_token_id parameter to the generate method. The following example shows how to translate between
	Hindi to French and Chinese to English using the facebook/m2m100_418M checkpoint.
	thon

	from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
	hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
	chinese_text = "生活就像一盒巧克力。"
	model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
	tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
	translate Hindi to French
	tokenizer.src_lang = "hi"
	encoded_hi = tokenizer(hi_text, return_tensors="pt")
	generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
	tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
	"La vie est comme une boîte de chocolat."
	translate Chinese to English
	tokenizer.src_lang = "zh"
	encoded_zh = tokenizer(chinese_text, return_tensors="pt")
	generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
	tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
	"Life is like a box of chocolate."

	Resources

	Translation task guide
	Summarization task guide

	M2M100Config
	[[autodoc]] M2M100Config
	M2M100Tokenizer
	[[autodoc]] M2M100Tokenizer
	- build_inputs_with_special_tokens
	- get_special_tokens_mask
	- create_token_type_ids_from_sequences
	- save_vocabulary
	M2M100Model
	[[autodoc]] M2M100Model
	- forward
	M2M100ForConditionalGeneration
	[[autodoc]] M2M100ForConditionalGeneration
	- forward