File size: 6,539 Bytes
57bdca5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

MarianMT

Overview
A framework for translation models, using the same models as BART. Translations should be similar, but not identical to output in the test set linked to in each model card.
This model was contributed by sshleifer.
Implementation Notes

Each model is about 298 MB on disk, there are more than 1,000 models.
The list of supported language pairs can be found here.
Models were originally trained by Jörg Tiedemann using the Marian C++ library, which supports fast training and translation.
All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
  in a model card.
The 80 opus models that require BPE preprocessing are not supported.

The modeling code is the same as [BartForConditionalGeneration] with a few minor modifications:

static (sinusoid) positional embeddings (MarianConfig.static_position_embeddings=True)

no layernorm_embedding (MarianConfig.normalize_embedding=False)
the model starts generating with pad_token_id (which has 0 as a token_embedding) as the prefix (Bart uses
    <s/>),
Code to bulk convert models can be found in convert_marian_to_pytorch.py.

Naming

All model names use the following format: Helsinki-NLP/opus-mt-{src}-{tgt}
The language codes used to name models are inconsistent. Two digit codes can usually be found here, three digit codes require googling "language
  code {code}".
Codes formatted like es_AR are usually code_{region}. That one is Spanish from Argentina.
The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second
  group use a combination of ISO-639-5 codes and ISO-639-2 codes.

Examples

Since Marian models are smaller than many other translation models available in the library, they can be useful for
  fine-tuning experiments and integration tests.
Fine-tune on GPU

Multilingual Models

All model names use the following format: Helsinki-NLP/opus-mt-{src}-{tgt}:
If a model can output multiple languages, and you should specify a language code by prepending the desired output
  language to the src_text.
You can see a models's supported language codes in its model card, under target constituents, like in opus-mt-en-roa.
Note that if a model is only multilingual on the source side, like Helsinki-NLP/opus-mt-roa-en, no language
  codes are required.

New multi-lingual models from the Tatoeba-Challenge repo
require 3 character language codes:
thon

from transformers import MarianMTModel, MarianTokenizer
src_text = [
     ">>fra<< this is a sentence in english that we want to translate to french",
     ">>por<< This should go to portuguese",
     ">>esp<< And this to Spanish",
 ]
model_name = "Helsinki-NLP/opus-mt-en-roa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)
['>>zlm_Latn<<', '>>mfe<<', '>>hat<<', '>>pap<<', '>>ast<<', '>>cat<<', '>>ind<<', '>>glg<<', '>>wln<<', '>>spa<<', '>>fra<<', '>>ron<<', '>>por<<', '>>ita<<', '>>oci<<', '>>arg<<', '>>min<<']
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
[tokenizer.decode(t, skip_special_tokens=True) for t in translated]
["c'est une phrase en anglais que nous voulons traduire en français",
 'Isto deve ir para o português.',
 'Y esto al español']

Here is the code to see all available pretrained models on the hub:
thon
from huggingface_hub import list_models
model_list = list_models()
org = "Helsinki-NLP"
model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
suffix = [x.split("/")[1] for x in model_ids]
old_style_multi_models = [f"{org}/{s}" for s in suffix if s != s.lower()]

Old Style Multi-Lingual Models
These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language
group:
python no-style
['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
 'Helsinki-NLP/opus-mt-ROMANCE-en',
 'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
 'Helsinki-NLP/opus-mt-de-ZH',
 'Helsinki-NLP/opus-mt-en-CELTIC',
 'Helsinki-NLP/opus-mt-en-ROMANCE',
 'Helsinki-NLP/opus-mt-es-NORWAY',
 'Helsinki-NLP/opus-mt-fi-NORWAY',
 'Helsinki-NLP/opus-mt-fi-ZH',
 'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
 'Helsinki-NLP/opus-mt-sv-NORWAY',
 'Helsinki-NLP/opus-mt-sv-ZH']
GROUP_MEMBERS = {
 'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
 'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
 'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
 'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
 'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
 'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
 'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
}
Example of translating english to many romance languages, using old-style 2 character language codes
thon

from transformers import MarianMTModel, MarianTokenizer
src_text = [
     ">>fr<< this is a sentence in english that we want to translate to french",
     ">>pt<< This should go to portuguese",
     ">>es<< And this to Spanish",
 ]
model_name = "Helsinki-NLP/opus-mt-en-ROMANCE"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
["c'est une phrase en anglais que nous voulons traduire en français", 
 'Isto deve ir para o português.',
 'Y esto al español']

Resources

Translation task guide
Summarization task guide
Causal language modeling task guide

MarianConfig
[[autodoc]] MarianConfig
MarianTokenizer
[[autodoc]] MarianTokenizer
    - build_inputs_with_special_tokens

MarianModel
[[autodoc]] MarianModel
    - forward
MarianMTModel
[[autodoc]] MarianMTModel
    - forward
MarianForCausalLM
[[autodoc]] MarianForCausalLM
    - forward

TFMarianModel
[[autodoc]] TFMarianModel
    - call
TFMarianMTModel
[[autodoc]] TFMarianMTModel
    - call

FlaxMarianModel
[[autodoc]] FlaxMarianModel
    - call
FlaxMarianMTModel
[[autodoc]] FlaxMarianMTModel
    - call