BertJapanese Overview The BERT models trained on Japanese text. There are models with two different tokenization methods: Tokenize with MeCab and WordPiece. This requires some extra dependencies, fugashi which is a wrapper around MeCab. Tokenize into characters. To use MecabTokenizer, you should pip install transformers["ja"] (or pip install -e .["ja"] if you install from source) to install dependencies. See details on cl-tohoku repository. Example of using a model with MeCab and WordPiece tokenization: thon import torch from transformers import AutoModel, AutoTokenizer bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese") tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese") Input Japanese Text line = "吾輩は猫である。" inputs = tokenizer(line, return_tensors="pt") print(tokenizer.decode(inputs["input_ids"][0])) [CLS] 吾輩 は 猫 で ある 。 [SEP] outputs = bertjapanese(**inputs) Example of using a model with Character tokenization: thon bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char") tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char") Input Japanese Text line = "吾輩は猫である。" inputs = tokenizer(line, return_tensors="pt") print(tokenizer.decode(inputs["input_ids"][0])) [CLS] 吾 輩 は 猫 で あ る 。 [SEP] outputs = bertjapanese(**inputs) This model was contributed by cl-tohoku. This implementation is the same as BERT, except for tokenization method. Refer to BERT documentation for API reference information. BertJapaneseTokenizer [[autodoc]] BertJapaneseTokenizer