It is pre-trained on the mC4 corpus, which includes 101 languages.