|
|
|
MegatronBERT |
|
Overview |
|
The MegatronBERT model was proposed in Megatron-LM: Training Multi-Billion Parameter Language Models Using Model |
|
Parallelism by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, |
|
Jared Casper and Bryan Catanzaro. |
|
The abstract from the paper is the following: |
|
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in |
|
Natural Language Processing applications. However, very large models can be quite difficult to train due to memory |
|
constraints. In this work, we present our techniques for training very large transformer models and implement a simple, |
|
efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our |
|
approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model |
|
parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We |
|
illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain |
|
15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline |
|
that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance |
|
the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 |
|
billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in |
|
BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we |
|
achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA |
|
accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy |
|
of 89.4%). |
|
This model was contributed by jdemouth. The original code can be found here. |
|
That repository contains a multi-GPU and multi-node implementation of the Megatron Language models. In particular, |
|
it contains a hybrid model parallel approach using "tensor parallel" and "pipeline parallel" techniques. |
|
Usage tips |
|
We have provided pretrained BERT-345M checkpoints |
|
for use to evaluate or finetuning downstream tasks. |
|
To access these checkpoints, first sign up for and setup the NVIDIA GPU Cloud (NGC) |
|
Registry CLI. Further documentation for downloading models can be found in the NGC documentation. |
|
Alternatively, you can directly download the checkpoints using: |
|
BERT-345M-uncased: |
|
|
|
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip |
|
-O megatron_bert_345m_v0_1_uncased.zip |
|
BERT-345M-cased: |
|
|
|
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_cased/zip -O |
|
megatron_bert_345m_v0_1_cased.zip |
|
Once you have obtained the checkpoints from NVIDIA GPU Cloud (NGC), you have to convert them to a format that will |
|
easily be loaded by Hugging Face Transformers and our port of the BERT code. |
|
The following commands allow you to do the conversion. We assume that the folder models/megatron_bert contains |
|
megatron_bert_345m_v0_1_{cased, uncased}.zip and that the commands are run from inside that folder: |
|
|
|
python3 $PATH_TO_TRANSFORMERS/models/megatron_bert/convert_megatron_bert_checkpoint.py megatron_bert_345m_v0_1_uncased.zip |
|
|
|
python3 $PATH_TO_TRANSFORMERS/models/megatron_bert/convert_megatron_bert_checkpoint.py megatron_bert_345m_v0_1_cased.zip |
|
Resources |
|
|
|
Text classification task guide |
|
Token classification task guide |
|
Question answering task guide |
|
Causal language modeling task guide |
|
Masked language modeling task guide |
|
Multiple choice task guide |
|
|
|
MegatronBertConfig |
|
[[autodoc]] MegatronBertConfig |
|
MegatronBertModel |
|
[[autodoc]] MegatronBertModel |
|
- forward |
|
MegatronBertForMaskedLM |
|
[[autodoc]] MegatronBertForMaskedLM |
|
- forward |
|
MegatronBertForCausalLM |
|
[[autodoc]] MegatronBertForCausalLM |
|
- forward |
|
MegatronBertForNextSentencePrediction |
|
[[autodoc]] MegatronBertForNextSentencePrediction |
|
- forward |
|
MegatronBertForPreTraining |
|
[[autodoc]] MegatronBertForPreTraining |
|
- forward |
|
MegatronBertForSequenceClassification |
|
[[autodoc]] MegatronBertForSequenceClassification |
|
- forward |
|
MegatronBertForMultipleChoice |
|
[[autodoc]] MegatronBertForMultipleChoice |
|
- forward |
|
MegatronBertForTokenClassification |
|
[[autodoc]] MegatronBertForTokenClassification |
|
- forward |
|
MegatronBertForQuestionAnswering |
|
[[autodoc]] MegatronBertForQuestionAnswering |
|
- forward |