ssl-aasist / fairseq /examples /moe_lm /README.md

Add files using upload-large-folder tool

9742bb8 verified 5 months ago

5.87 kB

	# Efficient Large Scale Language Modeling with Mixtures of Experts

	## Introduction

	Mixture of Experts layers (MoEs) enable efficient scaling of language models
	through conditional computation. This work empirically compares how
	autoregressive MoE language models scale in comparison with dense models in a
	wide range of settings: in- and out-of-domain language modeling, zero- and
	few-shot priming, and full fine-tuning. See the associated paper for more
	details.

	This repo contains instructions for reproducing results from the paper.

	## Pre-trained models

	These models are intended for research purposes only in order to reproduce the
	results from the paper, and to enable further research on the capabilities and
	limitations of language models. Please see the [model card](model_card.md) for
	more details about how the models were trained and evaluated, as well as their
	limitations and intended use.

	#### Dense models

	Dense models can be run directly from the `main` branch.

	Model \| Layers \| Model Dim \| Languages \| Download
	---\|---\|---\|---\|---
	`dense_125m` \| 12 \| 768 \| English \| [en_dense_lm_125m.tar.gz (0.2GB)](https://dl.fbaipublicfiles.com/fairseq/models/lm/en_dense_lm_125m.tar.gz)
	`dense_355m` \| 24 \| 1024 \| English \| [en_dense_lm_355m.tar.gz (0.6GB)](https://dl.fbaipublicfiles.com/fairseq/models/lm/en_dense_lm_355m.tar.gz)
	`dense_1_3b` \| 24 \| 2048 \| English \| [en_dense_lm_1_3b.tar.gz (2.3GB)](https://dl.fbaipublicfiles.com/fairseq/models/lm/en_dense_lm_1_3b.tar.gz)
	`dense_2_7b` \| 32 \| 2560 \| English \| [en_dense_lm_2_7b.tar.gz (4.6GB)](https://dl.fbaipublicfiles.com/fairseq/models/lm/en_dense_lm_2_7b.tar.gz)
	`dense_6_7b` \| 32 \| 4096 \| English \| [en_dense_lm_6_7b.tar.gz (12GB)](https://dl.fbaipublicfiles.com/fairseq/models/lm/en_dense_lm_6_7b.tar.gz)
	`dense_13b` \| 40 \| 5120 \| English \| [en_dense_lm_13b.tar.gz (23GB)](https://dl.fbaipublicfiles.com/fairseq/models/lm/en_dense_lm_13b.tar.gz)

	#### Mixture of expert models

	MoE models must be run from the `moe` branch. Please see the
	[MoE README](https://github.com/pytorch/fairseq/tree/moe#evaluating-moe-language-models)
	for more details about how to load and evaluate MoE models.

	Model \| Layers \| Model Dim \| Languages \| Download
	---\|---\|---\|---\|---
	`moe_15b` \| 12 \| 768 \| English \| [en_moe_lm_15b.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/lm/en_moe_lm_15b.tar.gz)
	`moe_52b` \| 24 \| 1024 \| English \| [en_moe_lm_52b.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/lm/en_moe_lm_52b.tar.gz)
	`moe_207b` \| 24 \| 2048 \| English \| Available by request
	`moe_1_1t` \| 32 \| 4096 \| English \| Available by request

	## Evaluation

	### Example (COPA)

	The following snippet shows how to evaluate our dense models on the [Choice of
	Plausible Alternatives (COPA)](https://people.ict.usc.edu/~gordon/copa.html) task.

	```python
	from fairseq.models.transformer_lm import TransformerLanguageModel
	model_dir = '/path/to/en_dense_lm_125m'
	lm = TransformerLanguageModel.from_pretrained(model_dir, bpe='gpt2')
	lm = lm.eval(); # disable dropout
	lm = lm.half(); # use FP16 for evaluation
	lm = lm.cuda(); # move to GPU

	def get_logprobs(prompt):
	import re
	prompt = re.sub('\n+' , '\n', prompt) # collapse repeated newlines, which indicate separate documents
	return lm.score(prompt, replace_newlines_with_eos=True)['positional_scores']

	# Zero-shot evaluation for the Choice of Plausible Alternatives (COPA) task.
	# A return value of 1 indicates that the first alternative is more plausible,
	# while 2 indicates that the second alternative is more plausible.
	def COPA_eval(prompt, alternative1, alternative2):
	lprob1 = get_logprobs(prompt + "\n" + alternative1).sum()
	lprob2 = get_logprobs(prompt + "\n" + alternative2).sum()
	return 1 if lprob1 > lprob2 else 2

	COPA_eval("The man broke his toe. What was the CAUSE of this?", "He got a hole in his sock.", "He dropped a hammer on his foot.")
	# 2
	COPA_eval("I tipped the bottle. What happened as a RESULT?", "The liquid in the bottle froze.", "The liquid in the bottle poured out.")
	# 2
	COPA_eval("I knocked on my neighbor's door. What happened as a RESULT?", "My neighbor invited me in.", "My neighbor left his house.")
	# 1
	```

	### Data format

	Few-shot prompting is known to be sensitive to the input formatting, and it is usually best to match the formatting used in pretraining.

	During pretraining our models were presented with data in the following format (i.e., one paragraph per line, with a blank line separating documents):
	```
	<doc0,para0,tok0> ... <doc0,para0,tokX>
	<doc0,para1,tok0> ... <doc0,para1,tokY>

	<doc1,para0,tok0> ... <doc0,para0,tokX>
	...
	```

	#### Newlines

	While we use the byte-level BPE from GPT-2/3, fairseq's preprocessing replaces newlines with the end-of-sentence symbol (`</s>`), which corresponds to embedding index `2`.
	Thus the model never saw newline characters during pretraining and newlines should not be used during few-shot prompting.

	This is more clearly illustrated in the following example, which uses fairseq's Hub Interface to tokenize two documents in the desired format:
	```python
	from fairseq.models.transformer_lm import TransformerLanguageModel
	model_dir = '/path/to/en_dense_lm_125m'
	lm = TransformerLanguageModel.from_pretrained(model_dir, bpe='gpt2')

	data = """\
	This is the first paragraph of the first document.
	This is the second paragraph of the first document.

	This is the first paragraph of the second document.\
	"""

	# The following is wrong, since it will encode newlines present in `data`.
	tokens_bad = lm.score(data)['tokens']
	assert '\n' in lm.decode(tokens_bad) # oops, we encoded a newline

	# Instead pass the replace_newlines_with_eos option to get the correct behavior.
	tokens_good = lm.score(data, replace_newline_with_eos=True)['tokens']
	assert '\n' not in lm.decode(tokens_good) # no newlines were encoded
	```

	## Citation

	Coming soon.