|
|
|
GPT Neo |
|
Overview |
|
The GPTNeo model was released in the EleutherAI/gpt-neo repository by Sid |
|
Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. It is a GPT2 like causal language model trained on the |
|
Pile dataset. |
|
The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of |
|
256 tokens. |
|
This model was contributed by valhalla. |
|
Usage example |
|
The generate() method can be used to generate text using GPT Neo model. |
|
thon |
|
|
|
from transformers import GPTNeoForCausalLM, GPT2Tokenizer |
|
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B") |
|
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B") |
|
prompt = ( |
|
"In a shocking finding, scientists discovered a herd of unicorns living in a remote, " |
|
"previously unexplored valley, in the Andes Mountains. Even more surprising to the " |
|
"researchers was the fact that the unicorns spoke perfect English." |
|
) |
|
input_ids = tokenizer(prompt, return_tensors="pt").input_ids |
|
gen_tokens = model.generate( |
|
input_ids, |
|
do_sample=True, |
|
temperature=0.9, |
|
max_length=100, |
|
) |
|
gen_text = tokenizer.batch_decode(gen_tokens)[0] |
|
|
|
Combining GPT-Neo and Flash Attention 2 |
|
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature, and make sure your hardware is compatible with Flash-Attention 2. More details are available here concerning the installation. |
|
Make sure as well to load your model in half-precision (e.g. torch.float16). |
|
To load and run a model using Flash Attention 2, refer to the snippet below: |
|
thon |
|
|
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
device = "cuda" # the device to load the model onto |
|
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B", torch_dtype=torch.float16, attn_implementation="flash_attention_2") |
|
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B") |
|
prompt = "def hello_world():" |
|
model_inputs = tokenizer([prompt], return_tensors="pt").to(device) |
|
model.to(device) |
|
generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True) |
|
tokenizer.batch_decode(generated_ids)[0] |
|
"def hello_world():\n >>> run_script("hello.py")\n >>> exit(0)\n<|endoftext|>" |
|
|
|
Expected speedups |
|
Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using EleutherAI/gpt-neo-2.7B checkpoint and the Flash Attention 2 version of the model. |
|
Note that for GPT-Neo it is not possible to train / run on very long context as the max position embeddings is limited to 2048 - but this is applicable to all gpt-neo models and not specific to FA-2 |
|
|
|
Resources |
|
|
|
Text classification task guide |
|
Causal language modeling task guide |
|
|
|
GPTNeoConfig |
|
[[autodoc]] GPTNeoConfig |
|
|
|
GPTNeoModel |
|
[[autodoc]] GPTNeoModel |
|
- forward |
|
GPTNeoForCausalLM |
|
[[autodoc]] GPTNeoForCausalLM |
|
- forward |
|
GPTNeoForQuestionAnswering |
|
[[autodoc]] GPTNeoForQuestionAnswering |
|
- forward |
|
GPTNeoForSequenceClassification |
|
[[autodoc]] GPTNeoForSequenceClassification |
|
- forward |
|
GPTNeoForTokenClassification |
|
[[autodoc]] GPTNeoForTokenClassification |
|
- forward |
|
|
|
FlaxGPTNeoModel |
|
[[autodoc]] FlaxGPTNeoModel |
|
- call |
|
FlaxGPTNeoForCausalLM |
|
[[autodoc]] FlaxGPTNeoForCausalLM |
|
- call |
|
|
|
|