GPT Neo
Overview
The GPTNeo model was released in the EleutherAI/gpt-neo repository by Sid
Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. It is a GPT2 like causal language model trained on the
Pile dataset.
The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of
256 tokens.
This model was contributed by valhalla.
Usage example
The generate() method can be used to generate text using GPT Neo model.
thon

from transformers import GPTNeoForCausalLM, GPT2Tokenizer
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
prompt = (
     "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
     "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
     "researchers was the fact that the unicorns spoke perfect English."
 )
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(
     input_ids,
     do_sample=True,
     temperature=0.9,
     max_length=100,
 )
gen_text = tokenizer.batch_decode(gen_tokens)[0]

Combining GPT-Neo and Flash Attention 2
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature, and make sure your hardware is compatible with Flash-Attention 2. More details are available here concerning the installation.
Make sure as well to load your model in half-precision (e.g. torch.float16).
To load and run a model using Flash Attention 2, refer to the snippet below:
thon

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B", torch_dtype=torch.float16, attn_implementation="flash_attention_2")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
prompt = "def hello_world():"
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
model.to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
tokenizer.batch_decode(generated_ids)[0]
"def hello_world():\n    >>> run_script("hello.py")\n    >>> exit(0)\n<|endoftext|>"

Expected speedups
Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using EleutherAI/gpt-neo-2.7B checkpoint and the Flash Attention 2 version of the model.
Note that for GPT-Neo it is not possible to train / run on very long context as the max position embeddings is limited to 2048 - but this is applicable to all gpt-neo models and not specific to FA-2

Resources

Text classification task guide
Causal language modeling task guide

GPTNeoConfig
[[autodoc]] GPTNeoConfig

GPTNeoModel
[[autodoc]] GPTNeoModel
    - forward
GPTNeoForCausalLM
[[autodoc]] GPTNeoForCausalLM
    - forward
GPTNeoForQuestionAnswering
[[autodoc]] GPTNeoForQuestionAnswering
    - forward
GPTNeoForSequenceClassification
[[autodoc]] GPTNeoForSequenceClassification
    - forward
GPTNeoForTokenClassification
[[autodoc]] GPTNeoForTokenClassification
    - forward

FlaxGPTNeoModel
[[autodoc]] FlaxGPTNeoModel
    - call
FlaxGPTNeoForCausalLM
[[autodoc]] FlaxGPTNeoForCausalLM
    - call