For example, to enable offloading for the bigscience/bloom-1b7 model, start by creating a [BitsAndBytesConfig]: | |
from transformers import AutoModelForCausalLM, BitsAndBytesConfig | |
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) | |
Design a custom device map to fit everything on your GPU except for the lm_head, which you'll dispatch to the CPU: | |
py | |
device_map = { | |
"transformer.word_embeddings": 0, | |
"transformer.word_embeddings_layernorm": 0, | |
"lm_head": "cpu", | |
"transformer.h": 0, | |
"transformer.ln_f": 0, | |
} | |
Now load your model with the custom device_map and quantization_config: | |
py | |
model_8bit = AutoModelForCausalLM.from_pretrained( | |
"bigscience/bloom-1b7", | |
device_map=device_map, | |
quantization_config=quantization_config, | |
) | |
Outlier threshold | |
An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. |