Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
raw
history blame contribute delete
909 Bytes
For example, to enable offloading for the bigscience/bloom-1b7 model, start by creating a [BitsAndBytesConfig]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
Design a custom device map to fit everything on your GPU except for the lm_head, which you'll dispatch to the CPU:
py
device_map = {
"transformer.word_embeddings": 0,
"transformer.word_embeddings_layernorm": 0,
"lm_head": "cpu",
"transformer.h": 0,
"transformer.ln_f": 0,
}
Now load your model with the custom device_map and quantization_config:
py
model_8bit = AutoModelForCausalLM.from_pretrained(
"bigscience/bloom-1b7",
device_map=device_map,
quantization_config=quantization_config,
)
Outlier threshold
An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16.