|
|
|
Fuyu |
|
Overview |
|
The Fuyu model was created by ADEPT, and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. |
|
The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs. |
|
By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance. |
|
|
|
The Fuyu models were trained using bfloat16, but the original inference uses float16 The checkpoints uploaded on the hub use torch_dtype = 'float16' which will be |
|
used by the AutoModel API to cast the checkpoints from torch.float32 to torch.float16. |
|
The dtype of the online weights is mostly irrelevant, unless you are using torch_dtype="auto" when initializing a model using model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto"). The reason is that the model will first be downloaded ( using the dtype of the checkpoints online) then it will be cast to the default dtype of torch (becomes torch.float32). Users should specify the torch_dtype they want, and if they don't it will be torch.float32. |
|
Finetuning the model in float16 is not recommended and known to produce nan, as such the model should be fine-tuned in bfloat16. |
|
|
|
Tips: |
|
|
|
To convert the model, you need to clone the original repository using git clone https://github.com/persimmon-ai-labs/adept-inference, then get the checkpoints: |
|
|
|
git clone https://github.com/persimmon-ai-labs/adept-inference |
|
wget path/to/fuyu-8b-model-weights.tar |
|
tar -xvf fuyu-8b-model-weights.tar |
|
python src/transformers/models/fuyu/convert_fuyu_weights_to_hf.py --input_dir /path/to/downloaded/fuyu/weights/ --output_dir /output/path \ |
|
--pt_model_path /path/to/fuyu_8b_release/iter_0001251/mp_rank_00/model_optim_rng.pt |
|
--ada_lib_path /path/to/adept-inference |
|
For the chat model: |
|
|
|
wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_chat_model_release.tar |
|
tar -xvf 8b_base_model_release.tar |
|
Then, model can be loaded via: |
|
py |
|
from transformers import FuyuConfig, FuyuForCausalLM |
|
model_config = FuyuConfig() |
|
model = FuyuForCausalLM(model_config).from_pretrained('/output/path') |
|
Inputs need to be passed through a specific Processor to have the correct formats. |
|
A processor requires an image_processor and a tokenizer. Hence, inputs can be loaded via: |
|
|
|
from PIL import Image |
|
from transformers import AutoTokenizer |
|
from transformers.models.fuyu.processing_fuyu import FuyuProcessor |
|
from transformers.models.fuyu.image_processing_fuyu import FuyuImageProcessor |
|
tokenizer = AutoTokenizer.from_pretrained('adept-hf-collab/fuyu-8b') |
|
image_processor = FuyuImageProcessor() |
|
processor = FuyuProcessor(image_processor=image_processor, tokenizer=tokenizer) |
|
text_prompt = "Generate a coco-style caption.\n" |
|
bus_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bus.png" |
|
bus_image_pil = Image.open(io.BytesIO(requests.get(bus_image_url).content)) |
|
inputs_to_model = processor(text=text_prompt, images=image_pil) |
|
|
|
This model was contributed by Molbap. |
|
The original code can be found here. |
|
|
|
Fuyu uses a sentencepiece based tokenizer, with a Unigram model. It supports bytefallback, which is only available in tokenizers==0.14.0 for the fast tokenizer. |
|
The LlamaTokenizer is used as it is a standard wrapper around sentencepiece. |
|
|
|
The authors suggest to use the following prompt for image captioning: f"Generate a coco-style caption.\\n" |
|
|
|
FuyuConfig |
|
[[autodoc]] FuyuConfig |
|
FuyuForCausalLM |
|
[[autodoc]] FuyuForCausalLM |
|
- forward |
|
FuyuImageProcessor |
|
[[autodoc]] FuyuImageProcessor |
|
- call |
|
FuyuProcessor |
|
[[autodoc]] FuyuProcessor |
|
- call |