perception-encoder

Model Details

[πŸ“ƒ Tech Report] [πŸ“‚ Github]

Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "Perception Encoder: The best visual embeddings are not at the output of the network".

Model Developer: Meta

Model Overview: Perception Encoder (PE) is a family of large-scale vision encoder models with state-of-the-art performance on a large variety of vision tasks. By using a robust contrastive pretraining recipe and finetuning on synthetically aligned videos, PE not only outperforms all existing models on classification and retrieval, but it also internally produces strong, general features that scale for downstream tasks. PE unlocks the ability for large-scale contrastive pretraining to transfer to downstream tasks with alignment tuning to capitalize on those general features.

Perception Encoder: Core

PE core is our base model trained with our robust image pretraining schedule and finetuned on the data generated by our synthetic video data engine.

Model Configurations

PE core curently comes in 3 sizes. PE core G is the main checkpoint, with L and B models distilled from it.

Scale Tower Params Width Depth MLP Heads CLIP Dim Resolution / Context Len
B/16 Vision 0.09B 768 12 3072 12 1024 224px
Text 0.31B 1024 24 4096 16 1024 32 tokens
L/14 Vision 0.32B 1024 24 4096 16 1024 336px
Text 0.31B 1024 24 4096 16 1024 32 tokens
G/14 Vision 1.88B 1536 50 8960 16 1280 448px
Text 0.47B 1280 24 5120 20 1280 72 tokens

All PE core models use an attention pooling block with 8 heads on top of the vision tower. The L and B models additionally have a class token for global aggregation. See the paper for more details.

Model Performance

PE core obtains extremely strong results across the board on zero-shot image classification and retrieval as well as zero-shot video classification and retrieval. We present a sample of its performance across those domains below.

Model Checkpoint IN-1k IN-v2 IN-A ObjectNet COCO-T2I Kinetics-400 VTT-T2I
B/16 224px PE-Core-B16-224 78.4 71.7 62.4 71.9 50.9 65.6 47.6
L/14 336px PE-Core-L14-336 83.5 77.9 89.0 84.7 57.1 73.4 50.3
G/14 448px PE-Core-G14-448 85.4 80.2 92.6 88.2 58.1 76.9 51.2

PE core performs particularly well on the hard benchmarks such as ObjectNet and ImageNet-A.

How to use

Model loading code

We provide the model loading code in https://github.com/facebookresearch/perception_models

git clone https://github.com/facebookresearch/perception_models.git
cd perception_models
conda create --name perception_models python=3.12
conda activate perception_models
# Install PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124
# We use torchcodec for decoding videos into PyTorch tensors
conda install ffmpeg -c conda-forge
pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124
pip install -e .

This will install an editable version of repo, allowing you to make changes to the code without needing to reinstall the package every time.

Image and Text Feature extraction with a Trained Model

import torch
from PIL import Image
import core.vision_encoder.pe as pe
import core.vision_encoder.transforms as transforms

print("CLIP configs:", pe.CLIP.available_configs())
# CLIP configs: ['PE-Core-G14-448', 'PE-Core-L14-336', 'PE-Core-B16-224']

model = pe.CLIP.from_config("PE-Core-B16-224", pretrained=True)  # Downloads from HF
model = model.cuda()

preprocess = transforms.get_image_transform(model.image_size)
tokenizer = transforms.get_text_tokenizer(model.context_length)

image = preprocess(Image.open("docs/assets/cat.png")).unsqueeze(0).cuda()
text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()

with torch.no_grad(), torch.autocast("cuda"):
    image_features, text_features, logit_scale = model(image, text)
    text_probs = (logit_scale * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[0.0, 0.0, 1.0]]

You can find more details in the GitHub repo.

Citation

If you find our code useful for your research, please consider citing:

@article{bolya2025PerceptionEncoder,
  title={Perception Encoder: The best visual embeddings are not at the output of the network},
  author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

@article{cho2025PerceptionLM,
  title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
  author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}
Downloads last month
1,719
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including facebook/PE-Core-B16-224