LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Abstract
Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MaskAttn-UNet: A Mask Attention-Driven Framework for Universal Low-Resolution Image Segmentation (2025)
- Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields (2025)
- Your ViT is Secretly an Image Segmentation Model (2025)
- ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models (2025)
- FeatSharp: Your Vision Model Features, Sharper (2025)
- SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining (2025)
- GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper