msj9817
/

GenHancer

Image Feature Extraction

Model card Files Files and versions Community

GenHancer / README.md

nielsr's picture

nielsr HF Staff

Add pipeline tag and library name

49d8229 verified 2 months ago

|

2.86 kB

	---
	license: apache-2.0
	pipeline_tag: image-feature-extraction
	library_name: transformers
	---

	# GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

	Code: https://github.com/mashijie1028/GenHancer/

	Paper: https://arxiv.org/abs/2503.19480

	Project Page: https://mashijie1028.github.io/GenHancer/

	## Introduction

	The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored.

	In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method.

	Through our in-depth exploration, we have finally arrived at an effective method that consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be plugged into multimodal large language models for better vision-centric performance.

	## This repo

	The proposed two-stage post-training scheme serves as a plug-and-play method to enhance pre-trained CLIPs' fine-grained representations, and here we release the enhanced model weights of [OpenAICLIP](https://huggingface.co/openai/clip-vit-large-patch14-336), [MetaCLIP](https://huggingface.co/facebook/metaclip-h14-fullcc2.5b) and [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384).

	We also attach the evaluation codes in `evaluation/`.

	## Citation

	```
	@article{ma2025genhancer,
	title={GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers},
	author={Ma, Shijie and Ge, Yuying and Wang, Teng and Guo, Yuxin and Ge, Yixiao and Shan, Ying},
	journal={arXiv preprint arXiv:2503.19480},
	year={2025}
	}
	```