Post
1852
New foundation model on image and video captioning just dropped by NVIDIA AI ๐ฅ
Describe Anything Model (DAM) is a 3B vision language model to generate detailed captions with localized references ๐ฎ
The team released the models, the dataset, a new benchmark and a demo ๐คฉ nvidia/describe-anything-680825bb8f5e41ff0785834c
Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)
DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset ๐
They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization ๐
Describe Anything Model (DAM) is a 3B vision language model to generate detailed captions with localized references ๐ฎ
The team released the models, the dataset, a new benchmark and a demo ๐คฉ nvidia/describe-anything-680825bb8f5e41ff0785834c
Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)
DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset ๐
They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization ๐