Performance is too low

by ziyjiang - opened Mar 22

Mar 22

Dear authors,

Thank you for your hard work in open-sourcing the model. I’ve tested it on MSR-VTT, but the performance appears to be quite low. I'm currently trying to debug this issue.

Then I tried running the demo example and obtained the same results as mentioned in another discussion in this repo (https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B/discussions/1). Would you mind sharing the expected output for the demo to help with debugging?

Thanks!

fferroni

Apr 18

I'm also getting very low metrics for MSR-VTT. And, like you, am getting the same values as in the demo.
@ziyjiang did you manage to figure out the issue?

ziyjiang

Apr 18

•

edited Apr 18

Hi @fferroni , here are some of my observations (though they may not all be correct):

First, I tried a few other InternVideo2 checkpoints, and the results seemed reasonable. I evaluated their 1B version (https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4), and it performed as expected. I haven’t tested their 6B version myself, but I assume this checkpoint (https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4) should work as well.

Second, even after using these models, there still appears to be about a 10-point gap on MSR-VTT and other datasets compared to the results reported in their paper. I came across a related GitHub issue, which I think provides a reasonable explanation—the reported results in the paper likely include an ITM re-ranking stage. (Github issue: https://github.com/OpenGVLab/InternVideo/issues/136)

fferroni

Apr 18

OK, good to know that other people are experiencing my same issues.

I also got more reasonable metrics from the 1B/6B versions you link. That 6B version you link however has an embedding dimensionality of 768, while this 6B repo version has 512. Also some differently sized cross-attention weight matrices. Rather confusing...

I also don't get the same metrics on those though. I agree ITM re-ranking will play a role, and possibly also DSL (dual softmax loss) which is another test-time reranking strategy. It's not mentioned in the paper but can be observed in the evaluation code... :shrug:

Thank you

ziyjiang

Apr 18

•

edited Apr 18

Thanks for the reply! I was aware that DSL might be used during training, but I didn’t realize it was also applied during test-time reranking. Would you mind sharing any relevant pointers or references on that part?

By the way, I think another factor that might contribute to the performance gap is the number of frames. In the paper, they mentioned using 8 frames, but the config file indicates 4 frames. Based on my experience, increasing from 4 to 8 frames can lead to at least a 2–3 point improvement in some other video embedding models.

fferroni

Apr 18

Some pointers:
https://github.com/OpenGVLab/InternVideo/issues/32
https://github.com/OpenGVLab/InternVideo/issues/89
https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/tasks_clip/retrieval_utils.py#L142

ziyjiang

Apr 18

Thanks a lot!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment