Performance is too low
Dear authors,
Thank you for your hard work in open-sourcing the model. I’ve tested it on MSR-VTT, but the performance appears to be quite low. I'm currently trying to debug this issue.
Then I tried running the demo example and obtained the same results as mentioned in another discussion in this repo (https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B/discussions/1). Would you mind sharing the expected output for the demo to help with debugging?
Thanks!
Hi @fferroni , here are some of my observations (though they may not all be correct):
First, I tried a few other InternVideo2 checkpoints, and the results seemed reasonable. I evaluated their 1B version (https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4), and it performed as expected. I haven’t tested their 6B version myself, but I assume this checkpoint (https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4) should work as well.
Second, even after using these models, there still appears to be about a 10-point gap on MSR-VTT and other datasets compared to the results reported in their paper. I came across a related GitHub issue, which I think provides a reasonable explanation—the reported results in the paper likely include an ITM re-ranking stage. (Github issue: https://github.com/OpenGVLab/InternVideo/issues/136)
OK, good to know that other people are experiencing my same issues.
I also got more reasonable metrics from the 1B/6B versions you link. That 6B version you link however has an embedding dimensionality of 768, while this 6B repo version has 512. Also some differently sized cross-attention weight matrices. Rather confusing...
I also don't get the same metrics on those though. I agree ITM re-ranking will play a role, and possibly also DSL (dual softmax loss) which is another test-time reranking strategy. It's not mentioned in the paper but can be observed in the evaluation code... :shrug:
Thank you
Thanks for the reply! I was aware that DSL might be used during training, but I didn’t realize it was also applied during test-time reranking. Would you mind sharing any relevant pointers or references on that part?
By the way, I think another factor that might contribute to the performance gap is the number of frames. In the paper, they mentioned using 8 frames, but the config file indicates 4 frames. Based on my experience, increasing from 4 to 8 frames can lead to at least a 2–3 point improvement in some other video embedding models.
Thanks a lot!