Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
The model is trained using a bipartite matching loss: so what we actually do is compare the predicted classes +
bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N
(so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as
bounding box).