The model is trained using a bipartite matching loss: so what we actually do is compare the predicted classes + | |
bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N | |
(so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as | |
bounding box). |