DETR consists of a convolutional backbone followed by an encoder-decoder Transformer which can be trained end-to-end for object detection.