By default, the processor will apply OCR on the image, and create [CLS] question tokens [SEP] word tokens [SEP].