If there are only PyTorch
checkpoints for a particular vision encoder-decoder model, a workaround is:
thon

from transformers import VisionEncoderDecoderModel, TFVisionEncoderDecoderModel
_model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
_model.encoder.save_pretrained("./encoder")
_model.decoder.save_pretrained("./decoder")
model = TFVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
     "./encoder", "./decoder", encoder_from_pt=True, decoder_from_pt=True
 )
This is only for copying some specific attributes of this particular model.