If there are only PyTorch checkpoints for a particular vision encoder-decoder model, a workaround is: thon from transformers import VisionEncoderDecoderModel, TFVisionEncoderDecoderModel _model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning") _model.encoder.save_pretrained("./encoder") _model.decoder.save_pretrained("./decoder") model = TFVisionEncoderDecoderModel.from_encoder_decoder_pretrained( "./encoder", "./decoder", encoder_from_pt=True, decoder_from_pt=True ) This is only for copying some specific attributes of this particular model.