This difference means that multi-modality isn't handled between transformers-agents and langchain.