This discrepancy adds difficulty to multimodal representation learning.