Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
ALIGN features a dual-encoder architecture with EfficientNet as its vision encoder and BERT as its text encoder, and learns to align visual and text representations with contrastive learning.