Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance.