Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
The authors of Reformer: The Efficient Transformer noticed that since the
computation is independent of the sequence_length dimension, it is mathematically equivalent to compute the output
embeddings of both feed forward layers [batch_size, config.hidden_size]_0, , [batch_size, config.hidden_size]_n
individually and concat them afterward to [batch_size, sequence_length, config.hidden_size] with n = sequence_length, which trades increased computation time against reduced memory use, but yields a mathematically
equivalent result.