Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Local Self Attention
Local self attention is essentially a "normal" self attention layer with key, query and value projections, but is
chunked so that in each chunk of length config.local_chunk_length the query embedding vectors only attends to
the key embedding vectors in its chunk and to the key embedding vectors of config.local_num_chunks_before
previous neighboring chunks and config.local_num_chunks_after following neighboring chunks.