They follow the architecture described in Generating Long Sequences with Sparse Transformers, modified to support longer context length. |
They follow the architecture described in Generating Long Sequences with Sparse Transformers, modified to support longer context length. |