Abstract
Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer. It needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks, and retains the perplexity on WikiText-103 language modelling benchmark. We further validated our results on text8 and enwiki8 datasets, as well as on the BERT model.
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Cite arxiv.org/abs/2009.04534 in a dataset README.md to link it from this page.
Spaces citing this paper 0
No Space linking this paper
Cite arxiv.org/abs/2009.04534 in a Space README.md to link it from this page.
Collections including this paper 0
No Collection including this paper
Add this paper to a
collection
to link it from this page.