In light of these pros and cons, we propose XLNet, a generalized autoregressive | |
pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all | |
permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive | |
formulation. |