Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually 15%) is masked by: | |
a special mask token with probability 0.8 | |
a random token different from the one masked with probability 0.1 | |
the same token with probability 0.1 | |
The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a separation token in between). |