Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually 15%) is masked by:
a special mask token with probability 0.8
a random token different from the one masked with probability 0.1
the same token with probability 0.1
The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a separation token in between).