File size: 174 Bytes
5fa1a76
 
 
1
2
3
To leverage the inductive
biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
distillation and cosine-distance losses.