5fa1a76
1
2
3
One can then, similar to BERT, convert the last hidden states of the latents to classification logits by averaging along the sequence dimension, and placing a linear layer on top of that to project the d_latents to num_labels.