Note that any pretrained Transformer-based speech model, e.g.