The model accepts log mel-filter bank features extracted from the audio waveform and pretrained autoregressively to generate a transcript or translation.