Thank you for the kind words, really glad you enjoyed the post!
We usually compute the forward KL:
KL(P‖Q), where P is the original (full-precision) model's output distribution and Q is the quantized model's.
This forward direction does emphasize mode covering — it penalizes cases where the quantized model assigns too little probability to tokens that the original model thought were likely. So yes, it's more forgiving if Q spreads its probabilities out, but punishes it when it "misses" key peaks from P.
So to your analogy: yes, when I talk about preserving the original model’s behavior, I am leaning into this “mode-covering” perspective. The idea is that we want the quantized model to remain attentive to the same likely outputs as the original. If it diverges too far — say, by ignoring a high-confidence token the full-precision model preferred — that’s where forward KL catches it (and where flips are more likely to show up).
One caveat: reverse KL (KL(Q‖P)) would behave very differently — it penalizes overconfidence in the quantized model that the original didn’t share, which could be useful too in some contexts, but it’s generally less stable in practice when P assigns low probabilities.