Reasoning is difficult for models of all sizes, but larger models are likely to perform better.