SimpleQA Scores Are WAY off

#3
by phil111 - opened

The whole point of SimpleQA is to test the innate broad knowledge of a model, not RAG. It's flat out lying to post your SimpleQA score with RAG.

Pretty much all models can use an external database to answer simple factual questions, making all SimpleQA scores fall within ~80-90. That's not the point of the test. Without domain knowledge being accurately stored within the weights performance across broad tasks, like writing coherent factually consistent stories, is abyssal despite a high SimpleQA score using RAG.

Please remove the RAG SimpleQA scores, or post the true SimpleQA scores next to it. As I said RAG SimpleQA scores couldn't be more meaningless since nearly all models get ~ the same score.

Thank you so much for your interest in our model!

As you correctly pointed out, SimpleQA was indeed designed to "test the innate broad knowledge of a model." That said, we believe this does not preclude its use as a WebQA benchmark to evaluate a model's performance in web-based tools, especially given its high accuracy and correctness. Additionally, to the best of our knowledge, we are not the first organization to adopt SimpleQA in this way. For instance, reports such as OpenAI's agent tool blog and Perplexity's deep search blog also utilize SimpleQA in similar contexts. Therefore, we feel it is appropriate to report SimpleQA scores in conjunction with Rag, as this aligns with established practices from other organizations as well.

In summary, based on prior examples, we utilized SimpleQA as a WebQA benchmark to assess the Agentic QA capabilities of different models within the same web environment. We hope this clarifies our approach, and thank you once again for the thoughtful discussion!

Yes, but OpenAI publishes the actual SimpleQA scores of all models, including on the linked OpenAI page you provided. The SimpleQA scores in the context of RAG is just an afterthought to make a point, which ironically is that RAG renders the test meaningless.

For example, the effective range between very weak and very powerful models while using RAG is only 80-90, and there's virtually no perceivable real-world difference between the models. So the SimpleQA test with RAG is almost meaningless and a piss poor test of agentic abilities, which isn't surprising since it wasn't designed to test it.

Conversely, the true SimpleQA score (sans RAG) is a very accurate test of a model's innate broad knowledge and hallucination rate, with a large effective range of 0-65 (e.g. 1b models vs GPT4.5) and says a TON about the overall power and usability of a model.

In summary, the SimpleQA score using RAG says virtually nothing about the relative abilities of the models being tested. Even small 3b LLMs can use an external database to score nearly as high on SimpleQA as vastly more powerful models like GPT4.5 or Gemini 2.5. Yet the far bigger models can do things like write eloquent, factually accurate, and nuanced stories that very few humans can match, while the 3b models all write absurdly bad stories filled with blatant contradictions, factually inaccuracies, and absurdities. So the SimpleQA scores without RAG are about 100x more meaningful, and that's not an exaggeration.

I did a little research and the pointlessness of RAG SimpleQA scores has already been noted and addressed by the industry, and it's the explicit reason OpenAI created BrowseComp, a RAG version of the SimpleQA test.

Thank you for sharing such detailed feedback!

In fact, our team has recently been actively exploring other benchmarks, including BrowseComp, to enable more comprehensive and accurate evaluations of our model's Agentic QA capabilities.

Once again, thank you for sharing your valuable insights and engaging in this discussion with us.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment