
OpenAI’s o3 artificial intelligence (AI) reasoning model has gained kudos as the company’s most powerful model, pushing limits in coding, math, science and visual perception. But it isn’t flawless.
Far from it.
Some discrepancies in benchmark tests, along with nagging hallucinations, are poking holes in the reputation of o3, as well as that of the o4 mini-model.
The o3 model debuted in December, boasting the ability to answer 25% of questions from FrontierMath’s tricky questions – 12 times the score of the next-best model, at 2%.
“We’re seeing [internally], with o3 in aggressive test-time compute settings, we’re able to get over 25%,” OpenAI Chief Research Officer Mark Chen said on a livestream.
But research institute Epoch AI, which runs evaluations on leading AI models against challenging tasks such as FrontierMath, concluded otherwise. It recently released independent benchmark results that gave o3 a score of 10% after concluding OpenAI achieved 25% with a version of o3 with more computing power than the model publicly launched this month.
“The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time [computing], or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs the 290 problems in frontiermath-2025-02-28-private),” Epoch said.
ARC Prize Foundation corroborated Epoch’s findings. In testing a prerelease version of o3, ARC Prize discovered the public o3 model “is a different model […] tuned for chat/product use.”
“All released o3 compute tiers are smaller than the version we [benchmarked],” ARC Prize said.
Benchmark disparities are not unique to OpenAI. As AI vendors race to gain performance advantages with new models, there has been a corresponding spike in debates over their results. Meta Platforms Inc. recently copped to promoting benchmark scores for a version of model different than one available to developers. Elon Musk’s xAI was also accused of publishing misleading benchmark charts for Grok 3, its latest AI model.
Meanwhile, internal tests of OpenAI’s o3 and o4 mini-models revealed they tend to hallucinate more than the company’s previous reasoning models o1, o1-mini, and o3-mini, as well as non-reasoning model GPT-40.
The o4-mini model hallucinated 48% in response to questions on PersonQA, OpenAI’s in-house benchmark for measuring the accuracy of knowledge about people – three times the rate of the o1. The o3 hallucinated 33% of the time, OpenAI found.
“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” an OpenAI spokesperson told TechCrunch.