AI Benchmarking Site Maps LLMs on a Human IQ Scale

A new benchmarking project called AI IQ is attempting to solve one of enterprise AI’s most complex problems: how to compare competing LLMs without sifting through incompatible benchmark tables and vendor marketing claims.

Launched by entrepreneur Ryan Shea, the site assigns scores to dozens of AI models by translating benchmark performance into an estimated human IQ scale, including an EQ (emotional intelligence) score that attempts to measure a model’s performance in a manner different from other ranking charts. Additionally, the AI IQ score includes a measure of cost efficiency.

The sheer scope and depth of the site’s ranking process invites doubt. And the site itself admits to certain limitations regarding the nature of measuring human IQ, calling it a metaphor as opposed to a clearly quantifiable metric.

“Benchmarking AI model intelligence and performance is a tricky business,” Mitch Ashley, VP, Software Lifecycle Engineering at the Futurum Group told Techstrong.ai.

“I’m not sure what this type of IQ benchmarking tells us. If buyers come to rely upon this information, it will hold weight in the selection and procurement process. If not, and no AI benchmarking truly has to date, this is interesting from a research perspective.”

Four Categories

The methodology combines 12 established benchmarks into four categories: reasoning, programmatic, academic, and mathematical. Scores from tests including ARC-AGI, FrontierMath, SWE-Bench and Humanity’s Last Exam are converted into estimated IQ values using calibration curves designed by Shea.

The site also attempts to limit distortions caused by benchmarks that are considered easier to manipulate or overly familiar to current models. Those tests receive compressed scoring ceilings so they cannot disproportionately raise a model’s composite score.

According to the latest rankings, OpenAI’s GPT-5.5 currently leads the field with an estimated IQ in the mid-130s. Anthropic’s Opus 4.7 and Google’s Gemini 3.1 Pro sit close behind, reflecting how tightly clustered the top frontier models have become.

That compression at the high end may be the most important trend revealed by the charts. While model vendors continue to promote new releases as major leaps forward, the scoring suggests the capability gap between leading systems is narrowing.

Further down the rankings, Chinese-developed models including DeepSeek, Qwen and GLM appear increasingly competitive on a price-performance basis. For enterprise deployments that do not require the highest-end reasoning capabilities, these lower-cost models could become attractive alternatives.

As for the platform’s emotional intelligence score, based on conversational quality and roleplay performance, Anthropic’s models rank particularly well in this category. However, the methodology acknowledges potential bias because one emotional benchmark relies partly on AI judging from Claude models. To compensate, the site applies a penalty adjustment to Anthropic systems.

For enterprise buyers, the most practical feature may be the cost comparison charts. These visualizations map estimated intelligence against effective operating expense, highlighting how cheaper models are approaching the performance of premium systems at a fraction of the price.

That dynamic reinforces a key trend already taking shape inside enterprise AI deployments: routing workloads between multiple models rather than relying on a single flagship system. Expensive frontier models may be reserved for advanced reasoning tasks, while lower-cost systems handle classification or summarization work.

Whether a model’s performance can be reduced to a single number is arguable. As the site’s own research shows, various models have distinct strengths. Furthermore, small language models may be best suited for some tasks, yet they may receive a lower overall number.

Additionally, LLMs typically outperform humans in highly specialized domains while simultaneously failing at tasks that require common sense or basic visual reasoning. A composite score may not reflect these key issues.

The AI IQ site acknowledges its limitations. As noted above, its documentation describes IQ as a metaphor, saying that “The IQ scale provides an intuitive frame of reference, not a claim of equivalence.” It also warns that benchmark validity changes rapidly as models improve and training data evolves. “Benchmarks become stale,” it says, describing its methodology as “a living document.”

AI Benchmarking Site Maps LLMs on a Human IQ Scale

Four Categories

SHARE THIS STORY

FOLLOW US

AI Benchmarking Site Maps LLMs on a Human IQ Scale

Four Categories

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP