AI & TECHNOLOGY

Can you outperform the world's most famous AI?

Frontier model performance on standardised tests has moved fast. Stanford HELM, OpenAI's GPT-4 technical report, and Anthropic's Claude evaluations now publish raw scores on the SAT, GRE, USMLE, LSAT and bar exam alongside benchmark suites like MMLU and GSM8K. Pick a test, enter your own score, and the calculator places you against current published model results so you can see whether you'd actually beat the model on the test you're best at.

Source: OpenAI GPT-4 technical report · Anthropic Claude benchmarks · Stanford HELM 2024
Advertisement
AI & TECH
YOUR RESULT
percentile

1st 50th 99th
find the norm
FINDTHENORM.COM

What is ChatGPT's IQ? How AI models score on standardised tests

GPT-4, the model underlying ChatGPT's most capable versions, scored approximately 155 on the Mensa Norway online IQ test — equivalent to the 99.9th percentile of the human population and well above the Mensa membership threshold of 130 (98th percentile). This result was replicated independently by multiple researchers in 2023-2024 and is consistent with performance on other cognitive benchmarks. GPT-4o (the more recent version) shows comparable or slightly higher performance. By contrast, GPT-3.5 scored approximately 83 — below the population average of 100 — on the same type of standardised reasoning assessment, representing a 72-point improvement across model generations and one of the most dramatic capability leaps in the history of any technology.

These IQ-proxy scores are not equivalent to administered IQ tests under controlled clinical conditions. They represent performance on the types of verbal reasoning, pattern recognition, and logical inference tasks that IQ tests measure, rather than a full Wechsler Adult Intelligence Scale assessment. The scores are meaningful as benchmarks — GPT-4 genuinely outperforms the vast majority of humans on the cognitive tasks these tests measure — but they do not capture the full picture of intelligence. Areas where GPT-4 underperforms human averages include novel physical reasoning, tasks requiring embodied understanding, long-horizon planning in complex real environments, and reliable factual accuracy on specific current-events questions. The IQ score is a useful shorthand for comparison but should not be mistaken for a complete cognitive profile.

Can AI actually reason, or is it pattern matching?

The question of whether large language models genuinely reason or are performing sophisticated pattern matching is one of the most actively debated questions in AI research. Webb, Holyoak, and Lu (2023, Nature Human Behaviour) tested GPT-4 on zero-shot analogical reasoning tasks — problems explicitly designed to require relational reasoning rather than pattern completion — and found that it outperformed college undergraduates on most tasks. The researchers concluded that "emergent analogical reasoning" had appeared in large language models at GPT-4 scale. This finding is significant because analogical reasoning has long been considered a hallmark of genuine intelligence rather than pattern recognition.

The counterargument comes from studies showing systematic failures in exactly the cases where pattern matching would fail: GPT-4 and similar models make errors on simple mathematical reasoning that any human would find trivial, fail on problems that require tracking physical states across multiple steps, and show inconsistent performance on problems that are semantically similar but syntactically different (suggesting sensitivity to surface form rather than underlying structure). The current consensus among researchers is probably somewhere between the extremes: these models exhibit genuine emergent capabilities that exceed what could be explained by simple pattern completion, while also having systematic reasoning gaps that suggest their underlying mechanisms differ from human reasoning in important ways. For the purposes of this calculator's comparison, the IQ-proxy benchmarks reflect real cognitive performance on the tested tasks, regardless of the underlying mechanism.

How does GPT-4 score on standardised tests?

OpenAI's GPT-4 Technical Report revealed some striking benchmark results. The model passed the bar exam at approximately the 90th percentile, scored 1410 on the SAT, and achieved near-perfect scores on GRE Verbal reasoning. However, it continues to struggle with mathematical reasoning tasks where humans with domain expertise outperform it.

TestGPT-4 ScoreHuman Percentile Equiv.Human average
SAT1410/160096th1050
GRE Verbal169/17099th150
Bar Exam~298/400~90thPass threshold
LSAT163/18088th151
USMLE Step 1~60%Pass~65% pass
AMC 10 (maths)~30%Below average50%

This is heavily debated. LLMs excel at tests requiring pattern matching to text they've trained on. They struggle with novel mathematical reasoning and spatial tasks. The high benchmark scores reflect training data saturation, not necessarily general intelligence. Many test questions appear in some form in training data.

Humans outperform current AI on: novel physical problem-solving, emotional and social intelligence assessment, creative tasks requiring genuine originality (not pattern recombination), ethical reasoning in ambiguous real-world situations, and highly specialised domain knowledge gained through embodied experience.

ChatGPT does not have an IQ in the clinical sense, because IQ is normed against a human population. However, when GPT-4-class models are given standardised IQ-proxy tests, they consistently score in the 145-160 range on verbal reasoning and matrix pattern tasks. The most widely cited figure is 155, based on GPT-4's performance on the Mensa Norway online IQ test, placing it at the 99.9th percentile of the human distribution. GPT-3.5 scored approximately 83, below the human average of 100. These scores reflect narrow task performance, not general intelligence: GPT-4o excels at pattern matching and logical deduction but cannot navigate a new city or understand sarcasm from a friend. Source: Mao et al. 2023, Webb et al. 2023, Nature Human Behaviour.

Multiple researchers administered standardised IQ-proxy tests to GPT-4-class models. The Mensa Norway online IQ test was completed by GPT-4 with a score equivalent to IQ 155, independently replicated by several AI researchers in 2023-2024. Mao et al. (2023) systematically tested GPT-3.5 and GPT-4 on a battery of psychometric tests, finding GPT-4 scored in the 'very superior' range across multiple measures. Webb, Holyoak, and Lu (2023) in Nature Human Behaviour found that GPT-4 matched or exceeded human performance on analogical reasoning tasks central to IQ testing. These results come with the caveat that models may have been trained on data containing similar test items, potentially inflating scores.

The gap between GPT-3.5 (IQ approximately 83) and GPT-4 (IQ approximately 155) reflects a fundamental architectural and training leap. GPT-4 is a much larger model trained on more data with improved reasoning capabilities, including chain-of-thought processing that allows it to work through multi-step problems. GPT-3.5 frequently fails on problems requiring more than one logical step. The difference is analogous to the gap between recognising a word and understanding a paragraph. GPT-4's ability to hold multiple constraints in context simultaneously is vastly superior. An IQ of 83 still places GPT-3.5 above approximately 13% of the human population on these specific task types. Source: OpenAI technical reports.

On the specific cognitive tasks measured by IQ tests, yes, frontier AI models outperform the vast majority of humans. GPT-4o at IQ 155 exceeds 99.9% of the human population on pattern recognition, verbal analogies, and deductive logic. But intelligence is not a single dimension. AI cannot read a room, comfort a grieving friend, improvise a solution with limited physical tools, or understand why a joke is funny in context. It has no embodied experience, no emotional intelligence in the experiential sense, and no common sense grounded in physical reality. The question 'is AI smarter than humans?' is like asking 'is a calculator better at maths than a poet?' The answer is yes at that specific thing, and irrelevant to most of what matters.

This is a long-standing and legitimate critique. Traditional IQ tests contain cultural biases in vocabulary, assumed knowledge, and communication style that disadvantage certain populations. Raven's Progressive Matrices, which our pattern recognition questions are modelled on, were specifically designed to reduce cultural bias by using abstract visual patterns rather than language-dependent questions. The verbal analogy and deductive logic questions use culturally common English-language constructs, which may disadvantage non-native English speakers. We present results as an estimated IQ range rather than a definitive score, and the primary value is the AI comparison, not the absolute number. Source: Jensen 1998, Flynn 2007.

Your score tells you how well you performed on specific reasoning questions compared to statistical norms and two AI models. It does not tell you how intelligent you are in any comprehensive sense. Intelligence is multidimensional: this test touches verbal reasoning, numerical pattern detection, spatial reasoning, and deductive logic, but ignores emotional intelligence, creativity, practical problem-solving, and social cognition. If you scored above 110, you performed well on abstract reasoning tasks. The most useful takeaway is the comparison: seeing where you sit relative to the AI models that are increasingly part of daily life. That context is more interesting than the number itself.

Based on the trajectory from GPT-3 (IQ approximately 70) to GPT-3.5 (approximately 83) to GPT-4 (approximately 152) to GPT-4o (approximately 155), the trend is clearly upward, though the rate of improvement may be plateauing at the frontier. Each model generation shows diminishing IQ gains as scores approach the ceiling of current tests. Researchers are already developing harder evaluation frameworks (ARC-AGI, GPQA, FrontierMath) that current models struggle with. The relevant question is not whether AI will keep scoring higher on IQ tests, but whether it will develop the broader cognitive capabilities (common sense, physical reasoning, genuine creativity) that current models lack. Source: Anthropic, OpenAI, Google Deepmind evaluations.

Yes, and at high levels. GPT-4 scored at approximately the 88th percentile on the SAT (combined reading/writing and math), around the 90th percentile on the LSAT (law school admissions), and passed the bar exam at approximately the 90th percentile of actual test-takers (Katz et al., 2024, Illinois Law Review). On the US Medical Licensing Examination (USMLE), GPT-4 scored at or above the passing threshold across all three steps. On the GRE, GPT-4 achieved approximately the 80th percentile on verbal reasoning and near-perfect scores on analytical writing. These results represent significant advances from GPT-3.5, which failed the bar exam and scored below average on most professional examinations. The implications for professional credentialing, education, and gatekeeping by examination are substantial and actively discussed by policymakers across legal, medical, and educational sectors.

By the IQ-proxy metrics available, GPT-4's estimated score of 155 is higher than the commonly cited estimates for both Einstein (estimated 160-190 in popular sources, though he never sat a standardised IQ test) and Hawking (160, also an informal estimate). However, this comparison has limited meaning. IQ is one narrow measure of cognitive ability, and the historical figures typically cited as having very high IQs were distinguished by capabilities that IQ tests do not measure: sustained original contribution to a field over decades, the ability to conceive of genuinely novel frameworks for understanding the world, and the integration of empirical intuition with mathematical formalism. GPT-4 can outperform most humans on verbal and matrix reasoning tasks that IQ tests measure, but it does not generate genuinely original scientific theories — it synthesises, explains, and applies existing knowledge with extraordinary efficiency. Whether the capabilities that produce a high IQ-proxy score constitute "smarter" than figures whose IQ estimates are similarly high is a question of what intelligence is, not just what it measures.

Methodology

GPT-4 benchmark scores are sourced from OpenAI's GPT-4 Technical Report (2023). For SAT: GPT-4 scored 1410/1600 (96th percentile). For Bar exam: GPT-4 scored ~90th percentile among test-takers. For GRE Verbal: GPT-4 scored approximately 169/170 (99th percentile). Human percentiles use standard score distributions from respective testing bodies.

Advertisement

Sources: OpenAI GPT-4 Technical Report (2023), Anthropic Claude model card (2024), Stanford HELM Benchmark (2024), ETS GRE score data.

Reviewed by Find The Norm Research Team · · Methodology