Column

Is 'LLMs Reach Expert Level' Actually True? Three Structural Flaws in AI Benchmark Claims

Benchmark studies claiming LLMs match or exceed human experts contain three structural flaws: training data contamination, lack of representativeness, and flawed comparison methodology. On novel tasks designed to eliminate these flaws, human experts outperformed LLMs across the board. Here is what that means for AI investment decisions.

2026 / 06 / 12 5 min read 日本語版 →

A flat illustration of a rising benchmark chart with the label 'expert level' floating above it, while a crack runs through the graph surface below

Hello. This is Keito Inoshita from Affectosphere Group.

“GPT-4 passed the medical licensing exam at a passing threshold.” “An LLM scored in the top 10% of the bar exam.” These claims have become familiar. What gets less attention is whether they mean what they appear to mean.

A study published on arXiv in June 2026 (George Perrett, Javae Elliott, Jennifer Hill, Marc Scott; arXiv:2606.11166) takes that question seriously. The researchers examined existing benchmark studies that claim LLMs match or exceed human experts, identified three structural flaws in those claims, and then ran experiments designed to eliminate the flaws. On the novel tasks they designed, human experts outperformed LLMs across the board.

For anyone making AI investment decisions based on benchmark scores, this is a study worth reading carefully.

Three takeaways for today

Benchmark studies claiming LLM parity with human experts contain three structural flaws: training data contamination, lack of representativeness, and flawed comparison methodology.
On novel tasks designed to eliminate those flaws, human experts outperformed LLMs comprehensively.
AI investment decisions require firsthand evaluation on actual business tasks — benchmark scores alone are not sufficient evidence.

① Three structural flaws in benchmark claims

The research identifies three mechanisms that cause LLM benchmarks to overestimate real-world capability.

The first is training data contamination.

LLMs are trained on large-scale web data. That data may include the benchmark questions and their answers. When a model achieves a high score on a medical licensing exam, it may be retrieving memorized content rather than demonstrating genuine reasoning. A student who has seen the answer key before the test achieves a high score — but not for the reasons we would want.

The second is lack of representativeness.

Most benchmarks are structured around specific formats — multiple choice questions at a specific difficulty level. Actual expert work is not a sequence of well-formed multiple choice problems. It involves complex situational judgment, integration of ambiguous information, and reasoning across multiple domains simultaneously. A benchmark that does not represent the full scope of the task it claims to measure produces scores that reflect only a narrow slice of capability.

The third is flawed comparison methodology.

When LLMs are compared against “human experts,” the conditions under which those experts are evaluated matter significantly. Experts assessed under test conditions that do not match how they work in practice produce comparison data that is not meaningful. Additionally, the comparison is structurally asymmetric: LLMs are often evaluated on isolated subtasks, while experts are responsible for the full scope of the work.

② On novel tasks, human experts won comprehensively

The methodological contribution of this study is its experimental design.

To eliminate training data contamination, the researchers created novel tasks. To address representativeness concerns, they designed tasks that more closely reflect actual professional work. To equalize comparison conditions, they evaluated LLMs and human experts under the same constraints.

The result: human experts outperformed LLMs comprehensively.

The implication is that “LLMs have reached expert level” claims, derived from benchmark designs containing these flaws, may significantly overstate actual capability in real-world professional settings.

This is not an argument that LLMs have no value. The domains where LLMs perform well are clearly established: structured document generation, summarization and synthesis of information, automation of pattern-clear tasks. The problem is inferring from “high score on a specific benchmark” to “applicable to the full scope of professional work.” That inference is where the overreach happens.

③ What this means for AI investment decisions

For CTOs, CIOs, and others making AI adoption decisions, the practical implications are direct.

Start by recognizing the limits of using benchmark scores as the primary evidence for adoption decisions.

When you encounter a claim like “this model matched human performance on X,” three questions help evaluate its reliability. Does that benchmark represent your actual business tasks? Could the training data have included the benchmark content? Under what conditions were the “human experts” evaluated? Working through these three questions often changes how the claim reads.

The more important step is running pilot evaluations on your actual work.

Even if an LLM is claimed to perform at “lawyer level” on legal document review, its performance on your specific contract formats, the industry-specific legal questions your team handles, and against your firm’s own quality standards — none of that is derivable from the benchmark score. It has to be measured directly. Committing significant investment to an LLM based on benchmark scores without a pilot is structurally similar to hiring someone based on a test score without evaluating their work on a representative sample of the actual job.

Concrete evaluation approaches by department:

Legal: Have the LLM work through actual contract templates your team uses, under the same conditions your lawyers face
HR: Measure screening precision against your actual hiring criteria, not generic benchmarks
R&D: Separately evaluate summarization of existing literature (a pattern-matching task) versus generation of novel hypotheses (a reasoning task)

Improving the precision of AI ROI estimates

The findings from this research translate directly into a more rigorous ROI estimation process.

A common mistake in current AI ROI models is converting “benchmark capability” directly into “operational efficiency gain.” The model achieves 95% accuracy on a summarization benchmark, therefore document preparation costs will fall 95%. That inference is rarely valid.

A more defensible framework proceeds in three steps.

First, decompose business tasks into two categories: tasks where LLMs have demonstrated genuine capability under real-world conditions (structured drafting, information synthesis, classification with clear patterns) and tasks that require expert judgment under ambiguity (complex situational analysis, novel problem-solving, domain-specific deep reasoning).

Second, for the first category, run a pilot with your actual task data under realistic conditions, and use those measured values as the basis for the ROI model.

Third, for the second category, either maintain current approaches or consider different solutions. Forcing LLMs into domains where they underperform generates quality problems that produce negative ROI.

The KPIs worth tracking are: accuracy of pre-deployment ROI estimates versus actual results (measured as deviation rate), and the proportion of projects where scope was adjusted based on pilot findings before full deployment. Improvement in the second metric is a direct proxy for reduced project failure rate from inappropriate LLM application.

Distinguishing where LLMs work from where they do not

What this research ultimately communicates is neither uncritical enthusiasm for LLMs nor dismissal of their value.

LLMs deliver real value in well-defined areas: structured information processing, document drafting, automation of tasks with clear patterns and objectively verifiable outputs. In professional domains requiring judgment under genuine novelty, integration of ambiguous information, and deep domain-specific reasoning, human experts currently hold a clear advantage.

The shift worth making is from “which is better overall?” to “for which specific tasks does this work, and for which does it not?” That question is the correct starting point for sound AI investment decisions.

Benchmark scores are useful as one data point. They are not sufficient as the primary basis for adoption decisions in professional contexts — not without pilot evaluation against the actual work.

That’s it for today!

Reference

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott (2026). Flaws in the LLM Automation Narrative. arXiv preprint arXiv:2606.11166.

* This article was written in part with AI assistance and may contain inaccuracies.