LLMs do well on medical benchmarks, but this nice experiment blows a hole in that.

Reading: Fidelity of Medical Reasoning in Large Language Models, JAMA Network, 8 August 2025.

The set up

The experiment presents an LLM with a medical scenario and prompts it to pick the correct answer from a set of 5 options. LLMs do well at this.

The twist

We sampled 100 questions from MedQA, a standard multiple-choice medical benchmark, and replaced the original correct answer choice with “None of the other answers” (NOTA).

A clinician verified that NOTA was the best of the answers available.

Surely, the argument goes, if you’re “reasoning” about the problem you’d still pick the right answer, which is now always “none of the other answers”.

The result

They found LLM accuracy drops from 9 percentage points in the best case to 38 in the worst.

A system dropping from 80% to 42% accuracy when confronted with a pattern disruption would be unreliable in clinical settings, where novel presentations are common. The results suggest that these systems are more brittle than their benchmark scores suggest.

A commentator on the article notes that it’s not a realistic scenario, which is correct. My interest is in how you evaluate LLMs, and for that, it’s a nice illustration.

The code is on Github.

* * *

Update: Microsoft Research have a pre-print on The Illusion of Readiness in LLMs and medical benchmarks:

Leading systems often guess correctly even when key inputs like images are removed, flip answers under trivial prompt changes, and fabricate convincing yet flawed reasoning. These aren’t glitches; they expose how today’s benchmarks reward test-taking tricks over medical understanding.

Medical LLM pattern matching is brittle

The set up

The twist

The result