Despite the increasing use of artificial intelligence in healthcare for both patients and healthcare professionals, a new study from Commander Mass Brigham finds that publicly available generative AI models often fail to adequately navigate diagnostic situations.
The study, published April 13 in JAMA Network Open, evaluated 21 different generic large-scale language models (LLMs) on 29 standardized clinical cases from January to December 2025. The model received a set of case records that “preserved clinical context and maintained continuity” throughout the clinical reasoning process.
Medical student raters then scored the output of each stage against the MSD manual. The researchers also developed a new measure called the Proportional Index of Medical Evaluation of LLM (PrIME-LLM) to determine accuracy across five clinical reasoning domains.
Among the LLMs tested by researchers at Mass General Brigham’s MESH Incubator were GPT-5, Gemini 3.0 Flash, and Grok 4.
Although all LLMs achieved an accurate final diagnosis more than 90% of the time, the researchers found that the models “performed poorly in generating differential diagnoses and avoiding uncertainty compared to other inference stages.” All models failed to generate an appropriate differential diagnosis more than 80% of the time.
“While these models are great for assigning a final diagnosis once the data is complete, they are difficult at the beginning of an open-ended case when there is less information,” lead author Alia Rao, a MESH researcher and MD student at Harvard Medical School, said in a statement.
MESH Incubator Executive Director Marc Succi, MD, is one of the study’s corresponding authors. Suchi said in a statement that off-the-shelf LLMs are “not ready to be introduced to clinical grade without oversight” despite continued improvements.
“Differential diagnosis is central to clinical reasoning and is the basis of ‘medical technology’ that currently cannot be replicated by AI,” Succi said.
The new study builds on previous research by Succi and the MESH group. Researchers evaluated the clinical capabilities of ChatGPT 3.5 in August 2023 and found that the chatbot was approximately 72% accurate in overall clinical decision making.
Researchers in the study said most models demonstrated improved accuracy when test results and images were provided in addition to text, and that recently released models performed better than older models.
Limitations noted include that web search and inference are disabled, prior exposure to standardized cases cannot be completely excluded, and the evaluation does not incorporate model extensions.
The study highlighted that LLM has the potential to “enhance, rather than replace, physician reasoning.”
“The consistent gap between differential and final diagnoses highlights how differently these systems process information compared to physicians,” the researchers wrote. “Clinicians retain uncertainty and iteratively refine differential diagnoses, but LLM collapses prematurely into a single answer, and this limitation persists across generations of models.”

