Despite the increasing use of artificial intelligence (AI) in healthcare, a new study led by Mass General Brigham researchers at the MESH Incubator shows that generative AI models continue to fall short in their clinical reasoning capabilities.
By asking 21 different large-scale language models (LLMs) to play the role of doctors in a series of clinical scenarios, researchers showed that LLMs often fail to navigate diagnostic workup and create a testable list of potential or “differential” diagnoses. All tested LLMs reached the correct final diagnosis more than 90% of the time when all relevant information for a patient’s case was provided, but consistently underperformed in the early, inference-driven steps of the diagnostic process, according to results published in . JAMA network open.
Despite continuous improvements, off-the-shelf large-scale language models are not ready for unsupervised clinical-grade deployment. Differential diagnosis is central to clinical reasoning and is the basis of “medical technology” that currently cannot be replicated by AI. The promise of AI in clinical medicine continues to be its potential to augment rather than replace physician reasoning if all relevant data is available, but this is not always the case. ”
Mark Succi, MD, corresponding author, executive director of the MESH Incubator at Massachusetts General Brigham
This new study is a follow-up to a previous study led by Succi’s MESH group, in which researchers evaluated ChatGPT 3.5’s ability to accurately diagnose a series of clinical vignettes.
In the new study, researchers developed a more comprehensive measure of LLM that goes beyond precision, the PrIME-LLM. This assesses the model’s ability at various stages of clinical reasoning: coming up with potential diagnoses, performing appropriate tests, arriving at a final diagnosis, and managing treatment. According to the researchers, if a model performs well in one area but poorly in another, this imbalance can be reflected in the PrIME-LLM score, masking areas of weakness, rather than averaging performance across tasks.
This study compared 21 general-purpose LLMs, including the latest models of ChatGPT, DeepSeek, Claude, Gemini, and Grok at the time of submission. The researchers tested the model’s functionality on 29 published clinical cases. To simulate the development of a clinical case, the researchers gradually fed information into the model, starting with basic information such as the patient’s age, gender, and symptoms, before adding physical examination findings and test results. LLM performance at each stage was assessed by medical student raters, and these ratings were used to calculate the model’s overall PrIME-LLM score.
Consistent with previous studies, the researchers found that LLM was better at making an accurate final diagnosis. However, all models failed to generate an appropriate differential diagnosis more than 80% of the time. Differential diagnosis is important in the real world, but in this study the model was given more information and could proceed to the next stage of clinical workup even if the differential diagnosis step failed.
“Step-by-step assessment of LLMs moves us beyond treating them as test takers and puts them in the shoes of physicians,” said lead author Alia Rao, MESH researcher and MD/PhD student at Harvard Medical School. “These models are great for making a final diagnosis once the data is complete, but they are challenging at the beginning of open-ended cases where there is not a lot of information available.”
Most LLMs improved accuracy when providing test results and images in addition to text. Recently released models generally perform better than older models, indicating gradual improvements in LLM. PrIME-LLM scores for the models ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5.
Succi said PrIME-LLM is a standardized method to assess the clinical capabilities of AI and can be used by AI developers and hospital leaders to benchmark new technologies as they are released.
“We want to be able to separate the hype from the reality when these tools are applied to medicine,” he said. “Our results confirm that large-scale language models in the medical field continue to require ‘human involvement’ and very close oversight.”
sauce:
Reference magazines:
Rao, Australia; others. (2026). Performance of large-scale language models and clinical reasoning tasks. JAMA network open. DOI: 10.1001/jamanetworkopen.2026.4003. https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2847679

