New benchmarks show that even the most advanced AI models are often able to arrive at a definitive diagnosis, but still leave clinicians at a loss when they need to weigh uncertainties, build differential diagnoses, and decide what to test next.

Research: Performance of large-scale language models and clinical reasoning tasks. Image credit: Iryna Pohrebna / Shutterstock
In a recent study published in JAMA network openresearchers investigated the clinical reasoning ability of large-scale language models (LLMs).
LLMs are rapidly gaining interest in medicine, particularly enhancing tools to support diagnostic reasoning and suggest management. Although these systems are currently actively marketed for clinical use, concerns about hallucinations, integrity, and safety remain. Additionally, existing assessments often rely on multiple-choice questions that do not reflect the complexity of patient care. It is unclear whether LLM can support end-to-end clinical reasoning.
LLM Clinical Reasoning Research Design
In this study, researchers investigated the performance of LLMs on clinical reasoning tasks. They compared 21 LLMs: OpenAI’s GPT-5, GPT-4.5, GPT-o3-Mini, GPT-4o, GPT-o1-Pro, and GPT-o1, Anthropic’s Claude 4.5 Opus, Claude 3.7 Sonnet, Claude 3 Opus, Claude 3.5 Sonnet, and Claude 3.5 Haiku, DeepSeek DeepSeek R1, and V3, Google DeepMind’s Gemini 3.0 Pro, Gemini 2.5 Pro, Gemini 1.5 Pro, Gemini 3.0 Flash, Gemini 2.0 Flash, and Gemini 1.5 Flash, and xAI’s Grok 3 and 4.
The team evaluated the accuracy of LLM in processing 29 standardized clinical vignettes included in the January 2025 update of the Merck Sharp & Dohme (MSD) manual. Each vignette presents a structured case that includes physical examination findings, medical history, laboratory findings, and system review. The clinical background was presented to each LLM in stages, maintaining the clinical context, and each clinical background was assessed three times.
The prompts were presented in a question-and-answer format. For LLMs without multimodal features, questions requiring image interpretation were excluded from scoring. LLM was prompted using defaults, disabling inference settings when available, and evaluating only the base model. Real-time browsing, retrieval, and web search capabilities are now turned off for all LLMs.
Performance was assessed across five clinical reasoning areas: diagnostic testing, differential diagnosis, final diagnosis, management, and other clinical reasoning. The output of the LLM was scored against the answer key in the MSD manual. Answers were scored using a deterministic rubric that maps LLM output to multiple-choice options. Answers were given full credit only if they included the correct option, and incorrect options were excluded.
Additionally, the Medical Evaluation Proportional Index of LLM (PrIME-LLM) score was developed to capture longitudinal inferences in an interpretable metric. Performance was visualized as a radar plot, with the vertices representing accuracy across the domain. The PrIME-LLM score is calculated as the area of the polygon in the LLM divided by the area of the reference polygon, which corresponds to a model that scores 100% across all domains.
PrIME-LLM results across clinical tasks
LLM generally had the highest scores in the final diagnosis domain and performed relatively better in management than diagnostic tests and differential diagnoses, but consistently showed deficiencies in the diagnostic tests and differential diagnosis domains. PrIME-LLM scores varied significantly between LLMs. The best performing cluster included Claude 4.5 Opus, Grok 4, Gemini 3.0 Flash, GPT-5, Gemini 3.0 Pro, and GPT-4.5, with Grok 4 achieving the highest average PrIME-LLM score. In particular, newer releases within the LLM family generally perform better.
Although the overall average accuracy ranged from 0.81 to 0.90, the average PrIME-LLM scores showed a wider separation, distinguishing high- and low-performance models. In particular, there was a large performance difference between inference-optimized models such as Grok 4, GPT-5, and Claude 4.5 Opus and non-inference models. The probability that the random score from the inference-optimized model was greater than the random score from the non-inference model was 0.99.
In virtually all LLMs, the accuracy of the final diagnostic items was significantly higher than the diagnostic test items. Furthermore, diagnostic test items consistently showed higher accuracy than differential diagnosis items, whereas administrative items and other item types had intermediate accuracy. Eighteen multimodal LLMs with image interpretation available were evaluated across vignettes including electrocardiograms, computed tomography scans, and chest radiographs.
Multimodal LLM accuracy was consistent across non-image questions, whereas performance on image-based questions varied across LLMs. GPT-4.5, GPT-o3-Mini, and Claude 3 Opus showed higher accuracy on image-based items than text-only items, and Gemini 2.5 Pro, Gemini 3.0 Pro, Gemini 3.0 Flash, and Grok 4 also reported significant improvements. Furthermore, the model failure rate, or the proportion of questions not answered completely correctly, was lowest for final diagnosis and highest for differential diagnosis. Failure rates in other domains were moderate.
LLM Differential Diagnosis and Uncertainty Gap
In summary, Frontier LLM achieved high accuracy in the final diagnosis, but performed poorly compared to other inference stages in creating differential diagnoses and avoiding uncertainty. The PrIME-LLM score provided better separation than the traditional summary metric, raw accuracy, and highlighted critical gaps hidden in traditional benchmarks.
Overall, the PrIME-LLM framework provides an independent, scalable, and reproducible benchmark to track progress and guide safe integration into medical practice. However, the findings also suggest that off-the-shelf LLMs are not yet ready for unsupervised, patient-facing clinical decision-making.

