Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Feeling lonely affects memory without accelerating mental decline

    April 15, 2026

    Although SARS-CoV-2 rarely reaches the placenta early in pregnancy, it still destroys early pregnancy immunity.

    April 15, 2026

    Travele receives breakthrough rare disease approval, charting path towards Filspari’s $3 billion US opportunity

    April 14, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Health Magazine
    • Home
    • Environmental Health
    • Health Technology
    • Medical Research
    • Mental Health
    • Nutrition Science
    • Pharma
    • Public Health
    • Discover
      • Daily Health Tips
      • Financial Health & Stability
      • Holistic Health & Wellness
      • Mental Health
      • Nutrition & Dietary Trends
      • Professional & Personal Growth
    • Our Mission
    Health Magazine
    Home » News » Study finds top AI models still struggle with clinical inference
    Discover

    Study finds top AI models still struggle with clinical inference

    healthadminBy healthadminApril 14, 2026No Comments5 Mins Read
    Study finds top AI models still struggle with clinical inference
    Share
    Facebook Twitter Reddit Telegram Pinterest Email


    New benchmarks show that even the most advanced AI models are often able to arrive at a definitive diagnosis, but still leave clinicians at a loss when they need to weigh uncertainties, build differential diagnoses, and decide what to test next.

    Research: Performance of large-scale language models and clinical reasoning tasks. Image credit: Iryna Pohrebna / Shutterstock

    In a recent study published in JAMA network openresearchers investigated the clinical reasoning ability of large-scale language models (LLMs).

    LLMs are rapidly gaining interest in medicine, particularly enhancing tools to support diagnostic reasoning and suggest management. Although these systems are currently actively marketed for clinical use, concerns about hallucinations, integrity, and safety remain. Additionally, existing assessments often rely on multiple-choice questions that do not reflect the complexity of patient care. It is unclear whether LLM can support end-to-end clinical reasoning.

    LLM Clinical Reasoning Research Design

    In this study, researchers investigated the performance of LLMs on clinical reasoning tasks. They compared 21 LLMs: OpenAI’s GPT-5, GPT-4.5, GPT-o3-Mini, GPT-4o, GPT-o1-Pro, and GPT-o1, Anthropic’s Claude 4.5 Opus, Claude 3.7 Sonnet, Claude 3 Opus, Claude 3.5 Sonnet, and Claude 3.5 Haiku, DeepSeek DeepSeek R1, and V3, Google DeepMind’s Gemini 3.0 Pro, Gemini 2.5 Pro, Gemini 1.5 Pro, Gemini 3.0 Flash, Gemini 2.0 Flash, and Gemini 1.5 Flash, and xAI’s Grok 3 and 4.

    The team evaluated the accuracy of LLM in processing 29 standardized clinical vignettes included in the January 2025 update of the Merck Sharp & Dohme (MSD) manual. Each vignette presents a structured case that includes physical examination findings, medical history, laboratory findings, and system review. The clinical background was presented to each LLM in stages, maintaining the clinical context, and each clinical background was assessed three times.

    The prompts were presented in a question-and-answer format. For LLMs without multimodal features, questions requiring image interpretation were excluded from scoring. LLM was prompted using defaults, disabling inference settings when available, and evaluating only the base model. Real-time browsing, retrieval, and web search capabilities are now turned off for all LLMs.

    Performance was assessed across five clinical reasoning areas: diagnostic testing, differential diagnosis, final diagnosis, management, and other clinical reasoning. The output of the LLM was scored against the answer key in the MSD manual. Answers were scored using a deterministic rubric that maps LLM output to multiple-choice options. Answers were given full credit only if they included the correct option, and incorrect options were excluded.

    Additionally, the Medical Evaluation Proportional Index of LLM (PrIME-LLM) score was developed to capture longitudinal inferences in an interpretable metric. Performance was visualized as a radar plot, with the vertices representing accuracy across the domain. The PrIME-LLM score is calculated as the area of ​​the polygon in the LLM divided by the area of ​​the reference polygon, which corresponds to a model that scores 100% across all domains.

    PrIME-LLM results across clinical tasks

    LLM generally had the highest scores in the final diagnosis domain and performed relatively better in management than diagnostic tests and differential diagnoses, but consistently showed deficiencies in the diagnostic tests and differential diagnosis domains. PrIME-LLM scores varied significantly between LLMs. The best performing cluster included Claude 4.5 Opus, Grok 4, Gemini 3.0 Flash, GPT-5, Gemini 3.0 Pro, and GPT-4.5, with Grok 4 achieving the highest average PrIME-LLM score. In particular, newer releases within the LLM family generally perform better.

    Although the overall average accuracy ranged from 0.81 to 0.90, the average PrIME-LLM scores showed a wider separation, distinguishing high- and low-performance models. In particular, there was a large performance difference between inference-optimized models such as Grok 4, GPT-5, and Claude 4.5 Opus and non-inference models. The probability that the random score from the inference-optimized model was greater than the random score from the non-inference model was 0.99.

    In virtually all LLMs, the accuracy of the final diagnostic items was significantly higher than the diagnostic test items. Furthermore, diagnostic test items consistently showed higher accuracy than differential diagnosis items, whereas administrative items and other item types had intermediate accuracy. Eighteen multimodal LLMs with image interpretation available were evaluated across vignettes including electrocardiograms, computed tomography scans, and chest radiographs.

    Multimodal LLM accuracy was consistent across non-image questions, whereas performance on image-based questions varied across LLMs. GPT-4.5, GPT-o3-Mini, and Claude 3 Opus showed higher accuracy on image-based items than text-only items, and Gemini 2.5 Pro, Gemini 3.0 Pro, Gemini 3.0 Flash, and Grok 4 also reported significant improvements. Furthermore, the model failure rate, or the proportion of questions not answered completely correctly, was lowest for final diagnosis and highest for differential diagnosis. Failure rates in other domains were moderate.

    LLM Differential Diagnosis and Uncertainty Gap

    In summary, Frontier LLM achieved high accuracy in the final diagnosis, but performed poorly compared to other inference stages in creating differential diagnoses and avoiding uncertainty. The PrIME-LLM score provided better separation than the traditional summary metric, raw accuracy, and highlighted critical gaps hidden in traditional benchmarks.

    Overall, the PrIME-LLM framework provides an independent, scalable, and reproducible benchmark to track progress and guide safe integration into medical practice. However, the findings also suggest that off-the-shelf LLMs are not yet ready for unsupervised, patient-facing clinical decision-making.



    Source link

    Visited 1 times, 1 visit(s) today
    Share. Facebook Twitter Pinterest LinkedIn Telegram Reddit Email
    Previous ArticleThis strange ‘pearly’ movement inside cells could change the way we understand diseases
    Next Article The people you live with may be changing your gut bacteria
    healthadmin

    Related Posts

    Feeling lonely affects memory without accelerating mental decline

    April 15, 2026

    Although SARS-CoV-2 rarely reaches the placenta early in pregnancy, it still destroys early pregnancy immunity.

    April 15, 2026

    State changes custody law to keep detained immigrant children out of foster care

    April 14, 2026

    Drug discovery revolution through assay screening services

    April 14, 2026

    A new model to break the cycle of chronic nightmares in children

    April 14, 2026

    Very high prenatal PFAS exposure increases risk of childhood asthma

    April 14, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Categories

    • Daily Health Tips
    • Discover
    • Environmental Health
    • Exercise & Fitness
    • Featured
    • Featured Videos
    • Financial Health & Stability
    • Fitness
    • Fitness Updates
    • Health
    • Health Technology
    • Healthy Aging
    • Healthy Living
    • Holistic Healing
    • Holistic Health & Wellness
    • Medical Research
    • Medical Research & Insights
    • Mental Health
    • Mental Wellness
    • Natural Remedies
    • New Workouts
    • Nutrition
    • Nutrition & Dietary Trends
    • Nutrition & Superfoods
    • Nutrition Science
    • Pharma
    • Preventive Healthcare
    • Professional & Personal Growth
    • Public Health
    • Public Health & Awareness
    • Selected
    • Sleep & Recovery
    • Top Programs
    • Weight Management
    • Workouts
    Popular Posts
    • the-pros-and-cons-of-paleo-dietsThe Pros and Cons of Paleo Diets: What Science Really Says April 16, 2025
    • Improve Mental Health10 Science-Backed Practices to Improve Mental Health… March 11, 2025
    • How Healthy Living Is Transforming Modern Wellness TrendsHow Healthy Living Is Transforming Modern Wellness… December 3, 2025
    • Kankakee_expansion.jpgCSL releases details of $1.5 billion U.S.… March 10, 2026
    • urlhttps3A2F2Fcalifornia-times-brightspot.s3.amazonaws.com2Fc32Fcd2F988500d440f2a55515940909.jpegA ‘reckless’ scrapyard with a history of… October 24, 2025
    • Healthy Living: Expert Tips to Improve Your Health in 2026Healthy Living: Expert Tips to Improve Your Health in 2026 November 16, 2025

    Demo
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Don't Miss

    Feeling lonely affects memory without accelerating mental decline

    By healthadminApril 15, 2026

    Loneliness affects memory in older people, but does not accelerate mental decline over time, data…

    Although SARS-CoV-2 rarely reaches the placenta early in pregnancy, it still destroys early pregnancy immunity.

    April 15, 2026

    Travele receives breakthrough rare disease approval, charting path towards Filspari’s $3 billion US opportunity

    April 14, 2026

    This Mediterranean-style diet is associated with a slower rate of brain volume loss as we age.

    April 14, 2026

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    HealthxMagazine
    HealthxMagazine

    At HealthX Magazine, we are dedicated to empowering entrepreneurs, doctors, chiropractors, healthcare professionals, personal trainers, executives, thought leaders, and anyone striving for optimal health.

    Our Picks

    This Mediterranean-style diet is associated with a slower rate of brain volume loss as we age.

    April 14, 2026

    Lilly has been directed to gather more safety information for obesity product launch Foundayo

    April 14, 2026

    Psychologists map a pathway linking sacred beliefs to better sex

    April 14, 2026
    New Comments
      Facebook X (Twitter) Instagram Pinterest
      • Home
      • Privacy Policy
      • Our Mission
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.