Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Lonely people have worse memory, but their memory declines less quickly, study finds

    April 14, 2026

    New study shows that watching TikTok’s ‘thirst traps’ is linked to lower relationship trust and satisfaction

    April 14, 2026

    J&J aims to generate $100 billion annually in sports immunology with Tremfya and new Icotyde

    April 14, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Health Magazine
    • Home
    • Environmental Health
    • Health Technology
    • Medical Research
    • Mental Health
    • Nutrition Science
    • Pharma
    • Public Health
    • Discover
      • Daily Health Tips
      • Financial Health & Stability
      • Holistic Health & Wellness
      • Mental Health
      • Nutrition & Dietary Trends
      • Professional & Personal Growth
    • Our Mission
    Health Magazine
    Home » News » Despite its accuracy, generative AI falls short in diagnostic reasoning
    Discover

    Despite its accuracy, generative AI falls short in diagnostic reasoning

    healthadminBy healthadminApril 14, 2026No Comments4 Mins Read
    Despite its accuracy, generative AI falls short in diagnostic reasoning
    Share
    Facebook Twitter Reddit Telegram Pinterest Email



    Despite the increasing use of artificial intelligence (AI) in healthcare, a new study led by Mass General Brigham researchers at the MESH Incubator shows that generative AI models continue to fall short in their clinical reasoning capabilities.

    By asking 21 different large-scale language models (LLMs) to play the role of doctors in a series of clinical scenarios, researchers showed that LLMs often fail to navigate diagnostic workup and create a testable list of potential or “differential” diagnoses. All tested LLMs reached the correct final diagnosis more than 90% of the time when all relevant information for a patient’s case was provided, but consistently underperformed in the early, inference-driven steps of the diagnostic process, according to results published in . JAMA network open.

    Despite continuous improvements, off-the-shelf large-scale language models are not ready for unsupervised clinical-grade deployment. Differential diagnosis is central to clinical reasoning and is the basis of “medical technology” that currently cannot be replicated by AI. The promise of AI in clinical medicine continues to be its potential to augment rather than replace physician reasoning if all relevant data is available, but this is not always the case. ”


    Mark Succi, MD, corresponding author, executive director of the MESH Incubator at Massachusetts General Brigham

    This new study is a follow-up to a previous study led by Succi’s MESH group, in which researchers evaluated ChatGPT 3.5’s ability to accurately diagnose a series of clinical vignettes.

    In the new study, researchers developed a more comprehensive measure of LLM that goes beyond precision, the PrIME-LLM. This assesses the model’s ability at various stages of clinical reasoning: coming up with potential diagnoses, performing appropriate tests, arriving at a final diagnosis, and managing treatment. According to the researchers, if a model performs well in one area but poorly in another, this imbalance can be reflected in the PrIME-LLM score, masking areas of weakness, rather than averaging performance across tasks.

    This study compared 21 general-purpose LLMs, including the latest models of ChatGPT, DeepSeek, Claude, Gemini, and Grok at the time of submission. The researchers tested the model’s functionality on 29 published clinical cases. To simulate the development of a clinical case, the researchers gradually fed information into the model, starting with basic information such as the patient’s age, gender, and symptoms, before adding physical examination findings and test results. LLM performance at each stage was assessed by medical student raters, and these ratings were used to calculate the model’s overall PrIME-LLM score.

    Consistent with previous studies, the researchers found that LLM was better at making an accurate final diagnosis. However, all models failed to generate an appropriate differential diagnosis more than 80% of the time. Differential diagnosis is important in the real world, but in this study the model was given more information and could proceed to the next stage of clinical workup even if the differential diagnosis step failed.

    “Step-by-step assessment of LLMs moves us beyond treating them as test takers and puts them in the shoes of physicians,” said lead author Alia Rao, MESH researcher and MD/PhD student at Harvard Medical School. “These models are great for making a final diagnosis once the data is complete, but they are challenging at the beginning of open-ended cases where there is not a lot of information available.”

    Most LLMs improved accuracy when providing test results and images in addition to text. Recently released models generally perform better than older models, indicating gradual improvements in LLM. PrIME-LLM scores for the models ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5.

    Succi said PrIME-LLM is a standardized method to assess the clinical capabilities of AI and can be used by AI developers and hospital leaders to benchmark new technologies as they are released.

    “We want to be able to separate the hype from the reality when these tools are applied to medicine,” he said. “Our results confirm that large-scale language models in the medical field continue to require ‘human involvement’ and very close oversight.”

    sauce:

    Reference magazines:

    Rao, Australia; others. (2026). Performance of large-scale language models and clinical reasoning tasks. JAMA network open. DOI: 10.1001/jamanetworkopen.2026.4003. https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2847679



    Source link

    Visited 1 times, 1 visit(s) today
    Share. Facebook Twitter Pinterest LinkedIn Telegram Reddit Email
    Previous ArticleClassification is fundamental to all brain processing and helps predict
    Next Article New method uses large amounts of data to power AI-driven protein engineering
    healthadmin

    Related Posts

    Drug discovery revolution through assay screening services

    April 14, 2026

    A new model to break the cycle of chronic nightmares in children

    April 14, 2026

    Very high prenatal PFAS exposure increases risk of childhood asthma

    April 14, 2026

    Laboratory studies of microplastics may not reflect real-world exposure

    April 14, 2026

    Study warns of rising teen dependence on AI companions

    April 14, 2026

    New method uses large amounts of data to power AI-driven protein engineering

    April 14, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Categories

    • Daily Health Tips
    • Discover
    • Environmental Health
    • Exercise & Fitness
    • Featured
    • Featured Videos
    • Financial Health & Stability
    • Fitness
    • Fitness Updates
    • Health
    • Health Technology
    • Healthy Aging
    • Healthy Living
    • Holistic Healing
    • Holistic Health & Wellness
    • Medical Research
    • Medical Research & Insights
    • Mental Health
    • Mental Wellness
    • Natural Remedies
    • New Workouts
    • Nutrition
    • Nutrition & Dietary Trends
    • Nutrition & Superfoods
    • Nutrition Science
    • Pharma
    • Preventive Healthcare
    • Professional & Personal Growth
    • Public Health
    • Public Health & Awareness
    • Selected
    • Sleep & Recovery
    • Top Programs
    • Weight Management
    • Workouts
    Popular Posts
    • the-pros-and-cons-of-paleo-dietsThe Pros and Cons of Paleo Diets: What Science Really Says April 16, 2025
    • Improve Mental Health10 Science-Backed Practices to Improve Mental Health… March 11, 2025
    • How Healthy Living Is Transforming Modern Wellness TrendsHow Healthy Living Is Transforming Modern Wellness… December 3, 2025
    • Kankakee_expansion.jpgCSL releases details of $1.5 billion U.S.… March 10, 2026
    • urlhttps3A2F2Fcalifornia-times-brightspot.s3.amazonaws.com2Fc32Fcd2F988500d440f2a55515940909.jpegA ‘reckless’ scrapyard with a history of… October 24, 2025
    • Healthy Living: Expert Tips to Improve Your Health in 2026Healthy Living: Expert Tips to Improve Your Health in 2026 November 16, 2025

    Demo
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Don't Miss

    Lonely people have worse memory, but their memory declines less quickly, study finds

    By healthadminApril 14, 2026

    Feeling lonely can affect older people’s ability to remember things, but it doesn’t seem to…

    New study shows that watching TikTok’s ‘thirst traps’ is linked to lower relationship trust and satisfaction

    April 14, 2026

    J&J aims to generate $100 billion annually in sports immunology with Tremfya and new Icotyde

    April 14, 2026

    Scientists discover why bread causes weight gain without extra calories

    April 14, 2026

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    HealthxMagazine
    HealthxMagazine

    At HealthX Magazine, we are dedicated to empowering entrepreneurs, doctors, chiropractors, healthcare professionals, personal trainers, executives, thought leaders, and anyone striving for optimal health.

    Our Picks

    Scientists discover why bread causes weight gain without extra calories

    April 14, 2026

    Wavelet, Aegis develops first AI non-invasive fetal EEG device

    April 14, 2026

    Blocking a single protein strengthens the immune system against cancer

    April 14, 2026
    New Comments
      Facebook X (Twitter) Instagram Pinterest
      • Home
      • Privacy Policy
      • Our Mission
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.