Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Scientists have discovered something surprising about French fries and diabetes

    June 3, 2026

    Brain scan reveals two different types of autism

    June 3, 2026

    Triple-drug strategy permanently regresses pancreatic tumors

    June 3, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Health Magazine
    • Home
    • Environmental Health
    • Health Technology
    • Medical Research
    • Mental Health
    • Nutrition Science
    • Pharma
    • Public Health
    • Discover
      • Daily Health Tips
      • Financial Health & Stability
      • Holistic Health & Wellness
      • Mental Health
      • Nutrition & Dietary Trends
      • Professional & Personal Growth
    • Our Mission
    Health Magazine
    Home » News » Study finds top AI models still struggle with clinical inference
    Discover

    Study finds top AI models still struggle with clinical inference

    healthadminBy healthadminApril 14, 2026No Comments5 Mins Read
    Study finds top AI models still struggle with clinical inference
    Share
    Facebook Twitter Reddit Telegram Pinterest Email


    New benchmarks show that even the most advanced AI models are often able to arrive at a definitive diagnosis, but still leave clinicians at a loss when they need to weigh uncertainties, build differential diagnoses, and decide what to test next.

    Research: Performance of large-scale language models and clinical reasoning tasks. Image credit: Iryna Pohrebna / Shutterstock

    In a recent study published in JAMA network openresearchers investigated the clinical reasoning ability of large-scale language models (LLMs).

    LLMs are rapidly gaining interest in medicine, particularly enhancing tools to support diagnostic reasoning and suggest management. Although these systems are currently actively marketed for clinical use, concerns about hallucinations, integrity, and safety remain. Additionally, existing assessments often rely on multiple-choice questions that do not reflect the complexity of patient care. It is unclear whether LLM can support end-to-end clinical reasoning.

    LLM Clinical Reasoning Research Design

    In this study, researchers investigated the performance of LLMs on clinical reasoning tasks. They compared 21 LLMs: OpenAI’s GPT-5, GPT-4.5, GPT-o3-Mini, GPT-4o, GPT-o1-Pro, and GPT-o1, Anthropic’s Claude 4.5 Opus, Claude 3.7 Sonnet, Claude 3 Opus, Claude 3.5 Sonnet, and Claude 3.5 Haiku, DeepSeek DeepSeek R1, and V3, Google DeepMind’s Gemini 3.0 Pro, Gemini 2.5 Pro, Gemini 1.5 Pro, Gemini 3.0 Flash, Gemini 2.0 Flash, and Gemini 1.5 Flash, and xAI’s Grok 3 and 4.

    The team evaluated the accuracy of LLM in processing 29 standardized clinical vignettes included in the January 2025 update of the Merck Sharp & Dohme (MSD) manual. Each vignette presents a structured case that includes physical examination findings, medical history, laboratory findings, and system review. The clinical background was presented to each LLM in stages, maintaining the clinical context, and each clinical background was assessed three times.

    The prompts were presented in a question-and-answer format. For LLMs without multimodal features, questions requiring image interpretation were excluded from scoring. LLM was prompted using defaults, disabling inference settings when available, and evaluating only the base model. Real-time browsing, retrieval, and web search capabilities are now turned off for all LLMs.

    Performance was assessed across five clinical reasoning areas: diagnostic testing, differential diagnosis, final diagnosis, management, and other clinical reasoning. The output of the LLM was scored against the answer key in the MSD manual. Answers were scored using a deterministic rubric that maps LLM output to multiple-choice options. Answers were given full credit only if they included the correct option, and incorrect options were excluded.

    Additionally, the Medical Evaluation Proportional Index of LLM (PrIME-LLM) score was developed to capture longitudinal inferences in an interpretable metric. Performance was visualized as a radar plot, with the vertices representing accuracy across the domain. The PrIME-LLM score is calculated as the area of ​​the polygon in the LLM divided by the area of ​​the reference polygon, which corresponds to a model that scores 100% across all domains.

    PrIME-LLM results across clinical tasks

    LLM generally had the highest scores in the final diagnosis domain and performed relatively better in management than diagnostic tests and differential diagnoses, but consistently showed deficiencies in the diagnostic tests and differential diagnosis domains. PrIME-LLM scores varied significantly between LLMs. The best performing cluster included Claude 4.5 Opus, Grok 4, Gemini 3.0 Flash, GPT-5, Gemini 3.0 Pro, and GPT-4.5, with Grok 4 achieving the highest average PrIME-LLM score. In particular, newer releases within the LLM family generally perform better.

    Although the overall average accuracy ranged from 0.81 to 0.90, the average PrIME-LLM scores showed a wider separation, distinguishing high- and low-performance models. In particular, there was a large performance difference between inference-optimized models such as Grok 4, GPT-5, and Claude 4.5 Opus and non-inference models. The probability that the random score from the inference-optimized model was greater than the random score from the non-inference model was 0.99.

    In virtually all LLMs, the accuracy of the final diagnostic items was significantly higher than the diagnostic test items. Furthermore, diagnostic test items consistently showed higher accuracy than differential diagnosis items, whereas administrative items and other item types had intermediate accuracy. Eighteen multimodal LLMs with image interpretation available were evaluated across vignettes including electrocardiograms, computed tomography scans, and chest radiographs.

    Multimodal LLM accuracy was consistent across non-image questions, whereas performance on image-based questions varied across LLMs. GPT-4.5, GPT-o3-Mini, and Claude 3 Opus showed higher accuracy on image-based items than text-only items, and Gemini 2.5 Pro, Gemini 3.0 Pro, Gemini 3.0 Flash, and Grok 4 also reported significant improvements. Furthermore, the model failure rate, or the proportion of questions not answered completely correctly, was lowest for final diagnosis and highest for differential diagnosis. Failure rates in other domains were moderate.

    LLM Differential Diagnosis and Uncertainty Gap

    In summary, Frontier LLM achieved high accuracy in the final diagnosis, but performed poorly compared to other inference stages in creating differential diagnoses and avoiding uncertainty. The PrIME-LLM score provided better separation than the traditional summary metric, raw accuracy, and highlighted critical gaps hidden in traditional benchmarks.

    Overall, the PrIME-LLM framework provides an independent, scalable, and reproducible benchmark to track progress and guide safe integration into medical practice. However, the findings also suggest that off-the-shelf LLMs are not yet ready for unsupervised, patient-facing clinical decision-making.



    Source link

    Visited 7 times, 1 visit(s) today
    Share. Facebook Twitter Pinterest LinkedIn Telegram Reddit Email
    Previous ArticleThis strange ‘pearly’ movement inside cells could change the way we understand diseases
    Next Article Residents near Adelaide’s cement factory express concern over more plastic being burned as fuel
    healthadmin

    Related Posts

    Triple-drug strategy permanently regresses pancreatic tumors

    June 3, 2026

    Is any level of alcohol safe? Extensive study reveals where drinking risk increases

    June 3, 2026

    AI turns everyday smartphone use into passive heart rate tracking

    June 3, 2026

    New AI tool reduces wait times for breast cancer biopsies

    June 2, 2026

    Gladstone receives NIAID grant to establish PhAIge Therapy Center

    June 2, 2026

    Engineered stem cell therapy reverses new-onset type 1 diabetes in mice

    June 2, 2026
    Add A Comment

    Comments are closed.

    Categories

    • Daily Health Tips
    • Discover
    • Environmental Health
    • Exercise & Fitness
    • Featured
    • Featured Videos
    • Financial Health & Stability
    • Fitness
    • Fitness Updates
    • Health
    • Health Technology
    • Healthy Aging
    • Healthy Living
    • Holistic Healing
    • Holistic Health & Wellness
    • Medical Research
    • Medical Research & Insights
    • Mental Health
    • Mental Wellness
    • Natural Remedies
    • New Workouts
    • Nutrition
    • Nutrition & Dietary Trends
    • Nutrition & Superfoods
    • Nutrition Science
    • Pharma
    • Preventive Healthcare
    • Professional & Personal Growth
    • Public Health
    • Public Health & Awareness
    • Selected
    • Sleep & Recovery
    • Top Programs
    • Weight Management
    • Workouts
    Popular Posts
    • 1773313737_bacteria_-_Sebastian_Kaulitzki_46826fb7971649bfaca04a9b4cef3309-620x480.jpgHow Sino Biological ProPure™ redefines ultra-low… March 12, 2026
    • pexels-david-bartus-442116The food industry needs to act now to cut greenhouse… January 2, 2022
    • the-pros-and-cons-of-paleo-dietsThe Pros and Cons of Paleo Diets: What Science Really Says April 16, 2025
    • 1773729862_TagImage-3347-458389964760995353448-620x480.jpgDespite safety concerns, parents underestimate the… March 17, 2026
    • 1773209206_futuristic_techno_design_on_background_of_supercomputer_data_center_-_Image_-_Timofeev_Vladimir_M1_4.jpegMulti-agent AI systems outperform single models… March 11, 2026
    • 1774403998_image_28620e4b6b0047f7ab9154b41d739db1-620x480.jpgGait pattern helps distinguish between Lewy body… March 24, 2026

    Demo
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Don't Miss

    Scientists have discovered something surprising about French fries and diabetes

    By healthadminJune 3, 2026

    French fries have long been criticized as an unhealthy food choice, but new research suggests…

    Brain scan reveals two different types of autism

    June 3, 2026

    Triple-drug strategy permanently regresses pancreatic tumors

    June 3, 2026

    Study finds tire pollution may pose a threat to human health

    June 3, 2026

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    HealthxMagazine
    HealthxMagazine

    At HealthX Magazine, we are dedicated to empowering entrepreneurs, doctors, chiropractors, healthcare professionals, personal trainers, executives, thought leaders, and anyone striving for optimal health.

    Our Picks

    Study finds tire pollution may pose a threat to human health

    June 3, 2026

    Is any level of alcohol safe? Extensive study reveals where drinking risk increases

    June 3, 2026

    AI turns everyday smartphone use into passive heart rate tracking

    June 3, 2026
    New Comments
      Facebook X (Twitter) Instagram Pinterest
      • Home
      • Privacy Policy
      • Our Mission
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.