Research reveals limitations of large-scale language models in medical diagnosis

Artificial intelligence (AI) is rapidly transforming healthcare. AI systems can now detect diabetic eye disease from retinal photographs and analyze CT images for signs of early-stage lung cancer or stroke.

Now, in hospitals across the country and around the world, special algorithms are silently assisting doctors, prioritizing urgent scans and alerting them to subtle abnormalities that may go unnoticed. These specialized AI tools are often trained on millions of accurately classified medical images and are increasingly being integrated into real-world clinical settings.

At the same time, another form of AI, large-scale language models (LLMs), is gaining public attention. Widely accessible systems such as ChatGPT and Claude can analyze both text and images. In theory, these capabilities should be suitable for medical tasks, but can general-purpose AI platforms be trusted when it comes to medical diagnosis?

A new study led by Milan Thoma, Ph.D., associate professor at the New York Institute of Technology College of Osteopathic Medicine (NYITCOM), suggests otherwise. as seen in academic journals algorithmToma and co-authors, including NYITCOM senior development security operations engineer Mihir Matalia and medical student Sungjoon Hon, tested the reliability of the world’s most advanced multimodal LLMSs: GPT-5, Gemini 3 Pro, Llama 4 Maverick, Grok4, and Claude Opus 4.5 Extended.

The researchers provided each AI model with the same CT brain scan that showed obvious intracranial pathology. The models were then asked to analyze the images like radiologists to identify the imaging technique used, the location of the lesion in the brain, the primary diagnosis, key features, and potential alternative diagnoses. Overall, the findings reveal a basic diagnostic error rate of 20% across AI models and concerning variability in interpretation and assessment.

Initially, the models yielded promising results, with all five correctly identifying the images as CT brain scans. The four models also detected an important finding: ischemic stroke near the left middle cerebral artery. However, some people made the fundamental mistake of misclassifying a stroke as a hemorrhage on the opposite side of the brain. In actual clinical practice, this error can have a significant impact on patient health, as ischemic stroke and hemorrhagic stroke require different treatments.

Even among the four AI models that came up with a correct diagnosis, the explanations were very different. Some people offer different interpretations of when the stroke first occurred. Others did not agree on a different diagnosis or additional brain areas affected or calcifications. Next, the researchers introduced a novel surprise. We asked each AI model to score the diagnostic descriptions of other AI models. This cross-evaluation revealed further discrepancies, with some models being evaluated more harshly than others. One model even believed that this finding indicated a chronic brain abnormality rather than an acute stroke, and therefore systematically deducted points from other models’ responses.

In recent years, Toma has published more than 30 peer-reviewed studies on AI in medical diagnostics and healthcare and two books on the subject.

Our research highlights important differences in the AI landscape. Most successful medical AI tools are task-specific algorithms, trained on large datasets of labeled medical images and validated against very specific diagnostic tasks. However, large-scale language models are not optimized for diagnostics and are built for linguistics and conversation. Therefore, they produce explanations that sound authoritative, even if their underlying interpretations are wrong or contradictory. ”

Dr. Milan Thoma, Associate Professor, New York Institute of Technology College of Osteopathic Medicine (NYITCOM)

Toma and his co-authors conclude that the future of healthcare AI is likely to combine both specialized diagnostic systems and language models. However, while LLM is useful for clinical documentation, summarizing reports, or communicating with patients, oversight by a medical professional remains non-negotiable for all diagnostic interpretations.

sauce:

New York Institute of Technology

Reference magazines:

Hon, S. Others. (2026). Chat is not diagnosis: Diagnostic variability and fundamental errors in multimodal LLM interpretation in radiology. algorithm. DOI: 10.3390/a19030170. https://www.mdpi.com/1999-4893/19/3/170

Source link

Visited 17 times, 1 visit(s) today

What's Hot

Key 2026 Clinical Trials Revolutionizing Cancer and Cardiovascular Care

Texas A&M researchers build AI tool for tuberculosis drug discovery

New ultrasound technology breaks blood-brain barrier to treat gliomas

Research reveals limitations of large-scale language models in medical diagnosis

Texas A&M researchers build AI tool for tuberculosis drug discovery

New ultrasound technology breaks blood-brain barrier to treat gliomas

Omalizumab wins multi-allergen oral immunotherapy in multi-food allergy trial

MAGFLO™ NGS beads: cost-effective nucleic acid purification

Qureight completes $20 million Series B funding

Cigarette smoke extract stimulates airway cells and increases nanoplastic damage

Key 2026 Clinical Trials Revolutionizing Cancer and Cardiovascular Care

Texas A&M researchers build AI tool for tuberculosis drug discovery

New ultrasound technology breaks blood-brain barrier to treat gliomas

This 4,000-year-old city defied the rules of history

Our Picks

This 4,000-year-old city defied the rules of history

Scientists are using biological markers to take the guesswork out of depression treatment

Rice Bran Compounds May Relieve Irritable Bowel Symptoms

Subscribe to Updates

What's Hot

Research reveals limitations of large-scale language models in medical diagnosis

Related Posts