Artificial intelligence (AI) is rapidly transforming healthcare. AI systems can now detect diabetic eye disease from retinal photographs and analyze CT images for signs of early-stage lung cancer or stroke.
Now, in hospitals across the country and around the world, special algorithms are silently assisting doctors, prioritizing urgent scans and alerting them to subtle abnormalities that may go unnoticed. These specialized AI tools are often trained on millions of accurately classified medical images and are increasingly being integrated into real-world clinical settings.
At the same time, another form of AI, large-scale language models (LLMs), is gaining public attention. Widely accessible systems such as ChatGPT and Claude can analyze both text and images. In theory, these capabilities should be suitable for medical tasks, but can general-purpose AI platforms be trusted when it comes to medical diagnosis?
A new study led by Milan Thoma, Ph.D., associate professor at the New York Institute of Technology College of Osteopathic Medicine (NYITCOM), suggests otherwise. as seen in academic journals algorithmToma and co-authors, including NYITCOM senior development security operations engineer Mihir Matalia and medical student Sungjoon Hon, tested the reliability of the world’s most advanced multimodal LLMSs: GPT-5, Gemini 3 Pro, Llama 4 Maverick, Grok4, and Claude Opus 4.5 Extended.
The researchers provided each AI model with the same CT brain scan that showed obvious intracranial pathology. The models were then asked to analyze the images like radiologists to identify the imaging technique used, the location of the lesion in the brain, the primary diagnosis, key features, and potential alternative diagnoses. Overall, the findings reveal a basic diagnostic error rate of 20% across AI models and concerning variability in interpretation and assessment.
Initially, the models yielded promising results, with all five correctly identifying the images as CT brain scans. The four models also detected an important finding: ischemic stroke near the left middle cerebral artery. However, some people made the fundamental mistake of misclassifying a stroke as a hemorrhage on the opposite side of the brain. In actual clinical practice, this error can have a significant impact on patient health, as ischemic stroke and hemorrhagic stroke require different treatments.
Even among the four AI models that came up with a correct diagnosis, the explanations were very different. Some people offer different interpretations of when the stroke first occurred. Others did not agree on a different diagnosis or additional brain areas affected or calcifications. Next, the researchers introduced a novel surprise. We asked each AI model to score the diagnostic descriptions of other AI models. This cross-evaluation revealed further discrepancies, with some models being evaluated more harshly than others. One model even believed that this finding indicated a chronic brain abnormality rather than an acute stroke, and therefore systematically deducted points from other models’ responses.
In recent years, Toma has published more than 30 peer-reviewed studies on AI in medical diagnostics and healthcare and two books on the subject.
Our research highlights important differences in the AI landscape. Most successful medical AI tools are task-specific algorithms, trained on large datasets of labeled medical images and validated against very specific diagnostic tasks. However, large-scale language models are not optimized for diagnostics and are built for linguistics and conversation. Therefore, they produce explanations that sound authoritative, even if their underlying interpretations are wrong or contradictory. ”
Dr. Milan Thoma, Associate Professor, New York Institute of Technology College of Osteopathic Medicine (NYITCOM)
Toma and his co-authors conclude that the future of healthcare AI is likely to combine both specialized diagnostic systems and language models. However, while LLM is useful for clinical documentation, summarizing reports, or communicating with patients, oversight by a medical professional remains non-negotiable for all diagnostic interpretations.
sauce:
New York Institute of Technology
Reference magazines:
Hon, S. Others. (2026). Chat is not diagnosis: Diagnostic variability and fundamental errors in multimodal LLM interpretation in radiology. algorithm. DOI: 10.3390/a19030170. https://www.mdpi.com/1999-4893/19/3/170

