A large-scale new multimodal AI system trained on tens of millions of medical images could unify fragmented radiology tools and help doctors interpret scans and generate reports more efficiently.

Research: MedVersa: A generalist-based model for diverse medical image processing tasks. Image credit: Thitisan / Shutterstock
In a recent study published in the journal NEJM AIresearchers introduced “MedVersa,” a generalist artificial intelligence (AI) model that can capture and interpret a wide range of medical imaging modalities and task types. Unlike traditional AI models trained for specific, narrow tasks, MedVersa is built on tens of millions of medical image instances to detect medical conditions and generate reports within a unified analytical framework.
Encouragingly, when comparing MedVersa’s performance to that of human radiologists in blinded evaluations of chest radiograph reports, the model often produced reports that were judged to be clinically equivalent to human-written reports, particularly for scans containing normal findings, and significantly reduced the time human radiologists spent documenting findings. Taken together, these results suggest that MedVersa is a promising step toward developing a new generation of integrated multimodal foundation models that may help integrate the currently fragmented ecosystem of AI tools currently used in clinical practice.
Background: Fragmentation of medical artificial intelligence tools
With recent advances in computational power and artificial intelligence (AI) model logic, some of these tools have been approved for use in the medical field, but their use is often fragmented. A model trained on an X-ray dataset can accurately detect pneumonia in a patient’s chest radiograph, but cannot use MRI or ultrasound data for overall patient assessment.
These “expert” models often struggle to adapt to complex clinical workflows where multiple data types are involved in diagnosing a patient. Computational biologists have attempted to address this contradiction by introducing the concept of Generalist Medical Artificial Intelligence (GMAI).
Their goal was to create a “foundation model” (similar to the “agent” technology employed in ChatGPT, Google Gemini, and other large-scale language models (LLMs)) that can handle multimodal input and output. Unfortunately, previous attempts to realize this concept have focused primarily on text-based input and have proven unable to resolve the complex visual tasks essential to radiology.
MedVersa Multimodal AI Model Development
This study aimed to address this capability gap by designing a radiology-specific generalist AI model, “MedVersa,” that can capture, annotate, diagnose, report, and document multimodal clinical image data. The model was trained using MedInterp, a large dataset that aggregates 91 public datasets. This dataset includes images, bounding box annotations, segmentation masks, captions, and other visual-linguistic surveillance signals used in a variety of image processing tasks, and includes more than 29 million medical instances.
The model features a unique architecture that uses a trained LLM as an “orchestrator” to evaluate the user’s requirements (e.g., “Where is the patient’s tumor?”) and dynamically select the appropriate internal vision module within the MedVersa framework to execute the request. Unlike previous GMAIs, which were primarily text-based, MedVersa is designed to generate text responses or deploy specialized “vision modules” for object detection or segmentation.
As a result, MedVersa can simultaneously process diverse inputs such as 2D X-rays, 3D CT and MRI scans, and patient history text. Following model training, MedVersa’s performance was validated against two different traditional competitors across nine different imaging tasks: 1. an approved expert AI model, and 2. a board-certified radiologist (n = 10).
Evaluation framework and comparative testing
The performance evaluation required an expert (AI model or human radiologist) to review reports generated by humans, ChatGPT-4o, and MedVersa for chest X-ray examinations. Importantly, the experts were blinded to the data source. Performance was scored based on the clinical accuracy and assessment efficiency (time taken to complete the assessment and generate the report) of the expert’s output.
Findings: Performance across imaging tasks
Findings show that MedVersa’s GMAI architecture competes with, and often outperforms, traditional “gold standard” expert models on many object detection and segmentation metrics.
When evaluating model report generation, on the BLEU-4 test (higher numbers are better and measures text similarity), MedVersa achieved a score of 17.8 compared to MAIRA’s 14.2, BiomedGPT’s 12.0, and Med-PaLM M’s 11.5. In the RadCliQ test (lower is better and measures deviation from human clinical reports), MedVersa achieved a score of 2.71 compared to MAIRA’s 3.10 and BiomedGPT’s 3.25. Med-PaLM M reported a slightly better RadCliQ score (2.67), which was statistically indistinguishable from MedVersa.
Comparison with human radiologist reports
When compared to human experts, researchers found that MedVersa reports were clinically equivalent to human-written reports in 64% of cases. For scans with normal findings, this equivalence increased to 91%. However, for scans with abnormal findings with more complex pathology, equivalence was much lower, with human-written reports often preferred by peer-review radiologists.
Researchers also demonstrated that using MedVersa as an assistant, physicians can complete report writing workflows more quickly. This reduced report creation time and, importantly, resulted in fewer “urgent” discrepancies (errors requiring immediate attention) than the reports produced by GPT-4o (20% reduction in 5-10 minute reporting intervals).
Conclusion: Towards an integrated clinical AI assistant
This study reveals that MedVersa is an important step toward developing integrated clinical assistants, rather than relying on traditional, fragmented AI tools. Its architecture, which leverages LLM to tune specialized vision tools, allowed this new model to achieve performance comparable to or better than specialized AI models across several tasks, while significantly streamlining and accelerating the workflow of expert human radiologists.
However, this study found that while MedVersa was superior for routine cases, board-certified radiologists were still preferred for complex and unusual cases with complex pathology, highlighting the importance of expert supervision. The authors also note that broader generalizability across imaging modalities remains an ongoing challenge, as some of the non-thoracic X-ray datasets included in the study were dominated by segmentation tasks rather than full diagnostic interpretation.
Therefore, while this study validates that MedVersa is a strong proof of concept, future GMAI models will need to be trained on expanded datasets that include more modalities (e.g., genetic information and electronic health records (EHRs)) to fully realize the potential of AI-assisted, human expert-mediated patient care.
Reference magazines:
- Zhou, H.-Y., Acosta, J. N., Adithan, S., Datta, S., Topol, E. J., and Rajpurkar, P. (2026). MedVersa: A generalist-based model for diverse medical image processing tasks. NEJM AI. DOI – 10.1056/aioa2500595. https://ai.nejm.org/doi/full/10.1056/AIoa2500595

