A nationally recognized medical AI system interpreted images, electrocardiograms, and clinical documents during live diagnostic chats and outperformed primary care physicians in mock consultations, while raising urgent questions about how such tools should be tested before actual treatment.

Research: Advances in conversational diagnostic AI through multimodal reasoning. Image credit: Explode / Shutterstock
In a recent study published in the journal natural medicineresearchers describe the development of a multimodal extension of the Articulate Medical Intelligence Explorer (AMIE). The model is designed to utilize a state-aware reasoning framework to manage clinical conversations and interpret visual artifacts. The study then conducted a randomized, blinded exploratory study that included 105 multimodal clinical scenarios, simulating 210 telemedicine consultations, and compared AMIE’s performance to that of 19 academically certified primary care physicians (PCPs).
According to the study results, the new multimodal model outperformed PCP on 29 out of 32 evaluation axes, including consultation quality indicators such as diagnostic accuracy and empathy. These findings suggest that multimodal AI may ultimately be able to support telehealth delivery, pending real-world validation.
Background on multimodal clinical AI
Morbidity risks associated with delayed access to care are increasingly documented in global health care delivery, and experts attribute this pattern to increasing pressures from clinician burnout, healthcare fragmentation, and an aging global population. Generative AI shows promise in mitigating these challenges, but early healthcare large-scale language model (LLM) implementations were primarily limited to text-only chatbots.
Field reviews highlight that this “text-only” constraint departs from standard clinical practice, where much diagnostic information is derived from history taking and physical examination, often supplemented by visual data.
These limitations are particularly evident in telemedicine settings, where patients are reported to frequently exchange multimodal information with clinicians, such as skin photos taken with smartphones, electrocardiogram (ECG) traces, or scanned test reports.
AMIE Multimodal Inference Research Design
This study aimed to address this persistent medical AI limitation by developing a multimodal system that can emulate the structured reasoning of experienced clinicians by strategically requesting and interpreting these visual artifacts during live diagnostic consultations.
The system, named “AMIE,” is built on the Gemini 2.0 Flash foundation model and powered by a new “state-aware” inferential temporal inference framework. AMIE’s custom architecture is designed to allow the model to maintain an internal “patient state” that tracks each patient’s chief complaint, current disease history, and prioritized knowledge gaps.
During clinical use, this framework was constructed to specifically direct diagnostic consultation through three sequential phases:
Obtain medical history. The system repeatedly updates patient profiles and identifies gaps in information. Additionally, the model determines whether and when multimodal artifacts are requested to enhance understanding of the patient’s medical history.
Diagnosis and management. The system then generates a differential diagnosis (DDx) report that provides patient explanation and management guidance for the most relevant identified conditions.
During follow-up, AI processes and clarifies patient concerns and communicates the final management plan, ensuring clarity for the patient or caregiver.
Model performance was validated using a synchronous chat-adapted Objective Structured Clinical Examination (OSCE) format, and AMIE was evaluated against 19 primary care physicians (PCPs). The patient cohort consisted of 25 verified patient officials who participated in a total of 210 visits, two for each scenario.
Test scenarios are based on real-world datasets, including the Skin Condition Image Network (SCIN) for dermatology, PTB-XL for ECG tracing, and selected clinical documents.
Performance was assessed by 18 specialist physicians using the Multimodal Understanding and Handling (MUH) rubric, Practical Assessment of Laboratory Skills (PACES), and General Medical Council Patient Questionnaire (GMCPQ).
Diagnostic accuracy and examination findings
OSCE evaluation data showed that multimodal AMIE demonstrated significant performance advantages over PCP in both objective accuracy and subjective quality measures in 29 of the 32 metrics evaluated.
When assessing diagnostic accuracy, statistical modeling confirmed that the AI’s DDx list was more accurate and comprehensive than the human physician’s list (P < 0.001). Accuracy was analyzed across lists containing 1 to 10 diagnoses, but neither AMIE nor PCP consistently submitted 10 differential diagnoses. Across all modalities, AI's top-k accuracy consistently outperformed PCP performance on lists containing 1 to 10 diagnoses.
In another automated ablation analysis across a clinical document scenario, the AI’s top-1 accuracy reached 0.98, compared to 0.89 for the “vanilla” baseline Gemini 2.0 flash model. This shows that state-aware inference improved performance over the base model alone.
For multimodal inference and overall robustness evaluation, expert ratings using the MUH rubric favored AI on seven out of nine metrics. AMIE was found to be particularly robust to variations in image quality, with low-quality images causing a greater reduction in the diagnostic performance of PCP than AMIE. In this simulated evaluation, AMIE also showed fewer less severe artifact-related false alarm events than PCP (P < 0.001).
Furthermore, the patient role rated the AI significantly higher on 10 out of 11 GMCPQ criteria, including showing empathy and listening. In multimodal tasks, AI was rated more favorably for its ability to explain findings (P < 0.01).
Impact of conversational diagnostic AI
Using data representing real-world clinical scenarios, this study highlights how state-of-the-art AI models can achieve performance that matches or exceeds that of PCPs in these simulated diagnostic settings by integrating perceptual evidence and state-aware reasoning.
Despite these results, the researchers cautioned that the study was an exploratory study and not a randomized clinical trial. Future work should evaluate the system’s performance, safety, reliability, impact on clinical workflow, and health equity in real-world settings before considering clinical deployment.
Click here to download your PDF copy.

