Leading AI models answer many questions about vaccines, but clinical rules aren't tracing them

New research shows leading AI models can address many questions about vaccines, but their mistakes about schedules, contraindications and eligibility highlight why medical oversight remains important.

Vaccine Knowledge Base: Construction and LLM Processing Pipeline. Paper: Evaluation of large-scale language models for multilingual vaccine knowledge: A benchmark study

In a recent study published in npj vaccinea group of researchers evaluated how accurately large-scale language models (LLMs) answer vaccine-related questions across a variety of vaccines, languages, and guidance strategies.

background

Many people are increasingly using digital tools such as artificial intelligence (AI) chatbots to seek health information. Many people are currently asking LLMs questions about vaccines, from safety concerns to vaccination schedules. However, incorrect answers in this area can impact medical decisions and public trust.

Vaccines are one of the most effective public health interventions, but vaccine hesitancy is increasingly challenging global immunization efforts. Therefore, it is important to determine whether AI can overcome language barriers and provide accurate and timely vaccine information.

About research

Researchers developed VaxEval, a multilingual vaccine knowledge benchmark, to evaluate the performance of modern LLMs. The benchmark included 1,886 multiple-choice questions covering 14 vaccines and three UN languages (English, Spanish, and Chinese). Topics covered in these questions include vaccination schedules, efficacy, safety, side effects, debunking myths, access, and disease prevention.

Data for the questions was obtained from trusted health organizations including the World Health Organization (WHO), Centers for Disease Control and Prevention (CDC), United Nations Children’s Fund (UNICEF), Africa CDC, American Medical Association (AMA), and Immunize.org. Additional materials were obtained from peer-reviewed scientific literature. All questions have undergone extensive quality checks and answer keys have been verified against trusted scientific sources.

The researchers used Generative Pre-trained Transformer (GPT)-4.5, GPT-4o, GPT-4, GPT-3.5-Turbo, Claude 3 Opus, Gemini 1.5 Pro, Llama-4 Maverick, DeepSeek-V3, Grok-3, Qwen 2.5, General Language Model 4 (GLM-4), Reka Core, and E-Lightning. The model used three prompting methods: zero shots, few shots, and chain of thoughts.

Model responses were evaluated for their ability to submit correct answer choices. We then performed statistical analyzes including mixed-effects logistic regression to identify characteristics of correct and incorrect answers and to compare model performance across languages, vaccine types, and model groups.

Research results

The benchmark included 1,340 English questions, 250 Spanish questions, and 296 Chinese questions. The average accuracy of all models was 86.0% for English, 83.7% for Spanish, and 80.0% for Chinese. This indicates that LLM has sufficient vaccine-related knowledge across the three languages, although performance varies by language.

Among the systems evaluated, GPT-4o achieved the highest overall accuracy of 90.3%, closely followed by Llama-4 Maverick at 90.2% and DeepSeek-V3 at 89.6%. For the group as a whole, the new flagship model outperformed the previous generation model.

Although statistical analysis showed that the flagship model was 57% more likely to return correct answers than the older system, GPT-4o, which was classified as an early model in this study, still achieved the highest overall accuracy.

Prompt type was also a factor in model performance. A few prompts produced the best results, increasing the likelihood of a correct response by 17% compared to zero prompts.

The use of thought chain prompts had the opposite effect than expected. These had a 21% lower correct answer rate than the zero-shot prompts. This suggests that forcing the model to generate step-by-step inferences does not necessarily improve factual accuracy in structured vaccine-related tasks.

Performance varied widely depending on the vaccine type. The highest accuracy rates were observed for influenza (90.5%), hepatitis A (89.5%), human papillomavirus (HPV) (88.4%), and coronavirus disease 2019 (COVID-19) vaccines (85.3%).

Vaccines for respiratory syncytial virus (RSV) (80.6%), meningococcal disease (81.7%), pneumococcal disease (77.7%), and dengue fever (76.4%) were included in the poor performing vaccine category. These results indicate that the model performed better on vaccines, which are featured heavily in public health communications and are widely discussed.

The model achieved the highest accuracy regarding misconceptions and corrections (93.0%), prevention-related questions (90.0%), and regulatory or monitoring systems (87.2%). Decreased performance was observed for questions regarding vaccine type and basic information (82.5%), efficacy and benefits (86.3%), cost and accessibility (82.6%), and dosage and recommendations (82.5%).

Linguistic analysis showed that Spanish and Chinese questions were less likely to be answered correctly than English questions. Further analysis of the semantically aligned multilingual questions showed that many of these differences were related to variations in dataset composition rather than inherent language biases.

The authors also noted that the Spanish and Chinese datasets were independently constructed rather than direct translations of the English questions, which may have contributed to differences in item difficulty, topic distribution, and source composition.

Error analysis highlighted weaknesses in the model. Almost half of the 150 incorrect response sets sampled were due to overgeneralization, where the model provided broad statements without considering vaccine-specific requirements.

Other common errors include incorrect dosing intervals, misidentified contraindications, incorrect recommendation of age-based eligibility, and inability to differentiate between vaccine types. This type of error is particularly concerning as it relates to practical guidance that may influence vaccination decisions.

conclusion

The findings show that modern LLMs have extensive knowledge about vaccines and can accurately answer most vaccine questions across multiple languages.

The new flagship model significantly outperforms the previous system at the group level and requires fewer shots, resulting in improved performance. However, a number of significant weaknesses remain in areas that require clear clinical guidance.

Additionally, accuracy remains inconsistent across different vaccines and languages. Although these systems have shown promise in supporting vaccine education and public health communication, their remaining error rates highlight the need for careful monitoring, continuous evaluation, and structured safety measures before widespread implementation in health-related settings.

The authors also emphasized that the accuracy of multiple choice cannot establish clinical reliability or readiness for real-world vaccine counseling without prospective validation and contextual safety evaluation.

Further research is needed to evaluate the accuracy, safety, and real-world effectiveness of AI-powered medical communication.

Click here to download your PDF copy.

Reference magazines:

Chen, S., Wass, L., Wu, Z., Garay, L., Vizoso, J., Leung, K., Wu, J., and Lin, L. (2026). Evaluating large-scale language models for multilingual vaccine knowledge: A benchmark study. npj vaccine. Doi: 10.1038/s41541-026-01500-1, https://www.nature.com/articles/s41541-026-01500-1

Source link

Visited 4 times, 1 visit(s) today