As AI tools enter clinical and population health research, this paper warns that speed alone is not enough and shows why expert oversight, causal logic, and transparent workflows remain essential for trustworthy science.

Research: Integrating artificial intelligence tools into health research. Image credit: FabrikaSimf / Shutterstock
In a recent article published in a magazine npj digital medicinean international research collaboration explored the operational frictions that arise when incorporating artificial intelligence (AI)-enabled tools into field-specific health research workflows. The authors argue that many AI-powered research tools are derived from data science workflows and software codebases, which may incorporate assumptions, terminology, and analytical priorities that do not necessarily align with pre-specified epidemiological principles such as study design, causal inference, and bias control.
In this article, we compared typical research workflows (“lifecycles”) and then presented a practical guide consisting of six core recommendations and five layers of automation to protect internal validity and maintain human accountability in high-stakes clinical and population health research.
Background on AI health research workflows
Modern medical research is facing an unprecedented influx of artificial intelligence (AI) tools designed to automate and speed up tasks from hypothesis generation to data synthesis.
However, scientists have identified significant methodological differences between quantitative health science and computational data science. Traditional medical disciplines, such as quantitative epidemiology, operate within strict protocol-driven workflows where study designs are prespecified to minimize selection and information bias. In contrast, AI tools are often shaped by the code base of data science, an interdisciplinary field focused on generating insights from existing data.
This difference is evident in how core terms are applied and interpreted. For example, in epidemiology, statistical significance is determined by rigorous hypothesis testing with a prespecified confidence threshold (usually p < 0.05). Conversely, data science workflows often define importance by a feature's weight or impact on prediction within a complex model, often emphasizing predictive performance over causal mechanisms.
Researchers believe that uncritical adoption of data science-centric AI interfaces can alter research workflows in ways that are opaque to researchers, resulting in lower-quality outcomes that do not meet established medical or epidemiological disciplinary standards.
Comparing epidemiology and data science
This article aims to systematically address these vulnerabilities, particularly through a comparative analysis that contrasts the structural elements of epidemiology and data science workflows.
The authors primarily focused on quantitative epidemiology, the study of health distribution within populations, as the primary model for tabular data analysis. Specifically, they contrasted traditional medical workflows with standard data science lifecycles and developed six actionable strategies for researchers.
This study demonstrated a practical example by presenting an example of testing a multimodal AI-enabled analysis tool that leverages multiple large-scale language models (LLMs) that can ingest raw datasets, generate Python code, and output statistical analysis.
The tool was tested using two immediate strategies to answer the complex causal question: “What is the causal effect of current smoking on heart attacks?” The first test, “Prompt 1,” used basic prompts that mimicked a novice researcher, while the second, “Prompt 2,” provided specific guidance that directed the AI to generate a directed acyclic graph (DAG), a standard visual causal model in epidemiology.
In this study, we further categorized human-AI interactions using an adaptive self-driving car framework consisting of five different automation levels, from Level 1, basic automation, under close human supervision, to Level 5, full automation, where the AI is directed to operate completely independently.
AI causal analysis failure result
This study’s example exercise showed that even seemingly efficient and well-structured AI-generated analyzes can still contain significant methodological errors. Under unconstrained prompt 1 conditions, the AI tool ran a logistic regression model and provided a functional Python script. However, a peer review of the model’s output revealed three major scientific flaws:
AI completely bypassed theoretical causal modeling and omitted formal variable adjustment sets and DAG generation.
The system incorrectly interpreted the odds ratios it generated as a direct increase in probability, rather than an increase in odds. This is a fundamental epidemiological error that undermines the clinical relevance and applicability of the output.
The analytical results were not reproducible. Resubmitting the same prompt produced variable statistical output, making the tool’s output less consistent and robust.
Surprisingly, Prompt 2, which was guided by experts, yielded similarly problematic results. Although the AI was successful in generating a visual DAG, this chart was considered conceptually meaningless and inconsistent with established medical literature. Additionally, the model was unable to integrate its own DAG into subsequent analysis steps.
Finally, execution terminated abruptly because the system was unable to convert the string variable to a number. This data cleaning error did not occur on the first attempt. These findings demonstrate that seemingly plausible AI-generated outputs can still be inaccurate, especially when domain-specific causal inference is required.
Human responsibility in AI research
The article warns against the uncritical integration of AI into health research and emphasizes that, at least for now, this integration requires a permanent “human-involved” implementation of experts, with researchers evaluating the output of algorithms through a rigorous “peer review” methodology that rejects, modifies, and approves text and code.
Researchers must use prescribed levels of automation as a guide, intentionally aligning the role of AI tools to specific workflow boundaries, and balancing tight error tolerance with epistemic responsibility. In conclusion, this study highlights that, at this time, maintaining human responsibility at the center of the human-AI loop is essential to maintaining the scientific and clinical integrity of clinical and population health research.
Click here to download your PDF copy.

