Artificial intelligence chatbots routinely answer public questions about sensitive health topics like addiction, providing mostly accurate yet highly generalized information. A recent evaluation found that while chatbot responses are generally in line with national guidelines, they often lack the contextual details needed to make individualized health decisions. These descriptive findings were recently published in the journal Drug and Alcohol Dependence.
Substance use disorder is a chronic medical condition defined by the compulsive use of drugs or alcohol despite negative physical, social, and emotional consequences. Official medical diagnostic frameworks view conditions on the basis of varying degrees of severity, rather than applying a binary label of addiction. This diagnosis reflects changes in brain function that lead to craving, physical tolerance, and withdrawal symptoms. Recent health surveys show that in the United States alone, approximately 50 million people over the age of 12 meet diagnostic criteria for this condition.
Despite increased access to medical care, addiction care remains underutilized. Healthcare providers face institutional limitations, time constraints, and a lack of specific training regarding this condition. At the same time, the social stigma surrounding addiction causes many people to avoid seeking formal medical advice for fear of judgment and legal repercussions.
People often turn to digital platforms as their first private step to gathering health information. Chatbots provide anonymous, immediate responses without the need for judgment from a clinical environment. However, the quality of this digitally generated medical guidance is not always reliable, especially for behavioral health conditions that are highly stigmatized.
To better understand how these systems work, researchers designed a study to evaluate the medical accuracy of artificial intelligence responses regarding addiction. Lead author Morgan Decker, a medical student, and Lee Sacca, a public health researcher and lead author, conducted the study in collaboration with a team at Florida Atlantic University. They collaborated with addiction treatment physicians and data scientists to evaluate digital guidance.
The research team focused on 14 frequently asked questions about substance use disorders. To create this list, they first asked the chatbot to generate a list of frequently asked questions adults have about diagnosis, treatment, and recovery. The team then cross-referenced these outputs with actual FAQs from leading healthcare organizations.
Benchmark organizations included the Centers for Disease Control and Prevention and the Substance Abuse and Mental Health Services Administration. The researchers also incorporated guidelines from the National Institute on Drug Abuse and the American Society of Addiction Medicine. This ensured that the artificial intelligence’s answers were evaluated against established best practices in the medical field.
The researchers entered the final 14 questions into the software and collected the responses. They especially took advantage of the updated 5th version of the application. To standardize the output, we applied settings that limit the model’s randomness so that the answers remain consistent and factual rather than conversational.
A pair of raters independently reviewed each response generated in a blinded manner. This assessment intentionally mixes training levels, combining students with board-certified addiction professionals. They scored their responses on a four-point scale based on accuracy, precision, and appropriateness for a general audience. Disagreements between pairs of raters were resolved through discussion with additional senior experts.
The highest score on the scale indicates an excellent response that requires no further explanation. The next two tiers represent satisfactory answers that require minimal or moderate clinical explanation. The lowest scores were unsatisfactory answers that included inaccurate or dangerously misleading information based on modern medical practice.
The evaluators found that none of the answers provided by the software were satisfactory. 3 out of 14 responses were rated excellent. Nine responses were considered satisfactory but required minimal detail. Two answers were satisfactory but required moderate clinical detail.
Artificial intelligence performed best on simple definition prompts. When we asked about signs and symptoms of substance use disorder, we got a very accurate list that matched expert guidelines. It pinpointed craving, withdrawal symptoms, and lack of control over use as key indicators.
Another highly rated answer addressed whether relapse represents failure. The software accurately emphasized that even if use is eventually resumed, it does not mean that the treatment has failed. Instead, consistent with the empathetic tone recommended by public health officials, it positioned relapse as a normal part of the recovery process that may require adjustments to medical strategies.
Many responses provided broad summaries but omitted subtle clinical examples. When asked about the risks of untreated addiction, the software correctly listed overdose, liver damage, and social isolation. However, they did not mention the increased risk of various cancers and infectious diseases, which are major complications recognized by public health officials.
When evaluating treatment options, the software accurately mentioned behavioral therapy and support groups. However, no specific federally approved treatment for alcohol use disorder could be identified. They also offered vague advice on how to help their loved ones, advising them not to engage in enabling behavior without explaining what enabling actually looks like.
The software also failed to provide practical resources when asked where to seek treatment. It pinpointed primary care physicians, mental health professionals, and anonymous support groups as avenues of help. Unfortunately, centralized government-backed tools, such as national helplines and specific website directories that provide immediate and confidential assistance based on geographic location, were completely omitted.
As medical scenarios become more complex, significant gaps in the software knowledge base become apparent. When asked about managing withdrawal symptoms, the application accurately pointed out that physical symptoms occur when an addict stops using a drug. But it did not warn users that withdrawal from certain substances, such as alcohol or benzodiazepines, could be fatal and required immediate medical supervision.
The software also required some refinement regarding treatment duration. It accurately states that recovery timelines vary widely depending on individual needs and severity of symptoms. While that’s true, medical institutions typically recommend a minimum three-month treatment program to achieve better recovery results, the software didn’t mention benchmarks.
The researchers note that their methodology has several limitations. This study is based on a subjective evaluation process by a specific group of medical professionals. Other clinical experts may evaluate nuanced answers differently. Additionally, the researchers only tested a small sample of 14 questions, which limits how broadly the results can summarize the software’s functionality.
Using an artificial intelligence program to generate the initial list of questions may have introduced circular bias into the experiment. Software may perform better with prompts that match its own structured, rational logic. Real patients often create prompts that are highly emotional, ambiguous, or poorly worded, which can generate very different guidance.
The researchers did not test how real patients would interpret or apply digital advice in real life. Health literacy varies widely among the population. Even scientifically accurate but highly generalized paragraphs can be confusing to those unfamiliar with medical terminology, especially when trying to manage an addiction without a doctor.
There are also ethical concerns about the use of personal health data by technology companies. Substance use disorders often come with legal risks, and poorly protected digital searches can compromise patient privacy. The language used by chatbots can also inadvertently reinforce social biases if the software relies on biased training data.
Future research should investigate a variety of real-world patient questions extracted from online forums and clinic data. Researchers also recommend evaluating competing digital platforms to see whether different company models offer better medical precision. Until these systems improve, human health professionals will still be needed to securely contextualize digital health information.
The study, “Descriptive Content Analysis Evaluation of ChatGPT Responses to Substance Use Disorder Treatment Questions Compared to National Health Guidelines,” was authored by Morgan Decker, Christine Kamm, Sara Burgoa, Meera Rao, Maria Mejia, Christine Ramdin, Adrienne Dean, Melodie Nasr, Lewis S. Nelson, and Lea Sacca.

