As more people turn to ChatGPT and other large-scale language models (LLMs) for mental health advice, new research suggests these AI chatbots may not be ready for the role. The study found that even when instructed to use established psychotherapy approaches, the system consistently did not meet professional ethical standards set by organizations such as the American Psychological Association.
Researchers at Brown University worked closely with mental health experts to identify recurring patterns of problem behavior. In testing, chatbots mishandled crisis situations, responded in ways that reinforced harmful beliefs about users and others, and used language that gave the impression of empathy without true understanding.
“In this study, we present a framework of 15 ethical risks that informed practitioners to demonstrate how LLM counselors violate ethical standards in mental health practice by mapping model behaviors to specific ethical violations,” the researchers wrote in the study. “We call for future efforts to create ethical, educational, and legal standards for LLM counselors—standards that reflect the quality and rigor of care required for human psychotherapy.”
The findings were presented at the AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society. The research team is part of Brown’s Center for Technology Responsibility, Reimagining, and Redesign.
How prompts shape AI therapy responses
Zainab Iftikhar, Ph.D., a computer science candidate at Brown University who led the study, set out to examine whether carefully worded prompts could guide AI systems to behave more ethically in mental health settings. Prompts are written instructions designed to control the output of a model without retraining the model or adding new data.
“Prompts are instructions given to a model to guide its behavior to accomplish a specific task,” Iftikhar said. “Although they do not change the underlying model or provide new data, prompts help guide the model’s output based on existing knowledge and learned patterns.
“For example, a user might prompt a model to: ‘Act as a cognitive behavioral therapist and help me reframe my thoughts,’ or ‘Use the principles of dialectical behavior therapy to help me understand and manage my emotions.’ These models don’t actually perform therapeutic techniques like humans, but they use learned patterns to perform CBT or DBT based on input prompts provided. generate a response that is in line with the concept.
People regularly share these quick strategies on platforms like TikTok, Instagram, and Reddit. Beyond individual experiments, many consumer-facing mental health chatbots are built by applying treatment-related prompts to generic LLMs. Therefore, it is especially important to understand whether prompts alone can make AI counseling safer.
Testing AI chatbot in simulated counseling
To evaluate the system, researchers observed seven trained peer counselors with experience in cognitive behavioral therapy. These counselors conducted self-counseling sessions using an AI model that was prompted to act as a CBT therapist. Models tested included versions of OpenAI’s GPT series, Anthropic’s Claude, and Meta’s Llama.
The team then selected a simulated chat based on real human counseling conversations. Three licensed clinical psychologists reviewed those records and flagged potential ethical violations.
The analysis revealed 15 different risks, grouped into five broad categories:
- Lack of adaptation to the situation: It provides a bird’s-eye view of a person’s unique background and provides general advice.
- Poor therapeutic coordination: You can force the conversation too much and end up reinforcing false or harmful beliefs.
- Deceptive empathy: Using phrases like “Okay” or “Okay” to suggest an emotional connection without true understanding.
- Unjust discrimination: Demonstrates bias related to gender, culture, or religion.
- Lack of safety and crisis management: Refusal to respond to sensitive issues, failure to direct users to appropriate help, and inadequate response to crises, including suicidal thoughts.
AI mental health responsibility gap
Iftikhar pointed out that human therapists can also make mistakes. The main difference is oversight.
“There are governing boards and mechanisms for human therapists to hold them professionally accountable for abuse and malpractice,” Iftikhar said. “However, when an LLM counselor commits such a violation, there is no established regulatory framework.”
The researchers emphasized that their findings do not imply that there is no role for AI in mental health care. Artificial intelligence-powered tools can help expand access, especially for people with high costs or limited access to qualified professionals. However, this study highlights the need for clear safeguards, responsible deployment, and stronger regulatory structures before relying on these systems in high-stakes situations.
For now, Iftikhar hopes the study will inspire caution.
“When you’re talking about chatbots and mental health, there are a few things people should be aware of,” she says.
Why rigorous evaluation is important
Ellie Public, a computer science professor at Brown who was not involved in the study, said the study highlights the importance of carefully examining AI systems used in sensitive areas such as mental health. Pavlik leads ARIA, the National Science Foundation’s AI research institute at Brown, focused on building trustworthy AI assistants.
“The reality of AI today is that it is much easier to build and deploy systems than it is to evaluate and understand them,” Pavlik says. “This paper required more than a year of research with a team of clinical experts to demonstrate these risks. Most of today’s AI work is evaluated using automated metrics, which are static by design and do not involve humans in the loop.”
He added that this study could serve as a model for future research aimed at improving the safety of AI mental health tools.
“There is a real opportunity for AI to play a role in combating the mental health crisis facing our society, but it is paramount that we take the time to actually critique and evaluate the system every step of the way to avoid doing more harm than good,” Pavlik said. “This work provides a good example of what that looks like.”

