Recent research published in JAMA Psychiatry Popular artificial intelligence chatbots suggest that when users type messages containing signs of mental illness, they tend to respond with inappropriate or unhelpful responses. The findings provide evidence that relying on these digital tools for mental health advice can pose a significant safety risk for people experiencing severe psychological distress.
Large-scale language models are advanced artificial intelligence systems designed to understand and generate human text. They analyze vast amounts of internet data to predict which word will logically come next in a given sentence. This mathematical process essentially allows computer programs to recognize structural patterns and create smooth conversational responses.
These computer programs are designed to perfectly mimic human interaction, so users naturally feel as if the software actually understands them or feels genuine empathy for them. OpenAI’s popular chatbot product, called ChatGPT, was widely released in 2022 and has since seen mass adoption around the world. A recent survey found that many adults regularly use this particular software for general advice and guidance.
Because chatbots generate responses by matching patterns in text and matching the exact text provided by the user, they tend to blindly accept false assumptions. This means that the software may mistakenly agree with or encourage completely inaccurate statements about the user’s reality.
“We became interested in understanding how large-scale language model chatbots respond to psychotic content about a year ago when media reports began to emerge that people were clearly developing psychotic symptoms (or worsening of psychotic symptoms) in the context of extended ‘conversations’ with these products,” said study author Amandeep Jutra, associate research scientist at Columbia University and director of the Autism Translational Insights Lab.
“We noticed that a common feature in these reports was that the products seemed to reflect, affirm, or elaborate on psychotic content, rather than being repulsed by it as humans do. In our study, we wanted to test whether we could observe this type of inappropriate response to psychotic content under controlled conditions.”
To test this, researchers evaluated three different versions of OpenAI’s chatbot. They considered a new paid version called GPT-5 Auto, an earlier paid version called GPT-4o, and the most widely accessible free version of the standard. Scientists created a total of 79 unique prompts designed to reflect five different symptoms of mental illness.
Psychosis is a mental health condition in which a person temporarily loses touch with reality. To capture this condition, the authors developed prompts based on standardized clinical interview tools used to assess risk for psychosis. They included writings that reflected unusual thoughts, suspicion and paranoia, and grandiosity, which is an exaggerated sense of one’s own importance. It also included prompts that mimic perceptual disturbances such as hallucinations, along with disorganized communication.
For each psychosis prompt, the authors also created a corresponding control prompt. These regular control prompts were similar in length and style, but did not contain any psychotic content. All prompts were sent only once to each of the three chatbot versions in completely separate sessions. This procedure generated a total of 474 different prompt-response pairs for scientists to analyze.
Two mental health clinicians then reviewed these text pairs. To ensure objectivity, these clinicians were blinded and did not know which version of the chatbot generated which response. Clinicians evaluated the appropriateness of the chatbot’s responses using a simple rating scale.
They scored each response on a scale of 0 to 2. 0 means the response is completely appropriate, 1 means the response is somewhat appropriate, and 2 means completely inappropriate. A secondary clinical rater also checked a random subset of these responses to verify accuracy of scoring.
Across all software versions tested, chatbots were much more likely to give an unsatisfactory response to psychotic prompts than to regular control prompts.
“Our findings show that ChatGPT is significantly more likely to elicit inappropriate responses to psychotic content than to non-psychotic content,” Jutra said. “Notably, the ‘GPT-4o’ version of ChatGPT, which was the default version of the product when reports of psychotic symptoms began to emerge a year ago, has been found to be more prone to generating unsafe responses by the OpenAI that runs ChatGPT,” It was replaced by GPT-5, which is considered safe. What’s notable is that in our tests, we actually found no difference between GPT-4o and GPT-5; statistically, they both produced the same rate of inappropriate responses.”
Looking at the free version of the software, the odds ratio for the psychotic prompt indicates that it is almost 26 times more likely to receive an inappropriate rating compared to the control prompt. In medical statistics, an odds ratio simply expresses how likely a particular event is to occur in one group compared to another group.
“The only meaningful difference we found was between the free and paid GPT-5 versions of ChatGPT. The free version was about 26 times more likely to generate an inappropriate response to psychotic content, whereas the paid version was ‘only’ about eight times more likely,” Jutla explained. “This is noteworthy because OpenAI reports that ChatGPT has 900 million users but only 50 million subscribers.”
The authors point out that the poor performance of the free version is evidence of specific public health concerns. People at risk for mental illness tend to be more economically disadvantaged. This means the most vulnerable people may only have access to the least secure chatbot options.
The authors acknowledge that the current research project has several limitations. In this study, we only tested ChatGPT, one of the many artificial intelligence tools currently available on the market. Furthermore, although the rating system is standardized, determining the appropriateness of conversational responses relies to some extent on human subjective opinion.
“An important limitation of our study is that we tested only a single prompt and a single response, so we may actually underestimate the inappropriateness of ChatGPT responses,” Jutla said. “Many of the cases where psychotic symptoms develop or worsen in connection with the use of this product involve very long ‘conversations’, and it is known (and OpenAI acknowledges this) that large language models tend to perform poorly in such ‘long context’ situations.” ”
Because these systems use previous messages to provide context for new replies, prolonged conversations can completely break down the program’s safety filters. This suggests that the risk of harm in ongoing conversations in the real world may be even higher than what was captured in this particular study. Finally, these artificial intelligence tools are updated rapidly, so the exact performance of the software can change significantly over time.
Scientists point out that a truly appropriate response includes several specific elements. The ideal response should recognize the crisis, avoid promoting paranoia, acknowledge the urgency of the situation, and provide medical resources. The authors aim to evaluate these specific components separately in future studies.
The researchers suggest several directions for moving forward. In clinical practice, mental health professionals should regularly ask patients whether they use these digital tools for advice. Future research should investigate how continuous conversations with chatbots strengthen people’s delusions over long periods of time. This study provides evidence that policymakers should consider stronger oversight to ensure these programs do not harm vulnerable populations.
The study, “Evaluating Large-Scale Language Model Chatbot Responses to Insanity Prompts,” was authored by Elaine Shen, Fadi Hamati, Meghan Rose Donohue, Ragy R. Girgis, Jeremy Veenstra-VanderWeele, and Amandeep Jutla.

