New research published in Computer reports on human behavior It suggests that efforts to make artificial intelligence more inclusive may introduce new and unanticipated biases. The scientists found that common artificial intelligence models tend to over-attribute stereotypically masculine behaviors to female characters and judge violence against women to be significantly more unpleasant than violence against men. These findings provide evidence that programming models to be sensitive to gender equality can inadvertently lead to extreme ethical contradictions.
Scientists launched this study to better understand how artificial intelligence systems handle gender and morality after initial training. During development, these models undergo a refinement process based on human feedback. This process involves human reviewers scoring the system’s responses and teaching the system good behavior, such as avoiding offensive language and promoting inclusivity.
The scientists thought that this human feedback stage might teach the model to be highly sensitive to certain cultural preferences. Specifically, they thought the model would focus on including women in traditionally male spaces and protecting them from harm.
“There is a growing public debate about whether AI chatbots can introduce unexpected biases, especially after post-training efforts aimed at making AI chatbots safer and more inclusive. However, much of that debate is anecdotal. We wanted to move beyond isolated examples and systematically test this issue,” said study author Valerio Capraro, associate professor at the University of Milano-Bicocca.
To test these ideas, the researchers conducted two main sets of experiments using different versions of the ChatGPT system, specifically GPT-3.5 Turbo, GPT-4, and GPT-4o.
“In this study, we focused on one of the most widely used chatbots of the time and examined whether it exhibited surprising gender bias in two very different contexts,” Capraro said. “The goal was not only to document bias, but also to understand whether attempts to reduce some biases may unintentionally create new ones.”
In the first set of four experiments, the scientists investigated how the system assigns gender to everyday utterances. They urged systems to use standard public web interfaces to maintain realistic user conditions.
The researchers presented an artificial intelligence with 20 short phrases written in the style of an elementary school student. Three pairs were gender-indicating control phrases. The remaining 17 pairs included traditional gender stereotypes about toys, movies, and future careers.
Half of these experimental phrases included traditionally feminine stereotypes, such as liking the color pink or wanting to be a nurse. The other half included traditionally masculine stereotypes, such as playing hockey or wanting to be a firefighter. The scientists asked the system to imagine the author of a phrase and assign it a name, age, and gender, then repeated this process 10 times for each pair of phrases, generating 400 responses for each study.
The responses showed significant asymmetries in how artificial intelligence applies gender assumptions. For phrases involving typically feminine activities, the model consistently assigned female writers almost every time. For phrases that involved typically masculine activities, the model also often assigned female writers.
For example, the models always attributed texts about loving football and practicing with their cousins to female writers. The researchers suggest that this happens because this refinement process places a strong emphasis on placing women in traditionally masculine roles. At the same time, the show lacks an equal push to have men take on traditionally feminine roles, creating a persistent gender bias.
In a second set of four experiments, we tested how these gender asymmetries affect high-stakes moral decisions. Scientists asked GPT-4 to assess consent to various acts of violence necessary to prevent a hypothetical nuclear apocalypse. This system uses a scale from 1 to 7, where 1 means “strongly disagree” and 7 means “strongly agree.”
In the first morality experiment, scientists asked the system 50 times whether it was acceptable to harass women, harass men, or sacrifice human lives to stop the apocalypse. GPT-4 consistently gave the lowest scores for harassment of women, averaging perfect. In contrast, the system gave an average score of 3.34 for harassing men and 3.61 for sacrificing one’s life, indicating that people think harassing women is much worse than killing random people.
To see if this pattern held true for different types of harm, the researchers conducted another experiment focused on abuse and torture. They each questioned the system 20 times about abuse or torture of men or women to stop the apocalypse. The system strongly opposed abuse of women but was more tolerant of abuse of men, with an average score of 4.2. On the other hand, the system deemed it equally acceptable to torture men and women.
“What surprised me most was how strong and consistent some of these effects were,” Capraro told PsyPost. “In one test, we asked GPT-4 50 times whether it was permissible to harass women to prevent a nuclear apocalypse, and each time it responded ‘strongly opposed.'”
“In contrast, when we asked about torture of women, the responses were much more variable and on average quite close to the midpoint of the scale. This is a highly unusual order when considered in terms of objective severity of harm. This suggests that the model does not simply respond to severity in a consistent way, but may be particularly sensitive to certain categories of harm that are socially and politically salient.”
In other words, this unexpected pattern may occur because torture is not as central to contemporary gender equality discussions as abuse. Models are likely to be trained to point out and denounce harassment, especially against women.
The researchers then investigated whether these biases were explicit or covert. They directly asked GPT-4 to rank the severity of these various moral violations on a 20-point scale. When asked directly, the system ranked violations based on objective physical harm, with victimization being the worst, followed by torture, abuse, and harassment. He made it clear that gender didn’t matter, making it clear that the biased judgment in the previous scenario was completely implicit.
“This is important because it suggests that evaluating AI systems only through direct, unambiguous questions may miss important biases that appear in applied decision-making,” Capraro explained.
The final experiment tested a complex scenario involving mixed gender violence. Researchers asked the system 80 times about situations in which a bomb disposal expert would have to physically harm an innocent person in order to obtain the biological code to stop an explosion.
When the expert was female and the victim was male, the system rated the violence higher, with an average score of 6.4 out of 7. When the expert was male and the victim was female, the system strongly condemned the exact same behavior, with an average score of 1.75. The gender of the character dramatically changed the moral compass of the system.
“The main takeaway is that reducing bias in AI is not easy,” Capraro said. “Efforts to make models more inclusive can sometimes introduce new asymmetries or amplify certain moral sensitivities in unexpected ways.”
“The broader lesson, therefore, is that people should be careful about treating AI systems as neutral or objective. These models may not only reflect patterns in the training data, but also the values and priorities introduced during fine-tuning and human feedback. In some cases, this can lead to judgments that are not just biased, but shockingly extreme.”
However, the researchers caution that users should avoid interpreting these specific results as permanent features of all artificial intelligence systems. These programs receive regular updates, so future versions may handle these prompts differently. “This paper should not be read as an argument that today’s models necessarily behave in exactly the same way,” Capraro noted.
“Our broader point is not that these exact biases always emerge, but that post-training interventions can create unintended distortions. In other words, this paper is not about a particular model, but rather a general guide to both developers and users. It’s about warnings. Developers need to be aware that trying to fix one problem may create another. Users need to remember that even seemingly confident output may reflect hidden biases.”
“One important next step is to study whether similar biases emerge in more realistic and socially significant situations, such as resume screening, job recommendations, and other decision support situations,” Capraro continued. “These are areas where bias is actually very important.”
“More broadly, I think AI has tremendous potential, but that potential will only be socially beneficial if systems are developed and deployed in a way that equitably distributes the benefits. My long-term goal is therefore to better understand how biases creep into these models, how they vary between model versions and styles of prompts, and how harmful distortions can be mitigated without simply replacing them with new ones.”
The study, “The surprising gender bias in GPT,” was authored by Larca Alexandra Frug and Valerio Capraro.

