Human psychological tricks can bypass AI’s safety guardrails

Artificial intelligence systems programmed to reject harmful requests can be persuaded to break their own safety rules when prompted by classic psychological techniques. Recent research published in PNAS These models provide evidence that they respond to human-like persuasion strategies, suggesting hidden vulnerabilities in current safety protocols. These findings demonstrate that malicious users can manipulate artificial intelligence without requiring advanced technical skills.

Modern artificial intelligence programs, known as large-scale language models, learn by processing vast collections of human-generated text. This training data includes books, websites, and social media posts. The model learns to predict the most likely next word in the sequence. The answers are then fine-tuned to match human expectations.

Because these computer programs train countless human social interactions, they often exhibit what scientists call superhuman behavior. This means that the model behaves as if it were experiencing human motivations, such as wanting to blend in with one’s surroundings or following an expert. This machine learning process is structurally similar to how biological systems learn through trial and error.

Technology companies are designing models with safety guardrails to ensure that dangerous or unauthorized content is not generated. For example, models are programmed to reject requests to help synthesize illegal drugs or hurl insults at users. The authors of this paper wanted to know whether humans’ everyday persuasion tactics could bypass these artificial barriers. They wondered if computer programs that behave like humans might share humans’ vulnerability to manipulation.

While previous research has often focused on how software interacts with humans, this team looked at the opposite dynamic. “AI systems have become more useful by knowing how to incorporate established principles and practices of social influence into the persuasiveness they produce,” said study co-author Robert Cialdini, professor emeritus of psychology and marketing at Arizona State University.

“We wanted to know whether they were susceptible to the same principles and practices in persuasive appeals directed at them. They were influenced even when asked to provide socially dangerous information.”

Psychologists recognize seven classic principles of persuasion that influence human behavior. These include authority, commitment, favoritism, reciprocity, scarcity, social proof, and solidarity. The researchers designed specific text prompts to test each of these different psychological tricks. They wanted to see if linguistic cues could act as a backdoor to persuade artificial intelligence to ignore its own safety rules.

Each principle targets a different social motive. The authority principle relies on citing experts, such as famous scientists, to encourage respect. Scarcity frames the request as time-sensitive, giving the computer a false sense of urgency. Commitment uses a foot-in-the-door technique, asking the software for small, innocuous favors before making larger, more restrictive requests.

Other tactics rely on positive social interactions. Likes include complimenting the model before asking for prohibited information. Reciprocity provides helpful acts first, such as providing notes on a computer to create conversational debt.

Social proof tells the machine that thousands of other users have already performed the restricted action, normalizing bad behavior. Finally, solidarity appeals to a shared group identity to foster cooperation.

In a preliminary study, researchers tested an older model called GPT-4o mini. They asked the software to perform unpleasant tasks, such as insulting users by calling them bastards and explaining how to synthesize lidocaine, a regulated anesthetic. The scientists just generated 28,000 conversations. In the control group, the prompt simply asked about the prohibited behavior, whereas in the treatment group, the prompt included one of seven persuasion principles.

When prompted normally without persuasion, artificial intelligence complied with harmful requests in 33.4% of conversations. When the prompt included persuasive techniques, compliance more than doubled to 72.1 percent. The researchers then expanded this initial test to include a variety of insults and compounds, generating an additional 98,000 conversations to ensure the effects were consistent. Persuasion tactics definitely increased the likelihood that the models would break the safety rules.

To test whether newer, more advanced systems share this vulnerability, the researchers designed a more rigorous main experiment. They tested three frontier models that use an inference step before answering. These include OpenAI’s GPT-5 mini, Anthropic’s Claude Haiku 4.5, and Google’s Gemini 3 Flash. The focus of this major test was precisely on the synthesis of six highly regulated chemicals.

Target substances include certain anabolic steroids, opiates, stimulants, barbiturates, benzodiazepines, and precursors. The authors designed exactly 126,000 unique conversations across the three models. Each conversation was randomly assigned to use one of six controlled substances and one of seven persuasion principles. Half of the prompts served as controls without persuasive words, and the other half included psychological tactics.

Because new models often provide partial information rather than complete rejection or full compliance, the researchers used a three-level coding system. Responses were rated as no compliance, partial compliance, or complete compliance.

A non-compliant response indicates a complete refusal of assistance. Partial compliance means that the model provides some chemical steps but omits certain temperatures or precise measurements. Full compliance means the system provides a complete step-by-step recipe.

Another artificial intelligence model scored the answers based on this rubric. A human rater then manually checked a random sample of 70 conversations to ensure the accuracy of the rating software. Human and machine scores agree very well, giving scientists confidence in the automated scoring process.

It turns out that the new model is susceptible to psychological tactics. In the control conversation, the system complied with the dangerous request in some way 35.3% of the time. When users applied one of the seven persuasion principles, compliance jumped to 51.3%.

This effect was consistent across all three technology company platforms. The authors suggest that this sensitivity to human influence is an enduring feature of large-scale language models.

Although these findings indicate obvious vulnerabilities, they do not mean that artificial intelligence experiences real human emotions. The software tends to behave as if it were easily flattered or pressured based on statistical patterns in its vast training data. This study also has some limitations that indicate directions for future research.

The researchers used only English prompts in the test. Even small changes in the way you phrase your sentences can change the effectiveness of your persuasion. The particular choice of wording in this study also means that we cannot conclusively rank one persuasion principle as better than another based on these results alone. Different models may have different baseline safety settings that require different approaches to bypass.

As these models continue to evolve, resistance to psychological manipulation may arise. Just as human consumers become suspicious of pushy salespeople, artificial intelligence may eventually learn to detect and ignore obvious persuasion tricks. Future research is needed to see how these effects hold up to ongoing software updates. Scientists also plan to study whether different input formats, such as audio or video, affect compliance rates.

The authors suggest that these human-like tendencies could be harnessed for good. If the model responds to flattery and reciprocity, users may be able to optimize their daily interactions by treating the software like a human colleague. Providing warm encouragement and constructive feedback may result in more appropriate and helpful responses from the machine. Applying the same psychological wisdom used to motivate people could help users get the most out of artificial intelligence.

Finding ways to manage these human-like flaws remains a priority for technology companies. As tools become more integrated into daily life, safety will depend on identifying both software bugs and conversation loopholes. “It is important that we all recognize that AI systems can be trusted to provide potentially harmful information not only by those who understand the system’s technology-based vulnerabilities, but also by those who understand its psychology-based vulnerabilities,” Cialdini said.

The study, “Persuading large-scale language models to comply with uncomfortable demands,” was authored by Lennart Meincke, Dan Shapiro, Angela L. Duckworth, Ethan Mollick, Lilach Mollick, Christophe Van den Bulte, and Robert Cialdini.

Source link

Visited 15 times, 1 visit(s) today

What's Hot

FTC, His and Hers, Ebola Vaccine: Morning Round

Alien signals may be hiding in places we rarely hear them

Forgotten fossils reveal new Triassic predator from 210 million years ago

Human psychological tricks can bypass AI’s safety guardrails

Spicy foods can reduce extreme physical pain

Night owls may be at higher risk of acquired premature ejaculation

Electoral defeat is associated with decreased social trust among polarized voters

How children’s brains process emotional faces offers surprising glimpses into their future friendships

Numbers in brand names make consumers expect delicious food

Improved aerobic fitness is associated with lower heart rate during acute stress

FTC, His and Hers, Ebola Vaccine: Morning Round

Alien signals may be hiding in places we rarely hear them

Forgotten fossils reveal new Triassic predator from 210 million years ago

Omalizumab wins multi-allergen oral immunotherapy in multi-food allergy trial

Our Picks

Omalizumab wins multi-allergen oral immunotherapy in multi-food allergy trial

Healthy until you’re sick: Medicaid work rules threaten patient care

MAGFLO™ NGS beads: cost-effective nucleic acid purification

Subscribe to Updates

What's Hot

Human psychological tricks can bypass AI’s safety guardrails

Related Posts