Modern AI is often judged to be more human-like than actual humans in Turing Test experiments.

Recent research is Proceedings of the National Academy of Sciences We provide evidence that certain modern artificial intelligence systems can pass the standard Turing test. When instructed to adopt specific human personas, these computer programs tricked human judges into thinking they were real people more than half the time. This discovery provides the first empirical evidence that modern systems can pass this key scientific benchmark, raising deep questions about the future of online communications.

To fully understand this research, it helps to know a little about large-scale language models (LLMs). These are highly complex computer programs trained on vast amounts of text data collected from the Internet. These power the popular AI chatbots that many people use today to compose emails, brainstorm ideas, and code software.

Large-scale language models learn statistical patterns in human language to predict the next word in a sequence. This allows us to generate incredibly natural-sounding text in response to user questions.

The researchers who conducted the study, Cameron R. Jones and Benjamin K. Bergen, wanted to see how well these modern models could handle a classical evaluation known as the Turing test. This theoretical game, originally proposed by British mathematician Alan Turing in 1950, provides a way to assess whether a machine can imitate human speech so well that it is completely indistinguishable from a real human.

In the standard three-way version of the test, a human judge converses with two hidden participants at exactly the same time using a text chat interface. One of these hidden participants is a real human and the other is a computer program. If the human judge cannot reliably guess which participant is the machine, the computer is said to have passed the test.

Jones and Bergen began this study because standard assessments of artificial intelligence tend to be static and narrow. High scores on common benchmark tests may simply reflect memorization of training data rather than adaptability to real-life conversations. The traditional Turing test provides a flexible, interactive environment in which human judges can ask open-ended questions and probe the machine for perceived weaknesses.

“The Turing test started as a way to ask whether machines could match human intelligence,” said Bergen, a cognitive science professor at the University of California, San Diego. “But we now know that AI can answer many questions faster and more accurately than humans, so the real problem is not one of raw brains.”

“Seeing that a machine can pass a test, and how a machine passes a test, forces you to rethink what it’s measuring,” Bergen says. “Increasingly, we are measuring our humanity.”

To explore these questions, the scientists set up a series of highly controlled online experiments involving nearly 500 participants in total. They evaluated several different artificial intelligence systems, including GPT-4.5, LLaMa-3.1-405B, and GPT-4o. The scientists also included an older and much simpler chatbot program from the 1960s called ELIZA as a baseline.

In the first study, researchers recruited 126 undergraduate students from the University of California, San Diego. The second study recruited 158 participants through an online survey platform called Prolific. Human participants were randomly assigned to play the role of judge or human interlocutor.

To conduct the study, the researchers built an online interface designed to feel like a familiar messaging app. “For the interrogator, there’s a split screen on the computer and he’s questioning both witnesses,” Jones said. He conducted his research while earning his PhD in cognitive science at the University of California, San Diego, and is currently an assistant professor of psychology at Stony Brook University. “They know that one of the witnesses is human and one of them is AI.”

The judge had exactly five minutes to speak with both participants. When the time was up, the judges had to decide which conversation partners were real people, rate their trustworthiness, and explain why.

The researchers tested two different sets of instructions, known as prompts, on the artificial intelligence model. The first prompt simply told the computer that it was participating in a Turing test. The second was a persona prompt, which gave the computer a detailed character to play and asked it to act like a young introvert familiar with internet culture and modern slang.

Persona prompts made a big difference in the results. GPT-4.5 was identified as human 73% of the time. This means that they fooled the judges far more often than the actual human participants. When the LLaMa-3.1-405B model was given the same prompt, it identified a human 56% of the time.

“What we found is that given the right prompts, advanced LLMs can display the same tone, directness, humor, and fallibility as humans,” Jones says. “We know that LLM can easily generate knowledge on almost any topic, but in this test we also showed that LLM can convincingly display social behavioral traits. This has major implications for the way we think about AI.”

However, without specific persona prompts, the model performed much worse. Without explicit instructions, GPT-4.5’s win rate dropped to 36%, and LLaMa-3.1-405B’s win rate dropped to 38%. The old baseline system performed poorly overall, with ELIZA selected as human only 23 percent of the time and GPT-4o selected 21 percent of the time.

“They have the ability to look human-like, but they probably have less of an ability to understand what it takes to look human-like,” Bergen said.

The researchers then conducted a replication study to see if longer conversations helped human judges accurately identify machines. We recruited 205 new participants from Prolific and extended the chat time limit to 15 minutes. They used GPT-5 with the LLaMa-3.1-405B model and ELIZA.

Over time, human judges still had trouble identifying machines using persona prompts. The GPT-5 model was identified as human 59 percent of the time, and LLaMa-3.1-405B was selected 56 percent of the time. Human judges easily identified the simple ELIZA program, guessing that it was written by a human only 17% of the time.

The researchers analyzed the specific conversational strategies that human judges used to determine who was genuine. Most of the judges tried to make small talk and asked about daily activities, personal opinions, and emotional experiences. Judges tended to assume that participants were human if they made minor spelling errors, appeared to lack knowledge about a particular topic, or responded directly without sounding overly formal.

“These traits are not the kind of intelligence Turing envisioned for solving mathematical and logical problems,” Bergen said.

Additionally, the scientists found that university students performed slightly better than online participants. This suggests that students shared more commonalities, such as details of their local campuses, which may have helped them explore each other more effectively.

The authors caution against misinterpreting the meaning of the results. Just because a machine passes the Turing test doesn’t mean it has true human intelligence or consciousness. Rather, it suggests that this machine is very good at conforming to human expectations about how other people chat online.

This study also has obvious limitations. The high success rate of the large-scale language model depended entirely on the specific persona prompts provided by the researchers. Without these detailed instructions, the model is unable to consistently fool judges, showing that it still requires human guidance to behave in a convincing and human way.

Future research could investigate how different types of judges perform on this classic test. Scientists might test whether computer science experts are better at spotting artificial intelligence than the average person. Researchers might also consider whether everyday humans can be trained to recognize machine-generated text over long periods of time.

This finding has real-world implications for online trust. “It’s relatively easy to make these models indistinguishable from humans,” Jones says. “We need to be more vigilant. When interacting with strangers online, people should be less confident that they are talking to a human being and not an LLM.”

“The Turing test is a game of lying for the sake of the model,” Jones said. “One of the implications of that is that the model seems to be very good at it.”

Not being able to tell whether you’re interacting with a human or a bot can have serious implications for everyday people. “There are a lot of people who want to use bots to persuade people to share their Social Security numbers to vote for their party or buy their products,” Bergen said.

The study “Large-scale language models pass the standard three-way Turing test” was authored by Cameron R. Jones and Benjamin K. Bergen.

Source link

Visited 8 times, 1 visit(s) today

What's Hot

How different types of exercise affect mood and brain chemistry

Exposure to high temperatures during pregnancy may slow the growth of your child’s thalamus

Weekly Rundown: SpotitEarly partners with HITLAB

Modern AI is often judged to be more human-like than actual humans in Turing Test experiments.

How different types of exercise affect mood and brain chemistry

Exposure to high temperatures during pregnancy may slow the growth of your child’s thalamus

Changes in brain structure are associated with different types of traumatic memories

New study warns of looming partisan fight over artificial intelligence

Irregular rhythms and childhood trauma predict depression and anxiety in teens

Donald Trump and Hunter Biden’s convictions shed light on political scandal brain teasers

How different types of exercise affect mood and brain chemistry

Exposure to high temperatures during pregnancy may slow the growth of your child’s thalamus

Weekly Rundown: SpotitEarly partners with HITLAB

Changes in brain structure are associated with different types of traumatic memories

Our Picks

Changes in brain structure are associated with different types of traumatic memories

‘Stupid Sprinkler’ Helps Scientists Finally Solve the Mystery of Feynman’s Famous Sprinkler

Multidisciplinary surgery improves outcome in rare corpus callosum glioma

Subscribe to Updates

What's Hot

Modern AI is often judged to be more human-like than actual humans in Turing Test experiments.

Related Posts