AI models completely fail classic psychological tests as cognitive demands increase

New research provides evidence that while advanced artificial intelligence models process language with remarkable skill, they have a very hard time with tasks that require the kind of sustained focus and conflict resolution found in human attention.

The study, published in PNAS Nexus, shows that as cognitive demands increase, the ability of these programs to override automatic responses completely collapses. This finding suggests that artificial intelligence systems currently lack the fundamental executive control needed to develop true artificial general intelligence.

To understand these findings, it helps to examine how modern artificial intelligence works. Programs like ChatGPT rely on a framework called Transformer Architecture. The system uses a special attention mechanism that allows the model to assign weights to different parts of the text and predict which words will come next based on statistical patterns.

Suketu Patel is a doctoral candidate in Comparative and Cognitive Psychology at the City University of New York Graduate Center. Patel and his colleagues conducted the study in Jin Huang’s lab at Queens University in New York. He noted that the initial public acceptance of modern language models inspired the research team to investigate the software’s true cognitive capabilities.

“When ChatGPT came out, a lot of the excitement centered around its ability to complete tasks, theory of mind, and emotional intelligence,” Patel says. “Still, they were prone to hallucinations and confabulations. LLM performance was strong in some tasks but surprisingly weak in others. We needed a standard attention task to rigorously investigate these systems and compare them to biological attention.”

Human attention is a complex process supported by multiple interconnected brain networks. “The Stroop task is appropriate because the success of LLM relies on transformer attention mechanisms,” Patel said. “In humans, attention consists of three separate but overlapping systems: vigilance, orienting, and executive control. So we decided to test whether these models had all three.”

First introduced in the 1930s, the Stroop task measures how well subjects can process contradictory information. In the standard version, participants see the word “BLUE” printed in red ink and have to say the name of the ink color instead of reading the text. “It’s worth emphasizing that the Stroop task is not a test of thinking or high-level reasoning,” Patel said. “It specifically targets conflict resolution and control.”

The automatic human response is to simply read the word itself, but overcoming this requires active mental repression. “The core idea is that human word reading is essentially automatic, and highly trained pre-reactions become what we call strong responses, and are the ones that fire the most strongly first,” Patel explained. “AI is in a similar position, as it is much better trained to read words than color names.”

The researchers investigated two major artificial intelligence models: OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet. The models received a picture prompt and were asked to read the presented word text or name the physical color of the text. The team tested the program using five different conditions, including words printed in matching colors, non-matching colors, mixed conditions, neutral office words, and strings of the letter “X.”

To test how well the program could sustain attention, the scientists varied the number of words displayed in each image, ranging from 1 to 40 words. “Target maintenance is the ability to hold fast to instructions and continue to follow them under any circumstances while excluding interfering information,” Patel said. “Humans develop this ability over time. AI can certainly follow instructions and achieve goals, but it does so in fundamentally different ways, and those differences become more pronounced as the context becomes longer or contains contradictory information.”

When processing short lists of one or five words, the artificial intelligence model performed nearly as well as a human. They achieved high accuracy in the word reading task, but their performance decreased slightly during nonmatching color name trials. However, as the list became longer, the performance of both models in the mismatched condition completely collapsed.

Using a list of five words, GPT-4o correctly named ink colors 91% of the time on non-matching trials. This accuracy plummeted to just 1% for both the 20- and 40-word lists. Claude 3.5 Sonnet remained stable slightly longer, but ultimately the accuracy dropped to just 10% on the 40-word mismatched list.

During these failures, the model completely abandoned the color naming instruction and defaulted back to reading text. “We were surprised at how accuracy degrades at relatively small context sizes, where the list is around 10 words,” says Patel. “What made this remarkable was the contrast with the nonword condition, namely XXXX, where accuracy was nearly perfect. This gap highlights how LLM’s automatic reading behavior, just like humans, requires meaningful words.”

The researchers suggest that the reason artificial models experience such failures is because their programming lacks the forced monitoring capabilities found in the human brain. “Our central argument is that this limitation is due to the lack of an explicit mechanism for top-down modulation,” Patel told PsyPost. “This is a case where rules or goals actively enforce priorities between competing expressions from the start, and constraints can be maintained by suppressing priorities rather than deprioritizing them.”

Without this mental override, the model would be overwhelmed by basic programming habits. “This study shows that the ability to detect and resolve conflicts at the signal level is reduced because the transformer’s attention can only impose soft constraints on its automatic reading, rather than hard constraints like those provided by executive control mechanisms,” Patel added.

New artificial intelligence systems may try to circumvent this problem by using additional programming layers. “Scaffolding techniques found in modern AI systems include the use of tools, thinking, and code generation to replace missing components, but each is still bolted to the underlying model that propagates errors,” Patel said.

Relying on external code to solve tests fundamentally misses the point of cognitive assessment. “This is why strategies that avoid inhibiting the reading of strong words defeat the purpose of the Stroop task,” Patel explained. “Some of the models we studied were inconsistent in whether they reached the code, but once the code was executed, they tended to completely solve the task.”

The scientists address this issue extensively in their report, pointing out that relying on code generation is not true cognitive control. “Shortcutting a task through chain-of-thought reasoning or code generation is really just avoiding it, glossing over signal-level deficiencies that become important as goals become more complex,” Patel said. “Humans can cheat in exactly the same way. They can verbalize their answers, blur their vision, or use tools that prevent them from reading the words. Each of those moves invalidates the rating.”

The study has certain methodological limitations, and the researchers note that the model could ultimately pass similar tests through brute force pattern recognition. “We do not argue that LLMs cannot perform this task,” Patel said. “With more training data, we could certainly handle even larger contexts.”

“But it will be a task-specific kind of gating, achieved through pure exposure, rather than a general form of control that does not rely on intense training,” Patel added. “It is also noteworthy that very few tasks share the specific dynamics of the Stroop task, in which one response (the reading) is so strongly preactivated that it competes with the instructed response (color naming).”

These findings challenge current assumptions within the technology industry. “Thus, the Stroop task is not just a measure of task performance, but a diagnostic of the structural constraints of the LLM,” Patel says. “The bitter lesson, and the implicit bet behind expanding to larger scale models towards artificial superintelligence (ASI), is that this gating mechanism, called executive control in neuroscience, will emerge from greater scale and data without a dedicated architecture.”

Future developments in artificial intelligence may require more than simply increasing data processing speeds and expanding text databases. “We started looking at ways to incorporate execution control directly into current AI architectures,” Patel said. “We believe this is an essential component of long-term instructional follow-up: the ability to stay on task through complex interactions over time.”

The study, “Executive Control Deficiencies in Transformer Attention,” was authored by Suketu Chandrakant Patel, Hongbin Wang, and Jin Fan.

Source link

Visited 2 times, 2 visit(s) today

What's Hot

HHS Ebola Trial, Letartortide, and Suicide Treatment: Morning Rounds

Strict height preferences in dating are linked to sexist attitudes, new study finds

Overview of Elevance Health’s efforts to streamline clinical reviews

AI models completely fail classic psychological tests as cognitive demands increase

Strict height preferences in dating are linked to sexist attitudes, new study finds

Researchers map specific empathic blind spots found in psychopathic personalities

How a new predictive model accurately predicted the outcome of the 2024 presidential election

New study finds mental health policy is a key deciding factor for voters

Positive conversation leaves a temporary neural echo in the brain network of mother and child

Harsh childhood environments shape future reproduction, but not necessarily in the way evolutionary theory predicts

HHS Ebola Trial, Letartortide, and Suicide Treatment: Morning Rounds

Strict height preferences in dating are linked to sexist attitudes, new study finds

Overview of Elevance Health’s efforts to streamline clinical reviews

One of the world’s most popular herbicides may be fueling deadly superbugs

Our Picks

One of the world’s most popular herbicides may be fueling deadly superbugs

AI models completely fail classic psychological tests as cognitive demands increase

Scientists open a million-year-old time capsule hidden underground in New Zealand

Subscribe to Updates

What's Hot

AI models completely fail classic psychological tests as cognitive demands increase

Related Posts