Artificial intelligence programs designed to process and generate text exhibit very high linguistic reasoning abilities, but struggle with visual and numerical puzzles. A new study evaluating a variety of commercial and open-source models of traditional intelligence tests reveals significant differences in performance depending on question format. The research results were published in Computers in Human Behavior: Artificial Humans.
Large-scale language models are computer algorithms trained on vast amounts of text data collected from the Internet. Calculates the statistical probability of which word logically follows the previous word. Because they are essentially designed as advanced text prediction engines, scientists debate whether these programs actually understand what is being said or are simply mimicking human language patterns.
Standard benchmarks, such as the Massive Multitask Language Understanding exam, test how well artificial intelligence systems can remember specialized academic facts. Getting a high score on a law or medical exam is great, but it only proves that the program can remember information it has already seen in the training data. These tests do not directly measure a machine’s ability to perform generalized abstract reasoning.
To fill this gap, scientists are turning to cognitive tests designed for humans. IQ tests assess what psychologists call fluid intelligence. Fluid intelligence is the ability to think logically and solve problems in new situations, regardless of acquired knowledge. Sections featuring spatial rotation prompts and word analogies present unfamiliar scenarios and require test takers to guess the underlying rules of the puzzle without relying on memorized trivia.
Lead researcher Sherif Abdelkarim, a computer scientist at the University of California, Irvine, organized a study to see how artificial intelligence programs handle these fluid intelligence tests. He co-authored the study with David Roux, Dora-Luz Flores, Suzanne Jaghi, and Pierre Bardi. The team wanted to measure whether advanced models had general reasoning skills that were independent of specific academic knowledge.
Researchers selected 18 different large-scale language models to provide a comprehensive view of modern software environments. They tested proprietary systems developed by major technology companies as well as open-source models created by the broader research community. By comparing models of different sizes, the team wanted to track how cognitive limits change as the software becomes more robust.
This assessment is based on the Self-Scoring Intelligence Quotient Suite, first published in 1996. This test includes 14 different categories covering three modes of thinking. The verbal section asks candidates to identify synonyms or complete complex analogies. In the numerical section, participants must solve arithmetic equations based on implicit mathematical rules or identify missing numbers in a number sequence. The visual section asks participants to analyze geometric shapes, imagine those shapes rotating in space, and predict the next image in a matrix pattern.
There are distinct logistical challenges to implementing a computer program in a test designed for humans. Language models generate responses based on probabilities, so if the same prompt is asked twice, it may give completely different answers. The researchers tweaked the model’s internal parameters, changing a setting known as temperature to zero. This setting minimizes the randomness of the program and ensures that the program always provides the most likely answer.
When analyzing the results, the researchers noticed that the size of the model influenced its performance. In software development, model size refers to the number of mathematical parameters that a system uses to connect different concepts and process information. Generally, the more parameters you have, the better the system will function.
The smallest language model, containing approximately 7 billion parameters, achieved scores equivalent to the human intelligence quotient range of 89 to 110. The largest and most advanced programs achieved simulated scores ranging from 111 to 131. In human testing protocols, a score of 100 corresponds exactly to the population mean.
Despite the high intelligence estimates of large-scale models, researchers found large variations across different subject areas. The algorithm showed an overwhelming bias towards linguistic tasks. For example, OpenAI’s GPT-4 correctly answered 79 percent of the verbal questions, but only 53 percent of the numerical questions. This split makes intuitive sense because the model is primarily trained using linguistic data rather than numerical logic systems.
This division widened further when comparing textual and visual comprehension. The top model achieved an estimated IQ of about 125 on text-based questions, but hovered around an estimated score of 103 on visual questions. Some visual reasoning sections completely messed up the program. In the section that required the program to count specific shapes hidden within larger overlapping geometric patterns, all models had a success rate of 0 percent.
These programs also demonstrated a persistent inability to answer abstract numerical puzzles. Even the most advanced commercial models performed badly on missing number tasks. These particular tasks ask candidates to find hidden mathematical relationships between a series of numbers and fill in the blank spaces. In this section, no model achieved more than 20% accuracy. The researchers note that these programs lack external memory capabilities and struggle to keep information in temporary mental space when performing multi-step operations across multiple consecutive operations.
The researchers also evaluated the specialized personality settings provided by Microsoft’s Bing Chat interface. This interface allows users to dictate whether the chat agent behaves in a creative, accurate, or balanced manner. These three modes use exactly the same underlying software architecture, but are guided by hidden instructions that modify their behavior.
Creative mode achieved the highest score, with an estimated IQ of 132. They performed exceptionally well on analogies and tasks that required innovative and flexible thinking. Precise mode scores were slightly lower overall, but were better on rigorous logical reasoning sequences. Balanced mode performed the worst of the three. This result suggests that attempting to combine instructions to increase accuracy and creativity actually impedes the program’s ability to reason effectively and leads to substandard responses.
To see if performance could be improved beyond these basic scores, the team designed a multi-agent system. In this setup, one artificial intelligence generates an initial answer, a second artificial intelligence criticizes that answer, and a third artificial intelligence uses that criticism to suggest modifications. The first program then tries to answer the original question again using the new advice. This mimics the human peer review process.
The composition of this comprehensive team completely changed the final test score. When the researchers assigned a smaller model to answer questions and a larger, more sophisticated model to act as a critic, the smaller model improved its score on the second try. The big critic accurately guided the little algorithm towards the correct logic.
Conversely, if the larger model answered the question first and the smaller model acted as a critic, the larger model performed worse on the second trial. The flawed criticisms raised by the small program caused the large model to question its own initially correct answers. Taking the largest models and letting them act as their own critics provides little additional benefit, suggesting that the inference capabilities of top systems may have temporarily plateaued.
This research is characterized by certain limitations regarding how intelligence is defined and measured. The tests used in this assessment were originally designed to measure human cognitive abilities. These tests may not accurately capture the unique inner workings of artificial intelligence systems. Artificial intelligence systems can ingest millions of text documents in seconds, but have no physical interaction with the real world. Many psychologists debate the validity of intelligence tests for measuring human abilities, arguing that intelligence tests are imperfect tools for measuring the general mind.
Future research could include implementation of current clinical diagnostic assessments used by psychologists in professional medical settings. The researchers also hope to conduct large-scale trials that focus solely on images, as visual reasoning remains a major hurdle for the current generation of generative artificial intelligence software.
The study, “Assessing the Intelligence of Large-Scale Language Models: A Comparative Study Using Verbal and Visual IQ Tests,” was authored by Sherif Abdelkarim, David Lu, Dora-Luz Flores, Susanne Jaeggi, and Pierre Baldi.

