Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Children under 2 years old should avoid screen time to protect their future health

    June 29, 2026

    Study links diabetes to worse health outcomes with long-term COVID-19 infection

    June 29, 2026

    Artificial intelligence models show major gaps with traditional human intelligence tests

    June 29, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Health Magazine
    • Home
    • Environmental Health
    • Health Technology
    • Medical Research
    • Mental Health
    • Nutrition Science
    • Pharma
    • Public Health
    • Discover
      • Daily Health Tips
      • Financial Health & Stability
      • Holistic Health & Wellness
      • Mental Health
      • Nutrition & Dietary Trends
      • Professional & Personal Growth
    • Our Mission
    Health Magazine
    Home » News » Artificial intelligence models show major gaps with traditional human intelligence tests
    Mental Health

    Artificial intelligence models show major gaps with traditional human intelligence tests

    healthadminBy healthadminJune 29, 2026No Comments7 Mins Read
    Artificial intelligence models show major gaps with traditional human intelligence tests
    Share
    Facebook Twitter Reddit Telegram Pinterest Email


    Artificial intelligence programs designed to process and generate text exhibit very high linguistic reasoning abilities, but struggle with visual and numerical puzzles. A new study evaluating a variety of commercial and open-source models of traditional intelligence tests reveals significant differences in performance depending on question format. The research results were published in Computers in Human Behavior: Artificial Humans.

    Large-scale language models are computer algorithms trained on vast amounts of text data collected from the Internet. Calculates the statistical probability of which word logically follows the previous word. Because they are essentially designed as advanced text prediction engines, scientists debate whether these programs actually understand what is being said or are simply mimicking human language patterns.

    Standard benchmarks, such as the Massive Multitask Language Understanding exam, test how well artificial intelligence systems can remember specialized academic facts. Getting a high score on a law or medical exam is great, but it only proves that the program can remember information it has already seen in the training data. These tests do not directly measure a machine’s ability to perform generalized abstract reasoning.

    To fill this gap, scientists are turning to cognitive tests designed for humans. IQ tests assess what psychologists call fluid intelligence. Fluid intelligence is the ability to think logically and solve problems in new situations, regardless of acquired knowledge. Sections featuring spatial rotation prompts and word analogies present unfamiliar scenarios and require test takers to guess the underlying rules of the puzzle without relying on memorized trivia.

    Lead researcher Sherif Abdelkarim, a computer scientist at the University of California, Irvine, organized a study to see how artificial intelligence programs handle these fluid intelligence tests. He co-authored the study with David Roux, Dora-Luz Flores, Suzanne Jaghi, and Pierre Bardi. The team wanted to measure whether advanced models had general reasoning skills that were independent of specific academic knowledge.

    Researchers selected 18 different large-scale language models to provide a comprehensive view of modern software environments. They tested proprietary systems developed by major technology companies as well as open-source models created by the broader research community. By comparing models of different sizes, the team wanted to track how cognitive limits change as the software becomes more robust.

    This assessment is based on the Self-Scoring Intelligence Quotient Suite, first published in 1996. This test includes 14 different categories covering three modes of thinking. The verbal section asks candidates to identify synonyms or complete complex analogies. In the numerical section, participants must solve arithmetic equations based on implicit mathematical rules or identify missing numbers in a number sequence. The visual section asks participants to analyze geometric shapes, imagine those shapes rotating in space, and predict the next image in a matrix pattern.

    There are distinct logistical challenges to implementing a computer program in a test designed for humans. Language models generate responses based on probabilities, so if the same prompt is asked twice, it may give completely different answers. The researchers tweaked the model’s internal parameters, changing a setting known as temperature to zero. This setting minimizes the randomness of the program and ensures that the program always provides the most likely answer.

    When analyzing the results, the researchers noticed that the size of the model influenced its performance. In software development, model size refers to the number of mathematical parameters that a system uses to connect different concepts and process information. Generally, the more parameters you have, the better the system will function.

    The smallest language model, containing approximately 7 billion parameters, achieved scores equivalent to the human intelligence quotient range of 89 to 110. The largest and most advanced programs achieved simulated scores ranging from 111 to 131. In human testing protocols, a score of 100 corresponds exactly to the population mean.

    Despite the high intelligence estimates of large-scale models, researchers found large variations across different subject areas. The algorithm showed an overwhelming bias towards linguistic tasks. For example, OpenAI’s GPT-4 correctly answered 79 percent of the verbal questions, but only 53 percent of the numerical questions. This split makes intuitive sense because the model is primarily trained using linguistic data rather than numerical logic systems.

    This division widened further when comparing textual and visual comprehension. The top model achieved an estimated IQ of about 125 on text-based questions, but hovered around an estimated score of 103 on visual questions. Some visual reasoning sections completely messed up the program. In the section that required the program to count specific shapes hidden within larger overlapping geometric patterns, all models had a success rate of 0 percent.

    These programs also demonstrated a persistent inability to answer abstract numerical puzzles. Even the most advanced commercial models performed badly on missing number tasks. These particular tasks ask candidates to find hidden mathematical relationships between a series of numbers and fill in the blank spaces. In this section, no model achieved more than 20% accuracy. The researchers note that these programs lack external memory capabilities and struggle to keep information in temporary mental space when performing multi-step operations across multiple consecutive operations.

    The researchers also evaluated the specialized personality settings provided by Microsoft’s Bing Chat interface. This interface allows users to dictate whether the chat agent behaves in a creative, accurate, or balanced manner. These three modes use exactly the same underlying software architecture, but are guided by hidden instructions that modify their behavior.

    Creative mode achieved the highest score, with an estimated IQ of 132. They performed exceptionally well on analogies and tasks that required innovative and flexible thinking. Precise mode scores were slightly lower overall, but were better on rigorous logical reasoning sequences. Balanced mode performed the worst of the three. This result suggests that attempting to combine instructions to increase accuracy and creativity actually impedes the program’s ability to reason effectively and leads to substandard responses.

    To see if performance could be improved beyond these basic scores, the team designed a multi-agent system. In this setup, one artificial intelligence generates an initial answer, a second artificial intelligence criticizes that answer, and a third artificial intelligence uses that criticism to suggest modifications. The first program then tries to answer the original question again using the new advice. This mimics the human peer review process.

    The composition of this comprehensive team completely changed the final test score. When the researchers assigned a smaller model to answer questions and a larger, more sophisticated model to act as a critic, the smaller model improved its score on the second try. The big critic accurately guided the little algorithm towards the correct logic.

    Conversely, if the larger model answered the question first and the smaller model acted as a critic, the larger model performed worse on the second trial. The flawed criticisms raised by the small program caused the large model to question its own initially correct answers. Taking the largest models and letting them act as their own critics provides little additional benefit, suggesting that the inference capabilities of top systems may have temporarily plateaued.

    This research is characterized by certain limitations regarding how intelligence is defined and measured. The tests used in this assessment were originally designed to measure human cognitive abilities. These tests may not accurately capture the unique inner workings of artificial intelligence systems. Artificial intelligence systems can ingest millions of text documents in seconds, but have no physical interaction with the real world. Many psychologists debate the validity of intelligence tests for measuring human abilities, arguing that intelligence tests are imperfect tools for measuring the general mind.

    Future research could include implementation of current clinical diagnostic assessments used by psychologists in professional medical settings. The researchers also hope to conduct large-scale trials that focus solely on images, as visual reasoning remains a major hurdle for the current generation of generative artificial intelligence software.

    The study, “Assessing the Intelligence of Large-Scale Language Models: A Comparative Study Using Verbal and Visual IQ Tests,” was authored by Sherif Abdelkarim, David Lu, Dora-Luz Flores, Susanne Jaeggi, and Pierre Baldi.



    Source link

    Visited 3 times, 3 visit(s) today
    Share. Facebook Twitter Pinterest LinkedIn Telegram Reddit Email
    Previous Article3 in 10 adults use AI or social media for health information
    Next Article Study links diabetes to worse health outcomes with long-term COVID-19 infection
    healthadmin

    Related Posts

    Study finds that authoritarianism acts as a psychological bridge for dark personalities

    June 29, 2026

    People who frequently experience feelings of inner emptiness may actually have higher levels of empathy

    June 29, 2026

    Magnetic muscle implant helps amputees feel coordinated movement of prosthetic hand

    June 28, 2026

    Can nighttime brain bursts predict performance on intelligence tests?

    June 28, 2026

    Negative life events cause a variety of depressive symptoms in teenage girls and boys

    June 28, 2026

    Brain scans reveal how uneven intelligence scores are linked to attention deficits in children

    June 28, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Categories

    • Daily Health Tips
    • Discover
    • Environmental Health
    • Exercise & Fitness
    • Featured
    • Featured Videos
    • Financial Health & Stability
    • Fitness
    • Fitness Updates
    • Health
    • Health Technology
    • Healthy Aging
    • Healthy Living
    • Holistic Healing
    • Holistic Health & Wellness
    • Medical Research
    • Medical Research & Insights
    • Mental Health
    • Mental Wellness
    • Natural Remedies
    • New Workouts
    • Nutrition
    • Nutrition & Dietary Trends
    • Nutrition & Superfoods
    • Nutrition Science
    • Pharma
    • Preventive Healthcare
    • Professional & Personal Growth
    • Public Health
    • Public Health & Awareness
    • Selected
    • Sleep & Recovery
    • Top Programs
    • Weight Management
    • Workouts
    Popular Posts
    • 1773313737_bacteria_-_Sebastian_Kaulitzki_46826fb7971649bfaca04a9b4cef3309-620x480.jpgHow Sino Biological ProPure™ redefines ultra-low… March 12, 2026
    • pexels-david-bartus-442116The food industry needs to act now to cut greenhouse… January 2, 2022
    • 1773729862_TagImage-3347-458389964760995353448-620x480.jpgDespite safety concerns, parents underestimate the… March 17, 2026
    • 1773209206_futuristic_techno_design_on_background_of_supercomputer_data_center_-_Image_-_Timofeev_Vladimir_M1_4.jpegMulti-agent AI systems outperform single models… March 11, 2026
    • 1774403998_image_28620e4b6b0047f7ab9154b41d739db1-620x480.jpgGait pattern helps distinguish between Lewy body… March 24, 2026
    • Leukemia-620x480.jpgBiomimetic platform powers CAR T therapy for… March 9, 2026

    Demo
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Don't Miss

    Children under 2 years old should avoid screen time to protect their future health

    By healthadminJune 29, 2026

    Researchers warn that using screens during the first 1,001 days of life could lead to…

    Study links diabetes to worse health outcomes with long-term COVID-19 infection

    June 29, 2026

    Artificial intelligence models show major gaps with traditional human intelligence tests

    June 29, 2026

    3 in 10 adults use AI or social media for health information

    June 29, 2026

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    HealthxMagazine
    HealthxMagazine

    At HealthX Magazine, we are dedicated to empowering entrepreneurs, doctors, chiropractors, healthcare professionals, personal trainers, executives, thought leaders, and anyone striving for optimal health.

    Our Picks

    3 in 10 adults use AI or social media for health information

    June 29, 2026

    A simple blood test could identify the most effective obesity drugs

    June 29, 2026

    Clarifying the 2025-2030 Dietary Guidelines Contradictions

    June 29, 2026
    New Comments
      Facebook X (Twitter) Instagram Pinterest
      • Home
      • Privacy Policy
      • Our Mission
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.