Scientists built the most difficult AI test ever, and the results were surprising

Researchers have noticed a growing problem as artificial intelligence systems begin to score very high on academic benchmarks that have been used for years. The tests once imposed on machines were no longer difficult enough. Well-known assessments such as the Massive Multi-Task Language Understanding (MMLU) exam, which was previously considered too demanding, are no longer able to adequately measure the capabilities of today’s advanced AI models.

To solve this problem, a global group of nearly 1,000 researchers, including professors at Texas A&M University, developed a new type of test. Their goal was to build a broad, difficult, and human-expertise test that current AI systems still struggle to handle.

The result is Humanity’s Last Exam (HLE), a 2,500-question assessment covering mathematics, humanities, natural sciences, ancient languages, and a wide range of highly specialized academic fields. For details of the project, please see natureadditional information about the exam is available at lastexam.ai.

Among the many contributors is Dr. Tung Nguyen, associate professor of education in the Department of Computer Science and Engineering at Texas A&M. Nguyen helped write and refine many exam questions.

“When an AI system starts performing really well on human benchmarks, it’s tempting to think that it’s getting closer to human-level understanding,” Nguyen says. “But HLE reminds us that intelligence is about more than just pattern recognition; it’s about depth, context, and expertise.”

The purpose of the exam is not to fool or defeat human test takers. Instead, the goal was to carefully identify areas where AI systems still fall short.

A global effort to test the limits of AI

Experts from around the world created and reviewed the questions included in Humanity’s Last Exam. Each question is carefully designed so there is one clear and verifiable answer. The questions are also formulated in such a way that they cannot be quickly answered by a simple internet search.

Topics originate from high-level academic issues. Some are translating ancient Palmyrene inscriptions, others are identifying tiny anatomical structures in birds, and others are analyzing detailed features of Biblical Hebrew pronunciation.

The researchers tested every question against leading AI systems. If the model was able to answer a question correctly, that question was removed from the final exam. This process ensured that the tests remained just slightly beyond what current AI systems can reliably solve.

Initial testing confirmed that the strategy works. Even powerful AI models struggled on exams. GPT-4o’s score reached 2.7 percent and Claude 3.5 Sonnet reached 4.1 percent. OpenAI’s o1 model performed slightly better at 8%. The most capable systems to date, such as Gemini 3.1 Pro and Claude Opus 4.6, reach accuracy levels of around 40-50 percent.

Why we need new AI benchmarks

Nguyen explained that the problem with AI overcoming old tests is more than a technical concern. He contributed 73 of the 2,500 questions published on HLE, the second most among contributors, and wrote the most questions related to mathematics and computer science.

“Without accurate assessment tools, policymakers, developers and users risk misunderstanding what AI systems can actually do,” he said. “Benchmarking provides a basis for measuring progress and identifying risks.”

The researchers say that high scores on tests originally designed for humans do not necessarily indicate true intelligence. These benchmarks primarily measure how well an AI can complete specific tasks created for human learners, rather than capturing deeper understanding.

A tool, not a threat

Despite its dramatic name, “Humanity’s Last Test” does not suggest that humanity is obsolete. Instead, it focuses on the vast amount of knowledge and expertise that remains uniquely human.

“This is not a competition with AI,” Nguyen said. “This is a way to understand where these systems are strong and where they struggle. That understanding will help us build safer and more reliable technology. And, importantly, it will remind us why human expertise remains important.”

Building long-term AI benchmarks

Humanity’s Last Test is designed to serve as a durable and transparent benchmark for future AI systems. To help with this goal, the researchers made some questions public, but kept most hidden so the AI model couldn’t simply memorize the answers.

“So far, humanity’s last test is one of the clearest assessments of the gap between AI and human intelligence. Despite rapid advances in technology, that gap remains large,” Nguyen said.

large-scale international research effort

Mr. Nguyen emphasized that the scale of the project demonstrates the value of cooperation across sectors and countries.

“What made this project extraordinary was its scale,” he says. “Experts from almost every field contributed. It wasn’t just computer scientists; it was historians, physicists, linguists, medical researchers. It’s that diversity that reveals exactly the gap in today’s AI systems. Perhaps ironically, it’s humans working together.”

Source link

Visited 1 times, 1 visit(s) today

What's Hot

Scientists built the most difficult AI test ever, and the results were surprising

Extreme male brain theory about autism applies more strongly to women

A surprising new way to spread bacteria without propellers

Scientists built the most difficult AI test ever, and the results were surprising

A surprising new way to spread bacteria without propellers

Scientists unravel the 20-year nuclear mystery behind gold formation

Scientists discover universal temperature curve governing all life

A black hole and a neutron star collided in a strange elliptical orbit

Scientists have discovered a secret trade between plants and beetles

60 years of research reveals that abnormal weather is having a big impact on baby birds