Artificial intelligence struggles to consistently evaluate scientific facts

Generative artificial intelligence programs can be written fluently, but basic scientific descriptions remain difficult to accurately and consistently evaluate. Recent research shows that if you ask artificial intelligence the exact same question multiple times, it often returns completely different answers. These results are rutgers business reviewhighlighting the limitations of current automated reasoning and the continued need for human oversight.

Generative artificial intelligence is a type of technology that is trained on large text databases to generate human-like sentences. Today, millions of people use these applications every day for everything from marketing to software development. The software writes in an authoritative tone that often makes it sound right, even when it’s completely wrong. Some well-known consulting firms have even faced public embarrassment for relying on automated reports containing fabricated data.

Despite these known flaws, many companies are partnering with technology vendors to incorporate these tools into their daily operations. Professionals frequently utilize automated software to analyze data, answer customer questions, and summarize research. Researchers wanted to know whether the logical abilities of these programs actually matched their impressive vocabularies. They designed tests to see if the technology could reliably evaluate rigorous business concepts.

Mesut Cisek, an associate professor in the Department of Marketing and International Business at Washington State University, led the study. His co-authors include Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University. The team designed an experiment to test the software’s ability to interpret academic literature.

Researchers collected 719 scientific hypotheses from nine open-access business journals published since 2021. A hypothesis is a formal, testable prediction about how two or more things will interact in the real world. For example, a statement might predict that a certain type of advertising will increase consumer spending.

The team presented these statements to ChatGPT, a very popular automated text generator. The program was asked to determine whether each statement was ultimately true or false based on actual research data. To test the program’s stability, the researchers sent the exact same prompt for each statement 10 separate times.

The entire experiment was performed twice to track the progress of the technology over time. The first test was conducted in mid-2024 using an older version of the software. The researchers repeated the entire process using an updated version of the application in mid-2025.

The results revealed a slight improvement in overall accuracy, but the raw numbers were highly misleading. The software selected the correct answer 76.5 percent of the time in 2024 and 80 percent of the time in 2025. Since there are only two possible answers to a question, a completely blind guess will be correct half of the time.

When researchers mathematically adjusted the scores to account for random guesses, actual performance dropped significantly. The effective accuracy rate was only around 60%. The software basically barely got a passing grade when it came to predicting actual scientific discoveries.

The program performed very poorly when evaluating ideas that the original researchers found to be false. The software correctly identified these unsupported statements only 16.4% of the time in 2025. The program exhibited a strong bias towards agreeing with the statements entered, acting as a compliant assistant rather than an objective analyst. This tendency to blindly confirm existing ideas creates an echo chamber that can mislead decision makers.

Consistency has proven to be an even bigger problem for automated systems. The software often contradicted itself if you asked the same question 10 times in a row. In some cases, the program would jump back and forth between true and false on successive trials.

“We’re not just talking about accuracy, we’re talking about consistency, because if you ask the same question over and over again, you’re going to get different answers,” Cicek said. In 2025, the program provided identical answers on all 10 attempts, but only on 73% of the utterances. For more than a quarter of the questions, the software gave at least one wrong answer out of 10 attempts.

The lack of a stable response pattern makes the software very unreliable for individual searches. Once a user asks a question, a simple refresh of the page can result in a completely different answer. “There were some cases where five were true and five were false,” Cicek said.

The researchers also categorized test questions by logical difficulty. The software handled direct causal relationships best, where one event leads directly to another. The most difficult part was the conditional statement, an idea that relies on changing a variable to true.

These results suggest that the program relies on recognizing common word patterns rather than actually understanding concepts. It is possible to imitate the structure of a logical argument without understanding the underlying meaning or context. Although this system has a high degree of linguistic fluency, it lacks true theoretical flexibility. When faced with complex scenarios, technology is unable to adapt its reasoning.

Software remains tied to pattern recognition rather than true understanding. “They just memorize it and can give you some insight, but they don’t understand what you’re talking about,” Cicek said. The apparent improvement over the past year appears to be due to improved text processing rather than deeper cognitive abilities.

For managers and analysts, these limitations pose significant risks. The findings reveal that automated systems are currently too shallow to handle high-stakes decisions on their own. As the text produced by these programs becomes smoother, users can easily miss hidden conceptual flaws.

Researchers advise experts to use artificial intelligence for speed, not substitution. Marketing teams may use text generators to brainstorm ideas or quickly summarize long reports. However, human experts must intervene to verify whether the logic is consistent with real market evidence.

Experts also need to iterate and validate automated insights. Asking the same questions multiple times can help uncover underlying biases and instabilities in the software. Conclusions generated by artificial intelligence should be treated as diagnostic clues rather than absolute facts.

The authors advocate building organizational literacy around automation tools. Employees need to understand exactly what these programs do well and where they fail. Organizations must train their staff to audit the reasoning behind automated answers, rather than simply trusting the numerical output.

The ultimate goal is to create a hybrid system that combines human intelligence and automated speed. In this configuration, the software handles the structural analysis while humans retain interpretive judgment. This balanced approach ensures that technology supports human understanding rather than replacing it.

The authors noted that the experiment had some minor limitations. This study assumes that all published and peer-reviewed findings are either completely true or false, ignoring the nuances of real-world science. Scientific discoveries can include a variety of results that do not easily fit into strict binary categories.

The team also limited consistency testing to 10 iterations per question using a single software platform. Future studies will need more repetitions to confirm these patterns. Researchers should also test different artificial intelligence programs to see if the flaw is universal.

Despite these limitations, research suggests that users should remain vigilant. Human judgment is still required to check these increasingly common digital systems. “Always be skeptical,” Cicek says. “I’m not against AI. I’m using AI. But we have to be very careful.”

The study, “Unstable Intelligence: GenAI Struggles with Accuracy and Consistency,” was authored by Mesut Cicek, Sevincgul Ulu, Can Uslay, and Kate Karniouchina.

Source link

Visited 1 times, 1 visit(s) today