As artificial intelligence becomes more commonplace in professional settings, human oversight is often promoted as a safeguard against automated mistakes. New research published in PNAS Nexus Our results suggest that human experts are significantly more likely to mistakenly accept a harsh decision if they believe that the decision was made by an artificial intelligence system rather than a human colleague. This pattern provides evidence that relying on human reviews to spot algorithmic errors may not be as effective as many expect.
Study author Rigissa Megalokonomo, associate professor of economics at Monash University, was drawn to this issue by the rapid integration of new technology in the classroom. “My research is in educational economics, so pressing topics in education are naturally within my scope of interest, and right now AI is probably the most pressing topic of all,” she said. Scoring decisions provide a realistic way to observe how experts react to flawed advice in high-stakes environments.
“Grading is one of the most important moments in a student’s school life, shaping their future and self-perception as a learner,” Megaloconomo explained. “If AI enters that process and introduces errors that experts can’t spot, that’s a serious problem worth studying.” Automated scoring tools promise to save time and provide consistent scoring, but they can still make mistakes and introduce bias.
When organizations use automated tools to support decision-making, they typically expect humans to review the output and spot mistakes. This expectation assumes that people can objectively evaluate the suggestions of computer programs and correct them if they deviate from the truth. To realize the benefits of automated technology without compromising accuracy or public trust, we must ensure that these monitoring mechanisms function properly.
Teachers, as the final decision-makers in the classroom, are expected to act as a safety net to find and correct mistakes. Research in this area investigates how users respond to automated advice, focusing on the psychological factors that drive acceptance or rejection. Sometimes people stop trusting computer programs completely, a concept known as algorithm aversion. In other situations, individuals may blindly trust automated output over human judgment, often referred to as automation bias.
The authors wanted to understand exactly why experts fix machine mistakes or miss them without intervening. “The standard reassurance about AI is, ‘Don’t worry, a human will check how it works,'” Megaloconomo said. “Our research tests whether that’s actually true.” To answer this question, researchers conducted a preregistered, randomized experiment among practicing teachers in Greece.
“We conducted an experiment with more than 1,300 teachers in Greece, randomly assigning them to grade student work in combination with intentionally incorrect scores that were labeled as coming from an AI system or a human colleague,” Megaloconomo said. “We then measured how far their final grade was from the objectively correct answer.” The participants taught a variety of subjects, including math, science, and humanities.
During the study, each teacher reviewed a sample of student work that matched their specific area of expertise. Student work was accompanied by a bulleted checklist showing exactly which parts of the answer were correct or incorrect. Along with the students’ answers, teachers reviewed the pre-assigned score of 5 out of 10, which was intentionally incorrect.
The researchers manipulated the direction of scoring error for different participant groups. In one scenario, a score of 5 out of 10 was too harsh because the student’s work actually deserved an 8 based on an objective checklist. In the second scenario, the same score was too lenient because the student only provided enough correct answers to earn 2 points.
After reviewing the assignments and suggested scores, teachers assigned their own final grades. The primary measurement in this study was the scoring equity gap. This metric calculates the absolute mathematical distance between a teacher’s final grade and an objectively correct benchmark grade. A large gap indicates that the teacher was unable to correct the initial flawed recommendation.
“We found that when the AI gave a harsh rating that was too low, teachers were significantly less likely to correct it than if they were given the same incorrect rating by a human colleague,” Megalokonomou told SciPost. “For severe AI errors, the grading fairness gap was 22% larger.” Teachers were more likely to accept stricter automatic grading and leave it largely uncorrected. When teachers received similarly harsh evaluations from their human peers, they were more willing to correct mistakes and improve students’ scores.
In the permissive scenario, the source of the recommendation made no statistical difference. Regardless of whether they thought the machine or the human made the mistake, teachers made the same appropriate corrections for overly generous grades. They did not show the same respect for computer programs when they gave students too many credits. This provides evidence that the reliability of algorithmic grading is highly dependent on the direction of the recommendation.
The scientists also asked participants to rate the original scorers on five psychological dimensions to understand their thought processes. These dimensions include perceived competence, understanding of the subject matter, fairness, benevolence, and responsibility. These responses helped explain why teachers responded differently to severe and lenient computer errors.
Megalokonomou highlighted major discrepancies in the survey responses. The most surprising finding, she noted, was “the gap between what teachers say about AI and what they actually do.” “They rated AI as less fair, competent, and responsible than their human colleagues, and most said they did not want to use AI.”
Despite these negative perceptions, behavior in evaluating actual work has changed. “But when the AI gave a tough score, it prioritized it over if a human had made the same mistake,” Megaloconomo explained. “Mistrust didn’t make them more wary; in fact, it went in the opposite direction.”
In harsh scenarios, teachers perceived the algorithm to have high technical ability and responsibility. Recognition of this ability led educators to embrace rigorous grading. The rigor itself seemed to serve as a signal that the computer program was rigorous and competent. More than half of the impact in severe scenarios is explained by higher perceived competence and responsibility.
In the permissive scenario, teachers viewed artificial intelligence much more negatively across all five psychological dimensions. They actively rejected that advice because they felt this permissive algorithm lacked competence, fairness, and goodwill. They intervened to correct inflated scores and bring grades back to a fair level. If the algorithm did not score well on all these characteristics, the teacher overrode the generous advice.
The researchers also looked at how different demographic groups responded. “Surprisingly, this pattern was most pronounced among teachers who are young, well-educated, and confident in their technology—exactly the people we expect to be the most important users of AI,” Megaloconomo said. Because these groups are often seen as early adopters of new technologies, this finding challenges the common belief that tech-savvy professionals automatically exercise strong surveillance.
Additionally, humanities teachers were shown to be slightly more likely to prioritize machines than science and mathematics teachers. The researchers suggest that algorithmic advice can be more influential if evaluation criteria are highly subjective. At the end of the survey, the researchers also asked teachers about their general attitudes toward artificial intelligence. Almost half of respondents reported using generative artificial intelligence tools at least weekly to prepare for lessons.
Despite using these tools for planning, teachers remained skeptical about delegating actual grading authority to machines. In open text responses, many educators expressed concerns about computers’ inability to account for individual student situations. They noted that human grading often requires empathy and context, such as understanding a student’s learning difficulties or family issues. This suggests that practices that rely solely on improving the technical accuracy of algorithms are unlikely to overcome teachers’ ethical objections.
However, there are some limitations. “This experiment was conducted with Greek teachers, so readers should be careful about direct generalizations to other national contexts and professional settings,” Megaloconomo noted. The way these particular teachers interact with technology may not fully reflect the behavior of professionals in other countries and cultural settings.
“This study was also designed around a specific controlled scenario, a single scoring task with intentionally incorrect scores, which allowed us to clearly isolate effects, whereas real-world scoring involves more complexity and requires repeated interactions with AI tools over time,” she added. The experimental design made it relatively easy to determine the correct grade using a simple checklist. In real classroom environments, grading is often more ambiguous and occurs under intense time pressure.
Real-world ambiguity can lead to greater reliance on algorithmic advice and stronger independent judgment. Future research could investigate whether compliance with automated harsh judgment extends to other assessment tasks, such as formative assessments and hiring decisions. Scientists may also vary the amount of explanation a computer program provides to see if a detailed rationale prompts humans to take a closer look at the results.
The researchers hope to apply these insights to improving professional practice. “I am already working on a teacher training program that focuses on AI monitoring, not just how to use AI tools, but how to recognize when your decisions may be wrong,” Megaloconomo said. “We hope this research will reach policy makers and school leaders who are currently making decisions about AI in education.”
“The question of whether human surveillance actually works tends to be assumed rather than tested,” she says. The findings send a strong warning that treating humans as automatic safeguards is insufficient.
“What I want readers to understand is the broader implications. Although this study is about teachers and grading, the dynamics we uncovered, where human oversight selectively breaks down in response to AI behavior, have applications beyond education,” Megaloconomo stressed. As automation tools become more common around the world, these insights provide a useful starting point for understanding how professionals interact with machines.
“Any environment where AI recommendations are combined with human review, such as healthcare, hiring, and criminal justice, faces the same fundamental challenges,” she warned. “Involving humans is not enough. If we want meaningful oversight, we must intentionally design oversight with structured checks and clear accountability mechanisms, rather than assuming that good intentions are enough.”
The study, “Why do experts miss AI errors? Evidence from a randomized labeling experiment,” was authored by Sofoklis Goulas, Rigissa Megalokonomou, and Panagiotis Sotirakopoulos.

