As artificial intelligence rapidly transforms industries, its role in education is under increasing scrutiny, with tools like ChatGPT promising to ease workloads and personalize learning. A new IZA discussion paper by Arnaud Chevalier, Jakub Orzech and Petar Stankov investigates whether AI-powered tools, specifically ChatGPT 3.5 and 4, could match human instructors in providing feedback and grading student work.
Using a randomized controlled trial (RCT), undergraduate students were divided into three groups: those receiving feedback from human graders, ChatGPT 3.5, or ChatGPT 4. The quality of the feedback was evaluated based on the students’ performance on the subsequent assignment. The double-blind design ensured neither students nor instructors knew the source of feedback, isolating its effects on student outcomes.
Inconsistencies in grading reveal critical shortcomings
The results show that ChatGPT 4 can deliver feedback comparable to human instructors, with students receiving its guidance performing on par with those who received human feedback. In contrast, students who received feedback from ChatGPT 3.5 performed worse in subsequent assessments, suggesting that this earlier version of the AI struggled with providing actionable and effective insights.
When it came to grading, the study highlighted significant gaps. Both versions of ChatGPT tended to assign more generous grades than human graders, and their evaluations lacked consistency and contextual understanding. For example, ChatGPT 3.5 struggled with complex tasks like assessing draft work or interpreting tables and empirical data. Even ChatGPT 4, while more capable, showed limitations. Not only do the grade distributions differ, but the rank of students within the grade distribution varies considerably. Crucially, the variability in grades—where the same submission could receive drastically different scores—further highlights the current unsuitability of AI for grading.
AI shows potential to save educators time
While AI tools like ChatGPT show promise in reducing the time educators spend on feedback provision and marking, allowing them to focus more on teaching-oriented tasks, the study concludes that these tools are not yet ready to fully replace human expertise in grading. As generative AI technology continues to improve, this research provides critical insights for educators and policymakers navigating its integration into the classroom.
[Editor’s note: In keeping with the focus of the study, this summary is based on a ChatGPT-generated draft, edited by a human.]