Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment
Article Number: e2025591 | Available Online: December 2025 | DOI: 10.22521/edupij.2025.19.591
Georgios Zacharis , Stamatios Papadakis
Full text PDF |
823 |
194
Abstract
|
Background/purpose. Generative artificial intelligence (GenAI) is often promoted as a transformative tool for assessment, yet evidence of its validity compared to human raters remains limited. This study examined whether an AI-based rater could be used interchangeably with trained faculty in scoring complex coursework. Materials/methods. Ninety-one essays from teacher education courses at two Greek universities were independently evaluated by two human raters and an AI system, using a common rubric. Results. Human inter-rater reliability was excellent (ICC(2,1) = .884; ICC(2,k) k=2 = .938). In contrast, AI–human agreement was substantially weaker (AI vs Human-Z: ICC(2,1) = .406; ICC(2,k) = .578; AI vs Human-S: ICC(2,1) = .279; ICC(2,k) = .436). The AI consistently inflated scores by 2.71–3.32 points and compressed distributions, limiting its ability to discriminate across performance levels. Bland–Altman analyses confirmed systematic proportional bias, with over-scoring of weaker work and under-scoring of stronger work. Results revealed significant inconsistency in AI performance: while the model failed to align with Human-S (κ = .017), it demonstrated statistically significant, moderate agreement with Human-Z (κ = .367). This discrepancy highlights the lack of standardization in human grading and the sensitivity of algorithms to divergent interpretive frameworks. A principal component analysis suggested that AI captured a narrower construct of quality than human raters. |
Conclusion. These findings indicate that current GenAI tools are not suitable for high-stakes assessment in higher education, where fairness and construct validity are essential. They may, however, offer value in formative feedback or administrative support if used transparently and under human oversight.
Keywords: Higher education assessment, human–AI agreement, AI grade, generative AI in education
ReferencesCanadian Teachers’ Federation. (2024). Policy brief: Education and AI in Canada. Canadian Teachers’ Federation. https://www.otffeo.on.ca/en/wp-content/uploads/sites/2/2025/05/CTF-ENAI-policy-brief-AGM-2024.pdf
Chai, F., Ma, J., Wang, Y., Zhu, J., & Han, T. (2024). Grading by AI makes me feel fairer? How different evaluators affect college students’ perception of fairness. Frontiers in Psychology, 15, Article 1221177. https://doi.org/10.3389/fpsyg.2024.1221177
Chauhan, A., Khaliq, F., & Nayak, K. R. (2025). Assessing quality of scenario-based multiple-choice questions in physiology: Faculty-generated vs. ChatGPT-generated questions among phase I medical students. International Journal of Artificial Intelligence in Education. Advance online publication. https://doi.org/10.1007/s40593-025-00471-z
Coblentz, D., Dong, J., & Gibbs, B. (2025). Generative artificial intelligence in Aotearoa New Zealand primary schools: Teacher and student survey findings. New Zealand Council for Educational Research. https://doi.org/10.18296/rep.0077
Crompton, H., & Burke, D. (2023). Artificial intelligence in higher education: The state of the field. International Journal of Educational Technology in Higher Education, 20(1), 22. https://doi.org/10.1186/s41239-023-00392-8
Department for Education. (2025). Generative AI in education: Updated guidance for schools. UK Government. https://www.gov.uk/government/publications/generative-artificial-intelligence-in-education
Dusseau, M., Reynoldson, M., Bone, E., Miciek, C., Kurowski, M., Osmond, M., & Lovelace, G. (2025, July 6). An open letter from educators who refuse the call to adopt GenAI in education. [Blog post]. https://openletter.earth/an-open-letter-from-educators-who-refuse-the-call-to-adopt-genai-in-education-cb4aee75
Estévez-Ayres, I., Callejo, P., Hombrados-Herrera, M. Á., Alario-Hoyos, C., & Delgado Kloos, C. (2024). Evaluation of LLM tools for feedback generation in a course on concurrent programming. International Journal of Artificial Intelligence in Education, 34(3), 774–790. https://doi.org/10.1007/s40593-024-00406-0
Flodén, J. (2025). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British Educational Research Journal, 51(1), 201–224. https://doi.org/10.1002/berj.4069
Gamlem, S. M., McGrane, J., Brandmo, C., Moltudal, S., Sun, S. Z., & Hopfenbeck, T. N. (2025). Exploring pre-service teachers’ attitudes and experiences with generative AI: A mixed methods study in Norwegian teacher education. Educational Psychology. Advance online publication. https://doi.org/10.1080/01443410.2025.2528663
Ghazawi, R., & Simpson, E. (2025). How well can large language models grade essays in Arabic? Computers & Education: Artificial Intelligence, 6, 100449. https://doi.org/10.1016/j.caeai.2025.100449
Gombert, S., Fink, A., Giorgashvili, T., & colleagues. (2024). From the automated assessment of student essay content to highly informative feedback: A case study. International Journal of Artificial Intelligence in Education, 34(4), 1378–1416. https://doi.org/10.1007/s40593-023-00387-6
Government of Canada. (2024). Guidance on the use of generative AI. Government of Canada. https://www.canada.ca/en/government/system/digital-government/digital-government-innovations/responsible-use-ai/guide-use-generative-ai.html
Gu, X., & Ericson, B. J. (2025). AI literacy in K-12 and higher education in the wake of generative AI: An integrative review. In Proceedings of the 2025 ACM Conference on International Computing Education Research V. 1 (pp. 125–140). ACM.
Jaschik, S. (2024, June 28). One-third of college instructors are using generative AI. Here’s how. Inside Higher Ed. https://www.insidehighered.com/news/student-success/academic-life/2024/06/28/one-third-college-instructors-are-using-genai-heres
Jurenka, I., Kunesch, M., McKee, K. R., Gillick, D., Zhu, S., Wiltberger, S., & Ibrahim, L. (2024). Towards responsible development of generative AI for education: An evaluation-driven approach (LearnLM Technical Report). Google DeepMind. https://goo.gle/LearnLM
Karakose, T. & Tulubas, T. (2025). The Role of Educational Leaders in the Age of Artificial Intelligence (AI). Educational Process: International Journal, 16, e2025267. https://doi.org/10.22521/edupij.2025.16.267
Karakose, T., Tülübaş, T., Kanadli, S., & Gurr, D. (2025). What factors mediate the relationship between principal leadership and teacher professional learning? Evidence from meta-analytic structural equation modelling (MASEM). Journal of Educational Administration, 63(1), 6376. https://doi.org/10.1108/JEA-05-2024-0160
Lee, U., Kim, Y., Lee, S., Park, J., Mun, J., Lee, E., Kim, H., Lim, C., & Yoo, Y. J. (2024). Can we use GPT-4 as a mathematics evaluator in education? Exploring the efficacy and limitations of LLM-based assessment for open-ended mathematics questions. International Journal of Artificial Intelligence in Education, 34(4), 1123–1145. https://doi.org/10.1007/s40593-024-00448-4
Letteri, I., & Vittorini, P. (2025). Enhancing student feedback in data science education: Harnessing the power of AI-generated approaches. International Journal of Artificial Intelligence in Education, 35(5), 921–940. https://doi.org/10.1007/s40593-025-00492-8
Marrella, D., Jiang, S., Ipaktchi, K., & Liverneaux, P. (2025). Comparing AI-generated and human peer reviews: A study on 11 articles. Hand Surgery and Rehabilitation, 14, 102225. https://doi.org/10.1016/j.hansur.2025.102225
Mitchell, M. (2025). Why AI chatbots lie to us. Science, 389(6705), 842–844. https://doi.org/10.1126/science.aea3922
Morales-Navarro, S., Morales, A., Nápoles, R., Ceriani, L., & Hernández, A. (2025). High school students building babyGPTs: Engaging in data and prompt engineering practices. International Journal of Child-Computer Interaction, 37, 100769. https://doi.org/10.1016/j.ijcci.2025.100769
Morris, W., Crossley, S., Holmes, L., Ou, C., Dascalu, M., & McNamara, D. (2024). Formative feedback on student-authored summaries in intelligent textbooks using large language models. International Journal of Artificial Intelligence in Education, 34(2), 321–345. https://doi.org/10.1007/s40593-024-00395-0
Nygren, T., Samuelsson, M., Hansson, P.-O., Efimova, E., & Bachelder, S. (2025). AI versus human feedback in mixed reality simulations: Comparing LLM and expert mentoring in preservice teacher education. International Journal of Artificial Intelligence in Education, 35(6), 1001–1022. https://doi.org/10.1007/s40593-025-00484-8
Ocumpaugh, J., Roscoe, R. D., Baker, R. S., & colleagues. (2024). Toward asset-based instruction and assessment in artificial intelligence in education. International Journal of Artificial Intelligence in Education, 34(4), 1559–1598. https://doi.org/10.1007/s40593-023-00382-x
Organisation for Economic Co-operation and Development. (2023). OECD digital education outlook 2023. OECD Publishing. https://doi.org/10.1787/20769679
Ozdogru, M., Tulubas, T., Karakose, T., Kanadlı, S., Kardas, A., & Papadakis, S. (2025). How does teacher self-efficacy mediate the relationship between student outcomes and principal leadership for learning? Results from meta-analytic structural equation modelling (MASEM). Acta Psychologica, 258, 105144. https://doi.org/10.1016/j.actpsy.2025.105144
Öztürk, A., Karahan, A. T., Günay, S., Erdal, A. S., Komut, S., Komut, E., & Yiğit, Y. (2025). A methodology to identify generative AI tools used as co-authors in writing scientific articles. The American Journal of Emergency Medicine. Advance online publication. https://doi.org/10.1016/j.ajem.2025.07.034
Pack, A., Barrett, A., & Escalante, J. (2024). Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability. Computers and Education: Artificial Intelligence, 6, Article 100234. https://doi.org/10.1016/j.caeai.2024.100234
Papadakis, S., & Karakose, T. (2025). Gamification and Student Achievement: Potential Benefits, Limitations, and Effective Use in Educational Environments. Educational Process: International Journal, 19, e2025529. https://doi.org/10.22521/edupij.2025.19.529
Petrone, J. (2025). AI Leap 2025: Estonia sets the standard for AI in education. e-Estonia. https://e-estonia.com/
Pham, N., Pham, N. H., & Nguyen-Duc, A. (2025). Fairness for machine learning software in education: A systematic mapping study. Journal of Systems and Software, 219, Article 112244. https://doi.org/10.1016/j.jss.2024.112244
Selwyn, N., Ljungqvist, M., & Sonesson, A. (2025). When the prompting stops: Exploring teachers’ work around the educational frailties of generative AI tools. Learning, Media and Technology. Advance online publication. https://doi.org/10.1080/17439884.2025.2537959
Tate, T. P., Steiss, J., Bailey, D., Graham, S., Moon, Y., Ritchie, D., Tseng, W., & Warschauer, M. (2024). Can AI provide useful holistic essay scoring? Computers and Education: Artificial Intelligence, 7, Article 100255. https://doi.org/10.1016/j.caeai.2024.100255
The Australian. (2024, March 2). Shared networks explore AI resources. The Australian. https://www.theaustralian.com.au/special-reports/shared-networks-explore-ai-resources/news-story/e8d377b21401b34f8b1dc55a3a4360ed
The Guardian. (2024, January 23). ChatGPT in Australian schools: What you need to know about law changes. The Guardian. https://www.theguardian.com/australia-news/2024/jan/23/chatgpt-in-australian-schools-what-you-need-to-know-law-changes
United Nations Educational, Scientific and Cultural Organization. (2023). Guidance for generative AI in education and research. UNESCO Publishing. https://www.unesco.org/en/articles/guidance-generative-ai-education-and-research
Vittorini, P., Menini, S., & Tonelli, S. (2021). An AI-based system for formative and summative assessment in data science courses. International Journal of Artificial Intelligence in Education, 31(1), 159–185. https://doi.org/10.1007/s40593-020-00230-2
Wang, F., Li, N., Cheung, A. C. K., & Wong, G. K. W. (2025). In GenAI we trust: An investigation of university students’ reliance on and resistance to generative AI in language learning. International Journal of Educational Technology in Higher Education, 22, 59. https://doi.org/10.1186/s41239-025-00547-9
Wetzler, E. L., Cassidy, K. S., Jones, M. J., Frazier, C. R., Korbut, N. A., Sims, C. M., Bowen, S. S., & Wood, M. (2024). Grading the graders: Comparing generative AI and human assessment in essay evaluation. Teaching of Psychology, 52(3), 298–304. https://doi.org/10.1177/00986283241282696
Xing, W., Nixon, N., Crossley, S., Denny, P., Lan, A., Stamper, J., & Yu, Z. (2025). The use of large language models in education. International Journal of Artificial Intelligence in Education, 35(1), 439–443. https://doi.org/10.1007/s40593-025-00457-x
Zapata-Rivera, D. (2021). Open student modeling research and its connections to educational assessment. International Journal of Artificial Intelligence in Education, 31(2), 380–396. https://doi.org/10.1007/s40593-020-00206-2