Volume 19 (2025) Download Cover Page

Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment

Article Number: e2025591  |  Available Online: December 2025  |  DOI: 10.22521/edupij.2025.19.591

Georgios Zacharis , Stamatios Papadakis

Abstract

Background/purpose. Generative artificial intelligence (GenAI) is often promoted as a transformative tool for assessment, yet evidence of its validity compared to human raters remains limited. This study examined whether an AI-based rater could be used interchangeably with trained faculty in scoring complex coursework.

Materials/methods. Ninety-one essays from teacher education courses at two Greek universities were independently evaluated by two human raters and an AI system, using a common rubric.

Results. Human inter-rater reliability was excellent (ICC(2,1) = .884; ICC(2,k) k=2 = .938). In contrast, AI–human agreement was substantially weaker (AI vs Human-Z: ICC(2,1) = .406; ICC(2,k) = .578; AI vs Human-S: ICC(2,1) = .279; ICC(2,k) = .436). The AI consistently inflated scores by 2.71–3.32 points and compressed distributions, limiting its ability to discriminate across performance levels. Bland–Altman analyses confirmed systematic proportional bias, with over-scoring of weaker work and under-scoring of stronger work. Results revealed significant inconsistency in AI performance: while the model failed to align with Human-S (κ = .017), it demonstrated statistically significant, moderate agreement with Human-Z (κ = .367). This discrepancy highlights the lack of standardization in human grading and the sensitivity of algorithms to divergent interpretive frameworks. A principal component analysis suggested that AI captured a narrower construct of quality than human raters.

Conclusion. These findings indicate that current GenAI tools are not suitable for high-stakes assessment in higher education, where fairness and construct validity are essential. They may, however, offer value in formative feedback or administrative support if used transparently and under human oversight.

Keywords: Higher education assessment, human–AI agreement, AI grade, generative AI in education

References

Canadian Teachers’ Federation. (2024). Policy brief: Education and AI in Canada. Canadian Teachers’ Federation. https://www.otffeo.on.ca/en/wp-content/uploads/sites/2/2025/05/CTF-ENAI-policy-brief-AGM-2024.pdf

Chai, F., Ma, J., Wang, Y., Zhu, J., & Han, T. (2024). Grading by AI makes me feel fairer? How different evaluators affect college students’ perception of fairness. Frontiers in Psychology, 15, Article 1221177. https://doi.org/10.3389/fpsyg.2024.1221177

Chauhan, A., Khaliq, F., & Nayak, K. R. (2025). Assessing quality of scenario-based multiple-choice questions in physiology: Faculty-generated vs. ChatGPT-generated questions among phase I medical students. International Journal of Artificial Intelligence in Education. Advance online publication. https://doi.org/10.1007/s40593-025-00471-z

Coblentz, D., Dong, J., & Gibbs, B. (2025). Generative artificial intelligence in Aotearoa New Zealand primary schools: Teacher and student survey findings. New Zealand Council for Educational Research. https://doi.org/10.18296/rep.0077

Crompton, H., & Burke, D. (2023). Artificial intelligence in higher education: The state of the field. International Journal of Educational Technology in Higher Education, 20(1), 22. https://doi.org/10.1186/s41239-023-00392-8

Department for Education. (2025). Generative AI in education: Updated guidance for schools. UK Government. https://www.gov.uk/government/publications/generative-artificial-intelligence-in-education

Dusseau, M., Reynoldson, M., Bone, E., Miciek, C., Kurowski, M., Osmond, M., & Lovelace, G. (2025, July 6). An open letter from educators who refuse the call to adopt GenAI in education. [Blog post]. https://openletter.earth/an-open-letter-from-educators-who-refuse-the-call-to-adopt-genai-in-education-cb4aee75

Estévez-Ayres, I., Callejo, P., Hombrados-Herrera, M. Á., Alario-Hoyos, C., & Delgado Kloos, C. (2024). Evaluation of LLM tools for feedback generation in a course on concurrent programming. International Journal of Artificial Intelligence in Education, 34(3), 774–790. https://doi.org/10.1007/s40593-024-00406-0

Flodén, J. (2025). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British Educational Research Journal, 51(1), 201–224. https://doi.org/10.1002/berj.4069

Gamlem, S. M., McGrane, J., Brandmo, C., Moltudal, S., Sun, S. Z., & Hopfenbeck, T. N. (2025). Exploring pre-service teachers’ attitudes and experiences with generative AI: A mixed methods study in Norwegian teacher education. Educational Psychology. Advance online publication. https://doi.org/10.1080/01443410.2025.2528663

Ghazawi, R., & Simpson, E. (2025). How well can large language models grade essays in Arabic? Computers & Education: Artificial Intelligence, 6, 100449. https://doi.org/10.1016/j.caeai.2025.100449

Gombert, S., Fink, A., Giorgashvili, T., & colleagues. (2024). From the automated assessment of student essay content to highly informative feedback: A case study. International Journal of Artificial Intelligence in Education, 34(4), 1378–1416. https://doi.org/10.1007/s40593-023-00387-6

Government of Canada. (2024). Guidance on the use of generative AI. Government of Canada. https://www.canada.ca/en/government/system/digital-government/digital-government-innovations/responsible-use-ai/guide-use-generative-ai.html

Gu, X., & Ericson, B. J. (2025). AI literacy in K-12 and higher education in the wake of generative AI: An integrative review. In Proceedings of the 2025 ACM Conference on International Computing Education Research V. 1 (pp. 125–140). ACM.

Jaschik, S. (2024, June 28). One-third of college instructors are using generative AI. Here’s how. Inside Higher Ed. https://www.insidehighered.com/news/student-success/academic-life/2024/06/28/one-third-college-instructors-are-using-genai-heres

Jurenka, I., Kunesch, M., McKee, K. R., Gillick, D., Zhu, S., Wiltberger, S., & Ibrahim, L. (2024). Towards responsible development of generative AI for education: An evaluation-driven approach (LearnLM Technical Report). Google DeepMind. https://goo.gle/LearnLM

Karakose, T.  & Tulubas, T. (2025). The Role of Educational Leaders in the Age of Artificial Intelligence (AI). Educational Process: International Journal, 16, e2025267.   https://doi.org/10.22521/edupij.2025.16.267

Karakose, T., Tülübaş, T., Kanadli, S., & Gurr, D. (2025). What factors mediate the relationship between principal leadership and teacher professional learning? Evidence from meta-analytic structural equation modelling (MASEM). Journal of Educational Administration63(1), 6376. https://doi.org/10.1108/JEA-05-2024-0160

Lee, U., Kim, Y., Lee, S., Park, J., Mun, J., Lee, E., Kim, H., Lim, C., & Yoo, Y. J. (2024). Can we use GPT-4 as a mathematics evaluator in education? Exploring the efficacy and limitations of LLM-based assessment for open-ended mathematics questions. International Journal of Artificial Intelligence in Education, 34(4), 1123–1145. https://doi.org/10.1007/s40593-024-00448-4

Letteri, I., & Vittorini, P. (2025). Enhancing student feedback in data science education: Harnessing the power of AI-generated approaches. International Journal of Artificial Intelligence in Education, 35(5), 921–940. https://doi.org/10.1007/s40593-025-00492-8

Marrella, D., Jiang, S., Ipaktchi, K., & Liverneaux, P. (2025). Comparing AI-generated and human peer reviews: A study on 11 articles. Hand Surgery and Rehabilitation, 14, 102225. https://doi.org/10.1016/j.hansur.2025.102225

Mitchell, M. (2025). Why AI chatbots lie to us. Science, 389(6705), 842–844. https://doi.org/10.1126/science.aea3922

Morales-Navarro, S., Morales, A., Nápoles, R., Ceriani, L., & Hernández, A. (2025). High school students building babyGPTs: Engaging in data and prompt engineering practices. International Journal of Child-Computer Interaction, 37, 100769. https://doi.org/10.1016/j.ijcci.2025.100769

Morris, W., Crossley, S., Holmes, L., Ou, C., Dascalu, M., & McNamara, D. (2024). Formative feedback on student-authored summaries in intelligent textbooks using large language models. International Journal of Artificial Intelligence in Education, 34(2), 321–345. https://doi.org/10.1007/s40593-024-00395-0

Nygren, T., Samuelsson, M., Hansson, P.-O., Efimova, E., & Bachelder, S. (2025). AI versus human feedback in mixed reality simulations: Comparing LLM and expert mentoring in preservice teacher education. International Journal of Artificial Intelligence in Education, 35(6), 1001–1022. https://doi.org/10.1007/s40593-025-00484-8

Ocumpaugh, J., Roscoe, R. D., Baker, R. S., & colleagues. (2024). Toward asset-based instruction and assessment in artificial intelligence in education. International Journal of Artificial Intelligence in Education, 34(4), 1559–1598. https://doi.org/10.1007/s40593-023-00382-x

Organisation for Economic Co-operation and Development. (2023). OECD digital education outlook 2023. OECD Publishing. https://doi.org/10.1787/20769679

Ozdogru, M., Tulubas, T., Karakose, T., Kanadlı, S., Kardas, A., & Papadakis, S. (2025). How does teacher self-efficacy mediate the relationship between student outcomes and principal leadership for learning? Results from meta-analytic structural equation modelling (MASEM). Acta Psychologica, 258, 105144. https://doi.org/10.1016/j.actpsy.2025.105144

Öztürk, A., Karahan, A. T., Günay, S., Erdal, A. S., Komut, S., Komut, E., & Yiğit, Y. (2025). A methodology to identify generative AI tools used as co-authors in writing scientific articles. The American Journal of Emergency Medicine. Advance online publication. https://doi.org/10.1016/j.ajem.2025.07.034

Pack, A., Barrett, A., & Escalante, J. (2024). Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability. Computers and Education: Artificial Intelligence, 6, Article 100234. https://doi.org/10.1016/j.caeai.2024.100234

Papadakis, S., & Karakose, T. (2025).  Gamification and Student Achievement: Potential Benefits, Limitations, and Effective Use in Educational Environments. Educational Process: International Journal, 19, e2025529.  https://doi.org/10.22521/edupij.2025.19.529

Petrone, J. (2025). AI Leap 2025: Estonia sets the standard for AI in education. e-Estonia. https://e-estonia.com/

Pham, N., Pham, N. H., & Nguyen-Duc, A. (2025). Fairness for machine learning software in education: A systematic mapping study. Journal of Systems and Software, 219, Article 112244. https://doi.org/10.1016/j.jss.2024.112244

Selwyn, N., Ljungqvist, M., & Sonesson, A. (2025). When the prompting stops: Exploring teachers’ work around the educational frailties of generative AI tools. Learning, Media and Technology. Advance online publication. https://doi.org/10.1080/17439884.2025.2537959

Tate, T. P., Steiss, J., Bailey, D., Graham, S., Moon, Y., Ritchie, D., Tseng, W., & Warschauer, M. (2024). Can AI provide useful holistic essay scoring? Computers and Education: Artificial Intelligence, 7, Article 100255. https://doi.org/10.1016/j.caeai.2024.100255

The Australian. (2024, March 2). Shared networks explore AI resources. The Australian. https://www.theaustralian.com.au/special-reports/shared-networks-explore-ai-resources/news-story/e8d377b21401b34f8b1dc55a3a4360ed

The Guardian. (2024, January 23). ChatGPT in Australian schools: What you need to know about law changes. The Guardian. https://www.theguardian.com/australia-news/2024/jan/23/chatgpt-in-australian-schools-what-you-need-to-know-law-changes

United Nations Educational, Scientific and Cultural Organization. (2023). Guidance for generative AI in education and research. UNESCO Publishing. https://www.unesco.org/en/articles/guidance-generative-ai-education-and-research

Vittorini, P., Menini, S., & Tonelli, S. (2021). An AI-based system for formative and summative assessment in data science courses. International Journal of Artificial Intelligence in Education, 31(1), 159–185. https://doi.org/10.1007/s40593-020-00230-2

Wang, F., Li, N., Cheung, A. C. K., & Wong, G. K. W. (2025). In GenAI we trust: An investigation of university students’ reliance on and resistance to generative AI in language learning. International Journal of Educational Technology in Higher Education, 22, 59. https://doi.org/10.1186/s41239-025-00547-9

Wetzler, E. L., Cassidy, K. S., Jones, M. J., Frazier, C. R., Korbut, N. A., Sims, C. M., Bowen, S. S., & Wood, M. (2024). Grading the graders: Comparing generative AI and human assessment in essay evaluation. Teaching of Psychology, 52(3), 298–304. https://doi.org/10.1177/00986283241282696

Xing, W., Nixon, N., Crossley, S., Denny, P., Lan, A., Stamper, J., & Yu, Z. (2025). The use of large language models in education. International Journal of Artificial Intelligence in Education, 35(1), 439–443. https://doi.org/10.1007/s40593-025-00457-x

Zapata-Rivera, D. (2021). Open student modeling research and its connections to educational assessment. International Journal of Artificial Intelligence in Education, 31(2), 380–396. https://doi.org/10.1007/s40593-020-00206-2