Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation

Wetzler, Elizabeth L.; Cassidy, Kenneth S.; Jones, Margaret J.; Frazier, Chelsea R.; Korbut, Nickalous A.; Sims, Chelsea M.; Bowen, Shari S.; Wood, Michael D.

Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation

Authors

Wetzler, Elizabeth L.

Cassidy, Kenneth S.

Jones, Margaret J.

Frazier, Chelsea R.

Korbut, Nickalous A.

Sims, Chelsea M.

Bowen, Shari S.

Wood, Michael D.

Issue Date

2024-09-19

Type

Journal articles

Keywords

generative AI , essay grading , grading bias , educational assessment , human instructor , AI bias , chatGPT scoring , Artificial Intelligence

Abstract

Background: Generative artificial intelligence (AI) represents a potentially powerful, time-saving tool for grading student essays. However, little is known about how AI-generated essay scores compare to human instructor scores. Objective: The purpose of this study was to compare the essay grading scores produced by AI with those of human instructors to explore similarities and differences. Method: Eight human instructors and two versions of OpenAI’s ChatGPT (3.5 and 4o) independently graded 186 deidentified student essays from an introductory psychology course using a detailed rubric. Scoring consistency was analyzed using Bland-Altman and regression analyses. Results: AI scores for ChatGPT3.5 were, on average, higher than human scores, although average scores for ChatGPT 4o and human scores were more similar. Notably, AI grading for both versions was more lenient than human instructors at lower performance levels and stricter at higher levels, reflecting proportional bias. Conclusion: Although AI may offer potential for supporting grading processes, the pattern of results suggests that AI and human instructors differ in how they score using the same rubric. Teaching Implications: Results suggest that educators should be aware that AI grading of psychology writing assignments that require reflection or critical thinking may differ markedly from scores generated by human instructors.

Citation

Wetzler, Elizabeth L., Kenneth S. Cassidy, Margaret J. Jones, Chelsea R. Frazier, Nickalous A. Korbut, Chelsea M. Sims, Shari S. Bowen, and Michael Wood. "Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation." Teaching of Psychology (2024): 00986283241282696.

Publisher

SAGE Publications

URI

https://journals.sagepub.com/doi/10.1177/00986283241282696
https://hdl.handle.net/20.500.14216/1570

DOI

10.1177/00986283241282696

ISSN

0098-6283
1532-8023

Collections

Behavioral Sciences & Leadership Scholarship

Full item page

Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation

Authors

Issue Date

Type

Language

Keywords

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Description

Citation

Publisher

License

Journal

Volume

Issue

URI

PubMed ID

DOI

ISSN

EISSN

Collections