A Comparative Study of AI-generated and Teacher-constructed Language Tests

Abstract

This article examines the quality of multiple-choice (MC) vocabulary tests generated with the help of artificial intelligence (AI) by comparing their psychometric properties to those of human-constructed tests. Two sets of criterion-referenced tests (CRTs), designed to assess vocabulary previously taught in class, were developed. In each set, one test was entirely generated by AI, while the other incorporated MC options either fully created by a human constructor or modified from the AI-generated options. The tests were administered to students of a high school. The analysis focussed on reliability estimates and item statistics, particularly those which are relevant to CRTs. The findings suggest that the use of AI significantly improved test practicality by reducing the time and effort needed to develop the tests, although human-constructed tests exhibited superior psychometric qualities.

Keywords:

language assessment, criterion-referenced testing, multiple choice, automatic item generation, ChatGPT, test evaluation



Details

References

Statistics

Authors

Download files

pdf

Altmetric indicators


Cited by / Share


Roczniki Humanistyczne · ISSN 0035-7707 | eISSN 2544-5200 | DOI: 10.18290/rh
© The Learned Society of the John Paul II Catholic University of Lublin & The John Paul II Catholic University of Lublin, Faculty of Humanities

Articles are licensed under a Creative Commons  Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)