Aryadoust, V., Zakaria, A., & Jia, Y. (2024). Investigating the affordances of OpenAI’s large language model in developing listening assessments. Computers and Education: Artificial Intelligence, 6, Article 100204. https://doi.org/https://doi.org/10.1016/j.caeai.2024.100204
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge University Press.
Bezirhan, U., & von Davier, M. (2023). Automated reading passage generation with OpenAI’s large language model. Computers and Education: Artificial Intelligence, 5, Article 100161. https://doi.org/https://doi.org/10.1016/j.caeai.2023.100161
Bhandari, S., Liu, Y., Kwak, Y., & Pardos, Z. A. (2024). Evaluating the psychometric properties of ChatGPT-generated questions. Computers and Education: Artificial Intelligence, 7, Article 100284. https://doi.org/https://doi.org/10.1016/j.caeai.2024.100284
Bitew, S. K., Deleu, J., Develder, C., & Demeester, T. (2025). Distractor generation for multiple-choice questions with predictive prompting and large language models. In R. Meo & F. Silvestri (Eds.), Machine learning and principles and practice of knowledge discovery in databases (pp. 48–63). Springer Nature Switzerland.
Brown, J. D. (2012). Classical test theory. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 323–335). Routledge.
Brown, J. D. (2014). Score dependability and decision consistency. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. 3, Chap. 71, pp. 1182–1206). John Wiley & Sons.
Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge University Press.
Circi, R., Hicks, J., & Sikali, E. (2023). Automatic item generation: Foundations and machine learning-based approaches for assessments. Frontiers in Education, 8, Article 858273. https://doi.org/10.3389/feduc.2023.858273
Coşkun, Ö., Kıyak, Y. S., & Budakoğlu, I. İ. (2025). ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment. Medical Teacher, 47(2), 268–274.
Doyle, A., Sridhar, P., Agarwal, A., Savelka, J., & Sakr, M. (2025). A comparative study of AI-generated and human-crafted learning objectives in computing education. Journal of Computer Assisted Learning, 41(1), Article e13092. https://doi.org/https://doi.org/10.1111/jcal.13092
Ebel, R. L. (1954). Procedures for the analysis of classroom tests. Educational and Psychological Measurement, 14, 352–364.
Flodén, J. (2025). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British Educational Research Journal, 51(1), 201–224. https://doi.org/https://doi.org/10.1002/berj.4069
Gierl, M. J., & Haladyna, T. M. (Eds.). (2013). Automatic item generation: Theory and practice. Routledge.
Haynie, W. J. (1992). Post hoc analysis of test items written by technology education teachers. Journal of Technology Education, 4(1), 26–38.
Jincheng, Z., Thada, J., & Rukthin, L. (2025). Meta-analysis of artificial intelligence in education. Higher Education Studies, 15(2), 189–210.
Khademi, A. (2023). Can ChatGPT and Bard generate aligned assessment items? A reliability analysis against human performance. Journal of Applied Learning & Teaching, 6(1), 75–80.
Kıyak, Y. S., Coşkun, Ö., Budakoğlu, I. İ., & Uluoğlu, C. (2024). ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. European Journal of Clinical Pharmacology. https://doi.org/10.1007/s00228-024-03649-x
Malec, W. (2024). Investigating the quality of AI-generated distractors for a multiple-choice vocabulary test. In O. Poquet, A. Ortega-Arranz, O. Viberg, I.-A. Chounta, B. McLaren, & J. Jovanovic (Eds.), CSEDU 2024: Proceedings of the 16th International Conference on Computer Supported Education – Volume 1 (pp. 836–843). SCITEPRESS.
Malec, W. (2025). Validating classroom tests on WebClass. In M. Bloch-Trojnar, A. Bloch-Rozmej, & E. Cyran (Eds.), Form, function, and learning: Linguistic studies in honour of Professor Anna Malicka-Kleparska from her students, colleagues, and friends (pp. 205–220). Wydawnictwo Werset.
Memarian, B., & Doleck, T. (2023). ChatGPT in education: Methods, potentials, and limitations. Computers in Human Behavior: Artificial Humans, 1(2), Article 100022. https://doi.org/https://doi.org/10.1016/j.chbah.2023.100022
Mendoza, K. K. R., Zúñiga, L. H. P., & López García, A. Y. (2024). Creación y jueceo de ítems: ChatGPT como diseñador y juez [Item creation and judging: ChatGPT as designer and judge]. Texto Livre, 17, Article e51222. https://doi.org/10.1590/1983-3652.2024.51222
Ngo, A., Gupta, S., Perrine, O., Reddy, R., Ershadi, S., & Remick, D. (2024). ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Academic Pathology, 11(1), Article 100099. https://doi.org/https://doi.org/10.1016/j.acpath.2023.100099
O, K.-M. (2024). A comparative study of AI-human-made and human-made test forms for a university TESOL theory course. Language Testing in Asia, 14(1), Article 19. https://doi.org/10.1186/s40468-024-00291-3
Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, 24(2), 3–13.
Sayin, A., & Gierl, M. (2024). Using OpenAI GPT to generate reading comprehension items. Educational Measurement: Issues and Practice, 43(1), 5–18. https://doi.org/https://doi.org/10.1111/emip.12590
Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15(6), 657–680. https://doi.org/10.1007/BF01068419
Shin, D., & Lee, J. H. (2023). Can ChatGPT make reading comprehension testing items on par with human experts? Language Learning & Technology, 27(3), 27–40. https://doi.org/10.1186/s40468-024-00291-3
Shin, D., & Lee, J. H. (2024). AI-powered automated item generation for language testing. ELT Journal, 78(4), 446–452. https://doi.org/10.1093/elt/ccae016
Song, Y., Du, J., & Zheng, Q. (2025). Automatic item generation for educational assessments: A systematic literature review. Interactive Learning Environments, 1–20. https://doi.org/10.1080/10494820.2025.2482588
Tewachew, A., Shiferie, K., & Tefera, E. (2024). Practices of EFL teachers in test construction. Cogent Education, 11(1), Article 2412496. https://doi.org/10.1080/2331186X.2024.2412496
Zuckerman, M., Flood, R., Tan, R. J. B., Kelp, N., Ecker, D. J., Menke, J., & Lockspeiser, T. (2023). ChatGPT for assessment writing. Medical Teacher, 45(11), 1224–1227. https://doi.org/10.1080/0142159X.2023.2249239