This article reports the results of an experiment conducted with ChatGPT to see how its performance compares to human performance on tests that require specific knowledge and skills, such as university admission tests. We chose a general undergraduate admission test and two tests for admission to biomedical programs: the Scholastic Assessment Test (SAT), the Cambridge BioMedical Admission Test (BMAT), and the Italian Medical School Admission Test (IMSAT). In particular, we looked closely at the difference in performance between ChatGPT-4 and its predecessor, ChatGPT-3.5, to assess its evolution. The performance of ChatGPT-4 showed a significant improvement over ChatGPT-3.5 and, compared to real students, was on average within the top 10% in the SAT test, while the score in the IMSAT test granted admission to the two highest ranked Italian medical schools. In addition to the performance analysis, we provide a qualitative analysis of incorrect answers and a classification of three different types of logical and computational errors made by ChatGPT-4, which reveal important weaknesses of the model. This provides insight into the skills needed to use these models effectively despite their weaknesses, and also suggests possible applications of our analysis in the field of education.
Copyright: © 2024 Giunti et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.