Performance Assessment of GPT 4.0 on the Japanese Medical Licensing Examination

Curr Med Sci. 2024 Dec;44(6):1148-1154. doi: 10.1007/s11596-024-2932-9. Epub 2024 Oct 26.

Abstract

Objective: To evaluate the accuracy and parsing ability of GPT 4.0 for Japanese medical practitioner qualification examinations in a multidimensional way to investigate its response accuracy and comprehensiveness to medical knowledge.

Methods: We evaluated the performance of the GPT 4.0 on Japanese Medical Licensing Examination (JMLE) questions (2021-2023). Questions are categorized by difficulty and type, with distinctions between general and clinical parts, as well as between single-choice (MCQ1) and multiple-choice (MCQ2) questions. Difficulty levels were determined on the basis of correct rates provided by the JMLE Preparatory School. The accuracy and quality of the GPT 4.0 responses were analyzed via an improved Global Qualily Scale (GQS) scores, considering both the chosen options and the accompanying analysis. Descriptive statistics and Pearson Chi-square tests were used to examine performance across exam years, question difficulty, type, and choice. GPT 4.0 ability was evaluated via the GQS, with comparisons made via the Mann-Whitney U or Kruskal-Wallis test.

Results: The correct response rate and parsing ability of the GPT4.0 to the JMLE questions reached the qualification level (80.4%). In terms of the accuracy of the GPT4.0 response to the JMLE, we found significant differences in accuracy across both difficulty levels and option types. According to the GQS scores for the GPT 4.0 responses to all the JMLE questions, the performance of the questionnaire varied according to year and choice type.

Conclusion: GTP4.0 performs well in providing basic support in medical education and medical research, but it also needs to input a large amount of medical-related data to train its model and improve the accuracy of its medical knowledge output. Further integration of ChatGPT with the medical field could open new opportunities for medicine.

Keywords: ChatGPT; Japanese medical licensing examination; artificial intelligence; large language model.

MeSH terms

  • Artificial Intelligence*
  • Education, Medical / standards
  • Educational Measurement* / methods
  • Educational Measurement* / standards
  • Japan
  • Licensure, Medical* / standards