QUEST-AI: A System for Question Generation, Verification, and Refinement using AI for USMLE-Style Exams

Pac Symp Biocomput. 2025:30:54-69.

Abstract

The United States Medical Licensing Examination (USMLE) is a critical step in assessing the competence of future physicians, yet the process of creating exam questions and study materials is both time-consuming and costly. While Large Language Models (LLMs), such as OpenAI's GPT-4, have demonstrated proficiency in answering medical exam questions, their potential in generating such questions remains underexplored. This study presents QUEST-AI, a novel system that utilizes LLMs to (1) generate USMLE-style questions, (2) identify and flag incorrect questions, and (3) correct errors in the flagged questions. We evaluated this system's output by constructing a test set of 50 LLM-generated questions mixed with 50 human-generated questions and conducting a two-part assessment with three physicians and two medical students. The assessors attempted to distinguish between LLM and human-generated questions and evaluated the validity of the LLM-generated content. A majority of exam questions generated by QUEST-AI were deemed valid by a panel of three clinicians, with strong correlations between performance on LLM-generated and human-generated questions. This pioneering application of LLMs in medical education could significantly increase the ease and efficiency of developing USMLE-style medical exam content, offering a cost-effective and accessible alternative for exam preparation.

MeSH terms

  • Artificial Intelligence
  • Clinical Competence / statistics & numerical data
  • Computational Biology*
  • Educational Measurement* / methods
  • Educational Measurement* / standards
  • Educational Measurement* / statistics & numerical data
  • Humans
  • Licensure, Medical* / standards
  • Students, Medical / statistics & numerical data
  • United States