Background: ChatGPT is an AI platform whose relevance in the peer review of scientific articles is steadily growing. Nonetheless, it has sparked debates over its potential biases and inaccuracies. This study aims to assess ChatGPT's ability to qualitatively emulate human reviewers in scientific research.
Methods: We included the first submitted version of the latest twenty original research articles published by the 3rd of July 2023, in a high-profile medical journal. Each article underwent evaluation by a minimum of three human reviewers during the initial review stage. Subsequently, three researchers with medical backgrounds and expertise in manuscript revision, independently and qualitatively assessed the agreement between the peer reviews generated by ChatGPT version GPT-4 and the comments provided by human reviewers for these articles. The level of agreement was categorized into complete, partial, none, or contradictory.
Results: 720 human reviewers' comments were assessed. There was a good agreement between the three assessors (Overall kappa >0.6). ChatGPT's comments demonstrated complete agreement in terms of quality and substance with 48 (6.7 %) human reviewers' comments, partially agreed with 92 (12.8 %), identifying issues necessitating further elaboration or recommending supplementary steps to address concerns, had no agreement with a significant 565 (78.5 %), and contradicted 15 (2.1 %). ChatGPT comments on methods had the lowest proportion of complete agreement (13 comments, 3.6 %), while general comments on the manuscript displayed the highest proportion of complete agreement (17 comments, 22.1 %).
Conclusion: ChatGPT version GPT-4 has a limited ability to emulate human reviewers within the peer review process of scientific research.
Keywords: Artificial Intelligence; Bias; ChatGPT; Peer review; Quality agreement.
Copyright © 2024 Elsevier B.V. All rights reserved.