Generating credible referenced medical research: A comparative study of openAI's GPT-4 and Google's gemini

Mahmud Omar; Saleh Nassar; Kareem Hijazi; Benjamin S Glicksberg; Girish N Nadkarni; Eyal Klang

doi:10.1016/j.compbiomed.2024.109545

Generating credible referenced medical research: A comparative study of openAI's GPT-4 and Google's gemini

Comput Biol Med. 2024 Dec 11:185:109545. doi: 10.1016/j.compbiomed.2024.109545. Online ahead of print.

Authors

Mahmud Omar¹, Saleh Nassar², Kareem Hijazi³, Benjamin S Glicksberg⁴, Girish N Nadkarni⁴, Eyal Klang⁴

Affiliations

¹ The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA; Maccabi Health Services, Israel. Electronic address: [email protected].
² Edith Wolfson Medical Center, Holon, Israel.
³ Maccabi Health Services, Israel.
⁴ The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA.

PMID: 39667055
DOI: 10.1016/j.compbiomed.2024.109545

Abstract

Background: Amidst the increasing use of AI in medical research, this study specifically aims to assess and compare the accuracy and credibility of openAI's GPT-4 and Google's Gemini in their ability to generate medical research introductions, focusing on the precision and reliability of their citations across five medical fields.

Methods: We compared the two models, OpenAI's GPT-4 and Google's Gemini Ultra, across five medical fields, focusing on the credibility and accuracy of citations, alongside the analysis of introduction length and unreferenced data.

Results: Gemini outperformed GPT-4 in reference precision. Gemini's references showed 77.2 % correctness and 68.0 % accuracy, compared to GPT-4's 54.0 % correctness and 49.2 % accuracy (p < 0.001 for both). This 23.2 percentage point difference in correctness and 18.8 in accuracy represents an improvement in citation reliability. GPT-4 generated longer introductions (332.4 ± 52.1 words vs. Gemini's 256.4 ± 39.1 words, p < 0.001) but included more unreferenced facts and assumptions (1.6 ± 1.2 vs. 1.2 ± 1.06 instances, p = 0.001).

Conclusion: While Gemini demonstrates significantly superior performance in generating credible and accurate references for medical research introductions, both models produced fabricated evidence, limiting their reliability for reference searching. This snapshot comparison of two prominent AI models highlights the potential and limitations of AI in academic content creation. The findings underscore the critical need for verification of AI-generated academic content and call for ongoing research into evolving AI models and their applications in scientific writing.