Between human and AI: assessing the reliability of AI text detection tools

Curr Med Res Opin. 2024 Mar;40(3):353-358. doi: 10.1080/03007995.2024.2310086. Epub 2024 Feb 2.

Abstract

Objective: Large language models (LLMs) such as ChatGPT-4 have raised critical questions regarding their distinguishability from human-generated content. In this research, we evaluated the effectiveness of online detection tools in identifying ChatGPT-4 vs human-written text.

Methods: A two texts produced by ChatGPT-4 using differing prompts and one text created by a human author were analytically assessed using the following online detection tools: GPTZero, ZeroGPT, Writer ACD, and Originality.

Results: The findings revealed a notable variance in the detection capabilities of the employed detection tools. GPTZero and ZeroGPT exhibited inconsistent assessments regarding the AI-origin of the texts. Writer ACD predominantly identified texts as human-written, whereas Originality consistently recognized the AI-generated content in both samples from ChatGPT-4. This highlights Originality's enhanced sensitivity to patterns characteristic of AI-generated text.

Conclusion: The study demonstrates that while automatic detection tools may discern texts generated by ChatGPT-4 significant variability exists in their accuracy. Undoubtedly, there is an urgent need for advanced detection tools to ensure the authenticity and integrity of content, especially in scientific and academic research. However, our findings underscore an urgent need for more refined detection methodologies to prevent the misdetection of human-written content as AI-generated and vice versa.

Keywords: Anesthesia; ChatGPT-4; artificial intelligence; intensive care; large language model.

MeSH terms

  • Artificial Intelligence*
  • Humans
  • Writing*