⁰⁰footnotetext: Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Copyright remains with the author(s).

\setcctype

[4.0]by-nc

The Great AI Witch Hunt: Reviewers’ Perception and (Mis)Conception of Generative AI in Research Writing

Hilda Hadan [email protected] 0000-0002-5911-1405 Stratford School of Interaction Design and Business, University of WaterlooWaterlooCanada , Derrick M. Wang [email protected] 0000-0003-3564-2532 Stratford School of Interaction Design and Business, University of WaterlooWaterlooCanada , Reza Hadi Mogavi [email protected] 0000-0002-4690-2769 Stratford School of Interaction Design and Business, University of WaterlooWaterlooCanada , Joseph Tu [email protected] 0000-0002-7703-6234 Stratford School of Interaction Design and Business, University of WaterlooWaterlooCanada , Leah Zhang-Kennedy [email protected] 0000-0002-0756-0022 Stratford School of Interaction Design and Business, University of WaterlooWaterlooCanada and Lennart E. Nacke [email protected] 0000-0003-4290-8829 Stratford School of Interaction Design and Business, University of WaterlooWaterlooCanada

(2024)

Abstract.

Generative AI (GenAI) use in research writing is growing fast. However, it is unclear how peer reviewers recognize or misjudge AI-augmented manuscripts. To investigate the impact of AI-augmented writing on peer reviews, we conducted a snippet-based online survey with 17 peer reviewers from top-tier HCI conferences. Our findings indicate that while AI-augmented writing improves readability, language diversity, and informativeness, it often lacks research details and reflective insights from authors. Reviewers consistently struggled to distinguish between human and AI-augmented writing but their judgements remained consistent. They noted the loss of a “human touch” and subjective expressions in AI-augmented writing. Based on our findings, we advocate for reviewer guidelines that promote impartial evaluations of submissions, regardless of any personal biases towards GenAI. The quality of the research itself should remain a priority in reviews, regardless of any preconceived notions about the tools used to create it. We emphasize that researchers must maintain their authorship and control over the writing process, even when using GenAI’s assistance.

Artificial intelligence, Generative AI, Reviewer Perception, Research Writing, AI Writing Augmentation

^†^†journalyear: 2024^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Computers in Human Behavior: Artificial Humans; ; ^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Human-centered computing Empirical studies in HCI

1. Introduction

The emergence of generative artificial intelligence (GenAI) tools such as ChatGPT¹¹1ChatGPT. https://chat.openai.com/ and Gemini²²2Gemini (formerly Google Brad). https://gemini.google.com/ have sparked a wave of excitement in academia and industry. Since the release of ChatGPT in November 2022 (OpenAI, 2022), GenAI has become increasingly popular in assisting people with written, auditory, and visual tasks (Tu et al., 2024; Mogavi et al., 2024; Khalifa and Albadawy, 2024). In research, GenAI offers a new approach to manuscript writing, as it can handle tasks ranging from text improvement suggestions to speech-to-text translation and even crafting initial drafts (Khalifa and Albadawy, 2024; Lin et al., 2024). Its ability to understand context and generate human-like and grammatically accurate responses fosters innovative brainstorming and enhances the quality and readability of research publications (Babl and Babl, 2023). However, along with GenAI’s potential to augment research activities, concerns about transparency, academic integrity, and the urgency of maintaining the credibility of research work have emerged (Liu et al., 2020; Rivera et al., 2020; COPE (2023b), Committee on Publication Ethics; Tu et al., 2024).

Despite the growing interest in using GenAI for manuscript writing and research activities (Phillips et al., 2023; Khalifa and Albadawy, 2024), many researchers hesitate to acknowledge its use in their papers. This is illustrated by several instances where research publications with undisclosed GenAI use were identified by readers (e.g., (retractionwatch.com, nd; Reddit.com, 2024; LinkedIn.com, 2024; Twitter.com, 2023)). Studies have identified the phenomenon of AI aversion, where AI-generated content, even if factual, is often perceived as inaccurate and misleading (Longoni et al., 2022; Burton et al., 2020) and disclosing its use can negatively impact readers’ satisfaction and perception of the authors’ qualifications and effort (Rae, 2024). Therefore, researchers’ hesitancy is partly due to their fear that acknowledging GenAI use might damage reviewers’ perceptions. However, given the widespread adoption of GenAI, researchers’ undisclosed GenAI use will harm the transparency, credibility, and integrity in research knowledge mobilization in the long-term.

Our research investigates perceptions of academia and industry professionals experienced in peer-reviewing manuscripts for top-tier human-computer interaction (HCI) conferences. Through understanding reviewers’ perceptions and clarifying their possible misconceptions, we seek to reduce researchers’ concerns about disclosing GenAI use. Our findings will shed light on the impacts of using GenAI as writing assistance for both reviewers and researcher, and foster a transparent and credible research environment. Specifically, we answer four Research Questions (RQs):

RQ1:

How much are reviewers aware of the use of AI in the context of research manuscripts?

A recent study has identified concerns among researchers about their writing being indistinguishable from AI-generated text, especially for those trained in formal writing structures (Tu et al., 2024). In fact, non-native English writing samples are more likely to be misclassified as AI-generated (Liang et al., 2023), and human cannot differentiate between AI- and human-written content (Gao et al., 2023). Therefore, false positives might occur among peer reviewers’ assessment of manuscripts. Our RQ1 aims to validate this hypothesis by examining reviewers’ awareness across various levels of AI involvement in research writing.

RQ2:

How much is reviewers’ judgement on research and manuscript quality influenced by the use of AI in its writing?

The phenomenon of AI aversion (Longoni et al., 2022; Burton et al., 2020) further raises the issue that reviewers might be biased in their assessment of the quality and credibility of the research presented in submissions. Our RQ2 aims to explore this issue by examining how snippets with various levels of AI involvement in writing influence reviewers’ judgments.

RQ3:

To what extent do reviewers’ peer-review experience, disciplinary expertise, and AI familiarity influence their perception and judgement?

Literature suggests that people’s familiarity with algorithms and expertise in relevant fields shape their perceptions (Dietvorst et al., 2015; Graefe et al., 2018; Logg et al., 2019). Therefore, reviewers’ peer-review experience, disciplinary expertise, and familiarity with GenAI may also shape their perceptions. Our RQ3 aims to investigate how these factors impact reviewers’ perceptions and judgments.

RQ4:

What aspects of research writing impact reviewers’ perception and judgement?

Prior research indicates that GPT detectors often misclassify content with limited linguistic proficiency as AI-generated (Liang et al., 2023), and that human-authored articles are generally seen as more pleasant to read and less boring (Clerwall, 2017). Our RQ4 seeks to identify specific manuscript’s elements that shape reviewers’ perceptions. Through identifying these elements, we aim to uncover the rationale behind reviewers’ judgments and misconceptions about GenAI in manuscript writing.

We investigated peer-reviewer perception through an online survey. To the best of our knowledge, our study is the first to empirically examine how peer-reviewers from top-tier HCI conferences perceive AI-augmented academic writing across three types of text: original human-written, AI-paraphrased, and AI-generated snippets. Our approach for assessing peer-reviewer perceptions of AI-augmented writing can be adapted for use in other academic fields than HCI. While our research is focused on HCI, it has broader implications for academic publishing across disciplines. We offer insights into the relationships of GenAI, authorship, and peer review. Our research makes four additional contributions to research on GenAI-augmented manuscript writing and its regulation. First, we show that all peer-reviewers struggled to distinguish between AI-processed and human-written snippet. All reviewers perceived AI-paraphrased snippets as more honest. Reviewers with more disciplinary expertise and AI familiarity consistently perceived snippets—regardless of AI involvement—as clearer and more compelling. Responsible and transparent use of GenAI can improve research manuscripts without compromising reviewers’ perceptions. Second, we report how our survey revealed reviewers’ contradictory perceptions of AI and human authorship indicators. This revelation has substantial implications for fair and unbiased manuscript evaluation with the potential to reshape peer-review processes across disciplines. We encourage authors to prioritize manuscript coherence, research validity, and effective communication, without letting their attitudes and misconceptions about GenAI influence their assessments. Third, we show that reviewers valued the subjective expressions of human authors in research manuscripts. This “human touch” resonated with reviewers because it maintains the collaborative nature of the research community. Therefore, we suggest researchers retain adequate involvement in their writing and act as the primary driver of the writing process—even with GenAI assistance. Fourth, our qualitative findings show that reviewers’ apprehensions about GenAI may worsen the publish-or-perish culture in academia. This could disproportionately affect researchers who rely on traditional writing methods. As a result, it would ultimately stifle human creativity. Our findings directly inform best practices for integrating GenAI in manuscript preparation—while maintaining research integrity—because we identify specific elements that shape reviewers’ perceptions. We conducted this research to provide crucial insights for the timely development of ethical AI use policies in academia. In addition, our findings contribute to the ongoing debate about GenAI’s role in academia by providing empirical evidence of its effects on peer review—a hallmark of scientific progress.

2. Background and Related Work

In this section, we summarize the technical evolution of GenAI as a manuscript writing assistant and the emerging perceptions and concerns within the academic community. In the end, we illustrate how our research addresses these concerns and promotes the ethical, transparent, and effective use of GenAI to support future researchers.

2.1. Generative AI as a Writing Assistant

Manuscript writing is crucial for researchers to share their ideas and contribute to their fields. However, writing high-quality research papers is challenging due to the need to simplify complex findings while ensuring accuracy, logical flow, and adequate evidence (Gupta et al., 2022). Beginners and non-native English speakers often struggle with using proper terminology and literature references (Gupta et al., 2022; Inouye and McAlpine, 2019; Liang et al., 2023). In addition, manuscript writing often competes with other responsibilities like teaching and supervising (De Rond and Miller, 2005), making efficiency and time management vital. The pressure of “publish or perish” mindset (De Rond and Miller, 2005) further intensifies these challenges. GenAI thus become valuable in research writing to ease researchers’ burden on writing and help them keep their focus on the innovative and critical aspects of their research.

With the rise of Large Language Models (LLM), GenAI’s potential to transform manuscript writing has garnered significant interest (Tu et al., 2024; Khalifa and Albadawy, 2024; Biswas, 2023). Traditional writing assistants offer word and sentence corrections, synonym suggestions, and sentence completion predictions (Arnold et al., 2016; Quinn and Zhai, 2016; Chen et al., 2019). In contrast, GenAI offers a broader array of functionalities to ensure high-quality writing across diverse research disciplines, such as inspiring new ideas (Lee et al., 2022; Shaer et al., 2024), enhancing readability (Babl and Babl, 2023), and assisting with narrative construction and creative writing (Lee et al., 2022; Singh et al., 2023; Yuan et al., 2022). However, GenAI has the limitation of generating factually incorrect information, known as hallucination (Ji et al., 2023; Achiam et al., 2023). For example, researchers have reported encountering fake references from GenAI (COPE (2023a), Committee on Publication Ethics). In addition, GenAI can be opinionated, which influence researchers’ perspectives and attitudes conveyed in the writing and compromise research integrity (Jakesch et al., 2023). Therefore, while GenAI holds benefits for manuscript writing, its use requires researchers’ careful consideration to avoid the risks.

These problems highlight the importance of transparently disclosing the use of GenAI. Such disclosure enables reviewers and readers to critically evaluate the research, be aware of potential biases or inaccuracies introduced by GenAI. Our study investigates reviewers’ perceptions and misconceptions, reduces current concerns and hesitations among researchers, encourages researchers to openly disclose their GenAI use, and fosters a more transparent and accountable research environment.

2.2. Perceptions of Generative AI in Research Community

A central debate in the research community regarding GenAI involves authorship and content attribution (COPE (2023b), Committee on Publication Ethics). Research manuscripts reflect the knowledge, expertise, and contributions of its author researchers (The Committee on Publication Ethics (2019), COPE). The use of GenAI in manuscript writing has raised questions about how to acknowledge its involvement, as crediting it as a co-author is inappropriate because “AI tools cannot meet the requirements for authorship as they cannot take responsibility for the submitted work” (COPE (2023b), Committee on Publication Ethics, para. 2). GenAI also cannot be accountable for the content it produces (COPE (2023b), Committee on Publication Ethics; COPE (2023a), Committee on Publication Ethics). Beyond authorship, ethical concerns arise, such as copyright infringement from using third-party materials, possible conflicts of interest, and plagiarism issues that replicate contents and images, ideas, and methods from already published works (COPE (2023a), Committee on Publication Ethics; Lund et al., 2023). In 2023, the Committee on Publication Ethics (COPE) recommended that authors explicitly disclose the use of AI-assisted technologies, including LLMs like ChatGPT, in their work (COPE (2023b), Committee on Publication Ethics). Following COPE’s lead, the Association for Computing Machinery (ACM) established policies on GenAI, stating “the use of generative AI tools and technologies to create content is permitted but must be fully disclosed” (Association for Computing Machinery, 2023). Following these, efforts are made to develop comprehensive reporting guidelines for evaluating the impact of tools like ChatGPT on scientific research writing, as seen in initiatives by Elsevier (elsevier.com, nda) and the World Association of Medical Editors (World Association of Medical Editors, 2023). These guidelines aim to promote transparency by providing a framework for declaring the use of GenAI in research.

Scholarly work revealed two opposing perceptions of AI-generated content: algorithm aversion and algorithmic appreciation. Algorithm aversion is a negative bias towards AI-generated content, even when the AI output is objectively better than human-produced content (Hong, 2018; Burton et al., 2020). For example, people tend to rate AI-written content as inaccurate regardless of its truthfulness (Longoni et al., 2022). In addition, informing users about AI involvement can harm the creator-reader relationship rather than facilitate content judgment (Rae, 2024). This bias worsens after seeing AI makes mistakes (Dietvorst et al., 2015). On the other hand, algorithmic appreciation refers to when people are more willing to adhere advice from an algorithm over a human (Logg et al., 2019), and find AI-created articles more credible with higher journalistic expertise (Graefe et al., 2018).

Manuscript writing involves various decisions about word choice and sentence structure to effectively convey authors’ meaning and purpose, with each word representing a decision made by the authors (Kreminski, 2024). With GenAI, many of these decisions are delegated to AI, which relies on highly probable options, pre-defined rules, large databases, or specific text corpora (Kreminski, 2024). This delegation can reduce human authors’ sense of ownership (Draxler et al., 2024; Lee et al., 2022), which may potentially lead to irresponsible assertions in research papers. Therefore, regulating the extent of GenAI assistance is crucial for maintaining the accountability and credibility of research publications. Our research aims to encourage transparency in disclosing GenAI use, which is the foundational step for responsible AI augmentation in research manuscript writing.

2.3. Connection to Our Research

While guidelines exist to guide researchers and promote transparency in research community, many researchers are hesitant to acknowledge their use of GenAI in their manuscripts (e.g., (retractionwatch.com, nd; Reddit.com, 2024; LinkedIn.com, 2024; Twitter.com, 2023)). Although previous studies have examined human ability to detect AI-generated content (e.g., (Gao et al., 2023; Köbis and Mossink, 2021; Ragot et al., 2020)), these studies were not conducted in the context of research publications and were not conducted with participants with experience reviewing academic manuscripts in peer-reviewed venues. Therefore, their findings offer limited insight into the specific issue of GenAI use in research manuscript writing. Our study addresses this gap by investigating experienced reviewers’ perceptions and misconceptions on manuscripts due to GenAI use. Through this investigation, we aim to reduce researchers’ concerns about negatively impacting reviewers’ perceptions and judgments, and encourage them to openly acknowledge their use of GenAI in future manuscripts. Given the increasing adoption of GenAI in research writing and the ethical needs of research transparency, our research is crucial and urgent in charting a path for a, ethical and beneficial GenAI augmentation in research manuscripts writing while avoiding detrimental consequences.

3. Methodology

To investigate reviewers’ perceptions of GenAI use in research writing, we employed a text snippet-based online survey. After obtaining Research Ethics Board approval [details omitted for blind review], we recruited 17 participants who have experience reviewing manuscripts for publication at top-tier HCI conferences, including CHI³³3The ACM CHI Conference on Human Factors in Computing Systems (CHI). and CSCW⁴⁴4The ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW).. We refer to our participants as “reviewers” in the following sections. Reviewers were presented with six snippets tailored to their areas of expertise in HCI, chosen from 16 example human-written abstracts and 32 GenAI-augmented snippets. The six snippets were presented in a randomized sequence. This approach allowed us to explore reviewers’ perception on a wide range of topics with different levels of GenAI use without overwhelming them with a long survey. In this section, we describe our snippet design, survey development, participant recruitment, and data analysis procedure.

3.1. Study Material Construction

In research paper writing, GenAI is used in various ways from recommending texts, perform spelling or grammar corrections, to generating entire sections (Association for Computing Machinery, 2023). To comprehensively evaluate reviewers’ perception, we present each participant with three types of snippets (Content_Type):

(1)

original: snippets written entirely by human authors.
(2)

paraphrased: snippets rephrased with a GenAI by rewriting human-written text while preserving its original meaning.
(3)

generated: snippets generated entirely with a GenAI by using human-written text as reference to ensure relevance to the original manuscript.

In this section, we discuss the selection of original human-written snippets, and the production of paraphrased and generated snippets using GenAI prompts.

3.1.1. Original Snippets

To ensure the comprehensive coverage of our original snippets, we selected abstracts from example papers from submission topics of CHI 2023 conference ⁵⁵5CHI’23. “Selecting a Subcommittee”. Last modified (n.d.). Last accessed on March 19, 2024. https://chi2023.acm.org/subcommittees/selecting-a-subcommittee/, the premier venue for HCI research⁶⁶6As of June 7, 2024, CHI was ranked as the premier venue in human-computer interaction research, with h5-index at 122, twice of the venue ranked as the second. See: https://scholar.google.ca/citations?view_op=top_venues&hl=en&vq=eng_humancomputerinteraction. For each topic, we selected the most-cited paper published before the prevalent use of GenAI in November 2022 to ensure it was written by human researchers. When multiple papers had the same citation numbers, we subsequentially selected papers based on download counts and the most recent publication date. This process resulted in a total of 16 abstracts as our original snippets. Details of these source papers are in Appendix C.

We chose to use abstracts due to three considerations. First, abstracts are crucial for research manuscripts as they comprehensively summarize the papers’ significance, research goals, methodology, findings, and contributions (Belcher, 2019). Second, in early stage of a peer-review process, abstracts guide editors and reviewers in efficiently evaluating a manuscript (Belcher, 2019). Third, since we recruit experienced reviewers who are academia and industry professionals, using abstracts ensures our study is manageable and not overly time-consuming while still offering sufficient information for evaluating participants’ perceptions.

3.1.2. Paraphrased and Generated Snippets

The selected original snippets were then processed through GenAI—Google Gemini⁷⁷7Google Gemini. https://gemini.google.com/app—to create the corresponding paraphrased and generated snippets. We chose Gemini for its ability to provide comprehensive summaries, valuable suggestions, and rationales, as well as its transparency in disclosing limitations rather than fabricating content, which distinguish it from other GenAI tools such as ChatGPT (Tu et al., 2024).

Refer to caption — Figure 1. Example prompt used for creating a paraphrased snippet from the original snippet.

Building upon literature on constructing GenAI prompts (Mollick and Mollick, 2023) and discussions with our research team of GenAI researchers and enthusiasts, we incorporated four components in our construction of the prompts for snippets processing:

(1)

Goal: the goal of the prompt. For producing paraphrased snippets, we set the goal as “rephrase” the original snippet; For producing generated snippets, we set the goal as “improve” the paraphrased snippet to allow GenAI to maximize its creativity while ensuring the content consistency.
(2)

Step-by-step instruction: the detailed instruction that specifies expected GenAI behaviour step-by-step. For producing paraphrased snippets, we provided a guide based on best practices of abstract writing (Belcher, 2019). For producing generated snippets, we used two sequential prompts that guide GenAI to first generate a new snippet based on the paraphrased snippet and the introduction section of the paper, then refine its contribution statements based on the manuscript’s conclusion section.
(3)

Context: the context information that facilitates the GenAI behaviours. For producing paraphrased snippets, the original snippet served as the context. For producing generated snippets, the paraphrased snippet and the corresponding manuscript’s introduction and conclusion sections were used.
(4)

Constraints: to ensure consistency in length, we set a 150-word constraint for both paraphrased and generated snippets based on typical CHI submissions⁸⁸8CHI 2023 — Papers. See section “Preparing and Submitting Your Paper” on https://chi2023.acm.org/for-authors/papers/.

Researchers in our team reviewed the snippets to ensure consistency in content and length across the three content_types. Figure 2 and Figure 2 illustrate the prompt structure, and Appendix D provides examples of the snippet production process in Gemini. This approach ensures that the snippets derived from the same abstract maintain consistent length, level of detail, and content. In this way, we ensures that our reviewers assess the snippets based on variations in writing style, word choice, structure, and flow due to GenAI involvement, rather than differences in interpretations and opinions that naturally vary among human authors.

3.2. Survey Design

In this section, we provide a detailed description of our survey design. Figure 3 summarizes the survey flow. A complete set of questions is included in Appendix E.

3.2.1. Screening Questionnaire

The survey began with a study information sheet and consent form, followed by a screening questionnaire. Our screening targeted participants who have experience serving as reviewers in peer-reviewed HCI conferences. Participants had to be at least 18 years old, have previous experience as a reviewer or associate chair, and have encountered or suspected the undisclosed use of GenAI in submissions they reviewed.

3.2.2. Instruction and Presentation of Snippets

To ensure reviewers’ perceptions were related to their experience with GenAI, not conventional writing assistants, we first provided a description of GenAI ’s functionality: “AI writing assistants can help researchers by suggesting phrasing, structuring sentences, and even generating initial drafts.” Reviewers then selected two research topics from the 16 CHI’23 topics (see Q1 & Q2 in Appendix E)—one in which they were most knowledgeable and one in which they had the least knowledge. From each topic, we presented the original, AI-paraphrased, and AI-generated snippets from an example paper (as described in subsubsection 3.1.1). This approach allowed us to compare reviewers’ perceptions and judgements varied between content_types, and investigate how their expertise influenced their perceptions. To avoid biasing reviewers, we did not disclose the content_type of each snippet. We described the six snippets as could be human-written or AI-processed without confirming AI or human authorship. The three snippets from the same abstract were presented in random order. Since the snippets were from published papers, we included a bold red text instructing reviewers not to search for the snippets in literature databases.

3.2.3. Perceptions of the Snippets and the Research Presented

For each snippet, reviewers were asked to provide a more detailed rating of their expertise in the topic, using a scale from 0—no knowledge or expertise in this field to 10—I am an expert in this field. We coded these responses as disciplinary_expertise in our statistical analysis. This question served three purposes. First, it clarified what “the most” and “the least” knowledgeable meant by each reviewer. Second, it captured cases when reviewers misidentify that a paraphrased or generated snippet is from a completely different abstract than its original. Third, it acted as an attention check. Reviewers selecting a topic they claimed to be most or least knowledgeable in but giving an opposite rating here indicated a lack of attention to our instructions.

To determine if reviewers’ judgements on research integrity, value, and soundness varied because of the writing across the three content_types, we asked them to rate each snippet’s accuracy (perceived_accuracy), reliability (perceived_reliability), honesty (perceived_honesty), clarity (perceived_clarity), and compellingness (perceived_ compellingness) in representing the research (International Center for Academic Integrity [ICAI], 2018). Reviewers rated these aspects on 5-point Likert scales, from 1—strongly disagree to 5—strongly agree, following Longoni et al. (2022)’s study on readers’ perception of news-headlines.

Next, we asked reviewers to rate their perceived level of AI involvement (perceived_AI_involvement) in each snippet’s writing process on a scale from 0—completely human to 10—completely GenAI, inspired by the methodology from Draxler et al. (2024), which asked participants to select the possible author attribution from a set of randomized options. Our 10-point scale offered finer granularity for reviewers to express their perceptions more accurately. For reviewers who suspected at least some degree of GenAI involvement (i.e., not completely human written), we included a highlight question, asking them to highlight specific sentences they believed were AI-processed. After that, reviewers were asked to share observations about the snippet’s style, structure, or content that influenced their perception of its authorship on an open-ended question. The combination of these questions allowed us to identify specific segments that influenced reviewers’ judgments.

To ensure data quality, we included an attention check question between the six snippets. The question asked reviewers to select a specific option. Reviewers who failed to select the designated option were excluded from our analysis for not following instructions.

3.2.4. General Perception of GenAI and Demographic Information

After all six snippets, we closed the survey with questions about reviewers’ general perceptions of GenAI writing. We asked about their views on the capability of human researchers (perceived_human_researcher_capability) and GenAI in communicating research ideas and outcomes through writing (perceived_AI_capability). These questions aimed to assess the reviewers’ algorithmic aversion or appreciation (Hong, 2018; Burton et al., 2020; Graefe et al., 2018), as their negative or positive attitudes toward GenAI may influence their perceptions of the snippets. Finally, we asked reviewers about their demographic information, estimated the number of papers they had reviewed (peer-review_experience), and use of GenAI in their own writing (AI_familiarity). We included these questions because AI background knowledge can influence perceptions (Ehsan et al., 2024), and people’s algorithmic aversion increases after witnessing AI mistakes (Dietvorst et al., 2015). Reviewers were also given an open-ended space for additional comments on our study before completing the survey.

3.3. Participants Recruitment and Demographics

Before distributing the survey, we piloted the questionnaire with five PhD students with peer-review experience and refined the language and question structure based on their feedback to improve clarity, comprehension, and conciseness. A prior power analysis (Faul et al., 2007, 2009) for a within-subject Wilcoxon-signed rank test determined that a sample size of $N=15$ was needed, with an effect size=0.8, a power=0.8, and a margin for random error $\leq 5\%$ . Following ethics approval, we recruited participants using a snowball sampling method in April and May 2024. Our research team reached out to CHI and CSCW conference committees for participation and assistance in distributing recruitment materials. This recruitment method was used due to the difficulty in recruiting reviewers, even in real peer-review process (Henderson et al., 2020). We closed the survey on May 7, 2024, one month after receiving the last response, resulting in a total of 41 responses. Of these, we excluded 23 responses for completing less than 50% of the questions (11 only completed the consent form) and one for failing the attention check. Our final analysis was based on the remaining $N=17$ valid responses.

Table 1. Participant reviewers’ (

N=17

) Demographic Information.

Age

Occupation

Area of Expertise*

AI Familiarity

Range

27-49

Professor

6 (35%)

Games and Play

6 (35%)

Sometimes

10 (59%)

Mean

34.52

Postdoctoral Researcher

5 (29%)

Interaction Techniques & Modalities

3 (18%)

Rarely

2 (12%)

5.62

Graduate Researcher

4 (24%)

Design

2 (12%)

Never

5 (29%)

Industry Professional

1 (6%)

Learning, Education, and Families

2 (12%)

Gender

Other-freelancer

1 (6%)

Critical Computing, Sustainability, and Social Justice

1 (6%)

Peer-review Experience

Woman

9 (53%)

Education Level

Health

1 (6%)

Range

5-500

Man

7 (41%)

Graduate or professional

16 (94%)

Specific Applications Areas

1 (6%)

Mean

110.94

Non-binary

1 (6%)

Bachelor

1 (6%)

Understanding People

1 (6%)

152.93

Note. *Research areas are based on the topics from the ACM CHI Conference on Human Factors in Computing Systems in 2023 (CHI’23) subcommittees.

Our study included 17 reviewers from premier HCI conferences, who represent a range of experience levels and areas of expertise within the field. While our sample size is limited, it embraces diverse perspectives, including novice and senior reviewers. The varied backgrounds of our participants in HCI sub-fields—with Games and Play being the most common expertise area—provide valuable insights into reviewer perceptions. However, we acknowledge that this sample may not be completely representative of the entire HCI reviewer community. Despite this apparent limitation, our findings offer crucial insights into reviewer attitudes towards AI-augmented writing in HCI. Table 1 summarizes the demographics of our 17 reviewers, including 53% women, 41% men, and one (6%) non-binary. Most reviewers were aged 27 to 49 and held post-secondary degrees (graduate or professional=94%, bachelor=6%), with a job occupation of academic researcher (graduate researchers=24%, postdoctoral researchers=29%, and professors=35%). The reviewers included novice and senior reviewers with varied areas of expertise, with Games and Play being the most selected topic (35%). In terms of personal GenAI use, 59% of reviewers reported sometimes using it for targeted research writing purposes, 12% rarely used it, and 29% had never used it.

3.4. Data Analysis

We present our quantitative data analysis and corresponding results in section 4. For the qualitative open-ended question, we conducted an inductive thematic analysis with two researchers, following the established guideline by Clarke et al. (2015). We reviewed the data to familiarize ourselves and ensure it contained no blank or incoherent responses to each question. We retained “N/A” responses, which represent an inability to differentiate human-written snippet from GenAI output. The two researchers independently coded 15% (n=16) of the total responses (N=102). We did not calculate inter-coder reliability, as it “prioritises uniformity over depth of insights” and often results in superficial themes, especially for studies with more than 20 codes (like ours) (Clarke et al., 2015, p. 303). Instead, the two researchers discussed and resolved conflicts in a meeting, and created an initial codebook. This process was repeated twice, with each meeting addressing half of the remaining data until the codebook was finalized and all data were coded. This finalized codebook served as a foundation for developing and refining the themes from our data. We present our codebook and themes in Appendix A.

4. Findings

4.1. RQ1: How Much Are Reviewers Aware of the Use of AI in the Context of Research Writing?

Table 2 shows the response distribution among the $N=17$ reviewers regarding their perceived_AI_involvement across the three content_types. Both original human written snippets and AI-generated snippets received a median=5, with a mean=4.44 (SD=3.13) and mean=5.12 (SD=3.18), respectively. This result indicates that reviewers generally believed GenAI was similarly involved in both human-written and AI-generated snippets. This similarity revealed a general misconception about GenAI use in snippets and suggested the difficulty in differentiating between AI-generated and human-written snippets among reviewers. Compared to these, the rating for AI-paraphrased snippets is notably lower (median=2, mean=2.74, SD=2.61).

[Uncaptioned image] — Table 2. Reviewers’ (N=17) perceived AI Involvement (0-completely human to 10-completely AI) Across Content Types

To validate the observed differences in reviewers’ perceptions, we performed a Friedman test (Friedman, 1937) and confirmed significant within-subject differences across the three types of snippets ( $\chi^{2}=6.92,df=2,P=0.03$ ). We further conducted post-hoc pairwise Wilcoxon comparisons (Woolson, 2007) with Bonferroni correction (Chen et al., 2017) (see Table 2). The result shows that, compared to AI-generated snippets, reviewers perceived significantly lower AI involvement in AI-paraphrased snippets ( $W=92,P=0.01,r=-0.60$ ). there was no significant difference in reviewers’ perceptions between AI-generated and human-written snippets ( $W=80,P=0.55$ ). Additionally, no significant difference was found between reviewers’ perceptions of human-written and AI-paraphrased snippets ( $W=26.5,P=0.06$ ). The validity of these results are further supported by our reviewers’ qualitative responses, with several of them indicated they were confused about which snippets were AI- or human-written.

4.2. RQ2: How Much Is Reviewers’ Judgement of Research and Manuscript Influenced by the Use of AI in Its Writing?

Table 3 presents the distribution of reviewers’ judgments across the three content_types. The result shows that reviewers’ responses were mainly neutral ( $mean=3.29,SD=1.12\sim mean=3.82,SD=0.80$ ), and there is no sizeable differences between reviewers’ perception on the accuracy, reliability, honesty, clarity, and compellingness.

To further validate our observations, we conducted a Friedman test (Friedman, 1937) and found no significant within-subject differences in reviewers’ perception across the three content_types. We suspect that this result is because our reviewers neither exhibited algorithmic aversion nor appreciation, but had neutral opinion towards GenAI. To validate this, we conducted a within-subject Wilcoxon signed-rank analysis (Woolson, 2007) with Bonferroni correction (Chen et al., 2017) to compare reviewers’ perceived_human_researcher_capability (mean=4.35, SD=0.79) and perceived_AI_capability (mean=4.06, SD=0.97). The results showed no significant difference in reviewers’ perceptions of AI and human researchers’ writing abilities (W=31.5, P=0.28, r=-0.26). Although the effect size is small, the validity of this result is supported by our reviewers’ lower perceived AI involvement and higher perceived honesty in AI-paraphrased snippets in subsection 4.3 and their qualitative responses that highlighted the advantages and weaknesses from both AI and human writing in subsection 4.4.

AI-authorship Makers			Human-authorship Markers
“foster…”	“…, do-ing…”	“By…”	contractions e.g., can’t, aren’t.
“leverage”	“a suite of”	“fueled	parentheses “()”
“bridge this gap”	“human-centered”	“pave the way…”	use “:” instead of a clause
“yet”	“While…”	“ultimately…”
“neglecting…”	“utmost important”	“seamless…”
“However,…”	“multimodal”	“go beyond…”
“thereby do-ing…”	“revealing”	“humanized technological future”
“envision”	“Contemperary ”	contractions e.g., can’t, aren’t.
“state-of-the-art”	use “:” instead of a clause	“struggle”

The Great AI Witch Hunt: Reviewers’ Perception and (Mis)Conception of Generative AI in Research Writing

Abstract.

1. Introduction

2. Background and Related Work

2.1. Generative AI as a Writing Assistant

2.2. Perceptions of Generative AI in Research Community

2.3. Connection to Our Research

3. Methodology

3.1. Study Material Construction

3.1.1. Original Snippets

3.1.2. Paraphrased and Generated Snippets

3.2. Survey Design

3.2.1. Screening Questionnaire

3.2.2. Instruction and Presentation of Snippets

3.2.3. Perceptions of the Snippets and the Research Presented

3.2.4. General Perception of GenAI and Demographic Information

3.3. Participants Recruitment and Demographics

3.4. Data Analysis

4. Findings

4.1. RQ1: How Much Are Reviewers Aware of the Use of AI in the Context of Research Writing?

4.2. RQ2: How Much Is Reviewers’ Judgement of Research and Manuscript Influenced by the Use of AI in Its Writing?

4.3. RQ3: To What Extent Do Reviewers’ Peer-Review Experience, Disciplinary Expertise, and AI Familiarity Influence Their Perception and Judgement?

4.4. RQ4: What Aspects of Research Writing Impact Reviewers’ Perception and Judgement?

4.4.1. Theme 1: Writing and Sentence Structure

4.4.2. Theme 2: Word Choice

4.4.3. Theme 3: Problematic Statement

4.4.4. Theme 4: Expression

4.4.5. Theme 5: Carefully Crafted Statement

5. Discussion

5.1. Implications For Researchers Who Submit to Peer-Reviewed Venues

5.2. Implications For Peer-Reviewers Who Review Research Manuscripts

5.3. Future Enforcement of Ethical Use of GenAI in Research Writing

5.4. Limitations and Opportunities for Future Research

6. Conclusion

Acknowledgements.

References

Appendix A Codebook

Appendix B Frequently Mentioned AI and Human Markers

Appendix C Selected Snippets, Their Citations Per Year, and Publication Information

Appendix D Example Gemini Prompts and Outputs

Appendix E Survey Questionnaire

E.1. Instruction

E.2. Knowledge and Expertise

E.3. Snippets and Questions

E.4. General Perception Questions