LLM-Collaboration on Automatic Science Journalism
for the General Audience

Gongyao Jiang Xinran Shi
The Hong Kong University of
Science and Technology (Guangzhou)
[email protected]
&Qiong Luo
The Hong Kong University of
Science and Technology (Guangzhou) /
The Hong Kong University of
Science and Technology
[email protected]
Corresponding Author

Abstract

Science journalism reports current scientific discoveries to non-specialists, aiming to enable public comprehension of the state of the art. However, this task can be challenging as the audience often lacks specific knowledge about the presented research. To address this challenge, we propose a framework that integrates three LLMs mimicking the real-world writing-reading-feedback-revision workflow, with one LLM acting as the journalist, a smaller LLM as the general public reader, and the third LLM as an editor. The journalist’s writing is iteratively refined by feedback from the reader and suggestions from the editor. Our experiments demonstrate that by leveraging the collaboration of two 7B and one 1.8B open-source LLMs, we can generate articles that are more accessible than those generated by existing methods, including advanced models such as GPT-4.

Gongyao Jiang and Xinran Shi The Hong Kong University of Science and Technology (Guangzhou) [email protected] Qiong Luo^†^†thanks: Corresponding Author The Hong Kong University of Science and Technology (Guangzhou) / The Hong Kong University of Science and Technology [email protected]

1 Introduction

Science journalism creates journalistic content that covers a wide range of scientific research, enhancing the public’s understanding of science (Göpfert, 2008; Allan, 2011; Angler, 2017). However, with rapid advances in various disciplines, science journalism struggles to keep pace with the exponential growth of knowledge. In response, automatic science journalism (ASJ) has been proposed to expedite the filtering, learning, and communication of scientific knowledge (Dangovski et al., 2021).

The essence of ASJ lies in elucidating complex technical content for readers, thereby facilitating their comprehension of advanced research (Cardenas et al., 2023). However, ASJ-generated content can be challenging for the general audience who lack in-depth knowledge of specific fields. As depicted in Figure 1, the degree to which content is embraced varies among readers with different levels of domain knowledge (August et al., 2024), underscoring the need for high readability to broaden a diverse readership. Some researchers have developed parallel corpora (Dangovski et al., 2021; Goldsack et al., 2022; Cardenas et al., 2023), where the target content is extracted from online scientific news or journals. However, these press releases often remain technical, likely because they are originally tailored for professional researchers rather than the general audience. Models trained on such content struggle to generate materials easily understandable for a broader audience.

Refer to caption — Figure 1: Reader experience varies with content technicality. Science journalism for the general audience demands high accessibility.

Large language models (LLMs) have shown impressive proficiency in instruction adherence and content generation (Achiam et al., 2023; Bai et al., 2023), thereby making them potential tools for ASJ. Furthermore, LLMs have exhibited social intelligence (Park et al., 2023), enabling them to play realistic roles and collaborate in real-world tasks (Qian et al., 2023; Talebirad and Nadiri, 2023). Motivated by these observations, we propose a novel framework that leverages LLMs as communicative agents to collectively accomplish the ASJ task.

Our goal is to automatically generate a popular science article based on a technical paper and make it accessible to the general public. In the real world, a journalist typically receives revision suggestions from a professional editor; to make it accessible to the general public, the journalist could pass their article to friends without domain expertise on the topic to get feedback for revision. Therefore, we design our framework to simulate such real-world scenarios, in which three LLMs collaborate to go through a four-step iterative process, which includes writing, reading, feedback, and revision, to generate highly accessible popular science articles. The questions for our framework are then (1) how the reader gives feedback; (2) how the editor gives suggestions based on the reader feedback; (3) how the journalist takes suggestions to revise the generated article; and (4) whether this iterative process improves the quality, especially accessibility, of the generated article.

As depicted in Figure 2, we have an LLM serve as the journalist writing for readers who lack domain knowledge of the given paper. We have another LLM, smaller than the journalist LLM, which acts as a general reader, to read the generated article and take notes to give reading feedback. As a less proficient model, the reader LLM needs material that is easily understandable to take comprehensive notes on. Therefore, the more accessible the explanations in the written article are, the greater the clarity and accuracy of the notes will exhibit.

LLMs have shown the capability of evaluating the quality of text (Chan et al., 2023; Zheng et al., 2024a; Desmond et al., 2024). Therefore, we let an editor (the third LLM) evaluate the correctness and comprehensiveness of the reader’s notes and then provide suggestions for the revision of the journalist’s article. The journalist then revises the previous version of the article based on the suggestions. By this iterative and tuning-free process, the popular science article is enhanced continuously and made more accessible to a general audience. To the best of our knowledge, our work is the first comprehensive study on LLMs for ASJ.

To assess the proposed method for ASJ, we employ both automatic metrics and human evaluation on measures including readability, information conveyance, authenticity, and interestingness of our generated articles. Compared with other methods, including those with fine-tuning and prompting on various LLMs, our proposed method achieves the highest readability while remaining competitive on the other measures. We also provide a detailed analysis, including ablation studies of removing the editor LLM, removing the reader LLM, or removing both, as well as trend analysis and case studies, to offer a comprehensive understanding of LLMs in the ASJ task.

In brief, we make the following contributions:

•

A novel ASJ framework with collaborative LLMs, generating articles of high readability.
•

Comprehensive experiments, analyses, and recommendations for LLM usage in ASJ.

2 Methodology

Following Dangovski et al. (2021); Goldsack et al. (2022); Cardenas et al. (2023), ASJ aims to automatically distill a scientific paper into an article accessible to a broader audience. Our ASJ framework employs an iterative workflow of writing, reading, suggestion-making, and revision, as illustrated in Figure 2. All prompts for each LLM agent are listed in Appendix A.

2.1 The LLM Journalist

LLMs have shown strong writing abilities (Yuan et al., 2022; Wasi et al., 2024). Thus, they are promising tools for rewriting a provided paper into a more accessible version. Following established strategies (Zheng et al., 2024a; Zhang et al., 2024), we start with prompting an LLM to assume the role of a journalist. Subsequently, the LLM is prompted that, given the paper, its task is to compose an article for the general public.

2.2 The LLM Reader

In our preliminary attempts, we asked the journalist LLM to directly assess the readability of the content. However, the results are unsatisfactory, probably due to the gap between human and model perceptions of reading difficulty. As illustrated in the two text boxes at the bottom of Figure 3, LLMs may regard both pieces of writing are of a similar level of readability, as they all incorporate essential information, even if the terminology “low-cost paper-based microfluidic diagnostic tests” on the left side is not clearly explained. However, in the eyes of a human reader, the content on the right is perceived as more accessible.¹¹1We briefly document other failed attempts in Appendix B for reader information.

To address this readability assessment problem, we design a separate reader LLM to read the content and generate reading notes. Our idea is inspired by the accumulation of errors, a common phenomenon in pipeline systems (Caselli et al., 2015; Wu et al., 2018; Jiang et al., 2023b; Dziri et al., 2023). Specifically, we can utilize the propagation from the textual readability of the journalist’s article to the reading comprehensibility of the reader in the writing-reading pipeline, to induce the readability of the generated article to become explicit.

Different from the LLM journalist, the reader LLM is of a smaller scale and simulates a general reader with limited domain knowledge. Once presented with the article crafted by the journalist, the reader LLM is employed to read the article and take notes. Specifically, we instruct the reader LLM to explain key terms in the article by extracting the explanations, if present, directly from the article or offering explanations for these terms, otherwise.

Intuitively, if the article is more accessible, the reading notes will be more comprehensive. For instance, in Figure 3, if the piece (left) lacks a detailed explanation of the term “paper-based microfluidic diagnostic test”, the reader will only note this term as being “combined with deep learning”. If the article (right) explains the usage and advantages of this term in plain and readable language, the reader LLM can grasp this knowledge. Through this readability propagation from the journalist’s article to the comprehensiveness of the reader’s notes, the editor LLM can better recognize issues in the journalist’s writing and then provide suitable suggestions for modifications.

2.3 Automated Suggestions and Revisions

LLMs have demonstrated strong capabilities in serving as evaluators, widely utilized in various generative tasks (Chan et al., 2023; Zheng et al., 2024a; Desmond et al., 2024). Therefore, we employ an LLM as a senior editor for automated evaluation of reader comprehension and providing recommendations for article enhancement. Given the article from the journalist and notes from the reader, an LLM editor is tasked with assessing the quality of the reader’s notes and identifying issues in the journalist’s writing, which may lead to reading obstacles.

Next, the editor offers advice for the journalist’s content development. For example, in Figure 2, the editor finds that the reader’s understanding of the term “blockchain” is limited, possibly due to an insufficient explanation in the reading material. To address this perceived issue, the editor suggests that the article should “explain technical terms”. These suggestions are then incorporated into the instructions that will guide the journalist in revising the article.

Subsequently, with the strong ability to follow instructions, the journalist LLM rewrites the article according to the suggestions. Then, the revised piece is fed to the reader for reading and taking notes to continue the process. By an iterative cycle encompassing writing, note-taking, suggesting modifications, and revision among three LLMs, the article tailored for the general readership undergoes steady enhancement. We further analyze this process in Section 4.

3 Experiments

3.1 Settings

Datasets. We use three publicly available corpora in different disciplines as benchmarks, namely SCITech, eLife, and PLOS. Appendix C presents a brief introduction and some statistics of these datasets. The same as the previous studies, in SCITech, we use 1431 instances for training and validation and the remaining 1000 for testing. We also separate each of eLife and PLOS datasets into training, validation, and testing splits at a ratio of 90%/5%/5%.

	SCITech			eLife			PLOS
Approach	CLI $\downarrow$	FKGL $\downarrow$	DCRS $\downarrow$	CLI $\downarrow$	FKGL $\downarrow$	DCRS $\downarrow$	CLI $\downarrow$	FKGL $\downarrow$	DCRS $\downarrow$	Avg.
LLaMA-2-7B	15.13	13.79	10.38	15.16	14.03	10.50	15.36	14.28	10.54	13.24
Gemma-7B	14.93	13.75	10.52	15.01	12.08	11.03	15.52	12.29	10.92	12.89
Mistral-7B	14.90	13.54	10.82	14.61	11.72	10.85	15.38	11.98	11.21	12.78
Qwen-1.5-7B	14.77	13.50	10.72	14.72	11.83	10.92	15.06	11.94	11.09	12.73
LLaMA-3-8B	14.84	13.18	10.41	14.55	11.65	10.49	15.18	12.01	10.88	12.58
Mixtral-8x7B	13.98	13.25	10.36	14.21	12.01	10.28	15.34	11.58	10.98	12.44
Qwen-1.5-72B	13.78	13.10	10.25	14.17	12.09	10.35	15.18	11.75	10.62	12.37
GPT-3.5-Turbo	14.98	13.62	10.81	14.35	11.87	10.98	15.11	11.92	10.87	12.72
GPT-4	13.48	12.13	10.14	13.96	10.87	10.11	14.86	11.78	10.47	11.98
BART-FT	13.43	15.22	10.66	12.32	10.65	9.19	15.61	14.24	10.51	12.43
Qwen-1.5-7B-FT	13.37	14.79	10.48	12.15	10.63	9.12	15.54	13.95	10.58	12.29
LLM-WS-CLBR	12.94	13.33	10.33	12.04	9.85	9.04	13.15	11.48	10.17	11.37
LLM-CLBR	12.69	10.16	9.79	11.60	10.10	9.46	12.74	10.00	9.69	10.69
Reader: 1.8B $\rightarrow$ 7B	12.81	10.35	9.68	11.82	10.01	9.51	12.67	9.93	9.78	10.73
$-$ Reading Notes	13.21	10.63	10.33	12.22	10.78	10.02	13.35	10.59	10.25	11.26
$-$ Suggestions	13.25	10.69	10.39	12.17	10.83	10.08	13.31	10.74	10.42	11.32
$-$ Collaboration	13.50	11.01	10.71	12.47	10.99	10.41	13.65	10.91	10.70	11.59
Paper Abstracts	16.67	15.27	11.39	17.53	15.35	11.87	16.38	14.98	11.10	14.50
Plain Summaries	14.23	14.79	11.13	12.52	10.91	8.94	15.90	14.76	10.91	12.68

Table 1: The results of automated evaluation. We tested various methods, including the open-source LLM prompting, the closed LLM prompting, fine-tuning, and collaboration of LLMs (ours).

Approach	Read.	Info.	Auth.	Intr.
Within Field
Plain Summaries	2.95	2.90	3.35	2.70
Qwen1.5-7B	3.50	3.35	3.40	3.10
GPT-4	3.80	3.75	3.80	3.40
LLM-CLBR	3.95	3.60	3.70	3.55
Outside Field
Plain Summaries	2.75	2.85	3.25	2.65
Qwen1.5-7B	3.35	3.10	3.30	3.10
GPT-4	3.40	3.55	3.70	3.15
LLM-CLBR	3.65	3.40	3.55	3.20

Table 2: Results of human evaluation, where ‘Read.’ indicates ‘Readability,’ ‘Info.’ denotes ‘Information Conveyance,’ ‘Auth.’ represents ‘Authenticity,’ and ‘Intr.’ signifies ‘Interestingness’.

Methods For Comparison.

•

BART. Goldsack et al. (2022); Cardenas et al. (2023) used BART (Lewis et al., 2020) for ASJ, showing strong performance.
•

LLMs. We test the performance of various LLMs, including both open-source and closed LLMs, i.e., LLaMA-2-7B (Touvron et al., 2023), Gemma-7B (Google, 2024), Mistral (7B, 8x7B, Jiang et al., 2023a), Qwen1.5 (7B, 72B, Bai et al., 2023), LLaMA-3-8B (Meta, 2024), GPT-3.5-Turbo-1106 (OpenAI, 2023), and GPT-4-1106-preview (Achiam et al., 2023). We prompt these LLMs and also fine-tune the Qwen1.5-7B model for ASJ.
•

Our Methods. We test two versions of collaborating LLMs. One is 2 $\times$ Qwen1.5-7B+Qwen1.5-1.8B (LLM-CLBR, CLBR for ‘collaboration’). Another is this combination where the journalist is replaced by a fine-tuned version as a warm start (LLM-WS-CLBR).

Automatic Evaluation. Following Goldsack et al. (2022); Cardenas et al. (2023), we use Coleman-Liau Index (CLI), Flesch-Kincaid Grade Level (FKGL) and Dale-Chall Readability Score (DCRS) to automatically assess the readability. CLI considers the count of sentences, words, and characters, while FKGL is based on the number of sentences, words, and syllables. DCRS assesses readability by analyzing the average sentence length and the presence of familiar words from a list of the most commonly used words.

Human Evaluation. Automatically assessing the authenticity and informativeness of content has been a challenging task. Cardenas et al. (2023) used QuestEval (Scialom et al., 2021) to assess the faithfulness of ASJ-generated content, yet the results exhibited significant variances. Therefore, human evaluation remains the main method for such assessments. We enlist four human participants for evaluation. All of them are including master’s students or holders of master’s degrees, two from computer science and two from biomedical science. Specifically, we sample 10 pairs of original papers and generated articles in computer science from SCITech, as well as 10 pairs in biomedical science pairs, 5 pairs from eLife and 5 pairs from PLOS.

We choose four representative methods for human evaluation: (1) plain summaries by human writers, (2) Qwen1.5-7B generation, (3) GPT-4 generation, and (4) generation by our LLM-Collaboration method. The human evaluation encompasses multiple dimensions, namely Readability, Information Conveyance, Authenticity, and Interestingness. Participants are tasked with evaluating the articles using a 1-5 Likert scale (Likert, 1932), grounded on specific questions. Each participant is assigned to assess all articles both in the field they are familiar with and those they are not familiar with, to provide a genuine evaluation from readers within the specific discipline and general readers. Appendix D shows the details of these measures and the questionnaire form.

Hyperparameters. We list hyperparameters in Appendix C for brevity.

3.2 Automatic Evaluation

The results in Table 1 show the comparison of different methods. Recent LLMs of similar scales have shown comparable performance, surpassing the LLaMA-2 introduced in 2023. Larger LLMs such as Mixstral-8x7B and Qwen1.5-72B show even better performance, indicating that performance improves as the model scale increases. Additionally, the formidable LLM GPT-4 outperforms all the other single LLMs. These findings demonstrate their performance on ASJ consistent with the capacity of LLMs. The fine-tuning methods exhibit competitive performance, slightly superior to prompting the open-source LLMs. Fine-tuning a model with a larger scale (Qwen1.5-7B) outperforms fine-tuning BART, in line with prompting.

Our collaboration with LLMs has demonstrated significant improvement over previous methods. Interestingly, it can be observed that fine-tuning the journalist LLM to warm up does not lead to any noticeable improvement. One possible reason may be the decay in the ability to follow instructions after this specialized training. Nonetheless, these empirical results show the effectiveness of our framework that integrates multiple LLMs.

Scientific Paper. In infectious disease diagnosis, results need to be communicated rapidly to healthcare professionals once testing has been completed so that care pathways can be implemented. This represents a particular challenge when testing in remote, low-resource rural communities, in which such diseases often create the largest burden. Here, we report a smartphone-based end-to-end platform for multiplexed DNA diagnosis of malaria. The approach uses a low-cost paper-based microfluidic diagnostic test, which is combined with deep learning algorithms for local decision support and blockchain technology for secure data connectivity and management. We validated the approach via field tests in rural Uganda, where it correctly identified more than 98% of tested cases. Our platform also provides secure geotagged diagnostic information, which creates the possibility of integrating infectious disease data within surveillance frameworks.
Press Release. Researchers at the U.K.ś University of Glasgow and Ugandaś Ministry of Health have developed easy-to-administer blood tests that could help curtail the spread of malaria in sub-Saharan Africa. These "origami" tests can be conducted in the field, with a blood sample taken via fingerprick placed in a channel made from sheets of folded wax paper. The blood is directed into three small chambers of a loop-mediated isothermal amplification (LAMP) machine, which tests for the presence of the mosquito-borne parasite that causes malaria. A smartphone app uses deep learning to ensure the tests are accurate and to monitor community transmission. The results are stored on a blockchain-based ledger to maintain privacy, with anonymized data shared with local authorities. In Uganda, the blood tests were shown to be 98% accurate.
Article Written by Our Method. In the fight against malaria, timely diagnosis is crucial, especially in rural communities where these diseases often hit hardest. Researchers have come up with a brilliant solution using smartphones! It’s a smartphone-based system that combines a low-cost test with deep learning and blockchain technology. First, imagine a tiny paper device that can detect malaria DNA right from blood samples. This microfluidic test is as simple as using a piece of paper - no fancy lab equipment needed! Then, the results are analyzed by smart algorithms on your phone. These "AI helpers" give instant feedback to healthcare workers without delay. All this diagnostic data is stored securely on blockchain. This means information is protected and cannot be tampered with. In Uganda, field tests showed that this system was almost perfect - it correctly identified 98% of cases! The best part? The platform also records each diagnosis’s location through geotags. This opens up the possibility to connect local disease data to bigger surveillance networks, helping us track and combat malaria more efficiently. So next time you hear about remote villages facing health challenges due to lack of fast diagnosis tools, remember: there’s hope with these smartphone-powered innovations!

Scientific Paper. In infectious disease diagnosis, results need to be communicated rapidly to healthcare professionals once testing has been completed so that care pathways can be implemented. This represents a particular challenge when testing in remote, low-resource rural communities, in which such diseases often create the largest burden. Here, we report a smartphone-based end-to-end platform for multiplexed DNA diagnosis of malaria. The approach uses a low-cost paper-based microfluidic diagnostic test, which is combined with deep learning algorithms for local decision support and blockchain technology for secure data connectivity and management. We validated the approach via field tests in rural Uganda, where it correctly identified more than 98% of tested cases. Our platform also provides secure geotagged diagnostic information, which creates the possibility of integrating infectious disease data within surveillance frameworks.

Press Release. Researchers at the U.K.ś University of Glasgow and Ugandaś Ministry of Health have developed easy-to-administer blood tests that could help curtail the spread of malaria in sub-Saharan Africa. These "origami" tests can be conducted in the field, with a blood sample taken via fingerprick placed in a channel made from sheets of folded wax paper. The blood is directed into three small chambers of a loop-mediated isothermal amplification (LAMP) machine, which tests for the presence of the mosquito-borne parasite that causes malaria. A smartphone app uses deep learning to ensure the tests are accurate and to monitor community transmission. The results are stored on a blockchain-based ledger to maintain privacy, with anonymized data shared with local authorities. In Uganda, the blood tests were shown to be 98% accurate.

Article Written by Our Method. In the fight against malaria, timely diagnosis is crucial, especially in rural communities where these diseases often hit hardest. Researchers have come up with a brilliant solution using smartphones! It’s a smartphone-based system that combines a low-cost test with deep learning and blockchain technology. First, imagine a tiny paper device that can detect malaria DNA right from blood samples. This microfluidic test is as simple as using a piece of paper - no fancy lab equipment needed! Then, the results are analyzed by smart algorithms on your phone. These "AI helpers" give instant feedback to healthcare workers without delay. All this diagnostic data is stored securely on blockchain. This means information is protected and cannot be tampered with. In Uganda, field tests showed that this system was almost perfect - it correctly identified 98% of cases! The best part? The platform also records each diagnosis’s location through geotags. This opens up the possibility to connect local disease data to bigger surveillance networks, helping us track and combat malaria more efficiently. So next time you hear about remote villages facing health challenges due to lack of fast diagnosis tools, remember: there’s hope with these smartphone-powered innovations!

Table 3: An example of the original paper, excerpts from the manually written press release, and the article generated by our method. We highlight statements that we regard easy to understand and explanations for technical details.

1st Iteration. 1. The mobile lab on your phone is a portable, paper-based test that uses deep learning algorithms to analyze blood samples for the presence of malaria. 2. The end-to-end system combines a smartphone app, allowing healthcare workers to make informed decisions about treatment without delay. 3. The smartphone app uses deep learning algorithms to interpret the test results, providing accurate diagnoses over 98% in field tests. 4. The platform ensures secure data management by using blockchain technology, which records transparently and cannot be tampered with.
2nd Iteration. 1. Smartphones and Blockchain: The article introduces the use of smartphones and blockchain technology in diagnosing diseases like malaria, particularly in remote areas where traditional diagnostic methods may not be available or accessible. 2. Malaria Diagnosis: The low-cost paper-based microfluidic diagnostic test uses a piece of tiny paper device to detect malaria DNA. It allows rapid detection with the support of AI systems. 3. Low-Cost Test: The AI-based test is referred to as a "smart decision support" system, which uses advanced technology such as artificial intelligence to interpret the results right on the user’s device. 4. Artificial Intelligence (AI): The AI acts like a smart detective, quickly identifying potential cases with an accuracy rate of 98% in real-world tests, enabling healthcare workers to make informed decisions about treatment without needing fancy labs or waiting too long.
3rd Iteration. 1. Smartphone: A device with advanced features such as cameras, sensors, and processors that can perform various tasks, including medical diagnosis. 2. Blood test: A diagnostic procedure where blood is collected from a finger or other body part through a small needle and analyzed for specific signs of malaria (e.g., anemia or fever). 3. AI as personal health assistant: An artificial intelligence system designed to assist individuals in managing their health by providing personalized recommendations based on medical data analysis (e.g., detecting early signs of disease). 4. Secure digital diary: A digital record containing sensitive information about an individual’s health status stored on a secure blockchain network to ensure accuracy and prevent unauthorized access or manipulation. 5. Blockchain technology: A distributed ledger system that allows secure sharing of data across multiple parties without the need for intermediaries or centralized authorities (e.g., storing patient records in hospitals). In this case, it serves as the secure digital ledger for tracking where diseases are spreading in remote areas due to malaria prevalence.

1st Iteration. 1. The mobile lab on your phone is a portable, paper-based test that uses deep learning algorithms to analyze blood samples for the presence of malaria. 2. The end-to-end system combines a smartphone app, allowing healthcare workers to make informed decisions about treatment without delay. 3. The smartphone app uses deep learning algorithms to interpret the test results, providing accurate diagnoses over 98% in field tests. 4. The platform ensures secure data management by using blockchain technology, which records transparently and cannot be tampered with.

2nd Iteration. 1. Smartphones and Blockchain: The article introduces the use of smartphones and blockchain technology in diagnosing diseases like malaria, particularly in remote areas where traditional diagnostic methods may not be available or accessible. 2. Malaria Diagnosis: The low-cost paper-based microfluidic diagnostic test uses a piece of tiny paper device to detect malaria DNA. It allows rapid detection with the support of AI systems. 3. Low-Cost Test: The AI-based test is referred to as a "smart decision support" system, which uses advanced technology such as artificial intelligence to interpret the results right on the user’s device. 4. Artificial Intelligence (AI): The AI acts like a smart detective, quickly identifying potential cases with an accuracy rate of 98% in real-world tests, enabling healthcare workers to make informed decisions about treatment without needing fancy labs or waiting too long.

3rd Iteration. 1. Smartphone: A device with advanced features such as cameras, sensors, and processors that can perform various tasks, including medical diagnosis. 2. Blood test: A diagnostic procedure where blood is collected from a finger or other body part through a small needle and analyzed for specific signs of malaria (e.g., anemia or fever). 3. AI as personal health assistant: An artificial intelligence system designed to assist individuals in managing their health by providing personalized recommendations based on medical data analysis (e.g., detecting early signs of disease). 4. Secure digital diary: A digital record containing sensitive information about an individual’s health status stored on a secure blockchain network to ensure accuracy and prevent unauthorized access or manipulation. 5. Blockchain technology: A distributed ledger system that allows secure sharing of data across multiple parties without the need for intermediaries or centralized authorities (e.g., storing patient records in hospitals). In this case, it serves as the secure digital ledger for tracking where diseases are spreading in remote areas due to malaria prevalence.

Table 4: Notes taken from the reader, two technical terms and their explanations are highlighted. The notes become more detailed and comprehensive during the iterative process.

1st Iteration. 1. Simplify technical terms 2. Break down processes 3. Emphasize benefits 4. Conclusion statement
2nd Iteration. 1. Emphasize technology: Instead of listing specific platforms (Ethereum or Hyperledger Fabric), mention that a secure digital ledger is used without specifying which one. 2. Simplify medical terms: For "Plasmodium falciparum," provide a simpler explanation or use a common term like "malaria parasite type." 3. Break down process: Explain how smartphone analysis works in more detail - perhaps by describing how it compares to traditional lab methods. 4. Connect to equity: Highlight how this technology addresses health disparities by providing quick diagnosis in remote areas.
3rd Iteration. 1. Emphasize simplicity: For accessibility, rephrase "low-cost paper-based microfluidic diagnostic test" as "affordable, easy-to-use test with a paper strip." 2. Explain AI in simpler terms: Instead of "AI instantly interprets results," say "The smartphone app quickly analyzes the data to give a diagnosis." 3. Break down data security: Highlight that information is stored securely on a phone or cloud server with strong passwords or encryption. 4. Quantify success: Mention that 98% accuracy rate is exceptional but could be framed as an impressive achievement ("This system detected almost all cases correctly!"). 5. Cite real-life impact: Share examples of how this technology has made a difference in remote communities to connect it emotionally with readers.

1st Iteration. 1. Simplify technical terms 2. Break down processes 3. Emphasize benefits 4. Conclusion statement

2nd Iteration. 1. Emphasize technology: Instead of listing specific platforms (Ethereum or Hyperledger Fabric), mention that a secure digital ledger is used without specifying which one. 2. Simplify medical terms: For "Plasmodium falciparum," provide a simpler explanation or use a common term like "malaria parasite type." 3. Break down process: Explain how smartphone analysis works in more detail - perhaps by describing how it compares to traditional lab methods. 4. Connect to equity: Highlight how this technology addresses health disparities by providing quick diagnosis in remote areas.

3rd Iteration. 1. Emphasize simplicity: For accessibility, rephrase "low-cost paper-based microfluidic diagnostic test" as "affordable, easy-to-use test with a paper strip." 2. Explain AI in simpler terms: Instead of "AI instantly interprets results," say "The smartphone app quickly analyzes the data to give a diagnosis." 3. Break down data security: Highlight that information is stored securely on a phone or cloud server with strong passwords or encryption. 4. Quantify success: Mention that 98% accuracy rate is exceptional but could be framed as an impressive achievement ("This system detected almost all cases correctly!"). 5. Cite real-life impact: Share examples of how this technology has made a difference in remote communities to connect it emotionally with readers.

Table 5: Suggestions provided by the editor LLM are becoming increasingly specific over the iterative process.

3.3 Human Evaluation

For a thorough evaluation, we carry out a human assessment on four representative methods. We request that participants assess articles relevant to their fields as well as those in their unfamiliar fields. The results are reported in Table 2, categorized into within-field and outside-field articles. Notably, participants assign lower ratings to articles outside their expertise, possibly due to the inherent comfort and familiarity bias elevating subjective scores. Despite the evident discrepancies arising from differing reader familiarity contexts, the method comparisons and overarching trends appear consistent across both settings.

Interestingly, all LLM-based methods outperform the plain summaries written by humans, probably because these summaries remain technical, providing a poor reading experience for a wide readership. The LLM-generated content, targeted to popular science, should be easier for both in-domain and out-of-domain readers to read. Furthermore, our LLM-collaboration approach surpasses the single Qwen in all dimensions, demonstrating the effectiveness of LLM collaboration. Most importantly, our LLM collaboration method achieves the highest readability and interest in both within-field and outside-field reading evaluation, with high information conveyance and authenticity close to the most advanced model GPT-4. Collectively, these findings attest to the potency and effectiveness of our proposed approach.

4 Analysis

4.1 Ablation Study

We conduct an ablation experiment to validate the effectiveness of each component. The results have been included in Table 6 for clarity. In the “Reader: 1.8B $\rightarrow$ 7B” setting, we substitute the 1.8B reader model with the 7B version. This substitution leads to a minor performance fluctuation. On the one hand, the 7B model’s strong ability gives it greater tolerance for low readability of content, making it harder to highlight writing issues in articles. On the other hand, it excels in instruction following, enhancing task execution and reducing intermediate errors in the workflow. The dynamics between gains and losses render the use of the 7B model as a “reader” comparably advantageous at times and disadvantageous at others. Nevertheless, we recommend using the 1.8B model due to its higher resource efficiency and throughput.

For $-$ Reading Notes, we eliminate the requirement for the reader LLM to read the article and make notes. Instead, we ask the editor LLM to offer suggestions directly. For $-$ Suggestions, the editor’s advice is omitted, and the journalist revises the article based on the reader LLM’s reading. For $-$ Collaboration, the journalist revises the article based on the previous writing without any input from the reader or the editor. As depicted in Table 6, our approach exhibits a decrease in performance when each module is removed, underscoring the significance of each module.

4.2 Trend Analysis

We provide an in-depth analysis of the performance trajectory over iteration cycles. The entire writing-reading-suggesting-revising process is carried out for five iterations, with readability score changes depicted in Figure 4. Our findings reveal a successive and pronounced decline in all reading difficulty metrics over the initial three iterations. This pattern demonstrates the efficacy of our iterative revision methodology. Following the third iteration, the performance levels off, indicating diminishing improvements from subsequent suggesting and editing efforts.

4.3 Writing Case Analysis

To facilitate an intuitive comparison of our method, we present a case study on one writing sample. As shown in Table 3, our method can generate articles that are more readable, with concise and vivid expressions, along with explanations for technical details. For instance, our generated article states that “timely diagnosis is crucial” rather than “results need to be communicated rapidly to healthcare professionals once testing has been completed so that care pathways can be implemented”, making it more brief and accessible for readers to grasp the research objective. Moreover, our generated article details that the proposed system is a “smartphone-based system that combines a low-cost test with deep learning and blockchain technology” rather than “smartphone-based end-to-end platform for multiplexed DNA diagnosis of malaria”, enhancing the comprehensibility for a broader audience.

4.4 Case of Reading Notes

We also present a case study on the notes taken by the reader LLM during the first three iterations. As evidenced in Table 4, the notes offer more detailed explanations for technical jargon than those in preceding iterations. For example, the technical terms “AI” and “blockchain” are more thoroughly explained in the third iteration than in the first two iterations. This phenomenon suggests that as the readability of writing improves, readers can acquire deeper and more elaborate insights, aligning with human reading behaviours.

4.5 Suggestions for Revision

We further examine the advice given by the LLM editor for the revision over the iterative process, as exemplified in Table 5. In the initial stage, the editor offers general advice. Following the revision, the advice becomes more specific and detailed. Subsequent rounds of suggestions highlight specific content in the article and recommend revisions based on various aspects. In the 3rd iteration, the feedback from the editor becomes more detailed and specific, with specific adjustments from the original version to the revised content. This revision behaviour is similar to real-world science journalism, showcasing the efficacy of our framework and the social intelligence of LLMs.

5 Related Work

Automatic Science Journalism. ASJ has gained increasing interest in recent years. Dangovski et al. (2021) created a parallel corpus and provided a sequence-to-sequence method to generate news summaries from scientific articles. However, this dataset is not available to the public because of the licensing restriction. Goldsack et al. (2022) released two corpora, primarily focused on the biomedical and life science domains. Similarly, they employed a standard sequence-to-sequence model for such tasks. Cardenas et al. (2023) constructed a dataset in various scientific fields and integrated the discourse structure of papers with their metadata to guide the generation. These methods of fine-tuning on small models can provide a good match with the reference, but there is still room for improvement in readability. In this work, we present a novel approach that integrates LLMs acting as agents for enhancing readability.

Large Language Models. With the emergence of the LLM, advanced models (Achiam et al., 2023; Touvron et al., 2023; Anthropic, 2023; Jiang et al., 2023a; Bai et al., 2023) have shown performance comparable to that of humans in a variety of real-world tasks. Park et al. (2023) leveraged LLMs for social simulation, showing communication and collaboration between LLM-based agents. Liu et al. (2023) built an LLM-collaboration architecture for enhancing reasoning and code generation tasks. Qian et al. (2023) used different LLM-based agents throughout the development process for software engineering. This body of work has demonstrated the strong collective intelligence of LLMs. Inspired by this collection of work, we utilize LLMs as communicative agents, writing, reading, and revising scientific articles to make content accessible to the general audience through a process resembling real-world practice.

6 Conclusion

This study presents the first examination of LLM collaboration for automatic science journalism aimed at general readers. Initially, an LLM functions as a journalist by composing an explanatory article for the general public. Subsequently, a smaller LLM, acting as a general audience, reads these articles and takes notes, potentially helping reveal the low readability of the article content. An LLM editor assesses the reader’s notes and offers suggestions for improvement. Following the editor’s advice, the journalist revises the manuscript, which is then passed to the reader again to continue the iterative process. Extensive experiments are conducted to evaluate the effectiveness of our framework, including both automatic metrics and human evaluation. In comparison to directly prompting and fine-tuning various LLMs, our method achieves the highest readability while maintaining high quality. Additionally, we offer an in-depth analysis, which further validates our method and provides insights into designing LLM-driven ASJ systems.

Limitations

We identify the following four limitations of our work. First, following previous work, we have defined ASJ as the process of transforming a single paper into an article intended for a general audience. In practice, a popular science article may encompass multiple studies. Therefore, an extension can be to address the challenge of consolidating several papers into a single article. Second, we have utilized some statistical indexes for automatic assessment, but these statistical measures may miss semantic information. LLM-driven evaluation could offer a solution. While there remains a gap between LLMs and humans in evaluating text on readability and authenticity, efforts could be made to minimize this gap. Third, given our exploratory approach in utilizing LLMs for ASJ, we strategically chose abstracts as input to maintain both simplicity and resource efficiency. Nevertheless, long-context ASJ is an intriguing task potentially with new challenges to address. Lastly, all components of our framework are powered by LLMs. In addition to our efforts to make each LLM simulate humans, it will be interesting to incorporate genuine human preferences to enhance the generated content.

Ethics Statement

The experiments in this study were conducted on publicly available datasets. Any information involving privacy was removed. All annotators have been properly paid for their efforts.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, and Sam Altman et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Allan (2011) Stuart Allan. 2011. Introduction: Science journalism in a digital age. Journalism, 12(7):771–777.
Angler (2017) Martin Angler. 2017. Science journalism: an introduction. Routledge.
Anthropic (2023) Anthropic. 2023. Claude anthropic. https://www.anthropic.com/claude.
August et al. (2024) Tal August, Kyle Lo, Noah A Smith, and Katharina Reinecke. 2024. Know your audience: The benefits and pitfalls of generating plain language summaries beyond the" general" audience. arXiv preprint arXiv:2403.04979.
Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609.
Cardenas et al. (2023) Ronald Cardenas, Bingsheng Yao, Dakuo Wang, and Yufang Hou. 2023. ‘don’t get too technical with me’: A discourse structure-based framework for automatic science journalism. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1186–1202.
Caselli et al. (2015) Tommaso Caselli, Piek Vossen, Marieke van Erp, Antske Fokkens, Filip Ilievski, Rubén Izquierdo, Minh Le, Roser Morante, and Marten Postma. 2015. When it’s all piling up: investigating error propagation in an nlp pipeline. In WNACP@ NLDB.
Chan et al. (2023) Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
Dangovski et al. (2021) Rumen Dangovski, Michelle Shen, Dawson Byrd, Li Jing, Desislava Tsvetkova, Preslav Nakov, and Marin Soljačić. 2021. We can explain your research in layman’s terms: Towards automating science journalism at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12728–12737.
Desmond et al. (2024) Michael Desmond, Zahra Ashktorab, Qian Pan, Casey Dugan, and James M Johnson. 2024. Evalullm: Llm assisted evaluation of generative outputs. In Companion Proceedings of the 29th International Conference on Intelligent User Interfaces, pages 30–32.
Dziri et al. (2023) Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaïd Harchaoui, and Yejin Choi. 2023. Faith and fate: Limits of transformers on compositionality. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
Goldsack et al. (2022) Tomas Goldsack, Zhihao Zhang, Chenghua Lin, and Carolina Scarton. 2022. Making science simple: Corpora for the lay summarisation of scientific literature. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10589–10604.
Google (2024) Google. 2024. Gemma: Introducing new state-of-the-art open models.
Göpfert (2008) Winfried Göpfert. 2008. The strength of pr and the weakness of science journalism. In Journalism, science and society, pages 227–238. Routledge.
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Jiang et al. (2023a) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023a. Mistral 7b. arXiv preprint arXiv:2310.06825.
Jiang et al. (2023b) Gongyao Jiang, Shuang Liu, Meishan Zhang, and Min Zhang. 2023b. A pilot study on dialogue-level dependency parsing for chinese. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9526–9541.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
Likert (1932) Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of psychology.
Lin et al. (2023) Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. 2023. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
Liu et al. (2023) Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. 2023. Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. arXiv preprint arXiv:2310.02170.
Meta (2024) Meta. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date.
OpenAI (2023) OpenAI. 2023. Introducing ChatGPT.
Park et al. (2023) Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22.
Qian et al. (2023) Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. 2023. Communicative agents for software development. arXiv preprint arXiv:2307.07924.
Scialom et al. (2021) Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. 2021. Questeval: Summarization asks for fact-based evaluation. Empirical Methods in Natural Language Processing.
Talebirad and Nadiri (2023) Yashar Talebirad and Amirhossein Nadiri. 2023. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314.
Tkachenko et al. (2020-2022) Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. 2020-2022. Label Studio: Data labeling software. Open source software available from https://github.com/heartexlabs/label-studio.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Wasi et al. (2024) Azmine Toushik Wasi, Rafia Islam, and Raima Islam. 2024. Llms as writing assistants: Exploring perspectives on sense of ownership and reasoning. arXiv preprint arXiv:2404.00027.
Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
Wu et al. (2018) Lijun Wu, Xu Tan, Di He, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2018. Beyond error propagation in neural machine translation: Characteristics of language also matter. arXiv preprint arXiv:1809.00120.
Yang et al. (2023) Zeyuan Yang, Peng Li, and Yang Liu. 2023. Failures pave the way: Enhancing large language models through tuning-free rule accumulation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1751–1777.
Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
Yuan et al. (2022) Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. 2022. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, pages 841–852.
Zhang et al. (2024) Meishan Zhang, Gongyao Jiang, Shuang Liu, Jing Chen, and Min Zhang. 2024. Llm–assisted data augmentation for chinese dialogue–level dependency parsing. Computational Linguistics, pages 1–24.
Zheng et al. (2024a) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024a. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
Zheng et al. (2024b) Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, and Yongqiang Ma. 2024b. Llamafactory: Unified efficient fine-tuning of 100+ language models. Preprint, arXiv:2403.13372.

Appendix A Prompts for LLMs

We list all prompts in Table 7. All prompts follow a similar format. First, we assign a role to each LLM agent by a sentence. We then specify the task and background in one or two sentences. Next, we give each LLM step-by-step instructions. After that, we input the rules to be followed for each LLM. Finally, the format of the output is specified as “markdown” style to facilitate the extraction and support the information flow among LLMs. In our preliminary study, this pattern works well in prompting various LLMs, with strong task completion and format adherence. Algorithm 1 presents the workflow of our framework.

1:a scientific paper

\boldsymbol{x}

; journalist

\mathcal{J}

; reader

\mathcal{R}

; editor

\mathcal{S}

; number of iterations

t

\boldsymbol{p}_{0}\leftarrow\mathcal{J}(\boldsymbol{x})

\triangleright

Initialization

3:for

i=1

t

\boldsymbol{r}\leftarrow\mathcal{R}\left(\boldsymbol{p}_{i-1}\right)

\triangleright

Reader’s notes

\boldsymbol{s}\leftarrow\mathcal{S}\left(\boldsymbol{p_{i-1},r}\right)

\triangleright

Editor’s suggestions

\boldsymbol{p}_{i}\leftarrow\mathcal{J}\left(\boldsymbol{x,p_{i-1},s}\right)

\triangleright

Revision

7:end for

8:return

\boldsymbol{p}_{t}

Algorithm 1 LLM-Collaboration for ASJ

Appendix B Failed Attempts

This appendix outlines our unsuccessful attempts. We hope that it will save time for other researchers. First, as a commonly used mechanism in LLM agents, reflection can support iterative enhancement by consolidating prior experiences (Yao et al., 2022; Park et al., 2023; Yang et al., 2023). In our pilot experiments, however, this approach did not succeed in refining the journalist’s writing. One potential explanation could be that the varied nature of journalistic content necessitates more targeted revision instructions, whereas summarizing writing experiences results in general writing guidance.

Within our framework, an LLM acts as an audience, reading the written article and taking notes. How about having this audience perform the reading comprehension task instead of taking notes? Intuitively, it can also demonstrate the reader’s understanding, and then induce the content complexity. Preliminary investigations, however, revealed that this approach yields lesser efficacy compared to the note-taking strategy. One possible reason might be that the quality of question generation greatly affects the efficiency of the whole workflow. Besides, asking a fixed number of questions narrows the expanse of textual exploration, thereby constricting the comprehensive perception of content complexity.

Appendix C Dataset and Hyperparameters

Table 6 shows some statistics of the three datasets used in our experiments. SCITech(News) is released by Cardenas et al. (2023), who gathered press releases from ACM Technews as well as their source articles from various publishers, involving fields of computer science, engineering, astrophysics, biology, and others. eLife is an open-access journal that focuses on biomedical and life sciences. Goldsack et al. (2022) collected some eLife articles as well as digests written by expert editors based on both the article itself and questions answered by the author. Similarly, PLOS hosts journals across areas of science and medicine. Some of these articles, also collected by Goldsack et al. (2022), come with the author’s summary. For resource saving, the original paper’s abstract serves as the scientific content input.

Statistic	SCITech	eLife	PLOS
# pairs	2431	4828	27525
# $\text{words}^{\text{ori}}$	216.8	166.3	268.3
# $\text{sentences}^{\text{ori}}$	5.7	6.8	10.2
# $\text{words}^{\text{pln}}$	176.1	347.6	175.6
# $\text{sentences}^{\text{pln}}$	7.9	15.7	7.8

Table 6: Statistics of benchmark datasets. Words and sentences are average values. The “ori” superscript indicates abstracts of original papers, and “pln” represents plain summaries written by authors or journalists.

Our local LLM service runs on a machine with eight GTX 4090 GPUs. We utilize the Huggingface platform (Wolf et al., 2019) for downloading and loading checkpoints. For rapid inference and memory efficiency, we utilize the vLLM library²²2https://docs.vllm.ai to develop API services. We deploy agents from the Qwen-1.5 series, for their good performance and diverse model scales Bai et al. (2023). In particular, Qwen-1.5-7B is employed for the steps of writing, providing suggestions, and revision, whereas Qwen-1.5-1.8B serves as the reader for taking notes. To improve memory efficiency, we implement activation-aware weight quantization (AWQ, Lin et al. 2023) for model quantization. For fine-tuning, we utilize LoRA (Hu et al., 2021) with Llama-Factory (Zheng et al., 2024b), adopting the default setting with the number of the epochs set to 10. We use the default temperature setting and empirically set top_p to 0.4, frequency penalty and repetition penalty to 1, ensuring the stability of the LLMs’ output while retaining diversity. We iterate five times and empirically select the output from the third iteration as the final result. The maximum number of tokens in the model output is 4096.

Appendix D Details of Human Evaluation

We create a questionnaire for human evaluation using a 1-5 Likert scale, as shown in Figure 5. All participants were informed that their assessments would be used for research purposes. We utilize Label Studio (Tkachenko et al., 2020-2022) to construct the annotation platform. Initially, participants indicate their familiarity with the given topic. They are then tasked with answering 4 questions related to Readability, Information Conveyance, Authenticity, and Interestingness. Readability assesses how easily the article can be read, serving as a supplementary and further validation of automated evaluation. Information conveyance determines if the rewritten content accurately and comprehensively conveys the information from the original paper. Similarly, authenticity assesses the correctness of the content. A high-quality article should contain minimal factual or common sense errors to avoid misleading readers. The level of interest is also a crucial factor; content of high appeal will attract more readers.

Appendix E Use of AI Assistants

We use ChatGPT for correcting grammar and improving expressions in this manuscript.

Journalist. You are a science journalist for general audiences. Given a paper’s summary, you are assigned to rewrite it into a short understandable article for general audiences.

Follow the rules strictly:

- Keep short yet informative.

- The output format:

## Article

…

Reader. You are a general reader. Given a popular science article, please read it carefully and take some notes.

Please take the following steps:

1. First, extract all technical terms with their context from the article.

2. Then, explain the technical terms based on their context.

Follow the rules strictly:

- Extraction should mention the specific location of each technical term in the article.

- Explanation should be first extracted from the article; if not found, it can be some common-sense or specialized knowledge.

- Extraction and explanation should be in points, like "1…2…3…".

- The output format:

### Extraction

1. …

2. …

…

### Explanation

1. …

2. …

Editor. You are a senior editor. Here are a scientific paper summary, and a short popular science article. A general reader has read the science article and takes some notes.

Please take the following steps:

1. First, evaluate the **reader’s notes** based on these factors: content accuracy, lexical and technical complexity, and information conveyance (from the original content).

2. Then, based on the above evaluation, list some brief yet informative writing advice that may benefit the popular science article, to make the article easier for general readers without specialized knowledge to read and understand. Specifically, the advice should benefit these factors of the article:

a) Content Accuracy: The factual correctness, scientific validity, and absence of errors in the general popular science article.

b) Accessibility: Higher accessibility means less technical, more readable and interestingly, etc. c) Information Conveyance: How effectively key information from the original paper is transferred to the popular science article.

Follow the rules strictly:

- Evaluation and advice sections should be in points, like "1…2…3…".

- Each advice should not go beyond the fact of original paper, but can be some common-sense or specialized knowledge.

- Each advice should be targeted at one specific aspect of the article.

- Don’t suggest visualization, references and links.

- Suggest explanations rather than content additions.

- The output format:

## Evaluation for reader’s notes

- Content accuracy of reader’s notes: …

- Lexical and technical complexity of reader’s notes: …

- Information conveyance of reader’s notes: …

## Advice

1. …

2. …

Revision You are a science journalist for general audiences. Given the paper summary and a short summary of the popular science article, you are assigned to rewrite the popular science summary for general audiences, who have no specialized knowledge on this field. There are some writing advice.

Please take the following steps:

1. Choose and refine the most relevant and suitable advice for writing improvement.

2. Then, based on the refined advice and the paper summary, rewrite the popular science article.

Follow the rules strictly:

- Keep short yet informative.

- Don’t include visualization, references and links.

- Revision must not go beyond the fact of original paper, but can be with some additional common-sense or professional knowledge for explanation.

- The output format:

## Improvement

…

## Revised Article

…

Table 7: Prompts for each LLM agent.

LLM-Collaboration on Automatic Science Journalism for the General Audience

Abstract

1 Introduction

2 Methodology

2.1 The LLM Journalist

2.2 The LLM Reader

2.3 Automated Suggestions and Revisions

3 Experiments

3.1 Settings

3.2 Automatic Evaluation

3.3 Human Evaluation

4 Analysis

4.1 Ablation Study

4.2 Trend Analysis

4.3 Writing Case Analysis

4.4 Case of Reading Notes

4.5 Suggestions for Revision

5 Related Work

6 Conclusion

Limitations

Ethics Statement

References

Appendix A Prompts for LLMs

Appendix B Failed Attempts

Appendix C Dataset and Hyperparameters

Appendix D Details of Human Evaluation

Appendix E Use of AI Assistants

LLM-Collaboration on Automatic Science Journalism
for the General Audience