Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

Gao, Fan; Jiang, Hang; Yang, Rui; Zeng, Qingcheng; Lu, Jinghui; Blum, Moritz; Liu, Dairui; She, Tianwei; Jiang, Yuang; Li, Irene

Computer Science > Computation and Language

arXiv:2308.10410v3 (cs)

[Submitted on 21 Aug 2023 (v1), revised 22 Feb 2024 (this version, v3), latest version 23 May 2024 (v4)]

Title:Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

Authors:Fan Gao, Hang Jiang, Rui Yang, Qingcheng Zeng, Jinghui Lu, Moritz Blum, Dairui Liu, Tianwei She, Yuang Jiang, Irene Li

View PDF HTML (experimental)

Abstract:Educational materials such as survey articles in specialized fields like computer science traditionally require tremendous expert inputs and are therefore expensive to create and update. Recently, Large Language Models (LLMs) have achieved significant success across various general tasks. However, their effectiveness and limitations in the education domain are yet to be fully explored. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science, focusing on a curated list of 99 topics. Automated benchmarks reveal that GPT-4 surpasses its predecessors like GPT-3.5, PaLM2, and LLaMa2 in comparison to the established ground truth. We compare both human and GPT-based evaluation scores and provide in-depth analysis. While our findings suggest that GPT-created surveys are more contemporary and accessible than human-authored ones, certain limitations were observed. Notably, GPT-4, despite often delivering outstanding content, occasionally exhibited lapses like missing details or factual errors. At last, we compared the rating behavior between humans and GPT-4 and found systematic bias in using GPT evaluation.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2308.10410 [cs.CL]
	(or arXiv:2308.10410v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2308.10410

Submission history

From: Irene Li [view email]
[v1] Mon, 21 Aug 2023 01:32:45 UTC (622 KB)
[v2] Wed, 6 Sep 2023 00:03:11 UTC (796 KB)
[v3] Thu, 22 Feb 2024 02:54:19 UTC (1,336 KB)
[v4] Thu, 23 May 2024 12:42:06 UTC (1,339 KB)

Computer Science > Computation and Language

Title:Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators