MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension

Li, Zekun; Yang, Xianjun; Choi, Kyuri; Zhu, Wanrong; Hsieh, Ryan; Kim, HyeonJung; Lim, Jin Hyuk; Ji, Sungyoung; Lee, Byungju; Yan, Xifeng; Petzold, Linda Ruth; Wilson, Stephen D.; Lim, Woosang; Wang, William Yang

Computer Science > Computation and Language

arXiv:2407.04903 (cs)

[Submitted on 6 Jul 2024]

Title:MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension

Authors:Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, Linda Ruth Petzold, Stephen D. Wilson, Woosang Lim, William Yang Wang

View PDF HTML (experimental)

Abstract:The rapid advancement of Large Language Models (LLMs) and Large Multimodal Models (LMMs) has heightened the demand for AI-based scientific assistants capable of understanding scientific articles and figures. Despite progress, there remains a significant gap in evaluating models' comprehension of professional, graduate-level, and even PhD-level scientific content. Current datasets and benchmarks primarily focus on relatively simple scientific tasks and figures, lacking comprehensive assessments across diverse advanced scientific disciplines. To bridge this gap, we collected a multimodal, multidisciplinary dataset from open-access scientific articles published in Nature Communications journals. This dataset spans 72 scientific disciplines, ensuring both diversity and quality. We created benchmarks with various tasks and settings to comprehensively evaluate LMMs' capabilities in understanding scientific figures and content. Our evaluation revealed that these tasks are highly challenging: many open-source models struggled significantly, and even GPT-4V and GPT-4o faced difficulties. We also explored using our dataset as training resources by constructing visual instruction-following data, enabling the 7B LLaVA model to achieve performance comparable to GPT-4V/o on our benchmark. Additionally, we investigated the use of our interleaved article texts and figure images for pre-training LMMs, resulting in improvements on the material generation task. The source dataset, including articles, figures, constructed benchmarks, and visual instruction-following data, is open-sourced.

Comments:	Code and data are available at this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.04903 [cs.CL]
	(or arXiv:2407.04903v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.04903

Submission history

From: Zekun Li [view email]
[v1] Sat, 6 Jul 2024 00:40:53 UTC (3,976 KB)

Computer Science > Computation and Language

Title:MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators