A Survey of Multimodal Large Language Model from A Data-centric Perspective

Tianyi Bai Hong Kong University of Science and TechnologyHong KongChina [email protected] , Hao Liang Peking UniversityBeijingChina [email protected] , Binwang Wan Harbin Institute of TechnologyWeihaiChina [email protected] , Yanran Xu, Xi Li, Shiyu Li AppleChina xu˙[email protected], weston˙[email protected], shiyu˙[email protected] , Ling Yang, Bozhou Li Peking UniversityBeijingChina [email protected], [email protected] , Yifan Wang University of Science and Technology of ChinaHe FeiChina [email protected] , Bin Cui Peking UniversityChina [email protected] , Ping Huang, Jiulong Shan AppleChina huang˙[email protected], [email protected] , Conghui He Shanghai Artifcial Intelligence LaboratoryChina [email protected] , Binhang Yuan Hong Kong University of Science and TechnologyChina [email protected] and Wentao Zhang Peking UniversityChina [email protected]

(2018)

Abstract.

Multimodal large language models (MLLMs) enhance the capabilities of standard large language models by integrating and processing data from multiple modalities, including text, vision, audio, video, and 3D environments. Data plays a pivotal role in the development and refinement of these models. In this survey, we comprehensively review the literature on MLLMs from a data-centric perspective. Specifically, we explore methods for preparing multimodal data during the pretraining and adaptation phases of MLLMs. Additionally, we analyze the evaluation methods for the datasets and review the benchmarks for evaluating MLLMs. Our survey also outlines potential future research directions. This work aims to provide researchers with a detailed understanding of the data-driven aspects of MLLMs, fostering further exploration and innovation in this field.

Large Language Models; Generative Models; Multimodal Large Language Models

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Computing methodologies Natural language processing^†^†ccs: Computing methodologies Computer vision

1. introduction

In recent years, we have witnessed rapid advancement of large language models (LLMs) and multimodal large language models (MLLMs) (Zhao et al., 2023c; Wu et al., 2023b). MLLMs, such as GPT-4 (OpenAI, 2023), Flamingo (Alayrac et al., 2022), LLaVA (Liu et al., 2024b), BLIP2 (Li et al., 2023e), Cambrian-1 (Tong et al., 2024) and X-InstructBLIP (Panagopoulou et al., 2023), integrate multiple modality information, demonstrating impressive comprehension and generation capabilities. These models achieve competitive performance in traditional multimodal tasks, such as visual recognition (Zhang et al., 2024a), video understanding (Xu et al., 2021; Tang et al., 2023), speech recognition (Min and Wang, 2023) and 3D understanding (Guo et al., 2023; Hong et al., 2023). Moreover, their excellent language understanding capacity enables strong performance in text-rich tasks, such as question answering (Hu et al., 2024), multi-dialog conversation and logical reasoning (Xu et al., 2024; Li et al., 2023k).

Most existing MLLMs focus on modifying model architecture to explore the use of information from modalities (Xu et al., 2024; Panagopoulou et al., 2023; Alayrac et al., 2022). While model effectiveness is crucial, data also significantly impacts the success of MLLMs. For example, Hoffmann et al. (2022) shows that in order to scale up models, it is necessary to increase the scale of the training data. Beyond data volume, data quality is equally important. Previous research (Sorscher et al., 2022) indicates that carefully curated datasets can enable smaller models to achieve comparable performance to larger ones. However, comprehensive studies on data curation and utilization for MLLMs are still lacking. Therefore, this study aims to provide a comprehensive understanding of MLLMs from a data-centric perspective.

In contrast to model-centric approaches that prioritize architectural enhancements while relying on fixed datasets, data-centric perspectives emphasize the comprehensive impact of the training corpus datasets on the model performance. Within the scope of data-centric MLLMs, our focus lies on leveraging the heterogeneous nature of data modalities, enhancing data structure, increasing data quantity, and elevating data quality to augment MLLMs (Zha et al., 2023). Our survey discusses three key questions from a data-centric perspective at different stages of MLLMs:

•

Q1: How to collect, select, and manage data for MLLMs? The substantial data volume requirements and the heterogeneity of multimodal data pose challenges in gathering, selecting, and effectively managing data for model training. Different training stages of MLLMs also lead to varying data type requirements.
•

Q2: How data affects the performance of MLLMs? Understanding the relationship between data characteristics and the performance of MLLMs is crucial for optimizing datasets and enhancing model capabilities.
•

Q3: How to evaluate the data for MLLMs? It is necessary to develop comprehensive evaluation benchmarks to assess the performance and robustness of MLLMs across diverse tasks.

Unique contribution of this survey. Several existing works have focused on LLMs (Zhao et al., 2023c; Hadi et al., 2023; Naveed et al., 2023) and MLLMs (Wu et al., 2023b; Zhang et al., 2024c) from a model-centric perspective, but lack an in-depth analysis from the data-centric perspective. Recently, some work has started to focus on data preparation for LLMs, such as data management methods (Wang et al., 2023b), data selection methods (Albalak et al., 2024), and comprehensive reviews of LLM datasets (Liu et al., 2024a). However, such work primarily focuses on data management and selection methods for text-only LLMs and does not provide a detailed analysis of the data processing pipeline for MLLMs. The most closely related work to ours is data-centric artificial intelligence (DCAI) (Zha et al., 2023; Jarrahi et al., 2022; Jakubik et al., 2022; Polyzotis and Zaharia, 2021; Whang et al., 2023), which also focuses on data-centric views of AI research but does not specifically analyze LLMs and MLLMs.

With the rapid growth of MLLMs and the increasingly important role of data in this large model era, we believe it is crucial to provide a comprehensive overview of data-centric approaches for MLLMs. This survey aims to thoroughly review the literature on the advances of MLLMs from a data-centric perspective and discusses open issues or future directions in this field.

In this survey, we reviewed the literature on the advances of MLLMs from data-centric perspectives. We provide researchers and developers with a general and comprehensive understanding of the latest developments in the field of MLLM data side. The key contributions of this survey are summarized as follows:

•

New data-centric perspective. We provide a comprehensive review of MLLMs from a data-centric perspective, considering modalities such as text, image, video, and audio.
•

Data preparation and management pipeline. We summarize the data preparation and management pipeline for MLLMs in both the pre-training and adaptation phases.
•

Data evaluation methods and evaluation benchmarks. We outline commonly used methods to evaluate datasets as well as MLLMs evaluation benchmarks from a data-centric perspective.
•

Open issues and future directions. We discuss open issues in current research on data-centric LLMs and propose several future research directions.

Refer to caption — Figure 1. Overview of the data pipeline for MLLMs.

The rest of this survey is organized as follows: Section 2 introduces the preliminaries for LLMs and MLLMs, and discusses the motivations to analyze them from a data-centric perspective. Sections 3 to 5 summarize the main stages of collecting, processing, and selecting data for MLLMs training. Section 6 enumerates the data evaluation methods and existing evaluation datasets for MLLMs according to evaluation tasks. Section 7 discusses open issues and highlights several future research directions in this field. Finally, we conclude the survey in Section 8.

2. backgrounds and categorization

2.1. Large Language Models

The field of natural language processing (NLP) has witnessed a remarkable evolution in language modeling techniques, culminating in the development of LLMs. The journey began with statistical methods (Jelinek, 1998; Gao and Lin, 2004; Rosenfeld, 2000), which laid the foundation for language modeling by capturing the probabilistic distributions of words and phrases. Subsequently, neural network approaches emerged (Bengio et al., 2000; Mikolov et al., 2010; Kombrink et al., 2011), leveraging the power of deep learning to learn more complex and abstract representations of language. However, such models were typically designed for specific tasks and encountered challenges related to scalability and generalization.

A significant breakthrough came with the introduction of pre-trained language models, which aimed to capture general language representations by training on vast amounts of unlabeled text data (Kenton and Toutanova, 2019; Radford et al., [n. d.]). Building upon these advancements, LLMs have taken the concept of pre-training to an unprecedented scale, with models containing billions of parameters and trained on massive corpora spanning hundreds of gigabytes to terabytes of text data (Brown et al., 2020; Raffel et al., 2020).

LLMs exhibit several notable properties that distinguish them from previous language models. One fundamental characteristic is the scaling law, which describes how the performance of large language models improves as they scale in terms of model size, training data, and computational resources (Kaplan et al., 2020; Hoffmann et al., 2022). This power-law relationship suggests that larger models trained on more data tend to capture more complex patterns and generalize better to new tasks. Another intriguing property of LLMs is the emergence of abilities that were not explicitly trained for, often referred to as emergent abilities (Wei et al., 2022). This ability suggests that LLMs can capture and leverage complex linguistic patterns and knowledge from the pre-training data, enabling them to perform tasks beyond their original training objective. Furthermore, LLMs have demonstrated the capability of in-context learning (Brown et al., 2020), where they can perform tasks based on a few examples provided in the input prompt, without the need for explicit fine-tuning. This highlights the models’ ability to rapidly adapt to new tasks and generalize from limited examples.

The evolution of LLMs typically involves three key stages: training, adaptation, and evaluation (Zhao et al., 2023c). The training stage focuses on learning general language representations from large-scale unlabeled corpora, capturing the underlying patterns and structures of natural language. The adaptation stage involves guiding the pre-trained model to specific tasks and human preference, often through supervised finetuning and human preference alignment (Raffel et al., 2020; Ouyang et al., 2022). This step is crucial for optimizing the model’s performance on the target application and ensuring its outputs are aligned with the desired objectives and human preference. Finally, the evaluation stage involves assessing the performance of the adapted model through various metrics and benchmarks to ensure that it meets the required standards and performs effectively in real-world scenarios.

2.2. Multimodal Large Language Models

Multimodal large language models (MLLMs) extend the capabilities of traditional LLMs by leveraging their comprehension, generation, and task-solving abilities across different modalities, including natural language (text and audio) and visual information (videos, images, and 3D models). The development of MLLMs is driven by the need to address real-world problems that often involve multiple modalities. For instance, in robotics, MLLM can process visual input from cameras, interpret natural language instructions, and analyze structured data from sensors to perform complex tasks (Zeng et al., 2023). Similarly, in healthcare, MLLMs can analyze medical images, process electronic health records, and interpret patient-doctor conversations to aid in diagnosis and treatment planning.

Multimodal large language models (MLLMs) typically consist of three primary components: a modality encoder, a projector, and a large language model (LLM). The modality encoder is responsible for encoding data from different modalities, the projector aligns the data from various modalities with the LLM, and the LLM serves as the backbone, providing comprehension capabilities. By integrating LLMs with multimodal encoders and projectors, MLLMs can process various types of information, demonstrating strong understanding and analytical capabilities to address downstream tasks across various modalities (Wu et al., 2023b).

MLLMs represent a significant advancement in AI by extending the capabilities of traditional LLMs to process and analyze multiple modalities. MLLMs can solve complex real-world problems that require understanding and reasoning across different types of information. As research in this field progresses, we can expect MLLMs to play an increasingly important role in various domains, from robotics and healthcare to education and entertainment.

Figure 2. Overview of Data-Centric MLLMs

\Description

[Overview of Data-Centric MLLMs]Categorization of Data-Centric MLLMs work

2.3. Data-Centric AI and Why Data-Centric MLLMs

The field of artificial intelligence (AI) has experienced a paradigm shift towards data-centric approaches in recent years (Zha et al., 2023). Data-centric AI emphasizes the critical role of data quality, diversity, and representativeness in building robust and effective AI systems. This shift acknowledges that the performance of AI models is heavily based on the quality and characteristics of the data used for training, rather than solely focusing on algorithmic improvements.

In the context of LLMs and MLLMs, data play a pivotal role in determining their capabilities and limitations. LLMs, such as GPT-3 (Brown et al., 2020) and BERT (Kenton and Toutanova, 2019), are trained on vast amounts of textual data to learn the intricacies of language and generate coherent, contextually relevant outputs. Similarly, MLLMs, which integrate multiple modalities such as text, images, and speech, require diverse and well-aligned datasets in different modalities to learn meaningful cross-modal representations (Gadre et al., 2024). The quality, diversity, and representativeness of the training data directly impact the models’ ability to understand and generate language, as well as their capacity to reason and perform tasks across multiple modalities (meta llama, 2024; Liu et al., 2023c; Chen et al., 2023e).

Analyzing LLMs and MLLMs from a data-centric perspective offers several advantages. First, it enables researchers to identify and address potential biases and limitations in the training data that can propagate to the models’ outputs. By carefully curating and augmenting datasets to ensure diversity and representativeness, researchers can mitigate biases and improve the fairness and generalizability of the models (Chen et al., 2024a). Second, a data-centric approach allows for a deeper understanding of the models’ capabilities and limitations based on the characteristics of the training data (McKinzie et al., 2024). By systematically varying data properties and evaluating model performance, researchers can gain insights into the specific data attributes that contribute to the models’ success or failure in various tasks. This understanding can guide the development of more efficient and effective data collection and curation strategies. Moreover, adopting a data-centric perspective in the development of LLMs and MLLMs opens up opportunities for data-efficient training and adaptation (Liu et al., 2024b, 2023c; Chen et al., 2023e). Additionally, data-centric approaches facilitate the development of more interpretable and explainable models by linking model behaviors to specific data characteristics.

2.4. Categories of Data-Centric MLLMs

In this survey, we review previous work on MLLMs from a data-centric perspective. Our taxonomy is primarily based on different stages of MLLM development, as illustrated in Figure 2. Specifically, we organize our article around a data pipeline, shown in Figure 1, where the data pipeline for MLLMs comprises three stages: pre-processing (including data collection and data processing in Section 3), pre-training data processing in Section 4), and adaptation data processing in Section 5. Along this pipeline, we discuss data-related model performance and analyze how dataset curation affects model performance. During the pre-processing stage, large-scale multimodal data are collected from various sources and constructed into datasets. Processing steps such as filtering, deduplication, and data enhancement are applied to improve data quality. In the pre-training stage, data selection, domain mixing and modality mixing are performed to choose appropriate data for model training. In the adaptation stage, we focus on data processing and generation for supervised fine-tuning and human preference alignment. Furthermore, we discuss data evaluation metrics and evaluation datasets for MLLMs in Section 6, examining their construction and main characteristics. Finally, we suggest possible future directions for further research on data-centric MLLMs in Section 7.

3. Data collecting and processing

The first and fundamental step in training MLLMs is to collect and process sufficient data from various sources. In this section, we first introduce the sources for data collection. Next, we summarize the processing steps for multimodal data, including commonly used filtering, deduplication, and data enhancement methods. Finally, we provide an overview of commonly used datasets for MLLMs.

3.1. Data Collecting Sources

We first introduce the sources where researchers typically find raw data for MLLM pre-training. Commonly used data sources can be divided into six categories: common webpages, social media, academic papers, books, code repositories, and professional sources. Data from these different sources have various characteristics, each contributing uniquely to improving different abilities of the models.

Common Webpages.

Webpages are the main source for collecting large-scale training corpus. CommonCrawl project serves as the most commonly used start-point for large-scale webpages. It is instrumental in generating large-scale pre-training corpora like C4 (Raffel et al., 2020) for LLM pre-training. Beyond textual content, CommonCrawl’s vast archive of web pages, which includes numerous image-text pairs, has also become a vital resource for constructing multimodal pre-training datasets such as LAION-5B (Schuhmann et al., 2022).

Apart from CommonCrawl, several works focus on selecting specific webpages and crawling data independently. For textual datasets, initiatives like WuDaoCorpora (Yuan et al., 2021) crawl thousands of web pages to construct vast textual corpora. In the multimodal dataset sector, there are three main self-crawling approaches: general approach, focused approach on specific platforms, and querying search engines for images. Examples include AI Challenger Captions (Wu et al., 2017) and Wukong (Gu et al., 2022). General approaches, such as CC3M (Sharma et al., 2018) and CC12M (Changpinyo et al., 2021), involve crawling billions of web pages without targeting specific platforms. Focused approaches, on the other hand, involve selecting specific webpages to crawl. For example, Wikipedia is a commonly used resource that serves as a rich and accurate source for text-based single-modal pre-training datasets like the Pile (Gao et al., 2020). It also plays a crucial role in creating multimodal pre-training datasets such as WIT (Srinivasan et al., 2021) and WikiCaps (Schamoni et al., 2018). For the vision modality, Flickr (Young et al., 2014) serves as an important source, containing photos and videos shared by online users. This contributes to image-text datasets such as Flickr30k (Young et al., 2014). For the audio modality, platforms such as BBC Sound Effects, FreeSound, and SoundBible are commonly used, contributing to datasets such as WavCaps (Mei et al., 2023) and FSD50K (Fonseca et al., 2022).

Social Media.

Social media platforms significantly enhance the training datasets for MLLMs by providing real-time, varied, and colloquial text data, as well as multimodal content that captures human expression and interaction. However, questions regarding the ownership and copyright of social media content are on the rise. MLLM researchers need to be aware of these issues and understand the rights and licensing of the data sources they use. Stack Exchange offers high-quality textual and visual content, hosting extensive forums with rich dialogues and images. In addition to Stack Exchange, Reddit offers user-generated content with discussions and voting on a wide array of topics, contributing to both textual and visual content. For video data, YouTube is a standout platform, with a vast amount of videos uploaded every minute. X, formerly known as Twitter, offers great amount of sharing text messages, images, audio, and videos.

Academic Papers.

Academic papers offer high-quality, authoritative content, enriching pre-training datasets with specialized knowledge and formal language, which is critical for developing language models with expertise in academic and professional domains. Academic papers are primarily utilized as sources for text-based pre-training datasets. arXiv offers a vast array of prepublished scientific research for training specialized language models. This is evidenced by its role in creating datasets such as RedPajama-Data-1T (Computer, 2023), Pile-arXiv (Gao et al., 2020), and image-text datasets such as MMC (Liu et al., 2023d). Another academic platform semantic scholars contribute to the S2ORC (Lo et al., 2020) corpus, which contains 81.1 million academic papers, offering metadata, abstracts, references, and full texts for 8.1 million open-access works.

Books.

Books provide a rich source of high-quality textual content for language model training and serve as fertile ground for multimodal pre-training datasets. They offer diverse visual data from book covers and illustrations, enriching models with a blend of literary depth and visual context.

Project Gutenberg acts as a pivotal source for language modeling and analysis with its extensive library of more than 70,000 free eBooks, fueling datasets such as PG-19 (Rae et al., 2019). Smashwords contributes to the well-known BookCorpus (Zhu et al., 2015) dataset, which encompasses a diverse collection of 11,038 self-published novels that span genres such as romance, science fiction, and fantasy. This extensive corpus has been instrumental in training landmark models such as GPT (Radford et al., [n. d.]) and BERT (Kenton and Toutanova, 2019). Bibliotik that consists of a mix of fiction and non-fiction books contributes to Books3 (Gao et al., 2020), part of the Pile dataset, which contains about 197,000 books. This dataset is essential for context modeling and narrative research, accounting for 2.1% of the RedPajama-Data-1T dataset.

Meanwhile, books also serve as an important source for constructing multimodal datasets. Book covers hold vast multimodal information, combining visual elements with metadata such as author names, titles, and genres. This fusion of data supports tasks like genre classification and visual question-answering. The OCR-VQA dataset (Mishra et al., 2019), sourced from Amazon.com, exemplifies how book covers can be leveraged to build rich pre-training datasets for MLLMs, demonstrating the value of covers in enhancing model understanding of visual and textual content. Old photo books offer a unique advantage for constructing multimodal pre-training datasets by providing historical visual and textual contexts, enriching models’ understanding of past societies and landscapes. For example, previous work collected 9,516 image-text pairs from 175 Japanese old photo books (Okamoto et al., 2023).

Domain-Specific Sources.

To enhance the performance of MLLMs in specific domains, general pre-training is supplemented with incremental, domain-specific pre-training. In the legal domain, pre-training data is primarily textual. FreeLaw stands out by offering free access to judgments from India’s higher courts, exemplified by the Pile-FreeLaw (Gao et al., 2020) dataset, which structures these judgments for machine learning. In the math domain, textual datasets primarily originate from Khan Academy’s exercises for foundational concepts and DeepMind’s Pile-DeepMind Mathematics (Saxton et al., 2018) for advanced, algorithmically generated problems. Additionally, DVQA (Kafle et al., 2018) introduces a multimodal aspect by combining bar charts with question-answer pairs generated through Matplotlib. In the medical domain, pre-training data comes from online medical websites, knowledge bases, and in-hospital database systems. Online medical websites such as Qianwen Health and PubMed contribute to datasets like Huatuo-26M (Li et al., 2023i) and MedQuAD (Ben Abacha and Demner-Fushman, 2019). Knowledge bases like Wikipedia also contribute to Huatuo-26M (Li et al., 2023i) and MedHop (Welbl et al., 2018). In-hospital database systems, including electronic health record systems (EHR) and ICU-specific clinical information systems, are used in MIMIC-IV (Johnson et al., 2023). Multimodal datasets in radiology, such as MIMIC-CXR-JPG (Johnson et al., 2019) and PADCHEST (Bustos et al., 2020), are generated from in-hospital radiology reports. In the financial domain, pre-training primarily relies on textual data sourced from both English and Chinese platforms. English contributions come from EDGAR and the SEC Financial Statement and Notes Data Sets, which enhance the FinTree project (Ok, 2023). Chinese sources include major sites like Sina Finance and Tencent Finance, as well as Eastmoney for specific documents, and forums like Guba and Xueqiu, culminating in the comprehensive BBT-FinCorpus (Lu et al., 2023b). This assortment of sources underscores the rich textual foundation for financial model pre-training.

3.2. Data Processing

3.2.1. Filtering

Data filtering in MLLM training is critical for enhancing model reliability and efficiency. Filtering out unwanted content is essential, as even minimal exposure to hate speech can negatively influence model behavior (Luccioni and Viviano, 2021; Gunasekar et al., 2023). For multimodal datasets, previous work has focused on filtering out unwanted data from different modalities separately. Specifically, for X-Text datasets, they consider filtering data based on both the text and the X modality independently.

Textual filters primarily encompass language and content filtering. For language filtering, documents or sentences that fall below a certain threshold for a specific language are removed. For English-only datasets, commonly used tools like Langdetect and FastText operate at the document level. For instance, C4 (Raffel et al., 2020) utilizes Langdetect to filter out any non-English pages with a probability below 0.99. Another renowned English language filter, FastText, is widely employed by datasets such as Dolma (Soldaini et al., 2024) and RefinedWeb (Penedo et al., 2023). CLUECorpus2020 (Xu et al., 2020) adopts a sentence-level filtering approach, selecting sentences whose language type is Chinese if a language is mentioned. For multilingual datasets, FastText, trained on Wikipedia data to classify 176 languages, remains a preferred choice, capable of processing 1,000 documents per second on a single CPU core. ROOTS (Laurençon et al., 2022) utilizes FastText for document-level language classification, resulting in a 1.6TB dataset containing 59 languages. For code datasets, detection methods are more elementary, often based on file extensions. For instance, in Palm (Chowdhery et al., 2023), files are filtered based on filename extensions to restrict to one of 24 common programming languages, resulting in 196GB of source code.

Another type of textual filtering is content filtering, which includes removing toxic and distracting content. Toxic content filtering targets text that is deemed rude, disrespectful, or unreasonable, employing heuristic and rule-based methods. For instance, in C4 (Raffel et al., 2020), pages containing words in a predefined ”Dirty, Naughty, Obscene, or Otherwise Bad Words” list are removed. Distracting content includes short or incomplete sentences, ”lorem ipsum” text, and useless identifiers like HTML, CSS, and JavaScript tags. Filtering methods usually involve multiple rule-based filters, such as punctuation-based segmentation in WudaoCorpora (Yuan et al., 2021), or machine learning approaches, as seen in phi-1 (Gunasekar et al., 2023), which use GPT-4 as an annotators and then train a high-quality content classifier based on the annotated results.

For image-level filtering, the most fundamental step is to remove images with excessively low resolution, as these images often fail to convey effective information. Additionally, it is necessary to filter out images with inappropriate aspect ratios, as such unconventional images frequently resemble banner-like advertisements (Zhu et al., 2024). Furthermore, similar to removing harmful content in text filtering, such as NSFW material, it is necessary to filter out such content from images. This can be achieved by training a binary classification model using appropriate datasets. Additionally, any potential occurrences of human faces or other sensitive elements in the images should be detected using a face detector and blurred accordingly (Guo et al., 2021). These filtering processes have a minimal effect on the overall model performance (Yang et al., 2022).

For video-level filtering, previous work has leveraged image or image-text filtering methods to handle static content and applied specific techniques for dynamic content. Video filtering involves four main components: scene transition detection, video quality and integrity improvement, refinement and coherence evaluation, and modality completeness. For scene transition detection, algorithms are used to remove abrupt scene changes, ensuring a cleaner dataset free from disruptive transitions (Wang et al., 2023a; Blattmann et al., 2023). To improve video quality and integrity, three aspects are considered: motion, textual information in the video, and resolution and frame rate. For motion analysis, dense optical flow (Farnebäck, 2003) is typically utilized to measure motion complexity and filter out static or repetitive sequences with limited educational value. To detect textual information in videos, OCR (Baek et al., 2019) is applied to identify and remove clips with substantial text, which can otherwise mislead the model’s visual interpretation. The resolution and frame rate of training video datasets are standardized, enabling model to learn from uniformly structured data (Wang et al., 2023a). In constructing video-text datasets, ensuring the relevance and clarity of annotations is paramount. Focused efforts are made to refine the textual content of video-text pairs, particularly subtitles, to ensure they are contextually meaningful and devoid of promotional or irrelevant material (Xu et al., 2023c). Models such as CLIP are employed to evaluate the coherence between video frames and the accompanying text (Chen et al., 2024b; Xu et al., 2023c; Wang et al., 2023a), ensuring that annotations accurately reflect the video content. In some studies, clips missing any modalities—vision, audio, or subtitles—are excluded to meet the requirements for comprehensive understanding across all modalities (Chen et al., 2024b).

3.2.2. Deduplications

Previous work has found there are great amount of duplicate data in various training datasets. For example, the BOOK Corpus (Zhu et al., 2015) contains thousands of duplicated books (Bandy and Vincent, 2021). The commonly used curated web crawl dataset C4 (Raffel et al., 2020) contains a single 61 word English sentence that repeated more than 6 hundred thousands times (Lee et al., 2022). More repeated data in training datasets can increase the rate of emitting memorized training data verbatim (Carlini et al., 2022; Kandpal et al., 2022), and deduplication of training datasets can prevent the memorization problem, thus alleviating privacy concerns (Kandpal et al., 2022). Moreover, duplication of training data can also cause the performance degradation of the pre-trained models (Hernandez et al., 2022). Training on deduplicated datasets can save the training cost while does not hurt the model perplexity (Lee et al., 2022).

Existing deduplication methods consider exact duplication, approximate duplication and semantic duplication in sentence-level (or sequence-level) and document-level. Exact deduplication is the most simple way to remove duplication. For sentence or sequence level, exact deduplication consider to remove exact string matching between sentence (or sequence) (Raffel et al., 2020; Suárez et al., 2019). To improve computational and memory efficiency, Suffix Arrays (Manber and Myers, 1993; Lee et al., 2022) and Bloom Filters (Computer, 2023; Soldaini et al., 2024) are used to achieve parallelized linear time complexity. For document level exact deduplication, URL deduplication is considered to remove exactly same web pages (Penedo et al., 2023). Approximate deduplication methods mainly focus on document-level duplicates. Usually, Locality Sensitive Hashing (LSH)-based MinHash (Broder, 1997) or SimHash (Charikar, 2002) methods are used to remove approximately duplicate documents (Computer, 2023; Lee et al., 2022; Penedo et al., 2023). These methods can achieve document-level deduplication with linear time and space complexity of the document number and can be implemented in a highly distributed setting (Lee et al., 2022). Apart from hashing-based approximate deduplication, recently there are some work consider to leverage pre-trained foundation models as semantic embedding metric for document-level embedding (Silcock et al., 2022; Kaddour, 2023; Abbas et al., 2023; Tirumala et al., 2023). These Semantic deduplication methods consider using Sentence-BERT (Reimers and Gurevych, 2019) MPNET (Song et al., 2020) for embedding (Silcock et al., 2022), E5-Large (Wang et al., 2022) for embedding (Kaddour, 2023), and OPT-125M (Zhang et al., 2022) for embedding (Abbas et al., 2023; Tirumala et al., 2023). These semantic deduplication processes are usually after the exact deduplication and approximate deduplication processes, removing the semantic duplicates by clustering the embedding points and keeping representative data in each clusters. However, how to choose an appropriate pre-trained foundation models and whether this models can embed semantic information efficiently are still open questions.

For image-text pairs or interleaved image-text documents, common methods of image deduplication include using the image URL (Zhu et al., 2024) or employing pHash (Zauner, 2010) algorithms. It’s worth noting that in the realm of interleaved image-text documents, there is currently no universally accepted, reliable method for deduplication based on both image and text elements simultaneously.

For Video multimodal data, video fingerprinting technology is uniquely applied to videos to identify and remove exact duplicates, addressing the challenge of video reuploads and mirrored content, which is more prevalent and complex in videos than in static images(Xu et al., 2023c).

3.2.3. Data Enhancement

Data enhancement of multimodal datasets usually focus on two aspects, enhancing the X modality data, and enhancing the text data. Traditional data augmentation methods for single modality data have been discussed thoroughly in previous work (Xu et al., 2023d; Cauli and Reforgiato Recupero, 2022; Ko et al., 2015; Wei et al., 2020; Zhao et al., 2021; Feng et al., 2021). In this work, we only consider data enhancement methods for MLLMs datasets.

For vision-language models, enhancing the quality of image-caption datasets is crucial for training MLLMs. Improving the captions not only enhances their alignment with the images but also prevents the discarding of highly informative images due to poor text quality. Rewriting text using BLIP2 has been shown to effectively improve training results (Nguyen et al., 2023). However, for large-scale image-text datasets (approximately 1.28 billion), this method experiences diminishing returns in terms of ImageNet accuracy, possibly due to the lack of diversity in the generated captions (Nguyen et al., 2023). Approaches like LaCLIP (Fan et al., 2024) and VeCLIP (Lai et al., 2023a) employ LLMs with carefully designed prompts to generate more diverse captions. ShareGPT4V (Chen et al., 2023e) employed the 100K high-quality captions generated by GPT4-Vision to fine-tune an alternative caption model and named it as Share-Captioner. The Share-Captioner is capable of generating highly content-related descriptions with unified instruction for the pre-train dataset. MLM (Wang et al., 2024c) used GPT-4 or GPT-4V to constructing multimodal instruction tuning data on proposed quality scoring tasks to fine-tune MLM to realize accurate quality assessment. Then they adopt the fine-tuned MLM Filter to generate quality scores for each data point in the data pool and then select the high-quality data.

Improving image resolution can enhance the performance of MLLMs. Initially, MLLMs primarily processed fixed, lower-resolution inputs, typically around 224 pixels (Liu et al., 2024b; Chen et al., 2023j; Zhu et al., 2023). Recent models such as LLaVA-1.5 (Liu et al., 2023c) and BLiVA (Hu et al., 2024) have improved performance by increasing the input resolution to 336 pixels and integrating task-specific global features. Furthermore, models like Qwen-VL (Bai et al., 2023a) and OtterHD (Li et al., 2023m) have pushed resolution support to 448 pixels, incorporating fine-tuning in the visual encoder during training, while maintaining the original image size during inference, leading to more precise segmentation recognition. Notably, Monkey (Li et al., 2023j) has significantly raised the resolution to 896 pixels by employing multiple visual encoders and leveraging fine-tuning techniques from Qwen-VL. Models like LLAVA HR (Luo et al., 2024) and Vary (Wei et al., 2023) have introduced additional visual encoders to capture more complex features, requiring extensive pre-training tasks. LLAVA UHD (Xu et al., 2024) and Ureader (Ye et al., 2023a) have improved the model’s capacity to understand detailed image features through adaptive region segmentation. However, a fundamental challenge remains within the MLLM architecture: models using lower-resolution inputs struggle to detect fine details, whereas those with higher resolutions may underperform in tasks requiring a broader global understanding.

3.3. Commonly-used Datasets for MLLMs

In this section we will briefly introduce commonly used multimodal datasets from different modality. These datasets are not utilized for this research. The comprehensive summary of text-only datasets has been elaborately discussed in previous literature (Liu et al., 2024a); therefore, this part will not be extensively covered within the scope of this discussion.

3.3.1. Image Datasets

For vision langauge models, the commonly used image-text datasets can be categorized into three distinct types: image-caption datasets, interleaved image-text datasets, and visual question answering datasets.

Image-Caption Datasets.

Image-caption pairs datasets are the most commonly used datasets in the pre-training stage of MLLMs. By leveraging the captions of each image, one can align the image modality with the text modality, enabling LLMs to understand image information.

General image-caption datasets contain images with short captions, typically one sentence or a few words, that describe the key features of the image. Notably, the LAION series (comprising LAION-400M (Schuhmann et al., 2021) and LAION-5B (Schuhmann et al., 2022)) represents some of the largest compilations, aggregating billions of pairs from the CommonCrawl. These datasets emphasize the maintenance of high quality through meticulous filtering based on the relevance of image-text pairs and the exclusion of inappropriate content. Derived from the LAION-5B subset, the LAION-COCO (Schuhmann et al., 2022) dataset is specifically designed to explore the impact of synthetic captions on model training. Also curated from CommonCrawl project, the COYO-700M (Minwoo Byeon, 2022) dataset emphasizes the informative correlation between images and alt-texts. The DataComp (Gadre et al., 2024) challenge introduces COMMONPOOL, a substantial multimodal dataset constructed from 12.8 billion image-text pairs that encourages innovation in dataset design. Apart from CommonCrawl project, some self-crawled datasets also play a vital role. Several datasets are curated from images with alt attributes from webpages. For example, Conceptual Captions (CC3M (Sharma et al., 2018) and CC12M (Changpinyo et al., 2021)) use generalized alt-text from the web and are processed with multiple filters, containing 3.3 million and 12 million image-text pairs respectively. Also leveraging alt-text for images on webpages, ALT200M (Hu et al., 2022) dataset collects 200 million images with their alt attributes, aiming for understanding scaling of vision-language models. For multilingual image-caption datasets, a representative one is the Wukong (Gu et al., 2022) dataset, which addresses the scarcity of large-scale datasets in the Chinese language, offering 100 million quality-assured pairs that are instrumental for the development of Chinese vision-language pre-training models. Larger dataset such as Long text & image pairs (LTIP) (Alayrac et al., 2022) dataset provides a unique collection of 312 million images with lengthier textual descriptions, enhancing models like Flamingo that are tailored for complex multimodal tasks.

Content descriptive image-caption datasets contain longer caption with more descriptive information. Usually each image in content description image-caption datasets contain at least 5 sentences of captions. Content descriptive image-caption datasets are essential for advancing the interface between visual content and text descriptions, enhancing the ability of MLLMs to interpret and generate more detailed description from images. For example, MS-COCO (Chen et al., 2015) employs images with multiple reference captions to refine evaluation metrics, while Flickr30K (Young et al., 2014) extends the variety of descriptions to bolster semantic inference capabilities. In addition, datasets like Visual Genome (Krishna et al., 2017) provide comprehensive annotations beyond simple captions, including objects, attributes, and relationships, to facilitate complex scene understanding. While the AI Challenger Captions (Wu et al., 2017) dataset specifically addresses the need for non-English language representation in image captioning, offering extensive annotations in Chinese to bridge the semantic gap between low-level visual features and high-level conceptual descriptions. Narrative and textual comprehension are further introduced by datasets like VIST (Huang et al., 2016), which emphasizes storytelling through visual sequences. TextCaps (Sidorov et al., 2020) incorporates reading comprehension into captioning. Collectively, these datasets not only propel advancements in automated image captioning but also tackle broader challenges in AI’s capability to process and generate meaningful visual and textual data, significantly enhancing context-aware machine understanding.

Interleaved Image-Text Documents.

A fundamental contrast between image-text caption datasets and interleaved image-text document datasets lies in their composition: image-caption pair datasets typically consist of a single image accompanied by multiple closely related captions, whereas interleaved image-text datasets consist of a text document interspersed with several illustrative images. The correlation between the images and the accompanying text tends to be relatively lower, but these datasets usually provides more semantic information, which is crucial for maintaining models’ semantic understanding. M3W (Alayrac et al., 2022) leverages data from 43 million web pages, integrating images with text to train models in few-shot learning scenarios. MMC4 (Zhu et al., 2024), an extension of the C4 corpus, aligns images with texts using CLIP features, resulting in a dataset containing over 101.2 million documents. OBELICS (Laurençon et al., 2024), sourced from CommonCrawl, comprises 141 million web pages. OmniCorpus (Li et al., 2024) scales the interleaved image-text datasets to 10 billion-level, providing much richer image and text information from diverse sources.

Visual Question Answer (VQA) Datasets

Visual question answer (VQA) datasets facilitate advanced research in MLLMs by allowing models to interpret images and answer related questions. These datasets typically consist of images paired with corresponding questions and answers, aiming to develop and evaluate models’ ability to understand visual content and provide accurate responses. For example, VQAv2.0 (Goyal et al., 2019) expands on its forerunner by introducing complementary images that prompt different answers to the same question, effectively addressing biases and increasing the dataset’s diversity. Visual-7W (Zhu et al., 2016) extends the scope by incorporating questions across multiple dimensions—what, where, when, who, why, how, and which—linked to specific objects within images. ST-VQA (Biten et al., 2019) integrates scene text into the VQA framework, making it possible to answer text-based questions from visual data. Shikra-RD (Chen et al., 2023j) leverages advanced language models to annotate images with relational descriptions, enhancing image comprehension. OCR-VQA focuses on reading text from book covers to answer related questions, combining elements of optical character recognition and VQA. DocVQA (Mathew et al., 2021) targets document images for extractive question answering, emphasizing precision in answers derived from visible text. A-OKVQA (Marino et al., 2019) introduces questions that demand commonsense and world knowledge, pushing the envelope on reasoning capabilities required from AI systems. TextVQA (Singh et al., 2019) specifically challenges models to read and understand text within images to respond accurately, and GQA (Hudson and Manning, 2019) promotes advanced reasoning over detailed visual scenes annotated with comprehensive scene graphs. These datasets collectively advance the field by enabling more nuanced interactions between AI models and the rich content within images, aiming for greater depth in the understanding and contextual integration of visual and textual data.

3.3.2. Video Datasets

Video datasets are comprehensive collections of video clips accompanied by associated annotations or labels. These datasets are meticulously designed to facilitate the training and evaluation of models for a wide range of video-related tasks, including action recognition, various video-text tasks, as well as video-centric dialogue.

Commonly used video-text dataset such as the MSR-VTT (Bain et al., 2021) features 10,000 video clips totaling 41.2 hours, each annotated with multiple descriptive sentences. Similarly, the WebVid series offers millions of web-sourced video clips with accompanying captions, expanding from 2.5 million pairs in WebVid-2M to 10 million pairs in WebVid-10M (Bain et al., 2021), covering a vast array of 13,000 hours of video content. The VTP (Alayrac et al., 2022) dataset further enriches the field with 27 million short video-text pairs sourced from a select few high-quality websites. Moreover, the recently introduced Panda-70M (Chen et al., 2024c) dataset contains a massive 70 million high-quality video-caption pairs, designed to enhance the training of high-performance MLLMs. Collectively, these datasets are integral to advancing the capabilities of MLLMs in understanding and interacting with video content. InternVid (Wang et al., 2023a) scales video-text dataset to 230M annotated video-text pairs, with every video lasting 351.9s on average.

3.3.3. Audio Datasets

Audio datasets are curated collections of sound recordings along with associated annotations or labels. These datasets are specifically designed to support the training and evaluation of models for various audio-related tasks, such as speech recognition, sound event detection, music classification, and speaker identification. Audio datasets vary significantly in size, quality, and scope, with some focusing on specific languages or acoustic environments, while others strive for diversity to ensure generalizability across different audio processing scenarios. Notable examples include AISHELL-2 (Du et al., 2018), a Mandarin Chinese speech corpus with over 1000 hours of data from multiple regions; WavCaps (Mei et al., 2023), an extensive English audio captioning dataset featuring approximately 400,000 clips; and VSDial-CN (Chen et al., 2023b), a multi-modal dataset derived from VisDial, encompassing visual data alongside related dialogues and captions, tailored for Automatic Speech Recognition (ASR) systems.

3.3.4. 3D Datasets

The development of MLLMs benefits from the utilization of diverse 3D datasets, which provide comprehensive environmental and structural data critical for tasks like scene understanding and semantic segmentation. Notably, the ScanNet (Dai et al., 2017) dataset offers an extensive collection of RGB-D video data across 1,513 indoor scenes, annotated with 3D camera poses, surface reconstructions, and semantic segmentations, totaling over 2.5 million views. Similarly, the S3DIS (Armeni et al., 2017) dataset includes point clouds from six large indoor areas, encompassing 271 rooms, with detailed semantic annotations for each point. The Structured3D (Zheng et al., 2020) dataset provides a vast repository of 3,500 home designs, encompassing 21,835 rooms detailed with elements such as object geometry, materials, and textures, tailored for analyses in 3D reconstruction and interior design.

4. Data-centric pre-training

The pre-training stage of multimodal large language models (MLLMs) is crucial for developing models that can effectively process and generate information across multiple modalities. This stage can be divided into two distinct phases, each focusing on specific aspects of the model’s architecture and training objectives.

The first phase involves two parallel processes: pre-training the LLM backbone (such as Vicuna (Chiang et al., 2023) and LLaMA2 (Touvron et al., 2023)) using text-only datasets, and pre-training the modality encoders (such as ViT (Dosovitskiy et al., 2020) and CLIP-ViT (Radford et al., 2021) for visual encoding, C-Former (Chen et al., 2023b) and HuBERT (Hsu et al., 2021) for audio encoding, and ULIP-2 (Xue et al., 2022, 2023) for 3D point cloud encoding) using pairs of data from different modalities, such as image-text or video-text pairs. This phase aims to establish a strong foundation for the model’s understanding of both textual and non-textual information.

The second phase builds upon the knowledge acquired in the first phase by further training the LLM and modality encoders using a mixture of multimodal data. During this phase, the input projector (such as a linear projector, cross-attention mechanism, Q-Former (Li et al., 2022a), or P-Former (Jian et al., 2024)) is trained to effectively map the features extracted by the modality encoders into the LLM’s embedding space. This mapping allows for a unified representation of textual and non-textual information within the model’s architecture. Some researchers consider employing a selective training approach during the second phase, which involves keeping certain components of the model frozen while training specific parts. This approach helps to preserve the knowledge acquired during the first phase, allows for targeted adaptation to specific multimodal tasks, and reduces computational costs.

4.1. Domain Mixture

The performance of a language model (LM) is significantly influenced by the composition of its pre-training data, which often includes sources such as Wikipedia, books, and web text (Xie et al., 2024a, b).

Previous methods typically select domain weights heuristically or optimize them using downstream tasks (Du et al., 2022). However, these approaches can be suboptimal or costly, requiring the training of LMs for different sets of domain weights and potentially leading to overfitting to specific downstream tasks. DoReMi (Xie et al., 2024a) offers a solution by optimizing domain weights using proxy models that are 30 times smaller than the target LLM, without requiring knowledge of downstream tasks. Recent work (Liu et al., 2024e) formulate domain mixture problem as a regression problem. They assume the rank invariance of domain mixture between small models and large models, thus developing domain mixture regression function to predict the optimal domain mixture for model training. There are also some work discuss this problem through domain mixture scaling law. These work extend the original scaling law (Kaplan et al., 2020; Hoffmann et al., 2022) to domain data mixing law (Que et al., 2024; Ye et al., 2024; Ge et al., 2024), discussing the relationship between domain mixture rate and loss through scaling models.

In the realm of video understanding, there is a consensus on the necessity for diverse datasets (Xu et al., 2023c; Chen et al., 2024b; Song et al., 2023; Chen et al., 2023l; Wang et al., 2023a). Xu et al. (2023c) highlight the use of hierarchical multi-label classification models (Giunchiglia and Lukasiewicz, 2020) to ensure dataset balance and comprehensiveness across multiple dimensions, thereby optimizing the model’s ability to understand and process various video domains, which is instrumental in training more versatile applicable video-language models.

4.2. Modality Mixture

When pre-training MLLMs, determining the optimal proportions of multimodal data is crucial for enhancing the model’s performance across different tasks. MM1 (McKinzie et al., 2024) explores the balance between image-caption pairs, interleaved image-caption documents, and text-only data for vision-language pre-training. They found that a ratio of 5:5:1 for caption/interleaved/text data yields the best overall performance on text-only tasks as well as zero-shot and few-shot image-text tasks. Additionally, incorporating synthetic data (Lai et al., 2023b) improves the model’s few-shot learning capabilities for image-text tasks.

For the development of MLLMs tailored for video understanding, the strategic integration of both video-text and image-text pairs during the pre-training phase is pivotal. This approach, highlighted in recent research (Lin et al., 2023b; Zhang et al., 2023; Jin et al., 2023; Han et al., 2023; Li et al., 2023c; Wang et al., 2023a), enhances the model’s capabilities by leveraging the complementary nature of these data types. Integrating labeled image datasets like COCO with video data broadens the variety of training datasets and addresses the scarcity of high-quality video-text resources. Studies (Lin et al., 2023b; Jin et al., 2023) indicate that training on both images and videos enhances models’ ability to understand static and dynamic visual information without needing task-specific adaptations. Adding a temporal modeling module to the vision encoder bridges the temporal dynamics of videos with the static nature of images, fostering cohesive visual understanding across different types of visual media. However, treating images as single-frame videos (Zhang et al., 2023; Luo et al., 2023) or pseudo-videos may weaken the model’s ability to grasp the temporal aspects of video sequences. An imbalanced mix could also bias the model towards one modality, complicating its learning process. Therefore, determining the optimal mix ratio of image-text pairs and video-text pairs is crucial for improving the model’s comprehensive understanding capabilities, ensuring a balanced approach to learning from both static images and dynamic videos. Additionally, video LLMs are increasingly being trained through multiple branches (Chen et al., 2024b; Zhang et al., 2023; Lyu et al., 2023), such as vision-language, audio-language, and subtitle-language. These methods leverage the correlations among different modalities to broaden the models’ understanding and learning capabilities. Recent studies (Shu et al., 2023; Chen et al., 2024b; Han et al., 2023) have shown that combining audio, subtitles, and visual data during training enhances performance across various video understanding benchmarks. This modular training approach also offers the flexibility to pre-train with partial data (Han et al., 2023; Sun et al., 2023c), such as visual-only or audio-only, particularly when comprehensive multimodal data is not available.

4.3. Quality Selection

Due to the diverse distribution of data, training a large model with all available data is not always optimal. Therefore, data selection becomes essential, offering benefits such as reduced training time and energy consumption (Gadre et al., 2024). Previous work has studied data selection methods using n-gram similarity with high quality datasets (Xie et al., 2023; Gao et al., 2020; Chowdhery et al., 2023), perplexity (Marion et al., 2023; Wenzek et al., 2020), influence functions (Park et al., 2023) and llm-based classifier (Wettig et al., 2024). Unlike pure text datasets, data selection for multimodal datasets must consider the alignment between different modalities. Data selection methods can be categorized into two types: active learning-based and pre-training selection. Active learning-based methods, such as CiT (Xu et al., 2023a), use data proxies to dynamically select training data during the training process, achieving significant acceleration effects. Pre-training selection methods, like Datacomp (Gadre et al., 2024) and Bunny (He et al., 2024), evaluate and select all data before training begins.

Regarding data selection criteria, methods can be distribution-agnostic, focusing solely on individual data point quality, or distribution-aware, considering the overall data distribution. Distribution-agnostic methods include training a model with the top 30% of data ranked by CLIP score, which can significantly improve results (Gadre et al., 2024). More comprehensive metrics have also been developed, showing better performance than using CLIP score alone (Wang et al., 2024c). For instance, Mahmoud et al. (2024) use the difference between original and synthetic captions generated by a small model to evaluate image-text pair alignment. Given the limited scope of purely distribution-agnostic work, combining distribution-agnostic and distribution-aware methods is often more effective. For example, using CLIP-score-based filters alongside image-based filters can outperform either method alone on large datasets (Wang et al., 2024a; Gadre et al., 2024).

5. data-centric adaptation

Adaptation is crucial for aligning pre-trained multimodal large language models (MLLMs) with specific tasks and user preferences. Self-supervised pre-training provides LLMs with a broad understanding of textual and multimodal information, while supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) are essential for adapting these models to excel in targeted applications and adhere to user preferences and societal values. Data-centric SFT involves training models on carefully curated datasets across multiple modalities. In contrast, data-centric RLHF focuses on collecting human judgments to guide the model toward generating desirable and ethically aligned responses. Emphasizing data-centric approaches highlights the importance of high-quality, domain-specific, and multimodal datasets in the successful adaptation of MLLMs.

5.1. Data-Centric Supervised Finetuning

Supervised fine-tuning (SFT) has become an indispensable technique for adapting MLLMs to specific domains and tasks. By leveraging carefully curated instruction-formatted datasets that encompass multimodal information, these models can be guided to acquire the necessary knowledge and abilities to excel in targeted applications.

During the SFT stage, various strategies have been explored, including fine-tuning one or all of the three primary pre-trained components: the LLM backbone, the modality encoders, and the input projector. This selective approach allows for a balance between adapting the model to specific tasks and preserving the generalized knowledge acquired during pre-training.

To facilitate effective supervised fine-tuning, diverse datasets that align with the target domain and task are used. These datasets typically include a combination of multimodal instruction-response pairs and text-only SFT data. By exposing the MLLMs to this rich variety of data, the model can develop a comprehensive understanding of the target domain and acquire the necessary skills to perform the desired tasks effectively.

The curation of self-supervised fine-tuning datasets usually includes four steps:

•

Collect X-text pairs from various data sources, similar to the pre-training stage as introduced in Section 3.1.
•

Process the data for further use, including filtering, deduplication, translation, etc.
•

Construct the processed X-text pairs into instruction-response form, which involves designing the instruction and converting captions into answers.
•

Select high-quality instruction-response pairs based on the fine-tuning needs. This step is not always necessary but can enhance the dataset’s quality.

5.1.1. SFT Data Collection and Processing

After collecting these data, several general approaches are employed to improve data quality, including filtering, deduplication, and data enhancement. These steps are similar to the data processing methods used in the pre-training stage described in Section 3.2.

In addition to these general data processing approaches, specific methods aim to improve data quality for supervised fine-tuning tasks. The main goal is to add more textual information to the datasets or improve the quality of captions. These methods, while akin to the data enhancement techniques mentioned in Section 3.2, are more focused on SFT data generation. For example, MiniGPT-4 (Zhu et al., 2023) curated detailed image description datasets for vision-language adaptation by leveraging their pre-trained model with instructions such as ”Describe this image in detail,” ensuring generated sentences contained more than 80 tokens. They also used ChatGPT to refine the descriptions and manually checked for quality, obtaining 3,500 high-quality image-text pairs. Similarly, ShareGPT4V (Chen et al., 2023e) used GPT-4 Vision to generate 100,000 high-quality captions that included world knowledge, spatial information, aesthetic evaluation, and more. These captions were directly used to generate SFT instruction-response pairs.

5.1.2. Instruction-Response Pairs Generation

After collecting and processing the original X-Text datasets, the next step is to generate the instruction-response pairs datasets based on the original datasets. The supervised fine-tuning stage is typically aiming to enhance the model’s ability on downstream tasks. For this reason, previous work consider to generate instruction-response pairs for different downstream tasks, including descriptive tasks, question-answer tasks, reasoning tasks, and classification tasks. Based on task categorization, previous work selects different source datasets and uses various methods to generate the instruction-response pairs. In this section, we first introduce the differences between each downstream task, and then we summarize representative methods of generating instruction-response datasets for each type of downstream task.

Captioning Instruction-Response Datasets.

The simplest tasks are the captioning ones. For this type of task, images (or other modality data) are usually provided, and the MLLMs are asked to give out description of the data. The instruction-response datasets designed to enhance the model’s captioning ability are relatively easy to construct. These kinds of instructions are usually one question asking for the description of the X modality data. Specifically, for image data, MiniGPT-4 (Zhu et al., 2023) construct description instruction tuning by randomly sample instructions such as ”Describe this image in detail”, and their modified high-quality captions described in Section. 5.1.1 are directly used as the response. While LLaVA (Xu et al., 2024) considers designing two types of description instructions, brief description and detailed description and uses GPT-4 (OpenAI, 2023) to generate the answers. For brief descriptions, they add keywords such as ”concisely”, ”brief”, and ”short” in their questions, and for detailed descriptions, they add keywords such as ”descriptive” and ”comprehensive”. ShareGPT4V (Chen et al., 2023e) replaces the detailed description data in LLaVA with its GPT4-Vision generated high-quality captions. X-InstructBLIP (Panagopoulou et al., 2023) further uses these descriptive prompts not only on image-text datasets but also on audio-text, video-text, and 3D-text datasets as a description instruction.

Question Answer Instruction-Response Datasets.

Question answer task requires the model to answer questions based on the given information on the X modality. This kind of task includes simple questions answers, and multiple choices. The Most well-known question-answer task is the visual question answer (VQA) task for visual modality. For VQA tasks, previous research on computer vision has developed a large amount of VQA datasets, and recent work on MLLMs leverages these VQA datasets for constructing instruction-response datasets. For example, LLaVA-1.5 (Liu et al., 2023c) leverage GQA (Hudson and Manning, 2019), OCR-VQA (Mishra et al., 2019), OKVQA (Marino et al., 2019) and VQAv2 (Goyal et al., 2019) to enrich their instruction response datasets. They format these datasets in a uniform template as ”Answer the question using a single word or phrase.” TextVQA (Singh et al., 2019) is another source used for constructing instruction-response datasets which contain text information in images. M3IT (Li et al., 2023k) utilize VQA datasets, such as VQA-v2 (Goyal et al., 2019), Shapes VQA (Andreas et al., 2016) and DocVQA (Mathew et al., 2021), to develop a multimodal instruction tuning dataset that contains 2.4 million instances. In addition to visual question-answer tasks, instruction tuning to improve model performance in the video question answer task also leverages previous video question answer datasets, such as MSRVTT-QA (Xu et al., 2017), iVQA (Yang et al., 2021), MSVD-QA (Xu et al., 2017), and ActivityNet-QA (Yu et al., 2019).

Reasoning Instruction-Response Datasets

Shikra-RD (Chen et al., 2023j) focuses on enhancing the models’ ability for referential dialogue problems, i.e., users point to specific areas and ask questions. They collect public VQA datasets, image captioning data, and several datasets with existing positional annotations to build instruction tuning datasets. Leveraging captions from the Flickr30K (Young et al., 2014) image-caption dataset and using GPT-4, they obtain high-quality RD annotations. PVIT (Chen et al., 2023h) converts the traditional VQA datasets such as GQA and OCR into instruction-response datasets with region-level information. MiniGPT-V2 (Chen et al., 2023m) designs a multi-task instruction template, following the conversation template from LLaMA-2 (Li et al., 2023f). They add task identifier tokens and spatial location representations to identify the task and spatial information in the image. They build five types of instruction-response datasets, including LLaVA-Instruct-150K, grounded image captions from Flickr30K (Young et al., 2014) with direct instruction and object parsing, multi-round conversations, and text-only unnatural instruction datasets (Honovich et al., 2022b). CogVLM (Wang et al., 2024b) provides an IT dataset called CogVLM-SFT-311K. To construct this dataset, they manually select high-quality IT data from MiniGPT-4 and integrate it with LLaVA-Instruct-150K, translating it into Chinese. They then manually correct any noise in their collected instruction tuning datasets and translate them back into English. In their model training, they also leverage VQA datasets and bounding box instruction tuning datasets generated from four types of datasets: grounded captioning (GC) datasets with box-noun phrase pairs in each image, referring expression generation (REG) datasets with box and explanation, referring expression comprehension (REC) datasets with boxes annotated for text, and grounded visual question answering (GroundedVQA) datasets. LLaVA1.5 (Liu et al., 2023c) leverage region level VQA datasets such as Visual Genome (Krishna et al., 2017) and RefCOCO (Kazemzadeh et al., 2014; Mao et al., 2016) in their IT datasets construction.

Other Instruction-Response Datasets

Another type of downstream task is the classification task, where the data is classified into either a label from a candidate pool or an open-label. M³IT (Li et al., 2023k) leverages various datasets for image classification, including ImageNet (Russakovsky et al., 2015), Grounded Object Identification (COCO-GOI)(Lin et al., 2014), COCO-Text(Veit et al., 2016), Image Text Matching (COCO-ITM)(Lin et al., 2014), e-SNLI-VE(Kayser et al., 2021), Multi-modal Fact Checking (Mocheg)(Yao et al., 2023), and IQA(Duanmu et al., 2021). Additionally, for image classification tasks, InstructBLIP (Dai et al., 2024) uses HatefulMemes (Kiela et al., 2020), a binary classification dataset for detecting hateful content in memes, to generate an instruction-response dataset. X-InstructBLIP (Panagopoulou et al., 2023) further extends this approach to audio classification by using AudioSet (Gemmeke et al., 2017). The instructions used are sentences asking for the classification of the audio, such as ”Classify the following audio”. And the response is the class of the original data.

5.1.3. SFT Data Selection

Supervised fine-tuning is crucial for adapting LLMs with downstream applications. Zhou et al. (2024) found that the primary knowledge of LLMs is acquired during the pre-training stage, and the purpose of instruction tuning is to enable LLMs to learn how to perform well on specific tasks and interact with humans. They determined that only a small set of carefully crafted, high-quality instructions is sufficient to endow LLMs with powerful instruction-following capabilities. Subsequently, various methods have been proposed for data selection to identify high-quality data, improving performance and reducing training costs. These methods are categorized into four types: coreset-based, LLMs-based, gradient-based, and self-instruction-based methods.

Coreset-Based Methods

Coreset provides a compact representation of a larger dataset while preserving its essential characteristics. From a geometric perspective, similar data points in the feature space are close to each other. For example, Sener and Savarese (2017) employs the greedy k-center algorithm to select a coreset and applies it to CNN image classification. In addressing the multi-class classification task, MODERATE CORESET (Xia et al., 2022) scores data points based on their distance to the class center. Data points with scores close to the score median are selected as the coreset, which achieves a balance between discriminative, compressible, and diverse. SIMILAR (Kothawade et al., 2021) picks data points according to the submodular information measures (SMI). A properly selected SMI function can maintain the diversity of selected data and handle imbalance classes, out-of-distribution data, and redundancy. Some algorithms select data points based on model performance, focusing on ”important” samples. For example, Bachem et al. (2017) perform importance sampling based on the upper bound of the sensitivity score to generate a coreset. Generally speaking, data points with high costs are more likely to be selected. Other mertics of performance include GraNd (Paul et al., 2021), least confidence (Wang and Shang, 2014), and etc.

Given that Coreset techniques have demonstrated promising results both theoretically and empirically, it’s feasible to apply these algorithms to the data used for training LLMs. Chen et al. (2023k) employ the K-greedy algorithm to identify just 0.5% of the core data for fine-tuning a pre-trained language model, achieving performance only 1-2% lower than that obtained using the entire dataset. Similarly, the approach proposed by Das and Khetan (2023) involves selecting data that represents the easiest and most challenging examples, a strategy proven effective in (Sorscher et al., 2022). By utilizing only 32.5% of carefully selected data, they achieved state-of-the-art results.

LLMs-Based Methods

External model-based methods frequently utilize external models to evaluate the quality, diversity, and complexity of data. These models serve various functions, from scoring data to enhancing its characteristics. For instance, Du et al. (2023) leverage DeBERTa (He et al., 2020) for scoring, retaining high-quality data, and combining it with the k-center greedy algorithm to select diverse data. Chen et al. (2023f) score the accuracy of data using ChatGPT to pick out high-quality data. Xu et al. (2023b) use GPT-4 to rewrite data to increase their complexity and then streamline it by reducing its variety and improving its quality. Liu et al. (2023e) train two models using ChatGPT’s labeled data to score the quality and complexity of the data. Lu et al. (2023c) rely on ChatGPT to tag each instance, defining its complexity and diversity based on these tags. Parkar et al. (2024) first cluster the data, and then use GPT-4 to select high-quality data for each cluster. Zheng et al. (2024) and Sun et al. (2024) leverage LLMs to automatically select high-quality domain data, achieving superior performance. For multimodal instruction tuning datasets, MLLMs are typically used to obtain or select high-quality instruction tuning data. LLaVA1.5 (Liu et al., 2023c) pioneers the use of text-only GPT-4 to expand the COCO (Lin et al., 2014) bounding box and caption dataset into a multimodal instruction-following dataset. Following LLaVA, ShareGPT-4v (Chen et al., 2023e) utilizes GPT-4v to enhance a portion of LLaVA’s instruction tuning data. For Video Large Language Models (VideoLLMs), Liang et al. (2024) pioneered the use of CLIP score for video keyframe data selection, achieving SoTA performance.

Gradient-Based Methods

Neural Networks are typically trained through gradient-based optimization techniques. Consequently, researchers have been exploring gradient-based methods to use a subset of data to approximate the gradient effectively. The EL2U score was introduced as a method to evaluate how removing a single data point affects the gradient (Paul et al., 2021). However, the EL2U score is limited to single-task learning scenarios. To address multi-task learning, particularly in NLP tasks, another approach has been developed (Attendu and Corbeil, 2023). This method applies the EL2U score to each individual task and then combines these scores using the L2 norm to accommodate the complexities of multi-task learning. Another notable work is (Xia et al., 2024), which adapts existing influence formulations to work with the Adam optimizer. It only uses a few steps of gradients to compute gradient features efficiently and then stores the compressed features in a gradient datastore for efficient data selection.

Self-Instruction-Based Methods

For self-instruction-based methods, the evaluation and selection of data do not require the involvement of any external models. Li et al. (2023l) proposes a self-instruction method for LLMs to autonomously identify and select challenging data. Li et al. (2023d) identifies high-quality data as those that, when included as part of a few-shot example, can enhance the model’s performance. Kung et al. (2023) proposes a novel task-level uncertainty metric that measures the sensitivity of LLMs to instruction perturbations for a task, to identify high-quality data, and combines this with the concept of active learning to iteratively select data. Liu et al. (2024c) scores data at three granular levels: token, sentence, and model. Each level’s scoring is calculated based on the previous level, with the model-level score serving as the final score.

5.2. Data-Centric Human Preference Alignment

To align human preferences with language models, previous work has considered constructing instruction-response datasets that respect human preferences. These datasets usually involve three main elements. RLHF aims to use reinforcement learning (RL) to align the model with human preferences (Ouyang et al., 2022). After training the LLM, RLHF collects human feedback to rank Q&A text data, trains a reward model (RM), and then uses RL to fine-tune the LLMs.

The most impactful work of RLHF is InstructGPT (Ouyang et al., 2022). To collect human feedback, InstructGPT first designs prompts manually and from users who use the InstructGPT playgrounds. Then, they process the prompts using heuristic de-duplication methods, such as checking for prompts that share a long common prefix and limiting the number of prompts to 200 per user ID, to protect the user’s sensitive information and improve the quality of prompts.

These prompts encompass multiple tasks such as text generation, question answering, dialogue, summarization, and more. Human taggers are then asked to rank the responses to each prompt, constructing a Q&A dataset used to train the reward model (RM). Subsequently, the fine-tuning stage of the LLM is framed as a reinforcement learning (RL) problem using the trained reward model. In LLaMa2 (Li et al., 2023f), human preferences are further divided into fine-grained aspects, with two separate reward models trained to rank helpfulness and safety, respectively. Further work also aims to improve the models’ generation abilities from these aspects.

For MLLMs, several works have focused on aligning models with human preferences. Based on LLaVA (Liu et al., 2023b), LLaVA-RLHF (Sun et al., 2023a) designed a reinforcement learning from human feedback (RLHF) method to align LLaVA with human feedback. DRESS (Chen et al., 2023i) aims to improve models’ response quality from the perspective of human values, particularly the 3H criteria (Ouyang et al., 2022) (helpfulness, honesty, and harmlessness). They provide two datasets: the Large Vision Language Model with Natural Language Feedback (LVLM_NLF) dataset and the Vision-Language Safety (VLSafe) dataset. To construct the VLSafe dataset, they adopt the LLM-Human-in-the-Loop approach, iteratively creating and filtering the data based on the COCO dataset. The LLM they used is GPT-3.5 Turbo. The final dataset contains 5,874 samples, meeting most requirements for harmlessness alignment and evaluation.

6. evaluation

To offer a comprehensive data-centric viewpoint for assessing MLLMs, this section begins with two key aspects. First, we review widely-used methods for evaluating data quality. Then, we provide an overview of MLLMs evaluation datasets, highlighting their collection, processing, and characteristics.

6.1. Data Evaluation

Evaluating datasets is essential to ensure their quality and reliability. This process helps identify and correct errors and biases, enhancing the accuracy of model training and predictions. In this section, we summarize recent dataset evaluation metrics from the following perspectives.

6.1.1. Dataset Diversity

Dataset diversity is crucial for developing robust and generalizable machine learning models. A diverse dataset enables the model to manage various scenarios and minimizes the risk of bias. Some studies (Heusel et al., 2017; Sajjadi et al., 2018) require a reference distribution or a dataset. For instance, Heusel et al. (2017) measures the Wasserstein-2 distance between two Gaussian distributions: one fitted to the embeddings of the reference sample and the other to the embeddings of the sample being evaluated for diversity. Other work use similarity scores to define diversity. For example, Shen et al. (2019); Fomicheva et al. (2020) utilize the average pairwise similarity score or its complement, the average dissimilarity. Fomicheva et al. (2020) propose a two-metric evaluation paradigm using precision and recall, with precision measuring quality and recall assessing diversity in terms of coverage of the reference distribution. The Vendi score (Friedman and Dieng, 2023; Pasarkar and Dieng, 2023; YEH et al., 2023) stands out from existing diversity evaluation metrics as a reference-free, flexible, and interpretable metric that measures internal diversity without comparing it to a reference distribution. Its reliance on a user-defined similarity function allows for broad applicability across domains, incorporating correlations between features while being computationally efficient and unsupervised, which makes it a valuable tool for diverse machine learning applications. Oppositely, Lee et al. (2023) use diversity score called Task2Vec diversity coefficient (Miranda et al., 2022) to evaluate the diversity of publicly available datasets.

6.1.2. Dataset Quality

Dataset quality is essential for the accuracy and reliability of machine learning models. High-quality data enables models to learn effectively and make precise predictions. Ensuring dataset quality is essential for the accuracy and reliability of machine learning models. High-quality data enable effective learning and precise predictions. There are several approaches to evaluate data quality. TRUE (Honovich et al., 2022a) offers a thorough evaluation of factual consistency metrics in grounded text generation systems. By standardizing datasets and introducing a meta-evaluation protocol, it highlights the robust performance of large-scale NLI and QG-QA methods across different tasks. Object Hallucination (Rohrbach et al., 2018) introduces CHAIR (Caption Hallucination Assessment with Image Relevance), which measures the proportion of words in a generated caption that correspond to objects present in the image, using ground truth sentences and object segmentations. FAITH SCORE (Jing et al., 2023) assesses the faithfulness of generated answers from large vision-language models (LVLMs). It identifies descriptive statements, extracts atomic facts, and checks their consistency with input images. This structured process ensures that generated answers align closely with the visual content, providing a comprehensive evaluation of faithfulness in LVLM outputs.

6.1.3. Dataset Similarity

In machine learning theory, performance improves when the training data distribution closely matches the testing data distribution. Therefore, it is essential to evaluate the distances of data distribution both theoretically and empirically. Common metrics for measuring distribution similarity include Euclidean distance, KL division, CORAL loss (Sun and Saenko, 2016), Wasserstein distance, and MMD distance (Jiang et al., 2022). The MAUVE scores (Pillutla et al., 2023) serve as a comparison measure between pairs of distributions, such as those encountered in generative modeling of text or images. This metric provides statistical bounds and extensive experiments to demonstrate its effectiveness.

6.2. Evaluation Datasets for MLLMs

In this section, we will briefly summarize different types of evaluation datasets according to downstream tasks.

Common evaluation tasks for MLLMs can be divided into captioning tasks, question answering tasks, perception and reasoning tasks, and other noteworthy tasks such as classification tasks. Captioning datasets (Chen et al., 2015; Venugopalan et al., 2015; Kim et al., 2019) act as benchmarks to evaluate a model’s fundamental ability to understand information from different modalities. They can be used to assess the zero-shot or few-shot learning capabilities of MLLMs in multimodality tasks, offering insights into their ability to generalize and adapt to new scenarios with limited training examples. While evaluation datasets for question answering (QA) are crucial for assessing the performance of MLLMs that handle diverse data types like text, images, and audio (Goyal et al., 2019; Xu et al., 2017; Lipping et al., 2022). These datasets require models to analyze and combine information from multiple sources to respond to relevant questions accurately. Perception and reasoning are crucial tests for assessing the general abilities of machine learning language models (MLLM). These tasks requires model to reason for complicated logic and capture structured or spatial information (Singh et al., 2019; Masry et al., 2022; Liu et al., 2023b; Mangalam et al., 2024). There are also several noteworthy tasks not mentioned above, including classification task that require model to classify examples for right labels (Kiela et al., 2020; Piczak, 2015, 2015; Wu et al., 2015; Goyal et al., 2017), and multimodal dialog task that require for model’s in-context learning abilities to answer questions during multiple rounds of conversations (Bai et al., 2023b).

7. future direction

In this section, our attention is directed towards research avenues for MLLMs from a data-centric viewpoint. We categorize forthcoming directions based on the structure of our article. We begin by addressing future directions concerning data collecting and processing, followed by pre-training, adaptation, and evaluation aspects.

7.1. Data Processing System for MLLMs.

The processing of data for MLLMs entails numerous intricate steps across various modalities, as outlined in Section 3.2. Recent efforts in preparing data for MLLM training involve processing data autonomously through specialized data pipelines and processing operators (Alayrac et al., 2022; Changpinyo et al., 2021; Bain et al., 2021; Du et al., 2018). While previous work has focused on designing data processing systems for LLMs, such as Data-Juicer (Chen et al., 2023d) and Oasis (Zhou et al., 2023), there remains a gap in the availability of data processing systems tailored specifically for MLLM data. Such systems have to be equipped to handle multi-modal data types, encompassing not only textual data but also images, videos, audio, and 3D data formats.

7.2. Data Quantity Analysis for MLLM Pre-training.

Large language models have shown specific properties such as Large language models exhibit distinct characteristics such as emergent abilities and scaling laws. Previous research has examined emergent phenomena concerning model scale, measured by factors like training compute and the number of model parameters, as well as in relation to evaluation metrics (Wei et al., 2022; Schaeffer et al., 2024). Scaling laws have been investigated concerning model size, data scale, and data quality. OpenAI initially explored the power-law relationship of pre-trained loss with respect to model size, dataset size, and training compute in neural language models (Kaplan et al., 2020). Subsequently, DeepMind introduced a new scaling law demonstrating the optimal allocation of compute resources (Hoffmann et al., 2022). Building on this, Goyal et al. (2024) investigated the trade-offs between dataset quality and quantity, while Ye et al. (2024) examined scaling laws associated with data mixtures. Furthermore, Zhang et al. (2024b) analyzed the impact of these properties, particularly focusing on text-only LLMs. However, there remains a gap in understanding how data quantity influences emergent abilities in MLLMs. Exploring the scaling laws of MLLMs, particularly regarding the evaluation of data quantity, represents an area that warrants further investigation.

7.3. Data Quality Analysis for MLLM Pre-training.

Adjusting the ratio and quantity of pre-training data holds significant potential for enhancing model performance. However, with the abundance of pre-training data available, there arises a critical necessity for efficient data selection algorithms. Evaluating data for large models typically demands the utilization of the models themselves, which can impose significant computational burdens. Hence, the development of Proxy Models becomes crucial, as they offer substantial reductions in computational costs. Moreover, Proxy Models prove instrumental in optimizing the mixing ratio of data from diverse sources. While several approaches like Doremi (Xie et al., 2024a) and Doge (Fan et al., 2023) have been proposed in this regard, there remains a need for the refinement of proxy models and further exploration into the relationship between proxy models and their original counterparts.

7.4. MLLM Data Evaluation.

As elaborated in Section 6.1, despite the introduction of various data evaluation metrics, there remains a notable absence of comprehensive metrics tailored specifically for evaluating multimodal data. The assessment of multimodal data presents heightened challenges owing to its diverse array of data types, often encompassing multiple tasks and modalities (Liu et al., 2023c; Chen et al., 2024c). The quality of such data can be impacted by the individual models corresponding to each modality and their alignments. Hence, it becomes imperative to scrutinize the quality of multimodal data from multiple dimensions. In addition to relying solely on human-defined features, enhancing data quality can be achieved by leveraging statistical metrics and methodologies, such as scrutinizing distributions. The notion of domain adaptation offers theoretical underpinnings for matching distributions (Jiang et al., 2022). Consequently, assessing distributions emerges as a promising avenue, both in terms of theoretical frameworks and empirical validation.

7.5. Data Quality Improving for MLLM Supervised Fine-Tuning.

Harnessing Instruction Tuning data can substantially bolster a model’s proficiency in adhering to instructions and enhancing its performance across specific tasks. Given the pivotal role of both quality and quantity in optimizing model performance, the exploration of this area emerges as a crucial research endeavor. The evaluation of data for large-scale models necessitates metrics that are model-agnostic, effectively capturing the distinct characteristics of the data. As highlighted in the preceding paragraph, there is a pressing need to devise metrics tailored specifically for instruction tuning data. An alternative pragmatic approach involves employing Large Language Models (LLMs) for data evaluation, leveraging the extensive knowledge accrued during their pre-training phase (Du et al., 2023; Chen et al., 2023f; Xu et al., 2023b; Liu et al., 2023e; Lu et al., 2023c; Parkar et al., 2024). However, utilizing LLMs for data assessment proves to be cost-effective. Despite the success of GPT-based methods for automated data quality evaluation, such approaches often lack interpretability. Hence, further investigation is warranted to gain a deeper understanding of GPT’s efficacy and to delineate the boundaries of employing GPT for data evaluation.

7.6. MLLM Lifelong Learning.

MLLMs often learn from diverse data types sequentially across various training phases (Li et al., 2022a; Liu et al., 2024b, 2023c). Typically, this sequential training starts with initializing the model using pre-trained weights for each modality, followed by pre-training, instruction tuning, and concluding with reinforcement learning from human feedback. It is critical throughout these stages to prevent catastrophic forgetting, ensuring the model retains its foundational language capabilities while acquiring multimodal skills. Therefore, integrating lifelong learning strategies with MLLMs represents a significant and necessary research direction.

8. conclusion

In this survey, we review recent advancements in data-centric multimodal large language models (MLLMs), introducing key concepts, findings, and techniques essential for processing training data for these models. Specifically, our discussion focuses on three crucial aspects of data handling: pre-training, adaptation, and evaluation. For each aspect, we highlight pivotal techniques and insights for data processing that are critical for effectively training MLLMs. Additionally, we provide a comprehensive summary of available data resources for training these models and discuss strategies for utilizing machine learning to enhance the performance of MLLMs. This survey aims to encapsulate the most recent literature on data-centric MLLMs, serving as a valuable reference for both researchers and engineers in the field.

References

(1)
Abbas et al. (2023) Amro Kamal Mohamed Abbas, Kushal Tirumala, Daniel Simig, Surya Ganguli, and Ari S Morcos. 2023. SemDeDup: Data-efficient learning at web-scale through semantic deduplication. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.
Agrawal et al. (2019) Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. 2019. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision. 8948–8957.
Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736.
Albalak et al. (2024) Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang. 2024. A Survey on Data Selection for Language Models. arXiv:2402.16827 [cs.CL]
Andreas et al. (2016) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 39–48.
Anne Hendricks et al. (2017) Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803–5812.
Armeni et al. (2017) Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. 2017. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 (2017).
Attendu and Corbeil (2023) Jean-michel Attendu and Jean-Philippe Corbeil. 2023. NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks. In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP). 129–146.
Bachem et al. (2017) Olivier Bachem, Mario Lucic, and Andreas Krause. 2017. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476 (2017).
Baek et al. (2019) Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. Character region awareness for text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9365–9374.
Bai et al. (2023a) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023a. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
Bai et al. (2023b) Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. 2023b. Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890 (2023).
Bain et al. (2021) Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1728–1738.
Bandy and Vincent (2021) Jack Bandy and Nicholas Vincent. 2021. Addressing ”Documentation Debt” in Machine Learning: A Retrospective Datasheet for BookCorpus. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). https://openreview.net/forum?id=Qd_eU1wvJeu
Barbieri et al. (2020) Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, and Luis Espinosa-Anke. 2020. TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. arXiv:2010.12421 [cs.CL]
Baumgartner et al. (2020) Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media, Vol. 14. 830–839.
Ben Abacha and Demner-Fushman (2019) Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering. BMC bioinformatics 20 (2019), 1–23.
Bengio et al. (2000) Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A neural probabilistic language model. Advances in neural information processing systems 13 (2000).
Biten et al. (2019) Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Ernest Valveny, C. V. Jawahar, and Dimosthenis Karatzas. 2019. Scene Text Visual Question Answering. arXiv:1905.13648 [cs.CV]
Blattmann et al. (2023) Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023).
Broder (1997) Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, 21–29.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
Bustos et al. (2020) Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria De La Iglesia-Vaya. 2020. Padchest: A large chest x-ray image dataset with multi-label annotated reports. Medical image analysis 66 (2020), 101797.
Carlini et al. (2022) Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying Memorization Across Neural Language Models. In The Eleventh International Conference on Learning Representations.
Cauli and Reforgiato Recupero (2022) Nino Cauli and Diego Reforgiato Recupero. 2022. Survey on videos data augmentation for deep learning models. Future Internet 14, 3 (2022), 93.
Changpinyo et al. (2021) Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. 2021. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. arXiv:2102.08981 [cs.CV]
Charikar (2002) Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. 380–388.
Chen et al. (2023g) Chongyan Chen, Mengchen Liu, Noel Codella, Yunsheng Li, Lu Yuan, and Danna Gurari. 2023g. Fully authentic visual question answering dataset from online communities. arXiv preprint arXiv:2311.15562 (2023).
Chen et al. (2023h) Chi Chen, Ruoyu Qin, Fuwen Luo, Xiaoyue Mi, Peng Li, Maosong Sun, and Yang Liu. 2023h. Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437 (2023).
Chen et al. (2024a) Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Pan Zhou, Yao Wan, and Lichao Sun. 2024a. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. arXiv preprint arXiv:2402.04788 (2024).
Chen and Dolan (2011) David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200.
Chen et al. (2023d) Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, et al. 2023d. Data-juicer: A one-stop data processing system for large language models. arXiv preprint arXiv:2309.02033 (2023).
Chen et al. (2023b) Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, and Bo Xu. 2023b. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160 (2023).
Chen et al. (2023l) Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. 2023l. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292 (2023).
Chen et al. (2023k) Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xiaomeng Hu, Xuetao Ma, Yifan Yanggong, and Junbo Zhao. 2023k. Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning. arXiv preprint arXiv:2305.09246 (2023).
Chen et al. (2023m) Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. 2023m. MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478 [cs.CV]
Chen et al. (2023j) Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023j. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195 (2023).
Chen et al. (2023e) Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2023e. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023).
Chen et al. (2023f) Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. 2023f. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701 (2023).
Chen et al. (2023c) Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, and Jing Liu. 2023c. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345 (2023).
Chen et al. (2024b) Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu. 2024b. Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. Advances in Neural Information Processing Systems 36 (2024).
Chen et al. (2024c) Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. 2024c. Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers. arXiv preprint arXiv:2402.19479 (2024).
Chen et al. (2023a) Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. 2023a. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 (2023).
Chen et al. (2015) Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).
Chen et al. (2022) Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. 2022. PaLI: A Jointly-Scaled Multilingual Language-Image Model. In The Eleventh International Conference on Learning Representations.
Chen et al. (2023i) Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. 2023i. Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv preprint arXiv:2311.10081 (2023).
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
Computer (2023) Together Computer. 2023. RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset. https://github.com/togethercomputer/RedPajama-Data
Dai et al. (2017) Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5828–5839.
Dai et al. (2024) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2024. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36 (2024).
Das and Khetan (2023) Devleena Das and Vivek Khetan. 2023. DEFT: Data Efficient Fine-Tuning for Large Language Models via Unsupervised Core-Set Selection. arXiv preprint arXiv:2310.16776 (2023).
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
Desai et al. (2021) Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin Johnson. 2021. Redcaps: Web-curated image-text data created by the people, for the people. arXiv preprint arXiv:2111.11431 (2021).
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Drossos et al. (2020) Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. 2020. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 736–740.
Du et al. (2018) Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu. 2018. Aishell-2: Transforming mandarin asr research into industrial scale. arXiv preprint arXiv:1808.10583 (2018).
Du et al. (2022) Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. 2022. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning. PMLR, 5547–5569.
Du et al. (2023) Qianlong Du, Chengqing Zong, and Jiajun Zhang. 2023. Mods: Model-oriented data selection for instruction tuning. arXiv preprint arXiv:2311.15653 (2023).
Duanmu et al. (2021) Zhengfang Duanmu, Wentao Liu, Zhongling Wang, and Zhou Wang. 2021. Quantifying visual image quality: A bayesian view. Annual Review of Vision Science 7 (2021), 437–464.
Fan et al. (2024) Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. 2024. Improving clip training with language rewrites. Advances in Neural Information Processing Systems 36 (2024).
Fan et al. (2023) Simin Fan, Matteo Pagliardini, and Martin Jaggi. 2023. Doge: Domain reweighting with generalization estimation. arXiv preprint arXiv:2310.15393 (2023).
Farnebäck (2003) Gunnar Farnebäck. 2003. Two-frame motion estimation based on polynomial expansion. In Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003 Proceedings 13. Springer, 363–370.
Feng et al. (2021) Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. 2021. A survey of data augmentation approaches for NLP. arXiv preprint arXiv:2105.03075 (2021).
Fomicheva et al. (2020) Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Frédéric Blain, Francisco Guzmán, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. 2020. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics 8 (2020), 539–555.
Fonseca et al. (2022) Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. 2022. FSD50K: An Open Dataset of Human-Labeled Sound Events. arXiv:2010.00475 [cs.SD]
Friedman and Dieng (2023) Dan Friedman and Adji Bousso Dieng. 2023. The vendi score: A diversity evaluation metric for machine learning. Transactions on Machine Learning Research (2023).
Fu et al. (2024) Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. 2024. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv:2306.13394 [cs.CV]
Gadre et al. (2024) Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. 2024. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems 36 (2024).
Gao and Lin (2004) Jianfeng Gao and Chin-Yew Lin. 2004. Introduction to the special issue on statistical language modeling. , 87–93 pages.
Gao et al. (2017) Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision. 5267–5275.
Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020).
Ge et al. (2024) Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, and Bolin Ding. 2024. Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining. arXiv preprint arXiv:2405.14908 (2024).
Gemmeke et al. (2017) Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 776–780.
Giunchiglia and Lukasiewicz (2020) Eleonora Giunchiglia and Thomas Lukasiewicz. 2020. Coherent hierarchical multi-label classification networks. Advances in neural information processing systems 33 (2020), 9662–9673.
Goyal et al. (2017) Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision. 5842–5850.
Goyal et al. (2024) Sachin Goyal, Pratyush Maini, Zachary C Lipton, Aditi Raghunathan, and J Zico Kolter. 2024. Scaling Laws for Data Filtering–Data Curation cannot be Compute Agnostic. arXiv preprint arXiv:2404.07177 (2024).
Goyal et al. (2019) Yash Goyal, Tejas Khot, Aishwarya Agrawal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2019. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. International Journal of Computer Vision 127, 4 (2019), 398–414.
Grave et al. (2018) Édouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomáš Mikolov. 2018. Learning Word Vectors for 157 Languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Gu et al. (2022) Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. 2022. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems 35 (2022), 26418–26431.
Gunasekar et al. (2023) Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. 2023. Textbooks Are All You Need. arXiv preprint arXiv:2306.11644 (2023).
Guo et al. (2021) Jia Guo, Jiankang Deng, Alexandros Lattas, and Stefanos Zafeiriou. 2021. Sample and computation redistribution for efficient face detection. arXiv preprint arXiv:2105.04714 (2021).
Guo et al. (2023) Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. 2023. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023).
Gupta et al. (2022) Tanmay Gupta, Ryan Marten, Aniruddha Kembhavi, and Derek Hoiem. 2022. Grit: General robust image task benchmark. arXiv preprint arXiv:2204.13653 (2022).
Gurari et al. (2018) Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 2018. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3608–3617.
Hadi et al. (2023) Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. 2023. Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Preprints (2023).
Han et al. (2023) Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, and Andrew Zisserman. 2023. AutoAD: Movie description in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18930–18940.
He et al. (2024) Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. 2024. Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530 (2024).
He et al. (2020) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION. In International Conference on Learning Representations.
Hernandez et al. (2022) Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, et al. 2022. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487 (2022).
Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017).
Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
Hong et al. (2023) Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 2023. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems 36 (2023), 20482–20494.
Honovich et al. (2022a) Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. 2022a. TRUE: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991 (2022).
Honovich et al. (2022b) Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. 2022b. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689 (2022).
Hsu et al. (2021) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
Hu et al. (2024) Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. 2024. Bliva: A simple multimodal llm for better handling of text-rich visual questions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 2256–2264.
Hu et al. (2022) Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 17980–17989.
Huang et al. (2016) Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016. Visual storytelling. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies. 1233–1239.
Hudson and Manning (2019) Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700–6709.
Jakubik et al. (2022) Johannes Jakubik, Michael Vössing, Niklas Kühl, Jannis Walk, and Gerhard Satzger. 2022. Data-centric artificial intelligence. arXiv preprint arXiv:2212.11854 (2022).
Jang et al. (2017) Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2758–2766.
Jarrahi et al. (2022) Mohammad Hossein Jarrahi, Ali Memariani, and Shion Guha. 2022. The principles of data-centric ai (dcai). arXiv preprint arXiv:2211.14611 (2022).
Jelinek (1998) Frederick Jelinek. 1998. Statistical methods for speech recognition. MIT press.
Jian et al. (2024) Yiren Jian, Chongyang Gao, and Soroush Vosoughi. 2024. Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. Advances in Neural Information Processing Systems 36 (2024).
Jiang et al. (2022) Junguang Jiang, Yang Shu, Jianmin Wang, and Mingsheng Long. 2022. Transferability in deep learning: A survey. arXiv preprint arXiv:2201.05867 (2022).
Jin et al. (2023) Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. 2023. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046 (2023).
Jing et al. (2023) Liqiang Jing, Ruosen Li, Yunmo Chen, Mengzhao Jia, and Xinya Du. 2023. Faithscore: Evaluating hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477 (2023).
Johnson et al. (2023) Alistair EW Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. 2023. MIMIC-IV, a freely accessible electronic health record dataset. Scientific data 10, 1 (2023), 1.
Johnson et al. (2019) Alistair E. W. Johnson, Tom J. Pollard, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih ying Deng, Yifan Peng, Zhiyong Lu, Roger G. Mark, Seth J. Berkowitz, and Steven Horng. 2019. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv:1901.07042 [cs.CV]
Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
Kaddour (2023) Jean Kaddour. 2023. The MiniPile Challenge for Data-Efficient Language Models. arXiv preprint arXiv:2304.08442 (2023).
Kafle et al. (2018) Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. 2018. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5648–5656.
Kandpal et al. (2022) Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning. PMLR, 10697–10707.
Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
Karpathy and Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.
Kayser et al. (2021) Maxime Kayser, Oana-Maria Camburu, Leonard Salewski, Cornelius Emde, Virginie Do, Zeynep Akata, and Thomas Lukasiewicz. 2021. e-vil: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF international conference on computer vision. 1244–1254.
Kazemzadeh et al. (2014) Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 787–798.
Kembhavi et al. (2016) Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 235–251.
Kenton and Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
Kiela et al. (2020) Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems 33 (2020), 2611–2624.
Kim et al. (2019) Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 119–132.
Ko et al. (2015) Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. 2015. Audio augmentation for speech recognition.. In Interspeech, Vol. 2015. 3586.
Kocetkov et al. (2022) Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. 2022. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533 (2022).
Kombrink et al. (2011) Stefan Kombrink, Tomas Mikolov, Martin Karafiát, and Lukás Burget. 2011. Recurrent Neural Network Based Language Modeling in Meeting Recognition.. In Interspeech, Vol. 11. 2877–2880.
Kothawade et al. (2021) Suraj Kothawade, Nathan Beck, Krishnateja Killamsetty, and Rishabh Iyer. 2021. Similar: Submodular information measures based active learning in realistic scenarios. Advances in Neural Information Processing Systems 34 (2021), 18685–18697.
Krishna et al. (2017) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (2017), 32–73.
Kung et al. (2023) Po-Nien Kung, Fan Yin, Di Wu, Kai-Wei Chang, and Nanyun Peng. 2023. Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks. In The 2023 Conference on Empirical Methods in Natural Language Processing.
Lai et al. (2023b) Zhengfeng Lai, Haotian Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, et al. 2023b. From scarcity to efficiency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699 (2023).
Lai et al. (2023a) Zhengfeng Lai, Haotian Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, and Meng Cao. 2023a. VeCLIP: Improving CLIP Training via Visual-enriched Captions. https://api.semanticscholar.org/CorpusID:263835242
Laurençon et al. (2024) Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. 2024. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems 36 (2024).
Laurençon et al. (2022) Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. 2022. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems 35 (2022), 31809–31826.
Lee et al. (2023) Alycia Lee, Brando Miranda, and Sanmi Koyejo. 2023. Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data. arXiv preprint arXiv:2306.13840 (2023).
Lee et al. (2022) Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8424–8445.
Lei et al. (2021) Jie Lei, Tamara L Berg, and Mohit Bansal. 2021. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems 34 (2021), 11846–11858.
Lei et al. (2020) Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer, 447–463.
Li et al. (2023h) Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. 2023h. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023).
Li et al. (2023m) Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. 2023m. Otterhd: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219 (2023).
Li et al. (2022b) Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, and Di Hu. 2022b. Learning to answer questions in dynamic audio-visual scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19108–19118.
Li et al. (2023e) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023e. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
Li et al. (2022a) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022a. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888–12900.
Li et al. (2023i) Jianquan Li, Xidong Wang, Xiangbo Wu, Zhiyi Zhang, Xiaolong Xu, Jie Fu, Prayag Tiwari, Xiang Wan, and Benyou Wang. 2023i. Huatuo-26M, a Large-scale Chinese Medical QA Dataset. arXiv:2305.01526 [cs.CL]
Li et al. (2023c) KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. 2023c. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023).
Li et al. (2023g) Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. 2023g. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005 (2023).
Li et al. (2023k) Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. 2023k. M³IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning. arXiv preprint arXiv:2306.04387 (2023).
Li et al. (2023l) Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. 2023l. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. arXiv preprint arXiv:2308.12032 (2023).
Li et al. (2024) Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. 2024. OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text. arXiv preprint arXiv:2406.08418 (2024).
Li et al. (2023a) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023a. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
Li et al. (2023b) Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023b. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023).
Li et al. (2023d) Yunshui Li, Binyuan Hui, Xiaobo Xia, Jiaxi Yang, Min Yang, Lei Zhang, Shuzheng Si, Junhao Liu, Tongliang Liu, Fei Huang, et al. 2023d. One shot learning as instruction data prospector for large language models. arXiv preprint arXiv:2312.10302 (2023).
Li et al. (2023f) Yanwei Li, Chengyao Wang, and Jiaya Jia. 2023f. LLaMA-VID: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 (2023).
Li et al. (2023j) Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. 2023j. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607 (2023).
Liang et al. (2024) Hao Liang, Jiapeng Li, Tianyi Bai, Chong Chen, Conghui He, Bin Cui, and Wentao Zhang. 2024. KeyVideoLLM: Towards Large-scale Video Keyframe Selection. arXiv preprint arXiv:2407.03104 (2024).
Lin et al. (2023b) Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. 2023b. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023).
Lin et al. (2023a) Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. 2023a. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533 (2023).
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
Lipping et al. (2022) Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, and Tuomas Virtanen. 2022. Clotho-aqa: A crowdsourced dataset for audio question answering. In 2022 30th European Signal Processing Conference (EUSIPCO). IEEE, 1140–1144.
Liu et al. (2023b) Fangyu Liu, Guy Emerson, and Nigel Collier. 2023b. Visual spatial reasoning. Transactions of the Association for Computational Linguistics 11 (2023), 635–651.
Liu et al. (2023d) Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. 2023d. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. arXiv preprint arXiv:2311.10774 (2023).
Liu et al. (2023c) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023c. Improved Baselines with Visual Instruction Tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
Liu et al. (2024b) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024b. Visual instruction tuning. Advances in neural information processing systems 36 (2024).
Liu et al. (2024c) Liangxin Liu, Xuebo Liu, Derek F Wong, Dongfang Li, Ziyi Wang, Baotian Hu, and Min Zhang. 2024c. SelectIT: Selective Instruction Tuning for Large Language Models via Uncertainty-Aware Self-Reflection. arXiv preprint arXiv:2402.16705 (2024).
Liu et al. (2024e) Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. 2024e. RegMix: Data Mixture as Regression for Language Model Pre-training. arXiv preprint arXiv:2407.01492 (2024).
Liu et al. (2023e) Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. 2023e. What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning. In The Twelfth International Conference on Learning Representations.
Liu et al. (2024a) Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. 2024a. Datasets for Large Language Models: A Comprehensive survey. arXiv:2402.18041 [cs.CL]
Liu et al. (2023a) Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. 2023a. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023).
Liu et al. (2024d) Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. 2024d. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473 (2024).
Lo et al. (2020) Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel S Weld. 2020. S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4969–4983.
Lu et al. (2023b) Dakuan Lu, Hengkui Wu, Jiaqing Liang, Yipei Xu, Qianyu He, Yipeng Geng, Mengkun Han, Yingsi Xin, and Yanghua Xiao. 2023b. Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark. arXiv preprint arXiv:2302.09432 (2023).
Lu et al. (2023c) Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, and Jingren Zhou. 2023c. # InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models. In The Twelfth International Conference on Learning Representations.
Lu et al. (2023a) Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2023a. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023).
Lu et al. (2022) Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35 (2022), 2507–2521.
Luccioni and Viviano (2021) Alexandra Luccioni and Joseph Viviano. 2021. What’s in the box? an analysis of undesirable content in the Common Crawl corpus. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 182–189.
Luo et al. (2024) Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, and Rongrong Ji. 2024. Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models. arXiv preprint arXiv:2403.03003 (2024).
Luo et al. (2023) Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. 2023. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207 (2023).
Lyu et al. (2023) Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. 2023. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093 (2023).
Maaz et al. (2023) Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. 2023. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023).
Mahmoud et al. (2024) Anas Mahmoud, Mostafa Elhoushi, Amro Abbas, Yu Yang, Newsha Ardalani, Hugh Leather, and Ari Morcos. 2024. Sieve: Multimodal Dataset Pruning Using Image Captioning Models. arXiv:2310.02110 [cs.CV]
Manber and Myers (1993) Udi Manber and Gene Myers. 1993. Suffix arrays: a new method for on-line string searches. siam Journal on Computing 22, 5 (1993), 935–948.
Mangalam et al. (2024) Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. 2024. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems 36 (2024).
Mao et al. (2016) Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 11–20.
Marino et al. (2019) Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. 3195–3204.
Marion et al. (2023) Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, and Sara Hooker. 2023. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564 (2023).
Masry et al. (2022) Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022).
Mathew et al. (2021) Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2200–2209.
McKinzie et al. (2024) Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. 2024. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training. arXiv preprint arXiv:2403.09611 (2024).
Mei et al. (2023) Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. 2023. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395 (2023).
meta llama (2024) meta llama. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/ Accessed: 2024-05-02.
Miech et al. (2019) Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. arXiv:1906.03327 [cs.CV]
Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. Interspeech 2010 (2010).
Min and Wang (2023) Zeping Min and Jinbo Wang. 2023. Exploring the integration of large language models into automatic speech recognition systems: An empirical study. In International Conference on Neural Information Processing. Springer, 69–84.
Minwoo Byeon (2022) Haecheon Kim Sungjun Lee Woonhyuk Baek Saehoon Kim Minwoo Byeon, Beomhee Park. 2022. COYO-700M: Image-Text Pair Dataset. https://github.com/kakaobrain/coyo-dataset.
Miranda et al. (2022) Brando Miranda, Patrick Yu, Yu-Xiong Wang, and Oluwasanmi O Koyejo. 2022. The Curse of Low Task Diversity: On the Failure of Transfer Learning to Outperform MAML and Their Empirical Equivalence. In Sixth Workshop on Meta-Learning at the Conference on Neural Information Processing Systems.
Mishra et al. (2019) Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. 2019. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR). IEEE, 947–952.
Naveed et al. (2023) Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Nick Barnes, and Ajmal Mian. 2023. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435 (2023).
Nguyen et al. (2023) Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, and Ludwig Schmidt. 2023. Improving Multimodal Datasets with Image Captioning. ArXiv abs/2307.10350 (2023). https://api.semanticscholar.org/CorpusID:259991316
Nijkamp et al. (2022) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
Ok (2023) Hyunjong Ok. 2023. FinTree: Financial Dataset Pretrain Transformer Encoder for Relation Extraction. arXiv preprint arXiv:2307.13900 (2023).
Okamoto et al. (2023) Yamato Okamoto, Haruto Toyonaga, Yoshihisa Ijiri, and Hirokatsu Kataoka. 2023. Constructing Image-Text Pair Dataset from Books. arXiv:2310.01936 [cs.CV]
OpenAI (2023) R OpenAI. 2023. GPT-4 technical report. arXiv (2023), 2303–08774.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
Overbay et al. (2023) Keighley Overbay, Jaewoo Ahn, Joonsuk Park, Gunhee Kim, et al. 2023. mRedditSum: A Multimodal Abstractive Summarization Dataset of Reddit Threads with Images. In The 2023 Conference on Empirical Methods in Natural Language Processing.
Panagopoulou et al. (2023) Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. 2023. X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning. arXiv preprint arXiv:2311.18799 (2023).
Park et al. (2023) Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander M\kadry. 2023. TRAK: attributing model behavior at scale. In Proceedings of the 40th International Conference on Machine Learning. 27074–27113.
Parkar et al. (2024) Ritik Sachin Parkar, Jaehyung Kim, Jong Inn Park, and Dongyeop Kang. 2024. SelectLLM: Can LLMs Select Important Instructions to Annotate? arXiv preprint arXiv:2401.16553 (2024).
Pasarkar and Dieng (2023) Amey Pasarkar and Adji Bousso Dieng. 2023. Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning. arXiv preprint arXiv:2310.12952 (2023).
Paul et al. (2021) Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. 2021. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems 34 (2021), 20596–20607.
Penedo et al. (2023) Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023).
Piczak (2015) Karol J Piczak. 2015. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia. 1015–1018.
Pillutla et al. (2023) Krishna Pillutla, Lang Liu, John Thickstun, Sean Welleck, Swabha Swayamdipta, Rowan Zellers, Sewoong Oh, Yejin Choi, and Zaid Harchaoui. 2023. Mauve scores for generative models: Theory and practice. Journal of Machine Learning Research 24, 356 (2023), 1–92.
Polyzotis and Zaharia (2021) Neoklis Polyzotis and Matei Zaharia. 2021. What can data-centric ai learn from data and ml engineering? arXiv preprint arXiv:2112.06439 (2021).
Que et al. (2024) Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, et al. 2024. D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models. arXiv preprint arXiv:2406.01375 (2024).
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
Radford et al. ([n. d.]) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. [n. d.]. Improving Language Understanding by Generative Pre-Training. ([n. d.]).
Rae et al. (2019) Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. 2019. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507 (2019).
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
Rasheed et al. (2023) Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, and Fahad S. Khan. 2023. GLaMM: Pixel Grounding Large Multimodal Model. arXiv:2311.03356 [cs.CV]
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics.
Rohrbach et al. (2018) Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156 (2018).
Rohrbach et al. (2017) Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. Movie description. International Journal of Computer Vision 123 (2017), 94–120.
Rosenfeld (2000) Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proc. IEEE 88, 8 (2000), 1270–1278.
Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211–252.
Sajjadi et al. (2018) Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. 2018. Assessing generative models via precision and recall. Advances in neural information processing systems 31 (2018).
Saxton et al. (2018) David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. 2018. Analysing Mathematical Reasoning Abilities of Neural Models. In International Conference on Learning Representations.
Schaeffer et al. (2024) Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. 2024. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems 36 (2024).
Schamoni et al. (2018) Shigehiko Schamoni, Julian Hitschler, and Stefan Riezler. 2018. A dataset and reranking method for multimodal MT of user-generated image captions. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track). 140–153.
Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35 (2022), 25278–25294.
Schuhmann et al. (2021) Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021).
Sener and Savarese (2017) Ozan Sener and Silvio Savarese. 2017. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489 (2017).
Sharma et al. (2018) Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565.
Shen et al. (2019) Tianxiao Shen, Myle Ott, Michael Auli, and Marc’Aurelio Ranzato. 2019. Mixture models for diverse machine translation: Tricks of the trade. In International conference on machine learning. PMLR, 5719–5728.
Shu et al. (2023) Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. 2023. Audio-Visual LLM for Video Understanding. arXiv preprint arXiv:2312.06720 (2023).
Sidorov et al. (2020) Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. 2020. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 742–758.
Silcock et al. (2022) Emily Silcock, Luca D’Amico-Wong, Jinglin Yang, and Melissa Dell. 2022. Noise-Robust De-Duplication at Scale. In The Eleventh International Conference on Learning Representations.
Singh et al. (2019) Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8317–8326.
Soboleva et al. (2023) Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. 2023. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama. https://huggingface.co/datasets/cerebras/SlimPajama-627B
Soldaini et al. (2024) Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. 2024. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint arXiv:2402.00159 (2024).
Soldaini and Lo (2023) Luca Soldaini and Kyle Lo. 2023. peS2o (Pretraining Efficiently on S2ORC) Dataset. Technical Report. Allen Institute for AI. ODC-By, https://github.com/allenai/pes2o.
Song et al. (2023) Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. 2023. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449 (2023).
Song et al. (2020) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems 33 (2020), 16857–16867.
Sorscher et al. (2022) Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. 2022. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems 35 (2022), 19523–19536.
Srinivasan et al. (2021) Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. 2021. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2443–2449.
Suárez et al. (2019) Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. 2019. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache.
Sun and Saenko (2016) Baochen Sun and Kate Saenko. 2016. Deep coral: Correlation alignment for deep domain adaptation. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer, 443–450.
Sun et al. (2023c) Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2023c. Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models. arXiv preprint arXiv:2310.05863 (2023).
Sun et al. (2024) Linzhuang Sun, Hao Liang, Jingxuan Wei, Linkun Sun, Bihui Yu, Bin Cui, and Wentao Zhang. 2024. Efficient-Empathy: Towards Efficient and Effective Selection of Empathy Data. arXiv preprint arXiv:2407.01937 (2024).
Sun et al. (2023b) Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. 2023b. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023).
Sun et al. (2023a) Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. 2023a. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525 (2023).
Tang et al. (2023) Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. 2023. Video understanding with large language models: A survey. arXiv preprint arXiv:2312.17432 (2023).
Tirumala et al. (2023) Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari S Morcos. 2023. D4: Improving llm pretraining via document de-duplication and diversification. arXiv preprint arXiv:2308.12284 (2023).
Tong et al. (2024) Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. 2024. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. arXiv preprint arXiv:2406.16860 (2024).
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
Veit et al. (2016) Andreas Veit, Tomas Matera, Lukas Neumann, Jiri Matas, and Serge Belongie. 2016. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016).
Venugopalan et al. (2015) Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494–1504.
Wang and Shang (2014) Dan Wang and Yi Shang. 2014. A new active labeling method for deep learning. In 2014 International joint conference on neural networks (IJCNN). IEEE, 112–119.
Wang et al. (2022) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022).
Wang et al. (2024b) Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. 2024b. CogVLM: Visual Expert for Pretrained Language Models. arXiv:2311.03079 [cs.CV]
Wang et al. (2024c) Weizhi Wang, Khalil Mrini, Linjie Yang, Sateesh Kumar, Yu Tian, Xifeng Yan, and Heng Wang. 2024c. Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters. arXiv preprint arXiv:2403.02677 (2024).
Wang et al. (2019) Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4581–4591.
Wang et al. (2024a) Yiping Wang, Yifang Chen, Wendan Yan, Kevin Jamieson, and Simon Shaolei Du. 2024a. Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning. arXiv:2402.02055 [cs.LG]
Wang et al. (2023a) Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. 2023a. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. In The Twelfth International Conference on Learning Representations.
Wang et al. (2023b) Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023b. Data management for large language models: A survey. arXiv preprint arXiv:2312.01700 (2023).
Wei et al. (2023) Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. 2023. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109 (2023).
Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research (2022).
Wei et al. (2020) Shengyun Wei, Shun Zou, Feifan Liao, et al. 2020. A comparison on data augmentation methods based on deep learning for audio classification. In Journal of physics: Conference series, Vol. 1453. IOP Publishing, 012085.
Welbl et al. (2018) Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics 6 (2018), 287–302.
Wenzek et al. (2020) Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Édouard Grave. 2020. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 4003–4012.
Wettig et al. (2024) Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. 2024. QuRating: Selecting High-Quality Data for Training Lanugage Models. In ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models.
Whang et al. (2023) Steven Euijong Whang, Yuji Roh, Hwanjun Song, and Jae-Gil Lee. 2023. Data collection and quality challenges in deep learning: A data-centric ai perspective. The VLDB Journal 32, 4 (2023), 791–813.
Wu et al. (2023b) Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Philip S Yu. 2023b. Multimodal large language models: A survey. arXiv preprint arXiv:2311.13165 (2023).
Wu et al. (2017) Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen Lin, Yanwei Fu, et al. 2017. Ai challenger: A large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475 (2017).
Wu et al. (2023a) Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2023a. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519 (2023).
Wu et al. (2015) Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1912–1920.
Xia et al. (2024) Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. 2024. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333 (2024).
Xia et al. (2022) Xiaobo Xia, Jiale Liu, Jun Yu, Xu Shen, Bo Han, and Tongliang Liu. 2022. Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. In The Eleventh International Conference on Learning Representations.
Xie et al. (2024a) Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. 2024a. Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems 36 (2024).
Xie et al. (2023) Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. 2023. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems 36 (2023), 34201–34227.
Xie et al. (2024b) Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. 2024b. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems 36 (2024).
Xu et al. (2017) Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia. 1645–1653.
Xu et al. (2021) Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, and Luke Zettlemoyer. 2021. VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4227–4239.
Xu et al. (2023a) Hu Xu, Saining Xie, Po-Yao (Bernie) Huang, Licheng Yu, Russ Howes, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. 2023a. CiT: Curation in Training for Effective Vision-Language Data. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023), 15134–15143. https://api.semanticscholar.org/CorpusID:255440514
Xu et al. (2023c) Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, et al. 2023c. Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks. arXiv preprint arXiv:2306.04362 (2023).
Xu et al. (2016) Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288–5296.
Xu et al. (2020) Liang Xu, Xuanwei Zhang, and Qianqian Dong. 2020. CLUECorpus2020: A large-scale Chinese corpus for pre-training language model. arXiv preprint arXiv:2003.01355 (2020).
Xu et al. (2023d) Mingle Xu, Sook Yoon, Alvaro Fuentes, and Dong Sun Park. 2023d. A comprehensive survey of image augmentation techniques for deep learning. Pattern Recognition 137 (2023), 109347.
Xu et al. (2024) Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. 2024. LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images. arXiv preprint arXiv:2403.11703 (2024).
Xu et al. (2023b) Yang Xu, Yongqiang Yao, Yufan Huang, Mengnan Qi, Maoquan Wang, Bin Gu, and Neel Sundaresan. 2023b. Rethinking the Instruction Quality: LIFT is What You Need. arXiv:2312.11508 [cs.CL]
Xue et al. (2022) Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. 2022. ULIP: Learning Unified Representation of Language, Image and Point Cloud for 3D Understanding. arXiv preprint arXiv:2212.05171 (2022).
Xue et al. (2023) Le Xue, Ning Yu, Shu Zhang, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. 2023. ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding. arXiv:2305.08275 [cs.CV]
Yang et al. (2021) Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2021. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF international conference on computer vision. 1686–1697.
Yang et al. (2022) Kaiyu Yang, Jacqueline H Yau, Li Fei-Fei, Jia Deng, and Olga Russakovsky. 2022. A study of face obfuscation in imagenet. In International Conference on Machine Learning. PMLR, 25313–25330.
Yao et al. (2023) Barry Menglong Yao, Aditya Shah, Lichao Sun, Jin-Hee Cho, and Lifu Huang. 2023. End-to-end multimodal fact-checking and explanation generation: A challenging dataset and models. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2733–2743.
Ye et al. (2023a) Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. 2023a. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126 (2023).
Ye et al. (2024) Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, and Xipeng Qiu. 2024. Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance. arXiv preprint arXiv:2403.16952 (2024).
Ye et al. (2023b) Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023b. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023).
YEH et al. (2023) SHIH-YING YEH, Yu-Guan Hsieh, Zhidong Gao, Bernard BW Yang, Giyeong Oh, and Yanmin Gong. 2023. Navigating text-to-image customization: From lycoris fine-tuning to model evaluation. In The Twelfth International Conference on Learning Representations.
Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.
Yu et al. (2023) Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2023. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023).
Yu et al. (2019) Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9127–9134.
Yuan et al. (2021) Sha Yuan, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang, and Jie Tang. 2021. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open 2 (2021), 65–68.
Zauner (2010) Christoph Zauner. 2010. Implementation and benchmarking of perceptual image hash functions. (2010).
Zeng et al. (2023) Fanlong Zeng, Wensheng Gan, Yongheng Wang, Ning Liu, and Philip S Yu. 2023. Large language models for robotics: A survey. arXiv preprint arXiv:2311.07226 (2023).
Zha et al. (2023) Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. 2023. Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158 (2023).
Zhang et al. (2024b) Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. 2024b. When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method. arXiv preprint arXiv:2402.17193 (2024).
Zhang et al. (2024c) Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. 2024c. MM-LLMs: Recent Advances in MultiModal Large Language Models. arXiv preprint arXiv:2401.13601 (2024).
Zhang et al. (2023) Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023).
Zhang et al. (2024a) Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. 2024a. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
Zhao et al. (2023b) Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, et al. 2023b. Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning. arXiv preprint arXiv:2307.09474 (2023).
Zhao et al. (2021) Qian Zhao, Xiaorong Gao, Jinlong Li, and Lin Luo. 2021. Optimization algorithm for point cloud quality enhancement based on statistical filtering. Journal of Sensors 2021 (2021), 1–10.
Zhao et al. (2023c) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023c. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
Zhao et al. (2023a) Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, and Bingyi Kang. 2023a. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581 (2023).
Zheng et al. (2020) Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. 2020. Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling. In Proceedings of The European Conference on Computer Vision (ECCV).
Zheng et al. (2023) Kaizhi Zheng, Xuehai He, and Xin Eric Wang. 2023. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239 (2023).
Zheng et al. (2024) Miao Zheng, Hao Liang, Fan Yang, Haoze Sun, Tianpeng Li, Lingchu Xiong, Yan Zhang, Yozhen Wu, Kun Li, Yanjun Sheng, et al. 2024. PAS: Data-Efficient Plug-and-Play Prompt Augmentation System. arXiv preprint arXiv:2407.06027 (2024).
Zhou et al. (2024) Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. 2024. Lima: Less is more for alignment. Advances in Neural Information Processing Systems 36 (2024).
Zhou et al. (2017) Luowei Zhou, Chenliang Xu, and Jason J. Corso. 2017. Towards Automatic Learning of Procedures from Web Instructional Videos. arXiv:1703.09788 [cs.CV]
Zhou et al. (2023) Tong Zhou, Yubo Chen, Pengfei Cao, Kang Liu, Jun Zhao, and Shengping Liu. 2023. Oasis: Data curation and assessment system for pretraining of large language models. arXiv preprint arXiv:2311.12537 (2023).
Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592 [cs.CV]
Zhu et al. (2024) Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. 2024. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems 36 (2024).
Zhu et al. (2016) Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4995–5004.
Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision. 19–27.

9. appendix

9.1. Evaluation datasets for MLLMs

9.1.1. Captioning Tasks

Captioning datasets act as benchmarks to evaluate a model’s fundamental ability to understand information from different modalities. They can be used to assess the zero-shot or few-shot learning capabilities of MLLMs in multimodality tasks, offering insights into their ability to generalize and adapt to new scenarios with limited training examples.

For image captioning tasks, numerous datasets are available to assess a model’s capability in image understanding. MS-COCO (Chen et al., 2015) is one of the most frequently used datasets for training and evaluating image captioning. Flickr30K (Young et al., 2014) is also extensively utilized to evaluate models’ capabilities across Zero/Few/Full-shot scenarios. The Karpathy split for MS-COCO and Flickr30K (Karpathy and Fei-Fei, 2015) is commonly adopted to divide training and testing data in the training and evaluation of MLLMs. Nocaps (Agrawal et al., 2019) was developed to overcome the limitations of MS-COCO by focusing on objects not present in MS-COCO captions, serving as a valuable complement to the MS-COCO dataset.

For video captioning, the MSVD dataset (Chen and Dolan, 2011) is a popular choice for evaluating action recognition and video description tasks (Venugopalan et al., 2015), comprising 2,089 video segments and 85,550 English descriptions, with a standard split provided by Venugopalan et al. (2015). The MSRVTT dataset (Xu et al., 2016) includes 10,000 web video clips totaling 41.2 hours, each annotated with approximately 20 natural sentences. The DiDeMo dataset (Anne Hendricks et al., 2017) contains over 10,000 unedited personal videos, each associated with 3-5 pairs of descriptions pinpointing specific moments. The VATEX dataset (Wang et al., 2019) includes more than 41,250 videos paired with 825,000 captions in English and Chinese, featuring over 206,000 English-Chinese parallel translation pairs. Additionally, the TVC dataset (Lei et al., 2020) includes both validation and private test components for further evaluation. These datasets are instrumental in evaluating both the video captioning and the video information retrieval capabilities.

For evaluating audio captioning ability, the two most commonly used datasets are AudioCaps (Kim et al., 2019) and Clotho (Drossos et al., 2020). The AudioCaps dataset features 46,000 pairs of audio clips and human-written text descriptions. Meanwhile, the Clotho dataset includes 4,981 audio samples, each lasting between 15 to 30 seconds, accompanied by 24,905 captions that range from 8 to 20 words in length. Both datasets are instrumental in assessing a model’s audio understanding capabilities.

9.1.2. Question Answering Tasks

Evaluation datasets for question answering (QA) are crucial for assessing the performance of MLLMs that handle diverse data types like text, images, and audio. These datasets require models to analyze and combine information from multiple sources to respond to relevant questions accurately.

Visual Question Answering (VQA) evaluation dataset tests the MLLM’s ability to integrate and interpret both text and image data. These datasets contain images paired with questions that require visual understanding to answer. Commonly used VQA evaluation datasets such as VQAv2 (Goyal et al., 2019) contains 13 million answers associated with approximately 200,000 images from the COCO dataset. It also contains multi images question answer pairs that can test models’ multi image understanding abilities. GQA (Hudson and Manning, 2019) dataset is designed to address deficiencies in previous visual question answering (VQA) datasets by fostering advanced visual reasoning and reducing biases. It utilizes over 22 million questions generated from 113K real-world images annotated with detailed scene graphs, detailing objects, attributes, and relationships. OKVQA (Marino et al., 2019) was also created to address the limitations of existing VQA benchmarks by focusing on knowledge-based visual question answering. This dataset consists of more than 14,000 questions that require external knowledge to answer, covering various categories such as science & technology, history, and sports. Vizwiz (Gurari et al., 2018) focuses on collecting visual questions from blind users, presenting unique challenges such as image quality and conversational question styles. POPE (Li et al., 2023b) focuses on evaluating object hallucination in vision-language models, using 6,136 binary questions to test if models hallucinate non-existent objects in images from MS-COCO and other sources.

Video question-answering evaluation datasets are developed to assess the capabilities of models in video-to-text translation and overall video comprehension. Key datasets include MSVD-QA (Xu et al., 2017) and MSRVTT-QA (Xu et al., 2017), derived from the MSVD (Chen and Dolan, 2011) and MSRVTT (Xu et al., 2016) video captioning datasets, respectively, which are frequently utilized for video QA tasks. For more intricate web videos, the ActivityNet-QA (Yu et al., 2019) dataset, containing 58,000 QA pairs from 5,800 complex videos, tests models’ abilities to understand complex video content. The TGIF-QA (Jang et al., 2017) dataset, with 165,165 QA pairs from 71,741 animated GIFs, evaluates spatio-temporal reasoning and visual question-answering skills in videos. LSMDC (Rohrbach et al., 2017), which includes 118,114 sentences aligned with clips from 202 movies, is used for assessing video description generation and movie QA abilities. Additionally, MoVQA (Rohrbach et al., 2017), featuring 21,953 manually annotated QA pairs from 100 diverse movies, is aimed at evaluating the understanding of long-form videos over various temporal durations, focusing on complex and extended video content comprehension.

Audio question-answering evaluation datasets are designed to test audio-to-text translation and the ability to understand audio. The ClothoAQA (Lipping et al., 2022) dataset, derived from the Clotho dataset, includes 1,991 audio files, each lasting between 15 to 30 seconds, with six different questions per audio file collected. The MUSIC-AVQA (Li et al., 2022b) dataset contains more than 45,000 question-answer pairs extracted from more than 9,000 videos and more than 150 hours of content. It uses 33 question templates across 9 types of questions to support spatio-temporal reasoning in audio-visual scenarios.

9.1.3. Perception and Reasoning Tasks

Perception and reasoning are crucial tests for assessing the general abilities of machine learning language models (MLLM). TextVQA (Singh et al., 2019) consists of 28,408 images with questions that require reading and reasoning about the text in the image. ChartQA (Masry et al., 2022) offers various questions based on real-world charts, testing visual and logical reasoning. AI2D (Kembhavi et al., 2016) includes over 5,000 grade school science diagrams with extensive annotations to evaluate diagram interpretation skills. ScienceQA(Lu et al., 2022) presents around 21,000 multimodal questions from science curricula, emphasizing multi-hop reasoning. MathVista (Lu et al., 2023a) comprises 6,141 image-text pairs to benchmark mathematical reasoning in visual contexts. MMVet (Yu et al., 2023) includes 200 images with questions that test large multimodal models in six capabilities. MMBench (Liu et al., 2023a) assesses 20 distinct abilities with 3,000 questions, ranging from object localization to social reasoning. LLaVAW (Liu et al., 2024b) features 24 images with diverse visual content and complex reasoning tests. MME (Fu et al., 2024) focuses on 14 subtasks covering perception and cognition, such as text translation and arithmetic. Lastly, MVBench (Li et al., 2023g) contains 20 video tasks that challenge comprehension of spatial and temporal dynamics, designed to evaluate advanced understanding skills.

There are several datasets designed to evaluate MLLMs’ spatial reasoning abilities. VSR (Liu et al., 2023b) comprises over 10,000 natural text-image pairs with 66 types of spatial relations, making it an excellent resource for evaluating spatial reasoning. The datasets RefCOCO, RefCOCO+, and RefCOCOg (Kazemzadeh et al., 2014; Mao et al., 2016) are crucial to assess how well models understand and ground natural language expressions within visual contexts, featuring images paired with expressions that describe specific objects or areas. These datasets, which vary in expression collection and annotations, challenge models to integrate vision and language by capturing detailed spatial relationships and contextual nuances. Additionally, the GRIT (Gupta et al., 2022) benchmark addresses seven vision tasks using multiple data sources to enhance referring expression grounding. DocVQA (Mathew et al., 2021) focuses on visual question answering within document images, utilizing a web-based tool to generate its 50,000 questions based on over 12,000 document images. OCR-VQA (Mishra et al., 2019) contains 207,572 images of book covers, with more than a million question-answer pairs, testing models’ ability to understand text in document images and perform visual question-answering. Lastly, SEED-Bench (Li et al., 2023h) evaluates spatial and temporal comprehension in multimodal contexts, serving as a platform to test generative comprehension capabilities.

For video reasoning, there exist numerous evaluation datasets. EgoSchema (Mangalam et al., 2024) features over 5,000 very long-form video language understanding questions derived from 250 hours of diverse egocentric video data. VideoChatGPT (Maaz et al., 2023), which is used to tackle challenges in video-based conversation models, addresses aspects such as temporal understanding, spatial consistency, and contextual comprehension. Both EgoSchema (Mangalam et al., 2024) and VideoChatGPT (Maaz et al., 2023) offer the means to assess models on longer and more intricate video content. Charades-STA (Gao et al., 2017) comprises approximately 10,000 videos, each annotated with temporal activity details across 157 activity categories and multiple video-level descriptions. Additionally, QVHighlights (Lei et al., 2021) contains over 10,000 YouTube videos covering various topics such as everyday activities, travel, and social and political events, facilitating the evaluation of MLLMs’ abilities in detecting moments and generating highlights.

9.1.4. Other Noteworthy Tasks

There are also several noteworthy tasks not mentioned above. One of the traditional tasks is the classification task. For the classification task, InstructBLIP uses HatefulMemes (Kiela et al., 2020), a binary hateful content classification dataset for memes classification. ImageNet-1K (Deng et al., 2009) is also a widely used dataset to test models’ image classification ability. While X-InstructBLIP extends this to ESC50 (Piczak, 2015) for audio classification and ModelNet40 (Wu et al., 2015) for 3D classification. Furthermore, M³IT (Li et al., 2023k) use Something-Something (Goyal et al., 2017) for video action classification. Multimodal dialogue task is different from question answering task because it requires multiple rounds of conversion. TouchStone (Bai et al., 2023b) explores multimodal dialogue through detailed image annotations, addressing the challenge of evaluating LLMs in open-ended dialogues.

9.2. Supervised fine-tuning datasets

Table 1. A detailed list of datasets and processing methods for different models with image modality during fine-tuning stage. * indicates the dataset is newly generated using certain method within a respective model, while the other datasets (without *) serve as the original data sources for that model. - denotes directly using original data sources without additional processing. #Examples denotes the statistics for each dataset.

Models	Zeit	Finetuning Datasets	Processing Method	#Examples
Flamingo(Alayrac et al., 2022)	2022.4	VQAv2	-	#images: 265K #questions: 1.4M
		VizWiz		#image-question pairs: 33.8K
		TextVQA		#images: 28.4K #questions: 45.3K
BLIP-2(Li et al., 2023e)	2023.01	COCO	-	#images: 330K #captions: 5 per image
		NoCaps		#images: 15.1K #captions: 11 per image
		VQAv2		#images: 265K #questions: 1.4M
		Flickr30K		#images: 31.7K #captions: 5 per image
LLaVA(Liu et al., 2024b)	2023.04	COCO	GPT-4, Prompt, Manual examples	#images: 330K #captions: 5 per image
LLaVA(Liu et al., 2024b)	2023.04	LLaVA-Instruct-150K*	GPT-4, Prompt, Manual examples	#image-text pairs: 158K
MiniGPT-4(Zhu et al., 2023)	2023.04	CC12M	Manual verifying and refining, ChatGPT, Random selecting	#image-text pairs: 12M
MiniGPT-4(Zhu et al., 2023)	2023.04	cc_sbu_align*	Manual verifying and refining, ChatGPT, Random selecting	#image-text pairs: 3.5K
mPLUG-Owl(Ye et al., 2023b)	2023.04	LLaVA-Instruct-150K	-	#image-text pairs: 158K
InstructBLIP(Dai et al., 2024)	2023.05	ScienceQA	-	#image-question pairs: 10.3K
		OCR-VQA		#images: 207K #image-question pairs: 1M
		OKVQA		#images: 14K #questions: 14K
		A-OKVQA		#images: 23.6K #questions: 24.9K
PaLI-X(Chen et al., 2023a)	2023.05	COCO	-	#images: 330K #captions: 5 per image
		NoCaps		#images: 15.1K #captions: 11 per image
		TextCaps		#images: 28.4K #captions: 5 per image
		VizWizCap		#images: 39.1K #captions: 5 per image
		Screen2Words		#images: 22.4K #captions: 5 per image
		WidgetCap		#images: 21.7K #captions: 162K #widgets: 61.2K
Shikra(Chen et al., 2023j)	2023.06	LLaVA-Instruct-150K	GPT-4, Prompt, Sampling ratio: 0.5	#image-text pairs: 158K
Shikra(Chen et al., 2023j)	2023.06	Flickr30K	GPT-4, Prompt, Sampling ratio: 0.5	#images: 31.7K #captions: 5 per image
		Shikra-RD*		#qustion-answer pairs: 5.9K
DLP(Jian et al., 2024)	2023.07	COCO	-	#images: 330K #captions: 5 per image
ChatSpot(Zhao et al., 2023b)	2023.07	Visual Genome	Unification, GPT-4, Prompt	#images: 108K #region descriptions: 5.4M #image-question pairs: 1.7M
MiniGPT-5(Zheng et al., 2023)	2023.10	VIST	Flexible framework for various task, Placeholders for images -¿ Text prompts	#images: 210K #stories: 50K
MiniGPT-5(Zheng et al., 2023)	2023.10	MMDialog		#images: 1.53M #texts: 1.08M #turns: 4.92M
LLaVA-1.5(Liu et al., 2023c)	2023.10	RefCOCO	Merging and concatenating, Splitting, Sampling	#images: 20K #captions: 142K
		GQA		#images: 113K #questions: 22M
		OCR-VQA		#images: 207K #image-question pairs: 1M
		TextCaps		#images: 28.4K #captions: 5 per image
		Visual Genome		#images: 108K #region descriptions: 5.4M #image-question pairs: 1.7M
		llava_v1_5_mix665k*		#image-text pairs: 665K
MiniGPT-v2(Chen et al., 2023m)	2023.10	LLaVA-Instruct-150K	Object parsing and grounding, Selecting captions	#image-text pairs: 158K
MiniGPT-v2(Chen et al., 2023m)	2023.10	Flickr30K	Object parsing and grounding, Selecting captions	#images: 31.7K #captions: 5 per image
CogVLM(Wang et al., 2024b)	2023.11	LLaVA-Instruct-150K	Translation into Chinese, Correcting and retranslation	#image-text pairs: 158K
		cc_sbu_align		#image-text pairs: 3.5K
		CogVLM-SFT-311K*		#images: 155K #image-text pairs: 311K
DRESS(Chen et al., 2023i)	2023.11	LLaVA-Instruct-150K	Partition the multi-turn into separate turns, LLM filtering, LLM-Human-in-the-Loop process	#image-text pairs: 158K
		COCO		#images: 330K #captions: 5 per image
		VLSafe*		#image-text pairs: 5.8K
VILA(Lin et al., 2023a)	2023.12	llava_v1_5_mix665k	-	#image-text pairs: 665K
ShareGPT4V(Chen et al., 2023e)	2023.11	COCO	GPT-4, Data-specific prompt, Data replacement	#images: 330K #captions: 5 per image
		LAION2B-en		#image-text pairs: 2.32B
		CC3M		#image-text pairs: 3.3M
		SBU Captions		#image-text pairs: 1M
		SAM		#images: 11M #segmentation masks: 1.1B
		TextCaps		#images: 28.4K #captions: 5 per image
		WikiArt		#images: 81.4K #artist classes: 129 #genre classes: 11 #style classes: 27
		llava_v1_5_mix665k		#image-text pairs: 665K
		ShareGPT4V-1.2M*		#image-text pairs: 1.2M
GLaMM(Rasheed et al., 2023)	2023.11	RefCOCO	Re-purposing of data sources, Manually annotation	#images: 20K #captions: 142K
		RefCOCO+		#images: 20K #captions: 141K
		RefCOCOg		#images: 25.8K #captions: 95K
		Visual Genome		#images: 108K #region descriptions: 5.4M #image-question pairs: 1.7M
		LLaVA-Instruct-150K		#image-text pairs: 158K
		GranD-f*		#image-text pairs: 214K
PVIT(Chen et al., 2023h)	2023.08	GQA	ChatGPT, Prompt, Multi-turn data	#images: 113K #questions: 22M
		VCR		#images: 110K #questions: 290K
		COCO		#images: 330K #captions: 5 per image
		Visual Genome		#images: 108K #region descriptions: 5.4M #image-question pairs: 1.7M
		COCO-Text		#images:63.6K #labeled text regions: 173K
		PVIT*		#image-text pairs: 13.7M (Stage 1) #image-question pairs: 10.4K (Stage 2)
TextMonkey(Liu et al., 2024d)	2024.03	COCO-Text	Structured data (documents, tables, charts), 5% of pre-train data,	#images:63.6K #labeled text regions: 173K
		TextOCR		#images:28.1K #texts: 903K
		HierText		#images:11.6K #texts: 1.2M
		TextVQA		#images: 28.4K #questions: 45.3K
		MLT		#image-text pairs: 10K
		ChartQA		#image-text pairs: 20.8K #tables: 20.8K
		DocVQA		#images: 12K #questions: 50K
		InfoVQA		#images: 5K #questions: 30K
		Monkey_Data*		#image-text pairs: 409.1K

Table 2. A detailed list of datasets and processing methods for different models with video modality during fine-tuning stage. * indicates the dataset is newly generated using certain method within a respective model, while the other datasets (without *) serve as the original data sources for that model. - denotes directly using original data sources without additional processing. #Examples denotes the statistics for each dataset.

Models	Zeit	Finetuning Datasets	Processing Method	#Examples
VideoChat(Li et al., 2023c)	2023.05	WebVid	GPT-4, Prompt, Randomly choosing	#video-text pairs: 10M
VideoChat(Li et al., 2023c)	2023.05	VideoChat*	GPT-4, Prompt, Randomly choosing	#video-description pairs: 7K #video-conversation pairs: 4K
Video-LLaMA(Lin et al., 2023b)	2023.01	VideoChat	-	#video-description pairs: 7K #video-conversation pairs: 4K
Video-ChatGPT(Maaz et al., 2023)	2023.01	ActivityNet	Human-assisted annotations, GPT, Semi-automatic annotation	#videos: 20K #texts: 100K
Video-ChatGPT(Maaz et al., 2023)	2023.01	VideoInstruct100K*	Human-assisted annotations, GPT, Semi-automatic annotation	#video-instruction pairs: 100K

Table 3. A detailed list of datasets and processing methods for different models with audio modality during the fine-tuning stage. * indicates the dataset is newly generated using certain methods within a respective model, while the other datasets (without *) serve as the original data sources for that model. The datasets in other modals used for generating multi-modal datasets (including audio) are also shown in this table. - denotes directly using original data sources without additional processing. #Examples denote the statistics for each dataset.

Models	Zeit	Finetuning Datasets	Processing Method	#Examples
BuboGPT(Zhao et al., 2023a)	2023.07	Clotho	GPT-4, Prompt	#audios: 5K #captions: 24K
		VGGSS		#video-audio pairs: 5K
		Clotho-Detail*		#audio-caption pairs: 3.9K #tokens: 207K
		VGGSS-Instruction-Tuning*		#audio-image-caption pairs: 5.1K
X-LLM(Chen et al., 2023b)	2023.05	AISHELL-2	Manually selecting, ChatGPT	#reading-speech audios: 1000 hours
		VSDial-CN		#automatic speech recognition (ASR) samples: 1.2M
		cc_sbu_align (Image-Text)		#image-text pairs: 3.5K
		ActivityNet (Video-Text)		#videos: 20K #texts: 100K
NExT-GPT(Wu et al., 2023a)	2023.09	WebVid (Video-Text)	External resources, GPT-4, Prompt	#video-text pairs: 10M
		AudioCaps		#audio-text pairs: 46K
		CC3M (Image-Text)		#image-caption pairs: 3.3M
		T2M*		#images: 4.9K #videos: 4.9K #audios: 4.9K #instances: 14.7K
		MosIT*		#images: 4K #videos: 4K #audios: 4K #instances: 5K
X-InstructBLIP(Panagopoulou et al., 2023)	2023.11	AudioCaps	Automatic generation, Prompt	#audio-text pairs: 46K
		AudioCapsQA*		#audio QA pairs: 25.4K #unique questoins: 10.8K
		DisCRn*		#audio-video pairs: 8.8K
Qwen-Audio(Bai et al., 2023a)	2023.11	Self-constructed	Manual annotation, GPT-3.5, Data Mixing	#audio-text pairs: 20K