LMMs-Eval: Reality Check on the Evaluation of
Large Multimodal Models

Kaichen Zhang^∗,1,2 Bo Li^∗,1,2 Peiyuan Zhang^∗,1,2 Fanyi Pu^∗,1,2
Joshua Adrian Cahyono ^1,2 Kairui Hu ^1,2 Shuai Liu ^1,2
Yuanhan Zhang ^1,2 Jingkang Yang ^1,2 Chunyuan Li¹ Ziwei Liu^1,2,^🖂

¹LMMs-Lab Team ²S-Lab, NTU, Singapore

{zhan0564, libo0013, peiyuan.zhang, fpu001, ziwei.liu}@ntu.edu.sg

Abstract

The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMs-Eval, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMs-Eval offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMs-Eval Lite, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LiveBench that utilizes continuously updating news and online forums to assess models’ generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LiveBenchat Github and LiveBench. Refer to caption Figure 1: To best navigate the trilemma in LMM evaluation benchmarking, we contribute (1) LMMs-Eval: a unified and standardized multimodal benchmark suite that encompasses over 50 tasks and more than 10 models, ensuring wide coverage; (2) LMMs-Eval Lite: an efficient benchmark set with reliable and aligned results with the time-consuming full-set evaluation, addressing low-cost concerns; (3) LiveBench: the evaluation benchmark with the latest information from news and forum websites, aiming to evaluate model’s zero-shot generalization ability on most recent events, thereby preventing contamination during evaluations.

¹¹footnotetext: Equal contribution. ^🖂Corresponding author.

https://github.com/EvolvingLMMs-Lab/lmms-eval

1 Introduction

Good benchmarks guide AI development. Current large foundational models such as GPT-4 [59], Gemini [69], Claude [2], and many others [71, 60, 57, 14] have demonstrated transformative capabilities, approaching or surpassing human-level performances in many tasks. In this context, benchmarks become both challenging and crucial to differentiate among the models and detect their weaknesses.

In the field of language models, exemplary works such as [38, 68, 19] aimed to comprehensively assess models across a wide range of dimensions. As generative AI evolves from language-centric to multimodal, a unified evaluation framework and a closer look at existing benchmarks are needed.

Transparent, standardized, and reproducible evaluations are crucial. We identify that there is so far no unified evaluation protocol in the field of LMM. Model publishers [42, 71, 16, 87, 33] come up with custom evaluation pipelines, which often differ significantly in data preparation, output postprocessing, and metrics calculation, hindering transparency and reproducibility. To this end, we build a standardized and reliable benchmark suite to assess multimodal models in their entirety with LMMs-Eval. LMMs-Eval covers over 50 tasks in various scenarios to thoroughly assess more than 10 multimodal models with around 30 variants. It offers a standardized evaluation pipeline to ensure transparency and reproducibility. It also comes with a unified interface to facilitate the integration of new models and datasets.

Wide-coverage, low-cost, and zero-contamination benchmark is hard to achieve simultaneously. We believe it is an impossible triangle to evaluate models with wide coverage and low cost without making the benchmarks susceptible to contamination, as shown in Figure 1. For instance, the Hugging Face OpenLLM leaderboard [72] provides an economical way to evaluate language models across a wide range of tasks, but it is also prone to overfitting and contamination. The LMSys Chatbot Arena [13] and AI2 WildVision [50] offer robust and non-contaminated evaluation through real user interactions. However, it is expensive to gather tens of thousands of human preferences. In this work, we do not break this impossible triangle. Instead, we complement the evaluation landscape of LMMs by introducing LMMs-Eval Lite and LiveBench . By covering diverse sets of tasks and pruning unnecessary data instances, LMMs-Eval Lite features a low-cost and wide-coverage LMM evaluation. On the other hand, LiveBench gathers the latest information from news and online forums to construct the test data, targeting an economical and generalizable way to do benchmarks.

In summary, we aim to offer a comprehensive view of the evaluations on multimodal models while presenting our observations and solutions. Our paper makes the following contributions:

(1) LMMs-Eval: a unified multimodal models evaluation suite that covers over 50 tasks and more than 10 models with around 30 sub-variants. With LMMs-Eval, we aim to streamline and standardize the evaluation process of multimodal models to ensure standardized comparisons between models.

(2) LMMs-Eval Lite: an efficient evaluation set that provides reliable and aligned results with the time-consuming full-set evaluation. LMMs-Eval Lite prunes unnecessary data instances to reduce the evaluation cost while maintaining the evaluation quality.

(3) LiveBench: an evaluation benchmark that gathers the latest information from news and forum websites to evaluate models’ zero-shot generalization ability on the most recent events. LiveBench aims to provide a low-cost and generalizable way to evaluate multimodal models.

2 LMMs-Eval: A Unified Multimodal Models Evaluation Suite

Evaluation has often taken a significant amount of time in the model development cycle. In Section 2.1 we argue that existing evaluation pipelines in LMM contain much overhead and are not standardized. By introducing LMMs-Eval, we reduce this overhead and scale up the evaluation. However, as we note in Section 2.2, there is still a trilemma in LMM evaluation that we cannot fully resolve but only find a better trade-off.

2.1 Scaling Evaluations with a Standardized Framework

Table 1: An overview of selected results on LMMs-Eval, achieved through a standardized and transparently reproducible pipeline.

Models	Parameters	AI2D	ChartQA	DocVQA	LLaVA^W	Mathvista	MME	MMMU	RealworldQA
LLaVA-1.5-7B	7B	54.8	18.2	28.1	59.6	26.7	1859.0	35.3	55.8
LLaVA-NeXT-Vicuna-7B	7B	66.6	54.8	74.4	72.3	34.4	1841.8	35.1	57.8
LLaVA-NeXT-Mistral-7B	7B	60.8	38.8	72.2	71.7	37.4	1823.4	33.4	59.3
Qwen-VL-Chat	7B	45.9	60.1	66.3	21.2	24.6	1890.8	27.7	1.7
InstructBLIP-Vicuna-7B	7B	33.8	12.5	13.9	55.2	23.4	1508.7	28.4	37.4
LLaVA-NeXT-LLaMA3-8B	8B	71.6	69.5	78.2	80.1	37.5	1971.5	41.7	60.0
Xcomposer4K-HD	8B	78.1	80.6	90.8	74.2	57.3	2189.8	42.6	62.6
Idefics2-8B	8B	69.2	26.4	73.4	43.7	48.0	1792.1	39.7	25.5
LLaVA-1.5-13B	13B	59.5	18.2	30.3	66.1	26.4	1818.3	34.8	54.9
LLaVA-NeXT-Vicuna-13B	13B	70.0	62.2	77.5	72.3	35.1	1891.9	35.9	58.7
InstructBLIP-Vicuna-13B	13B	36.8	12.7	13.6	54.4	25.0	1529.6	33.7	42.4
InternVL-1.5	26B	79.0	83.8	92.4	90.2	61.5	2183.6	43.1	65.0
LLaVA-NeXT-34B	34B	74.9	68.7	84.0	88.8	46.0	2030.4	46.7	62.0
LLaVA-NeXT-72B	72B	77.4	77.0	84.4	89.2	46.6	2158.9	46.4	65.4
LLaVA-NeXT-110B	110B	80.4	79.7	85.7	90.4	49.0	2200.4	49.1	63.1

Reducing the overhead Existing evaluations in LMMs are often done on a model-by-model and dataset-by-dataset basis [42, 71]. Researchers create custom inference scripts for their models across different benchmarks. While manageable for a single model and a few benchmarks, this process becomes highly inefficient when evaluating multiple checkpoints across ten or more datasets. Users need to manually launch each individual script to preprocess the datasets, inference models, and calculate final scores based on the outputs. Boilerplates are also abundant in the code. To address this, LMMs-Eval follows the framework design of lm-eval-harness [19] to allow for a one-command evaluation of multiple models and datasets. We preprocess and handle all the data needed during evaluation, ensuring a single data source is used across different models for a standardized evaluation. Furthermore, detailed model outputs and results will be logged for future analysis.

Standardized evaluation Custom evaluation scripts also lead to another issue: the scores reported in different places are not directly comparable. For instance, [35] extracts model answers by comparing the output probabilities among the choices. It is counted correct so long as the ground-truth answer has the lowest perplexity among the choices (PPL-based). However, [40] use the generation-based evaluation. An answer is counted as correct only if the model’s generation matches the option letter. To this end, we design a unified framework in LMMs-Eval covering different evaluation setups. We believe there is no best setup but one needs to fix one when comparing results across different models. For a fair comparison, we also respect the chat template of the models if they are instruction-tuned. For reproducibility and transparency, a detailed log containing the evaluation setup, model generations, and score breakdown will be automatically logged. Since we designed a unified interface, new models and datasets can also be quickly added into LMMs-Eval.

Equipped with these two core designs, we successfully scaled up our evaluation to over 10 models and more than 50 datasets. We present partial results in Table 1 and the full supported models, datasets, and scores can be found in Appendix E and Appendix F. We believe that large-scale evaluations are crucial. They enable a comprehensive comparison across various aspects of model performance, revealing whether a model is a versatile performer or excels only in specific tasks. Additionally, large-scale, reproducible, and standardized evaluations are essential in ablation experiments to enhance our understanding of model architectures and training data.

2.2 The Evaluation Trilemma

Our ultimate goal is to find a wide-coverage, low-cost, and zero-contamination way to evaluate LMMs. However, even with LMMs-Eval, we find it to be hard or even impossible. Specifically, once we scale the evaluation datasets to 50+, it becomes time-consuming to perform a full evaluation run on those datasets. Besides, those benchmarks are also susceptible to contamination during the training time[79]. As shown in Figure 1, we believe there is a trilemma in model evaluation. One can not achieve the three goals simultaneously but only find a trade-off. The LMSys Chatbot Arena [13]and AI2 WildVision [50] are foundational works in stressing wide coverage and anti-contamination. We present our solution to balance the other two sides of the triangle in Section 3 and Section 4.

3 LMMs-Eval Lite: Affordable Evaluation with Broad Domain Coverage

Refer to caption — Figure 2: Evaluation cost demonstration on Full and Lite set.

We estimate the time to evaluate various LLaVA models on all LMMs-Eval datasets in Figure 2. These evaluations were conducted using 8×A100 GPUs with flash attention enabled. We replicate the model weights across GPUs and use data parallel by default. For models larger than 72B, we use pipeline parallelism [26] to load a single model across different GPUs.

We aim to construct a lite benchmark set that can provide useful and fast signals during the model development. If we can identify a subset of the benchmark where the absolute scores and relative rankings among models remain similar to the full set, we can consider it to be safe to prune the datasets. We thus present LMMs-Eval Lite to complement the full datasets in LMMs-Eval.

Table 2: Overview of datasets in LMMs-Eval Lite. In addition to reducing the size of large evaluation datasets, we also retain the complete versions of certain datasets to ensure comprehensive coverage.

Task Domain	Dataset	Split	Full Size	Lite Size
Doc & Infographic Understanding	ChartQA	test	2500	400
	DocVQA	val	5349	400
	InfoVQA	val	2801	200
Image Understanding & Captioning	Flickr30k	val	31784	400
	NoCaps	val	4500	400
	TextCaps	val	3166	300
	RefCOCO	val	8811	500
Visual Question Answering	TextVQA	val	5000	300
Math & Science	MathVista	testmini	1000	1000
Math & Science	AI2D	test	3088	300
Visual Dialogue	LLaVA-W	test	60	60
Multi-discipline	MME	cog. & percep.	2374	2374
	MMMU	val	900	900
	CMMMU	val	900	900
	Seed-Bench	test	17990	700
-	Total	-	90223	9134

Lite set selection Let the benchmark be represented as $D=\{(x_{i},y_{i})\}_{i=1}^{n}$ and the scoring function underlying the benchmark system be denoted as $S$ . Given a model $f$ , let the response of the model to a particular question in the dataset be denoted as $f(x_{i})=\widehat{y}_{i}$ . We aim to select a subset of the benchmark $V\in D$ such that

\displaystyle\min_{V:\left|V\right|\leq\left|D\right|}\left|\frac{1}{\left|D% \right|}\sum_{i=1}^{\left|D\right|}S(y_{i},\widehat{y}_{i})-\frac{1}{\left|V% \right|}\sum_{i=1}^{\left|V\right|}S(y_{i},\widehat{y}_{i})\right|

(1)

This objective function has been proven to be equivalent to solving the $k$ -Center problem [63] and can be viewed as finding a subset of data points that can cover the full set. This corresponds to our motivation to find a subset that serve as a proxy of the full benchmarks. However, finding the exact solution to the $k$ -Center problem is NP-hard [15]. Consequently, we choose to use a greedy algorithm, to efficiently compute the results. The greedy algorithm is capable of achieving a $2$ -OPT solution. The detail of the algorithm can be found in Appendix H.

To perform $k$ -center clustering, an embedding needs to be extracted for each data point. In [63], image features were extracted by the CNN for $k$ -center clustering. We employed CLIP [62] for image embeddings and BGE-M3 [8] for text embeddings, and concatenated them to produce a final embedding.

To ensure that our selected subset maintains some basic testing abilities compared to the original benchmarks, we assess our findings by examining the correlation between the original scores and the lite set scores across six versions of LLaVA [40]. We present some of our results in Figure 3 where all the results achieve r larger than 0.9. Results with all the datasets we choose can be found in the Appendix D.

Lite benchmark construction We refer to the datasets used in works such as [58, 69, 2, 40] to construct LMMs-Eval Lite and select 15 datasets across different task domains for wide coverage. To maintain a low cost during evaluation, we apply the selection method to pick representative points for datasets containing more than 1500 data points. The correlation between the original scores and the lite set scores is low for MME [18], so we decided to keep the full version of it. In addition, we curate a new version of LMMs-Eval Lite in Appendix G that contains more datasets.

Score Aggregation

To provide an overall signal to guide model development, we designed a strategy to aggregate the scores across different benchmarks in LMMs-Eval Lite. Since different datasets and benchmarks come up with their own metrics, it is not reasonable to simply calculate the average score. Instead, we first normalize the scores from each dataset within a range of 100 and then calculate the average to be the final aggregated score. We report the aggregated score before and after the lite set pruning in Figure 4 to demonstrate the effectiveness of our selection method. Note that LMMs-Eval Lite is not designed to fully compare the performance of different model families. Instead, it served as a tool to provide useful and low-cost signals during model training and ablations.

4 LiveBench: From Static to Live Evaluation

4.1 Probing into Multimodal Data Contamination

LMMs are trained on massive amounts of data. For instance, Qwen-VL [3] leverages 1.4 billion pretraining data and CogVLM [75] uses 1.5 billion. However, research in both LLMs [86, 76] and LMMs [9] has indicated that data contamination can significantly skew benchmark scores. This highlights the need for careful data management and validation to ensure accurate and fair evaluations.

We explore multimodal training within the LLaVA frameworks, utilizing two primary data types: (1) pretraining data to align visual and textual embeddings and train the vision encoder, and (2) high-quality, supervised finetuning data to improve diverse instruction-following capabilities. The re-annotation and conversion of large web and academic datasets into training materials frequently lead to issues of overlap and contamination. To address this, we developed an analytical tool to assess the overlap between training and benchmark data, showcasing our findings with data from [40] with user data removed in it.

Text Overlap To measure text overlap, we use a string matching technique similar to those by GPT-4 [59], PaLM [70], and LLaMA [74]. Typically, an $8\sim 13$ n-grams range is used [6], but we consistently use $8$ n-grams for simiplicity. We exclude any n-gram appearing more than $10$ times in the training data, labeling these as meaningless n-grams. We also calculate an overlap ratio for each new n-gram candidate against our set of meaningless n-grams, excluding those exceeding a predefined threshold.

Image Overlap Contrary to text overlap, determining image overlap is a more challenging task. While it is common practice to compute image embeddings and then calculate their cosine similarity, selecting an appropriate threshold applicable to all datasets is difficult. Instead of computing similarity in the embedding space, we empirically find that using the pretrained SEED-tokenizer [20] leads to meaningful separation in detecting the overlap. We first tokenize each image into a 1-D sequence of 32 tokens. Similar to text, an 8-gram lookup table was constructed from those image tokens to detect image contamination. The occurrence of 8-gram overlap can be interpreted as approximately $1/4$ of the image overlapping.

4.1.1 Results & Analysis on Decontamination

To evaluate the potential contamination of current benchmarks, we selected over 20 benchmarks, including AI2D [29], ChartQA [54], NoCaps [1], VQA v2 [21], and LLaVA-in-the-wild [42]. We report the percentages of image and text overlap in Figure 5 for our selected datasets and more qualitative results qualitative results in Figure 6. Our examination of both image and text overlaps has revealed three primary types of data contamination across various benchmarks.

Duplicate Images Instances of completely identical images between the training set and benchmark datasets were observed. This issue is exemplified by two identical images in ChartQA [54] and MM-Vet [83].

Similar Images Our image n-gram analysis has succesfully identified the occurrence of visually similar images in both the training and benchmark datasets. Such similarities could lead to semantically similar questions, as demonstrated in examples from NoCaps [1], ChartQA [54] and MM-Vet [83].

Similar Questions We also observe recurring question structures in the training data that mirror those in the benchmark dataset. Although the corresponding images may differ, the similarity in question structure could advantage the model in responding to benchmark queries.

4.2 Multimodal LiveBench

Traditional benchmarks focus on static evaluations using fixed questions and answers. As multimodal research progresses, open-source models often outperform commercial ones like GPT4V in benchmarks, yet they lag in real user experience. Dynamic, user-oriented public arenas like LMSys and WildVision are gaining popularity for model evaluation but struggle with prompt quality control, difficulty, distribution, and noisy traffic, making consistent comparisons difficult. Additionally, they require collecting tens of thousands of user preferences, which makes the evaluation extremely costly. Recent benchmarks such as Vibe-Eval [61] and LLaVA-Wilder [32] use real-world data for more authentic testing models abilities in the wild. However, as current foundational models training data is continuously crawled and updated from the web, the trained model may inevitably see and contaminate the evaluation benchmarks.

To address this issue, we propose a new evaluation framework, LiveBench. The key idea of LiveBench is to evaluate the model’s performance on a lively updated dataset to achieve zero contamination while maintaining low cost. We collect the evaluation dataset from the web, and build a pipeline to automatically gather the latest global information from websites such as news and community forums. The detailed specifics are as follows.

4.2.1 Data Collection From the Web

To ensure the timeliness and authenticity of our information, we select sources from over 60 news outlets, including CNN, BBC, Japan’s Asahi Shimbun, and China’s Xinhua News Agency, as well as insights from forums like Reddit. A detailed list of these sources is provided in Section I.1.

We begin by capturing screenshots of home pages and then refine these images by removing white margins and other non-news elements to ensure the content focuses on news information, not advertisements or errors due to website blocking. For analysis, we select a quiz model from our pool of current most powerful commercial multimodal models, such as GPT4-V, Claude-3-Opus, and Gemini-1.5-Pro. We then guide the quiz model to progressively ask questions across multiple dimensions, including (1) basic understanding (2) contextual analysis (3) deeper and broader implications (4) further insights. The models design a Q&A set to address these dimensions. Subsequently, another model from our pool reviews and revises the questions for accuracy and relevance.

The final Q&As are then reviewed by humans for ultimate validation. To balance data collection costs and user evaluation, we aim to gather about 500 questions monthly, selecting 100-300 for our final LiveBench problem set, tagged with identifiers like LiveBench-2024-05.

4.2.2 Evaluation Metrics & Results on LiveBench

We adopt the scoring criteria from LLaVA-Wilder [32] and Vibe-Eval [61]. The judge model assigns a score from $[1,10]$ based on the provided ground-truth answer, detailed in Section 4.2.3. We use GPT-4o as the default judge model due to its popularity and high throughput API. Additionally, Claude-3-Opus and Gemini 1.5 Pro are implemented as alternative judge models. The final report results will be scaled to an accuracy metric from 0 to 100 based on the scores.

Table 3: LiveBench-2024-06 Results. We include the overall accuracy and the accuracy of each dimension. We use DI, BI, FI as the abbreviation for Deeper Implications, Boarder Implications and Further Insights. We keep our monthly maintaination and publish the results of SOTA-level multimodal models on LiveBench Leaderboard.

Model	LLM	Overall	Basic	Contextual	DI	BI	FI
Idefics-2-8B	Mistral-v0.1-7B	36.1	41.4	29.6	35.6	45.4	28.6
InstructBLIP-7B	Vicuna-1.1-7B	40.4	16.0	32.8	44.2	60.4	48.8
InstructBLIP-13B	Vicuna-1.1-13B	42.9	24.6	32.6	48.8	66.6	41.8
LLaVA-1.5-7B	Vicuna-1.5-7B	45.6	19.0	36.4	56.2	69.2	47.4
LLaVA-1.5-13B	Vicuna-1.5-7B	48.9	23.2	37.4	56	72.2	55.8
GPT-4-Turbo (wo/vision)	-	51.9	8.4	36.4	72.0	76.8	66.0
InternVL-2-2b	InternLM-2-1.8B	51.9	49	44.6	48.4	61.8	55.8
LLaVA-NeXT-8B	LLaMA-3-8B	67.8	50.9	62.7	74.7	80.0	70.0
InternVL-2-4b	Phi-3-3.8B	68.2	71.2	60.2	66.6	76.4	66.4
XComposer-4KHD	InernLM-2-7B	70.7	76.8	65.4	70.0	72.8	68.4
InternVL-2-8B	InternLM-2.5-7B	73.4	81.2	68.6	71.0	76.6	69.6
InternVL-2-26B	InternLM-2-20B	77.2	75.8	72.0	80.4	78.6	79.2
LLaVA-NeXT-34B	Nous-Hermes-2-Yi-34B	78.4	73.0	72.4	82.4	87.8	76.2
InternVL-1.5-26B	InternLM-2-20B	80.1	80.6	80.8	79.2	80.6	79.4
LLaVA-NeXT-72B	Qwen-1.5-72B	80.2	76.2	72.8	84.8	86.2	80.8
Gemini-1.5-Flash	-	85.7	86.8	83.0	84.6	87.8	86.2
Gemini-1.5-Pro	-	85.8	92.4	81.0	84.0	92.2	79.6
Claude-3-5-Sonnet	-	92.3	93.2	90.8	91.4	95.2	91.0
GPT-4o	-	92.4	91.0	89.8	92.8	96.4	92.0
GPT-4-Turbo	-	93.0	91.6	89.4	90.8	99.0	94.0

We present the results of the LiveBench evaluation in Table 3. The results clearly indicate that both GPT-4 series models, including GPT-4-Turbo and GPT-4-Omni, are among the top performers. In contrast, the Gemini and Claude series still lag behind open-source models.

Although many open-source models outperform these commercial models in static academic benchmarks (e.g. MME [18] and MMBench [44]), our findings support the hypothesis that commercial multimodal models like GPT-4V possess robust capabilities that existing benchmarks fail to fully capture. Specifically, LiveBench requires models to demonstrate strong zero-shot generalization abilities to interpret constantly updated content from news and forum websites.

We may still be far from reaching the level of GPT-4V. The current surpassing in benchmarks is merely due to the considered scenarios being too simple, fixed, or already contaminated. These findings, despite appearing as a setback for competitors, actually illuminate the limitations of conventional evaluation benchmarks. They emphasize the necessity for more thorough evaluations to accurately gauge model performance. Benchmarking serves as a compass for advancing AI, and these results offer valuable insights for prospective challengers seeking to enhance their models.

4.2.3 Case Analysis on LiveBench

The evaluation results on LiveBenchshow a different trend. In many existing benchmarks, the performance of open-sourced multimodal models has surpassed commercial models like GPT-4V, Gemini, and Claude. However, in LiveBench, the commercial models still outperform the open-sourced models. Here we list some of the hallucination cases in the open-sourced models that caused the poor performance. For more details, please refer to LiveBench Details.

In Figure 8, we present a case analysis of hallucination in LLaVA-NeXT-72B and InternVL-2-26B models. In the first case, the question is about the Biden-Trump debate, but LLaVA-NeXT-72B hallucinates by interpreting the headlines Why an Israel-Hezbollah war would be far more… and Julian Assange… as indicating contrasting consequences. However, these headlines are neither directly related to the debate’s outcome nor suggest broader international issues. In the second case, InternVL-2-26B incorrectly describes the image accompanying the article Lynas Bets on New Rare Earths Products, Breaking China Stranglehold but focuses on the image next to the article.

Both open-sourced models show hallucination by misplacing the context to near-place headlines or images. This may suggest that the models are not well-trained to understand the context of the news articles and the layout of a modern website. Meanwhile, we did not observe such common hallucination in commercial models.

5 Conclusions

In this work, we conducted a thorough reality check on the current evaluation pipeline and benchmarks for LMMs. We recognize the difficulties in the evaluation due to the evaluation trilemma. Although we cannot break this trilemma, we present three key contributions to find a better trade-off: 1) LMMs-Eval, a unified evaluation suite for a standardized and large-scale LMM evaluation, 2) LMMs-Eval Lite to balance low-cost evaluation with wide coverage, and 3) LiveBench, a benchmark that transforms traditional static evaluation into a dynamic format to address potential data contamination in LMMs evaluation. We hope our LMMs-Eval family makes a valuable contribution to the community towards the holistic evaluation of LMMs.

Limitation & Future Work Through reality check, we explore the field of evaluation in LMMs and re-examine the evaluation process. Throughout our papers, we assume that the evaluation trilemma cannot be resolved. This suggests future work that goes deeper into finding a better trade-off among the sides of the trilemma or potentially overcoming it. Additionally, we address the issue of data contamination using a relatively simple method that requires access to the training data, while most research does not open-source their data. Future work may focus on methods that rely solely on the model and develop more efficient approaches.

References

[1] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In Proceedings of the IEEE International Conference on Computer Vision, pages 8948–8957, 2019.
[2] Anthropic. Introducing the next generation of claude. Anthropic News, March 2024.
[3] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
[4] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Introducing our multimodal models, 2023.
[5] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Ernest Valveny, C. V. Jawahar, and Dimosthenis Karatzas. Scene text visual question answering, 2019.
[6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
[7] Rizhao Cai, Zirui Song, Dayan Guan, Zhenhao Chen, Xing Luo, Chenyu Yi, and Alex Kot. Benchlmm: Benchmarking cross-style visual capability of large multimodal models, 2023.
[8] Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024.
[9] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024.
[10] Xingyu Chen, Zihan Zhao, Lu Chen, Danyang Zhang, Jiabao Ji, Ao Luo, Yuxuan Xiong, and Kai Yu. Websrc: A dataset for web-based structural reading comprehension, 2021.
[11] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
[12] Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024.
[13] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024.
[14] Cohere. Introducing command r+: A scalable llm built for business, 2024.
[15] W. Cook. Combinatorial Optimization. A Wiley-Interscience publication. Wiley, 1997.
[16] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
[17] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024.
[18] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024.
[19] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023.
[20] Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023.
[21] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[22] Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023.
[23] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
[24] Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024.
[25] Jinyi Hu, Yuan Yao, Chongyi Wang, Shan Wang, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, Xu Han, Yankai Lin, Jiao Xue, Dahai Li, Zhiyuan Liu, and Maosong Sun. Large multilingual models pivot zero-shot multimodal learning across languages. arXiv preprint arXiv:2308.12038, 2023.
[26] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism, 2019.
[27] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
[28] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, October 2014. Association for Computational Linguistics.
[29] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images, 2016.
[30] Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. In European Conference on Computer Vision (ECCV), 2022.
[31] Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?, 2024.
[32] Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024.
[33] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning, 2023.
[34] Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models, 2023.
[35] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023.
[36] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
[37] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023.
[38] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
[39] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[40] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023.
[41] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024.
[42] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023.
[43] Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, and Xiang Yue. Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding? arXiv preprint arXiv:2404.05955, 2024.
[44] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2024.
[45] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
[46] Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023.
[47] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR), 2024.
[48] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
[49] Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning, 2022.
[50] Yujie Lu, Dongfu Jiang, Wenhu Chen, William Wang, Yejin Choi, and Bill Yuchen Lin. Wildvision arena: Benchmarking multimodal llms in the wild, February 2024.
[51] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[52] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
[53] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[54] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022.
[55] Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and C.V. Jawahar. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1697–1706, January 2022.
[56] Minesh Mathew, Dimosthenis Karatzas, R Manmatha, and CV Jawahar. Docvqa: A dataset for vqa on document images. corr abs/2007.00398 (2020). arXiv preprint arXiv:2007.00398, 2020.
[57] Mistral. Mixtral 8x22b: Cheaper, better, faster, stronger, 2024.
[58] OpenAI. Gpt-4v(ision) system card, 2023.
[59] OpenAI. Gpt-4 technical report, 2024.
[60] Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, et al. Reka core, flash, and edge: A series of powerful multimodal language models. arXiv preprint arXiv:2404.12387, 2024.
[61] Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, et al. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models. arXiv preprint arXiv:2405.02287, 2024.
[62] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
[63] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
[64] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension, 2020.
[65] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioningwith reading comprehension. 2020.
[66] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019.
[67] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
[68] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
[69] Gemini Team. Gemini: A family of highly capable multimodal models, 2024.
[70] PaLM Team. Palm 2 technical report, 2023.
[71] Qwen Team. Introducing qwen-vl, 2024.
[72] The HuggingFaceH4 Team. Open llm leaderboard - a hugging face space by huggingfaceh4, 2023.
[73] Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. Hierarchical multimodal transformers for multi-page docvqa, 2023.
[74] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
[75] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2024.
[76] Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui Hu, Chenxia Li, Liu Yang, Xilin Luo, Xuejie Wu, Lunan Liu, Wenjun Cheng, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Lei Lin, Xiaokun Wang, Yutuan Ma, Chuanhai Dong, Yanqi Sun, Yifu Chen, Yongyi Peng, Xiaojuan Liang, Shuicheng Yan, Han Fang, and Yahui Zhou. Skywork: A more open bilingual foundation model, 2023.
[77] Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181, 2023.
[78] xAI. Grok-1.5 vision preview, 2024.
[79] Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, and Ion Stoica. Rethinking benchmark and contamination for language models with rephrased samples, 2023.
[80] Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
[81] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
[82] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
[83] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
[84] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
[85] Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, Tianyu Zheng, Kang Zhu, Yuyang Cheng, Chunpu Xu, Shuyue Guo, Haoran Zhang, Xingwei Qu, Junjie Wang, Ruibin Yuan, Yizhi Li, Zekun Wang, Yudong Liu, Yu-Hsuan Tsai, Fengji Zhang, Chenghua Lin, Wenhao Huang, Wenhu Chen, and Jie Fu. Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark, 2024.
[86] Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele Lunati, and Summer Yue. A careful examination of large language model performance on grade school arithmetic, 2024.
[87] Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition, 2023.
[88] Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024.

Appendix A Broader Impacts

A comprehensive evaluation framework can help identify the limitations of existing multimodal models, preventing potential AI misuse. On the other hand, benchmarks can also introduce biases that may not reflect real-world scenarios. If the benchmarks are not representative of diverse applications and contexts, there is a risk that models optimized for these benchmarks may perform poorly in practical settings. Besides, automatic evaluations cannot replace expert human assessment in specialized fields such as medical imaging. The construction of LiveBench uses real-world data crawled from the web. It could potentially lead to concerns regarding data privacy. The benchmarks we provide are meant for research purposes only and should be used with caution.

Appendix B Data Contamination

Table 4: Detailed image overlap and text overlap statistics accross different dataset

		Image overlap (%)	Text overlap (%)
Dataset	Split	LLaVA-NeXT Data	LLaVA-NeXT Data
Math & Science
AI2D [29]	test	6.09	25.97
MathVista [47]	testmini	9.90	7.70
ScienceQA [48]	img	0.35	1.54
Doc & Inforgraphic
ChartQA [54]	test	68.64	26.52
DocVQA [56]	val	36.08	4.06
InfoVQA [56]	test	0.14	0.39
Caption
COCO2014 [39]	val	46.05	22.19
Flickr30k [81]	test	2.97	0.00
NoCaps [1]	val	2.53	19.98
TextCaps [64]	val	3.79	0.00
VQA
GQA [27]	testdev-balanced	13.91	9.50
TextVQA [66]	val	3.90	2.00
VQAv2 [21]	val	46.21	2.90
Multi-task benchmark
CMMMU [85]	val	2.89	1.11
MMBench [44]	cn-dev	2.77	0.81
MMBench [44]	en-dev	2.77	7.97
MME [18]	test	1.60	1.39
MMMU [84]	val	2.67	3.56
MMVet [83]	val	4.13	3.21
SEED-Bench [35]	all	1.11	13.84
Others
LLaVA-W [42]	test	5.00	1.67
POPE [37]	val	42.20	0.00

We present the details of the image overlapping in Table 4. Datasets such as ChartQA [54], DocVQA [56], COCO [39], and VQAv2 [21] were included in the LLaVA-NeXT [40] training data and thus suffered the most from data contamination. Most of the benchmarks maintain a relatively low contamination proportion, with image and text overlap below 10%. POPE [37] was detected to have a high image overlapping ratio because it uses image sources from COCO [39].

Appendix C More Qualitative Examples

We present more qualitative results here to demonstrate the data contamination problem in the dataset. We observe more identical images in benchmarks such as LLaVA^W [42], MathVista [47], and InfoVQA [56]. Similar images have also been another issue in different datasets; we present two more examples in NoCaps [1] and MM-Vet [83]. Text overlapping can help us detect questions with similar sentence structure. Though the images might not be similar enough, these similar questions might also be marked as in-domain questions. For example, we present two cases in MathVista [47]. Though not necessarily contamination or overlapping cases, the two images are both testing similar domain knowledge and may help the model to answer questions in the benchmarks.

Appendix D Coreset Selection correlation

Table 5: The full correlation results we achieve using our selection methods

				Correlation
Dataset	Split	Lite Size	Originalgröße	LLaVA Embedding	CLIP+BGE Embedding
Math & Science
AI2D [29]	test	300	3088	0.94	0.98
Doc & Inforgraphic
ChartQA [54]	test	400	2500	0.96	0.97
DocVQA [56]	val	400	5349	0.99	0.99
InfoVQA [56]	val	200	2801	0.94	0.94
Caption
Flickr30k [81]	test	400	31784	0.99	0.91
NoCaps [1]	val	400	4500	0.99	0.98
TextCaps [64]	val	300	3166	0.98	0.96
RefCOCO [28]	val	500	8811	0.99	0.99
VQA
TextVQA [66]	val	300	5000	0.99	0.99
Multi-task benchmark
SeedBench [35]	test	700	17990	0.77	0.87

We compare the original scores and the selected dataset scores between the Lite version and the original datasets, calculating the correlation scores between them. We tried two different embeddings to perform $k$ -center clustering. In addition to using CLIP [62] and BGE [8] embeddings, we also trained a LLaVA-Qwen 1.8B model following the training recipe of [40] to embed image and text pairs simultaneously. For LLaVA embeddings, the last hidden states for all tokens were averaged into a single vector to serve as the feature vector for each data point. We report the correlation results for both embeddings in Table 5.

Appendix E LMMs-Eval Suite Information

Table 6: Dataset Statistics in LMMs-Eval. This table categorizes the initial set of tasks, detailing their task domains, ground-truth types, instance counts, and splits. We provide a comprehensive overview of the diverse datasets employed, which cover various task domains and evaluation metrics.

Datasets	Task Domains	Ground-Truth Types	Instances	Splits
AI2D [29]	Science,Diagram	Muiti-Choice	3088	test
BenchLMM [7]	Cross Style Understanding	Short Answer / Muiti-Choice	102	test
ChartQA [54]	Chart	Short Answer	2500	test
CMMMU [85]	Multi-task,World Knowledge	Free-form / Muiti-Choice	900/11000	val/test
COCO 2014 Caption [39]	Captioning	Short Answer	40775 / 40504	test / val
COCO 2017 Caption [39]	Captioning	Short Answer	40670 / 5000	test / val
DocVQA [56]	Document	Short Answer	5349	test
Ferret [80]	Referring or Grounding Actions	Free-form Answer	120	test
Flickr30k [82]	Visual Understanding	Captioning	31783	test
GQA [27]	Real-World/Compositional QA	Short Answer	12578	test / dev
Hallusion-Bench [22]	Multimodal Image-Context Reasoning	Yes or No	951	image
IconQA [49]	Abstract Diagrams	Muiti-Choice / Short Answer	21489 / 21488	test / val
InfoVQA [55]	Infographics understanding	Extractive / Numerical	2801	val
LLaVA-COCO [42]	Conversation, Reasoning	Free-form Answer	90	test
LLaVA-W [42]	Conversation, Reasoning	Free-form Answer	60	test
LLaVA-Wilder [41]	Conversation, Reasoning	Free-form Answer	210/1020	test
LiveBench (Ours)	Webpage Understanding / Lively Updated	Free-form	dynamic	test
MathVista [47]	Mathematical Reasoning / Understanding	Free-form / Muiti-Choice	1000	testmini
MathVerse [88]	Mathematical Reasoning / Understanding	Free-form / Muiti-Choice	3940	testmini
MMBench [45]	Reasoning / Perception	Muiti-Choice	6666 / 4329	test / dev
MME [18]	Perception, Cognition	Yes or No	2374	test
MMMU [84]	Multi-task, World Knowledge	Free-form / Muiti-Choice	10500 / 900	test / val
MM-Vet [83]	Multi-task	Free-form	218	test
Multilingual-LLaVA-W	Multi-lingual Conversation,Reasoning	Free-form Answer	60	test
MultiDocVQA [73]	Document	Short Answer	5019 / 5187	test / val
NoCaps [1]	Novel Object Captioning	Short Answer	4500	val
OCRBench [46]	Text Recognition	Short Answer	1000	test
OKVQA [52]	knowledge-based visual QA	Short Answer	5046	val
OlympiadBench [24]	Reasoning	Short Answer	2126 / 6351	test-en / test-cn
POPE [37]	Hallucination	Yes or No	9000	test
Q-Bench [77]	Image Quality Assessment	Short Answer / Muiti-Choice	2990	test
RealWorldQA [78]	Real world scenarios QA	Muiti-Choice	765	test
Refcoco [28, 51]	Referring Expression	Short Answer	5000 / 1975 / 1810 / 8811	bbox-test / A / B / val
Refcoco [28, 51]	Referring Expression	Short Answer	5000 / 1975 / 1810 / 8811	seg-test / A / B / val
Refcoco+ [28, 51]	Referring Expression	Short Answer	1975 / 1798 / 3805,	bbox-testA / B / val
Refcoco+ [28, 51]	Referring Expression	Short Answer	1975 / 1798 / 3805	seg-testA / B / val
Refcocog [28, 51]	Referring Expression	Short Answer	5023 / 7573	bbox-testB / val,
Refcocog [28, 51]	Referring Expression	Short Answer	5023 / 7573	seg-test / val
ScienceQA [48]	Science, World Knowledge, Reasoning	Muiti-Choice	4241	test
ScreenSPOT [12]	GUI Understanding / Navigation	Short Answer / Coordinates	1272	test
SEED-Bench [36]	Spatial and Temporal Understanding	Muiti-Choice	17990	test
SEED-Bench-2 [34]	Multi-disciplinary Knowledge	Muiti-Choice	24371	test
ST-VQA [5]	Highlevel Semantic Information Understanding	Short Answer	4070	test
SynthDoG [30]	Text Understanding	Free-form	500 / 500	val-en / val-zh
TextCaps [65]	Text Understanding	Captioning	21953 / 3166 / 3289	train / val / test
TextVQA [67]	Text Understanding	Short Answer	5000 / 5734	val / test
VisualWebBench [43]	Webpage Understanding / OCR / Reasoning	Short Answer / Muiti-Choice	1536	test
VizwizVQA [23]	Low Quality Image Understanding	Short Answer	8000 / 4319	test / val
VQAv2 [21]	Visual QA	Free-form	447793 / 214354	test / val
WebSRC [10]	Structure of Webpage	Short Answer / Yes or No	40357 / 52826	test / dev

Table 7: Detailed Statistics of the Initial Set of Models in LMMs-Eval. The models are categorized by their model family, with their inference parameters, model types (indicating whether they are open-sourced or accessed via API), and parallel types, which denote the strategy leveraged during the model inference.

Model Family	Model Version	Parameters	Model Type	Parallel Type
InstructBLIP	InstructBLIP-Vicuna-7B	7B	Open-sourced	Data
InstructBLIP	InstructBLIP-Vicuna-13B	13B	Open-sourced	Data
Fuyu	Fuyu-8B	8B	Open-sourced	Data
Idefics	Idefics-2-8B	8B	Open-sourced	Data
MiniCPM	MiniCPM-V 2.8B	2.8B	Open-sourced	Data
XComposer	XComposer-4KHD	8B	Open-sourced	Data
InternVL	InternVL-1.5	26B	Open-sourced	Data
LLaVA	LLaVA-1.5-7B	7B	Open-sourced	Data
	LLaVA-1.5-13B	13B	Open-sourced	Data
	LLaVA-NeXT-Vicuna-7B	7B	Open-sourced	Data
	LLaVA-NeXT-Vicuna-13B	13B	Open-sourced	Data
	LLaVA-NeXT-Mistral-7B	7B	Open-sourced	Data
	LLaVA-NeXT-Yi-34B	34B	Open-sourced	Data
	LLaVA-NeXT-LLaMA-3-8B	8B	Open-sourced	Data
	LLaVA-NeXT-Qwen-72B	72B	Open-sourced	Model
	LLaVA-NeXT-Qwen-110B	110B	Open-sourced	Model
Qwen-VL	Qwen-VL-Chat-7B	7B	Open-sourced	Data
	Qwen-VL-Plus	K.A.	Close-sourced, API	Data
	Qwen-VL-MAX	K.A.	Close-sourced, API	Data
Gemini	Gemini-1.0-Pro	K.A.	Close-sourced, API	Data
	Gemini-1.5-Flash	K.A.	Close-sourced, API	Data
	Gemini-1.5-Pro	K.A.	Close-sourced, API	Data
GPT4	GPT-4V	K.A.	Close-sourced, API	Data
GPT4	GPT-4O	K.A.	Close-sourced, API	Data
Claude	Claude-3-Haku	K.A.	Close-sourced, API	Data
	Claude-3-Sonnet	K.A.	Close-sourced, API	Data
	Claude-3-Opus	K.A.	Close-sourced, API	Data

Datasets on LMMs-Eval In previous research, benchmarks such as AI2D [29], TextVQA [66], TextCaps [64], Flickr30k [81], and OK-VQA [53] among many others, have been employed to assess a model’s performance in tasks such as captioning, optical character recognition (OCR), and visual QA. With the advent of Large Multimodal Models (LMMs), these have increasingly focused on broader capabilities spanning both vision and language, including reasoning [48] and visual instruction following [42]. Consequently, new benchmarks featuring increasingly challenging tasks and more comprehensive evaluations have been proposed. For example, ScienceQA [48] and MathVista [47] assess mathematical and scientific competencies, while benchmarks like SEED-Bench [35], CMMMU [85], MMMU [84], and MM-Bench [44] evaluate the multifaceted dimensions of multimodal models.

Models on LMMs-Eval To enable comparisons on new benchmarks for different models and to understand their capabilities across multiple tasks, we have supported over 10 models such as Fuyu [4], LLaVA [42], Instruct-BLIP [16], InternVL [11], XComposer [17], Qwen-VL [3], MiniCPM [25], Idefics [31] and closed-source models such as GPT-4V [58], Gemini [69], Qwen-VL-Max [71] and Claude [2].

Appendix F Unified Evaluation Results with LMMs-Eval

Table 8: More results using LMMs-Eval

	Split	Metric	#Num	LLaVA-1.5-7B	LLaVA-1.5-13B	LLaVA-NeXT-mistral-7B	LLaVA-NeXT-vicuna-7B	LLaVA-NeXT-13B	LLaVA-NeXT-34B
COCO-Cap	cococap_val_2014	CIDEr	40,504	108.66	113.88	107.66	96.98	99.45	103.16
COCO-Cap	cococap_val_2017	CIDEr	5,000	110.38	115.61	109.22	99.93	101.99	105.89
DocVQA	val	ANLS	5,349	28.08	30.29	72.16	74.35	77.45	83.98
GQA	testdev_balanced_instructions	Acc	12,578	61.97	63.24	54.98	64.23	65.36	67.08
MultidocVQA	val	Anls/acc	5,187	16.65/7.21	18.25/8.02	41.4/27.89	44.42/31.32	46.28/32.56	50.16/34.93
NoCaps	nocaps_eval	CIDEr	4,500	105.54	109.28	96.14	88.29	88.27	91.94
OKVQA	val	Acc	5,046	53.44	58.22	54.77	44.25	46.27	46.84
POPE	test	F1 Score	9,000	85.87	85.92	86.79	86.4	86.26	87.77
ScienceQA	scienceqa-full	Acc.	4,114	70.41	74.96	28.84	73.21	75.85	85.81
Refcoco	all	CIder	17,596	29.76	34.26	9.47	34.2	34.75	33.56
Refcoco+	all	CIder	7,578	28.92	31.01	9.05	31.82	32	30.66
Refcocog	all	CIder	12,596	57.76	59.23	19.35	52.18	58.02	59.26
ScienceQA	scienceqa-img	Acc	2,017	70.43	72.88	28.56	70.15	73.57	81.85
SEED-Bench	Seed-1	Image-Acc	17,990	60.49	67.06	65.97	64.74	65.64	69.55
SEED-Bench-2	Seed-2	Acc	24,371	57.89	59.88	60.83	59.88	60.72	64.98
TextCaps	val	CIDEr	3,166	98.15	103.92	70.39	71.79	67.39	67.11
TextVQA	val	exact_match	5,000	46.07	48.73	65.76	64.85	66.92	69.31
VizWiz(val)	val	Acc	4,319	54.39	56.65	63.79	60.64	63.56	66.61
VQAv2	val	Acc	214,354	76.64	78.26	80.32	80.06	80.92	82.07

We present additional results using LMMs-Eval here. Due to limited computational resources, we are only able to provide a holistic view of models from the LLaVA [40] series. This demonstrates that achieving both wide coverage and low-cost evaluation simultaneously is not feasible, necessitating a balance between these two aspects.

Appendix G Curating more datasets in LMMs-Eval Lite

Table 9: LMMs-Eval Lite with more datasets, where we fixed the size of the Lite version and include more fields and datasets for a more wholistic and diverse evaluation for swift development

Task Domain	Dataset	Split	Full Size	Lite Size
Doc & Infographic Understanding	ChartQA	test	2500	500
	DocVQA	val	5349	500
	InfoVQA	val	2801	500
Image Understanding & Captioning	Flickr30k	val	31784	500
	NoCaps	val	4500	500
	TextCaps	val	3166	500
	RefCOCO	val	8811	500
	COCO	val	5000	500
Visual Question Answering	GQA	test	12578	500
	OKVQA	val	5046	500
	VizWiz-VQA	val	4319	500
	VQA-V2	val	214354	500
	TextVQA	val	5000	500
Math & Science	MathVista	testmini	1000	1000
Math & Science	AI2D	test	3088	500
Visual Dialogue	LLaVA-W	test	60	60
Multi-discipline	MM-Bench	cn-dev	4329	500
	MM-Bench	en-dev	4377	500
	MME	cog. & percep.	2374	2374
	MMMU	val	900	900
	CMMMU	val	900	900
	Seed-Bench	test	17990	500
-	Total	-	340226	13734

We applied the same algorithm to additional datasets to develop a more comprehensive and diverse Lite version. In contrast to the original LMMs-Eval Lite , our version incorporates more datasets, including COCO [39] and VQA [21].

Appendix H K-Center Greedy algorithm

Algorithm 1 k-Center-Greedy Input: data $\mathbf{x}_{i}$ and $\left|V\right|=n$ Initialize $\mathbf{s}=\mathbf{\phi}$ while $\left|s\right|<n$ do $u=\arg\max_{i\in D\setminus\mathbf{s}}\min_{j\in\mathbf{s}}\Delta(\mathbf{x}_{% i},\mathbf{x}_{j})$ $\mathbf{s}=\mathbf{s}\cup\{u\}$ end while return $\mathbf{s}$

The greedy algorithm we use for $k$ -center clustering is detailed in Algorithm 1. In $k$ -center clustering, the objective is to select $k$ points among $V$ vertices such that the maximum distance from any point in $V$ to its nearest cluster center is minimized. In the employed greedy algorithm, a random point is initially chosen as a center. Subsequently, the distance from this center to every other point is updated. The point with the maximum distance from the current centers is then selected and added to the center list. This process is repeated until $k$ center points have been identified.

Appendix I LiveBench Details

I.1 Website Candidates for LiveBench

To evaluate the performance and reliability of various news and information sources, a diverse set of websites has been selected for LiveBench. We present the websites in Table 10. These websites span multiple categories, ensuring comprehensive coverage of different domains such as general news, business, technology, and international affairs. The list of candidate websites for LiveBench includes prominent sources like BBC, CNN, Bloomberg, WSJ, and Reuters, among others. Each of these websites has been categorized based on its primary content focus. This categorization aids in the systematic evaluation of the content quality and the impact of imagery and reporting styles across different domains. It should be noted that this is a initial set of candidate websites and there may be changes depending on the situations of these websites.

Table 10: List of websites selected for LiveBench.

Name	URL	Kategorie
BBC Main	https://www.bbc.com/	General News
BBC News	https://www.bbc.com/news	News
BBC Sport	https://www.bbc.com/sport	Sports
BBC Business	https://www.bbc.com/business	Business
BBC Innovation	https://www.bbc.com/innovation	Innovation
BBC Culture	https://www.bbc.com/culture	Culture
BBC Travel	https://www.bbc.com/travel	Tourismus
BBC Future Planet	https://www.bbc.com/future-planet	Environment
CNN Main	https://edition.cnn.com/	General News
CNN Politics	https://edition.cnn.com/politics	Politics
CNN Entertainment	https://edition.cnn.com/entertainment	Entertainment
CNN Style	https://edition.cnn.com/style	Style
Bloomberg Economics	https://www.bloomberg.com/economics	Wirtschaft
Bloomberg Industries	https://www.bloomberg.com/industries	Branchen
Bloomberg Technology	https://www.bloomberg.com/technology	Technologie
Bloomberg Politics	https://www.bloomberg.com/politics	Politics
Bloomberg Opinion	https://www.bloomberg.com/opinion	Opinion
WSJ Main	https://www.wsj.com/	General News
WSJ Africa	https://www.wsj.com/world/africa?mod=nav_top_subsection	Africa
WSJ Americas	https://www.wsj.com/world/americas?mod=nav_top_subsection	Americas
WSJ Asia	https://www.wsj.com/world/asia?mod=nav_top_subsection	Asia
WSJ China	https://www.wsj.com/world/china?mod=nav_top_subsection	China
WSJ Europe	https://www.wsj.com/world/europe?mod=nav_top_subsection	Europa
WSJ Middle East	https://www.wsj.com/world/middle-east?mod=nav_top_subsection	Middle East
WSJ India	https://www.wsj.com/world/india?mod=nav_top_subsection	Indien
WSJ Oceania	https://www.wsj.com/world/oceania?mod=nav_top_subsection	Oceania
WSJ Russia	https://www.wsj.com/world/russia?mod=nav_top_subsection	Russland
WSJ UK	https://www.wsj.com/world/uk?mod=nav_top_subsection	UK
WSJ Science	https://www.wsj.com/science?mod=nav_top_subsection	Science
WSJ Archaeology	https://www.wsj.com/science/archaeology?mod=nav_top_subsection	Archaeology
WSJ Biology	https://www.wsj.com/science/biology?mod=nav_top_subsection	Biology
WSJ Environment	https://www.wsj.com/science/environment?mod=nav_top_subsection	Environment
WSJ Physics	https://www.wsj.com/science/physics?mod=nav_top_subsection	Physics
WSJ Space	https://www.wsj.com/science/space-astronomy?mod=nav_top_subsection	Space
WSJ Central Banking	https://www.wsj.com/economy/central-banking?mod=nav_top_subsection	Central Banking
WSJ Consumers	https://www.wsj.com/economy/consumers?mod=nav_top_subsection	Consumers
WSJ Housing	https://www.wsj.com/economy/housing?mod=nav_top_subsection	Housing
WSJ Jobs	https://www.wsj.com/economy/jobs?mod=nav_top_subsection	Karriere
WSJ Trade	https://www.wsj.com/economy/trade?mod=nav_top_subsection	Trade
WSJ Global	https://www.wsj.com/economy/global	Global Economy
WSJ AI	https://www.wsj.com/tech/ai?mod=nav_top_subsection	AI
WSJ Biotech	https://www.wsj.com/tech/biotech	Biotech
WSJ Cybersecurity	https://www.wsj.com/tech/cybersecurity?mod=nav_top_subsection	Cybersecurity
WSJ Personal Tech	https://www.wsj.com/tech/personal-tech?mod=nav_top_subsection	Personal Tech
Reuters Main	https://www.reuters.com/	General News
Reuters Aerospace and Defense	https://www.reuters.com/business/aerospace-defense/	Aerospace and Defense
Reuters Autos and Transportation	https://www.reuters.com/business/autos-transportation/	Autos and Transportation
Reuters Davos	https://www.reuters.com/business/davos/	Davos
Reuters Energy	https://www.reuters.com/business/energy/	Energy
Reuters Environment	https://www.reuters.com/business/environment/	Environment
Reuters Finance	https://www.reuters.com/business/finance/	Finanzbranche
Reuters Healthcare	https://www.reuters.com/business/healthcare-pharmaceuticals/	Gesundheitswesen
Reuters Media and Telecom	https://www.reuters.com/business/media-telecom/	Media and Telecom
Reuters Retail and Consumer	https://www.reuters.com/business/retail-consumer/	Retail and Consumer
Reuters Future of Health	https://www.reuters.com/business/future-of-health/	Future of Health
Reuters Future of Money	https://www.reuters.com/business/future-of-money/	Future of Money
Reuters Take Five	https://www.reuters.com/business/take-five/	Analysis
Reuters World at Work	https://www.reuters.com/business/world-at-work/	World at Work
Reuters Breakingviews	https://www.reuters.com/breakingviews/	Opinion
Reuters Technology	https://www.reuters.com/technology/	Technologie
Reuters Cybersecurity	https://www.reuters.com/technology/cybersecurity/	Cybersecurity
Reuters Space	https://www.reuters.com/technology/space/	Space
Reuters Disrupted	https://www.reuters.com/technology/disrupted/	Disruption
Reuters Momentum	https://www.reuters.com/technology/reuters-momentum/	Technologie
Reuters Investigations	https://www.reuters.com/investigations/	Investigations
Andreessen Horowitz	https://a16z.com/news-content/#latest	Technologie
Hacker News	https://news.ycombinator.com/	Technologie
Reddit	https://www.reddit.com/?rdt=48006	Social Media
Crunchbase News	https://news.crunchbase.com/	Startups
CCTV	https://www.cctv.com/	International News

I.2 Examples from LiveBench-2024-06

Figures 10 and 11 illustrate selected examples from the LiveBench-2024-06 evaluation. These figures categorize results into three distinct types: Basic Understanding, Contextual Analysis, and Broader Implications.

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models