MšŸ“5\mathbf{5}bold_5 ā€“ A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks

Florian Schneider1
Language Technology Group
UniversitƤt Hamburg, Germany
[email protected]
&Sunayana Sitaram
Microsoft Research India
Bangalore, India
[email protected]
Abstract

Since the release of ChatGPT, the field of Natural Language Processing has experienced rapid advancements, particularly in Large Language Models (LLMs) and their multimodal counterparts, Large Multimodal Models (LMMs). Despite their impressive capabilities, LLMs often exhibit significant performance disparities across different languages and cultural contexts, as demonstrated by various text-only benchmarks. However, current research lacks such benchmarks for multimodal visio-linguistic settings. This work fills this gap by introducing M5, the first comprehensive benchmark designed to evaluate LMMs on diverse vision-language tasks within a multilingual and multicultural context. M5 includes eight datasets covering five tasks and 41414141 languages, with a focus on underrepresented languages and culturally diverse images. Furthermore, we introduce two novel datasets, M5-VGR and M5-VLOD, including a new Visio-Linguistic Outlier Detection task, in which all evaluated open-source models fail to significantly surpass the random baseline. Through extensive evaluation and analyses, we highlight substantial task-agnostic performance disparities between high- and low-resource languages. Moreover, we show that larger models do not necessarily outperform smaller ones in a multilingual setting.

MšŸ“5\mathbf{5}bold_5 ā€“ A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks


Florian Schneider1 Language Technology Group UniversitƤt Hamburg, Germany [email protected] Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā  Sunayana Sitaram Microsoft Research India Bangalore, India [email protected]


11footnotetext: This works was done during a research internship with Microsoft Research India (Bangalore).

1 Introduction

Since the release of ChatGPT, Natural Language Processing has experienced a significant surge in interest and research, with a particular focus on LLMs finetuned to follow human instructions. Besides proprietary models like GPT-4Ā (Achiam etĀ al., 2023), ClaudeĀ (Bai etĀ al., 2022), or GeminiĀ (Anil etĀ al., 2023), there are also successful open-source variants such as LlamaĀ (Touvron etĀ al., 2023), PhiĀ (Gunasekar etĀ al., 2023; Abdin etĀ al., 2024), or MistralĀ (Jiang etĀ al., 2023).

Refer to caption
Figure 1: An overview of the average performance of the models on the datasets included in the M5 benchmark. For xFlickrCO and XM3600, we report BERTScore F1. For the other datasets, the accuracy metric is reported.

While LLMs often demonstrate impressive performance on a wide range of tasks, quantifying and measuring this performance is challenging. Nevertheless, recent evaluation studies have shown that LLMs generally perform well in English but much worse in other languagesĀ (Ahuja etĀ al., 2023a, b; Holtermann etĀ al., 2024).

In this work, we focus on multimodal variants of LLMs, Large Multimodal Models (LMMs), such as GPT 4VĀ (OpenAI, 2023), Gemini Pro VĀ (Anil etĀ al., 2023), or the popular open-source model, LLaVAĀ (Liu etĀ al., 2023a, b). LLMs are not text-only but are also capable of processing images in addition to text.

Refer to caption
Figure 2: An informative overview of the M5 Benchmark introduced in this work.

Most open-source LMMs comprise three major components: an LLM, a vision-encoder model, and a mapping network that projects image embeddings into the text embedding space. With this architecture, where an LLM serves as the core, we argue that LMMs inherently suffer from the same issue as LLMs: they generally perform much worse in non-English languages. However, existing benchmarks are either text-onlyĀ Ahuja etĀ al. (2023a) or multimodal but monolingualĀ Yue etĀ al. (2023), thus unable to prove this hypothesis. In other words, current research lacks multimodal multilingual benchmarks to examine LMMsā€™ multilingual capabilities. In this work, we fill this gap by introducing the M5 Benchmark, taking a significant step towards identifying and measuring the performance disparities of current LMMs between various languages. FigureĀ 2 and FigureĀ 1 present a high-level summary of our benchmark. Moreover, we introduce two new evaluation datasets, including a novel vision-language task. Both datasets focus on African and Asian cultures, which are underrepresented or even non-existent in previous benchmarks. Our exhaustive analyses additionally investigate the influence of different factors on the performance, such as the modelsā€™ size or language fidelity.

Major Contributions

The major contributions of this work are (a) M5, the first multimodal benchmark to assess the performance of current LMMs across five tasks, eight datasets, and 41414141 languages; (b) Two novel datasets spanning 10101010 underrepresented African and Asian languages, English and German, with images depicting the respective cultures. (c) A novel vision-language task: Visio-Linguistic Outlier Detection (VLOD); (d) A large-scale evaluation of 18181818 recent LLMs and a thorough analysis of their multilingual performance. (e) A public release of our codebase and all datasets in a uniform schema to foster future research for more equitable and accessible LMMs or AI in general111We will release all code and data upon acceptance..

2 Related Work

Large Multi-Modal Models

This work focuses on the multimodal counterpart of large language models (LLMs), often referred to as Large Multimodal Models (LMMs). LMMs are language models capable of processing and ā€œunderstandingā€ data other than text. While this generally subsumes images, video, audio, or more, we concentrate on visio-linguistic LMMs, i.e., models that take text and/or images as input and generate textual output.

The vast majority of open-source LMMs comprise three major components: a pretrained generative LLM as the core, a pretrained vision-encoder model that computes semantically rich image embeddings, and a shallow mapping network that learned to project image embeddings into the text embedding space. One of this architectureā€™s successful open-source implementations with a recent LLM, i.e., the Llama-based VicunaĀ Chiang etĀ al. (2023); Touvron etĀ al. (2023), is LLaVAĀ Liu etĀ al. (2023b), from which many others took inspiration also regarding the training data and process. Besides this, LMMs also exist, which use Cross-AttentionĀ Wang etĀ al. (2023); Bai etĀ al. (2023), Q-FormersĀ Li etĀ al. (2023); Geigle etĀ al. (2023), AdaptersĀ Eichenberg etĀ al. (2022), or Preceiver ResamplersĀ Alayrac etĀ al. (2022); Awadalla etĀ al. (2023) to process image embeddings. For an overview including architectural details and the number of parameters of the 18181818 LMMsā€™ components we employed in this work, please see TableĀ 8.

Evaluation Benchmarks

With the recent surge in the research of LLMs and LMMs, analyzing the modelsā€™ performances is crucial yet challenging. Popular benchmarks like BIG-BenchĀ bench authors (2023), HELMĀ Liang etĀ al. (2022), or MMLUĀ Hendrycks etĀ al. (2020) are the defacto-standard to evaluate LLMs on text-only tasks primarily in English. Efforts like MEGA, MEGAVERSE, or MultiQĀ Ahuja etĀ al. (2023a, b); Holtermann etĀ al. (2024) extended these monolingual benchmarks to a large set of diverse languages and showed that the LLMsā€™ performance in English versus non-English languages differs significantly.

Similarly, efforts have been made to evaluate multimodal models. Benchmarks like MMMUĀ Yue etĀ al. (2023), MMEĀ Fu etĀ al. (2023), or MMBenchĀ Yuan etĀ al. (2023) assess the performance of LMMs on a vast number of text-image tasks. However, these benchmarks primarily focus on English, with some tasks available in Chinese. Like MMMU, there is CMMMUĀ Ge etĀ al. (2024), which focuses on text-image tasks in Chinese. Nonetheless, evaluating state-of-the-art LMMs in a massively multilingual large-scale setting remains largely unexplored. There are only a few multimodal multilingual evaluation datasets (see SectionĀ 3.2 andĀ 8.6) and only two benchmarks: IGLUEĀ Bugliarello etĀ al. (2022) and MEGAVERSE. However, IGLUE evaluates only non-autoregressive transformer-encoders, thus lacking state-of-the-art LLMs. In MEGAVERSE, only five recent LMMs are evaluated on two datasets.

3 The M5 Benchmark

This section describes the setup of the M5 Benchmark introduced by this work. Details about the experimental setup, including prompts and hyperparameters, are reported in AppendixĀ A.

3.1 Models

We chose the LMMs included in this benchmark for the following reasons: Firstly, we focussed on publicly available models released on Hugging Face except for GPT-4 Vision and Gemini Pro. Secondly, we included LMMs well-performing on popular multimodal English-only benchmark s such as MMMUĀ (Yue etĀ al., 2023) and MMEĀ (Fu etĀ al., 2023). Thirdly, we aimed to cover a mixture of different model families and a broad model size spectrum, including small models with 3333B to 9999B, medium models with 10101010B to 19191919B, and large models with 20202020B to 40404040B parameters. For an overview of all models, including their number of parameters and other architectural details, see TableĀ 8.

3.2 Datasets

This section briefly introduces the existing datasets included in our benchmark. In addition to these, we crafted two novel datasets described in SectionĀ 4. For details about the languages covered by the datasets, please refer to TableĀ 6.

xGQA

The xGQA datasetĀ (Pfeiffer etĀ al., 2022) is a cross-lingual visual question-answering dataset. Each of the 9666966696669666 questions is available in eight languages covering five scripts, while the answers are in English only. The dataset holds 300300300300 unique images from Visual GenomeĀ (Krishna etĀ al., 2017).

MaXM

The MaXM dataset was introduced byĀ Changpinyo etĀ al. (2023) and is a VQA dataset comprising seven languages in five scripts. In MaXM, the questions and their respective answers are in the same language. To increase cultural diversity, the images were selected to match the region where the target language is spoken.

XVNLI

The XVNLI datasetĀ Bugliarello etĀ al. (2022) introduces the task of Cross-lingual Visual Natural Language Inference where a model needs to predict whether a textual hypothesis entails, contradicts, or is neutral concerning a visual premise. XVNLI comprises five languages covering three scripts and 357357357357 unique images from Visual Genome.

MaRVL

The MaRVL datasetĀ Liu etĀ al. (2021) aims to benchmark models on Multicultural Reasoning over Vision and Language. A task sample comprises two images, a textual statement, and a binary true or false answer grounded in the images. MaRVL comprises five languages covering three scripts and 4914491449144914 culturally diverse images that match the respective languages.

XM3600

The XM3600 datasetĀ Thapliyal etĀ al. (2022) is a large multilingual image captioning dataset comprising 36363636 languages with 261375261375261375261375 captions for 100100100100 unique images per language. The images are selected to match the languageā€™s cultural background, ensuring cultural and linguistic diversity.

xFlickrCO

The xFlickrCO datasetĀ (Bugliarello etĀ al., 2022) is an image captioning dataset and comprises 1000100010001000 images from Flickr30kĀ (Young etĀ al., 2014) and 1000100010001000 images from COCOĀ (Lin etĀ al., 2014). Each image is captioned in eight languages, covering four different scripts.

4 Novel M5 Datasets

In addition to the existing datasets introduced in the previous section, we crafted two novel multimodal and multilingual evaluation datasets. The principal motivation behind this is to fill the gap in existing vision-language datasets concerning the lack of underrepresented languages, tasks, and cultural diversity. Moreover, we aim to enable further examination of LMMs and their performance on non-English and non-Western data with a particular focus on African and Asian regions. Details, statistics, and examples are reported in AppendixĀ B.

Common Characteristics

Languages

Both datasets comprise samples in 12121212 languages covering seven scripts (see TableĀ 6): Amharic, Berber, Bengali, German, English, Filipino, Hausa, Hindi, Russian, Swahili, Thai, Zulu. The languages were selected to enrich the set of languages covered by existing datasets, focusing on underrepresented languages from Asian and African countries or regions. To our knowledge, no other visio-linguistic evaluation dataset covers Amharic, Berber, Hausa, or Zulu.

Depicting Cultural Diversity

The images in our datasets originate from the Dollar Street datasetĀ (GaviriaĀ Rojas etĀ al., 2022), comprising around 38ā¢K38š¾38K38 italic_K photos taken in 63636363 different regions or countries around the globe. These photos depict the lives of families, including their homes, neighborhoods, or everyday objects, in a culturally diverse way. Further, each image in the original dataset is tagged with one or more ā€œtopicsā€ that roughly describe its visual content.

Image Basis

For our datasets, we sampled a subset of images from the dataset taken in regions where our 12121212 target languages are spoken. In this subset, which forms the visual basis for both of our datasets and is referred to as š”¹š”¹\mathbb{B}blackboard_B, each image iltāˆˆš”¹superscriptsubscriptš‘–š‘™š‘”š”¹i_{l}^{t}\in\mathbb{B}italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT āˆˆ blackboard_B is tagged with exactly one topic tāˆˆš•‹={t0,ā€¦,t86}š‘”š•‹subscriptš‘”0ā€¦subscriptš‘”86t\in\mathbb{T}=\{t_{0},\dots,t_{86}\}italic_t āˆˆ blackboard_T = { italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ā€¦ , italic_t start_POSTSUBSCRIPT 86 end_POSTSUBSCRIPT } and was taken in a region rlsubscriptš‘Ÿš‘™r_{l}italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT where language lāˆˆš•ƒ={l0,ā€¦,l11}š‘™š•ƒsubscriptš‘™0ā€¦subscriptš‘™11l\in\mathbb{L}=\{l_{0},\dots,l_{11}\}italic_l āˆˆ blackboard_L = { italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ā€¦ , italic_l start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT } is spoken.

4.1 M5-VGR

Refer to caption
Figure 3: An Zulu example of the novel M5-VGR dataset. Hypothesis: ā€œIsithombe sokuqala nesithombe sesibili sibonisa iqanda elisehhokweni. (The first picture and the second picture show the egg on the head.)ā€, Label: False

Inspired by MaRVL, the goal of the M5-VGR dataset is to provide a visually grounded reasoning (VGR) evaluation dataset that covers a wide range of topologically different languages and, at the same time, visually represents a diverse set of cultures in which the respective languages are spoken. However, since the MaRVL dataset contains only five languages, we chose 11111111 additional topologically diverse languages for our dataset. To guarantee visual and linguistic diversity and high data quality in our dataset, we hired professional native-speaker annotators of the respective languages to annotate the data. Moreover, we performed several rounds of data quality assessment in close collaboration with the annotators.

A task sample sš‘ sitalic_s in M5-VGR contains two images iasubscriptš‘–š‘Ži_{a}italic_i start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and ibsubscriptš‘–š‘i_{b}italic_i start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, a textual visually grounded hypothesis hā„Žhitalic_h, and a binary label cš‘citalic_c which is either true or false concerning the two visual premises (see FigureĀ 3). More specifically, for each language lāˆˆš•ƒš‘™š•ƒl\in\mathbb{L}italic_l āˆˆ blackboard_L, we created 120120120120 tasks slāˆˆš•Šlsubscriptš‘ š‘™subscriptš•Šš‘™s_{l}\in\mathbb{S}_{l}italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT āˆˆ blackboard_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as follows: In the first step, we sampled 120120120120 unique images altāˆˆš”¹superscriptsubscriptš‘Žš‘™š‘”š”¹a_{l}^{t}\in\mathbb{B}italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT āˆˆ blackboard_B from our image basis so that each topic tāˆˆš•‹š‘”š•‹t\in\mathbb{T}italic_t āˆˆ blackboard_T occurs at least once across all 12121212 languages. Then, for each of the 120120120120 images, we randomly selected another image bl2tāˆˆš”¹superscriptsubscriptš‘subscriptš‘™2š‘”š”¹b_{l_{2}}^{t}\in\mathbb{B}italic_b start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT āˆˆ blackboard_B associated with another language l2ā‰ lāˆˆš•ƒsubscriptš‘™2š‘™š•ƒl_{2}\neq l\in\mathbb{L}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ā‰  italic_l āˆˆ blackboard_L that shares the topic tš‘”titalic_t. In the third step, we asked the native-speaker annotators of the language lš‘™litalic_l to manually create a hypothesis hā„Žhitalic_h and a label cš‘citalic_c which is either true or false concerning the image premises (alt,bl2t)superscriptsubscriptš‘Žš‘™š‘”superscriptsubscriptš‘subscriptš‘™2š‘”\left(a_{l}^{t},b_{l_{2}}^{t}\right)( italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). Further, the annotators were instructed to generate a hypothesis semantically related to the topic tš‘”titalic_t if possible.

4.2 M5-VLOD

Refer to caption
Figure 4: A Swahili example of the novel M5-VLOD dataset. Hypothesis: ā€œPicha zote zinaonyesha sabuni inayotumika kwa mikono na mwili bila mtu yeyote. (All the images show soap applied to the hands and body without anyone.)ā€, Outlier: 1111.

With the M5-VLOD dataset, we introduce a novel multimodal task: Visio-Linguistic Outlier Detection. The objective of the task is to detect an outlier image from a set of images considering a textual statement. An example of the task is shown in FigureĀ 4, where five images related to the topic ā€œsoap for hands and bodyā€ are shown. The machine-translated English statement is: ā€œAll the images show soap applied to the hands and body without anyone.ā€. Because only the first image shows a person, the statement is incorrect for the first image and, therefore, is considered the outlier image.

The dataset was collected similarly to M5-VGR, as described in the previous section. The major difference is that instead of sampling only one image in the second step, we sample four images so that a sample sl0āˆˆš•Šl0subscriptš‘ subscriptš‘™0subscriptš•Šsubscriptš‘™0s_{l_{0}}\in\mathbb{S}_{l_{0}}italic_s start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT āˆˆ blackboard_S start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT for language l0āˆˆš•ƒsubscriptš‘™0š•ƒl_{0}\in\mathbb{L}italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT āˆˆ blackboard_L comprises of five images: {al0t,bl1t,cl2t,dl3t,el4t,}\{a_{l_{0}}^{t},b_{l_{1}}^{t},c_{l_{2}}^{t},d_{l_{3}}^{t},e_{l_{4}}^{t},\}{ italic_a start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , } associated with five different languages {l0,ā€¦,l4āˆˆš•ƒ}subscriptš‘™0ā€¦subscriptš‘™4š•ƒ\{l_{0},\dots,l_{4}\in\mathbb{L}\}{ italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ā€¦ , italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT āˆˆ blackboard_L } that share one topic tāˆˆš•‹š‘”š•‹t\in\mathbb{T}italic_t āˆˆ blackboard_T. In the third step, we asked the native-speaker annotators of the language lš‘™litalic_l to manually create a textual statement hā„Žhitalic_h, valid for all but one of the images labeled as the outlier image.

5 General Results Discussion

Model Dataset
xGQA MaXM XVNLI MaRVL M5-VLOD M5-VGR xFlickrCO XM3600 ALL
E NE E NE E NE E NE E NE E NE E NE E NE E NE Ī”Ī”\Deltaroman_Ī”
CogVLM 0.590.590.590.59 0.300.300.300.30 0.430.430.430.43 0.020.020.020.02 0.470.470.470.47 0.290.290.290.29 0.600.600.600.60 0.510.510.510.51 0.100.100.100.10 0.080.080.080.08 0.680.680.680.68 0.550.550.550.55 0.870.870.870.87 0.600.600.600.60 0.880.880.880.88 0.650.650.650.65 0.580.580.580.58 0.380.380.380.38 āˆ’0.200.20-0.20- 0.20
BakLLaVA 0.620.620.620.62 0.320.320.320.32 0.530.530.530.53 0.080.080.080.08 0.480.480.480.48 0.340.340.340.34 0.590.590.590.59 0.530.530.530.53 0.140.140.140.14 0.200.200.200.20 0.710.710.710.71 0.480.480.480.48 0.910.910.910.91 0.630.630.630.63 0.880.880.880.88 0.640.640.640.64 0.610.610.610.61 0.400.400.400.40 āˆ’0.210.21-0.21- 0.21
LLaVA 1.6 7B 0.600.600.600.60 0.340.340.340.34 0.340.340.340.34 0.160.160.160.16 0.590.590.590.59 0.450.450.450.45 0.620.620.620.62 0.530.530.530.53 0.140.140.140.14 0.210.210.210.21 0.550.550.550.55 0.420.420.420.42 0.880.880.880.88 0.640.640.640.64 0.880.880.880.88 0.670.670.670.67 0.570.570.570.57 0.430.430.430.43 āˆ’0.150.15-0.15- 0.15
LLaVA 1.5 7B 0.620.620.620.62 0.300.300.300.30 0.520.520.520.52 0.150.150.150.15 0.600.600.600.60 0.470.470.470.47 0.570.570.570.57 0.520.520.520.52 0.150.150.150.15 0.200.200.200.20 0.480.480.480.48 0.420.420.420.42 0.920.920.920.92 0.680.680.680.68 0.890.890.890.89 0.670.670.670.67 0.590.590.590.59 0.430.430.430.43 āˆ’0.170.17-0.17- 0.17
Yi-VL 6B 0.570.570.570.57 0.320.320.320.32 0.530.530.530.53 0.200.200.200.20 0.560.560.560.56 0.380.380.380.38 0.590.590.590.59 0.530.530.530.53 0.200.200.200.20 0.190.190.190.19 0.730.730.730.73 0.610.610.610.61 0.910.910.910.91 0.640.640.640.64 0.910.910.910.91 0.660.660.660.66 0.620.620.620.62 0.440.440.440.44 āˆ’0.180.18-0.18- 0.18
MiniCPM-V 0.550.550.550.55 0.310.310.310.31 0.560.560.560.56 0.190.190.190.19 0.660.660.660.66 0.490.490.490.49 0.610.610.610.61 0.530.530.530.53 0.200.200.200.20 0.200.200.200.20 0.800.800.800.80 0.560.560.560.56 0.910.910.910.91 0.650.650.650.65 0.900.900.900.90 0.650.650.650.65 0.650.650.650.65 0.450.450.450.45 āˆ’0.200.20-0.20- 0.20
LLaVA 1.5 13B 0.620.620.620.62 0.340.340.340.34 0.560.560.560.56 0.190.190.190.19 0.590.590.590.59 0.490.490.490.49 0.600.600.600.60 0.540.540.540.54 0.160.160.160.16 0.210.210.210.21 0.570.570.570.57 0.460.460.460.46 0.910.910.910.91 0.690.690.690.69 0.900.900.900.90 0.690.690.690.69 0.610.610.610.61 0.450.450.450.45 āˆ’0.160.16-0.16- 0.16
Qwen-VL 0.590.590.590.59 0.330.330.330.33 0.500.500.500.50 0.230.230.230.23 0.620.620.620.62 0.540.540.540.54 0.600.600.600.60 0.530.530.530.53 0.160.160.160.16 0.210.210.210.21 0.820.820.820.82 0.540.540.540.54 0.890.890.890.89 0.620.620.620.62 0.900.900.900.90 0.650.650.650.65 0.640.640.640.64 0.460.460.460.46 āˆ’0.180.18-0.18- 0.18
Yi-VL 34B 0.580.580.580.58 0.380.380.380.38 0.530.530.530.53 0.200.200.200.20 0.590.590.590.59 0.510.510.510.51 0.620.620.620.62 0.580.580.580.58 0.260.260.260.26 0.190.190.190.19 0.770.770.770.77 0.520.520.520.52 0.910.910.910.91 0.640.640.640.64 0.900.900.900.90 0.660.660.660.66 0.650.650.650.65 0.460.460.460.46 āˆ’0.190.19-0.19- 0.19
Gemini Pro V 0.460.460.460.46 0.340.340.340.34 0.480.480.480.48 0.230.230.230.23 0.490.490.490.49 0.490.490.490.49 0.550.550.550.55 0.550.550.550.55 0.520.520.520.52 0.360.360.360.36 0.790.790.790.79 0.660.660.660.66 0.860.860.860.86 0.670.670.670.67 0.630.630.630.63 0.410.410.410.41 0.600.600.600.60 0.460.460.460.46 āˆ’0.130.13-0.13- 0.13
OmniLMM 12B 0.490.490.490.49 0.360.360.360.36 0.480.480.480.48 0.110.110.110.11 0.640.640.640.64 0.540.540.540.54 0.640.640.640.64 0.560.560.560.56 0.190.190.190.19 0.210.210.210.21 0.780.780.780.78 0.590.590.590.59 0.910.910.910.91 0.660.660.660.66 0.890.890.890.89 0.680.680.680.68 0.630.630.630.63 0.460.460.460.46 āˆ’0.160.16-0.16- 0.16
LLaVA 1.6 13B 0.650.650.650.65 0.380.380.380.38 0.460.460.460.46 0.240.240.240.24 0.610.610.610.61 0.550.550.550.55 0.650.650.650.65 0.650.650.650.65 0.140.140.140.14 0.210.210.210.21 0.780.780.780.78 0.500.500.500.50 0.900.900.900.90 0.670.670.670.67 0.880.880.880.88 0.680.680.680.68 0.630.630.630.63 0.480.480.480.48 āˆ’0.150.15-0.15- 0.15
mBliP BloomZ 0.440.440.440.44 0.390.390.390.39 0.550.550.550.55 0.290.290.290.29 0.400.400.400.40 0.440.440.440.44 0.550.550.550.55 0.560.560.560.56 0.140.140.140.14 0.210.210.210.21 0.690.690.690.69 0.560.560.560.56 0.920.920.920.92 0.720.720.720.72 0.910.910.910.91 0.710.710.710.71 0.580.580.580.58 0.490.490.490.49 āˆ’0.090.09-0.09- 0.09
InternVL V1.1 0.630.630.630.63 0.480.480.480.48 0.580.580.580.58 0.340.340.340.34 0.610.610.610.61 0.560.560.560.56 0.630.630.630.63 0.600.600.600.60 0.130.130.130.13 0.210.210.210.21 0.730.730.730.73 0.620.620.620.62 0.920.920.920.92 0.660.660.660.66 0.910.910.910.91 0.680.680.680.68 0.640.640.640.64 0.520.520.520.52 āˆ’0.120.12-0.12- 0.12
LLaVA 1.6 34B 0.650.650.650.65 0.460.460.460.46 0.580.580.580.58 0.320.320.320.32 0.620.620.620.62 0.580.580.580.58 0.640.640.640.64 0.660.660.660.66 0.260.260.260.26 0.220.220.220.22 0.870.870.870.87 0.640.640.640.64 0.890.890.890.89 0.680.680.680.68 0.880.880.880.88 0.700.700.700.70 0.670.670.670.67 0.530.530.530.53 āˆ’0.140.14-0.14- 0.14
mBliP mT0 0.440.440.440.44 0.400.400.400.40 0.500.500.500.50 0.420.420.420.42 0.590.590.590.59 0.570.570.570.57 0.600.600.600.60 0.630.630.630.63 0.120.120.120.12 0.170.170.170.17 0.740.740.740.74 0.690.690.690.69 0.920.920.920.92 0.730.730.730.73 0.910.910.910.91 0.710.710.710.71 0.600.600.600.60 0.540.540.540.54 āˆ’0.070.07-0.07- 0.07
InternVL V1.2+ 0.670.670.670.67 0.430.430.430.43 0.600.600.600.60 0.420.420.420.42 0.630.630.630.63 0.580.580.580.58 0.680.680.680.68 0.610.610.610.61 0.280.280.280.28 0.230.230.230.23 0.860.860.860.86 0.680.680.680.68 0.920.920.920.92 0.710.710.710.71 0.900.900.900.90 0.700.700.700.70 0.690.690.690.69 0.550.550.550.55 āˆ’0.150.15-0.15- 0.15
GPT 4V 0.450.450.450.45 0.410.410.410.41 0.490.490.490.49 0.530.530.530.53 0.690.690.690.69 0.680.680.680.68 0.640.640.640.64 0.660.660.660.66 0.700.700.700.70 0.420.420.420.42 0.880.880.880.88 0.810.810.810.81 0.900.900.900.90 0.700.700.700.70 0.890.890.890.89 0.720.720.720.72 0.700.700.700.70 0.620.620.620.62 āˆ’0.090.09-0.09- 0.09
Average 0.570.570.570.57 0.370.370.370.37 0.510.510.510.51 0.240.240.240.24 0.580.580.580.58 0.500.500.500.50 0.610.610.610.61 0.570.570.570.57 0.220.220.220.22 0.220.220.220.22 0.730.730.730.73 0.570.570.570.57 0.900.900.900.90 0.670.670.670.67 0.880.880.880.88 0.660.660.660.66 0.630.630.630.63 0.470.470.470.47 āˆ’0.150.15-0.15- 0.15
Table 1: Average performance in English (E) and non-English languages (NE) on all datasets for all models. For each dataset and the Ī”Ī”\Deltaroman_Ī” column, the heatmaps are created individually, indicated by the column gutter. The column ā€œALLā€ represents the average across all datasets. For xFlickrCO and XM3600, we report BertScore F1 and for the rest of the datasets, we report the relaxed accuracy.

This section discusses the modelsā€™ performance on the datasets considered in our benchmark. TableĀ 1 provides an overview of the performance in English compared to non-English languages for all models and datasets. Note that we use friendly names for the models for better readability (see TableĀ 8). Detailed results for each dataset and all their respective languages are provided in AppendixĀ D.

5.1 Summary of Findings

TableĀ 1 shows a clear pattern: Generally, LMMs perform significantly worse in non-English languages across all tasks. More specifically, the average performance across all models and datasets in English is 0.630.630.630.63 versus 0.470.470.470.47 in non-English languages. Most models have an average performance difference from English to non-English larger or equal to 0.120.120.120.12. However, for GPT 4V and despite their much smaller size also for mBlip BloomZ, and mBlip T0, the difference is smaller than 0.10.10.10.1. For the two mBLIP models, the authors explicitly stated in their paper the language distribution in the training data, which covers 96969696 languages. Hence, it can be assumed that this is the reason for this slight absolute performance difference, and, further, this might indicate that GPT 4V was also trained in a multilingual fashion. Due to the difference in size and the architecture222While the architecture of GPT 4V is not known, it is likely different from the mBlip modelsā€™ architecture, which employs Q-Formers, rarely used in state-of-the-art LMMs. of the mBlip models and GPT 4V, applying this multilingual training strategy for LMMs would generally lead to more robust multilingual performance.

The average performance difference of the models is most significant on the MaXM, XM3600, and xFlickrCo datasets, for which the models are required to generate non-English text.

Interestingly, for the M5-VLOD dataset, the models that performed worse than the random baseline of 0.20.20.20.2 in English performed better in non-English languages. An explanation for this could be false assumptions drawn from the English text. This finding also explains why the average English versus non-English performance disparity across all models is equal for the dataset and lies around the random baseline, indicating the challenge introduced by our dataset.

5.1.1 Dataset-Specific Discussion

Note that due to brevity constraints, we report exact numbers and diagrams of the language-specific results for each dataset in AppendixĀ D.

xGQA

All models perform best in English mostly, with a significant gap in accuracy to the second-best language from up to 0.620.620.620.62 in English to 0.360.360.360.36 in Russian for LLaVA 1.6 7B. In Bengali, where the models have the lowest average accuracy of 0.190.190.190.19, all models besides GPT 4V, which achieves 0.440.440.440.44, perform worst by far. The best-performing model in English and the best-performing model on average over all languages are the InternVL v1.2 and InternVL v1.1 models. Notably, despite their (estimated) much larger size, GPT 4V and Gemini Pro V are among the worst-performing models in English. After manually inspecting the results, we found the reason for this to be that the models did not respond in a single word but with a brief sentence, which is considered a false answer according to the applied metric (see AppendixĀ A.2 and SectionĀ 8.2).

MaXM

The average accuracy of the models for Hindi (0.220.220.220.22), Hebrew (0.190.190.190.19), Romanian (0.270.270.270.27), Thai (0.250.250.250.25), and Chinese (0.240.240.240.24) is much lower than for English (0.510.510.510.51) and French (0.350.350.350.35). It is also worth pointing out that most models, regardless of their size, perform remarkably worse in languages other than English (and French). In contrast, on xGQA, which is also a VQA dataset, the differences between the languages are much more minor. This is likely due to the difference between the two datasets, i.e., that xGQA has multilingual questions but only English answers, while MaXM has multilingual questions and expects the answers in the respective language, too. We further underline this in our language fidelity analysis in SectionĀ 6.3.

XVNLI

English accuracy is the best for most models, with an average of 0.580.580.580.58, whereas Arabic accuracy is the worst, with an average of 0.430.430.430.43. The performance drop from English to the other languages, i.e., Spanish (0.510.510.510.51), French (0.520.520.520.52), and Russian, with average accuracy scores of 0.510.510.510.51, 0.520.520.520.52, and 0.520.520.520.52, is less substantial. Note that XVNLI is an NLI dataset, i.e., the random baseline is at 1313\frac{1}{3}divide start_ARG 1 end_ARG start_ARG 3 end_ARG. All models surpass this baseline in all languages, except for CogVLM in Arabic (0.260.260.260.26) and French (0.270.270.270.27). The best-performing model is GPT4 V with an average accuracy across all languages of 0.680.680.680.68, followed by LLaVA 1.6 34B and InternVL V1.2+ with average scores of 0.590.590.590.59 and 0.580.580.580.58, respectively.

MaRVL

The datasetā€™s random baseline is 0.50.50.50.5, which is often only slightly surpassed by most models, especially for Swahili and Tamil languages, with an average accuracy of 0.530.530.530.53 and 0.540.540.540.54, respectively. Notably, only 8888 of 18181818 models perform best in English, with an average accuracy of 0.610.610.610.61. For the other models, the English performance is surpassed by Chinese, Indonesian, or Turkish, with an average accuracy of 0.600.600.600.60, 0.600.600.600.60, and 0.590.590.590.59, respectively. GPT-4V is on par with LLaVA 1.6 34B despite the latter having much fewer parameters.

M5-VGR

As with MaRVL, this datasetā€™s random baseline is at 0.50.50.50.5. Only one of 18181818 models, i.e., InternVL V1.2+, could surpass or reach this baseline in all languages. As expected, most models performed best in English, German, or Russian, with average accuracies of 0.730.730.730.73, 0.680.680.680.68, and 0.690.690.690.69, respectively. They performed worst in low-resource languages such as Amharic, Berber, Bengali, Hausa, or Zulu, with an average accuracy of 0.530.530.530.53, 0.490.490.490.49, 0.550.550.550.55, and 0.520.520.520.52, respectively. Only three models, i.e., Gemini Pro V, mBlip mT0, and GPT 4V, consistently and significantly surpass the random baseline in all languages except for Berber. The only languages where the average performance is significantly higher than the 0.50.50.50.5 random baseline are English (0.730.730.730.73), German (0.680.680.680.68), Russian (0.690.690.690.69), and Thai (0.620.620.620.62). The average scores of the other languages range from 0.490.490.490.49 in Berber to 0.570.570.570.57 in Hindi.

M5-VLOD

The datasetā€™s random baseline is 0.20.20.20.2 since the models need to find the outlier within five images. Only GPT 4V and Gemini Pro V significantly surpassed that baseline in all languages, with an average accuracy of 0.420.420.420.42 and 0.360.360.360.36, respectively. They achieve the best scores in English with an average accuracy of 0.700.700.700.70 (GPT 4V) and 0.520.520.520.52 (Gemini Pro V. However, in Berber, both models only achieve scores around the random baseline. All other models do not surpass the random baseline in all languages, including English, by more than 0.10.10.10.1, with average scores between 0.080.080.080.08 (CogVLM) and 0.230.230.230.23 (InternVL V1.2+) This highlights the challenge introduced by our dataset and the performance gap between proprietary and open-source models.

xFlickrCO

The majority of models perform best in English, often with a significant margin in average chrF++, i.e., 24.9324.9324.9324.93 in English and 12.4912.4912.4912.49 in non-English languages. Other languages where the modes perform comparably well are German and Spanish, with average chrF++ scores of 19.9519.9519.9519.95 and 19.5519.5519.5519.55, respectively. Interestingly, all models perform worse in non-Latin script languages, i.e., Russian (9.709.709.709.70), Chinese (4.534.534.534.53), and Japanese (4.054.054.054.05). Unexpectedly, the proprietary models GPT 4V and Gemini Pro V are surpassed by mBliP BloomZ, mBliP mT0, and InternVL V1.2+, which are much smaller open-source models. Even in English, most open-source models outperform the proprietary models.

XM3600

Note that due to limited resources, we evaluated GPT 4V only on a subset of 12121212 of 36363636 languages. Most models perform best in English (27.1427.1427.1427.14 average chrF++) by a large margin, followed by other Latin scripts in high-resource languages such as French (23.6523.6523.6523.65), Spanish (23.5223.5223.5223.52), or Dutch (21.0121.0121.0121.01). On average, the models perform worst on non-Latin script languages like Korean (3.503.503.503.50), Telugu (4.794.794.794.79), and Bengali (5.115.115.115.11). However, although the chrF++ metric claims to be script and language-independent, the low scores in high-resource languages like Chinese (3.953.953.953.95) and Japanese (5.135.135.135.13) make the metric questionable. While detailed analysis is out of the scope of this work, in future work, we will investigate this issue further (see SectionĀ 8.1).

6 Aggregated Result Analyses

6.1 Performance per Language

FigureĀ 5 shows the average performances aggregated by language333We do not show all 36363636 languages of XM3600 for better readability. or language taxonomy classesĀ Joshi etĀ al. (2020). These taxonomy classes indicated how well a respective language is represented and considered within the research field of NLP based on papers published at ā¢CL conferences. High-resource languages such as English or German are in Class 5, whereas low-resource languages such as Berber are in Class 0. For details about the languages and their taxonomy classes, please refer to TableĀ 12.

As can be observed from FigureĀ 5(a) and FigureĀ 5(b), the models perform best in English, followed by other European languages across all datasets. Our newly presented M5-VLOD dataset is an exception, where the average performance for all languages is around the random baseline, indicating the challenge it implies. As expected, the models consistently perform worse on low-resource languages than on high-resource languages on all datasets. This is also displayed in FigureĀ 5(c), where it can be observed that the average performance decreases with the language taxonomy class. Note that this is not precisely true for xFlickrCO and XVNLI because the average on Class-5 languages is lowered by outliers, as indicated by the large error bars. In contrast, the models performed comparably well in only one Class 3 or 4 language, respectively.

Refer to caption
(a) Performance on VQA, VGR, and VNLI datasets aggregated by language.
Refer to caption
(b) Performance on image captioning datasets aggregated by language.
Refer to caption
(c) Performance on datasets aggregated by language taxonomy class as introduced byĀ Joshi etĀ al. (2020).
Figure 5: Modelsā€™ performances on all datasets aggregated by language or language taxonomy classes.

6.2 Performance vs. Model Parameters

In FigureĀ 6, we plot the English and non-English average performance on the employed datasets versus the modelsā€™ sizes in multiple regression plots. Note that, on the x-axes, we indicated the unknown sizes of GPT 4V and Gemini Pro V by ā€œ???ā€, which are estimated to be of magnitudes larger than all other models evaluated in this benchmark hence should be much further right. However, we did not do so to improve the readability of the plots.

In the figures, we can make several observations: Firstly, the average English performance is higher than the non-English performance for all models on all datasets. Secondly, the markers, which represent the average performance of a specific model on a dataset, show that the largest model does not always perform best and that the difference between smaller and larger models is often neglectable. The same finding is shown by the relatively flat slope of the regression lines. However, for the M5-VLOD and VGR datasets, the regression line for the average English scores is steeper, meaning that larger models perform considerably better than the smaller models. Since this work introduces the datasets and M5-VLOD even introduces a novel task, it can be concluded that larger models can better generalize to unseen data.

Refer to caption
Refer to caption
Figure 6: Regression plots showing the English and average non-English performance versus model size on different datasets. On the x-axis, we indicated the unknown sizes of GPT 4V and Gemini Pro V by ā€œ???ā€.

6.3 Language Fidelity Analysis

Inspired byĀ Holtermann etĀ al. (2024), we report the results of a language fidelity analysis, which assesses how often a model responds in the requested language on average. For this, we used GlotLIDv3Ā Kargaran etĀ al. (2023) to predict the language based on the output text of the respective models. Since it is hard to predict the language of a word or a multi-word expression due to ambiguity, we selected the xFlickrCO dataset, where the expected response of a model is an image caption, i.e., a sentence, in one of eight languages. As it can be observed from TableĀ 2, all models achieve (almost) perfect fidelity in English where, whereas for Japanese, Russian, and Turkish, the average fidelity drops to two-thirds. Interestingly, the small-sized mBLIP models have almost perfect fidelity in all languages, (slightly) surpassing larger models like InternVL V1.2+ and GPT 4V.

Table 2: Language fidelity results on the xFlickrCO dataset.
Model Language
zh en de id ja ru es tr Avg.
BakLLaVA .00.00.00.00 1.01.01.01.0 .39.39.39.39 .06.06.06.06 .00.00.00.00 .00.00.00.00 .44.44.44.44 .00.00.00.00 .24.24.24.24
Yi-VL 6B .14.14.14.14 1.01.01.01.0 .20.20.20.20 .00.00.00.00 .20.20.20.20 .01.01.01.01 .57.57.57.57 .00.00.00.00 .28.28.28.28
Qwen-VL .95.95.95.95 .99.99.99.99 .18.18.18.18 .11.11.11.11 .15.15.15.15 .08.08.08.08 .15.15.15.15 .07.07.07.07 .33.33.33.33
Yi-VL 34B .43.43.43.43 1.01.01.01.0 .79.79.79.79 .45.45.45.45 .58.58.58.58 .22.22.22.22 .25.25.25.25 .33.33.33.33 .51.51.51.51
CogVLM .44.44.44.44 .95.95.95.95 .74.74.74.74 .76.76.76.76 .38.38.38.38 .43.43.43.43 .82.82.82.82 .54.54.54.54 .63.63.63.63
LLaVA 1.5 13B .88.88.88.88 1.01.01.01.0 .75.75.75.75 .55.55.55.55 .90.90.90.90 .26.26.26.26 .75.75.75.75 .40.40.40.40 .69.69.69.69
LLaVA 1.5 7B .83.83.83.83 1.01.01.01.0 .96.96.96.96 .83.83.83.83 .09.09.09.09 .22.22.22.22 .97.97.97.97 .67.67.67.67 .70.70.70.70
MiniCPM-V .21.21.21.21 1.01.01.01.0 .93.93.93.93 .79.79.79.79 .89.89.89.89 .96.96.96.96 .91.91.91.91 .68.68.68.68 .80.80.80.80
LLaVA 1.6 7B .99.99.99.99 .99.99.99.99 .66.66.66.66 .91.91.91.91 .59.59.59.59 .88.88.88.88 .91.91.91.91 .89.89.89.89 .85.85.85.85
InternVL V1.1 .96.96.96.96 1.01.01.01.0 .93.93.93.93 .78.78.78.78 .88.88.88.88 .89.89.89.89 .97.97.97.97 .66.66.66.66 .89.89.89.89
OmniLMM 12B .63.63.63.63 1.01.01.01.0 .95.95.95.95 .92.92.92.92 .83.83.83.83 .92.92.92.92 .98.98.98.98 .88.88.88.88 .89.89.89.89
Gemini Pro .95.95.95.95 .95.95.95.95 .95.95.95.95 .88.88.88.88 .91.91.91.91 .96.96.96.96 .97.97.97.97 .96.96.96.96 .94.94.94.94
LLaVA 1.6 13B 1.01.01.01.0 1.01.01.01.0 .90.90.90.90 .96.96.96.96 .91.91.91.91 .87.87.87.87 .97.97.97.97 .93.93.93.93 .94.94.94.94
LLaVA 1.6 34B .88.88.88.88 1.01.01.01.0 .99.99.99.99 .99.99.99.99 .86.86.86.86 .99.99.99.99 .99.99.99.99 .99.99.99.99 .96.96.96.96
GPT 4V .97.97.97.97 1.01.01.01.0 1.01.01.01.0 .98.98.98.98 .88.88.88.88 .99.99.99.99 .99.99.99.99 1.01.01.01.0 .98.98.98.98
InternVL V1.2+ .99.99.99.99 1.01.01.01.0 1.01.01.01.0 .95.95.95.95 .97.97.97.97 .99.99.99.99 .99.99.99.99 .96.96.96.96 .98.98.98.98
mBliP BloomZ .96.96.96.96 1.01.01.01.0 1.01.01.01.0 .99.99.99.99 .99.99.99.99 1.01.01.01.0 1.01.01.01.0 .99.99.99.99 .99.99.99.99
mBliP mT0 .96.96.96.96 1.01.01.01.0 1.01.01.01.0 .99.99.99.99 .99.99.99.99 1.01.01.01.0 1.01.01.01.0 1.01.01.01.0 .99.99.99.99
Avg. .73.73.73.73 .99.99.99.99 .79.79.79.79 .72.72.72.72 .67.67.67.67 .65.65.65.65 .81.81.81.81 .66.66.66.66 .75.75.75.75

While the language fidelity of a model focuses on the generated text, we argue that the fidelity is also an indicator of the modelā€™s general language capabilities. To prove this hypothesis, we computed Pearson correlation coefficients between the reported fidelity and the modelsā€™ performance on the datasets for the xFlickrCO languages. As shown in TableĀ 17, there is a positive moderate or high correlation between the average fidelity and the average score for most datasets. However, for xGQA and M5-VLOD, there is only a minor positive average correlation.

7 Conclusion

We introduced M5, a diverse benchmark in which we evaluated 18181818 Large Multimodal Models (LMMs) with varying sizes across five visio-linguistic tasks in eight datasets comprising 41414141 unique languages. Further, we presented two novel datasets ā€“ M5-VGR and M5-VLOD ā€“ which focus on underrepresented languages and depict culturally diverse scenes. With M5-VLOD, we introduce a new visio-linguistic outlier detection task in which only proprietary models achieve reasonable scores. Our experiments revealed that model size does not always correlate with better performance, especially in non-English languages, underscoring the importance of diverse, multilingual training data and robust architectures. Performance disparities were prominent between high-resource languages like English and low-resource languages across all datasets and models, highlighting ongoing challenges in achieving globally equitable multilingual AI. With M5, we aim to impel the development of more inclusive models suitable for diverse languages and cultures.

8 Limitations

This section outlines several limitations of our current study that will be addressed in future work.

8.1 Metrics for Multilingual Image Captioning

Our benchmark and current research generally lack robust metrics for evaluating multilingual image captioning, especially for non-Latin script languages. The issue, which is the same for machine translation tasks, arises because of the nature of most metrics, such as chrFĀ Popović (2017), CIDErĀ Vedantam etĀ al. (2015), ROUGEĀ Lin (2004), BLUEĀ Papineni etĀ al. (2002), or METEORĀ Banerjee and Lavie (2005), which are based on comparing word or character n-grams between the source and target sequence. For non-Latin scripts, tokenization or segmentation can be challenging because it might not contain spaces or punctuation, or the characters are logographic. Hence, their usability or effectiveness is doubtful in such scenarios because the metrics rely on tokenization.

Other metrics, such as BERTScoreĀ Zhang etĀ al. (2020), CLIPScoreĀ Hessel etĀ al. (2021), or COMETĀ Rei etĀ al. (2020), do not rely on the captionsā€™ surface forms but on their token or sentence embeddings. However, they suffer from other issues: They require strong multilingual or cross-lingual encoder models capable of computing embeddings for many languages, which itself is a challenging task. Further, the scores computed with these metrics are often not calibrated across languages and thus not directly comparable between different languages.

A promising currently popular solution might be the use of robust multilingual state-of-the-art LLMs such as GPT 4o444https://openai.com/index/hello-gpt-4o/, Claude 3 Opus555https://www.anthropic.com/news/claude-3-family, or Gemini 1.5 Ultra666https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/ as a judgeĀ Zheng etĀ al. (2024). However, this would require more computational and financial resources and, most importantly, more investigation.

8.2 VQA Metrics for Generative Models

The problem when employing and evaluating generative language models on question-answering tasks is that the models can generally output arbitrary token sequences. However, the gold label answers are limited and often comprise only a short phrase, a single word, or even a binary label. Hence, mapping the predicted answers to their gold labels is not straightforward, and the difficulty drastically increases in multilingual scenarios. The relaxed accuracy metric employed in this study (see SectionĀ A.1) has been found to occasionally incorrectly classify correct answers, leading to false negatives, especially in open vocabulary visual question answering (VQA). One way to address this issue is to leverage strong state-of-the-art LLMs as judges, as described above, to enhance the accuracy of the evaluations.

8.3 Influence of Prompting

Another limitation of this and most, if not all, other current studies is grounded in the model prompting. Since different models might react differently to specific prompting styles, and we only employ a single prompt per dataset for all models777We do apply the model-specific prompt or chat templates, though. (see FigureĀ 7), the results might not be optimal. This issue has been partially addressed byĀ Ahuja etĀ al. (2023a) but is out of the scope of this work.

8.4 ā€œOutdatedā€ Models

Since the pace of current research in NLP, CV, and multimodal machine learning is swift, the models employed in our benchmarking exercise might be considered slightly outdated. Note that we considered models released until March 2024. Since then, numerous improved LMMs based on state-of-the-art LLMs, such as Llama3888https://ai.meta.com/blog/meta-llama-3 and novel image encoders techniques such as NaVITĀ Dehghani etĀ al. (2024), have been publicly released. Because this was foreseeable, we designed our benchmark to be easily extendable with newer models, which we will include in future work.

8.5 Small M5 Datasets

This work introduced two datasets, M5-VGR and M5-VLOD, which comprise about 115115115115 samples for each of the 12121212 languages. Compared to other datasets, they can be considered small. We will increase their sizes in future work to obtain more robust and generalizable results.

8.6 Missing multimodal and Multilingual Datasets

Currently, the M5 Benchmark comprises 5555 text-image tasks, i.e., VQA, VGR, VNLI, and image captioning, thus missing other suitable tasks like multimodal and multilingual summarization. Further, other multimodal multilingual VQA and VGR datasets have emerged while writing this paper. We will include both new tasks and new datasets in future versions of the M5.

References

  • Abdin etĀ al. (2024) Marah Abdin, SamĀ Ade Jacobs, AmmarĀ Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, etĀ al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally On Your Phone. ArXiv, 2404.14219.
  • Achiam etĀ al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaĀ Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etĀ al. 2023. GPT-4 Technical Report. ArXiv, 2303.08774.
  • Ahuja etĀ al. (2023a) Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2023a. MEGA: Multilingual Evaluation of Generative AI. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232ā€“4267, Singapore.
  • Ahuja etĀ al. (2023b) Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, etĀ al. 2023b. MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks. arXiv preprint arXiv:2311.07463.
  • AI etĀ al. (2024) 01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, GeĀ Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. 2024. Yi: Open Foundation Models by 01.AI. Preprint, arXiv:2403.04652.
  • Alayrac etĀ al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, etĀ al. 2022. Flamingo: A Visual Language Model for Few-Shot Learning. Advances in neural information processing systems, 35:23716ā€“23736.
  • Anil etĀ al. (2023) Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewĀ M Dai, Anja Hauth, etĀ al. 2023. Gemini: A Family of Highly Capable Multimodal Models. ArXiv, 2312.11805.
  • Awadalla etĀ al. (2023) Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, etĀ al. 2023. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. arXiv preprint arXiv:2308.01390.
  • Bai etĀ al. (2023) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.
  • Bai etĀ al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, etĀ al. 2022. Constitutional AI: Harmlessness From AI Feedback. ArXiv, 2212.08073.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65ā€“72, Ann Arbor, Michigan.
  • bench authors (2023) BIG bench authors. 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  • Bugliarello etĀ al. (2022) Emanuele Bugliarello, Fangyu Liu, Jonas Pfeiffer, Siva Reddy, Desmond Elliott, EdoardoĀ Maria Ponti, and Ivan Vulić. 2022. IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 2370ā€“2392.
  • Changpinyo etĀ al. (2023) Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, XiĀ Chen, and Radu Soricut. 2023. MaXM: Towards Multilingual Visual Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2667ā€“2682, Singapore.
  • Chen etĀ al. (2023) Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, YuĀ Qiao, and Jifeng Dai. 2023. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. arXiv preprint arXiv:2312.14238.
  • Chiang etĀ al. (2023) Wei-Lin Chiang, Zhuohan Li, ZiĀ Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephĀ E. Gonzalez, Ion Stoica, and EricĀ P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  • Dehghani etĀ al. (2024) Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, IbrahimĀ M Alabdulmohsin, etĀ al. 2024. Patch nā€™ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. Advances in Neural Information Processing Systems, 36.
  • Eichenberg etĀ al. (2022) Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. 2022. MAGMA ā€“ Multimodal Augmentation of Generative Models through Adapter-based Finetuning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2416ā€“2428, Abu Dhabi, United Arab Emirates.
  • Fu etĀ al. (2023) Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, XuĀ Lin, Jinrui Yang, Xiawu Zheng, KeĀ Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. 2023. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394.
  • GaviriaĀ Rojas etĀ al. (2022) William GaviriaĀ Rojas, Sudnya Diamos, Keertan Kini, David Kanter, Vijay JanapaĀ Reddi, and Cody Coleman. 2022. The Dollar Street Dataset: Images Representing the Geographic and Socioeconomic Diversity of the World. Advances in Neural Information Processing Systems, 35:12979ā€“12990.
  • Ge etĀ al. (2024) Zhang Ge, DuĀ Xinrun, Chen Bei, Liang Yiming, Luo Tongxu, Zheng Tianyu, Zhu Kang, Cheng Yuyang, XuĀ Chunpu, Guo Shuyue, Zhang Haoran, QuĀ Xingwei, Wang Junjie, Yuan Ruibin, LiĀ Yizhi, Wang Zekun, Liu Yudong, Tsai Yu-Hsuan, Zhang Fengji, Lin Chenghua, Huang Wenhao, Chen Wenhu, and FuĀ Jie. 2024. CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark. arXiv preprint arXiv:2401.20847.
  • Geigle etĀ al. (2023) Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavavs. 2023. mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs. ArXiv, 2307.06930.
  • Gunasekar etĀ al. (2023) Suriya Gunasekar, YiĀ Zhang, Jyoti Aneja, Caio CĆ©sarĀ Teodoro Mendes, Allie DelĀ Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo deĀ Rosa, Olli Saarikivi, etĀ al. 2023. Textbooks Are All You Need. ArXiv, 2306.11644.
  • Hendrycks etĀ al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:2009.03300.
  • Hessel etĀ al. (2021) Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan LeĀ Bras, and Yejin Choi. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514ā€“7528, Online and Punta Cana, Dominican Republic.
  • Holtermann etĀ al. (2024) Carolin Holtermann, Paul Rƶttger, Timm Dill, and Anne Lauscher. 2024. Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ. arXiv preprint arXiv:2403.03814.
  • Hu etĀ al. (2024) Shengding Hu, Yuge Tu, XuĀ Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, etĀ al. 2024. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395.
  • Jiang etĀ al. (2023) AlbertĀ Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraĀ Singh Chaplot, Diego deĀ las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, etĀ al. 2023. Mistral 7B. ArXiv, 2310.06825.
  • Joshi etĀ al. (2020) Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282ā€“6293, Online.
  • Kargaran etĀ al. (2023) AmirĀ Hossein Kargaran, Ayyoob Imani, FranƧois Yvon, and Hinrich SchĆ¼tze. 2023. GlotLID: Language Identification for Low-Resource Languages. In The 2023 Conference on Empirical Methods in Natural Language Processing, Singapore.
  • Krishna etĀ al. (2017) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, DavidĀ A. Shamma, etĀ al. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision (IJCV), 123(1):32ā€“73.
  • Li etĀ al. (2023) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 19730ā€“19742.
  • Liang etĀ al. (2022) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, etĀ al. 2022. Holistic Evaluation of Language Models. arXiv preprint arXiv:2211.09110.
  • Lin (2004) Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74ā€“81, Barcelona, Spain.
  • Lin etĀ al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr DollĆ”r, and C.Ā Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV), pages 740ā€“755, Zurich, Switzerland.
  • Liu etĀ al. (2021) Fangyu Liu, Emanuele Bugliarello, EdoardoĀ Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. 2021. Visually Grounded Reasoning across Languages and Cultures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467ā€“10485, Online and Punta Cana, Dominican Republic.
  • Liu etĀ al. (2023a) Haotian Liu, Chunyuan Li, Yuheng Li, and YongĀ Jae Lee. 2023a. Improved Baselines with Visual Instruction Tuning. ArXiv, 2310.03744.
  • Liu etĀ al. (2023b) Haotian Liu, Chunyuan Li, Qingyang Wu, and YongĀ Jae Lee. 2023b. Visual Instruction Tuning. In Advances in Neural Information Processing Systems, volumeĀ 36, pages 34892ā€“34916, New Orleans, LT, USA.
  • OpenAI (2023) OpenAI. 2023. GPT-4 Vision System Card.
  • Papineni etĀ al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311ā€“318, Philadelphia, Pennsylvania, USA.
  • Pfeiffer etĀ al. (2022) Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin Steitz, Stefan Roth, Ivan Vulić, and Iryna Gurevych. 2022. xGQA: Cross-Lingual Visual Question Answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2497ā€“2511, Dublin, Ireland.
  • Popović (2017) Maja Popović. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612ā€“618, Copenhagen, Denmark.
  • Rei etĀ al. (2020) Ricardo Rei, Craig Stewart, AnaĀ C Farinha, and Alon Lavie. 2020. COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685ā€“2702, Online.
  • Thapliyal etĀ al. (2022) AshishĀ V. Thapliyal, Jordi PontĀ Tuset, XiĀ Chen, and Radu Soricut. 2022. Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 715ā€“729, Abu Dhabi, United Arab Emirates.
  • Touvron etĀ al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etĀ al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv, 2307.09288.
  • Vedantam etĀ al. (2015) Ramakrishna Vedantam, CĀ LawrenceĀ Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566ā€“4575, Salt Lake City, UT, USA.
  • Wang etĀ al. (2023) Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, JiĀ Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. 2023. CogVLM: Visual Expert for Pretrained Language Models. ArXiv, 2311.03079.
  • Young etĀ al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference Over Event Descriptions. Transactions of the Association for Computational Linguistics (TACL), 2:67ā€“78.
  • Yu etĀ al. (2023) Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, etĀ al. 2023. RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-Grained Correctional Human Feedback. arXiv preprint arXiv:2312.00849.
  • Yuan etĀ al. (2023) Liu Yuan, Duan Haodong, Zhang Yuanhan, LiĀ Bo, Zhang Songyang, Zhao Wangbo, Yuan Yike, Wang Jiaqi, HeĀ Conghui, Liu Ziwei, Chen Kai, and Lin Dahua. 2023. MMBench: Is Your Multi-modal Model an All-around Player? arXiv:2307.06281.
  • Yue etĀ al. (2023) Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, GeĀ Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, etĀ al. 2023. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. arXiv preprint arXiv:2311.16502.
  • Zhang etĀ al. (2020) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations, Online.
  • Zheng etĀ al. (2024) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, ZiĀ Lin, Zhuohan Li, Dacheng Li, Eric Xing, etĀ al. 2024. Judging LLM-as-a-Judge with MT-Bench and ChatBot Arena. Advances in Neural Information Processing Systems, 36.

Appendix A Experimental Setup Details

This section details the employed metrics, prompts, and generation hyperparameters.

Note that we ran all experiments on A6000 (50505050GB) and A100 (80808080GB) GPUs. The largest evaluated model (40404040B) fits on an A100.

A.1 Metrics

FollowingĀ Geigle etĀ al. (2023), we report a relaxed accuracy metric for the xGQA, MaXM, XVNLI, and MaRVL datasets due to the generative nature of the considered models. More specifically, we post-process the generated answers by, e.g., lowercasing, stripping, or removing punctuation. We then consider the processed generated answer correct if it matches the gold answer or starts or ends with the gold answer. Further, we allow synonyms for boolean and numerical values. Examples can be found in TableĀ A.2.

Inspired byĀ Ahuja etĀ al. (2023b), we report the chrF++Ā Popović (2017) metric for the xFlickrCo and XM3600 datasets.

A.2 Relaxed Accuracy Metric

Table 3: Examples of generated answers considered correct or incorrect in the relaxed accuracy metric used to measure the performance on the xGQA, MaXM, MaRVL, XVNLI, M5-VGR, and M5-VLOD datasets. For more details, please refer to our GitHub repository.
Generated Answer Gold Answer Considered Correct
{Yes, 1, True} true yes
{No, 0, False} false yes
A car. car yes
Yes, it is correct. yes yes
It is not correct, no. no yes
The color of the leaf is green. green yes
There are three birds. three birds yes
Five 5 yes
{yes, true} entailment yes
{no, false} contradiction yes
maybe neutral yes
There are three birds in the image. three birds no
There are three birds. 3 no
three birds 3 no
three birds 3 birds no

A.3 Prompts

xGQA Question: {QUESTION} Short answer in English:
MaXM Question: {QUESTION} Short answer in {LANGUAGE}:
MaRVL Based on the two images, is it correct to say ā€˜ā€˜{HYPOTHESIS}ā€™ā€™? Yes or no? One word answer in English:
XVNLI Is it guaranteed true that ā€˜ā€˜{HYPOTHESIS}ā€™ā€™? Yes, no, or maybe? One word answer in English:
M5-VGR Based on the two images, is it correct to say ā€˜ā€˜{HYPOTHESIS}ā€™ā€™? Yes or no? One word answer in English:
M5-VLOD Based on the 5 images ordered from top-left to bottom-right, which image does not match the hypothesis ā€˜ā€˜{HYPOTHESIS}ā€™ā€™? Choose one from [A, B, C, D, E] and only output a single letter:
xFlickrCo Brief caption in {LANGUAGE}:
XM3600 Brief caption in {LANGUAGE}:
Figure 7: Prompts employed for the different datasets.

FigureĀ 7 presents the dataset-specific textual prompts we used for all models in this benchmark. Note that this does not include model-specific prompt templates, image placeholders, special tags, or symbols, only the ā€rawā€ textual prompt, which is then embedded in the template as required by the respective model. The placeholders {QUESTION}, {LANGUAGE}, or {HYPOTHESIS} are replaced by the sample specific text. The prompts are partially inspired by Geigle etĀ al. (2023) or Bugliarello etĀ al. (2022).

A.4 Hyperparameters

This section briefly reports hyperparameters used within our experiments for better reproducibility.

A.4.1 Generation Parameters

We used the same generation hyperparameters to generate responses with all the employed open-source models on all datasets (see TableĀ 4). Those are inspired by the default parameters in the ā€œtransformersā€ library999https://huggingface.co/docs/transformers/en/main_classes/text_generation. Because for CogVLM, beam search is not supported, we set ā€œnum_beamsā€ to 1111. For GPT 4V and Gemini Pro V, we use the default parameters of the respective Python clients.

Table 4: Generation hyperparameters to generate responses with all the employed models on all datasets.
Parameter Value
num_beams 2222
do_sample True
max_new_tokens 50505050
temperature 1.01.01.01.0
top_k 50505050
top_p 0.950.950.950.95

A.4.2 Image Order for Multi-Image Datasets

Most models employed in our dataset only support a single image per prompt. For datasets where a sample comprises more than one image, i.e., for MaRVL, M5-VGR, and M5-VLOD, we use the following strategy: We first stack the images horizontally with a gutter of 10101010 pixels, provide them as a single image in the prompt, and generate the response. Then, we do the same again but stack the images vertically. For M5-VLOD, we also create a stacked image with two columns and three rows. The reported scores are the average of all variants.

Appendix B Dataset Details

B.1 M5-VGR and M5-VLOD Details

B.1.1 M5-VGR Examples

Refer to caption
Figure 8: Amharic M5-VGR Sample.
Refer to caption
Figure 9: Bengali M5-VGR Sample.
Refer to caption
Figure 10: Berber M5-VGR Sample.
Refer to caption
Figure 11: English M5-VGR Sample.
Refer to caption
Figure 12: Filipino M5-VGR Sample.
Refer to caption
Figure 13: German M5-VGR Sample.
Refer to caption
Figure 14: Hausa M5-VGR Sample.
Refer to caption
Figure 15: Hindi M5-VGR Sample.
Refer to caption
Figure 16: Russian M5-VGR Sample.
Refer to caption
Figure 17: Swahili M5-VGR Sample.
Refer to caption
Figure 18: Thai M5-VGR Sample.
Refer to caption
Figure 19: Zulu M5-VGR Sample.

B.1.2 M5-VLOD Examples

Refer to caption
Figure 20: Amharic M5-VLOD Sample. The images are ordered from top-left to bottom-right.
Refer to caption
Figure 21: Bengali M5-VLOD Sample. The images are ordered from top-left to bottom-right.
Refer to caption
Figure 22: Berber M5-VLOD Sample. The images are ordered from top-left to bottom-right.
Refer to caption
Figure 23: English M5-VLOD Sample. The images are ordered from top-left to bottom-right.
Refer to caption
Figure 24: Filipino M5-VLOD Sample. The images are ordered from top-left to bottom-right.
Refer to caption
Figure 25: German M5-VLOD Sample. The images are ordered from top-left to bottom-right.
Refer to caption
Figure 26: Hausa M5-VLOD Sample. The images are ordered from top-left to bottom-right.
Refer to caption
Figure 27: Hindi M5-VLOD Sample. The images are ordered from top-left to bottom-right.
Refer to caption
Figure 28: Russian M5-VLOD Sample. The images are ordered from top-left to bottom-right.
Refer to caption
Figure 29: Swahili M5-VLOD Sample. The images are ordered from top-left to bottom-right.
Refer to caption
Figure 30: Thai M5-VLOD Sample. The images are ordered from top-left to bottom-right.
Refer to caption
Figure 31: Zulu M5-VLOD Sample. The images are ordered from top-left to bottom-right.

B.1.3 Topics

Table 5: Number of images tagged with a certain topic in the M5-VGR (A) and M5-VLOD (B) datasets.
Topic Language
Amharic Berber Bengali German English Filipino Hausa Hindi Russian Swahili Thai Zulu
A B A B A B A B A B A B A B A B A B A B A B A B
armchair 1111 2222 1111 1111 1111 1111 1111 1111 1111 2222 3333 1111 1111 1111 1111 1111 2222 1111 3333 1111 1111 1111 1111 1111
backyard 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 3333 1111
bathroom privacy 1111 1111 3333 3333 1111 1111 2222 1111 1111 1111 3333 4444 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111
bathroom/toilet 1111 2222 3333 1111 1111 2222 1111 3333 2222 1111 1111 1111 2222 1111 1111 1111 1111 1111 3333 3333 1111 1111 1111 1111
bed 1111 1111 1111 2222 2222 1111 1111 1111 1111 3333 1111 2222 4444 1111 1111 1111 2222 2222 1111 1111 4444 1111 1111 1111
bedroom 2222 4444 1111 2222 2222 2222 1111 1111 1111 2222 1111 1111 3333 1111 1111 2222 1111 1111 2222 1111 1111 1111 1111 1111
books 2222 2222 1111 1111 1111 2222 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111
ceiling 1111 2222 1111 1111 2222 1111 1111 1111 2222 2222 1111 1111 1111 4444 2222 1111 2222 1111 2222 2222 2222 1111 1111 2222
children room 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 2222 1111 1111 2222 1111 1111 1111
cleaning equipment 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111
cooking pots 1111 1111 2222 2222 2222 1111 1111 1111 1111 1111 2222 1111 2222 2222 1111 1111 2222 1111 1111 2222 1111 2222 1111 1111
cooking utensils 1111 1111 3333 2222 1111 3333 1111 1111 1111 1111 1111 1111 1111 3333 2222 1111 1111 1111 1111 1111 2222 1111 1111 1111
couch 1111 1111 1111 1111 1111 1111 1111 2222 2222 1111 3333 3333 3333 1111 1111 1111 2222 1111 2222 1111 1111 1111 3333 1111
cups/mugs/glasses 1111 1111 1111 1111 1111 1111 1111 1111 3333 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111
cutlery 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 3333 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 2222 1111 1111
dish racks 1111 1111 1111 2222 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 3333 2222 1111 1111
dish washing brush/cloth 2222 1111 1111 1111 3333 1111 1111 1111 1111 3333 1111 1111 1111 1111 3333 2222 1111 2222 1111 1111 1111 1111 1111 1111
dish washing soap 1111 1111 1111 1111 1111 1111 1111 1111 3333 1111 1111 1111 1111 3333 1111 2222 2222 2222 2222 2222 1111 2222 1111 1111
drainage 1111 2222 1111 1111 1111 1111 1111 1111 1111 2222 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111
drinking water 3333 4444 2222 2222 1111 2222 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 4444 2222
drying 3333 1111 1111 1111 5555 1111 1111 1111 1111 1111 2222 2222 1111 2222 3333 1111 1111 1111 1111 1111 1111 1111 1111 1111
everyday shoes 1111 2222 1111 2222 2222 1111 3333 1111 1111 1111 2222 1111 2222 3333 1111 1111 1111 2222 1111 2222 2222 1111 2222 2222
family 2222 2222 4444 1111 2222 1111 3333 2222 1111 1111 2222 1111 3333 3333 1111 1111 1111 2222 1111 1111 2222 2222 2222 2222
floor 1111 1111 3333 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 2222 1111 1111 1111 1111
front door 2222 1111 4444 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 3333 2222 1111 1111 1111 4444 1111 3333 2222 1111 1111
grains 2222 1111 1111 1111 2222 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 2222 2222 2222 2222 1111 1111 1111 1111 1111
guest bed 3333 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 2222 2222 1111 1111 1111 1111 1111 1111
hair brush/comb 1111 1111 1111 1111 3333 1111 2222 3333 1111 1111 1111 1111 2222 2222 1111 1111 2222 1111 1111 2222 1111 1111 3333 1111
hand back 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 2222 2222 3333 1111 2222 1111 2222 1111 1111 1111 1111 2222 1111 2222
hand palm 1111 1111 3333 2222 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 2222 2222 1111 1111
hand washing 2222 1111 3333 2222 1111 1111 1111 5555 1111 1111 1111 4444 2222 1111 1111 3333 2222 2222 2222 1111 2222 1111 2222 3333
home 1111 1111 3333 2222 2222 1111 2222 1111 1111 2222 1111 4444 1111 1111 5555 1111 1111 2222 1111 2222 2222 2222 1111 2222
jewelry 1111 1111 1111 1111 1111 2222 1111 1111 1111 2222 1111 1111 1111 2222 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111
kitchen 2222 1111 1111 2222 1111 1111 1111 1111 4444 1111 2222 2222 1111 1111 1111 1111 2222 2222 1111 2222 2222 2222 1111 1111
kitchen sink 1111 2222 2222 2222 1111 1111 4444 2222 1111 2222 2222 1111 1111 2222 1111 2222 1111 1111 1111 3333 1111 1111 3333 3333
light source in kitchen 1111 1111 2222 1111 2222 3333 2222 2222 3333 2222 1111 1111 1111 1111 3333 1111 1111 1111 2222 1111 1111 1111 1111 1111
light source in livingroom 1111 2222 2222 2222 1111 1111 1111 1111 2222 1111 1111 2222 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111
living room 1111 1111 1111 1111 1111 1111 1111 2222 2222 1111 3333 1111 1111 1111 1111 1111 2222 2222 1111 1111 2222 1111 1111 1111
lock on front door 1111 1111 1111 1111 1111 4444 1111 1111 3333 1111 1111 1111 2222 1111 1111 3333 1111 2222 1111 1111 2222 3333 1111 1111
make up 1111 1111 1111 2222 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 2222 1111
meat or fish 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111
medication 1111 1111 1111 1111 1111 2222 1111 1111 1111 2222 1111 1111 2222 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111
most loved item 1111 1111 1111 1111 1111 2222 3333 1111 2222 2222 3333 3333 2222 1111 2222 2222 2222 2222 2222 00 1111 1111 4444 4444
most loved toy 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111 2222 1111 1111
nicest shoes 1111 1111 1111 1111 1111 2222 2222 2222 2222 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 2222 2222 2222 1111 1111
oven 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 2222 2222 1111 1111 1111 1111 1111 1111
paper 2222 1111 1111 2222 1111 2222 2222 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111
pen/pencils 1111 1111 1111 2222 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 3333 1111 1111 1111 1111 1111
phone 2222 2222 1111 1111 2222 1111 1111 1111 2222 3333 1111 3333 2222 1111 2222 1111 2222 1111 1111 3333 1111 1111 2222 1111
place where eating dinner 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 2222 1111 1111 2222 2222 1111 1111
plate of food 2222 1111 1111 4444 1111 1111 2222 3333 1111 1111 2222 1111 3333 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111
plates 2222 1111 1111 1111 2222 2222 1111 1111 1111 1111 1111 3333 1111 1111 1111 1111 1111 2222 1111 1111 2222 1111 1111 1111
play area 1111 1111 2222 2222 1111 1111 1111 1111 1111 1111 2222 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111
power outlet 1111 2222 1111 1111 1111 1111 3333 4444 2222 1111 1111 1111 1111 1111 3333 1111 1111 1111 1111 1111 1111 4444 2222 1111
refrigerator 1111 1111 1111 1111 2222 1111 4444 4444 1111 3333 1111 3333 1111 1111 1111 3333 2222 1111 1111 1111 1111 1111 1111 1111
roof 2222 1111 1111 1111 1111 3333 2222 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111
shampoo 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 2222 2222 1111 1111 2222 1111 1111
shower 1111 1111 1111 2222 1111 1111 1111 1111 1111 2222 1111 2222 1111 1111 1111 1111 2222 2222 1111 1111 1111 1111 1111 1111
sitting area 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111
soap for hands and body 1111 1111 2222 2222 2222 2222 1111 2222 1111 1111 1111 1111 2222 1111 2222 1111 2222 1111 1111 1111 2222 2222 1111 1111
social drink 1111 1111 1111 1111 1111 1111 2222 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222
sofa 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 2222 4444 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111
source of cool 1111 1111 1111 1111 2222 1111 1111 1111 2222 1111 1111 1111 2222 1111 2222 2222 1111 1111 1111 1111 1111 1111 1111 2222
spices 1111 2222 1111 1111 2222 1111 1111 3333 3333 3333 2222 1111 1111 1111 2222 1111 2222 2222 1111 2222 1111 2222 1111 3333
storage room 1111 2222 1111 1111 1111 2222 1111 1111 1111 1111 5555 1111 2222 1111 1111 2222 1111 2222 1111 1111 1111 1111 1111 1111
stove/hob 2222 1111 1111 2222 1111 1111 1111 1111 1111 3333 1111 1111 1111 5555 1111 2222 1111 1111 2222 3333 2222 1111 1111 4444
street detail 4444 1111 1111 1111 1111 1111 1111 1111 1111 3333 2222 1111 1111 2222 1111 2222 2222 1111 2222 1111 2222 1111 1111 1111
street view 1111 1111 1111 2222 1111 4444 2222 1111 1111 1111 1111 1111 1111 2222 1111 3333 1111 1111 2222 1111 2222 2222 1111 2222
switch on/off 2222 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 3333 1111 1111
table with food 2222 4444 1111 1111 1111 1111 2222 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111
teeth 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 2222 3333 1111 1111 1111 2222 2222 2222 2222 1111 2222 1111 1111
toilet 1111 2222 1111 2222 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 2222 1111 1111 2222 1111 1111 1111 1111
toilet paper 3333 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 3333 2222
tooth paste 2222 1111 1111 1111 1111 2222 2222 3333 2222 2222 2222 1111 1111 1111 4444 1111 2222 2222 3333 1111 1111 1111 2222 3333
toothbrush 1111 2222 1111 1111 1111 1111 3333 2222 1111 1111 2222 1111 1111 2222 1111 1111 1111 2222 1111 3333 1111 1111 3333 3333
toys 2222 1111 2222 1111 3333 5555 1111 1111 1111 3333 1111 2222 2222 1111 1111 4444 2222 3333 1111 1111 2222 3333 1111 1111
trash/waste 1111 1111 1111 1111 1111 1111 1111 1111 3333 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111
tv 1111 1111 1111 2222 3333 2222 1111 1111 2222 2222 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 2222 4444 6666
vegetables 1111 2222 2222 1111 2222 1111 1111 1111 3333 1111 1111 1111 1111 1111 1111 1111 2222 2222 1111 1111 1111 1111 1111 1111
wall 2222 1111 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111
wall clock 2222 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 4444 1111 1111 1111 2222 2222 1111 1111 1111 1111 2222 1111 1111
wall decoration 1111 2222 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 2222 2222 1111 1111 1111 2222 2222 00 2222 1111 1111 1111
wall inside 1111 2222 1111 1111 1111 1111 2222 1111 1111 1111 1111 2222 2222 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111
wardrobe 1111 3333 2222 1111 2222 1111 1111 2222 1111 1111 2222 1111 1111 1111 2222 2222 1111 1111 2222 2222 2222 2222 1111 1111
washing clothes/cleaning 1111 1111 1111 2222 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 3333 1111 1111 1111 4444 4444 1111 3333 1111 1111
washing detergent 2222 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 2222 1111 1111 1111 2222 1111 1111 2222 2222 1111 2222 1111 1111
water outlet 2222 1111 3333 2222 1111 1111 1111 2222 2222 1111 1111 1111 1111 1111 2222 1111 1111 1111 1111 1111 1111 1111 1111 1111

B.2 Dataset Language Details

Table 6: Language support of the datasets considered in this work. More details one the languages are reported in TableĀ 12.
Language Script MaXM xGQA XNLVI MaRVL M5-VLOD M5-VGR xFlickrCO XM3600
Amharic Ethiopic no no no no yes yes no no
Arabic Arabic no no yes no no no no yes
Bengali Bengali no yes no no yes yes no yes
Berber Tifinagh no no no no yes yes no no
Chinese Hanzi yes yes no yes no no yes yes
Croatian Latin no no no no no no no yes
Czech Latin no no no no no no no yes
Danish Latin no no no no no no no yes
Dutch Latin no no no no no no no yes
English Latin yes yes yes no yes yes yes yes
Filipino Latin no no no no yes yes no yes
Finnish Latin no no no no no no no yes
French Latin yes no yes no no no no yes
German Latin no yes no no yes yes yes yes
Greek Greek no no no no no no no yes
Hausa Latin no no no no yes yes no no
Hebrew Hebrew yes no no no no no no yes
Hindi Devanagari yes no no no yes yes no yes
Hungarian Latin no no no no no no no yes
Indonesian Latin no yes no yes no no yes yes
Italian Latin no no no no no no no yes
Japanese Japanese no no no no no no yes yes
Korean Hangul no yes no no no no no yes
Maori Latin no no no no no no no yes
Norwegian Latin no no no no no no no yes
Persian Perso-Arabic no no no no no no no yes
Polish Latin no no no no no no no yes
Portuguese Latin no yes no no no no no yes
Quechua Latin no no no no no no no yes
Romanian Latin yes no no no no no no yes
Russian Cyrillic no yes yes no yes yes yes yes
Spanish Latin no no yes no no no yes yes
Swahili Latin no no no yes yes yes no yes
Swedish Latin no no no no no no no yes
Tamil Tamil no no no yes no no no no
Telugu Telugu no no no no no no no yes
Thai Thai yes no no no yes yes no yes
Turkish Latin no no no yes no no yes yes
Ukrainian Cyrillic no no no no no no no yes
Vietnamese Latin no no no no no no no yes
Zulu Latin no no no no yes yes no no
Unique Languages 7 8 5 5 12 12 8 36
Unique Scripts 4 5 3 3 7 7 4 12

B.3 Language Details

Table 7: Details and statistics of languages comprised in the datasets of this benchmark. The continent and subregion columns refer to the content or subregion where the respective language is mostly spoken. The number of speakers is an estimate of the number of L1 and L2 speakers based on different public sources such as Wikipedia1010footnotemark: 10, EthnologueĀ 1111footnotemark: 11, and Statista1212footnotemark: 12. The ā€œTaxonomyā€ column indicates the taxonomy class of the language based onĀ Joshi etĀ al. (2020).
Language ISO 639 Lang. Family Script Continent Subregion Taxonomy Speakers /šŸšŸŽšŸ”absentsuperscript106\mathbf{/~{}10^{6}}/ bold_10 start_POSTSUPERSCRIPT bold_6 end_POSTSUPERSCRIPT
Arabic ar Afro-Asiatic Arabic Afrika & Asia North Africa & Middle East 5 630.00
Chinese zh Sino-Tibetan Hanzi Asia Northeastern Asia 5 1330.00
English en Indo-European Latin America North America 5 1457.00
French fr Indo-European Latin Europe Western Europe 5 310.00
German de Indo-European Latin Europe Western Europe 5 175.00
Japanese ja Japonic Japanese Asia Northeastern Asia 5 128.00
Spanish es Indo-European Latin Europe Southern Europe 5 600.00
Croatian hr Indo-European Latin Europe Central & Eastern Europe 4 6.80
Czech cs Indo-European Latin Europe Central & Eastern Europe 4 11.00
Dutch nl Indo-European Latin Europe Western Europe 4 30.00
Finnish fi Uralic Latin Europe Northern Europe 4 5.80
Hindi hi Indo-European Devanagari Asia Central & South Asia 4 600.00
Hungarian hu Uralic Latin Europe Central & Eastern Europe 4 17.00
Italian it Indo-European Latin Europe Southern Europe 4 68.00
Korean ko Koreanic Hangul Asia Northeastern Asia 4 82.00
Persian fa Indo-European Perso-Arabic Asia Middle East 4 130.00
Polish pl Indo-European Latin Europe Central & Eastern Europe 4 41.00
Portuguese pt Indo-European Latin Europe & America Southern Europe & South America 4 360.00
Russian ru Indo-European Cyrillic Asia Central Asia 4 260.00
Swedish sv Indo-European Latin Europe Northern Europe 4 13.00
Turkish tr Turkic Latin Asia Middle East 4 90.00
Vietnamese vi Austroasiatic Latin Asia Southeastern Asia 4 85.00
Bengali bn Indo-European Bengali Asia Central & South Asia 3 270.00
Danish da Indo-European Latin Europe Western Europe 3 6.00
Filipino fil Austronesian Latin Asia Southeastern Asia 3 83.00
Greek el Indo-European Greek Europe Central & Eastern Europe 3 13.50
Hebrew he & iw Afro-Asiatic Hebrew Asia Middle East 3 9.00
Indonesian id Austronesian Latin Asia Southeastern Asia 3 300.00
Romanian ro Indo-European Latin Europe Central & Eastern Europe 3 28.50
Tamil ta Dravidian Tamil Asia Central & South Asia 3 86.00
Thai th Kra-Dai Thai Asia Southeastern Asia 3 80.00
Ukrainian uk Indo-European Cyrillic Europe Central & Eastern Europe 3 32.80
Amharic am Afro-Asiatic Ethiopic Africa Eastern Africa 2 57.00
Hausa ha Afro-Asiatic Latin Africa Western Africa 2 79.00
Swahili sw Niger-Congo Latin Africa Eastern Africa 2 73.00
Zulu zu Niger-Congo Latin Africa Southern Africa 2 28.00
Maori mi Austronesian Latin Australia & Oceania Australia & Oceania 1 0.19
Norwegian no Indo-European Latin Europe Northern Europe 1 4.32
Quechua quz Quechuan Latin America South America 1 9.00
Telugu te Dravidian Telugu Asia Central & South Asia 1 96.00
Berber ber Afro-Asiatic Tifinagh Africa Northern Africa 0 26.20
33footnotetext: https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers44footnotetext: https://www.ethnologue.com/55footnotetext: https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/

Appendix C Model Details

Table 8: Architectural details of the LMMs evaluated in this study. The columns LM, VM, and ML are ā€œLanguage Modelā€, ā€œVision Modelā€, and ā€œMapping Modulesā€, respectively, and show the number of parameters of the particular module. ā€œ|Total|ā€ shows all parameters of the model. Note that we report friedly names of the models which are enriched with hyperlinks pointing to the respective Huggingface repositories (when viewed digitally). For Gemini Pro Vision and GPT-4 Vision, we used the gemini-1.0-pro-vision and gpt-4-1106-vision-preview variants, respectively.
Model LM VM MM |Total| |LM| |VM| |MM|
MiniCPM-VĀ [27; 49] MiniCPM-2B SigLIP 400M MLP 3.43ā¢B3.43B3.43\mathrm{B}3.43 roman_B 3.01ā¢B3.01B3.01\mathrm{B}3.01 roman_B 397.75ā¢M397.75M397.75\mathrm{M}397.75 roman_M 29.51ā¢M29.51M29.51\mathrm{M}29.51 roman_M
mBliP mT0Ā [22] Flan-T5-XL EVA01 CLIP-ViT-g QFormer 4.84ā¢B4.84B4.84\mathrm{B}4.84 roman_B 3.74ā¢B3.74B3.74\mathrm{B}3.74 roman_B 985.95ā¢M985.95M985.95\mathrm{M}985.95 roman_M 106.71ā¢M106.71M106.71\mathrm{M}106.71 roman_M
Yi-VL 6BĀ [5] Yi-6B-Chat CLIP-ViT-H-14 MLP 6.71ā¢B6.71B6.71\mathrm{B}6.71 roman_B 5.80ā¢B5.80B5.80\mathrm{B}5.80 roman_B 631.75ā¢M631.75M631.75\mathrm{M}631.75 roman_M 22.04ā¢M22.04M22.04\mathrm{M}22.04 roman_M
LLaVA 1.6 7BĀ [37] Vicuna-7B-v1.5 CLIP-ViT-L MLP 6.76ā¢B6.76B6.76\mathrm{B}6.76 roman_B 6.61ā¢B6.61B6.61\mathrm{B}6.61 roman_B 303.51ā¢M303.51M303.51\mathrm{M}303.51 roman_M 20.98ā¢M20.98M20.98\mathrm{M}20.98 roman_M
LLaVA 1.5 7BĀ [38] Vicuna-7B-v1.5 CLIP-ViT-L MLP 7.06ā¢B7.06B7.06\mathrm{B}7.06 roman_B 6.74ā¢B6.74B6.74\mathrm{B}6.74 roman_B 303.51ā¢M303.51M303.51\mathrm{M}303.51 roman_M 20.98ā¢M20.98M20.98\mathrm{M}20.98 roman_M
BakLLaVAĀ [38] Mistral 7B v0.1 CLIP-ViT-L MLP 7.57ā¢B7.57B7.57\mathrm{B}7.57 roman_B 7.24ā¢B7.24B7.24\mathrm{B}7.24 roman_B 303.51ā¢M303.51M303.51\mathrm{M}303.51 roman_M 20.98ā¢M20.98M20.98\mathrm{M}20.98 roman_M
mBliP BloomZĀ [22] BloomZ 7B EVA01 CLIP-ViT-g QFormer 8.16ā¢B8.16B8.16\mathrm{B}8.16 roman_B 7.07ā¢B7.07B7.07\mathrm{B}7.07 roman_B 985.95ā¢M985.95M985.95\mathrm{M}985.95 roman_M 108.29ā¢M108.29M108.29\mathrm{M}108.29 roman_M
Qwen-VLĀ [9] Qwen-7B CLIP-VIT-bigG CrossAttn 9.66ā¢B9.66B9.66\mathrm{B}9.66 roman_B 7.10ā¢B7.10B7.10\mathrm{B}7.10 roman_B 1.94ā¢B1.94B1.94\mathrm{B}1.94 roman_B 80.00ā¢M80.00M80.00\mathrm{M}80.00 roman_M
OmniLMM 12BĀ [49] Zephyr 7B Ī²š›½\betaitalic_Ī² EVA02 CLIP ViT-E MLP 11.61ā¢B11.61B11.61\mathrm{B}11.61 roman_B 7.24ā¢B7.24B7.24\mathrm{B}7.24 roman_B 4.28ā¢B4.28B4.28\mathrm{B}4.28 roman_B 93.36ā¢M93.36M93.36\mathrm{M}93.36 roman_M
LLaVA 1.6 13BĀ [37] Vicuna-13B-v1.5 CLIP-ViT-L MLP 13.05ā¢B13.05B13.05\mathrm{B}13.05 roman_B 12.85ā¢B12.85B12.85\mathrm{B}12.85 roman_B 303.51ā¢M303.51M303.51\mathrm{M}303.51 roman_M 31.47ā¢M31.47M31.47\mathrm{M}31.47 roman_M
LLaVA 1.5 13BĀ [38] Vicuna-13B-v1.5 CLIP-ViT-L MLP 13.35ā¢B13.35B13.35\mathrm{B}13.35 roman_B 13.02ā¢B13.02B13.02\mathrm{B}13.02 roman_B 303.51ā¢M303.51M303.51\mathrm{M}303.51 roman_M 31.47ā¢M31.47M31.47\mathrm{M}31.47 roman_M
CogVLMĀ [47] Vicuna-7B-v1.5 EVA02 CLIP ViT-E CrossAttn 17.64ā¢B17.64B17.64\mathrm{B}17.64 roman_B 6.74ā¢B6.74B6.74\mathrm{B}6.74 roman_B 4.28ā¢B4.28B4.28\mathrm{B}4.28 roman_B 6.62ā¢B6.62B6.62\mathrm{B}6.62 roman_B
InternVL V1.1 Ā [15] Llama-2-13B InternViT 6B MLP 19.11ā¢B19.11B19.11\mathrm{B}19.11 roman_B 13.12ā¢B13.12B13.12\mathrm{B}13.12 roman_B 5.91ā¢B5.91B5.91\mathrm{B}5.91 roman_B 91.79ā¢M91.79M91.79\mathrm{M}91.79 roman_M
LLaVA 1.6 34BĀ [37] Nous-Hermes-2-Yi-34B CLIP-ViT-L MLP 34.45ā¢B34.45B34.45\mathrm{B}34.45 roman_B 33.93ā¢B33.93B33.93\mathrm{B}33.93 roman_B 303.51ā¢M303.51M303.51\mathrm{M}303.51 roman_M 58.73ā¢M58.73M58.73\mathrm{M}58.73 roman_M
Yi-VL 34BĀ [5] Yi-34B-Chat CLIP-ViT-H MLP 35.08ā¢B35.08B35.08\mathrm{B}35.08 roman_B 33.93ā¢B33.93B33.93\mathrm{B}33.93 roman_B 631.75ā¢M631.75M631.75\mathrm{M}631.75 roman_M 60.60ā¢M60.60M60.60\mathrm{M}60.60 roman_M
InternVL V1.2+Ā [15] Nous-Hermes-2-Yi-34B InternViT-6B V1-2 MLP 40.07ā¢B40.07B40.07\mathrm{B}40.07 roman_B 34.39ā¢B34.39B34.39\mathrm{B}34.39 roman_B 5.54ā¢B5.54B5.54\mathrm{B}5.54 roman_B 143.17ā¢M143.17M143.17\mathrm{M}143.17 roman_M
Gemini Pro VisionĀ [7] ? ? ? ? ? ? ?
GPT-4 VisionĀ [39] ? ? ? ? ? ? ?

Appendix D Results Details

D.1 General Results

D.1.1 xGQA

Refer to caption
Figure 32: A bar plot showing the average accuracy per language and model on the xGQA dataset. The models on the x-Axis are ordered by their average score across all languages so that the best performing model is on the right and the worst is on the left.
Table 9: The average accuracy per language and model on the xGQA dataset. The column ā€œNEAā€ stands for the average of Non-English languages.
Model Language
bn de en id ko pt ru zh NEA
LLaVA 1.5 7B 0.060.060.060.06 0.350.350.350.35 0.620.620.620.62 0.330.330.330.33 0.290.290.290.29 0.350.350.350.35 0.360.360.360.36 0.350.350.350.35 0.300.300.300.30
CogVLM 0.050.050.050.05 0.380.380.380.38 0.590.590.590.59 0.340.340.340.34 0.300.300.300.30 0.330.330.330.33 0.330.330.330.33 0.370.370.370.37 0.300.300.300.30
MiniCPM-V 0.110.110.110.11 0.420.420.420.42 0.550.550.550.55 0.330.330.330.33 0.400.400.400.40 0.450.450.450.45 0.350.350.350.35 0.080.080.080.08 0.310.310.310.31
BakLLaVA 0.060.060.060.06 0.390.390.390.39 0.620.620.620.62 0.160.160.160.16 0.340.340.340.34 0.370.370.370.37 0.440.440.440.44 0.450.450.450.45 0.320.320.320.32
Yi-VL 6B 0.110.110.110.11 0.390.390.390.39 0.570.570.570.57 0.350.350.350.35 0.340.340.340.34 0.390.390.390.39 0.410.410.410.41 0.220.220.220.22 0.320.320.320.32
Qwen-VL 0.130.130.130.13 0.430.430.430.43 0.590.590.590.59 0.340.340.340.34 0.340.340.340.34 0.370.370.370.37 0.390.390.390.39 0.310.310.310.31 0.330.330.330.33
LLaVA 1.6 7B 0.070.070.070.07 0.420.420.420.42 0.600.600.600.60 0.370.370.370.37 0.330.330.330.33 0.390.390.390.39 0.370.370.370.37 0.380.380.380.38 0.340.340.340.34
Gemini Pro V 0.330.330.330.33 0.370.370.370.37 0.460.460.460.46 0.340.340.340.34 0.340.340.340.34 0.340.340.340.34 0.310.310.310.31 0.350.350.350.35 0.340.340.340.34
LLaVA 1.5 13B 0.100.100.100.10 0.440.440.440.44 0.620.620.620.62 0.340.340.340.34 0.310.310.310.31 0.380.380.380.38 0.400.400.400.40 0.400.400.400.40 0.340.340.340.34
OmniLMM 12B 0.210.210.210.21 0.420.420.420.42 0.490.490.490.49 0.350.350.350.35 0.370.370.370.37 0.380.380.380.38 0.410.410.410.41 0.390.390.390.39 0.360.360.360.36
LLaVA 1.6 13B 0.110.110.110.11 0.520.520.520.52 0.650.650.650.65 0.370.370.370.37 0.390.390.390.39 0.400.400.400.40 0.440.440.440.44 0.410.410.410.41 0.380.380.380.38
Yi-VL 34B 0.180.180.180.18 0.500.500.500.50 0.580.580.580.58 0.420.420.420.42 0.390.390.390.39 0.470.470.470.47 0.410.410.410.41 0.320.320.320.32 0.380.380.380.38
mBliP BloomZ 0.400.400.400.40 0.380.380.380.38 0.440.440.440.44 0.410.410.410.41 0.290.290.290.29 0.430.430.430.43 0.390.390.390.39 0.410.410.410.41 0.390.390.390.39
mBliP mT0 0.390.390.390.39 0.420.420.420.42 0.440.440.440.44 0.390.390.390.39 0.390.390.390.39 0.410.410.410.41 0.410.410.410.41 0.400.400.400.40 0.400.400.400.40
GPT 4V 0.440.440.440.44 0.420.420.420.42 0.450.450.450.45 0.420.420.420.42 0.410.410.410.41 0.410.410.410.41 0.380.380.380.38 0.410.410.410.41 0.410.410.410.41
InternVL V1.2+ 0.220.220.220.22 0.510.510.510.51 0.670.670.670.67 0.460.460.460.46 0.490.490.490.49 0.520.520.520.52 0.470.470.470.47 0.370.370.370.37 0.430.430.430.43
LLaVA 1.6 34B 0.210.210.210.21 0.540.540.540.54 0.650.650.650.65 0.480.480.480.48 0.440.440.440.44 0.520.520.520.52 0.500.500.500.50 0.560.560.560.56 0.460.460.460.46
InternVL V1.1 0.310.310.310.31 0.530.530.530.53 0.630.630.630.63 0.480.480.480.48 0.480.480.480.48 0.510.510.510.51 0.490.490.490.49 0.550.550.550.55 0.480.480.480.48
Average 0.190.190.190.19 0.430.430.430.43 0.570.570.570.57 0.370.370.370.37 0.370.370.370.37 0.410.410.410.41 0.400.400.400.40 0.370.370.370.37 0.370.370.370.37

D.1.2 MaXM

Refer to caption
Figure 33: A bar plot showing the average accuracy per language and model on the MaXM dataset. The models on the x-Axis are ordered by their average score across all languages so that the best performing model is on the right and the worst is on the left.
Table 10: The average accuracy per language and model on the MaXM dataset. The column ā€œNEAā€ stands for the average of Non-English languages.
Model Language
en fr hi iw ro th zh NEA
CogVLM 0.430.430.430.43 0.030.030.030.03 0.010.010.010.01 0.040.040.040.04 0.020.020.020.02 0.000.000.000.00 0.030.030.030.03 0.020.020.020.02
BakLLaVA 0.530.530.530.53 0.140.140.140.14 0.020.020.020.02 0.020.020.020.02 0.060.060.060.06 0.140.140.140.14 0.070.070.070.07 0.080.080.080.08
OmniLMM 12B 0.480.480.480.48 0.280.280.280.28 0.030.030.030.03 0.010.010.010.01 0.170.170.170.17 0.130.130.130.13 0.060.060.060.06 0.110.110.110.11
LLaVA 1.5 7B 0.520.520.520.52 0.340.340.340.34 0.130.130.130.13 0.050.050.050.05 0.160.160.160.16 0.090.090.090.09 0.120.120.120.12 0.150.150.150.15
LLaVA 1.6 7B 0.340.340.340.34 0.380.380.380.38 0.090.090.090.09 0.110.110.110.11 0.140.140.140.14 0.100.100.100.10 0.120.120.120.12 0.160.160.160.16
LLaVA 1.5 13B 0.560.560.560.56 0.350.350.350.35 0.090.090.090.09 0.050.050.050.05 0.320.320.320.32 0.120.120.120.12 0.190.190.190.19 0.190.190.190.19
MiniCPM-V 0.560.560.560.56 0.280.280.280.28 0.120.120.120.12 0.090.090.090.09 0.130.130.130.13 0.130.130.130.13 0.390.390.390.39 0.190.190.190.19
Yi-VL 34B 0.530.530.530.53 0.210.210.210.21 0.140.140.140.14 0.140.140.140.14 0.160.160.160.16 0.230.230.230.23 0.310.310.310.31 0.200.200.200.20
Yi-VL 6B 0.530.530.530.53 0.320.320.320.32 0.130.130.130.13 0.160.160.160.16 0.120.120.120.12 0.180.180.180.18 0.290.290.290.29 0.200.200.200.20
Qwen-VL 0.500.500.500.50 0.370.370.370.37 0.150.150.150.15 0.200.200.200.20 0.200.200.200.20 0.290.290.290.29 0.170.170.170.17 0.230.230.230.23
LLaVA 1.6 13B 0.460.460.460.46 0.430.430.430.43 0.130.130.130.13 0.160.160.160.16 0.380.380.380.38 0.170.170.170.17 0.170.170.170.17 0.240.240.240.24
mBliP BloomZ 0.550.550.550.55 0.230.230.230.23 0.530.530.530.53 0.180.180.180.18 0.320.320.320.32 0.190.190.190.19 0.420.420.420.42 0.310.310.310.31
LLaVA 1.6 34B 0.580.580.580.58 0.440.440.440.44 0.250.250.250.25 0.270.270.270.27 0.430.430.430.43 0.250.250.250.25 0.320.320.320.32 0.320.320.320.32
InternVL V1.1 0.580.580.580.58 0.470.470.470.47 0.330.330.330.33 0.220.220.220.22 0.360.360.360.36 0.280.280.280.28 0.400.400.400.40 0.340.340.340.34
mBliP mT0 0.500.500.500.50 0.420.420.420.42 0.500.500.500.50 0.370.370.370.37 0.410.410.410.41 0.580.580.580.58 0.240.240.240.24 0.420.420.420.42
InternVL V1.2+ 0.600.600.600.60 0.520.520.520.52 0.350.350.350.35 0.350.350.350.35 0.440.440.440.44 0.310.310.310.31 0.550.550.550.55 0.420.420.420.42
Gemini Pro V 0.480.480.480.48 0.500.500.500.50 0.470.470.470.47 0.430.430.430.43 0.430.430.430.43 0.610.610.610.61 0.290.290.290.29 0.450.450.450.45
GPT 4V 0.490.490.490.49 0.550.550.550.55 0.520.520.520.52 0.620.620.620.62 0.530.530.530.53 0.640.640.640.64 0.310.310.310.31 0.530.530.530.53
Average 0.51 0.35 0.22 0.19 0.27 0.25 0.24 0.25

D.1.3 XVNLI

Refer to caption
Figure 34: A bar plot showing the average accuracy per language and model on the XVNLI dataset. The models on the x-Axis are ordered by their average score across all languages so that the best performing model is on the right and the worst is on the left.
Table 11: The average accuracy per language and model on the XVNLI dataset. The column ā€œNEAā€ stands for the average of Non-English languages.
Model Language
ar en es fr ru NEA
CogVLM 0.260.260.260.26 0.470.470.470.47 0.310.310.310.31 0.270.270.270.27 0.320.320.320.32 0.290.290.290.29
BakLLaVA 0.320.320.320.32 0.480.480.480.48 0.330.330.330.33 0.330.330.330.33 0.360.360.360.36 0.340.340.340.34
Yi-VL 6B 0.340.340.340.34 0.560.560.560.56 0.380.380.380.38 0.390.390.390.39 0.410.410.410.41 0.380.380.380.38
mBliP BloomZ 0.400.400.400.40 0.400.400.400.40 0.450.450.450.45 0.480.480.480.48 0.440.440.440.44 0.440.440.440.44
LLaVA 1.6 7B 0.360.360.360.36 0.590.590.590.59 0.460.460.460.46 0.500.500.500.50 0.460.460.460.46 0.450.450.450.45
LLaVA 1.5 7B 0.340.340.340.34 0.600.600.600.60 0.520.520.520.52 0.530.530.530.53 0.500.500.500.50 0.470.470.470.47
Gemini Pro V 0.460.460.460.46 0.490.490.490.49 0.480.480.480.48 0.500.500.500.50 0.520.520.520.52 0.490.490.490.49
LLaVA 1.5 13B 0.390.390.390.39 0.590.590.590.59 0.530.530.530.53 0.540.540.540.54 0.520.520.520.52 0.490.490.490.49
MiniCPM-V 0.360.360.360.36 0.660.660.660.66 0.530.530.530.53 0.570.570.570.57 0.510.510.510.51 0.490.490.490.49
Yi-VL 34B 0.390.390.390.39 0.590.590.590.59 0.550.550.550.55 0.560.560.560.56 0.540.540.540.54 0.510.510.510.51
OmniLMM 12B 0.430.430.430.43 0.640.640.640.64 0.550.550.550.55 0.570.570.570.57 0.590.590.590.59 0.540.540.540.54
Qwen-VL 0.460.460.460.46 0.620.620.620.62 0.570.570.570.57 0.570.570.570.57 0.570.570.570.57 0.540.540.540.54
LLaVA 1.6 13B 0.490.490.490.49 0.610.610.610.61 0.570.570.570.57 0.560.560.560.56 0.570.570.570.57 0.550.550.550.55
InternVL V1.1 0.500.500.500.50 0.610.610.610.61 0.570.570.570.57 0.570.570.570.57 0.570.570.570.57 0.560.560.560.56
mBliP mT0 0.550.550.550.55 0.590.590.590.59 0.560.560.560.56 0.570.570.570.57 0.580.580.580.58 0.570.570.570.57
InternVL V1.2+ 0.530.530.530.53 0.630.630.630.63 0.590.590.590.59 0.600.600.600.60 0.590.590.590.59 0.580.580.580.58
LLaVA 1.6 34B 0.540.540.540.54 0.620.620.620.62 0.590.590.590.59 0.600.600.600.60 0.590.590.590.59 0.580.580.580.58
GPT 4V 0.670.670.670.67 0.690.690.690.69 0.660.660.660.66 0.680.680.680.68 0.700.700.700.70 0.680.680.680.68
Average 0.430.430.430.43 0.580.580.580.58 0.510.510.510.51 0.520.520.520.52 0.520.520.520.52 0.500.500.500.50

D.1.4 MaRVL

Refer to caption
Figure 35: A bar plot showing the average accuracy per language and model on the MaRVL dataset. Note that MaRVL does not contain English data originally and we machine-translated English from the other languages and averaged the results. The models on the x-Axis are ordered by their average score across all languages so that the best performing model is on the right and the worst is on the left.
Table 12: The average accuracy per language and model on the MaRVL dataset. Note that MaRVL does not contain English data originally and we machine-translated English from the other languages and averaged the results. The column ā€œNEAā€ stands for the average of Non-English languages.
Model Language
en id sw ta tr zh NEA
CogVLM 0.600.600.600.60 0.530.530.530.53 0.510.510.510.51 0.490.490.490.49 0.510.510.510.51 0.530.530.530.53 0.510.510.510.51
LLaVA 1.5 7B 0.570.570.570.57 0.530.530.530.53 0.510.510.510.51 0.510.510.510.51 0.510.510.510.51 0.530.530.530.53 0.520.520.520.52
BakLLaVA 0.590.590.590.59 0.540.540.540.54 0.510.510.510.51 0.500.500.500.50 0.530.530.530.53 0.550.550.550.55 0.530.530.530.53
LLaVA 1.6 7B 0.620.620.620.62 0.570.570.570.57 0.510.510.510.51 0.500.500.500.50 0.510.510.510.51 0.540.540.540.54 0.530.530.530.53
Qwen-VL 0.600.600.600.60 0.520.520.520.52 0.500.500.500.50 0.500.500.500.50 0.540.540.540.54 0.590.590.590.59 0.530.530.530.53
Yi-VL 6B 0.590.590.590.59 0.530.530.530.53 0.490.490.490.49 0.500.500.500.50 0.540.540.540.54 0.610.610.610.61 0.530.530.530.53
MiniCPM-V 0.610.610.610.61 0.530.530.530.53 0.500.500.500.50 0.500.500.500.50 0.560.560.560.56 0.580.580.580.58 0.530.530.530.53
LLaVA 1.5 13B 0.600.600.600.60 0.600.600.600.60 0.510.510.510.51 0.500.500.500.50 0.540.540.540.54 0.560.560.560.56 0.540.540.540.54
Gemini Pro V 0.550.550.550.55 0.550.550.550.55 0.530.530.530.53 0.550.550.550.55 0.560.560.560.56 0.550.550.550.55 0.550.550.550.55
OmniLMM 12B 0.640.640.640.64 0.620.620.620.62 0.510.510.510.51 0.510.510.510.51 0.570.570.570.57 0.570.570.570.57 0.560.560.560.56
mBliP BloomZ 0.550.550.550.55 0.570.570.570.57 0.560.560.560.56 0.570.570.570.57 0.560.560.560.56 0.560.560.560.56 0.560.560.560.56
Yi-VL 34B 0.620.620.620.62 0.620.620.620.62 0.530.530.530.53 0.510.510.510.51 0.590.590.590.59 0.650.650.650.65 0.580.580.580.58
InternVL V1.1 0.630.630.630.63 0.610.610.610.61 0.540.540.540.54 0.580.580.580.58 0.650.650.650.65 0.630.630.630.63 0.600.600.600.60
InternVL V1.2+ 0.680.680.680.68 0.670.670.670.67 0.530.530.530.53 0.530.530.530.53 0.640.640.640.64 0.700.700.700.70 0.610.610.610.61
mBliP mT0 0.600.600.600.60 0.630.630.630.63 0.600.600.600.60 0.640.640.640.64 0.660.660.660.66 0.620.620.620.62 0.630.630.630.63
LLaVA 1.6 13B 0.650.650.650.65 0.660.660.660.66 0.600.600.600.60 0.650.650.650.65 0.690.690.690.69 0.640.640.640.64 0.650.650.650.65
LLaVA 1.6 34B 0.640.640.640.64 0.720.720.720.72 0.560.560.560.56 0.570.570.570.57 0.700.700.700.70 0.760.760.760.76 0.660.660.660.66
GPT 4V 0.640.640.640.64 0.710.710.710.71 0.590.590.590.59 0.630.630.630.63 0.730.730.730.73 0.660.660.660.66 0.660.660.660.66
Average 0.610.610.610.61 0.600.600.600.60 0.530.530.530.53 0.540.540.540.54 0.590.590.590.59 0.600.600.600.60 0.570.570.570.57

D.1.5 M5-VGR

Refer to caption
Figure 36: A bar plot showing the average accuracy per language and model on the M5-VGR dataset. The models on the x-Axis are ordered by their average score across all languages so that the best performing model is on the right and the worst is on the left.
Table 13: The average accuracy per language and model on the M5-VGR dataset. The column ā€œNEAā€ stands for the average of Non-English languages.
Model Language
am ber bn de en fil ha hi ru sw th zu NEA
LLaVA 1.5 7B 0.430.430.430.43 0.500.500.500.50 0.360.360.360.36 0.440.440.440.44 0.470.470.470.47 0.520.520.520.52 0.420.420.420.42 0.380.380.380.38 0.410.410.410.41 0.360.360.360.36 0.380.380.380.38 0.360.360.360.36 0.420.420.420.42
LLaVA 1.6 7B 0.430.430.430.43 0.500.500.500.50 0.360.360.360.36 0.470.470.470.47 0.550.550.550.55 0.520.520.520.52 0.420.420.420.42 0.390.390.390.39 0.450.450.450.45 0.360.360.360.36 0.360.360.360.36 0.360.360.360.36 0.420.420.420.42
LLaVA 1.5 13B 0.430.430.430.43 0.500.500.500.50 0.370.370.370.37 0.650.650.650.65 0.570.570.570.57 0.520.520.520.52 0.420.420.420.42 0.450.450.450.45 0.560.560.560.56 0.370.370.370.37 0.410.410.410.41 0.360.360.360.36 0.460.460.460.46
BakLLaVA 0.420.420.420.42 0.510.510.510.51 0.370.370.370.37 0.620.620.620.62 0.710.710.710.71 0.550.550.550.55 0.480.480.480.48 0.370.370.370.37 0.680.680.680.68 0.420.420.420.42 0.480.480.480.48 0.330.330.330.33 0.480.480.480.48
LLaVA 1.6 13B 0.440.440.440.44 0.500.500.500.50 0.360.360.360.36 0.790.790.790.79 0.780.780.780.78 0.490.490.490.49 0.420.420.420.42 0.530.530.530.53 0.810.810.810.81 0.330.330.330.33 0.480.480.480.48 0.370.370.370.37 0.500.500.500.50
Yi-VL 34B 0.430.430.430.43 0.500.500.500.50 0.510.510.510.51 0.740.740.740.74 0.770.770.770.77 0.600.600.600.60 0.420.420.420.42 0.440.440.440.44 0.690.690.690.69 0.400.400.400.40 0.570.570.570.57 0.360.360.360.36 0.520.520.520.52
Qwen-VL 0.300.300.300.30 0.170.170.170.17 0.600.600.600.60 0.630.630.630.63 0.820.820.820.82 0.530.530.530.53 0.570.570.570.57 0.560.560.560.56 0.660.660.660.66 0.630.630.630.63 0.620.620.620.62 0.610.610.610.61 0.540.540.540.54
CogVLM 0.530.530.530.53 0.460.460.460.46 0.540.540.540.54 0.740.740.740.74 0.680.680.680.68 0.530.530.530.53 0.540.540.540.54 0.590.590.590.59 0.610.610.610.61 0.540.540.540.54 0.600.600.600.60 0.410.410.410.41 0.550.550.550.55
mBliP BloomZ 0.460.460.460.46 0.500.500.500.50 0.640.640.640.64 0.610.610.610.61 0.690.690.690.69 0.500.500.500.50 0.420.420.420.42 0.640.640.640.64 0.600.600.600.60 0.600.600.600.60 0.460.460.460.46 0.690.690.690.69 0.560.560.560.56
MiniCPM-V 0.610.610.610.61 0.640.640.640.64 0.550.550.550.55 0.690.690.690.69 0.800.800.800.80 0.550.550.550.55 0.430.430.430.43 0.640.640.640.64 0.680.680.680.68 0.380.380.380.38 0.560.560.560.56 0.410.410.410.41 0.560.560.560.56
OmniLMM 12B 0.510.510.510.51 0.690.690.690.69 0.580.580.580.58 0.650.650.650.65 0.780.780.780.78 0.620.620.620.62 0.490.490.490.49 0.510.510.510.51 0.780.780.780.78 0.470.470.470.47 0.640.640.640.64 0.510.510.510.51 0.590.590.590.59
Yi-VL 6B 0.620.620.620.62 0.310.310.310.31 0.640.640.640.64 0.740.740.740.74 0.720.720.720.72 0.540.540.540.54 0.700.700.700.70 0.620.620.620.62 0.720.720.720.72 0.550.550.550.55 0.630.630.630.63 0.590.590.590.59 0.610.610.610.61
InternVL V1.1 0.480.480.480.48 0.500.500.500.50 0.630.630.630.63 0.760.760.760.76 0.730.730.730.73 0.680.680.680.68 0.470.470.470.47 0.680.680.680.68 0.750.750.750.75 0.580.580.580.58 0.810.810.810.81 0.470.470.470.47 0.620.620.620.62
LLaVA 1.6 34B 0.510.510.510.51 0.650.650.650.65 0.570.570.570.57 0.800.800.800.80 0.870.870.870.87 0.580.580.580.58 0.470.470.470.47 0.670.670.670.67 0.820.820.820.82 0.630.630.630.63 0.740.740.740.74 0.590.590.590.59 0.640.640.640.64
Gemini Pro V 0.710.710.710.71 0.500.500.500.50 0.640.640.640.64 0.620.620.620.62 0.790.790.790.79 0.630.630.630.63 0.620.620.620.62 0.660.660.660.66 0.680.680.680.68 0.680.680.680.68 0.830.830.830.83 0.660.660.660.66 0.660.660.660.66
InternVL V1.2+ 0.510.510.510.51 0.550.550.550.55 0.660.660.660.66 0.780.780.780.78 0.860.860.860.86 0.730.730.730.73 0.540.540.540.54 0.670.670.670.67 0.850.850.850.85 0.640.640.640.64 0.900.900.900.90 0.660.660.660.66 0.680.680.680.68
mBliP mT0 0.810.810.810.81 0.420.420.420.42 0.670.670.670.67 0.680.680.680.68 0.740.740.740.74 0.560.560.560.56 0.870.870.870.87 0.670.670.670.67 0.750.750.750.75 0.670.670.670.67 0.750.750.750.75 0.730.730.730.73 0.690.690.690.69
GPT 4V 0.820.820.820.82 0.470.470.470.47 0.800.800.800.80 0.810.810.810.81 0.880.880.880.88 0.840.840.840.84 0.930.930.930.93 0.790.790.790.79 0.880.880.880.88 0.800.800.800.80 0.940.940.940.94 0.830.830.830.83 0.810.810.810.81
Average 0.530.530.530.53 0.490.490.490.49 0.550.550.550.55 0.680.680.680.68 0.730.730.730.73 0.580.580.580.58 0.530.530.530.53 0.570.570.570.57 0.690.690.690.69 0.520.520.520.52 0.620.620.620.62 0.520.520.520.52 0.570.570.570.57

D.1.6 M5-VLOD

Refer to caption
Figure 37: A bar plot showing the average accuracy per language and model on the M5-VLOD dataset. The models on the x-Axis are ordered by their average score across all languages so that the best performing model is on the right and the worst is on the left.
Table 14: The average accuracy per language and model on the M5-VLOD dataset. The column ā€œNEAā€ stands for the average of Non-English languages.
Model Language
am ber bn de en fil ha hi ru sw th zu NEA
CogVLM 0.100.100.100.10 0.070.070.070.07 0.080.080.080.08 0.060.060.060.06 0.100.100.100.10 0.070.070.070.07 0.090.090.090.09 0.080.080.080.08 0.060.060.060.06 0.060.060.060.06 0.070.070.070.07 0.090.090.090.09 0.080.080.080.08
mBliP mT0 0.140.140.140.14 0.220.220.220.22 0.160.160.160.16 0.100.100.100.10 0.120.120.120.12 0.240.240.240.24 0.240.240.240.24 0.150.150.150.15 0.100.100.100.10 0.220.220.220.22 0.150.150.150.15 0.140.140.140.14 0.170.170.170.17
Yi-VL 6B 0.140.140.140.14 0.210.210.210.21 0.200.200.200.20 0.120.120.120.12 0.200.200.200.20 0.260.260.260.26 0.210.210.210.21 0.210.210.210.21 0.130.130.130.13 0.240.240.240.24 0.220.220.220.22 0.190.190.190.19 0.190.190.190.19
Yi-VL 34B 0.150.150.150.15 0.220.220.220.22 0.140.140.140.14 0.270.270.270.27 0.260.260.260.26 0.210.210.210.21 0.170.170.170.17 0.270.270.270.27 0.210.210.210.21 0.160.160.160.16 0.160.160.160.16 0.170.170.170.17 0.190.190.190.19
MiniCPM-V 0.170.170.170.17 0.190.190.190.19 0.200.200.200.20 0.110.110.110.11 0.200.200.200.20 0.190.190.190.19 0.220.220.220.22 0.160.160.160.16 0.150.150.150.15 0.290.290.290.29 0.230.230.230.23 0.240.240.240.24 0.200.200.200.20
LLaVA 1.5 7B 0.180.180.180.18 0.220.220.220.22 0.150.150.150.15 0.130.130.130.13 0.150.150.150.15 0.190.190.190.19 0.250.250.250.25 0.190.190.190.19 0.190.190.190.19 0.250.250.250.25 0.270.270.270.27 0.190.190.190.19 0.200.200.200.20
BakLLaVA 0.250.250.250.25 0.190.190.190.19 0.210.210.210.21 0.120.120.120.12 0.140.140.140.14 0.210.210.210.21 0.220.220.220.22 0.150.150.150.15 0.170.170.170.17 0.220.220.220.22 0.260.260.260.26 0.260.260.260.26 0.200.200.200.20
LLaVA 1.5 13B 0.180.180.180.18 0.230.230.230.23 0.190.190.190.19 0.170.170.170.17 0.160.160.160.16 0.240.240.240.24 0.250.250.250.25 0.140.140.140.14 0.130.130.130.13 0.260.260.260.26 0.280.280.280.28 0.200.200.200.20 0.210.210.210.21
InternVL V1.1 0.180.180.180.18 0.220.220.220.22 0.200.200.200.20 0.110.110.110.11 0.120.120.120.12 0.240.240.240.24 0.290.290.290.29 0.160.160.160.16 0.110.110.110.11 0.290.290.290.29 0.290.290.290.29 0.190.190.190.19 0.210.210.210.21
LLaVA 1.6 7B 0.170.170.170.17 0.220.220.220.22 0.180.180.180.18 0.140.140.140.14 0.140.140.140.14 0.240.240.240.24 0.270.270.270.27 0.180.180.180.18 0.150.150.150.15 0.290.290.290.29 0.270.270.270.27 0.190.190.190.19 0.210.210.210.21
LLaVA 1.6 13B 0.180.180.180.18 0.230.230.230.23 0.190.190.190.19 0.130.130.130.13 0.140.140.140.14 0.250.250.250.25 0.290.290.290.29 0.160.160.160.16 0.130.130.130.13 0.280.280.280.28 0.260.260.260.26 0.220.220.220.22 0.210.210.210.21
Qwen-VL 0.180.180.180.18 0.220.220.220.22 0.200.200.200.20 0.140.140.140.14 0.160.160.160.16 0.250.250.250.25 0.290.290.290.29 0.160.160.160.16 0.130.130.130.13 0.290.290.290.29 0.270.270.270.27 0.190.190.190.19 0.210.210.210.21
mBliP BloomZ 0.200.200.200.20 0.200.200.200.20 0.190.190.190.19 0.150.150.150.15 0.140.140.140.14 0.240.240.240.24 0.290.290.290.29 0.170.170.170.17 0.120.120.120.12 0.260.260.260.26 0.280.280.280.28 0.210.210.210.21 0.210.210.210.21
OmniLMM 12B 0.180.180.180.18 0.160.160.160.16 0.250.250.250.25 0.170.170.170.17 0.190.190.190.19 0.250.250.250.25 0.300.300.300.30 0.170.170.170.17 0.250.250.250.25 0.200.200.200.20 0.210.210.210.21 0.220.220.220.22 0.210.210.210.21
LLaVA 1.6 34B 0.190.190.190.19 0.240.240.240.24 0.200.200.200.20 0.140.140.140.14 0.260.260.260.26 0.300.300.300.30 0.280.280.280.28 0.160.160.160.16 0.190.190.190.19 0.260.260.260.26 0.250.250.250.25 0.180.180.180.18 0.220.220.220.22
InternVL V1.2+ 0.240.240.240.24 0.200.200.200.20 0.280.280.280.28 0.290.290.290.29 0.280.280.280.28 0.200.200.200.20 0.140.140.140.14 0.200.200.200.20 0.240.240.240.24 0.240.240.240.24 0.280.280.280.28 0.240.240.240.24 0.230.230.230.23
Gemini Pro V 0.330.330.330.33 0.190.190.190.19 0.370.370.370.37 0.420.420.420.42 0.520.520.520.52 0.430.430.430.43 0.270.270.270.27 0.380.380.380.38 0.400.400.400.40 0.430.430.430.43 0.370.370.370.37 0.390.390.390.39 0.360.360.360.36
GPT 4V 0.360.360.360.36 0.220.220.220.22 0.380.380.380.38 0.420.420.420.42 0.700.700.700.70 0.530.530.530.53 0.380.380.380.38 0.470.470.470.47 0.500.500.500.50 0.440.440.440.44 0.480.480.480.48 0.460.460.460.46 0.420.420.420.42
Average 0.200.200.200.20 0.200.200.200.20 0.210.210.210.21 0.180.180.180.18 0.220.220.220.22 0.250.250.250.25 0.250.250.250.25 0.200.200.200.20 0.190.190.190.19 0.260.260.260.26 0.260.260.260.26 0.220.220.220.22 0.220.220.220.22

D.1.7 xFlickrCO

Refer to caption
Figure 38: A bar plot showing the average chrF++ score per language and model on the xFlickrCO dataset. The models on the x-Axis are ordered by their average score across all languages so that the best performing model is on the right and the worst is on the left.
Table 15: The average chrF++ score per language and model on the xFlickrCO dataset. The column ā€œNEAā€ stands for the average of Non-English languages.
Model Language
de en es id ja ru tr zh NEA
Qwen-VL 9.009.009.009.00 18.6818.6818.6818.68 8.698.698.698.69 4.884.884.884.88 0.770.770.770.77 0.740.740.740.74 3.913.913.913.91 5.625.625.625.62 4.804.804.804.80
Yi-VL 6B 11.5311.5311.5311.53 24.5424.5424.5424.54 14.6114.6114.6114.61 8.378.378.378.37 0.780.780.780.78 0.900.900.900.90 8.158.158.158.15 0.790.790.790.79 6.456.456.456.45
CogVLM 11.0811.0811.0811.08 16.7616.7616.7616.76 12.3212.3212.3212.32 11.2711.2711.2711.27 0.560.560.560.56 3.713.713.713.71 7.627.627.627.62 0.460.460.460.46 6.726.726.726.72
BakLLaVA 13.2113.2113.2113.21 26.7926.7926.7926.79 14.1714.1714.1714.17 10.4810.4810.4810.48 0.060.060.060.06 0.750.750.750.75 9.499.499.499.49 0.090.090.090.09 6.896.896.896.89
Yi-VL 34B 17.0217.0217.0217.02 24.6224.6224.6224.62 11.3611.3611.3611.36 11.7911.7911.7911.79 2.002.002.002.00 2.572.572.572.57 9.509.509.509.50 2.442.442.442.44 8.108.108.108.10
MiniCPM-V 19.0519.0519.0519.05 27.4327.4327.4327.43 18.8118.8118.8118.81 14.6214.6214.6214.62 4.694.694.694.69 10.7310.7310.7310.73 13.1813.1813.1813.18 1.401.401.401.40 11.7811.7811.7811.78
InternVL V1.1 18.2118.2118.2118.21 27.9827.9827.9827.98 20.7420.7420.7420.74 14.6914.6914.6914.69 4.314.314.314.31 7.077.077.077.07 8.678.678.678.67 9.389.389.389.38 11.8711.8711.8711.87
LLaVA 1.5 7B 23.2223.2223.2223.22 28.3228.3228.3228.32 21.9521.9521.9521.95 17.5817.5817.5817.58 0.440.440.440.44 4.454.454.454.45 10.7710.7710.7710.77 5.295.295.295.29 11.9611.9611.9611.96
LLaVA 1.5 13B 21.6621.6621.6621.66 29.3929.3929.3929.39 19.3719.3719.3719.37 15.5915.5915.5915.59 6.636.636.636.63 5.025.025.025.02 10.4510.4510.4510.45 6.726.726.726.72 12.2112.2112.2112.21
LLaVA 1.6 7B 19.7019.7019.7019.70 19.3119.3119.3119.31 21.4821.4821.4821.48 19.3219.3219.3219.32 4.604.604.604.60 11.2711.2711.2711.27 13.1413.1413.1413.14 6.786.786.786.78 13.7513.7513.7513.75
OmniLMM 12B 23.3923.3923.3923.39 30.7630.7630.7630.76 22.0522.0522.0522.05 20.5020.5020.5020.50 2.892.892.892.89 13.2913.2913.2913.29 14.5514.5514.5514.55 2.592.592.592.59 14.1814.1814.1814.18
LLaVA 1.6 13B 22.5522.5522.5522.55 23.9423.9423.9423.94 21.9821.9821.9821.98 20.7320.7320.7320.73 7.577.577.577.57 13.2613.2613.2613.26 14.7914.7914.7914.79 6.396.396.396.39 15.3315.3315.3315.33
LLaVA 1.6 34B 24.3824.3824.3824.38 23.5223.5223.5223.52 23.9823.9823.9823.98 22.3622.3622.3622.36 5.085.085.085.08 16.4016.4016.4016.40 15.0515.0515.0515.05 6.346.346.346.34 16.2316.2316.2316.23
GPT 4V 24.5624.5624.5624.56 24.1724.1724.1724.17 22.8222.8222.8222.82 23.2923.2923.2923.29 4.734.734.734.73 15.8215.8215.8215.82 17.5817.5817.5817.58 5.605.605.605.60 16.3416.3416.3416.34
mBliP BloomZ 24.3924.3924.3924.39 25.9925.9925.9925.99 25.1225.1225.1225.12 23.5623.5623.5623.56 7.187.187.187.18 15.3115.3115.3115.31 17.1617.1617.1617.16 3.933.933.933.93 16.6716.6716.6716.67
Gemini Pro V 24.1724.1724.1724.17 22.1322.1322.1322.13 23.5023.5023.5023.50 23.1023.1023.1023.10 5.755.755.755.75 17.2817.2817.2817.28 18.0318.0318.0318.03 5.245.245.245.24 16.7316.7316.7316.73
InternVL V1.2+ 25.8125.8125.8125.81 28.4128.4128.4128.41 24.1324.1324.1324.13 20.4820.4820.4820.48 7.257.257.257.25 17.3417.3417.3417.34 16.7316.7316.7316.73 8.548.548.548.54 17.1817.1817.1817.18
mBliP mT0 26.1026.1026.1026.10 26.0726.0726.0726.07 24.7424.7424.7424.74 22.4122.4122.4122.41 7.567.567.567.56 18.6418.6418.6418.64 19.5819.5819.5819.58 3.873.873.873.87 17.5617.5617.5617.56
Average 19.9519.9519.9519.95 24.9324.9324.9324.93 19.5519.5519.5519.55 16.9516.9516.9516.95 4.054.054.054.05 9.709.709.709.70 12.6912.6912.6912.69 4.534.534.534.53 12.4912.4912.4912.49

D.1.8 XM3600

Refer to caption
Figure 39: A bar plot showing the average chrF++ score per language and model on the XM3600 dataset. Due to resource restrictions, we evaluated GPT 4V only on a subset of languages. The models on the x-Axis are ordered by their average score across all languages so that the best performing model is on the right and the worst is on the left.
Table 16: The average chrF++ score per language and model on the XM3600 dataset. Due to resource restrictions, we evaluated GPT 4V only on a subset of languages. The column ā€œNEAā€ stands for the average of Non-English languages.
Model Language
ar bn cs da de el en es fa fi fil fr
CogVLM 0.070.070.070.07 0.040.040.040.04 9.309.309.309.30 11.9211.9211.9211.92 12.5012.5012.5012.50 0.250.250.250.25 24.2624.2624.2624.26 14.2514.2514.2514.25 0.020.020.020.02 10.5210.5210.5210.52 10.9610.9610.9610.96 13.1813.1813.1813.18
BakLLaVA 0.210.210.210.21 0.220.220.220.22 8.658.658.658.65 11.4511.4511.4511.45 14.3314.3314.3314.33 0.240.240.240.24 25.3925.3925.3925.39 17.1317.1317.1317.13 0.640.640.640.64 10.0210.0210.0210.02 11.4111.4111.4111.41 18.3318.3318.3318.33
Qwen-VL 2.082.082.082.08 0.170.170.170.17 9.899.899.899.89 14.3814.3814.3814.38 13.1413.1413.1413.14 2.322.322.322.32 27.8927.8927.8927.89 16.0016.0016.0016.00 4.094.094.094.09 7.137.137.137.13 11.3611.3611.3611.36 14.7014.7014.7014.70
Yi-VL 6B 4.654.654.654.65 2.982.982.982.98 9.489.489.489.48 13.5513.5513.5513.55 15.5815.5815.5815.58 4.544.544.544.54 28.5928.5928.5928.59 18.5818.5818.5818.58 3.503.503.503.50 9.299.299.299.29 12.4212.4212.4212.42 17.1217.1217.1217.12
Yi-VL 34B 4.244.244.244.24 4.144.144.144.14 9.529.529.529.52 15.4015.4015.4015.40 17.0017.0017.0017.00 8.008.008.008.00 27.1127.1127.1127.11 17.8617.8617.8617.86 10.0610.0610.0610.06 9.179.179.179.17 14.7314.7314.7314.73 16.9316.9316.9316.93
MiniCPM-V 6.386.386.386.38 1.961.961.961.96 9.059.059.059.05 15.5215.5215.5215.52 19.6019.6019.6019.60 2.982.982.982.98 28.5328.5328.5328.53 23.5423.5423.5423.54 3.573.573.573.57 12.3312.3312.3312.33 16.1916.1916.1916.19 23.9823.9823.9823.98
Gemini Pro V 14.9014.9014.9014.90 4.944.944.944.94 17.7917.7917.7917.79 18.3218.3218.3218.32 17.6317.6317.6317.63 10.3610.3610.3610.36 21.8121.8121.8121.81 18.6418.6418.6418.64 0.210.210.210.21 14.5014.5014.5014.50 2.252.252.252.25 20.1520.1520.1520.15
LLaVA 1.5 7B 6.306.306.306.30 3.713.713.713.71 13.8013.8013.8013.80 15.9315.9315.9315.93 21.1821.1821.1821.18 7.427.427.427.42 26.0226.0226.0226.02 23.6023.6023.6023.60 7.457.457.457.45 15.6715.6715.6715.67 17.3817.3817.3817.38 23.8323.8323.8323.83
mBliP mT0 12.6812.6812.6812.68 10.7910.7910.7910.79 17.2017.2017.2017.20 19.4319.4319.4319.43 18.7418.7418.7418.74 15.7615.7615.7615.76 28.6828.6828.6828.68 20.7120.7120.7120.71 16.1916.1916.1916.19 13.2613.2613.2613.26 20.7920.7920.7920.79 20.5220.5220.5220.52
InternVL V1.1 12.2312.2312.2312.23 2.552.552.552.55 14.7414.7414.7414.74 22.8222.8222.8222.82 23.7723.7723.7723.77 10.2010.2010.2010.20 32.1032.1032.1032.10 27.9127.9127.9127.91 11.9411.9411.9411.94 16.4716.4716.4716.47 19.2019.2019.2019.20 25.9525.9525.9525.95
mBliP BloomZ 18.1018.1018.1018.10 14.9214.9214.9214.92 16.9916.9916.9916.99 19.1619.1619.1619.16 21.1721.1721.1721.17 11.0311.0311.0311.03 28.0528.0528.0528.05 26.7326.7326.7326.73 15.5915.5915.5915.59 11.8611.8611.8611.86 14.4714.4714.4714.47 25.2825.2825.2825.28
OmniLMM 12B 9.489.489.489.48 3.513.513.513.51 14.2414.2414.2414.24 23.1523.1523.1523.15 25.0525.0525.0525.05 7.377.377.377.37 24.4224.4224.4224.42 26.7526.7526.7526.75 10.6510.6510.6510.65 13.7813.7813.7813.78 20.9220.9220.9220.92 28.1828.1828.1828.18
LLaVA 1.6 7B 12.5212.5212.5212.52 6.136.136.136.13 15.7915.7915.7915.79 14.5014.5014.5014.50 24.0624.0624.0624.06 11.1111.1111.1111.11 26.4126.4126.4126.41 27.3727.3727.3727.37 13.0713.0713.0713.07 17.2317.2317.2317.23 17.7617.7617.7617.76 27.4827.4827.4827.48
LLaVA 1.5 13B 7.077.077.077.07 1.801.801.801.80 14.7514.7514.7514.75 21.7421.7421.7421.74 24.1524.1524.1524.15 6.496.496.496.49 29.5529.5529.5529.55 26.5926.5926.5926.59 14.9014.9014.9014.90 19.5119.5119.5119.51 22.9122.9122.9122.91 29.1429.1429.1429.14
InternVL V1.2+ 13.5913.5913.5913.59 6.196.196.196.19 15.3415.3415.3415.34 24.8524.8524.8524.85 27.0527.0527.0527.05 11.2011.2011.2011.20 29.8429.8429.8429.84 29.5029.5029.5029.50 15.6915.6915.6915.69 17.0117.0117.0117.01 27.2227.2227.2227.22 29.8029.8029.8029.80
LLaVA 1.6 13B 14.0714.0714.0714.07 5.425.425.425.42 17.5117.5117.5117.51 22.3022.3022.3022.30 25.9525.9525.9525.95 11.9011.9011.9011.90 26.4226.4226.4226.42 28.3928.3928.3928.39 14.7214.7214.7214.72 20.4420.4420.4420.44 23.1423.1423.1423.14 29.4229.4229.4229.42
LLaVA 1.6 34B 13.8513.8513.8513.85 6.206.206.206.20 16.9416.9416.9416.94 24.4424.4424.4424.44 26.5126.5126.5126.51 12.1712.1712.1712.17 26.5226.5226.5226.52 28.9028.9028.9028.90 16.0916.0916.0916.09 18.0818.0818.0818.08 28.3528.3528.3528.35 29.8329.8329.8329.83
GPT 4V 22.6722.6722.6722.67 16.2716.2716.2716.27 - - 29.2429.2429.2429.24 - 26.8926.8926.8926.89 30.8630.8630.8630.86 - - - 31.8231.8231.8231.82
Average 9.739.739.739.73 5.115.115.115.11 12.8312.8312.8312.83 17.1617.1617.1617.16 20.9220.9220.9220.92 7.417.417.417.41 27.1427.1427.1427.14 23.5223.5223.5223.52 8.808.808.808.80 13.1313.1313.1313.13 16.1916.1916.1916.19 23.6523.6523.6523.65
Model Language
he hi hr hu id it ja ko mi nl no pl
CogVLM 0.520.520.520.52 0.380.380.380.38 10.2510.2510.2510.25 8.258.258.258.25 10.7010.7010.7010.70 13.1113.1113.1113.11 0.070.070.070.07 0.130.130.130.13 10.0010.0010.0010.00 13.5913.5913.5913.59 11.7311.7311.7311.73 9.989.989.989.98
BakLLaVA 1.071.071.071.07 0.710.710.710.71 10.3310.3310.3310.33 8.988.988.988.98 12.5912.5912.5912.59 16.1216.1216.1216.12 0.070.070.070.07 0.160.160.160.16 10.6210.6210.6210.62 14.5614.5614.5614.56 11.4811.4811.4811.48 10.9710.9710.9710.97
Qwen-VL 0.580.580.580.58 2.322.322.322.32 11.3311.3311.3311.33 9.609.609.609.60 11.5011.5011.5011.50 13.7613.7613.7613.76 2.752.752.752.75 0.700.700.700.70 8.738.738.738.73 15.9115.9115.9115.91 12.6412.6412.6412.64 10.5910.5910.5910.59
Yi-VL 6B 2.782.782.782.78 3.863.863.863.86 9.829.829.829.82 9.129.129.129.12 10.9010.9010.9010.90 14.6914.6914.6914.69 2.402.402.402.40 1.321.321.321.32 8.818.818.818.81 16.0416.0416.0416.04 13.3013.3013.3013.30 10.8810.8810.8810.88
Yi-VL 34B 5.585.585.585.58 5.645.645.645.64 10.3110.3110.3110.31 9.239.239.239.23 13.3013.3013.3013.30 16.5516.5516.5516.55 2.212.212.212.21 2.022.022.022.02 9.559.559.559.55 17.4317.4317.4317.43 13.7913.7913.7913.79 10.4010.4010.4010.40
MiniCPM-V 4.864.864.864.86 2.362.362.362.36 11.9611.9611.9611.96 10.9110.9110.9110.91 16.9416.9416.9416.94 19.0619.0619.0619.06 2.922.922.922.92 0.390.390.390.39 10.4910.4910.4910.49 18.4718.4718.4718.47 14.2714.2714.2714.27 11.5111.5111.5111.51
Gemini Pro V 7.127.127.127.12 6.986.986.986.98 13.4813.4813.4813.48 9.229.229.229.22 16.9816.9816.9816.98 18.4418.4418.4418.44 6.636.636.636.63 6.436.436.436.43 3.553.553.553.55 19.6719.6719.6719.67 17.4317.4317.4317.43 17.2917.2917.2917.29
LLaVA 1.5 7B 3.763.763.763.76 6.296.296.296.29 13.0513.0513.0513.05 11.6911.6911.6911.69 19.3319.3319.3319.33 20.7320.7320.7320.73 3.483.483.483.48 3.933.933.933.93 10.1010.1010.1010.10 23.3023.3023.3023.30 19.7919.7919.7919.79 16.1016.1016.1016.10
mBliP mT0 11.1611.1611.1611.16 12.0812.0812.0812.08 10.2610.2610.2610.26 14.5914.5914.5914.59 17.3917.3917.3917.39 17.9217.9217.9217.92 5.795.795.795.79 6.006.006.006.00 11.8811.8811.8811.88 24.2024.2024.2024.20 19.9719.9719.9719.97 14.4914.4914.4914.49
InternVL V1.1 8.808.808.808.80 6.476.476.476.47 15.0515.0515.0515.05 12.4912.4912.4912.49 24.3124.3124.3124.31 23.1323.1323.1323.13 6.096.096.096.09 4.834.834.834.83 15.9315.9315.9315.93 25.0225.0225.0225.02 22.4522.4522.4522.45 17.5817.5817.5817.58
mBliP BloomZ 9.169.169.169.16 16.1816.1816.1816.18 9.789.789.789.78 13.8413.8413.8413.84 21.4421.4421.4421.44 21.3921.3921.3921.39 6.536.536.536.53 3.673.673.673.67 5.995.995.995.99 26.1726.1726.1726.17 17.3517.3517.3517.35 16.0716.0716.0716.07
OmniLMM 12B 3.993.993.993.99 9.919.919.919.91 18.8418.8418.8418.84 16.7216.7216.7216.72 25.0725.0725.0725.07 22.5022.5022.5022.50 3.163.163.163.16 2.312.312.312.31 14.9414.9414.9414.94 26.4726.4726.4726.47 21.3621.3621.3621.36 19.1619.1619.1619.16
LLaVA 1.6 7B 10.6110.6110.6110.61 10.2610.2610.2610.26 16.5216.5216.5216.52 18.2618.2618.2618.26 24.0524.0524.0524.05 24.7124.7124.7124.71 6.666.666.666.66 6.096.096.096.09 13.1213.1213.1213.12 25.0725.0725.0725.07 20.4920.4920.4920.49 19.3819.3819.3819.38
LLaVA 1.5 13B 11.6311.6311.6311.63 9.139.139.139.13 16.8716.8716.8716.87 16.5416.5416.5416.54 25.1325.1325.1325.13 26.1126.1126.1126.11 8.168.168.168.16 6.866.866.866.86 13.9813.9813.9813.98 27.5227.5227.5227.52 23.7723.7723.7723.77 17.9617.9617.9617.96
InternVL V1.2+ 10.8810.8810.8810.88 7.697.697.697.69 17.0717.0717.0717.07 14.7014.7014.7014.70 24.6524.6524.6524.65 25.9425.9425.9425.94 7.967.967.967.96 5.535.535.535.53 14.1714.1714.1714.17 29.1129.1129.1129.11 23.0223.0223.0223.02 18.3718.3718.3718.37
LLaVA 1.6 13B 12.5412.5412.5412.54 11.0011.0011.0011.00 19.9919.9919.9919.99 19.5219.5219.5219.52 26.1526.1526.1526.15 26.6626.6626.6626.66 8.278.278.278.27 6.956.956.956.95 13.7313.7313.7313.73 27.1527.1527.1527.15 21.1921.1921.1921.19 21.0321.0321.0321.03
LLaVA 1.6 34B 11.3011.3011.3011.30 7.277.277.277.27 18.1618.1618.1618.16 16.5716.5716.5716.57 27.6927.6927.6927.69 27.4027.4027.4027.40 7.757.757.757.75 5.605.605.605.60 16.6916.6916.6916.69 28.4228.4228.4228.42 24.4524.4524.4524.45 19.4919.4919.4919.49
GPT 4V - 17.1617.1617.1617.16 - - 33.2433.2433.2433.24 - 11.4611.4611.4611.46 - - - - -
Average 6.466.466.466.46 7.547.547.547.54 12.9512.9512.9512.95 12.2312.2312.2312.23 20.0820.0820.0820.08 19.3519.3519.3519.35 5.135.135.135.13 3.503.503.503.50 10.6810.6810.6810.68 21.0121.0121.0121.01 17.1417.1417.1417.14 14.5114.5114.5114.51
Model Language
pt quz ro ru sv sw te th tr uk vi zh NEA
CogVLM 12.8712.8712.8712.87 9.759.759.759.75 11.2311.2311.2311.23 0.860.860.860.86 12.5712.5712.5712.57 9.419.419.419.41 0.510.510.510.51 0.260.260.260.26 9.589.589.589.58 0.460.460.460.46 6.746.746.746.74 0.290.290.290.29 7.047.047.047.04
BakLLaVA 14.0014.0014.0014.00 9.009.009.009.00 11.3011.3011.3011.30 0.850.850.850.85 11.6111.6111.6111.61 9.379.379.379.37 1.471.471.471.47 0.570.570.570.57 9.369.369.369.36 0.310.310.310.31 7.117.117.117.11 0.030.030.030.03 7.587.587.587.58
Qwen-VL 14.1714.1714.1714.17 8.258.258.258.25 13.6013.6013.6013.60 4.304.304.304.30 13.5913.5913.5913.59 8.758.758.758.75 1.441.441.441.44 1.281.281.281.28 8.268.268.268.26 5.665.665.665.66 5.765.765.765.76 6.206.206.206.20 8.208.208.208.20
Yi-VL 6B 13.7713.7713.7713.77 8.258.258.258.25 10.0410.0410.0410.04 6.576.576.576.57 15.6415.6415.6415.64 8.948.948.948.94 4.934.934.934.93 2.572.572.572.57 9.559.559.559.55 2.652.652.652.65 7.767.767.767.76 2.612.612.612.61 8.828.828.828.82
Yi-VL 34B 14.5714.5714.5714.57 7.647.647.647.64 10.9510.9510.9510.95 6.956.956.956.95 14.4214.4214.4214.42 9.719.719.719.71 5.625.625.625.62 2.922.922.922.92 10.8410.8410.8410.84 4.194.194.194.19 8.748.748.748.74 2.822.822.822.82 9.789.789.789.78
MiniCPM-V 18.2118.2118.2118.21 7.217.217.217.21 14.9414.9414.9414.94 3.693.693.693.69 15.3615.3615.3615.36 11.1611.1611.1611.16 1.831.831.831.83 2.242.242.242.24 13.4713.4713.4713.47 1.741.741.741.74 8.888.888.888.88 2.462.462.462.46 10.3010.3010.3010.30
Gemini Pro V 20.6020.6020.6020.60 4.724.724.724.72 10.9810.9810.9810.98 15.2715.2715.2715.27 20.6020.6020.6020.60 15.8015.8015.8015.80 1.871.871.871.87 12.4512.4512.4512.45 15.6215.6215.6215.62 10.8210.8210.8210.82 16.4816.4816.4816.48 4.884.884.884.88 12.3712.3712.3712.37
LLaVA 1.5 7B 21.5721.5721.5721.57 9.559.559.559.55 12.3812.3812.3812.38 10.0810.0810.0810.08 20.6820.6820.6820.68 9.599.599.599.59 2.232.232.232.23 5.515.515.515.51 11.7811.7811.7811.78 5.845.845.845.84 14.3414.3414.3414.34 3.873.873.873.87 12.4412.4412.4412.44
mBliP mT0 19.3519.3519.3519.35 7.707.707.707.70 13.0513.0513.0513.05 14.6314.6314.6314.63 20.6620.6620.6620.66 14.4514.4514.4514.45 12.4212.4212.4212.42 14.7614.7614.7614.76 14.1314.1314.1314.13 13.6013.6013.6013.60 18.7318.7318.7318.73 2.592.592.592.59 14.8014.8014.8014.80
InternVL V1.1 24.4724.4724.4724.47 7.917.917.917.91 17.5517.5517.5517.55 16.3916.3916.3916.39 23.4023.4023.4023.40 9.829.829.829.82 4.734.734.734.73 6.856.856.856.85 13.2213.2213.2213.22 11.2611.2611.2611.26 10.8010.8010.8010.80 7.767.767.767.76 14.9714.9714.9714.97
mBliP BloomZ 23.9323.9323.9323.93 4.324.324.324.32 14.5914.5914.5914.59 16.2516.2516.2516.25 18.3118.3118.3118.31 14.8214.8214.8214.82 14.1214.1214.1214.12 9.199.199.199.19 15.3415.3415.3415.34 13.3513.3513.3513.35 22.1422.1422.1422.14 2.652.652.652.65 15.2015.2015.2015.20
OmniLMM 12B 22.7522.7522.7522.75 10.6110.6110.6110.61 18.6118.6118.6118.61 17.4917.4917.4917.49 22.0922.0922.0922.09 13.6813.6813.6813.68 5.415.415.415.41 6.846.846.846.84 14.6814.6814.6814.68 17.4917.4917.4917.49 16.5816.5816.5816.58 3.003.003.003.00 15.3415.3415.3415.34
LLaVA 1.6 7B 23.4223.4223.4223.42 10.0410.0410.0410.04 15.5515.5515.5515.55 15.1815.1815.1815.18 21.4221.4221.4221.42 11.6911.6911.6911.69 4.604.604.604.60 9.629.629.629.62 14.8114.8114.8114.81 11.4011.4011.4011.40 15.5415.5415.5415.54 5.585.585.585.58 15.4615.4615.4615.46
LLaVA 1.5 13B 26.5126.5126.5126.51 9.709.709.709.70 21.3321.3321.3321.33 8.538.538.538.53 24.8024.8024.8024.80 13.8113.8113.8113.81 3.393.393.393.39 10.8410.8410.8410.84 15.9815.9815.9815.98 6.366.366.366.36 21.6621.6621.6621.66 6.226.226.226.22 16.0516.0516.0516.05
InternVL V1.2+ 26.6326.6326.6326.63 6.206.206.206.20 18.0618.0618.0618.06 19.3019.3019.3019.30 26.2726.2726.2726.27 14.8314.8314.8314.83 7.797.797.797.79 5.305.305.305.30 17.3017.3017.3017.30 13.7913.7913.7913.79 17.2217.2217.2217.22 7.717.717.717.71 17.0517.0517.0517.05
LLaVA 1.6 13B 25.0725.0725.0725.07 10.6010.6010.6010.60 21.9621.9621.9621.96 14.8614.8614.8614.86 21.0121.0121.0121.01 14.8014.8014.8014.80 5.185.185.185.18 11.1111.1111.1111.11 17.0317.0317.0317.03 14.0314.0314.0314.03 21.4421.4421.4421.44 6.026.026.026.02 17.4417.4417.4417.44
LLaVA 1.6 34B 22.8522.8522.8522.85 10.3910.3910.3910.39 20.0820.0820.0820.08 20.1120.1120.1120.11 24.9224.9224.9224.92 18.7318.7318.7318.73 8.708.708.708.70 7.197.197.197.19 18.8318.8318.8318.83 15.3615.3615.3615.36 16.2316.2316.2316.23 7.027.027.027.02 17.7917.7917.7917.79
GPT 4V 30.1330.1330.1330.13 - 25.4125.4125.4125.41 - - - - - 25.7025.7025.7025.70 - - - 24.9124.9124.9124.91
Average 20.8320.8320.8320.83 7.887.887.887.88 15.6515.6515.6515.65 10.6310.6310.6310.63 18.1918.1918.1918.19 11.6311.6311.6311.63 4.794.794.794.79 6.086.086.086.08 14.1914.1914.1914.19 8.248.248.248.24 13.1213.1213.1213.12 3.983.983.983.98 13.6413.6413.6413.64

D.2 Language Fidelity Analysis

Table 17: Pearson correlation coefficients between language fidelity on xFlickrCO and Performance on other datasets.
Dataset Language
Avg. zh en de id ja ru es tr
xFlickrCO .91 .85 .65 0.86 .88 .91 .92 .90 .84
XM3600 .81 .74 .63 0.63 .69 .74 .76 .67 .82
MaXM .55 .17 .43 - - - - - -
XVNLI .51 - .46 - - - .47 .20 -
MaRVL .46 .21 .41 - .50 - - - .50
M5-VGR .34 - .11 0.15 - - .42 - -
xGQA .21 .35 .47 0.08 .37 - -.04 - -
M5-VLOD .14 - .44 0.20 - - .14 - -