M ā A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks
Abstract
Since the release of ChatGPT, the field of Natural Language Processing has experienced rapid advancements, particularly in Large Language Models (LLMs) and their multimodal counterparts, Large Multimodal Models (LMMs). Despite their impressive capabilities, LLMs often exhibit significant performance disparities across different languages and cultural contexts, as demonstrated by various text-only benchmarks. However, current research lacks such benchmarks for multimodal visio-linguistic settings. This work fills this gap by introducing M5, the first comprehensive benchmark designed to evaluate LMMs on diverse vision-language tasks within a multilingual and multicultural context. M5 includes eight datasets covering five tasks and languages, with a focus on underrepresented languages and culturally diverse images. Furthermore, we introduce two novel datasets, M5-VGR and M5-VLOD, including a new Visio-Linguistic Outlier Detection task, in which all evaluated open-source models fail to significantly surpass the random baseline. Through extensive evaluation and analyses, we highlight substantial task-agnostic performance disparities between high- and low-resource languages. Moreover, we show that larger models do not necessarily outperform smaller ones in a multilingual setting.
M ā A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks
Florian Schneider1 Language Technology Group UniversitƤt Hamburg, Germany [email protected] Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Sunayana Sitaram Microsoft Research India Bangalore, India [email protected]
1 Introduction
Since the release of ChatGPT, Natural Language Processing has experienced a significant surge in interest and research, with a particular focus on LLMs finetuned to follow human instructions. Besides proprietary models like GPT-4Ā (Achiam etĀ al., 2023), ClaudeĀ (Bai etĀ al., 2022), or GeminiĀ (Anil etĀ al., 2023), there are also successful open-source variants such as LlamaĀ (Touvron etĀ al., 2023), PhiĀ (Gunasekar etĀ al., 2023; Abdin etĀ al., 2024), or MistralĀ (Jiang etĀ al., 2023).
![Refer to caption](x1.png)
While LLMs often demonstrate impressive performance on a wide range of tasks, quantifying and measuring this performance is challenging. Nevertheless, recent evaluation studies have shown that LLMs generally perform well in English but much worse in other languagesĀ (Ahuja etĀ al., 2023a, b; Holtermann etĀ al., 2024).
In this work, we focus on multimodal variants of LLMs, Large Multimodal Models (LMMs), such as GPT 4VĀ (OpenAI, 2023), Gemini Pro VĀ (Anil etĀ al., 2023), or the popular open-source model, LLaVAĀ (Liu etĀ al., 2023a, b). LLMs are not text-only but are also capable of processing images in addition to text.
![Refer to caption](x2.png)
Most open-source LMMs comprise three major components: an LLM, a vision-encoder model, and a mapping network that projects image embeddings into the text embedding space. With this architecture, where an LLM serves as the core, we argue that LMMs inherently suffer from the same issue as LLMs: they generally perform much worse in non-English languages. However, existing benchmarks are either text-onlyĀ Ahuja etĀ al. (2023a) or multimodal but monolingualĀ Yue etĀ al. (2023), thus unable to prove this hypothesis. In other words, current research lacks multimodal multilingual benchmarks to examine LMMsā multilingual capabilities. In this work, we fill this gap by introducing the M5 Benchmark, taking a significant step towards identifying and measuring the performance disparities of current LMMs between various languages. FigureĀ 2 and FigureĀ 1 present a high-level summary of our benchmark. Moreover, we introduce two new evaluation datasets, including a novel vision-language task. Both datasets focus on African and Asian cultures, which are underrepresented or even non-existent in previous benchmarks. Our exhaustive analyses additionally investigate the influence of different factors on the performance, such as the modelsā size or language fidelity.
Major Contributions
The major contributions of this work are (a) M5, the first multimodal benchmark to assess the performance of current LMMs across five tasks, eight datasets, and languages; (b) Two novel datasets spanning underrepresented African and Asian languages, English and German, with images depicting the respective cultures. (c) A novel vision-language task: Visio-Linguistic Outlier Detection (VLOD); (d) A large-scale evaluation of recent LLMs and a thorough analysis of their multilingual performance. (e) A public release of our codebase and all datasets in a uniform schema to foster future research for more equitable and accessible LMMs or AI in general111We will release all code and data upon acceptance..
2 Related Work
Large Multi-Modal Models
This work focuses on the multimodal counterpart of large language models (LLMs), often referred to as Large Multimodal Models (LMMs). LMMs are language models capable of processing and āunderstandingā data other than text. While this generally subsumes images, video, audio, or more, we concentrate on visio-linguistic LMMs, i.e., models that take text and/or images as input and generate textual output.
The vast majority of open-source LMMs comprise three major components: a pretrained generative LLM as the core, a pretrained vision-encoder model that computes semantically rich image embeddings, and a shallow mapping network that learned to project image embeddings into the text embedding space. One of this architectureās successful open-source implementations with a recent LLM, i.e., the Llama-based VicunaĀ Chiang etĀ al. (2023); Touvron etĀ al. (2023), is LLaVAĀ Liu etĀ al. (2023b), from which many others took inspiration also regarding the training data and process. Besides this, LMMs also exist, which use Cross-AttentionĀ Wang etĀ al. (2023); Bai etĀ al. (2023), Q-FormersĀ Li etĀ al. (2023); Geigle etĀ al. (2023), AdaptersĀ Eichenberg etĀ al. (2022), or Preceiver ResamplersĀ Alayrac etĀ al. (2022); Awadalla etĀ al. (2023) to process image embeddings. For an overview including architectural details and the number of parameters of the LMMsā components we employed in this work, please see TableĀ 8.
Evaluation Benchmarks
With the recent surge in the research of LLMs and LMMs, analyzing the modelsā performances is crucial yet challenging. Popular benchmarks like BIG-BenchĀ bench authors (2023), HELMĀ Liang etĀ al. (2022), or MMLUĀ Hendrycks etĀ al. (2020) are the defacto-standard to evaluate LLMs on text-only tasks primarily in English. Efforts like MEGA, MEGAVERSE, or MultiQĀ Ahuja etĀ al. (2023a, b); Holtermann etĀ al. (2024) extended these monolingual benchmarks to a large set of diverse languages and showed that the LLMsā performance in English versus non-English languages differs significantly.
Similarly, efforts have been made to evaluate multimodal models. Benchmarks like MMMUĀ Yue etĀ al. (2023), MMEĀ Fu etĀ al. (2023), or MMBenchĀ Yuan etĀ al. (2023) assess the performance of LMMs on a vast number of text-image tasks. However, these benchmarks primarily focus on English, with some tasks available in Chinese. Like MMMU, there is CMMMUĀ Ge etĀ al. (2024), which focuses on text-image tasks in Chinese. Nonetheless, evaluating state-of-the-art LMMs in a massively multilingual large-scale setting remains largely unexplored. There are only a few multimodal multilingual evaluation datasets (see SectionĀ 3.2 andĀ 8.6) and only two benchmarks: IGLUEĀ Bugliarello etĀ al. (2022) and MEGAVERSE. However, IGLUE evaluates only non-autoregressive transformer-encoders, thus lacking state-of-the-art LLMs. In MEGAVERSE, only five recent LMMs are evaluated on two datasets.
3 The M5 Benchmark
This section describes the setup of the M5 Benchmark introduced by this work. Details about the experimental setup, including prompts and hyperparameters, are reported in AppendixĀ A.
3.1 Models
We chose the LMMs included in this benchmark for the following reasons: Firstly, we focussed on publicly available models released on Hugging Face except for GPT-4 Vision and Gemini Pro. Secondly, we included LMMs well-performing on popular multimodal English-only benchmark s such as MMMUĀ (Yue etĀ al., 2023) and MMEĀ (Fu etĀ al., 2023). Thirdly, we aimed to cover a mixture of different model families and a broad model size spectrum, including small models with B to B, medium models with B to B, and large models with B to B parameters. For an overview of all models, including their number of parameters and other architectural details, see TableĀ 8.
3.2 Datasets
This section briefly introduces the existing datasets included in our benchmark. In addition to these, we crafted two novel datasets described in SectionĀ 4. For details about the languages covered by the datasets, please refer to TableĀ 6.
xGQA
MaXM
The MaXM dataset was introduced byĀ Changpinyo etĀ al. (2023) and is a VQA dataset comprising seven languages in five scripts. In MaXM, the questions and their respective answers are in the same language. To increase cultural diversity, the images were selected to match the region where the target language is spoken.
XVNLI
The XVNLI datasetĀ Bugliarello etĀ al. (2022) introduces the task of Cross-lingual Visual Natural Language Inference where a model needs to predict whether a textual hypothesis entails, contradicts, or is neutral concerning a visual premise. XVNLI comprises five languages covering three scripts and unique images from Visual Genome.
MaRVL
The MaRVL datasetĀ Liu etĀ al. (2021) aims to benchmark models on Multicultural Reasoning over Vision and Language. A task sample comprises two images, a textual statement, and a binary true or false answer grounded in the images. MaRVL comprises five languages covering three scripts and culturally diverse images that match the respective languages.
XM3600
The XM3600 datasetĀ Thapliyal etĀ al. (2022) is a large multilingual image captioning dataset comprising languages with captions for unique images per language. The images are selected to match the languageās cultural background, ensuring cultural and linguistic diversity.
xFlickrCO
4 Novel M5 Datasets
In addition to the existing datasets introduced in the previous section, we crafted two novel multimodal and multilingual evaluation datasets. The principal motivation behind this is to fill the gap in existing vision-language datasets concerning the lack of underrepresented languages, tasks, and cultural diversity. Moreover, we aim to enable further examination of LMMs and their performance on non-English and non-Western data with a particular focus on African and Asian regions. Details, statistics, and examples are reported in AppendixĀ B.
Common Characteristics
Languages
Both datasets comprise samples in languages covering seven scripts (see TableĀ 6): Amharic, Berber, Bengali, German, English, Filipino, Hausa, Hindi, Russian, Swahili, Thai, Zulu. The languages were selected to enrich the set of languages covered by existing datasets, focusing on underrepresented languages from Asian and African countries or regions. To our knowledge, no other visio-linguistic evaluation dataset covers Amharic, Berber, Hausa, or Zulu.
Depicting Cultural Diversity
The images in our datasets originate from the Dollar Street datasetĀ (GaviriaĀ Rojas etĀ al., 2022), comprising around photos taken in different regions or countries around the globe. These photos depict the lives of families, including their homes, neighborhoods, or everyday objects, in a culturally diverse way. Further, each image in the original dataset is tagged with one or more ātopicsā that roughly describe its visual content.
Image Basis
For our datasets, we sampled a subset of images from the dataset taken in regions where our target languages are spoken. In this subset, which forms the visual basis for both of our datasets and is referred to as , each image is tagged with exactly one topic and was taken in a region where language is spoken.
4.1 M5-VGR
![Refer to caption](extracted/5710554/gfx/vgr_sample_zu_90_4bbacd9003aa4d0199939fa2fd80c276.png)
Inspired by MaRVL, the goal of the M5-VGR dataset is to provide a visually grounded reasoning (VGR) evaluation dataset that covers a wide range of topologically different languages and, at the same time, visually represents a diverse set of cultures in which the respective languages are spoken. However, since the MaRVL dataset contains only five languages, we chose additional topologically diverse languages for our dataset. To guarantee visual and linguistic diversity and high data quality in our dataset, we hired professional native-speaker annotators of the respective languages to annotate the data. Moreover, we performed several rounds of data quality assessment in close collaboration with the annotators.
A task sample in M5-VGR contains two images and , a textual visually grounded hypothesis , and a binary label which is either true or false concerning the two visual premises (see FigureĀ 3). More specifically, for each language , we created tasks as follows: In the first step, we sampled unique images from our image basis so that each topic occurs at least once across all languages. Then, for each of the images, we randomly selected another image associated with another language that shares the topic . In the third step, we asked the native-speaker annotators of the language to manually create a hypothesis and a label which is either true or false concerning the image premises . Further, the annotators were instructed to generate a hypothesis semantically related to the topic if possible.
4.2 M5-VLOD
![Refer to caption](extracted/5710554/gfx/vlod_sample_sw_87_9fbc54f2a62f4cbbbe216e84565cbdd6.jpg)
With the M5-VLOD dataset, we introduce a novel multimodal task: Visio-Linguistic Outlier Detection. The objective of the task is to detect an outlier image from a set of images considering a textual statement. An example of the task is shown in FigureĀ 4, where five images related to the topic āsoap for hands and bodyā are shown. The machine-translated English statement is: āAll the images show soap applied to the hands and body without anyone.ā. Because only the first image shows a person, the statement is incorrect for the first image and, therefore, is considered the outlier image.
The dataset was collected similarly to M5-VGR, as described in the previous section. The major difference is that instead of sampling only one image in the second step, we sample four images so that a sample for language comprises of five images: associated with five different languages that share one topic . In the third step, we asked the native-speaker annotators of the language to manually create a textual statement , valid for all but one of the images labeled as the outlier image.
5 General Results Discussion
Model | Dataset | ||||||||||||||||||
xGQA | MaXM | XVNLI | MaRVL | M5-VLOD | M5-VGR | xFlickrCO | XM3600 | ALL | |||||||||||
E | NE | E | NE | E | NE | E | NE | E | NE | E | NE | E | NE | E | NE | E | NE | ||
CogVLM | |||||||||||||||||||
BakLLaVA | |||||||||||||||||||
LLaVA 1.6 7B | |||||||||||||||||||
LLaVA 1.5 7B | |||||||||||||||||||
Yi-VL 6B | |||||||||||||||||||
MiniCPM-V | |||||||||||||||||||
LLaVA 1.5 13B | |||||||||||||||||||
Qwen-VL | |||||||||||||||||||
Yi-VL 34B | |||||||||||||||||||
Gemini Pro V | |||||||||||||||||||
OmniLMM 12B | |||||||||||||||||||
LLaVA 1.6 13B | |||||||||||||||||||
mBliP BloomZ | |||||||||||||||||||
InternVL V1.1 | |||||||||||||||||||
LLaVA 1.6 34B | |||||||||||||||||||
mBliP mT0 | |||||||||||||||||||
InternVL V1.2+ | |||||||||||||||||||
GPT 4V | |||||||||||||||||||
Average |
This section discusses the modelsā performance on the datasets considered in our benchmark. TableĀ 1 provides an overview of the performance in English compared to non-English languages for all models and datasets. Note that we use friendly names for the models for better readability (see TableĀ 8). Detailed results for each dataset and all their respective languages are provided in AppendixĀ D.
5.1 Summary of Findings
TableĀ 1 shows a clear pattern: Generally, LMMs perform significantly worse in non-English languages across all tasks. More specifically, the average performance across all models and datasets in English is versus in non-English languages. Most models have an average performance difference from English to non-English larger or equal to . However, for GPT 4V and despite their much smaller size also for mBlip BloomZ, and mBlip T0, the difference is smaller than . For the two mBLIP models, the authors explicitly stated in their paper the language distribution in the training data, which covers languages. Hence, it can be assumed that this is the reason for this slight absolute performance difference, and, further, this might indicate that GPT 4V was also trained in a multilingual fashion. Due to the difference in size and the architecture222While the architecture of GPT 4V is not known, it is likely different from the mBlip modelsā architecture, which employs Q-Formers, rarely used in state-of-the-art LMMs. of the mBlip models and GPT 4V, applying this multilingual training strategy for LMMs would generally lead to more robust multilingual performance.
The average performance difference of the models is most significant on the MaXM, XM3600, and xFlickrCo datasets, for which the models are required to generate non-English text.
Interestingly, for the M5-VLOD dataset, the models that performed worse than the random baseline of in English performed better in non-English languages. An explanation for this could be false assumptions drawn from the English text. This finding also explains why the average English versus non-English performance disparity across all models is equal for the dataset and lies around the random baseline, indicating the challenge introduced by our dataset.
5.1.1 Dataset-Specific Discussion
Note that due to brevity constraints, we report exact numbers and diagrams of the language-specific results for each dataset in AppendixĀ D.
xGQA
All models perform best in English mostly, with a significant gap in accuracy to the second-best language from up to in English to in Russian for LLaVA 1.6 7B. In Bengali, where the models have the lowest average accuracy of , all models besides GPT 4V, which achieves , perform worst by far. The best-performing model in English and the best-performing model on average over all languages are the InternVL v1.2 and InternVL v1.1 models. Notably, despite their (estimated) much larger size, GPT 4V and Gemini Pro V are among the worst-performing models in English. After manually inspecting the results, we found the reason for this to be that the models did not respond in a single word but with a brief sentence, which is considered a false answer according to the applied metric (see AppendixĀ A.2 and SectionĀ 8.2).
MaXM
The average accuracy of the models for Hindi (), Hebrew (), Romanian (), Thai (), and Chinese () is much lower than for English () and French (). It is also worth pointing out that most models, regardless of their size, perform remarkably worse in languages other than English (and French). In contrast, on xGQA, which is also a VQA dataset, the differences between the languages are much more minor. This is likely due to the difference between the two datasets, i.e., that xGQA has multilingual questions but only English answers, while MaXM has multilingual questions and expects the answers in the respective language, too. We further underline this in our language fidelity analysis in SectionĀ 6.3.
XVNLI
English accuracy is the best for most models, with an average of , whereas Arabic accuracy is the worst, with an average of . The performance drop from English to the other languages, i.e., Spanish (), French (), and Russian, with average accuracy scores of , , and , is less substantial. Note that XVNLI is an NLI dataset, i.e., the random baseline is at . All models surpass this baseline in all languages, except for CogVLM in Arabic () and French (). The best-performing model is GPT4 V with an average accuracy across all languages of , followed by LLaVA 1.6 34B and InternVL V1.2+ with average scores of and , respectively.
MaRVL
The datasetās random baseline is , which is often only slightly surpassed by most models, especially for Swahili and Tamil languages, with an average accuracy of and , respectively. Notably, only of models perform best in English, with an average accuracy of . For the other models, the English performance is surpassed by Chinese, Indonesian, or Turkish, with an average accuracy of , , and , respectively. GPT-4V is on par with LLaVA 1.6 34B despite the latter having much fewer parameters.
M5-VGR
As with MaRVL, this datasetās random baseline is at . Only one of models, i.e., InternVL V1.2+, could surpass or reach this baseline in all languages. As expected, most models performed best in English, German, or Russian, with average accuracies of , , and , respectively. They performed worst in low-resource languages such as Amharic, Berber, Bengali, Hausa, or Zulu, with an average accuracy of , , , and , respectively. Only three models, i.e., Gemini Pro V, mBlip mT0, and GPT 4V, consistently and significantly surpass the random baseline in all languages except for Berber. The only languages where the average performance is significantly higher than the random baseline are English (), German (), Russian (), and Thai (). The average scores of the other languages range from in Berber to in Hindi.
M5-VLOD
The datasetās random baseline is since the models need to find the outlier within five images. Only GPT 4V and Gemini Pro V significantly surpassed that baseline in all languages, with an average accuracy of and , respectively. They achieve the best scores in English with an average accuracy of (GPT 4V) and (Gemini Pro V. However, in Berber, both models only achieve scores around the random baseline. All other models do not surpass the random baseline in all languages, including English, by more than , with average scores between (CogVLM) and (InternVL V1.2+) This highlights the challenge introduced by our dataset and the performance gap between proprietary and open-source models.
xFlickrCO
The majority of models perform best in English, often with a significant margin in average chrF++, i.e., in English and in non-English languages. Other languages where the modes perform comparably well are German and Spanish, with average chrF++ scores of and , respectively. Interestingly, all models perform worse in non-Latin script languages, i.e., Russian (), Chinese (), and Japanese (). Unexpectedly, the proprietary models GPT 4V and Gemini Pro V are surpassed by mBliP BloomZ, mBliP mT0, and InternVL V1.2+, which are much smaller open-source models. Even in English, most open-source models outperform the proprietary models.
XM3600
Note that due to limited resources, we evaluated GPT 4V only on a subset of of languages. Most models perform best in English ( average chrF++) by a large margin, followed by other Latin scripts in high-resource languages such as French (), Spanish (), or Dutch (). On average, the models perform worst on non-Latin script languages like Korean (), Telugu (), and Bengali (). However, although the chrF++ metric claims to be script and language-independent, the low scores in high-resource languages like Chinese () and Japanese () make the metric questionable. While detailed analysis is out of the scope of this work, in future work, we will investigate this issue further (see SectionĀ 8.1).
6 Aggregated Result Analyses
6.1 Performance per Language
FigureĀ 5 shows the average performances aggregated by language333We do not show all languages of XM3600 for better readability. or language taxonomy classesĀ Joshi etĀ al. (2020). These taxonomy classes indicated how well a respective language is represented and considered within the research field of NLP based on papers published at ā¢CL conferences. High-resource languages such as English or German are in Class 5, whereas low-resource languages such as Berber are in Class 0. For details about the languages and their taxonomy classes, please refer to TableĀ 12.
As can be observed from FigureĀ 5(a) and FigureĀ 5(b), the models perform best in English, followed by other European languages across all datasets. Our newly presented M5-VLOD dataset is an exception, where the average performance for all languages is around the random baseline, indicating the challenge it implies. As expected, the models consistently perform worse on low-resource languages than on high-resource languages on all datasets. This is also displayed in FigureĀ 5(c), where it can be observed that the average performance decreases with the language taxonomy class. Note that this is not precisely true for xFlickrCO and XVNLI because the average on Class-5 languages is lowered by outliers, as indicated by the large error bars. In contrast, the models performed comparably well in only one Class 3 or 4 language, respectively.
![Refer to caption](x3.png)
![Refer to caption](x4.png)
![Refer to caption](x5.png)
6.2 Performance vs. Model Parameters
In FigureĀ 6, we plot the English and non-English average performance on the employed datasets versus the modelsā sizes in multiple regression plots. Note that, on the x-axes, we indicated the unknown sizes of GPT 4V and Gemini Pro V by ā???ā, which are estimated to be of magnitudes larger than all other models evaluated in this benchmark hence should be much further right. However, we did not do so to improve the readability of the plots.
In the figures, we can make several observations: Firstly, the average English performance is higher than the non-English performance for all models on all datasets. Secondly, the markers, which represent the average performance of a specific model on a dataset, show that the largest model does not always perform best and that the difference between smaller and larger models is often neglectable. The same finding is shown by the relatively flat slope of the regression lines. However, for the M5-VLOD and VGR datasets, the regression line for the average English scores is steeper, meaning that larger models perform considerably better than the smaller models. Since this work introduces the datasets and M5-VLOD even introduces a novel task, it can be concluded that larger models can better generalize to unseen data.
![Refer to caption](x6.png)
![Refer to caption](x7.png)
6.3 Language Fidelity Analysis
Inspired byĀ Holtermann etĀ al. (2024), we report the results of a language fidelity analysis, which assesses how often a model responds in the requested language on average. For this, we used GlotLIDv3Ā Kargaran etĀ al. (2023) to predict the language based on the output text of the respective models. Since it is hard to predict the language of a word or a multi-word expression due to ambiguity, we selected the xFlickrCO dataset, where the expected response of a model is an image caption, i.e., a sentence, in one of eight languages. As it can be observed from TableĀ 2, all models achieve (almost) perfect fidelity in English where, whereas for Japanese, Russian, and Turkish, the average fidelity drops to two-thirds. Interestingly, the small-sized mBLIP models have almost perfect fidelity in all languages, (slightly) surpassing larger models like InternVL V1.2+ and GPT 4V.
Model | Language | ||||||||
---|---|---|---|---|---|---|---|---|---|
zh | en | de | id | ja | ru | es | tr | Avg. | |
BakLLaVA | |||||||||
Yi-VL 6B | |||||||||
Qwen-VL | |||||||||
Yi-VL 34B | |||||||||
CogVLM | |||||||||
LLaVA 1.5 13B | |||||||||
LLaVA 1.5 7B | |||||||||
MiniCPM-V | |||||||||
LLaVA 1.6 7B | |||||||||
InternVL V1.1 | |||||||||
OmniLMM 12B | |||||||||
Gemini Pro | |||||||||
LLaVA 1.6 13B | |||||||||
LLaVA 1.6 34B | |||||||||
GPT 4V | |||||||||
InternVL V1.2+ | |||||||||
mBliP BloomZ | |||||||||
mBliP mT0 | |||||||||
Avg. |
While the language fidelity of a model focuses on the generated text, we argue that the fidelity is also an indicator of the modelās general language capabilities. To prove this hypothesis, we computed Pearson correlation coefficients between the reported fidelity and the modelsā performance on the datasets for the xFlickrCO languages. As shown in TableĀ 17, there is a positive moderate or high correlation between the average fidelity and the average score for most datasets. However, for xGQA and M5-VLOD, there is only a minor positive average correlation.
7 Conclusion
We introduced M5, a diverse benchmark in which we evaluated Large Multimodal Models (LMMs) with varying sizes across five visio-linguistic tasks in eight datasets comprising unique languages. Further, we presented two novel datasets ā M5-VGR and M5-VLOD ā which focus on underrepresented languages and depict culturally diverse scenes. With M5-VLOD, we introduce a new visio-linguistic outlier detection task in which only proprietary models achieve reasonable scores. Our experiments revealed that model size does not always correlate with better performance, especially in non-English languages, underscoring the importance of diverse, multilingual training data and robust architectures. Performance disparities were prominent between high-resource languages like English and low-resource languages across all datasets and models, highlighting ongoing challenges in achieving globally equitable multilingual AI. With M5, we aim to impel the development of more inclusive models suitable for diverse languages and cultures.
8 Limitations
This section outlines several limitations of our current study that will be addressed in future work.
8.1 Metrics for Multilingual Image Captioning
Our benchmark and current research generally lack robust metrics for evaluating multilingual image captioning, especially for non-Latin script languages. The issue, which is the same for machine translation tasks, arises because of the nature of most metrics, such as chrFĀ PopoviÄ (2017), CIDErĀ Vedantam etĀ al. (2015), ROUGEĀ Lin (2004), BLUEĀ Papineni etĀ al. (2002), or METEORĀ Banerjee and Lavie (2005), which are based on comparing word or character n-grams between the source and target sequence. For non-Latin scripts, tokenization or segmentation can be challenging because it might not contain spaces or punctuation, or the characters are logographic. Hence, their usability or effectiveness is doubtful in such scenarios because the metrics rely on tokenization.
Other metrics, such as BERTScoreĀ Zhang etĀ al. (2020), CLIPScoreĀ Hessel etĀ al. (2021), or COMETĀ Rei etĀ al. (2020), do not rely on the captionsā surface forms but on their token or sentence embeddings. However, they suffer from other issues: They require strong multilingual or cross-lingual encoder models capable of computing embeddings for many languages, which itself is a challenging task. Further, the scores computed with these metrics are often not calibrated across languages and thus not directly comparable between different languages.
A promising currently popular solution might be the use of robust multilingual state-of-the-art LLMs such as GPT 4o444https://openai.com/index/hello-gpt-4o/, Claude 3 Opus555https://www.anthropic.com/news/claude-3-family, or Gemini 1.5 Ultra666https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/ as a judgeĀ Zheng etĀ al. (2024). However, this would require more computational and financial resources and, most importantly, more investigation.
8.2 VQA Metrics for Generative Models
The problem when employing and evaluating generative language models on question-answering tasks is that the models can generally output arbitrary token sequences. However, the gold label answers are limited and often comprise only a short phrase, a single word, or even a binary label. Hence, mapping the predicted answers to their gold labels is not straightforward, and the difficulty drastically increases in multilingual scenarios. The relaxed accuracy metric employed in this study (see SectionĀ A.1) has been found to occasionally incorrectly classify correct answers, leading to false negatives, especially in open vocabulary visual question answering (VQA). One way to address this issue is to leverage strong state-of-the-art LLMs as judges, as described above, to enhance the accuracy of the evaluations.
8.3 Influence of Prompting
Another limitation of this and most, if not all, other current studies is grounded in the model prompting. Since different models might react differently to specific prompting styles, and we only employ a single prompt per dataset for all models777We do apply the model-specific prompt or chat templates, though. (see FigureĀ 7), the results might not be optimal. This issue has been partially addressed byĀ Ahuja etĀ al. (2023a) but is out of the scope of this work.
8.4 āOutdatedā Models
Since the pace of current research in NLP, CV, and multimodal machine learning is swift, the models employed in our benchmarking exercise might be considered slightly outdated. Note that we considered models released until March 2024. Since then, numerous improved LMMs based on state-of-the-art LLMs, such as Llama3888https://ai.meta.com/blog/meta-llama-3 and novel image encoders techniques such as NaVITĀ Dehghani etĀ al. (2024), have been publicly released. Because this was foreseeable, we designed our benchmark to be easily extendable with newer models, which we will include in future work.
8.5 Small M5 Datasets
This work introduced two datasets, M5-VGR and M5-VLOD, which comprise about samples for each of the languages. Compared to other datasets, they can be considered small. We will increase their sizes in future work to obtain more robust and generalizable results.
8.6 Missing multimodal and Multilingual Datasets
Currently, the M5 Benchmark comprises text-image tasks, i.e., VQA, VGR, VNLI, and image captioning, thus missing other suitable tasks like multimodal and multilingual summarization. Further, other multimodal multilingual VQA and VGR datasets have emerged while writing this paper. We will include both new tasks and new datasets in future versions of the M5.
References
- Abdin etĀ al. (2024) Marah Abdin, SamĀ Ade Jacobs, AmmarĀ Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, etĀ al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally On Your Phone. ArXiv, 2404.14219.
- Achiam etĀ al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaĀ Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etĀ al. 2023. GPT-4 Technical Report. ArXiv, 2303.08774.
- Ahuja etĀ al. (2023a) Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2023a. MEGA: Multilingual Evaluation of Generative AI. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232ā4267, Singapore.
- Ahuja etĀ al. (2023b) Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, etĀ al. 2023b. MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks. arXiv preprint arXiv:2311.07463.
- AI etĀ al. (2024) 01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, GeĀ Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. 2024. Yi: Open Foundation Models by 01.AI. Preprint, arXiv:2403.04652.
- Alayrac etĀ al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, etĀ al. 2022. Flamingo: A Visual Language Model for Few-Shot Learning. Advances in neural information processing systems, 35:23716ā23736.
- Anil etĀ al. (2023) Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewĀ M Dai, Anja Hauth, etĀ al. 2023. Gemini: A Family of Highly Capable Multimodal Models. ArXiv, 2312.11805.
- Awadalla etĀ al. (2023) Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, etĀ al. 2023. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. arXiv preprint arXiv:2308.01390.
- Bai etĀ al. (2023) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.
- Bai etĀ al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, etĀ al. 2022. Constitutional AI: Harmlessness From AI Feedback. ArXiv, 2212.08073.
- Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65ā72, Ann Arbor, Michigan.
- bench authors (2023) BIG bench authors. 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- Bugliarello etĀ al. (2022) Emanuele Bugliarello, Fangyu Liu, Jonas Pfeiffer, Siva Reddy, Desmond Elliott, EdoardoĀ Maria Ponti, and Ivan VuliÄ. 2022. IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 2370ā2392.
- Changpinyo etĀ al. (2023) Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, XiĀ Chen, and Radu Soricut. 2023. MaXM: Towards Multilingual Visual Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2667ā2682, Singapore.
- Chen etĀ al. (2023) Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, YuĀ Qiao, and Jifeng Dai. 2023. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. arXiv preprint arXiv:2312.14238.
- Chiang etĀ al. (2023) Wei-Lin Chiang, Zhuohan Li, ZiĀ Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephĀ E. Gonzalez, Ion Stoica, and EricĀ P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
- Dehghani etĀ al. (2024) Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, IbrahimĀ M Alabdulmohsin, etĀ al. 2024. Patch nā Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. Advances in Neural Information Processing Systems, 36.
- Eichenberg etĀ al. (2022) Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. 2022. MAGMA ā Multimodal Augmentation of Generative Models through Adapter-based Finetuning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2416ā2428, Abu Dhabi, United Arab Emirates.
- Fu etĀ al. (2023) Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, XuĀ Lin, Jinrui Yang, Xiawu Zheng, KeĀ Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. 2023. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394.
- GaviriaĀ Rojas etĀ al. (2022) William GaviriaĀ Rojas, Sudnya Diamos, Keertan Kini, David Kanter, Vijay JanapaĀ Reddi, and Cody Coleman. 2022. The Dollar Street Dataset: Images Representing the Geographic and Socioeconomic Diversity of the World. Advances in Neural Information Processing Systems, 35:12979ā12990.
- Ge etĀ al. (2024) Zhang Ge, DuĀ Xinrun, Chen Bei, Liang Yiming, Luo Tongxu, Zheng Tianyu, Zhu Kang, Cheng Yuyang, XuĀ Chunpu, Guo Shuyue, Zhang Haoran, QuĀ Xingwei, Wang Junjie, Yuan Ruibin, LiĀ Yizhi, Wang Zekun, Liu Yudong, Tsai Yu-Hsuan, Zhang Fengji, Lin Chenghua, Huang Wenhao, Chen Wenhu, and FuĀ Jie. 2024. CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark. arXiv preprint arXiv:2401.20847.
- Geigle etĀ al. (2023) Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavavs. 2023. mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs. ArXiv, 2307.06930.
- Gunasekar etĀ al. (2023) Suriya Gunasekar, YiĀ Zhang, Jyoti Aneja, Caio CĆ©sarĀ Teodoro Mendes, Allie DelĀ Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo deĀ Rosa, Olli Saarikivi, etĀ al. 2023. Textbooks Are All You Need. ArXiv, 2306.11644.
- Hendrycks etĀ al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:2009.03300.
- Hessel etĀ al. (2021) Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan LeĀ Bras, and Yejin Choi. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514ā7528, Online and Punta Cana, Dominican Republic.
- Holtermann etĀ al. (2024) Carolin Holtermann, Paul Rƶttger, Timm Dill, and Anne Lauscher. 2024. Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ. arXiv preprint arXiv:2403.03814.
- Hu etĀ al. (2024) Shengding Hu, Yuge Tu, XuĀ Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, etĀ al. 2024. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395.
- Jiang etĀ al. (2023) AlbertĀ Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraĀ Singh Chaplot, Diego deĀ las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, etĀ al. 2023. Mistral 7B. ArXiv, 2310.06825.
- Joshi etĀ al. (2020) Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282ā6293, Online.
- Kargaran etĀ al. (2023) AmirĀ Hossein Kargaran, Ayyoob Imani, FranƧois Yvon, and Hinrich SchĆ¼tze. 2023. GlotLID: Language Identification for Low-Resource Languages. In The 2023 Conference on Empirical Methods in Natural Language Processing, Singapore.
- Krishna etĀ al. (2017) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, DavidĀ A. Shamma, etĀ al. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision (IJCV), 123(1):32ā73.
- Li etĀ al. (2023) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 19730ā19742.
- Liang etĀ al. (2022) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, etĀ al. 2022. Holistic Evaluation of Language Models. arXiv preprint arXiv:2211.09110.
- Lin (2004) Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74ā81, Barcelona, Spain.
- Lin etĀ al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr DollĆ”r, and C.Ā Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV), pages 740ā755, Zurich, Switzerland.
- Liu etĀ al. (2021) Fangyu Liu, Emanuele Bugliarello, EdoardoĀ Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. 2021. Visually Grounded Reasoning across Languages and Cultures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467ā10485, Online and Punta Cana, Dominican Republic.
- Liu etĀ al. (2023a) Haotian Liu, Chunyuan Li, Yuheng Li, and YongĀ Jae Lee. 2023a. Improved Baselines with Visual Instruction Tuning. ArXiv, 2310.03744.
- Liu etĀ al. (2023b) Haotian Liu, Chunyuan Li, Qingyang Wu, and YongĀ Jae Lee. 2023b. Visual Instruction Tuning. In Advances in Neural Information Processing Systems, volumeĀ 36, pages 34892ā34916, New Orleans, LT, USA.
- OpenAI (2023) OpenAI. 2023. GPT-4 Vision System Card.
- Papineni etĀ al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311ā318, Philadelphia, Pennsylvania, USA.
- Pfeiffer etĀ al. (2022) Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin Steitz, Stefan Roth, Ivan VuliÄ, and Iryna Gurevych. 2022. xGQA: Cross-Lingual Visual Question Answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2497ā2511, Dublin, Ireland.
- PopoviÄ (2017) Maja PopoviÄ. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612ā618, Copenhagen, Denmark.
- Rei etĀ al. (2020) Ricardo Rei, Craig Stewart, AnaĀ C Farinha, and Alon Lavie. 2020. COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685ā2702, Online.
- Thapliyal etĀ al. (2022) AshishĀ V. Thapliyal, Jordi PontĀ Tuset, XiĀ Chen, and Radu Soricut. 2022. Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 715ā729, Abu Dhabi, United Arab Emirates.
- Touvron etĀ al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etĀ al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv, 2307.09288.
- Vedantam etĀ al. (2015) Ramakrishna Vedantam, CĀ LawrenceĀ Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566ā4575, Salt Lake City, UT, USA.
- Wang etĀ al. (2023) Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, JiĀ Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. 2023. CogVLM: Visual Expert for Pretrained Language Models. ArXiv, 2311.03079.
- Young etĀ al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference Over Event Descriptions. Transactions of the Association for Computational Linguistics (TACL), 2:67ā78.
- Yu etĀ al. (2023) Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, etĀ al. 2023. RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-Grained Correctional Human Feedback. arXiv preprint arXiv:2312.00849.
- Yuan etĀ al. (2023) Liu Yuan, Duan Haodong, Zhang Yuanhan, LiĀ Bo, Zhang Songyang, Zhao Wangbo, Yuan Yike, Wang Jiaqi, HeĀ Conghui, Liu Ziwei, Chen Kai, and Lin Dahua. 2023. MMBench: Is Your Multi-modal Model an All-around Player? arXiv:2307.06281.
- Yue etĀ al. (2023) Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, GeĀ Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, etĀ al. 2023. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. arXiv preprint arXiv:2311.16502.
- Zhang etĀ al. (2020) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations, Online.
- Zheng etĀ al. (2024) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, ZiĀ Lin, Zhuohan Li, Dacheng Li, Eric Xing, etĀ al. 2024. Judging LLM-as-a-Judge with MT-Bench and ChatBot Arena. Advances in Neural Information Processing Systems, 36.
Appendix A Experimental Setup Details
This section details the employed metrics, prompts, and generation hyperparameters.
Note that we ran all experiments on A6000 (GB) and A100 (GB) GPUs. The largest evaluated model (B) fits on an A100.
A.1 Metrics
FollowingĀ Geigle etĀ al. (2023), we report a relaxed accuracy metric for the xGQA, MaXM, XVNLI, and MaRVL datasets due to the generative nature of the considered models. More specifically, we post-process the generated answers by, e.g., lowercasing, stripping, or removing punctuation. We then consider the processed generated answer correct if it matches the gold answer or starts or ends with the gold answer. Further, we allow synonyms for boolean and numerical values. Examples can be found in TableĀ A.2.
A.2 Relaxed Accuracy Metric
Generated Answer | Gold Answer | Considered Correct |
---|---|---|
{Yes, 1, True} | true | yes |
{No, 0, False} | false | yes |
A car. | car | yes |
Yes, it is correct. | yes | yes |
It is not correct, no. | no | yes |
The color of the leaf is green. | green | yes |
There are three birds. | three birds | yes |
Five | 5 | yes |
{yes, true} | entailment | yes |
{no, false} | contradiction | yes |
maybe | neutral | yes |
There are three birds in the image. | three birds | no |
There are three birds. | 3 | no |
three birds | 3 | no |
three birds | 3 birds | no |
A.3 Prompts
FigureĀ 7 presents the dataset-specific textual prompts we used for all models in this benchmark. Note that this does not include model-specific prompt templates, image placeholders, special tags, or symbols, only the ārawā textual prompt, which is then embedded in the template as required by the respective model. The placeholders {QUESTION}, {LANGUAGE}, or {HYPOTHESIS} are replaced by the sample specific text. The prompts are partially inspired by Geigle etĀ al. (2023) or Bugliarello etĀ al. (2022).
A.4 Hyperparameters
This section briefly reports hyperparameters used within our experiments for better reproducibility.
A.4.1 Generation Parameters
We used the same generation hyperparameters to generate responses with all the employed open-source models on all datasets (see TableĀ 4). Those are inspired by the default parameters in the ātransformersā library999https://huggingface.co/docs/transformers/en/main_classes/text_generation. Because for CogVLM, beam search is not supported, we set ānum_beamsā to . For GPT 4V and Gemini Pro V, we use the default parameters of the respective Python clients.
Parameter | Value |
---|---|
num_beams | |
do_sample | True |
max_new_tokens | |
temperature | |
top_k | |
top_p |
A.4.2 Image Order for Multi-Image Datasets
Most models employed in our dataset only support a single image per prompt. For datasets where a sample comprises more than one image, i.e., for MaRVL, M5-VGR, and M5-VLOD, we use the following strategy: We first stack the images horizontally with a gutter of pixels, provide them as a single image in the prompt, and generate the response. Then, we do the same again but stack the images vertically. For M5-VLOD, we also create a stacked image with two columns and three rows. The reported scores are the average of all variants.
Appendix B Dataset Details
B.1 M5-VGR and M5-VLOD Details
B.1.1 M5-VGR Examples
![Refer to caption](x8.png)
![Refer to caption](x9.png)
![Refer to caption](x10.png)
![Refer to caption](x11.png)
![Refer to caption](x12.png)
![Refer to caption](x13.png)
![Refer to caption](x14.png)
![Refer to caption](x15.png)
![Refer to caption](x16.png)
![Refer to caption](x17.png)
![Refer to caption](x18.png)
![Refer to caption](x19.png)
B.1.2 M5-VLOD Examples
![Refer to caption](x20.png)
![Refer to caption](x21.png)
![Refer to caption](x22.png)
![Refer to caption](x23.png)
![Refer to caption](x24.png)
![Refer to caption](x25.png)
![Refer to caption](x26.png)
![Refer to caption](x27.png)
![Refer to caption](x28.png)
![Refer to caption](x29.png)
![Refer to caption](x30.png)
![Refer to caption](x31.png)
B.1.3 Topics
Topic | Language | |||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Amharic | Berber | Bengali | German | English | Filipino | Hausa | Hindi | Russian | Swahili | Thai | Zulu | |||||||||||||
A | B | A | B | A | B | A | B | A | B | A | B | A | B | A | B | A | B | A | B | A | B | A | B | |
armchair | ||||||||||||||||||||||||
backyard | ||||||||||||||||||||||||
bathroom privacy | ||||||||||||||||||||||||
bathroom/toilet | ||||||||||||||||||||||||
bed | ||||||||||||||||||||||||
bedroom | ||||||||||||||||||||||||
books | ||||||||||||||||||||||||
ceiling | ||||||||||||||||||||||||
children room | ||||||||||||||||||||||||
cleaning equipment | ||||||||||||||||||||||||
cooking pots | ||||||||||||||||||||||||
cooking utensils | ||||||||||||||||||||||||
couch | ||||||||||||||||||||||||
cups/mugs/glasses | ||||||||||||||||||||||||
cutlery | ||||||||||||||||||||||||
dish racks | ||||||||||||||||||||||||
dish washing brush/cloth | ||||||||||||||||||||||||
dish washing soap | ||||||||||||||||||||||||
drainage | ||||||||||||||||||||||||
drinking water | ||||||||||||||||||||||||
drying | ||||||||||||||||||||||||
everyday shoes | ||||||||||||||||||||||||
family | ||||||||||||||||||||||||
floor | ||||||||||||||||||||||||
front door | ||||||||||||||||||||||||
grains | ||||||||||||||||||||||||
guest bed | ||||||||||||||||||||||||
hair brush/comb | ||||||||||||||||||||||||
hand back | ||||||||||||||||||||||||
hand palm | ||||||||||||||||||||||||
hand washing | ||||||||||||||||||||||||
home | ||||||||||||||||||||||||
jewelry | ||||||||||||||||||||||||
kitchen | ||||||||||||||||||||||||
kitchen sink | ||||||||||||||||||||||||
light source in kitchen | ||||||||||||||||||||||||
light source in livingroom | ||||||||||||||||||||||||
living room | ||||||||||||||||||||||||
lock on front door | ||||||||||||||||||||||||
make up | ||||||||||||||||||||||||
meat or fish | ||||||||||||||||||||||||
medication | ||||||||||||||||||||||||
most loved item | ||||||||||||||||||||||||
most loved toy | ||||||||||||||||||||||||
nicest shoes | ||||||||||||||||||||||||
oven | ||||||||||||||||||||||||
paper | ||||||||||||||||||||||||
pen/pencils | ||||||||||||||||||||||||
phone | ||||||||||||||||||||||||
place where eating dinner | ||||||||||||||||||||||||
plate of food | ||||||||||||||||||||||||
plates | ||||||||||||||||||||||||
play area | ||||||||||||||||||||||||
power outlet | ||||||||||||||||||||||||
refrigerator | ||||||||||||||||||||||||
roof | ||||||||||||||||||||||||
shampoo | ||||||||||||||||||||||||
shower | ||||||||||||||||||||||||
sitting area | ||||||||||||||||||||||||
soap for hands and body | ||||||||||||||||||||||||
social drink | ||||||||||||||||||||||||
sofa | ||||||||||||||||||||||||
source of cool | ||||||||||||||||||||||||
spices | ||||||||||||||||||||||||
storage room | ||||||||||||||||||||||||
stove/hob | ||||||||||||||||||||||||
street detail | ||||||||||||||||||||||||
street view | ||||||||||||||||||||||||
switch on/off | ||||||||||||||||||||||||
table with food | ||||||||||||||||||||||||
teeth | ||||||||||||||||||||||||
toilet | ||||||||||||||||||||||||
toilet paper | ||||||||||||||||||||||||
tooth paste | ||||||||||||||||||||||||
toothbrush | ||||||||||||||||||||||||
toys | ||||||||||||||||||||||||
trash/waste | ||||||||||||||||||||||||
tv | ||||||||||||||||||||||||
vegetables | ||||||||||||||||||||||||
wall | ||||||||||||||||||||||||
wall clock | ||||||||||||||||||||||||
wall decoration | ||||||||||||||||||||||||
wall inside | ||||||||||||||||||||||||
wardrobe | ||||||||||||||||||||||||
washing clothes/cleaning | ||||||||||||||||||||||||
washing detergent | ||||||||||||||||||||||||
water outlet |
B.2 Dataset Language Details
Language | Script | MaXM | xGQA | XNLVI | MaRVL | M5-VLOD | M5-VGR | xFlickrCO | XM3600 |
---|---|---|---|---|---|---|---|---|---|
Amharic | Ethiopic | no | no | no | no | yes | yes | no | no |
Arabic | Arabic | no | no | yes | no | no | no | no | yes |
Bengali | Bengali | no | yes | no | no | yes | yes | no | yes |
Berber | Tifinagh | no | no | no | no | yes | yes | no | no |
Chinese | Hanzi | yes | yes | no | yes | no | no | yes | yes |
Croatian | Latin | no | no | no | no | no | no | no | yes |
Czech | Latin | no | no | no | no | no | no | no | yes |
Danish | Latin | no | no | no | no | no | no | no | yes |
Dutch | Latin | no | no | no | no | no | no | no | yes |
English | Latin | yes | yes | yes | no | yes | yes | yes | yes |
Filipino | Latin | no | no | no | no | yes | yes | no | yes |
Finnish | Latin | no | no | no | no | no | no | no | yes |
French | Latin | yes | no | yes | no | no | no | no | yes |
German | Latin | no | yes | no | no | yes | yes | yes | yes |
Greek | Greek | no | no | no | no | no | no | no | yes |
Hausa | Latin | no | no | no | no | yes | yes | no | no |
Hebrew | Hebrew | yes | no | no | no | no | no | no | yes |
Hindi | Devanagari | yes | no | no | no | yes | yes | no | yes |
Hungarian | Latin | no | no | no | no | no | no | no | yes |
Indonesian | Latin | no | yes | no | yes | no | no | yes | yes |
Italian | Latin | no | no | no | no | no | no | no | yes |
Japanese | Japanese | no | no | no | no | no | no | yes | yes |
Korean | Hangul | no | yes | no | no | no | no | no | yes |
Maori | Latin | no | no | no | no | no | no | no | yes |
Norwegian | Latin | no | no | no | no | no | no | no | yes |
Persian | Perso-Arabic | no | no | no | no | no | no | no | yes |
Polish | Latin | no | no | no | no | no | no | no | yes |
Portuguese | Latin | no | yes | no | no | no | no | no | yes |
Quechua | Latin | no | no | no | no | no | no | no | yes |
Romanian | Latin | yes | no | no | no | no | no | no | yes |
Russian | Cyrillic | no | yes | yes | no | yes | yes | yes | yes |
Spanish | Latin | no | no | yes | no | no | no | yes | yes |
Swahili | Latin | no | no | no | yes | yes | yes | no | yes |
Swedish | Latin | no | no | no | no | no | no | no | yes |
Tamil | Tamil | no | no | no | yes | no | no | no | no |
Telugu | Telugu | no | no | no | no | no | no | no | yes |
Thai | Thai | yes | no | no | no | yes | yes | no | yes |
Turkish | Latin | no | no | no | yes | no | no | yes | yes |
Ukrainian | Cyrillic | no | no | no | no | no | no | no | yes |
Vietnamese | Latin | no | no | no | no | no | no | no | yes |
Zulu | Latin | no | no | no | no | yes | yes | no | no |
Unique Languages | 7 | 8 | 5 | 5 | 12 | 12 | 8 | 36 | |
Unique Scripts | 4 | 5 | 3 | 3 | 7 | 7 | 4 | 12 |
B.3 Language Details
Language | ISO 639 | Lang. Family | Script | Continent | Subregion | Taxonomy | Speakers |
Arabic | ar | Afro-Asiatic | Arabic | Afrika & Asia | North Africa & Middle East | 5 | 630.00 |
Chinese | zh | Sino-Tibetan | Hanzi | Asia | Northeastern Asia | 5 | 1330.00 |
English | en | Indo-European | Latin | America | North America | 5 | 1457.00 |
French | fr | Indo-European | Latin | Europe | Western Europe | 5 | 310.00 |
German | de | Indo-European | Latin | Europe | Western Europe | 5 | 175.00 |
Japanese | ja | Japonic | Japanese | Asia | Northeastern Asia | 5 | 128.00 |
Spanish | es | Indo-European | Latin | Europe | Southern Europe | 5 | 600.00 |
Croatian | hr | Indo-European | Latin | Europe | Central & Eastern Europe | 4 | 6.80 |
Czech | cs | Indo-European | Latin | Europe | Central & Eastern Europe | 4 | 11.00 |
Dutch | nl | Indo-European | Latin | Europe | Western Europe | 4 | 30.00 |
Finnish | fi | Uralic | Latin | Europe | Northern Europe | 4 | 5.80 |
Hindi | hi | Indo-European | Devanagari | Asia | Central & South Asia | 4 | 600.00 |
Hungarian | hu | Uralic | Latin | Europe | Central & Eastern Europe | 4 | 17.00 |
Italian | it | Indo-European | Latin | Europe | Southern Europe | 4 | 68.00 |
Korean | ko | Koreanic | Hangul | Asia | Northeastern Asia | 4 | 82.00 |
Persian | fa | Indo-European | Perso-Arabic | Asia | Middle East | 4 | 130.00 |
Polish | pl | Indo-European | Latin | Europe | Central & Eastern Europe | 4 | 41.00 |
Portuguese | pt | Indo-European | Latin | Europe & America | Southern Europe & South America | 4 | 360.00 |
Russian | ru | Indo-European | Cyrillic | Asia | Central Asia | 4 | 260.00 |
Swedish | sv | Indo-European | Latin | Europe | Northern Europe | 4 | 13.00 |
Turkish | tr | Turkic | Latin | Asia | Middle East | 4 | 90.00 |
Vietnamese | vi | Austroasiatic | Latin | Asia | Southeastern Asia | 4 | 85.00 |
Bengali | bn | Indo-European | Bengali | Asia | Central & South Asia | 3 | 270.00 |
Danish | da | Indo-European | Latin | Europe | Western Europe | 3 | 6.00 |
Filipino | fil | Austronesian | Latin | Asia | Southeastern Asia | 3 | 83.00 |
Greek | el | Indo-European | Greek | Europe | Central & Eastern Europe | 3 | 13.50 |
Hebrew | he & iw | Afro-Asiatic | Hebrew | Asia | Middle East | 3 | 9.00 |
Indonesian | id | Austronesian | Latin | Asia | Southeastern Asia | 3 | 300.00 |
Romanian | ro | Indo-European | Latin | Europe | Central & Eastern Europe | 3 | 28.50 |
Tamil | ta | Dravidian | Tamil | Asia | Central & South Asia | 3 | 86.00 |
Thai | th | Kra-Dai | Thai | Asia | Southeastern Asia | 3 | 80.00 |
Ukrainian | uk | Indo-European | Cyrillic | Europe | Central & Eastern Europe | 3 | 32.80 |
Amharic | am | Afro-Asiatic | Ethiopic | Africa | Eastern Africa | 2 | 57.00 |
Hausa | ha | Afro-Asiatic | Latin | Africa | Western Africa | 2 | 79.00 |
Swahili | sw | Niger-Congo | Latin | Africa | Eastern Africa | 2 | 73.00 |
Zulu | zu | Niger-Congo | Latin | Africa | Southern Africa | 2 | 28.00 |
Maori | mi | Austronesian | Latin | Australia & Oceania | Australia & Oceania | 1 | 0.19 |
Norwegian | no | Indo-European | Latin | Europe | Northern Europe | 1 | 4.32 |
Quechua | quz | Quechuan | Latin | America | South America | 1 | 9.00 |
Telugu | te | Dravidian | Telugu | Asia | Central & South Asia | 1 | 96.00 |
Berber | ber | Afro-Asiatic | Tifinagh | Africa | Northern Africa | 0 | 26.20 |
Appendix C Model Details
Model | LM | VM | MM | |Total| | |LM| | |VM| | |MM| |
MiniCPM-VĀ [27; 49] | MiniCPM-2B | SigLIP 400M | MLP | ||||
mBliP mT0Ā [22] | Flan-T5-XL | EVA01 CLIP-ViT-g | QFormer | ||||
Yi-VL 6BĀ [5] | Yi-6B-Chat | CLIP-ViT-H-14 | MLP | ||||
LLaVA 1.6 7BĀ [37] | Vicuna-7B-v1.5 | CLIP-ViT-L | MLP | ||||
LLaVA 1.5 7BĀ [38] | Vicuna-7B-v1.5 | CLIP-ViT-L | MLP | ||||
BakLLaVAĀ [38] | Mistral 7B v0.1 | CLIP-ViT-L | MLP | ||||
mBliP BloomZĀ [22] | BloomZ 7B | EVA01 CLIP-ViT-g | QFormer | ||||
Qwen-VLĀ [9] | Qwen-7B | CLIP-VIT-bigG | CrossAttn | ||||
OmniLMM 12BĀ [49] | Zephyr 7B | EVA02 CLIP ViT-E | MLP | ||||
LLaVA 1.6 13BĀ [37] | Vicuna-13B-v1.5 | CLIP-ViT-L | MLP | ||||
LLaVA 1.5 13BĀ [38] | Vicuna-13B-v1.5 | CLIP-ViT-L | MLP | ||||
CogVLMĀ [47] | Vicuna-7B-v1.5 | EVA02 CLIP ViT-E | CrossAttn | ||||
InternVL V1.1 Ā [15] | Llama-2-13B | InternViT 6B | MLP | ||||
LLaVA 1.6 34BĀ [37] | Nous-Hermes-2-Yi-34B | CLIP-ViT-L | MLP | ||||
Yi-VL 34BĀ [5] | Yi-34B-Chat | CLIP-ViT-H | MLP | ||||
InternVL V1.2+Ā [15] | Nous-Hermes-2-Yi-34B | InternViT-6B V1-2 | MLP | ||||
Gemini Pro VisionĀ [7] | ? | ? | ? | ? | ? | ? | ? |
GPT-4 VisionĀ [39] | ? | ? | ? | ? | ? | ? | ? |
Appendix D Results Details
D.1 General Results
D.1.1 xGQA
![Refer to caption](x32.png)
Model | Language | ||||||||
---|---|---|---|---|---|---|---|---|---|
bn | de | en | id | ko | pt | ru | zh | NEA | |
LLaVA 1.5 7B | |||||||||
CogVLM | |||||||||
MiniCPM-V | |||||||||
BakLLaVA | |||||||||
Yi-VL 6B | |||||||||
Qwen-VL | |||||||||
LLaVA 1.6 7B | |||||||||
Gemini Pro V | |||||||||
LLaVA 1.5 13B | |||||||||
OmniLMM 12B | |||||||||
LLaVA 1.6 13B | |||||||||
Yi-VL 34B | |||||||||
mBliP BloomZ | |||||||||
mBliP mT0 | |||||||||
GPT 4V | |||||||||
InternVL V1.2+ | |||||||||
LLaVA 1.6 34B | |||||||||
InternVL V1.1 | |||||||||
Average |
D.1.2 MaXM
![Refer to caption](x33.png)
Model | Language | |||||||
---|---|---|---|---|---|---|---|---|
en | fr | hi | iw | ro | th | zh | NEA | |
CogVLM | ||||||||
BakLLaVA | ||||||||
OmniLMM 12B | ||||||||
LLaVA 1.5 7B | ||||||||
LLaVA 1.6 7B | ||||||||
LLaVA 1.5 13B | ||||||||
MiniCPM-V | ||||||||
Yi-VL 34B | ||||||||
Yi-VL 6B | ||||||||
Qwen-VL | ||||||||
LLaVA 1.6 13B | ||||||||
mBliP BloomZ | ||||||||
LLaVA 1.6 34B | ||||||||
InternVL V1.1 | ||||||||
mBliP mT0 | ||||||||
InternVL V1.2+ | ||||||||
Gemini Pro V | ||||||||
GPT 4V | ||||||||
Average | 0.51 | 0.35 | 0.22 | 0.19 | 0.27 | 0.25 | 0.24 | 0.25 |
D.1.3 XVNLI
![Refer to caption](x34.png)
Model | Language | |||||
---|---|---|---|---|---|---|
ar | en | es | fr | ru | NEA | |
CogVLM | ||||||
BakLLaVA | ||||||
Yi-VL 6B | ||||||
mBliP BloomZ | ||||||
LLaVA 1.6 7B | ||||||
LLaVA 1.5 7B | ||||||
Gemini Pro V | ||||||
LLaVA 1.5 13B | ||||||
MiniCPM-V | ||||||
Yi-VL 34B | ||||||
OmniLMM 12B | ||||||
Qwen-VL | ||||||
LLaVA 1.6 13B | ||||||
InternVL V1.1 | ||||||
mBliP mT0 | ||||||
InternVL V1.2+ | ||||||
LLaVA 1.6 34B | ||||||
GPT 4V | ||||||
Average |
D.1.4 MaRVL
![Refer to caption](x35.png)
Model | Language | ||||||
---|---|---|---|---|---|---|---|
en | id | sw | ta | tr | zh | NEA | |
CogVLM | |||||||
LLaVA 1.5 7B | |||||||
BakLLaVA | |||||||
LLaVA 1.6 7B | |||||||
Qwen-VL | |||||||
Yi-VL 6B | |||||||
MiniCPM-V | |||||||
LLaVA 1.5 13B | |||||||
Gemini Pro V | |||||||
OmniLMM 12B | |||||||
mBliP BloomZ | |||||||
Yi-VL 34B | |||||||
InternVL V1.1 | |||||||
InternVL V1.2+ | |||||||
mBliP mT0 | |||||||
LLaVA 1.6 13B | |||||||
LLaVA 1.6 34B | |||||||
GPT 4V | |||||||
Average |
D.1.5 M5-VGR
![Refer to caption](x36.png)
Model | Language | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
am | ber | bn | de | en | fil | ha | hi | ru | sw | th | zu | NEA | |
LLaVA 1.5 7B | |||||||||||||
LLaVA 1.6 7B | |||||||||||||
LLaVA 1.5 13B | |||||||||||||
BakLLaVA | |||||||||||||
LLaVA 1.6 13B | |||||||||||||
Yi-VL 34B | |||||||||||||
Qwen-VL | |||||||||||||
CogVLM | |||||||||||||
mBliP BloomZ | |||||||||||||
MiniCPM-V | |||||||||||||
OmniLMM 12B | |||||||||||||
Yi-VL 6B | |||||||||||||
InternVL V1.1 | |||||||||||||
LLaVA 1.6 34B | |||||||||||||
Gemini Pro V | |||||||||||||
InternVL V1.2+ | |||||||||||||
mBliP mT0 | |||||||||||||
GPT 4V | |||||||||||||
Average |
D.1.6 M5-VLOD
![Refer to caption](x37.png)
Model | Language | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
am | ber | bn | de | en | fil | ha | hi | ru | sw | th | zu | NEA | |
CogVLM | |||||||||||||
mBliP mT0 | |||||||||||||
Yi-VL 6B | |||||||||||||
Yi-VL 34B | |||||||||||||
MiniCPM-V | |||||||||||||
LLaVA 1.5 7B | |||||||||||||
BakLLaVA | |||||||||||||
LLaVA 1.5 13B | |||||||||||||
InternVL V1.1 | |||||||||||||
LLaVA 1.6 7B | |||||||||||||
LLaVA 1.6 13B | |||||||||||||
Qwen-VL | |||||||||||||
mBliP BloomZ | |||||||||||||
OmniLMM 12B | |||||||||||||
LLaVA 1.6 34B | |||||||||||||
InternVL V1.2+ | |||||||||||||
Gemini Pro V | |||||||||||||
GPT 4V | |||||||||||||
Average |
D.1.7 xFlickrCO
![Refer to caption](x38.png)
Model | Language | ||||||||
---|---|---|---|---|---|---|---|---|---|
de | en | es | id | ja | ru | tr | zh | NEA | |
Qwen-VL | |||||||||
Yi-VL 6B | |||||||||
CogVLM | |||||||||
BakLLaVA | |||||||||
Yi-VL 34B | |||||||||
MiniCPM-V | |||||||||
InternVL V1.1 | |||||||||
LLaVA 1.5 7B | |||||||||
LLaVA 1.5 13B | |||||||||
LLaVA 1.6 7B | |||||||||
OmniLMM 12B | |||||||||
LLaVA 1.6 13B | |||||||||
LLaVA 1.6 34B | |||||||||
GPT 4V | |||||||||
mBliP BloomZ | |||||||||
Gemini Pro V | |||||||||
InternVL V1.2+ | |||||||||
mBliP mT0 | |||||||||
Average |
D.1.8 XM3600
![Refer to caption](x39.png)
Model | Language | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
ar | bn | cs | da | de | el | en | es | fa | fi | fil | fr | |
CogVLM | ||||||||||||
BakLLaVA | ||||||||||||
Qwen-VL | ||||||||||||
Yi-VL 6B | ||||||||||||
Yi-VL 34B | ||||||||||||
MiniCPM-V | ||||||||||||
Gemini Pro V | ||||||||||||
LLaVA 1.5 7B | ||||||||||||
mBliP mT0 | ||||||||||||
InternVL V1.1 | ||||||||||||
mBliP BloomZ | ||||||||||||
OmniLMM 12B | ||||||||||||
LLaVA 1.6 7B | ||||||||||||
LLaVA 1.5 13B | ||||||||||||
InternVL V1.2+ | ||||||||||||
LLaVA 1.6 13B | ||||||||||||
LLaVA 1.6 34B | ||||||||||||
GPT 4V | - | - | - | - | - | - | ||||||
Average |
Model | Language | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
he | hi | hr | hu | id | it | ja | ko | mi | nl | no | pl | |
CogVLM | ||||||||||||
BakLLaVA | ||||||||||||
Qwen-VL | ||||||||||||
Yi-VL 6B | ||||||||||||
Yi-VL 34B | ||||||||||||
MiniCPM-V | ||||||||||||
Gemini Pro V | ||||||||||||
LLaVA 1.5 7B | ||||||||||||
mBliP mT0 | ||||||||||||
InternVL V1.1 | ||||||||||||
mBliP BloomZ | ||||||||||||
OmniLMM 12B | ||||||||||||
LLaVA 1.6 7B | ||||||||||||
LLaVA 1.5 13B | ||||||||||||
InternVL V1.2+ | ||||||||||||
LLaVA 1.6 13B | ||||||||||||
LLaVA 1.6 34B | ||||||||||||
GPT 4V | - | - | - | - | - | - | - | - | - | |||
Average |
Model | Language | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pt | quz | ro | ru | sv | sw | te | th | tr | uk | vi | zh | NEA | |
CogVLM | |||||||||||||
BakLLaVA | |||||||||||||
Qwen-VL | |||||||||||||
Yi-VL 6B | |||||||||||||
Yi-VL 34B | |||||||||||||
MiniCPM-V | |||||||||||||
Gemini Pro V | |||||||||||||
LLaVA 1.5 7B | |||||||||||||
mBliP mT0 | |||||||||||||
InternVL V1.1 | |||||||||||||
mBliP BloomZ | |||||||||||||
OmniLMM 12B | |||||||||||||
LLaVA 1.6 7B | |||||||||||||
LLaVA 1.5 13B | |||||||||||||
InternVL V1.2+ | |||||||||||||
LLaVA 1.6 13B | |||||||||||||
LLaVA 1.6 34B | |||||||||||||
GPT 4V | - | - | - | - | - | - | - | - | - | ||||
Average |
D.2 Language Fidelity Analysis
Dataset | Language | ||||||||
---|---|---|---|---|---|---|---|---|---|
Avg. | zh | en | de | id | ja | ru | es | tr | |
xFlickrCO | .91 | .85 | .65 | 0.86 | .88 | .91 | .92 | .90 | .84 |
XM3600 | .81 | .74 | .63 | 0.63 | .69 | .74 | .76 | .67 | .82 |
MaXM | .55 | .17 | .43 | - | - | - | - | - | - |
XVNLI | .51 | - | .46 | - | - | - | .47 | .20 | - |
MaRVL | .46 | .21 | .41 | - | .50 | - | - | - | .50 |
M5-VGR | .34 | - | .11 | 0.15 | - | - | .42 | - | - |
xGQA | .21 | .35 | .47 | 0.08 | .37 | - | -.04 | - | - |
M5-VLOD | .14 | - | .44 | 0.20 | - | - | .14 | - | - |