What to do if language models disagree? Black-box model ensembling for
textual and visual question answering

Yuxi Xia    Kilm Zaporojets    Benjamin Roth
Abstract

A diverse range of large language models (LLMs), e.g., ChatGPT, and visual question answering (VQA) models, e.g., BLIP, have been developed for solving textual and visual question answering tasks. However, both LLMs and VQA models encounter challenges when applied to task-specific datasets. Fine-tuning these models is either difficult, as it requires access via APIs, rendering them as black-boxes, or costly due to the need of tuning a large number of parameters. To address this, we introduce InfoSel, a data-efficient and lightweight ensemble method that learns to dynamically pick the winner from existing black-box models for predictions on both textual and multimodal visual question answering tasks. Unlike traditional ensemble models, InfoSel does not rely on prediction probabilities or confidences, which typically are not available in black-box models. Experimental results on four datasets demonstrate that our approach achieves an absolute increase of up to +5.27% in the F1-score compared to standalone LLMs. Remarkably, this improvement is achieved by utilizing only 1K training instances and 110M model parameters for training task-specific ensemble models.

Machine Learning, ICML

1 Introduction

Large language models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks, predominantly attributed to their ability to comprehend instructions and tap into vast repositories of high-quality data (Bubeck et al., 2023; Laskar et al., 2023). For example, ChatGPT finds extensive utilization in daily textual question answering (TQA) tasks, rendering substantial convenience to a myriad of users (Touvron et al., 2023).111https://chat.openai.com/ For visual question answering (VQA) tasks, VQA models have exhibited exceptional versatility, primarily due to their capability to comprehend both visual and textual context (Gong et al., 2023).

However, recent work (Kocoń et al., 2023; Laskar et al., 2023) indicates that LLMs, such as ChatGPT, fall short of state-of-the-art performance on task-specific datasets. Similarly, VQA models  (Li et al., 2022, 2021b; Bao et al., 2022) face challenges when applied to specialized datasets due to the idiosyncrasies in the content, format or structure of these datasets (Arora et al., 2018). Unfortunately, fine-tuning LLMs (e.g., LLaMA-2-70b-chat (Touvron et al., 2023)) on task-specific data requires a large number of GPU hours. Alternatively, training smaller, task-specific models from scratch requires large amount of labeled data in order to achieve comparable performance. Furthermore, fine-tuning LLMs through proprietary APIs with self-uploaded labeled training data not only requires LLM experts’ knowledge but is also expensive.222https://platform.openai.com/docs/guides/fine-tuning/ These fine-tuned models further remain black-box, with restricted access to details regarding architectural intricacies, model weights, training data, and even prediction confidences.

In order to address these computational and accessibility challenges associated with fine-tuning, we introduce a scalable ensemble method called InfoSel (Informed Selection). InfoSel allows for training with just a few task-specific samples and is lightweight, with only a few million parameters. Unlike current LM-ensemble methods (e.g., MetaQA (Puerto et al., 2021), LLM-Blender (Jiang et al., 2023)) which depend on the confidence scores or log probabilities and thus can not be applied to black-box models like GPT3.5 text-davinci models, InfoSel does not rely on such information and offers black-box ensembling. Furthermore, our proposed ensemble method incorporates task-specific optimization, allowing it to be easily adapted to different datasets, considering variations of both the inputs and predicted answers from the ensembled LLMs (base models). This contrasts with traditional ensemble methods such as OLA (Woods et al., 1997) and PageRank (Brin & Page, 2012), which are not adapted to task-specific particularities (e.g., different features) of different datasets. Finally, our method efficiently deals with multimodal inputs. Concretely, our results exhibit superior performance on multimodal VQA inputs compared to state-of-the-art LLM-Blender (Jiang et al., 2023) ensemble method which is designed to work exclusively with text. Table 1 compares our method with alternatives.

Table 1: Our method (InfoSel) aims to optimize task-specific ensembling of black-box models, where confidences and parameters can not be accessed. We use only a small portion of training data (data-efficient), and employ small-size trainable models (lightweight). Note that we focus on optimizing the performance of the ensemble instead of standalone fine-tuned (FT) models. Finally, our method can be applied not only to textual data but also to multimodal data in the context of visual question answering.

OLA/
FT LLM-Blender PageRank InfoSel
Task-specific
Data-efficient
Lightweight
Black-box
Multimodal
Ensemble

At its core, InfoSel (see Figure 1) trains a lightweight ensemble model to dynamically identify the most accurate base model (i.e., LLM or VQA model) for a given input, which we refer to as the winner. This is achieved by designing a meta-level classification task considering all the base models as labels for every input. We designed and implemented two ensemble architectures for textual and visual QA tasks. Our first proposed architecture, InfoSel-TT, uses textual transformer (TT, 110M parameters) (Devlin et al., 2019) as the backbone to generate textual representation of the question with the predicted answers by base models, and linearly mapping it to predict the F1-scores of these answers. Although InfoSel-TT is straightforward and effective, it cannot handle multimodal data. To address this, we propose a second architecture named InfoSel-MT, where we incorporate a multimodal transformer (MT, 115M parameters) (Li et al., 2019) to generate fused contextual representations of a multimodal input (image, question, and the predicted answers). These fused representations are used to train a dense layer for selecting the winner. The challenge with this approach is the lack of exposure of the base models to new (unseen) labels appearing in the task-specific datasets. To address this, we fine-tune TT and MT models (FT-TT and FT-MT) separately to learn these new labels. The predictions of these fine-tuned models are fused with the output from InfoSel using a second, separately trained InfoSel ensemble model. We experiment with and without InfoSel, as this component is considered optional when InfoSel is already performing well.

We select three LLMs (ChatGPT, LLaMA-2-70b-chat and GPT3.5 text-davinci-003) and three VQA models (ALBEF (Li et al., 2021a), BLIP (Li et al., 2022) and VLMo (Bao et al., 2022)). These models are used as ensemble base models to provide answers for textual and visual QA tasks respectively. To demonstrate the data efficiency of the proposed architectures, we train them on a subsample of training data from public benchmark datasets and test on the corresponding full test data, a setting referred to as “Mini-*”. Specifically, the Mini-SDv2 and Mini-NQ are sampled from SQuAD v2 (Rajpurkar et al., 2018) and NQ-Open (Kwiatkowski et al., 2019) respectively, for textual QA tasks. On the other hand, the Mini-GQA and Mini-Viz datasets are sampled from GQA (Hudson & Manning, 2019) and VizWiz (Gurari et al., 2018) VQA datasets (see details in Section 4.1). Experimental results showcase improvements of the performance up to +5.27% on Mini-SDv2 with InfoSel and +31.63% on Mini-Viz with InfoSel when compared to the ensembled base models.

In summary, our contributions are: (1) InfoSel, a novel lightweight and data-efficient approach to ensemble black-box models without relying on access to model architecture, weights or prediction confidences for optimizing on task-specific datasets; (2) Assessment of the performance on textual and multimodal visual QA tasks, demonstrating gains of up to +5.27% with InfoSel and up to +31.63% with InfoSelcompared to ensembled base models on four benchmark datasets; (3) A detailed analysis of data efficiency, demonstrating that InfoSel surpasses the performance of the leading base models with as few as 10 training samples.333Code and data will be released upon acceptance.

2 Related Work

Domain Adaptation. These methods aim to improve the performance of a model on a task-specific domain by leveraging knowledge from other domains (Zhou et al., 2022). Methods such as fine-tuning (Yosinski et al., 2014), feature adaptation (Long et al., 2015) and data augmentation (Choi et al., 2019) aim to improve the performance of standalone models and thus typically require large amounts of labeled training data or access to the model architecture and weights. InfoSel addresses this challenge by employing a lightweight ensemble model trained on a small amount of labeled data, and restricted to access exclusively the predictions produced by black-box base models.

Ensemble Learning. Ensemble methods generate and combine multiple learners (ML models) to address a particular ML task (Sagi & Rokach, 2018). Classical ensembling approaches like boosting (Schapire, 2013) and bagging (Breiman, 1996) are designed to train and combine a large number of individual models and are thus computationally expensive. Snapshot ensemble method (Huang et al., 2017) uses several local minima from one single model for ensembling, which requires full access to model weights and architecture. Stacking (Wolpert, 1992) uses a meta-learner to integrate the probabilities of the predictions from base models for providing the final output. A recent LLM ensembling method proposed by Jiang et al. (2023) uses log probabilities generated by LLMs to train LLM-Blender models. Similarly, Puerto et al. (2021) introduce MetaQA to select the best answer from multiple experts which requires both the knowledge of confidence score and base models’ training data. Other methods train their base models to avoid dataset biases (Han et al., 2021), while Xu et al. (2019) aim to learn joint feature embeddings across different domains. However, these methods require at least one piece of knowledge that the black-box models can not provide, including base models’ confidence scores or even training data (not available for LLMs like ChatGPT). InfoSel is designed to not rely on any of the above elements.

Black-box Models Ensembling. The black-box ensembling can be achieved by classical dynamic classifier selection methods, most notably OLA (Woods et al., 1997), which learns to rank the best classifier dynamically by its overall local accuracy in the nearest region of the input. Alternatively, majority voting (Chan & van der Schaar, 2022) and PageRank (Brin & Page, 2012) weight the predictions by their internal agreements. Yet, these methods are not designed for task-specific optimization and do not consider the information about inputs and predicted answers from base models simultaneously. To address this, InfoSel proposes a transformer-based setup that utilizes all the information available in the black-box setting, but not more, to enhance task-specific performance.

3 InfoSel  Ensemble Training

Figure 1 illustrates the proposed InfoSel and InfoSel frameworks to ensemble LLMs for TQA tasks (left), and VQA models for VQA tasks (right). We differentiate TQA components using LLMs and VQA components using VQA models by denoting them with superscripts l𝑙litalic_l and v𝑣vitalic_v respectively. Similarly, to distinguish between InfoSel, InfoSel and FT models used in TQA and VQA tasks, we add suffixes “-TT” and “-MT” respectively. For example, the InfoSel-MT model in Figure 1, refers to the InfoSel for the VQA task.

Refer to caption
Figure 1: Architecture of our InfoSel, fine-tuned (FT) and InfoSel models. Mlsubscriptsuperscript𝑀𝑙M^{l}_{*}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and Mvsubscriptsuperscript𝑀𝑣M^{v}_{*}italic_M start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT refer to black-box LLMs and VQA base models respectively, which are not trainable. The number of these base models is flexible, and is not restricted to 3 as in the figure. The models on the left (suffixed with -TT) are trained for the TQA tasks, while the models on the right (suffixed with -MT) are trained for the VQA tasks. All our models are trained independently. Note that FT and InfoSel  are optional if the task-specific datasets do not contain high percentage of unseen labels.

3.1 InfoSel Training for TQA

Before training InfoSel, we first perform the data preparation (top of Figure 1) for both training and testing. Next, we train InfoSel and FT models, which is a necessary step before training the InfoSel.

Data Preparation. First, we randomly sample N𝑁Nitalic_N content-question pairs {(Ci,Qi)}i=1Nsubscriptsuperscriptsubscript𝐶𝑖subscript𝑄𝑖𝑁𝑖1\{(C_{i},Q_{i})\}^{N}_{i=1}{ ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT and the corresponding ground truth answers {A~il}i=1Nsubscriptsuperscriptsubscriptsuperscript~𝐴𝑙𝑖𝑁𝑖1\{\widetilde{A}^{l}_{i}\}^{N}_{i=1}{ over~ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT from various benchmark datasets (refer to Section 4.1). Next, we build prompts Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT following specific prompt rules Pi=R(Ci,Qi)subscript𝑃𝑖𝑅subscript𝐶𝑖subscript𝑄𝑖P_{i}=R(C_{i},Q_{i})italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_R ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (refer to Section 4.1). Using these prompts instead of plain (Ci,Qi)subscript𝐶𝑖subscript𝑄𝑖(C_{i},Q_{i})( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) text improves the LLMs’ answer quality (Bach et al., 2022).

We select K𝐾Kitalic_K (K𝐾Kitalic_K=3) black-box LLMs {Mjl}j=1Ksubscriptsuperscriptsubscriptsuperscript𝑀𝑙𝑗𝐾𝑗1\{M^{l}_{j}\}^{K}_{j=1}{ italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT to generate answers on the N𝑁Nitalic_N prompts. The answer generated by Mjlsubscriptsuperscript𝑀𝑙𝑗M^{l}_{j}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is denoted as Aijlsubscriptsuperscript𝐴𝑙𝑖𝑗A^{l}_{ij}italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (Aijl=Mjl(Pi)subscriptsuperscript𝐴𝑙𝑖𝑗subscriptsuperscript𝑀𝑙𝑗subscript𝑃𝑖A^{l}_{ij}=M^{l}_{j}(P_{i})italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )). Thereby, K𝐾Kitalic_K LLMs provide NK𝑁𝐾N*Kitalic_N ∗ italic_K candidate answers for N𝑁Nitalic_N prompts. We calculate the word-level F1𝐹1F1italic_F 1-scores (Rajpurkar et al., 2018) of all the candidate answers {Aijl}j=1Ksubscriptsuperscriptsubscriptsuperscript𝐴𝑙𝑖𝑗𝐾𝑗1\{A^{l}_{ij}\}^{K}_{j=1}{ italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT respectively for Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. These F1𝐹1F1italic_F 1-scores serve as target Yilsubscriptsuperscript𝑌𝑙𝑖Y^{l}_{i}italic_Y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to optimize the ensemble model:

Yil={F1(Aijl,A~il)}j=1K,YilK.formulae-sequencesubscriptsuperscript𝑌𝑙𝑖subscriptsuperscript𝐹1subscriptsuperscript𝐴𝑙𝑖𝑗subscriptsuperscript~𝐴𝑙𝑖𝐾𝑗1subscriptsuperscript𝑌𝑙𝑖superscript𝐾Y^{l}_{i}=\{F1(A^{l}_{ij},\widetilde{A}^{l}_{i})\}^{K}_{j=1},Y^{l}_{i}\in% \mathbb{R}^{K}.italic_Y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_F 1 ( italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT .

The input for the ensemble training consists of K𝐾Kitalic_K texts. Each text is formed by concatenating Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with each individual answer predicted by a base model j𝑗jitalic_j, Aijlsubscriptsuperscript𝐴𝑙𝑖𝑗A^{l}_{ij}italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. More formally, the input Xilsubscriptsuperscript𝑋𝑙𝑖X^{l}_{i}italic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is:

Xil={[Pi,Aijl]}j=1K,|Xil|=K.formulae-sequencesubscriptsuperscript𝑋𝑙𝑖subscriptsuperscriptsubscript𝑃𝑖subscriptsuperscript𝐴𝑙𝑖𝑗𝐾𝑗1superscriptsubscript𝑋𝑖𝑙𝐾X^{l}_{i}=\{[P_{i},A^{l}_{ij}]\}^{K}_{j=1},|X_{i}^{l}|=K.italic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { [ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT , | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | = italic_K .

The inputs {Xil}i=1Nsuperscriptsubscriptsubscriptsuperscript𝑋𝑙𝑖𝑖1𝑁\{X^{l}_{i}\}_{i=1}^{N}{ italic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and the corresponding target labels {Yil}i=1Nsuperscriptsubscriptsubscriptsuperscript𝑌𝑙𝑖𝑖1𝑁\{Y^{l}_{i}\}_{i=1}^{N}{ italic_Y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT are used for ensemble training.

InfoSel-TT. We use a textual BERT-base (Devlin et al., 2019) transformer fθtsubscriptsuperscript𝑓𝑡𝜃f^{t}_{\theta}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, (θ𝜃\thetaitalic_θ denote trainable model parameters) as the backbone of InfoSel-TT. To achieve faster convergence, we load the pre-trained weights of bert-base-uncased model.444https://huggingface.co/google-bert/bert-base-uncased The input vector Xilsubscriptsuperscript𝑋𝑙𝑖X^{l}_{i}italic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is passed to fθtsubscriptsuperscript𝑓𝑡𝜃f^{t}_{\theta}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to generate K𝐾Kitalic_K sentence representations for each value in Xilsubscriptsuperscript𝑋𝑙𝑖X^{l}_{i}italic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT respectively. Thus, the sentence representation Rijtsubscriptsuperscript𝑅𝑡𝑖𝑗R^{t}_{ij}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT of [Pi,Aijl]subscript𝑃𝑖subscriptsuperscript𝐴𝑙𝑖𝑗[P_{i},A^{l}_{ij}][ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] from fθtsubscriptsuperscript𝑓𝑡𝜃f^{t}_{\theta}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is:

Rijt=fθt([Pi,Aijl]),Rijt768.formulae-sequencesubscriptsuperscript𝑅𝑡𝑖𝑗subscriptsuperscript𝑓𝑡𝜃subscript𝑃𝑖subscriptsuperscript𝐴𝑙𝑖𝑗subscriptsuperscript𝑅𝑡𝑖𝑗superscript768R^{t}_{ij}=f^{t}_{\theta}([P_{i},A^{l}_{ij}]),R^{t}_{ij}\in\mathbb{R}^{768}.italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( [ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] ) , italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 768 end_POSTSUPERSCRIPT .

A dense layer (fθdsubscriptsuperscript𝑓𝑑𝜃f^{d}_{\theta}italic_f start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT) is followed to classify {Rijt}j=1Ksubscriptsuperscriptsubscriptsuperscript𝑅𝑡𝑖𝑗𝐾𝑗1\{R^{t}_{ij}\}^{K}_{j=1}{ italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT, and is trained to match the target label Yilsubscriptsuperscript𝑌𝑙𝑖Y^{l}_{i}italic_Y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using binary cross entropy loss BCEsubscript𝐵𝐶𝐸\mathcal{L}_{BCE}caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT. More formally, the training objective of InfoSel-TT is:

min𝜃i=1NBCE(fθd([fθt([Pi,Aijl])]j=1K),Yil).𝜃minsuperscriptsubscript𝑖1𝑁subscript𝐵𝐶𝐸subscriptsuperscript𝑓𝑑𝜃subscriptsuperscriptdelimited-[]subscriptsuperscript𝑓𝑡𝜃subscript𝑃𝑖subscriptsuperscript𝐴𝑙𝑖𝑗𝐾𝑗1subscriptsuperscript𝑌𝑙𝑖\underset{\theta}{\mathrm{min}}\sum_{i=1}^{N}\mathcal{L}_{BCE}(f^{d}_{\theta}(% [f^{t}_{\theta}([P_{i},A^{l}_{ij}])]^{K}_{j=1}),Y^{l}_{i}).underitalic_θ start_ARG roman_min end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( [ italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( [ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] ) ] start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT ) , italic_Y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Finally, the trained InfoSel-TT model (Mitsuperscript𝑀𝑖𝑡M^{it}italic_M start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT) selects the winner model Mi,winlsubscriptsuperscript𝑀𝑙𝑖𝑤𝑖𝑛M^{l}_{i,win}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT from {Mjl}j=1Ksubscriptsuperscriptsuperscriptsubscript𝑀𝑗𝑙𝐾𝑗1\{M_{j}^{l}\}^{K}_{j=1}{ italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT for the input Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the highest probability score based on the selection logits produced by fθdsubscriptsuperscript𝑓𝑑𝜃f^{d}_{\theta}italic_f start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Ai,winlsubscriptsuperscript𝐴𝑙𝑖𝑤𝑖𝑛A^{l}_{i,win}italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT denotes the answer provided by Mi,winlsubscriptsuperscript𝑀𝑙𝑖𝑤𝑖𝑛M^{l}_{i,win}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT.

Mi,winl=Mit({[Pi,Aijl]}j=1K)Mi,winl{Mjl}j=1K,subscriptsuperscript𝑀𝑙𝑖𝑤𝑖𝑛superscript𝑀𝑖𝑡subscriptsuperscriptsubscript𝑃𝑖subscriptsuperscript𝐴𝑙𝑖𝑗𝐾𝑗1subscriptsuperscript𝑀𝑙𝑖𝑤𝑖𝑛subscriptsuperscriptsuperscriptsubscript𝑀𝑗𝑙𝐾𝑗1M^{l}_{i,win}=M^{it}(\{[P_{i},A^{l}_{ij}]\}^{K}_{j=1})\text{, }M^{l}_{i,win}% \in\{M_{j}^{l}\}^{K}_{j=1},italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT = italic_M start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ( { [ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT ) , italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT ∈ { italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT ,
Ai,winl=Mi,winl(Pi).subscriptsuperscript𝐴𝑙𝑖𝑤𝑖𝑛subscriptsuperscript𝑀𝑙𝑖𝑤𝑖𝑛subscript𝑃𝑖A^{l}_{i,win}=M^{l}_{i,win}(P_{i}).italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT = italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

FT-TT. A potential limitation of using InfoSel-TT model only is the lack of exposure of the base models to new (unseen) labels appearing in the task-specific datasets. To address this, we fine-tune a separate lightweight TT model directly on the TQA datasets to learn these new labels. Specifically, the training objective is to locate the start and end token position of the answer from the context Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We provide the token positions of A~ilsubscriptsuperscript~𝐴𝑙𝑖\widetilde{A}^{l}_{i}over~ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the target label, such that the model is optimized to classify each token in two classes (start/end token). This fine-tuned textual transformer model is referred to as FT-TT (Mftsuperscript𝑀𝑓𝑡M^{ft}italic_M start_POSTSUPERSCRIPT italic_f italic_t end_POSTSUPERSCRIPT).555The training scheme is adapted from https://huggingface.co/learn/nlp-course/chapter7/7?fw=pt with the additional option to allow the model to return empty answers for unanswerable questions. We denote the answer predicted by Mftsuperscript𝑀𝑓𝑡M^{ft}italic_M start_POSTSUPERSCRIPT italic_f italic_t end_POSTSUPERSCRIPT on Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as Aiftsubscriptsuperscript𝐴𝑓𝑡𝑖A^{ft}_{i}italic_A start_POSTSUPERSCRIPT italic_f italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

InfoSel-TT. This model performs a further ensemble training of FT-TT and InfoSel-TT models with the same training scheme and labeled training data as InfoSel-TT. We anticipate that the thus trained InfoSel-TT model (Mit+superscript𝑀limit-from𝑖𝑡M^{it+}italic_M start_POSTSUPERSCRIPT italic_i italic_t + end_POSTSUPERSCRIPT) on the output of InfoSel-TT and the label finetuned FT-TT, will improve the ability to handle labels unseen by base models. As a result, we expect an improvement in the overall task-specific performance. The winner model selected by Mit+superscript𝑀limit-from𝑖𝑡M^{it+}italic_M start_POSTSUPERSCRIPT italic_i italic_t + end_POSTSUPERSCRIPT belong to {Mit,Mft}superscript𝑀𝑖𝑡superscript𝑀𝑓𝑡\{M^{it},M^{ft}\}{ italic_M start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT , italic_M start_POSTSUPERSCRIPT italic_f italic_t end_POSTSUPERSCRIPT }.

3.2 InfoSel Training for VQA

Data Preparation. Given N𝑁Nitalic_N image-question pairs {(Ii,Qi)}i=1Nsubscriptsuperscriptsubscript𝐼𝑖subscript𝑄𝑖𝑁𝑖1\{(I_{i},Q_{i})\}^{N}_{i=1}{ ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT from dev data of VQA benchmark datasets, we use K𝐾Kitalic_K (K𝐾Kitalic_K=3) pre-trained VQA models to predict answers Aijvsubscriptsuperscript𝐴𝑣𝑖𝑗A^{v}_{ij}italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT as follows: {Mjv((Ii,Qi))Aijv}j=1Ksubscriptsuperscriptsubscriptsuperscript𝑀𝑣𝑗subscript𝐼𝑖subscript𝑄𝑖subscriptsuperscript𝐴𝑣𝑖𝑗𝐾𝑗1\{M^{v}_{j}((I_{i},Q_{i}))\rightarrow A^{v}_{ij}\}^{K}_{j=1}{ italic_M start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) → italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT. We denote the ground truth answer for image-question pair (Ii,Qi)subscript𝐼𝑖subscript𝑄𝑖(I_{i},Q_{i})( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as A~ivsubscriptsuperscript~𝐴𝑣𝑖\widetilde{A}^{v}_{i}over~ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Target labels Yivsubscriptsuperscript𝑌𝑣𝑖Y^{v}_{i}italic_Y start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for ensemble training are given by the accuracy scores of the K𝐾Kitalic_K candidate answers evaluated on A~ivsubscriptsuperscript~𝐴𝑣𝑖\widetilde{A}^{v}_{i}over~ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

Yiv={Acc(Aijv,A~iv)}j=1KYivK.subscriptsuperscript𝑌𝑣𝑖subscriptsuperscript𝐴𝑐𝑐subscriptsuperscript𝐴𝑣𝑖𝑗subscriptsuperscript~𝐴𝑣𝑖𝐾𝑗1subscriptsuperscript𝑌𝑣𝑖superscript𝐾Y^{v}_{i}=\{Acc(A^{v}_{ij},\widetilde{A}^{v}_{i})\}^{K}_{j=1}\text{, }Y^{v}_{i% }\in\mathbb{R}^{K}.italic_Y start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_A italic_c italic_c ( italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT .

The concatenation of question (Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) with each of the candidate answers (Aijvsubscriptsuperscript𝐴𝑣𝑖𝑗A^{v}_{ij}italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT) obtained from the base models and the corresponding image (Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) serves the input to our ensemble model InfoSel-MT:

Xiv={(Ii,[Qi,Aijv])}j=1K,|Xiv|=K.formulae-sequencesuperscriptsubscript𝑋𝑖𝑣subscriptsuperscriptsubscript𝐼𝑖subscript𝑄𝑖subscriptsuperscript𝐴𝑣𝑖𝑗𝐾𝑗1superscriptsubscript𝑋𝑖𝑣𝐾X_{i}^{v}=\{(I_{i},[Q_{i},A^{v}_{ij}])\}^{K}_{j=1},|X_{i}^{v}|=K.italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT = { ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , [ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] ) } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT , | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT | = italic_K .

InfoSel-MT. A Multimodal Transformer (MT, fθmsubscriptsuperscript𝑓𝑚𝜃f^{m}_{\theta}italic_f start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT(Li et al., 2021b) is employed as the backbone for InfoSel-MT. Specifically, we first generate visual features Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using a pre-trained R-CNN model (Mrsubscript𝑀𝑟M_{r}italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT(Anderson et al., 2018). Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is composed of a vector of the image region features visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the detected tags ( i.e., object labels of the image) (Li et al., 2021b). The concatenated question-answer pair [Qi,Aijv]subscript𝑄𝑖subscriptsuperscript𝐴𝑣𝑖𝑗[Q_{i},A^{v}_{ij}][ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] and Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is then passed together with to MT (fθmsuperscriptsubscript𝑓𝜃𝑚f_{\theta}^{m}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT) to generate a fused contextual representation Rijmsubscriptsuperscript𝑅𝑚𝑖𝑗R^{m}_{ij}italic_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT:

Vi=(vi,tags)=Mr(Ii),subscript𝑉𝑖subscript𝑣𝑖𝑡𝑎𝑔𝑠superscript𝑀𝑟subscript𝐼𝑖V_{i}=(v_{i},tags)=M^{r}(I_{i}),italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t italic_a italic_g italic_s ) = italic_M start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,
Rijm=fθm(Vi,[Qi,Aijv]),Rijm768.formulae-sequencesubscriptsuperscript𝑅𝑚𝑖𝑗subscriptsuperscript𝑓𝑚𝜃subscript𝑉𝑖subscript𝑄𝑖subscriptsuperscript𝐴𝑣𝑖𝑗subscriptsuperscript𝑅𝑚𝑖𝑗superscript768R^{m}_{ij}=f^{m}_{\theta}(V_{i},[Q_{i},A^{v}_{ij}]),R^{m}_{ij}\in\mathbb{R}^{7% 68}.italic_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , [ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] ) , italic_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 768 end_POSTSUPERSCRIPT .

Finally, we use an additional dense layer (fθdsuperscriptsubscript𝑓𝜃𝑑f_{\theta}^{d}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT) to map Rijmsubscriptsuperscript𝑅𝑚𝑖𝑗R^{m}_{ij}italic_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to the target label Yivsubscriptsuperscript𝑌𝑣𝑖Y^{v}_{i}italic_Y start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The training is optimized using binary cross-entropy loss:

min𝜃i=1NBCE(fθd([fθm(Vi,[Qi,Aijv])]j=1K),Yiv).𝜃minsuperscriptsubscript𝑖1𝑁subscript𝐵𝐶𝐸subscriptsuperscript𝑓𝑑𝜃subscriptsuperscriptdelimited-[]subscriptsuperscript𝑓𝑚𝜃subscript𝑉𝑖subscript𝑄𝑖subscriptsuperscript𝐴𝑣𝑖𝑗𝐾𝑗1subscriptsuperscript𝑌𝑣𝑖\underset{\theta}{\mathrm{min}}\sum_{i=1}^{N}\mathcal{L}_{BCE}(f^{d}_{\theta}(% [f^{m}_{\theta}(V_{i},[Q_{i},A^{v}_{ij}])]^{K}_{j=1}),Y^{v}_{i}).underitalic_θ start_ARG roman_min end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( [ italic_f start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , [ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] ) ] start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT ) , italic_Y start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

We denote Mimsuperscript𝑀𝑖𝑚M^{im}italic_M start_POSTSUPERSCRIPT italic_i italic_m end_POSTSUPERSCRIPT to the trained InfoSel-MT model, Mimsuperscript𝑀𝑖𝑚M^{im}italic_M start_POSTSUPERSCRIPT italic_i italic_m end_POSTSUPERSCRIPT selects the winner model Mi,winvsubscriptsuperscript𝑀𝑣𝑖𝑤𝑖𝑛M^{v}_{i,win}italic_M start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT from {Mjv}j=1Ksubscriptsuperscriptsuperscriptsubscript𝑀𝑗𝑣𝐾𝑗1\{M_{j}^{v}\}^{K}_{j=1}{ italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT to predict answer Ai,winvsubscriptsuperscript𝐴𝑣𝑖𝑤𝑖𝑛A^{v}_{i,win}italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT for the image-question pair (Ii,Qi)subscript𝐼𝑖subscript𝑄𝑖(I_{i},Q_{i})( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) based on the selection logits produced by fθdsubscriptsuperscript𝑓𝑑𝜃f^{d}_{\theta}italic_f start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

Mi,winv=Mim({(Ii,[Qi,Aijv])}j=1K)Mi,winv{Mjv}j=1K,subscriptsuperscript𝑀𝑣𝑖𝑤𝑖𝑛superscript𝑀𝑖𝑚subscriptsuperscriptsubscript𝐼𝑖subscript𝑄𝑖subscriptsuperscript𝐴𝑣𝑖𝑗𝐾𝑗1subscriptsuperscript𝑀𝑣𝑖𝑤𝑖𝑛subscriptsuperscriptsuperscriptsubscript𝑀𝑗𝑣𝐾𝑗1M^{v}_{i,win}=M^{im}(\{(I_{i},[Q_{i},A^{v}_{ij}])\}^{K}_{j=1})\text{, }M^{v}_{% i,win}\in\{M_{j}^{v}\}^{K}_{j=1},italic_M start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT = italic_M start_POSTSUPERSCRIPT italic_i italic_m end_POSTSUPERSCRIPT ( { ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , [ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] ) } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT ) , italic_M start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT ∈ { italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT ,
Ai,winv=Mi,winv(Ii,Qi).subscriptsuperscript𝐴𝑣𝑖𝑤𝑖𝑛subscriptsuperscript𝑀𝑣𝑖𝑤𝑖𝑛subscript𝐼𝑖subscript𝑄𝑖A^{v}_{i,win}=M^{v}_{i,win}(I_{i},Q_{i}).italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT = italic_M start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_w italic_i italic_n end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

FT-MT. Similar to FT-TT, FT-MT composed of a trainable MT and a Multilayer Perceptron (MLP) is fine-tuned with the same training data as InfoSel-MT. Differently, FT-MT solves a multi-label classification task by classifying the fused contextual representation of Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (instead of [Qi,Aijv]subscript𝑄𝑖subscriptsuperscript𝐴𝑣𝑖𝑗[Q_{i},A^{v}_{ij}][ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] like InfoSel-MT) and Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a predefined answer list (labels). This list contains frequent answers from the training data. As a result, a trained FT-MT model (Mfmsuperscript𝑀𝑓𝑚M^{fm}italic_M start_POSTSUPERSCRIPT italic_f italic_m end_POSTSUPERSCRIPT) can learn to predict the unseen (new) labels (answers) contained in the task-specific datasets, but not in the pre-training data of the base models. Aifmsuperscriptsubscript𝐴𝑖𝑓𝑚A_{i}^{fm}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_m end_POSTSUPERSCRIPT denotes the answer predicted by Mfmsuperscript𝑀𝑓𝑚M^{fm}italic_M start_POSTSUPERSCRIPT italic_f italic_m end_POSTSUPERSCRIPT over (Ii,Qi)subscript𝐼𝑖subscript𝑄𝑖(I_{i},Q_{i})( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The training scheme is adapted from  (Li et al., 2021b).

InfoSel-MT. Similar to InfoSel-TT, InfoSel-MT model Mim+superscript𝑀limit-from𝑖𝑚M^{im+}italic_M start_POSTSUPERSCRIPT italic_i italic_m + end_POSTSUPERSCRIPT ensembles the FT-MT and InfoSel-MT models using the same training scheme as in InfoSel-MT. The winner model selected by Mim+superscript𝑀limit-from𝑖𝑚M^{im+}italic_M start_POSTSUPERSCRIPT italic_i italic_m + end_POSTSUPERSCRIPT belong to {Mim,Mfm}superscript𝑀𝑖𝑚superscript𝑀𝑓𝑚\{M^{im},M^{fm}\}{ italic_M start_POSTSUPERSCRIPT italic_i italic_m end_POSTSUPERSCRIPT , italic_M start_POSTSUPERSCRIPT italic_f italic_m end_POSTSUPERSCRIPT }.

4 Experiments and Analysis

4.1 Datasets

To demonstrate the data efficiency of our approach, we subsampled four publicly available benchmark datasets. This resulted in four Mini datasets, amounting to similar-to\sim1% of the TQA datasets’ and similar-to\sim10% of the VQA datasets’ original size. Table 2 presents the details of these datasets.

TQA datasets. We generated two Mini datasets, Mini-SDv2 and Mini-NQ, consisting of 1,000 randomly sampled instances from SQuAD-V2 (Rajpurkar et al., 2018) and NQ-Open (Kwiatkowski et al., 2019) train splits, respectively. For Mini-NQ, we followed (Fisch et al., 2019) to use long answers as the context, and short answers as the ground truth answers. The 1,000 samples are divided into train and validation data using an 8:2 ratio, while the trained models are tested on the dev data of the original datasets due to the unavailability of original test data. We use the setup proposed in (Laskar et al., 2023) to generate the answers from LLMs. Concretely, this setting relies on prompts generated by PromptSource (Bach et al., 2022), which we apply to our Mini-SDv2 dataset (unavailable for Mini-NQ). The prompts are in two forms depending on the context: (1) “What is the answer? [Context]; [Question]; If you can’t find the answer, please respond ‘unanswerable’. Answer:”; (2) “Answer the question depending on the context. [Context]; [Question]; If you can’t find the answer, please respond ‘unanswerable’. Answer:”. Differently, Mini-NQ does not contain unanswerable questions and thus we use the prompt “Answer the question depending on the context without explanation. [Context]; [Question]; Answer:”. Note that the answers of LLMs can be greatly influenced by some factors such as the use of different prompts or temperatures. However, our study does not focus on prompt engineering but rather on selecting the optimal base model to generate an answer. We will publicly release our prompts as well as the answers from LLMs for reproducibility.

VQA datasets. Our results (Figure 2) reveal that VQA tasks demand a greater quantity of training samples compared to TQA tasks. Therefore, we constructed Mini-GQA and Mini-Viz datasets using a larger fraction (the dev data) of GQA (Hudson & Manning, 2019) and VizWiz (Gurari et al., 2018) datasets compared to TQA datasets. The resulting Mini-GQA and Mini-Viz were divided into train and validation subsets using 8:2 ratio, while the test subset remained the same as in the original datasets.

Table 2: Details of the Mini datasets used for InfoSel ensemble training.

Mini Dataset Source Dataset Num. % Dataset
Mini-SDv2 train SQuAD-V2 train 800 0.56
Mini-SDv2 validation SQuAD-V2 train 200 0.14
Mini-SDv2 test SQuAD-V2 dev 11,873 8.39
Mini-NQ train NQ-Open train 800 0.87
Mini-NQ validation NQ-Open train 200 0.22
Mini-NQ test NQ-Open dev 3,499 3.83
Mini-GQA train GQA dev 105,640 9.80
Mini-GQA validation GQA dev 26,422 2.45
Mini-GQA test GQA test 12,578 1.17
Mini-Viz train VizWiz dev 3,456 10.5
Mini-Viz validation VizWiz dev 863 2.63
Mini-Viz test VizWiz test 8,000 24.39

4.2 Base Models

We experiment with ensembling GPT-3.5-turbo-0613 (ChatGPT), LLaMA-2-70b-chat (hereinafter referred to as “LLaMA”) (Touvron et al., 2023) and GPT-3.5 text-davinci-003 (hereinafter referred to as “Davinci”) to generate answers for TQA tasks.666GPT3.5 text-davinci-003 is deprecated after our experiments, but this fact does not influence the effectiveness of our method. To tackle VQA tasks, we employ three VQA models (VLMo (Bao et al., 2022), ALBEF (Li et al., 2021a) and BLIP (Li et al., 2022)), which are pre-trained on VQA v2 dataset (Antol et al., 2015). Note that we use the publicly accessible VQA models to save experimental costs, but these models are assumed to be black-box. This means that they can only provide predictions without any logits or confidence scores, which is the restriction assumed for the purpose of our study.

Finally, we use Oracle to represent the maximum capability of a combination of base models. Specifically, for each input, the Oracle selects the answer with the highest agreement to the ground truth among all the answers predicted by the base models. Thus, the Oracle score represents the performance of an ideal ensemble model.

4.3 Baselines

Majority Voting (MV). MV makes a collective decision by considering the predicted answers as a group of individuals voting on a particular input. The answer that receives the most votes is the winner, otherwise, ties are broken randomly.

Weighted Voting (WV). We adopt a strategy similar to Schick et al. (2020), where the model accuracy of the train data before training is used as the weight for average weighting. In our case, we use the corresponding accuracy of the base models as the weight for voting.

PageRank (Brin & Page, 2012). We adapt PageRank as a baseline to determine the most suitable answer in a graph where all the answers to one question are connected by their BLEURT (Sellam et al., 2020) similarities.

Overall Local Accuracy (OLA) (Woods et al., 1997). Following (Cruz et al., 2018), we use the k-nearest neighbors algorithm to divide the input space (representations of prompts for TQA, representations of images and questions for VQA) of training data into 7 regions. The overall local accuracy of each base model in different regions is computed as its region competence. The model presenting the highest competence level is selected to predict the label of inputs that fall in the same region.

PairRanker and LLM-Blender (Jiang et al., 2023). Both methods were originally designed for text generation tasks (e.g., machine translation and speech recognition). PairRanker model (DeBERTa (He et al., 2023), 400M parameters) is trained to rank a pair of candidate predictions from two LLMs using multiple optimizing objectives (i.e., log probabilities, BART score, BERTScore, etc). A bubble sort is applied to get the top k (we use k=2) predictions among multiple pairwise comparison results to feed to a GENFUSER model (Flan-T5-XL (Chung et al., 2022), 3B parameters) to generate the final fused prediction. LLM-Blender is a composite of the PairRanker and the GENFUSER model. We use a pre-trained (0-shot) PairRanker and LLM-Blender models which have been trained over massive data (105k) including TQA data as baselines to test on our data.

4.4 Experimental Setup

Evaluation Metric. LLMs tend to generate contextual answers that lead to lower scores in the exact match (EM). Therefore, we mainly use the (per-answer) token-level F1-score from the official evaluation guidance of the datasets as the main evaluation metric for TQA performance. Our results differ from the ones reported in (Laskar et al., 2023; Kocoń et al., 2023) because we do not apply any post-processing, human evaluation or output constraints for the generated answers.

Setup. We fixed the batch size to the upper limit of the server capacity, while the learning rates and epochs are selected after a grid search on a set of values (learning rates: {e3, 5e4, e4, 5e5, e5, 5e6, e6}, epochs: {3, 5, 10, 15, 20}). Models for TQA are trained for 5 epochs using a learning rate of 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and batch size of 4. Models for VQA use the same learning rate but a batch size of 16 for 20 epochs. We spent similar-to\sim74 and similar-to\sim290 seconds training 1 epoch on 1,000 samples for TQA and 4,319 samples for VQA respectively. The training was performed on 1 GPU with 16GB memory of a DGX1 server ((Pascal) Tesla P100).

Table 3: Test performance comparison on textual and visual QA datasets. The overall best results are highlighted in bold, and the best results of base models are underlined. The test data of Mini-Viz is not publicly accessible and thus the Oracle cannot be reported.

Textual Question Answering Visual Question Answering
Model Mini-SDv2 Mini-NQ Model Mini-GQA Mini-Viz
EM F1 EM F1 ACC ACC
LLaMA-2-70b-chat 0.24 11.34 28.07 46.47 ALBEF 50.60 21.28
text-davinci-003 52.37 58.44 52.24 69.44 BLIP 48.08 20.80
ChatGPT 30.89 44.95 57.53 71.54 VLMo 48.21 19.77
Oracle 58.61 66.20 64.02 79.21 Oracle 65.03 -
MV 26.95 37.75 46.07 62.43 MV 51.05 21.47
WV 52.37 58.44 57.53 71.54 WV 52.10 19.43
PageRank 25.39 37.31 51.76 68.53 PageRank 51.47 21.66
OLA 47.90 55.59 54.70 70.05 OLA 48.65 20.32
(0-shot) PairRanker 7.28 19.63 35.30 53.05 (0-shot) PairRanker 47.69 20.74
(0-shot) LLM-Blender 4.90 21.20 1.03 25.06 (0-shot) LLM-Blender 0.0 0.0
FT-TT 46.80 47.68 36.52 40.60 FT-MT 50.48 51.76
InfoSel-TT (ours) 57.74 63.63 58.45 73.37 InfoSel-MT (ours) 55.16 23.16
InfoSel-TT (ours) 49.09 49.85 48.16 53.70 InfoSel-MT (ours) 52.54 52.91
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Test performance of our InfoSel and FT models compared to the OLA baseline over increasing size of training data. The yellow dot highlights the point when InfoSel models outperform the best base LLM model.

4.5 Performance Comparison

In this section, we analyze the performance of our method, taking into account its distinctive characteristics as described in Table 1. Concretely, we focus on comparing our models in terms of task-specific performance, data efficiency, lightweight design, and multimodal capabilities.

Task-specific Performance. Table 3 demonstrates the task-specific performance of InfoSel, base models and baselines on textual and visual QA datasets. For TQA, we observe that LLaMA underperforms other base models. Upon closer examination, we found that LLaMA generates longer explanation text which, although often accurate, decreases the EM and F1-score values. Conversely, a more consistent performance of base models is observed for VQA. All the models demonstrate superior performance on Mini-NQ compared to Mini-SDv2. This is because Mini-SDv2 contains 50% of unanswerable questions, written adversarially and specifically designed to be challenging for the QA task (Rajpurkar et al., 2018). Similarly, Mini-Viz contains 28% of unanswerable questions, and the label “unanswerable” has never been seen by base models. Consequently, this lack of exposure leads to significantly lower performance scores. The Oracle scores of the base model indicate the performance of an ideal ensemble method.

Baseline methods such as WV, PageRank and OLA achieve only marginal improvements compared to base models (\leq+1.5%) on VQA datasets. These results highlight the limitations of these methods when applied to task-specific datasets (see also Table 1). Although 0-shot PairRanker and LLM-Blender models have been trained on massive amounts of data (including question answering data), they exhibit very low performance. LLM-Blender tends to generate longer answers compared to PairRanker which leads to a lower score especially when evaluated in the exact-match settings (EM, ACC).

InfoSel-TT achieves 5.19% (63.63-58.44) improvement compared to individual base models, and reaches 96.12% (63.63/66.20) of the Oracle on Mini-SDv2. Similarly, the corresponding improvement in Mini-NQ performance is 1.83%, reaching 93.06% of the performance achieved by the Oracle. In contrast, FT-TT, despite its superior performance over two base models on Mini-SDv2, underperforms InfoSel-TT by more than 15% due to the small-size training data (refer to Figure 2). We hypothesize that this poor performance of FT-TT models has a direct impact on the performance of InfoSel, which did not bring an additional improvement upon InfoSel. Furthermore, Table 3 showcases that InfoSel-MT achieves an improvement of 4.56% in accuracy score compared to the base models (55.16-50.60) on Mini-GQA, reaching 84.81% (55.16/65.03) of the Oracle performance. Furthermore, FT-MT improves 30.48% (51.76-21.28) accuracy on Mini-Viz due to additional new labels (e.g., “unanswerable”) introduced during fine-tuning. The base models in InfoSel-MT lacked exposure to these labels, leading to lower scores compared to FT-MT. Finally, the superior performance of InfoSel-MT model on Mini-Viz dataset demonstrates the effectiveness of the proposed blending approach, which improves 31.63% (52.91-21.28) accuracy upon the InfoSel-MT model. Thus, we conclude that the obtained results indicate that InfoSel not only improves the task-specific performance of the base models over four datasets, but also outperforms all the baselines.

Data Efficiency. The experimental results shown in Figure 2 demonstrate the data efficiency of our method by evaluating the model’s performance across varying training data sizes. We observe that InfoSel-TT achieves a higher F1-score compared to the base models when trained on as little as 10 samples from Mini-SDv2. This result has been further verified with the mean F1-score of 10 test results using different seed variations for sampling the training data. Conversely, the number of training of samples needed to surpass the performance of base models is higher for VQA datasets: 5% (6,603 samples) for Mini-GQA and 20% (864 samples) for Mini-Viz. We hypothesize that this is due to the inherent complexity of the VQA task.

Additionally, we find that a larger training data size benefits FT-TT more than InfoSel-TT and OLA. For example, the F1-score of FT-TT increases similar-to\sim200% and similar-to\sim500% from 10 to 10,000 training samples on Mini-SDv2 and Mini-NQ respectively, while InfoSel-TT only increases only similar-to\sim3% and similar-to\sim4%. However, FT-TT still underperforms the best base model, which suggests that fine-tuning a small-size model requires larger training data for getting a comparable performance with LLMs or InfoSel. Finally, we observe that a larger training data size does not necessarily lead to improved performance for the fine-tuned FT-TT model (e.g., when increasing from 80% to 100% the training data size on Mini-GQA). In contrast, OLA does not benefit as much as InfoSel and FT from a larger size of training data, only outperforming InfoSel-TT on Mini-NQ when 10 and 20 training samples are used.

Lightweight Model. Table 4 reports the parameter size of the base models and their ensemble models (InfoSel-TT and InfoSel-MT). InfoSel  provides an efficient method for ensembling large LLMs such as ChatGPT (175B parameters) using only 110M trainable parameters. Even though only 37% ((182M-115M)/182M) trainable parameters are saved for the VQA task, we still demonstrate that InfoSel can effectively enhance the task-specific performance of small-size black-box VQA models, offering a lightweight solution.

Table 4: Parameter size of the models used in our experiments.

LLMs VQA Models
Model #Param Model #Param
LLaMA-2-70b-chat 70B ALBEF 290M
text-davinci-003 175B BLIP 361M
ChatGPT 175B VLMo 182M
InfoSel-TT 110M InfoSel-MT 115M

Multimodal Data. InfoSel is able to utilize multimodal data (image and text) for VQA tasks, and thus outperform the latest text-exclusive LLM ensemble methods (PairRanker and LLM-Blender) as evidenced in Table 3. In contrast, LLM-Blender cannot process image features, thereby lacking crucial information in the multimodal setting, leading to an accuracy of 0.0 on VQA datasets.777The most frequent answer of LLM-Blender on VQA datasets is “I’m sorry, I don’t have enough context to answer that question.” Further insights into the significance of modality information are elaborated in Section 4.6 and Table 5.

4.6 Ablation Studies

Refer to caption
Figure 3: The portions of answers selected from different base models by InfoSel models on Mini-SDv2 and Mini-NQ test data. The upper row represents the results of the InfoSel model ensembled with all three LLMs, and the model in the lower row excluded the worst base model (LLaMA).

Is InfoSel robust to the base models’ individual performances? We carry out this study to assess whether InfoSel can effectively utilize the predictions obtained from various base models, regardless of their individual performance levels. In Figure 3, we observe a minor F1-score difference (0.07%) on the Mini-SDv2 dataset between the InfoSel model ensembled with and without the lowest performing base model (LLaMA). This finding suggests that InfoSel is robust, and not significantly affected by the individual model’s performance. In a more detailed analysis, we observe that InfoSel selects 4% of answers from LLaMA, resulting in an overall gain of +0.28% of the F1-score. This observation highlights the effectiveness of InfoSel, as it can leverage the knowledge contained in the answers provided even by the lowest performing base model to some extent.

Which modality information helps the most for ensembling? In the Table 5, we compare the effect of providing different modality information to InfoSel-MT during ensemble training. Notice that even with just the question and answer (Q+A) information, our model surpasses the performance of the 0-shot PairRanker and LLM-Blender. The setting that yields the lowest accuracy solely utilizes the image (V) as the signal. This can be explained by the fact that a single image often corresponds to multiple questions in VQA datasets, making it challenging for the model to acquire discriminative features. Furthermore, we conclude that a superior performance of our model when utilizing image, question, and answer (V+Q+A) data demonstrates the effectiveness of our model in multimodal setting.

Table 5: Accuracy of InfoSel-MT models when using different input information for training compared to baseline models. V, Q, and A represent visual, question, and answer information respectively.

Model Mini-GQA Mini-Viz
(0-shot) PairRanker(Q+A) 47.69 20.74
(0-shot) LLM-Blender(Q+A) 0.0 0.0
InfoSel-MT(V) 50.56 20.79
InfoSel-MT(Q) 51.11 21.21
InfoSel-MT(V+Q) 50.83 20.06
InfoSel-MT(V+A) 52.38 22.66
InfoSel-MT(Q+A) 54.76 22.89
InfoSel-MT(V+Q+A) 55.16 23.16

4.7 Case Study

Table 6: Case study of our models on Mini-SDv2 and Mini-NQ test data. Answers of LLMs are shortened to keywords for better demonstration. Ground truth answers are bolded, and one incorrect ground truth answer is colored red.

Mini-SDv2 Mini-NQ
…Derrick Norman …in 2005 and the
Context: Lehmer’s list release of her epon-
of primes up to ymous debut album
10,006,721… the following year…
How many primes When did Taylor
Question: were included in Der- Swift ’s first
rick Norman Lehmer’s album release?
list of prime numbers?
LLaMA unanswerable 2006
Davinci 10,006,721 2006
ChatGPT unanswerable 2006
FT-TT unanswerable 2005
InfoSel-TT 10,006,721 2006
InfoSel-TT unanswerable 2005
Table 7: Case study of our models on Mini-GQA test and Mini-Viz validation data. Ground truth answers are bolded.

Mini-GQA Mini-Viz
Image: [Uncaptioned image] [Uncaptioned image]
Question: What appliance is What is this pro-
it? duct?
ALBEF blender refrigerator
BLIP toaster toilet
VLMo microwave door
FT-MT coffee maker unanswerable
InfoSel-MT toaster toilet
InfoSel-MT coffee maker unanswerable

Table 6 illustrates two insightful cases from the predictions of different models on textual Mini-SDv2 and Mini-NQ QA datasets. The first case showcases the ability of InfoSel-TT to select the right model (Davinci) when the rest of the models is incorrect. However, InfoSel-TT selects the wrong answers from the FT-TT model and underperforms InfoSel-TT. The second case illustrates the ability of LLMs to generate correct answer (“2006”) despite the ground truth annotation error (“2005”). This demonstrates the advantage of ensembling highly expressive LLMs instead of relying only on fine-tuning small-size models such as FT-TT.

The first case of Table 7 further indicates that InfoSel captures the only correct answer (“toaster”). The second case demonstrates the ability of InfoSel to recognize task-specific labels (i.e., “unanswerable”) introduced by FT-MT. InfoSel-MT struggles with such labels as they are unfamiliar to the base models. This showcases the benefits of training InfoSel models on datasets containing a high percentage of task-specific labels.

5 Conclusion

In this paper, we propose InfoSel, a novel lightweight and task-specific ensemble method designed to learn the dynamic selection of the optimal model from a range of distinct black-box base LLMs. We find that using only 110M trainable parameters, our method is able to substantially increase the performance upon the best performing base LLM. Additionally, our analysis reveals that InfoSel remains robust regardless the incorrect predictions of the lowest performing LLM. Our findings also show that our solution is highly data-efficient. Concretely, it requires only a fraction of instances (as few as 10) from the training set to outperform base LLMs. Finally, our experimental results reveal the ability of InfoSel to be adapted to multimodal setting, showing a substantial increase in performance compared to state-of-the-art alternatives.

6 Acknowledgements

This research has been funded by the Vienna Science and Technology Fund (WWTF) [10.47379/VRG19008] “Knowledge-infused Deep Learning for Natural Language Processing”.

References

  • Anderson et al. (2018) Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6077–6086, 2018.
  • Antol et al. (2015) Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp.  2425–2433, 2015.
  • Arora et al. (2018) Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning, pp.  254–263. PMLR, 2018.
  • Bach et al. (2022) Bach, S. H., Sanh, V., Yong, Z.-X., Webson, A., Raffel, C., Nayak, N. V., Sharma, A., Kim, T., Bari, M. S., Fevry, T., Alyafeai, Z., Dey, M., Santilli, A., Sun, Z., Ben-David, S., Xu, C., Chhablani, G., Wang, H., Fries, J. A., Al-shaibani, M. S., Sharma, S., Thakker, U., Almubarak, K., Tang, X., Radev, D., Jiang, M. T.-J., and Rush, A. M. Promptsource: An integrated development environment and repository for natural language prompts, 2022.
  • Bao et al. (2022) Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K., Aggarwal, K., Som, S., Piao, S., and Wei, F. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
  • Breiman (1996) Breiman, L. Bagging predictors. Machine learning, 24:123–140, 1996.
  • Brin & Page (2012) Brin, S. and Page, L. Reprint of: The anatomy of a large-scale hypertextual web search engine. Comput. Networks, 56:3825–3833, 2012. URL https://api.semanticscholar.org/CorpusID:911040.
  • Bubeck et al. (2023) Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., and Zhang, Y. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
  • Chan & van der Schaar (2022) Chan, A. J. and van der Schaar, M. Synthetic model combination: An instance-wise approach to unsupervised ensemble learning. arXiv preprint arXiv:2210.05320, 2022.
  • Choi et al. (2019) Choi, J., Kim, T., and Kim, C. Self-ensembling with gan-based data augmentation for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  6830–6840, 2019.
  • Chung et al. (2022) Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V., and Wei, J. Scaling instruction-finetuned language models, 2022.
  • Cruz et al. (2018) Cruz, R. M., Sabourin, R., and Cavalcanti, G. D. Dynamic classifier selection: Recent advances and perspectives. Information Fusion, 41:195–216, 2018.
  • Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  • Fisch et al. (2019) Fisch, A., Talmor, A., Jia, R., Seo, M., Choi, E., and Chen, D. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of 2nd Machine Reading for Reading Comprehension (MRQA) Workshop at EMNLP, 2019.
  • Gong et al. (2023) Gong, T., Lyu, C., Zhang, S., Wang, Y., Zheng, M., Zhao, Q., Liu, K., Zhang, W., Luo, P., and Chen, K. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  • Gurari et al. (2018) Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J. P. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3608–3617, 2018.
  • Han et al. (2021) Han, X., Wang, S., Su, C., Huang, Q., and Tian, Q. Greedy gradient ensemble for robust visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1584–1593, 2021.
  • He et al. (2023) He, P., Gao, J., and Chen, W. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2023.
  • Huang et al. (2017) Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q. Snapshot ensembles: Train 1, get m for free. arXiv preprint arXiv:1704.00109, 2017.
  • Hudson & Manning (2019) Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6700–6709, 2019.
  • Jiang et al. (2023) Jiang, D., Ren, X., and Lin, B. Y. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion, 2023.
  • Kocoń et al. (2023) Kocoń, J., Cichecki, I., Kaszyca, O., Kochanek, M., Szydło, D., Baran, J., Bielaniewicz, J., Gruza, M., Janz, A., Kanclerz, K., Kocoń, A., Koptyra, B., Mieleszczenko-Kowszewicz, W., Miłkowski, P., Oleksy, M., Piasecki, M., Radliński, Ł., Wojtasik, K., Woźniak, S., and Kazienko, P. ChatGPT: Jack of all trades, master of none. Information Fusion, 99:101861, nov 2023. doi: 10.1016/j.inffus.2023.101861. URL https://doi.org/10.1016%2Fj.inffus.2023.101861.
  • Kwiatkowski et al. (2019) Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q., and Petrov, S. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl˙a˙00276. URL https://aclanthology.org/Q19-1026.
  • Laskar et al. (2023) Laskar, M. T. R., Bari, M. S., Rahman, M., Bhuiyan, M. A. H., Joty, S., and Huang, J. A systematic study and comprehensive evaluation of ChatGPT on benchmark datasets. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  431–469, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.29. URL https://aclanthology.org/2023.findings-acl.29.
  • Li et al. (2021a) Li, J., Selvaraju, R. R., Gotmare, A. D., Joty, S., Xiong, C., and Hoi, S. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, October 2021a. URL http://arxiv.org/abs/2107.07651. arXiv:2107.07651 [cs].
  • Li et al. (2022) Li, J., Li, D., Xiong, C., and Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, February 2022. URL http://arxiv.org/abs/2201.12086. arXiv:2201.12086 [cs].
  • Li et al. (2019) Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  • Li et al. (2021b) Li, L. H., You, H., Wang, Z., Zareian, A., Chang, S.-F., and Chang, K.-W. Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions, April 2021b. URL http://arxiv.org/abs/2010.12831. arXiv:2010.12831 [cs].
  • Long et al. (2015) Long, M., Cao, Y., Wang, J., and Jordan, M. Learning transferable features with deep adaptation networks. In International conference on machine learning, pp.  97–105. PMLR, 2015.
  • Puerto et al. (2021) Puerto, H., Şahin, G. G., and Gurevych, I. Metaqa: Combining expert agents for multi-skill question answering. arXiv preprint arXiv:2112.01922, 2021.
  • Rajpurkar et al. (2018) Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad, 2018.
  • Sagi & Rokach (2018) Sagi, O. and Rokach, L. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249, 2018.
  • Schapire (2013) Schapire, R. E. Explaining adaboost. In Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik, pp.  37–52. Springer, 2013.
  • Sellam et al. (2020) Sellam, T., Das, D., and Parikh, A. P. Bleurt: Learning robust metrics for text generation. In Proceedings of ACL, 2020.
  • Touvron et al. (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models, 2023.
  • Wolpert (1992) Wolpert, D. H. Stacked generalization. Neural networks, 5(2):241–259, 1992.
  • Woods et al. (1997) Woods, K., Kegelmeyer, W. P., and Bowyer, K. Combination of multiple classifiers using local accuracy estimates. IEEE transactions on pattern analysis and machine intelligence, 19(4):405–410, 1997.
  • Xu et al. (2019) Xu, Y., Chen, L., Cheng, Z., Duan, L., and Luo, J. Open-ended visual question answering by multi-modal domain adaptation. arXiv preprint arXiv:1911.04058, 2019.
  • Yosinski et al. (2014) Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014.
  • Zhou et al. (2022) Zhou, K., Liu, Z., Qiao, Y., Xiang, T., and Loy, C. C. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.