HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.04343v1 [cs.AI] 07 Mar 2024

CoTBal: Comprehensive Task Balancing for Multi-Task
Visual Instruction Tuning

Yanqi Dai Gaoling School of Artificial Intelligence
Renmin University of China
Beijing, China
Beijing Key Laboratory of Big Data Management and Analysis Methods
Beijing, China
Dong Jing Gaoling School of Artificial Intelligence
Renmin University of China
Beijing, China
Beijing Key Laboratory of Big Data Management and Analysis Methods
Beijing, China
Nanyi Fei School of Information
Renmin University of China
Beijing, China
Beijing Key Laboratory of Big Data Management and Analysis Methods
Beijing, China
Zhiwu Lu Beijing Key Laboratory of Big Data Management and Analysis Methods
Beijing, China
Abstract

Visual instruction tuning is a key training stage of large multimodal models (LMMs). Nevertheless, the common practice of indiscriminately mixing instruction-following data from various tasks may result in suboptimal overall performance due to different instruction formats and knowledge domains across tasks. To mitigate this issue, we propose a novel Comprehensive Task Balancing (CoTBal) algorithm for multi-task visual instruction tuning of LMMs. To our knowledge, this is the first work that explores multi-task optimization in visual instruction tuning. Specifically, we consider two key dimensions for task balancing: (1) Inter-Task Contribution, the phenomenon where learning one task potentially enhances the performance in other tasks, attributable to the overlapping knowledge domains, and (2) Intra-Task Difficulty, which refers to the learning difficulty within a single task. By quantifying these two dimensions with performance-based metrics, task balancing is thus enabled by assigning more weights to tasks that offer substantial contributions to others, receive minimal contributions from others, and also have great intra-task difficulties. Experiments show that our CoTBal leads to superior overall performance in multi-task visual instruction tuning.

CoTBal: Comprehensive Task Balancing for Multi-Task
Visual Instruction Tuning


1 Introduction

Large multimodal models (LMMs) such as GPT-4V (Yang et al., 2023) and Gemini (Team et al., 2023) have attracted emerging attention for their ability to comprehend and reason across both visual and textual modalities. A key advancement in this field is visual instruction tuning (Liu et al., 2023b), which integrates visual encoders with large language models (LLMs) through specialized visual instructions and alignment modules. This innovative technique expands the inherent general-purpose capacities of LLMs to encompass the visual modality, significantly enhancing the training efficiency and effectiveness of LMMs. Approaches such as LLaVA (Liu et al., 2023b, a) and MiniGPT-4 (Zhu et al., 2023) have shown remarkable achievements through visual instruction tuning.

Refer to caption
(a) Inter-Task Contribution
Refer to caption
(b) Intra-Task Difficulty
Figure 1: Schematic illustrations of inter-task contributions and intra-task difficulties. (a) The red words indicate the overlapping knowledge domains among tasks, thereby enabling inter-task contributions. (b) The different curves correlating performance with training data amount reveal varying degrees of intra-task difficulties.

Typically, instruction-following data from various tasks are indiscriminately mixed for visual instruction tuning. However, simultaneous optimization across multiple tasks can lead to gradient conflicts (Yu et al., 2020) due to different instruction formats and knowledge domains across tasks, resulting in suboptimal overall performance. To magnitude this issue, based on the mixture of LoRA experts, Gou et al. (2023) utilizes distinct experts to learn conflicting tasks, which seems to be the unique work for multi-task visual instruction tuning. Note that multi-task learning (MTL) is mainly explored by designing model structures or optimization algorithms in previous works (Liu et al., 2019). The work of Gou et al. (2023) clearly falls into the first category of MTL. In contrast, we concentrate on applying the second category of MTL to visual instruction tuning in this paper.

Specifically, we propose a Generic Task Weighting (GTW) paradigm where losses are task-specific weighted and averaged at the token level. Based on the paradigm, we devise Comprehensive Task Balancing (CoTBal), a novel algorithm that balances multi-task visual instruction tuning according to both the inter-task contribution and the intra-task difficulty. On one hand, Figure 1(a) exemplifies that different tasks have overlapping knowledge domains, so that learning one task potentially enhances the performance in other tasks. The extent of this overlap varies, leading to differing degrees of inter-task contributions, which are quantified by the normalized validation performance of a model trained on one task and applied to others. On the other hand, Figure 1(b) shows that tasks exhibit distinct patterns of performance improvement with increasing training data amount. Tasks achieving near-optimal performance with a limited dataset are relatively simpler, while those requiring the full dataset for optimal performance have greater inherent learning difficulties. These intra-task difficulties are measured by the normalized validation performance gap between models trained on the full dataset and those trained on a mini subset of the same task. To achieve comprehensive task balancing for visual instruction tuning, we thus propose to assign more weights to three types of tasks: (1) tasks offering substantial contributions to others, (2) tasks receiving minimal contributions from others, and (3) tasks having great difficulties. These criteria are employed together in our CoTBal to obtain more balanced overall performance.

Briefly, our main contributions are three-fold:
(1) We propose the Generic Task Weighting (GTW) paradigm for multi-task visual instruction tuning. This is the first work that explores multi-task optimization in visual instruction tuning.
(2) We devise the Comprehensive Task Balancing (CoTBal) algorithm, which balances multi-task visual instruction tuning based on both the inter-task contribution and the intra-task difficulty.
(3) Experiments show that CoTBal outperforms existing methods, significantly improving overall performance while ensuring task balance.

2 Related Work

Multi-Task Learning.  The purpose of Multi-task Learning (MTL) is jointly training a single model that can perform multiple tasks (Caruana, 1998; Ruder, 2017; Zhang and Yang, 2021; Vandenhende et al., 2021). Research in MTL is broadly divided into two categories: the first learns the correlations among tasks through model structures (Misra et al., 2016; Ma et al., 2018; Liu et al., 2019), and the second balances the joint training process of all tasks through optimization algorithms (Kendall et al., 2018; Lin et al., 2022; Sener and Koltun, 2018; Liu et al., 2021; Navon et al., 2022; Dai et al., 2023b). These two approaches are not mutually exclusive and can effectively complement each other (Liu et al., 2019). In this paper, we primarily focus on the multi-task optimization algorithm, which involves summing weighted losses or aggregating update gradients of all tasks.

Visual Instruction Tuning.  Instruction tuning (Wei et al., 2021) is first explored in natural language processing, enabling large language models (LLMs) to follow textual instructions and accomplish unseen tasks (Zhang et al., 2023a; Ouyang et al., 2022; Wang et al., 2022). To extend the powerful capabilities of LLMs into multimodal domain, Liu et al. (2023b) introduces visual instruction tuning. This technique integrates visual encoders (Dosovitskiy et al., 2020) with LLMs (Touvron et al., 2023a, b) through specialized visual instructions and alignment modules, effectively constructing large multimodal models (LMMs) that can engage with vision-language information. Subsequently, a range of advanced approaches show robust performance on various visual tasks, focusing on two components: (1) training setting, which encompasses the selection of the alignment module (Zhu et al., 2023; Dai et al., 2023a; Bai et al., 2023) and the determination of trainable modules (Liu et al., 2023a; Ye et al., 2023), and (2) training data, characterized by its larger scale (Zhao et al., 2023), increased versatility (Zhang et al., 2023b; Li et al., 2023), and superior quality (Chen et al., 2023; Wang et al., 2023). However, Gou et al. (2023) observes that diverse tasks for visual instruction tuning focus on different perspectives, resulting in conflicts when trained together. To mitigate this, they propose the mixture of LoRA experts. In this paper, we tackle this challenge from a different angle by employing multi-task optimization, which assigns specific weights to each task.

3 Methodology

In this section, we start with a Generic Task Weighting (GTW) paradigm tailored for multi-task visual instruction tuning. Base on this paradigm, we elaborate on two key dimensions for task balancing: inter-task contribution balancing and intra-task difficulty balancing. These two dimensions are then integrated to formulate the final Comprehensive Task Balancing (CoTBal) algorithm.

3.1 Generic Task Weighting Paradigm

In current works involving visual instruction tuning, instruction-following data from various tasks are typically indiscriminately mixed for fine-tuning LMMs. The training loss is obtained by averaging the cross-entropy losses calculated across all valid tokens, as represented by the following formula:

L=i=1Nj=1Sik=1Tijlog(p(tijk))i=1Nj=1SiTij,𝐿subscriptsuperscript𝑁𝑖1subscriptsuperscriptsubscript𝑆𝑖𝑗1subscriptsuperscriptsubscript𝑇𝑖𝑗𝑘1𝑝subscript𝑡𝑖𝑗𝑘subscriptsuperscript𝑁𝑖1subscriptsuperscriptsubscript𝑆𝑖𝑗1subscript𝑇𝑖𝑗L=\frac{\sum^{N}_{i=1}\sum^{S_{i}}_{j=1}\sum^{T_{ij}}_{k=1}-\log(p(t_{ijk}))}{% \sum^{N}_{i=1}\sum^{S_{i}}_{j=1}T_{ij}},italic_L = divide start_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT - roman_log ( italic_p ( italic_t start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG , (1)

where N𝑁Nitalic_N is the total number of tasks, Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of samples for Task i𝑖iitalic_i, Tijsubscript𝑇𝑖𝑗T_{ij}italic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the number of valid tokens in the j𝑗jitalic_jth sample for Task i𝑖iitalic_i, and tijksubscript𝑡𝑖𝑗𝑘t_{ijk}italic_t start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT is the k𝑘kitalic_kth valid token in the j𝑗jitalic_jth sample for Task i𝑖iitalic_i. However, this approach is incompatible with the task weighting paradigm of traditional multi-task optimization algorithms, where single-task losses are individually computed and aggregated through weighted summation to get the total loss. Therefore, we introduce the GTW paradigm, specifically tailored for multi-task visual instruction tuning. The training loss of GTW is defined as:

LGTW=i=1Nj=1Sik=1Tijλilog(p(tijk))i=1Nj=1SiλiTij,subscript𝐿𝐺𝑇𝑊subscriptsuperscript𝑁𝑖1subscriptsuperscriptsubscript𝑆𝑖𝑗1subscriptsuperscriptsubscript𝑇𝑖𝑗𝑘1subscript𝜆𝑖𝑝subscript𝑡𝑖𝑗𝑘subscriptsuperscript𝑁𝑖1subscriptsuperscriptsubscript𝑆𝑖𝑗1subscript𝜆𝑖subscript𝑇𝑖𝑗L_{GTW}=\frac{\sum^{N}_{i=1}\sum^{S_{i}}_{j=1}\sum^{T_{ij}}_{k=1}-\lambda_{i}% \log(p(t_{ijk}))}{\sum^{N}_{i=1}\sum^{S_{i}}_{j=1}\lambda_{i}T_{ij}},italic_L start_POSTSUBSCRIPT italic_G italic_T italic_W end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_p ( italic_t start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG , (2)

where λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the weight of Task i𝑖iitalic_i. The losses are assigned task-specific weights and aggregated at the token level rather than at the sample or task level. GTW allows for more equitable consideration of each valid token, ensuring that the model is not biased towards certain tasks due to variations in sample sequence length or data amount across tasks. Besides, we also perform weighting in the denominator to enable a fair comparison with the indiscriminate data mixing approach (see Equation 1), where the weights are uniformly set to 1111. The GTW paradigm is employed in our CoTBal algorithm, while also laying a solid foundation for subsequent studies.

3.2 Inter-Task Contribution Balancing

Although the focal points of distinct tasks vary in multi-task visual instruction tuning, a key shared objective exists: achieving more accurate comprehension and reasoning of visual information. As shown in Figure 1(a), the data of detailed image captioning on ShareGPT-4V (Chen et al., 2023) and visual question answering on VQAv2 (Goyal et al., 2017) both involve color information (pink and yellow dishes) in the image, which exemplifies the overlapping knowledge domains among tasks. Therefore, it is reasonable to hypothesize that different visual tasks could potentially provide mutual enhancement in their performance, which can be defined as the inter-task contribution. The extent of the overlapping knowledge domains varies, leading to differing degrees of inter-task contributions.

In practice, the inter-task contribution of Task i𝑖iitalic_i to Task j𝑗jitalic_j can be quantified by the validation performance for Task j𝑗jitalic_j of the model trained on Task i𝑖iitalic_i, which is normalized by the validation performance for Task j𝑗jitalic_j of the model trained on Task j𝑗jitalic_j itself. However, a model trained exclusively on one task may struggle to adhere to the instruction demands of other tasks. To address this, we incorporate mini subsets from all tasks into the training set, enabling the model to understand the instruction demands of each task. Therefore, the inter-task contribution of Task i𝑖iitalic_i to Task j𝑗jitalic_j can be calculated as:

Cij=Vj(i+mini)Vj(mini)Vj(j+mini)Vj(mini),subscript𝐶𝑖𝑗subscript𝑉𝑗𝑖𝑚𝑖𝑛𝑖subscript𝑉𝑗𝑚𝑖𝑛𝑖subscript𝑉𝑗𝑗𝑚𝑖𝑛𝑖subscript𝑉𝑗𝑚𝑖𝑛𝑖C_{ij}=\frac{V_{j}(i+mini)-V_{j}(mini)}{V_{j}(j+mini)-V_{j}(mini)},italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i + italic_m italic_i italic_n italic_i ) - italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_m italic_i italic_n italic_i ) end_ARG start_ARG italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_j + italic_m italic_i italic_n italic_i ) - italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_m italic_i italic_n italic_i ) end_ARG , (3)

where Vj(i+mini)subscript𝑉𝑗𝑖𝑚𝑖𝑛𝑖V_{j}(i+mini)italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i + italic_m italic_i italic_n italic_i ) represents the validation performance for Task j𝑗jitalic_j of a model trained on the full dataset from Task i𝑖iitalic_i alongside mini subsets from other tasks, and Vj(mini)subscript𝑉𝑗𝑚𝑖𝑛𝑖V_{j}(mini)italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_m italic_i italic_n italic_i ) signifies the validation performance for Task j𝑗jitalic_j of a model trained on mini subsets from all tasks. In the formula, Vj(mini)subscript𝑉𝑗𝑚𝑖𝑛𝑖V_{j}(mini)italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_m italic_i italic_n italic_i ) is subtracted from both the numerator and the denominator to mitigate the impact of incorporating mini subsets from all tasks into the training set on the validation performance for Task j𝑗jitalic_j.

Furthermore, based on the accurate quantification of the inter-task contribution, we propose two task weighting strategies for inter-task contribution balancing. Firstly, we examine the average inter-task contribution of one given task to all other tasks as Cone2allsubscript𝐶𝑜𝑛𝑒2𝑎𝑙𝑙C_{one2all}italic_C start_POSTSUBSCRIPT italic_o italic_n italic_e 2 italic_a italic_l italic_l end_POSTSUBSCRIPT, representing the extent to which this task assists all other tasks. The greater the assistance provided by one task to all other tasks, the more substantial its overall contribution to the entire training process of multi-task visual instruction tuning. Therefore, tasks that have greater Cone2allsubscript𝐶𝑜𝑛𝑒2𝑎𝑙𝑙C_{one2all}italic_C start_POSTSUBSCRIPT italic_o italic_n italic_e 2 italic_a italic_l italic_l end_POSTSUBSCRIPT should be assigned more weights to enhance overall performance. The specific task weights 𝝀𝒐𝒏𝒆𝟐𝒂𝒍𝒍subscript𝝀𝒐𝒏𝒆2𝒂𝒍𝒍\bm{\lambda_{one2all}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_o bold_italic_n bold_italic_e bold_2 bold_italic_a bold_italic_l bold_italic_l end_POSTSUBSCRIPT can be computed as:

Cone2all,i=1N1jiCij,subscript𝐶𝑜𝑛𝑒2𝑎𝑙𝑙𝑖1𝑁1subscript𝑗𝑖subscript𝐶𝑖𝑗C_{one2all,i}=\frac{1}{N-1}\sum_{j\neq i}C_{ij},italic_C start_POSTSUBSCRIPT italic_o italic_n italic_e 2 italic_a italic_l italic_l , italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , (4)
𝝀𝒐𝒏𝒆𝟐𝒂𝒍𝒍=N×softmax(𝑪𝒐𝒏𝒆𝟐𝒂𝒍𝒍T),subscript𝝀𝒐𝒏𝒆2𝒂𝒍𝒍𝑁softmaxsubscript𝑪𝒐𝒏𝒆2𝒂𝒍𝒍𝑇\bm{\lambda_{one2all}}=N\times\text{softmax}(\frac{\bm{C_{one2all}}}{T}),bold_italic_λ start_POSTSUBSCRIPT bold_italic_o bold_italic_n bold_italic_e bold_2 bold_italic_a bold_italic_l bold_italic_l end_POSTSUBSCRIPT = italic_N × softmax ( divide start_ARG bold_italic_C start_POSTSUBSCRIPT bold_italic_o bold_italic_n bold_italic_e bold_2 bold_italic_a bold_italic_l bold_italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ) , (5)

where Cone2all,isubscript𝐶𝑜𝑛𝑒2𝑎𝑙𝑙𝑖C_{one2all,i}italic_C start_POSTSUBSCRIPT italic_o italic_n italic_e 2 italic_a italic_l italic_l , italic_i end_POSTSUBSCRIPT signifies Cone2allsubscript𝐶𝑜𝑛𝑒2𝑎𝑙𝑙C_{one2all}italic_C start_POSTSUBSCRIPT italic_o italic_n italic_e 2 italic_a italic_l italic_l end_POSTSUBSCRIPT for Task i𝑖iitalic_i and 𝑪𝒐𝒏𝒆𝟐𝒂𝒍𝒍subscript𝑪𝒐𝒏𝒆2𝒂𝒍𝒍\bm{C_{one2all}}bold_italic_C start_POSTSUBSCRIPT bold_italic_o bold_italic_n bold_italic_e bold_2 bold_italic_a bold_italic_l bold_italic_l end_POSTSUBSCRIPT represents the N𝑁Nitalic_N-dimensional vector of Cone2allsubscript𝐶𝑜𝑛𝑒2𝑎𝑙𝑙C_{one2all}italic_C start_POSTSUBSCRIPT italic_o italic_n italic_e 2 italic_a italic_l italic_l end_POSTSUBSCRIPT for all tasks. T𝑇Titalic_T denotes the temperature hyperparameter that controls the degree of smoothness in the weight vector. Secondly, we consider the average inter-task contribution of all other tasks to one given task as Call2onesubscript𝐶𝑎𝑙𝑙2𝑜𝑛𝑒C_{all2one}italic_C start_POSTSUBSCRIPT italic_a italic_l italic_l 2 italic_o italic_n italic_e end_POSTSUBSCRIPT, denoting the degree to which this task receives benefits from all other tasks. If one task receives minimal benefits from other tasks, it tends to exhibit poorer performance compared to tasks that receive greater benefits. To maintain balanced overall performance, such type of tasks that have lower Call2onesubscript𝐶𝑎𝑙𝑙2𝑜𝑛𝑒C_{all2one}italic_C start_POSTSUBSCRIPT italic_a italic_l italic_l 2 italic_o italic_n italic_e end_POSTSUBSCRIPT should also be assigned more weights. The specific task weights 𝝀𝒂𝒍𝒍𝟐𝒐𝒏𝒆subscript𝝀𝒂𝒍𝒍2𝒐𝒏𝒆\bm{\lambda_{all2one}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_a bold_italic_l bold_italic_l bold_2 bold_italic_o bold_italic_n bold_italic_e end_POSTSUBSCRIPT can be computed as:

Call2one,i=1N1jiCji,subscript𝐶𝑎𝑙𝑙2𝑜𝑛𝑒𝑖1𝑁1subscript𝑗𝑖subscript𝐶𝑗𝑖C_{all2one,i}=\frac{1}{N-1}\sum_{j\neq i}C_{ji},italic_C start_POSTSUBSCRIPT italic_a italic_l italic_l 2 italic_o italic_n italic_e , italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT , (6)
𝝀𝒂𝒍𝒍𝟐𝒐𝒏𝒆=N×softmax(𝑪𝒂𝒍𝒍𝟐𝒐𝒏𝒆T),subscript𝝀𝒂𝒍𝒍2𝒐𝒏𝒆𝑁softmaxsubscript𝑪𝒂𝒍𝒍2𝒐𝒏𝒆𝑇\bm{\lambda_{all2one}}=N\times\text{softmax}(-\frac{\bm{C_{all2one}}}{T}),bold_italic_λ start_POSTSUBSCRIPT bold_italic_a bold_italic_l bold_italic_l bold_2 bold_italic_o bold_italic_n bold_italic_e end_POSTSUBSCRIPT = italic_N × softmax ( - divide start_ARG bold_italic_C start_POSTSUBSCRIPT bold_italic_a bold_italic_l bold_italic_l bold_2 bold_italic_o bold_italic_n bold_italic_e end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ) , (7)

where Call2one,isubscript𝐶𝑎𝑙𝑙2𝑜𝑛𝑒𝑖C_{all2one,i}italic_C start_POSTSUBSCRIPT italic_a italic_l italic_l 2 italic_o italic_n italic_e , italic_i end_POSTSUBSCRIPT signifies Call2onesubscript𝐶𝑎𝑙𝑙2𝑜𝑛𝑒C_{all2one}italic_C start_POSTSUBSCRIPT italic_a italic_l italic_l 2 italic_o italic_n italic_e end_POSTSUBSCRIPT for Task i𝑖iitalic_i and 𝑪𝒂𝒍𝒍𝟐𝒐𝒏𝒆subscript𝑪𝒂𝒍𝒍2𝒐𝒏𝒆\bm{C_{all2one}}bold_italic_C start_POSTSUBSCRIPT bold_italic_a bold_italic_l bold_italic_l bold_2 bold_italic_o bold_italic_n bold_italic_e end_POSTSUBSCRIPT represents the N𝑁Nitalic_N-dimensional vector of Call2onesubscript𝐶𝑎𝑙𝑙2𝑜𝑛𝑒C_{all2one}italic_C start_POSTSUBSCRIPT italic_a italic_l italic_l 2 italic_o italic_n italic_e end_POSTSUBSCRIPT for all tasks. T𝑇Titalic_T denotes the same temperature hyperparameter in Equation 5. Subsequently, we integrate the aforementioned two strategies to formulate the task weighting strategy for inter-task contribution balancing, where the task weights 𝝀𝑪subscript𝝀𝑪\bm{\lambda_{C}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C end_POSTSUBSCRIPT can be calculated as:

𝝀𝑪=12(𝝀𝒐𝒏𝒆𝟐𝒂𝒍𝒍+𝝀𝒂𝒍𝒍𝟐𝒐𝒏𝒆).subscript𝝀𝑪12subscript𝝀𝒐𝒏𝒆2𝒂𝒍𝒍subscript𝝀𝒂𝒍𝒍2𝒐𝒏𝒆\bm{\lambda_{C}}=\frac{1}{2}(\bm{\lambda_{one2all}}+\bm{\lambda_{all2one}}).bold_italic_λ start_POSTSUBSCRIPT bold_italic_C end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_italic_λ start_POSTSUBSCRIPT bold_italic_o bold_italic_n bold_italic_e bold_2 bold_italic_a bold_italic_l bold_italic_l end_POSTSUBSCRIPT + bold_italic_λ start_POSTSUBSCRIPT bold_italic_a bold_italic_l bold_italic_l bold_2 bold_italic_o bold_italic_n bold_italic_e end_POSTSUBSCRIPT ) . (8)

3.3 Intra-Task Difficulty Balancing

In addition to the inter-task contribution, another critical aspect in multi-task visual instruction tuning is the intra-task difficulty, which refers to the inherent learning difficulty within each task. Tasks that achieve near-optimal performance with a limited dataset are considered to have poor intra-task difficulties. Conversely, tasks that require the full dataset to reach optimal performance are deemed to have great intra-task difficulties. As illustrated in Figure 1(b), different tasks exhibit distinct patterns of performance improvement with increasing training data amount. Arranged by increasing intra-task difficulty, the sequence of these three tasks is as follows: visual question answering on VQAv2 (Goyal et al., 2017), detailed image captioning on ShareGPT-4V (Chen et al., 2023) and visual grounding on RefCOCO (Kazemzadeh et al., 2014; Mao et al., 2016).

Practically, the intra-task difficulty for Task i𝑖iitalic_i is measured by the validation performance gap between a model trained on the full dataset and that trained on a mini subset from Task i𝑖iitalic_i, which is normalized by the validation performance of the former model. This metric offers a precise measure of potential performance degradation when using the mini subset of training data, thereby reflecting the inherent learning difficulty of the task. Notably, to ensure a fair measurement across each task, the ratio between the number of samples in the mini subset and the total number of samples in the full dataset should be kept consistent.

However, training extra models using both the full dataset and the mini subset from each task is necessary to obtain the intra-task difficulty, which will require additional time comparable to the training time of the final model. To alleviate this, we repurpose the models trained for computing inter-task contributions. Specifically, we substitute the model trained on the mini subset from Task i𝑖iitalic_i with that trained on mini subsets from all tasks, and replace the model trained solely on the full dataset from Task i𝑖iitalic_i with that trained on the full dataset from Task i𝑖iitalic_i alongside mini subsets from other tasks. Due to the minimal inter-task contributions of others tasks to Task i𝑖iitalic_i when compared to the contribution from Task i𝑖iitalic_i to Task i𝑖iitalic_i itself, the impact of mini subsets from other tasks on the validation performance for Task i𝑖iitalic_i is negligible. Therefore, this approach significantly reduces training time with minimal error. The intra-task difficulty for Task i𝑖iitalic_i is calculated as:

Di=1Vi(mini)Vi(i+mini),subscript𝐷𝑖1subscript𝑉𝑖𝑚𝑖𝑛𝑖subscript𝑉𝑖𝑖𝑚𝑖𝑛𝑖D_{i}=1-\frac{V_{i}(mini)}{V_{i}(i+mini)},italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 - divide start_ARG italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_m italic_i italic_n italic_i ) end_ARG start_ARG italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i + italic_m italic_i italic_n italic_i ) end_ARG , (9)

where Vi(mini)subscript𝑉𝑖𝑚𝑖𝑛𝑖V_{i}(mini)italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_m italic_i italic_n italic_i ) represents the validation performance for Task i𝑖iitalic_i of a model trained on mini subsets from all tasks, and Vi(i+mini)subscript𝑉𝑖𝑖𝑚𝑖𝑛𝑖V_{i}(i+mini)italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i + italic_m italic_i italic_n italic_i ) denotes the validation performance for Task i𝑖iitalic_i of a model trained on the full dataset from Task i𝑖iitalic_i coupled with mini subsets from other tasks.

Moreover, owing to the varying intra-task difficulties across different tasks, treating each task equally during the training process may result in underfitting of the more challenging tasks, despite the simpler ones being adequately trained or even overfitted. Therefore, we propose a weighting strategy that assigns more weights to tasks with greater intra-task difficulties. The task weights 𝝀𝑫subscript𝝀𝑫\bm{\lambda_{D}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_D end_POSTSUBSCRIPT can be calculated as:

𝝀𝑫=N×softmax(𝑫T),subscript𝝀𝑫𝑁softmax𝑫𝑇\bm{\lambda_{D}}=N\times\text{softmax}(\frac{\bm{D}}{T}),bold_italic_λ start_POSTSUBSCRIPT bold_italic_D end_POSTSUBSCRIPT = italic_N × softmax ( divide start_ARG bold_italic_D end_ARG start_ARG italic_T end_ARG ) , (10)

where 𝑫𝑫\bm{D}bold_italic_D represents the N𝑁Nitalic_N-dimensional vector of intra-task difficulties for all tasks, and T𝑇Titalic_T is the same temperature hyperparameter used in Section 3.2.

3.4 Comprehensive Task Balancing

Algorithm 1 Overall Training Process of CoTBal
1:N𝑁Nitalic_N visual tasks, a pretrained LMM.
2:a fine-tuned LMM.
3:Trained a model on mini subsets from all tasks;
4:for i=1𝑖1i=1italic_i = 1 to N𝑁Nitalic_N do
5:     Trained a model on the full dataset from
6:for Task i𝑖iitalic_i and mini subsets from other tasks;
7:end for
8:for each Task i𝑖iitalic_i do
9:     for each other Task j𝑗jitalic_j do
10:         Compute inter-task contribution Cijsubscript𝐶𝑖𝑗C_{ij}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT;
11:     end for
12:end for
13:for each Task i𝑖iitalic_i do
14:     Compute outwards contribution Cone2all,isubscript𝐶𝑜𝑛𝑒2𝑎𝑙𝑙𝑖C_{one2all,i}italic_C start_POSTSUBSCRIPT italic_o italic_n italic_e 2 italic_a italic_l italic_l , italic_i end_POSTSUBSCRIPT;
15:     Compute inwards contribution Call2one,isubscript𝐶𝑎𝑙𝑙2𝑜𝑛𝑒𝑖C_{all2one,i}italic_C start_POSTSUBSCRIPT italic_a italic_l italic_l 2 italic_o italic_n italic_e , italic_i end_POSTSUBSCRIPT;
16:end for
17:Compute task weights 𝝀𝑪subscript𝝀𝑪\bm{\lambda_{C}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C end_POSTSUBSCRIPT using 𝑪𝒐𝒏𝒆𝟐𝒂𝒍𝒍subscript𝑪𝒐𝒏𝒆2𝒂𝒍𝒍\bm{C_{one2all}}bold_italic_C start_POSTSUBSCRIPT bold_italic_o bold_italic_n bold_italic_e bold_2 bold_italic_a bold_italic_l bold_italic_l end_POSTSUBSCRIPT and 𝑪𝒂𝒍𝒍𝟐𝒐𝒏𝒆subscript𝑪𝒂𝒍𝒍2𝒐𝒏𝒆\bm{C_{all2one}}bold_italic_C start_POSTSUBSCRIPT bold_italic_a bold_italic_l bold_italic_l bold_2 bold_italic_o bold_italic_n bold_italic_e end_POSTSUBSCRIPT for inter-task contribution balancing;
18:for each Task i𝑖iitalic_i do
19:     Compute intra-task difficulty Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
20:end for
21:Compute task weights 𝝀𝑫subscript𝝀𝑫\bm{\lambda_{D}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_D end_POSTSUBSCRIPT using 𝑫𝑫\bm{D}bold_italic_D for intra-task difficulty balancing;
22:Combine 𝝀𝑪subscript𝝀𝑪\bm{\lambda_{C}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C end_POSTSUBSCRIPT and 𝝀𝑫subscript𝝀𝑫\bm{\lambda_{D}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_D end_POSTSUBSCRIPT to get final task weights 𝝀𝑪𝒐𝑻𝑩𝒂𝒍subscript𝝀𝑪𝒐𝑻𝑩𝒂𝒍\bm{\lambda_{CoTBal}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C bold_italic_o bold_italic_T bold_italic_B bold_italic_a bold_italic_l end_POSTSUBSCRIPT for comprehensive task balancing;
23:Apply 𝝀𝑪𝒐𝑻𝑩𝒂𝒍subscript𝝀𝑪𝒐𝑻𝑩𝒂𝒍\bm{\lambda_{CoTBal}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C bold_italic_o bold_italic_T bold_italic_B bold_italic_a bold_italic_l end_POSTSUBSCRIPT to fine-tune the final LMM using the GTW paradigm.

After individually establishing the strategies for inter-task contribution balancing and intra-task difficulty balancing, the final step involves integrating them to create the CoTBal algorithm. The algorithm is designed to synergistically leverage the strengths of both two balancing methods, thereby ensuring a more comprehensive and effective multi-task optimization process in visual instruction tuning. The specific task weights 𝝀𝑪𝒐𝑻𝑩𝒂𝒍subscript𝝀𝑪𝒐𝑻𝑩𝒂𝒍\bm{\lambda_{CoTBal}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C bold_italic_o bold_italic_T bold_italic_B bold_italic_a bold_italic_l end_POSTSUBSCRIPT for comprehensive task balancing can be calculated as:

𝝀𝑪𝒐𝑻𝑩𝒂𝒍=α𝝀𝑪+(1α)𝝀𝑫,subscript𝝀𝑪𝒐𝑻𝑩𝒂𝒍𝛼subscript𝝀𝑪1𝛼subscript𝝀𝑫\bm{\lambda_{CoTBal}}=\alpha\bm{\lambda_{C}}+(1-\alpha)\bm{\lambda_{D}},bold_italic_λ start_POSTSUBSCRIPT bold_italic_C bold_italic_o bold_italic_T bold_italic_B bold_italic_a bold_italic_l end_POSTSUBSCRIPT = italic_α bold_italic_λ start_POSTSUBSCRIPT bold_italic_C end_POSTSUBSCRIPT + ( 1 - italic_α ) bold_italic_λ start_POSTSUBSCRIPT bold_italic_D end_POSTSUBSCRIPT , (11)

where α𝛼\alphaitalic_α is a hyperparameter that controls the relative influence of inter-task contribution balancing and intra-task difficulty balancing. The training process of CoTBal is summarized in Algorithm 1.

4 Experiments

4.1 Experimental Setup

Datasets.  The training data of CoTBal includes a variety of datasets: ShareGPT4V (Chen et al., 2023), VQAv2 (Goyal et al., 2017), GQA (Hudson and Manning, 2019), ChartQA (Masry et al., 2022), OCRVQA (Mishra et al., 2019), RefCOCO (Kazemzadeh et al., 2014; Mao et al., 2016) and ShareGPT (sha, 2023). The aforementioned visual datasets have various image domains and task types. Therefore, we treat each visual dataset as a distinct task, except for the RefCOCO dataset, which is bifurcated into two tasks: RefCOCO-caption and RefCOCO-grounding. The former generates descriptions for image regions defined by bounding boxes (bbox), while the latter produces bbox corresponding to described image regions. Besides, the ShareGPT dataset, only containing language conversation data, is also used as a training task to mitigate the phenomenon of model forgetting its inherent general language conversation capabilities.

Inspired by Liu et al. (2023a), we incorporate response format instructions into the data to clarify task requirements for the model and employ multiple data processing strategies to reduce training costs and ensure fairness, detailed as follows:
(1) For ShareGPT4V, the data is randomly partitioned into a validation set of 2k and a test set of 2k, with the remainder designated for training.
(2) For all VQA datasets and RefCOCO, data from the same training image are shuffled and merged into a single conversation.
(3) For RefCOCO, training conversations are segmented into parts, each with fewer than 10 turns.
(4) For OCRVQA, 80k conversations are sampled from the training set.
(5) For VQAv2, GQA and OCRVQA, 20k data are sampled from the validation set.
(6) For ShareGPT, invalid conversations are filtered out as Zheng et al. (2023), while long conversations that surpass 2048 tokens are truncated.
The training data sizes and response format instructions for each task are presented in Table 1.

Tasks Data Sizes

Response Format Instructions

ShareGPT 41k

-

ShareGPT-4V 98k
VQAv2 83k

Answer the question using a single

GQA 72k

word or phrase.

ChartQA 18k
OCRVQA 80k
RefCOCO-caption 41k

Provide a short description for this

region.

RefCOCO-bbox 41k

Provide the bounding box coordinate

of the region this sentence describes.

Total 475k
Table 1: Summary of training data sizes and response format instructions for each task.

Evaluation Metrics.  In the experiments, we first report the common evaluation metrics for each task: CIDEr (Vedantam et al., 2015) for image captioning tasks, Exact Match (EM) for visual question answering tasks, and Intersection over Union (IoU) for visual grounding tasks. Moreover, since multi-task visual instruction tuning aims to jointly improve performance across all tasks, we consider two metrics to comprehensively evaluate the effectiveness of methods: (1) ΔI%Δpercent𝐼\Delta I\%roman_Δ italic_I %, the average per-task improvement, and (2) ΔE%Δpercent𝐸\Delta E\%roman_Δ italic_E %, the average per-task error in test performance compared with models trained on individual tasks. These two metrics can be calculated as:

Ii=1Kij=1Ki(1)δijMe,ijMb,ijMb,ij,subscript𝐼𝑖1subscript𝐾𝑖superscriptsubscript𝑗1subscript𝐾𝑖superscript1subscript𝛿𝑖𝑗subscript𝑀𝑒𝑖𝑗subscript𝑀𝑏𝑖𝑗subscript𝑀𝑏𝑖𝑗I_{i}=\frac{1}{K_{i}}\sum_{j=1}^{K_{i}}(-1)^{\delta_{ij}}\frac{M_{e,ij}-M_{b,% ij}}{M_{b,ij}},italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_M start_POSTSUBSCRIPT italic_e , italic_i italic_j end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_b , italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_b , italic_i italic_j end_POSTSUBSCRIPT end_ARG , (12)
ΔI%=1Ni=1NIi,Δpercent𝐼1𝑁superscriptsubscript𝑖1𝑁subscript𝐼𝑖\Delta I\%=\frac{1}{N}\sum_{i=1}^{N}I_{i},roman_Δ italic_I % = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (13)
ΔE%=1Ni=1Nmin(0,Ii),Δpercent𝐸1𝑁superscriptsubscript𝑖1𝑁0subscript𝐼𝑖\Delta E\%=\frac{1}{N}\sum_{i=1}^{N}\min(0,I_{i}),roman_Δ italic_E % = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_min ( 0 , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (14)

where N𝑁Nitalic_N is the total number of tasks, Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the test performance improvement for Task i𝑖iitalic_i, Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of evaluation metrics for Task i𝑖iitalic_i, Me,ijsubscript𝑀𝑒𝑖𝑗M_{e,ij}italic_M start_POSTSUBSCRIPT italic_e , italic_i italic_j end_POSTSUBSCRIPT is the value on the j𝑗jitalic_jth metric for Task i𝑖iitalic_i of the model trained by the evaluated method and Mb,ijsubscript𝑀𝑏𝑖𝑗M_{b,ij}italic_M start_POSTSUBSCRIPT italic_b , italic_i italic_j end_POSTSUBSCRIPT is that of the baseline model trained individually on Task i𝑖iitalic_i. δijsubscript𝛿𝑖𝑗\delta_{ij}italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is an indicator function that is set to 00 if a higher value is better on the k𝑘kitalic_k-th metric for Task i𝑖iitalic_i, and 1111 otherwise. The metric ΔE%Δpercent𝐸\Delta E\%roman_Δ italic_E % serves as an indicator of imbalance in model performance by focusing on the negative aspects of the performance improvement, i.e., where there is no improvement or even a decline in performance compared to baseline models. By aggregating these negative impacts across all tasks, ΔE%Δpercent𝐸\Delta E\%roman_Δ italic_E % provides a concise measure of how the method may disproportionately benefit some tasks at the expense of others, thus revealing the degree of performance imbalance.

Compared Methods.  We compare the following methods: (1) our CoTBal algorithm; (2) Single-Task Learning (STL) baseline, training and testing independent models for each task; (3) Equal Weighting (EW), the indiscriminate data mixing approach which minimizes the loss in Equation 1 without task weighting; (4) Task-Level Aggregation (TLA), which averages the losses of valid tokens within each task, then calculates the mean loss across all tasks; (5) Random Loss Weighting (RLW) (Lin et al., 2021), which randomly assigns task weights; (6) Dynamic Weight Average (DWA) (Liu et al., 2019), which assigns more weights to tasks with less descending rates of training loss; (7) Improvable Gap Balancing version 1 (IGBv1) (Dai et al., 2023b), which assigns more weights to tasks with greater training losses. Method (5)-(7) are optimization algorithms in the traditional MTL framework, dynamically updating task weights in each training iteration. We apply the GTW paradigm to them for multi-task visual instruction tuning.

Note that we have not compared traditional gradient aggregation multi-task optimization algorithms. Such algorithms require computing update gradients via backpropagation for each task separately in each iteration, followed by the aggregation of gradients across all tasks. In multi-task visual instruction tuning, the large number of tasks and the massive volume of model parameters make this process impractical and excessively time-consuming.

Methods ShareGPT4V RefCOCO-caption VQAv2 GQA ChartQA OCRVQA RefCOCO-bbox ΔI%Δpercent𝐼absent\Delta I\%\uparrowroman_Δ italic_I % ↑ ΔE%Δpercent𝐸absent\Delta E\%\downarrowroman_Δ italic_E % ↓
test Ref-test Ref-testB Refg-test test-dev test-bal test test Ref-test Ref-testB Refg-test
CIDEr\uparrow CIDEr\uparrow EM\uparrow EM\uparrow EM\uparrow EM\uparrow IoU\uparrow
STL 0.1285 0.4330 0.4658 0.6019 77.73 61.23 17.76 68.22 65.02 51.58 50.78
EW 0.1411 0.4738 0.5591 0.5937 78.27 62.20 19.60 67.73 76.05 61.63 62.80 7.30 0.10
TLA 0.1144 0.5083 0.5770 0.5327 77.72 60.42 22.36 67.80 71.79 56.58 58.40 4.94 1.85
RLW 0.1388 0.4810 0.5571 0.5538 77.28 60.61 18.20 66.73 70.78 55.86 57.01 3.44 0.54
DWA 0.1225 0.4659 0.5470 0.6006 78.28 61.82 19.88 67.87 76.74 61.12 63.88 5.35 0.74
IGBv1 0.1349 0.4267 0.4824 0.6017 77.00 60.92 17.20 65.96 70.39 55.47 55.99 0.92 1.13
CoTBal 0.1437 0.4649 0.5724 0.5874 77.99 61.81 20.16 67.48 82.62 67.38 69.19 9.45 0.15
Table 2: Comparative results for multi-task visual instruction tuning. ()absent\uparrow(\downarrow)↑ ( ↓ ) indicates that the higher (lower) the result, the better the performance. Ref-test and Ref-testB represent two test sets of Kazemzadeh et al. (2014), and Refg-test denotes the test set of Mao et al. (2016).

Implementation Details.  In the experiments, we fine-tune the pretrained LLaVA-v1.5-7B model on 8×\times×A100 (80G) GPUs using the same training setting and code as Liu et al. (2023a). For our CoTBal algorithm, we perform task balancing across all seven visual instruction-following tasks, while directly assigning a weight of 1.01.01.01.0 to ShareGPT. The mini subset from each task is obtained by randomly sampling 1/321321/321 / 32th of the full dataset from that task. Both the temperature hyperparameter T𝑇Titalic_T and the control hyperparameter α𝛼\alphaitalic_α are set to 0.50.50.50.5.

4.2 Multi-Task Evaluations

Refer to caption
Figure 2: Performance comparison radar chart of the CoTBal method and the EW method.
Refer to caption
(a) Heatmap of inter-task contributions.
Refer to caption
(b) Histogram of intra-task difficulties.
Figure 3: Numerical visualizations of inter-task contributions and intra-task difficulties in the training process of the CoTBal algorithm.

Table 2 presents the comparative results for multi-task instruction tuning. With the same foundational models and training data, CoTBal achieves the optimal average per-task performance improvement (ΔI%Δpercent𝐼\Delta I\%roman_Δ italic_I %), alongside maintaining the near-lowest average per-task performance error (ΔE%Δpercent𝐸\Delta E\%roman_Δ italic_E %). As shown in Figure 2, compared to the most commonly employed EW method, CoTBal significantly enhances the performance on ShareGPT4V, ChartQA and RefCOCO-bbox tasks while keeping competitive performance on other tasks. This validates the effectiveness of our algorithm in terms of both overall performance and the degree of performance imbalance. Specifically, as depicted in Figure 3, CoTBal effectively captures the variances in mutual contributions and inherent learning difficulties among these visual tasks, thereby providing appropriate task weights for the final model training, which maximally mitigates task conflicts.

Besides, we observe that TLA is significantly inferior to EW in both ΔI%Δpercent𝐼\Delta I\%roman_Δ italic_I % and ΔE%Δpercent𝐸\Delta E\%roman_Δ italic_E %. TLA guarantees equality for each task in the final loss computation. However, variations in sample sequence length and data amount across different tasks may implicitly assign inappropriate task weights to the losses of valid tokens. The implicit weight is inversely related to the total number of valid tokens in each task, leading to poorer overall performance and a marked performance imbalance. This indicates that maintaining equality at the token level is more logical than preserving it at the task level, thereby demonstrating the viability of the GTW paradigm in multi-task visual instruction tuning.

As for the compared traditional multi-task optimization algorithms (RLW, DWA and IGBv1), directly applying them to visual instruction tuning yields suboptimal results in both ΔI%Δpercent𝐼\Delta I\%roman_Δ italic_I % and ΔE%Δpercent𝐸\Delta E\%roman_Δ italic_E %. We contend that assigning task weights based on training losses is imprecise, because the fine-tuning losses in large models fails to accurately reflect training progress. This is also the reason why both the inter-task contribution and the intra-task difficulty are quantified by performance-based metrics in our CoTBal algorithm.

4.3 Ablation Studies

Methods ΔI%Δpercent𝐼absent\Delta I\%\uparrowroman_Δ italic_I % ↑ ΔE%Δpercent𝐸absent\Delta E\%\downarrowroman_Δ italic_E % ↓
EW 7.30 0.10
CoTBal (T=2.0𝑇2.0T\!=\!2.0italic_T = 2.0) 8.25 0.12
CoTBal (T=1.0𝑇1.0T\!=\!1.0italic_T = 1.0) 8.41 0.11
CoTBal (T=0.5𝑇0.5T\!=\!0.5italic_T = 0.5) 9.45 0.15
CoTBal (𝝀𝒐𝒏𝒆𝟐𝒂𝒍𝒍subscript𝝀𝒐𝒏𝒆2𝒂𝒍𝒍\bm{\lambda_{one2all}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_o bold_italic_n bold_italic_e bold_2 bold_italic_a bold_italic_l bold_italic_l end_POSTSUBSCRIPT) 7.87 0.10
CoTBal (𝝀𝒂𝒍𝒍𝟐𝒐𝒏𝒆subscript𝝀𝒂𝒍𝒍2𝒐𝒏𝒆\bm{\lambda_{all2one}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_a bold_italic_l bold_italic_l bold_2 bold_italic_o bold_italic_n bold_italic_e end_POSTSUBSCRIPT) 7.05 0.05
CoTBal (𝝀𝑪subscript𝝀𝑪\bm{\lambda_{C}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C end_POSTSUBSCRIPT) 7.09 0.06
CoTBal (𝝀𝑫subscript𝝀𝑫\bm{\lambda_{D}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_D end_POSTSUBSCRIPT) 10.39 0.30
CoTBal (𝝀𝑪𝒐𝑻𝑩𝒂𝒍subscript𝝀𝑪𝒐𝑻𝑩𝒂𝒍\bm{\lambda_{CoTBal}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C bold_italic_o bold_italic_T bold_italic_B bold_italic_a bold_italic_l end_POSTSUBSCRIPT) 9.45 0.15
CoTBal (precise Difficulty) 9.20 0.16
CoTBal (real Difficulty) 9.45 0.15
Table 3: Ablation results for multi-task visual instruction tuning. T𝑇Titalic_T is the temperature hyperparameter, CoTBal (𝝀𝝀\bm{\lambda}bold_italic_λ) denotes the exclusive use of the specific 𝝀𝝀\bm{\lambda}bold_italic_λ for task weighting, and CoTBal (precise / real Difficulty) signifies the employment of the precise or real calculation approach for the intra-task difficulty.

As shown in Table 3, we analyze the impact of different training settings on model performance from three aspects: the temperature hyperparameter configuration, the task weighting strategy selection and the calculation approach for intra-task difficulties. The complete ablation results are presented in Appendix A. The compared methods include: EW; CoTBal (T=2.0/1.0/0.5𝑇2.01.00.5T\!=\!2.0/1.0/0.5italic_T = 2.0 / 1.0 / 0.5) where the temperature hyperparameter T𝑇Titalic_T is set to 2.02.02.02.0, 1.01.01.01.0 oder 0.50.50.50.5; CoTBal (𝝀𝒐𝒏𝒆𝟐𝒂𝒍𝒍/𝝀𝒂𝒍𝒍𝟐𝒐𝒏𝒆/𝝀𝑪/𝝀𝑫/𝝀𝑪𝒐𝑻𝑩𝒂𝒍subscript𝝀𝒐𝒏𝒆2𝒂𝒍𝒍subscript𝝀𝒂𝒍𝒍2𝒐𝒏𝒆subscript𝝀𝑪subscript𝝀𝑫subscript𝝀𝑪𝒐𝑻𝑩𝒂𝒍\bm{\lambda_{one2all}}/\bm{\lambda_{all2one}}/\bm{\lambda_{C}}/\bm{\lambda_{D}% }/\bm{\lambda_{CoTBal}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_o bold_italic_n bold_italic_e bold_2 bold_italic_a bold_italic_l bold_italic_l end_POSTSUBSCRIPT / bold_italic_λ start_POSTSUBSCRIPT bold_italic_a bold_italic_l bold_italic_l bold_2 bold_italic_o bold_italic_n bold_italic_e end_POSTSUBSCRIPT / bold_italic_λ start_POSTSUBSCRIPT bold_italic_C end_POSTSUBSCRIPT / bold_italic_λ start_POSTSUBSCRIPT bold_italic_D end_POSTSUBSCRIPT / bold_italic_λ start_POSTSUBSCRIPT bold_italic_C bold_italic_o bold_italic_T bold_italic_B bold_italic_a bold_italic_l end_POSTSUBSCRIPT) where task weights are set as 𝝀𝒐𝒏𝒆𝟐𝒂𝒍𝒍subscript𝝀𝒐𝒏𝒆2𝒂𝒍𝒍\bm{\lambda_{one2all}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_o bold_italic_n bold_italic_e bold_2 bold_italic_a bold_italic_l bold_italic_l end_POSTSUBSCRIPT, 𝝀𝒂𝒍𝒍𝟐𝒐𝒏𝒆subscript𝝀𝒂𝒍𝒍2𝒐𝒏𝒆\bm{\lambda_{all2one}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_a bold_italic_l bold_italic_l bold_2 bold_italic_o bold_italic_n bold_italic_e end_POSTSUBSCRIPT, 𝝀𝑪subscript𝝀𝑪\bm{\lambda_{C}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C end_POSTSUBSCRIPT, 𝝀𝑫subscript𝝀𝑫\bm{\lambda_{D}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_D end_POSTSUBSCRIPT oder 𝝀𝑪𝒐𝑻𝑩𝒂𝒍subscript𝝀𝑪𝒐𝑻𝑩𝒂𝒍\bm{\lambda_{CoTBal}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C bold_italic_o bold_italic_T bold_italic_B bold_italic_a bold_italic_l end_POSTSUBSCRIPT; and CoTBal (precise / real Difficulty) where the precise or real calculation approach for the intra-task difficulty is employed. Specifically, the precise calculation approach trains extra models using the full dataset and the mini subset from each task, while the real calculation approach repurposes the models trained for computing inter-task contributions to reduce additional training time.

In terms of the temperature hyperparameter configuration: CoTBal consistently outperforms EW in ΔI%Δpercent𝐼\Delta I\%roman_Δ italic_I %, maintaining its superiority across all T𝑇Titalic_T values and enhancing its advantage as T𝑇Titalic_T decreases. The degree of task balancing increases as T𝑇Titalic_T decreases, leading to an improved ΔI%Δpercent𝐼\Delta I\%roman_Δ italic_I %, which demonstrates the efficacy of comprehensive task balancing. Conversely, CoTBal exhibits a slight increase in ΔE%Δpercent𝐸\Delta E\%roman_Δ italic_E % as T𝑇Titalic_T decreases. When the degree of non-smoothness in task weights becomes excessive, tasks with significantly smaller weights inevitably underperform, resulting in the slight imbalance in performance.

In terms of the task weighting strategy selection: On the one hand, compared to the EW method, CoTBal (𝝀𝒐𝒏𝒆𝟐𝒂𝒍𝒍subscript𝝀𝒐𝒏𝒆2𝒂𝒍𝒍\bm{\lambda_{one2all}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_o bold_italic_n bold_italic_e bold_2 bold_italic_a bold_italic_l bold_italic_l end_POSTSUBSCRIPT) enhances ΔI%Δpercent𝐼\Delta I\%roman_Δ italic_I % while maintaining ΔE%Δpercent𝐸\Delta E\%roman_Δ italic_E % constant, due to its preference for tasks that offer substantial contributions to other tasks. On the other hand, CoTBal (𝝀𝒂𝒍𝒍𝟐𝒐𝒏𝒆subscript𝝀𝒂𝒍𝒍2𝒐𝒏𝒆\bm{\lambda_{all2one}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_a bold_italic_l bold_italic_l bold_2 bold_italic_o bold_italic_n bold_italic_e end_POSTSUBSCRIPT) significantly reduces ΔE%Δpercent𝐸\Delta E\%roman_Δ italic_E %, mitigating the performance imbalance issue by prioritizing tasks that receive minimal benefits from other tasks. CoTBal (𝝀𝑪subscript𝝀𝑪\bm{\lambda_{C}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C end_POSTSUBSCRIPT) integrates the aforementioned two strategies, achieving more balanced ΔI%Δpercent𝐼\Delta I\%roman_Δ italic_I % and ΔE%Δpercent𝐸\Delta E\%roman_Δ italic_E %. Moreover, CoTBal (𝝀𝑫subscript𝝀𝑫\bm{\lambda_{D}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_D end_POSTSUBSCRIPT) markedly enhances ΔI%Δpercent𝐼\Delta I\%roman_Δ italic_I % by assigning more weights to tasks that have greater learning difficulties, yet concurrently exacerbates the performance imbalance issue. Finally, CoTBal (𝝀𝑪𝒐𝑻𝑩𝒂𝒍subscript𝝀𝑪𝒐𝑻𝑩𝒂𝒍\bm{\lambda_{CoTBal}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C bold_italic_o bold_italic_T bold_italic_B bold_italic_a bold_italic_l end_POSTSUBSCRIPT) integrates all three strategies to maximize overall performance while mitigating the performance imbalance issue.

In terms of the calculation approach for the intra-task difficulty: The precise approach and the real approach exhibit similar levels of performance, with the real one even marginally surpassing the precise one in both ΔI%Δpercent𝐼\Delta I\%roman_Δ italic_I % and ΔE%Δpercent𝐸\Delta E\%roman_Δ italic_E %. When calculating the intra-task difficulty of Task i𝑖iitalic_i, training with mini subsets from any other tasks has negligible impact on performance in Task i𝑖iitalic_i, hence our CoTBal algorithm employs the real calculation approach to significantly reduce training time while ensuring performance.

5 Conclusion

In this paper, we devise Comprehensive Task Balancing (CoTBal), the first multi-task optimization algorithm tailored for visual instruction tuning of LMMs. Specifically, we first propose the Generic Task Weighting (GTW) paradigm. Based on this paradigm, we then design three task weighting strategies according to the inter-task contribution and the intra-task difficulty. Our experiments demonstrate that CoTBal outperforms existing methods, including the indiscriminate data mixing approach, significantly improving overall performance while ensuring task balance.

Limitations

Although the proposed CoTBal algorithm enhances the performance of multi-task visual instruction tuning, it still presents two small drawbacks. Firstly, CoTBal necessitates extra time for the computation of the inter-task contribution and the intra-task difficulty. Specifically, the extra time is approximately (1+(N1)/32)1𝑁132(1+(N-1)/32)( 1 + ( italic_N - 1 ) / 32 ) times the duration needed to train the final model, where N𝑁Nitalic_N is the number of tasks. Secondly, the measurement of the inter-task contribution and the intra-task difficulty could be further improved. It is rather an indirect metric assessed through validation performance, which may introduce slight noise. In our ongoing research, we will make further efforts on multi-task visual instruction tuning to overcome these drawbacks.

References

  • sha (2023) 2023. Sharegpt. https://sharegpt.com/.
  • Bai et al. (2023) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv Preprint ArXiv:2308.12966.
  • Caruana (1998) Rich Caruana. 1998. Multitask learning. Springer.
  • Chen et al. (2023) Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2023. Sharegpt4v: Improving large multi-modal models with better captions. ArXiv Preprint ArXiv:2311.12793.
  • Dai et al. (2023a) W Dai, J Li, D Li, AMH Tiong, J Zhao, W Wang, B Li, P Fung, and S Hoi. 2023a. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv Preprint ArXiv:2305.06500.
  • Dai et al. (2023b) Yanqi Dai, Nanyi Fei, and Zhiwu Lu. 2023b. Improvable gap balancing for multi-task learning. In Uncertainty in Artificial Intelligence, pages 496–506. PMLR.
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv Preprint ArXiv:2010.11929.
  • Gou et al. (2023) Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. 2023. Mixture of cluster-conditional lora experts for vision-language instruction tuning. ArXiv Preprint ArXiv:2312.12379.
  • Goyal et al. (2017) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913.
  • Hudson and Manning (2019) Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6700–6709.
  • Kazemzadeh et al. (2014) Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 787–798.
  • Kendall et al. (2018) Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7482–7491.
  • Li et al. (2023) Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2023. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. ArXiv Preprint ArXiv:2306.00890.
  • Lin et al. (2022) Baijiong Lin, YE Feiyang, Yu Zhang, and Ivor Tsang. 2022. Reasonable effectiveness of random weighting: A litmus test for multi-task learning. Transactions on Machine Learning Research.
  • Lin et al. (2021) Baijiong Lin, Feiyang Ye, Yu Zhang, and Ivor W Tsang. 2021. Reasonable effectiveness of random weighting: A litmus test for multi-task learning. ArXiv Preprint ArXiv:2111.10603.
  • Liu et al. (2023a) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023a. Improved baselines with visual instruction tuning. ArXiv Preprint ArXiv:2310.03744.
  • Liu et al. (2023b) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023b. Visual instruction tuning. ArXiv Preprint ArXiv:2304.08485.
  • Liu et al. (2021) Liyang Liu, Yi Li, Zhanghui Kuang, J Xue, Yimin Chen, Wenming Yang, Qingmin Liao, and Wayne Zhang. 2021. Towards impartial multi-task learning. In International Conference on Learning Representations.
  • Liu et al. (2019) Shikun Liu, Edward Johns, and Andrew J Davison. 2019. End-to-end multi-task learning with attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1871–1880.
  • Ma et al. (2018) Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1930–1939.
  • Mao et al. (2016) Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20.
  • Masry et al. (2022) Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ArXiv Preprint ArXiv:2203.10244.
  • Mishra et al. (2019) Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. 2019. Ocr-vqa: Visual question answering by reading text in images. In International Conference on Document Analysis and Recognition, pages 947–952. IEEE.
  • Misra et al. (2016) Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3994–4003.
  • Navon et al. (2022) Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. 2022. Multi-task learning as a bargaining game. In International Conference on Machine Learning, pages 16428–16446. PMLR.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  • Ruder (2017) Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. ArXiv Preprint ArXiv:1706.05098.
  • Sener and Koltun (2018) Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. Advances in Neural Information Processing Systems, 31.
  • Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. ArXiv Preprint ArXiv:2312.11805.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. ArXiv Preprint ArXiv:2302.13971.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. ArXiv Preprint ArXiv:2307.09288.
  • Vandenhende et al. (2021) Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. 2021. Multi-task learning for dense prediction tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3614–3633.
  • Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575.
  • Wang et al. (2023) Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, et al. 2023. Vigc: Visual instruction generation and correction. ArXiv Preprint ArXiv:2308.12714.
  • Wang et al. (2022) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language model with self generated instructions. ArXiv Preprint ArXiv:2212.10560.
  • Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. ArXiv Preprint ArXiv:2109.01652.
  • Yang et al. (2023) Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. 2023. The dawn of lmms: Preliminary explorations with gpt-4v (ision). ArXiv Preprint ArXiv:2309.17421, 9(1).
  • Ye et al. (2023) Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023. mplug-owl: Modularization empowers large language models with multimodality. ArXiv Preprint ArXiv:2304.14178.
  • Yu et al. (2020) Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836.
  • Zhang et al. (2023a) Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. 2023a. Instruction tuning for large language models: A survey. ArXiv Preprint ArXiv:2308.10792.
  • Zhang et al. (2023b) Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. 2023b. Llavar: Enhanced visual instruction tuning for text-rich image understanding. ArXiv Preprint ArXiv:2306.17107.
  • Zhang and Yang (2021) Yu Zhang and Qiang Yang. 2021. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609.
  • Zhao et al. (2023) Bo Zhao, Boya Wu, and Tiejun Huang. 2023. Svit: Scaling up visual instruction tuning. ArXiv Preprint ArXiv:2307.04087.
  • Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv Preprint ArXiv:2306.05685.
  • Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv Preprint ArXiv:2304.10592.

Appendix A Complete Results of Ablation Studies

We report the complete results of ablation studies for multi-task visual instruction tuning in Table 4.

Methods ShareGPT4V RefCOCO-caption VQAv2 GQA ChartQA OCRVQA RefCOCO-bbox ΔI%Δpercent𝐼absent\Delta I\%\uparrowroman_Δ italic_I % ↑ ΔE%Δpercent𝐸absent\Delta E\%\downarrowroman_Δ italic_E % ↓
test Ref-test Ref-testB Refg-test test-dev test-bal test test Ref-test Ref-testB Refg-test
CIDEr\uparrow CIDEr\uparrow EM\uparrow EM\uparrow EM\uparrow EM\uparrow IoU\uparrow
EW 0.1411 0.4738 0.5591 0.5937 78.27 62.20 19.60 67.73 76.05 61.63 62.80 7.30 0.10
CoTBal (T=2.0𝑇2.0T\!=\!2.0italic_T = 2.0) 0.1433 0.4540 0.5642 0.5973 78.10 62.08 20.16 67.67 78.01 63.06 64.73 8.25 0.12
CoTBal (T=1.0𝑇1.0T\!=\!1.0italic_T = 1.0) 0.1369 0.4605 0.5752 0.5948 78.15 62.09 20.32 67.71 80.61 65.00 66.78 8.41 0.11
CoTBal (T=0.5𝑇0.5T\!=\!0.5italic_T = 0.5) 0.1437 0.4649 0.5724 0.5874 77.99 61.81 20.16 67.48 82.62 67.38 69.19 9.45 0.15
CoTBal (𝝀𝒐𝒏𝒆𝟐𝒂𝒍𝒍subscript𝝀𝒐𝒏𝒆2𝒂𝒍𝒍\bm{\lambda_{one2all}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_o bold_italic_n bold_italic_e bold_2 bold_italic_a bold_italic_l bold_italic_l end_POSTSUBSCRIPT) 0.1448 0.4528 0.5763 0.6036 78.34 62.16 19.52 67.72 77.59 61.81 63.47 7.87 0.10
CoTBal (𝝀𝒂𝒍𝒍𝟐𝒐𝒏𝒆subscript𝝀𝒂𝒍𝒍2𝒐𝒏𝒆\bm{\lambda_{all2one}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_a bold_italic_l bold_italic_l bold_2 bold_italic_o bold_italic_n bold_italic_e end_POSTSUBSCRIPT) 0.1333 0.4617 0.5520 0.5961 78.25 62.12 20.20 68.00 76.98 62.61 64.12 7.05 0.05
CoTBal (𝝀𝑪subscript𝝀𝑪\bm{\lambda_{C}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C end_POSTSUBSCRIPT) 0.1340 0.4645 0.5626 0.5934 78.39 62.27 20.04 67.92 77.46 61.92 63.67 7.09 0.06
CoTBal (𝝀𝑫subscript𝝀𝑫\bm{\lambda_{D}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_D end_POSTSUBSCRIPT) 0.1455 0.4783 0.5706 0.5963 77.46 61.30 20.08 67.04 85.13 71.52 72.88 10.39 0.30
CoTBal (𝝀𝑪𝒐𝑻𝑩𝒂𝒍subscript𝝀𝑪𝒐𝑻𝑩𝒂𝒍\bm{\lambda_{CoTBal}}bold_italic_λ start_POSTSUBSCRIPT bold_italic_C bold_italic_o bold_italic_T bold_italic_B bold_italic_a bold_italic_l end_POSTSUBSCRIPT) 0.1437 0.4649 0.5724 0.5874 77.99 61.81 20.16 67.48 82.62 67.38 69.19 9.45 0.15
CoTBal (precise Difficulty) 0.1345 0.4767 0.5604 0.5952 78.00 61.87 21.04 67.46 82.06 67.83 69.08 9.20 0.16
CoTBal (real Difficulty) 0.1437 0.4649 0.5724 0.5874 77.99 61.81 20.16 67.48 82.62 67.38 69.19 9.45 0.15
Table 4: Complete results of ablation studies for multi-task visual instruction tuning. ()absent\uparrow(\downarrow)↑ ( ↓ ) indicates that the higher (lower) the result, the better the performance. Ref-test and Ref-testB represent two test sets of Kazemzadeh et al. (2014), and Refg-test denotes the test set of Mao et al. (2016).