Lateralization LoRA: Interleaved Instruction Tuning with Modality-Specialized Adaptations

Zhiyang Xu^♠ Minqian Liu^♠ Ying Shen^♠
Joy Rimchala^♢ Jiaxin Zhang^♢ Qifan Wang^♡ Yu Cheng^♣ Lifu Huang^♠
^♠Virginia Tech ^♢Intuit AI Research ^♡Meta AI
^♣The Chinese University of Hong Kong
^♠{zhiyangx, lifuh}@vt.edu

Abstract

Recent advancements in Vision-Language Models (VLMs) have led to the development of Vision-Language Generalists (VLGs) capable of understanding and generating interleaved images and text. Despite these advances, VLGs still struggle to follow user instructions for interleaved text and image generation. To address this issue, we introduce LeafInstruct, the first open-sourced interleaved instruction tuning data with over 30,000 high-quality instances across more than 10 domains. Due to the extensive size of existing VLGs, we opt for parameter-efficient tuning. However, we observe that VLGs tuned with a standard LoRA typically exhibit inferior performance in interleaved text-image generation. We attribute this problem to modality interference and the lack of modality-specialized adaptation design. Hence, we propose Lateralization LoRA, a novel modality-specialized adaptation method inspired by the concept of brain lateralization. Lateralization LoRA employs a hybrid approach, combining the traditional linear LoRA and a Convolutional LoRA for generating text and images, enabling the generation of high-quality text and images by leveraging modality-specific structures and parameter sets. We perform instruction tuning of the VLG (i.e., EMU2) using Lateralization LoRA on the LeafInstruct dataset. Extensive experiments demonstrate that EMU2 tuned with Lateralization LoRA achieve state-of-the-art performance, significantly surpassing baseline models in complex interleaved tasks.

Zhiyang Xu^♠ Minqian Liu^♠ Ying Shen^♠ Joy Rimchala^♢ Jiaxin Zhang^♢ Qifan Wang^♡ Yu Cheng^♣ Lifu Huang^♠ ^♠Virginia Tech ^♢Intuit AI Research ^♡Meta AI ^♣The Chinese University of Hong Kong ^♠{zhiyangx, lifuh}@vt.edu

1 Introduction

Refer to caption — Figure 1: Example outputs of EMU2 and EMU2 instruction tuned with LoRA on the LeafInstruct dataset. <IMG> tokens indicate where to insert images.

Recent advancements in Vision-Language Models (VLMs) Alayrac et al. (2022); Li et al. (2023c); Liu et al. (2023c); Bai et al. (2023); Liu et al. (2023b); Lin et al. (2024), which integrate pretrained image encoders with pretrained language models, have demonstrated great promise as versatile visual assistants. However, despite their ability to process inputs composed of interleaved images and text (i.e., multiple images and text segments arranged in arbitrary sequences), these models are mostly limited to generating only textual responses, which restricts their utility in a wide array of applications that require the simultaneous generation of both images and text, such as script generation Qi et al. (2024), visual storytelling Huang et al. (2016), and many others. To address this limitation, some initial efforts Sun et al. (2023c, a); Koh et al. (2023); Aghajanyan et al. (2022); Li et al. (2023b); Team (2024) have been made towards developing Vision-Language Generalists (VLGs) capable of both accepting and generating images and text in an interleaved fashion, by integrating a CLIP image encoder Radford et al. (2021), a LLM Touvron et al. (2023) and a diffusion-based image decoder Rombach et al. (2022).

Despite these advancements, existing VLGs struggle to follow user instructions to generate interleaved text and images. Current models Sun et al. (2023c, a) are often pretrained on interleaved documents such as MMC4 Zhu et al. (2023c) but are only instruction-tuned for single-modality generation, failing to adhere to human instructions to perform interleaved generation tasks. Moreover, the absence of a large-scale, open-source dataset specifically designed for interleaved instruction tuning significantly impedes the training of VLGs to proficiently produce interleaved text and images as directed by users. In response to this scarcity of interleaved instruction tuning data, we propose LeafInstruct, the first open-sourced high-quality interleaved instruction tuning data with over 30,000 high-quality instances spanning more than 10 domains. Leveraging open-source Large Language Models (LLMs) and various toolboxes, we have developed a rigorous automatic pipeline that filters extensive amounts of noisy web data, annotates detailed instructions, and rewrites text content to enhance quality.

Considering the substantial computational costs associated with full-parameter finetuning of large VLGs, we explore parameter-efficient instruction tuning using LeafInstruct. However, we find that merely applying parameter-efficient instruction tuning on VLGs tends to yield low-quality outputs for both text and images. Figure 1 shows examples of the interleaved generation from the pre-trained EMU2, and the EMU2 finetuned with a standard LoRA Hu et al. (2021) on LeafInstruct. We observe that, after instruction tuning with standard LoRA, the quality of generated images becomes worse, with local inconsistency and distortion. We hypothesize that the inferior performance stems from using a single LoRA to handle different modalities. Previous studies indicate that while the standard transformer architecture Vaswani et al. (2017) employed in LLMs excels at NLP tasks, it is less effective at modeling the local priors of image patches, which are crucial for various vision tasks Zhong et al. (2024); Chen et al. (2023d). This architecture inadequacy can cause VLGs to produce images with local inconsistency and distortion among adjacent patches, underscoring the need for distinct, optimal structures tailored to each modality in interleaved VLGs.

To better accommodate the distinct requirements of interleaved text and image generation, we propose integrating modality-specialized adaptations within the state-of-the-art VLG, EMU2 Sun et al. (2023a). Specifically, for image patch embeddings, we utilize low-rank convolutional adaptation layers to better model the local prior in images. For text tokens, we employ a separate set of linear Low-Rank Adaptation (LoRA) layers, acknowledging the distinct sequential modeling process of text compared to images. During training, both LoRA architectures are zero-initialized and progressively fine-tuned from their initial state, while the LLM parameters remain frozen. Our design allows each modality to have its own specialized parameters and optimal adaptation design, aligning with recent findings Shen et al. (2023): training modality-specific experts in VLMs can significantly reduce modality interference and enhance model performance. We name our novel modality-specialized adaptation method Lateralization LoRA, drawing intuition from the theory of brain lateralization Halpern et al. (2005); Rogers (2021), which states that (1) one hemisphere of the brain is better at performing certain functions than the other and (2) although the hemispheres look similar, their neuronal networks are different, allowing for specialized functions.

To validate the effectiveness of our methods and dataset, we conduct extensive experiments using InterleavedBench Liu et al. (2024), a recently introduced interleaved evaluation dataset. The results demonstrate that EMU2 instruction-tuned with Lateralization LoRA achieves state-of-the-art performance across multiple evaluation aspects. In summary, our contributions are threefold:

•

We leverage open-source LLMs to automatically generate training data for interleaved instruction tuning. To fill the blank in existing resources, we introduce the first publicly available instruction-tuning dataset for interleaved generation across diverse domains.
•

We introduce Lateralization LoRA, a novel parameter-efficient adaptation method that incorporates two types of LoRAs, enhancing the ability of autoregressive VLGs to generate interleaved text and images by allowing each modality has its specialized parameters and the optimal adaptation architecture.
•

EMU2 instruction tuned with the Lateralization LoRA on the LeafInstruct dataset, achieves significantly performance improvement on all aspects in InterleavedBench, outperforming existing open-source baselines.

2 Related Work

Interleaved Vision-Language Models

There are two popular formulations for VLGs: The first leverages VQGAN Esser et al. (2021) to quantize an image into a long sequence of discrete tokens and add the vocabulary in VQGAN’s codebook into the vocabulary of LLMs Aghajanyan et al. (2022); Yu et al. (2023); Yasunaga et al. (2023); Team (2024); Jin et al. (2023). In this way, the LLMs are trained with a unified autoregressive objective to predict image tokens or text tokens. The predicted image tokens are fed into a VQGAN decoder to reconstruct images. The second methodology employs the CLIP image encoder to transform images into sequences of continuous embeddings Koh et al. (2023); Tang et al. (2023); Zhu et al. (2023b); Sun et al. (2023c, a); Li et al. (2024); Wu et al. (2023); Tian et al. (2024), which are then concatenated with text embeddings in their original sequence order. Compared to the first approach, this formulation often requires shorter sequences to represent an image and generally yields superior performance. Our proposed method requires minimal assumptions on the VLG’s architecture and can be applied to many of the existing transformer-based VLGs.

Visual Instruction Tuning

Xu et al. (2023a) propose MultiInstruct , the first human-label visual instruction tuning dataset to improve the generalizability of VLMs. LLaVA Liu et al. (2023c) leverages GPT-4 to convert image captions from existing annotations into three tasks, including visual dialogues, visual question answering, and detail captions. Following studies either utilize proprietary LLMs Dai et al. (2023); Ye et al. (2023); Yin et al. (2023); Liu et al. (2023b); Li et al. (2023a); Lyu et al. (2023); Zhu et al. (2023a); Wang et al. (2023); Chen et al. (2023b) or human efforts Liu et al. (2023b); Xu et al. (2024) to augment visual instruction tuning tasks. Several studies target specific aspects of VLMs’ capability, such as domain and instruction bias Avrahami et al. (2022); Liu et al. (2023a), object grounding Chen et al. (2023a), and OCR Zhang et al. (2023b); Hu et al. (2023). Instruction tuning has also been widely applied to other vision-language tasks, such as image editing Brooks et al. (2023a) and interleaved text-image understanding Jiang et al. (2024). Hu et al. (2024) finetune a model that can follow multimodal instructions to generate desired images. However, most existing instruction-tuning datasets only consider the tasks where the outputs are in a single modality, i.e., either text or image. To facilitate the training and enhance the instruction-following capabilities for VLGs, we curated LeafInstruct, the first instruction-tuning dataset tailored for interleaved text-image generation across diverse domains, where the inputs and outputs can contain interleaved text and multiple images.

Parameter-Efficient Finetuning (PEFT)

PEFT methods Hu et al. (2021); Li and Liang (2021); Karimi Mahabadi et al. (2021); Zaken et al. (2022); Jia et al. (2022); Lian et al. (2022); Jie and Deng (2022); Liu et al. (2022); Chen et al. (2023d); Zhong et al. (2024) aim to adapt pretrained large models to various downstream tasks and have become prevalent in instruction tuning. Typically, these methods involve freezing the pretrained large models while finetuning a minimal set of newly introduced parameters. Recent studies Wang et al. (2022); Zadouri et al. (2023); Lin et al. (2024); Shen et al. (2024) propose to combine PEFT methods with Mixture-of-Experts to mitigate task interference and enhance performance, particularly in visual instruction tuning where models need to process inputs from two modalities. Our proposed Lateralization LoRA is the first PEFT method that utilizes two distinct LoRA architectures—linear and convolutional—for text and image generation within autoregressive interleaved generation models.

3 Background: Vision-Language Generalist

Base Model

The base VLG we leverage is Emu2 due to its strong performance. Emu2 consists of a CLIP image encoder: EVA-02-CLIP-E-plus Sun et al. (2023b), a decoder-only large language model: LLaMA-33B Touvron et al. (2023), and an image decoder: SDXL Podell et al. (2023). Given a sequence of interleaved text segments and images, the CLIP encoder encodes each image into a sequence of continuous image embeddings. The image embeddings are further mapped by a linear projector into the semantic space of the LLM, then the embeddings of images and text segments are concatenated together in their original order and fed into the transformer layers. The language-modeling head of LLM maps the hidden states of text tokens from the last transformer layer into probability distributions of vocabularies. The image-regression head projects the hidden states of image patches back to the latent space of the CLIP encoder. Finally, the image decoder takes in the predicted image embeddings and decodes them into the target image.

Training

The training objective of VLGs can be loosely defined in the following unified autoregressive manner.

\operatorname*{arg\,max}_{\theta}\sum^{\mathcal{D}}\sum_{n=1}^{N}P_{\theta}(s_% {n}|s_{1},s_{2},...,s_{n-1})

(1)

where $\theta$ denotes the model parameters, $N$ denotes the input sequence length, $\mathcal{D}$ denotes the training dataset, and $s_{i}$ denotes a text token or an image-patch embedding. The unified objective is optimized using two types of losses: (1) For text tokens, the CrossEntropy loss minimizes the distance between the probability distribution of vocabularies predicted from the language-modeling head and the true probability distribution of a sequence of text tokens; (2) For image embeddings, the mean-squared-error (MSE) loss minimizes the distance between predicted image embeddings from the image-regression head and the real-image embeddings. Although the EMU2 model is pre-trained with interleaved documents, it is instruction-tuned to generate either text or an image, in a single modality.

4 Method

4.1 Lateralization LoRA

In this subsection, we first briefly recap the mechanism of the original LoRA. Then we introduce Convolutional LoRA and explain its structures. Finally, we explain how to combine two LoRAs and apply them to autoregressive VLGs.

Low-Rank Adaptation (LoRA)

LoRA Hu et al. (2021) is a parameter-efficient finetuning method that freezes the pretrain model weighs and injects low-rank decomposable matrices into the layers of transformers. Formally, given the weights in a linear layer $\mathbf{W}\in\mathbb{R}^{d_{out}\times d_{in}}$ , LoRA modifies the weights by adding a decomposable weight matrix $\Delta\mathbf{W}$ to $\mathbf{W}$ . Thus, for a vector $\mathbf{h}\in\mathbb{R}^{d_{in}}$ , the modified linear transformation $T:\mathbb{R}^{d}_{in}\rightarrow\mathbb{R}^{d}_{out}$ becomes:

T(\mathbf{h})=\mathbf{h}(\mathbf{W}+\Delta\mathbf{W})^{\intercal}=\mathbf{h}% \mathbf{W}^{\intercal}+\mathbf{h}\Delta\mathbf{W}^{\intercal}

(2)

$\Delta\mathbf{W}$ is decomposed into two low-rank matrices, i.e., LoRA $A$ : $\mathbf{W}_{A}\in\mathbb{R}^{r\times d_{in}}$ and LoRA $B$ : $\mathbf{W}_{B}\in\mathbb{R}^{d_{out}\times r}$ satisfying the low-rank constraint $r\ll min(d_{out},d_{in})$ . The final expression is

T(\mathbf{h})=\mathbf{h}\mathbf{W}^{\intercal}+\alpha\mathbf{h}\mathbf{W}_{A}^% {\intercal}\mathbf{W}_{B}^{\intercal}

(3)

where $\alpha\in\mathbb{R}$ is a hyper-parameter.

Convolutional LoRA

We propose Convolutional LoRA, a variant of LoRA specifically designed for modeling the local structure of image hidden states, by simplifying the architecture proposed in Zhong et al. (2024). It consists of a convolutional LoRA $A$ layer: $\text{Conv}_{k\times k}$ , with kernel size: $k\times k$ , number of input channels: $c_{in}$ , and number of output channels: $r$ , and a LoRA $B$ : $\mathbf{W}_{B}\in\mathbb{R}^{C_{out}\times r}$ . Given the 2D feature $\mathbf{I}\in\mathbb{R}^{H\times W\times C_{in}}$ of an image, where $H$ denotes the height, $W$ denotes the width, and $C_{in}$ denotes the number of channels of $\mathbf{I}$ , the convolutional LoRA $A$ projects down its number of channels to $r$ in the mean time, performing convolution operation. Then the LoRA $B$ project its number of channels up to $C_{out}$ . The equation 3 becomes:

\tilde{T}(\mathbf{I})=\mathbf{I}\mathbf{W}^{\intercal}+\alpha\text{Conv}_{k% \times k}(\mathbf{I})\mathbf{W}_{B}^{\intercal}

(4)

where $\alpha$ is a hyper-parameter.

Combine Two LoRAs in VLGs

As shown in Figure 2, we propose to combine two types of LoRA together in VLGs, i.e., using linear LoRA for text generation and Convolutional LoRA for image generation. Formally, let $\mathbf{W}\in\mathbb{R}^{d_{out}\times d_{in}}$ be the weights of any linear layer in a LLM, and let $\mathbf{H}=[\mathbf{h}_{1}^{t},...,\mathbf{h}_{m}^{t},\mathbf{h}_{m+1}^{i},% \mathbf{h}_{m+2}^{i},...,\mathbf{h}_{m+(H\times W)}^{i},...,\mathbf{h}_{N}^{t}% ]\in\mathbb{R}^{N\times d_{in}}$ denotes the hidden states of a sequence of interleaved text and images, where a subscript indicates position of a hidden state and the superscript indicate if a hidden state is decoded into a text token ( $t$ ) or decoded into an image-patch embedding ( $i$ ). We separate $\mathbf{H}$ into text hidden states $\mathbf{H}^{t}=[\mathbf{h}_{1}^{t},\mathbf{h}_{2}^{t},...,\mathbf{h}_{m}^{t},% \mathbf{h}_{m+(H\times W)+1}^{t},...,\mathbf{h}_{N}^{t}]$ and image hidden states $\mathbf{H}^{i}=[[\mathbf{h}_{m+1}^{i},...,\mathbf{h}_{m+(H\times W)}^{i}],[% \mathbf{h}_{n+1}^{i},...,\mathbf{h}_{n+(H\times W)}^{i}],...]$ , where $m+1$ and $n+1$ denote the starting positions of two subsequences of image hidden states. Each subsequence of a single image has a fixed length of $H\times W$ , and we reshape the hidden states of each image in $\mathbf{H}^{i}$ into a 2D structure. Hence, the dimension of $\mathbf{H}^{i}$ becomes $B\times H\times W\times C_{in}$ , where $B$ denotes the number of images in the sequence $\mathbf{H}$ . We feed $\mathbf{H}^{t}$ into the Equation 3 to get $\hat{\mathbf{H}}^{t}=T(\mathbf{H}^{t})$ and $\mathbf{H}^{i}$ into Equation 4 to get $\hat{\mathbf{H}}^{i}=\tilde{T}(\mathbf{H}^{i})$ .

Figure 3 visualizes the convolutional operation applied to a sequence of image patches. The yellow squares on the left denote the reshaped input image patches and the larger green squares denote the convolution kernels. The number on each square denotes the original positions of a patch in the image sequence. For demonstration purpose, we draw images patches with $H=5$ and $W=5$ . Notably, since this is an autoregressive model, the current hidden state can only depend on previous hidden states. Thus, when applying the convolution operation on an image patch, the kernel only covers neighbouring patches on the top and left sides of a patch. For example, the new hidden state of patch $19$ is computed from patches: $13,14,18,\text{and}19$ . To preserve the shape ( $H\times W$ ) of the input image patches, we pad the reshaped image hidden states with zero vectors on the top and left sides, as shown by the grey squares in Figure 3. Finally, we assemble $\hat{\mathbf{H}}^{i}$ and $\hat{\mathbf{H}}^{t}$ back to their original sequence to form $\hat{\mathbf{H}}$ .

4.2 Interleaved Inference

In this subsection, we explain our designs to enable the interleaved inference of VLGs with proposed Lateralization LoRA and the automatic routing of hidden states. Trained on our LeafInstruct dataset, the VLG learns to automatically decide whether to generate a text segment or an image based on the interleaved context. The generation process begins with the VLG generating a textual response. At each generation step, the VLG predicts a new text token and this token becomes the input to generate the next token. The hidden states of newly generated tokens are always routed to the standard LoRA specialized for text. If the textual response concludes with a special image generation token <IMG>, the generated text is then concatenated with the original context. Subsequently, the VLG takes in the updated context ended with <IMG> and generate a fixed-length ( $H\times W$ ) sequence of image patch embeddings. Notably, the hidden state of the <IMG> token is used to predict the first image patch embedding and hence, it is routed to the Convolutional LoRA. At each step, the VLG takes in the newly generated image patch embedding and predict the next image patch embedding. The hidden states of all newly generated image embeddings are routed to the Convolutional LoRA. The generated image with a end-of-image token </IMG> is appended to the context to serve as a new input, prompting the model to resume textual response generation. This iterative process terminates when VLG produces the end-of-generation token </s> at the end of a textual response. It is important to note that a textual response may only contain a <IMG> token, indicating the initiation of image generation immediately, without preceding text. Also, a textual response may only contain a </s> token, indicating the generation should end immediately, without any new text at the end. These two special cases enable the VLG to generate images and text segments in an arbitrary order.

In this autoregressive generation process, a challenge arises with the one-by-one generation of image patches due to the shape requirements of the convolutional operation. To address this issue, we pad the image patch embeddings with zero vectors to achieve a sequence length of $H\times W$ .

5 Dataset: LeafInstruct

Existing interleaved vision-language models Sun et al. (2023c, a); Dong et al. (2024) predominantly follow the training procedure that they are first pretrained on massive corpora of interleaved data such as MMC4 Zhu et al. (2023c) and other resources and then finetuned on a mix of high-quality datasets, such as visual instruction tuning data in Liu et al. (2023c) and InstructPix2Pix Brooks et al. (2023b). However, one significant limitation of these instruction-tuning datasets is that the outputs are typically in a single modality, e.g., either text or image. This potentially hinders further advancement in interleaved generation since the models cannot learn how to generate coherent content interleaved text and images specified by a given instruction.

To bridge the gap between limited existing resources and the practical need for improving interleaved generation models, we curated LeafInstruct, the first comprehensive instruction tuning dataset for interleaved text-and-image generation. Each instance in our dataset consists of (1) a detailed instruction, (2) an input context interleaved with text and images (optional), and (3) a ground-truth output that is also interleaved with text and images. We show an example in our LeafInstruct and how it compares with other representative instruction-based datasets, i.e., InstructPix2Pix Brooks et al. (2023b) and Mantis-Instruct Jiang et al. (2024), in Figure 4. We also compare our LeafInstruct with existing training datasets in terms of what modalities are covered in the input and output, and if they are publicly available in Table 3. We show that compared with existing datasets, our LeafInstruct cover most complete modalities, i.e., both text and images, in the inputs and outputs.

Dataset construction

We construct a diverse instruction-tuning data collection from large-scale web resources, i.e., MMC4, and academic datasets, i.e., VIST Huang et al. (2016) and YouCook2 Zhou et al. (2018). Since the original data sources can be noisy, especially the website collection in MMC4, we meticulously devised a pipeline to filter and re-annotate the source data to ensure the high quality of our curated data. Firstly, we filter the samples based on the text length, number of images, and the coherence between text and images (measured by CLIPScore). We only keep the instances where they have 3 to 6 images in total. We also discard the instances with more than 12 sentences to ensure a balanced ratio between the number of text and images. Secondly, we leverage a state-of-the-art open-sourced LLM (i.e., Llama3-8B-Instruct) as a text filter to discard the instances with poor text quality. Thirdly, we apply an image filter to remove the instances with duplicate or perceptually identical images using the LPIPS score Zhang et al. (2018) to ensure the diversity of the images. Finally, we also apply Llama3 to annotate the task instruction for each instance based on the text content and rewrite the text if it’s too verbose in order to prevent the context length from being too long. We include more details on dataset construction in Appendix B.

Statistics

After applying our rigorous data processing pipeline, we totally obtain 38,272 high-quality instances out of more than 7 million source samples. We use Llama3 to tag the domain of each instance and show the distribution of the top 10 domains in our dataset in Figure 5.

6 Experiment

6.1 Baselines

We compare the EMU2 model tuned with our proposed Lateralization LoRA and LeafInstruct to state-of-the-art open-source VLGs, including GILL Koh et al. (2023), Mini-GPT5 Zheng et al. (2023), and Pretrained EMU2 Sun et al. (2023a), and proprietary models, including Gemini 1.5 Reid et al. (2024) + SDXL Podell et al. (2023) and GPT-4o ¹¹1https://openai.com/index/hello-gpt-4o/ + DALL E 3 ²²2https://openai.com/index/dall-e-3/.

6.2 Evaluation Benchmarks

We evaluate the interleaved generation capability of our method on the recently introduced InterleavedBench benchmark. InterleavedBench is a comprehensive dataset covering a diverse array of 10 tasks, such as multimodal script generation, visual storytelling, and activity generation, carefully curated by human annotators. The InterleavedBench has two splits: a context-based split in which the input of each instance is equipped with interleaved text and images; and a context-free split with text-only inputs. The context-based split contains 465 instances and the text-only split contains 350 instances. We only use the context-based split as the testing set, since we mainly focus on tasks with interleaved inputs and outputs.

6.3 Evaluation Metrics

Recent studies in interleaved evaluation An et al. (2023) point out that due to the vast output space of interleaved generation, reference-based evaluation metrics often fail to accurately assess the generation quality. Thus, we adopt InterleavedEval Liu et al. (2024), a strong reference-free evaluation metric, based on GPT-4o, having great flexibility and being capable of evaluating arbitrarily interleaved generation. InterleavedEval prompts GPT-4o to score an interleaved generation from five aspects, including Text Quality, Perceptual Quality, Image Coherence, Image-Text Coherence (ITC), and Helpfulness. For each aspect, the GPT-4o outputs a discrete score from {0, 1, 2, 3, 4, 5}. We refer to the detailed definition of each score in each aspect of the original paper. The implementation details of our method can be found in Appendix.

7 Results and Discussion

Model	Text Quality	Perceptual Quality	Image Coherence	TIC	Helpfulness
Proprietary Models
Gemini1.5 + SDXL	3.37	4.34	3.34	3.98	3.28
GPT-4o + DALL·E 3	3.16	4.44	3.13	4.39	3.46
Open-Source Models
MiniGPT-5	1.31	3.46	2.06	2.66	1.76
GILL	1.44	4.17	2.12	2.69	1.53
EMU-2	1.33	2.29	1.71	1.22	1.87
EMU2 + Lateralization LoRA	1.95	2.41	2.64	2.81	2.05

Table 1: Main results of baseline models and our methods. The top part shows the performance of proprietary models, the middle part shows the performance of open-source VLGs, and the bottom part shows the performance of EMU2 instruction tuned with our proposed methods and dataset.

Model	Text Quality	Image Quality	Image Coherence	TIC	Helpfulness
EMU-2	1.33	2.29	1.71	1.22	1.87
+ LoRA	1.25	1.43	1.61	1.79	1.3
+ MoE-LoRA	1.94	2.22	2.42	2.54	1.90
+ Lateralization LoRA	1.95	2.41	2.64	2.81	2.05

Table 2: Performance of our Lateralization LoRA, traditional linear LoRA, and Mixture-of-Expert (MoE) LoRA. Mixture-of-Expert LoRA uses two different sets of linear LoRA for images and text, respectively.

7.1 Main Results

Table 1 presents the main results of our method in comparison to the baselines. It is evident that instruction tuning the EMU2 model with Lateralization LoRA significantly enhances its performance across all evaluated aspects, particularly in Image Coherence and Text-Image Coherence aspects. Furthermore, our method surpasses existing open-source VLGs in four out of five aspects, with a significant improvement in the comprehensive helpfulness aspect, which measures how well the interleaved responses follow the task instructions and provide helpful information to achieve tasks. To better interpret these results, we selected three representative examples and show them in Figure 6. Firstly, Mini-GPT5 often fails to generate explanatory text, resulting in its poor performance in Text Quality. Secondly, both the Gill model and Mini-GPT5 frequently produce images and text that are irrelevant to the input context. In contrast, our model generates interleaved images and text that adhere closely to the user’s instructions and input context, which explains the superior performance in the helpfulness aspect of our model ( $9.6\%$ improvement compared to EMU2). Finally, neither the GILL nor Mini-GPT5 models can preserve the visual appearance of the entities and scenes in the input images. Our model, however, faithfully retains these visual characteristics, leading to significantly better Image Coherence ( $28.2\%$ improvement compared to Mini-GPT5).

In Table 1, we also present the performance of proprietary models for reference. However, we argue that a direct comparison between our model and proprietary models is not entirely fair due to the potential differences in model sizes, the number of training instances, and model architectures.

An interesting observation is that the EMU-2 model performs significantly worse in Perceptual Quality compared to other baselines. Even after being instruction-tuned with Lateralization LoRA and LeafInstruct, its performance in this aspect remains relatively low. Upon manually examining the models’ predictions, we discovered that GILL and Mini-GPT5 might exploit this evaluation criterion, a problem we discuss in detail in Subsection 7.3.

7.2 Effect of Lateralization LoRA

To isolate the effect of the instruction tuning data and directly demonstrate the performance improvement brought by our proposed Lateralization LoRA, we fine-tuned EMU2 using (1) traditional linear LoRA Hu et al. (2021) and (2) Mixture-of-Expert (MoE) LoRA Shen et al. (2024), with the results presented in Table 2. In the traditional linear LoRA, text, and images share the same low-rank adaptation parameters, while in the MoE LoRA, two different sets of linear LoRA are used for images and text, respectively. The routing strategy in MoE LoRA is based on the output modality of each hidden state, i.e., if the hidden state is used to predict a text token or an image patch embedding. By comparing the performance of traditional linear LoRA and MoE LoRA in Table 2, we can clearly identify the benefits of using separate parameters for image and text. Although our proposed Lateralization LoRA performs similarly to the MoE LoRA in the Text Quality aspect, Lateralization LoRA significantly outperforms the MoE LoRA in aspects related to image quality, such as Image Coherence and Text-Image Coherence (TIC). This result verifies our claim that convolutional LoRA can better model local priors of images, thereby improving the performance of image generation.

7.3 Why Does the EMU2 Model Perform Worse on the Perceptual Quality Aspect than Other Baselines?

In this subsection, we investigate why the EMU2 model trained with Lateralization LoRA has significantly lower Perceptual Quality scores compared to other baselines. As illustrated in Figure 7, both GILL and Mini-GPT5 often disregard the instructions and context images, generating images with minimal constraints. In contrast, the EMU2 model trained with Lateralization LoRA strives to adhere to the instructions and accurately condition its generation on the provided images. However, the complexity of the editing task results in our model producing images with lower quality and noticeable distortions. For example, in the first row of Figure 7, the images generated by GILL and Mini-GPT5 receive Perceptual scores of 5 and 4, respectively, while the image generated by our model receives a score of 2. Similarly, in the second row, the images generated by GILL and Mini-GPT5 receive Perceptual scores of 4 each, whereas our model receives a score of 1. But when taking Helpfulness into account, the images generated by our model are better than images generated by GILL and Mini-GPT5.

8 Conclusion

We propose Lateralization LoRA, a modality-specialized low-rank adaptation method tailored for VLGs. Lateralization LoRA dedicates a set of linear LoRA for processing text and a set of Convolutional LoRA for images, allowing each modality to have its own optimal adaptation design. Further, we propose the first interleaved instruction tuning dataset LeafInstruct by leveraging LLMs to automatically generate instructions and develop a rigorous filtering process to ensure the data quality. Extensive experiments on InterleavedBench showcase the effectiveness of our method and dataset.

9 Limitations

Our method focuses on EMU2 Sun et al. (2023a) due to its strong performance. Future work can experiment with our method with more architectures including models using discrete image tokens. Aghajanyan et al. (2022). Also, our method has the potential to be applied to VLMs. Future work can also explore this direction.

Our method uses two sets of LoRA for images and text, respectively. However, we do not experiment with the setting in which a large group of LoRA is applied to the model. Future work should explore the possibility of increasing the number of LoRA modules.

References

Aghajanyan et al. (2022) Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. 2022. CM3: A causal masked multimodal model of the internet. CoRR, abs/2201.07520.
Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karén Simonyan. 2022. Flamingo: a visual language model for few-shot learning.
An et al. (2023) Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, and Jiebo Luo. 2023. Openleaf: Open-domain interleaved image-text generation and evaluation. Preprint, arXiv:2310.07749.
Avrahami et al. (2022) Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended diffusion for text-driven editing of natural images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 18187–18197. IEEE.
Bai et al. (2023) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. Preprint, arXiv:2308.12966.
Brooks et al. (2023a) Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023a. Instructpix2pix: Learning to follow image editing instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 18392–18402. IEEE.
Brooks et al. (2023b) Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023b. Instructpix2pix: Learning to follow image editing instructions. In CVPR.
Chen et al. (2023a) Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023a. Shikra: Unleashing multimodal llm’s referential dialogue magic. CoRR, abs/2306.15195.
Chen et al. (2023b) Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2023b. Sharegpt4v: Improving large multi-modal models with better captions. CoRR, abs/2311.12793.
Chen et al. (2023c) Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, and William W. Cohen. 2023c. Subject-driven text-to-image generation via apprenticeship learning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
Chen et al. (2023d) Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. 2023d. Vision transformer adapter for dense predictions. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
Dai et al. (2023) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500.
Dong et al. (2024) Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. 2024. DreamLLM: Synergistic multimodal comprehension and creation. In The Twelfth International Conference on Learning Representations.
Esser et al. (2021) Patrick Esser, Robin Rombach, and Björn Ommer. 2021. Taming transformers for high-resolution image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 12873–12883. Computer Vision Foundation / IEEE.
Halpern et al. (2005) Marnie E. Halpern, Onur Güntürkün, William D. Hopkins, and Lesley J. Rogers. 2005. Lateralization of the vertebrate brain: Taking the side of model systems. Journal of Neuroscience, 25(45):10351–10357.
Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685.
Hu et al. (2024) Hexiang Hu, Kelvin C. K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William W. Cohen, Ming-Wei Chang, and Xuhui Jia. 2024. Instruct-imagen: Image generation with multi-modal instruction. CoRR, abs/2401.01952.
Hu et al. (2023) Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. 2023. BLIVA: A simple multimodal LLM for better handling of text-rich visual questions. CoRR, abs/2308.09936.
Huang et al. (2016) Ting-Hao (Kenneth) Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross B. Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, and Margaret Mitchell. 2016. Visual storytelling. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1233–1239. The Association for Computational Linguistics.
Jia et al. (2022) Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge J. Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIII, volume 13693 of Lecture Notes in Computer Science, pages 709–727. Springer.
Jiang et al. (2024) Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. 2024. MANTIS: interleaved multi-image instruction tuning. CoRR, abs/2405.01483.
Jie and Deng (2022) Shibo Jie and Zhi-Hong Deng. 2022. Convolutional bypasses are better vision transformer adapters. CoRR, abs/2207.07039.
Jin et al. (2023) Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, Di Zhang, Wenwu Ou, Kun Gai, and Yadong Mu. 2023. Unified language-vision pretraining in LLM with dynamic discrete visual tokenization. CoRR, abs/2309.04669.
Karimi Mahabadi et al. (2021) Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. 2021. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035.
Koh et al. (2023) Jing Yu Koh, Daniel Fried, and Russ Salakhutdinov. 2023. Generating images with multimodal language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
Li et al. (2023a) Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. 2023a. MIMIC-IT: multi-modal in-context instruction tuning. CoRR, abs/2306.05425.
Li et al. (2023b) Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, and Shuming Shi. 2023b. Textbind: Multi-turn interleaved multimodal instruction-following in the wild. CoRR, abs/2309.08637.
Li et al. (2023c) Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023c. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. 202:19730–19742.
Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
Li et al. (2024) Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. 2024. Mini-gemini: Mining the potential of multi-modality vision language models. CoRR, abs/2403.18814.
Lian et al. (2022) Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. 2022. Scaling & shifting your features: A new baseline for efficient model tuning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
Lin et al. (2024) Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. 2024. Moe-llava: Mixture of experts for large vision-language models. CoRR, abs/2401.15947.
Liu et al. (2023a) Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. 2023a. Aligning large multi-modal model with robust instruction tuning. CoRR, abs/2306.14565.
Liu et al. (2022) Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. 2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. CoRR, abs/2205.05638.
Liu et al. (2023b) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023b. Improved baselines with visual instruction tuning. CoRR, abs/2310.03744.
Liu et al. (2023c) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023c. Visual instruction tuning. CoRR, arXiv:2304.08485.
Liu et al. (2024) Minqian Liu, Zhiyang Xu, Zihao Lin, Trevor Ashby, Joy Rimchala, Jiaxin Zhang, and Lifu Huang. 2024. Holistic evaluation for interleaved text-and-image generation.
Lyu et al. (2023) Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. 2023. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. CoRR, abs/2306.09093.
Podell et al. (2023) Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. SDXL: improving latent diffusion models for high-resolution image synthesis. CoRR, abs/2307.01952.
Qi et al. (2024) Jingyuan Qi, Minqian Liu, Ying Shen, Zhiyang Xu, and Lifu Huang. 2024. MULTISCRIPT: multimodal script learning for supporting open domain everyday tasks. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pages 18888–18896. AAAI Press.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
Reid et al. (2024) Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, Andrew M. Dai, Katie Millican, Ethan Dyer, Mia Glaese, Thibault Sottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, James Molloy, Jilin Chen, Michael Isard, Paul Barham, Tom Hennigan, Ross McIlroy, Melvin Johnson, Johan Schalkwyk, Eli Collins, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, Clemens Meyer, Gregory Thornton, Zhen Yang, Henryk Michalewski, Zaheer Abbas, Nathan Schucher, Ankesh Anand, Richard Ives, James Keeling, Karel Lenc, Salem Haykal, Siamak Shakeri, Pranav Shyam, Aakanksha Chowdhery, Roman Ring, Stephen Spencer, Eren Sezener, and et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. CoRR, abs/2403.05530.
Rogers (2021) Lesley J. Rogers. 2021. Brain lateralization and cognitive capacity. Animals, 11(7).
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE.
Shen et al. (2023) Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yuxiong He. 2023. Scaling vision-language models with sparse mixture of experts. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11329–11344, Singapore. Association for Computational Linguistics.
Shen et al. (2024) Ying Shen, Zhiyang Xu, Qifan Wang, Yu Cheng, Wenpeng Yin, and Lifu Huang. 2024. Multimodal instruction tuning with conditional mixture of lora. CoRR, abs/2402.15896.
Sun et al. (2023a) Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. 2023a. Generative multimodal models are in-context learners. CoRR, abs/2312.13286.
Sun et al. (2023b) Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. 2023b. EVA-CLIP: improved training techniques for CLIP at scale. CoRR, abs/2303.15389.
Sun et al. (2023c) Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. 2023c. Generative pretraining in multimodality. CoRR, abs/2307.05222.
Tang et al. (2023) Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, and Mohit Bansal. 2023. Codi-2: In-context, interleaved, and interactive any-to-any generation. CoRR, abs/2311.18775.
Team (2024) Chameleon Team. 2024. Chameleon: Mixed-modal early-fusion foundation models. Preprint, arXiv:2405.09818.
Tian et al. (2024) Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, Hongsheng Li, Yu Qiao, and Jifeng Dai. 2024. Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. CoRR, abs/2401.10208.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Wang et al. (2023) Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, and Yu-Gang Jiang. 2023. To see is to believe: Prompting GPT-4V for better visual instruction tuning. CoRR, abs/2311.07574.
Wang et al. (2022) Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, and Jianfeng Gao. 2022. Adamix: Mixture-of-adaptations for parameter-efficient model tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5744–5760. Association for Computational Linguistics.
Wu et al. (2023) Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2023. Next-gpt: Any-to-any multimodal LLM. CoRR, abs/2309.05519.
Xu et al. (2024) Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, and Lifu Huang. 2024. Vision-flan: Scaling human-labeled tasks in visual instruction tuning. Preprint, arXiv:2402.11690.
Xu et al. (2023a) Zhiyang Xu, Ying Shen, and Lifu Huang. 2023a. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 11445–11465. Association for Computational Linguistics.
Xu et al. (2023b) Zhiyang Xu, Ying Shen, and Lifu Huang. 2023b. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 11445–11465. Association for Computational Linguistics.
Yasunaga et al. (2023) Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Richard James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-Tau Yih. 2023. Retrieval-augmented multimodal language modeling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 39755–39769. PMLR.
Ye et al. (2023) Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. 2023. mplug-owl: Modularization empowers large language models with multimodality. CoRR, abs/2304.14178.
Yin et al. (2023) Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, and Wanli Ouyang. 2023. LAMM: language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. CoRR, abs/2306.06687.
Yu et al. (2023) Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz, Luke Zettlemoyer, and Armen Aghajanyan. 2023. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. CoRR, abs/2309.02591.
Zadouri et al. (2023) Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermis, Acyr Locatelli, and Sara Hooker. 2023. Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. CoRR, abs/2309.05444.
Zaken et al. (2022) Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. 2022. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9.
Zhang et al. (2023a) Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. 2023a. Magicbrush: A manually annotated dataset for instruction-guided image editing. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595.
Zhang et al. (2023b) Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. 2023b. Llavar: Enhanced visual instruction tuning for text-rich image understanding. CoRR, abs/2306.17107.
Zheng et al. (2023) Kaizhi Zheng, Xuehai He, and Xin Eric Wang. 2023. Minigpt-5: Interleaved vision-and-language generation via generative vokens. CoRR, abs/2310.02239.
Zhong et al. (2024) Zihan Zhong, Zhiqiang Tang, Tong He, Haoyang Fang, and Chun Yuan. 2024. Convolution meets lora: Parameter efficient finetuning for segment anything model. CoRR, abs/2401.17868.
Zhou et al. (2018) Luowei Zhou, Chenliang Xu, and Jason J. Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 7590–7598. AAAI Press.
Zhu et al. (2023a) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023a. Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592.
Zhu et al. (2023b) Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, and Ying Shan. 2023b. VL-GPT: A generative pre-trained transformer for vision and language understanding and generation. CoRR, abs/2312.09251.
Zhu et al. (2023c) Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. 2023c. Multimodal c4: An open, billion-scale corpus of images interleaved with text. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.

Appendix A More Details of Lateralization LoRA

Dataset Name	Input Text	Input Images	Output Text	Output Images	Publicly Available
LLaVA Liu et al. (2023c)	Yes	Single	Single	No	Yes
MultiInstruct Xu et al. (2023b)	Yes	Single	Single	No	Yes
Vision-Flan Xu et al. (2024)	Yes	Single	Single	No	Yes
InstructPix2Pix Brooks et al. (2023a)	Yes	Single	No	Single	Yes
MagicBrush Zhang et al. (2023a)	Yes	Single	No	Single	Yes
SuTI Chen et al. (2023c)	Yes	Multiple	No	Single	No
Instruct-Imagen Hu et al. (2024)	Yes	Multiple	No	Single	No
Mantis-Instruct Jiang et al. (2024)	Yes	Multiple	Yes	No	Yes
LeafInstruct (Ours)	Yes	Multiple	Yes	Multiple	Yes

Table 3: Comparison between our LeafInstruct and existing instruction tuning datasets.

A.1 Implementation Details

We leverage the EMU2 model Sun et al. (2023a), consisting of the EVA-02-CLIP-E-plus Sun et al. (2023b) as the image encoder, the LLaMA-33B Touvron et al. (2023), and the SDXL Podell et al. (2023) as the image decoder, as our base model. The EVA-02-CLIP-E-plus and the LLaMA-33B is connected by a linear project-up layer and the LLaMA-33B and the SDXL is connected by a linear project-down layer. All the variants of LoRA in Section 7.2, including our Lateralization LoRA are trained with LeafInstruct for one epoch on $8\times\text{A}100$ GPUs with learning rate $2e^{-5}$ , batch size 1 per GPU, and a gradient accumulation step of 16. All the LoRA have a rank of 128, dropout rate of $0.05$ , and the LoRA $\alpha$ in Section 4.1 is set to $2\times 128$ . The kernel size of Lateralization LoRA is $2\times 2$ , the stride is set to 1. During training, all parameters of the EMU2 model are kept frozen and only the LoRA parameters are updated.

Appendix B More Details of LeafInstruct

B.1 More Details in Dataset Construction

Details of Text Quality Filter

We use Llama-8B-Instruct model to rate the text quality of an instance with the following prompt: “Imagine you are an expert data annotator. You are given a text material and you need to evaluate its quality in terms of whether it is coherent, fluent, easy to understand, and helpful to humans. Please be critical and rate the quality as good only when the text quality is good in all four aspects. Output 1 if you think the material is good after you consider all four aspects. Output 0 if you think the material is not good enough. Here is the text material to be evaluated: {TEXT} Only output 0 or 1 and do not output anything else. Your evaluation is:” We discard the instances if the output from Llama is 0.

Details of Image Filter

We empirically found that if the images are too identical in the training instances, the trained models tend to find a shortcut to simply copy the image during generation. To this end, we design a filter to discard the instances with duplicate images to improve data quality. Specifically, we leverage the LPIPS score Zhang et al. (2018) that measures the perceptual similarity between the images. Specifically, for each instance, we enumerate each pair of images and compute their LPIPS score. If there is one pair with a score higher than 0.6, we discard the instance. We determine the threshold of 0.6 by empirical trial.

Details of Instruction Annotation

We also adopt Llama-8B-Instruct to annotate the task instruction for each instance. We use the following prompt: “Imagine you are an expert instruction annotator. You are given a material. You need to read its content and output a brief task instruction with one sentence such that another person can recover the given the material given the instruction. The instruction you predict should be specifically tailored for creative interleaved content generation that consists of both text and images. Now you need to annotate a concise, accurate instruction for the following instance. Please only predict the instruction and do not output anything else. Please design the instruction for the multi-modal generation task interleaved with both text and images. Text: {TEXT} Instruction:”.