Lateralization LoRA: Interleaved Instruction Tuning with Modality-Specialized Adaptations

Zhiyang Xu  Minqian Liu  Ying Shen
Joy RimchalaJiaxin ZhangQifan WangYu ChengLifu Huang
Virginia Tech  Intuit AI Research  Meta AI
The Chinese University of Hong Kong
{zhiyangx, lifuh}@vt.edu
Abstract

Recent advancements in Vision-Language Models (VLMs) have led to the development of Vision-Language Generalists (VLGs) capable of understanding and generating interleaved images and text. Despite these advances, VLGs still struggle to follow user instructions for interleaved text and image generation. To address this issue, we introduce LeafInstruct, the first open-sourced interleaved instruction tuning data with over 30,000 high-quality instances across more than 10 domains. Due to the extensive size of existing VLGs, we opt for parameter-efficient tuning. However, we observe that VLGs tuned with a standard LoRA typically exhibit inferior performance in interleaved text-image generation. We attribute this problem to modality interference and the lack of modality-specialized adaptation design. Hence, we propose Lateralization LoRA, a novel modality-specialized adaptation method inspired by the concept of brain lateralization. Lateralization LoRA employs a hybrid approach, combining the traditional linear LoRA and a Convolutional LoRA for generating text and images, enabling the generation of high-quality text and images by leveraging modality-specific structures and parameter sets. We perform instruction tuning of the VLG (i.e., EMU2) using Lateralization LoRA on the LeafInstruct dataset. Extensive experiments demonstrate that EMU2 tuned with Lateralization LoRA achieve state-of-the-art performance, significantly surpassing baseline models in complex interleaved tasks.

Lateralization LoRA: Interleaved Instruction Tuning with Modality-Specialized Adaptations


Zhiyang Xu  Minqian Liu  Ying Shen Joy RimchalaJiaxin ZhangQifan WangYu ChengLifu Huang Virginia Tech  Intuit AI Research  Meta AI The Chinese University of Hong Kong {zhiyangx, lifuh}@vt.edu


1 Introduction

Refer to caption
Figure 1: Example outputs of EMU2 and EMU2 instruction tuned with LoRA on the LeafInstruct dataset. <IMG> tokens indicate where to insert images.

Recent advancements in Vision-Language Models (VLMs) Alayrac et al. (2022); Li et al. (2023c); Liu et al. (2023c); Bai et al. (2023); Liu et al. (2023b); Lin et al. (2024), which integrate pretrained image encoders with pretrained language models, have demonstrated great promise as versatile visual assistants. However, despite their ability to process inputs composed of interleaved images and text (i.e., multiple images and text segments arranged in arbitrary sequences), these models are mostly limited to generating only textual responses, which restricts their utility in a wide array of applications that require the simultaneous generation of both images and text, such as script generation Qi et al. (2024), visual storytelling Huang et al. (2016), and many others. To address this limitation, some initial efforts Sun et al. (2023c, a); Koh et al. (2023); Aghajanyan et al. (2022); Li et al. (2023b); Team (2024) have been made towards developing Vision-Language Generalists (VLGs) capable of both accepting and generating images and text in an interleaved fashion, by integrating a CLIP image encoder Radford et al. (2021), a LLM Touvron et al. (2023) and a diffusion-based image decoder Rombach et al. (2022).

Despite these advancements, existing VLGs struggle to follow user instructions to generate interleaved text and images. Current models Sun et al. (2023c, a) are often pretrained on interleaved documents such as MMC4 Zhu et al. (2023c) but are only instruction-tuned for single-modality generation, failing to adhere to human instructions to perform interleaved generation tasks. Moreover, the absence of a large-scale, open-source dataset specifically designed for interleaved instruction tuning significantly impedes the training of VLGs to proficiently produce interleaved text and images as directed by users. In response to this scarcity of interleaved instruction tuning data, we propose LeafInstruct, the first open-sourced high-quality interleaved instruction tuning data with over 30,000 high-quality instances spanning more than 10 domains. Leveraging open-source Large Language Models (LLMs) and various toolboxes, we have developed a rigorous automatic pipeline that filters extensive amounts of noisy web data, annotates detailed instructions, and rewrites text content to enhance quality.

Considering the substantial computational costs associated with full-parameter finetuning of large VLGs, we explore parameter-efficient instruction tuning using LeafInstruct. However, we find that merely applying parameter-efficient instruction tuning on VLGs tends to yield low-quality outputs for both text and images. Figure 1 shows examples of the interleaved generation from the pre-trained EMU2, and the EMU2 finetuned with a standard LoRA Hu et al. (2021) on LeafInstruct. We observe that, after instruction tuning with standard LoRA, the quality of generated images becomes worse, with local inconsistency and distortion. We hypothesize that the inferior performance stems from using a single LoRA to handle different modalities. Previous studies indicate that while the standard transformer architecture Vaswani et al. (2017) employed in LLMs excels at NLP tasks, it is less effective at modeling the local priors of image patches, which are crucial for various vision tasks Zhong et al. (2024); Chen et al. (2023d). This architecture inadequacy can cause VLGs to produce images with local inconsistency and distortion among adjacent patches, underscoring the need for distinct, optimal structures tailored to each modality in interleaved VLGs.

To better accommodate the distinct requirements of interleaved text and image generation, we propose integrating modality-specialized adaptations within the state-of-the-art VLG, EMU2 Sun et al. (2023a). Specifically, for image patch embeddings, we utilize low-rank convolutional adaptation layers to better model the local prior in images. For text tokens, we employ a separate set of linear Low-Rank Adaptation (LoRA) layers, acknowledging the distinct sequential modeling process of text compared to images. During training, both LoRA architectures are zero-initialized and progressively fine-tuned from their initial state, while the LLM parameters remain frozen. Our design allows each modality to have its own specialized parameters and optimal adaptation design, aligning with recent findings Shen et al. (2023): training modality-specific experts in VLMs can significantly reduce modality interference and enhance model performance. We name our novel modality-specialized adaptation method Lateralization LoRA, drawing intuition from the theory of brain lateralization Halpern et al. (2005); Rogers (2021), which states that (1) one hemisphere of the brain is better at performing certain functions than the other and (2) although the hemispheres look similar, their neuronal networks are different, allowing for specialized functions.

To validate the effectiveness of our methods and dataset, we conduct extensive experiments using InterleavedBench Liu et al. (2024), a recently introduced interleaved evaluation dataset. The results demonstrate that EMU2 instruction-tuned with Lateralization LoRA achieves state-of-the-art performance across multiple evaluation aspects. In summary, our contributions are threefold:

  • We leverage open-source LLMs to automatically generate training data for interleaved instruction tuning. To fill the blank in existing resources, we introduce the first publicly available instruction-tuning dataset for interleaved generation across diverse domains.

  • We introduce Lateralization LoRA, a novel parameter-efficient adaptation method that incorporates two types of LoRAs, enhancing the ability of autoregressive VLGs to generate interleaved text and images by allowing each modality has its specialized parameters and the optimal adaptation architecture.

  • EMU2 instruction tuned with the Lateralization LoRA on the LeafInstruct dataset, achieves significantly performance improvement on all aspects in InterleavedBench, outperforming existing open-source baselines.

2 Related Work

Interleaved Vision-Language Models

There are two popular formulations for VLGs: The first leverages VQGAN Esser et al. (2021) to quantize an image into a long sequence of discrete tokens and add the vocabulary in VQGAN’s codebook into the vocabulary of LLMs Aghajanyan et al. (2022); Yu et al. (2023); Yasunaga et al. (2023); Team (2024); Jin et al. (2023). In this way, the LLMs are trained with a unified autoregressive objective to predict image tokens or text tokens. The predicted image tokens are fed into a VQGAN decoder to reconstruct images. The second methodology employs the CLIP image encoder to transform images into sequences of continuous embeddings Koh et al. (2023); Tang et al. (2023); Zhu et al. (2023b); Sun et al. (2023c, a); Li et al. (2024); Wu et al. (2023); Tian et al. (2024), which are then concatenated with text embeddings in their original sequence order. Compared to the first approach, this formulation often requires shorter sequences to represent an image and generally yields superior performance. Our proposed method requires minimal assumptions on the VLG’s architecture and can be applied to many of the existing transformer-based VLGs.

Visual Instruction Tuning

Xu et al. (2023a) propose MultiInstruct , the first human-label visual instruction tuning dataset to improve the generalizability of VLMs. LLaVA Liu et al. (2023c) leverages GPT-4 to convert image captions from existing annotations into three tasks, including visual dialogues, visual question answering, and detail captions. Following studies either utilize proprietary LLMs  Dai et al. (2023); Ye et al. (2023); Yin et al. (2023); Liu et al. (2023b); Li et al. (2023a); Lyu et al. (2023); Zhu et al. (2023a); Wang et al. (2023); Chen et al. (2023b) or human efforts Liu et al. (2023b); Xu et al. (2024) to augment visual instruction tuning tasks. Several studies target specific aspects of VLMs’ capability, such as domain and instruction bias Avrahami et al. (2022); Liu et al. (2023a), object grounding Chen et al. (2023a), and OCR Zhang et al. (2023b); Hu et al. (2023). Instruction tuning has also been widely applied to other vision-language tasks, such as image editing Brooks et al. (2023a) and interleaved text-image understanding Jiang et al. (2024). Hu et al. (2024) finetune a model that can follow multimodal instructions to generate desired images. However, most existing instruction-tuning datasets only consider the tasks where the outputs are in a single modality, i.e., either text or image. To facilitate the training and enhance the instruction-following capabilities for VLGs, we curated LeafInstruct, the first instruction-tuning dataset tailored for interleaved text-image generation across diverse domains, where the inputs and outputs can contain interleaved text and multiple images.

Parameter-Efficient Finetuning (PEFT)

PEFT methods Hu et al. (2021); Li and Liang (2021); Karimi Mahabadi et al. (2021); Zaken et al. (2022); Jia et al. (2022); Lian et al. (2022); Jie and Deng (2022); Liu et al. (2022); Chen et al. (2023d); Zhong et al. (2024) aim to adapt pretrained large models to various downstream tasks and have become prevalent in instruction tuning. Typically, these methods involve freezing the pretrained large models while finetuning a minimal set of newly introduced parameters. Recent studies Wang et al. (2022); Zadouri et al. (2023); Lin et al. (2024); Shen et al. (2024) propose to combine PEFT methods with Mixture-of-Experts to mitigate task interference and enhance performance, particularly in visual instruction tuning where models need to process inputs from two modalities. Our proposed Lateralization LoRA is the first PEFT method that utilizes two distinct LoRA architectures—linear and convolutional—for text and image generation within autoregressive interleaved generation models.

3 Background: Vision-Language Generalist

Base Model

The base VLG we leverage is Emu2 due to its strong performance. Emu2 consists of a CLIP image encoder: EVA-02-CLIP-E-plus Sun et al. (2023b), a decoder-only large language model: LLaMA-33B Touvron et al. (2023), and an image decoder: SDXL Podell et al. (2023). Given a sequence of interleaved text segments and images, the CLIP encoder encodes each image into a sequence of continuous image embeddings. The image embeddings are further mapped by a linear projector into the semantic space of the LLM, then the embeddings of images and text segments are concatenated together in their original order and fed into the transformer layers. The language-modeling head of LLM maps the hidden states of text tokens from the last transformer layer into probability distributions of vocabularies. The image-regression head projects the hidden states of image patches back to the latent space of the CLIP encoder. Finally, the image decoder takes in the predicted image embeddings and decodes them into the target image.

Training

The training objective of VLGs can be loosely defined in the following unified autoregressive manner.

argmaxθ𝒟n=1NPθ(sn|s1,s2,,sn1)subscriptargmax𝜃superscript𝒟superscriptsubscript𝑛1𝑁subscript𝑃𝜃conditionalsubscript𝑠𝑛subscript𝑠1subscript𝑠2subscript𝑠𝑛1\operatorname*{arg\,max}_{\theta}\sum^{\mathcal{D}}\sum_{n=1}^{N}P_{\theta}(s_% {n}|s_{1},s_{2},...,s_{n-1})start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) (1)

where θ𝜃\thetaitalic_θ denotes the model parameters, N𝑁Nitalic_N denotes the input sequence length, 𝒟𝒟\mathcal{D}caligraphic_D denotes the training dataset, and sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes a text token or an image-patch embedding. The unified objective is optimized using two types of losses: (1) For text tokens, the CrossEntropy loss minimizes the distance between the probability distribution of vocabularies predicted from the language-modeling head and the true probability distribution of a sequence of text tokens; (2) For image embeddings, the mean-squared-error (MSE) loss minimizes the distance between predicted image embeddings from the image-regression head and the real-image embeddings. Although the EMU2 model is pre-trained with interleaved documents, it is instruction-tuned to generate either text or an image, in a single modality.

4 Method

Refer to caption
Figure 2: EMU2 model with our proposed Lateralization LoRA added to its linear layers. The yellow squares represent image patch embeddings and the green squares represent text token embeddings. The linear LoRA on the left side is specialized to generate text tokens and the Convolutional LoRA on the right side is specialized to generate image patches, i.e., if the output is a text token, then the hidden state goes through linear LoRA and if the output is an image patch, then the hidden state goes through the Convolutional LoRA.

4.1 Lateralization LoRA

In this subsection, we first briefly recap the mechanism of the original LoRA. Then we introduce Convolutional LoRA and explain its structures. Finally, we explain how to combine two LoRAs and apply them to autoregressive VLGs.

Low-Rank Adaptation (LoRA)

LoRA Hu et al. (2021) is a parameter-efficient finetuning method that freezes the pretrain model weighs and injects low-rank decomposable matrices into the layers of transformers. Formally, given the weights in a linear layer 𝐖dout×din𝐖superscriptsubscript𝑑𝑜𝑢𝑡subscript𝑑𝑖𝑛\mathbf{W}\in\mathbb{R}^{d_{out}\times d_{in}}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, LoRA modifies the weights by adding a decomposable weight matrix Δ𝐖Δ𝐖\Delta\mathbf{W}roman_Δ bold_W to 𝐖𝐖\mathbf{W}bold_W. Thus, for a vector 𝐡din𝐡superscriptsubscript𝑑𝑖𝑛\mathbf{h}\in\mathbb{R}^{d_{in}}bold_h ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the modified linear transformation T:indoutd:𝑇subscriptsuperscript𝑑𝑖𝑛subscriptsuperscript𝑑𝑜𝑢𝑡T:\mathbb{R}^{d}_{in}\rightarrow\mathbb{R}^{d}_{out}italic_T : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT becomes:

T(𝐡)=𝐡(𝐖+Δ𝐖)=𝐡𝐖+𝐡Δ𝐖𝑇𝐡𝐡superscript𝐖Δ𝐖superscript𝐡𝐖𝐡Δsuperscript𝐖T(\mathbf{h})=\mathbf{h}(\mathbf{W}+\Delta\mathbf{W})^{\intercal}=\mathbf{h}% \mathbf{W}^{\intercal}+\mathbf{h}\Delta\mathbf{W}^{\intercal}italic_T ( bold_h ) = bold_h ( bold_W + roman_Δ bold_W ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT = bold_hW start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT + bold_h roman_Δ bold_W start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT (2)

Δ𝐖Δ𝐖\Delta\mathbf{W}roman_Δ bold_W is decomposed into two low-rank matrices, i.e., LoRA A𝐴Aitalic_A: 𝐖Ar×dinsubscript𝐖𝐴superscript𝑟subscript𝑑𝑖𝑛\mathbf{W}_{A}\in\mathbb{R}^{r\times d_{in}}bold_W start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and LoRA B𝐵Bitalic_B: 𝐖Bdout×rsubscript𝐖𝐵superscriptsubscript𝑑𝑜𝑢𝑡𝑟\mathbf{W}_{B}\in\mathbb{R}^{d_{out}\times r}bold_W start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_r end_POSTSUPERSCRIPT satisfying the low-rank constraint rmin(dout,din)much-less-than𝑟𝑚𝑖𝑛subscript𝑑𝑜𝑢𝑡subscript𝑑𝑖𝑛r\ll min(d_{out},d_{in})italic_r ≪ italic_m italic_i italic_n ( italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ). The final expression is

T(𝐡)=𝐡𝐖+α𝐡𝐖A𝐖B𝑇𝐡superscript𝐡𝐖𝛼superscriptsubscript𝐡𝐖𝐴superscriptsubscript𝐖𝐵T(\mathbf{h})=\mathbf{h}\mathbf{W}^{\intercal}+\alpha\mathbf{h}\mathbf{W}_{A}^% {\intercal}\mathbf{W}_{B}^{\intercal}italic_T ( bold_h ) = bold_hW start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT + italic_α bold_hW start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT (3)

where α𝛼\alpha\in\mathbb{R}italic_α ∈ blackboard_R is a hyper-parameter.

Convolutional LoRA

We propose Convolutional LoRA, a variant of LoRA specifically designed for modeling the local structure of image hidden states, by simplifying the architecture proposed in Zhong et al. (2024). It consists of a convolutional LoRA A𝐴Aitalic_A layer: Convk×ksubscriptConv𝑘𝑘\text{Conv}_{k\times k}Conv start_POSTSUBSCRIPT italic_k × italic_k end_POSTSUBSCRIPT , with kernel size: k×k𝑘𝑘k\times kitalic_k × italic_k, number of input channels: cinsubscript𝑐𝑖𝑛c_{in}italic_c start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, and number of output channels: r𝑟ritalic_r, and a LoRA B𝐵Bitalic_B: 𝐖BCout×rsubscript𝐖𝐵superscriptsubscript𝐶𝑜𝑢𝑡𝑟\mathbf{W}_{B}\in\mathbb{R}^{C_{out}\times r}bold_W start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_r end_POSTSUPERSCRIPT. Given the 2D feature 𝐈H×W×Cin𝐈superscript𝐻𝑊subscript𝐶𝑖𝑛\mathbf{I}\in\mathbb{R}^{H\times W\times C_{in}}bold_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of an image, where H𝐻Hitalic_H denotes the height, W𝑊Witalic_W denotes the width, and Cinsubscript𝐶𝑖𝑛C_{in}italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT denotes the number of channels of 𝐈𝐈\mathbf{I}bold_I, the convolutional LoRA A𝐴Aitalic_A projects down its number of channels to r𝑟ritalic_r in the mean time, performing convolution operation. Then the LoRA B𝐵Bitalic_B project its number of channels up to Coutsubscript𝐶𝑜𝑢𝑡C_{out}italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT. The equation  3 becomes:

T~(𝐈)=𝐈𝐖+αConvk×k(𝐈)𝐖B~𝑇𝐈superscript𝐈𝐖𝛼subscriptConv𝑘𝑘𝐈superscriptsubscript𝐖𝐵\tilde{T}(\mathbf{I})=\mathbf{I}\mathbf{W}^{\intercal}+\alpha\text{Conv}_{k% \times k}(\mathbf{I})\mathbf{W}_{B}^{\intercal}over~ start_ARG italic_T end_ARG ( bold_I ) = bold_IW start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT + italic_α Conv start_POSTSUBSCRIPT italic_k × italic_k end_POSTSUBSCRIPT ( bold_I ) bold_W start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT (4)

where α𝛼\alphaitalic_α is a hyper-parameter.

Refer to caption
Figure 3: Visualization of the convolutional operation on a sequence of image patches. Yellow squares represent input image patches, and green squares represent output patches. The number on each square denotes its original position in the image-patch sequence, and the bigger green squares are the convolution kernels. The grey squares are paddings to preserve the shape of image patches.

Combine Two LoRAs in VLGs

As shown in Figure 2, we propose to combine two types of LoRA together in VLGs, i.e., using linear LoRA for text generation and Convolutional LoRA for image generation. Formally, let 𝐖dout×din𝐖superscriptsubscript𝑑𝑜𝑢𝑡subscript𝑑𝑖𝑛\mathbf{W}\in\mathbb{R}^{d_{out}\times d_{in}}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the weights of any linear layer in a LLM, and let 𝐇=[𝐡1t,,𝐡mt,𝐡m+1i,𝐡m+2i,,𝐡m+(H×W)i,,𝐡Nt]N×din𝐇superscriptsubscript𝐡1𝑡superscriptsubscript𝐡𝑚𝑡superscriptsubscript𝐡𝑚1𝑖superscriptsubscript𝐡𝑚2𝑖superscriptsubscript𝐡𝑚𝐻𝑊𝑖superscriptsubscript𝐡𝑁𝑡superscript𝑁subscript𝑑𝑖𝑛\mathbf{H}=[\mathbf{h}_{1}^{t},...,\mathbf{h}_{m}^{t},\mathbf{h}_{m+1}^{i},% \mathbf{h}_{m+2}^{i},...,\mathbf{h}_{m+(H\times W)}^{i},...,\mathbf{h}_{N}^{t}% ]\in\mathbb{R}^{N\times d_{in}}bold_H = [ bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_m + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_m + ( italic_H × italic_W ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the hidden states of a sequence of interleaved text and images, where a subscript indicates position of a hidden state and the superscript indicate if a hidden state is decoded into a text token (t𝑡titalic_t) or decoded into an image-patch embedding (i𝑖iitalic_i). We separate 𝐇𝐇\mathbf{H}bold_H into text hidden states 𝐇t=[𝐡1t,𝐡2t,,𝐡mt,𝐡m+(H×W)+1t,,𝐡Nt]superscript𝐇𝑡superscriptsubscript𝐡1𝑡superscriptsubscript𝐡2𝑡superscriptsubscript𝐡𝑚𝑡superscriptsubscript𝐡𝑚𝐻𝑊1𝑡superscriptsubscript𝐡𝑁𝑡\mathbf{H}^{t}=[\mathbf{h}_{1}^{t},\mathbf{h}_{2}^{t},...,\mathbf{h}_{m}^{t},% \mathbf{h}_{m+(H\times W)+1}^{t},...,\mathbf{h}_{N}^{t}]bold_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = [ bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_m + ( italic_H × italic_W ) + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] and image hidden states 𝐇i=[[𝐡m+1i,,𝐡m+(H×W)i],[𝐡n+1i,,𝐡n+(H×W)i],]superscript𝐇𝑖superscriptsubscript𝐡𝑚1𝑖superscriptsubscript𝐡𝑚𝐻𝑊𝑖superscriptsubscript𝐡𝑛1𝑖superscriptsubscript𝐡𝑛𝐻𝑊𝑖\mathbf{H}^{i}=[[\mathbf{h}_{m+1}^{i},...,\mathbf{h}_{m+(H\times W)}^{i}],[% \mathbf{h}_{n+1}^{i},...,\mathbf{h}_{n+(H\times W)}^{i}],...]bold_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = [ [ bold_h start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_m + ( italic_H × italic_W ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ] , [ bold_h start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_n + ( italic_H × italic_W ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ] , … ], where m+1𝑚1m+1italic_m + 1 and n+1𝑛1n+1italic_n + 1 denote the starting positions of two subsequences of image hidden states. Each subsequence of a single image has a fixed length of H×W𝐻𝑊H\times Witalic_H × italic_W, and we reshape the hidden states of each image in 𝐇isuperscript𝐇𝑖\mathbf{H}^{i}bold_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT into a 2D structure. Hence, the dimension of 𝐇isuperscript𝐇𝑖\mathbf{H}^{i}bold_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT becomes B×H×W×Cin𝐵𝐻𝑊subscript𝐶𝑖𝑛B\times H\times W\times C_{in}italic_B × italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, where B𝐵Bitalic_B denotes the number of images in the sequence 𝐇𝐇\mathbf{H}bold_H. We feed 𝐇tsuperscript𝐇𝑡\mathbf{H}^{t}bold_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT into the Equation 3 to get 𝐇^t=T(𝐇t)superscript^𝐇𝑡𝑇superscript𝐇𝑡\hat{\mathbf{H}}^{t}=T(\mathbf{H}^{t})over^ start_ARG bold_H end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_T ( bold_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and 𝐇isuperscript𝐇𝑖\mathbf{H}^{i}bold_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT into Equation 4 to get 𝐇^i=T~(𝐇i)superscript^𝐇𝑖~𝑇superscript𝐇𝑖\hat{\mathbf{H}}^{i}=\tilde{T}(\mathbf{H}^{i})over^ start_ARG bold_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = over~ start_ARG italic_T end_ARG ( bold_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ).

Figure 3 visualizes the convolutional operation applied to a sequence of image patches. The yellow squares on the left denote the reshaped input image patches and the larger green squares denote the convolution kernels. The number on each square denotes the original positions of a patch in the image sequence. For demonstration purpose, we draw images patches with H=5𝐻5H=5italic_H = 5 and W=5𝑊5W=5italic_W = 5. Notably, since this is an autoregressive model, the current hidden state can only depend on previous hidden states. Thus, when applying the convolution operation on an image patch, the kernel only covers neighbouring patches on the top and left sides of a patch. For example, the new hidden state of patch 19191919 is computed from patches: 13,14,18,and19131418and1913,14,18,\text{and}1913 , 14 , 18 , and 19. To preserve the shape (H×W𝐻𝑊H\times Witalic_H × italic_W) of the input image patches, we pad the reshaped image hidden states with zero vectors on the top and left sides, as shown by the grey squares in Figure 3. Finally, we assemble 𝐇^isuperscript^𝐇𝑖\hat{\mathbf{H}}^{i}over^ start_ARG bold_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝐇^tsuperscript^𝐇𝑡\hat{\mathbf{H}}^{t}over^ start_ARG bold_H end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT back to their original sequence to form 𝐇^^𝐇\hat{\mathbf{H}}over^ start_ARG bold_H end_ARG.

Refer to caption
Figure 4: Comparison between the existing benchmarks and our LeafInstruct. In existing datasets such as Mantis-Instruct and InstructPix2Pix, the outputs are in single modality, either text or image. On the contrary, the inputs and outputs of our LeafInstruct cover multiple modalities.

4.2 Interleaved Inference

In this subsection, we explain our designs to enable the interleaved inference of VLGs with proposed Lateralization LoRA and the automatic routing of hidden states. Trained on our LeafInstruct dataset, the VLG learns to automatically decide whether to generate a text segment or an image based on the interleaved context. The generation process begins with the VLG generating a textual response. At each generation step, the VLG predicts a new text token and this token becomes the input to generate the next token. The hidden states of newly generated tokens are always routed to the standard LoRA specialized for text. If the textual response concludes with a special image generation token <IMG>, the generated text is then concatenated with the original context. Subsequently, the VLG takes in the updated context ended with <IMG> and generate a fixed-length (H×W𝐻𝑊H\times Witalic_H × italic_W) sequence of image patch embeddings. Notably, the hidden state of the <IMG> token is used to predict the first image patch embedding and hence, it is routed to the Convolutional LoRA. At each step, the VLG takes in the newly generated image patch embedding and predict the next image patch embedding. The hidden states of all newly generated image embeddings are routed to the Convolutional LoRA. The generated image with a end-of-image token </IMG> is appended to the context to serve as a new input, prompting the model to resume textual response generation. This iterative process terminates when VLG produces the end-of-generation token </s> at the end of a textual response. It is important to note that a textual response may only contain a <IMG> token, indicating the initiation of image generation immediately, without preceding text. Also, a textual response may only contain a </s> token, indicating the generation should end immediately, without any new text at the end. These two special cases enable the VLG to generate images and text segments in an arbitrary order.

In this autoregressive generation process, a challenge arises with the one-by-one generation of image patches due to the shape requirements of the convolutional operation. To address this issue, we pad the image patch embeddings with zero vectors to achieve a sequence length of H×W𝐻𝑊H\times Witalic_H × italic_W.

5 Dataset: LeafInstruct

Existing interleaved vision-language models Sun et al. (2023c, a); Dong et al. (2024) predominantly follow the training procedure that they are first pretrained on massive corpora of interleaved data such as MMC4 Zhu et al. (2023c) and other resources and then finetuned on a mix of high-quality datasets, such as visual instruction tuning data in Liu et al. (2023c) and InstructPix2Pix Brooks et al. (2023b). However, one significant limitation of these instruction-tuning datasets is that the outputs are typically in a single modality, e.g., either text or image. This potentially hinders further advancement in interleaved generation since the models cannot learn how to generate coherent content interleaved text and images specified by a given instruction.

To bridge the gap between limited existing resources and the practical need for improving interleaved generation models, we curated LeafInstruct, the first comprehensive instruction tuning dataset for interleaved text-and-image generation. Each instance in our dataset consists of (1) a detailed instruction, (2) an input context interleaved with text and images (optional), and (3) a ground-truth output that is also interleaved with text and images. We show an example in our LeafInstruct and how it compares with other representative instruction-based datasets, i.e., InstructPix2Pix Brooks et al. (2023b) and Mantis-Instruct Jiang et al. (2024), in Figure 4. We also compare our LeafInstruct with existing training datasets in terms of what modalities are covered in the input and output, and if they are publicly available in Table 3. We show that compared with existing datasets, our LeafInstruct cover most complete modalities, i.e., both text and images, in the inputs and outputs.

Refer to caption
Figure 5: Top-10 domain distribution in LeafInstruct.

Dataset construction

We construct a diverse instruction-tuning data collection from large-scale web resources, i.e., MMC4, and academic datasets, i.e., VIST Huang et al. (2016) and YouCook2 Zhou et al. (2018). Since the original data sources can be noisy, especially the website collection in MMC4, we meticulously devised a pipeline to filter and re-annotate the source data to ensure the high quality of our curated data. Firstly, we filter the samples based on the text length, number of images, and the coherence between text and images (measured by CLIPScore). We only keep the instances where they have 3 to 6 images in total. We also discard the instances with more than 12 sentences to ensure a balanced ratio between the number of text and images. Secondly, we leverage a state-of-the-art open-sourced LLM (i.e., Llama3-8B-Instruct) as a text filter to discard the instances with poor text quality. Thirdly, we apply an image filter to remove the instances with duplicate or perceptually identical images using the LPIPS score Zhang et al. (2018) to ensure the diversity of the images. Finally, we also apply Llama3 to annotate the task instruction for each instance based on the text content and rewrite the text if it’s too verbose in order to prevent the context length from being too long. We include more details on dataset construction in Appendix B.

Statistics

After applying our rigorous data processing pipeline, we totally obtain 38,272 high-quality instances out of more than 7 million source samples. We use Llama3 to tag the domain of each instance and show the distribution of the top 10 domains in our dataset in Figure 5.

6 Experiment

6.1 Baselines

We compare the EMU2 model tuned with our proposed Lateralization LoRA and LeafInstruct to state-of-the-art open-source VLGs, including GILL Koh et al. (2023), Mini-GPT5 Zheng et al. (2023), and Pretrained EMU2 Sun et al. (2023a), and proprietary models, including Gemini 1.5 Reid et al. (2024) + SDXL Podell et al. (2023) and GPT-4o 111https://openai.com/index/hello-gpt-4o/ + DALL E 3 222https://openai.com/index/dall-e-3/.

6.2 Evaluation Benchmarks

We evaluate the interleaved generation capability of our method on the recently introduced InterleavedBench benchmark. InterleavedBench is a comprehensive dataset covering a diverse array of 10 tasks, such as multimodal script generation, visual storytelling, and activity generation, carefully curated by human annotators. The InterleavedBench has two splits: a context-based split in which the input of each instance is equipped with interleaved text and images; and a context-free split with text-only inputs. The context-based split contains 465 instances and the text-only split contains 350 instances. We only use the context-based split as the testing set, since we mainly focus on tasks with interleaved inputs and outputs.

6.3 Evaluation Metrics

Recent studies in interleaved evaluation An et al. (2023) point out that due to the vast output space of interleaved generation, reference-based evaluation metrics often fail to accurately assess the generation quality. Thus, we adopt InterleavedEval Liu et al. (2024), a strong reference-free evaluation metric, based on GPT-4o, having great flexibility and being capable of evaluating arbitrarily interleaved generation. InterleavedEval prompts GPT-4o to score an interleaved generation from five aspects, including Text Quality, Perceptual Quality, Image Coherence, Image-Text Coherence (ITC), and Helpfulness. For each aspect, the GPT-4o outputs a discrete score from {0, 1, 2, 3, 4, 5}. We refer to the detailed definition of each score in each aspect of the original paper. The implementation details of our method can be found in Appendix.

Refer to caption
Figure 6: Qualitative results of Lateralization LoRA and open-source baselines. The <IMG> tokens denote where to insert the images.

7 Results and Discussion

Model Text Quality Perceptual Quality Image Coherence TIC Helpfulness
Proprietary Models
Gemini1.5 + SDXL 3.37 4.34 3.34 3.98 3.28
GPT-4o + DALL·E 3 3.16 4.44 3.13 4.39 3.46
Open-Source Models
MiniGPT-5 1.31 3.46 2.06 2.66 1.76
GILL 1.44 4.17 2.12 2.69 1.53
EMU-2 1.33 2.29 1.71 1.22 1.87
EMU2 + Lateralization LoRA 1.95 2.41 2.64 2.81 2.05
Table 1: Main results of baseline models and our methods. The top part shows the performance of proprietary models, the middle part shows the performance of open-source VLGs, and the bottom part shows the performance of EMU2 instruction tuned with our proposed methods and dataset.
Model Text Quality Image Quality Image Coherence TIC Helpfulness
EMU-2 1.33 2.29 1.71 1.22 1.87
+ LoRA 1.25 1.43 1.61 1.79 1.3
+ MoE-LoRA 1.94 2.22 2.42 2.54 1.90
+ Lateralization LoRA 1.95 2.41 2.64 2.81 2.05
Table 2: Performance of our Lateralization LoRA, traditional linear LoRA, and Mixture-of-Expert (MoE) LoRA. Mixture-of-Expert LoRA uses two different sets of linear LoRA for images and text, respectively.
Refer to caption
Figure 7: Examples to show why EMU2 model has worse Perceptual Quality than other baselines. The <IMG> tokens denote where to insert the images.

7.1 Main Results

Table 1 presents the main results of our method in comparison to the baselines. It is evident that instruction tuning the EMU2 model with Lateralization LoRA significantly enhances its performance across all evaluated aspects, particularly in Image Coherence and Text-Image Coherence aspects. Furthermore, our method surpasses existing open-source VLGs in four out of five aspects, with a significant improvement in the comprehensive helpfulness aspect, which measures how well the interleaved responses follow the task instructions and provide helpful information to achieve tasks. To better interpret these results, we selected three representative examples and show them in Figure 6. Firstly, Mini-GPT5 often fails to generate explanatory text, resulting in its poor performance in Text Quality. Secondly, both the Gill model and Mini-GPT5 frequently produce images and text that are irrelevant to the input context. In contrast, our model generates interleaved images and text that adhere closely to the user’s instructions and input context, which explains the superior performance in the helpfulness aspect of our model (9.6%percent9.69.6\%9.6 % improvement compared to EMU2). Finally, neither the GILL nor Mini-GPT5 models can preserve the visual appearance of the entities and scenes in the input images. Our model, however, faithfully retains these visual characteristics, leading to significantly better Image Coherence (28.2%percent28.228.2\%28.2 % improvement compared to Mini-GPT5).

In Table 1, we also present the performance of proprietary models for reference. However, we argue that a direct comparison between our model and proprietary models is not entirely fair due to the potential differences in model sizes, the number of training instances, and model architectures.

An interesting observation is that the EMU-2 model performs significantly worse in Perceptual Quality compared to other baselines. Even after being instruction-tuned with Lateralization LoRA and LeafInstruct, its performance in this aspect remains relatively low. Upon manually examining the models’ predictions, we discovered that GILL and Mini-GPT5 might exploit this evaluation criterion, a problem we discuss in detail in Subsection 7.3.

7.2 Effect of Lateralization LoRA

To isolate the effect of the instruction tuning data and directly demonstrate the performance improvement brought by our proposed Lateralization LoRA, we fine-tuned EMU2 using (1) traditional linear LoRA Hu et al. (2021) and (2) Mixture-of-Expert (MoE) LoRA Shen et al. (2024), with the results presented in Table 2. In the traditional linear LoRA, text, and images share the same low-rank adaptation parameters, while in the MoE LoRA, two different sets of linear LoRA are used for images and text, respectively. The routing strategy in MoE LoRA is based on the output modality of each hidden state, i.e., if the hidden state is used to predict a text token or an image patch embedding. By comparing the performance of traditional linear LoRA and MoE LoRA in Table 2, we can clearly identify the benefits of using separate parameters for image and text. Although our proposed Lateralization LoRA performs similarly to the MoE LoRA in the Text Quality aspect, Lateralization LoRA significantly outperforms the MoE LoRA in aspects related to image quality, such as Image Coherence and Text-Image Coherence (TIC). This result verifies our claim that convolutional LoRA can better model local priors of images, thereby improving the performance of image generation.

7.3 Why Does the EMU2 Model Perform Worse on the Perceptual Quality Aspect than Other Baselines?

In this subsection, we investigate why the EMU2 model trained with Lateralization LoRA has significantly lower Perceptual Quality scores compared to other baselines. As illustrated in Figure 7, both GILL and Mini-GPT5 often disregard the instructions and context images, generating images with minimal constraints. In contrast, the EMU2 model trained with Lateralization LoRA strives to adhere to the instructions and accurately condition its generation on the provided images. However, the complexity of the editing task results in our model producing images with lower quality and noticeable distortions. For example, in the first row of Figure 7, the images generated by GILL and Mini-GPT5 receive Perceptual scores of 5 and 4, respectively, while the image generated by our model receives a score of 2. Similarly, in the second row, the images generated by GILL and Mini-GPT5 receive Perceptual scores of 4 each, whereas our model receives a score of 1. But when taking Helpfulness into account, the images generated by our model are better than images generated by GILL and Mini-GPT5.

8 Conclusion

We propose Lateralization LoRA, a modality-specialized low-rank adaptation method tailored for VLGs. Lateralization LoRA dedicates a set of linear LoRA for processing text and a set of Convolutional LoRA for images, allowing each modality to have its own optimal adaptation design. Further, we propose the first interleaved instruction tuning dataset LeafInstruct by leveraging LLMs to automatically generate instructions and develop a rigorous filtering process to ensure the data quality. Extensive experiments on InterleavedBench showcase the effectiveness of our method and dataset.

9 Limitations

Our method focuses on EMU2 Sun et al. (2023a) due to its strong performance. Future work can experiment with our method with more architectures including models using discrete image tokens. Aghajanyan et al. (2022). Also, our method has the potential to be applied to VLMs. Future work can also explore this direction.

Our method uses two sets of LoRA for images and text, respectively. However, we do not experiment with the setting in which a large group of LoRA is applied to the model. Future work should explore the possibility of increasing the number of LoRA modules.

References

Appendix A More Details of Lateralization LoRA

Dataset Name Input Text Input Images Output Text Output Images Publicly Available
LLaVA Liu et al. (2023c) Yes Single Single No Yes
MultiInstruct Xu et al. (2023b) Yes Single Single No Yes
Vision-Flan Xu et al. (2024) Yes Single Single No Yes
InstructPix2Pix Brooks et al. (2023a) Yes Single No Single Yes
MagicBrush Zhang et al. (2023a) Yes Single No Single Yes
SuTI Chen et al. (2023c) Yes Multiple No Single No
Instruct-Imagen Hu et al. (2024) Yes Multiple No Single No
Mantis-Instruct Jiang et al. (2024) Yes Multiple Yes No Yes
LeafInstruct (Ours) Yes Multiple Yes Multiple Yes
Table 3: Comparison between our LeafInstruct and existing instruction tuning datasets.

A.1 Implementation Details

We leverage the EMU2 model Sun et al. (2023a), consisting of the EVA-02-CLIP-E-plus Sun et al. (2023b) as the image encoder, the LLaMA-33B Touvron et al. (2023), and the SDXL Podell et al. (2023) as the image decoder, as our base model. The EVA-02-CLIP-E-plus and the LLaMA-33B is connected by a linear project-up layer and the LLaMA-33B and the SDXL is connected by a linear project-down layer. All the variants of LoRA in Section 7.2, including our Lateralization LoRA are trained with LeafInstruct for one epoch on 8×A1008A1008\times\text{A}1008 × A 100 GPUs with learning rate 2e52superscript𝑒52e^{-5}2 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, batch size 1 per GPU, and a gradient accumulation step of 16. All the LoRA have a rank of 128, dropout rate of 0.050.050.050.05, and the LoRA α𝛼\alphaitalic_α in Section 4.1 is set to 2×12821282\times 1282 × 128. The kernel size of Lateralization LoRA is 2×2222\times 22 × 2, the stride is set to 1. During training, all parameters of the EMU2 model are kept frozen and only the LoRA parameters are updated.

Appendix B More Details of LeafInstruct

B.1 More Details in Dataset Construction

Details of Text Quality Filter

We use Llama-8B-Instruct model to rate the text quality of an instance with the following prompt: “Imagine you are an expert data annotator. You are given a text material and you need to evaluate its quality in terms of whether it is coherent, fluent, easy to understand, and helpful to humans. Please be critical and rate the quality as good only when the text quality is good in all four aspects. Output 1 if you think the material is good after you consider all four aspects. Output 0 if you think the material is not good enough. Here is the text material to be evaluated: {TEXT} Only output 0 or 1 and do not output anything else. Your evaluation is:” We discard the instances if the output from Llama is 0.

Details of Image Filter

We empirically found that if the images are too identical in the training instances, the trained models tend to find a shortcut to simply copy the image during generation. To this end, we design a filter to discard the instances with duplicate images to improve data quality. Specifically, we leverage the LPIPS score Zhang et al. (2018) that measures the perceptual similarity between the images. Specifically, for each instance, we enumerate each pair of images and compute their LPIPS score. If there is one pair with a score higher than 0.6, we discard the instance. We determine the threshold of 0.6 by empirical trial.

Details of Instruction Annotation

We also adopt Llama-8B-Instruct to annotate the task instruction for each instance. We use the following prompt: “Imagine you are an expert instruction annotator. You are given a material. You need to read its content and output a brief task instruction with one sentence such that another person can recover the given the material given the instruction. The instruction you predict should be specifically tailored for creative interleaved content generation that consists of both text and images. Now you need to annotate a concise, accurate instruction for the following instance. Please only predict the instruction and do not output anything else. Please design the instruction for the multi-modal generation task interleaved with both text and images. Text: {TEXT} Instruction:”.