InterleavedEval: Holistic Evaluation for Interleaved Text-and-Image Generation

Created by Virginia Tech's NLP Lab.







Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.





InterleavedBench

We introduce INTERLEAVEDBENCH, the first comprehensive benchmark meticulously constructed to evaluate text-and-image interleaved generation.

Our dataset includes two subsets:

Comparison with Existing Benchmarks

Dataset Name Detailed Instruction Image Input Text Output Image Output
MagicBrush No Single No Single
DreamBench No Multiple No Single
CustomDiffusion No Multiple No Single
DreamEditBench No Multiple No Single
Mantis-Eval Yes Multiple Yes No
InterleavedBench (Ours) Yes Multiple Yes Multiple

We highlight the following key differences and unique challenges introduced by our INTERLEAVEDBENCH compared with the existing benchmark:

Main Results

Baselines


Automatic Evaluation


Note: TIC means "Text-Image Coherence" and we use a scale of 0-5 for this evaluation.

Best

Middle

Worst

Model Text Quality Perceptual Quality Image Coherence TIC Helpfulness AVG
MiniGPT-5 1.22 2.45 1.62 2.03 1.77 1.82
GILL 0.75 3.21 2.25 1.53 1.48 1.84
EMU-2 1.26 2.28 1.89 1.34 1.64 1.68
EMU-2 (Gold Text) 1.56 3.35 2.89 1.43 2.10 2.27
Gemini1.5 + SDXL 4.40 3.99 3.64 4.13 3.62 3.96
GPT-4o + DALLE3 4.37 4.36 3.51 4.55 3.88 4.13


Human Evaluation


Note: TIC means "Text-Image Coherence" and we use a scale of 0-3 for this evaluation.

Best

Middle

Worst

Model Text Quality Perceptual Quality Image Coherence TIC Helpfulness AVG
GILL 1.35 1.89 1.72 1.43 1.19 1.52
EMU-2 1.23 1.74 1.87 1.24 1.2 1.46
Gemini1.5 + SDXL 2.59 2.36 2.13 2.27 2.08 2.28
GPT-4o + DALLE3 2.49 2.51 2.02 2.31 2.13 2.29


Evaluation results on each evaluation aspect for each task



Qualitative Analysis





Citation

If you use InterleavedEval in your research, please cite the following papers.


@article{liu_holistic_2024,
  author       = {Minqian Liu and
                  Zhiyang Xu and
                  Zihao Lin and
                  Trevor Ashby and
                  Joy Rimchala and
                  Jiaxin Zhang and
                  Lifu Huang},
  title        = {Holistic Evaluation for Interleaved Text-and-Image Generation},
  journal      = {CoRR},
  volume       = {abs/2406.14643},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2406.14643},
  doi          = {10.48550/ARXIV.2406.14643},
  eprinttype    = {arXiv},
  eprint       = {2406.14643},
  timestamp    = {Tue, 16 Jul 2024 16:17:50 +0200}
}
                




Acknowledgement

InterleavedEval dataset is for research purpose only. Please carefully check the licenses of the original datasets before using InterleavedEval. We provide the URLs to the original datasets and their Bibtex on this page. The images and tasks may be taken down at any time when requested by the original dataset owners or owners of the referenced images. If you hope to take down any tasks or the images, please contact Minqian Liu and Lifu Huang at and .