Paragraph-to-Image Generation with Information-Enriched Diffusion Model

Wu, Weijia; Li, Zhuang; He, Yefei; Shou, Mike Zheng; Shen, Chunhua; Cheng, Lele; Li, Yan; Gao, Tingting; Zhang, Di; Wang, Zhongyuan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.14284 (cs)

[Submitted on 24 Nov 2023 (v1), last revised 29 Nov 2023 (this version, v2)]

Title:Paragraph-to-Image Generation with Information-Enriched Diffusion Model

Authors:Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang

View PDF

Abstract:Text-to-image (T2I) models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation. At its core is using a large language model (e.g., Llama V2) to encode long-form text, followed by fine-tuning with LORA to alignthe text-image feature spaces in the generation task. To facilitate the training of long-text semantic alignment, we also curated a high-quality paragraph-image pair dataset, namely ParaImage. This dataset contains a small amount of high-quality, meticulously annotated data, and a large-scale synthetic dataset with long text descriptions being generated using a vision-language model. Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models (SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to 15% and 45% human voting rate improvements for visual appeal and text faithfulness, respectively. The code and dataset will be released to foster community research on long-text alignment.

Comments:	The project website is at: this https URL. Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2311.14284 [cs.CV]
	(or arXiv:2311.14284v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.14284

Submission history

From: Weijia Wu [view email]
[v1] Fri, 24 Nov 2023 05:17:01 UTC (26,395 KB)
[v2] Wed, 29 Nov 2023 12:01:35 UTC (26,191 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Paragraph-to-Image Generation with Information-Enriched Diffusion Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Paragraph-to-Image Generation with Information-Enriched Diffusion Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators