LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Zhang, Yanzhe; Zhang, Ruiyi; Gu, Jiuxiang; Zhou, Yufan; Lipka, Nedim; Yang, Diyi; Sun, Tong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.17107 (cs)

[Submitted on 29 Jun 2023 (v1), last revised 2 Feb 2024 (this version, v2)]

Title:LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Authors:Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, Tong Sun

View PDF

Abstract:Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at this https URL.

Comments:	Preprint
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2306.17107 [cs.CV]
	(or arXiv:2306.17107v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.17107

Submission history

From: Yanzhe Zhang [view email]
[v1] Thu, 29 Jun 2023 17:08:16 UTC (19,086 KB)
[v2] Fri, 2 Feb 2024 19:44:14 UTC (20,644 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators