Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

Zhu, Lei; Wei, Fangyun; Lu, Yanye

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.07874 (cs)

[Submitted on 12 Mar 2024]

Title:Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

Authors:Lei Zhu, Fangyun Wei, Yanye Lu

View PDF HTML (experimental)

Abstract:In this work, we investigate the potential of a large language model (LLM) to directly comprehend visual signals without the necessity of fine-tuning on multi-modal datasets. The foundational concept of our method views an image as a linguistic entity, and translates it to a set of discrete words derived from the LLM's vocabulary. To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. With this innovative image encoding, the LLM gains the ability not only for visual comprehension but also for image denoising and restoration in an auto-regressive fashion-crucially, without any fine-tuning. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration. Code and models are available at this https URL.

Comments:	Accepted by CVPR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2403.07874 [cs.CV]
	(or arXiv:2403.07874v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.07874

Submission history

From: Fangyun Wei [view email]
[v1] Tue, 12 Mar 2024 17:59:51 UTC (3,993 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators