Make A Long Image Short: Adaptive Token Length for Vision Transformers

Zhu, Yichen; Zhu, Yuqin; Du, Jie; Wang, Yi; Ou, Zhicai; Feng, Feifei; Tang, Jian

Computer Science > Computer Vision and Pattern Recognition

arXiv:2112.01686v2 (cs)

[Submitted on 3 Dec 2021 (v1), last revised 6 Dec 2021 (this version, v2)]

Title:Make A Long Image Short: Adaptive Token Length for Vision Transformers

Authors:Yichen Zhu, Yuqin Zhu, Jie Du, Yi Wang, Zhicai Ou, Feifei Feng, Jian Tang

View PDF

Abstract:The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. More tokens normally lead to better performance but considerably increased computational cost. Motivated by the proverb "A picture is worth a thousand words" we aim to accelerate the ViT model by making a long image short. To this end, we propose a novel approach to assign token length adaptively during inference. Specifically, we first train a ViT model, called Resizable-ViT (ReViT), that can process any given input with diverse token lengths. Then, we retrieve the "token-length label" from ReViT and use it to train a lightweight Token-Length Assigner (TLA). The token-length labels are the smallest number of tokens to split an image that the ReViT can make the correct prediction, and TLA is learned to allocate the optimal token length based on these labels. The TLA enables the ReViT to process the image with the minimum sufficient number of tokens during inference. Thus, the inference speed is boosted by reducing the token numbers in the ViT model. Our approach is general and compatible with modern vision transformer architectures and can significantly reduce computational expanse. We verified the effectiveness of our methods on multiple representative ViT models (DeiT, LV-ViT, and TimesFormer) across two tasks (image classification and action recognition).

Comments:	10 pages, Technical report
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2112.01686 [cs.CV]
	(or arXiv:2112.01686v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2112.01686

Submission history

From: Jie Du [view email]
[v1] Fri, 3 Dec 2021 02:48:51 UTC (8,782 KB)
[v2] Mon, 6 Dec 2021 03:24:53 UTC (8,780 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Make A Long Image Short: Adaptive Token Length for Vision Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Make A Long Image Short: Adaptive Token Length for Vision Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators