Learning More May Not Be Better: Knowledge Transferability in Vision-and-Language Tasks

J Imaging. 2024 Nov 22;10(12):300. doi: 10.3390/jimaging10120300.

Abstract

Is learning more knowledge always better for vision-and-language models? In this paper, we study knowledge transferability in multi-modal tasks. The current tendency in machine learning is to assume that by joining multiple datasets from different tasks, their overall performance improves. However, we show that not all knowledge transfers well or has a positive impact on related tasks, even when they share a common goal. We conducted an exhaustive analysis based on hundreds of cross-experiments on twelve vision-and-language tasks categorized into four groups. While tasks in the same group are prone to improve each other, results show that this is not always the case. In addition, other factors, such as dataset size or the pre-training stage, may have a great impact on how well the knowledge is transferred.

Keywords: knowledge transferability analysis; multi-modal learning; vision and language.