Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Kohei Uehara^1, 2 Nabarun Goswami^1, 2 Hanqin Wang¹ Toshiaki Baba^1, 2 Kohtaro Tanaka^1, 2 Tomohiro Hashimoto^1, 2 Kai Wang^1, 2 Rei Ito¹ Takagi Naoya^1, 2 Ryo Umagami¹ Yingyi Wen¹ Tanachai Anakewat^1, 2 Tatsuya Harada^1, 2
¹The University of Tokyo, ²RIKEN {uehara, nabarungoswami, wang, baba, k-tanaka, hashimoto,
wang-kai, ito, takagi, umagami, wenyy, anakewat, harada}@mi.t.u-tokyo.ac.jp

Abstract

The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. To this end, we developed a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge. Furthermore, using the dataset we created, we fine-tuned an existing VLM. This training enabled the models to generate questions and perform iterative reasoning during inference. The results demonstrated a stride toward a more robust, accurate, and interpretable VLM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.

1 Introduction

Refer to caption — Figure 1: An example of the explicit reasoning steps we aim to achieve, also representing a sample from our constructed dataset. It demonstrates the thought process in response to a given question. Notably, we incorporate a question generation step into the reasoning process (as seen on the right side of the figure), allowing the model to interactively acquire knowledge and refine its reasoning steps.

In recent years, LLMs have gained significant popularity in the field of artificial intelligence [1, 24, 31]. Building upon this, research has steped into large Vision-and-Language Models (VLMs) [11, 5, 18, 17]. Our study specifically focuses on tasks involving both vision and language modalities. Generally, VLMs learn the alignment between images and texts using large datasets sourced by web crawling. They are then fine-tuned with smaller, high-quality datasets for instruction tuning to enhance their text generation capabilities for various tasks. However, these models often struggle with hallucination, where they produce outputs not aligned with the given input. For instance, in VLMs, the models may mention objects that do not exist in the given images [15, 14].

A critical limitation of these models is their inability to explain the reasoning behind their outputs, especially when hallucination occurs. This is a significant drawback, as it is difficult to identify and correct the model’s mistakes. We argue that these issues stem from the models being trained to directly output answers to instructions without including a reasoning process. By explicitly learning and outputting the reasoning process, models can provide more concrete and explanatory responses. Another limitation is that the models cannot correct their mistakes by themselves, as they lack the ability to reason about their own outputs. On the other hand, humans, when uncertain, often ask questions to acquire new knowledge and refine their answers. By explicitly outputting the reasoning process, models can be trained to generate questions during the reasoning steps, pause to acquire necessary knowledge from knowledgeable someone, and then continue the reasoning process, thereby arriving at more accurate and reliable conclusions.

Therefore, our research aims to address these challenges by incorporating an explicit reasoning process and the ability to generate questions during reasoning. This approach is akin to the Chain-of-Thought reasoning seen in LLMs [36, 4], where models are prompted to explicitly reason out problems. However, due to VLMs’ relatively weaker text generation capabilities compared to LLMs, achieving Chain-of-Thought solely through prompting is challenging. To overcome this, we have created a novel dataset that includes the explicit reasoning process. Following mainstream methods in LLM-based dataset creation [18], we generate this data using LLMs, combining image annotations and a few manually created examples. This data includes scenarios where the model needs to generate questions when uncertain, training it to ask questions during ambiguous reasoning situations. We refer to this reasoning process as “Chain-of-Reasoning (CoR).”

In order to present the effectiveness of our approach, we fine-tune an existing VLM model on the dataset we created. Being trained on our dataset, the model acquires the ability to generate explicit reasoning steps and ask questions when uncertain, thereby improving the reliability of its inferences. Our contributions are summarized as follows:

•

We present an approach by incorporating an explicit reasoning process and question-generation capability into VLMs, promoting more reliable inferences.
•

We crafted a new dataset and utilized it for model training, setting a precedent for future VLM advancements.
•

With our novel dataset and model, we achieved giving the model the ability to generate explicit reasoning steps and question-asking capability.

2 Related Work

2.1 Large-scale Vision-and-Language Models

The development of LLMs like ChatGPT [23], GPT-4 [24], and LLaMA [31] has paved the way for research into large-scale multi-modal models that extend beyond text-based modalities. This section highlights studies particularly relevant to our research, focusing on models integrating vision and language modalities.

BLIP-2 [11] and InstructBLIP [5] utilize pre-trained image encoders and text decoders, employing a Transformer model known as Q-Former as an adapter. The training of these models is typically divided into two stages. In the first stage, Q-Former learns the alignment between image and text using large-scale image-text paired datasets. The second stage involves training the Q-Former alongside the text decoder to enable the decoder to produce outputs according to given instructions. Generally, while the first training stage prioritizes the scale of the dataset, the second stage emphasizes the quality of smaller, high-quality instruction and answer datasets. This two-stage learning approach is widely adopted in the training of VLMs, as seen in models like BLIVA [9], MiniGPT (v1&2) [39, 2], and LLaVA (v1&1.5) [18, 17].

Additionally, there’s growing interest in VLMs that not only produce outputs aligned with overall image instructions but also focus on specific regions within images. This research direction is evident in models such as KOSMOS-2 [26], Shikra [3], MiniGPT v2 [2], and LLaVA v1.5 [17]. These models transform bounding box coordinates, representing input regions, into special tokens for textual representation. Another noteworthy model is GPT4RoI [38], which applies an RoIAlign based on the target region to the features outputted by the image encoder, thus obtaining features focused on the specified region.

2.2 Explicit Reasoning in V&L Tasks

In the field of natural language processing tasks, LLMs have shown the capability to perform explicit reasoning in a format known as Chain-of-Thought [4], thanks to their advanced language generation abilities. However, in vision-and-language (V&L) tasks, the adoption of explicit reasoning using large-scale models is still in its developmental phase compared to language-only tasks. The closest research in this area involves using LLMs to predict necessary reasoning steps in a programmatic format and subsequently calling APIs specialized for V&L tasks, such as in Visual Programming [7] and ViperGPT [30]. A significant limitation of these approaches is that the LLM predicting the reasoning steps does not have direct access to the images, leading to a lack of assurance that the reasoning is genuinely based on the image content. Our model addresses this gap by allowing the V&L model to directly predict reasoning steps, ensuring that the reasoning is firmly grounded in the image content.

2.3 Visual Question Generation

A related field to the concept of acquiring information through question generation is Visual Question Generation (VQG). In the early stages of VQG, the primary focus was on generating questions related to images, without any specific goal of acquiring knowledge [22, 12, 16, 35]. K-VQG [33, 34] does involve generating questions for knowledge acquisition; however, there has been no emphasis on utilizing the acquired knowledge in further reasoning steps.

Learning by Asking (LBA) is another relevant area. LBA is an approach where question generation is used to acquire data for learning purposes. For instance, LBA has been employed in tasks like VQA [21, 32, 34], caption generation [29], and scene graph completion [37]. However, our research significantly diverges from LBA, as we focus on generating questions as part of explicit reasoning during the inference process. The acquired answers are then immediately used to update and refine the reasoning quality, differentiating our approach from traditional LBA work.

3 Method

First, we introduce how we constructed our CoR dataset. Then, we explain the model details and training process.

3.1 Dataset Construction

⬇

You are an AI visual assistant, and you are seeing a single image (1000x1000 pixels). What you see are provided with captions, describing the same image you are looking at, and the coordinates of the bounding box of objects in the image. Provide the detailed reasoning steps for the given question as you are seeing the image. Please consider you cannot access the answer, but you can use a knowledgeable person to ask a question.

You need to generate reasoning steps in three settings:

1. You are not allowed to ask questions to a knowledgeable person. In this case, you may not reach the correct answer in the end.

2. You can ask questions to a knowledgeable person to gain additional knowledge. You must keep the same reasoning steps as the case 1, until you ask questions.

3. You are not allowed to ask questions, but you can see the answer to the question. So, you must construct your reasoning step to definitely reach the answer.

Note that the captions and bounding boxes are provided as a reference, so you need to behave like you are seeing the image.

IMPORTANT: The presence of these captions or descriptions, bboxes, and the answer must not be included in reasoning steps. Behave like you are actually seeing the image.

Thus, you must not use the word "caption", "description", or "bounding box" in the reasoning steps.

Especially, do not mention the coordinates of each object in the image. If you want to mention the position of the object, please refer to it by relative position, such as "at the left of the image" or "in front of xx".

Please provide reasoning steps only, do not include other sentences or contents.

The reasoning step should include only the information necessary to infer the answer to the question, and should not refer to anything that is not directly related to the question or answer.

Each reasoning step must contain only one sentence, and the total size of the reasoning steps must not be over seven.

Please add an uncertainty score [1-10, higher is more uncertain] to each reasoning step.

Avoid leaps of logic between each step, but if there are any, please increase the uncertainty score in that step.

IMPORTANT: In case 2, if there is any chance to decrease the uncertain score by asking questions to other knowledgeable people, you must ask questions to obtain additional information. Before asking the question, you need to imagine what kind of knowledge is needed to decrease the uncertainty, and the question must be one that you can get the knowledge. The question must not be a paraphrase of the given question. Include a specific statement of your imagined knowledge, question, and the answer to your question. Please continue the reasoning as if you could get the answer to the questions. Note that the question and the answer should not be counted as the reasoning steps.

Listing 1: Full example of our prompt given to GPT-4.

In pursuit of realizing the CoR, we developed a novel dataset. This dataset was created using a combination of the specially designed prompt, manually curated examples and the text generation capabilities of LLM. We show an overview of the dataset construction process in Figure 2 and an example of the dataset in Figure 3.

The data structure of our dataset consists of the image, an instruction text of the task, a question (optional, if the task is in QA format), the answer, and the reasoning process required to derive the answer. We used different approaches to build the dataset depending on the types of annotations provided with the base dataset.

(1) If the dataset had rich annotations, including captions and object detection annotations, such as labels and bounding boxes, we followed the LLaVA approach [18]. In this case, we used an LLM (GPT-4 [25]) that does not process images to build the dataset. Coupling these inputs with thoughtfully designed prompts (Listing LABEL:lst:prompt) enables the LLM to generate the appropriate reasoning steps. It’s crucial that the prompts ensure that the LLM simulates the act of observing the image directly, without mentioning any references to captions, descriptions, or bounding boxes. This approach ensures the consistency of the reasoning process when training VLMs, which do view the image directly. To achieve this, in the prompt given to the LLM, we clearly state like you must not use the word “caption”, “description”, or “bounding box” in the reasoning steps. Especially, do not mention the coordinates of each object in the image. If you want to mention the position of the object, please refer to it by relative position, such as “at the left of the image” or “in front of xx”. Furthermore, in the prompt, we instruct the LLM to suppose that the image is of size 1000 $\times$ 1000 pixels. This ensures accurate position estimations based on given textual coordinates of the bounding boxes. The prompts also emphasize succinct reasoning steps without redundancy, and the need to add uncertainty scores at the end of each reasoning step.

(2) If the dataset did not have such rich annotations, we used a model that can input images (GPT-4V) to build the dataset. We used the similar prompts as previous case, excluding the parts that required GPT to imagine image input from text information.

We devised three variants of reasoning step data: without QA, with QA, and with GT. In “without QA” setting, the reasoning process to the answer is conducted without generating any questions midway. This setting aims to mimic the process a VLM undergoes when attempting to answer a question directly, without generating any questions. In this scenario, the LLM is instructed to generate reasoning steps without seeing the correct answer to the given task. We instruct the model to generate “Uncertainty score” along with each reasoning step, indicating the model’s confidence in its reasoning. This plays a crucial role in “with QA” setting described below.

In contrast, the “with QA” setting incorporates the generation of questions as part of the reasoning steps. This setup is crucial for training VLMs to perform explicit reasoning while simultaneously generating questions. As the model generates the reasoning steps, it also generates an uncertainty score for each step. If the score significantly rises at any point in the “without QA” setting, it triggers the generation of a question just before the step. This mechanism helps the model to identify moments of uncertainty and seek external information, thereby simulating a more interactive and dynamic reasoning process.

Lastly, the “with GT” setting involves the LLM generating reasoning steps while having access to the correct answers to the tasks (i.e., the ground-truth answers for the given questions or captions for the given images). This approach differs from the previous two as the reasoning steps invariably lead to the correct answer, serving as a ground truth for correct reasoning without question generation. This data is primarily used to train VLMs that do not involve question generation in their reasoning process.

			Average number of steps
	Num. of samples	Num. of images	without QA	with QA	with GT
COCO Caption	5,857	5,782	6.27	8.75	6.20
VQA v2	5,755	5,633	4.68	7.36	4.54
OK-VQA	5,793	5,792	4.77	7.35	4.55
A-OKVQA	5,736	5,718	4.87	7.43	4.63
Visual Genome	5,883	5,609	4.34	7.31	4.12
Encyclopedic VQA	6,521	6,186	7.00	9.45	7.00
OVEN	5,685	5,685	6.99	9.51	6.99
Total	41,230	39,272	5.58	8.19	5.46

Table 1: Dataset statistics of our dataset. In each row, we show the statistics of the base dataset to craft our dataset. “Num. of images” indicates the number of unique images in the dataset. “without QA”, “with QA”, and “with GT” indicate the average number of steps when the dataset is generated without questions and answers, with questions and answers, and without seeing ground-truth answers, respectively.

Each question step is designed to consist of three key elements: Imagined Knowledge Needed, the Question, and the Answer. Firstly, “Imagined Knowledge Needed” component represents the knowledge the model predicts it needs to acquire during the reasoning process. Let us consider the example shown in Figure 2. Here, the model need to answer the question “What holiday do we use this for?”, according to the given image of Christmas decorations. The model might need more specific information about different holidays and their associated symbols or decorations.

Secondly, we include the “Question” that would facilitate the acquisition of this knowledge. For instance, the model might generate a question to acquire imagined knowledge needed, such as “Which holiday is generally associated with a decorated tree placed in a living room?” This step is crucial as it reflects the model’s ability to formulate relevant questions based on the required knowledge.

The third component is the “Answer”, which is essentially the answer to the generated question. For example, the answer to the question above would be “Christmas”, and then the model is able to successfully acquire the knowledge it needs to answer the original question.

3.2 Dataset Statistics

We provide a comprehensive overview of our dataset statistics in Table 1. In our research, we based our dataset generation on seven datasets: COCO Caption [13], Visual Genome Caption [10], VQA v2 [6], OK-VQA [19], A-OKVQA [28], OVEN [8], and Encyclopedic-VQA [20]. We can categorize these datasets into three groups:
(1) Visual understanding tasks.
This group includes COCO Caption, Visual Genome Caption, and VQA v2, which are captioning tasks and VQA tasks that require the model to understand or describe the content of the image. Usually, these tasks do not require the model to have expert knowledge that cannot be directly observed from the image.
(2) Vision + common-sense understanding tasks.
This group includes OK-VQA and A-OKVQA, which require the model to have common-sense knowledge to answer the questions. Here, common-sense knowledge refers to general knowledge about topics such as geography, brands, and vehicles. [19]
(3) Vision + encyclopedic knowledge tasks.
This group includes Encyclopedic-VQA and OVEN, which require the model to have encyclopedic knowledge to answer the questions. Encyclopedic knowledge refers to more in-depth and specific knowledge compared to common-sense knowledge. This includes knowledge such as the names of animal and plant species, or details about when and by whom a building in an image was constructed [20].

We assume that the difficulty of the reasoning process increases as we move from visual understanding tasks to vision + encyclopedic knowledge tasks, as the latter requires detailed and expert knowledge to answer the questions. For each of these datasets, we generated approximately 6,000 data samples, randomly selecting from their training sets. We show examples of each category in Figure 4.

From Table 1, we can see that when generating reasoning data without questions and answers (without QA) and when not seeing ground-truth answers (with GT), the average number of reasoning steps is shorter compared to the scenario where both questions and answers are included (with QA). This is attributed to the fact that the QA generation process inherently requires more reasoning steps.

The “with GT” setting shows slightly fewer reasoning steps than the without QA setting. This can be attributed to providing the correct answers to GPT-4, which likely allows for concise reasoning steps leading to the correct answer.

The COCO Caption, Encyclopedic-VQA, and OVEN dataset tend to have longer reasoning steps compared to the others. For COCO Caption, this is likely due to the nature of the task, which involves generating captions based on the entire content of the image, requiring a more comprehensive reasoning process. The Encyclopedic-VQA and OVEN datasets often contains questions that require very detailed observations or expert knowledge to answer, leading to longer reasoning steps. In contrast, the other tasks focus on specific questions or target regions within the images, thereby requiring less information to be incorporated into the reasoning steps.

3.3 Model Architecture

Our model fundamentally builds upon the LLaVA architecture [18], consisting of an image encoder, a text decoder, and an adapter component based on two-layer MLP blocks. We show an overview of our model in Figure 5. The input image is first resized to 336 $\times$ 336 pixels and passed through a pre-trained image encoder (e.g., CLIP-ViT [27]). The image encoder extracts the image features, which are then fed into the Adapter MLPs to transform the image features into a format that can be processed by the LLM. The instruction text is also fed into the LLM, and the model generates a text response based on the image features and instruction text.

COCO

Caption

Visual

Genome

OK-VQA

A-OKVQA

VQA v2

Encyclopedic

VQA

OVEN

average

LLaVA (original)

2.258

1.424

2.414

2.332

2.604

1.606

1.521

2.023

CoR w/o question

1.621

1.653

2.224

2.054

2.524

1.520

1.582

1.883

CoR w/ uncertainty

1.748

2.536

2.408

2.715

1.902

1.761

2.117

CoR w/ uncertainty, w/o knowledge

1.775

1.739

2.526

2.376

2.717

1.909

1.839

2.126

Ours CoR

1.769

1.782

2.631

2.459

2.737

1.925

1.836

2.163

Table 2: Evaluation results of our model on various datasets. The evaluation is conducted using GPT-4 to score the reasoning steps generated by the model. The scores range from 1 to 4, with 4 indicating a correct answer. The average score across all datasets is shown in the last column.

3.4 Training

We utilizes our dataset to make the model learn to generate reasoning steps and questions. In the training process, we utilized a pre-trained VLM, which is already instruction-tuned, and fine-tuned it on our dataset. In the fine-tuning process, we froze the parameters of the image encoder, and updated the parameters of the text decoder and the adapter MLP. Along with the image and original question, we provided a specially designed prompts, such as “Analyze the image and outline your reasoning process step by step before providing your final answer,” to let the model generate reasoning steps and questions.

When the model is applied to test data, it performs a two-stage reasoning process to reach the final answer. The first stage encompasses the reasoning that leads up to the generation of a question. The second stage occurs after an external response to the question is obtained; the model is given the original inputs, the intermediate reasoning steps, and the external response. Then, the model generates the remaining reasoning steps based on this new information.

In this study, we leverage GPT-4o as the external answerer. It is noteworthy that the questions generated by the model typically demand an understanding of the image content as well as common-sense and encyclopedic knowledge. Given its extensive training data and large model parameters, GPT-4 is expected to adeptly handle the knowledge required to answer the questions.

4 Experiments

4.1 Implementation Details

Our experimental setup incorporates the following models: for image processing, we use CLIP-ViT-Large¹¹1https://huggingface.co/openai/clip-vit-large-patch14-336, and for text generation, we employ vicuna-7b-1.5²²2https://huggingface.co/lmsys/vicuna-7b-v1.5. The entire model is trained from a publicly available pre-trained checkpoint ³³3https://huggingface.co/llava-hf/llava-1.5-7b-hf.

In fine-tuning stage, we utilized 8 $\times$ NVIDIA H100 (80GB) GPU for about 1 hour. We trained the model for one epoch with a total batch size of 128. The learning rate was set to a maximum of 2e-5, linearly increasing until 3% of the training steps, followed by a cosine decay schedule. Training was conducted using bfloat16 precision.

4.2 Evaluation Settings

In our experiments, we investigated how the presence or absence of CoR affects the behavior of the model when solving tasks. We also evaluated several ablations related to the CoR method. These included listing the necessary knowledge before question generation and generating an uncertainty score along with the reasoning steps. We used validation set of seven datasets for this evaluation: MS COCO, Visual Genome, VQAv2, OK-VQA, A-OKVQA, OVEN, and Encyclopedic VQA. Using entire validation set is computationally expensive, so we randomly sampled 1000 instances from each dataset for evaluation. Our experiments compared the model trained with CoR to a baseline model trained without question generation, specifically using the “with GT” setting from the dataset (referred to as “w/o Question”).

Currently, there are no established metrics for evaluating results that include long reasoning steps. Therefore, in this study, we used GPT-4 for scoring the answers. We instructed GPT-4 to extract the final answer from the reasoning steps and then evaluate the answer with a score from 1 to 4 (1: incorrect, 2: partially correct, 3: mostly correct, 4: correct).

4.3 Results and Discussions

The results of our experiments are shown in Table 2. Also, we show an example of the predicted reasoning steps in Figure 6.

The evaluation results show that “Ours CoR” performs the best on average across various datasets. In contrast, the CoR model without questions (CoR w/o question) has the lowest performance. This might be because the current model cannot consistently provide coherent outputs over long reasoning processes. The challenge of developing VLMs that can produce coherent and consistent long reasoning steps in line with the given tasks remains an important area for future research.

Comparing “Ours CoR” with the original LLaVA, the latter performs relatively well on VQA v2 and COCO Caption, which are part of LLaVA’s training data.However, for datasets not included in LLaVA’s training data, like OVEN and Encyclopedic VQA, which require specialized knowledge, models that ask questions (such as “Ours CoR”) show significantly better performance. This indicates that asking questions helps the model acquire necessary specialized knowledge that it does not inherently possess.

Furthermore, outputting uncertainty did not significantly improve performance (CoR w/ uncertainty vs. Ours CoR). This is likely because generating uncertainty outputs makes the model’s responses longer and less consistent, leading to performance degradation. However, if the model can handle uncertainty effectively, it could ask questions during the most uncertain reasoning steps, potentially improving performance.

Comparing models with and without prior knowledge output (CoR w/o knowledge vs. Ours CoR), it’s evident that pre-outputting necessary knowledge improves performance. This prevents the model from asking irrelevant questions and ensures that it focuses on helpful inquiries, leading to better results.

5 Conclusion

In this paper, we have presented a novel approach of improving the capabilities of VLMs. This was achieved by incorporating a structured CoR and the ability to generate questions during the reasoning process. The model was trained on a dataset specifically designed to include an explicit reasoning and question-asking process.

We proposed an architecture for a VLM that leverages pre-trained components such as an image encoder and text decoder, all of which are fine-tuned using our novel dataset. Our experimental results suggest that the model’s ability to generate questions contributes to its performance, compared to a baseline model that does not generate questions. This underscores the potential utility of integrating explicit reasoning processes into VLMs.

Acknowledgements

This work was partially supported by JST Moonshot R&D Grant Number JPMJPS2011, CREST Grant Number JPMJCR2015 and Basic Research Grant (Super AI) of Institute for AI and Beyond of the University of Tokyo.

References

[1] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In NeurIPS, volume 33, 2020.
[2] Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
[3] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
[4] Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402, 2023.
[5] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
[6] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. IJCV, 127:398–414, 2016.
[7] Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14953–14962, June 2023.
[8] Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi, Kenton Lee, Kristina Toutanova, and Ming-Wei Chang. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. In ICCV, 2023.
[9] Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. BLIVA: A simple multimodal llm for better handling of text-rich visual questions. arXiv preprint arXiv:2308.09936, 2023.
[10] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73, 2017.
[11] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
[12] Yikang Li, Nan Duan, Bolei Zhou, Xiao Chu, Wanli Ouyang, and Xiaogang Wang. Visual question generation as dual task of visual question answering. In CVPR, 2018.
[13] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
[14] Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for GPT-4V (ision), LLaVA-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023.
[15] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023.
[16] Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou Yang, and Changyin Sun. IVQA: Inverse visual question answering. In CVPR, 2018.
[17] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
[18] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
[19] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
[20] Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, André Araujo, and Vittorio Ferrari. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories. In ICCV, 2023.
[21] Ishan Misra, Ross Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, and Laurens van der Maaten. Learning by asking questions. In CVPR, June 2018.
[22] Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. Generating natural questions about an image. In ACL, 2016.
[23] OpenAI. OpenAI: Introducing ChatGPT. https://openai.com/blog/chatgpt, 2022.
[24] OpenAI. GPT-4 Technical Report. ArXiv, abs/2303.08774, 2023.
[25] OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[26] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306, 2023.
[27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021.
[28] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
[29] Kevin Shen, Amlan Kar, and Sanja Fidler. Learning to caption images through a lifetime by asking questions. In ICCV, 2019.
[30] Dídac Surís, Sachit Menon, and Carl Vondrick. ViperGPT: Visual inference via python execution for reasoning. ICCV, 2023.
[31] Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
[32] Kohei Uehara, Nan Duan, and Tatsuya Harada. Learning to ask informative sub-questions for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 4681–4690, June 2022.
[33] Kohei Uehara and Tatsuya Harada. K-vqg: Knowledge-aware visual question generation for common-sense acquisition. In WACV, 2023.
[34] Kohei Uehara and Tatsuya Harada. Learning by asking questions for knowledge-based novel object recognition. International Journal of Computer Vision, 132(6):2290–2309, 2024.
[35] Kohei Uehara, Antonio Tejero-De-Pablos, Yoshitaka Ushiku, and Tatsuya Harada. Visual question generation for class acquisition of unknown objects. In ECCV, 2018.
[36] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
[37] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Visual curiosity: Learning to ask questions to learn visual recognition. In CoRL, 2018.
[38] Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. GPT4RoI: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023.
[39] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Appendix

A. Prompt Details

⬇

You are a helpful assistant to a user who is trying to answer the question about an image.

The user has provided a question and a reasoning process that led to their answer.

The user asks you a question to collect knowledge that will help them answer the question.

You should provide a response that helps the user answer the question.

Keep the answer as concise as possible; single-sentence answers are best.

Your answer must be in JSON formatted dictionary with the key "answer".

Generate json-formatted dictionary only, do not include any other information like formatting, etc.

example:

{"answer": "Gothic architecture"}

Listing 2: Full example of our prompt given to GPT-4 to answer the question generated by our model.

⬇

You are a helpful assistant to a person who is trying to answer the question about an image from a examiner.

The person has provided a reasoning process to reach the final answer, and then provide a final answer.

Your task is to evaluate the answer in score 1-4.

1: The answer is wrong.

2: The answer is somewhat correct; meaning of the answer is similar to the correct answer but details are wrong.

3: The answer is mostly correct; meaning of the answer is correct but some details are wrong.

4: The answer is completely correct.

Before you evaluate the answer, you need to extract the final answer from the user’s full response.

When extracting the answer, you should not modify the user’s wording; just extract the answer as it is.

Your answer must be in JSON formatted dictionary with the key "answer" (str) and "score" (int).

Generate json-formatted dictionary only, do not include any other information like formatting, etc.

example:

{"answer": "Gothic architecture", "score": 4}

{"answer": "Cathedral", "score": 2}

{"answer": "School", "score": 1}

{"answer": "Baroque architecture", "score": 3}

Listing 3: Full example of our prompt given to GPT-4 for answer evaluation.

Here, we provide the full example of the prompt given to GPT-4 for answer generation and evaluation. The prompt for answer generation is shown in Listing LABEL:lst:prompt_answer, and the prompt for answer evaluation is shown in Listing LABEL:lst:prompt_eval. In both prompts, we provide an instruction and several examples to guide GPT-4 to generate and evaluate the answer, respectively. We instructed GPT-4 to return the answer in JSON format, which enables us to parse the answer and evaluate it automatically.

B. More Examples of the Dataset

In Figure 7 and 8, we show more examples of the dataset we crafted based on different datasets: OK-VQA, A-OKVQA, MS COCO caption, Visual Genome caption, OVEN, and Encyclopedic-VQA. Each example consists of an image, a question, and a reasoning chain that explains how to answer the question or generate a caption for the image. We provide the reasoning chain of three types: (1) reasoning steps without answer and question, in which the data is generated without access to the ground-truth answer and there is no question generation step; (2) reasoning steps with question, in which the data is generated with access to the ground-truth answer and there is a question generation step; and (3) reasoning steps with answer and without question, same as the second type but without the question generation step.

C. More Qualitative Results

We show more qualitative results of generated reasoning steps by our model for VQA, A-OKVQA, COCO Caption, Encyclopedic VQA, OVEN, OK-VQA, and Visual Genome in Figure 10, 11, 12, and 13.