ProcTag: Process Tagging for Assessing the Efficacy of Document Instruction Data

Yufan Shen^∗ Zhejiang University [email protected] , Chuwei Luo^∗† Alibaba Group [email protected] , Zhaoqing Zhu^∗ Alibaba Group [email protected] , Yang Chen Zhejiang University [email protected] , Qi Zheng Alibaba Group [email protected] , Zhi Yu^† Zhejiang University [email protected] , Jiajun Bu Zhejiang University [email protected] and Cong Yao Alibaba Group [email protected]

Abstract.

Recently, large language models (LLMs) and multimodal large language models (MLLMs) have demonstrated promising results on document visual question answering (VQA) task, particularly after training on document instruction datasets. An effective evaluation method for document instruction data is crucial in constructing instruction data with high efficacy, which, in turn, facilitates the training of LLMs and MLLMs for document VQA. However, most existing evaluation methods for instruction data are limited to the textual content of the instructions themselves, thereby hindering the effective assessment of document instruction datasets and constraining their construction. In this paper, we propose ProcTag, a data-oriented method that assesses the efficacy of document instruction data. ProcTag innovatively performs tagging on the execution process of instructions rather than the instruction text itself. By leveraging the diversity and complexity of these tags to assess the efficacy of the given dataset, ProcTag enables selective sampling or filtering of document instructions. Furthermore, DocLayPrompt, a novel semi-structured layout-aware document prompting strategy, is proposed for effectively representing documents. Experiments demonstrate that sampling existing open-sourced and generated document VQA/instruction datasets with ProcTag significantly outperforms current methods for evaluating instruction data. Impressively, with ProcTag-based sampling in the generated document datasets, only 30.5% of the document instructions are required to achieve 100% efficacy compared to the complete dataset. The code and generated instruction data will be made publicly available.

^*^*footnotetext: Equal contribution.

\dagger

Corresponding author.

1. Introduction

Document visual question answering (VQA), which closely aligns with the general document artificial intelligence (Cui et al., 2021; Han et al., 2023), is currently a significant research and application area. With the remarkable successes achieved by Large Language Models (LLMs) (OpenAI, 2022; Brown et al., 2020; Touvron et al., 2023a, b; Bai et al., 2023a) and Multimodal Large Models (MLLMs) (OpenAI, 2023; Bai et al., 2023b; Liu et al., 2023c, a; Ye et al., 2023b) across various real-world tasks, the application of these models (Zhang et al., 2023; Ye et al., 2023a; Yang et al., 2023; Shi et al., 2023; Liu et al., 2023b) for the document VQA has seen substantial advancements.

To achieve good document VQA results by LLMs/MLLMs, recent works (Ye et al., 2023a; Zhang et al., 2023) perform instruction tuning (Wei et al., 2022a; Ouyang et al., 2022) on the document instruction datasets. To facilitate effective instruction tuning, the instruction datasets should better possess high efficacy, which means that the instructions in the dataset are expected to be sufficiently diverse and complex (Xu et al., 2023; Mukherjee et al., 2023; Wang et al., 2024). Instruction datasets are often collected through manual annotation or generated using LLMs/MLLMs (Taori et al., 2023; Ding et al., 2023). Building upon this, recent works (Lu et al., 2023; Li et al., 2023) have proposed several data-oriented methods to evaluate the efficacy of instruction datasets, thereby facilitating the selection of data that is both diverse and complex. These methods have been proven effective in guiding the assessment of instruction data quality.

Refer to caption — Figure 1. The same instruction text can lead to entirely different execution processes, when applied to different types of documents.

However, existing evaluation methods for instruction data are biased to the general domain and primarily focus on the textual content of the instructions, resulting in suboptimal performance in the document understanding domain. In the document VQA task, effectively assessing the efficacy of instruction datasets requires a comprehensive understanding and differentiation of the document’s content, layout, instructions, and the instruction execution process. As illustrated in Figure 3, the data method INSTAG (Lu et al., 2023), which assesses the efficacy of instruction datasets by the diversity and complexity of instruction text, is applied to DocVQA dataset for training LLMs. The performance of LLMs trained on this filtered dataset exhibits minimal difference compared to models trained on randomly sampled data. The primary issue lies in the fact that, in document understanding, the same instruction text may lead to entirely different data categories depending on the instruction execution processes. Data methods that only focus on the instruction text are incapable of effectively distinguishing these differences. Consider the instruction ”What is Tom’s phone number?” as shown in Figure 1. When evaluated using methods like INSTAG, which considers only the text of the instruction, this instruction would be tagged as ”information extraction/entity extraction”. In the document understanding scenarios of Figure 1 (a), (b), and (c), it is evident that the same instruction carries three entirely different instruction execution processes. In (a), it is required to locate and understand a paragraph within the document; in (b), the focus is on finding key-value regions for extraction before comprehension; and in (c), the task involves locating the relevant table region for table understanding. Obviously, the same instruction text in different document scenarios represents entirely different instruction types. Currently, there is a notable lack of comprehensive research on assessing the efficacy of document instruction data, hindering the development of high-quality instruction datasets for training LLMs/MLLMs for document VQA. Therefore, it is crucial to explore effective methods for assessing the efficacy of document instruction datasets.

To this end, we propose ProcTag, a metric for assessing the efficacy of document instruction data by instruction execution process tagging. ProcTag focuses on modeling the process of document instruction execution rather than the instruction text itself. Empowered by the high-performing LLM GPT-3.5, including its chain-of-thought (Wei et al., 2022b; Kojima et al., 2022) reasoning capability, ProcTag introduces a framework that prompts GPT-3.5 to generate the process of document instruction execution. To better represent the process, ProcTag requires GPT-3.5 to express the process in code and then assign tags to the generated process. These tags are used to measure the diversity and complexity, representing the efficacy of the instruction data. Furthermore, to comprehensively represent document content and utilize GPT-3.5 more effectively, we also devise DocLayPrompt, a novel semi-structured prompting strategy based on document layout information for document representation. Compared to existing layout-aware prompts (Lamott et al., 2024), DocLayPrompt outperforms them in capturing the layout information for document representation. Given the relative scarcity of document instruction datasets, we apply the ProcTag method to an existing open-source document VQA dataset and several generated document instruction datasets. After applying ProcTag for data selection, the datasets are used to train both LLMs and MLLMs for experiments. Experimental results demonstrate the effectiveness of the proposed ProcTag method in assessing document instruction data with efficacy when compared to existing data methods and random sampling.

Our contributions are summarized as follows:

1)

This paper introduces a framework ProcTag, which models the instruction execution process instead of the instruction text itself, as a method for assessing the efficacy of document instruction data. To the best of our knowledge, it is the first to explore the instruction execution process as a data quality assessment method for document understanding.
2)

To effectively model the instruction execution process, ProcTag expresses it for document VQA problems in code and distinguishes the process through tagging. Additionally, a semi-structured layout prompting strategy named DocLayPrompt, which incorporates document layout information for effective document representation, is proposed.
3)

The proposed ProcTag method, when applied to existing open-source document VQA dataset and generated document understanding instruction datasets, significantly outperforms existing data methods when training LLMs and MLLMs. Additionally, by guiding the generation and filtering of document understanding instruction datasets using ProcTag, only 30.5% of the document instructions are required to achieve 100% efficacy compared to the complete dataset.

2. Related Works

LLMs/MLLMs for Document VQA. Recently, there have been broad discussions within the community regarding LLMs such as ChatGPT (OpenAI, 2022), LLaMA (Touvron et al., 2023a, b) and Qwen (Bai et al., 2023a), and MLLMs like GPT-4V (OpenAI, 2023), Gemini (Team et al., 2023) and Qwen-VL (Bai et al., 2023b). These models have achieved considerable success in a wide broad of downstream AI applications. Concurrently, leveraging LLMs/MLLMs for document AI (Yang et al., 2023; Liu et al., 2023b; Perot et al., 2023; Shi et al., 2023), especially document VQA tasks (Zhang et al., 2023; Ye et al., 2023a; Bai et al., 2023b), has demonstrated remarkable performance over previously document pre-trained models (Xu et al., 2020, 2021; Huang et al., 2022; Li et al., 2021a; Luo et al., 2022; Li et al., 2021b; Appalaraju et al., 2021; Gu et al., 2021; Peng et al., 2022; Luo et al., 2023). LLaVAR (Zhang et al., 2023) enhances the text comprehension capabilities of LLaVA (Liu et al., 2023c) by gathering text-rich images and constructing a corresponding instruction tuning dataset. Moreover, building upon the foundation of mPLUG-Owl (Ye et al., 2023b), mPLUG-DocOwl (Ye et al., 2023a) creates an instruction tuning dataset for various visual-text understanding tasks. Simultaneously, an OCR-free document instruction comprehension evaluation set, LLMDoc, has been developed to better compare models in terms of instruction compliance and document understanding. Likewise, Qwen-VL (Bai et al., 2023b) also considers utilizing high-quality and fine-grained visual-language (VL) instruction datasets to achieve its high-quality multimodal multitask understanding capabilities, which also includes instruction fine-tuning datasets related to documents. LayoutLLM (Luo et al., 2024) proposes a layout instruction tuning method and a document instruction tuning dataset for document understanding. A module called layout chain-of-thought (LayoutCoT) which is effective for document understanding is devised to represent the document-related instruction processes. The success of the aforementioned efforts demonstrates that, for document-related tasks, the document instruction datasets are indispensable.

Instruction Data Methods. Instruction tuning datasets (Wei et al., 2022a; Ouyang et al., 2022) are crucial for calibrating LLMs/MLLMs to align human instructions accurately. The majority of these datasets are sourced by manual annotation or generated by LLMs and MLLMs (Taori et al., 2023; Ding et al., 2023). Recent efforts in the community have recognized the importance of establishing robust instruction dataset evaluation methods to optimize the utility of these datasets (Li et al., 2023; Lu et al., 2023). Li et al. (Li et al., 2023) introduce the Instruction-Following Difficulty (IFD) metric, which autonomously screens for high-quality instruction data by identifying discrepancies between a model’s expected responses and its generative outputs. Despite its innovative approach, this method is hindered by the need for additional training on a pre-experienced model. In contrast, InsTag (Lu et al., 2023) proposes a more cost-efficient alternative that foregoes the need for model retraining. By leveraging existing LLMs to annotate instruction data, InsTag assesses quality along two dimensions: diversity and complexity. This approach also highlights that the quality of instruction datasets is crucial for impacting model performance. For document AI tasks, publicly available document instruction datasets are scarce. Furthermore, the above evaluation metrics are biased towards the general domain and focus primarily on textual instructions. These metrics can not be directly applied to complex multimodal tasks like document VQA, which relies not only on instruction text but also on the understanding of layout and different problem-solving processes. To more effectively employ document instruction datasets and enhance the document understanding capability of LLMs/MLLMs, it is necessary to develop an evaluation metric suitable for assessing instruction data in document VQA.

3. ProcTag

Leveraging tags to measure the diversity and complexity for assessing the efficacy of instruction datasets is a very effective way (Lu et al., 2023). As shown in Figure 2, ProcTag tags document instruction datasets by modeling the instruction execution process, including 3 steps: document representation (3.1), instruction execution process generation (3.2), and process tagging (3.3).

3.1. Document Representation

For tagging document instruction datasets, effectively representing document content with rich layout information plays an essential role. A natural idea to represent documents is to directly input document images into MLLMs, however, employing these models such as GPT-4V is both costly and inefficient. Meanwhile, existing research (Lamott et al., 2024) has demonstrated that only textual representation of documents to prompt LLMs can also achieve satisfactory results in various document understanding tasks. For instance, LATIN-Prompt (Wang et al., 2023) and the SpatialFormat-prompt (Lamott et al., 2024), widely acknowledged schemes for textual representation, utilize Optical Character Recognition (OCR) results to reconstruct layout information using spaces and line breaks. However, these approaches only employ the text content of documents but fail to consider the layout details in document representation, which can lead to ambiguities during the document comprehension process. For example, if a document is solely represented by text content, discerning between elements such as Page Head and Title becomes difficult to distinguish, as they both reside on the top of the page and are unified to the same font in text.

Considering the aforementioned constraint, a novel semi-structured approach for representing layout-aware documents, termed DocLayPrompt, is proposed. As Figure 2 (a) shows, DocLayPrompt incorporates layout information into the document representation process, which is vital for tagging document understanding instruction datasets, as it aids in the comprehensive modeling of a document’s content.

In DocLayPrompt, given an input document image $\mathcal{D}$ , the OCR and layout detection tools are used to extract its corresponding structural information: OCR results $\mathcal{O}=<\mathcal{S},\mathcal{B}_{O}>$ and layout results $\mathcal{L}=<\mathcal{T},\mathcal{B}_{L}>$ . Specifically, $\mathcal{S}=\{s_{1},s_{2},\ldots,s_{n}\}$ enumerates the extracted text sequences while $\mathcal{B}_{O}=\{b_{o1},b_{o2},\ldots,b_{on}\}$ corresponds to their associated bounding box coordinates within the OCR results. Concurrently, $\mathcal{T}=\{t_{1},t_{2},\ldots,t_{n}\}$ denotes the identified layout types with $\mathcal{B}_{L}=\{b_{l1},b_{l2},\ldots,b_{ln}\}$ specifying the respective the bounding box coordinates for these layout components. Then, these inputs are required to construct DocLayPrompt as follows:

1) Reorder and Clean Inputs: To accurately represent real documents, it is imperative to utilize its structural information (OCR and layout results) that are both orderly and clean. This necessitates the reordering of OCR and layout outcomes, which is achieved by employing bounding box coordinates $\mathcal{B}_{O}$ and $\mathcal{B}_{L}$ , respectively. Moreover, to eliminate redundant layout instances, Non-Maximum Suppression (NMS) is utilized, yielding the cleaned inputs: $\mathcal{O}^{C}$ and $\mathcal{L}^{C}$ .

(1)

\displaystyle\mathcal{O}^{C},\mathcal{L}^{C}

\displaystyle\leftarrow\textit{getCleanInputs}(\mathcal{O},\mathcal{L}).

2) Associate Structural Information: $\mathcal{O}^{C}$ and $\mathcal{L}^{C}$ are both representing the structural information within documents. To achieve a unified representation, it is necessary to associate these disparate sources of information to get associated structural information $\mathcal{A}_{OL}$ . Considering $\mathcal{L}^{C}$ encompasses a broader structural granularity compared to $\mathcal{O}^{C}$ , DocLayPrompt integrates a set of OCR results $\{\mathcal{O}^{C}_{i}\}^{n}_{i=m}$ into their encompassing layout component $\mathcal{L}^{C}_{j}$ for tight association. In particular, for $\mathcal{O}^{C}_{i}$ that are not encompassed by any layout component, it’s to be associated with the nearest layout component $\mathcal{L}^{C}_{j}$ in terms of Euclidean distance $d(\cdot)$ .

(2)

\begin{split}&\mathcal{A}_{OL}\leftarrow\{(\{\mathcal{O}^{C}_{i}\}^{n}_{i=m},% \mathcal{L}^{C}_{j})\},\\ &\text{ where }\{\mathcal{B}_{Oi}\}^{n}_{i=m}\subseteq\mathcal{B}_{Lj}\text{ % or }\underset{i,j}{\arg\min}\,d(\{\mathcal{B}_{Oi}\}^{n}_{i=m},\mathcal{B}_{Lj% }).\end{split}

3) Represent Document: Finally, concatenate $\mathcal{A}_{OL}$ sequentially through LATIN-Prompt and layout type tags to form the document representation $\mathcal{R}$ .

(3)

\mathcal{R}\leftarrow\textit{getDocRep}(\mathcal{A}_{OL}).

3.2. Instruction Execution Process Generation

As shown in Figure 1, the same instruction text can have entirely different instruction execution processes when it is in different types of documents. To effectively distinguish between document instruction data, it is critical to differentiate based on the process of instruction execution rather than the textual content of the instructions themselves. To get the document instruction execution process, inspired by the chain-of-thought (CoT) (Wei et al., 2022b; Kojima et al., 2022) ability of LLMs, ProcTag utilizes GPT to generate the document instruction execution process. To provide a more precise and concise description of the instruction execution process, ProcTag employs pseudo-code representation. Furthermore, to enhance the accuracy of the pseudo-code, the guidance of GPT generation necessitates an initial output of a step-by-step textual description of the instruction execution, followed by the generation of the corresponding pseudo-code.

As shown in Figure 2 (b), the structural representation $\mathcal{R}$ and corresponding question $\mathcal{Q}$ for the document instruction dataset are employed for prompting GPT to generate instruction execution process content $\mathcal{P}$ . To ensure sufficient and clear discriminability for subsequent tagging, the generated $\mathcal{P}$ includes a chain-of-thought (CoT) for instruction execution and the corresponding tightly coupled and distinct pseudo-code execution process, where the input of each step is the output of the previous step.

3.3. Process Tagging

To obtain distinctive and denoised process tags for assessment, ProcTag normalizes process tags through 3 stages: Function Name Extraction, Frequency Filtering, and Aggregation.

As shown in Figure 2 (c), Function Name Extraction is utilized to obtain the function names of pseudo-code as the raw process tags from $\mathcal{P}$ . Then, following the InsTag, the raw process tags employ Frequency Filtering and Aggregation to filter tags that appear too rarely and aggregate similar tags (e.g., “find_table” and “extract_table”), respectively. To filter out tags that appear infrequently, a threshold is established for their selection. To effectively aggregate semantically similar tags with code-like formats, a code language model is utilized to obtain embeddings for these code-formatted tags, which are then clustered. After that, the final instruction tags obtained through the above stages will be used for subsequent data assessment and experiments.

3.4. Efficacy Assessment by Tags

ProcTag defines the efficacy of an instruction dataset through two attributes of its process tags: complexity and diversity. Here, complexity refers to the number of different tags present within the dataset, while diversity denotes the average number of tags per instruction data. Higher levels of complexity and diversity signify a dataset with higher efficacy, whereas lower levels indicate lower efficacy. Following the InsTag data method for data sampling via tags, the ProcTag selects subsets by optimizing for the highest complexity under the condition of maximal tag diversity, thereby yielding sub-datasets with superior efficacy.

Table 1. The quality evaluation of the tags generated by ProcTag. Tagging precision and consistency are utilized for evaluating ProcTag. The Cohen’s kappa score is used to represent the agreement between GPT-4 and human.

Metric	GPT-4	Human	Human-GPT Agreement
Tag Precision	96%	92%	0.65
Tag Consistency	80%	88%	0.87

4. Experimental Setup

4.1. Datasets

Instruction datasets, including document instruction datasets, are often collected from manually annotations or generated by LLMs or MLLMs. In our experiments, all these types of document instruction datasets are considered for tagging and evaluation.

Manually Annotated Dataset. The widely-used public dataset DocVQA (Mathew et al., 2021) is employed, consisting of 50,000 questions defined across over 12,000 documents from various industry sources.

Generated Dataset. For document VQA tasks, the availability of publicly accessible instruction datasets is limited. Therefore, motivated by existing works that utilize LLMs/MLLMs for generating instruction datasets (Taori et al., 2023; Ding et al., 2023), it is necessary to use LLMs/MLLMs to generate some expanded corpus of document instruction datasets. Motivated by the LayoutLLM(Luo et al., 2024), to cover a broader spectrum of document types and complexity instructions, four document datasets are used for building the instruction tuning dataset, including RVL-CDIP (Harley et al., 2015), DocILE (Štěpán Šimsa et al., 2023), PublayNet (Zhong et al., 2019), and PubTabNet (Zhong et al., 2020). RVL-CDIP covers a substantial diversity of document types, with a collection of 400,000 images spanning 16 distinct classes, including, but not limited to, letters, forms, and memos. DocILE is specialized in form-based documents that facilitate diverse information extraction tasks. It contains a set of 6,680 labeled business documents, supplemented by a substantial unlabeled set of 932,000 documents, and a synthetically generated corpus of 100,000 documents. PublayNet, with its focus on document layout analysis, comprises over 360,000 PDF documents rich in textual and layout elements. This dataset features annotations that include common layout components such as text blocks, titles, lists, figures, and tables. Lastly, PubTabNet is characterized by its intricate table structures contained within academic literature. Encompassing more than 568,000 tabular instances in both image and HTML formats, this dataset offers detailed cell bounding box information, which is instrumental for advanced table recognition and understanding tasks.

4.2. Implementation Details

ProcTag Implementation Details. (1) Document Representation. The Vision Grid Transformer (VGT) (Da et al., 2023) model is utilized as the layout detection tool described in 3.1 for structural document representation. The OCR results for document images are sourced from the official OCR outputs provided by each respective document dataset.

(2) Instruction Execution Process Generation. During the instruction execution process generation in 3.2, if the output format returned from GPT does not meet the generation criteria and cannot be parsed, GPT is invoked again. This process is repeated at most twice, and if it still fails, the data is discarded. The proportion of discarded samples does not exceed 0.1%.

(3) Process Tagging. In 3.3, following InsTag, the tags appear fewer than 4 times in DocVQA and fewer than 2 times in other datasets are termed as long-tail tags, removing them in “Frequency Filtering”. During the “Aggregation” stage, after the exclusion of lowercase letters and special characters, embeddings for function names are obtained using the SantaCoder (Allal et al., 2023) and clustered with DBSCAN (Hahsler et al., 2019). The minimum semantic similarity threshold is set to 0.015. Finally, a minimum support of 40 instances and a minimum confidence of 99% are used for association aggregation. It is important to note that for instruction execution processes, the same types of tags with different orders represent different execution processes. Hence, ProcTag aggregates tags only when they exhibit a high frequency of occurrence and are in close proximity to the execution order. For example, “extract_list” and “find_item” are commonly co-occurring, but if they respectively appear in the first and third steps of the execution process that non-adjacency, they are not considered to be aggregated. Conversely, they would be combined to “extract_list_item”.

Dataset Details. Instruction Generation. PubLayNet, PubTabNet, RVL-CDIP, and DocILE are utilized for instruction generation. For each dataset, 3,000 document images are randomly sampled to prompt GPT (GPT-3.5) with the same prompt that used for generating the layout instruction tuning dataset of the LayoutLLM(Luo et al., 2024). 10 instruction items are generated for each document. After parsing and filtering, a random selection of 20,000 instances forms the training set, with another 1,000 instances randomly selected to serve as the test set.

Experimental Details. LLMs and MLLMs. Given the superior performance of the Qwen series models on the DocVQA task within open-source models, we choose the Qwen series models as our experimental benchmark. Both LLMs and MLLMs are employed for experimental validation. For LLMs, all experiments are based on the 7B version of QwenChat. For each experiment, the models are trained for 3 epochs with a batch size of 32. As for MLLM models, the 7B version of QwenVL-Chat is employed, trained for 3 epochs, with a batch size of 16. For all experiments, the learning rate is set to 1e-5.

Evaluation Metric. The widely used metric ANLS (Mathew et al., 2021) is utilized for evaluating DocVQA and the four generated document instruction datasets.

4.3. Quality Evaluation

Following the InsTag method, the quality of the tags generated by ProcTag is evaluated from the perspectives of precision and consistency. Precision is defined to measure whether tags correctly relate to the execution processes of their associated instructions, while consistency assesses whether the tags maintain a uniform definition across all corresponding instructions. GPT-4 and manual annotation are utilized to evaluate the tags generated by ProcTag.

Results. Table 1 shows the quality evaluation of the tags generated by ProcTag in the DocVQA dataset. The precision of our ProcTag on GPT-4 and human annotation reaches 96% and 92%, respectively, with consistency both exceeding 80%. Furthermore, Cohen’s kappa score is used to compute the agreement between human and GPT-4 annotation, yielding precision and consistency agreements exceeding 0.6, which qualifies as “substantial agreement”. Moreover, an additional comparison of final answers derived from step-by-step reasoning via ProcTag depicted in Figure 2(b) with ground truth shows an ANLS increase from 73.84 to 78.68 compared to direct reasoning without ProcTag. In conclusion, the tags marked by ProcTag demonstrate high quality in both precision and consistency.

5. Experimental Results

5.1. Main Results

Our experiments are conducted on the widely-used open-source DocVQA dataset which is manually annotated, as well as on four generated datasets, for supervised fine-tuning of LLMs/MLLMs. For each dataset, models are fine-tuned with different data proportions sampled by three methods: ProcTag, the existing instruction data method InsTag, and random sampling. Then, the document understanding performance of models in these states is evaluated on the test set, where Qwen and Qwen-VL are used as models. As shown in Figure 3, overall, our ProcTag consistently outperforms both InsTag and random variants. Notably, the performance trends of InsTag and random methods are similar. This is because InsTag, which only considers instruction text, cannot effectively distinguish different document instruction data, this finding is consistent with our observations (see Figure 1), which proves the necessity of modeling the process of document instruction execution rather than the instruction text itself.

Evaluation on Manual Annotation Dataset. Due to the scarcity of existing document instruction datasets, Qwen-VL has been trained on most datasets of this category, including DocVQA. For fair comparison on DocVQA, our experiments avoid validating on Qwen-VL, considering only Qwen.

Evaluation on Generated Dataset. The experimental results corresponding to the four generated datasets show consistent trends on both Qwen and Qwen-VL. In particular, unstable performance has been observed in the setting with 1/8 data proportion on certain datasets (like DocILE and RVL-CDIP), which we hypothesize is caused by limited data amount. Moreover, it can be seen that the performance achieves the highest on DocILE, attributable to its highly diverse document content. Conversely, ProcTag performs the least effectively on PubTabNet and PubLayNet, which are confined to tabular and academic paper layouts, respectively. The performance on RVL-CDIP is among them. In summary, our ProcTag exhibits superior performance on datasets with more diverse document data.

Visualization of the original document data features sampled by different methods. In addition, to evaluate the feasibility of direct feature extraction from raw instructional document data to sample effective instances, the widely acknowledged feature extractor CLIP (Radford et al., 2021) is utilized to extract features from images and corresponding instructional texts within document datasets gathered by various sampling methods. Subsequently, these features are visualized using t-SNE (Arora et al., 2018). As depicted in Figure 5, the raw document instruction features originating from diverse sample sources could not be effectively differentiated, despite observable performance variances amongst them. This indicates that for document instruction data, utilizing raw data features does not afford effective discrimination. Hence, it is imperative to employ specialized document representations like modeling the process of document instruction execution for effective modeling rather than relying on raw data.

5.2. ProcTag for Data Efficacy

Existing instruction datasets employed for LLMs/MLLMs are often sourced from generated content, characterized by significant size and redundancy, which can lead to extended training durations and resource inefficiency. For addressing this challenge, data efficacy is introduced, denoted as $E$ , to serve as a criterion for selecting high-quality data. This efficacy is defined by the ratio of current performance $P_{cur}$ to the best performance $P_{best}$ , given by $E=P_{cur}/P_{best}$ . To assess data efficacy, we conduct experiments based on data ratio and tag coverage rate. Within these two settings, $P_{\text{best}}$ refers to the performance achieved at 100% data ratio and tag coverage rate, respectively.

As shown in Figure 4, when the data ratio is varied, the peak performance is observed at 30%, indicating that our ProcTag can maintain high effectiveness with only a small subset of data. Furthermore, in terms of coverage rate, which represents the percentage of all tags covered in the dataset. Experimental results show that as coverage increases, performance exhibits a positive correlation with improvement. The results confirm the effectiveness of all tags generated by ProcTag, indicating that each tag contributes to the enhancement of performance. Furthermore, these results validate that our ProcTag for tagging is rational and effective, while also ensuring the diversity and complexity of these tags.

Table 2. Ablation study on the impact of different prompts for representing documents in the DocVQA task.

Verbalizer

GPT-3.5

(direct prompting)

Qwen

(fine-tuning)

PlainText

63.66

77.14

SpatialFormat (Lamott et al., 2024)

71.03

80.93

DocLayPrompt (Ours)

73.84

81.80

5.3. Ablation study

5.3.1. Impact of DocLayPrompt

To assess the effectiveness of DocLayPrompt in document text representation, we conducted comparative experiments with two commonly used document representation prompts: PlainText and SpecialFormat (Lamott et al., 2024). In direct prompting tests on the DocVQA test set using GPT-3.5, as shown in Table 2, the use of DocLayPrompt for document representation yields an ANLS improvement of 2.81 over the results obtained with SpecialFormat (Lamott et al., 2024). Furthermore, in the experimental fine-tuning of Qwen using various document prompts, the testing results indicate that employing DocLayPrompt for document representation outperforms both PlainText and SpecialFormat. This substantiates the superiority of utilizing DocLayPrompt for document representation, whether directly prompting the LLMs or incorporating it into the LLMs’ training, over the existing methods of PlainText and SpecialFormat.

Figure 6 shows the differences in the document instruction process generation using SpatialFormat and DocLayPrompt. Employing SpatialFormat prompts in the instruction execution generation failed to guide GPT in generating answers consistent with the accurate annotations of the original instruction dataset. It is observed that GPT fails to recognize key layout information from such textual representation, thus not generating tags related to the list-region. By integrating layout information via DocLayPrompt for document representation, the instruction execution generation process produced final answers that aligned with the annotations in the original instruction dataset. And it can make GPT understand the document layout, like locate_bulleted_list. This underscores the efficacy of the generated instruction execution process.

5.3.2. Impact of different representations of the process.

Though a straightforward process representation using natural language is recognized as a practical alternative, pseudo-code is utilized to represent the instruction execution process in ProcTag. The preference for pseudo-code stems from its ability to provide a more granular and structured representation, which is anticipated to yield a more detailed and succinct explanation of the instruction execution process. Thus, it is likely to enhance the effectiveness of the data sampling process. To evaluate the effectiveness of pseudo-code representation, an analysis is conducted with process tags generated in natural language and assessed using the DocVQA dataset. As shown in Figure 7, the experimental results show that while natural language tags offer some advantages over random sampling, they do not provide the significant improvements observed with pseudo-code tags, confirming the superior efficacy of the pseudo-code format.

6. Limitation

Due to cost constraints, the current approach employs GPT-3.5 and a text-based document prompt, rather than incorporating multimodal large language models like GPT-4V. Clearly, relying on textual document representation results in the loss of certain visual information, thereby hindering the applicability of our approach on visually-rich datasets such as InfographicVQA (Mathew et al., 2022).

7. Conclusion

In this paper, we propose ProcTag, a data method that assesses the efficacy of document instruction data. ProcTag performs tagging on the execution process of document instructions and utilizes the diversity and complexity of these tags to assess the efficacy of the dataset. Additionally, DocLayPrompt, a novel semi-structured layout-aware document prompt, is proposed for effectively representing documents. Experimental results demonstrate the effectiveness of the ProcTag method in assessing document instruction data with efficacy when compared to existing data methods and random sampling. As modeling the process of instruction execution is a generic approach, in the future, we will extend this approach to the general artificial intelligence domain, exploring more effective data evaluation strategies.

References

(1)
Allal et al. (2023) Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. 2023. SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 (2023).
Appalaraju et al. (2021) Srikar Appalaraju, Bhavan Jasani, and Bhargava Urala Kota. 2021. DocFormer: End-to-End Transformer for Document Understanding. In ICCV. 4171–4186.
Arora et al. (2018) Sanjeev Arora, Wei Hu, and Pravesh K Kothari. 2018. An analysis of the t-sne algorithm for data visualization. In Conference on learning theory. PMLR, 1455–1462.
Bai et al. (2023a) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023a. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
Bai et al. (2023b) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023b. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. NeurIPS 33 (2020), 1877–1901.
Cui et al. (2021) Lei Cui, Yiheng Xu, Tengchao Lv, and Furu Wei. 2021. Document ai: Benchmarks, models and applications. arXiv preprint arXiv:2111.08609 (2021).
Da et al. (2023) Cheng Da, Chuwei Luo, Qi Zheng, and Cong Yao. 2023. Vision grid transformer for document layout analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19462–19472.
Ding et al. (2023) Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. arXiv preprint arXiv:2305.14233 (2023).
Gu et al. (2021) Jiuxiang Gu, Jason Kuen, Vlad I Morariu, Handong Zhao, Nikolaos Barmpalios, Rajiv Jain, Ani Nenkova, and Tong Sun. 2021. Unified Pretraining Framework for Document Understanding. In NeurIPS.
Hahsler et al. (2019) Michael Hahsler, Matthew Piekenbrock, and Derek Doran. 2019. dbscan: Fast density-based clustering with R. Journal of Statistical Software 91 (2019), 1–30.
Han et al. (2023) Soyeon Caren Han, Yihao Ding, Siwen Luo, Josiah Poon, HeeGuen Yoon, Zhe Huang, Paul Duuring, and Eun Jung Holden. 2023. Workshop on Document Intelligence Understanding. arXiv:2307.16369 [cs.IR]
Harley et al. (2015) Adam W Harley, Alex Ufkes, and Konstantinos G Derpanis. 2015. Evaluation of deep convolutional nets for document image classification and retrieval. In ICDAR. IEEE, 991–995.
Huang et al. (2022) Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In ACM Multimedia.
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. NeurIPS 35 (2022), 22199–22213.
Lamott et al. (2024) Marcel Lamott, Yves-Noel Weweler, Adrian Ulges, Faisal Shafait, Dirk Krechel, and Darko Obradovic. 2024. LAPDoc: Layout-Aware Prompting for Documents. arXiv preprint arXiv:2402.09841 (2024).
Li et al. (2023) Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. 2023. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. arXiv preprint arXiv:2308.12032 (2023).
Li et al. (2021a) Peizhao Li, Jiuxiang Gu, and Jason Kuen. 2021a. SelfDoc: Self-Supervised Document Representation Learning. In CVPR. 5652–5660.
Li et al. (2021b) Yulin Li, Yuxi Qian, Yuechen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, and Errui Ding. 2021b. StrucTexT: Structured text understanding with multi-modal Transformers. In ACM Multimedia. 1912–1920.
Liu et al. (2023a) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023a. Improved Baselines with Visual Instruction Tuning. arXiv preprint arXiv:2310.03744 (2023).
Liu et al. (2023c) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023c. Visual Instruction Tuning.
Liu et al. (2023b) Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Yang Liu, Biao Yang, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Xucheng Yin, Cheng lin Liu, Lianwen Jin, and Xiang Bai. 2023b. On the Hidden Mystery of OCR in Large Multimodal Models. arXiv:2305.07895 [cs.CV]
Lu et al. (2023) Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, and Jingren Zhou. 2023. # InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models. In The Twelfth International Conference on Learning Representations.
Luo et al. (2023) Chuwei Luo, Changxu Cheng, Qi Zheng, and Cong Yao. 2023. GeoLayoutLM: Geometric Pre-Training for Visual Information Extraction. In CVPR. 7092–7101.
Luo et al. (2024) Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. 2024. LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding. In CVPR.
Luo et al. (2022) Chuwei Luo, Guozhi Tang, Qi Zheng, Cong Yao, Lianwen Jin, Chenliang Li, Yang Xue, and Luo Si. 2022. Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding. arXiv preprint arXiv:2206.13155 (2022).
Mathew et al. (2022) Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. 2022. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1697–1706.
Mathew et al. (2021) Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. In WACV. 2200–2209.
Mukherjee et al. (2023) Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. 2023. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707 (2023).
OpenAI (2022) OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
OpenAI (2023) R OpenAI. 2023. GPT-4 technical report. arXiv (2023), 2303–08774.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. NeurIPS 35 (2022), 27730–27744.
Peng et al. (2022) Qiming Peng, Yinxu Pan, Wenjin Wang, Bin Luo, Zhenyu Zhang, Zhengjie Huang, Teng Hu, Weichong Yin, Yongfeng Chen, Yin Zhang, Shikun Feng, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2022. ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding. In EMNLP findings.
Perot et al. (2023) Vincent Perot, Kai Kang, Florian Luisier, Guolong Su, Xiaoyu Sun, Ramya Sree Boppana, Zilong Wang, Jiaqi Mu, Hao Zhang, and Nan Hua. 2023. LMDX: Language Model-based Document Information Extraction and Localization. arXiv preprint arXiv:2309.10952 (2023).
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision.. In Proceedings of the International Conference on Machine Learning (ICML). 8748–8763.
Shi et al. (2023) Yongxin Shi, Dezhi Peng, Wenhui Liao, Zening Lin, Xinhong Chen, Chongyu Liu, Yuyi Zhang, and Lianwen Jin. 2023. Exploring ocr capabilities of gpt-4v (ision): A quantitative and in-depth evaluation. arXiv preprint arXiv:2310.16809 (2023).
Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971 (2023).
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
Wang et al. (2023) Wenjin Wang, Yunhao Li, Yixin Ou, and Yin Zhang. 2023. Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering. arXiv preprint arXiv:2306.00526 (2023).
Wang et al. (2024) Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. 2024. How far can camels go? exploring the state of instruction tuning on open resources. Advances in Neural Information Processing Systems 36 (2024).
Wei et al. (2022a) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2022a. Finetuned language models are zero-shot learners. ICLR (2022).
Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 35 (2022), 24824–24837.
Xu et al. (2023) Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023).
Xu et al. (2020) Yiheng Xu, Minghao Li, Lei Cui, and Shaohan Huang. 2020. LayoutLM: Pre-training of text and layout for document image understanding. In KDD. 1192–1200.
Xu et al. (2021) Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021. LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding. In ACL.
Yang et al. (2023) Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. 2023. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9 (2023).
Ye et al. (2023a) Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, et al. 2023a. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499 (2023).
Ye et al. (2023b) Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. 2023b. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178 [cs.CL]
Zhang et al. (2023) Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. 2023. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107 (2023).
Zhong et al. (2020) Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. 2020. Image-based table recognition: data, model, and evaluation. In ECCV. Springer, 564–580.
Zhong et al. (2019) Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. Publaynet: largest dataset ever for document layout analysis. In ICDAR. IEEE, 1015–1022.
Štěpán Šimsa et al. (2023) Štěpán Šimsa, Milan Šulc, Michal Uřičář, Yash Patel, Ahmed Hamdi, Matěj Kocián, Matyáš Skalický, Jiří Matas, Antoine Doucet, Mickaël Coustaty, and Dimosthenis Karatzas. 2023. DocILE Benchmark for Document Information Localization and Extraction. arXiv:2302.05658 [cs.CL]