MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models

Chengguang Gan1 [email protected] Qingyu Yin2 Xinyang He3 Hanjun Wei4 Yunhao Liang3
Younghun Lim1
Shijian Wang5 Hexiang Huang6 Qinghao Zhang7
Shiwen Ni8
Tatsunori Mori1
1Yokohama National University, 2Zhejiang University, 3University of Chinese Academy of Sciences,
4Chengdu Institute of Computer Applications, Chinese Academy of Sciences,
5Southeast University, 6University of Tsukuba, 7Pusan National University,
8Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Abstract

The Mutual Reinforcement Effect (MRE) represents a promising avenue in information extraction and multitasking research. Nevertheless, its applicability has been constrained due to the exclusive availability of MRE mix datasets in Japanese, thereby limiting comprehensive exploration by the global research community. To address this limitation, we introduce a Multilingual MRE mix dataset (MMM) that encompasses 21 sub-datasets in English, Japanese, and Chinese. In this paper, we also propose a method for dataset translation assisted by Large Language Models (LLMs), which significantly reduces the manual annotation time required for dataset construction by leveraging LLMs to translate the original Japanese datasets. Additionally, we have enriched the dataset by incorporating open-domain Named Entity Recognition (NER) and sentence classification tasks. Utilizing this expanded dataset, we developed a unified input-output framework to train an Open-domain Information Extraction Large Language Model (OIELLM). The OIELLM model demonstrates the capability to effectively process novel MMM datasets, exhibiting significant improvements in performance.

{CJK}

UTF8gbsn

MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models



11footnotetext: †Corresponding author

1 Introduction

Refer to caption
Figure 1: The Mutual Reinforcement Effect between the labels of Word-level labels and text-level label within a same text.

Information extraction (IE) Sarawagi et al. (2008) is a significant area of research within natural language processing (NLP). This field has evolved to encompass a variety of subtasks, including sentence classification , text classification , Named Entity Recognition (NER) (Qu et al., 2023; Nadeau and Sekine, 2007; Lample et al., 2016), sentiment analysis (Tan et al., 2023; Medhat et al., 2014; Rodríguez-Ibánez et al., 2023), relationship extraction (Wadhwa et al., 2023; Mintz et al., 2009; Etzioni et al., 2008), and event extraction (Gao et al., 2023; Xiang and Wang, 2019). Traditionally, these IE subtasks have been segregated into distinct categories for processing. In conventional multi-task IE (Sun et al., 2023; Zhao et al., 2020), datasets from various tasks are typically merged and subsequently fine-tuned using a unified model. This process culminates in the extraction of information from multiple subtasks, each directed by task-specific output heads. While this method effectively leverages the internal knowledge of the model across different IE tasks, it does not address the potential interconnections among the tasks themselves. This omission highlights a gap in understanding how these tasks might benefit from exploring their mutual relationships.

The Mutual Reinforcement Effect (MRE) Gan et al. (2023b) represents an emerging research direction in multitasking Information Extraction (IE). It primarily explores the interconnections among individual subtasks, thereby enhancing the performance of each task. MRE categorizes IE subtasks into two principal groups: text-level tasks and word-level tasks. For instance, conventional tasks such as sentence classification, text classification, and text sentiment analysis fall under text-level tasks. Conversely, tasks like NER, which involve classifying words, are categorized as word-level tasks.

Figure 1 illustrates a specific example of MRE applied to IE subtasks. The left side of the figure depicts labels for sentence classification, while the right side displays words alongside their corresponding labels extracted from the sentence. Thus, the left side corresponds to text-level tasks, and the right side to word-level tasks. Notably, each sentence features both a text-level label and word-level label entities pair. This differs from typical IE multitasking, which extracts information from various sub-tasks across different texts. Instead, MRE simultaneously performs text-level classification and word-level label-entities pairing on the same text.

We can elucidate the MRE with specific examples. Consider the sentence from Figure 1: ’Giant pandas are mammals, endemic to China.’ This sentence is categorized under ’nature,’ and it includes two annotated label-entities pairs: ’Animal Name: pandas’ and ’Nation: China.’ This exemplifies that when a sentence is classified as relating to nature, it likely contains terms associated with natural elements, such as species names and geographical locations. Similarly, when a sentence is classified as nature, there is a high probability that the sentence as a whole describes nature. The interplay between these two levels of text level classification and word-level entity recognition illustrates the MRE, as each level mutually reinforces understanding and classification accuracy. Expanding upon this, MRE is applicable to various subtasks in IE. For instance, in sentiment analysis, the presence of numerous words with positive sentiment typically indicates that the text overall conveys a positive sentiment. Conversely, if a text is overall negative, it likely contains words with negative sentiment. Integrating tasks at both the text and word levels not only aids in the model’s comprehension of the context but also improves performance across both tasks. This methodology probes whether LLMs can interpret texts similarly to human understanding, by assessing the semantics of individual words and then synthesizing these insights to classify the text as a whole (Gan et al., 2023c).

Refer to caption
Figure 2: Multilingual Mutual Reinforcement Effect Mix Datasets Names of all sub-datasets. (The image does not represent a percentage of the actual subdataset size.)

Figure 2 illustrates the composition of the translated Multilingual Mutual Reinforcement Effect Mix (MMM) Datasets. Notably, the SCPOS subdataset within the original Japanese dataset is substantially larger than others, which is why the figure does not depict the proportional sizes of the subdatasets; instead, it merely labels each subdataset within the MMM dataset. This dataset encompasses three languages and includes seven distinct subdatasets for each language, each corresponding to a specific information extraction task.

Among these, SCNM denotes the Sentence Classification and Named Entity Recognition Mix Dataset. SCPOS, which stands for Sentiment Classification and Part-of-Speech Dataset, involves comprehensive sentiment polarity classification at both the text and word levels. This subdataset is further segmented into four categories: sentiment-related words, adjectives and nouns, solely adjectives, and solely nouns. This categorization facilitates an examination of how words with different lexical properties influence the overall sentiment classification of a text.

Additionally, the TCREE dataset is dedicated to Text Classification and Relation & Event Extraction, leveraging mixed data types for enhanced analytical depth. Lastly, TCONER represents the Open-Domain NER and Text Classification, utilizing an existing open-domain dataset for annotations. This structure allows for a nuanced exploration of text classification and entity recognition across various domains.

This work begins by introducing a novel dataset translation framework. Initially, six MRE mix datasets were translated, and subsequently, the TCONER dataset was expanded. Furthermore, to address the suboptimal performance of existing large language models LLMs on IE tasks, we refined the training methods. We propose a new, streamlined, and efficient input-output scheme that standardizes the handling of all tasks, enhancing the model’s training process. The LLM was then trained using the MMM dataset. The outcomes indicate that the optimized LLM, referred to as OIELLM, surpasses the original training method in performance across various datasets. Moreover, expanding the use of more MRE mix datasets is shown to further enhance the knowledge utilization within the LLM, leading to improved performance across all tasks.

This paper presents several contributions are detailed as follows:

  1. 1.

    Development of a Novel Framework for Dataset Translation: We propose a new framework that leverages existing LLMs to translate datasets for underrepresented languages. This framework facilitates the construction of Multilingual MRE Mix Datasets, significantly reducing the reliance on extensive human labor and addressing the issue of dataset scarcity in linguistic research.

  2. 2.

    Expansion of MRE Mix Datasets: We have successfully expanded the MRE Mix datasets to include a newly constructed open-domain NER dataset. This addition serves as a critical component in enhancing the comprehensiveness and utility of the MRE Mix datasets, effectively placing the last piece in the puzzle for open-domain applications.

  3. 3.

    Optimization of Input and Output Schemes for IE LLMs: Our research optimizes the traditional input and output mechanisms of information extraction LLMs. By employing the specially constructed MMM dataset, we not only reject the standard models trained on generic data but also significantly enhance the performance of LLMs within the domain of information extraction.

2 Related Work

Datasets. To begin, the MRE mix dataset primarily originates from the SCNM Gan et al. (2023b) dataset in Japanese, followed by the SCPOS (Gan et al., 2023d) and TCREE Gan et al. (2023a) datasets. However, the exclusive use of the Japanese language across these datasets poses significant challenges for researchers attempting to further explore the MRE. Moreover, there has been a growing interest in employing LLMs for dataset construction (Tan et al., 2024; Wadhwa et al., 2023; Li et al., 2023; Laskar et al., 2023). Pioneering studies Huang et al. (2023) have demonstrated the efficacy of LLMs in data annotation, where LLM-annotated datasets have outperformed manually annotated counterparts. For instance, LLMs have been utilized to generate datasets for mathematical problems Lin et al. (2024) and to develop dataset labeling frameworks, such as FreeAL (Xiao et al., 2023a), where the data is initially labeled by LLMs and subsequently refined by smaller models before undergoing a final, more accurate labeling by LLMs again.

These methodologies leverage instructional learning and in-context learning to guide LLMs to respond to specific queries and from these responses, extract annotated labels, thereby creating a fully labeled dataset. Distinct from previous efforts, the MMM dataset represents the inaugural initiative to translate datasets from lesser-used languages into more widely spoken languages, such as English and Chinese. Furthermore, the newly developed TCONER dataset addresses a critical gap by providing the first open-domain Named Entity Recognition (NER) dataset within the existing framework of the MRE mix dataset.

LLM in Information Extraction. Since the introduction of Pretrained Language Models (PLMs), sequential-to-sequential (seq2seq) based IE models have gained prominence. These developments range from the initial UIE Lu et al. (2022) to later models such as USM Lou et al. (2023) and Mirror (Zhu et al., 2023). All these models are generative in nature, enabling them to handle multiple word-level IE tasks—such as NER, Relation Extraction, and Event Extraction simultaneously. The primary advantage of these generative IE models is their generalizability; they eliminate the need for task-specific fine-tuning across different tasks. Instead, a single model can address all IE subtasks by standardizing the format of inputs and outputs for various tasks. The model is trained across different IE subtasks using these unified formats, aiming to equip a single model with the capability to manage multiple tasks effectively.

Refer to caption
Figure 3: The overview of dataset translation framework.

With the advent of LLMs, new approaches to IE have emerged, which can be broadly divided into two categories. The first involves direct interaction with LLMs using prompts in a zero-shot or few-shot manner, where the model outputs the desired entities either through multi-round dialog-style prompts or through single-command-based prompts that extract entities in one go (Wang et al., 2023; Wei et al., 2023). The second approach involves fine-tuning LLMs using specialized datasets (Zhou et al., 2023; Xiao et al., 2023b).

Our research distinguishes itself by focusing more intensively on the MRE. We go beyond merely aggregating existing IE sub-datasets for model training. Instead, we develop specialized MRE-enhanced datasets, through which we not only demonstrate but also apply the efficacy of MRE in enhancing information extraction capabilities.

3 Multilingual Mutual Reinforcement Effect Mix Datasets

In this chapter we will explain how to translate MRE mix datasets in small languages into other languages. And how to construct TCONER datasets. And how you can minimize the use of manual labor with guaranteed quality.

Refer to caption
Figure 4: The format of MMM datasets.

3.1 Dataset Translation Framework

First, it is essential to understand the format of the Multilingual Mutual Reinforcement Effect Mix (MMM) dataset. As depicted in Figure 4, the MMM dataset comprises inputs and outputs. The input section, highlighted in blue, includes both text and a task instruction word, such as "NER." In the output section, shown in green, the initial output is a text-level classification label, followed by the task instruction word "NER." The labeling follows the start and end symbols (i.e., ":", ";") used in the original MRE mixed dataset. This format allows for consistent generation of label-entity pairs regardless of quantity (e.g., ":label1;entities1:label2;entities2…"). Thus, the task instruction word guides the model in producing various word-level extracted information alongside the text-level classification label.

Figure 3 presents a flowchart of the entire dataset translation framework. The process begins on the leftmost side, where six sub-datasets are initially processed using a rule-based matching method, according to their classifications. The labels at both text and word levels are systematically translated into English and Chinese. Given the consistent labeling across datasets, this translation can proceed directly based on predefined rules. For instance, the Japanese label "ポジティブ" is directly translated as "positive." Employing a rule-based approach for label translation is not only quick and precise but also simplifies the subsequent translation of text and entities. Furthermore, these translated labels are input into a LLM along with the untranslated text and entities, serving an auxiliary role in the translation process.

The process involves two main inputs to the LLM, GPT-3.5-Turbo Ouyang et al. (2022): the part with translated labels and the part with untranslated text and entities. We employ both instruction-based and in-context learning (ICL) methodologies for this translation task. As depicted in the central portion of Figure 3, the selection of the instruction template was refined through multiple iterations. Initially, a simple instruction such as "Translate the following Japanese dataset into English." failed to produce satisfactory translations. Consequently, we introduced several constraints to enhance the output quality. These include stipulating that the model’s output format must align with the example provided below, with a critical requirement being the accurate translation of entities, ensuring they correspond directly to terms found in the original Japanese text. Additional constraints were applied specifically for Japanese-to-Chinese translations, such as informing the model that labels have been pre-translated and only text and entities require translation. We also instructed the model to ensure comprehensive translation into Chinese. Furthermore, a one-shot example of ICL was provided to demonstrate the desired outcome, guiding the model to generate translations strictly adhering to the specified format.

Finally, we obtained the translated dataset. However, due to the inherent unpredictability of LLM outputs, it is not always guaranteed that the outputs will conform to the expected format, even when the inputs are consistent. To address this, we implemented a dual-component rule-based filtering mechanism. The first component involves removing samples containing any residual Japanese characters from the translated data. The second component entails verifying whether the translated entities exactly match words in the text. Samples that do not meet this criterion are excluded. Additionally, this step assesses whether the pairings of labels and entities adhere to the formatting standards of the MMM dataset.

Despite the substantial reduction in dataset size resulting from the first two steps—translation and filtering—the remaining data exhibit exceptionally high translation quality. The final dataset undergoes a manual review and correction process, which ensures maximum accuracy while minimizing the reliance on manual labor. This approach outlines our tailored dataset translation framework, designed to accommodate the specific characteristics of the MMM dataset. With minimal modifications, this framework can be adapted for translating datasets for other tasks, effectively addressing the scarcity of datasets in lesser-used languages.

3.2 Construction of TCONER

In the original MRE mix datasets, relation and event extraction tasks are open-domain, implying that the labels are not predefined. However, the label set is limited to only a dozen options. Given this context, we constructed a new dataset, termed TCONER, based on an open-domain Named Entity Recognition (NER) dataset***https://huggingface.co/datasets/Universal-NER/Pile-NER-type?row=0 (Zhou et al., 2023). The labels at the text level in the TCONER dataset are also open-domain. To annotate this dataset, we initially employed the GPT-3.5-Turbo model to assign open-domain text-level labels. Subsequent manual verification and annotation were conducted to ensure accuracy and consistency, resulting in the finalized TCONER dataset. Similarly, we translated the constructed English TCONER dataset using the dataset translation framework. The TCONER dataset was translated into Japanese and Chinese.

3.3 Results of Datasets Construction

Refer to caption
Figure 5: The input and output of Open-domain Information Extraction Large Language Model (OIELLM).

Appendix A Table 2 presents the statistics of the final translation results. Due to the high costs associated with the use of a premium API, we limited our study to 10,000 samples from each of three sub-datasets within SCPOS and the TCONER dataset, which contains 180,000 entries. These 10,000 samples, retained post-translation, proved to be an ample test set. It was observed that there was a greater data loss when translating into Chinese compared to English. This discrepancy may be attributed to the training data predominance of English in OpenAI’s GPT-3.5-Turbo model, resulting in superior performance in English-related tasks. For instance, in the SCNM and TCREE datasets, the Japanese to English translation accuracy exceeded 80%. Conversely, the translation results from English to Chinese in the TCONER dataset were markedly better than those from English to Japanese. This further confirms that GPT-3.5-Turbo exhibits enhanced effectiveness with major languages compared to lesser-used ones.

4 Open-domain Information Extraction Large Language Model

In this chapter, we outline methodologies to enhance the performance of existing models and techniques for processing MRE mix datasets, aiming to surpass previous benchmarks. Before delving into the specifics of the Open-domain Information Extraction Large Language Model (OIELLM), it is imperative to justify the necessity for a distinct model tailored to MMM datasets.

Firstly, MRE mix datasets differ significantly from traditional IE tasks as they require simultaneous output of text-level labels and word-level label-entity pairs. Consequently, standard sequence labeling models are inadequate for handling these demands directly. Furthermore, existing generative IE models and methodologies have solely focused on producing word-level label-entities, neglecting text-level labels altogether.

The primary objective of MRE mix datasets is to investigate the interplay between text-level and word-level annotations. By leveraging this synergistic relationship, we aim to concurrently enhance the performance of both tasks. This model improves textual understanding by learning both tasks in tandem. Additionally, the MRE framework can contribute to model interpretability, drawing inspiration from cognitive processes that mimic human reasoning.

In summary, this study proposes the development of a novel model specifically designed for processing the MMM dataset. Furthermore, we aim to experimentally investigate whether the MRE positively influences various IE subtasks when using LLMs. Traditionally, IE tasks have been approached through the use of QA dialogues for extraction. However, this research adopts a distinct methodology. This departure is motivated by earlier foundational studies in generic generative IE with PLMs, where dialogue models were not utilized. Instead, these studies implemented a generic framework. Accordingly, we too will employ a modified input and output scheme tailored for the MMM dataset, diverging from the conventional dialogue-based approaches.

Japanese SCNM SCPOS: RW SCPOS: Adj & N
Model TL WL ALL TL WL ALL TL WL ALL
USA-7B - - - 53.27 40.80 7.67 91.33 81.68 9.63
GIELLM-13B-jp 85.47 84.46 54.2 86.01 66.61 17.39 93.23 47.35 0.20
OIELLM-8B 84.73 88.53 61.93 86.50 54.76 12.40 89.13 14.88 0.40
OIELLM-8B* 87.30 89.28 64.00 88.20 53.79 12.30 89.63 15.84 0.73
OIELLM-13B 89.00 86.33 57.70 94.60 52.36 11.90 95.20 11.94 0.20
Japanese SCPOS: Adj SCPOS: N TCREE
Model TL WL ALL TL WL ALL TL WL ALL
USA-7B 91.43 45.51 51.77 92.03 81.30 9.73 - - -
GIELLM-13B-jp 93.67 45.06 55.67 92.83 46.42 0.33 97.47 79.01 77.89
OIELLM-8B 87.13 74.96 53.07 87.77 22.92 0.50 95.07 74.92 83.69
OIELLM-8B* 89.93 75.33 54.93 90.63 23.69 0.63 96.98 74.42 84.19
OIELLM-13B 94.00 60.69 42.50 94.70 18.07 0.60 97.08 73.82 84.19
Englisch SCNM SCPOS: RW SCPOS: Adj & N
Model TL WL ALL TL WL ALL TL WL ALL
OIELLM-8B 82.30 81.36 52.53 72.17 49.60 11.82 76.57 18.00 1.67
OIELLM-8B* 85.43 82.38 55.43 74.75 49.93 12.81 79.77 19.28 2.27
OIELLM-13B 84.80 80.68 50.60 95.07 46.64 12.19 94.30 18.59 3.20
Englisch SCPOS: Adj SCPOS: N TCREE
Model TL WL ALL TL WL ALL TL WL ALL
OIELLM-8B 75.47 51.85 32.33 76.10 28.67 1.27 80.87 21.77 33.67
OIELLM-8B* 76.60 51.95 33.17 78.67 27.45 0.73 80.23 25.90 22.37
OIELLM-13B 94.40 50.56 38.40 95.30 28.36 0.60 89.90 23.50 22.60
Chinese SCNM SCPOS: RW SCPOS: Adj & N
Model TL WL ALL TL WL ALL TL WL ALL
OIELLM-8B 84.90 71.90 46.40 89.29 45.75 9.93 92.33 8.75 0.33
OIELLM-8B* 86.33 69.97 46.77 92.27 46.20 10.60 94.50 8.46 0.40
OIELLM-13B 87.70 68.12 41.60 95.03 43.32 8.72 94.90 8.42 0.50
Chinese SCPOS: Adj SCPOS: N TCREE
Model TL WL ALL TL WL ALL TL WL ALL
OIELLM-8B 93.73 60.96 53.00 92.63 28.32 0.63 91.73 58.12 56.41
OIELLM-8B* 95.80 64.51 57.63 94.97 28.91 1.30 95.06 59.54 58.83
OIELLM-13B 96.00 60.68 54.90 95.20 27.77 1.00 95.26 56.91 56.00
TCONER Englisch Japanese Chinese
Model TL WL ALL TL WL ALL TL WL ALL
OIELLM-8B 24.80 21.12 0.20 27.70 13.83 0.20 33.73 18.87 0
OIELLM-8B* 37.13 23.05 0.30 41.40 14.24 0.17 48.27 18.06 0.17
OIELLM-13B 40.30 19.23 0.30 43.40 13.02 0 47.70 15.72 0.30
Table 1: The F1 score of MMM datasets. TL:Text-Level. WL: Word-level. ALL: TL and WL are correct simultaneously.

Figure 5 illustrates the input and output formats of our enhanced OIELLM. The fundamental unit of analysis in both input and output is words, reflecting our understanding of the tokenization principle utilized by LLMs, which typically focuses on words or phrases. By omitting the dialog prompt, we do not compromise the LLM’s comprehension of the task. This adjustment not only reduces the input-output length but also simplifies the LLM’s processing, thereby enhancing operational speed.

Each text processed is prefixed with task-specific instruction words, which define the task type and guide the model’s subsequent output generation. In our format, all task instruction words in the input are introduced by a special symbol "/", which serves to delineate the task words from the main text. This separation is crucial for distinguishing between text-level labels and word-level label-entity pairs in the output.

The combined text and task instruction words are then fed into the OIELLM, with the output comprising both text-level labels and word-level label-entity pairs. Our labeling convention adheres to the format used in the previous MRE mix datasets, utilizing ":" and ";" to ensure consistency and clarity.

In summary, by standardizing the input and output structures and clearly defining task instruction words, our modified OIELLM effectively processes all sub-datasets within the MMM framework.

5 Experiment

In this chapter, we detail specific experimental procedures, including dataset sizes for the MMM dataset and methodologies for training the OIELLM model, along with the evaluation techniques used.

5.1 Details of OIELLM Training

We began by selecting baselines: USA-7B (IL + ICL)https://huggingface.co/ganchengguang/USA-7B-instruction-incontext-learning and GIELLM-13B-jphttps://huggingface.co/ganchengguang/GIELLM-13B-jpllm, previously utilized for processing the MRE mixed datasets, served as comparative models. For the foundational architecture of OIELLM, we chose the latest Instruct and Base version of LLaMA3-8B§§§https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct. Since LLaMA3 does not offer a 13B version, we incorporated the LLaMA2-13B Touvron et al. (2023) model as well.

We attempted to evaluate the MMM dataset using the GPT-3.5-Turbo model; however, this model failed to produce the expected information and was unable to maintain a consistent format, despite being provided with an adequate number of few-shot examples for training. The resulting F1-score was near zero. Consequently, we decided not to select the GPT-3.5-Turbo model for further testing in our study.

OIELLM was fine-tuned using full parameters based on these three models. Training was conducted at BF16 precision, while inference was performed at FP16. The training spanned 3 epochs with a learning rate of 1e-5, utilizing computational resources including three A800 80GB and three RTX 6000 Ada 48GB GPUs, with training durations ranging from 12 to 20 hours. For the training and test sets, Comprehensive statistics on the training and test sets are available in Appendix B.

5.2 Evaluation

We employed the F1 score as our primary metric for evaluation. Initially, the model’s output was bifurcated into two segments based on the task-specific instruct word: the Text-level Label and the Label-entities pairs. Subsequently, Label-entities pairs were delimited using start-end symbols (i.e., ":", ";"). Each Label-entity pair was treated as an individual element within the set. The F1 score was segmented into three categories: Text-level (TL), Word-level (WL), and ALL. These represent the F1 scores at respective levels and the aggregate F1 score when both levels are accurately predicted in an output. For detailed methodologies, including codes and formulas, please refer to Appendix C.

6 Results

Table 1 presents the experimental results of three OIELLM models trained on 21 MMM sub-datasets. Notably, the model designated with an asterisk, OIELLM-8B, was trained using the LLaMA3-8B-Instruct framework, whereas the remaining models were based on the LLaMA3-8B-Base framework. These results demonstrate the enhanced performance of OIELLM in handling Japanese data after incorporating multilingual capabilities. Impressively, OIELLM’s performance surpassed that of GIELLM-13B-jp on half of the datasets, despite GIELLM-13B-jp being a model specifically tailored for Japanese. This observation supports the hypothesis that integrating multilingualism and multitasking can more effectively leverage the knowledge embedded in the pre-training of multilingual LLMs.

However, OIELLM’s performance on the TCONER task was suboptimal, which we attribute to insufficient training data. Given that open-domain tasks require extensive and diverse datasets compared to domain-specific tasks, the limited data may have hindered the model’s performance. This area will be a focus of our future research, aiming to understand and improve the data dependencies of OIELLM in open-domain contexts.

7 Conclusion and Future Work

In this study, we introduce a framework that utilizes LLMs to translate datasets, thereby removing language barriers for research in less-represented languages. To address the deficiency of open-domain IE tasks in the MRE mix dataset, we constructed the TCONER dataset. Additionally, we trained the OIELLM model using the newly created MMM dataset.

Future work will focus on employing the MMM dataset to further explore the Mutual Reinforcement Effect. We will also continue to enhance the performance of the OIELLM model in open-domain information extraction tasks.

8 Limitations

Due to resource constraints, we were unable to employ the higher-performing GPT-4-Turbo OpenAI (2023) model as the base for our dataset translation framework. Consequently, this model was also not utilized during the testing phase on the dataset. In future work, we aim to leverage a more advanced model, such as the GPT-4-Turbo, to evaluate the MMM dataset, provided that the necessary resources become available.

References

  • Etzioni et al. (2008) Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S Weld. 2008. Open information extraction from the web. Communications of the ACM, 51(12):68–74.
  • Gan et al. (2023a) Chengguang Gan, Qinghao Zhang, and Tatsunori Mori. 2023a. Giellm: Japanese general information extraction large language model utilizing mutual reinforcement effect. arXiv preprint arXiv:2311.06838.
  • Gan et al. (2023b) Chengguang Gan, Qinghao Zhang, and Tatsunori Mori. 2023b. Sentence-to-label generation framework for multi-task learning of japanese sentence classification and named entity recognition. In International Conference on Applications of Natural Language to Information Systems, pages 257–270. Springer.
  • Gan et al. (2023c) Chengguang Gan, Qinghao Zhang, and Tatsunori Mori. 2023c. Think from words (tfw): Initiating human-like cognition in large language models through think from words for japanese text-level classification. arXiv preprint arXiv:2312.03458.
  • Gan et al. (2023d) Chengguang Gan, Qinghao Zhang, and Tatsunori Mori. 2023d. Usa: Universal sentiment analysis model & construction of japanese sentiment text classification and part of speech dataset. Preprint, arXiv:2309.03787.
  • Gao et al. (2023) Jun Gao, Huan Zhao, Changlong Yu, and Ruifeng Xu. 2023. Exploring the feasibility of chatgpt for event extraction. arXiv preprint arXiv:2303.03836.
  • Huang et al. (2023) Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2023. Large language models can self-improve. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1051–1068, Singapore. Association for Computational Linguistics.
  • Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
  • Laskar et al. (2023) Md Tahmid Rahman Laskar, Mizanur Rahman, Israt Jahan, Enamul Hoque, and Jimmy Huang. 2023. Can large language models fix data annotation errors? an empirical study using debatepedia for query-focused text summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10245–10255.
  • Li et al. (2023) Minzhi Li, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy Chen, Zhengyuan Liu, and Diyi Yang. 2023. CoAnnotating: Uncertainty-guided work allocation between human and large language models for data annotation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1487–1505, Singapore. Association for Computational Linguistics.
  • Lin et al. (2024) Qingwen Lin, Boyan Xu, Zhengting Huang, and Ruichu Cai. 2024. From large to tiny: Distilling and refining mathematical expertise for math word problems with weakly supervision. arXiv preprint arXiv:2403.14390.
  • Lou et al. (2023) Jie Lou, Yaojie Lu, Dai Dai, Wei Jia, Hongyu Lin, Xianpei Han, Le Sun, and Hua Wu. 2023. Universal information extraction as unified semantic matching. Preprint, arXiv:2301.03282.
  • Lu et al. (2022) Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, and Hua Wu. 2022. Unified structure generation for universal information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5755–5772, Dublin, Ireland. Association for Computational Linguistics.
  • Medhat et al. (2014) Walaa Medhat, Ahmed Hassan, and Hoda Korashy. 2014. Sentiment analysis algorithms and applications: A survey. Ain Shams engineering journal, 5(4):1093–1113.
  • Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011.
  • Nadeau and Sekine (2007) David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3–26.
  • OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  • Qu et al. (2023) Xiaoye Qu, Yingjie Gu, Qingrong Xia, Zechang Li, Zhefeng Wang, and Baoxing Huai. 2023. A survey on arabic named entity recognition: Past, recent advances, and future trends. IEEE Transactions on Knowledge and Data Engineering.
  • Rodríguez-Ibánez et al. (2023) Margarita Rodríguez-Ibánez, Antonio Casánez-Ventura, Félix Castejón-Mateos, and Pedro-Manuel Cuenca-Jiménez. 2023. A review on sentiment analysis from social media platforms. Expert Systems with Applications, 223:119862.
  • Sarawagi et al. (2008) Sunita Sarawagi et al. 2008. Information extraction. Foundations and Trends® in Databases, 1(3):261–377.
  • Sun et al. (2023) Kai Sun, Richong Zhang, Samuel Mensah, Yongyi Mao, and Xudong Liu. 2023. Learning implicit and explicit multi-task interactions for information extraction. ACM Transactions on Information Systems, 41(2):1–29.
  • Tan et al. (2023) Kian Long Tan, Chin Poo Lee, and Kian Ming Lim. 2023. A survey of sentiment analysis: Approaches, datasets, and future research. Applied Sciences, 13(7):4550.
  • Tan et al. (2024) Zhen Tan, Alimohammad Beigi, Song Wang, Ruocheng Guo, Amrita Bhattacharjee, Bohan Jiang, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. 2024. Large language models for data annotation: A survey. arXiv preprint arXiv:2402.13446.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Wadhwa et al. (2023) Somin Wadhwa, Silvio Amir, and Byron C Wallace. 2023. Revisiting relation extraction in the era of large language models. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2023, page 15566. NIH Public Access.
  • Wang et al. (2023) Xiao Wang, Weikang Zhou, Can Zu, Han Xia, Tianze Chen, Yuansen Zhang, Rui Zheng, Junjie Ye, Qi Zhang, Tao Gui, et al. 2023. Instructuie: Multi-task instruction tuning for unified information extraction. arXiv preprint arXiv:2304.08085.
  • Wei et al. (2023) Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Yufeng Chen, Meishan Zhang, et al. 2023. Zero-shot information extraction via chatting with chatgpt. arXiv preprint arXiv:2302.10205.
  • Xiang and Wang (2019) Wei Xiang and Bang Wang. 2019. A survey of event extraction from text. IEEE Access, 7:173111–173137.
  • Xiao et al. (2023a) Ruixuan Xiao, Yiwen Dong, Junbo Zhao, Runze Wu, Minmin Lin, Gang Chen, and Haobo Wang. 2023a. FreeAL: Towards human-free active learning in the era of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14520–14535, Singapore. Association for Computational Linguistics.
  • Xiao et al. (2023b) Xinglin Xiao, Yijie Wang, Nan Xu, Yuqi Wang, Hanxuan Yang, Minzheng Wang, Yin Luo, Lei Wang, Wenji Mao, and Daniel Zeng. 2023b. Yayi-uie: A chat-enhanced instruction tuning framework for universal information extraction. arXiv preprint arXiv:2312.15548.
  • Zhao et al. (2020) He Zhao, Longtao Huang, Rong Zhang, Quan Lu, and Hui Xue. 2020. Spanmlt: A span-based multi-task learning framework for pair-wise aspect and opinion terms extraction. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 3239–3248.
  • Zhou et al. (2023) Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, and Hoifung Poon. 2023. Universalner: Targeted distillation from large language models for open named entity recognition. arXiv preprint arXiv:2308.03279.
  • Zhu et al. (2023) Tong Zhu, Junfei Ren, Zijian Yu, Mengsong Wu, Guoliang Zhang, Xiaoye Qu, Wenliang Chen, Zhefeng Wang, Baoxing Huai, and Min Zhang. 2023. Mirror: A universal framework for various information extraction tasks. arXiv preprint. ArXiv:2311.05419 [cs].

Appendix A Statistical results of the translated MMM dataset

Dataset SCNM SCPOS: RW SCPOS:
RW Adj & N
Japanese 5343 2000 187528
Englisch 4449 1312 4801
Chinese 3177 1406 3937
Dataset SCPOS: SCPOS: TCREE
Adj N
Japanese 187528 187528 2000
Englisch 9132 5027 1910
Chinese 7413 3920 1491
Sprache Englisch Japanese Chinese
TCONER 45888 6791 9047
Table 2: Statistical results of the translated MMM dataset. (Due to resource constraints, we extracted only 10,000 samples as translation objects from each of the three SCPOS sub-datasets and the TCONER dataset.)
Dataset SCNM SCPOS: RW SCPOS:
RW Adj & N
Japanese 1000 1000 1000
Englisch 1000 500 1000
Chinese 1000 500 1000
Dataset SCPOS: SCPOS: TCREE
Adj N
Japanese 1000 1000 1000
Englisch 1000 1000 500
Chinese 1000 1000 500
Sprache Englisch Japanese Chinese
TCONER 2000 2000 2000
Table 3: Statistical results of train sets of OIELLM.
Dataset SCNM SCPOS: RW SCPOS:
RW Adj & N
Japanese 4343 1000 186528
Englisch 3449 812 3801
Chinese 2177 906 2937
Dataset SCPOS: SCPOS: TCREE
Adj N
Japanese 186528 186528 1000
Englisch 8132 4027 1410
Chinese 6413 2920 991
Sprache Englisch Japanese Chinese
TCONER 43888 4791 7047
Table 4: Statistical results of test sets.

Appendix B Statistical Results of Train and Test Dataset in OIELLM

As shown in Tables 3 and 4, the statistics for the complete training and test sets of the MMM dataset. The MMM dataset was segmented into 21 sub-datasets. Training set sizes were assigned based on the sizes of these sub-datasets, categorized into three groups: 500, 1000, and 2000 samples. Samples beyond these numbers were allocated to the test sets.

Appendix C Calculate Detail of F1 Score

F1=2×precision×recallprecision+recallsubscript𝐹12precisionrecallprecisionrecallF_{1}=2\times\frac{\text{precision}\times\text{recall}}{\text{precision}+\text% {recall}}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 × divide start_ARG precision × recall end_ARG start_ARG precision + recall end_ARG (1)
precision=|RealGenerated||Generated|precision𝑅𝑒𝑎𝑙𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑\text{precision}=\frac{|Real\cap Generated|}{|Generated|}precision = divide start_ARG | italic_R italic_e italic_a italic_l ∩ italic_G italic_e italic_n italic_e italic_r italic_a italic_t italic_e italic_d | end_ARG start_ARG | italic_G italic_e italic_n italic_e italic_r italic_a italic_t italic_e italic_d | end_ARG (2)
recall=|RealGenerated||Real|recall𝑅𝑒𝑎𝑙𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑𝑅𝑒𝑎𝑙\text{recall}=\frac{|Real\cap Generated|}{|Real|}recall = divide start_ARG | italic_R italic_e italic_a italic_l ∩ italic_G italic_e italic_n italic_e italic_r italic_a italic_t italic_e italic_d | end_ARG start_ARG | italic_R italic_e italic_a italic_l | end_ARG (3)
Algorithm 1 Parse Text Label and Entity Pairs
1:procedure parse_output(output, instruct_word, is_tcree)
2:     Input: output (String), instruct_word (String), is_tcree (Boolean)
3:     Output: text_label (String), entity_pairs (Set of Tuples)
4:     
5:     instruct_word𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡_𝑤𝑜𝑟𝑑absentinstruct\_word\leftarrowitalic_i italic_n italic_s italic_t italic_r italic_u italic_c italic_t _ italic_w italic_o italic_r italic_d ← instruct_word
6:     if instruct_wordoutput𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡_𝑤𝑜𝑟𝑑𝑜𝑢𝑡𝑝𝑢𝑡instruct\_word\notin outputitalic_i italic_n italic_s italic_t italic_r italic_u italic_c italic_t _ italic_w italic_o italic_r italic_d ∉ italic_o italic_u italic_t italic_p italic_u italic_t then
7:         return ("",{})""("",\{\})( " " , { } )
8:     end if
9:     text_label,entity_pairs𝑡𝑒𝑥𝑡_𝑙𝑎𝑏𝑒𝑙𝑒𝑛𝑡𝑖𝑡𝑦_𝑝𝑎𝑖𝑟𝑠absenttext\_label,entity\_pairs\leftarrowitalic_t italic_e italic_x italic_t _ italic_l italic_a italic_b italic_e italic_l , italic_e italic_n italic_t italic_i italic_t italic_y _ italic_p italic_a italic_i italic_r italic_s ← output.split(instruct_word, 1)
10:     text_labeltext_label.strip()formulae-sequence𝑡𝑒𝑥𝑡_𝑙𝑎𝑏𝑒𝑙𝑡𝑒𝑥𝑡_𝑙𝑎𝑏𝑒𝑙𝑠𝑡𝑟𝑖𝑝text\_label\leftarrow text\_label.strip()italic_t italic_e italic_x italic_t _ italic_l italic_a italic_b italic_e italic_l ← italic_t italic_e italic_x italic_t _ italic_l italic_a italic_b italic_e italic_l . italic_s italic_t italic_r italic_i italic_p ( )
11:     if is_tcree then
12:         entity_pairs[entity_pairs.strip()]entity\_pairs\leftarrow[entity\_pairs.strip()]italic_e italic_n italic_t italic_i italic_t italic_y _ italic_p italic_a italic_i italic_r italic_s ← [ italic_e italic_n italic_t italic_i italic_t italic_y _ italic_p italic_a italic_i italic_r italic_s . italic_s italic_t italic_r italic_i italic_p ( ) ]
13:     else
14:         entity_pairs[pair.strip()entity\_pairs\leftarrow[pair.strip()italic_e italic_n italic_t italic_i italic_t italic_y _ italic_p italic_a italic_i italic_r italic_s ← [ italic_p italic_a italic_i italic_r . italic_s italic_t italic_r italic_i italic_p ( ) for pair𝑝𝑎𝑖𝑟pairitalic_p italic_a italic_i italic_r in entity_pairs.split(":")entity\_pairs.split(":")italic_e italic_n italic_t italic_i italic_t italic_y _ italic_p italic_a italic_i italic_r italic_s . italic_s italic_p italic_l italic_i italic_t ( " : " ) if pair]pair]italic_p italic_a italic_i italic_r ]
15:     end if
16:     entity_pairs[tuple(pair.split(";"))entity\_pairs\leftarrow[tuple(pair.split(";"))italic_e italic_n italic_t italic_i italic_t italic_y _ italic_p italic_a italic_i italic_r italic_s ← [ italic_t italic_u italic_p italic_l italic_e ( italic_p italic_a italic_i italic_r . italic_s italic_p italic_l italic_i italic_t ( " ; " ) ) for pair𝑝𝑎𝑖𝑟pairitalic_p italic_a italic_i italic_r in entity_pairs]entity\_pairs]italic_e italic_n italic_t italic_i italic_t italic_y _ italic_p italic_a italic_i italic_r italic_s ]
17:     return (text_label,set(entity_pairs))𝑡𝑒𝑥𝑡_𝑙𝑎𝑏𝑒𝑙set𝑒𝑛𝑡𝑖𝑡𝑦_𝑝𝑎𝑖𝑟𝑠(text\_label,\text{set}(entity\_pairs))( italic_t italic_e italic_x italic_t _ italic_l italic_a italic_b italic_e italic_l , set ( italic_e italic_n italic_t italic_i italic_t italic_y _ italic_p italic_a italic_i italic_r italic_s ) )
18:end procedure