LLMaEL: Large Language Models are Good
Context Augmenters for Entity Linking

Amy Xin 1, Yunjia Qi 1 , Zijun Yao 1, Fangwei Zhu 2, Kaisheng Zeng 1,
Bin Xu 1, Lei Hou 1, Juanzi Li 1
1Department of Computer Science and Technology, Tsinghua University, Beijing, China
2School of Computer Science, Peking University, Beijing, China
{xin-x23, qyj23}@mails.tsinghua.edu.cn
Equal contribution.
Abstract

Entity Linking (EL) models are well-trained at mapping mentions to their corresponding entities according to a given context. However, EL models struggle to disambiguate long-tail entities due to their limited training data. Meanwhile, large language models (LLMs) are more robust at interpreting uncommon mentions. Yet, due to a lack of specialized training, LLMs suffer at generating correct entity IDs. Furthermore, training an LLM to perform EL is cost-intensive. Building upon these insights, we introduce LLM-Augmented Entity Linking (LLMaEL), a plug-and-play approach to enhance entity linking through LLM data augmentation. We leverage LLMs as knowledgeable context augmenters, generating mention-centered descriptions as additional input, while preserving traditional EL models for task specific processing. Experiments on 6666 standard datasets show that the vanilla LLMaEL outperforms baseline EL models in most cases, while the fine-tuned LLMaEL set the new state-of-the-art results across all 6666 benchmarks. We will release our code and datasets.

LLMaEL: Large Language Models are Good
Context Augmenters for Entity Linking


Amy Xin 1thanks: Equal contribution., Yunjia Qi 1 , Zijun Yao 1, Fangwei Zhu 2, Kaisheng Zeng 1, Bin Xu 1, Lei Hou 1, Juanzi Li 1 1Department of Computer Science and Technology, Tsinghua University, Beijing, China 2School of Computer Science, Peking University, Beijing, China {xin-x23, qyj23}@mails.tsinghua.edu.cn


1 Introduction

Entity linking (EL) establishes connections between mentions in textual contexts and entities in a target knowledge base (KB). It plays an important role in many applications requiring semantic understanding, such as question answering Yao et al. (2023); Perez-Beltrachini et al. (2023); Xu et al. (2023), dialogue generation Cui et al. (2022); Rückert et al. (2022), and making recommendations Wang et al. (2022); Balloccu et al. (2022).

However, EL is still a challenging task as it requires two distinct capabilities: (a) Task Specification, which encompasses a thorough understanding of the entity linking task and the precise requirement for its output format, and (b) Entity Knowledge, which involves the possession of substantial knowledge about the target entity. Trained specifically for EL, traditional EL models Wu et al. (2020); Cao et al. (2021); Ayoola et al. (2022) excel in task specification, capable of producing results that exactly satisfy the format requirement of the EL task. Meanwhile, extensively pre-trained large language models (LLMs) (Brown et al., 2020; Touvron et al., 2023) are natural repositories of expansive world knowledge, possessing a vast reservoir of information pertinent to any given entity.

However, these two streams of models each present their own limitations for EL. Compared to knowledgeable LLMs, traditional EL models are constrained by their limited knowledge accumulated during model training, resulting in a narrower scope of knowledge about entities. While the training data for EL models equips them to master the specification of the entity linking task, it falls short of providing them with comprehensive knowledge of all entities, especially unpopular entities that scarcely or never appear in the training data.

Similarly, relying exclusively on LLMs for entity linking comes with its own set of drawbacks. With a primary design for language modeling, LLMs struggle to perform tasks that demand precise specifications Peng et al. (2023). More specifically, entity linking requires the production of exact entity IDs within a knowledge base. The correct generation of these IDs, which differ fundamentally from natural language, poses a significant challenge to LLMs. Although LLMs can partially learn the task specification for generating entity IDs via in-context learning  (Brown et al., 2020, ICL), we observe that LLMs tend to produce fictional entity IDs, which is recognized as hallucination Rawte et al. (2023). This leads to erroneous linkage of mentions to non-existent KB entities.

To address the limitations inherent in traditional EL models and modern LLMs respectively, we design a novel pipeline method that capitalizes on the strengths of both approaches. We present LLM-Augmented Entity Linking (LLMaEL), a plug-and-play method to bolster entity linking through LLM data augmentation. Instead of demanding LLMs to perform EL directly, we leverage LLMs as context enhancers, supplementing EL models with additional context regarding a specific mention. Our method consists of three primary stages: (1) context augmentation, where LLMs are prompted to augment the original mention-context pair by generating supplementary mention descriptions, (2) data fusion, where the LLM-augmented context is integrated into a selected EL model, and (3) EL execution, where the EL model is employed to retrieve the target entity.

LLMaEL enhances entity linking by integrating the broad world knowledge and text generation abilities of LLMs with the specialized KB interaction skills of EL models. First, we leverage LLMs for context augmentation, enriching mentions with LLMs’ world knowledge while also condensing mention-related information from the provided context. Then, we employ an EL model to execute entity linking, thus minimizing the risk of obtaining invalid entity IDs due to LLM hallucination. Our method effectively combines the strengths of EL models and LLMs while addressing their respective shortcomings, leading to a more accurate and reliable EL solution.

For all 3333 selected EL models, our vanilla LLMaEL achieves new SOTA performance on 5555 out of 6666 datasets. Our fine-tuned LLMaEL yields new SOTA results across all 6666 datasets, achieving an average 1.21%percent1.211.21\%1.21 % accuracy gain. Furthermore, we show that employing optional techniques such as context-joining and ensemble further boosts our method’s performance.

Refer to caption
Figure 1: The overall architecture of our approach. We mark the traditional entity linking process in pink and our method in yellow. Mentions that need to be executed by entity linking are marked with blue.

2 Preliminaries and Related Work

We give formal definition and notations for entity linking. We also introduce related work targeting entity linking, and methods using LLMs directly or as context augmenters for downstream tasks.

2.1 Task Definition

Entity Linking (EL) is the task of mapping mentions from a given context to KB entities. Formally, knowledge base G𝐺Gitalic_G consists of the set of entities {e}𝑒\{e\}{ italic_e } that are unique objects in the real world. The input of entity linking is a textual context c𝑐citalic_c, embedded with multiple entity mentions, denoted as c=t1m1t2m2t3𝑐subscript𝑡1normsubscript𝑚1subscript𝑡2normsubscript𝑚2subscript𝑡3c=\dots t_{1}||{m}_{1}||{t}_{2}||{m}_{2}||{t}_{3}\dotsitalic_c = … italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT …, where tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are textual spans and misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are entity mentions. The goal of entity linking is to obtain a correct list of mention-entity pairs {(mi,ei)}i[1,k]subscriptsubscript𝑚𝑖subscript𝑒𝑖𝑖1𝑘\{(m_{i},e_{i})\}_{i\in[1,k]}{ ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ∈ [ 1 , italic_k ] end_POSTSUBSCRIPT.

2.2 Related Work

Entity Linking.

It has been a long-standing goal to develop reliable entity linking solutions. The most widely adopted procedure to tackle EL is a two-stage architecture Sevgili et al. (2022), which divides EL into two sequential phases: candidate generation and entity re-ranking. Most models approach the candidate generation phase as a retrieval problem, aligning mentions to entities according to a specific metric Wu et al. (2020); Logeswaran et al. (2019); Le and Titov (2018). With the development of generative language models, it becomes possible to treat candidate generation as a text generation task Cao et al. (2021), training the model to generate unique entity names in the knowledge base directly based on the contextual information.

Most recent works prove that concept information about mentions is useful for EL, thus fine-grained entity typing is also integrated as part of the pipeline and has been applied to numerous EL models Ayoola et al. (2022); Onoe and Durrett (2020); Raiman and Raiman (2018). This suggests that augmenting mentions with additional information about the entities potentially facilitates the entity linking process.

LLMs as Executors for Downstream Tasks.

In-context learning, or few-shot prompting, is a prevalent strategy that directs LLMs to perform specific tasks by providing them in-prompt demonstrations. With the outstanding accomplishments of LLMs like GPT-3 Brown et al. (2020) and LLaMA2 Touvron et al. (2023), LLMs have achieved impressive results in downstream tasks through in-context learning, including question answering, summarization, and machine translation, etc. However, LLMs still struggle when executing specification-heavy tasks Peng et al. (2023), yielding results that are far from state-of-the-art. Hence, employing LLMs through in-context learning may not always be the best solution for any given task (e.g., EL).

LLMs as Context Augmenters for Downstream Tasks.

LLMs are primarily designed for text generation, which is their strongest advantage. Multiple studies have demonstrated that LLM-generated contexts present outstanding qualities, outperforming contexts obtained from information retrieval methods Yu et al. (2022); Chen et al. (2023). Furthermore, compared to retrieved contexts, LLM-generated contexts contribute to better downstream task performance Chen et al. (2023). With such insights, a bright solution is to leverage LLMs as context enhancers, generating contexts for downstream tasks as additional input. Liu et al. (2022) propose using LLM context augmentation for commonsense reasoning, achieving state-of-the-art results on multiple reasoning tasks. Similarly, Balkus and Yan (2022) improve text classification with GPT-3 augmented data, yielding higher consistent accuracy on unseen examples.

3 Methodology

LLMaEL is a plug-and-play enhancement method for entity linking using LLMs as context augmenters. It mainly includes three building blocks. (1) Context augmentation is the most basic element for LLMaEL, which effectively elicits LLMs to generate enriched context with more information for entity linking. (2) Data fusion designs multiple strategies to integrate the LLM-generated content with the original context, aiming to improve diverse off-the-shelf EL models. (3) EL execution finally conducts the entity linking task. Figure 1 illustrates the overall workflow of LLMaEL.

3.1 Context Augmentation

In a nutshell, we rely on decoder-only LLMs, which are used to map the original context along with an information-expansion instruction prompt into enhanced context. Formally, we denote LLM context generation as a function: c=LLM(p,c,mi)superscript𝑐LLM𝑝𝑐subscript𝑚𝑖c^{\prime}=\text{LLM}(p,c,m_{i})italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = LLM ( italic_p , italic_c , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where p𝑝pitalic_p is a specially designed prompt to instruct LLMs to perform context augmentation. csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the supplemented textual information for the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT mention misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Prompt Design.

The main strategy to control LLMs to augment context as expected is in-context learning Brown et al. (2020), which effectively constraints the output format of LLMs. Thus, our prompt includes two parts: (1) task specifications for expanding information, and (2) exemplars of paired original contexts and LLM-generated descriptions.

For task specification, we use the following template to ask LLMs to complete the sentence: {mdframed}

 Consider the following text.
 Text: [CONTEXT]
 Please provide me more descriptive
   information about [MENTION] from
   the text above.
 Make sure to include [MENTION] in
   your description.

where [CONTEXT] and [MENTION] are placeholders to be filled before feeding into LLMs. It is worth noting that our instruction requires LLMs to mention the entity again in the augmented context, which provides flexibility for data fusion.

For in-prompt demonstrations, we bootstrap examples via zero-shot prompting. To ensure the quality of these demonstrations, we first generate a sufficiently large amount of contexts via zero-shot prompting, and then manually filter out high-quality completions. The final prompt’s exemplars are then selected from this high-quality sample.

Due to the limited input size of LLMs, the final prompt includes three distinct examples. We show details of our prompt in Tables 6 and 7 in the Appendix B.

3.2 Data Fusion

Data fusion designs strategies from multiple perspectives to incorporate LLM-augmented context csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. In particular, context-joining strategies fuse csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT within the original context c𝑐citalic_c; EL model fine-tuning fuses the knowledge in csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT into the EL model; and ensemble fuses multiple LLM-augmented contexts.

ID Context Order Mention Offset
00 LLM-only LLM
1111 LLM + Original LLM
2222 LLM + Original Original
3333 Original + LLM LLM
4444 Original + LLM Original
Table 1: Five context-joining strategies of LLMaEL, enumerating the arrangements of the original and LLM-generated contexts in the final augmented context, which is to be inputted to the EL model. Context Order denotes the sequential order of the original and LLM-generated contexts. Since the same mention appears at least twice in the augmented context (one in the original context and one in the LLM-generated context), Mention Offset specifies the final mention-span to be provided to the EL model.
Context-Joining Strategies.

The most direct way to integrate LLM-augmented context csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with original context c𝑐citalic_c is to concatenate them and feed them into off-the-shelf EL models, which we denote as our vanilla implementation strategy for LLMaEL. To this end, we consider the following two design problems: (1) Context order. What sequence order should be adopted for concatenating the contexts c𝑐citalic_c and csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT? (2) Mention offset. Given that the mention occurs in both c𝑐citalic_c and csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which context’s spans should be utilized to refer to the mention when invoking EL models?

For LLMaEL, we design 5555 potential context-joining strategies, as shown in Table 1. Joining strategy 00 uses LLM-generated csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as a direct substitute for c𝑐citalic_c, while joining strategies 1111 to 4444 present all 4444 possible combinations over the distinct orders of the two contexts and the choice of the mention offset. We empirically find that different EL models perform best under different joining strategies, so we maintain the choice of context-joining strategy as a hyper-parameter, providing space for user adjustment across different settings.

EL Model Fine-tuning.

While the vanilla LLMaEL already demonstrates performance enhancements, the style and distribution of the augmented context are unfamiliar to EL models. To mitigate the gap between the data distribution that EL models are familiar with and the LLM-augmented contexts, we further fine-tune existing EL models. Specifically, we utilize existing EL training datasets and run the context augmentation step in Section 3.1, generating mention-centered descriptions for the entire dataset using an LLM. Then, we augment the training set with the generated descriptions using the optimal context-joining strategy for the selected EL model. Finally, we apply this augmented training set to fine-tune the EL model.

Ensemble.

Inaccuracies in LLMaEL’s performance may occur when the LLM generates wrong mention descriptions, misguiding the EL model to select a wrong entity. Hence, we also expand LLMaEL with ensemble techniques, attempting to improve our method’s robustness through diversified sampling. We sample mention descriptions across multiple LLMs and evaluate the diversified samples through both hard-voting and soft-voting classifier methods.

3.3 EL Execution

In the final phase of EL execution, the EL model is employed to output the entity ID in the target knowledge base. Compared to directly tasking LLMs to perform entity linking, LLMaEL injects task specification knowledge using EL models. Additionally, it augments EL models with sufficient entity knowledge from the infused LLM data.

4 Experiments

4.1 Experimental Setup

Datasets.

We evaluate LLMaEL on 6666 standard EL datasets AIDA-YAGO2 Hoffart et al. (2011), MSNBC Cucerzan (2007), AQUAINT Milne and Witten (2008), ACE04 Ratinov et al. (2011), WNED-CWEB Gabrilovich et al. (2013), and WNED-WIKI Guo and Barbosa (2018). The datasets are first augmented using our chosen LLM, then evaluated on our selected EL models.

Backbone Models for LLMaEL.

For our main experiments, we use Llama-3-70b-instruct as our backbone LLM, considering its outstanding performance on text generation. As LLMaEL is a plug-and-play framework for any EL models, to implement LLMaEL, we select three most widely adopted EL models as our backbone: BLINK Wu et al. (2020), a classical bi-encoder cross-encoder EL model; GENRE Cao et al. (2021), an autoregressive generative EL solution; and ReFinED Ayoola et al. (2022), an enhanced EL method using entity types and descriptions. To implement these models, we utilize their original implementations. That is, the full BLINK model111BLINK’s full cross-encoder model, the fairseq-AIDA GENRE model222The GENRE model developed using the fairseq toolkit and officially fine-tuned on AIDA-YAGO2, and the AIDA ReFinED model333The ReFinED model officially fine-tuned on AIDA-YAGO2. For unified implementation, we follow BLINK and ReFinED to execute GENRE without the candidate set. Our implementation details are included in Appendix A.

Baselines.

We compare LLMaEL with two categories of baselines: (1) LLMs for EL. We leverage Llama-3-70b-instruct AI@Meta (2024) to execute the EL task directly. We provide the LLM with a few shot prompt that includes paired examples of mention contexts and gold entity IDs. The concrete prompt is included in Table 8 of Appendix B. (2) Traditional EL models. We compare with each of our three backbone EL models to conduct EL on the original datasets, without any LLM data augmentation.

Evaluation Metrics.

We use disambiguation accuracy as our evaluation metric. The unweighted macro average over all test sets is also reported.

The Unified Context-Joining Strategy.

For all our implement LLMaEL variants, we apply the development (dev) subset of AIDA-YAGO2 to select the optimal context-joining strategy. In the main experiments detailed in Section 4.2.1, we adopt a unified strategy—strategy 4—that yields the highest average accuracy across all EL models. We hypothesize that this strategy outperforms others because most EL models are more adept at processing original contexts, thus performing better when LLM-generated contexts are placed towards the end. Interestingly, different EL models may show optimal performance with different joining strategies. BLINK’s optimal test-time strategy diverges from ReFinED and GENRE, and its optimal testing outcomes are presented in Section 4.2.2.

Fine-tuning.

We select our best-performing EL model ReFinED for model fine-tuning. We use the train and dev splits from the AIDA-YAGO2 dataset as our training and evaluation data. To avoid model over-fitting on AIDA-YAGO2, we leverage ReFinED’s wikipedia model444The ReFinED model that is not officially fine-tuned on AIDA-YAGO2 for fine-tuning. Specifically, we first employ Llama-3-70b-instruct to augment the datasets under the model’s optimal context-joining strategy, then apply the augmented datasets to the fine-tuning process.

4.2 Experimental Results

4.2.1 Main Results

Method AIDA MSNBC AQUA ACE04 CWEB WIKI AVG
LLM only 78.3778.3778.3778.37 80.4980.4980.4980.49 73.1873.1873.1873.18 83.2783.2783.2783.27 65.3465.3465.3465.34 64.4464.4464.4464.44 74.1874.1874.1874.18
BLINK only 82.0182.0182.0182.01 86.2386.2386.2386.23 85.1685.1685.1685.16 86.0186.0186.0186.01 69.1169.1169.1169.11 81.1181.1181.1181.11 81.6181.6181.6181.61
GENRE only 87.9287.9287.9287.92 83.5483.5483.5483.54 84.3284.3284.3284.32 84.8284.8284.8284.82 68.7568.7568.7568.75 83.0283.0283.0283.02 82.0682.0682.0682.06
ReFinED only 92.2592.2592.2592.25 87.10¯¯87.10\underline{87.10}under¯ start_ARG 87.10 end_ARG 87.5387.5387.5387.53 87.75¯¯87.75\underline{87.75}under¯ start_ARG 87.75 end_ARG 72.9672.9672.9672.96 85.1885.1885.1885.18 85.4685.4685.4685.46
LLMaEL ×\times× BLINK 81.9481.9481.9481.94 86.5686.5686.5686.56 85.1685.1685.1685.16 86.0186.0186.0186.01 69.1769.1769.1769.17 81.1481.1481.1481.14 81.6681.6681.6681.66
LLMaEL ×\times× GENRE 88.2788.2788.2788.27 85.6785.6785.6785.67 85.1485.1485.1485.14 85.2185.2185.2185.21 70.6770.6770.6770.67 82.9582.9582.9582.95 82.9982.9982.9982.99
LLMaEL ×\times× ReFinED 92.38 86.9486.9486.9486.94 88.09¯¯88.09\underline{88.09}under¯ start_ARG 88.09 end_ARG 88.14 73.16¯¯73.16\underline{73.16}under¯ start_ARG 73.16 end_ARG 85.90¯¯85.90\underline{85.90}under¯ start_ARG 85.90 end_ARG 85.76¯¯85.76\underline{85.76}under¯ start_ARG 85.76 end_ARG
LLMaEL ×\times× ReFinEDFTsubscriptReFinEDFT\text{ReFinED}_{\text{FT}}ReFinED start_POSTSUBSCRIPT FT end_POSTSUBSCRIPT 92.34¯¯92.34\underline{92.34}under¯ start_ARG 92.34 end_ARG 88.79 89.06 88.14 75.07 86.62 86.67
Table 2: Disambiguation accuracy scores across six test sets. The best value is in bold and second best is underlined. All models that involve BLINK, GENRE, or ReFinED are tested with official scripts provided by each model’s respective authors. AIDA refers to the test split of the AIDA-YAGO2 dataset. ReFinEDFTsubscriptReFinEDFT\text{ReFinED}_{\text{FT}}ReFinED start_POSTSUBSCRIPT FT end_POSTSUBSCRIPT corresponds to our customly fine-tuned version of ReFinED. The GENRE model is used without candidate sets.

We compare LLMaEL with baselines and report the results in Table 2. For LLMaEL, we evaluate the vanilla implementation of LLMaEL with the optimal context-joining strategy selected using the dev set of AIDA-YAGO2. We further fine-tune LLMaEL with the best performing backbone EL model (ReFinED).

We find that even the vanilla implementation of LLMaEL uniformly brings performance gain, comparing to the average accuracy of the original EL backbones. For each EL model, the vanilla LLMaEL improves performance on at least 5555 datasets, with LLMaEL ×\times× GENRE outperforming its EL model backbone with an average enhancement of 0.93%percent0.930.93\%0.93 %.

The fine-tuned LLMaEL yields new state-of-the-art results over all six datasets, surpassing the original ReFinED backbone with a 1.21%percent1.211.21\%1.21 % accuracy gain while boosting the vanilla LLMaEL ×\times× ReFinED with a 0.91%percent0.910.91\%0.91 % accuracy gain. This supports our hypothesis that fine-tuning further amplifies our method’s performance, as it better aligns EL models with the distribution characteristics of LLM-augmented contexts.

4.2.2 Ablations

Model-Specific Context-Joining Strategies.
Method ID AVG acc.
BLINK only - 81.61
LLMaEL ×\times× BLINK 4 81.6681.6681.6681.66
LLMaEL ×\times× BLINK 1* 83.5383.5383.5383.53
Table 3: Performance of the vanilla LLMaEL combined with BLINK under the unified strategy 4 and BLINK’s own optimal test-time strategy. The ID column refers to the selected context-joining strategy ID, while the AVG acc. column presents the unweighted macro average disambiguation accuracy score over all 6 test sets. The best average accuracy score is in bold. Refer to Table 1 for detailed descriptions of all 5 joining strategies.

In Table 2, we employ the unified context-joining strategy, specifically strategy 4444, chosen for its highest average accuracy score on the AIDA-dev dataset across all 3333 EL models. Although strategy 4444 proves to be the most effective for ReFinED and GENRE during testing, it does not align as the optimal joining strategy for BLINK.

Table 2 presents the optimal joining strategy for BLINK at test time, alongside its average accuracy score across all six datasets. Adopting this model-specific optimal strategy leads to a significant performance enhancement of 1.87%percent1.871.87\%1.87 % in average accuracy.

Intriguingly, BLINK’s optimal test-time strategy (strategy 1111) has complete opposite parameters as unified strategy 4444. We hypothesize that the reliance on AIDA-dev for selecting the optimal joining strategy might be a contributing factor. Given that BLINK, unlike the other two EL models, is not fine-tuned on the AIDA dataset, it may not resonate well with the textual distributions of AIDA datasets. Consequently, BLINK’s performance on the AIDA-dev dataset does not accurately reflect its true preferences and capabilities.

Choosing Among LLMs.
EL Model LLM(s) MSNBC AQUA ACE04 WIKI AVG
Single ReFinED - 87.1087.1087.1087.10 87.5387.5387.5387.53 87.7587.7587.7587.75 85.1885.1885.1885.18 86.8986.8986.8986.89
Llama-3-70b-instruct 86.9486.9486.9486.94 88.0988.0988.0988.09 88.14 85.9085.9085.9085.90 87.2787.2787.2787.27
GPT-3.5-turbo-instruct 86.9486.9486.9486.94 88.2388.2388.2388.23 88.14 85.6085.6085.6085.60 87.2387.2387.2387.23
GLM-4 86.94 87.9587.9587.9587.95 88.14 85.7585.7585.7585.75 87.2087.2087.2087.20
ReFinEDFTFT{}_{\text{FT}}start_FLOATSUBSCRIPT FT end_FLOATSUBSCRIPT - 89.4089.4089.4089.40 89.2089.2089.2089.20 88.14 85.9385.9385.9385.93 88.1788.1788.1788.17
Llama-3-70b-instruct 88.7988.7988.7988.79 89.0689.0689.0689.06 88.14 86.62 88.1588.1588.1588.15
GPT-3.5-turbo-instruct 89.4089.4089.4089.40 89.47 88.14 86.2886.2886.2886.28 88.3288.3288.3288.32
GLM-4 89.40 89.2089.2089.2089.20 88.14 86.2486.2486.2486.24 88.2588.2588.2588.25
Multi ReFinED Hard-voting ensemble 86.9486.9486.9486.94 87.9587.9587.9587.95 88.14 85.9085.9085.9085.90 87.2387.2387.2387.23
Soft-voting ensemble 86.9486.9486.9486.94 87.9587.9587.9587.95 88.14 85.7585.7585.7585.75 87.2087.2087.2087.20
ReFinEDFTFT{}_{\text{FT}}start_FLOATSUBSCRIPT FT end_FLOATSUBSCRIPT Hard-voting ensemble 89.0989.0989.0989.09 89.47 88.14 86.62 88.33
Soft-voting ensemble 89.40 89.3489.3489.3489.34 88.14 86.2586.2586.2586.25 88.2888.2888.2888.28
Table 4: Disambiguation accuracy scores across four selected datasets, where LLMaEL applies different LLMs and ensemble techniques. The Single portion presents the EL model’s performance on its own and after its integration with three individual LLMs. The Multi portion presents results obtained by the ensemble of all four outputs. The best value for each dataset is in bold.

LLMaEL is adaptable to any LLM. In this section, we implement LLMaEL using two other widely recognized LLMs, namely GPT-3.5-turbo OpenAI (2023) and GLM-4 Du et al. (2022). Table 4 presents our results in the upper half labeled Single.

For the vanilla LLMaEL (ie., ReFinED), all three LLMs demonstrate an average performance enhancement. Among them, Llama-3-70b-instruct yields the most significant overall improvement, achieving an average accuracy of 87.27%percent87.2787.27\%87.27 %. GPT-3.5-turbo-instruct and GLM-4 demonstrate comparable performance, yielding average improvements of at least 0.31%percent0.310.31\%0.31 % .

For the fine-tuned LLMaEL (i.e., ReFinEDFTsubscriptReFinEDFT\text{ReFinED}_{\text{FT}}ReFinED start_POSTSUBSCRIPT FT end_POSTSUBSCRIPT), it also demonstrated excellent compatibility across various LLMs. With GPT-3.5-turbo-instruct, ReFinEDFTsubscriptReFinEDFT\text{ReFinED}_{\text{FT}}ReFinED start_POSTSUBSCRIPT FT end_POSTSUBSCRIPT achieves an average performance of 88.32%percent88.3288.32\%88.32 %, representing a 1.43%percent1.431.43\%1.43 % improvement over the original context and model.

It is also noteworthy that the performance of ReFinEDFTsubscriptReFinEDFT\text{ReFinED}_{\text{FT}}ReFinED start_POSTSUBSCRIPT FT end_POSTSUBSCRIPT also shows a significant enhancement when applied to the original contexts, registering an average performance improvement of 1.28%percent1.281.28\%1.28 %.

Ensemble.

We use both hard-voting and soft-voting classifiers to perform ensemble. The hard-voting classifier is executed by selecting the most frequent outcome among multiple independently-generated results. In instances where multiple results share an equivalent frequency, the result with the highest probability level is selected. Conversely, the soft-voting classifier selects the final answer by aggregating the probabilities of all outcomes.

The Multi half of Table 4 illustrates our ensemble results. Both ReFinED and ReFinEDFTsubscriptReFinEDFT\text{ReFinED}_{\text{FT}}ReFinED start_POSTSUBSCRIPT FT end_POSTSUBSCRIPT are improved by the implementation of ensemble techniques. For ReFinED, ensemble using the hard-voting classifier achieves a highest average accuracy of 87.23%percent87.2387.23\%87.23 %. This accuracy score is higher than the score obtained by the soft-voting classifier. This is because the hard-voting classifier is particularly effective when the performance of individual models are diverse. For most datasets, the original ReFinED model yields results that are apparently different to the other 3333 LLM-enhanced models, contributing to the diversity of model performance. Meanwhile, when the performance of single models is relatively uniform, both ensemble methods—hard and soft-voting classifiers—exhibit comparable effectiveness. This phenomenon is evident in the performance outcomes of the ReFinEDFTsubscriptReFinEDFT\text{ReFinED}_{\text{FT}}ReFinED start_POSTSUBSCRIPT FT end_POSTSUBSCRIPT, where the hard and soft-voting classifiers present equal average accuracies. Moreover, the hard-voting classifier under ReFinEDFTsubscriptReFinEDFT\text{ReFinED}_{\text{FT}}ReFinED start_POSTSUBSCRIPT FT end_POSTSUBSCRIPT also achieved the best average performance of 88.33%percent88.3388.33\%88.33 %.

4.3 Discussions

We delve deeper into LLMaEL by examining the following two discussion questions.

4.3.1 Does LLM-Augmented Data Improve EL Performance Over Long-Tail Entities?

Refer to caption
Figure 2: EL performance across entities of different frequencies. The green line illustrates the performance of the original ReFinED model applied to the original datasets. The purple lines illustrate the performance of our customly fine-tuned ReFinEDFTsubscriptReFinEDFT\text{ReFinED}_{\text{FT}}ReFinED start_POSTSUBSCRIPT FT end_POSTSUBSCRIPT model using LLM-augmented datasets.

Entities vary in frequency, depending on how often they are referenced in the real world. EL models tend to perform better on high-frequency entities and poorly on low-frequency entities due to their limited training data.

LLMs possess more entity knowledge compared to EL models, which can be transferred to EL models through LLM data augmentation. Hence, we hypothesize that a core contributor to LLMaEL’s effectiveness is its ability to enhance EL models over long-tail entities—entities that possess low frequencies.

To investigate this, we evaluate LLMaEL across entities of different frequencies. We select four datasets—MSNBC, AQUAINT, ACE04, and WNED-WIKI—to calculate the cross-dataset accuracy of each contained gold entity. For each entity, we assign its corresponding PageRank value from Wikidata5M Wang et al. (2021) as its frequency. To simplify visualization, we normalize the frequencies using a base-10 logarithmic scale. Finally, we categorize all entities into seven buckets, each bucket comprising the entities that share the same integer part in their normalized frequencies.

Our findings are illustrated in Figure 2, where the horizontal axis presents normalized entity frequencies and the vertical axis presents the average accuracies of each entity bucket. LLMaEL improves the accuracy of entities with mid-to-low frequencies within the range of 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT to 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, and refines the accuracy of entities with extremely low frequencies in the range of 107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT to 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT.

Such results align with our hypothesis that LLM data augmentation enhances EL performance for long-tail entities. Furthermore, the results indicate that LLMs also improve performance for mid-tail entities. This improvement is likely because the LLM-generated data offers condensed mention information, thereby reducing the noise present in the original contexts.

4.3.2 Is There a Better Way to Leverage LLMs for EL?

Method AIDA MSNBC AQUA ACE04
BLINK only 82.01 86.23 85.16 86.01
LLM only 78.3778.3778.3778.37 80.4980.4980.4980.49 73.1873.1873.1873.18 83.2783.2783.2783.27
Re-rank-100 70.9570.9570.9570.95 82.0182.0182.0182.01 73.1873.1873.1873.18 74.3274.3274.3274.32
Re-rank-10 73.2473.2473.2473.24 80.1880.1880.1880.18 73.4573.4573.4573.45 82.8882.8882.8882.88
Table 5: Performance of BLINK and three configurations of LLM executing EL tasks, where the LLM employed is Llama-3-70b-instruct.

Considering that many EL models, such as BLINK, operate by first retrieving candidate entities and then re-ranking them, a practical approach is to use EL models for candidate retrieval and LLMs for re-ranking. In this section, we explore whether this is a more effective way to leverage LLMs for EL.

We use BLINK’s bi-encoder for candidate retrieval and Llama-3-70b-instruct for re-ranking. We establish two re-ranking settings:

  • Re-rank-100: Extract the top 100100100100 candidate entities of BLINK’s bi-encoder and task the LLM to select the final entity.

  • Re-rank-10: Extract the top 10101010 candidate entities of BLINK’s bi-encoder and augment each candidate with its Wikipedia abstract. The LLM selects the final entity using the abstracts as supplementary information.

As shown in Table 5, applying LLMs for entity re-ranking does not enhance EL performance; in fact, its efficacy is even lower than directly leveraging LLMs for EL. We observe two primary reasons for this inefficacy. First, the presence of similar candidate names confuses the LLM. Unlike demanding LLMs to directly generate entity names for mentions, asking LLMs to perform re-ranking requires them to discern the subtle distinctions among candidates. As highlighted by Peng et al. (2023), LLMs struggle to understand and distinguish complex contexts, leading to diminished performance. Secondly, presenting the LLM with multiple candidates often causes it to spread its focus across the entire context rather than concentrating on the specific mention. This results in the LLM to prioritize information that is distant and unrelated to the mention.

The suboptimal performance of LLMs used as either direct EL executors or entity re-rankers underscores that LLMaEL’s context augmentation approach is by far the most effective strategy to harness LLMs for the EL task.

5 Conclusion

This paper presents LLMaEL, a lightweight and flexible pipeline approach to enhance entity linking through LLM data augmentation. It leverages the strengths of both EL models and LLMs with minimal costs, yielding promising results without the need for any LLM training. Furthermore, it offers advanced data fusion options. We hope our work can provide new insights for entity linking tasks in the LLM era.

Ethical Considerations

Entity linking is a well established task, aiming to bridge textual data and structural data (e.g., knowledge base). This work follows this setting, aiming to provide a better EL method with higher accuracy. As the proposed methodology LLMaEL is our main contribution, we are hereby to discuss potential misuse of LLMaEL.

Potential Misuse.

The risk to misuse LLMaEL is the same as all other EL models, such as using entity linking models to decorate generated fake contents with apparently right but actually wrong reference. Moreover, we would like to point out that, as LLMaEL allow for utilizing a third party LLM to augment EL data. If the used LLM is jail-breaked or hacked to produce misinformation, it would result in cascading failure of LLMaEL.

Possible Biases.

LLMs carry potential risks of generating biased or harmful content. Since our approach relies on LLMs for context generation, our pipeline method and fine-tuned model could inherit existing biases present in the LLMs’ model weights.

Environmental Impact.

The model inference and EL model fine-tuning phases of LLMaEL lead to energy and carbon costs. However, compared to methods that leverage LLMs through LLM fine-tuning, our method requires less energy expenses.

Limitations

As a pipeline method, LLMaEL relies heavily on the abilities of its selected EL models and LLMs. Yet, the most advanced LLMs currently available are commercial products, which incur costs for each API call. Furthermore, some LLMs show accessibility constraints. For instance, GPT-4 is not included in our experiments due to our limited access to the model.

References

Appendix A Reproducibility Details

Datasets.

For each EL model, we download their official training and testing datasets from their respective github repositories555https://github.com/facebookresearch/BLINK666https://github.com/facebookresearch/GENRE777https://github.com/amazon-science/ReFinED. Each model provides an official version of our 8 selected datasets888MSNBC, AQUAINT, ACE2004, WNED-CWEB, WNED-WIKI, and the train, dev, test splits of AIDA-YAGO2. Since ReFinED is the only model that supports NIL entities, ReFinED’s official 8 datasets contain the largest number of entries. Hence, we obtain LLM-generated context for each of ReFinED’s 8 datasets. Then, we map these generated contexts to the official datasets of the other two EL models, ensuring correct alignment for each data entry.

Context-Joining.

For joining strategies that incorporate context combination, contexts are merged using a newline symbol "\n". In the case of BLINK and ReFinED, contexts are fully combined in the specified order, without any truncation applied. In the case of GENRE, the contexts are first fully combined, then trimmed to the model’s maximum input sequence length.

Testing Scripts.

For all experiments that incorporate an EL model, we augment the model’s official datasets with LLM-generated context, then use the model’s official testing script to produce our results.

LLM Hyperparameters.

For all LLM-generated responses, we set the number of maximum tokens to 150. We set the temperature of Llama-3-70b-instruct and GLM-4 to 0.01 and the temperature of GPT-3.5-turbo-instruct to 0. We adapt default values for all other hyperparameters.

Appendix B LLM Prompts and Example Responses

Tables 6 to B illustrate all our LLM prompts along with sample inputs and responses.

Prompt
Consider the following text.
Text: Nearly 17 months after he first issued his call for a “fresh start after a season of cynicism”, Gov. George W. Bush ended his quest for the presidency Monday on a nearly identical note, pledging to purge { Washington } of what he cast as a crippling discord. The Texas governor claimed that Gore’s decades of experience in Washington had estranged him from the rest of the country by making him too trusting of federal government and too fond of federal spending. “My opponent vows to carry his home state”, Bush said. “ He may win Washington, D.C., but he’s not going to win Tennessee. “He forgot his roots”, Bush added. “He forgot where he’s from. He trusts Washington. We trust the people.”
Please provide me more descriptive information about { Washington } from the text above. Make sure to include Washington in your description.
Answer:
Response
Washington is the capital of the United States and the seat of the federal government. It is located on the Potomac River, between Maryland and Virginia. It is home to numerous monuments, memorials, and government buildings, including the White House, the Capitol Building, and the Supreme Court.
Table 6: Our zero-shot prompt with an example input and response. The mention and context inputs are from the MSNBC dataset, and the response is generated by our backbone LLM. Our zero-shot prompt is used to generate exemplars for our three-shot prompt.
Prompt
Example 1. Consider the following text.
Text: Nearly 17 months after he first issued his call for a “fresh start after a season of cynicism”, Gov. George W. Bush ended his quest for the presidency Monday on a nearly identical note, pledging to purge { Washington } of what he cast as a crippling discord. The Texas governor claimed that Gore’s decades of experience in Washington had estranged him from the rest of the country by making him too trusting of federal government and too fond of federal spending. “My opponent vows to carry his home state”, Bush said. “ He may win Washington, D.C., but he’s not going to win Tennessee. “He forgot his roots”, Bush added. “He forgot where he’s from. He trusts Washington. We trust the people.”
Please provide me more descriptive information about { Washington } from the text above.
Answer:
Washington is the capital of the United States and the seat of the federal government. It is located on the Potomac River, between Maryland and Virginia. It is home to numerous monuments, memorials, and government buildings, including the White House, the Capitol Building, and the Supreme Court.
Example 2. Consider the following text.
Text: O’Donnell and Trump have been feuding since he announced last month that Miss USA Tara Conner, whose title had been in jeopardy because of underage drinking, would keep her crown. Trump is the owner of the Miss Universe Organization, which includes Miss USA and Miss Teen USA. The 44-year-old outspoken moderator of “The View”, who joined the show in September, said Trump’s news conference with { Conner } had annoyed her “on a multitude of levels and that the twice-divorced real estate mogul had no right to be “the moral compass for 20-year-olds in America”. Trump fired back, calling O’Donnell a “loser” and a “bully”, among other insults, in various media interviews.
Please provide me more descriptive information about Conner from the text above.
Answer:
Conner is the Miss USA titleholder whose title was in jeopardy due to underage drinking. She was saved from losing her crown by Donald Trump, the owner of the Miss Universe Organization, which includes Miss USA and Miss Teen USA. Tara Conner was given a second chance by Trump and was allowed to keep her crown.
Example 3. Consider the following text.
Text: Scottish Labour Party narrowly backs referendum. STIRLING, Scotland 1996-08-31 British Labour Party leader Tony Blair won a narrow victory on Saturday when the party’s Scottish executive voted 21-18 in favour of his plans for a referendum on a separate parliament for Scotland. Blair once pledged to set up a Scottish parliament if the Labour won the next general election, which must be held by May 1997. Prime Minister John Major says the 300-year-old union of the Scottish and English parliaments will be a main plank in his Conservative Party’s election platform. Conservatives have only 10 of the 72 Scottish seats in parliament and consistently run third in opinion polls in Scotland behind { Labour } and the independence-seeking Scottish National Party.
Please provide me more descriptive information about { Labour } from the text above.
Answer:
The Labour Party is a centre-left political party in the United Kingdom. It is the main opposition party to the Conservative Party and is led by Tony Blair. The Labour Party has traditionally been strong in Scotland, and the Scottish Labour Party is a branch of the UK Labour Party. In the text, the Scottish Labour Party narrowly voted in favour of Tony Blair’s plans for a referendum on a separate parliament for Scotland.
Now consider the following text.
Text: left context + { mention } + right context
Please provide me more descriptive information about { mention } from the text above.
Answer:
Table 7: Our three-shot prompt template, used for our main experiments detailed in section 4. All three exemplars are obtained by employing our zero-shot prompt presented in Table 6 over our backbone LLM. Due to space limitations, this prompt does not provide an example input and response. In real application, left context, right context, and mention are filled with their corresponding input entries.
Prompt
Gives the text and mentions within the text highlighted by <MENTION> and </MENTION>. Please give which page in Wikipedia this mention is most likely to be? Please answer me directly in this form: "mention":"Wikipedia page url".
Text: Having caught the popular attention and with goodwill at a high-point , Nelsonic was able to obtain licensing from several big-name video game companies such as Sega , Nintendo ,<MENTION> Midway Games </MENTION>, and Mylstar Electronics .
Answer: "Midway Games": "https://en.wikipedia.org/wiki/Midway_Games"
Text: State Highway 110 or SH 110 is a state highway in the U.S. state of Texas that runs from Grand Saline to Rusk . SH 110 begins at an intersection with and in downtown Rusk and leaves the courthouse square north with US 84 , crossing on its way to a split on the northeast side of Rusk where US 84 goes off east and SH 110 turns north , out of town . The road passes <MENTION> Ponta </MENTION> and New Summerfield before crossing the county line into Smith County as it enters Troup . After a brief downtown multiplex with SH 135 , SH 110 leaves Troup going northwest through Whitehouse on its way to Tyler .
Answer: "Ponta": "https://en.wikipedia.org/wiki/Ponta,_Texas"
Text: Messier 49 ( also known as M 49 or NGC 4472 ) is an elliptical galaxy located about away in the equatorial <MENTION> constellation </MENTION> of Virgo . This galaxy was discovered by French astronomer Charles Messier on February 19 , 1771 .
Answer: "constellation": "https://en.wikipedia.org/wiki/Constellation"
Text: <MENTION> Xinhua News Agency </MENTION> , Shanghai , April 3rd , by reporter Jierong Zhou Recently , HSBC has moved its Shanghai branch to the China Shipping Mansion in the Pudong Lujiazui financial trading district , becoming the third foreign capital bank to be approved to operate RMB business and shift to Pudong. …
Answer:
Response
"Xinhua News Agency": "https://en.wikipedia.org/wiki/Xinhua_News_Agency"
Table 8: An example of prompt and response for LLM direct execution of EL tasks, from the ACE04 dataset. The prompt includes an instruction, three demonstrations, and a question to be addressed. Due to space limitations, we have omitted the text in the question part during presentation.
Prompt
Instruction: Gives the text and mentions within the text highlighted by <MENTION> and </MENTION>. Please select from the options below which Wikipedia page this mention is most likely to be from? Please answer me directly in this form: (̈letter): Wikipedia entity name and url.̈And I also want you to give an explanation in the next line.
Options:
(1): [’Time Warner Interactive’, ’https://en.wikipedia.org/wiki?curid=12642915’]
(2): [’TT Games’, ’https://en.wikipedia.org/wiki?curid=49108324’]
(3): [’Atari Games’, ’https://en.wikipedia.org/wiki?curid=304833’]
(4): [’Midway Games’, ’https://en.wikipedia.org/wiki?curid=430266’]
(5): [’Vivendi Games’, ’https://en.wikipedia.org/wiki?curid=6573837’]
Text: Having caught the popular attention and with goodwill at a high-point , Nelsonic was able to obtain licensing from several big-name video game companies such as Sega , Nintendo ,<MENTION> Midway Games </MENTION>, and Mylstar Electronics .
Answer: (4): [’Midway Games’, ’https://en.wikipedia.org/wiki?curid=430266’]
Explanation: The mention <̈MENTION> Midway Games </MENTION>ïn the provided text is most likely from the Wikipedia page for Midway Games. Midway Games is mentioned in the text as one of the big-name video game companies from which Nelsonic obtained licensing. The description of Midway Games in option (4) matches the context in the text, making it the most likely source.
Options: (1): [’İhlas News Agency’, ’https://en.wikipedia.org/wiki?curid=51484970’]
(2): [’Kuwait News Agency’, ’https://en.wikipedia.org/wiki?curid=3824465’]
(3): [’Vietnam News Agency’, ’https://en.wikipedia.org/wiki?curid=11835464’]
(4): [’Hai Xia’, ’https://en.wikipedia.org/wiki?curid=41850118’]
(5): [’Saudi Press Agency’, ’https://en.wikipedia.org/wiki?curid=2902711’]
(99): [’BNO News’, ’https://en.wikipedia.org/wiki?curid=13403728’]
(100): [’Beat Holdings’, ’https://en.wikipedia.org/wiki?curid=3536753’]
Text: <MENTION> Xinhua News Agency </MENTION> , Shanghai , April 3rd , by reporter Jierong Zhou Recently , HSBC has moved its Shanghai branch to the China Shipping Mansion in the Pudong Lujiazui financial trading district , becoming the third foreign capital bank to be approved to operate RMB business and shift to Pudong. …
Answer:
Response
(53): [’Xinhua News Agency’, ’https://en.wikipedia.org/wiki?curid=263168’]
Explanation: The mention "<MENTION> Xinhua News Agency </MENTION>" in the provided text is most likely from the Wikipedia page for Xinhua News Agency. The text mentions Xinhua News Agency as the source of the news report, and the description of Xinhua News Agency in option (53) matches the context in the text, making it the most likely source.
Table 9: An example of prompt and response for LLM to re-rank the returned retrieval results at Re-rank 100, from the ACE04 dataset. This prompt contains an instruction, a demonstration, and a question. For simplicity in presentation, we have omitted some options and text and replaced them with ellipses.
Prompt
Gives the text and mentions within the text highlighted by <MENTION> and </MENTION>. Please select from the options below which Wikipedia page this mention is most likely to be from? Please answer me directly in this form: "(letter): Wikipedia entity name and url".And I also want you to give an explanation in the next line.
Options:
(a): [’TT Games’, ’https://en.wikipedia.org/wiki?curid=49108324’, ’TT Games Limited is a British holding company and a subsidiary of Warner Bros. Games. …’]
(b): [’Atari Games’, ’https://en.wikipedia.org/wiki?curid=304833’, ’Atari Games Corporation, known as Midway Games West Inc. after 1999, was an American producer of arcade games. …’]
(c): [’Midway Games’, ’https://en.wikipedia.org/wiki?curid=430266’, ’Midway Games Inc., known previously as Midway Manufacturing and Bally Midway, and commonly known as simply Midway, was an American video game developer and publisher. …’]
Text: Having caught the popular attention and with goodwill at a high-point , Nelsonic was able to obtain licensing from several big-name video game companies such as Sega , Nintendo ,<MENTION> Midway Games </MENTION>, and Mylstar Electronics .
Answer: (c): [’Midway Games’, ’https://en.wikipedia.org/wiki?curid=430266’]
Explanation: For mention of "<MENTION> Midway Games </MENTION>", the most similar option is option (c) Midway Games. Additionally, the description in option (c) of Midway Games as an American video game developer and publisher matches the context in the text, making it the most likely source.
Options:
(a): [’China Xinhua News Network Corporation’,’https://en.wikipedia.org/wiki?curid=25626750’, ’The China Xinhua News Network Corporation (CNC, Chinese: {CJK*}UTF8gbsn中国新华新闻电视网; pinyin: zhōng guó xīn huá xīn wén diàn shì wǎng) is a state-owned television network subsidiary of Xinhua News Agency.’]
(j): [’News agency’,’https://en.wikipedia.org/wiki?curid=47917’, ’A news agency is an organization that gathers news reports and sells them in to subscribing news organizations, such as newspapers, magazines and radio and television broadcasters. News agencies are known for their press releases. A news agency may also be referred to as a wire service, newswire, or news service.’]
Text: <MENTION> Xinhua News Agency </MENTION> , Shanghai , April 3rd , by reporter Jierong Zhou Recently , HSBC has moved its Shanghai branch to the China Shipping Mansion in the Pudong Lujiazui financial trading district , becoming the third foreign capital bank to be approved to operate RMB business and shift to Pudong . …
Answer:
Response
(g): [’Xinhua News Agency’,’https://en.wikipedia.org/wiki?curid=263168’]
Explanation: For mention of "<MENTION> Xinhua News Agency </MENTION>", the most similar option is option (g) Xinhua News Agency. Additionally, the description in option (g) of Xinhua News Agency as the official state news agency of the People’s Republic of China matches the context in the text, making it the most likely source.
Table 10: The prompt for LLM to re-rank the returned retrieval results at Re-rank 10, from the ACE04 dataset. This prompt contains an instruction, a demonstration, and a question. For simplicity in presentation, we have omitted some abstracts, options and text and replaced them with ellipses.