11institutetext: National University of Defense Technology 22institutetext: Hefei University of Technology
22email: {betterszsong}@gmail.com

DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model

Shezheng Song1    Shasha Li1    Jie Yu1    Shan Zhao2    Xiaopeng Li1    Jun Ma1   
Xiaodong Liu1
   Zhuo Li1    Xiaoguang Mao1
Abstract

Our study delves into Multimodal Entity Linking, aligning the mention in multimodal information with entities in knowledge base. Existing methods are still facing challenges like ambiguous entity representations and limited image information utilization. Thus, we propose dynamic entity extraction using ChatGPT, which dynamically extracts entities and enhances datasets. We also propose a method: Dynamically Integrate Multimodal information with knowledge base (DIM), employing the capability of the Large Language Model (LLM) for visual understanding. The LLM, such as BLIP-2, extracts information relevant to entities in the image, which can facilitate improved extraction of entity features and linking them with the dynamic entity representations provided by ChatGPT. The experiments demonstrate that our proposed DIM method outperforms the majority of existing methods on the three original datasets, and achieves state-of-the-art (SOTA) on the dynamically enhanced datasets (Wiki+, Rich+, Diverse+). For reproducibility, our code and collected datasets are released on https://github.com/season1blue/DIM.

——

Keywords:
Multimodal Information Processing Large Language Model Object Recognition Multimodal Entity Linking.

1 Introduction

Multimodal entity linking (Komlósi \BBA Waldbuesser, \APACyear2015) involves linking mentioned entity (i.e. mention) in natural language texts to their corresponding entity (Zhao, Hu, Cai\BCBL \BBA Liu, \APACyear2021) in a knowledge base (Fu \BOthers., \APACyear2020). These entities could be individuals, locations, organizations in the real world, or specific entities within a knowledge base. MEL could help computers better understand user semantics, integrate various information sources, resolve ambiguities, and enhance user experience. It plays an important role in search engines (P. Wu \BBA Xie, \APACyear2024), recommendation systems (Zhou, \APACyear2023), information retrieval(Ma, Zhou\BCBL \BOthers., \APACyear2023), and knowledge graph construction (Ma, Chen\BCBL \BOthers., \APACyear2023; Zhao, Hu, Cai\BCBL \BBA Liu, \APACyear2021). Through MEL, systems can provide more relevant search results, more personalized recommendations, more comprehensive information retrieval, richer knowledge graphs, and smarter conversation and text processing capabilities.

Human cognition (Hu \BOthers., \APACyear2021; Hutchins, \APACyear2020) can be expressed through various modalities of information carriers, such as text or images. Building a connection between natural language and structured knowledge contributes to the unity of human cognition and the knowledge base, enabling artificial intelligence to comprehend human society better. Specifically, multimodal entity linking aids in a deeper understanding of the semantics of text. By linking entities in the expression to a knowledge base, systems can acquire additional information about these entities, elevating the comprehension of textual meanings. Besides fostering the unity of semantic expression, multimodal entity linking also assists in enriching the information in knowledge bases. Through entity linking, new relationships and attributes can be added to the knowledge base, thereby enhancing its completeness and accuracy.

Refer to caption
Figure 1: The process of integrating human cognition with information in the knowledge base. When someone says “This is why Taylor ….”, MEL aims to link the mentioned ’Taylor’ to ’Taylor Alison Swift’ in knowledge base to facilitate further understanding of user semantics.

In recent years, the multimodal entity linking task (Zhao, Hu, Cai, Chen\BCBL \BBA Liu, \APACyear2021) has attracted increasing attention from researchers. P. Wang (\APACyear2022) adopts pretrained models such as BERT and constructs a multi-level multimodal information understanding hierarchy to accomplish entity linking. Both the Wang (\APACyear2022) and Song \BOthers. (\APACyear2023) enhance the entity linking task by proposing new datasets and fine-grained features. However, existing methods face two main challenges: (1) Researchers have not adequately considered that ambiguous entity representations (He \BOthers., \APACyear2013) in datasets may not effectively represent entities in the knowledge base. Typically, textual descriptions are used as representations of entities in the knowledge base, and if the computed features are deemed similar to entity representations, the entity linking task is considered complete. However, in this process, the linking process may fail due to the misalignment between ambiguous entity representations and the semantics of knowledge base entities even with well-learned features. Existing methods overly focus on enhancing the feature learning aspect, overlooking the ambiguity in entity representations (Ji \BOthers., \APACyear2022). (2) Existing work still exhibits limited understanding of images. As a crucial supplementary source to text, image information deserves more attention. Methods like GHMFC (P. Wang, \APACyear2022) and MMEL (Yang \BOthers., \APACyear2023) are trained by encoding images to enhance the linking ability of image information to entity representations. However, the impact of image information on recognizing entity identities is limited. While image encoder can extract racial features, such as “white female, blue dress”, it struggles to identify the identity information due to a lack of comprehensive understanding of image.

In light of the aforementioned challenges, we propose the following optimization methods: (1) Addressing the issue of ambiguous entity representations that hinder the unity of human cognition and knowledge base, we leverage the rapid learning capabilities of ChatGPT on the knowledge base and dynamically extract representations of entities. To facilitate subsequent experiments by researchers, we organize the dynamically extracted entities and enhance three datasets (Wikimel, Richpedia, and Wikidiverse) individually. We designate these enhanced datasets as Wiki+, Rich+, and Diverse+. (2) Addressing the inadequacy in information utilization by existing models, we propose a method to Dynamically Integrate Multimodal information (DIM). Besides, we conduct experiments on both the original datasets(Wikimel, Richpedia, and Wikidiverse) and the enhanced datasets(Wiki+, Rich+, Diverse+) to validate the effectiveness of our datasets. Specifically, we leverage the understanding capabilities of the Large Language Model (Li \BOthers., \APACyear2022, \APACyear2023) on images and design prompts to extract information from images, such as obtaining image titles or inquiring about the identities of individuals in the images. Applying LLM to the human images as shown in Figure 1 will extract information such as “Taylor Swift on wedding”, which contributes to linking the entity.

In summary, our innovations are outlined as follows:

  • We propose a dynamic method for collecting entity representations from the knowledge base, and release the collected enhanced datasets Wiki+, Rich+, and Diverse+.

  • We introduce a dynamic linking method, DIM, to connect human cognition with the knowledge base. DIM utilizes BLIP-2 for better extraction of entity-related features from images and links them with the dynamic entity representations provided by ChatGPT.

  • Extensive experiments are conducted. The DIM method not only outperforms most methods on the original Wikimel, Richpedia, and Wikidiverse datasets but also achieves state-of-the-art performance on Wiki+, Rich+, and Diverse+.

2 Dynamic Building Entity Representation

Multimodal entity linking(MEL) significantly influences and facilitates a profound understanding and cognition of information for humans. MEL serves as a crucial means to unify human cognition with structured knowledge repositories: (1) Assurance of semantic consistency: MEL ensures semantic consistency by aligning entities mentioned in the cognition with knowledge base. It helps in eliminating ambiguity and ensuring that the interpretation of specific entities remains clear despite contextual variations. (2) Enhancement of cognitive information: MEL offers individuals a richer and deeper cognitive experience. By associating entities with background knowledge, individuals can comprehensively grasp the meaning of entities, thereby elevating their cognitive awareness of information. (3) Integrated knowledge acquisition: This contributes to breaking down information silos, enabling people to easily cross different domains, texts, and knowledge sources to acquire information, promoting an overall improvement in cognitive levels.

2.1 Existing Entity Representation

In our investigation and study, we analyze the existing Entity Linking datasets and their methods of entity representation:

  • Wikimel and Richpedia (Zhou, \APACyear2021) employ concise attributes from Wikidata. This representation lacks representativeness for entities, as many entities share similar attributes. It is easy to link accurate cognition to the wrong entity incorrectly.

  • Wikiperson (Sun, \APACyear2022), similar to Wikimel, uses attributes of individuals as representatives, but in a more simplified manner. In this example, the attribute “American President” is inadequate as a representative of Joe Biden, given that there have been multiple American Presidents.

  • Weibo (Zhang, \APACyear2021) utilizes individuals from Weibo as entities, using user-authored personal bios as entity representations. These bios, relying on user-generated content, may contain biases or errors and do not accurately reflect the broader public’s understanding of the entity.

  • Wikidiverse (Wang, \APACyear2022) uses images collected from Wikipedia as entity representation. However, images can deviate from a person’s true appearance due to factors like angles and time, lacking real-time accuracy.

Table 1: Examples of entity representations for the entity Joe Biden in different datasets. Wiki+ is the dataset with dynamic enhancement.
Wikimel
Sex: male. Birth: 1942, Scranton. Religion: Catholicism. Occupation: lawyer, politician. Spouse: Jill Biden, Neilia Hunter. Languages: English. Alma mater: Archmere Academy, Syracuse University College of Law…
Wikiperson
President of the United States
Weibo
Husband to @DrBiden, proud father and grandfather. Ready to finish the job for all Americans.
Wikidiverse
Joe Biden became the presumptive nominee of the Democratic Party for president in April 2020, and formally accepted the nomination the following August…
Wiki+(Ours)
Joe Biden is an American politician who served as the 46th president of the United States. Born on November 20, 1942, in Scranton, Pennsylvania, Biden has had a long and distinguished political career…

In summary, existing multimodal entity linking methods suffer from the limitation that entity representation fails to effectively represent entities. More importantly, these representations are manually collected from Wikipedia or other knowledge bases and can only represent the entity’s state at a specific time. Human understanding of entities changes over time and events. For instance, Donald Trump is no longer the President of the United States in 2023. In such cases, rigid and less adaptable entity representations can lead to errors. Additionally, when a mentioned entity is not in the dataset, there is no corresponding entity representation, causing potential issues in entity linking.

Table 1 shows the diverse representation methods for entities in existing research. (It is worth noting that Weibo dataset (Zhang, \APACyear2021) is primarily designed for Chinese celebrities and does not involve international figures. For the sake of convenient presentation and visual comparison, following the definition of the Weibo dataset, we manually gathered personal profiles of relevant entities from Twitter. ) Therefore, the paper proposes a dynamic method for constructing entity representations, leveraging the capabilities of ChatGPT to enhance real-time and scalable entity representations based on the evolving understanding of the world.

2.2 Leverage Chatgpt to Dynamic Connect

ChatGPT (OpenAI, \APACyear2023) is a powerful large model trained on massive amounts of web data and is continuously updated over time. We plan to utilize the interface provided by ChatGPT to inquire about entities, so as to subsequently link with the entities in the implicit knowledge base of ChatGPT. Candidate entities are input into ChatGPT for inquiries using the prompt: “You are a helpful assistant designed to give a comprehensive introduction about people. Who is this one?” The generated response from ChatGPT is shown in Table 1. For experimental convenience and dataset quality enhancement, we collect the response of ChatGPT to enhance the dataset. In detail, we construct entity representations for 17391, 17804, and 57007 entities from Wikimel, Richpedia, Wikidiverse, respectively. The newly built entity representations better reflect the general public’s understanding of entities, align closely with their inherent semantics, and facilitate a unified approach to cognition and knowledge base.

Taking the Wikimel (Zhou, \APACyear2021) as an example, out of the 17474 entities collected from the dataset, 131 entities did not return any results, and 220 inquiries returned “Sorry, I cannot provide an introduction to this entity.” Besides, ChatGPT provides speculative information based on cultural, regional, or other contextual cues for 462 entities. For example, “John Abbott is a common English given name and surname,” but did not provide specific representation. Additionally, 2997 entities require additional information for verification. For instance, “It is possible that Edward J. Livernash is a private individual without any notable achievements.” Furthermore, 599 entities are speculated to be fictional names, such as “John McDuffie is a fictional name, so there is no information.” The specific reasons and their proportions are as shown in Figure 2. In summary, out of 17391 entities, 5517 entities were unable to be enhanced through ChatGPT. For these entities, we continue to use the original entity representations.

Refer to caption
Figure 2: Statistics of enhanced datasets including Richpedia, Wikimel, and Wikidiverse.
Refer to caption
Figure 3: Model overview. Example is an image with mention m𝑚mitalic_m Trump, text t𝑡titalic_t Trump and his wife Melania at Wedding. c𝑐citalic_c is the result of the expert model. Npairloss in contrastive learning is to ensure close distances for same-category samples and distinct distances for different-category samples.

3 Dynamically Integrate Multimodal Information

To evaluate the effectiveness of our enhanced dataset, we introduce a corresponding baseline, a method to Dynamically Integrate Multimodal information(DIM). DIM was experimented not only on the original Wikimel, Richpedia, and Wikidiverse datasets but also on the enhanced Wiki+, Rich+, and Diverse+ datasets to assess the effectiveness of the enhanced datasets.

In detail, our DIM employs CLIP (Radford \BOthers., \APACyear2021) for feature encoding and utilizes BLIP-2 (Li \BOthers., \APACyear2023) as an expert to extract useful information from images, serving as supplementary information for feature extraction by CLIP. This approach was designed to enhance the representation and understanding of entities in our experiments. The performance comparisons across the original and enhanced datasets aimed to highlight the impact of our dataset augmentation on the efficacy of the proposed DIM model.

3.1 Feature Extractor

Given a sentence xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, mention xmsubscript𝑥𝑚x_{m}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and Wikipedia description of entity xesubscript𝑥𝑒x_{e}italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, we follow CLIP (Radford \BOthers., \APACyear2021) to tokenize it into a sequence of word embeddings. Then the special tokens startoftext and endoftext are added at the beginning and end positions of word embeddings. As a result, with N𝑁Nitalic_N sentences and Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT candidate entities, we feed sentence representation tN×d𝑡superscript𝑁𝑑t\in\mathbb{R}^{N\times d}italic_t ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT, mention representation mN×d𝑚superscript𝑁𝑑m\in\mathbb{R}^{N\times d}italic_m ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT and entity representation eNe×d𝑒superscriptsubscript𝑁𝑒𝑑e\in\mathbb{R}^{N_{e}\times d}italic_e ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT into model. Similarly, we feed image into visual encoder of CLIP to get visual feature vN×d𝑣superscript𝑁𝑑v\in\mathbb{R}^{N\times d}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT. d𝑑ditalic_d is the hidden size of textual and visual features.

3.2 Expert Supplementary Information

We employ BLIP-2 as the expert for extracting information from images, employing various approaches to process BLIP-2 images: 1) Image Captioning: We extract corresponding captions c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for images, such as “A man and a woman on the red carpet.” 2) Prompt-based Inquiry: We utilize prompts to ask the detailed information c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT about the images, with specific prompt designs such as “Question: Who are the characters in the picture? Answer: ”

We concatenate the image-related information obtained through these two methods to obtain expert information c=[CLS]c1[SEP]c2𝑐[CLS]𝑐1[SEP]𝑐2c=\text{[CLS]}c1\text{[SEP]}c2italic_c = [CLS] italic_c 1 [SEP] italic_c 2, where ’[CLS]’ and ’[SEP]’ are special characters used to indicate the beginning and separation of text, respectively. c𝑐citalic_c is fed into the text encoder of CLIP to get expert feature fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

3.3 Architecture

We combine the extracted image and text features with the supplementary information provided by the expert, concatenating them to form the final feature representation. To manage the expression of features, we employ multi-head attention for selection. Through multi-head attention, c𝑐citalic_c will interact separately with text feature t𝑡titalic_t and image feature v𝑣vitalic_v to extract useful information and control noise.

ftsubscript𝑓𝑡\displaystyle f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =softmax((WQfc)T(WKt)d)(WVt)absent𝑠𝑜𝑓𝑡𝑚𝑎𝑥superscriptsuperscript𝑊𝑄subscript𝑓𝑐𝑇superscript𝑊𝐾𝑡𝑑superscript𝑊𝑉𝑡\displaystyle=softmax(\frac{(W^{Q}f_{c})^{T}(W^{K}t)}{\sqrt{d}})(W^{V}t)= italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG ( italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_t ) end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) ( italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_t ) (1)
fvsubscript𝑓𝑣\displaystyle f_{v}italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT =softmax((WQfc)T(WKv)d)(WVv)absent𝑠𝑜𝑓𝑡𝑚𝑎𝑥superscriptsuperscript𝑊𝑄subscript𝑓𝑐𝑇superscript𝑊𝐾𝑣𝑑superscript𝑊𝑉𝑣\displaystyle=softmax(\frac{(W^{Q}f_{c})^{T}(W^{K}v)}{\sqrt{d}})(W^{V}v)= italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG ( italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_v ) end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) ( italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_v ) (2)

where WQd×dq,WKd×dk,WVd×dvformulae-sequencesuperscript𝑊𝑄superscript𝑑subscript𝑑𝑞formulae-sequencesuperscript𝑊𝐾superscript𝑑subscript𝑑𝑘superscript𝑊𝑉superscript𝑑subscript𝑑𝑣W^{Q}\in\mathbb{R}^{d\times d_{q}},W^{K}\in\mathbb{R}^{d\times d_{k}},W^{V}\in% \mathbb{R}^{d\times d_{v}}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are randomly initialized projection matrices. We set dq=dk=dv=d/hsubscript𝑑𝑞subscript𝑑𝑘subscript𝑑𝑣𝑑d_{q}=d_{k}=d_{v}=d/hitalic_d start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_d / italic_h. hhitalic_h is the number of heads of attention layer.

On one hand, text information, image information, and expert features are fused to form the fused feature g𝑔gitalic_g.

g=fv+fc+ft𝑔subscript𝑓𝑣subscript𝑓𝑐subscript𝑓𝑡g=f_{v}+f_{c}+f_{t}italic_g = italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (3)

On the other hand, the textual representations of Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT candidate entities in the knowledge base also encoded by CLIP’s text encoder to obtain e𝑒eitalic_e. During this process, the limit of text length is imposed to prevent excessively long entity representations. The fused feature g𝑔gitalic_g is then matched with the textual features of the Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT entities using cosine similarity, calculating the Match Score. If g𝑔gitalic_g is most similar to the correct entity among the Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT entities, it is considered a successful T@1 prediction. Similarly, if the similarity ranks within the top 5, it constitutes a successful T@5 prediction.

3.4 Training Loss

We utilize Npairloss (Sohn, \APACyear2016) as our optimization training objective to enhance the learning process. NpairLoss is applied to networks with multiple outputs, where each output corresponds to a specific task. Its purpose is to optimize the network by maximizing the similarity of sample pairs within the same category, thereby enhancing the effectiveness of multi-task learning. Specifically, for each sample, NpairLoss defines the loss by comparing the similarity between positive sample pairs (belonging to the same category) and negative sample pairs (belonging to different categories). It encourages the network to make sample pairs within the same category more similar while ensuring that sample pairs from different categories are more dissimilar.

The mathematical formulation of NpairLoss Nsubscript𝑁\mathcal{L}_{N}caligraphic_L start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is typically expressed as the sum of losses over each sample pair. The similarity scores of positive sample pairs are maximized, while those of negative sample pairs are minimized. This helps effectively balance the trade-offs between different tasks in multi-task learning, thereby improving the generalization performance of network.

=i=1N(sim(𝐠i,𝐩i)j=1Ksim(𝐠i,𝐧j)+log(j=1Kesim(𝐠i,𝐧j)))superscriptsubscript𝑖1𝑁simsubscript𝐠𝑖subscript𝐩𝑖superscriptsubscript𝑗1𝐾simsubscript𝐠𝑖subscript𝐧𝑗superscriptsubscript𝑗1𝐾superscript𝑒simsubscript𝐠𝑖subscript𝐧𝑗\mathcal{L}=\sum_{i=1}^{N}\left(-\frac{\text{sim}(\mathbf{g}_{i},\mathbf{p}_{i% })}{\sum_{j=1}^{K}\text{sim}(\mathbf{g}_{i},\mathbf{n}_{j})}+\log\left(\sum_{j% =1}^{K}e^{\text{sim}(\mathbf{g}_{i},\mathbf{n}_{j})}\right)\right)caligraphic_L = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( - divide start_ARG sim ( bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT sim ( bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG + roman_log ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT sim ( bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) ) (4)

where gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the fused feature of the i𝑖iitalic_i-th sample. pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the representation of the positive sample corresponding to the i𝑖iitalic_i-th sample pair. njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the representation of the j𝑗jitalic_j-th negative sample corresponding to the i𝑖iitalic_i-th sample pair. sim(a,b)𝑠𝑖𝑚𝑎𝑏sim(a,b)italic_s italic_i italic_m ( italic_a , italic_b ) denotes the similarity measure between two representations, typically cosine similarity.

4 Expreriment

DIM method is not only tested on the original dataset, with results presented in Table 3, but also on the dynamically enhanced dataset we introduced in Table 4. For comparative analysis, we reproduce several classic baselines on the enhanced dataset, such as BERT (Devlin \BOthers., \APACyear2019), GHMFC (P. Wang, \APACyear2022) and CLIP (Radford \BOthers., \APACyear2021).

Table 2: The statistics of datasets. Wiki+, Rich+, Diverse+ represent the enhanced versions of Wikimel, Richpedia, and Wikidiverse datasets using ChatGPT. “Text” and “ER” refer to the length of text and entity representation, respectively.
Dataset Sample Entity Mention Text ER
Wiki+ 18880 17391 25846 8.2 1318
Rich+ 17805 17804 18752 13.6 1274
Diverse+ 13765 57007 16097 10.1 902

4.1 Dataset

Our experimental dataset comprises authoritative datasets in the entity linking domain, including Richpedia, WikiMEL, Wikidiverse. We conduct a comprehensive analysis of the augmented dataset, and the statistics are presented in Table 2.

4.2 Baseline

We select several representative methods from the current research community as our baseline: (1) BLINK (L. Wu \BOthers., \APACyear2019) is a two-step entity linking model based on the BERT model. (2) BERT (Devlin \BOthers., \APACyear2019) is a deep learning model based on the attention mechanism and Transformer architecture (3) ARNN (Eshel \BOthers., \APACyear2017) utilizes the attention-RNN structure to establish the relationship link between entities and input information. (4) DZMNED (Moon \BOthers., \APACyear2018) focuses on utilizing a multimodal attention mechanism to analyze information related to mentions in both images and text. (5) JMEL (Adjali \BOthers., \APACyear2020) leverage fully connected layers to project multimodal features into a shared latent space facilitating the representation of features. (6) MEL-HI (Zhang \BOthers., \APACyear2021) employs multiple attention mechanisms to focus on different aspects of multimodal information and decrease the effects of noisy images. (7) HieCoAtt (Lu \BOthers., \APACyear2016) is a multimodal fusion mechanism, using alternating co-attention and three textual levels (tokens, phrases, and sentences) to calculate relationship. (8) GHMFC (P. Wang, \APACyear2022) takes the gated multimodal fusion and novel attention mechanism to link entities in knowledge base. (9) MMEL (Yang \BOthers., \APACyear2023) is a joint feature extraction module to learn the representations of context and entity candidates, from both the visual and textual perspectives. (10) CLIP-text (Radford \BOthers., \APACyear2021) only uses textual information and focuses on the ability to build textual relationships between text and entity. (11) CLIP (Radford \BOthers., \APACyear2021) take both textual and visual features into consideration. The model concatenates multimodal features and calculates the similarity between fused features and entities.

Table 3: Experimental results on the dataset before dynamic enhancement. T@1 denotes Top-1 accuracy, meaning the feature’s similarity to the target entity ranks first among multiple candidate entities. Similarly, T@5 signifies the similarity ranking within the top 5. Calculations are conducted among 100 candidate entities.
Models Richpedia Wikimel Wikidiverse
T@1 T@5 T@10 T@20 T@1 T@5 T@10 T@20 T@1 T@5 T@10 T@20
BLINK 30.8 38.8 44.5 53.6 30.8 44.6 56.7 66.4 - 71.2 - -
DZMNED 29.5 41.6 45.8 55.2 30.9 50.7 56.9 65.1 - 39.1 - -
JMEL 29.6 42.3 46.6 54.1 31.3 49.4 57.9 64.8 21.9 54.5 69.9 76.3
BERT 31.6 42.0 47.6 57.3 31.7 48.8 57.8 70.3 22.2 53.8 69.8 82.8
ARNN 31.2 39.3 45.9 54.5 32.0 45.8 56.6 65.0 22.4 50.5 68.4 76.6
MEL-HI 34.9 43.1 50.6 58.4 38.7 55.1 65.2 75.7 27.1 60.7 78.7 89.2
HieCoAtt 37.2 46.8 54.2 62.4 40.5 57.6 69.6 78.6 28.4 63.5 84.0 92.6
GHMFC 38.7 50.9 58.5 66.7 43.6 64.0 74.4 85.8 - - - -
MMEL - - - - 71.5 91.7 96.3 98.0 - - - -
CLIP 60.4 96.1 98.3 99.2 36.1 81.3 92.8 98.3 42.4 80.5 91.7 96.6
DIM 66.1 97.5 98.9 99.6 64.6 97.6 99.1 99.8 52.7 94.5 98.4 99.5

4.3 Metrics

For metric evaluation, we adopted the T@1, 5, 10, and 20 metrics as employed in GHMFC (P. Wang, \APACyear2022). These metrics represent the ranking of the similarity scores for candidate entities, where the linked entity’s similarity score is within the top 1, 5, 10, and 20 positions, respectively. Following the previous approach (P. Wang, \APACyear2022; Yang \BOthers., \APACyear2023), calculations are conducted among 100 candidate entities.

To be specific, following the definition in DWE (Song \BOthers., \APACyear2023), the formula is as follows:

Acctop-k=1Ni=1Nη{I(cos(g,gt),cos(g,Ce))k}𝐴𝑐subscript𝑐𝑡𝑜𝑝-𝑘1𝑁superscriptsubscript𝑖1𝑁𝜂𝐼𝑐𝑜𝑠𝑔𝑔𝑡𝑐𝑜𝑠𝑔subscript𝐶𝑒𝑘\footnotesize Acc_{top\text{-}k}=\frac{1}{N}\sum_{i=1}^{N}\eta\{I(cos(g,gt),% cos(g,C_{e}))\leq k\}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_t italic_o italic_p - italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_η { italic_I ( italic_c italic_o italic_s ( italic_g , italic_g italic_t ) , italic_c italic_o italic_s ( italic_g , italic_C start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) ) ≤ italic_k } (5)

where N𝑁Nitalic_N represents the total number of samples, and η𝜂\etaitalic_η is the indicator function. When the receiving condition is satisfied, η𝜂\etaitalic_η is set to 1, and 0 otherwise. gt𝑔𝑡gtitalic_g italic_t is ground truth entity while Cesubscript𝐶𝑒C_{e}italic_C start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is a set of candidate entities. cos𝑐𝑜𝑠cositalic_c italic_o italic_s means cosine similarity function. I𝐼Iitalic_I is a function to calculate the rank of similarity between joint feature g𝑔gitalic_g and ground truth gt𝑔𝑡gtitalic_g italic_t among a set of candidate entities Cesubscript𝐶𝑒C_{e}italic_C start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT.

4.4 Implement Details

Following previous work(Zhou, \APACyear2021; Wang, \APACyear2022; Song \BOthers., \APACyear2023), we select 100 potential entities as candidates. In the Wiki+, Rich+, and Diverse+ datasets, we utilize fuzzy matching 111https://github.com/seatgeek/fuzzywuzzy technology to recognize candidate entities that resemble the particular mentions.

Our experiments are conducted on RTX 3090 using PyTorch 2.0. ChatGPT is based on the GPT-3.5-turbo version. The version of CLIP employed is Vit-base-patch16-224-in21. The training consisted of 300 epochs, with both image and text hidden layer dimensions set to 512, and the output layer dimension set to 512. We utilized the AdamW optimizer with a learning rate of 5e-5, and the batch size was set to 64.

4.5 Experiment on Original Dataset

We conduct experiments with the DIM method on the datasets before dynamic enhancement to validate its effectiveness. As shown in Table 3, the performance of the DIM method surpasses the majority of existing models on three public datasets. DIM demonstrates T@1 performance improvements of 5.7%, 28.5%, and 10.3% on the Richpedia, Wikimel, and Wikidiverse datasets, respectively. This indicates that the DIM method is more effective in capturing information related to entity identity in image data.

4.6 Experiment on Enhanced Dataset

Table 4: Experimental results on the dataset with dynamic enhancement.
Models Richpedia WikiMEL Wikidiverse
T@1 T@5 T@10 T@20 T@1 T@5 T@10 T@20 T@1 T@5 T@10 T@20
BERT 35.5 77.7 87.8 94.3 32.0 75.7 88.2 95.5 10.3 23.9 33.8 47.3
GHMFC 34.6 77.0 87.1 93.8 33.3 75.9 88.4 95.0 14.8 29.9 39.3 53.8
CLIP 63.5 95.4 97.5 98.6 63.0 96.2 98.6 99.6 45.7 88.9 96.4 99.3
DIM 65.1 96.6 98.3 99.5 68.1 98.4 99.5 99.9 53.4 95.5 99.2 99.7

In Table 2, to illustrate the distinctions between the enhanced datasets and the original datasets, we conducted feature statistics on the datasets and a series of experiments. As shown in Table 4, not only did we experiment with the DIM method on the enhanced datasets, but for comparison, we also replicated several classic baselines, including BERT, CLIP, and GHMFC. Our approach outperforms most existing models on the original dataset without dynamic enhancement, showcasing the effectiveness of our proposed DIM method. Furthermore, on our enhanced dataset(Rich+, Wiki+, and Diverse+), our method continues to demonstrate robust performance, validating the effectiveness of our enhancement approach. The improved entity representation by ChatGPT aligns more coherently with entities in the knowledge base, achieving better semantic consistency.

5 Conclusion

Our study on multimodal entity linking introduces an impactful solution to key challenges. We leverage ChatGPT’s rapid learning to enhance datasets (Wiki+, Rich+, Diverse+), addressing ambiguous entity representations. Furthermore, the dynamically integrate multimodal information with knowledge base (DIM) method validates efficacy and improves information extraction from images, overcoming existing limitations. These innovations contribute to a deeper understanding of human cognition and knowledge bases, advancing natural language processing and artificial intelligence. Experiments show that our DIM not only outperforms most methods on the original dataset(Wikimel, Richpedia, Wikidiverse) but also achieves optimal performance on the newly enhanced dataset(Wiki+, Rich+, Diverse+).

The dataset we collect relies on ChatGPT’s understanding of the knowledge base and the world. Although this allows for dynamic entity information linking, it can lead to biases or omissions in data collection due to ChatGPT’s potential hallucinations or unavailability. We will continue to explore and refine methods for entity data collection based on large models to enhance accuracy and completeness.

References

  • Adjali \BOthers. (\APACyear2020) \APACinsertmetastarjmel{APACrefauthors}Adjali, O., Besançon, R., Ferret, O., Le Borgne, H.\BCBL \BBA Grau, B.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleMultimodal entity linking for tweets Multimodal entity linking for tweets.\BBCQ \BIn \APACrefbtitleEuropean Conference on Information Retrieval European conference on information retrieval (\BPGS 463–478). \PrintBackRefs\CurrentBib
  • Devlin \BOthers. (\APACyear2019) \APACinsertmetastarBERT{APACrefauthors}Devlin, J., Chang, M\BHBIW., Lee, K.\BCBL \BBA Toutanova, K.  \APACrefYearMonthDay2019Jul. \BBOQ\APACrefatitleBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Bert: Pre-training of deep bidirectional transformers for language understanding.\BBCQ \BIn \APACrefbtitleProceedings of the 2019 Conference of the North. Proceedings of the 2019 conference of the north. {APACrefURL} http://dx.doi.org/10.18653/v1/n19-1423 {APACrefDOI} \doi10.18653/v1/n19-1423 \PrintBackRefs\CurrentBib
  • Eshel \BOthers. (\APACyear2017) \APACinsertmetastarARNN{APACrefauthors}Eshel, Y., Cohen, N., Radinsky, K., Markovitch, S., Yamada, I.\BCBL \BBA Levy, O.  \APACrefYearMonthDay2017. \BBOQ\APACrefatitleNamed entity disambiguation for noisy text Named entity disambiguation for noisy text.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1706.09147. \PrintBackRefs\CurrentBib
  • Fu \BOthers. (\APACyear2020) \APACinsertmetastarfu2020kbsurvey{APACrefauthors}Fu, B., Qiu, Y., Tang, C., Li, Y., Yu, H.\BCBL \BBA Sun, J.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleA survey on complex question answering over knowledge base: Recent advances and challenges A survey on complex question answering over knowledge base: Recent advances and challenges.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2007.13069. \PrintBackRefs\CurrentBib
  • He \BOthers. (\APACyear2013) \APACinsertmetastarhe2013learning{APACrefauthors}He, Z., Liu, S., Li, M., Zhou, M., Zhang, L.\BCBL \BBA Wang, H.  \APACrefYearMonthDay2013. \BBOQ\APACrefatitleLearning entity representation for entity disambiguation Learning entity representation for entity disambiguation.\BBCQ \BIn \APACrefbtitleProceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Proceedings of the 51st annual meeting of the association for computational linguistics (volume 2: Short papers) (\BPGS 30–34). \PrintBackRefs\CurrentBib
  • Hu \BOthers. (\APACyear2021) \APACinsertmetastarhu2021can{APACrefauthors}Hu, Q., Lu, Y., Pan, Z., Gong, Y.\BCBL \BBA Yang, Z.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleCan AI artifacts influence human cognition? The effects of artificial autonomy in intelligent personal assistants Can ai artifacts influence human cognition? the effects of artificial autonomy in intelligent personal assistants.\BBCQ \APACjournalVolNumPagesInternational Journal of Information Management56102250. \PrintBackRefs\CurrentBib
  • Hutchins (\APACyear2020) \APACinsertmetastarhutchins2020distributed{APACrefauthors}Hutchins, E.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleThe distributed cognition perspective on human interaction The distributed cognition perspective on human interaction.\BBCQ \BIn \APACrefbtitleRoots of human sociality Roots of human sociality (\BPGS 375–398). \APACaddressPublisherRoutledge. \PrintBackRefs\CurrentBib
  • Ji \BOthers. (\APACyear2022) \APACinsertmetastarji2022win{APACrefauthors}Ji, B., Li, S., Yu, J., Ma, J.\BCBL \BBA Liu, H.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleWin-Win Cooperation: Bundling Sequence and Span Models for Named Entity Recognition Win-win cooperation: Bundling sequence and span models for named entity recognition.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2207.03300. \PrintBackRefs\CurrentBib
  • Komlósi \BBA Waldbuesser (\APACyear2015) \APACinsertmetastarcel{APACrefauthors}Komlósi, L\BPBII.\BCBT \BBA Waldbuesser, P.  \APACrefYearMonthDay2015. \BBOQ\APACrefatitleThe cognitive entity generation: Emergent properties in social cognition The cognitive entity generation: Emergent properties in social cognition.\BBCQ \BIn \APACrefbtitle2015 6th IEEE International Conference on Cognitive Infocommunications (CogInfoCom) 2015 6th ieee international conference on cognitive infocommunications (coginfocom) (\BPGS 439–442). \PrintBackRefs\CurrentBib
  • Li \BOthers. (\APACyear2023) \APACinsertmetastarblip2{APACrefauthors}Li, J., Li, D., Savarese, S.\BCBL \BBA Hoi, S.  \APACrefYearMonthDay2023\APACmonth01. \BBOQ\APACrefatitleBLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.\BBCQ \APACjournalVolNumPagesarXiv e-printsarXiv:2301.12597. {APACrefDOI} \doi10.48550/arXiv.2301.12597 \PrintBackRefs\CurrentBib
  • Li \BOthers. (\APACyear2022) \APACinsertmetastarli2022blip{APACrefauthors}Li, J., Li, D., Xiong, C.\BCBL \BBA Hoi, S.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleBlip: Bootstrapping language-image pre-training for unified vision-language understanding and generation Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.\BBCQ \BIn \APACrefbtitleInternational Conference on Machine Learning International conference on machine learning (\BPGS 12888–12900). \PrintBackRefs\CurrentBib
  • Lu \BOthers. (\APACyear2016) \APACinsertmetastarHieCoAtt{APACrefauthors}Lu, J., Yang, J., Batra, D.\BCBL \BBA Parikh, D.  \APACrefYearMonthDay2016Jan. \APACrefbtitleHierarchical Question-Image Co-Attention for Visual Question Answering. Hierarchical question-image co-attention for visual question answering. \PrintBackRefs\CurrentBib
  • Ma, Chen\BCBL \BOthers. (\APACyear2023) \APACinsertmetastarma2023using{APACrefauthors}Ma, W., Chen, Q., Zhou, T., Zhao, S.\BCBL \BBA Cai, Z.  \APACrefYearMonthDay2023. \BBOQ\APACrefatitleUsing multimodal contrastive knowledge distillation for video-text retrieval Using multimodal contrastive knowledge distillation for video-text retrieval.\BBCQ \APACjournalVolNumPagesIEEE Transactions on Circuits and Systems for Video Technology33105486–5497. \PrintBackRefs\CurrentBib
  • Ma, Zhou\BCBL \BOthers. (\APACyear2023) \APACinsertmetastarma2023adaptive{APACrefauthors}Ma, W., Zhou, T., Qin, J., Xiang, X., Tan, Y.\BCBL \BBA Cai, Z.  \APACrefYearMonthDay2023. \BBOQ\APACrefatitleAdaptive multi-feature fusion via cross-entropy normalization for effective image retrieval Adaptive multi-feature fusion via cross-entropy normalization for effective image retrieval.\BBCQ \APACjournalVolNumPagesInformation Processing & Management601103119. \PrintBackRefs\CurrentBib
  • Moon \BOthers. (\APACyear2018) \APACinsertmetastarDZMNED{APACrefauthors}Moon, S., Neves, L.\BCBL \BBA Carvalho, V.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleMultimodal named entity disambiguation for noisy social media posts Multimodal named entity disambiguation for noisy social media posts.\BBCQ \BIn \APACrefbtitleProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers) (\BPGS 2000–2008). \PrintBackRefs\CurrentBib
  • OpenAI (\APACyear2023) \APACinsertmetastarGPT4{APACrefauthors}OpenAI.  \APACrefYearMonthDay2023\APACmonth03. \BBOQ\APACrefatitleGPT-4 Technical Report GPT-4 Technical Report.\BBCQ \APACjournalVolNumPagesarXiv e-printsarXiv:2303.08774. {APACrefDOI} \doi10.48550/arXiv.2303.08774 \PrintBackRefs\CurrentBib
  • Radford \BOthers. (\APACyear2021) \APACinsertmetastarclip_model{APACrefauthors}Radford, A., Kim, J\BPBIW., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S.\BDBLSutskever, I.  \APACrefYearMonthDay2021\APACmonth02. \APACrefbtitleLearning Transferable Visual Models From Natural Language Supervision. Learning Transferable Visual Models From Natural Language Supervision. \APACaddressPublisherarXiv. {APACrefURL} [2022-11-16]http://arxiv.org/abs/2103.00020 \APACrefnotearXiv:2103.00020 [cs] \PrintBackRefs\CurrentBib
  • Sohn (\APACyear2016) \APACinsertmetastarnpairloss{APACrefauthors}Sohn, K.  \APACrefYearMonthDay2016. \BBOQ\APACrefatitleImproved deep metric learning with multi-class n-pair loss objective Improved deep metric learning with multi-class n-pair loss objective.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems29. \PrintBackRefs\CurrentBib
  • Song \BOthers. (\APACyear2023) \APACinsertmetastardwe{APACrefauthors}Song, S., Zhao, S., Wang, C., Yan, T., Li, S., Mao, X.\BCBL \BBA Wang, M.  \APACrefYearMonthDay2023. \APACrefbtitleA Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking. A dual-way enhanced framework from text matching point of view for multimodal entity linking. \PrintBackRefs\CurrentBib
  • Sun (\APACyear2022) \APACinsertmetastarwikiperson{APACrefauthors}Sun, W.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleVisual Named Entity Linking: A New Dataset and A Baseline Visual named entity linking: A new dataset and a baseline.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2211.04872. \PrintBackRefs\CurrentBib
  • Wang (\APACyear2022) \APACinsertmetastarwikidiverse{APACrefauthors}Wang.  \APACrefYearMonthDay2022\APACmonth04. \APACrefbtitleWikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types. WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types. \APACaddressPublisherarXiv. {APACrefURL} [2022-12-10]http://arxiv.org/abs/2204.06347 \APACrefnotearXiv:2204.06347 [cs] \PrintBackRefs\CurrentBib
  • P. Wang (\APACyear2022) \APACinsertmetastarbaseline_mel{APACrefauthors}Wang, P.  \APACrefYearMonthDay2022\APACmonth07. \BBOQ\APACrefatitleMultimodal Entity Linking with Gated Hierarchical Fusion and Contrastive Training Multimodal Entity Linking with Gated Hierarchical Fusion and Contrastive Training.\BBCQ \BIn \APACrefbtitleProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (\BPGS 938–948). \APACaddressPublisherMadrid SpainACM. {APACrefURL} [2022-11-13]https://dl.acm.org/doi/10.1145/3477495.3531867 {APACrefDOI} \doi10.1145/3477495.3531867 \PrintBackRefs\CurrentBib
  • L. Wu \BOthers. (\APACyear2019) \APACinsertmetastarBLINK{APACrefauthors}Wu, L., Petroni, F., Josifoski, M., Riedel, S.\BCBL \BBA Zettlemoyer, L.  \APACrefYearMonthDay2019Nov. \APACrefbtitleScalable Zero-shot Entity Linking with Dense Entity Retrieval. Scalable zero-shot entity linking with dense entity retrieval. \PrintBackRefs\CurrentBib
  • P. Wu \BBA Xie (\APACyear2024) \APACinsertmetastarwu2024v{APACrefauthors}Wu, P.\BCBT \BBA Xie, S.  \APACrefYearMonthDay2024. \BBOQ\APACrefatitleV?: Guided Visual Search as a Core Mechanism in Multimodal LLMs V?: Guided visual search as a core mechanism in multimodal llms.\BBCQ \BIn \APACrefbtitleProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Proceedings of the ieee/cvf conference on computer vision and pattern recognition (\BPGS 13084–13094). \PrintBackRefs\CurrentBib
  • Yang \BOthers. (\APACyear2023) \APACinsertmetastarMMEL{APACrefauthors}Yang, C., He, B., Wu, Y., Xing, C., He, L.\BCBL \BBA Ma, C.  \APACrefYearMonthDay2023. \BBOQ\APACrefatitleMMEL: A Joint Learning Framework for Multi-Mention Entity Linking Mmel: A joint learning framework for multi-mention entity linking.\BBCQ \BIn \APACrefbtitleUncertainty in Artificial Intelligence Uncertainty in artificial intelligence (\BPGS 2411–2421). \PrintBackRefs\CurrentBib
  • Zhang (\APACyear2021) \APACinsertmetastarweibo{APACrefauthors}Zhang, L.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleAttention-based multimodal entity linking with high-quality images Attention-based multimodal entity linking with high-quality images.\BBCQ \BIn \APACrefbtitleInternational Conference on Database Systems for Advanced Applications International conference on database systems for advanced applications (\BPGS 533–548). \PrintBackRefs\CurrentBib
  • Zhang \BOthers. (\APACyear2021) \APACinsertmetastarMELHI{APACrefauthors}Zhang, L., Li, Z.\BCBL \BBA Yang, Q.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleAttention-based multimodal entity linking with high-quality images Attention-based multimodal entity linking with high-quality images.\BBCQ \BIn \APACrefbtitleInternational Conference on Database Systems for Advanced Applications International conference on database systems for advanced applications (\BPGS 533–548). \PrintBackRefs\CurrentBib
  • Zhao, Hu, Cai, Chen\BCBL \BBA Liu (\APACyear2021) \APACinsertmetastarzhao2021dynamic2{APACrefauthors}Zhao, S., Hu, M., Cai, Z., Chen, H.\BCBL \BBA Liu, F.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleDynamic modeling cross-and self-lattice attention network for Chinese NER Dynamic modeling cross-and self-lattice attention network for chinese ner.\BBCQ \BIn \APACrefbtitleProceedings of the AAAI Conference on Artificial Intelligence Proceedings of the aaai conference on artificial intelligence (\BVOL 35, \BPGS 14515–14523). \PrintBackRefs\CurrentBib
  • Zhao, Hu, Cai\BCBL \BBA Liu (\APACyear2021) \APACinsertmetastarzhao2021dynamic{APACrefauthors}Zhao, S., Hu, M., Cai, Z.\BCBL \BBA Liu, F.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleDynamic modeling cross-modal interactions in two-phase prediction for entity-relation extraction Dynamic modeling cross-modal interactions in two-phase prediction for entity-relation extraction.\BBCQ \APACjournalVolNumPagesIEEE Transactions on Neural Networks and Learning Systems. \PrintBackRefs\CurrentBib
  • Zhou (\APACyear2021) \APACinsertmetastarbaseline_dataset{APACrefauthors}Zhou, X.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleWeibo-mel, Wikidata-mel and Richpedia-mel: multimodal entity linking benchmark datasets Weibo-mel, wikidata-mel and richpedia-mel: multimodal entity linking benchmark datasets.\BBCQ \BIn \APACrefbtitleKnowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction: 6th China Conference, CCKS 2021, Guangzhou, China, November 4-7, 2021, Proceedings 6 Knowledge graph and semantic computing: Knowledge graph empowers new infrastructure construction: 6th china conference, ccks 2021, guangzhou, china, november 4-7, 2021, proceedings 6 (\BPGS 315–320). \PrintBackRefs\CurrentBib
  • Zhou (\APACyear2023) \APACinsertmetastarzhou2023mmrec{APACrefauthors}Zhou, X.  \APACrefYearMonthDay2023. \BBOQ\APACrefatitleMmrec: Simplifying multimodal recommendation Mmrec: Simplifying multimodal recommendation.\BBCQ \BIn \APACrefbtitleProceedings of the 5th ACM International Conference on Multimedia in Asia Workshops Proceedings of the 5th acm international conference on multimedia in asia workshops (\BPGS 1–2). \PrintBackRefs\CurrentBib