Unified Interpretation of Smoothing Methods for Negative Sampling Loss Functions in Knowledge Graph Embedding

Xincan Feng, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe
Nara Institute of Science and Technology The University of Tokyo
{feng.xincan.fy2, kamigaito.h, taro}@is.naist.jp
[email protected]
Abstract

Knowledge Graphs (KGs) are fundamental resources in knowledge-intensive tasks in NLP. Due to the limitation of manually creating KGs, KG Completion (KGC) has an important role in automatically completing KGs by scoring their links with KG Embedding (KGE). To handle many entities in training, KGE relies on Negative Sampling (NS) loss that can reduce the computational cost by sampling. Since the appearance frequencies for each link are at most one in KGs, sparsity is an essential and inevitable problem. The NS loss is no exception. As a solution, the NS loss in KGE relies on smoothing methods like Self-Adversarial Negative Sampling (SANS) and subsampling. However, it is uncertain what kind of smoothing method is suitable for this purpose due to the lack of theoretical understanding. This paper provides theoretical interpretations of the smoothing methods for the NS loss in KGE and induces a new NS loss, Triplet Adaptive Negative Sampling (TANS), that can cover the characteristics of the conventional smoothing methods. Experimental results of TransE, DistMult, ComplEx, RotatE, HAKE, and HousE on FB15k-237, WN18RR, and YAGO3-10 datasets and their sparser subsets show the soundness of our interpretation and performance improvement by our TANS.

1 Introduction

Knowledge Graphs (KGs) represent human knowledge using various entities and their relationships as graph structures. KGs are fundamental resources for knowledge-intensive tasks like dialog (Moon et al., 2019), question answering (Reese et al., 2020), named entity recognition (Liu et al., 2019), open-domain questions (Hu et al., 2022), and recommendation systems (Gao et al., 2020), etc.

However, to create complete KGs, we need to consider a large number of entities and all their possible relationships. Taking into account the explosively large number of combinations between entities, only relying on manual approaches is unrealistic to make complete KGs.

Knowledge Graph Completion (KGC) is a task to deal with this problem. KGC involves automatically completing missing links corresponding to relationships between entities in KGs. To complete the KGs, we need to score each link between entities. For this purpose, current KGC commonly relies on Knowledge Graph Embedding (KGE) (Bordes et al., 2011). KGE models predict the missing relations, named link prediction, by learning structural representations. In the current KGE, models need to complete a link (triplet) (ei,rk,ej)subscript𝑒𝑖subscript𝑟𝑘subscript𝑒𝑗(e_{i},r_{k},e_{j})( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) of entities eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and their relationship rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by answering eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT from a given query (?,rk,ej)?subscript𝑟𝑘subscript𝑒𝑗(?,r_{k},e_{j})( ? , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) or (ei,rk,?)subscript𝑒𝑖subscript𝑟𝑘?(e_{i},r_{k},?)( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ? ), respectively. Hence, KGE needs to handle a large number of entities and their relationships during its training.

To handle a large number of entities and relationships in KGs, Negative Sampling (NS) loss (Mikolov et al., 2013) is frequently used for training KGE models. The original NS loss is proposed to approximate softmax cross-entropy loss to reduce computational costs by sampling false labels from its noise distribution in training. Trouillon et al. (2016) import the NS loss from word embedding to KGE with utilizing uniform distribution as its noise distribution. Sun et al. (2019) extend the NS loss to Self-Adversarial Negative Sampling (SANS) loss for efficient training of KGE. Unlike the NS loss with uniform distribution, the SANS loss utilizes the training model’s prediction as the noise distribution. Since the negative samples in the SANS loss become more difficult to discriminate for models in training, the SANS can extract models’ potential compared with the NS loss with uniform distribution.

Refer to caption
Figure 1: Appearance frequencies of queries and answers (entities) in the training data of FB15k-237, WN18RR, and YAGO3-10. Note that the indices are sorted from high frequency to low.
Refer to caption
Figure 2: Performances of KGE models HousE, HAKE, RotatE, ComplEx, DistMult, and TransE on datasets FB15k-237, WN18RR, and YAGO3-10 using NS, SANS, and subsampling methods (noted as Base, Freq, Uniq).

One of the problems left for KGE is the sparsity of KGs. Figure 1 shows the appearance frequency of queries and answers (entities) in the training data of FB15k-237, WN18RR and YAGO3-10 datasets. From the long-tail distribution of this figure, we can understand that both queries and answers necessary for training KGE models may suffer from the sparsity problem.

As a solution, several smoothing methods are used in KGE. Sun et al. (2019) import subsampling from word2vec (Mikolov et al., 2013) to KGE. Subsampling can smooth the appearance frequency of triplets and queries in KGs. Kamigaito and Hayashi (2022a) show a general formulation that covers the basic subsampling of Sun et al. (2019) (Base), their frequency-based subsampling (Freq) and unique-based subsampling (Uniq) for KGE. Kamigaito and Hayashi (2021) indicate that SANS has a similar effect of using label-smoothing (Szegedy et al., 2016) and thus SANS can smooth the frequencies of answers in training. Figure 2 shows the effectiveness of SANS and subsampling in KGC performance. From the figure, since FB15k-237 is more sparse (imbalanced) than WN18RR and YAGO3-10 based on Figure 1, we can understand that strategy in choosing smoothing methods have more considerable influences than models when data is sparse.

While SANS and subsampling can improve model performance by smoothing the appearance frequencies of triplets, queries, and answers, their theoretical relationship is not clear, leaving their capabilities and deficiencies a question. For example, conventional works (Sun et al., 2019; Zhang et al., 2020b; Kamigaito and Hayashi, 2022a)111Note that Sun et al. (2019); Zhang et al. (2020b) use subsampling in their released implementation without referring to it in their paper. jointly use SANS and subsampling with no theoretical background. Thus, there is a call for further interpretability and performance improvement.

To solve the above problem, we theoretically and empirically study the differences of SANS and subsampling on three common datasets and their sparser subsets with six popular KGE models222Our code and data are available at https://github.com/xincanfeng/ss_kge.. Our contributions are as follows:

  • By focusing on the smoothing targets, we theoretically reveal the differences between SANS and subsampling and induce a new NS loss, Triplet Adaptive Negative Sampling (TANS), that can cover the smoothing target of both SANS and subsampling.

  • We theoretically show that TANS with subsampling can potentially cover the conventional usages of SANS and subsampling.

  • We empirically verify that TANS improves KGC performance on sparse KGs in terms of MRR.

  • We empirically verify that TANS with subsampling can cover the conventional usages of SANS and subsampling in terms of MRR.

2 Background

In this section, we describe the problem formulation for solving KGC by KGE and explain the conventional NS loss functions in KGE.

2.1 Formulation of KGE

KGC is a research topic for automatically inferring new links in a KG that are likely but not yet known to be true. To infer the new links by KGE, we decompose KGs into a set of triplets (links). By using entities eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and their relation rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we represent the triplet as (ei,rk,ej)subscript𝑒𝑖subscript𝑟𝑘subscript𝑒𝑗(e_{i},r_{k},e_{j})( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). In a typical KGC task, a KGE model receives a query (ei,rk,?)subscript𝑒𝑖subscript𝑟𝑘?(e_{i},r_{k},?)( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ? ) or (?,rk,ej)?subscript𝑟𝑘subscript𝑒𝑗(?,r_{k},e_{j})( ? , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and predicts the entity corresponding to ???? as an answer.

In KGE, a KGE model scores a triplet (ei,rk,ej)subscript𝑒𝑖subscript𝑟𝑘subscript𝑒𝑗(e_{i},r_{k},e_{j})( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) by using a scoring function sθ(x,y)subscript𝑠𝜃𝑥𝑦s_{\mathbf{\theta}}(x,y)italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ), where θ𝜃\mathbf{\theta}italic_θ denotes model parameters. Here, using a softmax function, we represent the existence probability pθ(y|x)subscript𝑝𝜃conditional𝑦𝑥p_{\mathbf{\theta}}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) for an answer y𝑦yitalic_y of the query x𝑥xitalic_x as follows:

pθ(y|x)=exp(sθ(x,y))yYexp(sθ(x,y)),subscript𝑝𝜃conditional𝑦𝑥subscript𝑠𝜃𝑥𝑦subscriptsuperscript𝑦𝑌subscript𝑠𝜃𝑥superscript𝑦p_{\mathbf{\theta}}(y|x)=\frac{\exp(s_{\mathbf{\theta}}(x,y))}{\sum_{y^{\prime% }\in Y}\exp(s_{\mathbf{\theta}}(x,y^{\prime}))},italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) = divide start_ARG roman_exp ( italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_Y end_POSTSUBSCRIPT roman_exp ( italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_ARG , (1)

where Y is a set of entities.

2.2 NS Loss in KGE

To train sθ(x,y)subscript𝑠𝜃𝑥𝑦s_{\mathbf{\theta}}(x,y)italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ), we need to calculate losses for the observables D={(x1,y1),,(xn,yn)}𝐷subscript𝑥1subscript𝑦1subscript𝑥𝑛subscript𝑦𝑛D=\{(x_{1},y_{1}),\cdots,(x_{n},y_{n})\}italic_D = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } that follow pd(x,y)subscript𝑝𝑑𝑥𝑦p_{d}(x,y)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x , italic_y ). Even if we can represent KGC by Eq. (1), it does not mean we can tractably perform KGC due to the large number of Y in KGs. For the reason of the computational cost, the NS loss (Mikolov et al., 2013) is used to approximate Eq. (1) by sampling false answers.

By modifying that of Mikolov et al. (2013), the following NS loss (Sun et al., 2019; Ahrabian et al., 2020) is commonly used in KGE:

NS(θ)subscriptNS𝜃\displaystyle\ell_{\text{NS}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT NS end_POSTSUBSCRIPT ( italic_θ )
=\displaystyle== 1|D|(x,y)D[log(σ(sθ(x,y)+τ))\displaystyle-\frac{1}{|D|}\sum_{(x,y)\in D}\Bigl{[}\log(\sigma(s_{\mathbf{% \theta}}(x,y)+\tau))- divide start_ARG 1 end_ARG start_ARG | italic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_D end_POSTSUBSCRIPT [ roman_log ( italic_σ ( italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_τ ) )
+1νyiUνlog(σ(sθ(x,yi)τ))],\displaystyle+\frac{1}{\nu}\sum_{y_{i}\sim U}^{\nu}\log(\sigma(-s_{\theta}(x,y% _{i})-\tau))\Bigr{]},+ divide start_ARG 1 end_ARG start_ARG italic_ν end_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT roman_log ( italic_σ ( - italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_τ ) ) ] , (2)

where U𝑈Uitalic_U is the noise distribution that follows uniform distribution, σ𝜎\sigmaitalic_σ is the sigmoid function, ν𝜈\nuitalic_ν is the number of negative samples per positive sample (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), and τ𝜏\tauitalic_τ is a margin term to adjust the value range decided by sθ(x,y)subscript𝑠𝜃𝑥𝑦s_{\mathbf{\theta}}(x,y)italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ).

2.3 Smoothing Methods for the NS Loss in KGE

As shown in Figure 1, KGC needs to deal with the sparsity problem caused by low frequent queries and answers in KGs. Imposing smoothing on the appearance frequencies of queries and answers can mitigate this problem. The following subsections introduce subsampling (Mikolov et al., 2013; Sun et al., 2019; Kamigaito and Hayashi, 2022a) and SANS (Sun et al., 2019), the conventional smoothing methods for the NS loss in KGE.

2.3.1 Subsampling

Subsampling (Mikolov et al., 2013) is a method to smooth the frequency of triplets or queries in the NS loss. Sun et al. (2019) import this approach from word embedding to KGE. Kamigaito and Hayashi (2022b, a) add some variants to subsampling for KGC and theoretically provide a unified expression of them as follows:

SUB(θ)subscriptSUB𝜃\displaystyle\ell_{\text{SUB}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT SUB end_POSTSUBSCRIPT ( italic_θ )
=\displaystyle== 1|D|(x,y)D[A(x,y;α)log(σ(sθ(x,y)+τ))\displaystyle-\frac{1}{|D|}\!\!\sum_{(x,y)\in D}\!\!\Bigl{[}A(x,y;\alpha)\log(% \sigma(s_{\theta}(x,y)\!+\!\tau))\!- divide start_ARG 1 end_ARG start_ARG | italic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_D end_POSTSUBSCRIPT [ italic_A ( italic_x , italic_y ; italic_α ) roman_log ( italic_σ ( italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_τ ) )
+1νyiUνB(x,y;α)log(σ(sθ(x,yi)τ))],\displaystyle+\!\frac{1}{\nu}\!\sum_{y_{i}\sim U}^{\nu}\!\!\!B(x,y;\alpha)\!% \log(\sigma(\!-s_{\theta}(x,y_{i})\!-\!\tau)\!)\!\Bigr{]},+ divide start_ARG 1 end_ARG start_ARG italic_ν end_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT italic_B ( italic_x , italic_y ; italic_α ) roman_log ( italic_σ ( - italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_τ ) ) ] , (3)

where α𝛼\alphaitalic_α is a temperature term to adjust the frequecy of triplets and queries. Note that we incorporate α𝛼\alphaitalic_α into Eq. (3) to consider various loss functions even though Kamigaito and Hayashi (2022b, a) do not consider α𝛼\alphaitalic_α. In this formulation, we can consider several assumptions for deciding A(x,y;α)𝐴𝑥𝑦𝛼A(x,y;\alpha)italic_A ( italic_x , italic_y ; italic_α ) and B(x,y;α)𝐵𝑥𝑦𝛼B(x,y;\alpha)italic_B ( italic_x , italic_y ; italic_α ). We introduce these assumptions in the following paragraphs:

Base

As a basic subsampling approach, Sun et al. (2019) import the one originally used in word2vec Mikolov et al. (2013) to KGE, defined as follows:

A(x,y;α)=B(x,y;α)=#(x,y)α|D|(x,y)D#(x,y)α,𝐴𝑥𝑦𝛼𝐵𝑥𝑦𝛼#superscript𝑥𝑦𝛼𝐷subscriptsuperscript𝑥superscript𝑦𝐷#superscriptsuperscript𝑥superscript𝑦𝛼A(x,y;\alpha)\!=\!B(x,y;\alpha)\!=\!\frac{\#(x,y)^{-\alpha}|D|}{\sum_{(x^{% \prime}\!\!,y^{\prime})\in D}\#(x^{\prime},y^{\prime})^{-\alpha}},italic_A ( italic_x , italic_y ; italic_α ) = italic_B ( italic_x , italic_y ; italic_α ) = divide start_ARG # ( italic_x , italic_y ) start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT | italic_D | end_ARG start_ARG ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_D end_POSTSUBSCRIPT # ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT end_ARG , (4)

where ##\## is the symbol for frequency and #(x,y)#𝑥𝑦\#(x,y)# ( italic_x , italic_y ) represents the frequency of (x,y)𝑥𝑦(x,y)( italic_x , italic_y ). In word2vec, subsampling randomly discards a word by a probability 1t/f1𝑡𝑓1-\sqrt{t/f}1 - square-root start_ARG italic_t / italic_f end_ARG, where t𝑡titalic_t is a constant value and f𝑓fitalic_f is a frequency of a word. This is similar to randomly keeping a word with a probability t/f𝑡𝑓\sqrt{t/f}square-root start_ARG italic_t / italic_f end_ARG. Thus, we can understand that Eq. (4) follows the original use in word2vec. Since the actual (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) occurs at most once in KGs, when (x,y)=(ei,rk,ej)𝑥𝑦subscript𝑒𝑖subscript𝑟𝑘subscript𝑒𝑗(x,y)=(e_{i},r_{k},e_{j})( italic_x , italic_y ) = ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), they approximate the frequency of (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) as:

#(x,y)#(ei,rk)+#(rk,ej),#𝑥𝑦#subscript𝑒𝑖subscript𝑟𝑘#subscript𝑟𝑘subscript𝑒𝑗\#(x,y)\approx\#(e_{i},r_{k})+\#(r_{k},e_{j}),# ( italic_x , italic_y ) ≈ # ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + # ( italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (5)

based on the approximation of n-gram language modeling (Katz, 1987).

Freq

Kamigaito and Hayashi (2022a) propose frequency-based subsamping (Freq) by assuming a case that (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) originally has a frequency, but the observed one in the KG is at most 1.

A(x,y;α)𝐴𝑥𝑦𝛼\displaystyle A(x,y;\alpha)italic_A ( italic_x , italic_y ; italic_α ) =#(x,y)α|D|(x,y)D#(x,y)α,absent#superscript𝑥𝑦𝛼𝐷subscriptsuperscript𝑥superscript𝑦𝐷#superscriptsuperscript𝑥superscript𝑦𝛼\displaystyle=\frac{\#(x,y)^{-\alpha}|D|}{\sum_{(x^{\prime},y^{\prime})\in D}% \#(x^{\prime},y^{\prime})^{-\alpha}},\>\>\>\>\>= divide start_ARG # ( italic_x , italic_y ) start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT | italic_D | end_ARG start_ARG ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_D end_POSTSUBSCRIPT # ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT end_ARG ,
B(x,y;α)𝐵𝑥𝑦𝛼\displaystyle B(x,y;\alpha)italic_B ( italic_x , italic_y ; italic_α ) =#xα|D|xD#xα.absent#superscript𝑥𝛼𝐷subscriptsuperscript𝑥𝐷#superscript𝑥𝛼\displaystyle=\frac{\#x^{-\alpha}|D|}{\sum_{x^{\prime}\in D}\#x^{\prime-\alpha% }}.= divide start_ARG # italic_x start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT | italic_D | end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_D end_POSTSUBSCRIPT # italic_x start_POSTSUPERSCRIPT ′ - italic_α end_POSTSUPERSCRIPT end_ARG . (6)
Uniq

Kamigaito and Hayashi (2022a) also propose unique-based subsamping (Uniq) by assuming a case that the originally frequency and the observed one in the KG are both 1.

A(x,y;α)=B(x,y;α)=#xα|D|xD#xα.𝐴𝑥𝑦𝛼𝐵𝑥𝑦𝛼#superscript𝑥𝛼𝐷subscriptsuperscript𝑥𝐷#superscript𝑥𝛼A(x,y;\alpha)=B(x,y;\alpha)=\frac{\#x^{-\alpha}|D|}{\sum_{x^{\prime}\in D}\#x^% {\prime-\alpha}}.italic_A ( italic_x , italic_y ; italic_α ) = italic_B ( italic_x , italic_y ; italic_α ) = divide start_ARG # italic_x start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT | italic_D | end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_D end_POSTSUBSCRIPT # italic_x start_POSTSUPERSCRIPT ′ - italic_α end_POSTSUPERSCRIPT end_ARG . (7)

2.3.2 SANS Loss

SANS is originally proposed as a kind of NS loss to train KGE models efficiently by considering negative samples close to their corresponding positive ones. Kamigaito and Hayashi (2021) show that using SANS is similar to imposing label-smoothing on Eq. (1). Thus, SANS is a method to smooth the frequency of answers in the NS loss. The SANS loss is represented as follows:

SANS(θ)subscriptSANS𝜃\displaystyle\ell_{\text{SANS}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT SANS end_POSTSUBSCRIPT ( italic_θ )
=\displaystyle== 1|D|(x,y)D[log(σ(sθ(x,y)+τ))\displaystyle-\frac{1}{|D|}\sum_{(x,y)\in D}\Bigl{[}\log(\sigma(s_{\theta}(x,y% )+\tau))- divide start_ARG 1 end_ARG start_ARG | italic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_D end_POSTSUBSCRIPT [ roman_log ( italic_σ ( italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_τ ) )
+yiUνpθ(yi|x;β)log(σ(sθ(x,yi)τ))],\displaystyle+\!\!\sum_{y_{i}\sim U}^{\nu}p_{\theta}(y_{i}|x;\beta)\log(\sigma% (\!-\!s_{\theta}(x,y_{i})\!-\!\tau))\Bigr{]},+ ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ; italic_β ) roman_log ( italic_σ ( - italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_τ ) ) ] , (8)
pθ(yi|x;β)exp(βsθ(x,yi))j=1νexp(βsθ(x,yj)),subscript𝑝𝜃conditionalsubscript𝑦𝑖𝑥𝛽𝛽subscript𝑠𝜃𝑥subscript𝑦𝑖superscriptsubscript𝑗1𝜈𝛽subscript𝑠𝜃𝑥subscript𝑦𝑗\displaystyle p_{\theta}(y_{i}|x;\beta)\approx\frac{\exp(\beta s_{\theta}(x,y_% {i}))}{\sum_{j=1}^{\nu}\exp(\beta s_{\theta}(x,y_{j}))},italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ; italic_β ) ≈ divide start_ARG roman_exp ( italic_β italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT roman_exp ( italic_β italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_ARG , (9)

where β𝛽\betaitalic_β is a temperature to adjust the distribution of negative sampling. Different from subsampling, SANS uses pθ(yi|x;β)subscript𝑝𝜃conditionalsubscript𝑦𝑖𝑥𝛽p_{\theta}(y_{i}|x;\beta)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ; italic_β ) that is predicted by a model θ𝜃\thetaitalic_θ to adjust the frequency of the answer yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Since pθ(yi|x;β)subscript𝑝𝜃conditionalsubscript𝑦𝑖𝑥𝛽p_{\theta}(y_{i}|x;\beta)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ; italic_β ) is essentially a noise distribution, it does not receive any gradient during training.

Method Smoothing Remarks
p(x,y)𝑝𝑥𝑦p(x,y)italic_p ( italic_x , italic_y ) p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ) p(x)𝑝𝑥p(x)italic_p ( italic_x )
Subsampling Base \checkmark \triangle \triangle p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ) and p(x)𝑝𝑥p(x)italic_p ( italic_x ) are influenced by p(x,y)𝑝𝑥𝑦p(x,y)italic_p ( italic_x , italic_y ).
Uniq \triangle ×\times× \checkmark p(x,y)𝑝𝑥𝑦p(x,y)italic_p ( italic_x , italic_y ) is indirectly controlled by p(x)𝑝𝑥p(x)italic_p ( italic_x ).
Freq \checkmark \triangle \checkmark p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ) is indirectly controlled by p(x,y)𝑝𝑥𝑦p(x,y)italic_p ( italic_x , italic_y ) or p(x)𝑝𝑥p(x)italic_p ( italic_x ).
SANS \triangle \checkmark ×\times× p(x,y)𝑝𝑥𝑦p(x,y)italic_p ( italic_x , italic_y ) is indirectly controlled by p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ).
TANS \checkmark \checkmark \checkmark
Table 1: The characteristics of each smoothing method for the NS loss in KGE (See §2.3 for the details.) and our proposed TANS. \checkmark and \triangle respectively denote the method smooths the probability directly and indirectly. ×\times× denotes the method does not smooth the probability.

3 Triplet Adaptive Negative Sampling

In this section, we explain our proposed Triplet Adaptive Negative Sampling (TANS) in detail. We first show the overview of our TANS through the comparison with the conventional smoothing methods of the NS loss for KGE (See §2.3) in §3.1 and after that we explain the details of TANS through its mathematical formulations in §3.2 and §3.3.

3.1 Overview

TANS is fundamentally different from SANS, with SANS only taking into account the conditional probability of negative samples and TANS being a loss function that considers the joint probability of the pair of queries and their answers.

Table 1 shows the characteristics of TANS and the conventional smoothing methods of the NS loss for KGE introduced in §2.3. These characteristics are based on the decomposition of pd(x,y)subscript𝑝𝑑𝑥𝑦p_{d}(x,y)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x , italic_y ), the appearance probability for the triplet (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), into that of its answer pd(y|x)subscript𝑝𝑑conditional𝑦𝑥p_{d}(y|x)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_y | italic_x ) and query p(x)𝑝𝑥p(x)italic_p ( italic_x ):

pd(x,y)=pd(y|x)pd(x)subscript𝑝𝑑𝑥𝑦subscript𝑝𝑑conditional𝑦𝑥subscript𝑝𝑑𝑥p_{d}(x,y)=p_{d}(y|x)p_{d}(x)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x , italic_y ) = italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_y | italic_x ) italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) (10)

In Eq. (10), smoothing both pd(y|x)subscript𝑝𝑑conditional𝑦𝑥p_{d}(y|x)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_y | italic_x ) and pd(x)subscript𝑝𝑑𝑥p_{d}(x)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) is similar to smoothing pd(x,y)subscript𝑝𝑑𝑥𝑦p_{d}(x,y)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x , italic_y ). However, smoothing pd(x,y)subscript𝑝𝑑𝑥𝑦p_{d}(x,y)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x , italic_y ) does not ensure smoothing both pd(x)subscript𝑝𝑑𝑥p_{d}(x)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) and pd(y|x)subscript𝑝𝑑conditional𝑦𝑥p_{d}(y|x)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_y | italic_x ) considering the case of only one of them being smoothed, and the left one being still sparse. Similarly, smoothing only pd(x)subscript𝑝𝑑𝑥p_{d}(x)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) or pd(y|x)subscript𝑝𝑑conditional𝑦𝑥p_{d}(y|x)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_y | italic_x ) does not ensure pd(x,y)subscript𝑝𝑑𝑥𝑦p_{d}(x,y)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x , italic_y ) being smoothed due to the case where one of them is still sparse. In Table 1, we denote such a case where the method can influence the probability, but no guarantee of the probability be smoothed as \triangle.

In TANS, we aim to smooth pd(x,y)subscript𝑝𝑑𝑥𝑦p_{d}(x,y)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x , italic_y ) by smoothing both pd(y|x)subscript𝑝𝑑conditional𝑦𝑥p_{d}(y|x)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_y | italic_x ) and pd(x)subscript𝑝𝑑𝑥p_{d}(x)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) based on Eq. (10).

3.2 Formulation

Here, we induce TANS from SANS with targeting to smooth pd(x,y)subscript𝑝𝑑𝑥𝑦p_{d}(x,y)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x , italic_y ) by smoothing both pd(y|x)subscript𝑝𝑑conditional𝑦𝑥p_{d}(y|x)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_y | italic_x ) and pd(x)subscript𝑝𝑑𝑥p_{d}(x)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ). First, we assume a simple replacement from pθ(y|x)subscript𝑝𝜃conditional𝑦𝑥p_{\theta}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) to pθ(x,y)subscript𝑝𝜃𝑥𝑦p_{\theta}(x,y)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) in SANS(θ)subscriptSANS𝜃\ell_{\text{SANS}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT SANS end_POSTSUBSCRIPT ( italic_θ ) of Eq. (9):

1|D|(x,y)D[log(σ(sθ(x,y)+τ))\displaystyle-\frac{1}{|D|}\sum_{(x,y)\in D}\Bigl{[}\log(\sigma(s_{\theta}(x,y% )+\tau))- divide start_ARG 1 end_ARG start_ARG | italic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_D end_POSTSUBSCRIPT [ roman_log ( italic_σ ( italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_τ ) )
+yiUνpθ(x,yi)log(σ(sθ(x,yi)τ))].\displaystyle+\sum_{y_{i}\sim U}^{\nu}p_{\theta}(x,y_{i})\log(\sigma(-s_{% \theta}(x,y_{i})-\tau))\Bigr{]}.+ ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( italic_σ ( - italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_τ ) ) ] . (11)

However, using Eq. (11) causes an imbalanced loss between the first and second terms since the sum of pθ(x,yi)subscript𝑝𝜃𝑥subscript𝑦𝑖p_{\theta}(x,y_{i})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) on all negative samples is not always 1. Thus, Eq. (11) is impractical as a loss function.

As a solution, we focus on the decomposition pθ(x,y)=pθ(y|x)pθ(x)subscript𝑝𝜃𝑥𝑦subscript𝑝𝜃conditional𝑦𝑥subscript𝑝𝜃𝑥p_{\mathbf{\theta}}(x,y)=p_{\mathbf{\theta}}(y|x)p_{\mathbf{\theta}}(x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) and the fact that the sum of pθ(y|x)subscript𝑝𝜃conditional𝑦𝑥p_{\mathbf{\theta}}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) of all negative samples is always 1. By using pθ(x)subscript𝑝𝜃𝑥p_{\mathbf{\theta}}(x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) to make a balance between the first and second loss term, we can modify Eq. (11) and induce our TANS as follows:

TANS(θ)subscriptTANS𝜃\displaystyle\ell_{\text{TANS}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT TANS end_POSTSUBSCRIPT ( italic_θ )
=\displaystyle== 1|D|(x,y)Dpθ(x;γ)[log(σ(sθ(x,y)+τ))\displaystyle-\frac{1}{|D|}\sum_{(x,y)\in D}\!\!\!p_{\theta}(x;\gamma)\Bigl{[}% \log(\sigma(s_{\theta}(x,y)+\tau))- divide start_ARG 1 end_ARG start_ARG | italic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_D end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; italic_γ ) [ roman_log ( italic_σ ( italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_τ ) )
+yiUνpθ(yi|x;β)log(σ(sθ(x,yi)τ))],\displaystyle+\!\!\sum_{y_{i}\sim U}^{\nu}\!p_{\theta}(y_{i}|x;\beta)\log(% \sigma(\!-s_{\theta}(x,y_{i})\!\!-\!\!\tau))\!\Bigr{]},+ ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ; italic_β ) roman_log ( italic_σ ( - italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_τ ) ) ] , (12)
pθ(x;γ)=yiDpθ(x,yi;γ),subscript𝑝𝜃𝑥𝛾subscriptsubscript𝑦𝑖𝐷subscript𝑝𝜃𝑥subscript𝑦𝑖𝛾\displaystyle p_{\mathbf{\theta}}(x;\gamma)=\sum_{y_{i}\in D}p_{\mathbf{\theta% }}(x,y_{i};\gamma),\>\>\>\>\>italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; italic_γ ) = ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_γ ) ,
pθ(x,yi;γ)=exp(γsθ(x,yi))(x,y)Dexp(γsθ(x,y)),subscript𝑝𝜃𝑥subscript𝑦𝑖𝛾𝛾subscript𝑠𝜃𝑥subscript𝑦𝑖subscriptsuperscript𝑥superscript𝑦𝐷𝛾subscript𝑠𝜃superscript𝑥superscript𝑦\displaystyle p_{\mathbf{\theta}}(x,y_{i};\gamma)\!\!=\!\!\frac{\exp{(\gamma s% _{\theta}(x,y_{i}))}}{\sum_{(x^{\prime},y^{\prime})\in D}\!\exp{\!(\gamma s_{% \theta}(x^{\prime},y^{\prime})\!)}},italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_γ ) = divide start_ARG roman_exp ( italic_γ italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_D end_POSTSUBSCRIPT roman_exp ( italic_γ italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_ARG , (13)

where γ𝛾\gammaitalic_γ is a temperature to smooth the frequency of queries. Since TANS uses a noise distribution decided by pθ(x;γ)subscript𝑝𝜃𝑥𝛾p_{\theta}(x;\gamma)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; italic_γ ) and pθ(yi|x;β)subscript𝑝𝜃conditionalsubscript𝑦𝑖𝑥𝛽p_{\theta}(y_{i}|x;\beta)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ; italic_β ), it does not propagate gradients through probabilities for negative samples, and thus, memory usage is not increased.

Temperature Induced NS Loss
α𝛼\alphaitalic_α β𝛽\betaitalic_β γ𝛾\gammaitalic_γ
=0absent0=0= 0 =0absent0=0= 0 =0absent0=0= 0 Equivalent to NS(θ)subscriptNS𝜃\ell_{\text{NS}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT NS end_POSTSUBSCRIPT ( italic_θ ), the basic NS loss in KGE (Eq. (2))
=0absent0=0= 0 =0absent0=0= 0 0absent0\neq 0≠ 0 Currently does not exist
=0absent0=0= 0 0absent0\neq 0≠ 0 =0absent0=0= 0 Proportional to SANS(θ)subscriptSANS𝜃\ell_{\text{SANS}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT SANS end_POSTSUBSCRIPT ( italic_θ ), the SANS loss (Eq. (9))
=0absent0=0= 0 0absent0\neq 0≠ 0 0absent0\neq 0≠ 0 Equivalent to our TANS(θ)subscriptTANS𝜃\ell_{\text{TANS}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT TANS end_POSTSUBSCRIPT ( italic_θ ), the TANS loss (Eq. (12))
0absent0\neq 0≠ 0 =0absent0=0= 0 =0absent0=0= 0 Proportional to NS(θ)subscriptNS𝜃\ell_{\text{NS}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT NS end_POSTSUBSCRIPT ( italic_θ ), the basic NS loss in KGE (Eq. (2)) with subsampling in §2.3
0absent0\neq 0≠ 0 =0absent0=0= 0 0absent0\neq 0≠ 0 Currently does not exist
0absent0\neq 0≠ 0 0absent0\neq 0≠ 0 =0absent0=0= 0 Proportional to SANS(θ)subscriptSANS𝜃\ell_{\text{SANS}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT SANS end_POSTSUBSCRIPT ( italic_θ ), the SANS loss (Eq. (9)) with subsampling in §2.3
0absent0\neq 0≠ 0 0absent0\neq 0≠ 0 0absent0\neq 0≠ 0 Equivalent to our UNI(θ)subscriptUNI𝜃\ell_{\text{UNI}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT UNI end_POSTSUBSCRIPT ( italic_θ ), the unified NS loss in KGE (Eq. (16))
and also equivalent to our TANS(θ)subscriptTANS𝜃\ell_{\text{TANS}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT TANS end_POSTSUBSCRIPT ( italic_θ ), the TANS loss (Eq. (12)) with subsampling in §2.3
Table 2: The relationship among the loss functions from the viewpoint of the unified NS loss, UNI(θ)subscriptUNI𝜃\ell_{\text{UNI}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT UNI end_POSTSUBSCRIPT ( italic_θ ) in Eq. (16).

3.3 Theoretical Interpretation

In this subsection, we discuss the difference and similarities among TANS and other smoothing methods for the NS loss in KGE. As shown in Table 1, the subsampling methods, Base and Freq, can smooth triplet frequencies similar to our TANS. To investigate TANS from the view point of subsampling, we reformulate Eq. (12) as follows:

TANS(θ)subscriptTANS𝜃\displaystyle\ell_{\text{TANS}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT TANS end_POSTSUBSCRIPT ( italic_θ )
=\displaystyle== 1|D|(x,y)D[A(x,y;γ)log(σ(sθ(x,y)+τ))\displaystyle-\frac{1}{|D|}\!\sum_{(x,y)\in D}\!\!\!\!\!\Bigl{[}A(x,y;\gamma)% \log(\sigma(s_{\theta}(x,y)\!+\!\tau))- divide start_ARG 1 end_ARG start_ARG | italic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_D end_POSTSUBSCRIPT [ italic_A ( italic_x , italic_y ; italic_γ ) roman_log ( italic_σ ( italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_τ ) )
+yiUνB(x,y;β,γ)log(σ(sθ(x,yi)τ))],\displaystyle+\!\!\!\sum_{y_{i}\sim U}^{\nu}\!\!B(x,y;\beta,\gamma)\log(\sigma% (-s_{\theta}(x,y_{i})\!-\!\tau))\Bigr{]},+ ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT italic_B ( italic_x , italic_y ; italic_β , italic_γ ) roman_log ( italic_σ ( - italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_τ ) ) ] , (14)
A(x,y;γ)=pθ(x;γ),𝐴𝑥𝑦𝛾subscript𝑝𝜃𝑥𝛾\displaystyle A(x,y;\gamma)=p_{\theta}(x;\gamma),\>\>\>\>\>italic_A ( italic_x , italic_y ; italic_γ ) = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; italic_γ ) ,
B(x,y;β,γ)=pθ(yi|x;β)pθ(x;γ).𝐵𝑥𝑦𝛽𝛾subscript𝑝𝜃conditionalsubscript𝑦𝑖𝑥𝛽subscript𝑝𝜃𝑥𝛾\displaystyle B(x,y;\beta,\gamma)=p_{\theta}(y_{i}|x;\beta)p_{\theta}(x;\gamma).italic_B ( italic_x , italic_y ; italic_β , italic_γ ) = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ; italic_β ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; italic_γ ) . (15)

Apart from the temperature terms, α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and γ𝛾\gammaitalic_γ, we can see that the general formulation of subsampling in Eq. (3) and the above Eq. (14) has the same formulation. Thus, TANS is not merely an extension of SANS but also a novel subsampling method.

Even though their similar characteristic, TANS and subsampling have an essential difference: TANS smooths the frequencies by model-predicted distributions as in Eq. (13), and the subsampling methods smooth them by counting appearance frequencies on the observed data as in Eq. (4), (5), (6), and (7). For instance, TANS can work even when the entity or relations included in the target triplet appear more than once, which is theoretically different from conventional approaches.

Since the superiority of using either model-based or count-based frequencies depends on the model and dataset, we empirically investigate this point through our experiments.

4 Unified Interpretation of SANS and Subsampling

In the previous section, we understand that our TANS can smooth triplets, queries, and answers partially covered by SANS and subsampling methods. On the other hand, TANS only relies on model-predicted frequencies to smooth the frequencies. Neubig and Dyer (2016) point out the benefits of combining count-based and model-predicted frequencies in language modeling. This section integrates smoothing methods for the NS loss in KGE from a unified interpretation.

4.1 Formulation

We formulate the unified loss function by introducing subsampling (Eq. (3)) into our TANS (Eq. (12)) as follows:

UNI(θ)subscriptUNI𝜃\displaystyle\ell_{\text{UNI}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT UNI end_POSTSUBSCRIPT ( italic_θ )
=\displaystyle== 1|D|(x,y)Dpθ(x;γ)[A(x,y;α)log(σ(sθ(x,y)+τ))\displaystyle\!-\!\!\frac{1}{|D|}\!\!\sum_{(x,y)\in D}\!\!\!p_{\theta}(x;% \gamma)\!\Bigl{[}\!A(x,y;\alpha)\!\log(\sigma(s_{\theta}(x,y)\!+\!\tau))- divide start_ARG 1 end_ARG start_ARG | italic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_D end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; italic_γ ) [ italic_A ( italic_x , italic_y ; italic_α ) roman_log ( italic_σ ( italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_τ ) )
+ηyiUνB(x,y;α)pθ(yi|x;β)log(σ(sθ(x,yi)τ))],\displaystyle+\!\!\eta\!\!\!\sum_{y_{i}\sim U}^{\nu}\!\!\!B(x,y;\alpha)p_{% \theta}(y_{i}|x;\beta)\!\log(\sigma(\!-\!s_{\theta}(x,y_{i})\!-\!\tau)\!)\!% \Bigr{]}\!,+ italic_η ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT italic_B ( italic_x , italic_y ; italic_α ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ; italic_β ) roman_log ( italic_σ ( - italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_τ ) ) ] , (16)

where η𝜂\etaitalic_η is a hyperparamter that can be any value to absorb the difference among the three different subsampling methods, Base, Uniq, and Freq.

Here, we can induce the NS losses shown in our paper from Eq. (16) by changing the temperature parameters α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and γ𝛾\gammaitalic_γ. Table 2 shows the induced losses from our UNI(θ)subscriptUNI𝜃\ell_{\text{UNI}}(\mathbf{\theta})roman_ℓ start_POSTSUBSCRIPT UNI end_POSTSUBSCRIPT ( italic_θ ). Note that since pθ(x;γ)subscript𝑝𝜃𝑥𝛾p_{\theta}(x;\gamma)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; italic_γ ) only appears in our TANS, canceling pθ(x;γ)subscript𝑝𝜃𝑥𝛾p_{\theta}(x;\gamma)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ; italic_γ ) by γ=0𝛾0\gamma=0italic_γ = 0 induces an inequivalent but a proportional relationship to the conventional NS loss.

4.2 Theoretical Interpretation

As shown in Table 2, TANS w/ subsampling has characteristics of all smoothing methods for the NS loss in KGE introduced in this paper. Therefore, we can expect higher performance of TANS w/ subsampling than the combination of conventional methods, the basic NS, SANS, and subsampling. However, because TANS w/ subsampling uses subsampling in §2.3, we need to choose the one from Base, Uniq, and Freq for TANS w/ subsampling. Since this part is out of the scope of theoretical interpretation, we investigate this in the experiments.

Refer to caption
(a) Results on datasets FB15k-237, WN18RR, YAGO3-10 using NS, SANS, TANS, and NS with subsampling.
Refer to caption
(b) Results on datasets FB15k-237, WN18RR, YAGO3-10 using SANS, TANS, and those with subsampling.
Figure 3: KGC performance on common KGs (Notations are the same as in Figure 2).
Refer to caption
Figure 4: KGC performance on filtered sparser KGs, i.e., FB15k-237-HL, WN18RR-HL, and YAGO3-10-HL (Notations are the same as in Figure 2).

5 Experiments

In this section, we investigate our theoretical interpretation in §3.3 and §4.2 through experiments.

5.1 Experimental Settings

Datasets We used three common datasets, FB15k-237 (Toutanova and Chen, 2015), WN18RR, and YAGO3-10 (Dettmers et al., 2018) 333Table 4 in Appendix A shows the dataset statistics..

Comparison Methods As comparison methods, we used TransE (Bordes et al., 2013), DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016), RotatE (Sun et al., 2019), HAKE (Zhang et al., 2020a), and HousE (Li et al., 2022). We followed the original settings of Sun et al. (2019) for TransE, DistMult, ComplEx, and RotatE with their implementation444https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding, the original settings of Zhang et al. (2020a) for HAKE with their implementation555https://github.com/MIRALab-USTC/KGE-HAKE, and the original settings of Li et al. (2022) for HousE with their implementation666https://github.com/rui9812/HousE. We tuned temperature γ𝛾\gammaitalic_γ on the validation split for each dataset.

Metrics We employed conventional metrics in KGC, i.e., MRR, Hits@1 (H@1), Hits@3 (H@3), and Hits@10 (H@10) and reported the average scores and their standard deviations by three different runs with fixed random seeds.

5.2 Results

Since the result tables are large777The full experimental results are listed in Appendix B. The scores are included in Table 56, and 7 of Appendix B.1. The training loss curves and validation MRR curves for each smoothing method are in Figure 67, and 8 of Appendix B.2., we discuss them individually, focusing on important information in the following subsections.

5.2.1 Effectiveness of TANS

Figure 3(a) shows the MRR scores of each method. From the result, we can understand the effectiveness of considering triplet information in SANS as conducted in TANS. Thus, the result is along with our expectation in §3.3 that TANS can cover the role of subsampling methods. However, as the result of HAKE on WN18RR shows, there is a case that subsampling methods outperform TANS. As discussed in §3.3, using only TANS does not cover all combinations of NS loss and subsampling. Considering this theoretical fact, we further compare TANS with subsampling and the SANS loss with subsampling in the following section.

5.2.2 Validity of the Unified Interpretation

Figure 3(b) shows the result for each configuration. We can see performance improvements by using subsampling in both SANS and TANS. Furthermore, in almost all cases, TANS with subsampling achieve the highest MRR. This observation is along with the theoretical conclusion in §3.3 that TANS with subsampling can cover the characteristic of other NS loss in terms of smoothing. On the other hand, the results of HAKE on YAGO3-10 show the different tendency that SANS with subsampling achieves the best MRR instead of TANS. Because the model prediction estimates the triplet frequencies, TANS is influenced by the selected model. Therefore, carefully choosing the combination of a loss function and model is still effective in improving KGC performance on the NS loss with subsampling.

6 Analysis

We analyze how TANS mitigates the sparsity problem in imbalanced KGs commonly caused by low frequent triplets in KGC. By considering that all triplets in KGs appear at most once, we focus on queries. We extracted 0.5% triplets with the highest or lowest frequent queries in training, validation, and test splits as the sparser subsets FB15k-237-HL, WN18RR-HL, and YAGO3-10-HL, respectively 888Note that we show their appearance frequencies of queries and answers in the training data in Figure 5 and detailed statistics in Table 4 of Appendix C.1 and  C.2, respectively. from original data, for the investigation.

Figure 4 shows MRRs for each model on each sparser dataset. From the result, we can understand that TANS can perform even much better in KGC when KGs get more imbalanced. You can see further detailed results in Table 8,  9, and  10 of Appendix C.3.

7 Related Work

Knowledge Graph

Knowledge graphs have important roles in various knowledge-intensive NLP tasks like dialog (Moon et al., 2019), question answering (Reese et al., 2020), named entity recognition (Liu et al., 2019), open-domain questions (Hu et al., 2022), recommendation systems (Gao et al., 2020), and commonsense reasoning (Sakai et al., 2024b), etc. In addition to these text-only tasks, knowledge-intensive vision and language (V&L) tasks such as visual question answering (VQA) (Yue et al., 2023), image generation (Kamigaito et al., 2023), explanation generation (Hayashi et al., 2024), and image review generation (Saito et al., 2024) also require external knowledge. Visual KGs (Zhu et al., 2024) have the potential to contribute to solving these tasks. Therefore, KGs are important materials in various different fields.

Knowlege Graph Completion

Even though KGs are useful, their sparsity is a fundamental problem. To solve the sparsity of knowledge graphs, we need to complete them by inferring their unseen links between nodes, which are entities. For that purpose, knowledge graph completion (KGC) and knowledge graph embedding (KGE) Bordes et al. (2011), which represents KG information as a continuous vector space, are commonly used. As KGE methods, vector space models like TransE Bordes et al. (2013), DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016), RotatE (Sun et al., 2019), HAKE (Zhang et al., 2020a), and HousE (Li et al., 2022), that learn only from task-specific datasets expand this field as pioneers. As well as such approaches, pre-trained language model (PLM)-based approaches like KEPLER Wang et al. (2021) and SimKGC Wang et al. (2022) also have an important role in KGC due to their ability to utilize the knowledge obtained in pre-training. However, as pointed out by Sakai et al. (2024a), PLM-based approaches have a leakage issue caused by data contamination in pre-training. Generation-based KGC methods like KGT5 Saxena et al. (2022) and GenKGC Xie et al. (2022) are unique in directly generating entity names. In hierarchical text classification (HTC), generation-based approaches contribute to improving performance Kwon et al. (2023) supported by considering label hierarchies by fusing pre-trained text and label embeddings Xiong et al. (2021); Zhang et al. (2021) on the decoder. However, Sakai et al. (2024a) point out that commonly used KGC methods conduct link-level prediction, and such generation-based KGC methods make it difficult to use structure information of KGs directly. Thus, their performance gain is limited. This situation requires investigating the benefits of inferring links by generation-based KGC under predefined entities and relationships.

Negative Sampling

Mikolov et al. (2013) initially propose the NS loss of the frequent words to train their word embedding model, word2vec. Trouillon et al. (2016) introduce the NS loss to KGE to speed up training. Melamud et al. (2017) use the NS loss to train the language model. In contextualized pre-trained embeddings, Clark et al. (2020a) indicate that a BERT (Devlin et al., 2019)-like model ELECTRA (Clark et al., 2020b) uses the NS loss to perform better and faster than language models. Sun et al. (2019) extend the NS loss to SANS loss for KGE and propose their noise distribution, which is subsampled by a uniformed probability pθ(yi|x)subscript𝑝𝜃conditionalsubscript𝑦𝑖𝑥p_{\theta}(y_{i}|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ). Kamigaito and Hayashi (2021) point out the sparseness problem of KGs through their theoretical analysis of the NS loss in KGE. Furthermore, Kamigaito and Hayashi (2022a, b) reveal that subsampling Mikolov et al. (2013) can alleviate the sparseness problem in the NS for KGE and conclude three assumptions for subsampling, i.e., Base, Freq, and Uniq. Feng et al. (2023) incorporate their proposed model-based subsampling that estimates frequencies for entities and their relationships by a trained KGE model into the subsampling of the NS loss to mitigate the sparseness issue of counting the frequency by increasing computational cost to train the additional KGE model.

Our Work

Through our work, we theoretically clarify the position of the previous works on SANS loss and subsampling from the viewpoint of smoothing methods for the NS loss in KGE. Since our work unitedly interprets SANS loss and subsampling, our proposed TANS inherits the advantages of conventional works and can deal with the sparsity problem in the NS loss for KGE.

8 Conclusion

We reveal the relationships between SANS loss and subsampling for the KG completion task through theoretical analysis. We explain that SANS loss and subsampling under three assumptions, Base, Freq, and Uniq have similar roles to mitigate the sparseness problem of queries and answers of KGs by smoothing the frequencies of queries and answers. Furthermore, based on our interpretation, we induce a new loss function, Triplet Adaptive Negative Sampling (TANS), by integrating SANS loss and subsampling. We also introduce a theoretical interpretation that TANS with subsampling can cover all conventional combinations of SANS loss and subsampling.

We verified our interpretation by empirical experiments in three common datasets, FB15k-237, WN18RR, and YAGO3-10, and six popular KGE models, TransE, DistMult, ComplEx, RotatE, HAKE, and HousE. The experimental results show that our TANS loss can outperform subsampling and SANS loss with many models in terms of MRR as expected by our theoretical interpretation. Furthermore, the combinatorial use of TANS and subsampling achieved comparable or better performance than other combinations and showed the validity of our theoretical interpretation that TANS with subsampling can cover all conventional combinations of SANS loss and subsampling in KGE.

Limitations

Our experiments are conducted exclusively on public datasets, which are relatively well-balanced. Consequently, we anticipate that our TANS will perform better on real-world KGs.

Ethics Statement

We used the publicly available datasets, FB15k-237, WN18RR, and YAGO3-10, to train and evaluate KGE models, and there is no ethical consideration.

Reproducibility Statement

We used the publicly available code to implement KGE models, TransE, DistMult, ComplEx, RotatE, HAKE, and HousE with the author-provided hyperparameters as described in §5.1. Regarding the temperature parameter γ𝛾\gammaitalic_γ, we tuned it on the validation split for each dataset and reported the values in Table 56, and 7 of Appendix B. Our code and data are available at https://github.com/xincanfeng/ss_kge.

Acknowledgements

This work was supported by NAIST Granite, i.e., JST SPRING Grant Number JPMJSP2140.

References

Appendix A Dataset Statistics

Table 4 shows the dataset statistics for dataset FB15k-237, WN18RR, and YAGO3-10, introduced in §5.1.

Appendix B Full Experimental Results

B.1 Results Tables

Table 56, and 7 list all results on FB15k-237, WN18RR, and YAGO3-10, explained in §5.2. In these tables, the bold scores are the best results for each subsampling type (e.g. None, Base, Freq, and Uniq.), \dagger indicates the best scores for each model, SD denotes the standard deviation of the three trials, and γ𝛾\gammaitalic_γ denotes the temperature chosen by development data.

B.2 Training Loss and Validation MRR Curve

Figure 67, and 8 show the training loss curves and validation MRR curves for each smoothing method. From these figures, we can understand that the convergence of TANS loss is as well as SANS and NS loss on datasets FB15k-237, WN18RR, and YAGO3-10 for each KGE model. Meanwhile, the time complexity of TANS is the same with SANS and NS loss too.

Refer to caption
Figure 5: Appearance frequencies of queries and answers (entities) in the training data of the sparser subsets FB15k-237-HL, WN18RR-HL, and YAGO3-10-HL. Note that the indices are sorted from high frequency to low.
Dataset Split Tuple Query Entity Relation
FB15k-237 Total 310,116 150,508 14,541 237
#Train 272,115 138,694 14,505 237
#Valid 17,535 19,750 9,809 223
#Test 20,466 22,379 10,348 224
WN18RR Total 93,003 77,479 40,943 11
#Train 86,835 74,587 40,559 11
#Valid 3,034 5,431 5,173 11
#Test 3,134 5,565 5,323 11
YAGO3-10 Total 1,089,040 372,775 123,182 37
#Train 1,079,040 371,077 123,143 37
#Valid 5,000 8,534 7,948 33
#Test 5,000 8,531 7,937 34
Table 3: Statistics for each public dataset.
Dataset Split Tuple Query Entity Relation
FB15k-237-HL Total 111,631 63,330 11,828 155
#Train 95,244 55,923 11,600 155
#Valid 7,571 6,918 4,933 90
#Test 8,816 7,830 5,406 89
WN18RR-HL Total 14,697 14,675 12,973 10
#Train 13,758 13,785 12,275 10
#Valid 465 619 613 9
#Test 474 623 619 8
YAGO3-10-HL Total 366,079 182,274 95,788 29
#Train 362,728 181,196 95,432 29
#Valid 1,662 2,316 2,113 13
#Test 1,689 2,359 2,135 14
Table 4: Statistics of the filtered sparser datasets.

Appendix C Sparse Queries

C.1 Appearance Frequencies of Queries and Answers

Figure 5 shows the appearance frequencies of queries and answers in the training set of our filtered sparser data FB15k-237-HL, WN18RR-HL, and YAGO3-10-HL, expained in §6.

C.2 Data Statistics

Table 4 shows detailed statistics of our filtered sparser data FB15k-237-HL, WN18RR-HL, and YAGO3-10-HL, expained in §6.

C.3 Detailed Results

Table 8,  9, and  10 shows the detailed results on our filtered sparser data FB15k-237-HL, WN18RR-HL, and YAGO3-10-HL, expained in §6. Notations are as those described in §B.1.

FB15k-237
Model Subsampling MRR H@1 H@3 H@10 γ𝛾\gammaitalic_γ
Assumption Loss Mean SD Mean SD Mean SD Mean SD
ComplEx None NS 23.9 0.2 15.8 0.1 26.1 0.3 40.0 0.2 -
SANS 22.3 0.1 13.8 0.1 24.2 0.0 39.5 0.2 -
TANS 32.8 0.2 23.2 0.1 36.2 0.2 52.2 0.1 -2
Base NS 27.2 0.1 19.1 0.1 29.5 0.1 43.0 0.2 -
SANS 32.3 0.0 23.0 0.1 35.4 0.1 51.2 0.1 -
TANS 33.3 0.0 23.8 0.1 36.9 0.1 52.7 0.0 -1
Freq NS 25.1 0.2 17.1 0.3 27.4 0.2 41.0 0.2 -
SANS 32.7 0.1 23.6 0.1 36.0 0.1 51.2 0.1 -
TANS 33.3 0.0 23.8 0.0 36.8 0.1 52.1 0.2 -0.5
Uniq NS 22.8 0.4 14.7 0.5 24.7 0.4 39.0 0.1 -
SANS 32.6 0.0 23.5 0.1 35.8 0.1 51.2 0.1 -
TANS 33.0 0.1 23.5 0.1 36.5 0.1 52.1 0.1 -0.5
DistMult None NS 23.3 0.1 15.6 0.1 25.7 0.1 38.4 0.1 -
SANS 22.3 0.1 14.0 0.2 24.1 0.1 39.2 0.0 -
TANS 31.0 0.1 21.7 0.1 34.0 0.1 49.6 0.1 -1
Base NS 25.4 0.1 17.9 0.1 27.6 0.1 40.4 0.1 -
SANS 30.8 0.1 21.9 0.1 33.6 0.1 48.4 0.1 -
TANS 31.5 0.1 22.4 0.1 34.6 0.1 49.7 0.0 -0.5
Freq NS 24.0 0.1 16.7 0.2 25.9 0.1 38.4 0.1 -
SANS 29.9 0.0 21.2 0.1 32.8 0.0 47.5 0.1 -
TANS 30.7 0.0 21.6 0.0 34.0 0.0 49.0 0.0 -1
Uniq NS 21.0 0.1 13.5 0.2 22.8 0.2 36.3 0.2 -
SANS 29.2 0.0 20.5 0.1 31.9 0.0 46.7 0.0 -
TANS 30.7 0.1 21.5 0.1 33.8 0.1 49.3 0.1 -2
TransE None NS 30.4 0.0 21.3 0.1 33.4 0.1 48.5 0.0 -
SANS 33.0 0.1 22.9 0.1 37.2 0.1 53.0 0.1 -
TANS 33.6 0.0 23.9 0.0 37.3 0.0 53.0 0.1 -0.5
Base NS 29.4 0.1 20.0 0.1 32.8 0.0 48.1 0.0 -
SANS 33.0 0.1 23.1 0.1 36.8 0.1 52.7 0.1 -
TANS 33.0 0.0 23.1 0.0 36.8 0.1 52.7 0.1 -0.1
Freq NS 29.3 0.1 20.0 0.1 32.8 0.1 47.8 0.1 -
SANS 33.5 0.0 23.9 0.1 37.2 0.1 52.8 0.1 -
TANS 33.5 0.1 23.9 0.1 37.2 0.0 52.8 0.1 -0.1
Uniq NS 30.1 0.1 21.0 0.1 33.6 0.0 48.0 0.0 -
SANS 33.5 0.0 23.9 0.0 37.3 0.2 52.7 0.1 -
TANS 34.0 0.1 24.5 0.1 37.7 0.1 53.0 0.1 0.5
RotatE None NS 30.3 0.0 21.4 0.1 33.2 0.1 48.4 0.1 -
SANS 32.9 0.1 22.8 0.1 36.8 0.0 53.1 0.2 -
TANS 34.1 0.1 24.6 0.1 37.7 0.1 53.3 0.1 -0.5
Base NS 29.5 0.0 20.3 0.0 32.7 0.1 47.9 0.0 -
SANS 33.6 0.1 23.9 0.1 37.3 0.1 53.1 0.0 -
TANS 33.8 0.0 24.2 0.0 37.4 0.0 53.0 0.1 -0.5
Freq NS 29.4 0.1 20.2 0.1 32.6 0.1 47.6 0.1 -
SANS 34.0 0.1 24.6 0.0 37.7 0.0 53.0 0.0 -
TANS 34.1 0.0 24.6 0.0 37.7 0.0 53.1 0.1 -0.01
Uniq NS 30.1 0.0 21.2 0.1 33.3 0.1 47.7 0.1 -
SANS 33.9 0.1 24.4 0.1 37.6 0.1 52.9 0.1 -
TANS 34.2 0.0 24.7 0.1 37.8 0.0 53.1 0.1 0.5
HAKE None NS 30.8 0.1 21.8 0.1 33.8 0.1 48.6 0.1 -
SANS 32.8 0.2 22.7 0.3 36.9 0.1 52.8 0.1 -
TANS 34.4 0.1 24.9 0.1 37.9 0.2 53.6 0.0 -0.5
Base NS 30.4 0.1 21.6 0.1 33.3 0.1 48.2 0.0 -
SANS 34.1 0.1 24.4 0.1 37.9 0.1 53.6 0.2 -
TANS 34.1 0.0 24.4 0.0 37.9 0.0 53.7 0.0 -0.05
Freq NS 30.2 0.1 21.5 0.0 33.1 0.0 47.7 0.1 -
SANS 34.7 0.0 25.2 0.1 38.2 0.0 53.8 0.1 -
TANS 34.6 0.0 25.0 0.1 38.2 0.2 53.7 0.1 0.05
Uniq NS 30.7 0.1 22.2 0.1 33.5 0.1 48.0 0.1 -
SANS 34.7 0.1 25.1 0.1 38.3 0.1 53.9 0.1 -
TANS 34.9 0.0 25.4 0.0 38.6 0.1 54.0 0.1 0.5
HousE None NS 29.1 0.1 20.6 0.1 31.6 0.1 46.3 0.1 -
SANS 34.7 0.2 24.8 0.2 38.5 0.3 54.4 0.2 -
TANS 35.6 0.1 26.1 0.1 39.4 0.1 54.5 0.1 -1
Base NS 28.1 0.1 19.6 0.1 30.9 0.2 45.1 0.2 -
SANS 35.2 0.2 25.6 0.2 39.0 0.2 54.4 0.3 -
TANS 35.6 0.1 26.1 0.1 39.4 0.2 54.5 0.1 -0.5
Freq NS 27.9 0.1 19.2 0.1 30.7 0.2 45.2 0.1 -
SANS 35.9 0.2 26.4 0.2 39.5 0.2 54.7 0.1 -
TANS 35.8 0.2 26.4 0.2 39.6 0.2 54.7 0.1 -0.01
Uniq NS 28.8 0.1 20.2 0.2 31.9 0.1 45.7 0.0 -
SANS 36.1 0.1 26.7 0.2 39.8 0.1 54.8 0.2 -
TANS 36.2 0.1 26.7 0.2 39.9 0.1 54.8 0.1 0.1
Table 5: Results on FB15k-237.
WN18RR
Model Subsampling MRR H@1 H@3 H@10 γ𝛾\gammaitalic_γ
Assumption Loss Mean SD Mean SD Mean SD Mean SD
ComplEx None NS 44.5 0.1 38.1 0.2 48.3 0.2 55.5 0.1 -
SANS 45.0 0.1 41.0 0.1 46.5 0.3 53.3 0.3 -
TANS 47.3 0.0 43.3 0.0 49.1 0.1 55.7 0.1 -2
Base NS 45.0 0.1 38.9 0.1 48.6 0.2 55.7 0.1 -
SANS 46.9 0.1 42.7 0.2 48.5 0.2 55.5 0.2 -
TANS 47.7 0.2 43.6 0.1 49.3 0.2 55.9 0.3 -2
Freq NS 45.1 0.1 38.9 0.1 48.8 0.2 56.0 0.2 -
SANS 47.4 0.1 43.2 0.1 49.2 0.2 56.0 0.2 -
TANS 48.0 0.1 43.9 0.1 49.7 0.1 56.1 0.1 -2
Uniq NS 45.0 0.1 38.7 0.1 48.8 0.1 56.0 0.3 -
SANS 47.5 0.1 43.3 0.1 49.1 0.2 56.2 0.2 -
TANS 48.3 0.1 44.4 0.2 49.6 0.1 56.3 0.2 -1
DistMult None NS 38.5 0.2 30.6 0.3 42.9 0.2 52.5 0.1 -
SANS 42.4 0.0 38.2 0.1 43.7 0.0 51.0 0.2 -
TANS 44.2 0.1 40.1 0.1 45.3 0.1 53.2 0.2 -2
Base NS 39.3 0.2 31.9 0.2 43.3 0.1 53.0 0.2 -
SANS 43.9 0.1 39.4 0.1 45.2 0.1 53.3 0.2 -
TANS 44.6 0.0 40.5 0.2 45.7 0.1 53.9 0.1 -2
Freq NS 39.0 0.2 31.2 0.2 43.2 0.1 52.9 0.2 -
SANS 44.5 0.1 40.0 0.1 46.0 0.1 54.2 0.2 -
TANS 44.7 0.1 40.5 0.2 45.8 0.0 54.0 0.2 -2
Uniq NS 38.8 0.2 30.8 0.2 43.1 0.1 53.0 0.2 -
SANS 44.7 0.1 40.1 0.1 46.2 0.3 54.3 0.0 -
TANS 45.0 0.1 40.7 0.1 46.1 0.2 54.5 0.2 -0.5
TransE None NS 21.1 0.0 2.1 0.1 36.5 0.2 50.4 0.2 -
SANS 22.5 0.1 1.7 0.1 40.2 0.1 52.5 0.2 -
TANS 22.7 0.0 2.5 0.0 39.5 0.2 53.4 0.1 0.5
Base NS 20.3 0.1 1.6 0.1 35.1 0.2 49.9 0.2 -
SANS 22.3 0.0 1.3 0.1 40.2 0.1 52.9 0.1 -
TANS 22.4 0.1 1.4 0.1 40.1 0.1 53.0 0.1 0.1
Freq NS 21.0 0.1 1.8 0.1 36.4 0.2 51.0 0.2 -
SANS 23.0 0.0 1.9 0.1 40.9 0.2 53.6 0.0 -
TANS 23.1 0.0 2.1 0.0 41.0 0.1 53.8 0.0 0.1
Uniq NS 21.5 0.1 2.2 0.0 37.2 0.1 51.4 0.2 -
SANS 23.2 0.0 2.3 0.1 40.9 0.2 53.6 0.1 -
TANS 23.3 0.1 3.0 0.0 40.2 0.2 54.4 0.1 0.5
RotatE None NS 47.0 0.1 42.5 0.2 48.6 0.2 55.8 0.3 -
SANS 47.2 0.1 42.6 0.1 49.1 0.1 56.7 0.0 -
TANS 47.3 0.1 42.6 0.1 49.1 0.1 56.7 0.1 -0.01
Base NS 47.0 0.0 42.2 0.1 48.7 0.1 56.3 0.1 -
SANS 47.5 0.1 42.7 0.2 49.3 0.1 57.2 0.1 -
TANS 47.5 0.1 42.7 0.2 49.3 0.1 57.1 0.1 0.01
Freq NS 47.1 0.1 42.3 0.1 48.7 0.1 56.4 0.1 -
SANS 47.7 0.1 42.9 0.2 49.6 0.0 57.4 0.1 -
TANS 47.7 0.1 42.8 0.2 49.7 0.1 57.4 0.1 0.1
Uniq NS 47.2 0.2 42.7 0.2 48.7 0.1 56.3 0.1 -
SANS 47.7 0.1 42.9 0.1 49.6 0.1 57.2 0.1 -
TANS 47.8 0.2 42.8 0.3 49.8 0.1 57.6 0.1 0.5
HAKE None NS 48.8 0.1 44.5 0.1 50.5 0.2 57.3 0.1 -
SANS 48.9 0.0 44.5 0.2 50.6 0.3 57.7 0.1 -
TANS 48.9 0.0 44.4 0.1 50.5 0.3 57.8 0.1 0.01
Base NS 49.2 0.0 44.6 0.1 51.1 0.1 57.9 0.2 -
SANS 49.5 0.1 45.0 0.2 51.2 0.2 58.2 0.2 -
TANS 49.5 0.1 45.0 0.2 51.2 0.3 58.4 0.2 0.1
Freq NS 49.3 0.1 44.8 0.1 51.3 0.2 58.0 0.2 -
SANS 49.7 0.1 45.2 0.2 51.5 0.1 58.4 0.2 -
TANS 49.7 0.0 45.2 0.2 51.6 0.3 58.4 0.2 -0.01
Uniq NS 49.4 0.2 44.9 0.2 51.3 0.2 57.8 0.2 -
SANS 49.9 0.0 45.3 0.1 51.8 0.2 58.6 0.2 -
TANS 49.9 0.1 45.4 0.1 51.8 0.2 58.5 0.2 0.05
HousE None NS 47.4 0.1 41.7 0.1 50.2 0.1 57.3 0.1 -
SANS 49.7 0.1 44.8 0.2 51.5 0.1 59.5 0.1 -
TANS 50.2 0.1 45.3 0.1 52.0 0.1 60.0 0.1 -0.5
Base NS 48.1 0.1 42.4 0.1 50.9 0.1 58.5 0.2 -
SANS 51.2 0.1 46.7 0.1 53.0 0.2 60.3 0.1 -
TANS 51.3 0.1 46.7 0.2 53.0 0.0 60.4 0.1 0.05
Freq NS 48.1 0.2 42.5 0.3 50.9 0.2 58.5 0.2 -
SANS 51.4 0.1 46.8 0.1 53.2 0.3 60.5 0.1 -
TANS 51.3 0.2 46.7 0.2 53.1 0.3 60.5 0.1 0.05
Uniq NS 48.1 0.1 42.5 0.1 50.8 0.2 58.1 0.1 -
SANS 51.2 0.2 46.8 0.2 52.7 0.1 60.1 0.1 -
TANS 51.1 0.3 46.7 0.5 52.7 0.1 60.0 0.1 -0.1
Table 6: Results on WN18RR.
YAGO3-10
Model Subsampling MRR H@1 H@3 H@10 γ𝛾\gammaitalic_γ
Assumption Loss Mean SD Mean SD Mean SD Mean SD
RotatE None NS 43.5 0.1 32.8 0.2 49.1 0.2 63.7 0.3 -
SANS 49.6 0.2 39.9 0.1 55.3 0.3 67.3 0.2 -
TANS 49.6 0.2 40.0 0.2 55.4 0.5 67.2 0.3 -0.05
Base NS 44.8 0.1 34.5 0.3 50.0 0.2 64.7 0.2 -
SANS 49.6 0.3 40.1 0.3 55.2 0.4 67.4 0.3 -
TANS 49.5 0.3 40.1 0.3 55.0 0.5 67.3 0.3 -0.05
Freq NS 44.8 0.2 34.5 0.3 50.0 0.1 64.7 0.2 -
SANS 49.9 0.2 40.5 0.3 55.5 0.5 67.4 0.3 -
TANS 49.9 0.2 40.5 0.3 55.5 0.5 67.4 0.2 0.01
Uniq NS 44.4 0.2 34.0 0.3 49.8 0.2 64.3 0.2 -
SANS 50.0 0.3 40.6 0.2 55.6 0.3 67.5 0.2 -
TANS 50.1 0.2 40.7 0.1 55.7 0.3 67.6 0.3 0.05
HAKE None NS 47.4 0.3 36.6 0.5 53.9 0.1 67.0 0.1 -
SANS 53.5 0.2 44.6 0.3 59.1 0.4 69.0 0.2 -
TANS 53.7 0.1 45.3 0.3 59.0 0.1 68.8 0.1 0.05
Base NS 48.8 0.3 38.4 0.4 55.0 0.2 68.1 0.3 -
SANS 54.6 0.2 46.2 0.3 59.9 0.2 69.6 0.2 -
TANS 54.5 0.2 45.9 0.3 59.9 0.2 69.9 0.1 -0.1
Freq NS 49.3 0.2 39.1 0.3 55.4 0.1 68.1 0.2 -
SANS 54.6 0.4 46.0 0.7 60.2 0.1 69.6 0.3 -
TANS 54.8 0.2 46.4 0.3 60.1 0.1 69.6 0.3 0.05
Uniq NS 45.2 0.1 34.3 0.1 51.1 0.1 65.8 0.3 -
SANS 55.2 0.3 46.8 0.5 60.5 0.2 70.0 0.3 -
TANS 55.1 0.2 46.8 0.3 60.3 0.1 69.9 0.2 -0.1
HousE None NS 29.2 0.0 18.3 0.1 33.6 0.2 50.1 0.2 -
SANS 54.8 1.3 46.8 1.3 59.7 1.2 68.9 1.2 -
TANS 54.8 1.2 46.9 1.2 59.6 1.2 68.8 1.1 0.01
Base NS 29.6 0.1 19.8 0.1 33.6 0.2 48.9 0.1 -
SANS 56.7 0.1 48.6 0.2 61.7 0.2 71.3 0.1 -
TANS 57.0 0.2 49.0 0.4 61.9 0.3 71.5 0.2 -0.1
Freq NS 27.3 0.8 17.5 0.9 31.0 0.8 46.6 0.8 -
SANS 57.0 0.1 49.0 0.2 62.0 0.1 71.4 0.1 -
TANS 57.2 0.1 49.3 0.1 62.3 0.1 71.4 0.1 -0.1
Uniq NS 28.1 0.2 18.2 0.4 31.8 0.1 47.6 0.0 -
SANS 57.2 0.1 49.3 0.2 62.0 0.0 71.4 0.2 -
TANS 57.3 0.2 49.5 0.3 62.2 0.1 71.5 0.1 -0.05
Table 7: Results on YAGO3-10.
FB15k-237-HL
Model Subsampling MRR H@1 γ𝛾\gammaitalic_γ
Assumption Loss Mean SD Mean SD
HAKE None NS 38.1 0.3 28.4 0.5 -
SANS 35.2 0.2 24.5 0.3 -
TANS 41.1 0.1 33.0 0.1 -1
Base NS 40.5 0.1 31.8 0.2 -
SANS 38.4 0.2 28.9 0.2 -
TANS 41.8 0.1 33.6 0.2 -1
Freq NS 41.1 0.1 32.8 0.1 -
SANS 40.2 0.0 31.5 0.1 -
TANS 42.0 0.1 33.7 0.1 -1
Uniq NS 41.5 0.1 33.2 0.1 -
SANS 41.1 0.0 32.8 0.0 -
TANS 41.9 0.2 33.5 0.2 -0.1
RotatE None NS 40.0 0.1 30.8 0.1 -
SANS 36.3 0.1 25.3 0.2 -
TANS 41.5 0.0 33.1 0.1 -1
Base NS 41.8 0.1 33.6 0.1 -
SANS 40.7 0.1 31.7 0.2 -
TANS 42.0 0.1 33.8 0.1 -0.5
Freq NS 41.3 0.1 33.2 0.1 -
SANS 42.0 0.2 33.6 0.3 -
TANS 42.3 0.0 34.1 0.1 -0.5
Uniq NS 41.7 0.1 33.7 0.2 -
SANS 42.2 0.1 33.8 0.2 -
TANS 42.1 0.1 33.8 0.2 -0.05
HousE None NS 39.1 0.2 29.8 0.2 -
SANS 37.0 0.2 26.2 0.4 -
TANS 42.3 0.1 34.1 0.2 -2
Base NS 40.3 0.1 31.3 0.2 -
SANS 40.5 0.4 31.3 0.4 -
TANS 42.4 0.2 34.2 0.3 -2
Freq NS 39.8 0.3 31.0 0.3 -
SANS 42.1 0.2 33.8 0.2 -
TANS 42.8 0.3 34.8 0.4 -1
Uniq NS 40.5 0.2 31.9 0.2 -
SANS 42.4 0.2 34.4 0.2 -
TANS 42.5 0.1 34.5 0.0 -1
Table 8: Results on FB15k-237-HL.
WN18RR-HL
Model Subsampling MRR H@1 γ𝛾\gammaitalic_γ
Assumption Loss Mean SD Mean SD
HAKE None NS 10.8 0.1 8.7 0.2 -
SANS 10.3 0.1 7.8 0.1 -
TANS 13.9 0.2 12.1 0.2 -2
Base NS 12.1 0.2 9.5 0.3 -
SANS 11.1 0.1 9.1 0.1 -
TANS 13.7 0.1 11.7 0.3 -2
Freq NS 12.4 0.1 10.4 0.1 -
SANS 11.9 0.2 9.5 0.2 -
TANS 14.2 0.5 11.9 0.4 -2
Uniq NS 13.3 0.3 11.3 0.3 -
SANS 11.9 0.2 9.7 0.2 -
TANS 14.1 0.2 11.7 0.2 -2
RotatE None NS 14.2 0.2 11.8 0.3 -
SANS 13.9 0.3 11.7 0.3 -
TANS 14.4 0.1 11.8 0.2 -2
Base NS 13.9 0.2 11.5 0.2 -
SANS 14.1 0.3 11.7 0.3 -
TANS 14.5 0.1 11.7 0.1 -2
Freq NS 14.4 0.1 12.0 0.1 -
SANS 14.3 0.4 12.0 0.3 -
TANS 15.1 0.1 12.2 0.1 -2
Uniq NS 14.4 0.2 12.2 0.1 -
SANS 14.2 0.2 11.9 0.2 -
TANS 15.1 0.2 12.3 0.3 -2
HousE None NS 10.7 1.8 8.4 1.4 -
SANS 11.7 1.1 9.5 0.9 -
TANS 13.4 0.4 11.0 0.4 -2
Base NS 9.9 0.4 8.4 0.4 -
SANS 11.5 0.2 9.5 0.2 -
TANS 13.4 0.2 11.3 0.3 -2
Freq NS 13.9 0.1 11.8 0.2 -
SANS 13.8 0.2 11.9 0.3 -
TANS 13.9 0.3 12.0 0.2 0.1
Uniq NS 13.7 0.1 11.6 0.1 -
SANS 13.8 0.2 11.6 0.2 -
TANS 13.8 0.2 11.7 0.3 -0.05
Table 9: Results on WN18RR-HL.
YAGO3-10-HL
Model Subsampling MRR H@1 γ𝛾\gammaitalic_γ
Assumption Loss Mean SD Mean SD
HAKE None NS 45.9 0.0 36.9 0.1 -
SANS 47.8 0.4 40.0 0.6 -
TANS 49.2 0.4 39.8 0.7 -0.5
Base NS 50.2 0.3 43.0 0.3 -
SANS 47.7 0.4 40.5 0.7 -
TANS 50.1 0.3 41.4 0.3 -0.5
Freq NS 50.8 0.3 43.3 0.2 -
SANS 48.8 0.1 41.3 0.2 -
TANS 49.7 0.3 41.0 0.2 -0.5
Uniq NS 49.4 0.2 40.8 0.2 -
SANS 46.9 0.4 39.8 0.5 -
TANS 49.4 0.6 40.6 0.8 -0.5
RotatE None NS 38.0 0.1 28.7 0.3 -
SANS 41.3 0.1 32.3 0.2 -
TANS 43.5 0.1 34.8 0.2 -0.5
Base NS 40.6 0.2 31.8 0.5 -
SANS 43.8 0.2 35.1 0.1 -
TANS 43.8 0.2 35.2 0.1 -0.05
Freq NS 40.3 0.2 31.4 0.4 -
SANS 43.5 0.2 34.6 0.1 -
TANS 43.7 0.0 35.1 0.1 -0.1
Uniq NS 40.2 0.0 31.3 0.2 -
SANS 43.9 0.1 35.1 0.2 -
TANS 44.1 0.1 35.4 0.3 -0.1
HousE None NS 37.8 0.3 26.9 0.4 -
SANS 50.3 0.1 40.7 0.3 -
TANS 52.5 0.5 45.4 0.3 -0.5
Base NS 42.8 1.2 34.3 1.9 -
SANS 51.9 0.3 44.4 0.2 -
TANS 51.9 0.6 44.3 0.8 0.05
Freq NS 39.7 0.8 29.9 1.5 -
SANS 48.6 1.7 40.0 1.4 -
TANS 52.0 0.1 44.5 0.3 -1
Uniq NS 41.0 0.1 31.6 0.1 -
SANS 49.4 0.3 41.1 1.1 -
TANS 52.2 0.1 44.7 0.1 -0.05
Table 10: Results on YAGO3-10-HL.
Refer to caption
Figure 6: Training loss and validation MRR Curve on FB15k-237.
Refer to caption
Figure 7: Training loss and validation MRR Curve on WN18RR.
Refer to caption
Figure 8: Training loss and validation MRR Curve on YAGO3-10.