HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: mdwlist
  • failed: derivative

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages.

License: CC BY-NC-SA 4.0
arXiv:2306.00899v2 [cs.LG] 18 Dec 2023

Pitfalls in Link Prediction with Graph Neural Networks: Understanding the Impact of Target-link Inclusion
& Better Practices

Jing Zhu*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT University of Michigan, Ann Arbor [email protected] Yuhang Zhou*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT University of Maryland, College Park [email protected] Vassilis N. Ioannidis AWS AI Research and Education [email protected] Shengyi Qian University of Michigan, Ann Arbor [email protected] Wei Ai University of Maryland, College Park [email protected] Xiang Song AWS AI Research and Education [email protected]  and  Danai Koutra University of Michigan, Ann Arbor [email protected]
(2023; 2024)
Abstract.

While Graph Neural Networks (GNNs) are remarkably successful in a variety of high-impact applications, we demonstrate that, in link prediction, the common practices of including the edges being predicted in the graph at training and/or test have outsized impact on the performance of low-degree nodes. We theoretically and empirically investigate how these practices impact node-level performance across different degrees. Specifically, we explore three issues that arise: (I1) overfitting; (I2) distribution shift; and (I3) implicit test leakage. The former two issues lead to poor generalizability to the test data, while the latter leads to overestimation of the model’s performance and directly impacts the deployment of GNNs. To address these issues in a systematic way, we introduce an effective and efficient GNN training framework, SpotTarget, which leverages our insight on low-degree nodes: (1) at training time, it excludes a (training) edge to be predicted if it is incident to at least one low-degree node; and (2) at test time, it excludes all test edges to be predicted (thus, mimicking real scenarios of using GNNs, where the test data is not included in the graph). SpotTarget helps researchers and practitioners adhere to best practices for learning from graph data, which are frequently overlooked even by the most widely-used frameworks. Our experiments on various real-world datasets show that SpotTarget makes GNNs up to 15×\times× more accurate in sparse graphs, and significantly improves their performance for low-degree nodes in dense graphs.

graph neural network; link prediction; shortcut learning
copyright: nonejournalyear: 2023copyright: acmcopyrightjournalyear: 2024copyright: acmlicensedconference: Proceedings of the 17th ACM International Conference on Web Search and Data Mining; March 4–8, 2024; Merida, Mexicobooktitle: Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24), March 4–8, 2024, Merida, Mexicoprice: 15.00doi: 10.1145/3616855.3635786isbn: 979-8-4007-0371-3/24/03ccs: Computing methodologies Machine learningfootnotetext: *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT equal contribution

1. Introduction

Refer to caption

Training Target Edges

Test Target Edges

Results

Issues

Include (a1) - (I1) Overfitting
Include (a1) Exclude (b2) (I2) Distribution shift
- Include (b1) (I3) Leakage
Exclude (a2) Exclude (b2) -
Figure 1. The pitfalls of including target links as message-passing edges during training or test time, and the issues that arise from these practices. [Left] Training time: Given a toy train graph and training target edge e12subscript𝑒12e_{12}italic_e start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT in (a), we illustrate the impact of the inclusion (a1) and exclusion (a2) of e12subscript𝑒12e_{12}italic_e start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT on the 1-hop induced train graph for nodes 1 and 2, which is used for message passing. [Right] Test time: We give the same illustration for a test graph and test target edge eBCsubscript𝑒𝐵𝐶e_{BC}italic_e start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT in (b). [Table] Overview of the three main issues and when they arise: (I1) When including train targets, GNNs overfit them instead of making predictions based on the graph structure and node features. (I2) When train target links are present but test target edges are absent, there is a distribution shift between training and testing. (I3) The presence of test target links causes implicit test leakage.

Graphs or networks are key representations for relational data that occur in many scientific and industrial applications. Link prediction, the task of predicting whether a link is likely to form between two nodes or entities in a graph, has many downstream applications such as drug repurposing, recommendation systems, and knowledge graph completion (Liben-Nowell and Kleinberg, 2003; Adamic and Adar, 2003; Koren et al., 2009; Bordes et al., 2013; Zeng et al., 2020; Martínez et al., 2016). It is also widely used as a pre-training method to produce high-quality entity representations that can be used in various business applications  (Hu et al., 2020a, 2019). Techniques to solve this task range from heuristics–e.g., predicting links based on the number of common neighbors between a pair of nodes–to graph neural network (GNN) models , which rely on message passing and leverage both the graph structure and node features. In recent years, GNN-based methods, which formulate the link prediction problem as a binary classification problem over node pairs, have led to state-of-the-art performance in many high-impact applications and have become the go-to approach both in industrial settings and academia (Zhang and Chen, 2018; Kipf and Welling, 2016b; Zhang et al., 2021; Ioannidis et al., 2022).

In this work, we focus on key pitfalls when training GNN models for the link prediction task, which we have found to cause significant disparities in node-level performance. Specifically, we investigate the common practices of including in the graph the target links (i.e., the edges for which the existence or absence is being predicted) at training and/or test time, and considering them during message passing (Dong et al., 2022; Zhang and Chen, 2018). The inclusion of (training) target links at training time leads to two major issues, overfitting (I1) and distribution shift (I2), while the inclusion of target edges in the test graph data causes implicit data leakage (I3) through neighborhood aggregation. In turn, these issues lead to poor performance for GNN models and inability to effectively generalize to (truly) unobserved links at test time. We give an illustrative example of the issues in message passing in Fig. 1.

Illustrative Example.

In Fig. 1(a), e12subscript𝑒12e_{12}italic_e start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT is a training target link for which we want to predict the existence. When this edge is not excluded from the training graph, GNNs would use the message-passing graph shown in Fig. 1(a1) for nodes 1111 and 2222, which leads to overfitting on e12subscript𝑒12e_{12}italic_e start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT and memorizing its existence instead of learning to predict it based on the graph structure and node features. Moreover, in a realistic testing scenario as in Fig. 1(b) where the goal is to predict whether the edge eBCsubscript𝑒𝐵𝐶e_{BC}italic_e start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT exists or not, GNNs would use the message-passing graph shown in Fig. 1(b2) for nodes B𝐵Bitalic_B and C𝐶Citalic_C, where the two nodes are disconnected. This leads to distribution shift: there is discrepancy between the message-passing graphs used during training and testing despite the similarity between the target links.

On the other hand, at test time, including the test target links in the test graph (e.g., edge eBCsubscript𝑒𝐵𝐶e_{BC}italic_e start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT in Fig. 1(b1)) causes data leakage. In our example, during neighborhood aggregation, the target node B𝐵Bitalic_B would aggregate the messages from C𝐶Citalic_C and vice versa, resulting in a higher likelihood of predicting the existence of edge eBCsubscript𝑒𝐵𝐶e_{BC}italic_e start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT compared to the case where the link does not exist in the message-passing graph. However, in real-world applications, the goal is to predict future links that are not observed in the data, so the inclusion of test target links corresponds to implicit data leakage.

The pitfalls of including the target links in the graph at train and/or test time are commonplace in many GNN-based frameworks. For example, PyTorch Geometric (PyG) (Fey and Lenssen, 2019), a commonly-used repository, does not support excluding target edges when constructing the mini-batch graphs for training. Another popular library, DGL (Wang et al., 2019), for the first four years of its existence, did not include the function of excluding training target links in the official code examples that have been used by numerous researchers and practitioners. The majority of papers fail to reference the exclusion of target links as a consideration in their empirical analyses, and, anecdotally, multiple authors with both industry and academic experience have observed that these pitfalls often occur in practice. Although there have been efforts to deliberately eliminate the test-time pitfall in some popular benchmarks (Hu et al., 2020b), it is still a commonly overlooked problem in applications that rely on proprietary data. Data contamination has also been a major issue in model evaluation, especially in the era of large language models(Zhou et al., 2023b; Sainz et al., 2023; Golchin and Surdeanu, 2023; Zhou et al., 2023a).

We demonstrate theoretically and empirically that low-degree nodes suffer more from the inclusion of target edges as it causes more significant relative degree changes for them compared to other nodes. Intuitively, for high-degree nodes, the target links that are erroneously considered have limited impact on the performance since they are only a small fraction of the edges considered during message passing. Thus, these practices significantly impact real-world applications, where the observed data are often incomplete and very sparse, with many low-degree nodes (Reddy et al., 2022; Faloutsos et al., 1999; Leskovec et al., 2020, 2005). To address the three issues (I1-I3), we introduce a GNN training framework, SpotTarget, to systematically and efficiently exclude the target links at training and test time, as well as check if target test edges are excluded for any user-defined dataset. Although excluding all training target links is an ideal solution, our analysis indicated that it significantly corrupts the mini-batch graph and renders learning with GNNs challenging. Our theoretical and empirical analysis shows that excluding the target links that are incident to at least one low-degree node achieves the best trade-off between avoiding the issues (I1, I2) and learning powerful node representations at training time. At test time, we argue that it is important to mimic real scenarios and avoid leakage for all target edges by excluding them from the test graph. Our key contributions are:

  • Systematic Analysis of the Target Link Inclusion Practices: Focusing on link prediction, we perform the first thorough theoretical and empirical analysis on the effect of including target edges as message-passing edges at training and test time. Our key insight is that low-degree nodes tend to suffer more from the issues that arise from these pitfalls.

  • Efficient Unified Framework: We introduce the first unified GNN training framework, SpotTarget, which automatically tackles these issues at training and test time. During training, for efficiency, SpotTarget leverages our theoretical insight and excludes target links incident to at least one low-degree node. At test time, it excludes all target edges. These strategies ensure generalizable and robust model training without any data leakage issues. SpotTarget is also easy-to-use and scalable, and helps researchers and practitioners adhere to best practices, which are frequently overlooked even by the most widely-used GNN frameworks. We integrated it as a plug-and-play module in DGL.

  • Extensive Experiments: To quantify the effect of including the target links as message-passing edges during training and test time, we conduct extensive experiments on various datasets, spanning from commonly-used link prediction benchmarks to real-world datasets. We show that SpotTarget makes GNN models up to 15×\times× more accurate on sparse graphs, and significantly improves their performance for low-degree nodes on dense graphs.

2. Related Work

Link Prediction using GNNs. Graph neural networks (GNNs) are popular neural network architectures that learn representations by capturing the interactions between objects. While perhaps most often used for node- or graph-level classication, the applications of GNNs have expanded to include edge-level inference tasks like link prediction. Methods that use GNNs for link prediction mainly fall into two categories: Graph Autoencoder (GAE)-based methods and enclosing subgraph-based methods. GAE-based methods use GNNs as the encoder of nodes, and edges are decoded by their nodes’ encoding vectors using score functions  (Kipf and Welling, 2016b; Davidson et al., 2018; Vashishth et al., 2019; Zhu et al., 2021; You et al., 2019; Zhu et al., 2023). Enclosing subgraph-based methods, including SEAL  (Zhang and Chen, 2018; Zhang et al., 2021), IGMC  (Zhang and Chen, 2019), GraIL  (Teru et al., 2020), TCL-GNN  (Yan et al., 2021), first extract an enclosing subgraph for the target edge, then apply GNNs to encode the representations of the nodes in enclosing subgraph, and finally aggregate the node representations by pooling methods. The learned subgraph features are fed into a classifier to predict the existence or absence of the target edge. Even though enclosing subgraph-based methods such as SEAL give more accurate predictions, GAE-based methods are typically orders of magnitude faster to compute and require fewer computation resources. In real-world applications, graphs are often massive with many millions of nodes or even billions of nodes, so typically GAE-based methods are employed (Zheng et al., 2020).

Issues in Link Prediction using GNNs. Unlike node classification where edges are solely used as message-passing edges, edges in link prediction have two separate roles: (1) message passing and (2) prediction objectives. This distinction is often overlooked; GNNs designed for node classification tasks are often adapted for link prediction by simply stacking a decoder function, without explicitly handling the message passing and target links separately. The training pitfalls caused by the existence of target edges were initially identified by SEAL (Zhang and Chen, 2018; Wang et al., 2023), which made efforts to mitigate them through negative injection. Building upon it, FakeEdge (Dong et al., 2022) discussed the distribution shift issue that occurs due to the presence of target links during training and the absence of target links at test time. They further proposed to always add or remove the target links, or combine the strategies for subgraph-based methods like SEAL. Unlike these works, we focus on performing a thorough and systematic analysis of all the issues caused by including target links at training and/or test time, and characterizing the disparate impact of these practices on the performance of nodes of varying degrees. Moreover, unlike SEAL and FakeEdge that only apply to subgraph-based models, our SpotTarget aims to systematically and efficiently address the issues for more scalable GAE-based models, which are commonly-used in real-world applications (e.g., web-scale recommender systems). For example, our training framework is 10x faster than FakeEdge (2 hours for one epoch vs. 20 hours on Ogbl-Citation2).

3. Preliminaries

In this section, we formally define key notions and the problem that we seek to solve. The major symbols we use are defined in Tab.  1.

Table 1. Major symbols and their definitions.
Symbols Definitions
G𝐺Gitalic_G Graph
disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Degree of node i𝑖iitalic_i
eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT The target edge between nodes i,j𝑖𝑗i,jitalic_i , italic_j to be predicted
Ttrsubscript𝑇trT_{\text{tr}}italic_T start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT The set of train target edges
Ttstsubscript𝑇tstT_{\text{tst}}italic_T start_POSTSUBSCRIPT tst end_POSTSUBSCRIPT The set of test target edges
Tlowsubscript𝑇lowT_{\text{low}}italic_T start_POSTSUBSCRIPT low end_POSTSUBSCRIPT Set of target edges incident to at least one low-degree node
δ𝛿\deltaitalic_δ Degree threshold to filter edges in Tlowsubscript𝑇lowT_{\text{low}}italic_T start_POSTSUBSCRIPT low end_POSTSUBSCRIPT

3.1. Definitions

Graphs. We consider a graph G=(V,E,𝐗)𝐺𝑉𝐸𝐗G=(V,E,\mathbf{X})italic_G = ( italic_V , italic_E , bold_X ), where V𝑉Vitalic_V is the set of vertices, E𝐸Eitalic_E is the set of edges, and 𝐗|V|×d𝐗superscript𝑉𝑑\mathbf{X}\in\mathbb{R}^{|V|\times d}bold_X ∈ roman_ℝ start_POSTSUPERSCRIPT | italic_V | × italic_d end_POSTSUPERSCRIPT represents the d𝑑ditalic_d-dimensional input node features. We denote as Nk(u)subscript𝑁𝑘𝑢N_{k}(u)italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u ) the k𝑘kitalic_k-hop neighbors of node u𝑢uitalic_u, i.e., the set of nodes at a distance less than or equal to k𝑘kitalic_k from u𝑢uitalic_u. The degree dusubscript𝑑𝑢d_{u}italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT of node u𝑢uitalic_u is defined as the number of its 1-hop neighbors or adjacent nodes, i.e., du=|N1(u)|subscript𝑑𝑢subscript𝑁1𝑢d_{u}=|N_{1}(u)|italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = | italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_u ) |.

Link Prediction. Given a graph G=(V,E,𝐗)𝐺𝑉𝐸𝐗G=(V,E,\mathbf{X})italic_G = ( italic_V , italic_E , bold_X ), the link prediction task aims to determine whether there is or will be a link eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT between nodes i𝑖iitalic_i and j𝑗jitalic_j, where i,jV𝑖𝑗𝑉i,j\in Vitalic_i , italic_j ∈ italic_V and eijEsubscript𝑒𝑖𝑗𝐸e_{ij}\notin Eitalic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∉ italic_E. We refer to eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, the edge for which we want to predict the existence or absence, as target edge or link. We adopt the widely-used train-validate-test setting, where only the epoch that achieves best performance on validation links is evaluated on test edges.

In this paper, we distinguish different types of target links:

(1) Training vs. test target links: The training target edges, Ttrsubscript𝑇trT_{\text{tr}}italic_T start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT, are used to train a supervised link prediction model, while the test target links, Ttstsubscript𝑇tstT_{\text{tst}}italic_T start_POSTSUBSCRIPT tst end_POSTSUBSCRIPT, are the links for which we want to predict the existence or absence at test time (e.g., when evaluating the test performance or making predictions in real-world applications).

(2) Target links that are incident to at least one low-degree node: Based on our theoretical insights in Sec. 5.1, our framework leverages target links incident to at least one low-degree node (i.e., edges euvsubscript𝑒𝑢𝑣e_{uv}italic_e start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT for which min(du,dv)subscript𝑑𝑢subscript𝑑𝑣\min(d_{u},d_{v})roman_min ( italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) is small), denoted as Tlowsubscript𝑇lowT_{\text{low}}italic_T start_POSTSUBSCRIPT low end_POSTSUBSCRIPT.

Graph Neural Networks. GNNs utilize a neighborhood aggregation scheme to learn a representation hvsubscript𝑣h_{v}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT for each node v𝑣vitalic_v. Node representation is formulated as a k𝑘kitalic_k-round neighborhood aggregation schema: hv(k)=COMBINE(k)({hv(k1),AGGREGATE(k)({hu(k1):uNk(v)})})superscriptsubscript𝑣𝑘superscriptCOMBINE𝑘superscriptsubscript𝑣𝑘1superscriptAGGREGATE𝑘conditional-setsuperscriptsubscript𝑢𝑘1𝑢subscript𝑁𝑘𝑣h_{v}^{(k)}=\text{COMBINE}^{(k)}(\{h_{v}^{(k-1)},\text{AGGREGATE}^{(k)}(\{h_{u% }^{(k-1)}:u\in N_{k}(v)\})\})italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = COMBINE start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( { italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , AGGREGATE start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( { italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT : italic_u ∈ italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v ) } ) } ), where AGGREGATE(.) is typically mean or max pooling, and COMBINE(.) can be a sum/concatenation/attention on nodes’ ego- and neighbor-embeddings. . Given a set of target links, we define the k𝑘kitalic_k-hop message-passing graph of a GNN model as the induced subgraph that contains all the endpoint nodes of the target links, their k-hop neighbors, and the edges of the original graph that connect these nodes. Examples of (train and test) 1-hop message-passing graphs are given in Fig. 1.

3.2. Problem Statement

More formally, we tackle the following problem: Given a graph G𝐺Gitalic_G, a link prediction task, and a base GNN model in a mini-batch training setting, we seek to: (1) investigate the issues that arise from the common practices of including the target links Ttrsubscript𝑇trT_{\text{tr}}italic_T start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT and Ttstsubscript𝑇tstT_{\text{tst}}italic_T start_POSTSUBSCRIPT tst end_POSTSUBSCRIPT as message-passing edges at training and test time, respectively, and (2) propose an efficient, unified and easy-to-use solution that automatically addresses these issues.

4. Issues of Target-link Inclusion

In this section, we aim to explore the three issues that occur in link prediction with GNNs due to the practices of including the target links as message-passing edges at training and/or test time.

4.1. Issues during Training Time

The presence of the training target links in the train graph data and their use as message-passing edges cause overfitting as well as distribution shift.

(I1) Overfitting. Suppose that we have an original train graph G𝐺Gitalic_G, as shown in Fig. 1(a). GAE-based methods first generate node 1 and node 2’s embeddings by aggregating their 1-hop neighbors’ information and decode the likelihood of node 1 and node 2 forming an edge using a dot product decoder. When the target edge e12subscript𝑒12e_{12}italic_e start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT is present, node 1’s embedding aggregates node 2’s features, and vice versa. Since the training objective is to learn as high probability as possible for a link existing between node 1 and 2, GNNs would learn to overfit the training objective in order to predict the existence of the training target link. Similarly, subgraph-based models first find an enclosing subgraph for target edges Ttrsubscript𝑇trT_{\text{tr}}italic_T start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT and then apply GNNs upon the enclosing subgraph to predict the link existence. When a target link is present in the enclosing subgraph as a message-passing edge, these models also suffer from overfitting issues. The overfitting issue leads to poor model generalizability to test data.

(I2) Distribution Shift. In typical GNN training processes for link prediction, the train target links Ttrsubscript𝑇trT_{\text{tr}}italic_T start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT are present and used during message passing, while the test target links Ttstsubscript𝑇tstT_{\text{tst}}italic_T start_POSTSUBSCRIPT tst end_POSTSUBSCRIPT are absent and never used during test. This practice poses a distribution shift problem. As an example, we consider the train graph in Fig. 1(a) along with e12subscript𝑒12e_{12}italic_e start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT as the train target link, and the test graph with eBCsubscript𝑒𝐵𝐶e_{BC}italic_e start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT as the test target link in Fig. 1(b). As shown in Fig. 1(a1), at training time, when node 1 aggregates the messages from its neighbors, node 2 is among its direct neighbors, and the message from node 2 contributes to the computation of node 1’s embeddings. In a realistic test scenario (Fig. 1(b2)), future links are not observed in the test data; so, when node B aggregates the messages from its neighbors, it does not include any message from node C as the latter is not a direct neighbor. This poses a distribution shift between training and testing, and also results in poor GNN model generalizability.

4.2. Issues during Test Time

At test time, including the test targets links, Ttstsubscript𝑇tstT_{\text{tst}}italic_T start_POSTSUBSCRIPT tst end_POSTSUBSCRIPT, in message passing results in implicit data leakage.

(I3) Data Leakage. As shown in Fig. 1(b1), when test target eBCsubscript𝑒𝐵𝐶e_{BC}italic_e start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT exists in the test message-passing graph, the target node B𝐵Bitalic_B would aggregate messages from C𝐶Citalic_C and vice versa, which results in a higher likelihood of predicting a link between nodes B𝐵Bitalic_B and C𝐶Citalic_C during inference, compared to the case when eBCsubscript𝑒𝐵𝐶e_{BC}italic_e start_POSTSUBSCRIPT italic_B italic_C end_POSTSUBSCRIPT does not exist in the test graph in Fig. 1(b2). This leads to overestimation of the model’s predictive performance and directly impacts the deployment of GNN models since future links that need to be predicted are never observed in real-world applications.

5. Proposed Framework: SpotTarget

Refer to caption
Figure 2. Example 2-hop message-passing graph for a mini-batch of size 4. Red lines are train target links and black lines correspond to other message-passing edges induced by the target edges. As shown on the left, excluding all target (red) links TTrsubscript𝑇TrT_{\text{Tr}}italic_T start_POSTSUBSCRIPT Tr end_POSTSUBSCRIPT during training results in three disconnected components. As shown on the right, if only edges incident to low-degree nodes are excluded Tlowsubscript𝑇lowT_{\text{low}}italic_T start_POSTSUBSCRIPT low end_POSTSUBSCRIPT (e.g., deg 2absent2\leq 2≤ 2), the graph connectivity is preserved. Our proposed solution avoids significant corruption of the graph structure while simultaneously avoiding issues (I1) and (I2).
Refer to caption
(a) Ogbl-Collab
Refer to caption
(b) Ogbl-Citation2
Refer to caption
(c) USAir
Refer to caption
(d) E-commerce
Figure 3. Average degree change for nodes when excluding training target links. The Y-axis corresponds to the relative change in degree before and after excluding all of the train target links in each mini-batch. Lower-degree nodes have higher relative degree change; for nodes with degree less than 5, the relative degree change is as high as 100%percent\%%.

In this section, we present SpotTarget, the first framework that systematically resolves the issues arising from the presence of target links in the message-passing graph for link prediction. We propose separate solutions that are tailored to training and inference time.

5.1. Training-time Solution: Exclude Target Links Incident to Low-degree Nodes

As discussed in Sec.  4.1, the practice of including train target links TTrsubscript𝑇TrT_{\text{Tr}}italic_T start_POSTSUBSCRIPT Tr end_POSTSUBSCRIPT as message-passing edges causes overfitting and distribution shift (I1-I2). One straightforward solution is to exclude all train target edges during training. However, this poses several challenges for both mini-batch and full-batch settings:

  • First, for mini-batch training, an excluded link could be a message-passing edge of another target edge. For example, in Fig. 2, e14subscript𝑒14e_{14}italic_e start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT is both a target edge and a message-passing edge for node 4 and node 1. The existence of e14subscript𝑒14e_{14}italic_e start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT affects the message-passing graphs, and, in turn, the learning of target edges e46subscript𝑒46e_{46}italic_e start_POSTSUBSCRIPT 46 end_POSTSUBSCRIPT, e01subscript𝑒01e_{01}italic_e start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT and e12subscript𝑒12e_{12}italic_e start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT. Excluding all the target edges causes significant corruption of the graph structure. In an extreme case, some nodes become isolated nodes, such as node 2 in Fig. 2. As a result, GNN models may fail to learn good representations when all target edges are excluded.

  • Second, in full-batch training, if all edges are used as training target edges, then excluding all edges will result in a graph with only nodes and no edges, which is impractical. If only a portion of edges are used as training edges, the graph structure corruption caused by excluding all target edges still applies to full-batch settings. In full-batch training, it requires iterating over all edges in the graph to remove the target edges per training step, which is especially time-consuming. In practice, full batch training for link prediction on massive graphs is rare, since it is inefficient in terms of time and space complexity. As a result, we only consider mini-batch training in our proposed framework, SpotTarget.

  • Third, although setting the batch size to 1 can solve the structure corruption, the mini-batch message-passing graph would become too small, causing inefficiency and instability for GNN training.

The question then becomes: How can we achieve the best trade-off between avoiding issues (I1, I2) caused by the presence of train target links and preserving the graph structure in mini-batch training as much as possible? The key insight to tackle this problem lies in identifying which nodes are mostly affected by issues (I1, I2), and only excluding target links incident to those nodes. At a high level, we show theoretically and empirically that low-degree nodes are impacted most by the inclusion of target link edges, as it causes more significant relative degree changes for them compared to other nodes. Excluding the target links incident to low-degree nodes achieves the best trade-off: since they have few neighbors, there is generally a small probability that the excluded target links that are incident to them are message-passing edges of another node in the mini-batch training. Next, we provide a theoretical and quantitative analysis to show that low-degree nodes are affected most by the issues, and target links in Tlowsubscript𝑇lowT_{\text{low}}italic_T start_POSTSUBSCRIPT low end_POSTSUBSCRIPT should be excluded during training.

Theoretical Analysis. We begin by explaining from a theoretical perspective why primarily low-degree nodes suffer from the issues caused by the inclusion of train targets compared to high-degree nodes. Intuitively, we compare the change in influence that a random node vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT has on a high-degree node vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and a low-degree node vlsubscript𝑣𝑙v_{l}italic_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT before and after excluding an edge incident to vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and vlsubscript𝑣𝑙v_{l}italic_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, respectively. We leverage the notion of influence/effect functions in statistics (Xu et al., 2018; Tang et al., 2020) to measure the relative influence of a node on another node through a specific train edge.

Theorem 1.

Let vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and vlsubscript𝑣𝑙v_{l}italic_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT be two nodes in a graph with degrees dh>dlsubscript𝑑subscript𝑑𝑙d_{h}>d_{l}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT > italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, and node vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be an arbitrary node in the graph. Assume that ReLU is the activation function, the Λnormal-Λ\Lambdaroman_Λ-layer GNN is untrained, and all random walk paths have a return probability of 0. We denote the effect of node vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on node vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT after Λnormal-Λ\Lambdaroman_Λ-th layer GNN as \pdvxhΛxk\pdvsuperscriptsubscript𝑥normal-Λsubscript𝑥𝑘\pdv{x_{h}^{\Lambda}}{x_{k}}italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where xh,xksubscript𝑥subscript𝑥𝑘x_{h},x_{k}italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are nlimit-from𝑛n-italic_n -dimensional vectors indicating the embeddings for nodes vh,vksubscript𝑣subscript𝑣𝑘v_{h},v_{k}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, respectively. Further we denote that effect of node vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on node vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT after removing an incident edge to node vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT as \pdvxh~Λxk\pdvsuperscriptnormal-~subscript𝑥normal-Λsubscript𝑥𝑘\pdv{\tilde{x_{h}}^{\Lambda}}{x_{k}}over~ start_ARG italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. We define the change in effect of vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT before and after removing an incident edge to vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT as distance function D(k,h)=1𝔼(\pdvx~h,sΛxk,t/\pdvxh,sΛxk,t)𝐷𝑘1normal-𝔼\pdvsuperscriptsubscriptnormal-~𝑥𝑠normal-Λsubscript𝑥𝑘𝑡\pdvsuperscriptsubscript𝑥𝑠normal-Λsubscript𝑥𝑘𝑡D(k,h)=1-\mathbb{E}(\pdv{\tilde{x}_{h,s}^{\Lambda}}{x_{k,t}}/\pdv{x_{h,s}^{% \Lambda}}{x_{k,t}})italic_D ( italic_k , italic_h ) = 1 - roman_𝔼 ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) for any entry 1s,tnformulae-sequence1𝑠𝑡𝑛1\leq s,t\leq n1 ≤ italic_s , italic_t ≤ italic_n of xhsubscript𝑥x_{h}italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Similarly, we define the change in effect of node vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on vlsubscript𝑣𝑙v_{l}italic_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as D(k,l)=1𝔼(\pdvx~l,sΛxk,t/\pdvxl,sΛxk,t)𝐷𝑘𝑙1normal-𝔼\pdvsuperscriptsubscriptnormal-~𝑥𝑙𝑠normal-Λsubscript𝑥𝑘𝑡\pdvsuperscriptsubscript𝑥𝑙𝑠normal-Λsubscript𝑥𝑘𝑡D(k,l)=1-\mathbb{E}(\pdv{\tilde{x}_{l,s}^{\Lambda}}{x_{k,t}}/\pdv{x_{l,s}^{% \Lambda}}{x_{k,t}})italic_D ( italic_k , italic_l ) = 1 - roman_𝔼 ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT italic_l , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) for any entry 1s,tnformulae-sequence1𝑠𝑡𝑛1\leq s,t\leq n1 ≤ italic_s , italic_t ≤ italic_n of xlsubscript𝑥𝑙x_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Then, D(k,h)<D(k,l)𝐷𝑘𝐷𝑘𝑙D(k,h)<D(k,l)italic_D ( italic_k , italic_h ) < italic_D ( italic_k , italic_l ).

Theorem 1 states that the change in influence of a random node vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on another node v𝑣vitalic_v, caused by excluding a target link is higher on the low degree nodes vlsubscript𝑣𝑙v_{l}italic_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. This suggests that low-degree nodes benefit more by excluding target edges: when all target edges are present, low-degree nodes are more vulnerable to the issues brought by the inclusion of target edges. This statement holds for any GNN model relying on message passing. We provide detailed proofs here 111https://arxiv.org/abs/2306.00899.

Quantitative Analysis: Average degree change. We further support our claim that low-degree nodes are affected more by providing a quantitative analysis on the relative degree changes before and after excluding the train target links. We analyze four datasets of various sparsity levels, as shown in Tab.  2. For each dataset, we sort its nodes by their degrees and report the average degree change before and after excluding the train targets for each mini-batch epoch. As shown in Fig.  3, for low-degree nodes, the relative change is near 100 %percent\%%, while for high-degree nodes it is less than 10%percent\%%.

Proposed Solution: Exclude T𝐥𝐨𝐰subscript𝑇𝐥𝐨𝐰T_{\text{low}}italic_T start_POSTSUBSCRIPT low end_POSTSUBSCRIPT. To achieve the best trade-off between avoiding issues (I1)-(I2) and minimizing the corruption of the graph structure in mini-batch training, SpotTarget excludes Tlowsubscript𝑇lowT_{\text{low}}italic_T start_POSTSUBSCRIPT low end_POSTSUBSCRIPT, the train target edges where at least one incident node has degree lower than a degree threshold δ𝛿\deltaitalic_δ. Implementation-wise, to ensure the scalability and usability of our proposed solution in large-scale, real-world applications, we implemented it as a subclass of DGL’s edge sampler, which is comparable to DGL’s original edge sampler and can be readily combined with other DGL functions.

5.2. Test-time Best Practice: Exclude All Test Target Links

As we have discussed, including the test target links in the test message-passing graph causes test data leakage. This may occur inadvertently—for example, when adapting GNNs designed for node classification tasks for the link prediction task by simply stacking a decoder function—or when test target links are explicitly added into the graph to ensure that there is no distribution shift issue. We argue that under no circumstance should the test edges be used as message-passing edges. This would ensure more accurate estimation of GNN’s predictive performance.

Algorithm 1 SpotTarget: Leakage Check(G𝐺Gitalic_G)
1:Input: An input graph G, edge splits S, an argument K if validation target edges are used as inference inputs, 𝐊={T,F}𝐊𝑇𝐹\textbf{K}=\{T,F\}K = { italic_T , italic_F }
2:Output: The desired inference graph 𝐆infersubscript𝐆infer\textbf{G}_{\text{infer}}G start_POSTSUBSCRIPT infer end_POSTSUBSCRIPT
3:// STEP 1. Check if the input graph contains validation and test edges
4:Cvalid=Check Existence(𝐆,𝐒valid)subscript𝐶validCheck Existence𝐆subscript𝐒validC_{\text{valid}}=\text{Check Existence}(\textbf{G},\textbf{S}_{\text{valid}})italic_C start_POSTSUBSCRIPT valid end_POSTSUBSCRIPT = Check Existence ( G , S start_POSTSUBSCRIPT valid end_POSTSUBSCRIPT )
5:Ctest=Check Existence(𝐆,𝐒test)subscript𝐶testCheck Existence𝐆subscript𝐒testC_{\text{test}}=\text{Check Existence}(\textbf{G},\textbf{S}_{\text{test}})italic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT = Check Existence ( G , S start_POSTSUBSCRIPT test end_POSTSUBSCRIPT )
6:// STEP 2. Delete test and validation edges according to user needs
7:if Ctestsubscript𝐶testC_{\text{test}}italic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT is True then
8:    𝐆infer=RemoveEdge(𝐆,𝐒test)subscript𝐆inferRemoveEdge𝐆subscript𝐒test\textbf{G}_{\text{infer}}=\text{RemoveEdge}(\textbf{G},\textbf{S}_{\text{test}})G start_POSTSUBSCRIPT infer end_POSTSUBSCRIPT = RemoveEdge ( G , S start_POSTSUBSCRIPT test end_POSTSUBSCRIPT )
9:else
10:    𝐆infer=𝐆subscript𝐆infer𝐆\textbf{G}_{\text{infer}}=\textbf{G}G start_POSTSUBSCRIPT infer end_POSTSUBSCRIPT = G
11:// If Validation edges exist in the inference graph and it is not desired
12:if Cvalidsubscript𝐶validC_{\text{valid}}italic_C start_POSTSUBSCRIPT valid end_POSTSUBSCRIPT is True and K is False then
13:    𝐆infer=RemoveEdge(𝐆infer,𝐒valid)subscript𝐆inferRemoveEdgesubscript𝐆infersubscript𝐒valid\textbf{G}_{\text{infer}}=\text{RemoveEdge}(\textbf{G}_{\text{infer}},\textbf{% S}_{\text{valid}})G start_POSTSUBSCRIPT infer end_POSTSUBSCRIPT = RemoveEdge ( G start_POSTSUBSCRIPT infer end_POSTSUBSCRIPT , S start_POSTSUBSCRIPT valid end_POSTSUBSCRIPT )
14:return 𝐆infersubscript𝐆infer\textbf{G}_{\text{infer}}G start_POSTSUBSCRIPT infer end_POSTSUBSCRIPT

Proposed Solution. SpotTarget excludes all the test target links from the test message-passing graph. Moreover, it supports automatically checking for data leakage in user-specified data splits and rectifying the issues as needed (Alg.  1). In prior work (Hu et al., 2020b), validation edges are sometimes used in the message-passing graphs to obtain more information, especially for data that is split into training/validation/test sets according to time. Including the validation target edges is typically not seen as data leakage. The decision of whether or not to use the validation edges as message-passing edges depends on the application of interest. SpotTarget requires the user to deliberately define this design choice, and generates the inference graph that complies with the user requirements.

6. Experiments

Through our extensive empirical analysis, we aim to address the following research questions:

  • RQ1: How well does SpotTarget address issues (I1) and (I2) on commonly-used graph benchmarks, which are dense?

  • RQ2: How well does SpotTarget perform on sparse graphs with very skewed degree distributions?

  • RQ3: How well does SpotTarget address issues (I1)-(I2) for edges incident to low-degree nodes on popular benchmarks?

  • RQ4: At test time, how much is the performance of GNN models overestimated due to implicit data leakage (I3)?

Before presenting our results, we describe the experiment setup.

Data. We evaluate our framework on four real-world datasets on the link prediction task and give their statistics in Tab. 2. Ogbl-Collab and Ogbl-Citation2 (Hu et al., 2020b) are author collaboration and citation networks. USAir (Ribeiro et al., 2017) is a network of US airlines. We note that these datasets are relatively dense, with average node degree of 8-20. In real-world applications, the observed data is typically incomplete and sparse, with skewed degree distributions and many low-degree nodes. For this reason, we also consider E-commerce (Reddy et al., 2022), a sparse real-world dataset of queries and related products that are exact matches in Amazon Search.

Table 2. Dataset statistics based on the training splits.
Dataset # Nodes # Edges Node deg. Attr. dim.
ogbl-collab (Hu et al., 2020b) 235,868 2,358,104 8.20 128
ogbl-citation2 (Hu et al., 2020b) 2,927,963 30,387,995 20.73 128
USAir (Ribeiro et al., 2017) 332 3,402 10.25 332
E-commerce  (Reddy et al., 2022) 346,439 238,818 1.38 768
Table 3. RQ1-Training Issues: Results on dense graphs. Test performance of different training frameworks across GNNs and datasets. SpotTarget has the best overall performance (lowest rank) across all datasets. *OOM = out of GPU memory.
Model ExcludeNone(Tr) ExcludeAll ExcludeRandom SpotTarget
Ogbl-Collab (H@50 normal-↑\uparrow)
SAGE 48.57 ±plus-or-minus\pm± 0.74 45.82 ±plus-or-minus\pm± 0.41 45.74 ±plus-or-minus\pm± 1.33 49.00 ±plus-or-minus\pm± 0.65
MB-GCN 43.03 ±plus-or-minus\pm± 0.50 37.75 ±plus-or-minus\pm± 1.42 41.43 ±plus-or-minus\pm± 2.25 39.58 ±plus-or-minus\pm± 1.06
GATv2 45.61 ±plus-or-minus\pm± 0.85 45.71 ±plus-or-minus\pm± 0.87 45.87 ±plus-or-minus\pm± 0.64 45.46 ±plus-or-minus\pm± 0.19
SEAL 61.27 ±plus-or-minus\pm± 0.28 64.11 ±plus-or-minus\pm± 0.30 64.40 ±plus-or-minus\pm± 0.57 64.57 ±plus-or-minus\pm± 0.30
Ogbl-Citation2 (MRR normal-↑\uparrow)
SAGE 82.06 ±plus-or-minus\pm± 0.06 81.47 ±plus-or-minus\pm± 0.17 82.06 ±plus-or-minus\pm± 0.13 82.18 ±plus-or-minus\pm± 0.18
MB-GCN 79.70 ±plus-or-minus\pm± 0.25 79.06 ±plus-or-minus\pm± 0.30 80.39 ± 0.15 79.88 ±plus-or-minus\pm± 0.14
GATv2 OOM OOM OOM OOM
SEAL 86.75 ±plus-or-minus\pm± 0.20 86.74 ±plus-or-minus\pm± 0.23 86.61 ± 0.39 86.93 ±plus-or-minus\pm± 0.55
USAir (AUC normal-↑\uparrow)
SAGE 95.97 ±plus-or-minus\pm± 0.17 95.71 ±plus-or-minus\pm± 0.12 96.42 ±plus-or-minus\pm± 0.18 96.19 ±plus-or-minus\pm± 0.53
MB-GCN 94.00 ±plus-or-minus\pm± 0.14 94.09 ±plus-or-minus\pm± 0.11 93.98 ± 0.06 94.28 ±plus-or-minus\pm± 0.15
GATv2 95.05 ±plus-or-minus\pm± 0.66 95.66 ±plus-or-minus\pm± 0.24 95.80 ±plus-or-minus\pm± 0.24 95.87 ±plus-or-minus\pm± 0.46
SEAL 95.36 ±plus-or-minus\pm± 0.24 95.94 ±plus-or-minus\pm± 0.04 95.76 ±plus-or-minus\pm± 0.24 96.39 ±plus-or-minus\pm± 0.09
Rank normal-↓\downarrow 2.81 3.09 2.45 1.64
Table 4. RQ2-Training Issues: Results on the sparse E-commerce dataset. SpotTarget achieves consistently better performance than the baseline across metrics and models. For SAGE and GATv2, SpotTarget is up to 15× more accurate.
SAGE MB-GCN GATv2
Metrics ExcludeNone(Tr) SpotTarget ExcludeNone(Tr) SpotTarget ExcludeNone(Tr) SpotTarget
MRR normal-↑\uparrow 4.40 ±plus-or-minus\pm± 0.31 65.85 ±plus-or-minus\pm± 0.31 17.07 ±plus-or-minus\pm± 7.38 69.67 ±plus-or-minus\pm± 0.52 5.98 ±plus-or-minus\pm± 0.56 69.44 ±plus-or-minus\pm± 0.55
H@10 normal-↑\uparrow 6.55 ±plus-or-minus\pm± 0.37 89.67 ±plus-or-minus\pm± 0.19 28.35 ±plus-or-minus\pm± 7.47 89.79 ±plus-or-minus\pm± 0.25 9.64 ±plus-or-minus\pm± 1.10 90.52 ±plus-or-minus\pm± 0.26
H@1 normal-↑\uparrow 3.04 ±plus-or-minus\pm± 0.31 52.84 ±plus-or-minus\pm± 0.46 10.83 ±plus-or-minus\pm± 5.21 57.63 ±plus-or-minus\pm± 0.57 3.94 ±plus-or-minus\pm± 0.81 57.11 ±plus-or-minus\pm± 1.03

Metrics. Following prior works, we use Mean Reciprocal Rank (MRR) on Ogbl-Citation2 and Hits@50 on Ogbl-Collab (Hu et al., 2020b). Area Under the Curve (AUC) is used for USAir  (Zhang and Chen, 2018). For E-commerce, we choose to report MRR, Hits@10, and Hits@1, the three most commonly-used evaluation metrics for link prediction (Hu et al., 2020b; Zhang et al., 2021; Vashishth et al., 2019). For all evaluation metrics, the higher the value is, the better.

GNN models. We select four GNN models to validate our proposed solutions. SAGE (Hamilton et al., 2017), MB-GCN (Kipf and Welling, 2016a) and GATv2 (Brody et al., 2021) are GAE-based models. MB-GCN (Kipf and Welling, 2016a) is a mini-batch GCN model and at each iteration, only a portion of the entire graph is seen. SEAL (Zhang and Chen, 2018) is a subgraph-based model that extracts an enclosing subgraph for each target edge and predicts the link likelihood based on the subgraph’s embeddings. All GNNs are implemented in DGL (Paszke et al., 2017; Wang et al., 2019). We conduct a hyperparameter tuning and choose the best.

Baselines. For training-time issues (I1, I2), we use ExcludeNone(Tr), ExcludeAll and ExcludeRandom as our baselines. ExcludeNone(Tr) does not exclude any training target links, while ExcludeAll excludes all target links TTrsubscript𝑇TrT_{\text{Tr}}italic_T start_POSTSUBSCRIPT Tr end_POSTSUBSCRIPT. Note that ExcludeAll on SEAL is essentially FakeEdge, which excludes all target edges on subgraph-based models. ExcludeRandom randomly excludes target edges during training, and the proportion of excluded targets is the same as our SpotTarget. For test-time issues (I3), our baseline is ExcludeNone(Tst), which uses the test target links in the inference graph. This approach corresponds to the case where data leakage occurs, which should always be avoided in real-world applications.

SpotTarget Variants. At training time, we consider two variants of SpotTarget that differ in the degree threshold δ𝛿\deltaitalic_δ that they use to exclude target links Tlowsubscript𝑇lowT_{\text{low}}italic_T start_POSTSUBSCRIPT low end_POSTSUBSCRIPT for all datasets, and report the best-performing one: δ=10𝛿10\delta=10italic_δ = 10 oder δ=20𝛿20\delta=20italic_δ = 20, which corresponds to the average degrees of the dense datasets we used. For the E-Commerce dataset, since 99.5% edges are incident to nodes with degree less than 5, SpotTarget excludes almost all target edges and achieves similar impact as ExcludeAll and ExcludeRandom. At test time, we consider two variants for SpotTarget: ExcludeValTst excludes both validation and test target edges from the test graph, while ExcludeTst only excludes the test target links. As shown in Alg.  1, whether to use ExcludeValTst or ExcludeTst depends on the user’s input.

6.1. RQ1-Training Issues: Results on Dense Data

Setup. To evaluate SpotTarget’s ability to address training issues (I1) and (I2) on dense graphs, we report the link prediction performance of four GNN models on three popular dense datasets over three trials. We report the recommended metrics for each dataset. For Ogbl-Collab and Ogbl-Citation2, we generate one negative per target edge during training and use the recommended negatives during evaluation. For USAir, we also generate one negative per target edge during training, while during evaluation, we treat all edges that do not appear in the train,test,validation as negative edges. In addition to the performance for each setting, we also report the average rank of our baselines and proposed framework SpotTarget. Our results are summarized in Tab.  3.

Results. SpotTarget achieves the best performance (lowest rank) across datasets and models. On Ogbl-Citation2 and USAir, our method almost achieves the best results across different types of models. This indicates that SpotTarget successfully addresses the train issues (I1, I2) while also avoiding significant corruption of the structure in the mini-batch graphs. Although the original implementation of SEAL uses ExcludeAll, we find that replacing that strategy with SpotTarget further helps improve SEAL’s performance.

Moreover, comparing SpotTarget with ExcludeRandom, we can see that SpotTarget consistently gives better performance. This experimentally verifies Theorem 1 and show that specifically excluding the edges incident to low-degree nodes can benefit more.

We also observe that ExcludeAll typically results in slightly lower performance compared to ExcludeNone(Tr). As discussed in Sec.  5.1, this is mainly because excluding all target edges in one mini-batch causes a significant change on graph structure and even isolates some nodes. Thus, GNNs will not learn good node representations.

Observation 1.

(1) Across all datasets and models, SpotTarget achieves the best overall rank compared with ExcludeNone(Tr), ExcludeAll and ExcludeRandom. This indicates that it successfully addresses the issues (I1) and (I2). (2) In many cases (6/11), ExcludeAll leads to performance degradation because of currupting the structure of mini-batch graphs.

Table 5. RQ3-Training Issues: Results on low-degree nodes. We report MRR of SAGE on Ogbl-Citation2 on target edges incident to at least one low-degree nodes (min(di,dj)𝑚𝑖𝑛subscript𝑑𝑖subscript𝑑𝑗min(d_{i},d_{j})italic_m italic_i italic_n ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )) or only low-degree nodes (max(di,dj)𝑚𝑎𝑥subscript𝑑𝑖subscript𝑑𝑗max(d_{i},d_{j})italic_m italic_a italic_x ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )). SpotTarget achieves the best performance.
Exclusion max(di,dj)<10𝑚𝑎𝑥subscript𝑑𝑖subscript𝑑𝑗10max(d_{i},d_{j})<10italic_m italic_a italic_x ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < 10 max(di,dj)<5𝑚𝑎𝑥subscript𝑑𝑖subscript𝑑𝑗5max(d_{i},d_{j})<5italic_m italic_a italic_x ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < 5 min(di,dj)<10𝑚𝑖𝑛subscript𝑑𝑖subscript𝑑𝑗10min(d_{i},d_{j})<10italic_m italic_i italic_n ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < 10 min(di,dj)<5𝑚𝑖𝑛subscript𝑑𝑖subscript𝑑𝑗5min(d_{i},d_{j})<5italic_m italic_i italic_n ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < 5 min(di,dj)=2𝑚𝑖𝑛subscript𝑑𝑖subscript𝑑𝑗2min(d_{i},d_{j})=2italic_m italic_i italic_n ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 2 min(di,dj)=1𝑚𝑖𝑛subscript𝑑𝑖subscript𝑑𝑗1min(d_{i},d_{j})=1italic_m italic_i italic_n ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 1
MRR \uparrow ExcludeNone(Tr) 73.11±0.25plus-or-minus73.110.2573.11\pm 0.2573.11 ± 0.25 62.15±0.84plus-or-minus62.150.8462.15\pm 0.8462.15 ± 0.84 78.78±0.12plus-or-minus78.780.1278.78\pm 0.1278.78 ± 0.12 69.54±0.37plus-or-minus69.540.3769.54\pm 0.3769.54 ± 0.37 47.02±0.56plus-or-minus47.020.5647.02\pm 0.5647.02 ± 0.56 27.54±0.88plus-or-minus27.540.8827.54\pm 0.8827.54 ± 0.88
ExcludeAll 77.45±0.41plus-or-minus77.450.4177.45\pm 0.4177.45 ± 0.41 75.39±1.42plus-or-minus75.391.4275.39\pm 1.4275.39 ± 1.42 79.17±0.12plus-or-minus79.170.1279.17\pm 0.1279.17 ± 0.12 73.86±0.33plus-or-minus73.860.3373.86\pm 0.3373.86 ± 0.33 60.05±1.11plus-or-minus60.051.1160.05\pm 1.1160.05 ± 1.11 48.60±1.11plus-or-minus48.601.1148.60\pm 1.1148.60 ± 1.11
ExcludeRandom 76.11±0.12plus-or-minus76.110.1276.11\pm 0.1276.11 ± 0.12 70.79±0.53plus-or-minus70.790.5370.79\pm 0.5370.79 ± 0.53 79.41±0.06plus-or-minus79.410.0679.41\pm 0.0679.41 ± 0.06 72.31±0.04plus-or-minus72.310.0472.31\pm 0.0472.31 ± 0.04 55.21±0.06plus-or-minus55.210.0655.21\pm 0.0655.21 ± 0.06 41.48±0.42plus-or-minus41.480.4241.48\pm 0.4241.48 ± 0.42
SpotTarget 78.08±0.06plus-or-minus78.080.0678.08\pm 0.0678.08 ± 0.06 76.23±0.56plus-or-minus76.230.5676.23\pm 0.5676.23 ± 0.56 79.30±0.18plus-or-minus79.300.1879.30\pm 0.1879.30 ± 0.18 73.87±0.18plus-or-minus73.870.1873.87\pm 0.1873.87 ± 0.18 61.48±0.51plus-or-minus61.480.5161.48\pm 0.5161.48 ± 0.51 51.47±2.51plus-or-minus51.472.5151.47\pm 2.5151.47 ± 2.51
Table 6. RQ4-Test Issue: Leakage quantification. We report the test results of four GNNs over three datasets. Note that ExcludeNone(Tst)’s good performance is due to data leakage; the test edges, never observed in real-world applications, are used during inference. Using test target links should be avoided; our framework, SpotTarget, can automatically check and/or enforce this. *OOM = out of GPU memory.
Models SpotTarget Baseline
ExcludeValTst ExcludeTst ExcludeNone(Tst)
Ogbl-Collab (H@50 normal-↑\uparrow)
SAGE 48.57 ±plus-or-minus\pm± 0.74 57.61 ±plus-or-minus\pm± 0.88 83.82 ±plus-or-minus\pm± 0.59
MB-GCN 43.03 ±plus-or-minus\pm± 0.50 50.53 ±plus-or-minus\pm± 1.10 75.41 ±plus-or-minus\pm± 0.43
GATv2 45.61 ±plus-or-minus\pm± 0.85 54.94 ±plus-or-minus\pm± 0.19 84.16 ±plus-or-minus\pm± 2.62
SEAL 57.50 ±plus-or-minus\pm± 0.31 55.16 ±plus-or-minus\pm± 1.94 99.91 ±plus-or-minus\pm± 0.05
Ogbl-Citation2 (MRR normal-↑\uparrow)
SAGE 82.06 ±plus-or-minus\pm± 0.06 82.28 ±plus-or-minus\pm± 0.11 89.22 ±plus-or-minus\pm± 0.10
MB-GCN 79.70 ±plus-or-minus\pm± 0.25 81.25 ±plus-or-minus\pm± 0.22 88.32 ±plus-or-minus\pm± 0.14
GATv2 OOM OOM OOM
SEAL 86.75 ±plus-or-minus\pm± 0.20 87.01 ±plus-or-minus\pm± 0.39 97.14 ±plus-or-minus\pm± 0.18
USAir (AUC normal-↑\uparrow)
SAGE 95.97 ±plus-or-minus\pm± 0.17 95.51 ±plus-or-minus\pm± 0.53 99.15 ±plus-or-minus\pm± 0.59
MB-GCN 94.00 ±plus-or-minus\pm± 0.14 94.11 ±plus-or-minus\pm± 0.13 98.66 ±plus-or-minus\pm± 0.22
GATv2 95.05 ±plus-or-minus\pm± 0.66 94.07 ±plus-or-minus\pm± 0.21 98.96 ±plus-or-minus\pm± 0.11
SEAL 95.36 ±plus-or-minus\pm± 0.24 95.10 ±plus-or-minus\pm± 0.76 97.20 ±plus-or-minus\pm± 0.78
No Leakage?
Deployment

6.2. RQ2-Training Issues: Results on Sparse Data

Setup. In the real-world E-commerce dataset, the graph is incomplete, sparse and full of low-degree nodes. Based on our theoretical analysis in Sec. 5.1, the low-degree nodes suffer more from training issues. To investigate the usefulness of SpotTarget in such settings, we repeat the previous experiments. Note that we do not report the results of ExcludeAll and ExcludeRandom because almost all edge is incident to nodes with degree less than 5, so SpotTarget excludes nearly every target edge. Furthermore, due to the high sparsity, we also do not report the results for SEAL since it is impractical to construct subgraphs for each node. The results are shown in Tab.  4.

Results. On sparse graphs like E-commerce, SpotTarget achieves 14.9 ×\times× better performance. Since many real-world graphs are very sparse (e.g. commonsense knowledge graphs and biochemical graphs have an average degree of 2  (Malaviya et al., 2020; Dwivedi et al., 2022)), SpotTarget can improve the performance of GNNs across numerous high-impact settings.

Observation 2.

SpotTarget achieves 14.9×\times× better performance compared to ExcludeNone across models. This verifies empirically that low-degree nodes suffer more from issues (I1) and (I2), and excluding T𝑙𝑜𝑤subscript𝑇𝑙𝑜𝑤T_{\text{low}}italic_T start_POSTSUBSCRIPT low end_POSTSUBSCRIPT works well especially for datasets with many low-degree nodes.

6.3. RQ3-Training Issues: Results on Low-degree Nodes

Setup. To quantify how much low-degree nodes in dense datasets suffer from issues (I1) and (I2), we explore the predictive performance for edges adjacent to low-degree nodes. We report the performance of two different edge types: (1) edges that are incident to at least one low-degree node, i.e., min(di,dj)<δ𝑚𝑖𝑛subscript𝑑𝑖subscript𝑑𝑗𝛿min(d_{i},d_{j})<\deltaitalic_m italic_i italic_n ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < italic_δ and; (2) edges that are only incident to low-degree nodes, i.e., max(di,dj)<δ𝑚𝑎𝑥subscript𝑑𝑖subscript𝑑𝑗𝛿max(d_{i},d_{j})<\deltaitalic_m italic_a italic_x ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < italic_δ. We report results on Ogbl-Citation2 for SAGE, and compare SpotTarget against three baselines. Results are shown in Tab. 5.

Results. For edges that are incident to low-degree nodes, ExcludeAll, ExcludeRandom and SpotTarget achieve significantly better performance than ExcludeNone(Tr). This corresponds to our theoretical analysis in Sec. 5.1 that highlights low-degree nodes are harmed by training issues (I1) and (I2) more, and excluding train target edges is more beneficial for low-degree nodes. Specifically, comparing ExcludeNone(Tr), ExcludeAll and ExcludeRandom, SpotTarget achieves better performance on various types of edges that are incident to low-degree nodes. This indicates that SpotTarget is better at maintaining the graph structure in mini-batch training.

Observation 3.

Better performance on edges adjacent to low-degree nodes in dense graphs indicates that SpotTarget successfully resolves (I1, I2) on low-degree nodes.

6.4. RQ4-Test Issues: Leakage Quantification

Setup. Beyond the training issues (I1, I2), we aim to quantify the performance gap introduced by the data leakage at test time (I3). To achieve this, we report results on excluding different types of edges from the inference graph (validation, test edges). Although we are not evaluating in a deployed system, by excluding different types of edges, we are mimicking what would happen in a real application. All GNNs are trained using train edges only. ExcludeValTst excludes all validation and test target links during inference, and ExcludeTst only excludes validation edges. Both ExcludeValTst and ExcludeTst are variants of SpotTarget. ExcludeNone(Tst) keeps all validation and test target links during testing, resulting in data leakage (I3) and should be avoided in practice. The results are shown in Tab.  6.

Results. When validation target links are used as message-passing edges in inference graphs, we observe a slight performance boost, which matches findings in prior work (Hu et al., 2020b). However, the performance boost due to the inclusion of the test target links is undesired, as it can lead to overestimation of the models’ predictive performance. In practice, test links cannot be observed and utilized. Specifically, when test targets are present during inference, SEAL seemingly achieves near-perfect results, which are not indicative of actual performance. SpotTarget successfully resolves issue (I3).

Observation 4.

Due to data leakage (I3), using test edges causes a fake performance boost across multiple datasets, especially for those with time-based splits (e.g., Ogbl-Collab). Increased performance verifies the necessity of SpotTarget, which always excludes the test target links from the inference graphs at test time. Since in real applications, future (test) links are never observed, if the model utilizes information from test target edges, its performance gets overestimated, i.e., a fake performance boost that will not be seen in practice is achieved.

7. Conclusion

In this work, we focused on the pitfalls in link prediction with GNNs and systematically study the issues that arise from including the target links as message passing edges. We are the first to show (both theoretically and empirically) that low-degree nodes suffer more from these issues. Our proposed framework, SpotTarget, strikes the best balance between eliminating the issues from the target links, not significantly corrupting the structure of the mini-batch graphs, and being scalable and easy to use. SpotTarget can help researchers and practitioners adhere to best practices, which are frequently overlooked even by the widely-used GNN frameworks.

Acknowledgments

We thank Yongyi Yang for providing constructive feedback on our theoretical proof. This material is based upon work supported by the National Science Foundation under IIS 2212143, CAREER Grant No. IIS 1845491, and AWS Cloud Credits for Research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding parties.

Ethical Discussion

In our work, the datasets we used are publicly available; we did not collect or release new datasets. During data pre-processing, we strictly followed ethical principles and did not attempt to infer sensitive attributes. As discussed in Sec. 6.3, our approach is able to improve performance for edges adjacent to low-degrees nodes, which can be used to mitigate the potential bias of current GNNs on marginalized nodes (e.g., individuals) that have few connections.

References

  • (1)
  • Adamic and Adar (2003) Lada A Adamic and Eytan Adar. 2003. Friends and neighbors on the web. Social networks 25, 3 (2003), 211–230.
  • Blakely et al. (2021) Derrick Blakely, Jack Lanchantin, and Yanjun Qi. 2021. Time and Space Complexity of Graph Convolutional Networks. Accessed on: Dec 31 (2021).
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems 26 (2013).
  • Brody et al. (2021) Shaked Brody, Uri Alon, and Eran Yahav. 2021. How attentive are graph attention networks? arXiv preprint arXiv:2105.14491 (2021).
  • Davidson et al. (2018) Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. 2018. Hyperspherical variational auto-encoders. arXiv preprint arXiv:1804.00891 (2018).
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Dong et al. (2022) Kaiwen Dong, Yijun Tian, Zhichun Guo, Yang Yang, and Nitesh V Chawla. 2022. FakeEdge: Alleviate Dataset Shift in Link Prediction. arXiv preprint arXiv:2211.15899 (2022).
  • Dwivedi et al. (2022) Vijay Prakash Dwivedi, Ladislav Rampášek, Mikhail Galkin, Ali Parviz, Guy Wolf, Anh Tuan Luu, and Dominique Beaini. 2022. Long range graph benchmark. arXiv preprint arXiv:2206.08164 (2022).
  • Faloutsos et al. (1999) Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. 1999. On power-law relationships of the internet topology. ACM SIGCOMM computer communication review 29, 4 (1999), 251–262.
  • Fey and Lenssen (2019) Matthias Fey and Jan E. Lenssen. 2019. Fast Graph Representation Learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds.
  • Golchin and Surdeanu (2023) Shahriar Golchin and Mihai Surdeanu. 2023. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493 (2023).
  • Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017).
  • Hu et al. (2020b) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020b. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33 (2020), 22118–22133.
  • Hu et al. (2019) Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. 2019. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265 (2019).
  • Hu et al. (2020a) Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. 2020a. Gpt-gnn: Generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1857–1867.
  • Ioannidis et al. (2022) Vassilis N Ioannidis, Xiang Song, Da Zheng, Houyu Zhang, Jun Ma, Yi Xu, Belinda Zeng, Trishul Chilimbi, and George Karypis. 2022. Efficient and effective training of language and graph neural network models. arXiv preprint arXiv:2206.10781 (2022).
  • Kipf and Welling (2016a) Thomas N Kipf and Max Welling. 2016a. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Kipf and Welling (2016b) Thomas N Kipf and Max Welling. 2016b. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
  • Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
  • Leskovec et al. (2005) Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. 177–187.
  • Leskovec et al. (2020) Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2020. Mining of massive data sets. Cambridge university press.
  • Liben-Nowell and Kleinberg (2003) David Liben-Nowell and Jon Kleinberg. 2003. The link prediction problem for social networks. In Proceedings of the twelfth international conference on Information and knowledge management. 556–559.
  • Malaviya et al. (2020) Chaitanya Malaviya, Chandra Bhagavatula, Antoine Bosselut, and Yejin Choi. 2020. Commonsense knowledge base completion with structural and semantic context. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 2925–2933.
  • Martínez et al. (2016) Víctor Martínez, Fernando Berzal, and Juan-Carlos Cubero. 2016. A survey of link prediction in complex networks. ACM computing surveys (CSUR) 49, 4 (2016), 1–33.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
  • Reddy et al. (2022) Chandan K. Reddy, Lluís Màrquez, Fran Valero, Nikhil Rao, Hugo Zaragoza, Sambaran Bandyopadhyay, Arnab Biswas, Anlu Xing, and Karthik Subbian. 2022. Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search. (2022). arXiv:2206.06588
  • Ribeiro et al. (2017) Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. 2017. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 385–394.
  • Sainz et al. (2023) Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. 2023. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. arXiv preprint arXiv:2310.18018 (2023).
  • Tang et al. (2020) Xianfeng Tang, Huaxiu Yao, Yiwei Sun, Yiqi Wang, Jiliang Tang, Charu Aggarwal, Prasenjit Mitra, and Suhang Wang. 2020. Investigating and mitigating degree-related biases in graph convoltuional networks. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1435–1444.
  • Teru et al. (2020) Komal Teru, Etienne Denis, and Will Hamilton. 2020. Inductive relation prediction by subgraph reasoning. In International Conference on Machine Learning. PMLR, 9448–9457.
  • Vashishth et al. (2019) Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha Talukdar. 2019. Composition-based multi-relational graph convolutional networks. arXiv preprint arXiv:1911.03082 (2019).
  • Wang et al. (2019) Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. 2019. Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. arXiv preprint arXiv:1909.01315 (2019).
  • Wang et al. (2023) Xiyuan Wang, Haotong Yang, and Muhan Zhang. 2023. Neural Common Neighbor with Completion for Link Prediction. arXiv preprint arXiv:2302.00890 (2023).
  • Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2018. Representation learning on graphs with jumping knowledge networks. In International conference on machine learning. PMLR, 5453–5462.
  • Yan et al. (2021) Zuoyu Yan, Tengfei Ma, Liangcai Gao, Zhi Tang, and Chao Chen. 2021. Link prediction with persistent homology: An interactive view. In International Conference on Machine Learning. PMLR, 11659–11669.
  • You et al. (2019) Jiaxuan You, Rex Ying, and Jure Leskovec. 2019. Position-aware graph neural networks. In International conference on machine learning. PMLR, 7134–7143.
  • Zeng et al. (2020) Xiangxiang Zeng, Xiang Song, Tengfei Ma, Xiaoqin Pan, Yadi Zhou, Yuan Hou, Zheng Zhang, Kenli Li, George Karypis, and Feixiong Cheng. 2020. Repurpose open data to discover therapeutics for COVID-19 using deep learning. Journal of proteome research 19, 11 (2020), 4624–4636.
  • Zhang and Chen (2018) Muhan Zhang and Yixin Chen. 2018. Link prediction based on graph neural networks. Advances in neural information processing systems 31 (2018).
  • Zhang and Chen (2019) Muhan Zhang and Yixin Chen. 2019. Inductive matrix completion based on graph neural networks. arXiv preprint arXiv:1904.12058 (2019).
  • Zhang et al. (2021) Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. 2021. Labeling trick: A theory of using graph neural networks for multi-node representation learning. Advances in Neural Information Processing Systems 34 (2021), 9061–9073.
  • Zheng et al. (2020) Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, and George Karypis. 2020. Distdgl: distributed graph neural network training for billion-scale graphs. In 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3). IEEE, 36–44.
  • Zhou et al. (2023b) Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. 2023b. Don’t Make Your LLM an Evaluation Benchmark Cheater. arXiv preprint arXiv:2311.01964 (2023).
  • Zhou et al. (2023a) Yuhang Zhou, Paiheng Xu, Xiaoyu Liu, Bang An, Wei Ai, and Furong Huang. 2023a. Explore Spurious Correlations at the Concept Level in Language Models for Text Classification. arXiv preprint arXiv:2311.08648 (2023).
  • Zhu et al. (2023) Jing Zhu, Xiang Song, Vassilis N Ioannidis, Danai Koutra, and Christos Faloutsos. 2023. TouchUp-G: Improving Feature Representation through Graph-Centric Finetuning. arXiv preprint arXiv:2309.13885 (2023).
  • Zhu et al. (2021) Zhaocheng Zhu, Zuobai Zhang, Louis-Pascal Xhonneux, and Jian Tang. 2021. Neural bellman-ford networks: A general graph neural network framework for link prediction. Advances in Neural Information Processing Systems 34 (2021), 29476–29490.

Appendix A Appendix

A.1. Experimental Details

E-commerce Dataset Construction. The E-commerce dataset is constructed by keeping only links that represent “exact” matches between queries and products (Reddy et al., 2022). The queries are randomly divided into train, validate and test sets according to a 70%/10%/20% ratio. We use BERT embeddings (Devlin et al., 2018) as node features.

Hyperparameter Tuning. We conduct extensive hyperparameter tuning using grid search. We search on the learning rates ={=\{= {1e-2, 1e-3, 1e-4, 5e-4, 5e-5}}\}} and the number of layers ={1,2,3}absent123=\{1,2,3\}= { 1 , 2 , 3 }, hidden dimension ={128,256,512,1024}absent1282565121024=\{128,256,512,1024\}= { 128 , 256 , 512 , 1024 }. We report the best performing hyperparameters for each setting. We used a Nvidia A40 GPU to train the model and repeat our experiments with three random seeds. Test results are reported on the best-performing validation epoch. Our result on FakeEdge is lower than reported because (1) we use a different split of USAir due to no public splits available. (2) For FakeEdge, they set the number of hops to 2, hidden channel to 128. We found this to be computationally intensive and cannot be run on larger datasets such as Ogbl-Citation2. We follow the settings from SEAL, and set the number of hops to 1 and the hidden channel to 32 (Dong et al., 2022).

Ablation: Which Degree to Use? At training time, we only exclude edges adjacent to nodes smaller than a degree threshold δ𝛿\deltaitalic_δ. One research question that arises is how do we determine the threshold δ𝛿\deltaitalic_δ? We conduct experiments on USAir with varying δ𝛿\deltaitalic_δ. The results are shown in Fig.  4. As we exclude target edges with a higher degree threshold (exclude more target edges), the performance of the model will first go up and then go down, forming a U-shape curve. This indicates that we need to strike a balance between eliminating the training issues and preserving the structures of the mini-batch graph. A sensitivity check is needed to find the optimal degree threshold. In practice, we found that choosing δ𝛿\deltaitalic_δ as the average degree of dense datasets typically yields good performance.

Time Complexity Analysis. The additional time complexity of SpotTarget comes from the target edge exclusion part. For each iteration, we need to iterate over the edges in the mini-batch to examine whether they are incident to low-degree nodes. The time complexity of excluding the target edges is 𝒪(|B|)𝒪𝐵\mathcal{O}(|B|)caligraphic_O ( | italic_B | ), where |B|𝐵|B|| italic_B | is the number of edges in the message passing graph. The time complexity of training in the ExcludeNone(Tr), ExcludeAll, ExcludeRandom and SpotTarget frameworks is similar since the time of additional edge exclusion is much smaller compared with the model complexity. The difference of number of edges in the message passing graph only makes marginal changes in the training time  (Blakely et al., 2021).

Refer to caption
(a) USAir with SAGE model
Refer to caption
(b) USAir with GATv2 model
Figure 4. SpotTarget has robust performance for varying (low) degree thresholds across GNN models. We see a slight ‘U-shape’ effect, which works best when excluding the train target links Tlowsubscript𝑇lowT_{\text{low}}italic_T start_POSTSUBSCRIPT low end_POSTSUBSCRIPT. The red star indicates the average degree.

A.2. Extended Theoretical Analysis

We first prove Theorem  1 on GCN and then extend the proof into general message-passing GNN models.

Proof.

We want to prove that when a neighboring edge of a node is removed in order to eliminate the train issues: overfitting (I1) and distribution shift (I2), the changes on high degree nodes is smaller than the change on low degree nodes. We first define the overall influence of node vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on node vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT after ΛΛ\Lambdaroman_Λ-th layer GCN as \pdvxhΛxk\pdvsuperscriptsubscript𝑥Λsubscript𝑥𝑘\pdv{x_{h}^{\Lambda}}{x_{k}}italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (Xu et al., 2018; Tang et al., 2020). According to  (Tang et al., 2020), we have that the partial derivative of xhsubscript𝑥x_{h}italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for an ΛΛ\Lambdaroman_Λ-th layer untrained GCN is

(1) \pdvxh,sΛxk,t=dhdkp=1Ψλ=0Λ1dpλdiag(𝟙σλ)s,s𝐖s,tλ\pdvsuperscriptsubscript𝑥𝑠Λsubscript𝑥𝑘𝑡subscript𝑑subscript𝑑𝑘superscriptsubscript𝑝1Ψsuperscriptsubscriptproduct𝜆0Λ1subscript𝑑superscript𝑝𝜆diagsubscriptsubscriptdouble-struck-𝟙subscript𝜎𝜆𝑠𝑠subscriptsuperscript𝐖𝜆𝑠𝑡\pdv{x_{h,s}^{\Lambda}}{x_{k,t}}=\sqrt{d_{h}d_{k}}\sum_{p=1}^{\Psi}\prod_{% \lambda=0}^{\Lambda}\frac{1}{d_{p^{\lambda}}}\text{diag}(\mathbb{1}_{\sigma_{% \lambda}})_{s,s}\mathbf{W}^{\lambda}_{s,t}italic_x start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ψ end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_λ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG diag ( blackboard_𝟙 start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s , italic_s end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT

for all 1s,tnformulae-sequence1𝑠𝑡𝑛1\leq s,t\leq n1 ≤ italic_s , italic_t ≤ italic_n. Here diag(𝟙σλ)diagsubscriptdouble-struck-𝟙subscript𝜎𝜆\text{diag}(\mathbb{1}_{\sigma_{\lambda}})diag ( blackboard_𝟙 start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is a diagonal mask matrix representing the activation result, ΨΨ\Psiroman_Ψ is the set of all (ΛΛ\Lambdaroman_Λ + 1)-length random-walk paths on the graph from node vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , and pλsuperscript𝑝𝜆p^{\lambda}italic_p start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT represents the λ𝜆\lambdaitalic_λ-th node on a specific path p (p0superscript𝑝0p^{0}italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and pΛsuperscript𝑝Λp^{\Lambda}italic_p start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT denote node i and k accordingly).

Excluding one neighboring edge of node vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT would bring two changes: (1) the degree of node vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT will decrease to dh1subscript𝑑1d_{h}-1italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - 1 as one of its neighbors is removed, and (2) There will be less random walk paths from node vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, |Ψ~|<|Ψ|~ΨΨ|\tilde{\Psi}|<|{\Psi}|| over~ start_ARG roman_Ψ end_ARG | < | roman_Ψ |. Thus we have

(2)

1𝔼(\pdvx~h,sΛxk,t/\pdvxh,sΛxk,t)=1(dh1)dkp=1Ψ~𝔼(λ=0Λ1dpλdiag(𝟙σλ)s,s𝐖s,tλ)dhdkp=1Ψ𝔼(λ=0Λ1dpλdiag(𝟙σλ)s,s𝐖s,tλ)1𝔼\pdvsuperscriptsubscript~𝑥𝑠Λsubscript𝑥𝑘𝑡\pdvsuperscriptsubscript𝑥𝑠Λsubscript𝑥𝑘𝑡1subscript𝑑1subscript𝑑𝑘superscriptsubscript𝑝1~Ψ𝔼superscriptsubscriptproduct𝜆0Λ1subscript𝑑superscript𝑝𝜆diagsubscriptsubscriptdouble-struck-𝟙subscript𝜎𝜆𝑠𝑠subscriptsuperscript𝐖𝜆𝑠𝑡subscript𝑑subscript𝑑𝑘superscriptsubscript𝑝1Ψ𝔼superscriptsubscriptproduct𝜆0Λ1subscript𝑑superscript𝑝𝜆diagsubscriptsubscriptdouble-struck-𝟙subscript𝜎𝜆𝑠𝑠subscriptsuperscript𝐖𝜆𝑠𝑡1-\mathbb{E}(\pdv{\tilde{x}_{h,s}^{\Lambda}}{x_{k,t}}/\pdv{x_{h,s}^{\Lambda}}{% x_{k,t}})=1-\frac{\sqrt{(d_{h}-1)d_{k}}\sum_{p=1}^{\tilde{\Psi}}\mathbb{E}(% \prod_{\lambda=0}^{\Lambda}\frac{1}{d_{p^{\lambda}}}\text{diag}(\mathbb{1}_{% \sigma_{\lambda}})_{s,s}\mathbf{W}^{\lambda}_{s,t})}{\sqrt{d_{h}d_{k}}\sum_{p=% 1}^{\Psi}\mathbb{E}(\prod_{\lambda=0}^{\Lambda}\frac{1}{d_{p^{\lambda}}}\text{% diag}(\mathbb{1}_{\sigma_{\lambda}})_{s,s}\mathbf{W}^{\lambda}_{s,t})}1 - roman_𝔼 ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) = 1 - divide start_ARG square-root start_ARG ( italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - 1 ) italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG roman_Ψ end_ARG end_POSTSUPERSCRIPT roman_𝔼 ( ∏ start_POSTSUBSCRIPT italic_λ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG diag ( blackboard_𝟙 start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s , italic_s end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ψ end_POSTSUPERSCRIPT roman_𝔼 ( ∏ start_POSTSUBSCRIPT italic_λ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG diag ( blackboard_𝟙 start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s , italic_s end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) end_ARG

From  (Tang et al., 2020), we have p=1Ψ𝔼(λ=0Λ11dpλdiag(𝟙σλ)s,s𝐖s,tλ)=vsuperscriptsubscript𝑝1Ψ𝔼superscriptsubscriptproduct𝜆0Λ11superscriptsubscript𝑑𝑝𝜆diagsubscriptsubscriptdouble-struck-𝟙subscript𝜎𝜆𝑠𝑠subscriptsuperscript𝐖𝜆𝑠𝑡𝑣\sum_{p=1}^{\Psi}\mathop{\mathbb{E}}(\prod_{\lambda=0}^{\Lambda-1}\frac{1}{d_{% p}^{\lambda}}\text{diag}(\mathbb{1}_{\sigma_{\lambda}})_{s,s}\mathbf{W}^{% \lambda}_{s,t})=v∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ψ end_POSTSUPERSCRIPT roman_𝔼 ( ∏ start_POSTSUBSCRIPT italic_λ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_ARG diag ( blackboard_𝟙 start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s , italic_s end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) = italic_v is a constant. Eq. 2 can rewritten as

(3)

1𝔼(\pdvx~h,sΛxk,t/\pdvxh,sΛxk,t)=1dh1dh1/(dh1)diag(𝟙σΛ)s,s𝐖s,tΛvnN~(h)v1/(dh)diag(𝟙σΛ)s,s𝐖s,tΛvnN(h)v1𝔼\pdvsuperscriptsubscript~𝑥𝑠Λsubscript𝑥𝑘𝑡\pdvsuperscriptsubscript𝑥𝑠Λsubscript𝑥𝑘𝑡1subscript𝑑1subscript𝑑1subscript𝑑1diagsubscriptsubscriptdouble-struck-𝟙subscript𝜎Λ𝑠𝑠subscriptsuperscript𝐖Λ𝑠𝑡subscriptsubscript𝑣𝑛~𝑁𝑣1subscript𝑑diagsubscriptsubscriptdouble-struck-𝟙subscript𝜎Λ𝑠𝑠subscriptsuperscript𝐖Λ𝑠𝑡subscriptsubscript𝑣𝑛𝑁𝑣1-\mathbb{E}(\pdv{\tilde{x}_{h,s}^{\Lambda}}{x_{k,t}}/\pdv{x_{h,s}^{\Lambda}}{% x_{k,t}})=1-\sqrt{\frac{d_{h}-1}{d_{h}}}\frac{1/(d_{h}-1)\text{diag}(\mathbb{1% }_{\sigma_{\Lambda}})_{s,s}\mathbf{W}^{\Lambda}_{s,t}\sum_{v_{n}\in\tilde{N}(h% )}v}{1/(d_{h})\text{diag}(\mathbb{1}_{\sigma_{\Lambda}})_{s,s}\mathbf{W}^{% \Lambda}_{s,t}\sum_{v_{n}\in N(h)}v}1 - roman_𝔼 ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) = 1 - square-root start_ARG divide start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG end_ARG divide start_ARG 1 / ( italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - 1 ) diag ( blackboard_𝟙 start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s , italic_s end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ over~ start_ARG italic_N end_ARG ( italic_h ) end_POSTSUBSCRIPT italic_v end_ARG start_ARG 1 / ( italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) diag ( blackboard_𝟙 start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s , italic_s end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_N ( italic_h ) end_POSTSUBSCRIPT italic_v end_ARG

Then we have

(4) 1𝔼(\pdvx~h,sΛxk,t/\pdvxh,sΛxk,t)=111dh1𝔼\pdvsuperscriptsubscript~𝑥𝑠Λsubscript𝑥𝑘𝑡\pdvsuperscriptsubscript𝑥𝑠Λsubscript𝑥𝑘𝑡111subscript𝑑\begin{split}1-\mathbb{E}(\pdv{\tilde{x}_{h,s}^{\Lambda}}{x_{k,t}}/\pdv{x_{h,s% }^{\Lambda}}{x_{k,t}})=1-\sqrt{1-\frac{1}{d_{h}}}\end{split}start_ROW start_CELL 1 - roman_𝔼 ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) = 1 - square-root start_ARG 1 - divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG end_ARG end_CELL end_ROW

Since if dh>dlsubscript𝑑subscript𝑑𝑙d_{h}>d_{l}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT > italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, we can deduce 11dh>11dl11subscript𝑑11subscript𝑑𝑙\sqrt{1-\frac{1}{d_{h}}}>\sqrt{1-\frac{1}{d_{l}}}square-root start_ARG 1 - divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG end_ARG > square-root start_ARG 1 - divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG end_ARG, thus 1𝔼(\pdvx~h,sΛxk,t/\pdvxh,sΛxk,t)<1𝔼(\pdvx~l,sΛxk,t/\pdvxj,sΛxk,t)1𝔼\pdvsuperscriptsubscript~𝑥𝑠Λsubscript𝑥𝑘𝑡\pdvsuperscriptsubscript𝑥𝑠Λsubscript𝑥𝑘𝑡1𝔼\pdvsuperscriptsubscript~𝑥𝑙𝑠Λsubscript𝑥𝑘𝑡\pdvsuperscriptsubscript𝑥𝑗𝑠Λsubscript𝑥𝑘𝑡1-\mathbb{E}(\pdv{\tilde{x}_{h,s}^{\Lambda}}{x_{k,t}}/\pdv{x_{h,s}^{\Lambda}}{% x_{k,t}})<1-\mathbb{E}(\pdv{\tilde{x}_{l,s}^{\Lambda}}{x_{k,t}}/\pdv{x_{j,s}^{% \Lambda}}{x_{k,t}})1 - roman_𝔼 ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) < 1 - roman_𝔼 ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT italic_j , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) and D(k,h)<D(k,l)𝐷𝑘𝐷𝑘𝑙D(k,h)<D(k,l)italic_D ( italic_k , italic_h ) < italic_D ( italic_k , italic_l ) hold.

With the proof for GCN model, Theorem 1 can be easily extended to general GNN models. For general GNNs, the output node features of the ΛΛ\Lambdaroman_Λ-th layer are generated as follows:

(5) xhΛ+1=σ(WΛvaN(h)αa,hxaΛ)superscriptsubscript𝑥Λ1𝜎superscript𝑊Λsubscriptsubscript𝑣𝑎𝑁subscript𝛼𝑎superscriptsubscript𝑥𝑎Λx_{h}^{\Lambda+1}=\sigma(W^{\Lambda}\sum_{v_{a}\in N(h)}\alpha_{a,h}x_{a}^{% \Lambda})italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ + 1 end_POSTSUPERSCRIPT = italic_σ ( italic_W start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ italic_N ( italic_h ) end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_a , italic_h end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT )

where α𝛼\alphaitalic_α is a constant or parameters related with node attributes, such as node degrees or parameters will be learned, such as attention scores. We calculate the effect of node vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT as follows:

(6) 𝔼(xhΛxk)=vdkvnN(h)αn,hdiag(𝟙σΛ)𝐖Λ𝔼superscriptsubscript𝑥Λsubscript𝑥𝑘𝑣subscript𝑑𝑘subscriptsubscript𝑣𝑛𝑁subscript𝛼𝑛diagsubscriptdouble-struck-𝟙subscript𝜎Λsuperscript𝐖Λ\mathbb{E}(\frac{\partial x_{h}^{\Lambda}}{\partial x_{k}})=vd_{k}\sum_{v_{n}% \in N(h)}\alpha_{n,h}\cdot\text{diag}(\mathbb{1}_{\sigma_{\Lambda}})\cdot% \mathbf{W}^{\Lambda}roman_𝔼 ( divide start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) = italic_v italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_N ( italic_h ) end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_n , italic_h end_POSTSUBSCRIPT ⋅ diag ( blackboard_𝟙 start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ bold_W start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT

For the effect of node vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT after excluding one target edge, the cardinality of the set of vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT neighbor nodes decreases from N𝑁Nitalic_N to N1𝑁1N-1italic_N - 1 and the value of α𝛼\alphaitalic_α may also change to α~~𝛼\tilde{\alpha}over~ start_ARG italic_α end_ARG. We have the effect ratio is

(7) 𝔼(xh,s~Λxk,t/xh,sΛxk,t)=vnN(h)~αn,h~vnN(h)αn,h𝔼superscript~subscript𝑥𝑠Λsubscript𝑥𝑘𝑡superscriptsubscript𝑥𝑠Λsubscript𝑥𝑘𝑡subscriptsubscript𝑣𝑛~𝑁~subscript𝛼𝑛subscriptsubscript𝑣𝑛𝑁subscript𝛼𝑛\mathbb{E}(\frac{\partial\tilde{x_{h,s}}^{\Lambda}}{\partial x_{k,t}}/\frac{% \partial x_{h,s}^{\Lambda}}{\partial x_{k,t}})=\frac{\sum_{v_{n}\in\tilde{N(h)% }}\tilde{\alpha_{n,h}}}{\sum_{v_{n}\in N(h)}\alpha_{n,h}}roman_𝔼 ( divide start_ARG ∂ over~ start_ARG italic_x start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT end_ARG / divide start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Λ end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT end_ARG ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ over~ start_ARG italic_N ( italic_h ) end_ARG end_POSTSUBSCRIPT over~ start_ARG italic_α start_POSTSUBSCRIPT italic_n , italic_h end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_N ( italic_h ) end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_n , italic_h end_POSTSUBSCRIPT end_ARG

If α𝛼\alphaitalic_α is unrelated with the degree of vhsubscript𝑣v_{h}italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, then the theorem holds since α=α~𝛼~𝛼\alpha=\tilde{\alpha}italic_α = over~ start_ARG italic_α end_ARG the expectation of the ratio is dh1dhsubscript𝑑1subscript𝑑\frac{d_{h}-1}{d_{h}}divide start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG and D(k,h)<D(k,l)𝐷𝑘𝐷𝑘𝑙D(k,h)<D(k,l)italic_D ( italic_k , italic_h ) < italic_D ( italic_k , italic_l ). If the value of α(dh)mproportional-to𝛼superscriptsubscript𝑑𝑚\alpha\propto(d_{h})^{m}italic_α ∝ ( italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT then we have the raio is (dh1)(dh1)mdh(dh)msubscript𝑑1superscriptsubscript𝑑1𝑚subscript𝑑superscriptsubscript𝑑𝑚\frac{(d_{h}-1)(d_{h}-1)^{m}}{d_{h}(d_{h})^{m}}divide start_ARG ( italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - 1 ) ( italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - 1 ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG. If m>=1𝑚1m>=-1italic_m > = - 1, we still have the theorem holds. To our best knowledge, we do not find the existing GNNs with α(dh)mproportional-to𝛼superscriptsubscript𝑑𝑚\alpha\propto(d_{h})^{m}italic_α ∝ ( italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and m<1𝑚1m<-1italic_m < - 1, so our theorem holds for general GNNs. ∎