STAGE: Simplified Text-Attributed Graph Embeddings Using Pre-trained LLMs

Aaron Zolnai-Lucas1, Jack Boylan111footnotemark: 1, Chris Hokamp1, Parsa Ghaffari1

1Quantexa,
Correspondence: {firstname}{lastname}@quantexa.com
Authors contributed equally.
Abstract

We present Simplified Text-Attributed Graph Embeddings (STAGE), a straightforward yet effective method for enhancing node features in Graph Neural Network (GNN) models that encode Text-Attributed Graphs (TAGs). Our approach leverages Large-Language Models (LLMs) to generate embeddings for textual attributes. STAGE achieves competitive results on various node classification benchmarks while also maintaining a simplicity in implementation relative to current state-of-the-art (SoTA) techniques. We show that utilizing pre-trained LLMs as embedding generators provides robust features for ensemble GNN training, enabling pipelines that are simpler than current SoTA approaches which require multiple expensive training and prompting stages. We also implement diffusion-pattern GNNs in an effort to make this pipeline scalable to graphs beyond academic benchmarks.

STAGE: Simplified Text-Attributed Graph Embeddings Using Pre-trained LLMs


Aaron Zolnai-Lucas1thanks: Authors contributed equally., Jack Boylan111footnotemark: 1, Chris Hokamp1, Parsa Ghaffari1 1Quantexa, Correspondence: {firstname}{lastname}@quantexa.com


1 Introduction

A Knowledge Graph (KG) typically includes entities (represented as nodes), relationships between entities (represented as edges), and attributes of both entities and relationships Ehrlinger and Wöß (2016). These attributes, referred to as metadata, are often governed by a domain-specific ontology, which provides a formal framework for defining the types of entities and relationships as well as their properties. KGs can be used to represent structured information about the world in diverse settings, including medical domain models Koné et al. (2023), words and lexical semantics Miller (1995), and commercial products Chiang et al. (2019).

Text-Attributed Graphs (TAGs) can be viewed as a subset of KGs, where some node and edge metadata is represented by unstructured or semi-structured natural language text Yang et al. (2023). Examples of unstructured data values in TAGs could include the research article text representing the nodes of a citation graph, or the content of social media posts that are the nodes of an interaction graph extracted from a social media platform. Many real-world datasets are naturally represented as TAGs, and studying how to best represent and learn using these datasets has received attention from the fields of graph learning, natural language processing (NLP), and information retrieval.

Graph Learning and LLMs

With the emergence of LLMs as powerful general purpose reasoning agents, there has been increasing interest in integrating KGs with LLMs Pan et al. (2024). Current SoTA approaches combining graph learning with (L)LMs follow either an iterative or a cascading method. Iterative methods involve jointly training an LM and a GNN for the given task. While this approach can produce a task-specific feature space, it may be complex and resource-intensive, particularly for large graphs. In contrast, cascading methods first apply an LM to extract node features which are then used by a downstream GNN model. Cascading models demonstrate excellent performance on TAG tasks He et al. (2024); Duan et al. (2023a), although they often require multiple stages of training targeted at each pipeline component. More recent cascading techniques implement an additional step, known as text-level enhancement Chen et al. (2024), whereby textual features are augmented using an LLM.

Refer to caption
Figure 1: Our proposed approach to node classification. Firstly, the textual attributes of the input graph nodes are encoded using an off-the-shelf LLM. The text embeddings will be used alongside the graph adjacency matrix as input to train a downstream ensemble of GNNs. GNN predictions are then mean-pooled to obtain the final prediction.

Simplifying Node Representation Generation

To the best of our knowledge, all existing cascading approaches require multiple rounds of data generation or finetuning to achieve satisfactory results on TAG tasks He et al. (2024); Duan et al. (2023a); Chen et al. (2024). This bottleneck increases the difficulty of applying such methods to real-world graphs. Our proposed method, STAGE, aims to simplify existing approaches by foregoing LM finetuning, and only making use of a single pre-trained LLM as the node embedding model, without data augmentation via prompting. We study possible configurations of this simplified pipeline and demonstrate that this method achieves competitive performance while significantly reducing the complexity of training and data preparation.

Scalable GNN Architectures

The exponentially growing receptive field required during training of most message-passing GNNs is another bottleneck in both cascading and iterative approaches, becoming computationally intractable for large graphs (Duan et al., 2023b; Liu et al., 2024). Because we wish to study approaches that can be applied in real-world settings, we also explore the implementation of diffusion-pattern GNNs, such as Simple-GCN Wu et al. (2019) and SIGN Frasca et al. (2020), which may enable STAGE to be applied to much larger graphs beyond the relatively small academic benchmarks. Our code is available at https://github.com/aaronzo/STAGE.

Concretely, this work studies several ways to make learning on TAGs more efficient and scalable:

  • Single Training Stage: We perform ensemble GNN training with a fixed LLM as the node feature generator, which significantly reduces training time by eliminating the need for multiple large model training runs.

  • No LLM Prompting: We do not prompt an LLM for text-level augmentations such as predictions or explanations. Instead, we use only the text attributes provided in the dataset.

  • Direct Use of LLM as Text Embedding Model: Using an off-the-shelf LLM as the embedding model makes this method adaptable to new models and datasets. We study several alternative base models for embedding generation.

  • Diffusion-pattern GNN implementation: We contribute an investigation into diffusion-pattern GNNs which enable this method to scale to larger graphs.

The rest of the paper is organized as follows: section 2 gives an overview of related work, section 3 discusses our approach in detail, section 4 studies the performance of STAGE in various settings, and section 5 is a discussion of the experimental results.

00100100100100200200200200300300300300400400400400500500500500600600600600707070707272727274747474767676767878787880808080 LM 104min, 73.61% STAGE (Ours) 4min, 77.77% GLEM 551min, 76.57% TAPE 192min, 77.50% Total Training Time (min)Accuracy (%)Finetuned LMSTAGE (ensemble GNNs)LM-based GLEMLLM-based TAPE
Figure 2: The performance trade-off between node classification accuracy and total training time on ogbn-arxiv for SoTA LM-GNN methods. The STAGE model uses text embeddings generated from Salesforce-Embedding-Mistral and an ensemble of GNNs (GCN, SAGE and RevGAT) and MLP. The size of each marker indicates the total number of trainable parameters. Figure adapted from He et al. (2024).

2 Background

Text-Attributed Graphs

Yan et al. (2023) suggest that integrating topological data with textual information can significantly improve the learning outcomes on various graph-related tasks. Chien et al. (2022) incorporate graph structural information into the pre-training stage of pre-trained language models (PLMs), achieving improved performance albeit with additional training overhead, while Liu et al. (2023) further adopt sentence embedding models to unify the text-attribute and graph structure feature space, proposing a unified model for diverse tasks across multiple datasets.

LLMs as Text Encoders

General purpose text embedding models, used in both finetuned and zero-shot paradigms, are a standard component of modern NLP pipelines (Mikolov et al., 2013; Pennington et al., 2014; Reimers and Gurevych, 2019). As LLMs have emerged as powerful zero-shot agents, many studies have considered generating text embeddings as an auxiliary output Muennighoff (2022); Mialon et al. (2023). BehnamGhader et al. (2024) introduce LLM2Vec, an unsupervised method to convert LLMs into powerful text encoders by using bidirectional attention, masked next token prediction and contrastive learning, achieving state-of-the-art performance on various text embedding benchmarks.

Language Models and GNNs

Graph Neural Networks have been successfully applied to node classification and link prediction tasks, demonstrating improved performance when combined with textual features from nodes Kipf and Welling (2017); Li et al. (2022b). Several studies show that finetuning pre-trained Language Models (PLMs), such as BERT Devlin et al. (2019) and DeBERTa He et al. (2021), enhances GNN performance by leveraging textual node features Chen et al. (2024); Duan et al. (2023a); He et al. (2024).

Recent research has explored the integration of LLMs with GNNs, particularly for TAGs. LLMs contribute deep semantic understanding and commonsense knowledge, potentially boosting GNNs’ effectiveness on downstream tasks. However, combining LLMs with GNNs poses computational challenges. Techniques like GLEM Zhao et al. (2023) use the Expectation Maximization framework to alternate updates between LM and GNN modules.

Other approaches include the TAPE method, which uses GPT OpenAI (2023); OpenAI et al. (2024) models for data augmentation, enhancing GNN performance through enriched textual embeddings He et al. (2024). SimTeG demonstrates that parameter-efficient finetuning (PEFT) PLMs can yield competitive results Duan et al. (2023a). Ye et al. (2024) suggest that finetuned LLMs can match or exceed state-of-the-art GNN performance on various benchmarks.

Building on these insights, the STAGE method focuses on efficient and scalable learning for TAGs by utilizing zero-shot capabilities of LLMs to generate representations without extensive task-specific tuning or auxiliary data generation.

3 Approach

Our cascading approach consists of two steps:

  • A zero-shot LLM-based embedding generator is used to encode the title and abstract (or equivalent textual attribute) of each node. We denote the generated node embeddings as 𝒳𝒳\mathbf{\mathcal{X}}caligraphic_X.

  • An ensemble of GNN architectures are trained on 𝒳𝒳\mathbf{\mathcal{X}}caligraphic_X, and their predictions are mean-pooled to obtain the final node predictions.

Ensembling the predictions from multiple GNN architectures was motivated by our observation of strong performance by different models across different datasets.

3.1 Text Embedding Retrieval

For the text embedding model, we select a general-purpose embedding LLM that ranks highly on the Massive Text Embedding Benchmark (MTEB) Leaderboard111https://huggingface.co/spaces/mteb/leaderboard. Specifically, we evaluate gte-Qwen1.5-7B-instruct, LLM2Vec-Meta-Llama-3-8B-Instruct, and SFR-Embedding-Mistral. MTEB ranks embedding models based on their performance across a wide variety of information retrieval, classification and clustering tasks. This model is used out-of-the-box without any finetuning. An appealing aspect of LLM-based embeddings is the possibility to add instructions alongside input text to bias the embeddings for a given task. We empirically evaluate the effect of instruction biased embeddings is in Table 2 of section 4.

Node representations 𝒳𝒳\mathbf{\mathcal{X}}caligraphic_X are generated using only the title and abstract, or equivalent textual node attributes, omitting the LLM predictions and explanations provided by He et al. (2024). 𝒳𝒳\mathbf{\mathcal{X}}caligraphic_X will then be used as enriched node feature vectors for training a downstream GNN ensemble.

3.2 GNN Training

Using the previously generated embeddings 𝒳𝒳\mathbf{\mathcal{X}}caligraphic_X as node features, we train an ensemble of GNN models on the node classification task:

Losscls=θ(ϕ(GNN(𝒳,𝒜)),𝐘),subscriptLossclssubscript𝜃italic-ϕGNN𝒳𝒜𝐘\text{Loss}_{\text{cls}}=\mathcal{L}_{\theta}\left(\phi(\text{GNN}(\mathcal{X}% ,\mathcal{A})),\mathbf{Y}\right),Loss start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ϕ ( GNN ( caligraphic_X , caligraphic_A ) ) , bold_Y ) , (1)

where ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) is the classifier, 𝒜𝒜\mathbf{\mathcal{A}}caligraphic_A is the adjacency matrix of the graph and 𝒴𝒴\mathbf{\mathcal{Y}}caligraphic_Y is the label. For the GNN architectures we choose GCN Kipf and Welling (2017), SAGE Hamilton et al. (2018) and RevGAT Li et al. (2022a). We also evaluate a multi-layer perceptron (MLP) Haykin (1994) among our GNN models. To combine the predictions from each of the K𝐾Kitalic_K models in the ensemble, we compute the mean prediction as follows:

𝐩¯=1Kk=1K𝐩k,¯𝐩1𝐾superscriptsubscript𝑘1𝐾subscript𝐩𝑘\bar{\mathbf{p}}=\frac{1}{K}\sum_{k=1}^{K}\mathbf{p}_{k},over¯ start_ARG bold_p end_ARG = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (2)

Cross-entropy loss is used to compute the loss value.

Diffusion-based GNNs

For a graph G𝐺Gitalic_G with node features 𝒳𝒳\mathbf{\mathcal{X}}caligraphic_X, a diffusion operator is a matrix AOPsubscript𝐴OPA_{\text{OP}}italic_A start_POSTSUBSCRIPT OP end_POSTSUBSCRIPT with the same dimensions as the adjacency matrix of G𝐺Gitalic_G. Diffused features \mathcal{H}caligraphic_H are then calculated via =AOP𝒳subscript𝐴OP𝒳\mathcal{H}=A_{\text{OP}}\mathcal{X}caligraphic_H = italic_A start_POSTSUBSCRIPT OP end_POSTSUBSCRIPT caligraphic_X.

We explored Simple-GCN Wu et al. (2019) and SIGN Frasca et al. (2020), both of which employ adjacency-based diffusion operators to pre-aggregate features across the graph before training. SIGN is a generalization of Simple-GCN, to extend to Personalized-PageRank Page et al. (1998) and triangle-based operators. This allows expensive computation to be carried out by distributed computing clusters or efficient sparse graph routines such as GraphBLAS Davis (2019), which do not need to back-propagate through graph convolution. The prediction head can then be a shallow MLP or logistic regression. We provide implementation specifics in appendix section C to ensure repeatability.

3.3 Parameter-efficient Finetuning LLM

Motivated by the node classification performance gains seen by Duan et al. (2023a) using PEFT, we finetune an LLM on the node classification task. Concretely, we use an LLM embedding model with a low-rank adapter (LoRA) Hu et al. (2021a) and a densely connected classifier head. The pre-trained LLM weights remain frozen as the model trains on input text T𝑇Titalic_T to reduce loss according to:

Losscls=(ϕ(LLM(T)),Y)subscriptLossclsitalic-ϕLLM𝑇𝑌\text{Loss}_{\text{cls}}=\mathcal{L}(\phi(\text{LLM}(T)),Y)Loss start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT = caligraphic_L ( italic_ϕ ( LLM ( italic_T ) ) , italic_Y ) (3)

where ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) is the classifier head and Y𝑌Yitalic_Y is the label. Again, we use cross-entropy loss to compute the loss value.

4 Experiments

Dataset Method hshallowsubscriptshallowh_{\text{shallow}}italic_h start_POSTSUBSCRIPT shallow end_POSTSUBSCRIPT hGIANTsubscriptGIANTh_{\text{GIANT}}italic_h start_POSTSUBSCRIPT GIANT end_POSTSUBSCRIPT GPT3.5 LMfinetune𝐿subscript𝑀finetuneLM_{\text{finetune}}italic_L italic_M start_POSTSUBSCRIPT finetune end_POSTSUBSCRIPT hTAPEsubscriptTAPEh_{\text{TAPE}}italic_h start_POSTSUBSCRIPT TAPE end_POSTSUBSCRIPT hSTAGE(OURS)subscriptSTAGE(OURS)h_{\text{STAGE}}\textbf{(OURS)}italic_h start_POSTSUBSCRIPT STAGE end_POSTSUBSCRIPT (OURS)
Cora MLP 0.6388 ±plus-or-minus\pm± 0.0213 0.7196 ±plus-or-minus\pm± 0.0000 0.6769 0.7606 ±plus-or-minus\pm± 0.0378 0.8778 ±plus-or-minus\pm± 0.0485 0.7680 ±plus-or-minus\pm± 0.0228
GCN 0.8911 ±plus-or-minus\pm± 0.0015 0.8423 ±plus-or-minus\pm± 0.0053 0.6769 0.7606 ±plus-or-minus\pm± 0.0378 0.9119 ±plus-or-minus\pm± 0.0158 0.8704 ±plus-or-minus\pm± 0.0105
SAGE 0.8824 ±plus-or-minus\pm± 0.0009 0.8455 ±plus-or-minus\pm± 0.0028 0.6769 0.7606 ±plus-or-minus\pm± 0.0378 0.9290 ±plus-or-minus\pm± 0.0307 0.8722 ±plus-or-minus\pm± 0.0063
RevGAT 0.8911 ±plus-or-minus\pm± 0.0000 0.8353 ±plus-or-minus\pm± 0.0038 0.6769 0.7606 ±plus-or-minus\pm± 0.0378 0.9280 ±plus-or-minus\pm± 0.0275 0.8639 ±plus-or-minus\pm± 0.0129
Ensemble - - - - - 0.8824 ±plus-or-minus\pm± 0.0155
PubMed MLP 0.8635 ±plus-or-minus\pm± 0.0032 0.8175 ±plus-or-minus\pm± 0.0059 0.9342 0.9494 ±plus-or-minus\pm± 0.0046 0.9565 ±plus-or-minus\pm± 0.0060 0.9142 ±plus-or-minus\pm± 0.0122
GCN 0.8031 ±plus-or-minus\pm± 0.0425 0.8419 ±plus-or-minus\pm± 0.0050 0.9342 0.9494 ±plus-or-minus\pm± 0.0046 0.9431 ±plus-or-minus\pm± 0.0043 0.8960 ±plus-or-minus\pm± 0.0042
SAGE 0.8881 ±plus-or-minus\pm± 0.0002 0.8372 ±plus-or-minus\pm± 0.0082 0.9342 0.9494 ±plus-or-minus\pm± 0.0046 0.9618 ±plus-or-minus\pm± 0.0053 0.9087 ±plus-or-minus\pm± 0.0064
RevGAT 0.8850 ±plus-or-minus\pm± 0.0005 0.8502 ±plus-or-minus\pm± 0.0048 0.9342 0.9494 ±plus-or-minus\pm± 0.0046 0.9604 ±plus-or-minus\pm± 0.0047 0.8654 ±plus-or-minus\pm± 0.0952
Ensemble - - - - - 0.9265 ±plus-or-minus\pm± 0.0068
ogbn-arxiv MLP 0.5336 ±plus-or-minus\pm± 0.0038 0.7308 ±plus-or-minus\pm± 0.0006 0.7350 0.7361 ±plus-or-minus\pm± 0.0004 0.7587 ±plus-or-minus\pm± 0.0015 0.7517 ±plus-or-minus\pm± 0.0011
GCN 0.7182 ±plus-or-minus\pm± 0.0027 0.7329 ±plus-or-minus\pm± 0.0010 0.7350 0.7361 ±plus-or-minus\pm± 0.0004 0.7520 ±plus-or-minus\pm± 0.0005 0.7377 ±plus-or-minus\pm± 0.0010
SAGE 0.7171 ±plus-or-minus\pm± 0.0017 0.7435 ±plus-or-minus\pm± 0.0014 0.7350 0.7361 ±plus-or-minus\pm± 0.0004 0.7672 ±plus-or-minus\pm± 0.0007 0.7596 ±plus-or-minus\pm± 0.0040
RevGAT 0.7083 ±plus-or-minus\pm± 0.0017 0.7590 ±plus-or-minus\pm± 0.0019 0.7350 0.7361 ±plus-or-minus\pm± 0.0004 0.7750 ±plus-or-minus\pm± 0.0012 0.7638 ±plus-or-minus\pm± 0.0054
Ensemble - - - - - 0.7777 ±plus-or-minus\pm± 0.0019
ogbn-products MLP 0.5385 ±plus-or-minus\pm± 0.0017 0.6125 ±plus-or-minus\pm± 0.0078 0.7440 0.7297 ±plus-or-minus\pm± 0.0023 0.7878 ±plus-or-minus\pm± 0.0082 0.7277 ±plus-or-minus\pm± 0.0054
GCN 0.7052 ±plus-or-minus\pm± 0.0051 0.6977 ±plus-or-minus\pm± 0.0042 0.7440 0.7297 ±plus-or-minus\pm± 0.0023 0.7996 ±plus-or-minus\pm± 0.0041 0.7679 ±plus-or-minus\pm± 0.0109
SAGE 0.6913 ±plus-or-minus\pm± 0.0026 0.6869 ±plus-or-minus\pm± 0.0011 0.7440 0.7297 ±plus-or-minus\pm± 0.0023 0.8137 ±plus-or-minus\pm± 0.0043 0.7795 ±plus-or-minus\pm± 0.0012
RevGAT 0.6964 ±plus-or-minus\pm± 0.0017 0.7189 ±plus-or-minus\pm± 0.0030 0.7440 0.7297 ±plus-or-minus\pm± 0.0023 0.8234 ±plus-or-minus\pm± 0.0036 0.8083 ±plus-or-minus\pm± 0.0051
Ensemble - - - - - 0.8140 ±plus-or-minus\pm± 0.0033
tape-arxiv23 MLP 0.6202 ±plus-or-minus\pm± 0.0064 0.5574 ±plus-or-minus\pm± 0.0032 0.7356 0.7358 ±plus-or-minus\pm± 0.0006 0.8385 ±plus-or-minus\pm± 0.0246 0.7940 ±plus-or-minus\pm± 0.0022
GCN 0.6341 ±plus-or-minus\pm± 0.0062 0.5672 ±plus-or-minus\pm± 0.0061 0.7356 0.7358 ±plus-or-minus\pm± 0.0006 0.8080 ±plus-or-minus\pm± 0.0215 0.7678 ±plus-or-minus\pm± 0.0024
SAGE 0.6430 ±plus-or-minus\pm± 0.0037 0.5665 ±plus-or-minus\pm± 0.0032 0.7356 0.7358 ±plus-or-minus\pm± 0.0006 0.8388 ±plus-or-minus\pm± 0.0264 0.7894 ±plus-or-minus\pm± 0.0024
RevGAT 0.6563 ±plus-or-minus\pm± 0.0062 0.5834 ±plus-or-minus\pm± 0.0038 0.7356 0.7358 ±plus-or-minus\pm± 0.0006 0.8423 ±plus-or-minus\pm± 0.0256 0.7880 ±plus-or-minus\pm± 0.0023
Ensemble - - - - - 0.8029 ±plus-or-minus\pm± 0.0020
Table 1: Node classification accuracy for the Cora, PubMed, ogbn-arxiv, ogbn-products, and tape-arxiv23 datasets. The experiment is run over four seeds, with mean accuracy and standard deviation shown. The best results are coloured green (first), yellow (second), and orange (third). For hSTAGEsubscriptSTAGEh_{\text{STAGE}}italic_h start_POSTSUBSCRIPT STAGE end_POSTSUBSCRIPT, we use SFR-Embedding-Mistral as the embedding model on TA features only, and the simple task instruction to bias the embeddings. We adapt the table from He et al. (2024) and include our results.

We investigate the performance of STAGE over five TAG benchmarks: ogbn-arxiv Hu et al. (2021b), a dataset of arXiv papers linked by citations; ogbn-products Hu et al. (2021b), representing an Amazon product co-purchasing network; PubMed Sen et al. (2008), a citation network of diabetes-related scientific publications; Cora McCallum et al. (2000), a dataset of scientific publications categorized into one of seven classes; and tape-arxiv23 He et al. (2024), focusing on arXiv papers published after the 2023 knowledge cut-off for GPT3.5. We use the subset of ogbn-products provided by He et al. (2024). Further details can be found in appendix Table 7.

For each experiment using Cora, PubMed or tape-arxiv23, 60% of the data was allocated for training, 20% for validation, and 20% for testing. For the ogbn-arxiv and ogbn-products datasets, we adopted the standard train/validation/test split provided by the Open Graph Benchmark (OGB)222https://ogb.stanford.edu/ Hu et al. (2021b).

Our main results can be seen in Table 1. Multiple GNN models are trained using embeddings from a pre-trained LLM as node features. We ensemble the predictions across model architectures by taking the mean prediction.

Dataset Method hno instructionsubscriptno instructionh_{\text{no instruction}}italic_h start_POSTSUBSCRIPT no instruction end_POSTSUBSCRIPT htask instructionsubscripttask instructionh_{\text{task instruction}}italic_h start_POSTSUBSCRIPT task instruction end_POSTSUBSCRIPT hgraph-aware-instructionsubscriptgraph-aware-instructionh_{\text{graph-aware-instruction}}italic_h start_POSTSUBSCRIPT graph-aware-instruction end_POSTSUBSCRIPT
Cora MLP 0.7772 ±plus-or-minus\pm± 0.0205 0.7680 ±plus-or-minus\pm± 0.0228 0.7763 ±plus-or-minus\pm± 0.0193
GCN 0.8612 ±plus-or-minus\pm± 0.0121 0.8704 ±plus-or-minus\pm± 0.0105 0.8718 ±plus-or-minus\pm± 0.0085
SAGE 0.8833 ±plus-or-minus\pm± 0.0125 0.8722 ±plus-or-minus\pm± 0.0063 0.8704 ±plus-or-minus\pm± 0.0109
RevGAT 0.8630 ±plus-or-minus\pm± 0.0119 0.8639 ±plus-or-minus\pm± 0.0129 0.8676 ±plus-or-minus\pm± 0.0125
Ensemble 0.8930 ±plus-or-minus\pm± 0.0086 0.8824 ±plus-or-minus\pm± 0.0155 0.8875 ±plus-or-minus\pm± 0.0118
PubMed MLP 0.9305 ±plus-or-minus\pm± 0.0052 0.9142 ±plus-or-minus\pm± 0.0122 0.9185 ±plus-or-minus\pm± 0.0145
GCN 0.9021 ±plus-or-minus\pm± 0.0034 0.8960 ±plus-or-minus\pm± 0.0042 0.8978 ±plus-or-minus\pm± 0.0046
SAGE 0.9268 ±plus-or-minus\pm± 0.0052 0.9087 ±plus-or-minus\pm± 0.0064 0.9126 ±plus-or-minus\pm± 0.0024
RevGAT 0.8637 ±plus-or-minus\pm± 0.0942 0.8654 ±plus-or-minus\pm± 0.0952 0.9211 ±plus-or-minus\pm± 0.0022
Ensemble 0.9358 ±plus-or-minus\pm± 0.0035 0.9265 ±plus-or-minus\pm± 0.0068 0.9313 ±plus-or-minus\pm± 0.0025
ogbn-arxiv MLP 0.7417 ±plus-or-minus\pm± 0.0015 0.7517 ±plus-or-minus\pm± 0.0011 0.7519 ±plus-or-minus\pm± 0.0028
GCN 0.7336 ±plus-or-minus\pm± 0.0029 0.7377 ±plus-or-minus\pm± 0.0010 0.7367 ±plus-or-minus\pm± 0.0045
SAGE 0.7515 ±plus-or-minus\pm± 0.0027 0.7596 ±plus-or-minus\pm± 0.0040 0.7559 ±plus-or-minus\pm± 0.0039
RevGAT 0.7629 ±plus-or-minus\pm± 0.0035 0.7638 ±plus-or-minus\pm± 0.0054 0.7607 ±plus-or-minus\pm± 0.0011
Ensemble 0.7745 ±plus-or-minus\pm± 0.0013 0.7777 ±plus-or-minus\pm± 0.0019 0.7740 ±plus-or-minus\pm± 0.0019
ogbn-products MLP 0.6841 ±plus-or-minus\pm± 0.0054 0.7277 ±plus-or-minus\pm± 0.0054 0.7163 ±plus-or-minus\pm± 0.0172
GCN 0.7367 ±plus-or-minus\pm± 0.0068 0.7679 ±plus-or-minus\pm± 0.0109 0.7729 ±plus-or-minus\pm± 0.0033
SAGE 0.7543 ±plus-or-minus\pm± 0.0065 0.7795 ±plus-or-minus\pm± 0.0012 0.7811 ±plus-or-minus\pm± 0.0049
RevGAT 0.8016 ±plus-or-minus\pm± 0.0078 0.8083 ±plus-or-minus\pm± 0.0051 0.8000 ±plus-or-minus\pm± 0.0078
Ensemble 0.7991 ±plus-or-minus\pm± 0.0034 0.8140 ±plus-or-minus\pm± 0.0033 0.8090 ±plus-or-minus\pm± 0.0037
tape-arxiv23 MLP 0.7803 ±plus-or-minus\pm± 0.0014 0.7940 ±plus-or-minus\pm± 0.0022 0.7948 ±plus-or-minus\pm± 0.0025
GCN 0.7518 ±plus-or-minus\pm± 0.0044 0.7678 ±plus-or-minus\pm± 0.0024 0.7703 ±plus-or-minus\pm± 0.0025
SAGE 0.7702 ±plus-or-minus\pm± 0.0022 0.7894 ±plus-or-minus\pm± 0.0024 0.7917 ±plus-or-minus\pm± 0.0021
RevGAT 0.7880 ±plus-or-minus\pm± 0.0047 0.7880 ±plus-or-minus\pm± 0.0023 0.7906 ±plus-or-minus\pm± 0.0034
Ensemble 0.8013 ±plus-or-minus\pm± 0.0017 0.8029 ±plus-or-minus\pm± 0.0020 0.8054 ±plus-or-minus\pm± 0.0025
Table 2: Node classification accuracy for the Cora, PubMed, ogbn-arxiv, ogbn-products, and tape-arxiv23 datasets, demonstrating the effect of varying an instruction to bias the embeddings from the pre-trained LLM. The experiment is run over four seeds, with mean accuracy and standard deviation shown. The best results are coloured green (first), yellow (second), and orange (third). For all experiments, we use SFR-Embedding-Mistral as the embedding model on TA features only, and the simple task instruction to bias the embeddings.

Node classification accuracy is provided for various datasets, measured across multiple methods and feature types. Each column represents a specific metric or method:

  • hshallow: Performance using shallow features, indicating basic attributes provided as part of each dataset

  • hGIANT: Results obtained by using GIANT features as proposed by Chien et al. (2022), designed to incorporate graph structural information into LM training

  • GPT3.5: Accuracy when using zero-shot predictions from GPT-3.5-turbo, demonstrating the utility of state-of-the-art language models in a zero-shot setting

  • LMfinetune: Performance metrics reported by He et al. (2024) after finetuning the DeBERTa He et al. (2021) model on labeled nodes from the graph, showing the benefits of supervised finetuning

  • hTAPE: Shows results for the TAPE features He et al. (2024), which includes the original textual attributes of the node, GPT-generated predictions for each node, and GPT-generated explanations of ranked predictions to enrich node features.

  • hSTAGE: Reflects the model’s performance training with node features generated by a pre-trained LLM.

Instruction-biased Embeddings

Textual attributes for each node are passed to the embedding LLM together with a task description which remains constant for every text, prefixing each input with a task-specific system prompt. We evaluated 3 simple task descriptions:

  1. 1.

    A short prompt describing the classification task for the text, as used during the pre-training stage of the LLM.

  2. 2.

    A description of the types of relationships between texts to form a graph, along with the classification task description. Specific graph structure for each node is not included in the prompt, unlike the proposed method from Fatemi et al. (2024).

  3. 3.

    No task description.

Our findings are summarized in Table 2. Further details of the instructions can be found in appendix Table 8.

Parameter-efficient Finetuning

In Table 3 we investigate the effect of using parameter-efficient finetuning (PEFT) on the pre-trained LLM, as described in Duan et al. (2023a). We also compare this against finetuning both the LLM (using PEFT) and the GNN in unison.

Dataset LLM + GNN Ensemble LLMfinetuned LLMfinetuned + GNN Ensemble
Cora 0.8824 ±plus-or-minus\pm± 0.0155 0.8063 0.8856
PubMed 0.9265 ±plus-or-minus\pm± 0.0068 0.9513 0.9559
ogbn-arxiv 0.7777 ±plus-or-minus\pm± 0.0019 0.7666 0.7813
ogbn-products 0.8140 ±plus-or-minus\pm± 0.0033 0.8020 0.8257
tape-arxiv23 0.8029 ±plus-or-minus\pm± 0.0020 0.8021 0.8095
Table 3: Effect of using parameter-efficient finetuning (PEFT) on the pre-trained LLM, as described in Duan et al. (2023a). Comparison of GNN-only trained, LLM finetuned without GNNs, and LLM and GNN trained separately. The best results are highlighted in bold.

Embedding Model Type

In Table 4, we compare the results when using different pre-trained LLMs as the text encoder.

Dataset Method SFR-Embedding-Mistral LLM2Vec gte-Qwen1.5-7B-instruct
Cora MLP 0.7680 ±plus-or-minus\pm± 0.0228 0.8026 ±plus-or-minus\pm± 0.0141 0.7389 ±plus-or-minus\pm± 0.0136
GCN 0.8704 ±plus-or-minus\pm± 0.0105 0.8778 ±plus-or-minus\pm± 0.0046 0.8621 ±plus-or-minus\pm± 0.0105
SAGE 0.8722 ±plus-or-minus\pm± 0.0063 0.8773 ±plus-or-minus\pm± 0.0062 0.8658 ±plus-or-minus\pm± 0.0049
RevGAT 0.8639 ±plus-or-minus\pm± 0.0129 0.8810 ±plus-or-minus\pm± 0.0033 0.8408 ±plus-or-minus\pm± 0.0076
Ensemble 0.8824 ±plus-or-minus\pm± 0.0155 0.8898 ±plus-or-minus\pm± 0.0066 0.8686 ±plus-or-minus\pm± 0.0024
Simple-GCN 0.7389 ±plus-or-minus\pm± 0.0120 0.6983 ±plus-or-minus\pm± 0.0120 0.7491±plus-or-minus\pm± 0.0166
SIGN 0.8819 ±plus-or-minus\pm± 0.0074 0.8856 ±plus-or-minus\pm± 0.0083 0.8575 ±plus-or-minus\pm± 0.0157
PubMed MLP 0.9142 ±plus-or-minus\pm± 0.0122 0.9321 ±plus-or-minus\pm± 0.0013 0.8808 ±plus-or-minus\pm± 0.0107
GCN 0.8960 ±plus-or-minus\pm± 0.0042 0.8996 ±plus-or-minus\pm± 0.0011 0.8591 ±plus-or-minus\pm± 0.0041
SAGE 0.9087 ±plus-or-minus\pm± 0.0064 0.9231 ±plus-or-minus\pm± 0.0056 0.8733 ±plus-or-minus\pm± 0.0051
RevGAT 0.8654 ±plus-or-minus\pm± 0.0952 0.9312 ±plus-or-minus\pm± 0.0026 0.8754 ±plus-or-minus\pm± 0.0010
Ensemble 0.9265 ±plus-or-minus\pm± 0.0068 0.9357 ±plus-or-minus\pm± 0.0031 0.8941 ±plus-or-minus\pm± 0.0041
Simple-GCN 0.7505 ±plus-or-minus\pm± 0.0048 0.7400 ±plus-or-minus\pm± 0.0037 0.7472 ±plus-or-minus\pm± 0.0076
SIGN 0.8868 ±plus-or-minus\pm± 0.0062 0.9004 ±plus-or-minus\pm± 0.0038 0.8611 ±plus-or-minus\pm± 0.0084
ogbn-arxiv MLP 0.7517 ±plus-or-minus\pm± 0.0011 0.7331 ±plus-or-minus\pm± 0.0033 0.7603 ±plus-or-minus\pm± 0.0011
GCN 0.7377 ±plus-or-minus\pm± 0.0010 0.7324 ±plus-or-minus\pm± 0.0014 0.7369 ±plus-or-minus\pm± 0.0022
SAGE 0.7596 ±plus-or-minus\pm± 0.0040 0.7428 ±plus-or-minus\pm± 0.0039 0.7664 ±plus-or-minus\pm± 0.0029
RevGAT 0.7638 ±plus-or-minus\pm± 0.0054 0.7529 ±plus-or-minus\pm± 0.0044 0.7738 ±plus-or-minus\pm± 0.0009
Ensemble 0.7777 ±plus-or-minus\pm± 0.0019 0.7701 ±plus-or-minus\pm± 0.0018 0.7817 ±plus-or-minus\pm± 0.0011
Simple-GCN 0.3337 ±plus-or-minus\pm± 0.0107 0.3614 ±plus-or-minus\pm± 0.0039 0.3463 ±plus-or-minus\pm± 0.0181
SIGN 0.6150 ±plus-or-minus\pm± 0.0182 0.6035 ±plus-or-minus\pm± 0.0084 0.6285 ±plus-or-minus\pm± 0.0114
ogbn-products MLP 0.7277 ±plus-or-minus\pm± 0.0054 0.6913 ±plus-or-minus\pm± 0.0052 0.7231 ±plus-or-minus\pm± 0.0050
GCN 0.7679 ±plus-or-minus\pm± 0.0109 0.7479 ±plus-or-minus\pm± 0.0128 0.7701 ±plus-or-minus\pm± 0.0117
SAGE 0.7795 ±plus-or-minus\pm± 0.0012 0.7496 ±plus-or-minus\pm± 0.0163 0.7921 ±plus-or-minus\pm± 0.0069
RevGAT 0.8083 ±plus-or-minus\pm± 0.0051 0.7883 ±plus-or-minus\pm± 0.0014 0.7955 ±plus-or-minus\pm± 0.0096
Ensemble 0.8140 ±plus-or-minus\pm± 0.0033 0.7908 ±plus-or-minus\pm± 0.0045 0.8104 ±plus-or-minus\pm± 0.0041
Simple-GCN 0.6216 ±plus-or-minus\pm± 0.0052 0.6040 ±plus-or-minus\pm± 0.0039 0.6219 ±plus-or-minus\pm± 0.0039
SIGN 0.6668 ±plus-or-minus\pm± 0.0078 0.6621 ±plus-or-minus\pm± 0.0009 0.6698 ±plus-or-minus\pm± 0.0010
tape-arxiv23 MLP 0.7940 ±plus-or-minus\pm± 0.0022 0.7772 ±plus-or-minus\pm± 0.0033 0.8008 ±plus-or-minus\pm± 0.0018
GCN 0.7678 ±plus-or-minus\pm± 0.0024 0.7541 ±plus-or-minus\pm± 0.0042 0.7746 ±plus-or-minus\pm± 0.0025
SAGE 0.7894 ±plus-or-minus\pm± 0.0024 0.7677 ±plus-or-minus\pm± 0.0018 0.7975 ±plus-or-minus\pm± 0.0016
RevGAT 0.7880 ±plus-or-minus\pm± 0.0023 0.7840 ±plus-or-minus\pm± 0.0058 0.7954 ±plus-or-minus\pm± 0.0028
Ensemble 0.8029 ±plus-or-minus\pm± 0.0020 0.7967 ±plus-or-minus\pm± 0.0037 0.8065 ±plus-or-minus\pm± 0.0022
Simple-GCN 0.2516 ±plus-or-minus\pm± 0.0027 0.2451 ±plus-or-minus\pm± 0.0004 0.258 ±plus-or-minus\pm± 0.0011
SIGN 0.7186 ±plus-or-minus\pm± 0.0041 0.6804 ±plus-or-minus\pm± 0.0041 0.733 ±plus-or-minus\pm± 0.0009
Table 4: Node classification accuracy for the Cora, PubMed, ogbn-arxiv, ogbn-products, and tape-arxiv23 datasets, demonstrating the effect of changing the pre-trained LLM text encoder. The experiment is run over four seeds, with mean accuracy and standard deviation shown. The best results are coloured green (first), yellow (second), and orange (third). For all experiments, we use TA features only, and the simple task instruction to bias the embeddings.

Diffusion GNNs

Included in Table 4, we study the performance of using SimpleGCN and SIGN models individually. Model selection and implementation details can be found in the appendix sections C and D.

Ablation Study

To study the impact of each component in the GNN ensemble, we perform a detailed ablation study. The results can be found in 6.

5 Analysis

Main Results (Table 1)

We find that ensembling GNNs always leads to superior performance across datasets when taking the STAGE approach.

Despite the reduced computational resources and training data requirements, the STAGE method remains highly competitive across all benchmarks. The ensemble STAGE approach lags behind the TAPE pipeline by roughly 5% on Cora, 3.5% on Pubmed, 0.8% on ogbn-products, and 4% on tape-arxiv23. This is a strong result when we consider that STAGE involves training only the GNN ensemble, whereas TAPE also requires two finetuned LMs to generate node features. We see marginally superior results on the ogbn-arxiv dataset using the ensemble STAGE approach.

Instruction-biased Embedding Results (Table 2)

From our findings we conclude that varying the instructions to bias embeddings has little effect on downstream node classification performance for the models we evaluated. We note that while the authors of all embedding models recommend providing instructions along with input text in order to avoid degrading performance, we did not measure a performance improvement in our experiments.

This experiment further supports our claim that an ensemble approach improves robustness across datasets and methods of node feature generation.

PEFT Results (Table 3)

Finetuning each LLM gave marginal performance improvements across all datasets to varying degrees; we see the largest improvement on pubmed (3%). It is of note that finetuning significantly increases the number of trainable parameters (see Table 5) and total training time. Specifically, PEFT for 7B embedding models has over 20 million trainable parameters. On a single A100 GPU, training runs lasted 6 hours on ogbn-arxiv.

LLM Embedding Model Comparison (Table 4)

All three LLM embedding models demonstrated comparable performance on the graph tasks, with each model exhibiting marginally better results on different datasets. Notably, there was no clear winner among them. The LLM2Vec model exhibited slightly weaker performance on the larger datasets (ogbn-arxiv, ogbn-products, tape-arxiv23), while it was marginally stronger on the smaller datasets (Cora, PubMed).

Ensembling the GNN models consistently ranked among the top three models across all three LLM embedding models, delivering an average performance increase of 1%. Among the individual GNN architectures, RevGAT consistently demonstrated superior performance.

Diffusion-pattern GNN Results (Table 4)

The diffusion-based GNNs yielded variable results across datasets. Specifically, SIGN emerged as the second-best performer on the Cora dataset. As expected, SIGN consistently outperformed Simple-GCN, given that it generalizes the latter. Due to its low training time, SIGN is a viable candidate for large datasets, although careful tuning of its hyper-parameters is recommended for optimal performance.

Ablation Study Results (Table 6)

From our ablation study we observe that no individual GNN model outperforms any ensemble of models on any dataset. Additionally, we find that the full ensemble of MLP, GCN, SAGE and RevGAT achieve the highest and most stable accuracy scores across datasets.

Scalability

An important advantage of STAGE is the lack of finetuning necessary to achieve strong results. This lies in contrast to approaches such as TAPE He et al. (2024) and SimTeG Duan et al. (2023a), both of which require finetuning at least one LM. Training an ensemble of GNNs and MLP head over the ogbn-arxiv dataset can be performed on a single consumer-grade GPU in less than 5 minutes. This is illustrated in Figure 2 where we compare the relationship between training time and accuracy for a number of SoTA node classification approaches. When using SIGN diffusion, training time was under 12 seconds for the ogbn-arxiv, but this came at a performance cost. Moreover, TAPE relies on text-level enhancement via LLM API calls, which adds a new dimension of cost and rate-limiting333https://platform.openai.com/docs/guides/rate-limits to consider when adapting to other datasets.

6 Conclusions

This work introduces STAGE, a method to use pre-trained LLMs as text encoders in TAG tasks without the need for finetuning, significantly reducing computational resources and training time. Additional gains can be achieved through parameter-efficient finetuning of the LLM. Data augmentation, which is orthogonal to our approach, could improve performance with general-purpose text embedding models. However, it likely remains intractable for many large-scale datasets due to the need to query a large model for each node.

We also demonstrate the effect of diffusion operators Frasca et al. (2020) on node classification performance, decreasing TAG pipeline training time substantially. We aim to examine the scalability of diffusion-pattern GNNs on larger datasets in later work.

Future work may aim to refine the integration of LLM encoders with GNN heads. Potential strategies include an Expectation-Maximization approach or a joint model configuration Zhao et al. (2023). A significant challenge is the requirement for large, variable batch sizes during LLM finetuning due to current neighborhood sampling techniques, which necessitates increased computational power. We anticipate that overcoming these limitations will make future research more accessible and expedite iterations.

References

Appendix A Appendix

Appendix B Negative Results

Co-training LLM and GNN:

In a similar approach to iterative methods, we investigated co-training the LLM and GNN on the ogbn-arxiv node classification task to facilitate a shared representation space. This proved unfeasible due to the memory requirements exceeding the capacity of one A100 GPU.

Appendix C Implementation of Diffusion Operators

We implement diffusion operators from two methods, Simple-GCN Wu et al. (2019) and SIGN Frasca et al. (2020). In the case of SIGN, the authors omit implementation details of the operators, so we include them here.

Let A𝐴Aitalic_A denote the adjacency matrix of a possibly directed graph G𝐺Gitalic_G, X𝑋Xitalic_X its node features, and D𝐷Ditalic_D the diagonal degree matrix of G𝐺Gitalic_G.

We denote the random-walk normalized adjacency ARW&AD1limit-fromsubscript𝐴RW𝐴superscript𝐷1A_{\text{RW}}\&\coloneqq AD^{-1}italic_A start_POSTSUBSCRIPT RW end_POSTSUBSCRIPT & ≔ italic_A italic_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and the GCN-normalized adjacency Kipf and Welling (2017)

AGCNsubscript𝐴GCN\displaystyle A_{\text{GCN}}italic_A start_POSTSUBSCRIPT GCN end_POSTSUBSCRIPT (D+I)1/2(A+I)(D+I)1/2absentsuperscript𝐷𝐼12𝐴𝐼superscript𝐷𝐼12\displaystyle\coloneqq\left(D+I\right)^{-1/2}\left(A+I\right)\left(D+I\right)^% {-1/2}≔ ( italic_D + italic_I ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( italic_A + italic_I ) ( italic_D + italic_I ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT (4)

The Personalized PageRank matrix is then given by Gasteiger et al. (2022):

APPRsubscript𝐴PPR\displaystyle A_{\text{PPR}}italic_A start_POSTSUBSCRIPT PPR end_POSTSUBSCRIPT α(In(1α)ARW)1absent𝛼superscriptsubscript𝐼𝑛1𝛼subscript𝐴RW1\displaystyle\coloneqq\alpha\left(I_{n}-\left(1-\alpha\right)A_{\text{RW}}% \right)^{-1}≔ italic_α ( italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - ( 1 - italic_α ) italic_A start_POSTSUBSCRIPT RW end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (5)

And we denote the triangle-based adjacency matrix by 𝐀Δsubscript𝐀Δ\mathbf{A}_{\Delta}bold_A start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT, where (AΔ)ijsubscriptsubscript𝐴Δ𝑖𝑗\left(A_{\Delta}\right)_{ij}( italic_A start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT counts the number of directed triangles in G𝐺Gitalic_G that contain the edge (i,j)𝑖𝑗(i,j)( italic_i , italic_j )

Diffusion is applied to node features X𝑋Xitalic_X by matrix multiplication. Simple-GCN takes a power k𝑘kitalic_k of AGCNsubscript𝐴GCNA_{\text{GCN}}italic_A start_POSTSUBSCRIPT GCN end_POSTSUBSCRIPT as its diffusion operator, whilst SIGN diffusion generalizes this to concatenate powers of AGCNsubscript𝐴GCNA_{\text{GCN}}italic_A start_POSTSUBSCRIPT GCN end_POSTSUBSCRIPT, APPRsubscript𝐴PPRA_{\text{PPR}}italic_A start_POSTSUBSCRIPT PPR end_POSTSUBSCRIPT and AΔsubscript𝐴ΔA_{\Delta}italic_A start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT.

Diffusion can be calculated efficiently if sparse-matrix-sparse-matrix multiplication is avoided. For both SIGN and Simple-GCN, the order of operations for applying a power of an operator Aopsubscript𝐴opA_{\text{op}}italic_A start_POSTSUBSCRIPT op end_POSTSUBSCRIPT should be

Aop(Aop((Aop(X)))ktimes\displaystyle\underbrace{A_{\text{op}}(A_{\text{op}}(...(A_{\text{op}}(X))...)% }_{k\;\text{times}}under⏟ start_ARG italic_A start_POSTSUBSCRIPT op end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT op end_POSTSUBSCRIPT ( … ( italic_A start_POSTSUBSCRIPT op end_POSTSUBSCRIPT ( italic_X ) ) … ) end_ARG start_POSTSUBSCRIPT italic_k times end_POSTSUBSCRIPT (6)

as opposed to (Aopk)Xsuperscriptsubscript𝐴op𝑘𝑋(A_{\text{op}}^{k})X( italic_A start_POSTSUBSCRIPT op end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) italic_X, where the operator matrix Aopsubscript𝐴𝑜𝑝A_{op}italic_A start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT is feasible to calculate, since the former avoids sparse matrix multiplication. In SIGN, the recursive nature of eq.6 can be exploited to reuse results for calculating successive powers.

In the case of personalized pagerank diffusion, we first use a trick from Gasteiger et al. (2022) to approximate the diffused features of personalized pagerank matrix APPRXsubscript𝐴PPR𝑋A_{\text{PPR}}Xitalic_A start_POSTSUBSCRIPT PPR end_POSTSUBSCRIPT italic_X in linear time and avoid calculative APPRsubscript𝐴PPRA_{\text{PPR}}italic_A start_POSTSUBSCRIPT PPR end_POSTSUBSCRIPT directly, by viewing eq.5 as topic-sensitive PageRank Haveliwala (2002). We use the random-walk normalized adjacency matrix.

The following power iteration approximates APPRXsubscript𝐴PPR𝑋A_{\text{PPR}}Xitalic_A start_POSTSUBSCRIPT PPR end_POSTSUBSCRIPT italic_X (notation from Gasteiger et al. (2022)):

Z(0)superscript𝑍0\displaystyle Z^{(0)}italic_Z start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT Xabsent𝑋\displaystyle\coloneqq X≔ italic_X
Z(k+1)superscript𝑍𝑘1\displaystyle Z^{(k+1)}italic_Z start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT (1α)AZ(k)+αXabsent1𝛼𝐴superscript𝑍𝑘𝛼𝑋\displaystyle\coloneqq(1-\alpha)AZ^{(k)}+\alpha X≔ ( 1 - italic_α ) italic_A italic_Z start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT + italic_α italic_X

To compute the n𝑛nitalic_nth diffused power, we repeat the process n𝑛nitalic_n times:

Z0(0)subscriptsuperscript𝑍00\displaystyle Z^{(0)}_{0}italic_Z start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =Xabsent𝑋\displaystyle=X= italic_X
Zn+1(0)subscriptsuperscript𝑍0𝑛1\displaystyle Z^{(0)}_{n+1}italic_Z start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT =limkinfZn(k)absentsubscript𝑘infimumsubscriptsuperscript𝑍𝑘𝑛\displaystyle=\lim\limits_{k\to\inf}Z^{(k)}_{n}= roman_lim start_POSTSUBSCRIPT italic_k → roman_inf end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

Lastly, for triangle-based diffusion, we count triangles using linear algebra. For unweighted A𝐴Aitalic_A we perform a single sparse matrix multiplication to obtain A2superscript𝐴2A^{2}italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, in which element (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) counts the directed paths in G𝐺Gitalic_G for node i𝑖iitalic_i to node j𝑗jitalic_j. We then calculate

AΔ=ATA2subscript𝐴Δdirect-productsuperscript𝐴𝑇superscript𝐴2\displaystyle A_{\Delta}=A^{T}\odot A^{2}italic_A start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⊙ italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where direct-product\odot denotes the Hadamard product, which can be efficiently calculated for sparse matrices. We then normalize and diffuse features over powers of AΔsubscript𝐴ΔA_{\Delta}italic_A start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT in the same fashion as for AGCNsubscript𝐴𝐺𝐶𝑁A_{GCN}italic_A start_POSTSUBSCRIPT italic_G italic_C italic_N end_POSTSUBSCRIPT.

An implementation of these operators as GraphBLAS Davis (2019) code is published alongside this paper.

C.1 Parallelism of diffusion operators

All operations above can be be parallelized across columns of X𝑋Xitalic_X, either keeping A𝐴Aitalic_A in shared memory on one machine or keeping a copy on each executor in a distributed computing infrastructure like Apache Spark.

Appendix D Preprocessing & Model Selection for Diffusion Operators

For Simple-GCN Wu et al. (2019), we set the degree k𝑘kitalic_k by selecting the highest validation accuracy from k=2,3,4𝑘234k=2,3,4italic_k = 2 , 3 , 4, of which k=2𝑘2k=2italic_k = 2 had the highest accuracy in each case. For SIGN Frasca et al. (2020), we choose s𝑠sitalic_s, p𝑝pitalic_p, t𝑡titalic_t from the highest validation accuracy amongst (3,0,0)300(3,0,0)( 3 , 0 , 0 ) (3,0,1)301(3,0,1)( 3 , 0 , 1 ) (3,3,0)330(3,3,0)( 3 , 3 , 0 ), (4,2,1)421(4,2,1)( 4 , 2 , 1 ) (5,3,0)530(5,3,0)( 5 , 3 , 0 ). For Cora and PubMed, (4,2,1)421(4,2,1)( 4 , 2 , 1 ) was chosen, and for ogbn-arxiv, ogbn-products, and tape-arxiv23 (3,3,0)330(3,3,0)( 3 , 3 , 0 ) was chosen. We chose the number of layers for the Inception NLP to match the number of layers in other GNNs tested, 4. We did not perform additional hyper-parameter tuning. When preprocessing the embeddings, we centered and scaled the data to unit variance for Simple-GCN and SIGN only.

Appendix E Model Trainable Parameters

Model Trainable Parameter Count
RevGAT 3,457,678
GCN 559,111
SAGE 1,117,063
MLP 117,767
Simple-GCN 24,111
SIGN-(3,3,0) 500,271
SIGN-(4,2,1) 582,575
PEFT 7B LLM >20M
Table 5: Trainable parameter counts for different models. 7B LLM refers to all finetuned LLM embedding models used during experiments (see Section 3.1)

Appendix F Ablation Study

To study the effect each model has on the GNN ensemble step of STAGE, we perform a detailed ablation study. The results are shown in Table 6.

Method Cora PubMed ogbn-arxiv ogbn-products tape-arxiv23
Full Ensemble 0.8824 ±plus-or-minus\pm± 0.0155 0.9265 ±plus-or-minus\pm± 0.0068 0.7777 ±plus-or-minus\pm± 0.0019 0.8140 ±plus-or-minus\pm± 0.0033 0.8029 ±plus-or-minus\pm± 0.0020
No MLP 0.8838 ±plus-or-minus\pm± 0.0039 0.9239 ±plus-or-minus\pm± 0.0036 0.7748 ±plus-or-minus\pm± 0.0012 0.8093 ±plus-or-minus\pm± 0.0021 0.8015 ±plus-or-minus\pm± 0.0010
No GCN 0.8685 ±plus-or-minus\pm± 0.0209 0.9240 ±plus-or-minus\pm± 0.0076 0.7731 ±plus-or-minus\pm± 0.0017 0.8100 ±plus-or-minus\pm± 0.0038 0.8028 ±plus-or-minus\pm± 0.0023
No SAGE 0.8759 ±plus-or-minus\pm± 0.0207 0.9258 ±plus-or-minus\pm± 0.0110 0.7739 ±plus-or-minus\pm± 0.0020 0.8116 ±plus-or-minus\pm± 0.0045 0.8021 ±plus-or-minus\pm± 0.0035
No RevGAT 0.8764 ±plus-or-minus\pm± 0.0180 0.9272 ±plus-or-minus\pm± 0.0052 0.7717 ±plus-or-minus\pm± 0.0007 0.8029 ±plus-or-minus\pm± 0.0036 0.7985 ±plus-or-minus\pm± 0.0018
Best Individual 0.8722 ±plus-or-minus\pm± 0.0063 0.9142 ±plus-or-minus\pm± 0.0122 0.7638 ±plus-or-minus\pm± 0.0054 0.8083 ±plus-or-minus\pm± 0.0051 0.7880 ±plus-or-minus\pm± 0.0023
Best Individual Model SAGE MLP RevGAT RevGAT RevGAT
Table 6: Ablation study results for the ensemble model on various datasets. The table shows the accuracy when each component is removed from the ensemble. The experiment is run over four seeds, with mean accuracy and standard deviation shown. The best results are coloured green (first), yellow (second), and orange (third). For all experiments, we use SFR-Embedding-Mistral as the embedding model on TA features only, and the simple task instruction to bias the embeddings.

Appendix G Datasets

In this section, we describe the characteristics of the node classification datasets we used during our work. The statistics are shown in Table 7.

Dataset Node Count Edge Count Task Metric
Cora McCallum et al. (2000) 2,708 5,429 7-class classif. Accuracy
Pubmed Sen et al. (2008) 19,717 44,338 3-class classif. Accuracy
ogbn-arxiv Hu et al. (2021b) 169,343 1,166,243 40-class classif. Accuracy
ogbn-products Hu et al. (2021b) (subset) 54,025 74,420 47-class classif. Accuracy
tape-arxiv23 He et al. (2024) 46,198 78,548 40-class classif. Accuracy
Table 7: Statistics of the TAG datasets

Appendix H Instruction-biased Embeddings

In Table 8 we list the specific instructions used to 655 investigate the effect of biasing embeddings.

Dataset Prompt Type Prompt
ogbn-arxiv, arxiv_2023, cora, pubmed Simple Task Identify the main and secondary category of Arxiv papers based on the titles and abstracts.
ogbn-arxiv, arxiv_2023, cora, pubmed Graph-Aware Identify the main and secondary category of Arxiv papers based on the titles and abstracts. Your predictions will be used in a downstream graph-based prediction that for each paper can learn from your predictions of neighboring papers in a graph as well as the predictions for the paper in question. Papers in the graph are connected if one cites the other.
ogbn-products Simple Task Identify the main and secondary category of this product based on the titles and description.
ogbn-products Graph-Aware Identify the main and secondary category of this product based on the titles and description. Your predictions will be used in a downstream graph-based prediction that for each product can learn from your predictions of neighboring products in a graph as well as the predictions for the paper in question. Products in the graph are connected if they are purchased together.
Table 8: Task descriptions for embedding bias across various datasets.