Graph Neural Networks for Text Classification: A Survey

Kunze Wang Yihao Ding The University of SydneyAustralia  and  Soyeon Caren Han The University of Western Australia, The University of SydneyAustralia
Abstract.

Text Classification is the most essential and fundamental problem in Natural Language Processing. While numerous recent text classification models applied the sequential deep learning technique, graph neural network-based models can directly deal with complex structured text data and exploit global information. Many real text classification applications can be naturally cast into a graph, which captures words, documents, and corpus global features. In this survey, we bring the coverage of methods up to 2023, including corpus-level and document-level graph neural networks. We discuss each of these methods in detail, dealing with the graph construction mechanisms and the graph-based learning process. As well as the technological survey, we look at issues behind and future directions addressed in text classification using graph neural networks. We also cover datasets, evaluation metrics, and experiment design and present a summary of published performance on the publicly available benchmarks. Note that we present a comprehensive comparison between different techniques and identify the pros and cons of various evaluation metrics in this survey.

Graph Neural Networks, Text Classification. Representation Learning
ccs: Computing methodologies Natural language processing

1. Introduction

Text classification aims to classify a given document into certain pre-defined classes, and is considered as a fundamental task in Natural Language Processing (NLP). It includes a large number of downstream tasks, such as topic classification(Zhang et al., 2015), and sentiment analysis(Tai et al., 2015). Traditional text classification methods build representation on the text using N-gram(Cavnar et al., 1994) or Term Frequency-Inverse Document Frequency (TF-IDF) (Hakim et al., 2014) and apply traditional machine learning models, such as SVM(Joachims, 2005), to classify the documents. With the development of neural networks, more deep learning models have been applied to text classification, including convolutional neural networks (CNN)(Kim, 2014), recurrent neural networks (RNN)(Tang et al., 2015) and attention-based(Vaswani et al., 2017) models and large language models(Devlin et al., 2018).

However, these methods are either unable to handle the complex relationships between words and documents(Yao et al., 2019), and can not efficiently explore the contextual-aware word relations(Zhang et al., 2020). To resolve such obstacles, graph neural networks (GNN) are introduced. GNN is used with graph-structure datasets so a graph needs to be built for text classification. There are two main approaches to constructing graphs, corpus-level graphs and document-level graphs. The datasets are either built into single or multiple corpus-level graphs representing the whole corpus or numerous document-level graphs and each of them represents a document. The corpus-level graph can capture the global structural information of the entire corpus, while the document-level graph can capture the word-to-word relationships within a document explicitly. Both ways of applying graph neural networks to text classification achieve good performance.

This paper mainly focuses on GNN-based text classification techniques, datasets, and their performance. The graph construction approaches for both corpus-level and document-level graphs are addressed in detail. Papers on the following aspects will be reviewed:

  • GNNs-based text classification approaches. Papers that design GNN-based frameworks to enhance the feature representation or directly apply GNNs to conduct sequence text classification tasks will be summarized, described and discussed. GNNs applied for token-level classification (Natural Language Understanding) tasks, including NER, slot filling, etc, will not be discussed in this work.

  • Text classification benchmark datasets and their performance applied by GNN-based models. The text classification datasets with commonly used metrics used by GNNs-based text classification models will be summarized and categorized based on task types, together with the model performance on these datasets.

1.1. Related Surveys and Our Contribution

Before 2019, the text classification survey papers (Xing et al., 2010; Khan et al., 2010; Harish et al., 2010; Aggarwal and Zhai, 2012; Vijayan et al., 2017) have focused on covering traditional machine learning-based text classification models. Recently, with the rapid development of deep learning techniques, (Minaee et al., 2021; Zulqarnain et al., 2020; Zhou, 2020; Li et al., 2022) review the various deep learning based approaches. In addition, some papers not only review the SoTA model architectures, but summarize the overall workflow (Jindal et al., 2015; Kadhim, 2019; Mirończuk and Protasiewicz, 2018; Kowsari et al., 2019; Bhavani and Kumar, 2021) or specific techniques for text classification including word embedding (Selva Birunda and Kanniga Devi, 2021), feature selection (Deng et al., 2019; Shah and Patel, 2016; Pintas et al., 2021) , term weighting (Patra and Singh, 2013; Alsaeedi, 2020) and etc. Meanwhile, some growing potential text classification architectures are surveyed, such as CNNs (Yang et al., 2016), attention mechanisms (Mariyam et al., 2021). Since the powerful ability to represent non-Euclidean relation, GNNs have been used in multiple practical fields and reviewed e.g. financial application (Wang et al., 2021), traffic prediction (Liu and Tan, 2021), bio-informatics (Zhang et al., 2021), power system (Liao et al., 2021a), recommendation system (Gao et al., 2022; Liang et al., 2021; Yang et al., 2021). Moreover, (Bronstein et al., 2017; Battaglia et al., 2018; Zhang et al., 2019; Zhou et al., 2020; Wu et al., 2020) comprehensively review the general algorithms and applications of GNNs, as well as certain surveys mainly focus on specific perspectives including graph construction (Skarding et al., 2021; Thomas et al., 2022), graph representation (Hamilton et al., 2017), training (Xie et al., 2022), pooling (Liu et al., 2022) and more. However, only (Minaee et al., 2021; Li et al., 2022) briefly introduce certain SoTA GNN-based text classification models. A recent short review paper (Malekzadeh et al., 2021) reviews several SoTA models without providing a comprehensive overview in this area. The contribution of this survey includes:

  • This is the first survey focused only on graph neural networks for text classification with a comprehensive description and critical discussion on more than twenty GNN text classification models.

  • We categorize the existing GNN text classification models into two main categories with multiple sub-categories, and the tree structure of all the models shows in Figure 1.

  • We compare these models in terms of graph construction, node embedding initialization, and graph learning methods. And We also compare the performance of these models on the benchmark datasets and discuss the key findings.

  • We discuss the existing challenges and some potential future work for GNN text classification models.

1.2. Text Classification Tasks

Text classification involves assigning a pre-defined label to a given text sequence. The process typically involves encoding pre-processed raw text into numerical representations and using classifiers to predict the corresponding categories. Typical sub-tasks include sentiment analysis, topic labelling, news categorization, and hate speech detection. Certain frameworks can be extended to advanced applications such as information retrieval, summarising, question answering, and natural language inference. This paper focuses specifically on GNN-based models used for typical text classification.

  • Sentiment Analysis is a task that aims to identify the emotional states and subjective opinions expressed in the input text, such as reviews, micro-blogs, etc. This can be achieved through binary or multi-class classification. Effective sentiment analysis can aid in making informed business decisions based on user feedback.

  • Topic Classification is a supervised deep learning task to automatically understand the text content and classified into multiple domain-specific categories, typically more than two. The data sources may gather from different domains, including Wikipedia pages, newspapers, scientific papers, etc.

  • Junk Information Detection involves detecting inappropriate social media content. Social media providers commonly use approaches like hate speech, abusive language, advertising or spam detection to remove such content efficiently.

1.3. Text Classification Development

Many traditional machine learning methods and deep learning models are selected as the baselines for comparing with the GNN-based text classifiers. We mainly summarized those baselines into three types:

Traditional Machine Learning: In earlier years, traditional methods such as Support Vector Machines (SVM) (Zhang et al., 2011) and Logistic Regression (Genkin et al., 2007) utilized sparse representations like Bag of Words (BoW) and TF-IDF. However, recent advancements (Lilleberg et al., 2015; Yin and Jin, 2015; Ren et al., 2016) have focused on dense representations, such as Word2vec, GloVe, and Fasttext, to mitigate the limitations of sparse representations. These dense representations are also used as inputs for sophisticated methods, such as Deep Averaging Networks (DAN) (Iyyer et al., 2015) and Paragraph Vector (Doc2Vec) (Le and Mikolov, 2014), to achieve new state-of-the-art results.

Sequential Models: RNNs and CNNs have been utilized to capture local-level semantic and syntactic information of consecutive words from input text bodies. The upgraded models, such as LSTM (Graves, 2012) and GRU (Cho et al., 2014), have been proposed to address the vanishing or exploding gradient problems caused by vanilla RNN. CNN-based structures have been applied to capture N-gram features by using one or more convolution and pooling layers, such as Dynamic CNN (Kalchbrenner et al., 2014) and TextCNN (Kim, 2014). However, these models can only capture local dependencies of consecutive words. To capture longer-term or non-Euclidean relations, improved RNN structures, such as Tree-LSTM (Tai et al., 2015) and MT-LSTM (Liu et al., 2015), and global semantic information, like TopicRNN (Dieng et al., 2016), have been proposed. Additionally, graph (Peng et al., 2018) and tree structure (Mou et al., 2015) enhanced CNNs have been proposed to learn more about global and long-term dependencies.

Attentions and Transformers: attention mechanisms (Bahdanau et al., 2014) have been widely adopted to capture long-range dependencies, such as hierarchical attention networks (Abreu et al., 2019) and attention-based hybrid models (Yang et al., 2016). Self-attention-based transformer models have achieved state-of-the-art performance on many text classification benchmarks via pre-training on some tasks to generate strong contextual word representations. However, these models only focus on learning the relation between input text bodies and ignore the global and corpus level information. Researchers have proposed combining the benefits of attention mechanisms and Graph Neural Networks (GNNs) to learn both the relation between input text bodies and the global and corpus level information, such as VGCN-BERT (Lu et al., 2020) and BERTGCN (Lin et al., 2021).

1.4. Outline

The outline of this survey is as follows:

  • Section 1 presents the research questions and provides an overview of applying Graph Neural Networks to text classification tasks, along with the scope and organization of this survey.

  • Section 2 provides background information on text classification and graph neural networks and introduces the key concepts of applying GNNs to text classification from a designer’s perspective.

  • Section 3 and Section 4 discuss previous work on Corpus-level Graph Neural Networks and Document-level Graph Neural Networks, respectively, and provide a comparative analysis of the strengths and weaknesses of these two approaches.

  • Section 5 introduces the commonly used datasets and evaluation metrics in GNN for text classification.

  • Section 6 reports the performance of various GNN models on a range of benchmark datasets for text classification and discusses the key findings.

  • The challenges for the existing methods and some potential future works are discussed in Section 7.

  • In Section 8, we present the conclusions of our survey on GNN for text classification and discuss potential directions for future work.

{forest}

for tree= forked edges, grow’=0, draw, rounded corners, node options=align=center, text width=2.7cm, s sep=6pt, calign=child edge, calign child=(n_children()+1)/2 [GNN for Text Classification, fill=gray!45, parent [Corpus-level Graph, for tree= pretrain [Word and Document nodes, pretrain [PMI+TF-IDF, pretrain [TextGCN(Yao et al., 2019); SGC(Wu et al., 2019b); S2GC(Zhu and Koniusz, 2020); NMGC(Lei et al., 2021); TG-Transformer(Zhang and Zhang, 2020); BertGCN(Lin et al., 2021), pretrain_work] ] [Multi-Graph/Multi-Dimensional Edge, pretrain [TensorGCN(Liu et al., 2020); ME-GCN(Wang et al., 2022b),pretrain_work] ] [Inductive Learning, pretrain [HeteGCN(Ragesh et al., 2021); InducT-GCN(Wang et al., 2022a); T-VGAE(Xie et al., 2021), style = pretrain_work] ] ] [Document Nodes, pretrain [knn-GCN(Benamira et al., 2019); TextGTL(Li et al., 2021),style = pretrain_work] ] [Word Nodes, pretrain [VGCN-BERT(Lu et al., 2020), style = pretrain_work] ] [Extra Topic Nodes, pretrain [Single Layer topic nodes, pretrain [HGAT(Linmei et al., 2019); STGCN(Ye et al., 2020),style = pretrain_work] ] [Multi-layer Topic Node, pretrain [DHTG(Wang et al., 2020), style = pretrain_work] ] ] ] [Document-Level Graph, for tree=template [Local word consecutive, template [Simple consecutive graph models, template [Text-Level-GNN(Huang et al., 2019); MPAD(Nikolentzos et al., 2020); TextING(Zhang et al., 2020), template_work] ] [Advanced graph models, template [MLGNN(Liao et al., 2021b); DADGNN(Liu et al., 2021a); TextSSL(Piao et al., 2021), template_work] ] ] [Global Word Co-occurrence, template [Only co-occurrence, template [DAGNN(Wu et al., 2019a),template_work] ] [With Extra Edges, template [ReGNN(Li et al., 2019); GFN(Dai et al., 2022),template_work] ] ] [Other Word Graphs, template [HyperGAT(Ding et al., 2020); IGCN(Tang et al., 2020); GTNT(Mei et al., 2021),template_work] ] ] ]

Figure 1. Categorizing the graph neural network text classification models.

2. Backgrounds of GNN

2.1. Definition of Graph

A graph in this paper is represented as G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), where V𝑉Vitalic_V and E𝐸Eitalic_E represent a set of nodes (vertices) and edges of G𝐺Gitalic_G, respectively. A single node in the node set is represented viVsubscript𝑣𝑖𝑉v_{i}\in Vitalic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V, as well as eij=(vi,vj)Esubscript𝑒𝑖𝑗subscript𝑣𝑖subscript𝑣𝑗𝐸e_{ij}=(v_{i},v_{j})\in Eitalic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_E donates an edge between node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The adjacent matrix of graph G𝐺Gitalic_G is represented as A𝐴Aitalic_A, where An×n𝐴superscript𝑛𝑛A\in\mathbb{R}^{n\times n}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT and n𝑛nitalic_n is the number of nodes in graph G𝐺Gitalic_G. If eijEsubscript𝑒𝑖𝑗𝐸e_{ij}\in Eitalic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ italic_E, Aij=1subscript𝐴𝑖𝑗1A_{ij}=1italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1, otherwise Aij=0subscript𝐴𝑖𝑗0A_{ij}=0italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0. In addition, we use X and E to represent the nodes and edges representations in graph G𝐺Gitalic_G, where Xn×mXsuperscript𝑛𝑚\textit{{X}}\in\mathbb{R}^{n\times m}X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT and En×cEsuperscript𝑛𝑐\textit{{E}}\in\mathbb{R}^{n\times c}E ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_c end_POSTSUPERSCRIPT. ximsubscriptx𝑖superscript𝑚\textbf{{x}}_{i}\in\mathbb{R}^{m}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT represents the m𝑚mitalic_m-dimensional vector of node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and eijcsubscripte𝑖𝑗superscript𝑐\textbf{{e}}_{ij}\in\mathbb{R}^{c}e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT represents the c𝑐citalic_c-dimensional vector of edge eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (most of the recent studies set c=1𝑐1c=1italic_c = 1 to represent a weighting scalar). A donates the edge feature weighted adjacent matrix.

2.2. Traditional Graph-based Algorithms

Before GNNs were broadly used for representing irregular relations, traditional graph-based algorithms have been applied to model the non-Euclidean structures in text classification e.g. Random Walk (Szummer and Jaakkola, 2001; Zhou and Li, 2005), Graph Matching (Schenker et al., 2004; Silva et al., 2014), Graph Clustering (Matsuo et al., 2006) which has been well summarized in (Wu et al., 2021). There are three common limitations of those traditional graph-based algorithms. Firstly, most of those algorithms mainly focus on capturing graph-level structure information without considering the significance of node and edge features. For example, Random Walk based approaches (Zhou and Li, 2005; Szummer and Jaakkola, 2001) mainly focus on using distance or angle between node vectors to calculate transition probability while ignoring the information represented by node vectors. Secondly, since the traditional graph-based algorithms are only suitable for specific tasks, there is no unified learning framework for addressing various practical tasks. For example, (Kaur and Kumar, 2018) proposes a graph clustering method that requires a domain knowledge-based ontology graph. Lastly, the traditional graph based methods are comparative time inefficient like the Graph Edit Distance-based graph matching methods have exponential time complexity (Silva et al., 2014).

2.3. Foundations of GNN

To tackle the limitation of traditional graph-based algorithms and better represent non-Euclidean relations in practical applications, Graph Neural Networks are proposed by (Scarselli et al., 2008). GNNs have a unified graph-based framework and simultaneously model the graph structure, node, and edge representations. This section will provide the general mathematical definitions of Graph Neural Networks. The general forward process of GNN can be summarised as follows:

(1) H(l)=(A,H(l1))superscriptH𝑙AsuperscriptH𝑙1\textbf{{H}}^{(l)}=\mathcal{F}(\textbf{{A}},\textbf{{H}}^{(l-1)})H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = caligraphic_F ( A , H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT )

where An×nAsuperscript𝑛𝑛\textbf{{A}}\in\mathbb{R}^{n\times n}A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT represents the weighted adjacent matrix and H(l)n×dsuperscriptH𝑙superscript𝑛𝑑\textbf{{H}}^{(l)}\in\mathbb{R}^{n\times d}H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT is the updated node representations at the l𝑙litalic_l-th GNN layers by feeding l1𝑙1l-1italic_l - 1-th layer node features H(l1)n×ksuperscriptH𝑙1superscript𝑛𝑘\textbf{{H}}^{(l-1)}\in\mathbb{R}^{n\times k}H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_k end_POSTSUPERSCRIPT ( k𝑘kitalic_k is the dimensions of previous layers node representations ) into pre-defined graph filters \mathcal{F}caligraphic_F.
The most commonly used graph filtering method is defined as follows:

(2) H(l)=ϕ(A~H(l1)W)superscriptH𝑙italic-ϕ~AsuperscriptH𝑙1W\textbf{{H}}^{(l)}=\phi(\tilde{\textbf{{A}}}\textbf{{H}}^{(l-1)}\textbf{{W}})H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = italic_ϕ ( over~ start_ARG A end_ARG H start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT W )

where A~=D12AD12~AsuperscriptD12superscriptAD12\tilde{\textbf{{A}}}=\textbf{{D}}^{-\frac{1}{2}}\textbf{{AD}}^{-\frac{1}{2}}over~ start_ARG A end_ARG = D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT AD start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT is the normalized symmetric adjacency matrix. An×nAsuperscript𝑛𝑛\textbf{{A}}\in\mathbb{R}^{n\times n}A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is the adjacent matrix of graph G𝐺Gitalic_G and D is the degree matrix of A, where Dii=ΣjAijsubscript𝐷𝑖𝑖subscriptΣ𝑗subscript𝐴𝑖𝑗D_{ii}=\Sigma_{j}A_{ij}italic_D start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Wk×dWsuperscript𝑘𝑑\textbf{{W}}\in\mathbb{R}^{k\times d}W ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d end_POSTSUPERSCRIPT is the weight matrix and ϕitalic-ϕ\phiitalic_ϕ is the activation function. If we design a two layers of GNNs based on the above filter could get a vanilla Graph Convolutional Network (GCN) (Welling and Kipf, 2016) framework for text classification:

(3) Y=softmax(A~(ReLU(A~HW(0)))W(1))Y𝑠𝑜𝑓𝑡𝑚𝑎𝑥~A𝑅𝑒𝐿𝑈~AsuperscriptHW0superscriptW1\textbf{{Y}}=softmax(\tilde{\textbf{{A}}}(ReLU(\tilde{\textbf{{A}}}\textbf{{H}% }\textbf{{W}}^{(0)}))\textbf{{W}}^{(1)})Y = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( over~ start_ARG A end_ARG ( italic_R italic_e italic_L italic_U ( over~ start_ARG A end_ARG bold_italic_H bold_italic_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) ) W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )

where W0superscriptW0\textbf{{W}}^{0}W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and W1superscriptW1\textbf{{W}}^{1}W start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT represent different weight matrix for different GCN layers and H is the input node features. ReLU𝑅𝑒𝐿𝑈ReLUitalic_R italic_e italic_L italic_U function is used for non-linearization and softmax𝑠𝑜𝑓𝑡𝑚𝑎𝑥softmaxitalic_s italic_o italic_f italic_t italic_m italic_a italic_x is used to generated predicted categories Y.

2.4. GNN for Text Classification

In this paper, we mainly discuss how GNNs are applied in Text Classification tasks. Before we present the specific applications in this area, we first introduce the key concepts of applying GNNs to text classification from a designer’s view. We suppose for addressing a text classification task need to design a graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ). The general procedures include Graph Construction, Initial Node Representation, Edge Representations, Training Setup.

2.4.1. Graph Construction

Some applications have explicit graph structures including constituency or dependency graphs (Tang et al., 2020), knowledge graphs (Ostendorff et al., 2019; Marin et al., 2014), social networks (Dai et al., 2022) without constructing graph structure and defining corresponding nodes and edges. However, for text classification, the most common graph structures are implicit, which means we need to define a new graph structure for a specific task such as designing a word-word or word-document co-occurrence graph. In addition, for text classification tasks, the graph structure can be generally classified into two types:

  • Corpus-level/Document-level. Corpus-level graphs intend to construct the graph to represent the whole corpus such as (Yao et al., 2019; Liu et al., 2020; Lin et al., 2021; Wu et al., 2019b), while the document-level graphs focus on representing the non-Euclidean relations existing in a single text body like (Chen et al., 2020; Nikolentzos et al., 2020; Zhang et al., 2020). Supposing a specific corpus 𝒞𝒞\mathcal{C}caligraphic_C contains a set of documents (text bodies) 𝒞={D1,D2,,Dj}𝒞subscript𝐷1subscript𝐷2subscript𝐷𝑗\mathcal{C}=\{D_{1},D_{2},...,D_{j}\}caligraphic_C = { italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } and each Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains a set of tokens Di={ti1,ti2,,tik}subscript𝐷𝑖subscript𝑡subscript𝑖1subscript𝑡subscript𝑖2subscript𝑡subscript𝑖𝑘D_{i}=\{t_{i_{1}},t_{i_{2}},...,t_{i_{k}}\}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. The vocabulary of 𝒞𝒞\mathcal{C}caligraphic_C can be represented as 𝒟={t1,t2,,tl}𝒟subscript𝑡1subscript𝑡2subscript𝑡𝑙\mathcal{D}=\{t_{1},t_{2},...,t_{l}\}caligraphic_D = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT }, where l𝑙litalic_l is the length of 𝒟𝒟\mathcal{D}caligraphic_D. For the most commonly adopted corpus-level graph Gcorpus=(Vcorpus,Ecorpus)subscript𝐺𝑐𝑜𝑟𝑝𝑢𝑠subscript𝑉𝑐𝑜𝑟𝑝𝑢𝑠subscript𝐸𝑐𝑜𝑟𝑝𝑢𝑠G_{corpus}=(V_{corpus},E_{corpus})italic_G start_POSTSUBSCRIPT italic_c italic_o italic_r italic_p italic_u italic_s end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_c italic_o italic_r italic_p italic_u italic_s end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_c italic_o italic_r italic_p italic_u italic_s end_POSTSUBSCRIPT ), a node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in Vcorpussubscript𝑉𝑐𝑜𝑟𝑝𝑢𝑠V_{corpus}italic_V start_POSTSUBSCRIPT italic_c italic_o italic_r italic_p italic_u italic_s end_POSTSUBSCRIPT follows vi𝒞𝒟subscript𝑣𝑖𝒞𝒟v_{i}\in\mathcal{C}\cup\mathcal{D}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_C ∪ caligraphic_D and the edge eijEcorpussubscript𝑒𝑖𝑗subscript𝐸𝑐𝑜𝑟𝑝𝑢𝑠e_{ij}\in E_{corpus}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ italic_E start_POSTSUBSCRIPT italic_c italic_o italic_r italic_p italic_u italic_s end_POSTSUBSCRIPT is one kind of relations between visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Regarding the document level graph Gdoci=(Vdoci,Edoci)subscript𝐺𝑑𝑜subscript𝑐𝑖subscript𝑉𝑑𝑜subscript𝑐𝑖subscript𝐸𝑑𝑜subscript𝑐𝑖G_{doc_{i}}=(V_{doc_{i}},E_{doc_{i}})italic_G start_POSTSUBSCRIPT italic_d italic_o italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_d italic_o italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_d italic_o italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), a node vijsubscript𝑣subscript𝑖𝑗v_{i_{j}}italic_v start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT in Vdocisubscript𝑉𝑑𝑜subscript𝑐𝑖V_{doc_{i}}italic_V start_POSTSUBSCRIPT italic_d italic_o italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT follows vijDisubscript𝑣subscript𝑖𝑗subscript𝐷𝑖v_{i_{j}}\in D_{i}italic_v start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

After designing the graph-scale for the specific tasks, specifying the graph types is also important to determine the nodes and their relations. For text classification tasks, the commonly used graph construction ways can be summarized into:

  • Homogeneous/Heterogeneous Graphs: homogeneous graphs have the same node and edge type while heterogeneous graphs have various node and edge types. For a graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), we use 𝒩vsuperscript𝒩𝑣\mathcal{N}^{v}caligraphic_N start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT and 𝒩esuperscript𝒩𝑒\mathcal{N}^{e}caligraphic_N start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT to represent the number of types of V𝑉Vitalic_V and E𝐸Eitalic_E. If 𝒩v=𝒩e=1superscript𝒩𝑣superscript𝒩𝑒1\mathcal{N}^{v}=\mathcal{N}^{e}=1caligraphic_N start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT = caligraphic_N start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = 1, G𝐺Gitalic_G is a homogeneous graph. If 𝒩v>1superscript𝒩𝑣1\mathcal{N}^{v}>1caligraphic_N start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT > 1 or 𝒩e>1superscript𝒩𝑒1\mathcal{N}^{e}>1caligraphic_N start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT > 1, G𝐺Gitalic_G is a heterogeous graph.

  • Static/Dynamic Graphs: Static graphs aim to use the constructed graph structure by various external or internal information to leverage to enhance the initial node representation such as dependency or constituency graph (Tang et al., 2020), co-occurrence between word nodes (Zhang et al., 2020), TF-IDF between word and document nodes (Yao et al., 2019; Wu et al., 2019b; Lei et al., 2021) and so on. However, compared with the static graph, the dynamic graph initial representations or graph topology are changing during training without certain domain knowledge and human efforts. The feature representations or graph structure can jointly learn with downstream tasks to be optimised together. For example, (Wang et al., 2020) proposed a novel topic-awared GNN text classification model with dynamically updated edges between topic nodes with others (e.g. document, word). Piao et al. (Piao et al., 2021) also designed a dynamic edge based graph to update the contextual dependencies between nodes. Additionally, (Chen et al., 2020) propose a dynamic GNN model to jointly update the edge and node representation simultaneously. We provide more details about above mentioned models in Section 3 and Section 4.

Another widely used pair of graph categories are directed or undirected graphs based on whether the directions of edges are bi-directional or not. For text classification, most of the GNN designs are following the unidirectional way. In addition, those graph type pairs are not parallel which means those types can be combined.

2.4.2. Initial Node Representation

Based on the pre-defined graph structure and specified graph type, selecting the appropriate initial node representations is the key procedure to ensure the proposed graph structure can effectively learn node. According to the node entity type, the existing node representation approaches for text classification can be generally summarised into:

  • Word-level Representation: non-context word embedding methods such as GloVe (Pennington et al., 2014), Word2vec (Mikolov et al., 2013), FastText (Bojanowski et al., 2017) are widely adopted by many GNN-based text classification framework to numerically represent the node features. However, those embedding methods are restricted to capturing only syntactic similarity and fail to represent the complex semantic relationships between words, as well as, they cannot capture the meaning of out-of-vocabulary (OOV) words, and their representations are fixed. Therefore, there are some recent studies selecting ELMo (Peters et al., 2018), BERT (Devlin et al., 2018), GPT (Radford et al., 2018) to get contextual word-level node representation. Notably, even if one-hot encoding is the simplest word representation method, there are many GNN-based text classifiers using one-hot encoding and achieving state-of-the-art performance. Few frameworks use randomly initialised vectors to represent the word-level node features.

  • Document-level Representation: similar to other NLP applications, document level representations are normally acquired by aggregating the word level representation via some deep learning frameworks. For example, some researchers select by extracting the last-hidden state of LSTM or using the [CLS] token from BERT to numerically represent the input text body. Furthermore, it is also a commonly used document-level node representation way to use TF-IDF based document vectors.

Most GNN based text classification frameworks will compare the performance between different node representation methods to conduct quantitative analysis, as well as provide reasonable justifications for demonstrating the effectiveness of the selected initial node representation based on defined graph structure.

2.4.3. Edge Features

Well-defined edge features can effectively improve the graph representation learning efficiency and performance to exploit more explicit and implicit relations between nodes. Based on the predefined graph types, the edge feature types can be divided into structural features and non-structural features. The structural edge features are acquired from explicit relations between nodes such as dependency or constituency relation between words, word-word adjacency relations, etc. Those relations between nodes are explicitly defined and are also widely employed in other NLP applications. However, more commonly used edge features are non-structural features which implicitly existed between the nodes and specifically applied to specific graph-based frameworks. The typically non-structural edge features are firstly defined by (Kim, 2014) for GNNs-based text classification tasks including:

  • PMI measures the co-occurrence between two words in a sliding window W𝑊Witalic_W and is calculated as:

    (4) PMI(i,j)=logp(i,j)p(i)p(j)PMI𝑖𝑗𝑙𝑜𝑔𝑝𝑖𝑗𝑝𝑖𝑝𝑗\displaystyle\text{PMI}(i,j)=log\frac{p(i,j)}{p(i)p(j)}PMI ( italic_i , italic_j ) = italic_l italic_o italic_g divide start_ARG italic_p ( italic_i , italic_j ) end_ARG start_ARG italic_p ( italic_i ) italic_p ( italic_j ) end_ARG
    (5) p(i,j)=#W(i,j)#W𝑝𝑖𝑗#𝑊𝑖𝑗#𝑊\displaystyle p(i,j)=\frac{\#W(i,j)}{\#W}italic_p ( italic_i , italic_j ) = divide start_ARG # italic_W ( italic_i , italic_j ) end_ARG start_ARG # italic_W end_ARG
    (6) p(i)=#W(i)#W𝑝𝑖#𝑊𝑖#𝑊\displaystyle p(i)=\frac{\#W(i)}{\#W}italic_p ( italic_i ) = divide start_ARG # italic_W ( italic_i ) end_ARG start_ARG # italic_W end_ARG

    where #W#𝑊\#W# italic_W is the number of windows in total, and #W(i)#𝑊𝑖\#W(i)# italic_W ( italic_i ), #W(i,j)#𝑊𝑖𝑗\#W(i,j)# italic_W ( italic_i , italic_j ) shows the number of windows containing word i𝑖iitalic_i and both word i𝑖iitalic_i and j𝑗jitalic_j respectively.

  • TF-IDF is the broadly used weight of the edges between document-level nodes and word-level nodes.

Except for those two widely used implicit edge features, some specific edge weighting methods are proposed to meet the demands of particular graph structures for exploiting more information of input text bodies.

2.4.4. Training Setup

After specifying the graph structure and types, the graph representation learning tasks and training settings also need to be determined to decide how to optimise the designed GNNs. Generally, the graph representation learning tasks can be categorised into three levels including Node-level, Graph-level and Edge-level. Node-level and Graph-level tasks involve node or graph classification, clustering, regression, etc, while Edge-level tasks include link prediction or edge classification for predicting the relation existence between two nodes or the corresponding edge categories.

Similar to other deep learning model training settings, GNNs also can be divided into supervised, semi-supervised and unsupervised training settings. Supervised training provides labelled training data, while unsupervised training utilises unlabeled data to train the GNNs. However, compared with supervised or unsupervised learning, semi-supervised learning methods are broadly used by GNNs designed for text classification applications which could be classified into two types:

  • Inductive Learning adjusts the weights of proposed GNNs based on a labelled training set for learning the overall statistics to induce the general trained model for following processing. The unlabeled set can be fed into the trained GNNs to compute the expected outputs.

  • Transductive Learning intends to exploit labelled and unlabeled sets simultaneously for leveraging the relations between different samples to improve the overall performance.

Table 1. Commonly used notations in Graph Neural Networks
Notations Descriptions
G𝐺Gitalic_G A graph.
V𝑉Vitalic_V The set of nodes in a graph.
E𝐸Eitalic_E The set of edges in a graph.
eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT An edge between node i𝑖iitalic_i and node j𝑗jitalic_j.
Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The neighbors of a node i𝑖iitalic_i.
𝑨𝑨\bm{A}bold_italic_A The graph adjacency matrix.
𝑨~~𝑨\tilde{\bm{A}}over~ start_ARG bold_italic_A end_ARG The normalized matrix 𝑨𝑨\bm{A}bold_italic_A.
𝑨~k,kZsuperscript~𝑨𝑘𝑘𝑍\tilde{\bm{A}}^{k},k\in Zover~ start_ARG bold_italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k ∈ italic_Z The kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT power of 𝑨~~𝑨\tilde{\bm{A}}over~ start_ARG bold_italic_A end_ARG.
[𝑨||𝑩][\bm{A}||\bm{B}][ bold_italic_A | | bold_italic_B ] The concatenation of 𝑨𝑨\bm{A}bold_italic_A and 𝑩𝑩\bm{B}bold_italic_B.
𝑫𝑫\bm{D}bold_italic_D The degree matrix of 𝑨𝑨\bm{A}bold_italic_A. 𝑫ii=Σj=1n𝑨ijsubscript𝑫𝑖𝑖superscriptsubscriptΣ𝑗1𝑛subscript𝑨𝑖𝑗\bm{D}_{ii}=\Sigma_{j=1}^{n}\bm{A}_{ij}bold_italic_D start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.
𝑾(𝒍)superscript𝑾𝒍\bm{W^{(l)}}bold_italic_W start_POSTSUPERSCRIPT bold_( bold_italic_l bold_) end_POSTSUPERSCRIPT The weight matrix of layer l𝑙litalic_l.
𝑯𝑹n×d𝑯superscript𝑹𝑛𝑑\bm{H}\in\bm{R}^{n\times d}bold_italic_H ∈ bold_italic_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT The feature matrix of a graph.
𝑯(𝒍)𝑹n×dsuperscript𝑯𝒍superscript𝑹𝑛𝑑\bm{H^{(l)}}\in\bm{R}^{n\times d}bold_italic_H start_POSTSUPERSCRIPT bold_( bold_italic_l bold_) end_POSTSUPERSCRIPT ∈ bold_italic_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT The feature matrix of a graph at layer l𝑙litalic_l.
𝒉𝒊𝑹nsubscript𝒉𝒊superscript𝑹𝑛\bm{h_{i}}\in\bm{R}^{n}bold_italic_h start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ bold_italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT The feature vector of the node i𝑖iitalic_i
𝒉𝒊(𝒍)𝑹nsuperscriptsubscript𝒉𝒊𝒍superscript𝑹𝑛\bm{h_{i}^{(l)}}\in\bm{R}^{n}bold_italic_h start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_( bold_italic_l bold_) end_POSTSUPERSCRIPT ∈ bold_italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT The feature vector of the node i𝑖iitalic_i at layer l𝑙litalic_l.
𝒁𝑹n×d𝒁superscript𝑹𝑛𝑑\bm{Z}\in\bm{R}^{n\times d}bold_italic_Z ∈ bold_italic_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT The output feature matrix of a graph.
𝒛𝒊𝑹nsubscript𝒛𝒊superscript𝑹𝑛\bm{z_{i}}\in\bm{R}^{n}bold_italic_z start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ bold_italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT The output feature vector of the node i𝑖iitalic_i

3. Corpus-level GNN for Text Classification

We define a corpus-level Graph Neural Network as “constructing a graph to represent the whole corpus”, thus, only one or several graphs will be built for the given corpus. We categorize Corpus-level GNN into four subcategories based on the types of nodes shown in the graph.

3.1. Document and Word Nodes as a Graph

Most corpus-level graphs include word nodes and document nodes and there are word-document edges and word-word edges. By applying K𝐾Kitalic_K(normally K𝐾Kitalic_K=2 or 3) layer GNN, word nodes will serve as a bridge to propagate the information from one document node to another.

3.1.1. PMI and TF-IDF as graph edges: TextGCN, SGC, S2GC, NMGC, TG-Transformer, BertGCN

TextGCN(Yao et al., 2019) Yao et al. (2019) builds a corpus-level graph with training document nodes, test document nodes and word nodes. Before constructing the graph, a common preprocessing method(Kim, 2014) has been applied and words shown fewer than 5 times or in NLTK(Bird et al., 2009) stopwords list have been removed. The edge value between the document node and the word node is TF-IDF and that between the word nodes is PMI. The adjacency matrix of this graph shows as follows.

(7) Aij={PMI(i,j)i,j are words,PMI(i,j)>0TF-IDFi,ji is document,j is word1i=j0otherwisesubscript𝐴𝑖𝑗casesPMI𝑖𝑗𝑖𝑗 are wordsPMI𝑖𝑗0subscriptTF-IDF𝑖𝑗𝑖 is document𝑗 is word1𝑖𝑗0otherwiseA_{ij}=\begin{cases}\text{PMI}(i,j)&i,j\text{ are words},\text{PMI}(i,j)>0\\ \text{TF-IDF}_{i,j}&i\text{ is document},j\text{ is word}\\ 1&i=j\\ 0&\text{otherwise}\end{cases}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL PMI ( italic_i , italic_j ) end_CELL start_CELL italic_i , italic_j are words , PMI ( italic_i , italic_j ) > 0 end_CELL end_ROW start_ROW start_CELL TF-IDF start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL italic_i is document , italic_j is word end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL italic_i = italic_j end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW

A two-layer GCN is applied to the graph, and the dimension of the second layer output equals to the number of classes in the dataset. Formally, the forward propagation of TextGCN shows as:

(8) 𝒁=softmax(𝑨~(ReLU(𝑨~𝑿𝑾(0)))𝑾(1))𝒁softmax~𝑨ReLU~𝑨𝑿superscript𝑾0superscript𝑾1\bm{Z}=\text{softmax}(\tilde{\bm{A}}(\text{ReLU}(\tilde{\bm{A}}\bm{X}\bm{W}^{(% 0)}))\bm{W}^{(1)})bold_italic_Z = softmax ( over~ start_ARG bold_italic_A end_ARG ( ReLU ( over~ start_ARG bold_italic_A end_ARG bold_italic_X bold_italic_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) ) bold_italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )

where A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG is the normalized adjacency of A𝐴Aitalic_A and X𝑋Xitalic_X is one-hot embedding. W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are learnable parameters of the model. The representation on training documents is used to calculate the loss and that on test documents is for prediction. TextGCN is the first work that treats a text classification task as a node classification problem by constructing a corpus-level graph and has inspired many following works.

Based on TextGCN, several works follow the same graph construction method and node initialization but apply different graph propagation models.

SGC(Wu et al., 2019b) To make GCN efficient, SGC (Simple Graph Convolution) removes the nonlinear activation function in GCN layers, therefore, the K-layer propagation of SGC shows as:

(9) 𝒁=softmax(𝑨~(𝑨~(𝑨~𝑿𝑾(0))𝑾(1))𝑾(K))𝒁softmax~𝑨~𝑨~𝑨𝑿superscript𝑾0superscript𝑾1superscript𝑾𝐾\bm{Z}=\text{softmax}(\tilde{\bm{A}}...(\tilde{\bm{A}}(\tilde{\bm{A}}\bm{X}\bm% {W}^{(0)})\bm{W}^{(1)})...\bm{W}^{(K)})bold_italic_Z = softmax ( over~ start_ARG bold_italic_A end_ARG … ( over~ start_ARG bold_italic_A end_ARG ( over~ start_ARG bold_italic_A end_ARG bold_italic_X bold_italic_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) bold_italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) … bold_italic_W start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT )

which can be reparameterized into

(10) 𝒁=softmax(𝑨~K𝑿𝑾)𝒁softmaxsuperscript~𝑨𝐾𝑿𝑾\bm{Z}=\text{softmax}(\tilde{\bm{A}}^{K}\bm{X}\bm{W})bold_italic_Z = softmax ( over~ start_ARG bold_italic_A end_ARG start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_X bold_italic_W )

and K𝐾Kitalic_K is 2 when applied to text classification tasks. With a smaller number of parameters and only one feedforward layer, SGC saves computation time and resources while improving performance.

S2GC(Zhu and Koniusz, 2020) To solve the oversmoothing issues in GCN, Zhu and Koniusz (2020) propose Simple Spectral Graph Convolution(S2GC) which includes self-loops using Markov Diffusion Kernel. The output of S2GC is calculated as:

(11) 𝒁=softmax(1KΣk=0K𝑨~k𝑿𝑾)𝒁softmax1𝐾superscriptsubscriptΣ𝑘0𝐾superscript~𝑨𝑘𝑿𝑾\bm{Z}=\text{softmax}(\frac{1}{K}\Sigma_{k=0}^{K}\tilde{\bm{A}}^{k}\bm{X}\bm{W})bold_italic_Z = softmax ( divide start_ARG 1 end_ARG start_ARG italic_K end_ARG roman_Σ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over~ start_ARG bold_italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_italic_X bold_italic_W )

and can be generalized into:

(12) 𝒁=softmax(1KΣk=0K((1α)𝑨~k𝑿+α𝑿)𝑾)𝒁softmax1𝐾superscriptsubscriptΣ𝑘0𝐾1𝛼superscript~𝑨𝑘𝑿𝛼𝑿𝑾\bm{Z}=\text{softmax}(\frac{1}{K}\Sigma_{k=0}^{K}((1-\alpha)\tilde{\bm{A}}^{k}% \bm{X}+\alpha\bm{X})\bm{W})bold_italic_Z = softmax ( divide start_ARG 1 end_ARG start_ARG italic_K end_ARG roman_Σ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( ( 1 - italic_α ) over~ start_ARG bold_italic_A end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_italic_X + italic_α bold_italic_X ) bold_italic_W )

Similarly, K𝐾Kitalic_K = 2 on text classification tasks and α𝛼\alphaitalic_α denotes the trade-off between self-information of the node and consecutive neighbourhood information. S2GC can also be viewed as introducing skip connections into GCN.

NMGC(Lei et al., 2021) Other than using the sum of each GCN layer in S2GC, NMGC applies min pooling using the Multi-hop neighbour Information Fusion (MIF) operator to address oversmoothing problems. A MIF function is defined as:

(13) MIF(K)=min(𝑨~𝑿𝑾,𝑨~2𝑿𝑾,,𝑨~K𝑿𝑾)MIF𝐾min~𝑨𝑿𝑾superscript~𝑨2𝑿𝑾superscript~𝑨𝐾𝑿𝑾\text{MIF}(K)=\text{min}(\tilde{\bm{A}}\bm{X}\bm{W},\tilde{\bm{A}}^{2}\bm{X}% \bm{W},...,\tilde{\bm{A}}^{K}\bm{X}\bm{W})MIF ( italic_K ) = min ( over~ start_ARG bold_italic_A end_ARG bold_italic_X bold_italic_W , over~ start_ARG bold_italic_A end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_X bold_italic_W , … , over~ start_ARG bold_italic_A end_ARG start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_X bold_italic_W )

NMGC-K firstly applies a MIF(K𝐾Kitalic_K) layer then a GCN layer and K is 2 or 3. For example, when K𝐾Kitalic_K = 3, the output is:

(14) 𝒁=softmax(𝑨~(ReLU min(𝑨~𝑿𝑾(0),𝑨~2𝑿𝑾(0),𝑨~3𝑿𝑾(0)))𝑾(1))𝒁softmax~𝑨ReLU min~𝑨𝑿superscript𝑾0superscript~𝑨2𝑿superscript𝑾0superscript~𝑨3𝑿superscript𝑾0superscript𝑾1\bm{Z}=\text{softmax}(\tilde{\bm{A}}(\text{ReLU min}(\tilde{\bm{A}}\bm{X}\bm{W% }^{(0)},\tilde{\bm{A}}^{2}\bm{X}\bm{W}^{(0)},\tilde{\bm{A}}^{3}\bm{X}\bm{W}^{(% 0)}))\bm{W}^{(1)})bold_italic_Z = softmax ( over~ start_ARG bold_italic_A end_ARG ( ReLU min ( over~ start_ARG bold_italic_A end_ARG bold_italic_X bold_italic_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_A end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_X bold_italic_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_A end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT bold_italic_X bold_italic_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) ) bold_italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )

NMGC can also be treated as a skip-connection in Graph Neural Networks which makes the shallow layer of GNN contribute to the final representation directly.

TG-Transformer(Zhang and Zhang, 2020) TextGCN treats the document nodes and word nodes as the same type of nodes during propagation, and to introduce heterogeneity into the TextGCN graph, TG-Transformer (Text Graph Transformer) adopts two sets of weights for document nodes and word nodes respectively. To cope with a large corpus graph, subgraphs are sampled from the TextGCN graph using PageRank algorithm(Page et al., 1999). The input embedding of is the sum of three types of embedding: pretrained GloVe embedding, node type embedding, and Weisfeiler-Lehman structural encoding(Niepert et al., 2016). During propagation, self-attention(Vaswani et al., 2017) with graph residual(Zhang and Meng, 2019) is applied.

BertGCN(Lin et al., 2021) To combine BERT(Kenton and Toutanova, 2019) and TextGCN, BertGCN enhances TextGCN by replacing the document node initialization with the BERT [CLS] output of each epoch and replacing the word input vector with zeros. BertGCN trains BERT and TextGCN jointly by interpolating the output of TextGCN and BERT:

(15) 𝒁=λ𝒁GCN+(1λ)𝒁BERT𝒁𝜆subscript𝒁𝐺𝐶𝑁1𝜆subscript𝒁𝐵𝐸𝑅𝑇\bm{Z}=\lambda\bm{Z}_{GCN}+(1-\lambda)\bm{Z}_{BERT}bold_italic_Z = italic_λ bold_italic_Z start_POSTSUBSCRIPT italic_G italic_C italic_N end_POSTSUBSCRIPT + ( 1 - italic_λ ) bold_italic_Z start_POSTSUBSCRIPT italic_B italic_E italic_R italic_T end_POSTSUBSCRIPT

where λ𝜆\lambdaitalic_λ is the trade-off factor. To optimize the memory during training, a memory bank is used to track the document input and a smaller learning rate is set to BERT module to remain the consistency of the memory bank. BertGCN shows that with the help of TextGCN, BERT can achieve better performance.

3.1.2. Multi-Graphs/Multi-Dimensional Edges: TensorGCN, ME-GCN

TensorGCN(Liu et al., 2020) Instead of constructing a single corpus-level graph, TensorGCN builds three independent graphs: Semantic-based graph, Syntactic-based graph, and Sequential-based graph to incorporate semantic, syntactic and sequential information respectively and combines them into a tensor graph.

Three graphs share the same set of TF-IDF values for the word-document edge but different values for word-word edges. Semantic-based graph extracts the semantic features from a trained Long short-term memory(LSTM)(Hochreiter and Schmidhuber, 1997) model and connects the words sharing high similarity. Syntactic-based graph uses Stanford CoreNLP parser(Manning et al., 2014) and constructs edges between words when they have a larger probability of having dependency relation. For Sequential-based graph, PMI value is applied as TextGCN does.

The propagation includes intra-graph propagation and inter-graph propagation. The model first applies the GCN layer on three graphs separately as intra-graph propagation. Then the same nodes on three graphs are treated as a virtual graph and another GCN layer is applied as inter-graph propagation.

ME-GCN(Wang et al., 2022b) To fully utilize the corpus information and analyze rich relational information of the graph, ME-GCN (Multi-dimensional Edge-Embedded GCN) builds a graph with multi-dimensional word-word, word-document and document-document edges. Word2vec and Doc2vec embedding is firstly trained on the given corpus and the similarity of each dimension of trained embedding is used to construct the multi-dimensional edges. The trained embedding also serves as the input embedding of the graph nodes. During propagation, GCN is firstly applied on each dimension and representations on different dimensions are either concatenated or fed into a pooling method to get the final representations of each layer.

3.1.3. Making TextGCN Inductive: HeteGCN, InducT-GCN, T-VGAE

HeteGCN(Ragesh et al., 2021) HeteGCN (Heterogeneous GCN) optimizes the TextGCN by decomposing the TextGCN undirected graph into several directed subgraphs. Several subgraphs from TextGCN graph are combined sequentially as different layers: feature graph (word-word graph), feature-document graph (word-document graph), and document-feature graph (document-word graph). Different combinations were tested and the best model is shown as:

(16) 𝒁=softmax(𝑨wd(ReLU(𝑨ww𝑿w𝑾(0)))𝑾(1))𝒁softmaxsubscript𝑨𝑤𝑑ReLUsubscript𝑨𝑤𝑤subscript𝑿𝑤superscript𝑾0superscript𝑾1\displaystyle\bm{Z}=\text{softmax}(\bm{A}_{w-d}(\text{ReLU}(\bm{A}_{w-w}\bm{X}% _{w}\bm{W}^{(0)}))\bm{W}^{(1)})bold_italic_Z = softmax ( bold_italic_A start_POSTSUBSCRIPT italic_w - italic_d end_POSTSUBSCRIPT ( ReLU ( bold_italic_A start_POSTSUBSCRIPT italic_w - italic_w end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT bold_italic_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) ) bold_italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )

where 𝑨wwsubscript𝑨𝑤𝑤\bm{A}_{w-w}bold_italic_A start_POSTSUBSCRIPT italic_w - italic_w end_POSTSUBSCRIPT and 𝑨wdsubscript𝑨𝑤𝑑\bm{A}_{w-d}bold_italic_A start_POSTSUBSCRIPT italic_w - italic_d end_POSTSUBSCRIPT show the adjacency matrix for the word-word subgraph and word-document subgraph. Since the input of HeteGCN is the word node embeddings without using document nodes, it can also work in an inductive way while the previous corpus-level graph text classification models are all transductive models.

InducT-GCN(Wang et al., 2022a) InducT-GCN (InducTive Text GCN) aims to extend the transductive TextGCN into an inductive model. Instead of using the whole corpus for building the graph, InducT-GCN builds a training corpus graph and makes the input embedding of the document as the TF-IDF vectors, which aligns with the one-hot word embeddings. The weights are learned following TextGCN but InducT-GCN builds virtual subgraphs for prediction on new test documents.

T-VGAE(Xie et al., 2021) T-VGAE (Topic Variational Graph Auto-Encoder) applies Variational Graph Auto-Encoder on the latent topic of each document to make the model inductive. A vocabulary graph Avsubscript𝐴𝑣A_{v}italic_A start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT which connects the words using PMI values is constructed while each document is represented using the TF-IDF vector. All the document vectors are stacked into a matrix which can also be treated as a bipartite graph Adsubscript𝐴𝑑A_{d}italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Two graph auto-encoder models are applied on Avsubscript𝐴𝑣A_{v}italic_A start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and Adsubscript𝐴𝑑A_{d}italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT respectively. The overall workflow shows as:

(17) 𝒁v=EncoderGCN(𝑨v,𝑿v)subscript𝒁𝑣subscriptEncoder𝐺𝐶𝑁subscript𝑨𝑣subscript𝑿𝑣\displaystyle\bm{Z}_{v}=\text{Encoder}_{GCN}(\bm{A}_{v},\bm{X}_{v})bold_italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = Encoder start_POSTSUBSCRIPT italic_G italic_C italic_N end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT )
(18) 𝒁d=EncoderUDMP(𝑨d,𝒁v)subscript𝒁𝑑subscriptEncoder𝑈𝐷𝑀𝑃subscript𝑨𝑑subscript𝒁𝑣\displaystyle\bm{Z}_{d}=\text{Encoder}_{UDMP}(\bm{A}_{d},\bm{Z}_{v})bold_italic_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = Encoder start_POSTSUBSCRIPT italic_U italic_D italic_M italic_P end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT )
(19) 𝑨v=Decoder(𝒁v)superscriptsubscript𝑨𝑣Decodersubscript𝒁𝑣\displaystyle\bm{A}_{v}^{*}=\text{Decoder}(\bm{Z}_{v})bold_italic_A start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = Decoder ( bold_italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT )
(20) 𝑨d=Decoder(𝒁d,𝒁v)superscriptsubscript𝑨𝑑Decodersubscript𝒁𝑑subscript𝒁𝑣\displaystyle\bm{A}_{d}^{*}=\text{Decoder}(\bm{Z}_{d},\bm{Z}_{v})bold_italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = Decoder ( bold_italic_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT )

where Xvsuperscript𝑋𝑣X^{v}italic_X start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT is an Identity Matrix. The EncoderGCNsubscriptEncoder𝐺𝐶𝑁\text{Encoder}_{GCN}Encoder start_POSTSUBSCRIPT italic_G italic_C italic_N end_POSTSUBSCRIPT and the decoders are applied following VGAE𝑉𝐺𝐴𝐸VGAEitalic_V italic_G italic_A italic_E(Kipf and Welling, 2016) while EncoderUDMPsubscriptEncoder𝑈𝐷𝑀𝑃\text{Encoder}_{UDMP}Encoder start_POSTSUBSCRIPT italic_U italic_D italic_M italic_P end_POSTSUBSCRIPT is an unidirectional message passing variant of EncoderGCNsubscriptEncoder𝐺𝐶𝑁\text{Encoder}_{GCN}Encoder start_POSTSUBSCRIPT italic_G italic_C italic_N end_POSTSUBSCRIPT. The training objective is minimising the reconstruction error and Zdsubscript𝑍𝑑Z_{d}italic_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is used for the classification task.

3.2. Document Nodes as a Graph

To show the global structure of the corpus directly, some models only adopt document nodes in the non-heterogeneous graph.

knn-GCN(Benamira et al., 2019) knn-GCN constructs a k–nearest-neighbours graph by connecting the documents with their K𝐾Kitalic_K nearest neighbours using Euclidean distances of the embedding of each document. The embedding is generated in an unsupervised way: either using the mean of pretrained GloVe word vectors or applying LDA(Blei et al., 2003). Both GCN and Attention-based GNN(Thekumparampil et al., 2018) are used as the graph model.

TextGTL(Li et al., 2021) Similar to TensorGCN, TextGTL (Text-oriented Graph-based Transductive Learning) constructs three different document graphs: Semantics Text Graph, Syntax Text Graph, and Context Text Graph while all the graphs are non-heterogeneous. Semantics Text Graph uses Generalized Canonical Correlation Analysis(Bach and Jordan, 2002) and trains a classifier to determine the edge values between two document nodes. Syntax Text Graph uses the Stanford CoreNLP dependency parser(Manning et al., 2014) to construct units and also trains a classifier. Context Text Graph defines the edge values by summing up the PMI values of the overlapping words in two documents. Two GCN layers are applied and the output of each graph is mixed as the output of this layer and input for the next layer for all three graphs:

(21) 𝑯(1)=σ(𝑨𝑯(0)𝑾(0))superscript𝑯1𝜎𝑨superscript𝑯0superscript𝑾0\displaystyle\bm{H}^{(1)}=\sigma(\bm{A}\bm{H}^{(0)}\bm{W}^{(0)})bold_italic_H start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = italic_σ ( bold_italic_A bold_italic_H start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT )
(22) 𝑯(2)=σ(𝑨[𝑯sem(1)𝑯syn(1)𝑯seq(1)]𝑾(1))superscript𝑯2𝜎𝑨delimited-[]superscriptsubscript𝑯𝑠𝑒𝑚1normsuperscriptsubscript𝑯𝑠𝑦𝑛1superscriptsubscript𝑯𝑠𝑒𝑞1superscript𝑾1\displaystyle\bm{H}^{(2)}=\sigma(\bm{A}[\bm{H}_{sem}^{(1)}||\bm{H}_{syn}^{(1)}% ||\bm{H}_{seq}^{(1)}]\bm{W}^{(1)})bold_italic_H start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = italic_σ ( bold_italic_A [ bold_italic_H start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT | | bold_italic_H start_POSTSUBSCRIPT italic_s italic_y italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT | | bold_italic_H start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ] bold_italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
(23) 𝒁=Poolingmean(𝑯sem(2),𝑯syn(2),𝑯seq(2))𝒁subscriptPooling𝑚𝑒𝑎𝑛superscriptsubscript𝑯𝑠𝑒𝑚2superscriptsubscript𝑯𝑠𝑦𝑛2superscriptsubscript𝑯𝑠𝑒𝑞2\displaystyle\bm{Z}=\text{Pooling}_{mean}(\bm{H}_{sem}^{(2)},\bm{H}_{syn}^{(2)% },\bm{H}_{seq}^{(2)})bold_italic_Z = Pooling start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT ( bold_italic_H start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUBSCRIPT italic_s italic_y italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT )

where H(0)superscript𝐻0H^{(0)}italic_H start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is the TF-IDF vector of the documents. Data augmentation with super nodes is also applied in TextGTL to strengthen the information in graph models.

3.3. Word Nodes as a Graph

By neglecting the document nodes in the graph, a graph with only word nodes shows good performance in deriving the graph-based embedding and is used for downstream tasks. Since no document nodes are included, this method can be easily adapted as an inductive learning model.

VGCN-BERT(Lu et al., 2020) VGCN-BERT enhances the input embedding of BERT by concatenating it with the graph embedding. It first constructs a vocabulary graph and uses PMI as the edge value. A variant of the GCN layer called VGCN(Vocabulary GCN) is applied to derive the graph word embedding:

(24) 𝑿Graph=ReLU(𝑿BERT𝑨𝑾(0))𝑾(1)subscript𝑿𝐺𝑟𝑎𝑝ReLUsubscript𝑿𝐵𝐸𝑅𝑇𝑨superscript𝑾0superscript𝑾1\bm{X}_{Graph}=\text{ReLU}(\bm{X}_{BERT}\bm{A}\bm{W}^{(0)})\bm{W}^{(1)}bold_italic_X start_POSTSUBSCRIPT italic_G italic_r italic_a italic_p italic_h end_POSTSUBSCRIPT = ReLU ( bold_italic_X start_POSTSUBSCRIPT italic_B italic_E italic_R italic_T end_POSTSUBSCRIPT bold_italic_A bold_italic_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) bold_italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT

where BERT embedding is used as the input. The graph word embeddings are concatenated with BERT embedding and fed into the BERT as extra information.

3.4. Extra Topic Nodes in the Graph

Topic information of each document can also provide extra information in corpus-level graph neural networks. Several models also include topic nodes in the graph.

3.4.1. Single Layer Topic nodes: HGAT, STGCN

HGAT(Linmei et al., 2019) HGAT (Heterogeneous GAT) applies LDA(Blei et al., 2003) to extract topic information for each document, top P𝑃Pitalic_P topics with the largest probabilities are selected as connected with the document. Instead of using the words directly, to utilize the external knowledge HGAT applies the entity linking tool TAGME111https://sobigdata.d4science.org/group/tagme/ to identify the entities in the document and connects them. The semantic similarity between entities using pretrained Word2vec with threshold is used to define the connectedness between entity nodes. Since the graph is a heterogeneous graph, a HIN (heterogeneous information network) model is implemented which propagates solely on each sub-graphs depending on the type of node. An HGAT model is applied by considering type-level attention and node-level attention. For a given node, the type-level attention learns the weights of different types of neighbouring nodes while node-level attention captures the importance of different neighbouring nodes when ignoring the type. By using the dual attention mechanism, HGAT can capture the information of type and node at the same time.

STGCN(Yan et al., 2013) In terms of short text classification, STGCN (Short-Text GCN) applies BTM to get topic information to avoid the data sparsity problem from LDA. The graph is constructed following TextGCN while extra topic nodes are included. The edge values of word-topic and document-topic are from BTM and a classical two-layer GCN is applied. The word embeddings learned from STGCN are concatenated with BERT embeddings and a bi-LSTM model is applied for final prediction.

3.4.2. Multi-layer Topic Nodes: DHTG

DHTG(Wang et al., 2020) To capture different levels of information, DHTG (Dynamic Hierarchical Topic Graph) introduces hierarchical topic-level nodes in the graph from fine-grain to coarse. Poisson gamma belief network (PGBN)(Zhou et al., 2015) is used as a probabilistic deep topic model. The first-layer topics are from the combination of words, while deeper layers are generated by previous layers’ topics with the weights of PGBN, and the weights serve as the edge values of each layer of topics. For the topics on the same layer, the cosine similarity is chosen as the edge value. A two-layer GCN is applied and the model is learned jointly with PGBN, which makes the edge of the topics dynamic.

3.5. Critical Analysis

Compared with sequential models like CNN and LSTM, corpus-level GNN is able to capture the global corpus structure information with word nodes as bridges between document nodes and shows great performance without using external resources like pretrained embedding or pretrained model. However, the improvement in performance is marginal when pretrained embedding is included. Another issue is that most corpus-level GNN is transductive learning which is not applicable in the real world. Meanwhile, constructing the whole corpus into a graph requires large memory space especially when the dataset is large.

A detailed comparison of corpus-level GNN is displayed in Table 2.

4. Document-level GNN for Text Classification

By constructing the graph based on each document, a graph classification model can be used as a text classification model. Since each document is represented by one graph and new graphs can be built for test documents, the model can easily work in an inductive way.

4.1. Local Word Consecutive Graph

The simplest way to convert a document into a graph with words as nodes is by connecting the consecutive words within a sliding window.

4.1.1. Simple consecutive graph models: Text-Level-GNN, MPAD, TextING

Text-Level-GNN(Huang et al., 2019) Text-Level-GNN applies a small sliding window and constructs the graph with a small number of nodes and edges in each graph, which saves memory and computation time. The edge value is trainable and shared across the graphs when connecting the same two words, which also brings global information.

Unlike corpus-level graph models, Text-Level-GNN applies a message passing mechanism (MPM)(Gilmer et al., 2017) instead of GCN for graph learning. For each node, the neighbour information is aggregated using max-pooling with trainable edge values as the AGGREGATE function and then the weighted sum is used as the COMBINE function. To get the representation of each graph, sum-pooling and an MLP classifier are applied as the READOUT function. The propagation shows as:

(25) 𝒉i(l+1)=(1α)(maxn𝒩ieni𝒉n(l))+α𝒉i(l)subscriptsuperscript𝒉𝑙1𝑖1𝛼𝑚𝑎subscript𝑥𝑛subscript𝒩𝑖subscript𝑒𝑛𝑖subscriptsuperscript𝒉𝑙𝑛𝛼subscriptsuperscript𝒉𝑙𝑖\displaystyle\bm{h}^{(l+1)}_{i}=(1-\alpha)(max_{n\in\mathcal{N}_{i}}e_{ni}\bm{% h}^{(l)}_{n})+\alpha\bm{h}^{(l)}_{i}bold_italic_h start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 - italic_α ) ( italic_m italic_a italic_x start_POSTSUBSCRIPT italic_n ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_n italic_i end_POSTSUBSCRIPT bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_α bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
(26) 𝒛𝒊=softmax(𝑾Σi𝒉i+𝒃)subscript𝒛𝒊softmax𝑾subscriptΣ𝑖subscript𝒉𝑖𝒃\displaystyle\bm{z_{i}}=\text{softmax}(\bm{W}\Sigma_{i}\bm{h}_{i}+\bm{b})bold_italic_z start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT = softmax ( bold_italic_W roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_b )

where 𝒉i(l)subscriptsuperscript𝒉𝑙𝑖\bm{h}^{(l)}_{i}bold_italic_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is i𝑖iitalic_ith word node presentation of layer l𝑙litalic_l, enisubscript𝑒𝑛𝑖e_{ni}italic_e start_POSTSUBSCRIPT italic_n italic_i end_POSTSUBSCRIPT is edge weight from node n𝑛nitalic_n to node i𝑖iitalic_i. A two-layer MPM is applied, and the input of each graph is pretrained GloVe vectors.

MPAD(Nikolentzos et al., 2020) MPAD (Message Passing Attention Networks) connects words within a sliding window of size 2 but also includes an additional master node connecting all nodes in the graph. The edge only shows the connectedness of each pair of word nodes and is fixed. A variant of Gated Graph Neural Networks is applied where the AGGREGATE function is the weighted sum and the COMBINE function is GRU(Chung et al., 2014). Self-attention is applied in the READOUT function.

To learn the high-level information, the master node is directly concatenated with the READOUT output, working as a skip connection mechanism. To get the final representation, each layer’s READOUT results are concatenated to capture multi-granularity information. Pretrained Word2vec is used as the initialization of word nodes input.

TextING(Zhang et al., 2020) To simplify MPAD, TextING ignores the master node in the document-level graphs, which makes the graph sparser. Compared with Text-Level-GNN, TextING remains fixed edges. A similar AGGREGATE and COMBINE function are applied under the concept of e Gated Graph Neural Networks(GGNN)(Li et al., 2016) with the weighted sum and GRU. However, for the READOUT function, soft attention is used and both max-pooling and mean-pooling are applied to make sure that ”every word plays a role in the text and the keywords should contribute more explicitly”.

4.1.2. Advanced graph models: MLGNN, TextSSL, DADGNN

MLGNN(Liao et al., 2021b) MLGNN (Multi-level GNN) builds the same graph as TextING but introduces three levels of MPM: bottom-level, middle-level and top-level. In the bottom-level MPM, the same method with Text-Level-GNN is applied with pretrained Word2vec as input embedding but the edge is non-trainable. In the middle level, a larger window size is adopted and Graph Attention Networks(GAT)(Veličković et al., 2018) is applied to learn long distant word nodes information. In the top-level MPM, all word nodes are connected and multi-head self-attention(Vaswani et al., 2017) is applied. By applying three different levels of MPM, MLGNN learns multi-granularity information well.

DADGNN(Liu et al., 2021a) DADGNN (Deep Attention Diffusion GNN) constructs the same graph as TextING but uses attention diffusion to overcome the oversmoothing issue. Pretrained word embedding is used as the input of each node and an MLP layer is applied. Then, the graph attention matrix is calculated based on the attention to the hidden states of each node. The diffusion matrix is calculated as

(27) 𝑻=Σn=0ϵn𝑨n𝑻superscriptsubscriptΣ𝑛0subscriptitalic-ϵ𝑛superscript𝑨𝑛\bm{T}=\Sigma_{n=0}^{\infty}\epsilon_{n}\bm{A}^{n}bold_italic_T = roman_Σ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_A start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT

where A𝐴Aitalic_A is the graph attention matrix and ϵitalic-ϵ\epsilonitalic_ϵ is the learnable coefficients. Ansuperscript𝐴𝑛A^{n}italic_A start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT plays a role of connecting n𝑛nitalic_n-hop neighbours and Liu et al. (2021a) uses n[4,7]𝑛47n\in[4,7]italic_n ∈ [ 4 , 7 ] in practice. A multi-head diffusion matrix is applied for layer propagation.

TextSSL(Piao et al., 2021) To solve the word ambiguity problem and show the word synonymity and dynamic contextual dependency, TextSSL (Sparse Structure Learning) learns the graph using intra-sentence neighbours and inter-sentence neighbours simultaneously. The local syntactic neighbour is defined as the consecutive words and trainable edges across graphs are also included by using Gumbel-softmax . By applying sparse structure learning, TextSSL manages to select edges with dynamic contextual dependencies.

4.2. Global Word Co-occurrence Graph

Similar to the TextGCN graph, document-level graphs can also use PMI as the word-word edge values.

4.2.1. Only global word co-occurrence: DAGNN

DAGNN(Wu et al., 2019a) To address the long-distance dependency, hierarchical information and cross-domain learning challenges in domain-adversarial text classification tasks, Wu et al. (2019a) propose DAGNN (Domain-Adversarial Graph Neural Network). Each document is represented by a graph with content words as nodes and PMI values as edge values, which can capture long-distance dependency information. Pretrained FastText is chosen as the input word embeddings to handle the out-of-vocabulary issue and a GCN model with skip connection is used to address the oversmoothing problem. The propagation is formulated as:

(28) 𝑯(l+1)=(1α)𝑨~𝑯(l)+α𝑯(0)superscript𝑯𝑙11𝛼~𝑨superscript𝑯𝑙𝛼superscript𝑯0\displaystyle\bm{H}^{(l+1)}=(1-\alpha)\tilde{\bm{A}}\bm{H}^{(l)}+\alpha\bm{H}^% {(0)}bold_italic_H start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = ( 1 - italic_α ) over~ start_ARG bold_italic_A end_ARG bold_italic_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT + italic_α bold_italic_H start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT

To learn the hierarchical information of documents, DiffPool(Ying et al., 2018) is applied to assign each document into a set of clusters. Finally, adversarial training is used to minimize the loss on source tasks and maximize the differentiation between source and target tasks.

4.2.2. Combine with Extra Edges: ReGNN, GFN

ReGNN(Li et al., 2019) To capture both global and local information, ReGNN (Recursive Graphical Neural Network) uses PMI together with consecutive words as the word edges. And graph propagation function is the same as GGNN while additive attention(Bahdanau et al., 2015) is applied in aggregation. Pretrained GloVe is the input embedding of each word node.

GFN(Dai et al., 2022) GFN (Graph Fusion Network) builds four types of graphs using the word co-occurrence statistics, PMI, the similarity of pretrained embedding and Euclidean distance of pretrained embedding. Although four corpus-level graphs are built, the graph learning actually happens on subgraphs of each document, making the method a document-level GNN. For each subgraph, each type of graph is learned separately using the graph convolutional method and then a fusion method of concatenation is used. After an MLP layer, average pooling is applied to get the document representation.

4.3. Other word graphs

Some other ways of connecting words in a document have been explored.

HyperGAT(Ding et al., 2020) Ding et al. (2020) proposes HyperGAT (Hypergraph Attention Networks) which builds hypergraphs for each document to capture high-level interaction between words. Two types of hyperedges are included: sequential hyperedges connecting all words in a sentence and semantic hyperedges connecting top-K words after getting the topic of each word using LDA. Like traditional hypergraph propagations, HyperGAT follows the same two steps of updating but with an attention mechanism to highlight the key information: Node-level attention is applied to learn hyperedges representations and edge-level attention is used for updating node representations.

IGCN(Tang et al., 2020) Contextual dependency helps in understanding a document and the graph neural network is no exception. IGCN constructs the graph with the dependency graph to show the connectedness of each pair of words in a document. Then, the word representation learned from Bi-LSTM using POS embedding and word embedding is used to calculate the similarity between each pair of nodes. Attention is used for the output to find the important relevant semantic features.

GTNT(Mei et al., 2021) Words with higher TF-IDF values should connect to more word nodes, with this in mind, GTNT(Graph Transformer Networks based Text representation) uses sorted TF-IDF value to determine the degree of each node and applies the Havel-Hakimi algorithm(Hakami, 1962) to determine the edges between word nodes. A variant of GAT is applied during model learning. Despite the fact that GAT’s attention score is mutual for two nodes, GTNT uses relevant importance to adjust the attention score from one node to another. Pretrained Word2vec is applied as the input of each node.

4.4. Critical Analysis

Most document-level GNNs connect consecutive words as edges in the graph and apply a graph neural network model, which makes them similar to CNN where the receptive field enlarges when graph models go deeper. Also, the major differences among document-level GNNs are the details of graph models, e.g. different pooling methods, and different attention calculations, which diminishes the impact of the contribution of these works. Compared with corpus-level GNN, document-level GNN adopts more complex graph models and also suffers from the out-of-memory issue when the number of words in a document is large.

A detailed comparison of document-level GNN is displayed in Table 2.

Table 2. Models Detailed Comparison on whether using external resources, how to construct the edge and node input, and whether transductive learning or inductive learning. GloVe and Word2vec are pretrained if not specified. “emb sim” is short for “embedding similarity”, “dep graph” is short “dependency graph”.

Graph Model External Resource Edge Construction Node Initialization Learning Corpus-level TextGCN(Yao et al., 2019) N/A pmi, tf–idf one-hot transductive SGC(Wu et al., 2019b) N/A pmi, tf–idf one-hot transductive S2GC(Zhu and Koniusz, 2020) N/A pmi, tf–idf one-hot transductive NMGC(Lei et al., 2021) N/A pmi, tf–idf one-hot transductive TG-transformer(Zhang and Zhang, 2020) GloVe pmi, tf–idf GloVe transductive BERTGCN(Lin et al., 2021) BERT pmi, tf–idf doc: 0 word: BERT emb transductive TensorGCN(Liu et al., 2020) GloVe, CoreNLP emb sim, dep graph, pmi, tf–idf one-hot transductive ME-GCN(Wang et al., 2022b) N/A emb sim, tf–idf Trained Word2vec/doc2vec transductive HeteGCN(Ragesh et al., 2021) N/A pmi, tf–idf one-hot inductive InducT-GCN(Wang et al., 2022a) N/A pmi, tf–idf one-hot, tf–idf vectors inductive T-VGAE(Xie et al., 2021) N/A pmi one-hot inductive VGCN-BERT(Lu et al., 2020) BERT pmi BERT emb transductive knn-GCN(Benamira et al., 2019) GloVe emb sim GloVe transductive TextGTL(Li et al., 2021) CoreNLP dep graph, pmi tf–idf vectors transductive HGAT(Linmei et al., 2019) TAGME, Word2vec LDA, entity link, emb sim tf-idf, LDA, Word2vec transductive STGCN(Ye et al., 2020) BERT pmi, tf–idf, BTM BERT emb transductive DHTG(Wang et al., 2020) N/A PGBN, pmi, tf–idf one-hot transductive Doc-level Text-Level-GNN(Huang et al., 2019) GloVe consecutive words GloVe inductive MPAD(Nikolentzos et al., 2020) Word2vec consecutive words Word2vec inductive TextING(Zhang et al., 2020) GloVe consecutive words GloVe inductive MLGNN(Liao et al., 2021b) Word2vec consecutive words Word2vec inductive DADGNN(Liu et al., 2021a) Word2vec/GloVe consecutive words Word2vec/GloVe inductive TextSSL(Piao et al., 2021) GloVe consecutive words GloVe inductive DAGNN(Wu et al., 2019a) GloVe pmi GloVe inductive ReGNN(Li et al., 2019) GloVe consecutive words, pmi GloVe inductive GFN(Dai et al., 2022) GloVe pmi, emb sim GloVe inductive HyperGAT(Ding et al., 2020) N/A LDA, consecutive words one-hot inductive IGCN(Tang et al., 2020) spaCy dep graph LSTM emb inductive GTNT(Mei et al., 2021) Word2vec/GloVe tf–idf sorted value Word2vec/GloVe inductive

5. Datasets and Metrics

5.1. Datasets

Table 3. Commonly Used Text Classification Datasets by GNN-based models

Task Name Domain # # # # # Ave Models Cat. Docs Train Test Words Len. Topic Classification Ohsumed Bibliography 23 7,400 3,357 4,043 14,157 135.82 (Linmei et al., 2019; Wang et al., 2020; Xie et al., 2021; Yao et al., 2019; Lin et al., 2021; Piao et al., 2021; Liu et al., 2020; Lei et al., 2021; Zhang and Zhang, 2020; Ding et al., 2020; Zhang et al., 2020; Li et al., 2021, 2019) R8 News 8 7,674 5,485 2,189 7,688 65.72 (Liu et al., 2021a; Wu et al., 2019a; Zhang and Zhang, 2020; Yao et al., 2019; Wang et al., 2020; Xie et al., 2021; Lin et al., 2021; Piao et al., 2021; Liu et al., 2020; Lei et al., 2021; Linmei et al., 2019; Ding et al., 2020; Zhang et al., 2020; Li et al., 2021; Huang et al., 2019; Li et al., 2019) R52 News 52 9,100 6,532 2,568 8,892 69.82 (Liu et al., 2021a; Wu et al., 2019a; Zhang and Zhang, 2020; Yao et al., 2019; Wang et al., 2020; Xie et al., 2021; Lin et al., 2021; Piao et al., 2021; Liu et al., 2020; Lei et al., 2021; Ding et al., 2020; Zhang et al., 2020; Linmei et al., 2019; Huang et al., 2019; Li et al., 2019, 2021) 20NG News 20 18,846 11,314 7,532 42,757 221.26 (Wang et al., 2020; Wu et al., 2019a; Xie et al., 2021; Yao et al., 2019; Lin et al., 2021; Piao et al., 2021; Liu et al., 2020; Lei et al., 2021; Linmei et al., 2019; Ding et al., 2020; Li et al., 2021, 2019) AG-News News 4 127,600 120,000 7,600 128,515 44.03 (Liao et al., 2021b; Liu et al., 2021a; Linmei et al., 2019) WebKB Web Page 7 4,199 2,803 1,396 7,771 133.37 (Liu et al., 2021a) TREC Questions 6 5952 5452 500 9593 10.06 (Liu et al., 2021a) DBLP Bibliography 6 81,479 61,479 20,000 25,549 8.51 (Liu et al., 2021a) DBpedia Wikipedia 14 630000 560000 70000 - - (Linmei et al., 2019) Sentiment Analysis MR Movie review 2 10,662 7,108 3,554 18,764 20.39 (Lu et al., 2020; Wang et al., 2020; Xie et al., 2021; Yao et al., 2019; Liu et al., 2021a; Lin et al., 2021; Piao et al., 2021; Liu et al., 2020; Lei et al., 2021; Linmei et al., 2019; Ding et al., 2020; Zhang et al., 2020; Li et al., 2021; Huang et al., 2019; Li et al., 2019; Ye et al., 2020) AAR Product review 2 3150 1575 1575 - - (Tang et al., 2020) TUA Airline comments 2 14640 7320 7320 - - (Tang et al., 2020) SST-1 Movie review 5 11,855 9,465 2,210 19,524 20.17 (Liu et al., 2021a) SST-2 Movie review 2 9613 7,792 1,821 17539 19.67 (Liu et al., 2021a; Lu et al., 2020) IMDB Movie review 2 50,000 25,000 25,000 71,278 232.77 (Liu et al., 2021a; Tang et al., 2020; Zhang and Zhang, 2020) Yelp 2014 Review rating 5 1,125,386 900,309 225,077 476,191 148.8 (Zhang and Zhang, 2020) Twitter Twitter 2 10000 - - - - (Linmei et al., 2019) SenTube-A Youtube Comments 2 7,400 3,357 4,043 14,157 28.54 (Liao et al., 2021b) SenTube-T Youtube Comments 2 6664 4997+333 1334 20,276 28.73 (Liao et al., 2021b) Other ArangoHate Twitter posts 2 7006 - - - 13.3 (Lu et al., 2020) FountaHate Twitter posts 4 99996 - - - 15.7 (Lu et al., 2020) CoLA grammar check 2 9594 8551 1043 - 7.7 (Lu et al., 2020)

There are many popular text classification benchmark datasets, while this paper mainly focuses on the datasets used by GNN-based text classification applications. Based on the purpose of applications, we divided the commonly adopted datasets into three types including Topic Classification, Sentiment Analysis and Other. Most of these text classification datasets contain a single target label of each text body. The key information of each dataset is listed in Table 3.

5.1.1. Topic Classification

Topic classification models aim to classify input text bodies from diverse sources into predefined categories. News categorization is a typical topic classification task to obtain key information from news and classify them into corresponding topics. The input text bodies normally are paragraphs or whole documents especially for news categorization, while there are still some short text classification datasets from certain domains such as micro-blogs, bibliography, etc. Some typical datasets are listed:

  • Ohsumed (Joachims, 1998) is acquired from the MEDLINE database and further processed by (Yao et al., 2019) via selecting certain documents (abstracts) and filtering out the documents belonging to multiple categories. Those documents are classified into 23 cardiovascular diseases. The statistics of (Yao et al., 2019) processed Ohsumed dataset is represented in Table 3, which is directly employed by other related works.

  • R8 / R52 are two subsets of the Reuters 21587 dataset 222 For the original Reuters 21587 dataset, please refer to this link http://www.daviddlewis.com/resources/testcollections/reuters21578 which contain 8 and 52 news topics from Reuters financial news services, respectively.

  • 20NG is another widely used news categorization dataset that contains 20 newsgroups. It was originally collected by (Lang, 1995), but the procedures are not explicitly described.

  • AG News (Zhang et al., 2015) is a large-scale news categorization dataset compared with other commonly used datasets which are constructed by selecting the top-4 largest categories from the AG corpus. Each news topic contains 30,000 samples for training and 1900 samples for testing.

  • Database systems and Logic Programming (DBLP) is a topic classification dataset to classify the computer science paper titles into six various topics (Mei et al., 2021). Different from paragraph or document based topic classification dataset, DBLP aims to categorise scientific paper titles into corresponding categories, the average input sentence length is much lower than others.

  • Dbpedia (Lehmann et al., 2015) is a large-scale multilingual knowledge base that contains 14 non-overlapping categories. Each category contains 40000 samples for training and 5000 samples for testing.

  • WebKB (Craven et al., 1998) is a long corpus web page topic classification dataset.

  • TREC (Li and Roth, 2002) is a question topic classification dataset to categorise one question sentence into 6 question categories.

5.1.2. Sentiment Analysis

The purpose of sentiment analysis is to analyse and mine the opinion of the textual content which could be treated as a binary or multi-class classification problem. The sources of existing sentiment analysis tasks come from movie reviews, product reviews or user comments, social media posts, etc. Most sentiment analysis datasets target to predict the people’s opinions of one or two input sentences of which the average length of each input text body is around 25 tokens.

  • Movie Review (MR) (Pang and Lee, 2005) is a binary sentiment classification dataset for movie review which contains positive and negative data equally distributed. Each review only contains one sentence.

  • Stanford Sentiment Treebank (SST) (Socher et al., 2013) is an upgraded version of MR which contains two subsets SST-1 and SST-2. SST-1 provides five fine-grained labels while SST-2 is a binary sentiment classification dataset.

  • Internet Movie DataBase (IMDB) (Maas et al., 2011) is also an equally distributed binary classification dataset for sentiment analysis. Different from other short text classification dataset, the average number of words of each review is around 221.

  • Yelp 2014 (Tang et al., 2015) is a large scale binary category based sentiment analysis dataset for longer user reviews collected from Yelp.com.

Certain binary sentiment classification benchmark datasets are also used by GNN-based text classifiers. Most of them are gathered from shorter user reviews or comments (normally one or two sentences) from different websites including Amazon Alexa Reviews (AAR), Twitter US Airline (TUA), Youtube comments (SenTube-A and SenTube-T) (Uryupina et al., 2014).

5.1.3. Other Datasets

There are some datasets targeting other tasks including hate detection, grammaticality checking, etc. For example, ArangoHate (Arango et al., 2019) is a hate detection dataset, a sub-task of intend detection, which contains 2920 hateful documents and 4086 normal documents by resampling the merged datasets from (Davidson et al., 2017) and (Waseem, 2016). In addition, (Founta et al., 2018) proposes another large scale hate language detection dataset, namely FountaHate to classify the tweets into four categories including 53,851, 14,030, 27,150, and 4,965 samples of normal, spam, hateful and abusive, respectively. Since there is no officially provided training and testing splitting radio for above datasets, the numbers represented in Table 3 are following the ratios (train/development/test is 85:5:10) defined by (Lu et al., 2020).

5.1.4. Dataset Summary

Since an obvious limitation of corpus-level GNN models has high memory consumption limitation (Zhang and Zhang, 2020; Huang et al., 2019; Ding et al., 2020), the datasets with a smaller number of documents and vocabulary sizes such as Ohsumed, R8/R52, 20NG or MR are widely used to ensure feasibly build and evaluate corpus-level graphs. For the document-level GNN based models, some larger size datasets like AG-News can be adopted without considering the memory consumption problem. From Table 3, we could find most of the related works mainly focus on the GNN applied in topic classification and sentiment analysis which means the role of GNNs in other text classification tasks such as spam detection, intent detection, abstractive question answering need to be further exploited. Another observed trend is short text classification are gained less attention compared with long document classification tasks. In this case, GNN in short text classification may be an .

5.2. Evaluation Methods

5.2.1. Performance Metrics

In terms of evaluating and comparing the performance of proposed models with other baselines, accuracy and F1 are most commonly used metrics to conduct overall performance analysis, ablation studies and breakdown analysis. We use TP𝑇𝑃TPitalic_T italic_P, FP𝐹𝑃FPitalic_F italic_P, TN𝑇𝑁TNitalic_T italic_N and FN𝐹𝑁FNitalic_F italic_N to represent the number of true positive, false positive, true negative and false negative samples. N𝑁Nitalic_N is the total number of samples.

  • Accuracy and Error Rate: are basic evaluation metrics adopted by many GNN-based text classifiers such as (Li et al., 2021; Liu et al., 2016; Wang et al., 2020; Yao et al., 2019; Zhang and Zhang, 2020). Most of the related papers run all baselines and their models 10 times or 5 times to show the mean±standardplus-or-minus𝑚𝑒𝑎𝑛𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑mean\pm standarditalic_m italic_e italic_a italic_n ± italic_s italic_t italic_a italic_n italic_d italic_a italic_r italic_d deviation𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛deviationitalic_d italic_e italic_v italic_i italic_a italic_t italic_i italic_o italic_n of accuracy for reporting more convincing results. It can be defined as:

    (29) Accuracy=(TF+TN)N,𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑇𝐹𝑇𝑁𝑁Accuracy=\frac{(TF+TN)}{N},italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y = divide start_ARG ( italic_T italic_F + italic_T italic_N ) end_ARG start_ARG italic_N end_ARG ,
    (30) ErrorRate=1Accuracy=(FP+FN)N.𝐸𝑟𝑟𝑜𝑟𝑅𝑎𝑡𝑒1𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝐹𝑃𝐹𝑁𝑁ErrorRate=1-Accuracy=\frac{(FP+FN)}{N}.italic_E italic_r italic_r italic_o italic_r italic_R italic_a italic_t italic_e = 1 - italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y = divide start_ARG ( italic_F italic_P + italic_F italic_N ) end_ARG start_ARG italic_N end_ARG .
  • Precision, Recall and F1: are metrics for measuring the performance especially for imbalanced datasets. Precision is used to measure the results relevancy, while recall is utilised to measure how many truly relevant results acquired. Through calculating the harmonic average of Precision and Recall could get F1. Those three measurements can be defined as:

    (31) Precision=TP(TP+FP),𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑇𝑃𝑇𝑃𝐹𝑃Precision=\frac{TP}{(TP+FP)},italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n = divide start_ARG italic_T italic_P end_ARG start_ARG ( italic_T italic_P + italic_F italic_P ) end_ARG ,
    (32) Recall=TP(TP+FN),𝑅𝑒𝑐𝑎𝑙𝑙𝑇𝑃𝑇𝑃𝐹𝑁Recall=\frac{TP}{(TP+FN)},italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG italic_T italic_P end_ARG start_ARG ( italic_T italic_P + italic_F italic_N ) end_ARG ,
    (33) F1=2×Precision×Recall(Precision+Recall),𝐹12𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑒𝑐𝑎𝑙𝑙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑒𝑐𝑎𝑙𝑙F1=\frac{2\times Precision\times Recall}{(Precision+Recall)},italic_F 1 = divide start_ARG 2 × italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n × italic_R italic_e italic_c italic_a italic_l italic_l end_ARG start_ARG ( italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n + italic_R italic_e italic_c italic_a italic_l italic_l ) end_ARG ,

Few papers only utilise recall or precision to evaluate the performance (Mei et al., 2021). However, precision and recall are more commonly used together with F1 or Accuracy to evaluate and analyse the performance from different perspectives e.g. (Li et al., 2019; Linmei et al., 2019; Lu et al., 2020; Xie et al., 2021). In addition, based on different application scenarios, different F1 averaging methods are adopted by those papers to measure overall F1 score of multi-class (Number of Classes is C𝐶Citalic_C) classification tasks including:

  • Macro-F1 applies the same weights to all categories to get overall F1macro𝐹subscript1𝑚𝑎𝑐𝑟𝑜F1_{macro}italic_F 1 start_POSTSUBSCRIPT italic_m italic_a italic_c italic_r italic_o end_POSTSUBSCRIPT by taking the arithmetic mean.

    (34) F1macro=1CΣi=1CF1i,𝐹subscript1𝑚𝑎𝑐𝑟𝑜1𝐶superscriptsubscriptΣ𝑖1𝐶𝐹subscript1𝑖F1_{macro}=\frac{1}{C}\Sigma_{i=1}^{C}F1_{i},italic_F 1 start_POSTSUBSCRIPT italic_m italic_a italic_c italic_r italic_o end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG roman_Σ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_F 1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,
  • Micro-F1 is calculated by considering the overall Pmicrosubscript𝑃𝑚𝑖𝑐𝑟𝑜P_{micro}italic_P start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT and Rmicrosubscript𝑅𝑚𝑖𝑐𝑟𝑜R_{micro}italic_R start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT. It can be defined as:

    (35) F1micro=2×Pmicro×Rmicro(Pmicro+Rmicro)𝐹subscript1𝑚𝑖𝑐𝑟𝑜2subscript𝑃𝑚𝑖𝑐𝑟𝑜subscript𝑅𝑚𝑖𝑐𝑟𝑜subscript𝑃𝑚𝑖𝑐𝑟𝑜subscript𝑅𝑚𝑖𝑐𝑟𝑜F1_{micro}=\frac{2\times P_{micro}\times R_{micro}}{(P_{micro}+R_{micro})}italic_F 1 start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT = divide start_ARG 2 × italic_P start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT × italic_R start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT end_ARG start_ARG ( italic_P start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT ) end_ARG

    where:

    (36) Pmicro=ΣiCTPiΣiCTPi+FPi,Rmicro=ΣiCTPiΣiCTPi+FNi,formulae-sequencesubscript𝑃𝑚𝑖𝑐𝑟𝑜subscriptΣ𝑖𝐶𝑇subscript𝑃𝑖subscriptΣ𝑖𝐶𝑇subscript𝑃𝑖𝐹subscript𝑃𝑖subscript𝑅𝑚𝑖𝑐𝑟𝑜subscriptΣ𝑖𝐶𝑇subscript𝑃𝑖subscriptΣ𝑖𝐶𝑇subscript𝑃𝑖𝐹subscript𝑁𝑖P_{micro}=\frac{\Sigma_{i\in C}TP_{i}}{\Sigma_{i\in C}TP_{i}+FP_{i}},R_{micro}% =\frac{\Sigma_{i\in C}TP_{i}}{\Sigma_{i\in C}TP_{i}+FN_{i}},italic_P start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT = divide start_ARG roman_Σ start_POSTSUBSCRIPT italic_i ∈ italic_C end_POSTSUBSCRIPT italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_Σ start_POSTSUBSCRIPT italic_i ∈ italic_C end_POSTSUBSCRIPT italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_F italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , italic_R start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT = divide start_ARG roman_Σ start_POSTSUBSCRIPT italic_i ∈ italic_C end_POSTSUBSCRIPT italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_Σ start_POSTSUBSCRIPT italic_i ∈ italic_C end_POSTSUBSCRIPT italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_F italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ,
  • Weighted-F1 is the weighted mean of F1 of each category where the weight Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is related to the number of occurrences of the corresponding i𝑖iitalic_ith class, which can be defined as:

    (37) F1macro=Σi=1CF1i×Wi,𝐹subscript1𝑚𝑎𝑐𝑟𝑜superscriptsubscriptΣ𝑖1𝐶𝐹subscript1𝑖subscript𝑊𝑖F1_{macro}=\Sigma_{i=1}^{C}F1_{i}\times W_{i},italic_F 1 start_POSTSUBSCRIPT italic_m italic_a italic_c italic_r italic_o end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_F 1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

5.2.2. Other Evaluation Aspects

Since two limitations of GNN-based models are time and memory consumption, therefore, except the commonly used qualitative performance comparison, representing and comparing the GPU or CPU memory consumption and the training time efficiency of proposed models are also adopted by many related studies to demonstrate the practicality in real-world applications. In addition, based on the novelties of various models, specific evaluation methods are conducted to demonstrate the proposed contributions.

  • Memory Consumption: (Ding et al., 2020; Huang et al., 2019; Liu et al., 2021a) list the memory consumption of different models for comprehensively evaluating the proposed models in computational efficiency aspect.

  • Time Measurement: (Ragesh et al., 2021; Pasa et al., 2021) perform performance training time comparison between their proposed models and baselines on different benchmarks. Due to the doubts about the efficiency of applying GNNs for text classification, it is an effective way to demonstrate they could well balance performance and time efficiency.

  • Parameter Sensitivity is commonly conducted by GNNs studies to investigate the effect of different hyperparameters e.g. varying sliding window sizes, embedding dimensions of proposed models to represent the model sensitivity via line chart such as (Linmei et al., 2019; Ding et al., 2020; Liu et al., 2021a).

  • Number of Labelled Documents is a widely adopted evaluation method by GNN-based text classification models (Li et al., 2021; Wang et al., 2020; Linmei et al., 2019; Mei et al., 2021; Yao et al., 2019; Ragesh et al., 2021; Ding et al., 2020) which mainly analyse the performance trend by using different proportions of training data to test whether the proposed model can work well under the limited labelled training data.

  • Vocabulary Size is similar to the number of labelled documents but it investigates the effects of using different sizes of vocabulary during the GNN training stage adopted by (Wang et al., 2020).

5.2.3. Metrics Summary

For general text classification tasks, Accuracy, Precision, Recall and varying F1 are commonly used evaluation metrics for comparing with other baselines. However, for GNN based models, only representing the model performance cannot effectively represent the multi-aspects of proposed models. In this case, there are many papers conducting external processes to evaluate and analyse the GNN based classifier from multiple views including time and memory consumption, model sensitivity and dataset quantity.

6. Performance

Table 4. Performance Table. - indicates unavailability. * refers to replication from HyperGAT (Ding et al., 2020).
Type Method External Resource 20NG R8 R52 Ohsumed MR
Corpus-level TextGCN (Yao et al., 2019) N/A 86.34 ± 0.09 97.07 ± 00.10 93.56 ± 0.18 68.36 ± 0.56 76.74 ± 0.20
SGC (Wu et al., 2019b) N/A 88.5 ± 0.1 97.2 ± 0.1 94.0 ± 0.2 68.5 ± 0.3 75.9 ± 0.3
S2GC (Zhu and Koniusz, 2020) N/A 88.6± 0.1 97.4 ± 0.1 94.5 ± 0.2 68.5 ± 0.1 76.7 ± 0.0
TG-transformer (Zhang and Zhang, 2020) GloVe - 98.1±0.1 95.2±0.2 70.4±0.4 -
DHTG (Wang et al., 2020) N/A 87.13 ± 0.07 97.33 ± 0.06 93.93 ± 0.10 68.80 ± 0.33 77.21 ± 0.11
TensorGCN (Liu et al., 2020) GloVe, CoreNLP 87.74 ± 0.05 98.04 ± 0.08 95.05 ± 0.11 70.11 ± 0.24 77.91 ± 0.07
STGCN (Ye et al., 2020) BERT - 98.5 - - 82.5
NMGC (Lei et al., 2021) N/A 86.61 ± 0.06 97.31 ± 0.09 94.35 ± 0.06 69.21 ± 0.17 76.21 ± 0.25
BertGCN (Lin et al., 2021) BERT 89.3 98.1 96.6 72.8 86
RobertaGCN (Lin et al., 2021) RoBERTa 89.5 98.2 96.1 72.8 89.7
T-VGAE (Xie et al., 2021) N/A 88.08 ± 0.06 97.68 ± 0.14 95.05 ± 0.10 70.02 ± 0.14 78.03 ± 0.11
Doc-level ReGNN (Li et al., 2019) GloVe - 97.93 ± 0.31 95.17 ± 0.17 67.93 ± 0.33 78.71 ± 0.56
Text-Level-GNN (Huang et al., 2019) GloVe 84.16 ± 0.25* 97.8 ± 0.2 94.6 ± 0.3 69.4 ± 0.6 75.47 ± 0.06*
TextING (Zhang et al., 2020) GloVe - 98.13 ± 0.12 95.68 ± 0.35 70.84 ± 0.52 80.19 ± 0.31
HyperGAT (Ding et al., 2020) N/A 86.62 ± 0.16 97.97 ± 0.23 94.98 ± 0.27 69.90 ± 0.34 78.32 ± 0.27
TextSSL (Piao et al., 2021) GloVe 85.26 ± 0.28 97.81 ± 0.14 95.48 ± 0.26 70.59 ± 0.38 79.74 ± 0.19

While different GNN text classification models may be evaluated on different datasets, there are some datasets that are commonly used across many of these models, including 20NG, R8, R52, Ohsumed and MR. The accuracy of various models assessed on these five datasets is presented in Table 4. Some of the results are reported with ten times average accuracy and standard derivation while some only report the average accuracy. Several conclusions can be drawn:

  • Models that use external resources usually achieve better performance than those that do not, especially models with BERT and RoBERTa(Lin et al., 2021; Ye et al., 2020).

  • Under the same setting, such as using GloVe as the external resource, Corpus-level GNN models (e.g. TG-Transformer(Zhang and Zhang, 2020), TensorGCN(Liu et al., 2020)) typically outperform Document-level GNN models (e.g. TextING(Zhang et al., 2020), TextSSL(Piao et al., 2021)). This is because Corpus-level GNN models can work in a transductive way and make use of the test input, whereas Document-level GNN models can only use the training data.

  • The advantage of Corpus-level GNN models over Document-level GNN models only applies to topic classification datasets and not to sentiment analysis datasets such as MR. This is because sentiment analysis involves analyzing the order of words in a text, which is something that most Corpus-level GNN models cannot do.

7. Challenges and Future Work

7.1. Model Performance

With the development of pre-trained models(Kenton and Toutanova, 2019; Liu et al., 2019), and prompt learning methods(Gao et al., 2021; Liu et al., 2021b) achieve great performance on text classification. Applying GNNs in text classification without this pre-training style will not be able to achieve such good performance. For both corpus-level and document-level GNN text classification models, researching how to combine GNN models with these pretrained models to improve the pretrained model performance can be the future work. Meanwhile, more advanced graph models can be explored, e.g. more heterogeneous graph models on word and document graphs to improve the model performance.

7.2. Graph Construction

Most GNN text classification methods use a single, static-value edge to construct graphs based on document statistics. This approach applies to both corpus-level GNN and document-level GNN. However, to better explore the complex relationship between words and documents, more dynamic hyperedges can be utilized. Dynamic edges in GNNs can be learned from various sources, such as the graph structure, document semantic information, or other models. And hyperedges can be built for a more expressive representation of the complex relationships between nodes in the graph.

7.3. Application

While corpus-level GNN text classification models have demonstrated good performance without using external resources, these models are mostly transductive. To apply them in real-world settings, an inductive learning approach should be explored. Although some inductive corpus-level GNNs have been introduced, the large amount of space required to construct the graph and the inconvenience of incremental training still present barriers to deployment. Improving the scalability of online training and testing for inductive corpus-level GNNs represents a promising area for future work.

8. Conclusion

This survey article introduces how Graph Neural Networks have been applied to text classification in two different ways: corpus-level GNN and document-level GNN, with a detailed structural figure. Details of these models have been introduced and discussed, along with the datasets commonly used by these methods. Compared with traditional machine learning and sequential deep learning models, graph neural networks can explore the relationship between words and documents in the global structure (corpus-level GNN) or the local document (document-level GNN), giving a good performance. A detailed performance comparison is applied to investigate the influence of external resources, model learning methods, and types of different datasets. Furthermore, we propose the challenges for GNN text classification models and potential future work.

References

  • (1)
  • Abreu et al. (2019) Jader Abreu, Luis Fred, David Macêdo, and Cleber Zanchettin. 2019. Hierarchical attentional hybrid neural networks for document classification. In International Conference on Artificial Neural Networks. Springer, 396–402.
  • Aggarwal and Zhai (2012) Charu C Aggarwal and ChengXiang Zhai. 2012. A survey of text classification algorithms. In Mining text data. Springer, 163–222.
  • Alsaeedi (2020) Abdullah Alsaeedi. 2020. A survey of term weighting schemes for text classification. International Journal of Data Mining, Modelling and Management 12, 2 (2020), 237–254.
  • Arango et al. (2019) Aymé Arango, Jorge Pérez, and Barbara Poblete. 2019. Hate speech detection is not as easy as you may think: A closer look at model validation. In Proceedings of the 42nd international acm sigir conference on research and development in information retrieval. 45–54.
  • Bach and Jordan (2002) Francis R Bach and Michael I Jordan. 2002. Kernel independent component analysis. Journal of machine learning research 3, Jul (2002), 1–48.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  • Battaglia et al. (2018) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018).
  • Benamira et al. (2019) Adrien Benamira, Benjamin Devillers, Etienne Lesot, Ayush K Ray, Manal Saadi, and Fragkiskos D Malliaros. 2019. Semi-supervised learning and graph neural networks for fake news detection. In 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 568–569.
  • Bhavani and Kumar (2021) A Bhavani and B Santhosh Kumar. 2021. A Review of State Art of Text Classification Algorithms. In 2021 5th International Conference on Computing Methodologies and Communication (ICCMC). IEEE, 1484–1490.
  • Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the association for computational linguistics 5 (2017), 135–146.
  • Bronstein et al. (2017) Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. 2017. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34, 4 (2017), 18–42.
  • Cavnar et al. (1994) William B Cavnar, John M Trenkle, et al. 1994. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, Vol. 161175. Las Vegas, NV.
  • Chen et al. (2020) Yu Chen, Lingfei Wu, and Mohammed Zaki. 2020. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. Advances in Neural Information Processing Systems 33 (2020), 19314–19326.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
  • Craven et al. (1998) Mark Craven, Andrew McCallum, Dan PiPasquo, Tom Mitchell, and Dayne Freitag. 1998. Learning to extract symbolic knowledge from the World Wide Web. Technical Report. Carnegie-mellon univ pittsburgh pa school of computer Science.
  • Dai et al. (2022) Yong Dai, Linjun Shou, Ming Gong, Xiaolin Xia, Zhao Kang, Zenglin Xu, and Daxin Jiang. 2022. Graph fusion network for text classification. Knowledge-Based Systems 236 (2022), 107659.
  • Davidson et al. (2017) Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the international AAAI conference on web and social media, Vol. 11. 512–515.
  • Deng et al. (2019) Xuelian Deng, Yuqing Li, Jian Weng, and Jilian Zhang. 2019. Feature selection for text classification: A review. Multimedia Tools and Applications 78, 3 (2019), 3797–3816.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Dieng et al. (2016) Adji B Dieng, Chong Wang, Jianfeng Gao, and John Paisley. 2016. Topicrnn: A recurrent neural network with long-range semantic dependency. arXiv preprint arXiv:1611.01702 (2016).
  • Ding et al. (2020) Kaize Ding, Jianling Wang, Jundong Li, Dingcheng Li, and Huan Liu. 2020. Be more with less: Hypergraph attention networks for inductive text classification. arXiv preprint arXiv:2011.00387 (2020).
  • Founta et al. (2018) Antigoni Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. Large scale crowdsourcing and characterization of twitter abusive behavior. In Twelfth International AAAI Conference on Web and Social Media.
  • Gao et al. (2022) Chen Gao, Xiang Wang, Xiangnan He, and Yong Li. 2022. Graph neural networks for recommender system. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1623–1625.
  • Gao et al. (2021) Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making Pre-trained Language Models Better Few-shot Learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 3816–3830.
  • Genkin et al. (2007) Alexander Genkin, David D Lewis, and David Madigan. 2007. Large-scale Bayesian logistic regression for text categorization. technometrics 49, 3 (2007), 291–304.
  • Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In International conference on machine learning. PMLR, 1263–1272.
  • Graves (2012) Alex Graves. 2012. Long short-term memory. Supervised sequence labelling with recurrent neural networks (2012), 37–45.
  • Hakami (1962) S Hakami. 1962. On the realizability of a set of integers as degrees of the vertices of a graph. SIAM Journal Applied Mathematics 10 (1962), 496–506.
  • Hakim et al. (2014) Ari Aulia Hakim, Alva Erwin, Kho I Eng, Maulahikmah Galinium, and Wahyu Muliady. 2014. Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach. In 2014 6th international conference on information technology and electrical engineering (ICITEE). IEEE, 1–4.
  • Hamilton et al. (2017) William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584 (2017).
  • Harish et al. (2010) Bhat S Harish, Devanur S Guru, and Shantharamu Manjunath. 2010. Representation and classification of text documents: A brief review. IJCA, Special Issue on RTIPPR (2) 110 (2010), 119.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  • Huang et al. (2019) Lianzhe Huang, Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2019. Text level graph neural network for text classification. arXiv preprint arXiv:1910.02356 (2019).
  • Iyyer et al. (2015) Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers). 1681–1691.
  • Jindal et al. (2015) Rajni Jindal, Ruchika Malhotra, and Abha Jain. 2015. Techniques for text classification: Literature review and current trends. webology 12, 2 (2015).
  • Joachims (1998) Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning. Springer, 137–142.
  • Joachims (2005) Thorsten Joachims. 2005. Text categorization with support vector machines: Learning with many relevant features. In Machine Learning: ECML-98: 10th European Conference on Machine Learning Chemnitz, Germany, April 21–23, 1998 Proceedings. Springer, 137–142.
  • Kadhim (2019) Ammar Ismael Kadhim. 2019. Survey on supervised machine learning techniques for automatic text classification. Artificial Intelligence Review 52, 1 (2019), 273–292.
  • Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014).
  • Kaur and Kumar (2018) Rajinder Kaur and Mukesh Kumar. 2018. Domain ontology graph approach using Markov clustering algorithm for text classification. In International Conference on Intelligent Computing and Applications. Springer, 515–531.
  • Kenton and Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
  • Khan et al. (2010) Aurangzeb Khan, Baharum Baharudin, Lam Hong Lee, and Khairullah Khan. 2010. A review of machine learning algorithms for text-documents classification. Journal of advances in information technology 1, 1 (2010), 4–20.
  • Kim (2014) Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 1746–1751.
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Variational Graph Auto-Encoders. NIPS Workshop on Bayesian Deep Learning (2016).
  • Kowsari et al. (2019) Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. 2019. Text classification algorithms: A survey. Information 10, 4 (2019), 150.
  • Lang (1995) Ken Lang. 1995. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning. 331–339.
  • Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning. PMLR, 1188–1196.
  • Lehmann et al. (2015) Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. 2015. DBpedia-A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6, 2 (2015), 167–195. Crossref, ISI (2015).
  • Lei et al. (2021) Fangyuan Lei, Xun Liu, Zhengming Li, Qingyun Dai, and Senhong Wang. 2021. Multihop neighbor information fusion graph convolutional network for text classification. Mathematical Problems in Engineering 2021 (2021).
  • Li et al. (2021) Chen Li, Xutan Peng, Hao Peng, Jianxin Li, and Lihong Wang. 2021. TextGTL: Graph-based Transductive Learning for Semi-Supervised Text Classification via Structure-Sensitive Interpolation. IJCAI. ijcai. org (2021).
  • Li et al. (2022) Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu Yang, Lichao Sun, Philip S Yu, and Lifang He. 2022. A Survey on Text Classification: From Traditional to Deep Learning. ACM Transactions on Intelligent Systems and Technology (TIST) 13, 2 (2022), 1–41.
  • Li et al. (2019) Wei Li, Shuheng Li, Shuming Ma, Yancheng He, Deli Chen, and Xu Sun. 2019. Recursive graphical neural networks for text classification. arXiv preprint arXiv:1909.08166 (2019).
  • Li and Roth (2002) Xin Li and Dan Roth. 2002. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics.
  • Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. 2016. Gated Graph Sequence Neural Networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
  • Liang et al. (2021) Zijian Liang, Hui Ding, and Wenlong Fu. 2021. A Survey on Graph Neural Networks for Recommendation. In 2021 International Conference on Culture-oriented Science & Technology (ICCST). IEEE, 383–386.
  • Liao et al. (2021a) Wenlong Liao, Birgitte Bak-Jensen, Jayakrishnan Radhakrishna Pillai, Yuelong Wang, and Yusen Wang. 2021a. A review of graph neural networks and their applications in power systems. Journal of Modern Power Systems and Clean Energy (2021).
  • Liao et al. (2021b) Wenxiong Liao, Bi Zeng, Jianqi Liu, Pengfei Wei, Xiaochun Cheng, and Weiwen Zhang. 2021b. Multi-level graph neural network for text sentiment analysis. Computers & Electrical Engineering 92 (2021), 107096.
  • Lilleberg et al. (2015) Joseph Lilleberg, Yun Zhu, and Yanqing Zhang. 2015. Support vector machines and word2vec for text classification with semantic features. In 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC). IEEE, 136–140.
  • Lin et al. (2021) Yuxiao Lin, Yuxian Meng, Xiaofei Sun, Qinghong Han, Kun Kuang, Jiwei Li, and Fei Wu. 2021. BertGCN: Transductive Text Classification by Combining GNN and BERT. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 1456–1462.
  • Linmei et al. (2019) Hu Linmei, Tianchi Yang, Chuan Shi, Houye Ji, and Xiaoli Li. 2019. Heterogeneous graph attention networks for semi-supervised short text classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4821–4830.
  • Liu et al. (2022) Chuang Liu, Yibing Zhan, Chang Li, Bo Du, Jia Wu, Wenbin Hu, Tongliang Liu, and Dacheng Tao. 2022. Graph Pooling for Graph Neural Networks: Progress, Challenges, and Opportunities. arXiv preprint arXiv:2204.07321 (2022).
  • Liu et al. (2015) Pengfei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu, and Xuan-Jing Huang. 2015. Multi-timescale long short-term memory neural network for modelling sentences and documents. In Proceedings of the 2015 conference on empirical methods in natural language processing. 2326–2335.
  • Liu et al. (2016) Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent Neural Network for Text Classification with Multi-Task Learning. In IJCAI.
  • Liu et al. (2020) Xien Liu, Xinxin You, Xiao Zhang, Ji Wu, and Ping Lv. 2020. Tensor graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 8409–8416.
  • Liu et al. (2021b) Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. GPT Understands, Too. arXiv:2103.10385 (2021).
  • Liu et al. (2021a) Yonghao Liu, Renchu Guan, Fausto Giunchiglia, Yanchun Liang, and Xiaoyue Feng. 2021a. Deep Attention Diffusion Graph Neural Networks for Text Classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8142–8152.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • Liu and Tan (2021) Zhanghui Liu and Huachun Tan. 2021. Traffic Prediction with Graph Neural Network: A Survey. In CICTP 2021. 467–474.
  • Lu et al. (2020) Zhibin Lu, Pan Du, and Jian-Yun Nie. 2020. VGCN-BERT: augmenting BERT with graph embedding for text classification. In European Conference on Information Retrieval. Springer, 369–382.
  • Maas et al. (2011) Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 142–150.
  • Malekzadeh et al. (2021) Masoud Malekzadeh, Parisa Hajibabaee, Maryam Heidari, Samira Zad, Ozlem Uzuner, and James H Jones. 2021. Review of graph neural network in text classification. In 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON). IEEE, 0084–0091.
  • Manning et al. (2014) Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit.. In ACL (System Demonstrations). The Association for Computer Linguistics, 55–60. http://dblp.uni-trier.de/db/conf/acl/acl2014-d.html#ManningSBFBM14
  • Marin et al. (2014) Alex Marin, Roman Holenstein, Ruhi Sarikaya, and Mari Ostendorf. 2014. Learning phrase patterns for text classification using a knowledge graph and unlabeled data. In Fifteenth annual conference of the international speech communication association.
  • Mariyam et al. (2021) Ayesha Mariyam, SK Althaf Hussain Basha, and S Viswanadha Raju. 2021. A literature survey on recurrent attention learning for text classification. In IOP Conference Series: Materials Science and Engineering, Vol. 1042. IOP Publishing, 012030.
  • Matsuo et al. (2006) Yutaka Matsuo, Takeshi Sakaki, Kôki Uchiyama, and Mitsuru Ishizuka. 2006. Graph-based word clustering using a web search engine. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. 542–550.
  • Mei et al. (2021) Xin Mei, Xiaoyan Cai, Libin Yang, and Nanxin Wang. 2021. Graph transformer networks based text representation. Neurocomputing 463 (2021), 91–100.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  • Minaee et al. (2021) Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep learning–based text classification: a comprehensive review. ACM Computing Surveys (CSUR) 54, 3 (2021), 1–40.
  • Mirończuk and Protasiewicz (2018) Marcin Michał Mirończuk and Jarosław Protasiewicz. 2018. A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications 106 (2018), 36–54.
  • Mou et al. (2015) Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. 2015. Natural language inference by tree-based convolution and heuristic matching. arXiv preprint arXiv:1512.08422 (2015).
  • Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning convolutional neural networks for graphs. In International conference on machine learning. PMLR, 2014–2023.
  • Nikolentzos et al. (2020) Giannis Nikolentzos, Antoine Tixier, and Michalis Vazirgiannis. 2020. Message passing attention networks for document understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8544–8551.
  • Ostendorff et al. (2019) Malte Ostendorff, Peter Bourgonje, Maria Berger, Julian Moreno-Schneider, Georg Rehm, and Bela Gipp. 2019. Enriching bert with knowledge graph embeddings for document classification. arXiv preprint arXiv:1909.08402 (2019).
  • Page et al. (1999) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.
  • Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv preprint cs/0506075 (2005).
  • Pasa et al. (2021) Luca Pasa, Nicolò Navarin, Wolfgang Erb, and Alessandro Sperduti. 2021. Simple Graph Convolutional Networks. (2021). https://doi.org/10.48550/ARXIV.2106.05809
  • Patra and Singh (2013) Anuradha Patra and Divakar Singh. 2013. A survey report on text classification with different term weighing methods and comparison between classification algorithms. International Journal of Computer Applications 75, 7 (2013).
  • Peng et al. (2018) Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu Song, and Qiang Yang. 2018. Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In Proceedings of the 2018 world wide web conference. 1063–1072.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2227–2237. https://doi.org/10.18653/v1/N18-1202
  • Piao et al. (2021) Yinhua Piao, Sangseon Lee, Dohoon Lee, and Sun Kim. 2021. Sparse Structure Learning via Graph Neural Networks for Inductive Document Classification. arXiv preprint arXiv:2112.06386 (2021).
  • Pintas et al. (2021) Julliano Trindade Pintas, Leandro AF Fernandes, and Ana Cristina Bicharra Garcia. 2021. Feature selection methods for text classification: a systematic literature review. Artificial Intelligence Review 54, 8 (2021), 6149–6200.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).
  • Ragesh et al. (2021) Rahul Ragesh, Sundararajan Sellamanickam, Arun Iyer, Ramakrishna Bairi, and Vijay Lingam. 2021. Hetegcn: Heterogeneous graph convolutional networks for text classification. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 860–868.
  • Ren et al. (2016) Yafeng Ren, Ruimin Wang, and Donghong Ji. 2016. A topic-enhanced word embedding for Twitter sentiment classification. Information Sciences 369 (2016), 188–198.
  • Scarselli et al. (2008) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE transactions on neural networks 20, 1 (2008), 61–80.
  • Schenker et al. (2004) Adam Schenker, Mark Last, Horst Bunke, and Abraham Kandel. 2004. Classification of web documents using graph matching. International Journal of Pattern Recognition and Artificial Intelligence 18, 03 (2004), 475–496.
  • Selva Birunda and Kanniga Devi (2021) S Selva Birunda and R Kanniga Devi. 2021. A review on word embedding techniques for text classification. Innovative Data Communication Technologies and Application (2021), 267–281.
  • Shah and Patel (2016) Foram P Shah and Vibha Patel. 2016. A review on feature selection and feature extraction for text classification. In 2016 international conference on wireless communications, signal processing and networking (WiSPNET). IEEE, 2264–2268.
  • Silva et al. (2014) Fernanda B Silva, Salvatore Tabbone, and Ricardo da S Torres. 2014. Bog: A new approach for graph matching. In 2014 22nd International Conference on Pattern Recognition. IEEE, 82–87.
  • Skarding et al. (2021) Joakim Skarding, Bogdan Gabrys, and Katarzyna Musial. 2021. Foundations and modeling of dynamic networks using dynamic graph neural networks: A survey. IEEE Access 9 (2021), 79143–79168.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631–1642.
  • Szummer and Jaakkola (2001) Martin Szummer and Tommi Jaakkola. 2001. Partially labeled classification with Markov random walks. Advances in neural information processing systems 14 (2001).
  • Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015).
  • Tang et al. (2015) Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 conference on empirical methods in natural language processing. 1422–1432.
  • Tang et al. (2020) Hengliang Tang, Yuan Mi, Fei Xue, and Yang Cao. 2020. An Integration Model Based on Graph Convolutional Network for Text Classification. IEEE Access 8 (2020), 148865–148876.
  • Thekumparampil et al. (2018) Kiran K Thekumparampil, Chong Wang, Sewoong Oh, and Li-Jia Li. 2018. Attention-based graph neural network for semi-supervised learning. arXiv preprint arXiv:1803.03735 (2018).
  • Thomas et al. (2022) Josephine M Thomas, Alice Moallemy-Oureh, Silvia Beddar-Wiesing, and Clara Holzhüter. 2022. Graph Neural Networks Designed for Different Graph Types: A Survey. arXiv preprint arXiv:2204.03080 (2022).
  • Uryupina et al. (2014) Olga Uryupina, Barbara Plank, Aliaksei Severyn, Agata Rotondi, and Alessandro Moschitti. 2014. SenTube: A corpus for sentiment analysis on YouTube social media. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). 4244–4249.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. International Conference on Learning Representations (2018). https://openreview.net/forum?id=rJXMpikCZ
  • Vijayan et al. (2017) Vikas K Vijayan, KR Bindu, and Latha Parameswaran. 2017. A comprehensive study of text classification algorithms. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 1109–1113.
  • Wang et al. (2021) Jianian Wang, Sheng Zhang, Yanghua Xiao, and Rui Song. 2021. A review on graph neural network methods in financial applications. arXiv preprint arXiv:2111.15367 (2021).
  • Wang et al. (2022b) Kunze Wang, Soyeon Caren Han, Siqu Long, and Josiah Poon. 2022b. ME-GCN: multi-dimensional edge-embedded graph convolutional networks for semi-supervised text classification. arXiv preprint arXiv:2204.04618.
  • Wang et al. (2022a) Kunze Wang, Soyeon Caren Han, and Josiah Poon. 2022a. InducT-GCN: Inductive graph convolutional networks for text classification. In 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 1243–1249.
  • Wang et al. (2020) Zhengjue Wang, Chaojie Wang, Hao Zhang, Zhibin Duan, Mingyuan Zhou, and Bo Chen. 2020. Learning dynamic hierarchical topic graph with graph convolutional network for document classification. In International Conference on Artificial Intelligence and Statistics. PMLR, 3959–3969.
  • Waseem (2016) Zeerak Waseem. 2016. Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter. In Proceedings of the first workshop on NLP and computational social science. 138–142.
  • Welling and Kipf (2016) Max Welling and Thomas N Kipf. 2016. Semi-supervised classification with graph convolutional networks. In J. International Conference on Learning Representations (ICLR 2017).
  • Wu et al. (2019b) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019b. Simplifying graph convolutional networks. In International conference on machine learning. PMLR, 6861–6871.
  • Wu et al. (2021) Lingfei Wu, Yu Chen, Kai Shen, Xiaojie Guo, Hanning Gao, Shucheng Li, Jian Pei, and Bo Long. 2021. Graph neural networks for natural language processing: A survey. arXiv preprint arXiv:2106.06090 (2021).
  • Wu et al. (2019a) Man Wu, Shirui Pan, Xingquan Zhu, Chuan Zhou, and Lei Pan. 2019a. Domain-adversarial graph neural networks for text classification. In 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 648–657.
  • Wu et al. (2020) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 1 (2020), 4–24.
  • Xie et al. (2021) Qianqian Xie, Jimin Huang, Pan Du, Min Peng, and Jian-Yun Nie. 2021. Inductive topic variational graph auto-encoder for text classification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4218–4227.
  • Xie et al. (2022) Yaochen Xie, Zhao Xu, Jingtun Zhang, Zhengyang Wang, and Shuiwang Ji. 2022. Self-supervised learning of graph neural networks: A unified review. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  • Xing et al. (2010) Zhengzheng Xing, Jian Pei, and Eamonn Keogh. 2010. A brief survey on sequence classification. ACM Sigkdd Explorations Newsletter 12, 1 (2010), 40–48.
  • Yan et al. (2013) Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2013. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web. 1445–1456.
  • Yang et al. (2021) Yuxuan Yang, Yingmei Wei, and Tianrui Shen. 2021. A Review of Graph Neural Networks for Recommender Applications. In 2021 IEEE International Conference on Unmanned Systems (ICUS). IEEE, 602–607.
  • Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 1480–1489.
  • Yao et al. (2019) Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 7370–7377.
  • Ye et al. (2020) Zhihao Ye, Gongyao Jiang, Ye Liu, Zhiyong Li, and Jin Yuan. 2020. Document and word representations generated by graph convolutional network and bert for short text classification. In ECAI 2020. IOS Press, 2275–2281.
  • Yin and Jin (2015) Yanping Yin and Zhong Jin. 2015. Document sentiment classification based on the word embedding. In 2015 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering. Atlantis Press, 456–461.
  • Ying et al. (2018) Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. 2018. Hierarchical graph representation learning with differentiable pooling. Advances in neural information processing systems 31 (2018).
  • Zhang and Zhang (2020) Haopeng Zhang and Jiawei Zhang. 2020. Text Graph Transformer for Document Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8322–8327.
  • Zhang and Meng (2019) Jiawei Zhang and Lin Meng. 2019. Gresnet: Graph residual network for reviving deep gnns from suspended animation. arXiv preprint arXiv:1909.05729 (2019).
  • Zhang et al. (2019) Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. 2019. Graph convolutional networks: a comprehensive review. Computational Social Networks 6, 1 (2019), 1–23.
  • Zhang et al. (2011) Wen Zhang, Taketoshi Yoshida, and Xijin Tang. 2011. A comparative study of TF* IDF, LSI and multi-words for text classification. Expert systems with applications 38, 3 (2011), 2758–2765.
  • Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems 28 (2015).
  • Zhang et al. (2021) Xiao-Meng Zhang, Li Liang, Lin Liu, and Ming-Jing Tang. 2021. Graph neural networks and their current applications in bioinformatics. Frontiers in genetics 12 (2021).
  • Zhang et al. (2020) Yufeng Zhang, Xueli Yu, Zeyu Cui, Shu Wu, Zhongzhen Wen, and Liang Wang. 2020. Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 334–339.
  • Zhou et al. (2020) Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57–81.
  • Zhou et al. (2015) Mingyuan Zhou, Yulai Cong, and Bo Chen. 2015. The Poisson Gamma Belief Network. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. 3043–3051.
  • Zhou and Li (2005) Xueyuan Zhou and Chunping Li. 2005. Text Classification by Markov Random Walks with Reward.. In DMIN. Citeseer, 275–278.
  • Zhou (2020) Yifan Zhou. 2020. A review of text classification based on deep learning. In Proceedings of the 2020 3rd international conference on geoinformatics and data analysis. 132–136.
  • Zhu and Koniusz (2020) Hao Zhu and Piotr Koniusz. 2020. Simple spectral graph convolution. In International Conference on Learning Representations.
  • Zulqarnain et al. (2020) Muhammad Zulqarnain, Rozaida Ghazali, Yana Mazwin Mohmad Hassim, and Muhammad Rehan. 2020. A comparative review on deep learning models for text classification. Indones. J. Electr. Eng. Comput. Sci 19, 1 (2020), 325–335.