Finding core labels for maximizing generalization of graph neural networks

Sichao Fu; Xueqi Ma; Yibing Zhan; Fanyu You; Qinmu Peng; Tongliang Liu; James Bailey; Danilo Mandic

doi:10.1016/j.neunet.2024.106635

Finding core labels for maximizing generalization of graph neural networks

Neural Netw. 2024 Dec:180:106635. doi: 10.1016/j.neunet.2024.106635. Epub 2024 Aug 14.

Authors

Sichao Fu¹, Xueqi Ma², Yibing Zhan³, Fanyu You⁴, Qinmu Peng⁵, Tongliang Liu⁶, James Bailey⁷, Danilo Mandic⁸

Affiliations

¹ School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China. Electronic address: [email protected].
² School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia. Electronic address: [email protected].
³ JD Explore Academy, Beijing 100176, China. Electronic address: [email protected].
⁴ University of Southern California, Los Angeles 90005, USA. Electronic address: [email protected].
⁵ School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China. Electronic address: [email protected].
⁶ Trustworthy Machine Learning Lab, School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia. Electronic address: [email protected].
⁷ School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia. Electronic address: [email protected].
⁸ Department of Electrical Engineering, Imperial College London, London SW7 2BX, UK. Electronic address: [email protected].

PMID: 39173205
DOI: 10.1016/j.neunet.2024.106635

Abstract

Graph neural networks (GNNs) have become a popular approach for semi-supervised graph representation learning. GNNs research has generally focused on improving methodological details, whereas less attention has been paid to exploring the importance of labeling the data. However, for semi-supervised learning, the quality of training data is vital. In this paper, we first introduce and elaborate on the problem of training data selection for GNNs. More specifically, focusing on node classification, we aim to select representative nodes from a graph used to train GNNs to achieve the best performance. To solve this problem, we are inspired by the popular lottery ticket hypothesis, typically used for sparse architectures, and we propose the following subset hypothesis for graph data: "There exists a core subset when selecting a fixed-size dataset from the dense training dataset, that can represent the properties of the dataset, and GNNs trained on this core subset can achieve a better graph representation". Equipped with this subset hypothesis, we present an efficient algorithm to identify the core data in the graph for GNNs. Extensive experiments demonstrate that the selected data (as a training set) can obtain performance improvements across various datasets and GNNs architectures.

Keywords: Data-centric; Graph neural networks; Node classification; Semi-supervised learning.

MeSH terms

Algorithms*
Humans
Neural Networks, Computer*
Supervised Machine Learning