OAG-Bench: A Human-Curated Benchmark for
Academic Graph Mining

(2024)

Abstract.

With the rapid proliferation of scientific literature, versatile academic knowledge services increasingly rely on comprehensive academic graph mining. Despite the availability of public academic graphs, benchmarks, and datasets, these resources often fall short in multi-aspect and fine-grained annotations, are constrained to specific task types and domains, or lack underlying real academic graphs. In this paper, we present OAG-Bench, a comprehensive, multi-aspect, and fine-grained human-curated benchmark based on the Open Academic Graph (OAG). OAG-Bench covers 10 tasks, $20$ datasets, $70+$ baselines, and $120+$ experimental results to date. We propose new data annotation strategies for certain tasks and offer a suite of data pre-processing codes, algorithm implementations, and standardized evaluation protocols to facilitate academic graph mining. Extensive experiments reveal that even advanced algorithms like large language models (LLMs) encounter difficulties in addressing key challenges in certain tasks, such as paper source tracing and scholar profiling. We also introduce the Open Academic Graph Challenge (OAG-Challenge) to encourage community input and sharing. We envisage that OAG-Bench can serve as a common ground for the community to evaluate and compare algorithms in academic graph mining, thereby accelerating algorithm development and advancement in this field. OAG-Bench is accessible at https://www.aminer.cn/data/.

academic knowledge graph; benchmark; academic graph mining

^†^†copyright: acmcopyright^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†journalyear: 2024^†^†copyright: rightsretained^†^†conference: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 25–29, 2024; Barcelona, Spain^†^†booktitle: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), August 25–29, 2024, Barcelona, Spain^†^†doi: 10.1145/3637528.3672354^†^†isbn: 979-8-4007-0490-1/24/08^†^†ccs: Information systems Digital libraries and archives^†^†ccs: Information systems Data mining

1. Introduction

The overarching goal of academic data mining is to deepen our comprehension of the development, nature, and trends of science. It offers the potential to unlock enormous scientific, technological, and educational value (Wang and Barabási, 2021). For example, deep mining from academic data can assist governments in making scientific policies, support companies in talent discovery, and help researchers acquire new knowledge more efficiently.

Table 1. Comparison between academic knowledge graphs (AKG) and academic benchmarks. Biomed: biomedicine.

AKG / Benchmark	Multiple Tasks	Domain	Task Type	Baseline Codes	Leader- board
MAG (Sinha et al., 2015)	-	Alle	-	-	-
OAG (Zhang et al., 2019, 2023a)	-	Alle	-	-	-
AceKG (Wang et al., 2018)	Partial	Alle	Graph	-	-
S2ORC (Lo et al., 2020)	✓	Alle	NLP	Partial	✓
BLURB (Gu et al., 2021)	✓	Biomed.	NLP	✓	✓
OAG-Bench	✓	Alle	Diverse	✓	✓

The landscape of academic data mining is rich with entity-centric applications, such as paper recommendation, expert finding, and venue recommendation. Several popular academic mining systems, such as Semantic Scholar¹¹1https://www.semanticscholar.org/, ResearchGate²²2https://www.researchgate.net/, and AMiner³³3https://www.aminer.cn/, are all powered by academic knowledge graphs (AKG)⁴⁴4In this paper, we use academic knowledge graph and academic graph interchangeably.. Based on different data sources, there have been multiple public academic graphs and academic benchmarks, such as Microsoft Academic Graph (MAG) (Sinha et al., 2015) and S2ORC (Lo et al., 2020). A comparative overview of these academic resources is presented in Table 1. However, there remain several defects in existing popular datasets that may hinder promising explorations, which are summarized as follows:

•

Public academic graphs, such as MAG and OAG (Zhang et al., 2019), lack multi-aspect and fine-grained annotations, impeding potential evaluation of downstream tasks on top of them.
•

Academic benchmarks, such as S2ORC and BLURB (Gu et al., 2021), are limited to specific task types (e.g., NLP tasks) and domains (e.g., biomedicine), which may not cover the full spectrum of academic tasks, such as various graph-based tasks.
•

Separate academic datasets, such as PubMedQA (Jin et al., 2019) and concept taxonomy datasets (Shen et al., 2020), often do not include or align with large-scale and comprehensive academic graphs, resulting in a divergence from real-world scenarios.

Present Work. To this end, we introduce OAG-Bench, a meticulously human-annotated academic benchmark for academic graph mining. OAG-Bench currently includes ten tasks, $20$ datasets, $70+$ baseline methods, and $120+$ experimental results. Figure 1 provides an overview of OAG-Bench. Specifically,

(1)

For the design principles of OAG-Bench, we aim to conduct comprehensive and fine-grained annotations on the large-scale OAG for the full life cycle of academic graph mining. Firstly, we annotate the nodes and edges of the academic knowledge graph and identify valuable and challenging tasks during this process, such as author name disambiguation. Then, powered by the academic graph, academic applications explore tasks beyond the academic graph itself and study knowledge acquisition and cognitive impact, such as paper source tracing (C.f. Section 3).
(2)

For the datasets in OAG-Bench, we construct various human-curated datasets for diverse tasks. We also propose new annotation strategies for certain tasks, such as checking inconsistent paper assignments across sources for incorrect paper-author assignment detection and marking the sources of papers via online paper reading groups. Notably, ten datasets in eight tasks are newly constructed. The dataset sizes in OAG-Bench range from thousands to millions.
(3)

For the evaluation of OAG-Bench, OAG-Bench provides corresponding data processing methods, evaluation metrics, and at least three baseline methods for each task. OAG-Bench implements a wide range of methods, covering traditional machine learning methods, shallow convolutional/recurrent/graph neural networks, LLMs, etc. Experimental investigations show that advanced generation-based LLMs hold promise in some tasks like author name disambiguation, but they still struggle with tasks like scholar profiling and paper source tracing.

To sum up, OAG-Bench makes the following contributions: First, we provide multi-aspect and fine-grained human-curated datasets that cover the full life cycle of academic graph mining. Second, we release a series of data pre-processing codes, algorithm implementations, and standardized evaluation protocols to assist researchers in getting started quickly in academic graph mining. Finally, based on OAG-Bench, interested researchers or practitioners can develop advanced AKG-based algorithms, study the foundation models for academic graph mining, and so forth.

Refer to caption — Figure 1. OAG-Bench overview.

2. Background

This section first gives the formal definition of academic knowledge graphs and then introduces related academic datasets.

2.1. Academic Knowledge Graph

An academic knowledge graph (AKG) is defined as a graph $AKG=\{E,R\}$ where each entity $e\in E$ and each relation $r\in R$ are associated with type mapping functions $\tau(e):E\to C$ and $\phi(r):R\to D$ , respectively. $C$ and $D$ represent the sets of entity and relation types with $|C|>1$ and $|D|>1$ . Each entity pair $e_{1}$ and $e_{2}$ is linked by a specific relation $r\in R$ to form a tuple $(e_{1},r,e_{2})$ .

For instance, an academic graph is a heterogeneous entity graph that encompasses multiple types of entities, such as authors, papers, and venues. The relation set, represented by $D$ , includes several key relationships: the authorship relation, which connects authors and papers; the paper-publish-in-venue relation that links papers to the venues where they are published; the co-authorship relation, indicating collaborations between authors, etc.

2.2. Academic Datasets

Some organizations have made their academic graphs available, including MAG (Sinha et al., 2015), OAG (Zhang et al., 2019, 2023a), AceKG (Wang et al., 2018), OpenAlex⁵⁵5https://openalex.org/, and CrossRef⁶⁶6https://www.crossref.org/. These graphs are typically large-scale, but are rarely carefully annotated to benchmark a wide range of academic tasks. Additionally, some benchmarks based on academic corpus have been proposed, such as S2ORC (Lo et al., 2020), SciDocs (Cohan et al., 2020), and BLURB (Gu et al., 2021), but they mainly target NLP tasks and overlook the intricate structure of academic graphs.

To bridge the gap, our objective is to meticulously annotate large-scale academic graphs to benchmark various tasks for academic graph mining. Our initiative, OAG-Bench, leverages the Open Academic Graph (OAG)⁷⁷7https://www.aminer.cn/open-academic-graph, which was initially generated by linking two large academic graphs: MAG and AMiner. OAG aligned large-scale entities in MAG and AMiner, including papers, authors, affiliations, and venues, with an accuracy of over $97\%$ . It has made available the alignment relations between these two graphs alongside their metadata. As MAG turned down its service at the end of 2021, OAG has expanded its data sources to include PubMed, ArXiv, CrossRef, and so forth. To date, five versions of OAG have been released, amassing around 700 million entities and 2 billion relations.

3. OAG-Bench Framework

In this section, we first propose the overall design principle of OAG-Bench, and then present the detailed workflow about how to construct comprehensive and high-quality datasets.

As depicted in Figure 2, we host a series of data collection and annotation efforts to conduct multi-aspect and fined-grained labeling based on the OAG. The framework aims to leverage high-quality academic knowledge graphs (AKG) to facilitate academic data mining. Therefore, the framework is structured into two types of tasks: AKG construction and AKG-empowered academic applications. AKG construction focuses on disambiguating or enriching graph nodes and correcting or completing graph edges, consisting of Academic Entity Construction and Academic Graph Completion. Beyond basic academic relationships, Academic applications delve into knowledge and cognition, consisting of Academic Knowledge Acquisition and Academic Trace and Prediction.

(1) Academic Entity Construction. The construction of academic entities is fundamental to the construction of academic graphs. This stage mainly identifies the identical real-world entities across data sources. For notoriously ambiguous entities, i.e., authors, we further incorporate the author name disambiguation task.

(2) Academic Graph Completion. Building upon the conflated academic entities, this stage aims to establish connections between different entities to complete and enrich academic graphs. Specifically, we engage in fine-grained scholar profiling labeling and attach concepts to entities, such as authors, papers, and concepts.

(3) Academic Knowledge Acquisition. On top of high-quality academic graphs, this stage focuses on the acquisition of academic knowledge and models the multifaceted relations between users and papers. We gather user behavior records from real academic systems to build corresponding datasets.

(4) Academic Trace and Prediction. Besides the correlation between academic knowledge and users, this stage aims to further explore the cognitive influence exerted by papers and authors. It involves retrospective analysis to pinpoint the pivotal references that have inspired a research paper. Looking forward, the challenge lies in forecasting impactful papers or authors.

Table 2 summarizes the specifics of the datasets in OAG-Bench. OAG-Bench includes diverse tasks and datasets since the construction of academic graphs is complex and can not be conducted end-to-end. Furthermore, the applications of academic graphs also involve diverse paper-centric, author-centric, and user-centric services. Although OAG-Bench currently includes ten different tasks, these tasks could facilitate each other. For instance, profiling scholars precisely can help to attach concept tags to scholars. These tasks can also foster other academic tasks. For example, paper recommendation datasets are also valuable assets for similar paper search. In the following, we will present the design choices of the tasks in each module, corresponding task definitions, and the construction methods of related datasets.

Table 2. Dataset overview in OAG-Bench. The format of #Datasets column is #Datasets/#New datasets for each task.

Task	Data Source	#Datasets	Data Volume	#Baselines	Data characteristics
Entity alignment	AMiner/DBLP/MAG	3/2	1K-10K	5	Matching heterogeneous entities
Author name disambiguation	AMiner	3/1	1M	10	Million-scale human-annotated data
Scholar profiling	AMiner	2/1	2K-9K	13	Long attribute extraction for long texts
Entity tagging	AMiner	2/1	11K-900K	12	A large number of class labels
Concept taxonomy completion	MAG/AMiner	3/1	1K-300K	3	Professionally-labeled data for the AI field
Paper recommendation	AMiner	1/0	10K	4	User click records in a real academic system
Reviewer recommendation	Frontiers	1/1	200K	4	Authentic public review records
Academic question answering	Zhihu/StackExchange	1/0	18K	6	Automatic QA for academic domain
Paper source tracing	AMiner	1/1	2K	11	Careful annotations by researchers
Academic influence prediction	AMiner	3/2	1K-1M	12	Summarizing Test-of-Time award in CS^*

*

CS: computer science.

3.1. Academic Entity Construction

To integrate academic data from multiple sources, various types of entities need to be aligned. Thus, we first include the entity alignment task. In view of the severe ambiguity of author names, we further add the author name disambiguation task.

Entity Alignment. Given two entity sets $E_{1}$ and $E_{2}$ , the goal of entity alignment is to generate entity matchings $L=\{(e_{1},e_{2})|e_{1}\in E_{1},e_{2}\in E_{2}\}$ such that $e_{1}$ and $e_{2}$ refer to the same real-world entity. Specifically, we consider three types of entities, i.e., authors, affiliations, and venues. As for dataset annotation, we randomly sample a venue set and an affiliation set, and then manually label venue pairs with high similarity calculated by the Jaccard Index, and construct affiliation alignment pairs by using aliases or former names derived from the information box of their Wikipedia entries. We utilize Wikipedia due to its high data quality. Meanwhile, it allows us to accurately obtain positive affiliation alignment pairs without the need for manual labeling. For author alignment, we sample top-viewed computer science authors from AMiner, and then manually pair them with DBLP authors according to their affiliations, published venues, and papers. As a result, we construct $1{\small,}200$ venue pairs, $5{\small,}000$ affiliation pairs, and $10000$ author pairs.

Author Name Disambiguation (AND). Aiming to disambiguate the same-name authors, AND is a key and challenging task in academic knowledge graph construction. We adopt the WhoIsWho (Chen et al., 2023b) dataset, a million-scale human-annotated dataset for author name disambiguation. WhoIsWho breaks down the task into three subtasks: (1) From-scratch Name Disambiguation (SND), (2) Real-time Name Disambiguation (RND), and (3) Incorrect Assignment Detection (IND). While existing research primarily concentrates on SND and RND, the IND task has received less attention despite its growing importance with the expansion of academic databases. Given an author profile with paper lists, IND aims to detect incorrectly assigned papers to this author. To address the IND challenge, if we were to randomly select author profiles for annotation of their paper assignments, there’s a high likelihood that we would encounter numerous profiles with a little ambiguity. Thus, we propose an effective cross-checking annotation strategy. Specifically, we utilize existing paper alignments and author alignments between AMiner and DBLP, and then gather inconsistent paper-author assignments for further expert checking. This strategy ensures that the profiles under checking have a high likelihood of inaccuracy (with a significant error rate exceeding 30% for these inconsistencies within AMiner). Subsequently, all papers associated with AMiner authors that have incorrect assignments are manually checked by experts. Finally, the refined IND dataset includes $1{\small,}691$ authors, $326738$ papers, with an assignment error rate of $11.32\%$ and reaching $1.5$ times the scale of papers of the IND task in WhoIsWho.

3.2. Academic Graph Completion

Academic graph completion aims to enrich academic graphs from two aspects — entities and relations. To enrich entities, we include the scholar profiling task to extract multidimensional attributes for authors. To enrich relations, we include the entity tagging task to attach concepts to entities. Concepts are abstract entities that can endow semantics to entities. To further build a hierarchical knowledge structure, we also include the concept taxonomy completion task to identify hypernyms and hyponyms for new concepts.

Scholar Profiling. Profiling scholars from big data is a vital task in scholar mining, and it becomes harder and harder due to data fragmentation, modeling lengthy texts, data noise, etc. Previous works on scholar profiling usually extract attributes from scholars’ homepages or search engines. In OAG-Bench, besides profiling scholars from search engines, we introduce a new complex setting — Multidimensional Scholar Profiling from Long Texts, which aims to extract multiple attributes in lengthy texts. Each attribute extraction includes the starting and ending positions in the text. Importantly, long attributes are also taken into consideration, such as work experience and education experience. These attributes can often exceed 100 tokens. Traditional scholar profiling or named entity recognition tasks seldom focus on extracting such long attributes. For data annotation, scholars with detailed biographical descriptions are randomly sampled. Then, we manually label the starting and ending positions of each attribute in the texts. Finally, we construct $2{\small,}099$ scholars with $12$ attributes.

Entity Tagging. Aiming at associating entities with concept labels, entity tagging is an important step in building semantic and hierarchical academic graphs. We introduce scholar interest extraction and paper topic classification to attach concepts to scholars and papers, respectively. Scholar interest extraction aims to extract scholars’ research interests from their publications. Derived from 2017 Open Academic Data Challenge⁸⁸8https://www.biendata.xyz/competition/scholar/, the dataset of this task contains manually annotated $789$ research interest tags for $11357$ scholars and their papers. Paper topic classification aims to classify papers into several topics based on the paper citation network. For dataset construction, based on the DBLP paper citation network⁹⁹9https://originalstatic.aminer.cn/misc/dblp.v12.7z, each paper is assigned one of nine topics¹⁰¹⁰10https://numbda.cs.tsinghua.edu.cn/~yuwj/TH-CPL.pdf. The topics include high-performance computing, computer networks, network and information security, theoretical computer science, system software and software engineering, database and data mining, artificial intelligence and pattern recognition, computer graphics and multimedia, human-computer interaction, and pervasive computing. related to computer science based on its publication venue.

Concept Taxonomy Completion. Concept taxonomies are typically manually created by experts, like defining “deep learning” falls under “machine learning”. The automatic construction of concept taxonomies is a critical challenge in the fast-evolving landscape of knowledge concepts, which is beneficial to the organization of knowledge in the realm of big data. Given an existing concept hierarchy tree (Taxonomy) $T_{0}$ and a set of new concepts $C$ , the goal of concept taxonomy completion is to predict its hypernym $pa(c)\in T_{0}$ and hyponym $ch(c)\in T_{0}$ for each new concept $c\in C$ to complete and expand the existing concept hierarchy tree. We adopt two MAG taxonomies (Sinha et al., 2015) (MAG-Full and MAG-CS) as two taxonomy datasets. These two datasets are large-scale but not carefully verified by experts. Thus, we introduce a newly manually curated dataset covering AI sub-fields by AI researchers, with $1{\small,}335$ concepts and $1{\small,}283$ edges. The guidelines of edge construction refer to relevant textbooks and the ACM Computing Classification System.

3.3. Academic Knowledge Acquisition

Academic services based on academic graphs provide convenience for researchers to acquire knowledge actively or passively. For passive academic recommendation, we include paper recommendation and reviewer recommendation. For active knowledge acquisition, we include academic question answering task.

Paper Recommendation. As the volume of papers surges, researchers face increasing challenges in locating relevant literature. Given a user-paper bipartite graph $G=\{U,P,R\}$ , where $U$ is the user set, $P$ is the paper set, and $R$ signifies interactions (e.g., clicks) between users and papers, the goal of paper recommendation is to predict the next paper a user will interact with (Zhang et al., 2023c). We collect user behavior data based on the real AMiner system. AMiner provides a real-time paper recommendation service for researchers on the homepage. Researchers can offer several keywords to subscribe to relevant research papers. The back-end recommendation engine makes recommendations based on the users’ historical click records. This dataset includes $5{\small,}340$ users, $14967$ papers, and $163084$ interactions as of October 2021. To ensure quality, only users/papers with over $10$ clicks/be-clicked instances are included.

Reviewer Recommendation. As the volume of submissions to academic journals and conferences increases, reviewer recommendation becomes increasingly hard. Different from paper recommendations, reviewer recommendation aims to pair papers with proficient and willing reviewers. Given a paper submission set $S$ , a reviewer set $A$ , and known paper-reviewer matches $R\subseteq S\times A$ , this task is to predict the reviewer $a\in A$ for a new submission record $s_{i}\in S$ , Additional information, including paper metadata and reviewer expertise, is available. For data collection, we extract real paper-reviewing records from the open-access platform Frontiers. After processing, it includes $210069$ reviewers and $225478$ papers, with each paper having at least $2$ reviewers. Furthermore, we match reviewers to authors in OAG using names, affiliations, and research interests, linking approximately half of the reviewers to the OAG. These reviewers are associated with their respective publications.

Academic Question Answering. Traditional keyword-based information retrieval cannot satisfy professional knowledge retrieval in the era of artificial intelligence. For instance, consider the question, “Can neural networks be used to prove conjectures?”. How to retrieve answers and evidence from scholarly literature? Given an academic question $q$ and a paper set $P^{q}=\{p^{q}_{1},p^{q}_{2},..,p^{q}_{N}\}$ , the goal of academic question answering is to select the most relevant papers from the candidate set $P^{q}$ . We adopt OAG-QA (Tam et al., 2022) dataset, which is derived from academic question-answering platforms. We retrieve question posts from StackExchange and Zhihu websites, extract the paper URL mentioned in the answer, and match it with the paper in OAG (Zhang et al., 2019). It comprises $17948$ question-paper pairs. Questions cover $22$ disciplines and $87$ topics, forming a two-level hierarchical structure; that is, each topic belongs to a discipline. For each topic, $10000$ candidate papers, including the ground-truth papers in the answers, are included.

3.4. Academic Trace and Prediction

Understanding the evolution of science on the cognitive level offers the potential to predict, change, and finally invent the future. Tracing back to the past, we include paper source tracing task to identify the sources of research papers. To predict future potential academic impact, we include two tasks, i.e., paper influence prediction and author influence prediction.

Paper Source Tracing (PST). Tracing the sources of papers is crucial for understanding technological essence and uncovering innovation patterns. Given a paper $p$ (including its full text) and its references, the goal of PST is to identify the most important references (termed ref-source) that largely inspired the paper $p$ in terms of ideas or methods. The source papers of a given paper are defined by the following principles: (1) the main idea of the paper $p$ is inspired by the reference; or (2) the main method of the paper $p$ comes from the reference. In other words, this paper would not come into being without these source papers. We carefully build a dataset PST-Bench for this task. Given the specialized knowledge required for paper source tracing, dozens of computer science graduate students were employed to mark the sources of papers in their familiar fields. The annotation process was organized in an online paper reading group, where each student needed to share two papers and mark their source papers each week. After collection, expert-checking, and preprocessing, $2{\small,}141$ labeled computer science papers were obtained. We conducted a human evaluation on the test set, with senior researchers double-checking 100 papers. The accuracy rate was 94%.

Academic Influence Prediction. Paper influence prediction aims to forecast a paper’s impact $\Delta yr$ years later based on its metadata and citation relationships. Regarding the “Test-of-Time Paper Award” (TOT award) as an indicator of high impact, we collate TOT awards in computer science venues¹¹¹¹11https://numbda.cs.tsinghua.edu.cn/~yuwj/TH-CPL.pdf. Awards with similar meanings include the Most Influential Award, Sustained Influence Award, etc. At present, a total of $1{\small,}063$ papers awarded by 2022 have been collected. Similarly, author influence prediction seeks to predict an author’s influence $\Delta yr$ years later using their papers and citation relationships. For this task, we provide two datasets. First, we adopt AuthPred-2017¹²¹²12https://www.biendata.xyz/competition/Tsinghua_course3/. This dataset contains a subset of AMiner authors and papers published by these authors before 2011, intending to predict the citations of these authors as of 2016. This dataset uses AMiner’s citation statistics and provides $1112931$ authors for training and $123823$ authors for testing. In addition, we construct a new dataset AuthPred-2022. This dataset contains a subset of AMiner’s authors in the field of computer science and papers published by these authors before 2017, with the goal of predicting the citations of these authors as of early 2022 (as calculated by Google Scholar). This dataset contains $26797$ AMiner authors with Google Scholar links.

4. Task Evaluations

This section delves into representative tasks of each module in OAG-Bench, highlighting selected experiments. Additional experiments are detailed in Appendix A. All codes have been available¹³¹³13https://github.com/zfjsail/OAG-Bench.

4.1. Author Name Disambiguation

Since SND and IND are two widely studied tasks, we take incorrect assignment detection (IND) as an illustration for evaluation.

Baselines. We adopt graph-based anomaly detection methods and LLM-based methods as baselines. For each author, graph-based methods first construct a paper similarity graph based on attribute similarity (e.g., co-authorship, co-organization) and then detect anomalies in the graph. (1) Logistic Regression (LR): injects top eigenvectors of each graph as features to perform node classification. (2) GCN (Kipf and Welling, 2016): employs graph convolutional networks as the encoder, and then uses fully-connected layers to classify normal/abnormal nodes. (3) GCCAD (Chen et al., 2023a): leverages graph contrastive learning and contrasts abnormal nodes with normal ones in terms of their distances to the global context. (4) ChatGLM (Du et al., 2022): finetunes ChatGLM-6B model by inputting each author’s paper list and asking the model whether one given paper is an anomaly or not.

Evaluation Metrics. Due to the imbalance between positive and negative instances, we adopt the widely-used metric AUC. Furthermore, we choose mean average precision (MAP) as another metric, which pays more attention to the rankings of incorrect instances. We take a macro average of each metric for each author.

Table 3. Performance of incorrect assignment detection (

\%

Method	AUC	MAP
LR	58.46	69.56
GCN	62.48	71.18
GCCAD	70.15	74.17
ChatGLM	77.92	79.54

Experimental Results. Table 3 shows the performance of incorrect assignment detection. We observe that graph neural network-based methods (GCN and GCCAD) outperform the traditional method (LR) based on eigenvalue decomposition. GCCAD explicitly contrasts abnormal paper nodes with other nodes, yielding better performance than GCN. Surprisingly, ChatGLM outperforms graph-based anomaly detection methods, indicating the potential of the attention mechanisms in LLMs to capture the complex correlations between the target paper and the overall author profile. The best performance of IND is not that satisfactory compared with that of SND and RND tasks (Chen et al., 2023b), suggesting that more attention should be paid to the IND task for author name disambiguation in the future.

4.2. Scholar Profiling

In this subsection, for entity attribute enrichment, we present the evaluation of multidimensional scholar profiling from long texts.

Baselines. We select the latest NER methods based on pre-trained models: (1) Han et al. (Yan et al., 2023): use a Biaffine decoder to generate features for each start and end position and then employ CNN to classify locations based on spatial position dependence. (2) GlobalPointer (Su et al., 2022): uses a multiplicative attention mechanism to incorporate relative positional encodings of start and end positions and alleviates class imbalance via modified loss functions. (3) UIE (Lu et al., 2022): is a generative pre-trained model based extraction framework with structure extraction languages and template-specific prompts.

Evaluation Metrics. Precision, Recall, and F1 are computed by comparing predicted and annotated text segments for each attribute. These individual attribute results are then averaged to obtain the overall evaluation result.

Table 4. Extraction performance of multidimensional scholar profiling from long texts (

\%

Method	Precision	Recall	F1
UIE	43.14	35.86	39.15
Global Pointer	51.87	33.09	40.39
Han et al.	50.33	43.76	45.09

Experimental Results. Table 4 display extraction results for long text-based scholar profiling. In the context of scholar profiling from long texts, Table 4 reveals that span-based methods like Han et al. surpass generation-based methods like UIE. We also conduct preliminary experiments by giving some demonstrations and calling GPT-4 (Achiam et al., 2023) API on a subset of test sets, achieving only less than $5\%$ F1 score. This performance disparity likely stems from the challenges that language models face when generating accurate lengthy texts directly for attributes such as education/work experiences. However, with the highest F1 score in Table 4 being 45.09%, there is still room for improvement in extraction performance. Exploring the fusion of large language models (LLMs) and span-based methods presents a promising research avenue.

4.3. Entity Tagging

For relation enrichment, this subsection presents the results of scholar interest extraction.

Baselines. For scholar interest extraction, we employ competition-winning solutions and methods relying on pre-trained models. These approaches follow a common principle: they gauge the similarity between authors in the test and training sets, using the weighted interest tags of training authors for the authors in the test set. The variations among baselines lie in how they compute author similarity. (1) LSI (Deerwester et al., 1990) employs bag of words and TF-IDF for paper texts, reducing dimensions with the LSI model. (2) ACA¹⁴¹⁴14https://github.com/geekinglcq/aca: utilizes more paper attributes, including titles, citations, and venues, for a more nuanced author similarity calculation. (3) pre-training models: leverage models like Sentence-BERT (S-BERT) (Reimers and Gurevych, 2019), SimCSE (Gao et al., 2021), E5 (Wang et al., 2022), BERT (Devlin et al., 2019), GTE (Li et al., 2023), BGE¹⁵¹⁵15https://github.com/FlagOpen/FlagEmbedding, and Sentence-T5 (S-T5) (Ni et al., 2022) to encode paper texts for similarity measurement.

Evaluation Metrics. For scholar interest extraction, we calculate the overlap ratio between predicted and ground-truth tags.

\text{Accuracy}=\frac{1}{N}\sum_{i=1}^{N}\frac{|T_{i}\cap T_{i}^{*}|}{|T_{i}^{% *}|}

where $N$ is the number of scholars, $T_{i}^{*}$ is the annotated interest set of the $i$ -th scholar, and $T_{i}$ is the predicted interest set of the $i$ -th scholar. We pick the $3$ closest tags to the author for evaluation. For paper topic classification, we measure multi-classification accuracy.

Experimental Results. Figure 3 presents the results of scholar interest extraction. Initial attempts to classify scholars’ paper texts using research interest tags as labels yielded unsatisfactory results, likely due to the large number of tags and limited training data. The methods compared in Figure 3 rely on author similarity to calculate interest tags, which proves more effective than text classification. Notably, encoding with pre-trained models directly is less effective than LSI, highlighting the effectiveness of shallow semantic models. Additionally, models focused on sentence embedding outperform general pre-trained models like BERT. The ACA method, which leverages various author attributes such as venues and citing papers, yields improved prediction results. However, the overall accuracy remains low, indicating a challenge in accurate classification with a large number of interest tags.

4.4. Academic Recommendation

For academic knowledge acquisition, this subsection presents the results of paper recommendation and reviewer recommendation.

Baselines. We compare various recommendation algorithms: (1) TF-IDF, (2) Toronto paper matching system (TPMS) (Charlin and Zemel, 2013) for reviewer recommendation task, (3) Variational autoencoder (VAE)-based item-based collaborative filtering method Mult-VAE (Liang et al., 2018), (4) Graph filtered-based collaborative filtering method GF-CF (Shen et al., 2021), and (5) Graph neural networks (GNN)-based collaborative filtering methods NGCF (Wang et al., 2019) and LightGCN (He et al., 2020).

Evaluation Metrics. Like the standard recommendation task (Zhang et al., 2023c, 2024), we adopt Recall@20 and NDCG@20 as evaluation metrics for paper/reviewer recommendation.

Table 5. Performance of paper recommendation.

Method	Recall@20	NDCG@20
Mult-VAE	0.1088	0.0282
GF-CF	0.2067	0.1044
NGCF	0.1651	0.0823
LightGCN	0.1950	0.0985

Experimental Results. Table 5 shows the paper recommendation performance. GF-CF outperforms other methods, highlighting the effectiveness of graph filters. GNN-based methods exceed Mult-VAE, demonstrating the value of high-order graph structures. LightGCN performs better than NGCF, which confirms the redundancy of some GNN modules in NGCF. However, there is room for improvement in recommendation accuracy. How to use the attributes of papers and users to capture the dynamic interest changes of users is a difficult point.

Table 6. Performance of reviewer recommendation.

Method	Recall@20	NDCG@20
TF-IDF	0.0016	0.0001
TPMS	0.0220	0.008
GF-CF	0.0382	0.0203
LightGCN	0.0371	0.0234

Table 6 reports the reviewer recommendation performance. GF-CF and LightGCN outperform TF-IDF, underscoring the importance of leveraging graph structures. The performance of TPMS is unsatisfactory because TPMS is also mainly based on TF-IDF text similarity. However, most methods fall short of fully utilizing the multidimensional attributes of papers and reviewers’ research interests. This highlights the need for further research in the reviewer recommendation task.

4.5. Academic Question Answering

For academic knowledge acquisition, this subsection introduces the results of academic question answering.

Baselines. We adopt sparse and dense retrieval methods: (1) Sparse retrieval methods: BM25, (2) Dense retrieval methods: DPR-FT (full fine-tuning of Dense Passage Retriever (DPR) (Karpukhin et al., 2020)), DPR-PT2 (parameter-efficient fine-tuning of DPR with P-Tuning v2 (Liu et al., 2022a)), ColBERT-FT (full fine-tuning of ColBERT (Khattab and Zaharia, 2020)), ColBERT-PT2 (parameter-efficient fine-tuning of ColBERT with P-Tuning v2), and LLM-Embedder (Zhang et al., 2023b) (a fine-tuned LLM based on various retrieval-related tasks).

Evaluation Metrics. Hit@K is used to measure retrieval accuracy, reporting if the top $K$ retrieved papers contain the correct answer. The average Hit@K across all questions is reported.

Experimental Results. Figure 4 presents the results of the OAG-QA dataset. Generally speaking, dense retrieval methods outperform sparse retrieval methods. ColBERT-based methods are significantly better than DPR-based ones. This shows that by employing late interaction patterns and multi-vector representations, ColBERT models the correlation between questions and papers better. Interestingly, efficient parameter fine-tuning methods excel over full fine-tuning, possibly due to better knowledge retention and generalization from the pre-trained model. The effect of LLM-Embedder suggests there still exists noticeable gap between LLM and academic retrieval. Overall, these methods’ retrieval effects are suboptimal, suggesting room for improvement.

4.6. Paper Source Tracing

For academic source tracing, this subsection presents the results of paper source tracing.

Baselines. We compare three types of methods. (1) Statistical methods: Rule (employing regular expressions to extract references appearing near signal words like “motivated by” or “inspired by”), and Random Forest (RF) (following (Valenzuela et al., 2015), extracting statistical features about citations, citing positions, text similarity, etc., and using RF to predict the importance of references). (2) Graph-based methods: LINE (Tang et al., 2015) and NetSMF (Qiu et al., 2019) train paper embeddings in citation networks and then calculates the cosine similarity between the paper embedding and the reference embedding to measure the importance of references. (3) Pre-training methods: extract the contextual text where each reference appears in the full texts, encode the text with the pre-training models, and use the reference annotation results in the training set for fine-tuning. The pre-training models considered include BERT (Devlin et al., 2019), SciBERT (Beltagy et al., 2019), Galactica-standard (Taylor et al., 2022), and GLM (Du et al., 2022). We also adopt three SOTA closed-source models: GPT-3.5 (OpenAI, 2022), GPT-4 (Achiam et al., 2023), and Claude-instant (Anthropic, 2023). For both open-source and closed-source LLMs, we input the context of a referenced paper and query the model to assess the reference’s significance. For instance, we ask, “Given the context …, is the current reference important?” Closed-source LLMs perform this task using zero-shot evaluation.

Evaluation Metrics. A paper may have one or more ref-sources. For each reference of the paper $p$ , an importance score between $[0,1]$ needs to be output. For each paper $p$ to be traced, its reference list is encoded as 0-1 based on the labeling results (1 if it’s ref-source, 0 otherwise). By comparing the prediction result of each reference with its labeling result, we compute the Mean Average Precision (MAP). The average MAP across different papers serves as the evaluation metric.

Experimental Results. Figure 5 presents the results of paper source tracing. Among all methods, SciBERT delivers the best performance, indicating the efficacy of pre-trained language models. RF outperforms the Rule method, underscoring the effectiveness of feature engineering. Graph-based methods achieve average performance, possibly owing to the ignorance of the contextual information of references. The Rule-based approach’s performance is subpar, likely due to many important references lacking surrounding signal words like “inspired by”, resulting in a low recall. Surprisingly, finetuned SciBERT and BERT-base outperform larger models like GLM-2B, Galactica-standard, and closed-source LLMs. The reason may lie in two aspects. First, the training objective of the mask language model is more suitable for this context understanding task. Second, API-based models may not be well-trained on similar tasks. A potential future direction could involve merging graph-based and text-based methods for paper source tracing. Note that the current methods’ results are not yet satisfactory, indicating ample room for further exploration.

4.7. Paper Influence Prediction

For academic influence prediction, this subsection presents the results of paper influence prediction.

Baselines. We select the following methods: (1) Citation: is based on the paper citation number of known years; (2) Random Forest (RF) (Breiman, 2001): defines features as the paper citation number per year and the total number of citations; (3) GBDT (Friedman, 2001): uses the same features as RF; (4) PageRank (Page et al., 1999): calculates papers’ PageRank score based on paper citation networks; (5) GraphSAGE (Hamilton et al., 2017): performs semi-supervised classification on the paper citation network. Additionally, we consider graph-based node importance prediction methods: (6) GENI (Park et al., 2019) and (7) RGTN (Huang et al., 2021).

Evaluation Metrics. We predict for each venue to determine whether a paper would be awarded, with labels being $0$ oder $1$ indicating whether the paper is awarded or not. Mean Average Precision (MAP) is calculated by comparing the predicted probability of winning the award with the ground truth label, and the mean MAP across different venues is used as the evaluation metric.

Table 7. Results of paper influence prediction.

Method	MAP
Citation	0.6413
RF	0.5409
GBDT	0.5725
PageRank	0.6504
GraphSAGE	0.0811
GENI	0.1262
RGTN	0.0279

Experimental Results. Table 7 presents results for paper influence prediction. Table 7 shows PageRank performing best, as it considers the influence of citing papers, unlike the citation method that treats each citing paper equally. Traditional classifiers (RF and GBDT) are inferior to methods using only total citations, indicating that total citations are a very important indicator. The features added by the classifier may dilute the effect of total citations. GraphSAGE’s poor performance may be due to its inability to capture paper-influence factors like citation count. GENI outperformed GraphSAGE, but both methods were less effective than Citation and PageRank methods. This could be due to their implicit incorporation of citation statistics and the severe class imbalance problem (positive vs. negative $<1:100$ ). Thus, identifying factors beyond citation count remains a challenge in predicting papers’ breakthrough innovation.

5. OAG-Challenge

To promote the engagement of the research community and the development of OAG-Bench, we also introduce the Open Academic Data Challenge (OAG-Challenge) and set up a regular leaderboard for up-to-date OAG-Bench ¹⁶¹⁶16https://www.biendata.xyz/kdd2024/. OAG-Challenge currently contains three challenging academic tasks: incorrect assignment detection for author name disambiguation (IND), academic question answering (OAG-AQA), and paper source tracing (PST).

Specifically, given the paper assignments of each author and paper metadata, IND aims to detect paper assignment errors for each author. Given professional questions and a pool of candidate papers, OAG-AQA hopes to retrieve the most relevant papers to answer these questions. As mentioned earlier, given the full texts of each paper, PST aims to automatically trace the most significant references that have inspired a given paper.

OAG-Challenge was deployed at KDD Cup 2024 and attracted more than 800 team registrations globally. Following the previous successful conventions, submissions are required to provide source codes, technical reports, and contact information for better knowledge sharing and iteration. We are periodically updating the datasets, including annotating new assignment errors, crawling new academic question and answer pairs, and collecting new reading records for PST.

6. Conclusion

The attention of the research community to academic benchmarks remains limited, even if academic tasks offer various challenges and applications of immense impact. Thus, this paper introduces OAG-Bench to carefully annotate large-scale OAG for the full life cycle of academic graph mining. OAG-Bench now includes 10 tasks, $20$ datasets, $70+$ baseline models, and $120+$ experimental results. In the future, we plan to continually maintain and enhance OAG-Bench by updating up-to-date datasets regularly from real scenarios, adding more practical tasks, and exploring interactive evaluation metrics. OAG-Bench is always open for contributions from communities by adding new tasks or datasets, developing cutting-edge algorithms or foundation models for various tasks, etc.

Acknowledgements.

This work is supported by Natural Science Foundation of China (NSFC) 62425601 and 62276148, Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108400, the New Cornerstone Science Foundation through the XPLORER PRIZE. We also thank Weibin Liao, Chao Yu, Kai Yu, and Zheng Jiang for their contribution to code reproducibility.

References

(1)
Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Anthropic (2023) Anthropic. 2023. Introducing Claude. https://www.anthropic.com/news/introducing-claude.
Beltagy et al. (2019) Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 3615–3620.
Breiman (2001) Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
Charlin and Zemel (2013) Laurent Charlin and Richard Zemel. 2013. The Toronto paper matching system: an automated paper-reviewer assignment system. (2013).
Chen et al. (2023b) Bo Chen, Jing Zhang, Fanjin Zhang, Tianyi Han, Yuqing Cheng, Xiaoyan Li, Yuxiao Dong, and Jie Tang. 2023b. Web-Scale Academic Name Disambiguation: the WhoIsWho Benchmark, Leaderboard, and Toolkit. In Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3817–3828.
Chen et al. (2023a) Bo Chen, Jing Zhang, Xiaokang Zhang, Yuxiao Dong, Jian Song, Peng Zhang, Kaibo Xu, Evgeny Kharlamov, and Jie Tang. 2023a. GCCAD: Graph Contrastive Learning for Anomaly Detection. IEEE Transactions on Knowledge & Data Engineering 01 (2023), 1–14.
Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
Cohan et al. (2020) Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld. 2020. SPECTER: Document-level Representation Learning using Citation-informed Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2270–2282.
Deerwester et al. (1990) Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6 (1990), 391–407.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.
Dorogush et al. (2018) Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. (2018). arXiv:1810.11363
Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 320–335.
Feng et al. (2022) Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic BERT Sentence Embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 878–891.
Friedman (2001) Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics 29, 5 (2001), 1189–1232.
Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 6894–6910.
Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 855–864.
Gu et al. (2021) Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare 3, 1 (2021), 1–23.
Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 1025–1035.
He et al. (2022) Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2022. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. In Proceedings of the 11th International Conference on Learning Representations.
He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639–648.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
Huang et al. (2021) Han Huang, Leilei Sun, Bowen Du, Chuanren Liu, Weifeng Lv, and Hui Xiong. 2021. Representation learning on knowledge graphs for node importance estimation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 646–655.
Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. (2015). arXiv:1508.01991
Jiang et al. (2022) Minhao Jiang, Xiangchen Song, Jieyu Zhang, and Jiawei Han. 2022. TaxoEnrich: Self-Supervised Taxonomy Completion via Structure-Semantic Representations. In Proceedings of the ACM Web Conference 2022. 925–934.
Jin et al. (2019) Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A Dataset for Biomedical Research Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2567–2577.
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 6769–6781.
Ke et al. (2017) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: a highly efficient gradient boosting decision tree. (2017), 3149–3157.
Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48.
Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 4th International Conference on Learning Representations.
Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proceedings of the 8th International Conference on Learning Representations.
Li et al. (2020) Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment 14, 1 (2020), 50–60.
Li et al. (2023) Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv preprint arXiv:2308.03281 (2023).
Liang et al. (2018) Dawen Liang, Rahul G Krishnan, Matthew D Hoffman, and Tony Jebara. 2018. Variational autoencoders for collaborative filtering. In Proceedings of the 2018 world wide web conference. 689–698.
Liu et al. (2022a) Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022a. P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 61–68.
Liu et al. (2022b) Xiao Liu, Da Yin, Jingnan Zheng, Xingjian Zhang, Peng Zhang, Hongxia Yang, Yuxiao Dong, and Jie Tang. 2022b. OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3418–3428.
Liu et al. (2020) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2020. RoBERTa: A Robustly Optimized BERT Pretraining Approach. In Proceedings of the 9th International Conference on Learning Representations.
Lo et al. (2020) Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel S Weld. 2020. S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4969–4983.
Lu et al. (2022) Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, and Hua Wu. 2022. Unified Structure Generation for Universal Information Extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5755–5772.
Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2. 3111–3119.
Ni et al. (2022) Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. 2022. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. In Findings of the Association for Computational Linguistics: ACL 2022. 1864–1874.
OpenAI (2022) OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt.
Page et al. (1999) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.
Pareja et al. (2020) Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, Tao Schardl, and Charles Leiserson. 2020. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. 5363–5370.
Park et al. (2019) Namyong Park, Andrey Kan, Xin Luna Dong, Tong Zhao, and Christos Faloutsos. 2019. Estimating node importance in knowledge graphs using graph neural networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 596–606.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing. 1532–1543.
Qiu et al. (2019) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Chi Wang, Kuansan Wang, and Jie Tang. 2019. Netsmf: Large-scale network embedding as sparse matrix factorization. In The World Wide Web Conference. 1509–1520.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 3982–3992.
Rossi et al. (2020) Emanuele Rossi, Fabrizio Frasca, Ben Chamberlain, Davide Eynard, Michael Bronstein, and Federico Monti. 2020. Sign: Scalable inception graph neural networks. (2020). arXiv:2004.11198
Shen et al. (2020) Jiaming Shen, Zhihong Shen, Chenyan Xiong, Chi Wang, Kuansan Wang, and Jiawei Han. 2020. TaxoExpan: Self-supervised taxonomy expansion with position-enhanced graph neural network. In Proceedings of The Web Conference 2020. 486–497.
Shen et al. (2021) Yifei Shen, Yongji Wu, Yao Zhang, Caihua Shan, Jun Zhang, B Khaled Letaief, and Dongsheng Li. 2021. How Powerful is Graph Convolution for Recommendation?. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 1619–1629.
Sinha et al. (2015) Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Hsu, and Kuansan Wang. 2015. An overview of microsoft academic service (mas) and applications. In Proceedings of the 24th international conference on world wide web. 243–246.
Su et al. (2022) Jianlin Su, Ahmed Murtadha, Shengfeng Pan, Jing Hou, Jun Sun, Wanwei Huang, Bo Wen, and Yunfeng Liu. 2022. Global Pointer: Novel Efficient Span-based Approach for Named Entity Recognition. (2022). arXiv:2208.03054
Sutskever et al. (2009) Ilya Sutskever, Ruslan Salakhutdinov, and Joshua B Tenenbaum. 2009. Modelling relational data using Bayesian Clustered Tensor Factorization. In Proceedings of the 22nd International Conference on Neural Information Processing Systems. 1821–1828.
Tam et al. (2022) Weng Lam Tam, Xiao Liu, Kaixuan Ji, Lilong Xue, Xingjian Zhang, Yuxiao Dong, Jiahua Liu, Maodi Hu, and Jie Tang. 2022. Parameter-efficient prompt tuning makes generalized and calibrated neural text retrievers. arXiv preprint arXiv:2207.07087 (2022).
Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web. 1067–1077.
Taylor et al. (2022) Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085 (2022).
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
Valenzuela et al. (2015) Marco Valenzuela, Vu Ha, and Oren Etzioni. 2015. Identifying Meaningful Citations.. In AAAI workshop: Scholarly big data, Vol. 15. 13.
Wang and Barabási (2021) Dashun Wang and Albert-László Barabási. 2021. The science of science. Cambridge University Press.
Wang et al. (2022) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022).
Wang et al. (2018) Ruijie Wang, Yuchen Yan, Jialu Wang, Yuting Jia, Ye Zhang, Weinan Zhang, and Xinbing Wang. 2018. Acekg: A large-scale knowledge graph for academic data mining. In Proceedings of the 27th ACM international conference on information and knowledge management. 1487–1490.
Wang et al. (2019) Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval. 165–174.
Wu et al. (2019) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning. 6861–6871.
Xiong et al. (2017) Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval. 55–64.
Yan et al. (2023) Hang Yan, Yu Sun, Xiaonan Li, and Xipeng Qiu. 2023. An Embarrassingly Easy but Strong Baseline for Nested Named Entity Recognition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 1442–1452.
Zhang et al. (2024) Dan Zhang, Yangliao Geng, Wenwen Gong, Zhongang Qi, Zhiyu Chen, Xing Tang, Ying Shan, Yuxiao Dong, and Jie Tang. 2024. RecDCL: Dual Contrastive Learning for Recommendation. In Proceedings of the ACM on Web Conference 2024. 3655–3666.
Zhang et al. (2023c) Dan Zhang, Yifan Zhu, Yuxiao Dong, Yuandong Wang, Wenzheng Feng, Evgeny Kharlamov, and Jie Tang. 2023c. ApeGNN: Node-Wise Adaptive Aggregation in GNNs for Recommendation. In Proceedings of the ACM Web Conference 2023. 759–769.
Zhang et al. (2023a) Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Evgeny Kharlamov, Bin Shao, et al. 2023a. OAG: Linking Entities across Large-scale Heterogeneous Knowledge Graphs. IEEE Transactions on Knowledge and Data Engineering 35, 9 (2023), 9225–9239.
Zhang et al. (2019) Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li, et al. 2019. OAG: Toward linking large-scale heterogeneous entity graphs. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2585–2595.
Zhang et al. (2023b) Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and Jian-Yun Nie. 2023b. Retrieve anything to augment large language models. arXiv preprint arXiv:2310.07554 (2023).

Appendix A Results of Additional Tasks

A.1. Entity Alignment

We provide the experimental results of venue alignment, affiliation alignment, and author alignment in this subsection.

Baselines. Venue alignment and affliation alignment are short text matching tasks. We compare different types of matching methods: (1) traditional machine learning methods (SVM) using Jaccard index and TF-IDF similarity as input features; (2) shallow neural network-based matching methods including CNN-based matching model (LinKG_C) (Zhang et al., 2019) and RNN-based matching model (LinKG_L) (Zhang et al., 2019); (3) matching models based on pre-trained models (Ditto (Li et al., 2020) with pre-trained models BERT (Devlin et al., 2019), ALBERT (Lan et al., 2019), RoBERTa (Liu et al., 2020), DeBERTa (He et al., 2022), LaBSE (Feng et al., 2022), and GLM (Du et al., 2022)).

Author alignment can further take authors’ structural information into account. Apart from SVM, LinKG_C, and LinKG_L, we additionally select LinKG_G (Zhang et al., 2019) model for author alignment, which constructs a subgraph for each candidate pair, and then uses heterogeneous graph attention networks to learn hidden representations for classification. For each candidate author pair, we leverage published papers and venues of authors as features.

Evaluation Metrics. Entity alignment is a binary classification problem. We take F1 and AUC as the evaluation metric.

Table 8. Alignment results of different types of entities (

\%

	Venue		Affiliation
Method	F1	AUC	F1	AUC
SVM	82.63	91.87	68.47	70.93
LinKG_C	83.46	94.07	69.06	69.12
LinKG_L	85.03	95.33	67.76	72.37
Ditto-BERT-base	89.33	95.00	70.38	78.65
Ditto-ALBERT-base	88.05	96.06	61.90	70.96
Ditto-RoBERTa-large	89.47	97.22	71.98	79.02
Ditto-DeBERTa-base	94.27	96.73	72.78	80.37
Ditto-DeBERTa-large	89.44	97.94	82.18	89.76
Ditto-LaBSE	90.79	96.93	71.06	78.55
Ditto-GLM-RoBERTa	78.82	93.04	61.11	67.93

Table 9. Performance of author alignment (

\%

Method	F1	AUC
SVM	90.14	93.67
LinKG_C	67.72	66.78
LinKG_L	74.80	77.58
LinKG_G	89.62	97.32

Experimental Results. Table 8 presents results for various matching models for venue alignment and affiliation alignment. The methods for author alignment utilize more structural information, with corresponding results presented in Section A.1. Venue/affiliation alignment is essentially a short text matching task. SVM performs slightly poorer than LinKG_C and LinKG_L, possibly due to the inability of SVM features to capture word order information. Conversely, LinKG_C and LinKG_L can capture the contextual dependence of word sequence. Overall, the top-performing models on both datasets utilize pre-training, indicating the promising potential of pre-trained language models for entity alignment. Among all methods that employ pre-trained models, DeBERTa-large delivers the best performance on two tasks, particularly on affiliation alignment, indicating that DeBERTa-large effectively encodes additional semantic knowledge beyond affiliations’ surface names. Note that the trends of AUC and F1 in the table are sometimes inconsistent. Given that F1 requires a threshold setting, AUC is a more reliable metric when discrepancies arise between the two.

The results of author alignment are shown in Table 9. We observe that methods using authors’ structure information (SVM and LinKG_G) are significantly better than methods not using structure information (LinKG_C and LinKG_L). In the future, more author pairs with ambiguous names will be added to increase the difficulty of this task.

A.2. Author Name Disambiguation

We present the detailed task description and experiments for author name disambiguation in this subsection. Two subtasks of author name disambiguation are defined as follows.

Problem A.1.

From-scratch Name Disambiguation (SND). Given a collection of papers associated with identically-named authors, the goal is to cluster these papers into distinct groups, where each group should represent papers by the same author, while different groups signify papers by different authors.

Problem A.2.

Real-time Name Disambiguation (RND). Given a collection of unassigned papers and a set of authors (each author includes attributes such as affiliations, research interests, published papers, etc.), the goal is to assign these papers to the correct author or return empty (meaning that no author can be matched).

Datasets. The WhoIsWho dataset includes $72609$ authors with $2459$ names, and $1102249$ associated papers. The authorship between papers and authors is manually annotated. For the two subtasks, the training set contains the mapping relationship among the author name — author ID — paper ID, and the metadata of the paper (such as title, author name, published venue, etc.). For the SND task, the validation set and test set contain the name to be disambiguated and the papers associated with the name, and the goal is to cluster the papers into different groups. For the RND task, the validation and test sets involve unassigned papers, and the goal is to assign papers to existing authors or return NIL.

Baselines. We select the winning solutions of the latest competition¹⁷¹⁷17https://www.biendata.xyz/competition/whoiswho1/, https://www.biendata.xyz/competition/whoiswho2/ for comparison. Specifically, the compared SND methods include ECNU_AIDA, Complex808, and liub. The three methods follow a similar framework. They first encode the semantic features of papers via pre-trained models. Then, they construct a heterogeneous network according to the heterogeneous attributes of papers. Next, they use random walks based on meta-paths to generate the structural representation of papers. The semantic representation and structural representation can separately generate two paper similarity matrices. Finally, they use the DBSCAN clustering algorithm to cluster the papers to obtain the clustering result. The difference between the three methods lies in: (1) ECNU_AIDA and Complex808 use pre-trained Word2Vec model (Mikolov et al., 2013) to obtain the semantic representation of papers, while liub uses OAG-BERT (Liu et al., 2022b) to obtain the semantic representation of papers; (2) ECNU_AIDA and liub use co-author and co-organization relations between two papers for random walks in heterogeneous networks, while Complex808 additionally introduces co-venue and co-keyword relations between two papers with a certain probability for random walks.

Compared RND methods include: (1) kingsundad: employs three similarity features: handcrafted, OAG-BERT, and RBF-kernel interaction matching (Xiong et al., 2017). These features are input into multiple classifiers like XGBoost (Chen and Guestrin, 2016), LightGBM (Ke et al., 2017), and CatBoost (Dorogush et al., 2018) for ensemble learning. (2) AlexNE: introduces various name encoding techniques, like abbreviated encoding, to enhance recall. Unlike the Kingsundad method, it uses both OAG-BERT and GloVe (Pennington et al., 2014) to generate semantic paper representations. It constructs a large graph connecting unassigned papers to existing ones via keywords and generates node representations using Node2Vec (Grover and Leskovec, 2016). (3) Data Magician: is a feature engineering-based method that uses features from keywords, affiliations, co-authors, years, and more. Notably, it defines a time-weighted paper similarity method to account for changing research interests.

Evaluation Metrics. For the SND task and for each name, we evaluate the widely-adopted pairwise-F1.

\displaystyle\begin{split}PairwisePrecision=\frac{\#% PairsCorrectlyPredictedToSameAuthor}{\#TotalPairsPredictedToSameAuthor}\\ PairwiseRecall=\frac{\#PairsCorrectlyPredictedToSameAuthor}{\#% TotalPairsToSameAuthor}\\ PairwiseF1=\frac{2\times PairwisePrecision\times PairwiseRecall}{% PairwisePrecision+PairwiseRecall}\end{split}

The overall pairwise-F1 is the mean of pairwise-F1 of all names.

For the RND task, we first calculate precision and recall for each author’s unassigned papers, and then take the weighted average of precision and recall of different authors according to the paper count of each author to compute the overall F1.

Table 10. Disambiguation results on WhoIsWho dataset (

\%

Method	F1
ECNU_AIDA	89.140
Complex808	88.594
liub	88.580

(a) Results of SND methods.

Method	F1
kingsundad	93.492
AlexNE	93.136
Data Magician	92.850

(b) Results of SND methods.

Experimental Results. Table 10(a) and Table 10(b) report the performance of SND and RND methods, respectively. For SND methods, despite using OAG-BERT for semantic paper representation, the liub method underperforms, suggesting the need for further exploration of applying large language models. In addition, given the similarity of the three methods’ frameworks, how to break away from paper representations based on semantic and structural dimensions to calculate paper similarity and then use the DBSCAN algorithm to cluster papers is also worthy of further study.

For RND methods, all three methods yield good results. AlexNE attempts to introduce a graph structure to help disambiguate unassigned papers, which is a less explored direction. Data Magician uses time-based paper similarity features for complex name disambiguation scenarios. Further research is needed for challenging situations like co-authors with identical names or affiliation shifts, and for designing a robust paper-author matching model, considering potential incorrect assignments in existing authors’ papers.

A.3. Scholar Profiling

This subsection presents the task description and experiments for search engine-based scholar profiling.

Problem A.3.

Search Engine-based Scholar Profiling. Given a scholar’s name, affiliation, and one’s search engine records (using “name + affiliation” to query and extracting up to $2$ search pages and up to $20$ snippets), the goal is to extract the portrait information of the scholar, including homepage, gender, and position.

Datasets. CCKS2021-En¹⁸¹⁸18https://www.biendata.xyz/competition/ccks_aminer_profiling/: This dataset is an English subset of the CCKS 2021 scholar profiling track from AMiner. It contains $9{\small,}221$ scholar portraits, randomly divided into $5{\small,}557$ for training, $1{\small,}833$ for validation, and $1{\small,}831$ for testing.

Baselines. Drawing from the winning solutions of the CCKS 2021 and recent named entity recognition (NER) methods, we select the following baselines for search engine-based scholar profiling: (1) SML: employ manual features and traditional classifiers. Specifically, for gender prediction, SML-esb extracts features such as the frequency of “his” and “her” and uses various classifiers for voting. For homepage extraction, Logistic Regression (LR) and XGBoost extract features like the appearance of signal words (such as “edu” and “academic”) for classification. (2) Rule: utilizes regular expressions and voting for position extraction. (3) BI-LSTM-CRF (Huang et al., 2015): uses a BI-LSTM layer and a CRF layer for sequence labeling in position extraction. (4) Pre-training methods: We design different inputs for pre-training models for each attribute. For gender prediction and position extraction, the scholar’s name, affiliation, and webpage texts are concatenated as inputs. For homepage extraction, the scholar’s name and the candidate URL are concatenated as inputs. We fine-tune pre-trained models for classification, including BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2020), DeBERTa (He et al., 2022), ALBERT (Lan et al., 2019), ChatGLM¹⁹¹⁹19https://github.com/THUDM/ChatGLM-6B, LLaMA (Touvron et al., 2023).

Evaluation Metrics. Following the competition, accuracy is used to measure exact matches between predictions and ground truths.

Table 11. Results of search engine-based scholar profiling (

\%

Method	Gender	Homepage	Position
SML-esb	71.67	-	-
LR	-	19.37	-
XGBoost	-	20.8	-
Rule	-	-	71.6
BI-LSTM-CRF	-	-	85.10
BERT-base	96.12	18.91	83.23
RoBERTa-base	96.40	20.65	83.78
DeBERTa-base	96.50	21.11	85.14
DeBERTa-v3-large	96.56	18.21	79.34
ALBERT-base	95.85	16.38	84.33
ChatGLM-6B-LoRA	96.40	26.83	78.15
LLaMA-7B-LoRA	70.07	26.93	79.14

Experimental Results. Table 11 displays extraction results for search engine-based scholar profiling. In Table 11, pre-trained models outperform traditional methods, showcasing the expressive capacity and effectiveness of pre-trained models without manual feature design. BI-LSTM-CRF and partial pre-trained models exhibit similar performance in position extraction, showing the suitability of both sequence labeling and pre-trained models. Large generative models excel in homepage extraction, though accuracy remains modest.

A.4. Entity Tagging

We provide the experimental setup and results of the paper topic classification as follows.

Baselines. We adopt three GNN methods as baselines, including SGC (Wu et al., 2019), SIGN (Rossi et al., 2020), and GraphSAGE (Hamilton et al., 2017).

Evaluation Metrics. We measure multi-classification accuracy for paper classification.

Table 12. Results of paper topic classification (

\%

Method	Test Acc.	Valid. Acc.
SGC	34.08	31.44
SIGN	26.25	24.99
GraphSAGE	59.57	57.12

Experimental Results. Table 12 shows the results of paper topic classification. GraphSAGE outperforms SGC and SIGN in paper topic classification due to more training parameters, expressive ability, and neighbor sampling strategy. However, its longer training time poses a challenge in balancing efficiency and effectiveness on large-scale graph data. SGC performs better than SIGN, possibly because SIGN’s graph convolution filter is less suited for paper classification tasks, while SGC’s simpler convolution scheme effectively captures paper topic information. Current methods mainly leverage paper citation structure for paper topic classification, yielding unsatisfactory results. More content information could be incorporated to enhance the fine-grained paper tagging performance.

A.5. Concept Taxonomy Completion

In this subsection, we provide the experimental setup and results for concept taxonomy completion.

Baselines. Some of the latest concept taxonomy completion methods are selected for comparison. (1) BiLinear (Sutskever et al., 2009): uses a bilinear model to encode new and candidate concept representations, performing binary classification to ascertain if a candidate position owns the correct hypernym and hyponym of a new concept. (2) TaxoExpan (Shen et al., 2020): employs a position-augmented graph neural network to gauge the relationship between new concepts and candidate concept subgraphs, using contrastive learning to bolster model robustness. We use SciBERT (Beltagy et al., 2019) to encode concepts for a fair comparison with the next method. (3) TaxoEnrich (Jiang et al., 2022): initially transforms the existing hyponymy relationship into natural language, using SciBERT to represent concepts. It then employs LSTM to encode vertical concept relationships and an attention mechanism for sibling relationships. Finally, a matching model calculates the score between a new concept and a candidate position.

Evaluation Metrics. Each new concept is matched with nodes in the existing concept hierarchy tree and sorted by similarity. Evaluation metrics include Hit@10 and Mean Reciprocal Rank (MRR), which is the average rank of the reciprocal of actual hypernyms.

Table 13. Results of concept taxonomy completion.

Method		BiLinear	TaxoExpan	TaxoEnrich
MAG-full	Hit@10	0	0.216	0.003
MAG-full	MRR	0.002	0.221	0.008
MAG-CS	Hit@10	0.022	0.301	0.132
MAG-CS	MRR	0.059	0.376	0.204
OAG-AI	Hit@10	0.022	0.343	0.166
OAG-AI	MRR	0.059	0.422	0.238

Experimental Results. Table 13 reports the performance of concept taxonomy completion. TaxoExpan outperforms other methods on three datasets, indicating the effectiveness of the position-augmented graph neural network. TaxoEnrich surpasses TaxoExpan in its paper, likely due to its more potent pre-trained representation. The BiLinear model’s simplicity limits its expressive power, affecting prediction accuracy. The results of OAG-AI are obtained by making inferences using the pre-trained model on MAG-CS, maintaining similar trends as other datasets. Hit@10 doesn’t exceed 0.35 on all datasets, indicating the challenge of automatic taxonomy construction and the potential need for more information or powerful model architectures.

A.6. Academic Influence Prediction

In this subsection, we provide the experimental setup and results for author influence prediction.

Baselines. We adopt the following baselines: (1) ARIMA: is a statistical model for time series forecasting; (2) Linear Regression: defines a series of features of each author, including the author’s annual citations in the past 20 years, the total number of citations, the total number of papers, the H-index²⁰²⁰20https://en.wikipedia.org/wiki/H-index, and the estimated citation number of the author by using author citations and paper-author relations. Then, we use the Linear Regression model to predict the number of citations of the author. (3) GBRT (Friedman, 2001): uses the same features as linear regression. (4) LSTM (Hochreiter and Schmidhuber, 1997): uses the features (#citations and #papers) of the author in the past 20 years as a time series and uses LSTM for regression prediction. (5) EvolveGCN (Pareja et al., 2020): models author influence prediction as a node regression problem on dynamic co-author graphs.

Evaluation Metrics. The root mean square error (RMSE) between the predicted cited number and the actual cited number is used as the evaluation metric.

Table 14. Performance of author influence prediction (RMSE).

Method	AuthPred-2016	AuthPred-2022
ARIMA	1225	23920
Linear Regression	562	22057
GBRT	553	21777
LSTM	1034	25409
EvolveGCN	969	22841

Experimental Results. Table 14 present results for author influence prediction. It reveals that the GBRT method has smaller prediction errors on both datasets, demonstrating its superior fitting ability over linear regression and the effectiveness of input features like author citations. ARIMA’s poor performance suggests that time-series-based statistical methods struggle to predict academic influence. In addition to being affected by past achievements, the future influence of scholars will also have dynamic and more complex factors. EvolveGCN outperforms LSTM, indicating co-author network dynamics contain factors related to author influence. The larger prediction error on the AuthPred-2022 dataset could be due to differences in citation statistics between AMiner and Google Scholar, or the increased difficulty in predicting author influence in 2022 due to the surge in paper numbers.

Appendix B Ethical Statement

OAG-Bench involves author-centric attributes. We exclude those sensitive attributes such as email and profile photo, making available attributes publicly accessible elsewhere. For online publications, OAG-Bench provides publicly available metadata and very few parsed full-texts of open-access papers for research purposes. For data annotation, all annotators gave their informed consent for inclusion before they participated in this study.

OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining