Improving Low-Resource Knowledge Tracing Tasks by Supervised Pre-training and Importance Mechanism Fine-tuning

Hengyuan Zhang [email protected] Zitao Liu [email protected] Shuyan Huang Chenming Shang Bojun Zhan Yong Jiang

Abstract

Knowledge tracing (KT) aims to estimate student’s knowledge mastery based on their historical interactions. Recently, the deep learning based KT (DLKT) approaches have achieved impressive performance in the KT task. These DLKT models heavily rely on the large number of available student interactions. However, due to various reasons such as budget constraints and privacy concerns, observed interactions are very limited in many real-world scenarios, a.k.a, low-resource KT datasets. Directly training a DLKT model on a low-resource KT dataset may lead to overfitting and it is difficult to choose the appropriate deep neural architecture. Therefore, in this paper, we propose a low-resource KT framework called LoReKT to address above challenges. Inspired by the prevalent “pre-training and fine-tuning” paradigm, we aim to learn transferable parameters and representations from rich-resource KT datasets during the pre-training stage and subsequently facilitate effective adaptation to low-resource KT datasets. Specifically, we simplify existing sophisticated DLKT model architectures with purely a stack of transformer decoders. We design an encoding mechanism to incorporate student interactions from multiple KT data sources and develop an importance mechanism to prioritize updating parameters with high importance while constraining less important ones during the fine-tuning stage. We evaluate LoReKT on six public KT datasets and experimental results demonstrate the superiority of our approach in terms of AUC and Accuracy. To encourage reproducible research, we make our data and code publicly available at https://anonymous.4open.science/r/LoReKT-C619.

keywords:

Educational data mining , Knowledge tracing , Pre-training and fine-tuning , Importance mechanism

\affiliation

[a]organization=Shenzhen International Graduate School, Tsinghua University, city=Shenzhen, postcode=518055, country=China \affiliation[b]organization=Guangdong Institute of Smart Education, Jinan University, city=Guangzhou, postcode=510610, country=China \affiliation[c]organization=TAL Education Group, city=Beijing, postcode=100080, country=China

1 Introduction

Knowledge tracing holds a pivotal role within the realm of Intelligent Tutoring Systems (ITS) [1, 2, 3, 4]. Its primary objective is to forecast students’ performance on questions by estimating their mastery of individual knowledge components (KCs¹¹1A KC is a generality of everyday terms like concept, principle, or skill.) through an analysis of their past interactions. A KC is a description of a mental structure or process that a learner uses, alone or in combination with other KCs, to accomplish steps in a task or a problem. Take Figure 1 as an example. The student has successively responded to four questions ( $Q_{1}$ to $Q_{4}$ ), achieving correct answers for $Q_{1}$ and $Q_{3}$ , while $Q_{2}$ is answered incorrectly. This pattern suggests that the student may have a proficient understanding of the “Addition”, “Subtraction”, and “Multiplication” KCs, but lacks familiarity with the “Modulo” and “Division” KCs. Leveraging the current knowledge mastery, the KT task aims to predict the student’s performance on the upcoming sixth question, $Q_{4}$ . After gaining insights into students’ knowledge mastery through KT, educators can promptly pinpoint weaknesses and provide targeted exercises for improvement. Additionally, this information can assist online learning platforms in providing a series of adaptive learning services such as learning resource recommendations, customizing student learning paths, and personalizing teaching plans [5, 6].

Recently, with the remarkable progress of deep learning techniques, many studies develop deep learning based KT (DLKT) models that are trained on massive students’ historical interactions to pursue high accuracy on students’ knowledge mastery estimations. Thus, many publicly available educational datasets have been released for training an effective DLKT model.

Refer to caption — Figure 1: An illustration of the KT problem.

However, due to the users’ privacy protection of the educational applications and the different learning energy and enthusiasm of students, it is extremely difficult to collect large-scale high-quality student interaction sequences from real-world educational environments. Therefore, the educational datasets for KT model training frequently involve limited student learning records. For example, ASSISTments2009 is one of the classical KT datasets and the observed interaction records are only collected from 4,217 students. However, most of the state-of-the-art DLKT models are designed with stacks of neural networks such as recurrent neural networks, memory networks [7, 8, 9, 10]. Directly training a DLKT model on such a low-resource KT dataset is very easy to run into the problem of overfitting. Furthermore, it is unclear what type of model architectures are most suitable for low-resource KT datasets in previous KT works.

To enhance the learning capability from low-resource datasets of deep learning-based models, some studies perform “pre-training and fine-tuning” paradigm [11, 12, 13, 14]. This paradigm leverages rich-resource datasets to pre-train a model first and transfer the learned parameters to the low-resource dataset. Motivated by these promising studies, we propose a simple yet effective framework called LoReKT. LoReKT aims to improve the performance on the low-resource KT dataset by transferring the knowledge tracing capability from the model pre-trained on multiple rich-resource KT datasets. More specifically, in the pre-training stage, we build a foundational pre-trained KT model using a stack of transformer decoders based on multiple rich-resource KT datasets. To enhance the model’s capacity for integrating information from both questions and concepts, we introduce data type embeddings. Furthermore, to enable the model to learn the distinct and shared tracing patterns from multiple KT datasets, we introduce dedicated dataset embeddings for each KT dataset. In the fine-tuning stage, we propose an importance vector-based fine-tuning strategy to allow the model to focus on updating crucial parameters for the specific target low-resource dataset while constraining unimportant parameters to prevent the learning and memorization of noisy information.

As a result, the implementation of LoReKT has the following merits:

1.

The framework avoids direct training by initially pre-training the model on rich-resource KT datasets and subsequently fine-tuning it on specific low-resource KT datasets. This approach mitigates the risk of overfitting.
2.

The framework leverages a stack of transformer decoders as its backbone, which has demonstrated excellent performance in various “pre-training and fine-tuning” scenarios. Moreover, this backbone requires no additional architecture design efforts, simplifying the process and reducing the reliance on specific model architectures for particular datasets.
3.

To further mitigate the overfitting issue, the framework employs a fine-tuning strategy based on an importance mechanism, restricting the learning of less important parameters to prevent the memorization of noisy information.
4.

To ensure that our approach can be fairly comparable with other recently developed DLKT models, we follow a publicly available standardized KT task evaluation protocol [15]. We conduct comprehensive and rigorous experiments on three public rich-resource datasets and three low-resource datasets. The results show that after pre-training, the pre-trained KT model comes close to the performance of previous approaches on rich-resource datasets. By fine-tuning the pre-trained KT model on the low-resource datasets, it achieves superior prediction performance in terms of AUC and Accuracy compared to 17 baselines.

2 Related Work

2.1 Deep Learning based Knowledge Tracing

Deep Knowledge Tracing (DKT) has pioneered the application of deep learning in knowledge tracing tasks by employing a Long Short Term Memory (LSTM) layer to encapsulate students’ knowledge states and predict students’ response performances [16]. Since then, many methods tend to use deep learning techniques to solve KT problem [1, 2, 3, 4, 7, 17, 8, 18, 19, 20, 21, 22, 10, 23, 24]. For example, Yeung and Yeung [17] leveraged prediction-consistent regularization mechanism to mitigate issues related to input reconstruction failure and prediction inconsistency in the context of DKT [16]. Zhang et al. [7] integrated a meticulously designed static key matrix for storing the interconnections among different knowledge components (KCs). Simultaneously, it utilizes a dynamic value matrix to iteratively update the knowledge state of students. Motivated by the learning curve theory [25], Nagatani et al. [8] took student’s forgetting behavior into consideration to enhance DKT [16]. Lee and Yeung [18] used student knowledge state encoder and skill encoder to predict the student response performance via the dot product. Tiana et al. [21] performed multi-task learning based on the bidirectional encoder representations to construct mixed representations of questions. To mitigate the potential issue of limited generalization in DLKT, adversarial training techniques, such as adversarial perturbations, are introduced to the original student interaction sequence. Specifically, Guo et al. [10] improveed the generalization capability of the DLKT model by incorporating adversarial perturbations at the embedding level of the student interaction sequence. The carefully designed perturbations contribute to the model’s effective generalization across diverse student interactions. Moreover, certain studies have concentrated on exploring the interactions between student responses and questions, as well as the associations between questions and KCs. For example, Nakagawa et al. [26] constructed a question-concept knowledge graph and utilized graph neural network to aggregate the node features related to the corresponding concepts and subsequently updates the student’s knowledge states effectively. Additionally, Ma et al. [23] employed self-supervised learning paradigm to identify the latent relationship between questions and KCs, thereby enhancing input representations. Another research direction focused on the interdependence among student interactions, aiming to capture finer details embedded within them. For instance, Pandey and Karypis [19] utilized a self-attention mechanism to grasp the relationships between exercises and students’ responses. Ghosh et al. [20] presented AKT, which utilizes two self-attention modules to extract the inner relevance of questions and interactions respectively, and explicitly model students’ forgetting behaviors via a monotonic attention mechanism.

In this paper, unlike the aforementioned DLKT methods that are committed to developing a series of sophisticated architectures, our LoReKT is based on a stack of simple transformer decoders. This unified backbone aims to break down disciplinary barriers and learn consistent representations across multiple KT datasets.

2.2 Pre-training for Low-resource Setting

In real-world scenarios, encounters with low-resource settings are commonplace, and the paradigm of ”pre-training and fine-tuning” has consistently proven its efficacy in addressing challenges within such contexts [27, 28, 29, 30, 31, 32, 33]. For example, Yosinski et al. [27] explored the transferability of AlexNet and observed that the initial three layers of AlexNet encapsulate general features conducive to transferability. By introducing fine-tuning to the neural network, it successfully mitigated data variability and scarcity, consequently enhancing the overall network performance. Bansal et al. [29] introduced a straightforward methodology to enhance direct speech-to-text translation (ST) in scenarios where the source language is low-resource. The approach involves initial pre-training of the model on a high-resource automatic speech recognition (ASR) task, followed by a subsequent fine-tuning process to refine its parameters specifically for speech-to-text translation (ST). Zhang et al. [30] proposed an innovative approach, the adaptive data augmentation fine-tuning technique, designed to facilitate the efficient transfer of Named Entity Recognition (NER) knowledge from resource-rich domains to low-resource target domains. Liu et al. [31] utilized pre-training data in both the initial pre-training and subsequent fine-tuning stages, strategically enhancing the model’s performance across low-resource datasets. This dual-stage utilization of pre-training data contributes to a comprehensive and effective optimization, addressing the challenges posed by limited data availability in low-resource scenarios. Chi et al. [32] investigated the prospect of enhancing the performance of low-resource non-English languages by incorporating pre-trained language models that are primarily trained on English. This exploration seeks to leverage the knowledge embedded in English-dominant language models to boost the capabilities of models applied to non-English languages with limited resources. To generate high-quality definition for low-resource language, Zhang et al. [33] leveraged a multilingual pre-trained model as backbone and employed a prompt contrastive fine-tuning approach to enhance the model’s capabilities in this specific linguistic context.

In this paper, we adhere to the ”pre-training and fine-tuning” paradigm, opting for a strategic approach rather than direct training of a DLKT model on low-resource KT datasets. This choice is made to mitigate potential overfitting issues and enhance the overall performance of model in low-resource KT datasets scenarios.

3 Problem Statement

The objective of KT problem is to predict the probability of whether a student will answer arbitrary $q_{*}$ correctly based on the student’s historical interaction data. Specifically, suppose a student’s chronologically ordered collection of $T$ past interactions is denoted as $\mathbf{S}=\{\mathbf{s}_{j}\}_{j=1}^{T}$ , each student interaction $\mathbf{s}_{j}$ is represented as an ordering 4-tuple, i.e., $\mathbf{s}_{j}=<q_{j},\{c|c\in\mathcal{N}_{q_{j}}\}_{j},r_{j},t_{j}>$ , where $q_{j}$ , $\{c\}_{j}$ , $r_{j}$ and $t_{j}$ represent the specific question, the associated KC set, student response²²2Response $r_{j}\in\{0,1\}$ , 1 represents the student answered correctly, and 0 otherwise. and student’s response timestamp respectively. $\mathcal{N}_{q_{j}}$ is the set of KCs that are associated with the question $q_{j}$ . We would like to estimate the probability $\hat{r}_{*}$ of the student’s future performance on arbitrary question $q_{*}$ .

4 The Framework

In this section, we introduce the procedures in our proposed LoReKT framework in details: (1) obtaining a foundational pre-trained KT model through learning from rich-resource KT datasets during the pre-training stage (Section 4.1); (2) efficiently adapting the pre-trained KT model to the low-resource dataset using an importance vector in the fine-tuning stage (Section 4.2).

4.1 Pre-training Stage

Our objective in the pre-training stage is to learn transferable parameters and representations from certain rich-resource datasets and build a pre-trained KT foundation model that is able to quickly adapt to low-resource KT datasets.

4.1.1 Interaction Encoding

Due to the fact that the question bank is typically much larger than KCs, previous research mainly used KCs for interaction encoding, treating questions with the same KCs as identical [16, 34, 35]. However, this approach led DLKT models to overlook the unique characteristics of same-KC questions, limiting interaction representation. To address this, we align with works that include both individual question features and KCs to encode interactions in a more granular manner [36, 20, 24].

Specifically, let $\mathbf{D}=\{D_{i}\}_{i=1}^{I}$ be the mixed students’ learning sequences, where $D_{i}=\{\mathbf{S}^{i}_{1},...,\mathbf{S}^{i}_{n}\}$ . $\mathbf{S}^{i}_{j}$ is the $j$ th learning sequence from rich-dataset $i$ . $n$ is the number of learning sequences in $D_{i}$ , and $I$ is the number of rich-resource KT datasets. Let $\mathcal{N}_{q_{t}}$ be the set of KCs associated with $q_{t}$ . We represent question $q_{t}$ and its corresponding KCs as follows:

	$\displaystyle\mathbf{q}_{t}=\mathbf{W}^{q}\cdot\mathbf{e}^{q}_{t}$
	$\displaystyle\bar{\mathbf{c}}_{t}=\frac{1}{\left\|C_{q_{t}}\right\|}\sum_{j=1}^{% M}\mathbf{c}_{j}*\mathbb{I}(c_{j}\in\mathcal{N}_{q_{t}})$		(1)
	$\displaystyle\mathbf{c}_{j}=\mathbf{W}^{c}\cdot\mathbf{e}^{c}_{j}$

where $\mathbf{e}^{q}_{t}\in\mathbb{R}^{N\times 1}$ and $\mathbf{e}^{c}_{j}\in\mathbb{R}^{M\times 1}$ are the one-hot vectors that indicating the question $q_{t}$ and the related KC in $\mathcal{N}_{q_{t}}$ . $\mathbf{c}_{j}\in\mathbb{R}^{d\times 1}$ is one of the latent representations of the related KC to question $q_{t}$ . $\mathbf{q}_{t}\in\mathbb{R}^{d\times 1}$ and $\bar{\mathbf{c}}_{t}\in\mathbb{R}^{d\times 1}$ are the latent embedding of $q_{t}$ and its corresponding KCs, respectively. $\mathbf{W}^{q}\in\mathbb{R}^{d\times N}$ and $\mathbf{W}^{c}\in\mathbb{R}^{d\times M}$ are learnable linear transformation operations. $N$ and $M$ are the total number of distinct questions and KCs in our mixed dataset $\mathbf{D}$ , respectively³³3We reassign ID numbers for all questions and KCs in the dataset $\mathbf{D}$ based on the values of $N$ and $M$ . In the fine-tuning stage, for specific low-resource KT dataset, we adjust the ID numbers for its questions and KCs, starting from $N$ and $M$ . Additionally, we expand the size of $\mathbf{W}^{q}$ and $\mathbf{W}^{c}$ to obtain their corresponding question and KC representations, i.e., $\mathbf{q}_{t}$ and $\mathbf{c}_{j}$ .. $\cdot$ is the standard matrix/vector multiplication. $C_{q_{t}}$ is the size of $\mathcal{N}_{q_{t}}$ and $\mathbb{I}(\cdot)$ is the indicator function.

Drawing inspiration from the powerful pre-trained model BERT [37], which leverages token type embeddings to improve the integration of various token information, we introduce data type embeddings to the KT problem. In this problem, there are two distinct data types: questions and concepts. To improve the pre-trained KT model’s ability to incorporate information from both, we introduce question and concept data type embeddings, which are directly integrated into all question and concept embeddings:

\displaystyle\widetilde{\mathbf{q}}_{t}=\mathbf{q}_{t}\oplus\mathbf{t}_{q};% \quad\widetilde{\mathbf{c}}_{t}=\bar{\mathbf{c}}_{t}\oplus\mathbf{t}_{c}

(2)

where $\mathbf{t}_{q}\in\mathbb{R}^{d\times 1}$ and $\mathbf{t}_{c}\in\mathbb{R}^{d\times 1}$ are the question and concept data type embeddings, $\oplus$ is the element-wise addition operator. $\widetilde{\mathbf{q}}_{t}$ and $\widetilde{\mathbf{c}}_{t}$ are the question and KCs embeddings enriched with data type information.

Finally, we combine the embedding of question, its corresponding KCs, and response to encode the interaction $\mathbf{e}_{t}\in\mathbb{R}^{d\times 1}$ , i.e.:

\displaystyle\mathbf{x}_{t}=\widetilde{\mathbf{q}}_{t}\oplus\widetilde{\mathbf% {c}}_{t};\quad\mathbf{r}_{t}=\mathbf{W}^{a}\cdot\mathbf{a}^{q}_{t};\quad% \mathbf{e}_{t}=\mathbf{x}_{t}\oplus\mathbf{r}_{t}

(3)

where $\mathbf{x}_{t}$ is the question-concept (QC) embedding, $\mathbf{a}^{q}_{t}\in\mathbb{R}^{2\times 1}$ is the one-hot vector indicating whether the question $q_{t}$ is answered correctly and $\mathbf{W}^{a}\in\mathbb{R}^{d\times 2}$ is learnable linear transformation operation. The illustration of the interaction encoding procedure is shown in Figure 2 (a).

4.1.2 Pre-training

Recently, generative pre-trained models that based on Transformer architecture have achieved promising results in various tasks compared to designing a sophisticated neural network for a specific task [38, 13, 14]. Drawing inspiration from these impressive findings, we opt to directly utilize a stack of transformer decoders. This choice enables us to dynamically capture student knowledge states without additional architecture design efforts:

	$\displaystyle\mathbf{h}^{(0)}=\mathbf{E}\oplus\mathbf{P}$
	$\displaystyle\mathbf{h}^{(l)}=\mathbf{Tranformer\_block}(\mathbf{h}^{(l-1)})% \quad\forall{l}\in[1,L]$		(4)

where $\mathbf{E}=(\mathbf{e}_{1},...,\mathbf{e}_{T})$ is the embedding matrix of $T$ past interaction. $\mathbf{P}$ is the position embedding matrix. $L$ is the number of layers. $\mathbf{h}^{(l)}\in\mathbb{R}^{T\times d}$ is a knowledge state embedding matrix of a student by $T$ past interactions. Please note that to estimate student knowledge states via their historical interactions, we use QC embedding $\mathbf{x}_{t}$ for mapping both queries and keys, and interaction embedding $\mathbf{e}_{t}$ for mapping values in the self-attention mechanism.

Furthermore, inspired by prompt learning techniques that effectively capture diverse and overlapping patterns in multi-task learning scenarios [39, 40, 41, 42], we introduce dedicated dataset embeddings for individual KT datasets. Given that each KT dataset has its unique prediction paradigm stemming from variations in question banks and KCs, this enhancement empowers the pre-trained KT model to effectively capture the specific and shared information across different KT datasets. Specifically, the knowledge state $\mathbf{h}^{(l)}$ is first concatenated with corresponding dataset embedding $\mathbf{d}_{i}$ and QC embedding $\mathbf{x}_{t+1}$ , then fed into a two-layer fully connected network with Sigmoid activation function $\sigma(\cdot)$ to predict the performance of a student on next question $q_{t+1}$ :

	$\displaystyle\mathbf{d}_{i}=\mathbf{W}^{d}\cdot\mathbf{e}^{d}$
	$\displaystyle\mathbf{y}_{t+1}=\mbox{ReLU}(\mathbf{W}_{1}\cdot[\mathbf{h}_{t+1}% ^{(l)};\mathbf{x}_{t+1};\mathbf{d}_{i}]+\mathbf{b}_{1})$		(5)
	$\displaystyle\hat{r}_{t+1}=\sigma(\mathbf{w}^{\top}\cdot\mbox{ReLU}\bigl{(}% \mathbf{W}_{2}\cdot\mathbf{y}_{t+1}+\mathbf{b}_{2}\bigl{)}+b)$

where $\mathbf{W}^{d}\in\mathbb{R}^{d\times I}$ is learnable linear transformation operation⁴⁴4In the fine-tuning stage, we expand the size of $\mathbf{W}^{d}$ to assign the dataset embedding for the specific low-resource KT dataset., $I$ is the total number of rich-resource KT datasets in the pre-training stage. $\mathbf{e}^{d}\in\mathbb{R}^{I\times 1}$ is the one-hot vector indicating the corresponding dataset that current interaction belongs to. $\mathbf{W}_{1}\in\mathbb{R}^{d\times 2d}$ , $\mathbf{W}_{2}\in\mathbb{R}^{d\times d}$ , $\mathbf{w}\in\mathbb{R}^{d\times 1}$ , $\mathbf{b}_{1}\in\mathbb{R}^{d\times 1}$ , $\mathbf{b}_{2}\in\mathbb{R}^{d\times 1}$ and $b$ are trainable parameters. All learnable parameters in LoReKT are trained in end-to-end fashion by minimizing the binary cross entropy loss between predicted probability $\hat{r}_{t}$ and the ground-truth label $r_{t}$ :

\mathcal{L}_{\text{KT}}=-\sum_{t=1}^{T}\bigl{(}r_{t}\log\hat{r}_{t}+(1-r_{t})% \log(1-\hat{r}_{t})\bigl{)}

(6)

The forward procedure is illustrated in Figure 2 (b).

4.2 Fine-tuning Stage

In low-resource scenarios, overfitting is a common problem, as some model parameters may learn and memorize noisy dataset information, thereby hindering the model’s ability to generalize. This problem is especially severe in the low-resource KT setting. To alleviate the above problem, we introduce a novel importance vector-based fine-tuning strategy to encourage model to focus on updating the important parameters while constraining less important ones.

4.2.1 Computing Importance Vector of Layer

The backbone of LoReKT is a stack of transformer decoders. The key components of a transformer decoder are multi-head attention layer, intermediate layer, and output layer⁵⁵5In this paper, we use “layer” or $l$ to indicate any of these three layers, because the procedure of computing these three layers’ importance vector is similar.. It has been found that not all units (neurons or attention heads) in a specific layer are important [43]. Therefore, before directly fine-tuning the model on each low-resource KT dataset, we adopt the approach described by Ke et al. [44] to compute the importance vector for each layer. This is achieved by employing a gradient-based importance detection method, which is specifically tailored for each low-resource KT dataset:

	$\displaystyle\hat{\mathbf{o}}_{l}=\mathbf{g}_{l}\odot\mathbf{o}_{l}$
	$\displaystyle\mathbf{I}_{l}=\frac{1}{N}\sum_{n=1}^{N}\left\|\frac{\partial% \mathcal{L}_{\text{KT}}}{\partial\mathbf{g}_{l}}\right\|$		(7)

where $\mathbf{o}_{l}$ refers to the output of layer $l$ (which can be any of the three layers mentioned above). $\odot$ refers to element-wise multiplication. $\mathbf{g}_{l}$ serves as a virtual parameter, sharing the same dimensions as $\mathbf{o}_{l}$ , with each of its elements initialized to 1. It remains unchanged during the computing process, as we only need its gradient on each parameter to get the importance of corresponding unit. The unit with a higher gradient value obtained by its virtual parameter is considered more important, as they have a significant impact on the loss. Therefore, the gradient of each parameter $g_{l,j}$ in $\mathbf{g}_{l}$ can be regarded as the importance of unit $j$ in layer $l$ . $\mathbf{I}_{l}$ is the importance vector of layer $l$ , which is of the same size as ${\mathbf{g}_{l}}$ , $N$ is the number of samples in current low-resource KT dataset and the $\mathcal{L}_{\text{KT}}$ is the loss defined in Eq.(6). Noted that, each low-resource KT dataset has its own $\mathbf{I}_{l}$ for layer $l$ . The $\mathcal{L}_{\text{KT}}$ loss for each low-resource KT dataset is computed based on the zero-shot performance of pre-trained KT model (as obtained in Section 4.1). The $\mathbf{g}_{l}$ remains unchanged during the computing process, because we need only its average gradient $\nabla\mathbf{g}_{l}$ (the term within $\left|\right|$ in eq.(7)) over all the samples in the low-resource KT dataset and will not use the gradient to update the $\mathbf{g}_{l}$ . The illustration of computing importance vector of layer $l$ is shown in Figure 3 (c).

4.2.2 Fine-tune with Importance Vector

After obtaining the importance vector $\mathbf{I}_{l}$ for each layer in each low-resource KT dataset using the pre-trained model, we initially compute the original gradient $\nabla_{l}$ by employing $\mathcal{L}_{\text{KT}}$ defined in Eq. 6. Subsequently, we apply the importance vector $\mathbf{I}_{l}$ to obtain the modified gradient $\hat{\nabla}_{l}$ for updating:

\displaystyle\hat{\nabla}_{l}=\mathbf{I}_{l}\odot\nabla_{l}

Here, we expand (by copying) the $\mathbf{I}_{l}$ to match the dimensions of $\nabla_{l}$ to apply it to all associated parameters. The modified gradient $\hat{\nabla}_{l}$ is only employed in the backward pass. This encourages the model to prioritize updating the associated parameters with high importance instead of less important ones by regulating their gradient flow. The procedure of fine-tuning based on importance vector is shown in Figure 3 (d).

5 Experiment

	Low-resource			Rich-resource
	AS2009	NIPS34	AL2005	BD2006	XES3G5M	EdNet
# of Ques.	26,688	948	210,710	207,856	7,652	12,235
# of KCs	123	57	112	493	865	188
# of Interactions	346,860	1,382,727	809,694	3,679,199	5,549,635	6,533,522
avg KCs	1.1969	1.0148	1.3634	1.0136	1.1640	2.2611
Subject	Math	Math	Math	Math	Math	Linguistics
Language	English	English	English	English	Chinese	English

Table 1: Dataset statistics of 6 datasets. “avg KCs” denotes the number of average KCs per question.

In this section, we present details of our experiment settings and the corresponding results. We conduct comprehensive analysis to illustrate the effectiveness of our LoReKT framework. Specifically, we aim to answer the following research questions: (RQ1) Can we build a solid pre-trained foundational model for KT? (RQ2) In low-resource scenarios, how does our proposed LoReKT framework performs compared to the state-of-the-art KT methods? (RQ3) Does pre-training truly enhance the performance of mode in low-resource KT datasets? (RQ4) In the fine-tuning stage, is it effective to focus on updating important parameters based on $\mathbf{I}_{l}$ ? (RQ5) How does the dataset and data type embedding affect the pre-trained KT model?

5.1 Datasets

Since our LoReKT framework learns transferable parameters and representations from rich-resource KT datasets first and then quickly adapt to low-resource scenarios, we select three rich-resource datasets including BD2006 [47], XES3G5M [48], and EdNet [49] to establish a robust foundational pre-trained KT model. We further fine-tune the pre-trained KT model on three low-resource KT datasets including AS2009 [50], NIPS34 [51], and AL2005 [47] respectively. The data statistics for the six selected datasets can be found in Table 1. The detailed descriptions are as follows:

Rich-resource KT Datasets

1.

Bridge2algebra2006 (BD2006)⁶⁶6https://pslcdatashop.web.cmu.edu/KDDCup/: this dataset is provided from the KDD Cup 2010 EDM Challenge with the algebra questions answered by 13-14 years old students.
2.

XES3G5M⁷⁷7https://github.com/ai4ed/XES3G5M: this large-scale dataset from a Chinese online mathematics learning platform includes rich information about student learning interactions.
3.

EdNet⁸⁸8https://github.com/riiid/ednet: this dataset from South Korea’s Santa AI tutoring system is one of the largest KT datasets, with 130+ million student interactions in the TOEIC test.

Low-resource KT Datasets

1.

ASSISTments2009 (AS2009)⁹⁹9https://sites.google.com/site/assistmentsdata/home/2009-2010-assistment-data/skill-builder-data-2009-2010: this dataset is one of the classical education dataset that collects students’ responses to mathematic questions from the free online tutoring ASSISTments platform during the school year 2009-2010.
2.

NIPS34¹⁰¹⁰10https://eedi.com/projects/neurips-education-challenge: this dataset is released in the NeurIPS 2020 Education Challenge. In our work, we choose to use Task 3&4 that are the students’ responses to multiple-choice diagnostic math questions on the Eedi platform.
3.

Algebra2005 (AL2005)¹¹¹¹11https://pslcdatashop.web.cmu.edu/KDDCup/: similar to BD2006, this dataset is also released on KDD Cup 2010 EDM Challenge.

5.2 Experimental Setting

We remove student sequences shorter than 3 attempts and truncate student interaction sequences that are longer than 200. We use 80% of student sequences for training and validation and the rest 20% of student sequences for model evaluation. We adopt Adam optimizer [52] to train all the models. The number of training epochs is set to 200. Following all existing DLKT research [15, 24, 36], we use the Area Under the Curve (AUC) as the main evaluation metric and use Accuracy as the secondary evaluation metric. We use an early stopping strategy that stops optimization when the AUC score fails to get the improvement on the validation set in the latest 10 epochs. Owing to the training efficiency, it is difficult to tune too many hyperparameters for the models with billions of parameters. Hence, we only tune the model learning rate in {0.001, 0.0001} with the dropout rate in {0.1, 0.2} for fair comparison in various model sizes.

With the aim of finding a suitable model size for pre-training, we conduct an extensive exploration of various model sizes, encompassing a range from 89M to 1.01B. The corresponding architecture details are summarized in Table 2. $n_{params}$ is the total number of trainable parameters. $n_{layers}$ is the total number of layers, $d_{model}$ is the number of units in each bottleneck layer (the feed-forward layer is denoted as $d_{ff}$ ), and $n_{head}$ is the number of attention heads.

Model Name	$n_{params}$	$n_{layers}$	$d_{model}$	$n_{head}$	$d_{ff}$
LoReKT-Base-89M	89M	4	256	8	256
LoReKT-Base-221M	221M	24	512	16	1024
LoReKT-Base-478M	478M	24	1024	16	1024
LoReKT-Base-1.01B	1.01B	32	1536	24	2560

Table 2: The model sizes and associated architecture details of LoReKT-Base.

5.3 Baselines

To conduct a comprehensive evaluation of LoReKT, we have compared it with 17 selected baseline methods. We have carefully categorized these baseline methods into four categories:

Deep Sequential KT Models

Deep sequential KT models use an auto-regressive framework to dynamically track students’ knowledge states. Representative deep sequential KT models include:

1.

DKT [16]: it uses an LSTM layer to model students’ learning processes.
2.

DKT+ [17]: it improves the original DKT model by addressing the reconstruction and inconsistent issues.
3.

DKT-F [8]: it enhances original DKT by considering students’ forgetting behaviors.
4.

KQN [18]: it is a recurrent neural network (RNN) based architecture that extracts the relation representations between students’ learning abilities and KCs to predict their performance.
5.

LPKT [53]: it designs a learning cell to model the students’ learning processes to estimate their knowledge states.
6.

AT-DKT [54]: it proposes two auxiliary learning tasks involving question tagging prediction task and individualized prior knowledge prediction task to improve the prediction performance of DKT.

Memory Augmented KT Models

Memory augmented KT models employ memory networks to capture potential relevances between KCs and student knowledge states. Representative memory augmented KT models include:

1.

DKVMN [7]: it incorporates a static matrix to store the relationships among KCs and a dynamic matrix to track the student’s knowledge state.
2.

SKVMN [55]: it is a combination of DKVMN and LSTM that uses a hop-LSTM layer to capture sequential dependencies of questions.
3.

DeepIRT [56]: it incorporates DKVMN and item response theory to enhance the interpretability of the prediction output of DKVMN.

Attention based KT Models

Attention based KT models capture dependencies between historical interactions and the next questions via the attention mechanism. Representative attention based KT models include:

1.

SAKT [19]: it uses self-attention to identify the relevance between historical interactions and KCs.
2.

SAINT [36]: it is a Transformer-based model for KT that encodes questions and responses in the encoder and decoder respectively.
3.

AKT [20]: it leverages three self-attention modules to estimate the relevance between questions and historical interactions and explicitly models student’s forgetting behavior via a monotonic attention mechanism.
4.

simpleKT [24]: it explores the ordinary dot-product attention based KT models by capturing the individual differences among questions covering the same set of KCs
5.

sparseKT [57]: it incorporates a k-selection module to only pick items with the highest attention scores to improve the robustness and generalization of the attention based DLKT approaches.

Other KT Models

Other KT models that do not belong to the above categories:

1.

HawkesKT [58]: it utilizes the Hawkes process to model temporal cross-effects in student historical interactions.
2.

ATKT [10]: it performs adversarial perturbations into the student interaction sequence to enhance the generalization ability based on an attention-LSTM based KT model.
3.

GKT [26]: it casts the knowledge structure as a graph and reformulates the KT task as a time series node-level classification problem in GNN.

Method (Chronologically)	Model Type	AUC
Method (Chronologically)	Model Type	AS2009	NIPS34	AL2005	BD2006	XES3G5M	EdNet
DKT [16]	Sequential	0.7525	0.7688	0.8159	0.8018	0.7845	0.6405
DKVMN [7]	Memory	0.7472	0.7677	0.8052	0.7999	0.7796	0.6576
DKT+ [17]	Sequential	0.7543	0.7698	0.8141	0.8019	0.7858	0.6454
DKT-F [8]	Sequential	-	0.7728	0.8146	0.7997	0.7935	0.6548
KQN [18]	Sequential	0.7462	0.7685	0.8010	0.7953	0.7794	0.6415
SKVMN [55]	Memory	0.7332	0.7513	0.7463	0.7287	0.7512	0.6374
DeepIRT [56]	Memory	0.7465	0.7673	0.8040	0.7976	0.7789	0.6387
GKT [26]	Others	0.7442	0.7718	0.8112	0.8041	0.7731	0.6392
SAKT [19]	Attention	0.7221	0.7508	0.7850	0.7748	0.7685	0.6290
SAINT [36]	Attention	0.6990	0.7883	0.7764	0.7758	0.8070	0.6841
AKT [20]	Attention	0.7869	0.8038	0.8324	0.8213	0.8215	0.7054
ATKT [10]	Others	0.7472	0.7664	0.7987	0.7889	0.7791	0.6490
HawkesKT [58]	Others	0.7232	0.7763	0.8199	0.8077	0.7933	0.7304
LPKT [53]	Sequential	0.7812	0.8004	0.8268	0.8056	0.8163	0.7644
AT-DKT [54]	Sequential	0.7555	0.7816	0.8246	0.8104	0.7925	0.6536
simpleKT [24]	Attention	0.7744	0.8035	0.8254	0.8160	0.8161	0.6765
sparseKT [57]	Attention	0.7739	0.8033	0.8152	0.8120	0.8165	0.6804
LoReKT-Base-89M	Attention	0.6041	0.6401	0.5943	0.8049	0.8145	0.7647
LoReKT-Base-221M	Attention	0.6228	0.6452	0.6155	0.8183	0.8192	0.7672
LoReKT-Base-478M	Attention	0.5957	0.6103	0.5834	0.8061	0.8164	0.7659
LoReKT-Base-1.01B	Attention	0.5761	0.5980	0.5745	0.8003	0.8121	0.7633
LoReKT-Ft-impt-221M	Attention	0.7912	0.8002	0.8425	-	-	-
LoReKT-Ft-221M	Attention	0.7833	0.7969	0.8359	-	-	-

Table 3: The overall performance in terms of AUC. The result of each low-resource KT dataset corresponds to a separately fine-tuned model, leading to different performance on pre-training datasets. Therefore, we use “-” to denote the results on pre-training datasets. The best result is indicated in bold, while the second best result is denoted in underline.

Method (Chronologically)	Model Type	Accuracy
Method (Chronologically)	Model Type	AS2009	NIPS34	AL2005	BD2006	XES3G5M	EdNet
DKT [16]	Sequential	0.7228	0.7031	0.8105	0.8554	0.8173	0.6665
DKVMN [7]	Memory	0.7196	0.7022	0.8026	0.8547	0.8155	0.6392
DKT+ [17]	Sequential	0.7243	0.7046	0.8085	0.8554	0.8179	0.6668
DKT-F [8]	Sequential	-	0.7076	0.8092	0.8540	0.8209	0.6666
KQN [18]	Sequential	0.7214	0.7028	0.8022	0.8540	0.8154	0.6665
SKVMN [55]	Memory	0.7156	0.6885	0.7837	0.8406	0.8071	0.6572
DeepIRT [56]	Memory	0.7196	0.7020	0.8029	0.8542	0.8152	0.6559
GKT [26]	Others	0.7179	0.7053	0.8078	0.8555	0.8139	0.6672
SAKT [19]	Attention	0.7031	0.6873	0.7948	0.8460	0.8121	0.6519
SAINT [36]	Attention	0.6977	0.7187	0.7770	0.8455	0.8177	0.6624
AKT [20]	Attention	0.7385	0.7320	0.8138	0.8594	0.8275	0.6888
ATKT [10]	Others	0.7206	0.7012	0.7989	0.8510	0.8143	0.6642
HawkesKT [58]	Others	0.7046	0.7110	0.8108	0.8563	0.8191	0.7076
LPKT [53]	Sequential	0.7355	0.7309	0.8154	0.8547	0.8264	0.7243
AT-DKT [54]	Sequential	0.7250	0.7146	0.8144	0.8560	0.8195	0.6684
simpleKT [24]	Attention	0.7320	0.7328	0.8083	0.8579	0.8240	0.6624
sparseKT [57]	Attention	0.7282	0.7322	0.8017	0.8569	0.8234	0.6643
LoReKT-Base-89M	Attention	0.6035	0.6003	0.6753	0.8543	0.8248	0.7243
LoReKT-Base-221M	Attention	0.6216	0.6008	0.6962	0.8596	0.8271	0.7250
LoReKT-Base-478M	Attention	0.5929	0.5821	0.6644	0.8538	0.8253	0.7212
LoReKT-Base-1.01B	Attention	0.5805	0.5745	0.6568	0.8501	0.8239	0.7207
LoReKT-Ft-impt-221M	Attention	0.7402	0.7323	0.8242	-	-	-
LoReKT-Ft-221M	Attention	0.7353	0.7275	0.8159	-	-	-

Table 4: The overall performance in terms of Accuracy. The result of each low-resource KT dataset corresponds to a separately fine-tuned model, leading to different performance on pre-training datasets. Therefore, we use “-” to denote the results on pre-training datasets. The best result is indicated in bold, while the second best result is denoted in underline.

5.4 Results

We utilize different variants of LoReKT to represent its performance under different settings. LoReKT-Base represents the model trained after the pre-training stage without any fine-tuning on a specific low-resource dataset. LoReKT-Ft-impt and LoReKT-Ft refer to the model that fine-tuned on the specific low-resource KT dataset with and without using importance vector.

5.4.1 Model Performance after Pre-training (RQ1)

We report the results of main evaluation metric, i.e., AUC, in Table 3 and the results of secondary evaluation metric, i.e., Accuracy, in Table 4. From Table 3, we have the following observations: (1) in the comparison of various model sizes in LoReKT-Base, LoReKT-Base-221M exhibits the best performance. Initially, as the model size increases, the model’s performance improves; however, it begins to decline when the model size becomes excessively large. We argue that this phenomenon is indicative of the model first experiencing underfitting followed by overfitting; (2) the LoReKT-Base-221M demonstrates strong performance across all three pre-training datasets (BD2006, XES3G5M, and EdNet). For example, it achieves the highest AUC score on the EdNet dataset, showcasing a substantial improvement of 12.59% over sparseKT, 9.2% over LPKT and 8.6% over AKT. It ranks second in terms of AUC on BD2006 and XES3G5M datasets and is on par with the best model AKT within a 0.5% range of performance gap. It’s noteworthy that LoReKT-Base-221M, utilizing only a single model, consistently achieves strong performance across all three datasets. In contrast, the AKT model is separately trained for each dataset, resulting in significant performance variations. For instance, while the AKT model performs well on BD2006 and XES3G5M, its performance on EdNet is notably lower, lagging behind LoReKT-Base-221M by 8.1%. Furthermore, the architecture of LoReKT-Base-221M is more concise than AKT which designs two extra modules upon the original RNN architecture; and (3) the LoReKT-Base-221M also demonstrates impressive zero-shot capabilities on previously unseen datasets (AS2009, NIPS34, and AL2005). For example, it achieves AUC scores of 0.6452 and 0.6228 in NIPS34 and AS2009 datasets, which is significantly outperforming random performance. In spite of the zero-shot performance is still far from usable, it veriﬁes our conjecture that LoReKT-Base has good potential transferability between different disciplines via cross-source learning. Furthermore, its robust zero-shot capabilities enhance the accuracy of the importance vector computed in Section 4.2.1 for each low-resource KT dataset in the fine-tuning stage.

5.4.2 Fine-tuning Performance of LoReKT (RQ2)

After obtaining the pre-trained KT model LoReKT-Base-221M, we further fine-tune it on each low-resource KT dataset. From Table 3, we can observe that (1) comparing LoReKT-Base-221M and LoReKT-Ft-impt-221M, the performance on all three low-resource KT datasets is significantly improved in terms of AUC score (e.g., an improvement of 26.1% in AS2009, 24.3% in NIPS34 and 37.3% in AL2005); (2) the LoReKT-Ft-impt-221M outperforms all the baseline methods in AS2009 and AL2005 datasets in terms of AUC score. More specifically, compared with ATKT, sparseKT, and simpleKT, it significantly improves the AUC score by 5.61%, 3.47%, and 2.16% on AL2005 respectively. Also, it outperforms the majority of baseline methods and closely matches the performance of the top models such as AKT and simpleKT on NIPS34, with performance gap of less than 0.5%. Notably, our LoReKT-Ft-impt-221M is robust enough to consistently achieve strong performance across all three low-resource datasets without the need for additional architecture design, while AKT and simpleKT exhibit notable performance variations across these datasets. Furthermore, it can be observed that the performance of LoReKT-Ft-impt-221M on NIPS34 is slightly lower than on AS2009 and AL2005 datasets when compared to the baseline methods. We attribute this discrepancy to the larger dataset size of NIPS34 relative to AS2009 and AL2005 (shown in Table 1), which suggests that the potential overfitting issue is not as prominent in NIPS34.

5.4.3 Impact of Pre-training (RQ3)

To analyze the impact of pre-training on the performance of low-resource KT datasets, we progressively reduce the number of rich-resource datasets used in the pre-training. From Figure 4, we have the following observations: (1) As the number of rich-resource KT datasets used in the pre-training stage decreases, the model’s performance on low-resource KT datasets correspondingly drops. (2) When pre-training is omitted, the model’s performance on low-resource KT datasets significantly deteriorates, particularly for the AS2009 and AL2005 datasets. For example, there is a decline in the AUC score by 3.6% in AS2009 and 2.1% in AL2005, whereas only a 1.1% decline is observed in NIPS34. This could be attributed to the fact that the data quantity in AS2009 and AL2005 is much smaller than in NIPS34, leading to a more pronounced overfitting issue.

5.4.4 Impact of Importance Vector (RQ4)

We conduct experiments to further investigate the effectiveness of our proposed fine-tuning strategy with importance vector in low-resource KT datasets. Comparing the results of LoReKT-Ft-221M and LoReKT-Ft-impt-221M in Table 3, we observe that the proposed fine-tuning strategy with importance vector enhances performance across all low-resource datasets. It leads to AUC improvements of 0.66% for the AL2005 dataset and 0.79% for the AS2009 dataset, which is larger than the 0.33% improvement observed for the NIPS34 dataset. This observation suggests that the fine-tuning strategy with importance vector is more effective when the dataset size is insufficient (as shown in Table 1, NIPS34 exhibits a higher interaction count compared to AS2009 and AL2005). We believe that this effectiveness is attributed to the importance vector, which restricts the update of unimportant parameters, thereby mitigating the risk of overfitting.

Method	AUC			Accuracy
Method	BD2006	XES3G5M	EdNet	BD2006	XES3G5M	EdNet
LoReKT-Base-221M	0.8183	0.8192	0.7672	0.8596	0.8271	0.7250
w/o data type & dataset embedding	0.8145	0.8139	0.7594	0.8541	0.8239	0.7185
w/o data type embedding	0.8161	0.8172	0.7646	0.8575	0.8268	0.7243
w/o dataset embedding	0.8173	0.8168	0.7625	0.8569	0.8268	0.7246

Table 5: The impact of data type and dataset embedding in LoReKT-Base in terms of AUC and Accuracy performance. ”w/o“ means excluding such module from LoReKT-Base. The best result is indicated in bold, while the second best result is denoted in underline.

5.4.5 Impact of Data Type and Dataset Embeddings (RQ5)

We also analyze the impact of data type and dataset embedding in Section 4.1. We report the AUC and Accuracy results in Table 5. As presented in Table 5, comparing the results of “w/o data type & dataset embedding” and “w/o data type embedding”, we demonstrate that the proposed dataset embedding enhances the model’s understanding of the corresponding dataset’s prediction paradigm. This effect is particularly pronounced for datasets with larger quantities, such as the improvement in AUC of 0.52% for EdNet, surpassing the improvements of 0.33% in XES3G5M and 0.16% in BD2006. We attribute this to the fact that the dataset embedding is well-represented when the corresponding dataset has a sufficient quantity (as indicated in Table 1). The results of “w/o data type & dataset embedding” and “w/o dataset embedding” reveal that, across all three datasets, the proposed data type embedding plays an important role in enabling the model to incorporate information from both questions and concepts (e.g., an improvement in AUC of 0.28%, 0.29% and 0.31% in BD2006, XES3G5M, and EdNet respectively).

6 Conclusion

In this paper, we focus on improving the performance of DLKT models on low-resource KT datasets. To address this problem, we propose a framework called LoReKT based on a stack of transformer decoders. The LoReKT includes two stages: pre-training and fine-tuning, and does not require sophisticated architecture for a specific dataset. In the pre-training stage, we establish a robust pre-trained KT model based on several rich-resource KT datasets. Subsequently, we leverage an importance mechanism fine-tuning strategy to adapt the pre-trained model to a specific low-resource KT dataset effectively. The extensive quantitative and qualitative experiment results on six real-world datasets demonstrate the superior performance of LoReKT against a wide range of recently proposed DLKT models in terms of AUC and Accuracy.

CRediT authorship contribution statement

Hengyuan Zhang: Methodology, Conceptualization, Investigation, Writing.

Zitao Liu: Writing - review & editing, Supervision.

Shuyan Huang: Writing - review & editing, Supervision.

Chenming Shang: Investigation, Methodology, Writing.

Bojun Zhan: Investigation, Writing.

Yong Jiang: Supervision.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Zhang and Yao [2018] K. Zhang, Y. Yao, A three learning states bayesian knowledge tracing model, Knowledge-Based Systems 148 (2018) 189–201.
Su et al. [2021] Y. Su, Z. Cheng, P. Luo, J. Wu, L. Zhang, Q. Liu, S. Wang, Time-and-concept enhanced deep multidimensional item response theory for interpretable knowledge tracing, Knowledge-Based Systems 218 (2021) 106819.
Song et al. [2022] X. Song, J. Li, Q. Lei, W. Zhao, Y. Chen, A. Mian, Bi-clkt: Bi-graph contrastive learning based knowledge tracing, Knowledge-Based Systems 241 (2022) 108274.
Ke et al. [2024] F. Ke, W. Wang, W. Tan, L. Du, Y. Jin, Y. Huang, H. Yin, Hitskt: A hierarchical transformer model for session-aware knowledge tracing, Knowledge-Based Systems 284 (2024) 111300.
Liu et al. [2019] Q. Liu, Z. Huang, Y. Yin, E. Chen, H. Xiong, Y. Su, G. Hu, EKT: Exercise-aware knowledge tracing for student performance prediction, IEEE Transactions on Knowledge and Data Engineering 33 (2019) 100–115.
Wu et al. [2020] Z. Wu, M. Li, Y. Tang, Q. Liang, Exercise recommendation based on knowledge concept prediction, Knowledge-Based Systems 210 (2020) 106481.
Zhang et al. [2017] J. Zhang, X. Shi, I. King, D. Y. Yeung, Dynamic key-value memory networks for knowledge tracing, in: Proceedings of the 26th International Conference on World Wide Web, 2017, p. 765.
Nagatani et al. [2019] K. Nagatani, Q. Zhang, M. Sato, Y.-Y. Chen, F. Chen, T. Ohkuma, Augmenting knowledge tracing by considering forgetting behavior, in: The World Wide Web Conference, 2019, pp. 3101–3107.
Sonkar et al. [2020] S. Sonkar, A. E. Waters, A. S. Lan, P. J. Grimaldi, R. G. Baraniuk, qDKT: Question-centric deep knowledge tracing, in: Proceedings of The 13th International Conference on Educational Data Mining, 2020, pp. 677–681.
Guo et al. [2021] X. Guo, Z. Huang, J. Gao, M. Shang, M. Shu, J. Sun, Enhancing knowledge tracing via adversarial training, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 367–375.
Zoph et al. [2016] B. Zoph, D. Yuret, J. May, K. Knight, Transfer learning for low-resource neural machine translation, arXiv preprint arXiv:1604.02201 (2016).
Tu et al. [2019] T. Tu, Y.-J. Chen, C.-c. Yeh, H.-Y. Lee, End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning, arXiv preprint arXiv:1904.06508 (2019).
Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems (2020) 1877–1901.
Gira et al. [2022] M. Gira, R. Zhang, K. Lee, Debiasing pre-trained language models via efficient fine-tuning, in: Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, 2022, pp. 59–69.
Liu et al. [2022] Z. Liu, Q. Liu, J. Chen, S. Huang, J. Tang, W. Luo, pyKT: A python library to benchmark deep learning based knowledge tracing models, in: Thirty-sixth Conference on Neural Information Processing Systems, 2022.
Piech et al. [2015] C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. J. Guibas, J. Sohl-Dickstein, Deep knowledge tracing, Advances in neural information processing systems 28 (2015).
Yeung and Yeung [2018] C.-K. Yeung, D.-Y. Yeung, Addressing two problems in deep knowledge tracing via prediction-consistent regularization, in: Proceedings of the Fifth Annual ACM Conference on Learning at Scale, 2018, pp. 1–10.
Lee and Yeung [2019] J. Lee, D.-Y. Yeung, Knowledge query network for knowledge tracing: How knowledge interacts with skills, in: Proceedings of the 9th International Conference on Learning Analytics & Knowledge, 2019, pp. 491–500.
Pandey and Karypis [2019] S. Pandey, G. Karypis, A self-attentive model for knowledge tracing, in: 12th International Conference on Educational Data Mining, International Educational Data Mining Society, 2019, pp. 384–389.
Ghosh et al. [2020] A. Ghosh, N. Heffernan, A. S. Lan, Context-aware attentive knowledge tracing, in: ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020.
Tiana et al. [2021] Z. Tiana, G. Zhengc, B. Flanaganb, J. Mic, H. Ogatab, Bekt: Deep knowledge tracing with bidirectional encoder representations from transformers, in: Proceedings of the 29th International Conference on Computers in Education, 2021.
Song et al. [2022] X. Song, J. Li, T. Cai, S. Yang, T. Yang, C. Liu, A survey on deep learning based knowledge tracing, Knowledge-Based Systems 258 (2022) 110036.
Ma et al. [2022] Y. Ma, P. Han, H. Qiao, C. Cui, Y. Yin, D. Yu, Spakt: A self-supervised pre-training method for knowledge tracing, IEEE Access 10 (2022) 72145–72154.
Liu et al. [2023] Z. Liu, Q. Liu, J. Chen, S. Huang, W. Luo, simpleKT: A simple but tough-to-beat baseline for knowledge tracing, in: International Conference on Learning Representations, 2023.
Murray et al. [2013] R. C. Murray, S. Ritter, T. Nixon, R. Schwiebert, R. G. Hausmann, B. Towle, S. E. Fancsali, A. Vuong, Revealing the learning in learning curves, in: International Conference on Artificial Intelligence in Education, Springer, 2013, pp. 473–482.
Nakagawa et al. [2019] H. Nakagawa, Y. Iwasawa, Y. Matsuo, Graph-based knowledge tracing: modeling student proficiency using graph neural network, in: 2019 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE, 2019, pp. 156–163.
Yosinski et al. [2014] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep neural networks?, Advances in neural information processing systems 27 (2014).
Shang et al. [2019] J. Shang, T. Ma, C. Xiao, J. Sun, Pre-training of graph augmented transformers for medication recommendation, arXiv preprint arXiv:1906.00346 (2019).
Bansal et al. [2019] S. Bansal, H. Kamper, K. Livescu, A. Lopez, S. Goldwater, Pre-training on high-resource speech recognition improves low-resource speech-to-text translation, in: 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2019, pp. 58–68.
Zhang et al. [2021] T. Zhang, C. Xia, P. S. Yu, Z. Liu, S. Zhao, Pdaln: Progressive domain adaptation over a pre-trained model for low-resource cross-domain named entity recognition, in: EMNLP, 2021.
Liu et al. [2022] Z. Liu, Y. Xu, Y. Xu, Q. Qian, H. Li, X. Ji, A. Chan, R. Jin, Improved fine-tuning by better leveraging pre-training data, Advances in Neural Information Processing Systems 35 (2022) 32568–32581.
Chi et al. [2023] Z. Chi, H. Huang, L. Liu, Y. Bai, X. Gao, X.-L. Mao, Can pretrained english language models benefit non-english nlp systems in low-resource scenarios?, IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023) 1–14. doi:10.1109/TASLP.2023.3267618.
Zhang et al. [2023] H. Zhang, D. Li, Y. Li, C. Shang, C. Shi, Y. Jiang, Assisting language learners: Automated trans-lingual definition generation via contrastive prompt learning, in: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 2023, pp. 260–274.
Yeung and Yeung [2018] C.-K. Yeung, D.-Y. Yeung, Addressing two problems in deep knowledge tracing via prediction-consistent regularization, in: Proceedings of the Fifth Annual ACM Conference on Learning at Scale, 2018, pp. 1–10.
Nagatani et al. [2019] K. Nagatani, Q. Zhang, M. Sato, Y.-Y. Chen, F. Chen, T. Ohkuma, Augmenting knowledge tracing by considering forgetting behavior, in: The world wide web conference, 2019, pp. 3101–3107.
Choi et al. [2020] Y. Choi, Y. Lee, J. Cho, J. Baek, B. Kim, Y. Cha, D. Shin, C. Bae, J. Heo, Towards an appropriate query, key, and value computation for knowledge tracing, in: Proceedings of the Seventh ACM Conference on Learning@Scale, 2020, pp. 341–344.
Devlin et al. [2019] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186.
Radford et al. [2018] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language understanding by generative pre-training (2018).
Raffel et al. [2020] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, The Journal of Machine Learning Research 21 (2020) 5485–5551.
Lester et al. [2021] B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter-efficient prompt tuning, in: M.-F. Moens, X. Huang, L. Specia, S. W.-t. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 3045–3059. URL: https://aclanthology.org/2021.emnlp-main.243. doi:10.18653/v1/2021.emnlp-main.243.
Sanh et al. [2022] V. Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Fevry, J. A. Fries, R. Teehan, T. L. Scao, S. Biderman, L. Gao, T. Wolf, A. M. Rush, Multitask prompted training enables zero-shot task generalization, in: International Conference on Learning Representations, 2022.
Chung et al. [2022] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned language models, CoRR (2022).
Michel et al. [2019] P. Michel, O. Levy, G. Neubig, Are sixteen heads really better than one?, Advances in neural information processing systems 32 (2019).
Ke et al. [2022] Z. Ke, Y. Shao, H. Lin, T. Konishi, G. Kim, B. Liu, Continual pre-training of language models, in: The Eleventh International Conference on Learning Representations, 2022.
Kong et al. [2022] C. Kong, Y. Chen, H. Zhang, L. Yang, E. Yang, Multitasking framework for unsupervised simple definition generation, arXiv preprint arXiv:2203.12926 (2022).
Li et al. [2023] D. Li, H. Zhang, Y. Li, S. Yang, Multi-level contrastive learning for script-based character understanding, arXiv preprint arXiv:2310.13231 (2023).
Stamper et al. [2010] J. Stamper, A. Niculescu-Mizil, S. Ritter, G. Gordon, K. Koedinger, Challenge data set from kdd cup 2010 educational data mining challenge (2010).
Liu et al. [2023] Z. Liu, Q. Liu, T. Guo, J. Chen, S. Huang, X. Zhao, J. Tang, W. Luo, J. Weng, Xes3g5m: A knowledge tracing benchmark dataset with auxiliary information, in: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
Choi et al. [2020] Y. Choi, Y. Lee, D. Shin, J. Cho, S. Park, S. Lee, J. Baek, C. Bae, B. Kim, J. Heo, Ednet: A large-scale hierarchical dataset in education, in: Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6–10, 2020, Proceedings, Part II 21, Springer, 2020, pp. 69–73.
Feng et al. [2009] M. Feng, N. Heffernan, K. Koedinger, Addressing the assessment challenge with an online system that tutors as it assesses, User modeling and user-adapted interaction 19 (2009) 243–266.
Wang et al. [2020] Z. Wang, A. Lamb, E. Saveliev, P. Cameron, Y. Zaykov, J. M. Hernández-Lobato, R. E. Turner, R. G. Baraniuk, C. Barton, S. P. Jones, et al., Instructions and guide for diagnostic questions: The neurips 2020 education challenge, arXiv preprint arXiv:2007.12061 (2020).
Kingma and Ba [2015] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations, 2015.
Shen et al. [2021] S. Shen, Q. Liu, E. Chen, Z. Huang, W. Huang, Y. Yin, Y. Su, S. Wang, Learning process-consistent knowledge tracing, in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 1452–1460.
Liu et al. [2023] Z. Liu, Q. Liu, J. Chen, S. Huang, B. Gao, W. Luo, J. Weng, Enhancing deep knowledge tracing with auxiliary tasks, in: Proceedings of the 2023 World Wide Web Conference, 2023.
Abdelrahman and Wang [2019] G. Abdelrahman, Q. Wang, Knowledge tracing with sequential key-value memory networks, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 175–184.
Yeung [2019] C.-K. Yeung, Deep-IRT: Make deep learning based knowledge tracing explainable using item response theory, arXiv preprint arXiv:1904.11738 (2019).
Huang et al. [2023] S. Huang, Z. Liu, X. Zhao, W. Luo, J. Weng, Towards robust knowledge tracing models via k-sparse attention, in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 2441–2445.
Wang et al. [2021] C. Wang, W. Ma, M. Zhang, C. Lv, F. Wan, H. Lin, T. Tang, Y. Liu, S. Ma, Temporal cross-effects in knowledge tracing, in: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2021, pp. 517–525.
Shang et al. [2024] C. Shang, H. Zhang, H. Wen, Y. Yang, Understanding multimodal deep neural networks: A concept selection view, arXiv preprint arXiv:2404.08964 (2024).
Zhang et al. [2022] H. Zhang, D. Li, S. Yang, Y. Li, Fine-grained contrastive learning for definition generation, in: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, 2022, pp. 1001–1012.
Kong et al. [2022] C. Kong, Y. Wang, R. Chong, L. Yang, H. Zhang, E. Yang, Y. Huang, Blcu-icall at semeval-2022 task 1: Cross-attention multitasking framework for definition modeling, arXiv preprint arXiv:2204.07701 (2022).