Improving Low-Resource Knowledge Tracing Tasks by Supervised Pre-training and Importance Mechanism Fine-tuning

Hengyuan Zhang [email protected] Zitao Liu [email protected] Shuyan Huang Chenming Shang Bojun Zhan Yong Jiang
Abstract

Knowledge tracing (KT) aims to estimate student’s knowledge mastery based on their historical interactions. Recently, the deep learning based KT (DLKT) approaches have achieved impressive performance in the KT task. These DLKT models heavily rely on the large number of available student interactions. However, due to various reasons such as budget constraints and privacy concerns, observed interactions are very limited in many real-world scenarios, a.k.a, low-resource KT datasets. Directly training a DLKT model on a low-resource KT dataset may lead to overfitting and it is difficult to choose the appropriate deep neural architecture. Therefore, in this paper, we propose a low-resource KT framework called LoReKT to address above challenges. Inspired by the prevalent “pre-training and fine-tuning” paradigm, we aim to learn transferable parameters and representations from rich-resource KT datasets during the pre-training stage and subsequently facilitate effective adaptation to low-resource KT datasets. Specifically, we simplify existing sophisticated DLKT model architectures with purely a stack of transformer decoders. We design an encoding mechanism to incorporate student interactions from multiple KT data sources and develop an importance mechanism to prioritize updating parameters with high importance while constraining less important ones during the fine-tuning stage. We evaluate LoReKT on six public KT datasets and experimental results demonstrate the superiority of our approach in terms of AUC and Accuracy. To encourage reproducible research, we make our data and code publicly available at https://anonymous.4open.science/r/LoReKT-C619.

keywords:
Educational data mining , Knowledge tracing , Pre-training and fine-tuning , Importance mechanism
\affiliation

[a]organization=Shenzhen International Graduate School, Tsinghua University, city=Shenzhen, postcode=518055, country=China \affiliation[b]organization=Guangdong Institute of Smart Education, Jinan University, city=Guangzhou, postcode=510610, country=China \affiliation[c]organization=TAL Education Group, city=Beijing, postcode=100080, country=China

1 Introduction

Knowledge tracing holds a pivotal role within the realm of Intelligent Tutoring Systems (ITS) [1, 2, 3, 4]. Its primary objective is to forecast students’ performance on questions by estimating their mastery of individual knowledge components (KCs111A KC is a generality of everyday terms like concept, principle, or skill.) through an analysis of their past interactions. A KC is a description of a mental structure or process that a learner uses, alone or in combination with other KCs, to accomplish steps in a task or a problem. Take Figure 1 as an example. The student has successively responded to four questions (Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to Q4subscript𝑄4Q_{4}italic_Q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT), achieving correct answers for Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Q3subscript𝑄3Q_{3}italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, while Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is answered incorrectly. This pattern suggests that the student may have a proficient understanding of the “Addition”, “Subtraction”, and “Multiplication” KCs, but lacks familiarity with the “Modulo” and “Division” KCs. Leveraging the current knowledge mastery, the KT task aims to predict the student’s performance on the upcoming sixth question, Q4subscript𝑄4Q_{4}italic_Q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. After gaining insights into students’ knowledge mastery through KT, educators can promptly pinpoint weaknesses and provide targeted exercises for improvement. Additionally, this information can assist online learning platforms in providing a series of adaptive learning services such as learning resource recommendations, customizing student learning paths, and personalizing teaching plans [5, 6].

Recently, with the remarkable progress of deep learning techniques, many studies develop deep learning based KT (DLKT) models that are trained on massive students’ historical interactions to pursue high accuracy on students’ knowledge mastery estimations. Thus, many publicly available educational datasets have been released for training an effective DLKT model.

Refer to caption
Figure 1: An illustration of the KT problem.

However, due to the users’ privacy protection of the educational applications and the different learning energy and enthusiasm of students, it is extremely difficult to collect large-scale high-quality student interaction sequences from real-world educational environments. Therefore, the educational datasets for KT model training frequently involve limited student learning records. For example, ASSISTments2009 is one of the classical KT datasets and the observed interaction records are only collected from 4,217 students. However, most of the state-of-the-art DLKT models are designed with stacks of neural networks such as recurrent neural networks, memory networks [7, 8, 9, 10]. Directly training a DLKT model on such a low-resource KT dataset is very easy to run into the problem of overfitting. Furthermore, it is unclear what type of model architectures are most suitable for low-resource KT datasets in previous KT works.

To enhance the learning capability from low-resource datasets of deep learning-based models, some studies perform “pre-training and fine-tuning” paradigm [11, 12, 13, 14]. This paradigm leverages rich-resource datasets to pre-train a model first and transfer the learned parameters to the low-resource dataset. Motivated by these promising studies, we propose a simple yet effective framework called LoReKT. LoReKT aims to improve the performance on the low-resource KT dataset by transferring the knowledge tracing capability from the model pre-trained on multiple rich-resource KT datasets. More specifically, in the pre-training stage, we build a foundational pre-trained KT model using a stack of transformer decoders based on multiple rich-resource KT datasets. To enhance the model’s capacity for integrating information from both questions and concepts, we introduce data type embeddings. Furthermore, to enable the model to learn the distinct and shared tracing patterns from multiple KT datasets, we introduce dedicated dataset embeddings for each KT dataset. In the fine-tuning stage, we propose an importance vector-based fine-tuning strategy to allow the model to focus on updating crucial parameters for the specific target low-resource dataset while constraining unimportant parameters to prevent the learning and memorization of noisy information.

As a result, the implementation of LoReKT has the following merits:

  • 1.

    The framework avoids direct training by initially pre-training the model on rich-resource KT datasets and subsequently fine-tuning it on specific low-resource KT datasets. This approach mitigates the risk of overfitting.

  • 2.

    The framework leverages a stack of transformer decoders as its backbone, which has demonstrated excellent performance in various “pre-training and fine-tuning” scenarios. Moreover, this backbone requires no additional architecture design efforts, simplifying the process and reducing the reliance on specific model architectures for particular datasets.

  • 3.

    To further mitigate the overfitting issue, the framework employs a fine-tuning strategy based on an importance mechanism, restricting the learning of less important parameters to prevent the memorization of noisy information.

  • 4.

    To ensure that our approach can be fairly comparable with other recently developed DLKT models, we follow a publicly available standardized KT task evaluation protocol [15]. We conduct comprehensive and rigorous experiments on three public rich-resource datasets and three low-resource datasets. The results show that after pre-training, the pre-trained KT model comes close to the performance of previous approaches on rich-resource datasets. By fine-tuning the pre-trained KT model on the low-resource datasets, it achieves superior prediction performance in terms of AUC and Accuracy compared to 17 baselines.

2 Related Work

2.1 Deep Learning based Knowledge Tracing

Deep Knowledge Tracing (DKT) has pioneered the application of deep learning in knowledge tracing tasks by employing a Long Short Term Memory (LSTM) layer to encapsulate students’ knowledge states and predict students’ response performances [16]. Since then, many methods tend to use deep learning techniques to solve KT problem [1, 2, 3, 4, 7, 17, 8, 18, 19, 20, 21, 22, 10, 23, 24]. For example, Yeung and Yeung [17] leveraged prediction-consistent regularization mechanism to mitigate issues related to input reconstruction failure and prediction inconsistency in the context of DKT [16]. Zhang et al. [7] integrated a meticulously designed static key matrix for storing the interconnections among different knowledge components (KCs). Simultaneously, it utilizes a dynamic value matrix to iteratively update the knowledge state of students. Motivated by the learning curve theory [25]Nagatani et al. [8] took student’s forgetting behavior into consideration to enhance DKT [16]. Lee and Yeung [18] used student knowledge state encoder and skill encoder to predict the student response performance via the dot product. Tiana et al. [21] performed multi-task learning based on the bidirectional encoder representations to construct mixed representations of questions. To mitigate the potential issue of limited generalization in DLKT, adversarial training techniques, such as adversarial perturbations, are introduced to the original student interaction sequence. Specifically, Guo et al. [10] improveed the generalization capability of the DLKT model by incorporating adversarial perturbations at the embedding level of the student interaction sequence. The carefully designed perturbations contribute to the model’s effective generalization across diverse student interactions. Moreover, certain studies have concentrated on exploring the interactions between student responses and questions, as well as the associations between questions and KCs. For example, Nakagawa et al. [26] constructed a question-concept knowledge graph and utilized graph neural network to aggregate the node features related to the corresponding concepts and subsequently updates the student’s knowledge states effectively. Additionally, Ma et al. [23] employed self-supervised learning paradigm to identify the latent relationship between questions and KCs, thereby enhancing input representations. Another research direction focused on the interdependence among student interactions, aiming to capture finer details embedded within them. For instance, Pandey and Karypis [19] utilized a self-attention mechanism to grasp the relationships between exercises and students’ responses. Ghosh et al. [20] presented AKT, which utilizes two self-attention modules to extract the inner relevance of questions and interactions respectively, and explicitly model students’ forgetting behaviors via a monotonic attention mechanism.

In this paper, unlike the aforementioned DLKT methods that are committed to developing a series of sophisticated architectures, our LoReKT is based on a stack of simple transformer decoders. This unified backbone aims to break down disciplinary barriers and learn consistent representations across multiple KT datasets.

2.2 Pre-training for Low-resource Setting

In real-world scenarios, encounters with low-resource settings are commonplace, and the paradigm of ”pre-training and fine-tuning” has consistently proven its efficacy in addressing challenges within such contexts [27, 28, 29, 30, 31, 32, 33]. For example, Yosinski et al. [27] explored the transferability of AlexNet and observed that the initial three layers of AlexNet encapsulate general features conducive to transferability. By introducing fine-tuning to the neural network, it successfully mitigated data variability and scarcity, consequently enhancing the overall network performance. Bansal et al. [29] introduced a straightforward methodology to enhance direct speech-to-text translation (ST) in scenarios where the source language is low-resource. The approach involves initial pre-training of the model on a high-resource automatic speech recognition (ASR) task, followed by a subsequent fine-tuning process to refine its parameters specifically for speech-to-text translation (ST). Zhang et al. [30] proposed an innovative approach, the adaptive data augmentation fine-tuning technique, designed to facilitate the efficient transfer of Named Entity Recognition (NER) knowledge from resource-rich domains to low-resource target domains. Liu et al. [31] utilized pre-training data in both the initial pre-training and subsequent fine-tuning stages, strategically enhancing the model’s performance across low-resource datasets. This dual-stage utilization of pre-training data contributes to a comprehensive and effective optimization, addressing the challenges posed by limited data availability in low-resource scenarios. Chi et al. [32] investigated the prospect of enhancing the performance of low-resource non-English languages by incorporating pre-trained language models that are primarily trained on English. This exploration seeks to leverage the knowledge embedded in English-dominant language models to boost the capabilities of models applied to non-English languages with limited resources. To generate high-quality definition for low-resource language, Zhang et al. [33] leveraged a multilingual pre-trained model as backbone and employed a prompt contrastive fine-tuning approach to enhance the model’s capabilities in this specific linguistic context.

In this paper, we adhere to the ”pre-training and fine-tuning” paradigm, opting for a strategic approach rather than direct training of a DLKT model on low-resource KT datasets. This choice is made to mitigate potential overfitting issues and enhance the overall performance of model in low-resource KT datasets scenarios.

3 Problem Statement

The objective of KT problem is to predict the probability of whether a student will answer arbitrary qsubscript𝑞q_{*}italic_q start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT correctly based on the student’s historical interaction data. Specifically, suppose a student’s chronologically ordered collection of T𝑇Titalic_T past interactions is denoted as 𝐒={𝐬j}j=1T𝐒superscriptsubscriptsubscript𝐬𝑗𝑗1𝑇\mathbf{S}=\{\mathbf{s}_{j}\}_{j=1}^{T}bold_S = { bold_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, each student interaction 𝐬jsubscript𝐬𝑗\mathbf{s}_{j}bold_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is represented as an ordering 4-tuple, i.e., 𝐬j=<qj,{c|c𝒩qj}j,rj,tj>\mathbf{s}_{j}=<q_{j},\{c|c\in\mathcal{N}_{q_{j}}\}_{j},r_{j},t_{j}>bold_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = < italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , { italic_c | italic_c ∈ caligraphic_N start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT >, where qjsubscript𝑞𝑗q_{j}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, {c}jsubscript𝑐𝑗\{c\}_{j}{ italic_c } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represent the specific question, the associated KC set, student response222Response rj{0,1}subscript𝑟𝑗01r_{j}\in\{0,1\}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ { 0 , 1 }, 1 represents the student answered correctly, and 0 otherwise. and student’s response timestamp respectively. 𝒩qjsubscript𝒩subscript𝑞𝑗\mathcal{N}_{q_{j}}caligraphic_N start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the set of KCs that are associated with the question qjsubscript𝑞𝑗q_{j}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We would like to estimate the probability r^subscript^𝑟\hat{r}_{*}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT of the student’s future performance on arbitrary question qsubscript𝑞q_{*}italic_q start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT.

4 The Framework

In this section, we introduce the procedures in our proposed LoReKT framework in details: (1) obtaining a foundational pre-trained KT model through learning from rich-resource KT datasets during the pre-training stage (Section 4.1); (2) efficiently adapting the pre-trained KT model to the low-resource dataset using an importance vector in the fine-tuning stage (Section 4.2).

Refer to caption
Figure 2: An illustration of the interaction encoding and forward procedure of our LoReKT framework.

4.1 Pre-training Stage

Our objective in the pre-training stage is to learn transferable parameters and representations from certain rich-resource datasets and build a pre-trained KT foundation model that is able to quickly adapt to low-resource KT datasets.

4.1.1 Interaction Encoding

Due to the fact that the question bank is typically much larger than KCs, previous research mainly used KCs for interaction encoding, treating questions with the same KCs as identical [16, 34, 35]. However, this approach led DLKT models to overlook the unique characteristics of same-KC questions, limiting interaction representation. To address this, we align with works that include both individual question features and KCs to encode interactions in a more granular manner [36, 20, 24].

Specifically, let 𝐃={Di}i=1I𝐃superscriptsubscriptsubscript𝐷𝑖𝑖1𝐼\mathbf{D}=\{D_{i}\}_{i=1}^{I}bold_D = { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT be the mixed students’ learning sequences, where Di={𝐒1i,,𝐒ni}subscript𝐷𝑖subscriptsuperscript𝐒𝑖1subscriptsuperscript𝐒𝑖𝑛D_{i}=\{\mathbf{S}^{i}_{1},...,\mathbf{S}^{i}_{n}\}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. 𝐒jisubscriptsuperscript𝐒𝑖𝑗\mathbf{S}^{i}_{j}bold_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the j𝑗jitalic_jth learning sequence from rich-dataset i𝑖iitalic_i. n𝑛nitalic_n is the number of learning sequences in Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and I𝐼Iitalic_I is the number of rich-resource KT datasets. Let 𝒩qtsubscript𝒩subscript𝑞𝑡\mathcal{N}_{q_{t}}caligraphic_N start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the set of KCs associated with qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We represent question qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its corresponding KCs as follows:

𝐪t=𝐖q𝐞tqsubscript𝐪𝑡superscript𝐖𝑞subscriptsuperscript𝐞𝑞𝑡\displaystyle\mathbf{q}_{t}=\mathbf{W}^{q}\cdot\mathbf{e}^{q}_{t}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_W start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ⋅ bold_e start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
𝐜¯t=1|Cqt|j=1M𝐜j𝕀(cj𝒩qt)subscript¯𝐜𝑡1subscript𝐶subscript𝑞𝑡superscriptsubscript𝑗1𝑀subscript𝐜𝑗𝕀subscript𝑐𝑗subscript𝒩subscript𝑞𝑡\displaystyle\bar{\mathbf{c}}_{t}=\frac{1}{\left|C_{q_{t}}\right|}\sum_{j=1}^{% M}\mathbf{c}_{j}*\mathbb{I}(c_{j}\in\mathcal{N}_{q_{t}})over¯ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_C start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∗ blackboard_I ( italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (1)
𝐜j=𝐖c𝐞jcsubscript𝐜𝑗superscript𝐖𝑐subscriptsuperscript𝐞𝑐𝑗\displaystyle\mathbf{c}_{j}=\mathbf{W}^{c}\cdot\mathbf{e}^{c}_{j}bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_W start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⋅ bold_e start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

where 𝐞tqN×1subscriptsuperscript𝐞𝑞𝑡superscript𝑁1\mathbf{e}^{q}_{t}\in\mathbb{R}^{N\times 1}bold_e start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT and 𝐞jcM×1subscriptsuperscript𝐞𝑐𝑗superscript𝑀1\mathbf{e}^{c}_{j}\in\mathbb{R}^{M\times 1}bold_e start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × 1 end_POSTSUPERSCRIPT are the one-hot vectors that indicating the question qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the related KC in 𝒩qtsubscript𝒩subscript𝑞𝑡\mathcal{N}_{q_{t}}caligraphic_N start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. 𝐜jd×1subscript𝐜𝑗superscript𝑑1\mathbf{c}_{j}\in\mathbb{R}^{d\times 1}bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT is one of the latent representations of the related KC to question qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. 𝐪td×1subscript𝐪𝑡superscript𝑑1\mathbf{q}_{t}\in\mathbb{R}^{d\times 1}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT and 𝐜¯td×1subscript¯𝐜𝑡superscript𝑑1\bar{\mathbf{c}}_{t}\in\mathbb{R}^{d\times 1}over¯ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT are the latent embedding of qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its corresponding KCs, respectively. 𝐖qd×Nsuperscript𝐖𝑞superscript𝑑𝑁\mathbf{W}^{q}\in\mathbb{R}^{d\times N}bold_W start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_N end_POSTSUPERSCRIPT and 𝐖cd×Msuperscript𝐖𝑐superscript𝑑𝑀\mathbf{W}^{c}\in\mathbb{R}^{d\times M}bold_W start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_M end_POSTSUPERSCRIPT are learnable linear transformation operations. N𝑁Nitalic_N and M𝑀Mitalic_M are the total number of distinct questions and KCs in our mixed dataset 𝐃𝐃\mathbf{D}bold_D, respectively333We reassign ID numbers for all questions and KCs in the dataset 𝐃𝐃\mathbf{D}bold_D based on the values of N𝑁Nitalic_N and M𝑀Mitalic_M. In the fine-tuning stage, for specific low-resource KT dataset, we adjust the ID numbers for its questions and KCs, starting from N𝑁Nitalic_N and M𝑀Mitalic_M. Additionally, we expand the size of 𝐖qsuperscript𝐖𝑞\mathbf{W}^{q}bold_W start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT and 𝐖csuperscript𝐖𝑐\mathbf{W}^{c}bold_W start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT to obtain their corresponding question and KC representations, i.e., 𝐪tsubscript𝐪𝑡\mathbf{q}_{t}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐜jsubscript𝐜𝑗\mathbf{c}_{j}bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.. \cdot is the standard matrix/vector multiplication. Cqtsubscript𝐶subscript𝑞𝑡C_{q_{t}}italic_C start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the size of 𝒩qtsubscript𝒩subscript𝑞𝑡\mathcal{N}_{q_{t}}caligraphic_N start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝕀()𝕀\mathbb{I}(\cdot)blackboard_I ( ⋅ ) is the indicator function.

Drawing inspiration from the powerful pre-trained model BERT [37], which leverages token type embeddings to improve the integration of various token information, we introduce data type embeddings to the KT problem. In this problem, there are two distinct data types: questions and concepts. To improve the pre-trained KT model’s ability to incorporate information from both, we introduce question and concept data type embeddings, which are directly integrated into all question and concept embeddings:

𝐪~t=𝐪t𝐭q;𝐜~t=𝐜¯t𝐭cformulae-sequencesubscript~𝐪𝑡direct-sumsubscript𝐪𝑡subscript𝐭𝑞subscript~𝐜𝑡direct-sumsubscript¯𝐜𝑡subscript𝐭𝑐\displaystyle\widetilde{\mathbf{q}}_{t}=\mathbf{q}_{t}\oplus\mathbf{t}_{q};% \quad\widetilde{\mathbf{c}}_{t}=\bar{\mathbf{c}}_{t}\oplus\mathbf{t}_{c}over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ bold_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ; over~ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over¯ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ bold_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (2)

where 𝐭qd×1subscript𝐭𝑞superscript𝑑1\mathbf{t}_{q}\in\mathbb{R}^{d\times 1}bold_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT and 𝐭cd×1subscript𝐭𝑐superscript𝑑1\mathbf{t}_{c}\in\mathbb{R}^{d\times 1}bold_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT are the question and concept data type embeddings, direct-sum\oplus is the element-wise addition operator. 𝐪~tsubscript~𝐪𝑡\widetilde{\mathbf{q}}_{t}over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐜~tsubscript~𝐜𝑡\widetilde{\mathbf{c}}_{t}over~ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the question and KCs embeddings enriched with data type information.

Finally, we combine the embedding of question, its corresponding KCs, and response to encode the interaction 𝐞td×1subscript𝐞𝑡superscript𝑑1\mathbf{e}_{t}\in\mathbb{R}^{d\times 1}bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT, i.e.:

𝐱t=𝐪~t𝐜~t;𝐫t=𝐖a𝐚tq;𝐞t=𝐱t𝐫tformulae-sequencesubscript𝐱𝑡direct-sumsubscript~𝐪𝑡subscript~𝐜𝑡formulae-sequencesubscript𝐫𝑡superscript𝐖𝑎subscriptsuperscript𝐚𝑞𝑡subscript𝐞𝑡direct-sumsubscript𝐱𝑡subscript𝐫𝑡\displaystyle\mathbf{x}_{t}=\widetilde{\mathbf{q}}_{t}\oplus\widetilde{\mathbf% {c}}_{t};\quad\mathbf{r}_{t}=\mathbf{W}^{a}\cdot\mathbf{a}^{q}_{t};\quad% \mathbf{e}_{t}=\mathbf{x}_{t}\oplus\mathbf{r}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ over~ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_W start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ⋅ bold_a start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (3)

where 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the question-concept (QC) embedding, 𝐚tq2×1subscriptsuperscript𝐚𝑞𝑡superscript21\mathbf{a}^{q}_{t}\in\mathbb{R}^{2\times 1}bold_a start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 1 end_POSTSUPERSCRIPT is the one-hot vector indicating whether the question qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is answered correctly and 𝐖ad×2superscript𝐖𝑎superscript𝑑2\mathbf{W}^{a}\in\mathbb{R}^{d\times 2}bold_W start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 2 end_POSTSUPERSCRIPT is learnable linear transformation operation. The illustration of the interaction encoding procedure is shown in Figure 2 (a).

4.1.2 Pre-training

Recently, generative pre-trained models that based on Transformer architecture have achieved promising results in various tasks compared to designing a sophisticated neural network for a specific task [38, 13, 14]. Drawing inspiration from these impressive findings, we opt to directly utilize a stack of transformer decoders. This choice enables us to dynamically capture student knowledge states without additional architecture design efforts:

𝐡(0)=𝐄𝐏superscript𝐡0direct-sum𝐄𝐏\displaystyle\mathbf{h}^{(0)}=\mathbf{E}\oplus\mathbf{P}bold_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = bold_E ⊕ bold_P
𝐡(l)=𝐓𝐫𝐚𝐧𝐟𝐨𝐫𝐦𝐞𝐫_𝐛𝐥𝐨𝐜𝐤(𝐡(l1))l[1,L]formulae-sequencesuperscript𝐡𝑙𝐓𝐫𝐚𝐧𝐟𝐨𝐫𝐦𝐞𝐫_𝐛𝐥𝐨𝐜𝐤superscript𝐡𝑙1for-all𝑙1𝐿\displaystyle\mathbf{h}^{(l)}=\mathbf{Tranformer\_block}(\mathbf{h}^{(l-1)})% \quad\forall{l}\in[1,L]bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = bold_Tranformer _ bold_block ( bold_h start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) ∀ italic_l ∈ [ 1 , italic_L ] (4)

where 𝐄=(𝐞1,,𝐞T)𝐄subscript𝐞1subscript𝐞𝑇\mathbf{E}=(\mathbf{e}_{1},...,\mathbf{e}_{T})bold_E = ( bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) is the embedding matrix of T𝑇Titalic_T past interaction. 𝐏𝐏\mathbf{P}bold_P is the position embedding matrix. L𝐿Litalic_L is the number of layers. 𝐡(l)T×dsuperscript𝐡𝑙superscript𝑇𝑑\mathbf{h}^{(l)}\in\mathbb{R}^{T\times d}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_d end_POSTSUPERSCRIPT is a knowledge state embedding matrix of a student by T𝑇Titalic_T past interactions. Please note that to estimate student knowledge states via their historical interactions, we use QC embedding 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for mapping both queries and keys, and interaction embedding 𝐞tsubscript𝐞𝑡\mathbf{e}_{t}bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for mapping values in the self-attention mechanism.

Furthermore, inspired by prompt learning techniques that effectively capture diverse and overlapping patterns in multi-task learning scenarios [39, 40, 41, 42], we introduce dedicated dataset embeddings for individual KT datasets. Given that each KT dataset has its unique prediction paradigm stemming from variations in question banks and KCs, this enhancement empowers the pre-trained KT model to effectively capture the specific and shared information across different KT datasets. Specifically, the knowledge state 𝐡(l)superscript𝐡𝑙\mathbf{h}^{(l)}bold_h start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is first concatenated with corresponding dataset embedding 𝐝isubscript𝐝𝑖\mathbf{d}_{i}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and QC embedding 𝐱t+1subscript𝐱𝑡1\mathbf{x}_{t+1}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, then fed into a two-layer fully connected network with Sigmoid activation function σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) to predict the performance of a student on next question qt+1subscript𝑞𝑡1q_{t+1}italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT:

𝐝i=𝐖d𝐞dsubscript𝐝𝑖superscript𝐖𝑑superscript𝐞𝑑\displaystyle\mathbf{d}_{i}=\mathbf{W}^{d}\cdot\mathbf{e}^{d}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ⋅ bold_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT
𝐲t+1=ReLU(𝐖1[𝐡t+1(l);𝐱t+1;𝐝i]+𝐛1)subscript𝐲𝑡1ReLUsubscript𝐖1superscriptsubscript𝐡𝑡1𝑙subscript𝐱𝑡1subscript𝐝𝑖subscript𝐛1\displaystyle\mathbf{y}_{t+1}=\mbox{ReLU}(\mathbf{W}_{1}\cdot[\mathbf{h}_{t+1}% ^{(l)};\mathbf{x}_{t+1};\mathbf{d}_{i}]+\mathbf{b}_{1})bold_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = ReLU ( bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ [ bold_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ; bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] + bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (5)
r^t+1=σ(𝐰ReLU(𝐖2𝐲t+1+𝐛2)+b)\displaystyle\hat{r}_{t+1}=\sigma(\mathbf{w}^{\top}\cdot\mbox{ReLU}\bigl{(}% \mathbf{W}_{2}\cdot\mathbf{y}_{t+1}+\mathbf{b}_{2}\bigl{)}+b)over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_σ ( bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ ReLU ( bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ bold_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_b )

where 𝐖dd×Isuperscript𝐖𝑑superscript𝑑𝐼\mathbf{W}^{d}\in\mathbb{R}^{d\times I}bold_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_I end_POSTSUPERSCRIPT is learnable linear transformation operation444In the fine-tuning stage, we expand the size of 𝐖dsuperscript𝐖𝑑\mathbf{W}^{d}bold_W start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to assign the dataset embedding for the specific low-resource KT dataset., I𝐼Iitalic_I is the total number of rich-resource KT datasets in the pre-training stage. 𝐞dI×1superscript𝐞𝑑superscript𝐼1\mathbf{e}^{d}\in\mathbb{R}^{I\times 1}bold_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_I × 1 end_POSTSUPERSCRIPT is the one-hot vector indicating the corresponding dataset that current interaction belongs to. 𝐖1d×2dsubscript𝐖1superscript𝑑2𝑑\mathbf{W}_{1}\in\mathbb{R}^{d\times 2d}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 2 italic_d end_POSTSUPERSCRIPT, 𝐖2d×dsubscript𝐖2superscript𝑑𝑑\mathbf{W}_{2}\in\mathbb{R}^{d\times d}bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, 𝐰d×1𝐰superscript𝑑1\mathbf{w}\in\mathbb{R}^{d\times 1}bold_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT, 𝐛1d×1subscript𝐛1superscript𝑑1\mathbf{b}_{1}\in\mathbb{R}^{d\times 1}bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT, 𝐛2d×1subscript𝐛2superscript𝑑1\mathbf{b}_{2}\in\mathbb{R}^{d\times 1}bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT and b𝑏bitalic_b are trainable parameters. All learnable parameters in LoReKT are trained in end-to-end fashion by minimizing the binary cross entropy loss between predicted probability r^tsubscript^𝑟𝑡\hat{r}_{t}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the ground-truth label rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

KT=t=1T(rtlogr^t+(1rt)log(1r^t))\mathcal{L}_{\text{KT}}=-\sum_{t=1}^{T}\bigl{(}r_{t}\log\hat{r}_{t}+(1-r_{t})% \log(1-\hat{r}_{t})\bigl{)}caligraphic_L start_POSTSUBSCRIPT KT end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_log over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( 1 - over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (6)

The forward procedure is illustrated in Figure 2 (b).

Refer to caption
Figure 3: An illustration of the importance vector computing and applying procedure in our LoReKT framework.

4.2 Fine-tuning Stage

In low-resource scenarios, overfitting is a common problem, as some model parameters may learn and memorize noisy dataset information, thereby hindering the model’s ability to generalize. This problem is especially severe in the low-resource KT setting. To alleviate the above problem, we introduce a novel importance vector-based fine-tuning strategy to encourage model to focus on updating the important parameters while constraining less important ones.

4.2.1 Computing Importance Vector of Layer

The backbone of LoReKT is a stack of transformer decoders. The key components of a transformer decoder are multi-head attention layer, intermediate layer, and output layer555In this paper, we use “layer” or l𝑙litalic_l to indicate any of these three layers, because the procedure of computing these three layers’ importance vector is similar.. It has been found that not all units (neurons or attention heads) in a specific layer are important [43]. Therefore, before directly fine-tuning the model on each low-resource KT dataset, we adopt the approach described by Ke et al. [44] to compute the importance vector for each layer. This is achieved by employing a gradient-based importance detection method, which is specifically tailored for each low-resource KT dataset:

𝐨^l=𝐠l𝐨lsubscript^𝐨𝑙direct-productsubscript𝐠𝑙subscript𝐨𝑙\displaystyle\hat{\mathbf{o}}_{l}=\mathbf{g}_{l}\odot\mathbf{o}_{l}over^ start_ARG bold_o end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = bold_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⊙ bold_o start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT
𝐈l=1Nn=1N|KT𝐠l|subscript𝐈𝑙1𝑁superscriptsubscript𝑛1𝑁subscriptKTsubscript𝐠𝑙\displaystyle\mathbf{I}_{l}=\frac{1}{N}\sum_{n=1}^{N}\left|\frac{\partial% \mathcal{L}_{\text{KT}}}{\partial\mathbf{g}_{l}}\right|bold_I start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT KT end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG | (7)

where 𝐨lsubscript𝐨𝑙\mathbf{o}_{l}bold_o start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT refers to the output of layer l𝑙litalic_l (which can be any of the three layers mentioned above). direct-product\odot refers to element-wise multiplication. 𝐠lsubscript𝐠𝑙\mathbf{g}_{l}bold_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT serves as a virtual parameter, sharing the same dimensions as 𝐨lsubscript𝐨𝑙\mathbf{o}_{l}bold_o start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, with each of its elements initialized to 1. It remains unchanged during the computing process, as we only need its gradient on each parameter to get the importance of corresponding unit. The unit with a higher gradient value obtained by its virtual parameter is considered more important, as they have a significant impact on the loss. Therefore, the gradient of each parameter gl,jsubscript𝑔𝑙𝑗g_{l,j}italic_g start_POSTSUBSCRIPT italic_l , italic_j end_POSTSUBSCRIPT in 𝐠lsubscript𝐠𝑙\mathbf{g}_{l}bold_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT can be regarded as the importance of unit j𝑗jitalic_j in layer l𝑙litalic_l. 𝐈lsubscript𝐈𝑙\mathbf{I}_{l}bold_I start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the importance vector of layer l𝑙litalic_l, which is of the same size as 𝐠lsubscript𝐠𝑙{\mathbf{g}_{l}}bold_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, N𝑁Nitalic_N is the number of samples in current low-resource KT dataset and the KTsubscriptKT\mathcal{L}_{\text{KT}}caligraphic_L start_POSTSUBSCRIPT KT end_POSTSUBSCRIPT is the loss defined in Eq.(6). Noted that, each low-resource KT dataset has its own 𝐈lsubscript𝐈𝑙\mathbf{I}_{l}bold_I start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for layer l𝑙litalic_l. The KTsubscriptKT\mathcal{L}_{\text{KT}}caligraphic_L start_POSTSUBSCRIPT KT end_POSTSUBSCRIPT loss for each low-resource KT dataset is computed based on the zero-shot performance of pre-trained KT model (as obtained in Section 4.1). The 𝐠lsubscript𝐠𝑙\mathbf{g}_{l}bold_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT remains unchanged during the computing process, because we need only its average gradient 𝐠lsubscript𝐠𝑙\nabla\mathbf{g}_{l}∇ bold_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (the term within ||\left|\right|| | in eq.(7)) over all the samples in the low-resource KT dataset and will not use the gradient to update the 𝐠lsubscript𝐠𝑙\mathbf{g}_{l}bold_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. The illustration of computing importance vector of layer l𝑙litalic_l is shown in Figure 3 (c).

4.2.2 Fine-tune with Importance Vector

After obtaining the importance vector 𝐈lsubscript𝐈𝑙\mathbf{I}_{l}bold_I start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for each layer in each low-resource KT dataset using the pre-trained model, we initially compute the original gradient lsubscript𝑙\nabla_{l}∇ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT by employing KTsubscriptKT\mathcal{L}_{\text{KT}}caligraphic_L start_POSTSUBSCRIPT KT end_POSTSUBSCRIPT defined in Eq. 6. Subsequently, we apply the importance vector 𝐈lsubscript𝐈𝑙\mathbf{I}_{l}bold_I start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to obtain the modified gradient ^lsubscript^𝑙\hat{\nabla}_{l}over^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for updating:

^l=𝐈llsubscript^𝑙direct-productsubscript𝐈𝑙subscript𝑙\displaystyle\hat{\nabla}_{l}=\mathbf{I}_{l}\odot\nabla_{l}over^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = bold_I start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⊙ ∇ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT

Here, we expand (by copying) the 𝐈lsubscript𝐈𝑙\mathbf{I}_{l}bold_I start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to match the dimensions of lsubscript𝑙\nabla_{l}∇ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to apply it to all associated parameters. The modified gradient ^lsubscript^𝑙\hat{\nabla}_{l}over^ start_ARG ∇ end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is only employed in the backward pass. This encourages the model to prioritize updating the associated parameters with high importance instead of less important ones by regulating their gradient flow. The procedure of fine-tuning based on importance vector is shown in Figure 3 (d).

5 Experiment

Low-resource Rich-resource
AS2009 NIPS34 AL2005 BD2006 XES3G5M EdNet
# of Ques. 26,688 948 210,710 207,856 7,652 12,235
# of KCs 123 57 112 493 865 188
# of Interactions 346,860 1,382,727 809,694 3,679,199 5,549,635 6,533,522
avg KCs 1.1969 1.0148 1.3634 1.0136 1.1640 2.2611
Subject Math Math Math Math Math Linguistics
Language English English English English Chinese English
Table 1: Dataset statistics of 6 datasets. “avg KCs” denotes the number of average KCs per question.

In this section, we present details of our experiment settings and the corresponding results. We conduct comprehensive analysis to illustrate the effectiveness of our LoReKT framework. Specifically, we aim to answer the following research questions: (RQ1) Can we build a solid pre-trained foundational model for KT? (RQ2) In low-resource scenarios, how does our proposed LoReKT framework performs compared to the state-of-the-art KT methods? (RQ3) Does pre-training truly enhance the performance of mode in low-resource KT datasets? (RQ4) In the fine-tuning stage, is it effective to focus on updating important parameters based on 𝐈lsubscript𝐈𝑙\mathbf{I}_{l}bold_I start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT? (RQ5) How does the dataset and data type embedding affect the pre-trained KT model?

5.1 Datasets

Since our LoReKT framework learns transferable parameters and representations from rich-resource KT datasets first and then quickly adapt to low-resource scenarios, we select three rich-resource datasets including BD2006 [47], XES3G5M [48], and EdNet [49] to establish a robust foundational pre-trained KT model. We further fine-tune the pre-trained KT model on three low-resource KT datasets including AS2009 [50], NIPS34 [51], and AL2005 [47] respectively. The data statistics for the six selected datasets can be found in Table 1. The detailed descriptions are as follows:

Rich-resource KT Datasets
  • 1.

    Bridge2algebra2006 (BD2006)666https://pslcdatashop.web.cmu.edu/KDDCup/: this dataset is provided from the KDD Cup 2010 EDM Challenge with the algebra questions answered by 13-14 years old students.

  • 2.

    XES3G5M777https://github.com/ai4ed/XES3G5M: this large-scale dataset from a Chinese online mathematics learning platform includes rich information about student learning interactions.

  • 3.

    EdNet888https://github.com/riiid/ednet: this dataset from South Korea’s Santa AI tutoring system is one of the largest KT datasets, with 130+ million student interactions in the TOEIC test.

Low-resource KT Datasets
  • 1.

    ASSISTments2009 (AS2009)999https://sites.google.com/site/assistmentsdata/home/2009-2010-assistment-data/skill-builder-data-2009-2010: this dataset is one of the classical education dataset that collects students’ responses to mathematic questions from the free online tutoring ASSISTments platform during the school year 2009-2010.

  • 2.

    NIPS34101010https://eedi.com/projects/neurips-education-challenge: this dataset is released in the NeurIPS 2020 Education Challenge. In our work, we choose to use Task 3&4 that are the students’ responses to multiple-choice diagnostic math questions on the Eedi platform.

  • 3.

    Algebra2005 (AL2005)111111https://pslcdatashop.web.cmu.edu/KDDCup/: similar to BD2006, this dataset is also released on KDD Cup 2010 EDM Challenge.

5.2 Experimental Setting

We remove student sequences shorter than 3 attempts and truncate student interaction sequences that are longer than 200. We use 80% of student sequences for training and validation and the rest 20% of student sequences for model evaluation. We adopt Adam optimizer [52] to train all the models. The number of training epochs is set to 200. Following all existing DLKT research [15, 24, 36], we use the Area Under the Curve (AUC) as the main evaluation metric and use Accuracy as the secondary evaluation metric. We use an early stopping strategy that stops optimization when the AUC score fails to get the improvement on the validation set in the latest 10 epochs. Owing to the training efficiency, it is difficult to tune too many hyperparameters for the models with billions of parameters. Hence, we only tune the model learning rate in {0.001, 0.0001} with the dropout rate in {0.1, 0.2} for fair comparison in various model sizes.

With the aim of finding a suitable model size for pre-training, we conduct an extensive exploration of various model sizes, encompassing a range from 89M to 1.01B. The corresponding architecture details are summarized in Table 2. nparamssubscript𝑛𝑝𝑎𝑟𝑎𝑚𝑠n_{params}italic_n start_POSTSUBSCRIPT italic_p italic_a italic_r italic_a italic_m italic_s end_POSTSUBSCRIPT is the total number of trainable parameters. nlayerssubscript𝑛𝑙𝑎𝑦𝑒𝑟𝑠n_{layers}italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r italic_s end_POSTSUBSCRIPT is the total number of layers, dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT is the number of units in each bottleneck layer (the feed-forward layer is denoted as dffsubscript𝑑𝑓𝑓d_{ff}italic_d start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT), and nheadsubscript𝑛𝑒𝑎𝑑n_{head}italic_n start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d end_POSTSUBSCRIPT is the number of attention heads.

Model Name nparamssubscript𝑛𝑝𝑎𝑟𝑎𝑚𝑠n_{params}italic_n start_POSTSUBSCRIPT italic_p italic_a italic_r italic_a italic_m italic_s end_POSTSUBSCRIPT nlayerssubscript𝑛𝑙𝑎𝑦𝑒𝑟𝑠n_{layers}italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r italic_s end_POSTSUBSCRIPT dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT nheadsubscript𝑛𝑒𝑎𝑑n_{head}italic_n start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d end_POSTSUBSCRIPT dffsubscript𝑑𝑓𝑓d_{ff}italic_d start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT
LoReKT-Base-89M 89M 4 256 8 256
LoReKT-Base-221M 221M 24 512 16 1024
LoReKT-Base-478M 478M 24 1024 16 1024
LoReKT-Base-1.01B 1.01B 32 1536 24 2560
Table 2: The model sizes and associated architecture details of LoReKT-Base.

5.3 Baselines

To conduct a comprehensive evaluation of LoReKT, we have compared it with 17 selected baseline methods. We have carefully categorized these baseline methods into four categories:

Deep Sequential KT Models

Deep sequential KT models use an auto-regressive framework to dynamically track students’ knowledge states. Representative deep sequential KT models include:

  • 1.

    DKT [16]: it uses an LSTM layer to model students’ learning processes.

  • 2.

    DKT+ [17]: it improves the original DKT model by addressing the reconstruction and inconsistent issues.

  • 3.

    DKT-F [8]: it enhances original DKT by considering students’ forgetting behaviors.

  • 4.

    KQN [18]: it is a recurrent neural network (RNN) based architecture that extracts the relation representations between students’ learning abilities and KCs to predict their performance.

  • 5.

    LPKT [53]: it designs a learning cell to model the students’ learning processes to estimate their knowledge states.

  • 6.

    AT-DKT [54]: it proposes two auxiliary learning tasks involving question tagging prediction task and individualized prior knowledge prediction task to improve the prediction performance of DKT.

Memory Augmented KT Models

Memory augmented KT models employ memory networks to capture potential relevances between KCs and student knowledge states. Representative memory augmented KT models include:

  • 1.

    DKVMN [7]: it incorporates a static matrix to store the relationships among KCs and a dynamic matrix to track the student’s knowledge state.

  • 2.

    SKVMN [55]: it is a combination of DKVMN and LSTM that uses a hop-LSTM layer to capture sequential dependencies of questions.

  • 3.

    DeepIRT [56]: it incorporates DKVMN and item response theory to enhance the interpretability of the prediction output of DKVMN.

Attention based KT Models

Attention based KT models capture dependencies between historical interactions and the next questions via the attention mechanism. Representative attention based KT models include:

  • 1.

    SAKT [19]: it uses self-attention to identify the relevance between historical interactions and KCs.

  • 2.

    SAINT [36]: it is a Transformer-based model for KT that encodes questions and responses in the encoder and decoder respectively.

  • 3.

    AKT [20]: it leverages three self-attention modules to estimate the relevance between questions and historical interactions and explicitly models student’s forgetting behavior via a monotonic attention mechanism.

  • 4.

    simpleKT [24]: it explores the ordinary dot-product attention based KT models by capturing the individual differences among questions covering the same set of KCs

  • 5.

    sparseKT [57]: it incorporates a k-selection module to only pick items with the highest attention scores to improve the robustness and generalization of the attention based DLKT approaches.

Other KT Models

Other KT models that do not belong to the above categories:

  • 1.

    HawkesKT [58]: it utilizes the Hawkes process to model temporal cross-effects in student historical interactions.

  • 2.

    ATKT [10]: it performs adversarial perturbations into the student interaction sequence to enhance the generalization ability based on an attention-LSTM based KT model.

  • 3.

    GKT [26]: it casts the knowledge structure as a graph and reformulates the KT task as a time series node-level classification problem in GNN.

Method (Chronologically) Model Type AUC
AS2009 NIPS34 AL2005 BD2006 XES3G5M EdNet
DKT [16] Sequential 0.7525 0.7688 0.8159 0.8018 0.7845 0.6405
DKVMN [7] Memory 0.7472 0.7677 0.8052 0.7999 0.7796 0.6576
DKT+ [17] Sequential 0.7543 0.7698 0.8141 0.8019 0.7858 0.6454
DKT-F [8] Sequential - 0.7728 0.8146 0.7997 0.7935 0.6548
KQN [18] Sequential 0.7462 0.7685 0.8010 0.7953 0.7794 0.6415
SKVMN [55] Memory 0.7332 0.7513 0.7463 0.7287 0.7512 0.6374
DeepIRT [56] Memory 0.7465 0.7673 0.8040 0.7976 0.7789 0.6387
GKT [26] Others 0.7442 0.7718 0.8112 0.8041 0.7731 0.6392
SAKT [19] Attention 0.7221 0.7508 0.7850 0.7748 0.7685 0.6290
SAINT [36] Attention 0.6990 0.7883 0.7764 0.7758 0.8070 0.6841
AKT [20] Attention 0.7869 0.8038 0.8324 0.8213 0.8215 0.7054
ATKT [10] Others 0.7472 0.7664 0.7987 0.7889 0.7791 0.6490
HawkesKT [58] Others 0.7232 0.7763 0.8199 0.8077 0.7933 0.7304
LPKT [53] Sequential 0.7812 0.8004 0.8268 0.8056 0.8163 0.7644
AT-DKT [54] Sequential 0.7555 0.7816 0.8246 0.8104 0.7925 0.6536
simpleKT [24] Attention 0.7744 0.8035 0.8254 0.8160 0.8161 0.6765
sparseKT [57] Attention 0.7739 0.8033 0.8152 0.8120 0.8165 0.6804
LoReKT-Base-89M Attention 0.6041 0.6401 0.5943 0.8049 0.8145 0.7647
LoReKT-Base-221M Attention 0.6228 0.6452 0.6155 0.8183 0.8192 0.7672
LoReKT-Base-478M Attention 0.5957 0.6103 0.5834 0.8061 0.8164 0.7659
LoReKT-Base-1.01B Attention 0.5761 0.5980 0.5745 0.8003 0.8121 0.7633
LoReKT-Ft-impt-221M Attention 0.7912 0.8002 0.8425 - - -
LoReKT-Ft-221M Attention 0.7833 0.7969 0.8359 - - -
Table 3: The overall performance in terms of AUC. The result of each low-resource KT dataset corresponds to a separately fine-tuned model, leading to different performance on pre-training datasets. Therefore, we use “-” to denote the results on pre-training datasets. The best result is indicated in bold, while the second best result is denoted in underline.
Method (Chronologically) Model Type Accuracy
AS2009 NIPS34 AL2005 BD2006 XES3G5M EdNet
DKT [16] Sequential 0.7228 0.7031 0.8105 0.8554 0.8173 0.6665
DKVMN [7] Memory 0.7196 0.7022 0.8026 0.8547 0.8155 0.6392
DKT+ [17] Sequential 0.7243 0.7046 0.8085 0.8554 0.8179 0.6668
DKT-F [8] Sequential - 0.7076 0.8092 0.8540 0.8209 0.6666
KQN [18] Sequential 0.7214 0.7028 0.8022 0.8540 0.8154 0.6665
SKVMN [55] Memory 0.7156 0.6885 0.7837 0.8406 0.8071 0.6572
DeepIRT [56] Memory 0.7196 0.7020 0.8029 0.8542 0.8152 0.6559
GKT [26] Others 0.7179 0.7053 0.8078 0.8555 0.8139 0.6672
SAKT [19] Attention 0.7031 0.6873 0.7948 0.8460 0.8121 0.6519
SAINT [36] Attention 0.6977 0.7187 0.7770 0.8455 0.8177 0.6624
AKT [20] Attention 0.7385 0.7320 0.8138 0.8594 0.8275 0.6888
ATKT [10] Others 0.7206 0.7012 0.7989 0.8510 0.8143 0.6642
HawkesKT [58] Others 0.7046 0.7110 0.8108 0.8563 0.8191 0.7076
LPKT [53] Sequential 0.7355 0.7309 0.8154 0.8547 0.8264 0.7243
AT-DKT [54] Sequential 0.7250 0.7146 0.8144 0.8560 0.8195 0.6684
simpleKT [24] Attention 0.7320 0.7328 0.8083 0.8579 0.8240 0.6624
sparseKT [57] Attention 0.7282 0.7322 0.8017 0.8569 0.8234 0.6643
LoReKT-Base-89M Attention 0.6035 0.6003 0.6753 0.8543 0.8248 0.7243
LoReKT-Base-221M Attention 0.6216 0.6008 0.6962 0.8596 0.8271 0.7250
LoReKT-Base-478M Attention 0.5929 0.5821 0.6644 0.8538 0.8253 0.7212
LoReKT-Base-1.01B Attention 0.5805 0.5745 0.6568 0.8501 0.8239 0.7207
LoReKT-Ft-impt-221M Attention 0.7402 0.7323 0.8242 - - -
LoReKT-Ft-221M Attention 0.7353 0.7275 0.8159 - - -
Table 4: The overall performance in terms of Accuracy. The result of each low-resource KT dataset corresponds to a separately fine-tuned model, leading to different performance on pre-training datasets. Therefore, we use “-” to denote the results on pre-training datasets. The best result is indicated in bold, while the second best result is denoted in underline.

5.4 Results

We utilize different variants of LoReKT to represent its performance under different settings. LoReKT-Base represents the model trained after the pre-training stage without any fine-tuning on a specific low-resource dataset. LoReKT-Ft-impt and LoReKT-Ft refer to the model that fine-tuned on the specific low-resource KT dataset with and without using importance vector.

5.4.1 Model Performance after Pre-training (RQ1)

We report the results of main evaluation metric, i.e., AUC, in Table 3 and the results of secondary evaluation metric, i.e., Accuracy, in Table 4. From Table 3, we have the following observations: (1) in the comparison of various model sizes in LoReKT-Base, LoReKT-Base-221M exhibits the best performance. Initially, as the model size increases, the model’s performance improves; however, it begins to decline when the model size becomes excessively large. We argue that this phenomenon is indicative of the model first experiencing underfitting followed by overfitting; (2) the LoReKT-Base-221M demonstrates strong performance across all three pre-training datasets (BD2006, XES3G5M, and EdNet). For example, it achieves the highest AUC score on the EdNet dataset, showcasing a substantial improvement of 12.59% over sparseKT, 9.2% over LPKT and 8.6% over AKT. It ranks second in terms of AUC on BD2006 and XES3G5M datasets and is on par with the best model AKT within a 0.5% range of performance gap. It’s noteworthy that LoReKT-Base-221M, utilizing only a single model, consistently achieves strong performance across all three datasets. In contrast, the AKT model is separately trained for each dataset, resulting in significant performance variations. For instance, while the AKT model performs well on BD2006 and XES3G5M, its performance on EdNet is notably lower, lagging behind LoReKT-Base-221M by 8.1%. Furthermore, the architecture of LoReKT-Base-221M is more concise than AKT which designs two extra modules upon the original RNN architecture; and (3) the LoReKT-Base-221M also demonstrates impressive zero-shot capabilities on previously unseen datasets (AS2009, NIPS34, and AL2005). For example, it achieves AUC scores of 0.6452 and 0.6228 in NIPS34 and AS2009 datasets, which is significantly outperforming random performance. In spite of the zero-shot performance is still far from usable, it verifies our conjecture that LoReKT-Base has good potential transferability between different disciplines via cross-source learning. Furthermore, its robust zero-shot capabilities enhance the accuracy of the importance vector computed in Section 4.2.1 for each low-resource KT dataset in the fine-tuning stage.

5.4.2 Fine-tuning Performance of LoReKT (RQ2)

After obtaining the pre-trained KT model LoReKT-Base-221M, we further fine-tune it on each low-resource KT dataset. From Table 3, we can observe that (1) comparing LoReKT-Base-221M and LoReKT-Ft-impt-221M, the performance on all three low-resource KT datasets is significantly improved in terms of AUC score (e.g., an improvement of 26.1% in AS2009, 24.3% in NIPS34 and 37.3% in AL2005); (2) the LoReKT-Ft-impt-221M outperforms all the baseline methods in AS2009 and AL2005 datasets in terms of AUC score. More specifically, compared with ATKT, sparseKT, and simpleKT, it significantly improves the AUC score by 5.61%, 3.47%, and 2.16% on AL2005 respectively. Also, it outperforms the majority of baseline methods and closely matches the performance of the top models such as AKT and simpleKT on NIPS34, with performance gap of less than 0.5%. Notably, our LoReKT-Ft-impt-221M is robust enough to consistently achieve strong performance across all three low-resource datasets without the need for additional architecture design, while AKT and simpleKT exhibit notable performance variations across these datasets. Furthermore, it can be observed that the performance of LoReKT-Ft-impt-221M on NIPS34 is slightly lower than on AS2009 and AL2005 datasets when compared to the baseline methods. We attribute this discrepancy to the larger dataset size of NIPS34 relative to AS2009 and AL2005 (shown in Table 1), which suggests that the potential overfitting issue is not as prominent in NIPS34.

5.4.3 Impact of Pre-training (RQ3)

To analyze the impact of pre-training on the performance of low-resource KT datasets, we progressively reduce the number of rich-resource datasets used in the pre-training. From Figure 4, we have the following observations: (1) As the number of rich-resource KT datasets used in the pre-training stage decreases, the model’s performance on low-resource KT datasets correspondingly drops. (2) When pre-training is omitted, the model’s performance on low-resource KT datasets significantly deteriorates, particularly for the AS2009 and AL2005 datasets. For example, there is a decline in the AUC score by 3.6% in AS2009 and 2.1% in AL2005, whereas only a 1.1% decline is observed in NIPS34. This could be attributed to the fact that the data quantity in AS2009 and AL2005 is much smaller than in NIPS34, leading to a more pronounced overfitting issue.

Refer to caption
Figure 4: The impact of the number of rich-resource KT datasets used in the pre-training stage on the performance of low-resource KT datasets.

5.4.4 Impact of Importance Vector (RQ4)

We conduct experiments to further investigate the effectiveness of our proposed fine-tuning strategy with importance vector in low-resource KT datasets. Comparing the results of LoReKT-Ft-221M and LoReKT-Ft-impt-221M in Table 3, we observe that the proposed fine-tuning strategy with importance vector enhances performance across all low-resource datasets. It leads to AUC improvements of 0.66% for the AL2005 dataset and 0.79% for the AS2009 dataset, which is larger than the 0.33% improvement observed for the NIPS34 dataset. This observation suggests that the fine-tuning strategy with importance vector is more effective when the dataset size is insufficient (as shown in Table 1, NIPS34 exhibits a higher interaction count compared to AS2009 and AL2005). We believe that this effectiveness is attributed to the importance vector, which restricts the update of unimportant parameters, thereby mitigating the risk of overfitting.

Method AUC Accuracy
BD2006 XES3G5M EdNet BD2006 XES3G5M EdNet
LoReKT-Base-221M 0.8183 0.8192 0.7672 0.8596 0.8271 0.7250
w/o data type & dataset embedding 0.8145 0.8139 0.7594 0.8541 0.8239 0.7185
w/o data type embedding 0.8161 0.8172 0.7646 0.8575 0.8268 0.7243
w/o dataset embedding 0.8173 0.8168 0.7625 0.8569 0.8268 0.7246
Table 5: The impact of data type and dataset embedding in LoReKT-Base in terms of AUC and Accuracy performance. ”w/o“ means excluding such module from LoReKT-Base. The best result is indicated in bold, while the second best result is denoted in underline.

5.4.5 Impact of Data Type and Dataset Embeddings (RQ5)

We also analyze the impact of data type and dataset embedding in Section 4.1. We report the AUC and Accuracy results in Table 5. As presented in Table 5, comparing the results of “w/o data type & dataset embedding” and “w/o data type embedding”, we demonstrate that the proposed dataset embedding enhances the model’s understanding of the corresponding dataset’s prediction paradigm. This effect is particularly pronounced for datasets with larger quantities, such as the improvement in AUC of 0.52% for EdNet, surpassing the improvements of 0.33% in XES3G5M and 0.16% in BD2006. We attribute this to the fact that the dataset embedding is well-represented when the corresponding dataset has a sufficient quantity (as indicated in Table 1). The results of “w/o data type & dataset embedding” and “w/o dataset embedding” reveal that, across all three datasets, the proposed data type embedding plays an important role in enabling the model to incorporate information from both questions and concepts (e.g., an improvement in AUC of 0.28%, 0.29% and 0.31% in BD2006, XES3G5M, and EdNet respectively).

6 Conclusion

In this paper, we focus on improving the performance of DLKT models on low-resource KT datasets. To address this problem, we propose a framework called LoReKT based on a stack of transformer decoders. The LoReKT includes two stages: pre-training and fine-tuning, and does not require sophisticated architecture for a specific dataset. In the pre-training stage, we establish a robust pre-trained KT model based on several rich-resource KT datasets. Subsequently, we leverage an importance mechanism fine-tuning strategy to adapt the pre-trained model to a specific low-resource KT dataset effectively. The extensive quantitative and qualitative experiment results on six real-world datasets demonstrate the superior performance of LoReKT against a wide range of recently proposed DLKT models in terms of AUC and Accuracy.

CRediT authorship contribution statement

Hengyuan Zhang: Methodology, Conceptualization, Investigation, Writing.

Zitao Liu: Writing - review & editing, Supervision.

Shuyan Huang: Writing - review & editing, Supervision.

Chenming Shang: Investigation, Methodology, Writing.

Bojun Zhan: Investigation, Writing.

Yong Jiang: Supervision.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  • Zhang and Yao [2018] K. Zhang, Y. Yao, A three learning states bayesian knowledge tracing model, Knowledge-Based Systems 148 (2018) 189–201.
  • Su et al. [2021] Y. Su, Z. Cheng, P. Luo, J. Wu, L. Zhang, Q. Liu, S. Wang, Time-and-concept enhanced deep multidimensional item response theory for interpretable knowledge tracing, Knowledge-Based Systems 218 (2021) 106819.
  • Song et al. [2022] X. Song, J. Li, Q. Lei, W. Zhao, Y. Chen, A. Mian, Bi-clkt: Bi-graph contrastive learning based knowledge tracing, Knowledge-Based Systems 241 (2022) 108274.
  • Ke et al. [2024] F. Ke, W. Wang, W. Tan, L. Du, Y. Jin, Y. Huang, H. Yin, Hitskt: A hierarchical transformer model for session-aware knowledge tracing, Knowledge-Based Systems 284 (2024) 111300.
  • Liu et al. [2019] Q. Liu, Z. Huang, Y. Yin, E. Chen, H. Xiong, Y. Su, G. Hu, EKT: Exercise-aware knowledge tracing for student performance prediction, IEEE Transactions on Knowledge and Data Engineering 33 (2019) 100–115.
  • Wu et al. [2020] Z. Wu, M. Li, Y. Tang, Q. Liang, Exercise recommendation based on knowledge concept prediction, Knowledge-Based Systems 210 (2020) 106481.
  • Zhang et al. [2017] J. Zhang, X. Shi, I. King, D. Y. Yeung, Dynamic key-value memory networks for knowledge tracing, in: Proceedings of the 26th International Conference on World Wide Web, 2017, p. 765.
  • Nagatani et al. [2019] K. Nagatani, Q. Zhang, M. Sato, Y.-Y. Chen, F. Chen, T. Ohkuma, Augmenting knowledge tracing by considering forgetting behavior, in: The World Wide Web Conference, 2019, pp. 3101–3107.
  • Sonkar et al. [2020] S. Sonkar, A. E. Waters, A. S. Lan, P. J. Grimaldi, R. G. Baraniuk, qDKT: Question-centric deep knowledge tracing, in: Proceedings of The 13th International Conference on Educational Data Mining, 2020, pp. 677–681.
  • Guo et al. [2021] X. Guo, Z. Huang, J. Gao, M. Shang, M. Shu, J. Sun, Enhancing knowledge tracing via adversarial training, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 367–375.
  • Zoph et al. [2016] B. Zoph, D. Yuret, J. May, K. Knight, Transfer learning for low-resource neural machine translation, arXiv preprint arXiv:1604.02201 (2016).
  • Tu et al. [2019] T. Tu, Y.-J. Chen, C.-c. Yeh, H.-Y. Lee, End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning, arXiv preprint arXiv:1904.06508 (2019).
  • Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems (2020) 1877–1901.
  • Gira et al. [2022] M. Gira, R. Zhang, K. Lee, Debiasing pre-trained language models via efficient fine-tuning, in: Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, 2022, pp. 59–69.
  • Liu et al. [2022] Z. Liu, Q. Liu, J. Chen, S. Huang, J. Tang, W. Luo, pyKT: A python library to benchmark deep learning based knowledge tracing models, in: Thirty-sixth Conference on Neural Information Processing Systems, 2022.
  • Piech et al. [2015] C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. J. Guibas, J. Sohl-Dickstein, Deep knowledge tracing, Advances in neural information processing systems 28 (2015).
  • Yeung and Yeung [2018] C.-K. Yeung, D.-Y. Yeung, Addressing two problems in deep knowledge tracing via prediction-consistent regularization, in: Proceedings of the Fifth Annual ACM Conference on Learning at Scale, 2018, pp. 1–10.
  • Lee and Yeung [2019] J. Lee, D.-Y. Yeung, Knowledge query network for knowledge tracing: How knowledge interacts with skills, in: Proceedings of the 9th International Conference on Learning Analytics & Knowledge, 2019, pp. 491–500.
  • Pandey and Karypis [2019] S. Pandey, G. Karypis, A self-attentive model for knowledge tracing, in: 12th International Conference on Educational Data Mining, International Educational Data Mining Society, 2019, pp. 384–389.
  • Ghosh et al. [2020] A. Ghosh, N. Heffernan, A. S. Lan, Context-aware attentive knowledge tracing, in: ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020.
  • Tiana et al. [2021] Z. Tiana, G. Zhengc, B. Flanaganb, J. Mic, H. Ogatab, Bekt: Deep knowledge tracing with bidirectional encoder representations from transformers, in: Proceedings of the 29th International Conference on Computers in Education, 2021.
  • Song et al. [2022] X. Song, J. Li, T. Cai, S. Yang, T. Yang, C. Liu, A survey on deep learning based knowledge tracing, Knowledge-Based Systems 258 (2022) 110036.
  • Ma et al. [2022] Y. Ma, P. Han, H. Qiao, C. Cui, Y. Yin, D. Yu, Spakt: A self-supervised pre-training method for knowledge tracing, IEEE Access 10 (2022) 72145–72154.
  • Liu et al. [2023] Z. Liu, Q. Liu, J. Chen, S. Huang, W. Luo, simpleKT: A simple but tough-to-beat baseline for knowledge tracing, in: International Conference on Learning Representations, 2023.
  • Murray et al. [2013] R. C. Murray, S. Ritter, T. Nixon, R. Schwiebert, R. G. Hausmann, B. Towle, S. E. Fancsali, A. Vuong, Revealing the learning in learning curves, in: International Conference on Artificial Intelligence in Education, Springer, 2013, pp. 473–482.
  • Nakagawa et al. [2019] H. Nakagawa, Y. Iwasawa, Y. Matsuo, Graph-based knowledge tracing: modeling student proficiency using graph neural network, in: 2019 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE, 2019, pp. 156–163.
  • Yosinski et al. [2014] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep neural networks?, Advances in neural information processing systems 27 (2014).
  • Shang et al. [2019] J. Shang, T. Ma, C. Xiao, J. Sun, Pre-training of graph augmented transformers for medication recommendation, arXiv preprint arXiv:1906.00346 (2019).
  • Bansal et al. [2019] S. Bansal, H. Kamper, K. Livescu, A. Lopez, S. Goldwater, Pre-training on high-resource speech recognition improves low-resource speech-to-text translation, in: 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2019, pp. 58–68.
  • Zhang et al. [2021] T. Zhang, C. Xia, P. S. Yu, Z. Liu, S. Zhao, Pdaln: Progressive domain adaptation over a pre-trained model for low-resource cross-domain named entity recognition, in: EMNLP, 2021.
  • Liu et al. [2022] Z. Liu, Y. Xu, Y. Xu, Q. Qian, H. Li, X. Ji, A. Chan, R. Jin, Improved fine-tuning by better leveraging pre-training data, Advances in Neural Information Processing Systems 35 (2022) 32568–32581.
  • Chi et al. [2023] Z. Chi, H. Huang, L. Liu, Y. Bai, X. Gao, X.-L. Mao, Can pretrained english language models benefit non-english nlp systems in low-resource scenarios?, IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023) 1–14. doi:10.1109/TASLP.2023.3267618.
  • Zhang et al. [2023] H. Zhang, D. Li, Y. Li, C. Shang, C. Shi, Y. Jiang, Assisting language learners: Automated trans-lingual definition generation via contrastive prompt learning, in: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 2023, pp. 260–274.
  • Yeung and Yeung [2018] C.-K. Yeung, D.-Y. Yeung, Addressing two problems in deep knowledge tracing via prediction-consistent regularization, in: Proceedings of the Fifth Annual ACM Conference on Learning at Scale, 2018, pp. 1–10.
  • Nagatani et al. [2019] K. Nagatani, Q. Zhang, M. Sato, Y.-Y. Chen, F. Chen, T. Ohkuma, Augmenting knowledge tracing by considering forgetting behavior, in: The world wide web conference, 2019, pp. 3101–3107.
  • Choi et al. [2020] Y. Choi, Y. Lee, J. Cho, J. Baek, B. Kim, Y. Cha, D. Shin, C. Bae, J. Heo, Towards an appropriate query, key, and value computation for knowledge tracing, in: Proceedings of the Seventh ACM Conference on Learning@Scale, 2020, pp. 341–344.
  • Devlin et al. [2019] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186.
  • Radford et al. [2018] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language understanding by generative pre-training (2018).
  • Raffel et al. [2020] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, The Journal of Machine Learning Research 21 (2020) 5485–5551.
  • Lester et al. [2021] B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter-efficient prompt tuning, in: M.-F. Moens, X. Huang, L. Specia, S. W.-t. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 3045–3059. URL: https://aclanthology.org/2021.emnlp-main.243. doi:10.18653/v1/2021.emnlp-main.243.
  • Sanh et al. [2022] V. Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Fevry, J. A. Fries, R. Teehan, T. L. Scao, S. Biderman, L. Gao, T. Wolf, A. M. Rush, Multitask prompted training enables zero-shot task generalization, in: International Conference on Learning Representations, 2022.
  • Chung et al. [2022] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned language models, CoRR (2022).
  • Michel et al. [2019] P. Michel, O. Levy, G. Neubig, Are sixteen heads really better than one?, Advances in neural information processing systems 32 (2019).
  • Ke et al. [2022] Z. Ke, Y. Shao, H. Lin, T. Konishi, G. Kim, B. Liu, Continual pre-training of language models, in: The Eleventh International Conference on Learning Representations, 2022.
  • Kong et al. [2022] C. Kong, Y. Chen, H. Zhang, L. Yang, E. Yang, Multitasking framework for unsupervised simple definition generation, arXiv preprint arXiv:2203.12926 (2022).
  • Li et al. [2023] D. Li, H. Zhang, Y. Li, S. Yang, Multi-level contrastive learning for script-based character understanding, arXiv preprint arXiv:2310.13231 (2023).
  • Stamper et al. [2010] J. Stamper, A. Niculescu-Mizil, S. Ritter, G. Gordon, K. Koedinger, Challenge data set from kdd cup 2010 educational data mining challenge (2010).
  • Liu et al. [2023] Z. Liu, Q. Liu, T. Guo, J. Chen, S. Huang, X. Zhao, J. Tang, W. Luo, J. Weng, Xes3g5m: A knowledge tracing benchmark dataset with auxiliary information, in: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  • Choi et al. [2020] Y. Choi, Y. Lee, D. Shin, J. Cho, S. Park, S. Lee, J. Baek, C. Bae, B. Kim, J. Heo, Ednet: A large-scale hierarchical dataset in education, in: Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6–10, 2020, Proceedings, Part II 21, Springer, 2020, pp. 69–73.
  • Feng et al. [2009] M. Feng, N. Heffernan, K. Koedinger, Addressing the assessment challenge with an online system that tutors as it assesses, User modeling and user-adapted interaction 19 (2009) 243–266.
  • Wang et al. [2020] Z. Wang, A. Lamb, E. Saveliev, P. Cameron, Y. Zaykov, J. M. Hernández-Lobato, R. E. Turner, R. G. Baraniuk, C. Barton, S. P. Jones, et al., Instructions and guide for diagnostic questions: The neurips 2020 education challenge, arXiv preprint arXiv:2007.12061 (2020).
  • Kingma and Ba [2015] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations, 2015.
  • Shen et al. [2021] S. Shen, Q. Liu, E. Chen, Z. Huang, W. Huang, Y. Yin, Y. Su, S. Wang, Learning process-consistent knowledge tracing, in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 1452–1460.
  • Liu et al. [2023] Z. Liu, Q. Liu, J. Chen, S. Huang, B. Gao, W. Luo, J. Weng, Enhancing deep knowledge tracing with auxiliary tasks, in: Proceedings of the 2023 World Wide Web Conference, 2023.
  • Abdelrahman and Wang [2019] G. Abdelrahman, Q. Wang, Knowledge tracing with sequential key-value memory networks, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 175–184.
  • Yeung [2019] C.-K. Yeung, Deep-IRT: Make deep learning based knowledge tracing explainable using item response theory, arXiv preprint arXiv:1904.11738 (2019).
  • Huang et al. [2023] S. Huang, Z. Liu, X. Zhao, W. Luo, J. Weng, Towards robust knowledge tracing models via k-sparse attention, in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 2441–2445.
  • Wang et al. [2021] C. Wang, W. Ma, M. Zhang, C. Lv, F. Wan, H. Lin, T. Tang, Y. Liu, S. Ma, Temporal cross-effects in knowledge tracing, in: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2021, pp. 517–525.
  • Shang et al. [2024] C. Shang, H. Zhang, H. Wen, Y. Yang, Understanding multimodal deep neural networks: A concept selection view, arXiv preprint arXiv:2404.08964 (2024).
  • Zhang et al. [2022] H. Zhang, D. Li, S. Yang, Y. Li, Fine-grained contrastive learning for definition generation, in: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, 2022, pp. 1001–1012.
  • Kong et al. [2022] C. Kong, Y. Wang, R. Chong, L. Yang, H. Zhang, E. Yang, Y. Huang, Blcu-icall at semeval-2022 task 1: Cross-attention multitasking framework for definition modeling, arXiv preprint arXiv:2204.07701 (2022).