Understanding Forgetting in Continual Learning with Linear Regression:
Overparameterized and Underparameterized Regimes
Abstract
Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. Despite the tremendous progress made in the past, the theoretical understanding, especially factors contributing to catastrophic forgetting, remains relatively unexplored. In this paper, we provide a general theoretical analysis of forgetting in the linear regression model via Stochastic Gradient Descent (SGD) applicable to both under-parameterized and overparameterized regimes. Our theoretical framework reveals some interesting insights into the intricate relationship between task sequence and algorithmic parameters, an aspect not fully captured in previous studies due to their restrictive assumptions. Specifically, we demonstrate that, given a sufficiently large data size, the arrangement of tasks in a sequence—where tasks with larger eigenvalues in their population data covariance matrices are trained later—tends to result in increased forgetting. Additionally, our findings highlight that an appropriate choice of step size will help mitigate forgetting in both under-parameterized and overparameterized settings. To validate our theoretical analysis, we conducted simulation experiments on both linear regression models and Deep Neural Networks (DNNs). Results from these simulations substantiate our theoretical findings.
1 Introduction
Continual learning, also known as lifelong learning, is a subfield of machine learning that focuses on developing a model capable of learning continuously from a stream of data, which are i.i.d sampled from different tasks and presented sequentially to the model. A primary challenge in continual learning is the catastrophic forgetting phenomenon (McCloskey & Cohen, 1989), wherein the model forgets previously acquired knowledge when exposed to new data.
Previous research addressing catastrophic forgetting in continuous learning primarily focuses on empirical studies, which can be broadly classified into three categories: expansion-based methods, regularization-based methods, and memory-based methods. Expansion-based methods (Yoon et al., 2017, 2019; Yang et al., 2021) mitigate catastrophic forgetting by allocating distinct subsets of network parameters to individual tasks. Regularization-based methods (Kirkpatrick et al., 2017; Aljundi et al., 2018; Serra et al., 2018; Liu & Liu, 2022) employee structural regularization in fixed capacity models to counteract forgetting, which penalize significant changes in parameters that are crucial for previous tasks. Memory-based methods (Shin et al., 2017; Chaudhry et al., 2018; Riemer et al., 2018; Saha et al., 2021; Lin et al., 2022; Hao et al., 2023) alleviate forgetting by storing subsets of previous task data or synthesizing pseudo-data without data-replay.
Recently, there has been a growing body of work focused on understanding the behavior of catastrophic forgetting from a theoretical standpoint. For example, Bennani et al. 2020; Doan et al. 2021 analyze the generalization of continual learning for Orthogonal Gradient Descent (OGD) (Farajtabar et al., 2020) in the Neural Tangent Kernel (NTK) (Jacot et al., 2018) regime. Lee et al. 2021; Asanuma et al. 2021 explore the impact of task similarity in a teacher-student setting. Evron et al. 2022; Lin et al. 2023 provide a detailed forgetting analysis of the minimum-norm interpolator for the overparameterized linear regression model. However, the existing analyses of forgetting often rely on relatively stringent assumptions that may not be applicable in many scenarios. For example, Bennani et al. 2020; Doan et al. 2021; Evron et al. 2022; Lin et al. 2023 necessitate an overparameterized regime for their analysis, which may be invalid when involving large datasets. Moreover, Lee et al. 2021; Asanuma et al. 2021; Lin et al. 2023; Swartworth et al. 2023 assume that data follows a Gaussian distribution that may not hold in real-world datasets exhibiting more complex distributions. Evron et al. 2022; Lin et al. 2023 focus on the minimum-norm interpolator, where each task requires achieving zero loss on its training samples and hence can find a closed-form solution.
In this paper, we investigate the behavior of forgetting under the linear regression model via the more practical Stochastic Gradient Descent (SGD) method and provide a general theoretical analysis that is applicable to both over-parameterized and under-parameterized regimes. Our main contributions can be summarized as follows:
Firstly, our work provides a theoretical analysis for multi-step SGD algorithms in both underparameterized and overparameterized regimes, with the population data covariance matrix satisfying the general fourth moment instead of Gaussian distribution as in existing studies. In specific, we provide a novel upper bound on the model forgetting, as well as a matching lower bound that shows the tightness of our characterization. Our bounds derive the forgetting bound that is stated as a function of the spectrum of the population data covariance matrices for each task, the step size, the number of training samples and the effective dimensions on the forgetting.
Second, our study provides some interesting insights into the impact of task sequence and algorithmic parameters on the degree of forgetting. Specifically, we show that when the data size is sufficiently large, forgetting tends to escalate when we postpone the training of tasks, whose population data covariance matrices possess larger eigenvalues. It is intuitive that when tasks with larger eigenvalues are trained later, the model might overfit these tasks due to their high variance. In addition, our findings reveal that an appropriate choice of step size can help mitigate forgetting in both underparameterized and overparameterized settings. Note that these results cannot be derived from existing works due to their restrictive data distribution assumptions or closed-form updating rules. More detailed discussions can be found in Section 4.
Finally, we conducted simulation experiments on both linear regression models and Deep Neural Networks (DNNs) to validate our theoretical analysis. Our simulation results indicate that both linear regression models and DNNs exhibit increased forgetting when tasks with larger eigenvalues are encountered later. Additionally, we demonstrate that smaller step sizes in training can also mitigate forgetting across task sequences, especially in under-parameterized settings. Interestingly, we observe that in over-parameterized DNNs, higher dimensionality does not necessarily equate to more forgetting if the dataset size is fixed, as opposite to the linear regression case.
1.1 Related Work
In this section, we discuss related work on Covariate Shift, SGD analysis in linear regression, and theoretical studies for catastrophic forgetting.
Covariate Shift Covariate shift is a specific set-up in machine learning (Pan & Yang, 2009; Sugiyama & Kawanabe, 2012), referring to a distribution mismatch between the training and test data. The concept is typically applied in transfer learning, which can be seen as a particular instance of continual learning, generally involving two tasks. For example, Mohri & Medina 2012; Cortes & Mohri 2014; Kpotufe & Martinet 2018; Cortes et al. 2019; Hanneke & Kpotufe 2020; Ma et al. 2023; Wu et al. 2022b examine the (regularized) empirical risk minimizer, which focuses on minimizing the empirical and generalization error across accessible datasets. Nevertheless, the standard covariate shift is defined over two distinct data distributions, which can not be directly applied to our case. Consequently, we propose an extended version in Definition 2.2 to better suit our context.
SGD Analysis Recently, several studies have investigated the behavior of Stochastic Gradient Descent (SGD) in linear regression models through the lens of bias-variance decomposition (Défossez & Bach, 2015; Dieuleveut et al., 2017; Jain et al., 2017, 2018) and the eigen-decomposition of the covariance matrix (Chen et al., 2020; Zou et al., 2021; Wu et al., 2022a, b). Our work closely relates to the studies in Zou et al. 2021; Wu et al. 2022b that also characterized the SGD dynamic in linear regression with respect to the full eigenspectrum of the data covariance matrix. However, they focused on either the single-task setting or the pretraining-finetuning setting, while we studied the more challenging continual learning problem that involves a sequence of tasks with different data distributions. More discussion in Section 4.
Theoretical Studies in Continual Learning Although significant progress has been made in empirical studies addressing the issue of forgetting in continual learning, theoretical insights into this area are still largely unexplored. In this context, Bennani et al. 2020 established a theoretical framework to study continual learning algorithms in the NTK regime, and provided the first generalization bound dependent on task similarity for SGD and OGD. Doan et al. 2021 introduced the NTK overlap matrix as a task similarity metric and proposed a data-structure-informed variant of OGD that utilizes Principal Component Analysis (PCA). Asanuma et al. 2021 utilized the teacher-student framework on a single neural network and demonstrated that catastrophic forgetting can be circumvented when the similarity among input distributions is small and the similarity among teacher networks is large. Lee et al. 2021 expanded an earlier analysis of two-layer networks within the teacher-student setup to the setting with multiple teachers and revealed that the highest level of forgetting occurs when tasks have intermediate similarity with each other. Evron et al. 2022; Swartworth et al. 2023 explained the behavior of forgetting in the linear regression model from the perspectives of alternating projections and the Kaczmarz method (Karczmarz, 1937). Lin et al. 2023 investigated the impact of overparameterization, task similarity, and task ordering on forgetting and generalization in the overparameterized linear regression model.
The works most relevant to our study include (Evron et al., 2022; Lin et al., 2023), both of which also studied the behavior of forgetting in the linear regression model. However, our work differs from their studies in several aspects.
Firstly, with regard to assumptions, Evron et al. 2022 assumed all data are bounded with 1 and the model is noiseless, and Lin et al. 2023 assumed all data are sampled from a Gaussian distribution. In contrast, our assumptions cover more data distributions and are much milder than theirs (see Remark 2 and Section 4 for more details). Secondly, in terms of methods, both Evron et al. 2022 and Lin et al. 2023 analyze the problem of forgetting using the minimum norm solution, which presupposes zero training error—a requirement not necessary in our approach with SGD (see Section 2 for further discussions). Third, Evron et al. 2022; Lin et al. 2023 considered only the overparameterized case where the data dimension is larger than the data size, while our analysis holds for both the underparameterized and overparameterized settings.
Notations: In this paper, we adhere to a consistent notation style for clarity. We use boldface lower letters such as for vectors, and boldface capital letters (e.g. ) for matrices. Let denote the spectral norm of and denote the Euclidean norm of . For two vectors and , their inner product is denoted by oder . For two matrices and of appropriate dimension, their inner product is defined as . For a positive semi-definite (PSD) matrix and a vector of appropriate dimension, we write . The outer product is denoted by .
2 Preliminaries
In our setup, we consider a sequence of tasks, denoted as . For each task in this sequence, we have a corresponding dataset , which consists of data points. Each of these data points, denoted as , is drawn independently and identically distributed (i.i.d.) from a specific distribution . Here, represents the feature vector, and is the response variable for each data point in the dataset . Assume that are i.i.d. sampled from a linear regression model, i.e., each pair is a realization of the linear regression model , where is some randomized noise and is the optimal model parameter.
Our goal is to output a model minimizing the degree of forgetting (Evron et al., 2022) for tasks, i.e.
(1) |
represents the final output after sequentially training on tasks, each updated via SGD over iterations for each task. Equation 1 quantifies an average excess population risk on the final output across all tasks. For each task , the loss evaluate how well performs on it, thus assessing the degree of the model’s forgetting on previous tasks in continual learning scenarios.
Definition 2.1 (Data Covariance).
Assume that each entry and the trace of the are finite. Define as data covariance matrix.
Let denote the eigen decomposition of the data covariance for task , given by , where are eigenvalues in a nonincreasing order and are the corresponding eigenvectors. Define as and allow to imply that .
Definition 2.2 (Covariate Shift).
For each task , the covariates , , are i.i.d. drawn from .
Compared to the concept of covariate shift in transfer learning (Pathak et al., 2022), Definition 2.2 provides a more general scenario applicable to a series of tasks . For simplicity, in our analysis, we assume that each task in our model consists of data points, differentiating it from transfer learning approaches that typically consider the total dataset size as .
Assumption 2.3 (Fourth moment conditions).
Assume that for each task , the expected fourth moment of covariates, denoted as , and the expected covariance matrix are finite. Moreover:
-
(A)
There exists a constant such that for any Positive Semi-Definite (PSD) matrix , the following holds:
-
(B)
There exists a constant , such that for every PSD matrix , the following holds:
Remark 1.
2.3 is a commonly employed assumption in the linear regression analysis utilizing SGD methods (Zou et al., 2021; Wu et al., 2022a, b), which is much weaker than the assumptions on the aforementioned related work. Specifically, it can be verified that 2.3 holds with and for Gaussian distribution discussed in (Asanuma et al., 2021; Lee et al., 2021; Lin et al., 2023). Additionally, 2.3(A) can be relaxed to with , where is assumed in Evron et al. 2022.
Assumption 2.4 (Well-specified noise).
Assume that for each distribution of task , the response (conditional on input covariates) is given by , where and is independent with .
Similar to previous works, we assume that is some randomized noise that satisfies and for each task .
Continual Learning via SGD
Suppose we train the model parameter sequentially. Let represent the parameter state after the completion of training on task , which also serves as the initial condition for the training of task . Starting with and employing a constant step size , the model is updated by SGD for each task over iterations, with :
(2) | ||||
where represents the gradient of the loss function at task and iteration for a given data point .
Contrastingly, the minimum norm solution in linear regression, particularly relevant in overparameterized settings, aims to find a weight vector that not only achieves zero training error but also possesses the minimal possible norm. Here, represents the outcome post-training for task , and it also serves as the starting point for training task . The objective, beginning from an initial condition , is defined by the following optimization problem:
where and . The update rules for each iteration follow as:
(3) |
where highlights the computational intensity of inverting the matrix . This is particularly challenging for large datasets or overparameterized feature spaces. Unlike the minimum norm solution, SGD does not assume the existence of a unique, exact solution and is more adaptable to a variety of problems, including those with non-linear dynamics.
3 Main Results
Before presenting our upper bound, we shall establish the following notations to facilitate comprehension of the results.
(4) |
where are eigenvalues of in a nonincreasing order and represents the cut-off index for . Here, and can be regarded as a projection accumulation from task to task , and basically capture the impact of the learning dynamic of previous tasks on the subsequent task. is defined with respect to the cut-off index for each task’s data covariance matrix that captures both the dominant eigenvalues and the tail of the spectrum, and denotes the sum of the -th eigenvalue across all tasks.
In the following, we first provide our upper bound for the behavior of forgetting via SGD in the linear regression model.
Theorem 3.1 (Upper Bound).
Consider a scenario where the model undergoes training via SGD for distinct tasks, following a sequence . With a constant step size of given that , each task is executed for iterations. Given that Assumptions (A) and 2.4 are satisfied, the following will hold:
where the variance and bias errors are upper-bounded by
where the effective dimensions are given by
(5) | ||||
with and defined as in Equation 4 and denoting .
In Theorem 3.1, we establish an upper bound on the forgetting behavior of a model trained using SGD in the continual learning with various data distribution settings. It highlights that the model’s performance is influenced by both and , where stems from the inherent noise intrinsic to the model itself and represents the bias associated with the initial value during the learning process. Notice that both of them are determined jointly by the spectrum of the covariance matrices as well as the stepsizes for continual learning.
To provide a more intuitive explanation, we explore a simplified scenario by setting . Specifically, this setting simplifies our analysis by reducing the error terms to only the first term in bias error, which appears to depend solely on the initial weight and the data. However, this simplification might misleadingly imply that a minimal would result in optimal learning outcomes. A crucial aspect overlooked in this interpretation is the role of the projection term , which becomes an identity matrix when . Thus, while setting eliminates other error terms, it also exacerbates the first term of bias error, potentially making it the most significant error contributor. Consequently, there exists a trade-off in choosing the step size.
The subsequent theorem presents a nearly matching lower bound.
Theorem 3.2 (Lower Bound).
Consider a scenario where the model undergoes training via SGD for distinct tasks, following a sequence . With a constant step size of given that , each task is executed for iterations. Given that Assumptions (B) and 2.4 are satisfied, the following will hold:
where the variance and bias errors are lower bounded by
where the effective dimensions and are the same as in Theorem 3.1, and .
Analogous to the Theorem 3.1, our lower bound also consists of the bias term and the variance term. It is noteworthy that our lower bound is tight with the upper bound in terms of variance term, differing only by absolute constants. Additionally, our lower bound closely matches the upper bound in terms of the bias term, with some differences arising from the following quantities
Specifically, here differs from in Theorem 3.1 only by a factor of constants (i.e. and defined in 2.3). The term has a different subscript of compared to that of the upper bound. Nevertheless, it can be regarded as a part of the projection accumulation that exists in the subscript of both results simultaneously.
More importantly, we show that the upper and lower bounds converge, ignoring constant factors, under the conditions
which can be satisfied that the signal-to-noise ratios is bounded and the step size is appropriate small.
4 Discussion
Building on Theorem 3.1 and Theorem 3.2, we aim to offer a more comprehensive understanding of our findings from three key perspectives: 1) Technical Understanding Under Simplified Cases; 2) Comparison with Existing Work; 3) The Impact of Task Ordering and Parameters on Forgetting.
4.1 Technical Understanding Under Simplified Cases
In this section, we demonstrate how to achieve a vanishing bound in the overparameterized regime.
Based on Theorem 3.1, we consider a scenario where and for each task , implying a rapid decay in the spectrum of . To obtain a vanishing bound in the overparameterized regime, the effective dimension should hold that
(6) | ||||
To meet the condition in Equation 6, for each task , let and }. It necessarily holds that
(7) | ||||
To clarify Equation 7, let notice the crucial cut-off index and , which divide the entire feature space into two -dimensional and -dimensional subspaces. For achieving a diminishing bound in overparameterized setting, it is necessary that the sum of eigenvalues for indices less than , denoted as , should be , and the sum of the tail eigenvalues for indices greater than , should be . These conditions are typically met when the dataset size is sufficiently large, or when a smaller step size is chosen dependent on . Additionally, We note that the condition in Equation 7 can be relaxed. In light of the definition of , the eigenvalues for task are truncated based on the following two scenarios: : Here, the cut-off for task occurs earlier, resulting in an additional dimensions of eigenvalues such that . To achieve a diminishing bound under this condition, it is necessary that . : In this case, the cut-off for task occurs later, involving an additional dimensions of eigenvalues where , achieving the same results.
In the under-parameterized regime, we even account for the worst-case scenario where for all index and task , leading to a bound of
4.2 Comparison with Existing work
In this section, we will first explore the challenges and parallels between traditional/transfer learning and continual learning. Secondly, we examine how restrictive assumptions in previous studies might overshadow the impact of key factors, thereby affecting the overall understanding of forgetting in continual learning.
Our results reveal that compared to traditional learning (Zou et al., 2021), which typically involves a single task, and transfer learning (Wu et al., 2022b), which usually incorporates two data distributions, the effective dimension in continual learning scenarios is more complex. Specifically, in our analysis, the term arises from a distinct measurement perspective (i.e. forgetting), which requires us to consider how the final output aligns with all previously encountered tasks in the continual learning (i.e. for all ). This is in contrast to both traditional training and transfer learning, where the evaluation metric is uniformly focused on performance against a single dataset (i.e. ). Moreover, the multi-task nature of continual learning introduces unique challenges considering the bias iterates and variance iterates, where we refer to the proof in Appendix for more details.
Given that our analysis, similar to theirs, characterizes bounds with the full eigenspectrum of the data covariance matrix, it follows that our derived results match their findings in several aspects: The cutoff index is uniquely determined for each task in continual learning, akin to the one in Zou et al. 2021; Wu et al. 2022b, where they identify corresponding indices and . The projection terms and also occur in transfer learning (Wu et al., 2022b), showing how previous iterations/past learning is projected onto the future updates.
Previous work (Evron et al., 2022) also explored the dynamics of forgetting through the perspective of projection. We first revisit the findings presented by Evron et al. 2022. Considering a scenario where the number of iterations , the update rule in their analysis can be reformulated as follows:
(8) |
where they incorporate the noiseless model assumption that . As a result, the forgetting in Evron et al. 2022 holds that
indicating that the forgetting dynamic can be determined by the projection of , where . However, compared to our analysis, their study exhibits several key differences in comparison to ours. The inherent model noise: Evron et al. 2022 considers a noiseless model, where results in the absence of an additional iterative term related to noise in Equation 8. This omission leads to a lack of accumulative variance error in the evaluation of forgetting performance (i.e. in our analysis). It is noteworthy to mention that in numerous learning problems, the variance error often plays a dominant role in the total error (Jain et al., 2018; Zou et al., 2021; Wu et al., 2022b). The bounded norm : the assumption of the bounded norm, which omits the interaction with projection effects, is crucial in our analysis as the factor in Theorem 3.1 and Theorem 3.2. Last iterate SGD results: Evron et al. 2022 shows that, with a step size, their worst-case expected forgetting will become a dimension-dependent bound of . This analysis, conducted under the overparameterized regime, suggests the occurrence of catastrophic forgetting. In contrast, our results, as discussed earlier, offer a different perspective, suggesting the possibility of achieving a vanishing forgetting bound in overparameterized settings with certain conditions met.
It is noticed that Lin et al. 2023 also investigates the relationship between catastrophic forgetting and factors such as task sequence (order) and dimensionality. However, their results will tend to be vacuous in the under-parameterized setting since , data matrix for task , is non-invertible when employing minimum norm solution, as we discussed earlier in Section 2. Due to space constraints, a more extensive discussion will be provided in Appendix D.
4.3 The Impact of Task Ordering and Parameters on Forgetting
In the upcoming discussion, we will present theoretical insights derived from our results.
Notice that the bounds in Theorem 3.1 and Theorem 3.2 contain two crucial factors: the effective dimension and the covariance accumulation /. We first discuss the effective dimension. Each is consist of a projection term and the eigenvalues , with serving as the constant. It can be observed that when data size approaches infinity, the projection term converges to , implying that the eigenvalue will predominantly dictate the larger effective dimension with respect to . This observation highlights the substantial influence of eigenvalues on task sequence in continual learning. Specifically, it shows that when data size is sufficiently large, task sequences organized in a way, where tasks associated with larger eigenvalues in their population data covariance matrix are trained later, exhibit more forgetting. Additionally, if the step size is appropriately small, the projection term stabilizes to a constant of less than 1, leading to similar outcomes as in the first scenario. It is noteworthy that these insights can not be derived from the existing work analysis due to their restrictive assumptions, such as Gaussian data distribution in Lee et al. 2021; Asanuma et al. 2021; Lin et al. 2023 and minimum norm solution in Evron et al. 2022; Lin et al. 2023; Swartworth et al. 2023.
The covariance accumulation term, , which includes the covariance matrices and the step size , plays a crucial role in demonstrating how previously acquired information is retained and influences the model’s adaptability to new tasks. Notably, there is an interesting contradiction in the optimal accumulation order within compared to the projection term in . Specifically, earlier occurrence of with larger expected eigenvalues tends to increase the degree of forgetting. Meanwhile, an important observation is that if the step size is sufficiently small, the impact of the covariance accumulation term becomes less significant. This interplay between the effective dimension and covariance accumulation elucidates the complexities inherent in continual learning scenarios.
5 Empirical Stimulation
In this section, we conduct experiments using synthetic data to validate our theoretical results and shed light on the intricate interplay between eigenvalues, step size, and dimensionality.
Experimental Setup In our study, we designed three distinct tasks, denoted as Tasks 1,2, and 3, each with a different feature space. During the initial simulations, the eigenvalues for the feature values of Tasks 1, 2, and 3 were set according to , and respectively. To mimic real-world data imperfections, Gaussian noise with a standard deviation of 0.1 was added to the labels. We assessed the impact of task sequence on the model’s tendency to forget by evaluating six different task orders: [1,2,3], [2,1,3], [1, 3, 2], [3, 1, 2], [2, 3, 1], and [3, 2, 1].
5.1 Linear Regression
Training and Evaluation For this experiment, a linear regression model was trained using Stochastic Gradient Descent (SGD) with a learning rate of 0.01 or 0.001. The model was tested in both low-dimensional (10 input features) and high-dimensional (1000 input features) settings. Each task sequence underwent training with various data sizes, ranging from 100 to 950 in increments of 50, and each task was trained for five epochs. The performance of the model was evaluated on each task to calculate the average excess risk (Equation 1), quantifying the degree of forgetting the model experienced.
Impact of Eigenvalue Sequencing The observations from Figure 1(a) and Figure 1(c) reveal the significant impact of eigenvalue sequencing on forgetting behavior in the underparameterzied regime. Notably, task sequences that are arranged such that tasks with larger eigenvalues (i.e. Task 3 in our case, characterized by ) are trained later in the learning process tend to result in increased forgetting. This empirical finding aligns well with our theoretical analysis (the term discussed in Section 4.3). In an under-parameterized setting, or when the eigenvalues decay rapidly, the effective dimension — crucial in determining the model’s forgetting performance - is largely influenced by the eigenvalues. Such a pattern is intuitive as when tasks with larger eigenvalues are trained later, the model might overfit these tasks due to their high variance.
Impact of Dimensionality Our results, depicted in Figure 1(c) and Figure 1(d), show that in under-parameterized scenarios, performance remains relatively unaffected by an increase in dimensionality. However, in over-parameterized settings, the model tends to exhibit increased forgetting as dimensionality rises, particularly when the data size is kept constant. This highlights the varying impact of dimensionality on model performance in different parameterization contexts. In higher-dimensional settings, the influence of the projection term , as shown in Theorem 3.1, diminishes in comparison to the impact of and . Consequently, as the number of features in the model increases, the sequence in which tasks are presented becomes less significant in determining the model’s forgetting behavior. This shift implies that, in high-dimensional scenarios, the inherent complexity and the distribution of eigenvalues of the feature space play a more critical role than the sequence of tasks, influencing the model’s learning and retention capabilities.
Impact of Step-size Our results, depicted in Figure 1(e) and Figure 1(f), reveal that a smaller step size effectively reduces forgetting in various task sequences and across different dimensionalities. This trend is especially noticeable in high-dimensional feature spaces, where a reduced step size markedly lowers the rate of forgetting. This observation is in line with the theoretical insights provided in Theorem 3.1 and Theorem 3.2, as smaller step sizes may lead to more refined updates during training, allowing the model to incrementally adjust to new tasks while preserving knowledge from previous ones.
5.2 Implication on DNNs
Intriguingly, our next discussion will adopt the same data generation and task setup as outlined in Section 5.1, but shift our focus to a different Neural Network model. This model comprises an input layer, a hidden layer with ten neurons, and an output layer, and it undergoes a training process akin to that of linear regression.
Impact of Eigenvalue Sequencing In our studies with Deep Neural Networks (DNNs), we still find that task sequences, ending with tasks having larger eigenvalues, tend to exhibit increased forgetting, especially in under-parameterized settings, similar to linear regression models. This indicates that the tendency of overfitting observed in linear models, particularly when tasks with larger eigenvalues are trained later in the sequence, may occur in DNNs as well.
Impact of Dimensionality Our results also reveal the consistent behaviors between DNNs and linear regression concerning dimensionality. In under-parameterized scenarios (Figure 1(k)), forgetting remains stable despite increased dimensionality, while in over-parameterized settings (Figure 1(l)), higher dimensionality leads to more forgetting when data size is fixed. However, the adverse effects of higher dimensions can be alleviated by expanding the dataset size, as demonstrated in Figure 1(j). It is a notable contrast to linear regression, which suggests that the complex structures of DNNs are better suited to manage and learn from high-dimensional data in continual learning scenarios. The different behaviors observed between DNNs and linear regression models will be a potentially interesting direction for future work.
Impact of Step Size Our results, depicted in Figure 1(e) and Figure 1(f), indicate that in under-parameterized settings, a smaller step size significantly lessens the influence of task sequences on forgetting, while in models with high-dimensional features, forgetting can be mitigated even without adjusting the step size.
6 Conclusion
In this work, we contribute to the understanding of catastrophic forgetting in continual learning via a multi-step SGD algorithm. Our theoretical analysis establishes bounds that illustrate the impact of various factors on forgetting such as data covariance matrix spectrum, step size, data size, and dimensionality, which can not be fully captured in previous studies due to their restrictive assumptions. This theoretical understanding is further substantiated through simulations conducted in linear regression models and Deep Neural Networks, which corroborate our theoretical insights.
Impact Statements
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.
Acknowledgments
The research of Meng Ding and Jinhui Xu was supported in part by KAUST through grant CRG10-4663.2. Di Wang was supported in part by the baseline funding BAS/1/1689-01-01, funding from the CRG grand URF/1/4663-01-01, REI/1/5232-01-01, REI/1/5332-01-01, FCC/1/1976-49-01 from CBRC of King Abdullah University of Science and Technology (KAUST). Di Wang was also supported by the funding RGC/3/4816-09-01 of the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI).
References
- Aljundi et al. (2018) Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., and Tuytelaars, T. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pp. 139–154, 2018.
- Asanuma et al. (2021) Asanuma, H., Takagi, S., Nagano, Y., Yoshida, Y., Igarashi, Y., and Okada, M. Statistical mechanical analysis of catastrophic forgetting in continual learning with teacher and student networks. Journal of the Physical Society of Japan, 90(10):104001, 2021.
- Bennani et al. (2020) Bennani, M. A., Doan, T., and Sugiyama, M. Generalisation guarantees for continual learning with orthogonal gradient descent. arXiv preprint arXiv:2006.11942, 2020.
- Chaudhry et al. (2018) Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
- Chen et al. (2020) Chen, X., Liu, Q., and Tong, X. T. Dimension independent generalization error by stochastic gradient descent. arXiv preprint arXiv:2003.11196, 2020.
- Cortes & Mohri (2014) Cortes, C. and Mohri, M. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
- Cortes et al. (2019) Cortes, C., Mohri, M., and Medina, A. M. Adaptation based on generalized discrepancy. The Journal of Machine Learning Research, 20(1):1–30, 2019.
- Défossez & Bach (2015) Défossez, A. and Bach, F. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Artificial Intelligence and Statistics, pp. 205–213. PMLR, 2015.
- Dieuleveut et al. (2017) Dieuleveut, A., Flammarion, N., and Bach, F. Harder, better, faster, stronger convergence rates for least-squares regression. The Journal of Machine Learning Research, 18(1):3520–3570, 2017.
- Doan et al. (2021) Doan, T., Bennani, M. A., Mazoure, B., Rabusseau, G., and Alquier, P. A theoretical analysis of catastrophic forgetting through the ntk overlap matrix. In International Conference on Artificial Intelligence and Statistics, pp. 1072–1080. PMLR, 2021.
- Evron et al. (2022) Evron, I., Moroshko, E., Ward, R., Srebro, N., and Soudry, D. How catastrophic can catastrophic forgetting be in linear regression? In Conference on Learning Theory, pp. 4028–4079. PMLR, 2022.
- Farajtabar et al. (2020) Farajtabar, M., Azizan, N., Mott, A., and Li, A. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pp. 3762–3773. PMLR, 2020.
- Hanneke & Kpotufe (2020) Hanneke, S. and Kpotufe, S. On the value of target data in transfer learning, 2020.
- Hao et al. (2023) Hao, J., Ji, K., and Liu, M. Bilevel coreset selection in continual learning: A new formulation and algorithm. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Jain et al. (2017) Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P., Pillutla, V. K., and Sidford, A. A markov chain theory approach to characterizing the minimax optimality of stochastic gradient descent (for least squares). arXiv preprint arXiv:1710.09430, 2017.
- Jain et al. (2018) Jain, P., Kakade, S., Kidambi, R., Netrapalli, P., and Sidford, A. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. Journal of machine learning research, 18, 2018.
- Karczmarz (1937) Karczmarz, S. Angenaherte auflosung von systemen linearer glei-chungen. Bull. Int. Acad. Pol. Sic. Let., Cl. Sci. Math. Nat., pp. 355–357, 1937.
- Kirkpatrick et al. (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
- Kpotufe & Martinet (2018) Kpotufe, S. and Martinet, G. Marginal singularity, and the benefits of labels in covariate-shift. In Conference On Learning Theory, pp. 1882–1886. PMLR, 2018.
- Lee et al. (2021) Lee, S., Goldt, S., and Saxe, A. Continual learning in the teacher-student setup: Impact of task similarity. In International Conference on Machine Learning, pp. 6109–6119. PMLR, 2021.
- Lin et al. (2022) Lin, S., Yang, L., Fan, D., and Zhang, J. Trgp: Trust region gradient projection for continual learning. arXiv preprint arXiv:2202.02931, 2022.
- Lin et al. (2023) Lin, S., Ju, P., Liang, Y., and Shroff, N. Theory on forgetting and generalization of continual learning. arXiv preprint arXiv:2302.05836, 2023.
- Liu & Liu (2022) Liu, H. and Liu, H. Continual learning with recursive gradient optimization. arXiv preprint arXiv:2201.12522, 2022.
- Ma et al. (2023) Ma, C., Pathak, R., and Wainwright, M. J. Optimally tackling covariate shift in rkhs-based nonparametric regression, 2023.
- McCloskey & Cohen (1989) McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165. Elsevier, 1989.
- Mohri & Medina (2012) Mohri, M. and Medina, A. M. New analysis and algorithm for learning with drifting distributions, 2012.
- Pan & Yang (2009) Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
- Pathak et al. (2022) Pathak, R., Ma, C., and Wainwright, M. A new similarity measure for covariate shift with applications to nonparametric regression. In International Conference on Machine Learning, pp. 17517–17530. PMLR, 2022.
- Riemer et al. (2018) Riemer, M., Cases, I., Ajemian, R., Liu, M., Rish, I., Tu, Y., and Tesauro, G. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.
- Saha et al. (2021) Saha, G., Garg, I., and Roy, K. Gradient projection memory for continual learning. arXiv preprint arXiv:2103.09762, 2021.
- Serra et al. (2018) Serra, J., Suris, D., Miron, M., and Karatzoglou, A. Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pp. 4548–4557. PMLR, 2018.
- Shin et al. (2017) Shin, H., Lee, J. K., Kim, J., and Kim, J. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.
- Sugiyama & Kawanabe (2012) Sugiyama, M. and Kawanabe, M. Machine learning in non-stationary environments: Introduction to covariate shift adaptation. MIT press, 2012.
- Swartworth et al. (2023) Swartworth, W. J., Needell, D., Ward, R., Kong, M., and Jeong, H. Nearly optimal bounds for cyclic forgetting. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=X25L5AjHig.
- Wu et al. (2022a) Wu, J., Zou, D., Braverman, V., Gu, Q., and Kakade, S. Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression. In International Conference on Machine Learning, pp. 24280–24314. PMLR, 2022a.
- Wu et al. (2022b) Wu, J., Zou, D., Braverman, V., Gu, Q., and Kakade, S. The power and limitation of pretraining-finetuning for linear regression under covariate shift. Advances in Neural Information Processing Systems, 35:33041–33053, 2022b.
- Yang et al. (2021) Yang, L., Lin, S., Zhang, J., and Fan, D. Grown: Grow only when necessary for continual learning. arXiv preprint arXiv:2110.00908, 2021.
- Yoon et al. (2017) Yoon, J., Yang, E., Lee, J., and Hwang, S. J. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
- Yoon et al. (2019) Yoon, J., Kim, S., Yang, E., and Hwang, S. J. Scalable and order-robust continual learning with additive parameter decomposition. arXiv preprint arXiv:1902.09432, 2019.
- Zou et al. (2021) Zou, D., Wu, J., Braverman, V., Gu, Q., and Kakade, S. Benign overfitting of constant-stepsize sgd for linear regression. In Conference on Learning Theory, pp. 4633–4635. PMLR, 2021.
Appendix A Support Lemmas
Notations
For two matrices and , their inner product is defined as . For each task , we define the following linear operators:
We use the notation to denote the operator acting on a symmetric matrix . For example, with these definitions, we have that for a symmetric matrix ,
It can be readily understood that the following properties are satisfied:
Lemma A.1 ((Zou et al., 2021)).
An operator , when defined on symmetric matrices, is termed a Positive Semi-Definite (PSD) mapping if implies . Consequently, for each task we have:
-
1.
and are both PSD mappings.
-
2.
and are both PSD mappings.
-
3.
and are both PSD mappings.
-
4.
If , then exists, and is a PSD mapping.
-
5.
If , then exists for PSD matrix , and is a PSD mapping.
Then for the SGD iterates, we can consider their associated bias iterates and variance iterates:
(9) | ||||
(10) |
where and .
Lemma A.2 (Bias-variance decomposition).
Suppose that Assumption 2.4 holds. Then we have:
Appendix B Variance Error
B.1 Upper Bound
The assumption presented below can be inferred from 2.3 by setting , given that .
Assumption B.1 (Relaxed version).
For each task , there exists a constant such that:
Proof.
This lemma is derived directly from the Lemmas in (Jain et al., 2018; Zou et al., 2021). To ensure completeness, we include a proof as follows.
We prove the lemma via induction. Initially, for , it is evident that . Now, assuming that , let us examine in light of Equation 9. When , for each task , it implies:
(11) | ||||
∎
Proof.
We first examine the recursion from to for each task :
where the penultimate inequality is derived from the Lemma B.2.
Hence, after iterations, we could have the following results for task :
Now, we consider the first task incorporating with the Lemma B.5 in (Zou et al., 2021), which implies:
By combining the aforementioned results and denoting , we obtain:
∎
Based on Lemma A.2, the upper bound of the variance error can be expressed as follows:
(12) |
Let us consider the variance terms separately.
variance term 1 | (13) | |||
where we use the facts that hold for all in the last inequality.
Before we turn our attention to the second term, we first consider the :
Substituting the above to the variance term 2, we have:
variance term 2 | ||||
(14) |
Similarly, for the last term, we have:
(15) |
B.2 Lower Bound
Now, we shift our focus to the lower bound of variance. Similarly, we have the following lemma hold:
Proof.
In a similar fashion, let’s first examine the recursion of from to for each task .
where we utilize the fact that is a PSD mapping, as established by A.1.
Consequently, after iterations, the following results can be deduced for task :
Now, we consider the first task incorporating the Lemma C.2 in (Zou et al., 2021), which implies:
By combining the aforementioned results and denoting , we obtain:
which completes the proof. ∎
Drawing from Lemma A.2, the lower bound of the variance error is expressed as follows:
(16) |
Analogous to the approach for the upper bound, we will examine the terms one by one.
(17) |
To further lower bound the two terms, noticing the following inequality:
Hence, the first term, we have:
For the variance term , we notice that:
Substituting the above to the variance term 2’, we have:
Also, similar to the variance term 3’, it holds that:
Appendix C Bias Error
Before providing the proof of bias bound, we first introduce the following lemmas for tradition SGD training in Zou et al. 2021.
Lemma C.1 (Summation of bias iterates (Zou et al., 2021)).
Suppose that Assumption 2.3 holds. Suppose that . Then for every and each task , it holds that:
Lemma C.2.
Under Assumptions 2.3, let , if the stepsize satisfies , then for any , it holds that for each task :
where denoting .
Lemma C.3.
C.1 Upper Bound
We first examine the recursion from to for each task :
(18) | ||||
where the penultimate inequality is derived from the assumption 2.3.
Hence, after iterations, we could have the following results for task :
We now examine the second term for each :
where we know the following holds:
Moreover, we have . Therefore, it holds that:
It implies:
where we denote and define . Therefore, Section C.1 can be represented as follows:
We first consider the term 1 with Lemma C.2.
where is the index of the smallest eigenvalue of satisfying , and denotes .
Moreover, , Hence:
Now we are ready to examine the term 2.
Let us denote .
By combining the aforementioned results, we obtain:
where denoting .
Based on Lemma A.2, the upper bound of the bias error can be expressed as follows:
For each :
(19) | |||
Hence,
C.2 Lower Bound
We first examine the recursion from to for each task :
(20) | ||||
Hence, after iterations, we could have the following results for task :
We now examine the second term for each :
(22) | ||||
Subsituting the above to Equation 22 and denoting , we have:
From the Lemma, we have:
Then, for each task , we examine the term 1:
The first bias item is lower bounded by:
The second bias item is lower bounded by:
To further lower bound the two terms, we notice that:
Substituting to the previous results, we have:
bias term | |||
and
bias term | |||
Now we are ready to examine term 2.
Analogous to term 1, we have:
bias term | |||
and
bias term | |||
After iterations, it holds that:
where denoting .
Then, the bias error can be represented as follows:
It follows that:
where
and
Appendix D Extension work
It is noticed that when the step size is set to , the update rule for the minimum norm solution can be considered equivalent to that of the last iterate SGD. Consequently, in this subsection, we will focus on a particular case (akin to the setting in Lin et al. 2023) that involves this specific step size, allowing us to draw direct comparisons and insights under a defined set of conditions.
Consider a series of tasks . Given datasets, for each dataset , drawn i.i.d from some fixed distribution . Assume that are i.i.d. sampled from a linear regression model, i.e., each is a realization of the linear regression model , where is some randomized noise satisfing well-specified condition and is the optimal model parameter for task .
We adopt the same learning procedure with specific step size, aiming to output a model minimizing the performance (Lin et al., 2023), i.e.
(23) |
Therefore, our results can be restated as follows
Theorem D.1.
Consider a scenario where the model undergoes training via SGD for distinct tasks, following a sequence . With a specific step size of , each task is executed for iterations. Given that Assumption 2.4 are satisfied, the following will hold:
Remark 2.
In contrast to the approach in Theorem 3.1 and Theorem 3.2, here we do not rely on the decomposition of bias and variance error while considering that the projection is orthogonal to with a specific stepsize . This perspective allows us to derive a closed-form expression for the expected performance, which integrates the impact of initial parameter deviations, task-specific parameter variations, and random noise. Furthermore, Theorem D.1 in our study explores the performance behavior on general data distributions, expanding beyond the Gaussian distribution context discussed in Lin et al. 2023. In scenarios where there is only a single sample per training iteration, our results could cover their findings.
Proof.
For each iteration, according to the update rule of SGD, it holds that
which can be rewritten as:
We consider the expectation norm for both sides:
where the (*) equation comes from the choice of step size such that and are orthogonal projection, which equals the minimum norm solution with one sample.
Considering tasks, it holds that
In conclusion, we aggregate the performance metrics across tasks, ranging from to , to derive the final result. ∎