Understanding Forgetting in Continual Learning with Linear Regression:
Overparameterized and Underparameterized Regimes

Meng Ding Kaiyi Ji Di Wang Jinhui Xu

Abstract

Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. Despite the tremendous progress made in the past, the theoretical understanding, especially factors contributing to catastrophic forgetting, remains relatively unexplored. In this paper, we provide a general theoretical analysis of forgetting in the linear regression model via Stochastic Gradient Descent (SGD) applicable to both under-parameterized and overparameterized regimes. Our theoretical framework reveals some interesting insights into the intricate relationship between task sequence and algorithmic parameters, an aspect not fully captured in previous studies due to their restrictive assumptions. Specifically, we demonstrate that, given a sufficiently large data size, the arrangement of tasks in a sequence—where tasks with larger eigenvalues in their population data covariance matrices are trained later—tends to result in increased forgetting. Additionally, our findings highlight that an appropriate choice of step size will help mitigate forgetting in both under-parameterized and overparameterized settings. To validate our theoretical analysis, we conducted simulation experiments on both linear regression models and Deep Neural Networks (DNNs). Results from these simulations substantiate our theoretical findings.

Machine Learning, ICML

1 Introduction

Continual learning, also known as lifelong learning, is a subfield of machine learning that focuses on developing a model capable of learning continuously from a stream of data, which are i.i.d sampled from different tasks and presented sequentially to the model. A primary challenge in continual learning is the catastrophic forgetting phenomenon (McCloskey & Cohen, 1989), wherein the model forgets previously acquired knowledge when exposed to new data.

Previous research addressing catastrophic forgetting in continuous learning primarily focuses on empirical studies, which can be broadly classified into three categories: expansion-based methods, regularization-based methods, and memory-based methods. Expansion-based methods (Yoon et al., 2017, 2019; Yang et al., 2021) mitigate catastrophic forgetting by allocating distinct subsets of network parameters to individual tasks. Regularization-based methods (Kirkpatrick et al., 2017; Aljundi et al., 2018; Serra et al., 2018; Liu & Liu, 2022) employee structural regularization in fixed capacity models to counteract forgetting, which penalize significant changes in parameters that are crucial for previous tasks. Memory-based methods (Shin et al., 2017; Chaudhry et al., 2018; Riemer et al., 2018; Saha et al., 2021; Lin et al., 2022; Hao et al., 2023) alleviate forgetting by storing subsets of previous task data or synthesizing pseudo-data without data-replay.

Recently, there has been a growing body of work focused on understanding the behavior of catastrophic forgetting from a theoretical standpoint. For example, Bennani et al. 2020; Doan et al. 2021 analyze the generalization of continual learning for Orthogonal Gradient Descent (OGD) (Farajtabar et al., 2020) in the Neural Tangent Kernel (NTK) (Jacot et al., 2018) regime. Lee et al. 2021; Asanuma et al. 2021 explore the impact of task similarity in a teacher-student setting. Evron et al. 2022; Lin et al. 2023 provide a detailed forgetting analysis of the minimum-norm interpolator for the overparameterized linear regression model. However, the existing analyses of forgetting often rely on relatively stringent assumptions that may not be applicable in many scenarios. For example, Bennani et al. 2020; Doan et al. 2021; Evron et al. 2022; Lin et al. 2023 necessitate an overparameterized regime for their analysis, which may be invalid when involving large datasets. Moreover, Lee et al. 2021; Asanuma et al. 2021; Lin et al. 2023; Swartworth et al. 2023 assume that data follows a Gaussian distribution that may not hold in real-world datasets exhibiting more complex distributions. Evron et al. 2022; Lin et al. 2023 focus on the minimum-norm interpolator, where each task requires achieving zero loss on its training samples and hence can find a closed-form solution.

In this paper, we investigate the behavior of forgetting under the linear regression model via the more practical Stochastic Gradient Descent (SGD) method and provide a general theoretical analysis that is applicable to both over-parameterized and under-parameterized regimes. Our main contributions can be summarized as follows:

Firstly, our work provides a theoretical analysis for multi-step SGD algorithms in both underparameterized and overparameterized regimes, with the population data covariance matrix satisfying the general fourth moment instead of Gaussian distribution as in existing studies. In specific, we provide a novel upper bound on the model forgetting, as well as a matching lower bound that shows the tightness of our characterization. Our bounds derive the forgetting bound that is stated as a function of $\mathbf{1)}$ the spectrum of the population data covariance matrices for each task, $\mathbf{2)}$ the step size, $\mathbf{3)}$ the number of training samples and $\mathbf{4)}$ the effective dimensions on the forgetting.

Second, our study provides some interesting insights into the impact of task sequence and algorithmic parameters on the degree of forgetting. Specifically, we show that when the data size is sufficiently large, forgetting tends to escalate when we postpone the training of tasks, whose population data covariance matrices possess larger eigenvalues. It is intuitive that when tasks with larger eigenvalues are trained later, the model might overfit these tasks due to their high variance. In addition, our findings reveal that an appropriate choice of step size can help mitigate forgetting in both underparameterized and overparameterized settings. Note that these results cannot be derived from existing works due to their restrictive data distribution assumptions or closed-form updating rules. More detailed discussions can be found in Section 4.

Finally, we conducted simulation experiments on both linear regression models and Deep Neural Networks (DNNs) to validate our theoretical analysis. Our simulation results indicate that both linear regression models and DNNs exhibit increased forgetting when tasks with larger eigenvalues are encountered later. Additionally, we demonstrate that smaller step sizes in training can also mitigate forgetting across task sequences, especially in under-parameterized settings. Interestingly, we observe that in over-parameterized DNNs, higher dimensionality does not necessarily equate to more forgetting if the dataset size is fixed, as opposite to the linear regression case.

1.1 Related Work

In this section, we discuss related work on Covariate Shift, SGD analysis in linear regression, and theoretical studies for catastrophic forgetting.

Covariate Shift Covariate shift is a specific set-up in machine learning (Pan & Yang, 2009; Sugiyama & Kawanabe, 2012), referring to a distribution mismatch between the training and test data. The concept is typically applied in transfer learning, which can be seen as a particular instance of continual learning, generally involving two tasks. For example, Mohri & Medina 2012; Cortes & Mohri 2014; Kpotufe & Martinet 2018; Cortes et al. 2019; Hanneke & Kpotufe 2020; Ma et al. 2023; Wu et al. 2022b examine the (regularized) empirical risk minimizer, which focuses on minimizing the empirical and generalization error across accessible datasets. Nevertheless, the standard covariate shift is defined over two distinct data distributions, which can not be directly applied to our case. Consequently, we propose an extended version in Definition 2.2 to better suit our context.

SGD Analysis Recently, several studies have investigated the behavior of Stochastic Gradient Descent (SGD) in linear regression models through the lens of bias-variance decomposition (Défossez & Bach, 2015; Dieuleveut et al., 2017; Jain et al., 2017, 2018) and the eigen-decomposition of the covariance matrix (Chen et al., 2020; Zou et al., 2021; Wu et al., 2022a, b). Our work closely relates to the studies in Zou et al. 2021; Wu et al. 2022b that also characterized the SGD dynamic in linear regression with respect to the full eigenspectrum of the data covariance matrix. However, they focused on either the single-task setting or the pretraining-finetuning setting, while we studied the more challenging continual learning problem that involves a sequence of tasks with different data distributions. More discussion in Section 4.

Theoretical Studies in Continual Learning Although significant progress has been made in empirical studies addressing the issue of forgetting in continual learning, theoretical insights into this area are still largely unexplored. In this context, Bennani et al. 2020 established a theoretical framework to study continual learning algorithms in the NTK regime, and provided the first generalization bound dependent on task similarity for SGD and OGD. Doan et al. 2021 introduced the NTK overlap matrix as a task similarity metric and proposed a data-structure-informed variant of OGD that utilizes Principal Component Analysis (PCA). Asanuma et al. 2021 utilized the teacher-student framework on a single neural network and demonstrated that catastrophic forgetting can be circumvented when the similarity among input distributions is small and the similarity among teacher networks is large. Lee et al. 2021 expanded an earlier analysis of two-layer networks within the teacher-student setup to the setting with multiple teachers and revealed that the highest level of forgetting occurs when tasks have intermediate similarity with each other. Evron et al. 2022; Swartworth et al. 2023 explained the behavior of forgetting in the linear regression model from the perspectives of alternating projections and the Kaczmarz method (Karczmarz, 1937). Lin et al. 2023 investigated the impact of overparameterization, task similarity, and task ordering on forgetting and generalization in the overparameterized linear regression model.

The works most relevant to our study include (Evron et al., 2022; Lin et al., 2023), both of which also studied the behavior of forgetting in the linear regression model. However, our work differs from their studies in several aspects.

Firstly, with regard to assumptions, Evron et al. 2022 assumed all data are bounded with 1 and the model is noiseless, and Lin et al. 2023 assumed all data are sampled from a Gaussian distribution. In contrast, our assumptions cover more data distributions and are much milder than theirs (see Remark 2 and Section 4 for more details). Secondly, in terms of methods, both Evron et al. 2022 and Lin et al. 2023 analyze the problem of forgetting using the minimum norm solution, which presupposes zero training error—a requirement not necessary in our approach with SGD (see Section 2 for further discussions). Third, Evron et al. 2022; Lin et al. 2023 considered only the overparameterized case where the data dimension is larger than the data size, while our analysis holds for both the underparameterized and overparameterized settings.

Notations: In this paper, we adhere to a consistent notation style for clarity. We use boldface lower letters such as $\mathbf{x},\mathbf{w}$ for vectors, and boldface capital letters (e.g. $\mathbf{A},\mathbf{H}$ ) for matrices. Let $\|\mathbf{A}\|_{2}$ denote the spectral norm of $\mathbf{A}$ and $\|\mathbf{v}\|_{2}$ denote the Euclidean norm of $\mathbf{v}$ . For two vectors $\mathbf{u}$ and $\mathbf{v}$ , their inner product is denoted by $\langle\mathbf{u},\mathbf{v}\rangle$ oder $\mathbf{u}^{\top}\mathbf{v}$ . For two matrices $\mathbf{A}$ and $\mathbf{B}$ of appropriate dimension, their inner product is defined as $\langle\mathbf{A},\mathbf{B}\rangle:=\operatorname{tr}(\mathbf{A}^{\top}% \mathbf{B})$ . For a positive semi-definite (PSD) matrix $\mathbf{A}$ and a vector $\mathbf{v}$ of appropriate dimension, we write $\|\mathbf{v}\|_{\mathbf{A}}^{2}:=\mathbf{v}^{\top}\mathbf{Av}$ . The outer product is denoted by $\otimes$ .

2 Preliminaries

In our setup, we consider a sequence of tasks, denoted as $\mathbb{M}=\{1,2,\ldots,M\}$ . For each task $m$ in this sequence, we have a corresponding dataset $D_{m}$ , which consists of $N$ data points. Each of these data points, denoted as $(\mathbf{x}_{m,i},y_{m,i})$ , is drawn independently and identically distributed (i.i.d.) from a specific distribution $\mathcal{D}_{m}=$ $\mathcal{X}_{m}\times\mathcal{Y}_{m}\subset\mathbb{R}^{d}\times\mathbb{R}$ . Here, $\mathbf{x}_{m,i}$ represents the feature vector, and $y_{m,i}$ is the response variable for each data point in the dataset $D_{m}$ . Assume that $\{(\mathbf{x}_{m,i},y_{m,i})\}_{i=1}^{N}$ are i.i.d. sampled from a linear regression model, i.e., each pair $(\mathbf{x}_{m,i},y_{m,i})$ is a realization of the linear regression model $y_{m}=(\mathbf{x}_{m}^{\top}\mathbf{w}_{*})+z_{m}$ , where $z_{m}$ is some randomized noise and $\mathbf{w}_{*}\in\mathbb{R}^{d}$ is the optimal model parameter.

Our goal is to output a model $\mathbf{w}_{MN}$ minimizing the degree of forgetting (Evron et al., 2022) for $M$ tasks, i.e.

G(M)=\frac{1}{M}\sum_{m=1}^{M}\mathcal{L}_{m}(\mathbf{w}_{MN}),\quad\text{where}

(1)

\mathcal{L}_{m}(\mathbf{w})=\frac{1}{2}\mathbb{E}_{(\mathbf{x}_{m},y_{m})\sim% \mathcal{D}_{m}}\|\mathbf{x}_{m}^{\top}\mathbf{w}-y_{m}\|^{2},\quad m\in% \mathbb{M}

$\mathbf{w}_{MN}$ represents the final output after sequentially training on $M$ tasks, each updated via SGD over $N$ iterations for each task. Equation 1 quantifies an average excess population risk on the final output $\mathbf{w}_{MN}$ across all tasks. For each task $m$ , the loss $\mathcal{L}_{m}$ evaluate how well $\mathbf{w}_{MN}$ performs on it, thus assessing the degree of the model’s forgetting on previous tasks in continual learning scenarios.

Definition 2.1 (Data Covariance).

Assume that each entry and the trace of the $\mathbb{E}[\mathbf{x}_{m}\mathbf{x}_{m}^{\top}]$ are finite. Define $\mathbf{H}_{m}:=$ $\mathbb{E}[\mathbf{x}_{m}\mathbf{x}_{m}^{\top}]$ as data covariance matrix.

Let $\mathbf{H}_{m}$ denote the eigen decomposition of the data covariance for task $m$ , given by $\mathbf{H}_{m}=\sum_{i}\lambda_{m}^{i}\mathbf{v}_{m}^{i}{\mathbf{v}_{m}^{i}}^{\top}$ , where $(\lambda_{m}^{i})_{i\geq 1}$ are eigenvalues in a nonincreasing order and $(\mathbf{v}_{m}^{i})_{i\geq 1}$ are the corresponding eigenvectors. Define $\mathbf{H}_{m,k_{1}:k_{2}}$ as $\mathbf{H}_{m,k_{1}:k_{2}}:=\sum_{k_{1}<i\leq k_{2}}\lambda_{m}^{i}\mathbf{v}_% {m}^{i}{\mathbf{v}_{m}^{i}}^{\top},$ and allow $k_{2}=\infty$ to imply that $\mathbf{H}_{m,k:\infty}=\sum_{i>k}\lambda_{m}^{i}\mathbf{v}_{m}^{i}{\mathbf{v}% _{m}^{i}}^{\top}$ .

Definition 2.2 (Covariate Shift).

For each task $m$ , the covariates $\mathbf{x}_{m,1}$ , $\ldots$ , $\mathbf{x}_{m,N}$ are i.i.d. drawn from $\mathcal{D}_{m}$ .

Compared to the concept of covariate shift in transfer learning (Pathak et al., 2022), Definition 2.2 provides a more general scenario applicable to a series of tasks $M\geq 2$ . For simplicity, in our analysis, we assume that each task $m$ in our model consists of $N$ data points, differentiating it from transfer learning approaches that typically consider the total dataset size as $N$ .

Assumption 2.3 (Fourth moment conditions).

Assume that for each task $m$ , the expected fourth moment of covariates, denoted as $\mathcal{M}:=\mathbb{E}[\mathbf{x}_{m}\otimes\mathbf{x}_{m}\otimes\mathbf{x}_{% m}\otimes\mathbf{x}_{m}]$ , and the expected covariance matrix $\mathbf{H}_{m}$ are finite. Moreover:

(A)

There exists a constant $\alpha_{m}>0$ such that for any Positive Semi-Definite (PSD) matrix $\mathbf{A}$ , the following holds:

\mathbb{E}[{\mathbf{x}_{m}}{\mathbf{x}_{m}}^{\top}\mathbf{A}{\mathbf{x}_{m}}{% \mathbf{x}_{m}}^{\top}]\preceq\alpha_{m}\cdot\operatorname{tr}(\mathbf{H}_{m}% \mathbf{A})\mathbf{H}_{m}.

(B)

There exists a constant $\beta_{m}>0$ , such that for every PSD matrix $\mathbf{A}$ , the following holds:

\mathbb{E}[{\mathbf{x}_{m}}{\mathbf{x}_{m}}^{\top}\mathbf{A}{\mathbf{x}_{m}}{% \mathbf{x}_{m}}^{\top}]-\mathbf{H}_{m}\mathbf{A}\mathbf{H}_{m}\succeq\beta_{m}% \cdot\operatorname{tr}(\mathbf{H}_{m}\mathbf{A})\mathbf{H}_{m}.

Remark 1.

2.3 is a commonly employed assumption in the linear regression analysis utilizing SGD methods (Zou et al., 2021; Wu et al., 2022a, b), which is much weaker than the assumptions on the aforementioned related work. Specifically, it can be verified that 2.3 holds with ${\alpha_{m}}=3$ and $\beta_{m}=1$ for Gaussian distribution discussed in (Asanuma et al., 2021; Lee et al., 2021; Lin et al., 2023). Additionally, 2.3 (A) can be relaxed to $\mathbb{E}\|\mathbf{x}_{m}\|_{2}^{2}\leq\alpha_{m}\operatorname{tr}(\mathbf{H}% _{m})$ with $\mathbf{A}=\mathbf{I}$ , where $\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})=1$ is assumed in Evron et al. 2022.

Assumption 2.4 (Well-specified noise).

Assume that for each distribution of task $m$ , the response (conditional on input covariates) is given by $y_{m}=\mathbf{x}_{m}^{\top}\mathbf{w}^{*}+z_{m}$ , where $z_{m}\sim\mathcal{N}(0,\sigma^{2})$ and $z_{m}$ is independent with $\mathbf{x}_{m}$ .

Similar to previous works, we assume that $z_{m}$ is some randomized noise that satisfies $\mathbb{E}[z_{m}|\mathbf{x}]=0$ and $\mathbb{E}[z_{m}^{2}]=\sigma^{2}$ for each task $m$ .

Continual Learning via SGD

Suppose we train the model parameter $\mathbf{w}$ sequentially. Let $\mathbf{w}_{(m-1)N+N}$ represent the parameter state after the completion of training on task $m$ , which also serves as the initial condition for the training of task $m+1$ . Starting with $\mathbf{w}_{0}$ and employing a constant step size $\eta$ , the model is updated by SGD for each task $m\in\mathbb{M}$ over $N$ iterations, with $t=1,\ldots,N$ :

	$\displaystyle\mathbf{w}_{(m-1)N+t}$	$\displaystyle=\mathbf{w}_{(m-1)N+t-1}-\eta\cdot\mathbf{g}_{m,t},\quad\text{and}$		(2)
	$\displaystyle\mathbf{g}_{m,t}$	$\displaystyle:=(\mathbf{x}_{m,t}^{\top}\mathbf{w}_{(m-1)N+t-1}-y_{m,t})\mathbf% {x}_{m,t},$		(2)

where $\mathbf{g}_{m,t}$ represents the gradient of the loss function at task $m$ and iteration $t$ for a given data point $(\mathbf{x}_{m,t},y_{m,t})$ .

Contrastingly, the minimum norm solution in linear regression, particularly relevant in overparameterized settings, aims to find a weight vector $\mathbf{w}$ that not only achieves zero training error but also possesses the minimal possible norm. Here, $\mathbf{w}_{m}$ represents the outcome post-training for task $m$ , and it also serves as the starting point for training task $m+1$ . The objective, beginning from an initial condition $\mathbf{w}_{0}=\mathbf{0}$ , is defined by the following optimization problem:

\min_{\mathbf{w}}\|\mathbf{w}-\mathbf{w}_{m-1}\|_{2},\quad\text{s.t. }(\mathbf% {X}_{m})^{\top}\mathbf{w}=\bm{y}_{m},

where $\mathbf{X}_{m}:=[\mathbf{x}_{m,1},\ldots,\mathbf{x}_{m,N}]\in\mathbb{R}^{d% \times N}$ and $\mathbf{y}_{m}=[y_{m,1},\ldots,y_{m,N}]\in\mathbb{R}^{1\times N}$ . The update rules for each iteration follow as:

\mathbf{w}_{m}=\mathbf{w}_{m-1}+\mathbf{X}_{m}(\mathbf{X}_{m}^{\top}\mathbf{X}% _{m})^{-1}(\boldsymbol{y}_{m}-\mathbf{X}_{m}^{\top}\mathbf{w}_{m-1}),

(3)

where highlights the computational intensity of inverting the matrix $(\mathbf{X}_{m}^{\top}\mathbf{X}_{m})^{-1}$ . This is particularly challenging for large datasets or overparameterized feature spaces. Unlike the minimum norm solution, SGD does not assume the existence of a unique, exact solution and is more adaptable to a variety of problems, including those with non-linear dynamics.

3 Main Results

Before presenting our upper bound, we shall establish the following notations to facilitate comprehension of the results.

\left\{\begin{aligned} \Gamma_{(p,q)}^{i}&:=\prod_{j=p}^{q}(1-\eta\lambda_{j}^% {i})^{2N},\quad\bm{\Gamma}_{p}^{q}:=\prod_{j=p}^{q}(\mathbf{I}-\eta\mathbf{H}_% {j})^{2N},\\ \mathbf{U}_{k_{m}^{*}}&:={\mathbf{I}_{m,{0:{k_{m}^{*}}}}+N\eta\mathbf{H}_{m,{{% k_{m}^{*}}:\infty}}},\quad\Lambda^{i}:=\sum_{m=1}^{M}\lambda_{m}^{i},\end{% aligned}\right.

(4)

where $(\lambda_{m}^{i})_{i\geq 1}$ are eigenvalues of $\mathbf{H}_{m}$ in a nonincreasing order and $k_{m}^{*}=\max\{i:\lambda_{m}^{i}\geq\frac{1}{N\eta}\}$ represents the cut-off index for $\mathbf{H}_{m}$ . Here, $\Gamma_{(p,q)}^{i}$ and $\boldsymbol{\Gamma}_{p}^{q}$ can be regarded as a projection accumulation from task $p$ to task $q$ , and basically capture the impact of the learning dynamic of previous tasks on the subsequent task. $\mathbf{U}_{k_{m}^{*}}$ is defined with respect to the cut-off index $k_{m}^{*}$ for each task’s data covariance matrix $\mathbf{H}_{m}$ that captures both the dominant eigenvalues and the tail of the spectrum, and $\Lambda^{i}$ denotes the sum of the $i$ -th eigenvalue across all tasks.

In the following, we first provide our upper bound for the behavior of forgetting via SGD in the linear regression model.

Theorem 3.1 (Upper Bound).

Consider a scenario where the model $\mathbf{w}$ undergoes training via SGD for $M$ distinct tasks, following a sequence $1,\ldots,M$ . With a constant step size of $\eta\leq 1/R^{2}$ given that $R^{2}=\max\{\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})\}_{m=1}^{M}$ , each task $m$ is executed for $N$ iterations. Given that Assumptions (A) and 2.4 are satisfied, the following will hold:

G(M)\leq\text{err}_{\text{var}}+\text{err}_{\text{bias}},

where the variance and bias errors are upper-bounded by

	$\displaystyle\text{err}_{\text{var}}$	$\displaystyle\leq\frac{\sum_{m=1}^{M}}{M}\cdot\frac{\eta\sigma^{2}}{(1-\eta R^% {2})}\cdot D_{1}^{\text{eff}},$
	$\displaystyle\text{err}_{\text{bias}}$	$\displaystyle\leq\frac{\sum_{k=1}^{M}}{M}\\|\mathbf{w}_{0}-\mathbf{w}^{*}\\|_{% \bm{\Gamma}_{1}^{M}\mathbf{H}_{k}}^{2}$
		$\displaystyle+\frac{\sum_{m=1}^{M}}{M}\frac{2\alpha_{m}\eta^{2}\cdot(D_{2}^{% \text{eff}}+\Phi_{1}^{m-1}D_{3}^{\text{eff}})}{1-\eta\alpha_{m}\operatorname{% tr}(\mathbf{H}_{m})}\cdot\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{\mathbf{U}_{k_{m}^% {}}}^{2}$
		$\displaystyle+\frac{\sum_{m=1}^{M}}{M}\alpha_{m}\eta\cdot\\|\mathbf{w}_{0}-% \mathbf{w}^{}\\|_{\bm{\Gamma}_{1}^{M}\mathbf{H}_{k}(\mathbf{H}_{m}+{\Phi_{1}^{% m-1}}\mathbf{I})\cdot\mathbf{U}_{k_{m}^{}}}^{2},$

where the effective dimensions are given by

		$\displaystyle D_{1}^{\text{eff}}:=\sum_{i<k_{m}^{}}\Gamma_{(m+1,M)}^{i}% \Lambda^{i}+N\eta\sum_{i>k_{m}^{}}\Gamma_{(m+1,M)}^{i}\lambda_{m}^{i}\Lambda^% {i}$		(5)
		$\displaystyle D_{2}^{\text{eff}}:=\sum_{i<k_{m}^{}}{\Gamma^{i}_{(1,M)}(% \lambda_{m}^{i})^{2}\Lambda^{i}}+N\eta\sum_{i>k_{m}^{}}\Gamma^{i}_{(1,M)}(% \lambda_{m}^{i})^{3}\Lambda^{i}$
		$\displaystyle D_{3}^{\text{eff}}:=\sum_{i<k_{m}^{}}{\Gamma_{(m,M)}^{i}(% \lambda_{m}^{i})\Lambda^{i}}+{\eta N}\sum_{i>k_{m}^{}}\Gamma_{(m,M)}^{i}(% \lambda_{m}^{i})^{2}\Lambda^{i},$

with $k_{m}^{*},\Gamma_{(p,q)}^{i}$ and $\bm{\Gamma}_{p}^{q}$ defined as in Equation 4 and denoting $\Phi_{1}^{m-1}:=\sum_{j=1}^{m-1}\prod_{k=1}^{j}\alpha_{k}\eta^{j}\cdot\langle% \mathbf{H}_{k-1},\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m-1})^{N}\rangle\cdot% \langle\mathbf{H}_{j},\mathbf{H}_{m}\rangle$ .

In Theorem 3.1, we establish an upper bound on the forgetting behavior of a model trained using SGD in the continual learning with various data distribution settings. It highlights that the model’s performance is influenced by both $\text{err}_{\text{var}}$ and $\text{err}_{\text{bias}}$ , where $\text{err}_{\text{var}}$ stems from the inherent noise intrinsic to the model itself and $\text{err}_{\text{bias}}$ represents the bias associated with the initial value during the learning process. Notice that both of them are determined jointly by the spectrum of the covariance matrices as well as the stepsizes for continual learning.

To provide a more intuitive explanation, we explore a simplified scenario by setting $\eta=0$ . Specifically, this setting simplifies our analysis by reducing the error terms to only the first term in bias error, which appears to depend solely on the initial weight $\mathbf{w}_{0}$ and the data. However, this simplification might misleadingly imply that a minimal $\eta$ would result in optimal learning outcomes. A crucial aspect overlooked in this interpretation is the role of the projection term $\Gamma_{1}^{M}=\prod_{j=1}^{M}(\mathbf{I}-\eta\mathbf{H}_{j})^{2N}$ , which becomes an identity matrix $\mathbf{I}$ when $\eta=0$ . Thus, while setting $\eta=0$ eliminates other error terms, it also exacerbates the first term of bias error, potentially making it the most significant error contributor. Consequently, there exists a trade-off in choosing the step size.

The subsequent theorem presents a nearly matching lower bound.

Theorem 3.2 (Lower Bound).

G(M)\geq\text{err}_{\text{var}}+\text{err}_{\text{bia}},

where the variance and bias errors are lower bounded by

	$\displaystyle\text{err}_{\text{var}}\geq\frac{\sum_{m=1}^{M}}{M}\cdot\frac{9% \eta^{2}\sigma^{2}}{20}\cdot D_{1}^{\text{eff}},$
	$\displaystyle\text{err}_{\text{bias}}\geq\frac{\sum_{k=1}^{M}}{M}\\|\mathbf{w}_% {0}-\mathbf{w}^{*}\\|_{\bm{\Gamma}_{1}^{M}\mathbf{H}_{k}}^{2}$
	$\displaystyle+\frac{\sum_{m=1}^{M}}{M}\cdot\frac{\beta_{m}^{2}\eta^{2}}{25}% \cdot(D_{2}^{\text{eff}}+\hat{\Phi}_{1}^{m-1}D_{3}^{\text{eff}})\cdot\\|\mathbf% {w}_{0}-\mathbf{w}^{}\\|_{\mathbf{U}_{k_{m}^{}}}^{2}$
	$\displaystyle+\frac{\sum_{m=1}^{M}}{M}\frac{\beta_{m}\eta^{2}}{5}\cdot\\|% \mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\mathbf{I}-\eta\mathbf{H}_{m})^{2N}\bm{% \Gamma}_{1}^{M}\mathbf{H}_{k}(\mathbf{H}_{m}+\hat{\Phi}_{1}^{m-1}\mathbf{I})% \cdot\mathbf{U}_{k_{m}^{}}}^{2}$

where the effective dimensions $k_{m}^{*},\Gamma_{(p,q)}^{i}$ and $\bm{\Gamma}_{p}^{q}$ are the same as in Theorem 3.1, and $\hat{\Phi}_{1}^{m-1}:=\sum_{j=1}^{m-1}\prod_{k=1}^{j}\beta_{k}(\frac{\eta}{2})% ^{j}\cdot\langle\mathbf{H}_{k-1},(\mathbf{I}-(\mathbf{I}-\eta{\mathbf{H}_{m-1}% })^{2N})\rangle\cdot\langle\mathbf{H}_{j},\mathbf{H}_{m}\rangle$ .

Analogous to the Theorem 3.1, our lower bound also consists of the bias term and the variance term. It is noteworthy that our lower bound is tight with the upper bound in terms of variance term, differing only by absolute constants. Additionally, our lower bound closely matches the upper bound in terms of the bias term, with some differences arising from the following quantities

\hat{\Phi}_{1}^{m-1}\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\mathbf{U}_{k_{m}^{*}}}% ^{2},\quad\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\mathbf{I}-\eta\mathbf{H}_{m})^{% 2N}}.

Specifically, $\hat{\Phi}_{1}^{m-1}$ here differs from ${\Phi}_{1}^{m-1}$ in Theorem 3.1 only by a factor of constants (i.e. $\alpha_{k}$ and $\beta_{k}$ defined in 2.3). The term $\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\mathbf{I}-\eta\mathbf{H}_{m})^{2N}}$ has a different subscript of $(\mathbf{I}-\eta\mathbf{H}_{m})^{2N}$ compared to that of the upper bound. Nevertheless, it can be regarded as a part of the projection accumulation $\bm{\Gamma}_{1}^{M}$ that exists in the subscript of both results simultaneously.

More importantly, we show that the upper and lower bounds converge, ignoring constant factors, under the conditions

\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\mathbf{U}_{k_{m}^{*}}}^{2}\lesssim\sigma^{% 2},\quad\hat{\Phi}_{1}^{m-1}\lesssim O(1),

which can be satisfied that the signal-to-noise ratios $\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\mathbf{U}_{k_{m}^{*}}}^{2}/\sigma^{2}$ is bounded and the step size is appropriate small.

4 Discussion

Building on Theorem 3.1 and Theorem 3.2, we aim to offer a more comprehensive understanding of our findings from three key perspectives: 1) Technical Understanding Under Simplified Cases; 2) Comparison with Existing Work; 3) The Impact of Task Ordering and Parameters on Forgetting.

4.1 Technical Understanding Under Simplified Cases

In this section, we demonstrate how to achieve a vanishing bound in the overparameterized regime.

Based on Theorem 3.1, we consider a scenario where $\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{2}^{2},\sigma^{2}\lesssim 1$ and $\operatorname{tr}(\mathbf{H}_{m})\simeq 1$ for each task $m$ , implying a rapid decay in the spectrum of $\mathbf{H}_{m}$ . To obtain a vanishing bound in the overparameterized regime, the effective dimension should hold that

	$\displaystyle D_{1}^{\text{eff}}\simeq D_{3}^{\text{eff}}$	$\displaystyle=o(\frac{MN}{e^{(M-m)}}),$		(6)
	$\displaystyle D_{2}^{\text{eff}}$	$\displaystyle=o(\frac{MN}{e^{M}}).$		(6)

To meet the condition in Equation 6, for each task $\widetilde{m}$ , let $k^{\dagger}=\min\{k_{m}^{*},k_{\widetilde{m}}^{*}\}$ and $k^{\star}=\max\{k_{m}^{*},k_{\widetilde{m}}^{*}$ }. It necessarily holds that

	$\displaystyle\sum_{i<k^{\star}}\lambda_{\widetilde{m}}^{i}$	$\displaystyle\simeq\sum_{i<k^{\star}}\lambda_{m}^{i}\lambda_{\widetilde{m}}^{i% }\simeq\sum_{i<k^{\star}}(\lambda_{m}^{i})^{2}\lambda_{\widetilde{m}}^{i}=o({N% }),$		(7)
	$\displaystyle\sum_{i>k^{\dagger}}\lambda_{\widetilde{m}}^{i}\lambda_{m}^{i}$	$\displaystyle\simeq\sum_{i>k^{\dagger}}(\lambda_{m}^{i})^{2}\lambda_{% \widetilde{m}}^{i}\simeq\sum_{i>k^{\dagger}}(\lambda_{m}^{i})^{3}\lambda_{% \widetilde{m}}^{i}=o(\frac{1}{N}).$		(7)

To clarify Equation 7, let notice the crucial cut-off index $k^{\star}$ and $k^{\dagger}$ , which divide the entire feature space into two $k^{\star}$ -dimensional and $k^{\dagger}$ -dimensional subspaces. For achieving a diminishing bound in overparameterized setting, it is necessary that the sum of eigenvalues for indices less than $k^{\star}$ , denoted as $\sum_{i<k^{\star}}$ , should be $o(N)$ , and the sum of the tail eigenvalues for indices greater than $k^{\dagger},\sum_{i>k^{\dagger}}$ , should be $o(\frac{1}{N})$ . These conditions are typically met when the dataset size $N$ is sufficiently large, or when a smaller step size $\eta$ is chosen dependent on $N$ . Additionally, We note that the condition in Equation 7 can be relaxed. In light of the definition of $k_{m}^{*}$ , the eigenvalues for task $\widetilde{m}$ are truncated based on the following two scenarios: $\mathbf{1})$ $k_{m}^{*}\textless k_{\widetilde{m}}^{*}$ : Here, the cut-off for task $\widetilde{m}$ occurs earlier, resulting in an additional $(k_{\widetilde{m}}^{*}-k_{m}^{*})$ dimensions of eigenvalues such that $\lambda_{\widetilde{m}}^{i}\geq 1/(N\eta)$ . To achieve a diminishing bound under this condition, it is necessary that $\sum_{k_{m}^{*}\leq i\leq k_{\tilde{m}}^{*}}\lambda_{\widetilde{m}}^{i}=$ $o(N)$ . $\mathbf{2})$ $k_{m}^{*}\geq k_{\widetilde{m}}^{*}$ : In this case, the cut-off for task $\widetilde{m}$ occurs later, involving an additional $(k_{m}^{*}-k_{\widetilde{m}}^{*})$ dimensions of eigenvalues where $\lambda_{\widetilde{m}}^{i}\leq 1/(N\eta)$ , achieving the same results.

In the under-parameterized regime, we even account for the worst-case scenario where $\lambda_{m}^{i}\geq\frac{1}{N\eta}$ for all index $i$ and task $m$ , leading to a bound of $D_{1}^{\text{eff}}\simeq D_{3}^{\text{eff}}=o(\frac{Md\lambda_{m}^{1}}{e^{(M-m% )}}),D_{2}^{\text{eff}}=o(\frac{Md\lambda_{m}^{1}}{e^{M}}).$

4.2 Comparison with Existing work

In this section, we will first explore the challenges and parallels between traditional/transfer learning and continual learning. Secondly, we examine how restrictive assumptions in previous studies might overshadow the impact of key factors, thereby affecting the overall understanding of forgetting in continual learning.

Our results reveal that compared to traditional learning (Zou et al., 2021), which typically involves a single task, and transfer learning (Wu et al., 2022b), which usually incorporates two data distributions, the effective dimension in continual learning scenarios is more complex. Specifically, in our analysis, the term $\Lambda^{i}$ arises from a distinct measurement perspective (i.e. forgetting), which requires us to consider how the final output aligns with all previously encountered tasks in the continual learning (i.e. $\mathbf{H}_{m}$ for all $m$ ). This is in contrast to both traditional training and transfer learning, where the evaluation metric is uniformly focused on performance against a single dataset (i.e. $\mathbf{H}_{M}$ ). Moreover, the multi-task nature of continual learning introduces unique challenges considering the bias iterates and variance iterates, where we refer to the proof in Appendix for more details.

Given that our analysis, similar to theirs, characterizes bounds with the full eigenspectrum of the data covariance matrix, it follows that our derived results match their findings in several aspects: $\mathbf{1)}$ The cutoff index $k_{m}^{*}$ is uniquely determined for each task $m$ in continual learning, akin to the one in Zou et al. 2021; Wu et al. 2022b, where they identify corresponding indices $k_{\text{training }}^{*}$ and $k_{\text{test }}^{*}$ . $\mathbf{2)}$ The projection terms ${\Gamma}_{(p,q)}^{i}$ and $\bm{\Gamma}_{p}^{q}$ also occur in transfer learning (Wu et al., 2022b), showing how previous iterations/past learning is projected onto the future updates.

Previous work (Evron et al., 2022) also explored the dynamics of forgetting through the perspective of projection. We first revisit the findings presented by Evron et al. 2022. Considering a scenario where the number of iterations $N=1$ , the update rule in their analysis can be reformulated as follows:

\mathbf{w}_{m}-\mathbf{w}^{*}=(\mathbf{I}-\eta_{m}\mathbf{x}_{m}\mathbf{x}_{m}% ^{\top})(\mathbf{w}_{m-1}-\mathbf{w}^{*}),

(8)

where they incorporate the noiseless model assumption that $y_{m}=\mathbf{x}_{m}^{\top}\mathbf{w}^{*}$ . As a result, the forgetting in Evron et al. 2022 holds that

	$\displaystyle G(M)=\frac{1}{M}\sum_{m=1}^{M}\\|\mathbf{x}_{m}(\mathbf{w}_{M}-% \mathbf{w}^{*})\\|_{2}^{2},\quad\text{given}\quad\\|\mathbf{x}_{m}\\|_{2}\leq 1$
	$\displaystyle\leq\frac{1}{M}\sum_{m=1}^{M}\\|(\mathbf{I}-\eta_{m}\mathbf{x}_{M}% \mathbf{x}_{M}^{\top})\ldots(\mathbf{I}-\eta_{m}\mathbf{x}_{1}\mathbf{x}_{1}^{% \top})(\mathbf{w}_{0}-\mathbf{w}^{*})\\|_{2}^{2},$

indicating that the forgetting dynamic can be determined by the projection of $(\mathbf{I}-\eta_{m}\mathbf{x}_{m}\mathbf{x}_{m}^{\top})$ , where $\eta_{m}=\|\mathbf{x}_{m}\|^{-2}$ . However, compared to our analysis, their study exhibits several key differences in comparison to ours. $\mathbf{1)}$ The inherent model noise: Evron et al. 2022 considers a noiseless model, where results in the absence of an additional iterative term $\mathbf{x}_{m}\cdot z_{m}$ related to noise in Equation 8. This omission leads to a lack of accumulative variance error in the evaluation of forgetting performance (i.e. $\text{err}_{\text{var}}$ in our analysis). It is noteworthy to mention that in numerous learning problems, the variance error often plays a dominant role in the total error (Jain et al., 2018; Zou et al., 2021; Wu et al., 2022b). $\mathbf{2)}$ The bounded norm $\|\mathbf{x}\|_{2}$ : the assumption of the bounded norm, which omits the interaction with projection effects, is crucial in our analysis as the factor $\Lambda^{i}$ in Theorem 3.1 and Theorem 3.2. $\mathbf{3)}$ Last iterate SGD results: Evron et al. 2022 shows that, with a $\|\mathbf{x}_{m}\|_{2}^{-2}$ step size, their worst-case expected forgetting will become a dimension-dependent bound of $O(d/M)$ . This analysis, conducted under the overparameterized regime, suggests the occurrence of catastrophic forgetting. In contrast, our results, as discussed earlier, offer a different perspective, suggesting the possibility of achieving a vanishing forgetting bound in overparameterized settings with certain conditions met.

It is noticed that Lin et al. 2023 also investigates the relationship between catastrophic forgetting and factors such as task sequence (order) and dimensionality. However, their results will tend to be vacuous in the under-parameterized setting since $(\mathbf{X}_{m}\mathbf{X}_{m}^{\top})^{-1}$ , data matrix for task $m$ , is non-invertible when employing minimum norm solution, as we discussed earlier in Section 2. Due to space constraints, a more extensive discussion will be provided in Appendix D.

4.3 The Impact of Task Ordering and Parameters on Forgetting

In the upcoming discussion, we will present theoretical insights derived from our results.

Notice that the bounds in Theorem 3.1 and Theorem 3.2 contain two crucial factors: the effective dimension $D^{\text{eff}}$ and the covariance accumulation $\hat{\Phi}_{1}^{m-1}$ / ${\Phi}_{1}^{m-1}$ . We first discuss the effective dimension. Each $D^{\text{eff}}$ is consist of a projection term $\Gamma^{i}_{(m,M)}$ and the eigenvalues $\lambda_{m}^{i}$ , with $\Lambda^{i}$ serving as the constant. It can be observed that when data size $N$ approaches infinity, the projection term converges to $\frac{1}{e^{M-m}}$ , implying that the eigenvalue will predominantly dictate the larger effective dimension with respect to $\frac{\lambda_{m}^{i}}{e^{M-m}}$ . This observation highlights the substantial influence of eigenvalues on task sequence in continual learning. Specifically, it shows that when data size is sufficiently large, task sequences organized in a way, where tasks associated with larger eigenvalues in their population data covariance matrix are trained later, exhibit more forgetting. Additionally, if the step size is appropriately small, the projection term stabilizes to a constant of less than 1, leading to similar outcomes as in the first scenario. It is noteworthy that these insights can not be derived from the existing work analysis due to their restrictive assumptions, such as Gaussian data distribution in Lee et al. 2021; Asanuma et al. 2021; Lin et al. 2023 and minimum norm solution in Evron et al. 2022; Lin et al. 2023; Swartworth et al. 2023.

The covariance accumulation term, $\hat{\Phi}_{1}^{m-1}/\Phi_{1}^{m-1}$ , which includes the covariance matrices $\mathbf{H}_{j\leq m}$ and the step size $\eta$ , plays a crucial role in demonstrating how previously acquired information is retained and influences the model’s adaptability to new tasks. Notably, there is an interesting contradiction in the optimal accumulation order within $\hat{\Phi}_{1}^{m-1}/\Phi_{1}^{m-1}$ compared to the projection term in $\Gamma_{(m,M)}^{i}$ . Specifically, earlier occurrence of $\mathbf{H}_{j}$ with larger expected eigenvalues tends to increase the degree of forgetting. Meanwhile, an important observation is that if the step size is sufficiently small, the impact of the covariance accumulation term becomes less significant. This interplay between the effective dimension and covariance accumulation elucidates the complexities inherent in continual learning scenarios.

5 Empirical Stimulation

In this section, we conduct experiments using synthetic data to validate our theoretical results and shed light on the intricate interplay between eigenvalues, step size, and dimensionality.

Experimental Setup In our study, we designed three distinct tasks, denoted as Tasks 1,2, and 3, each with a different feature space. During the initial simulations, the eigenvalues for the feature values of Tasks 1, 2, and 3 were set according to $\lambda_{i}=i^{-3},\lambda_{i}=i^{-2}$ , and $\lambda_{i}=i^{-1}$ respectively. To mimic real-world data imperfections, Gaussian noise with a standard deviation of 0.1 was added to the labels. We assessed the impact of task sequence on the model’s tendency to forget by evaluating six different task orders: [1,2,3], [2,1,3], [1, 3, 2], [3, 1, 2], [2, 3, 1], and [3, 2, 1].

5.1 Linear Regression

Training and Evaluation For this experiment, a linear regression model was trained using Stochastic Gradient Descent (SGD) with a learning rate of 0.01 or 0.001. The model was tested in both low-dimensional (10 input features) and high-dimensional (1000 input features) settings. Each task sequence underwent training with various data sizes, ranging from 100 to 950 in increments of 50, and each task was trained for five epochs. The performance of the model was evaluated on each task to calculate the average excess risk (Equation 1), quantifying the degree of forgetting the model experienced.

Impact of Eigenvalue Sequencing The observations from Figure 1(a) and Figure 1(c) reveal the significant impact of eigenvalue sequencing on forgetting behavior in the underparameterzied regime. Notably, task sequences that are arranged such that tasks with larger eigenvalues (i.e. Task 3 in our case, characterized by $\lambda_{i}=i^{-1}$ ) are trained later in the learning process tend to result in increased forgetting. This empirical finding aligns well with our theoretical analysis (the term $\frac{\lambda_{m}^{i}}{e^{M-m}}$ discussed in Section 4.3). In an under-parameterized setting, or when the eigenvalues decay rapidly, the effective dimension — crucial in determining the model’s forgetting performance - is largely influenced by the eigenvalues. Such a pattern is intuitive as when tasks with larger eigenvalues are trained later, the model might overfit these tasks due to their high variance.

Impact of Dimensionality Our results, depicted in Figure 1(c) and Figure 1(d), show that in under-parameterized scenarios, performance remains relatively unaffected by an increase in dimensionality. However, in over-parameterized settings, the model tends to exhibit increased forgetting as dimensionality rises, particularly when the data size is kept constant. This highlights the varying impact of dimensionality on model performance in different parameterization contexts. In higher-dimensional settings, the influence of the projection term $\Gamma_{(p,q)}^{i}$ , as shown in Theorem 3.1, diminishes in comparison to the impact of $\Lambda^{i}$ and $\lambda_{i}$ . Consequently, as the number of features in the model increases, the sequence in which tasks are presented becomes less significant in determining the model’s forgetting behavior. This shift implies that, in high-dimensional scenarios, the inherent complexity and the distribution of eigenvalues of the feature space play a more critical role than the sequence of tasks, influencing the model’s learning and retention capabilities.

Impact of Step-size Our results, depicted in Figure 1(e) and Figure 1(f), reveal that a smaller step size effectively reduces forgetting in various task sequences and across different dimensionalities. This trend is especially noticeable in high-dimensional feature spaces, where a reduced step size markedly lowers the rate of forgetting. This observation is in line with the theoretical insights provided in Theorem 3.1 and Theorem 3.2, as smaller step sizes may lead to more refined updates during training, allowing the model to incrementally adjust to new tasks while preserving knowledge from previous ones.

5.2 Implication on DNNs

Intriguingly, our next discussion will adopt the same data generation and task setup as outlined in Section 5.1, but shift our focus to a different Neural Network model. This model comprises an input layer, a hidden layer with ten neurons, and an output layer, and it undergoes a training process akin to that of linear regression.

Impact of Eigenvalue Sequencing In our studies with Deep Neural Networks (DNNs), we still find that task sequences, ending with tasks having larger eigenvalues, tend to exhibit increased forgetting, especially in under-parameterized settings, similar to linear regression models. This indicates that the tendency of overfitting observed in linear models, particularly when tasks with larger eigenvalues are trained later in the sequence, may occur in DNNs as well.

Impact of Dimensionality Our results also reveal the consistent behaviors between DNNs and linear regression concerning dimensionality. In under-parameterized scenarios (Figure 1(k)), forgetting remains stable despite increased dimensionality, while in over-parameterized settings (Figure 1(l)), higher dimensionality leads to more forgetting when data size is fixed. However, the adverse effects of higher dimensions can be alleviated by expanding the dataset size, as demonstrated in Figure 1(j). It is a notable contrast to linear regression, which suggests that the complex structures of DNNs are better suited to manage and learn from high-dimensional data in continual learning scenarios. The different behaviors observed between DNNs and linear regression models will be a potentially interesting direction for future work.

Impact of Step Size Our results, depicted in Figure 1(e) and Figure 1(f), indicate that in under-parameterized settings, a smaller step size significantly lessens the influence of task sequences on forgetting, while in models with high-dimensional features, forgetting can be mitigated even without adjusting the step size.

6 Conclusion

In this work, we contribute to the understanding of catastrophic forgetting in continual learning via a multi-step SGD algorithm. Our theoretical analysis establishes bounds that illustrate the impact of various factors on forgetting such as data covariance matrix spectrum, step size, data size, and dimensionality, which can not be fully captured in previous studies due to their restrictive assumptions. This theoretical understanding is further substantiated through simulations conducted in linear regression models and Deep Neural Networks, which corroborate our theoretical insights.

Impact Statements

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Acknowledgments

The research of Meng Ding and Jinhui Xu was supported in part by KAUST through grant CRG10-4663.2. Di Wang was supported in part by the baseline funding BAS/1/1689-01-01, funding from the CRG grand URF/1/4663-01-01, REI/1/5232-01-01, REI/1/5332-01-01, FCC/1/1976-49-01 from CBRC of King Abdullah University of Science and Technology (KAUST). Di Wang was also supported by the funding RGC/3/4816-09-01 of the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI).

References

Aljundi et al. (2018) Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., and Tuytelaars, T. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pp. 139–154, 2018.
Asanuma et al. (2021) Asanuma, H., Takagi, S., Nagano, Y., Yoshida, Y., Igarashi, Y., and Okada, M. Statistical mechanical analysis of catastrophic forgetting in continual learning with teacher and student networks. Journal of the Physical Society of Japan, 90(10):104001, 2021.
Bennani et al. (2020) Bennani, M. A., Doan, T., and Sugiyama, M. Generalisation guarantees for continual learning with orthogonal gradient descent. arXiv preprint arXiv:2006.11942, 2020.
Chaudhry et al. (2018) Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
Chen et al. (2020) Chen, X., Liu, Q., and Tong, X. T. Dimension independent generalization error by stochastic gradient descent. arXiv preprint arXiv:2003.11196, 2020.
Cortes & Mohri (2014) Cortes, C. and Mohri, M. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
Cortes et al. (2019) Cortes, C., Mohri, M., and Medina, A. M. Adaptation based on generalized discrepancy. The Journal of Machine Learning Research, 20(1):1–30, 2019.
Défossez & Bach (2015) Défossez, A. and Bach, F. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Artificial Intelligence and Statistics, pp. 205–213. PMLR, 2015.
Dieuleveut et al. (2017) Dieuleveut, A., Flammarion, N., and Bach, F. Harder, better, faster, stronger convergence rates for least-squares regression. The Journal of Machine Learning Research, 18(1):3520–3570, 2017.
Doan et al. (2021) Doan, T., Bennani, M. A., Mazoure, B., Rabusseau, G., and Alquier, P. A theoretical analysis of catastrophic forgetting through the ntk overlap matrix. In International Conference on Artificial Intelligence and Statistics, pp. 1072–1080. PMLR, 2021.
Evron et al. (2022) Evron, I., Moroshko, E., Ward, R., Srebro, N., and Soudry, D. How catastrophic can catastrophic forgetting be in linear regression? In Conference on Learning Theory, pp. 4028–4079. PMLR, 2022.
Farajtabar et al. (2020) Farajtabar, M., Azizan, N., Mott, A., and Li, A. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pp. 3762–3773. PMLR, 2020.
Hanneke & Kpotufe (2020) Hanneke, S. and Kpotufe, S. On the value of target data in transfer learning, 2020.
Hao et al. (2023) Hao, J., Ji, K., and Liu, M. Bilevel coreset selection in continual learning: A new formulation and algorithm. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
Jain et al. (2017) Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P., Pillutla, V. K., and Sidford, A. A markov chain theory approach to characterizing the minimax optimality of stochastic gradient descent (for least squares). arXiv preprint arXiv:1710.09430, 2017.
Jain et al. (2018) Jain, P., Kakade, S., Kidambi, R., Netrapalli, P., and Sidford, A. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. Journal of machine learning research, 18, 2018.
Karczmarz (1937) Karczmarz, S. Angenaherte auflosung von systemen linearer glei-chungen. Bull. Int. Acad. Pol. Sic. Let., Cl. Sci. Math. Nat., pp. 355–357, 1937.
Kirkpatrick et al. (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
Kpotufe & Martinet (2018) Kpotufe, S. and Martinet, G. Marginal singularity, and the benefits of labels in covariate-shift. In Conference On Learning Theory, pp. 1882–1886. PMLR, 2018.
Lee et al. (2021) Lee, S., Goldt, S., and Saxe, A. Continual learning in the teacher-student setup: Impact of task similarity. In International Conference on Machine Learning, pp. 6109–6119. PMLR, 2021.
Lin et al. (2022) Lin, S., Yang, L., Fan, D., and Zhang, J. Trgp: Trust region gradient projection for continual learning. arXiv preprint arXiv:2202.02931, 2022.
Lin et al. (2023) Lin, S., Ju, P., Liang, Y., and Shroff, N. Theory on forgetting and generalization of continual learning. arXiv preprint arXiv:2302.05836, 2023.
Liu & Liu (2022) Liu, H. and Liu, H. Continual learning with recursive gradient optimization. arXiv preprint arXiv:2201.12522, 2022.
Ma et al. (2023) Ma, C., Pathak, R., and Wainwright, M. J. Optimally tackling covariate shift in rkhs-based nonparametric regression, 2023.
McCloskey & Cohen (1989) McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165. Elsevier, 1989.
Mohri & Medina (2012) Mohri, M. and Medina, A. M. New analysis and algorithm for learning with drifting distributions, 2012.
Pan & Yang (2009) Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
Pathak et al. (2022) Pathak, R., Ma, C., and Wainwright, M. A new similarity measure for covariate shift with applications to nonparametric regression. In International Conference on Machine Learning, pp. 17517–17530. PMLR, 2022.
Riemer et al. (2018) Riemer, M., Cases, I., Ajemian, R., Liu, M., Rish, I., Tu, Y., and Tesauro, G. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.
Saha et al. (2021) Saha, G., Garg, I., and Roy, K. Gradient projection memory for continual learning. arXiv preprint arXiv:2103.09762, 2021.
Serra et al. (2018) Serra, J., Suris, D., Miron, M., and Karatzoglou, A. Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pp. 4548–4557. PMLR, 2018.
Shin et al. (2017) Shin, H., Lee, J. K., Kim, J., and Kim, J. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.
Sugiyama & Kawanabe (2012) Sugiyama, M. and Kawanabe, M. Machine learning in non-stationary environments: Introduction to covariate shift adaptation. MIT press, 2012.
Swartworth et al. (2023) Swartworth, W. J., Needell, D., Ward, R., Kong, M., and Jeong, H. Nearly optimal bounds for cyclic forgetting. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=X25L5AjHig.
Wu et al. (2022a) Wu, J., Zou, D., Braverman, V., Gu, Q., and Kakade, S. Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression. In International Conference on Machine Learning, pp. 24280–24314. PMLR, 2022a.
Wu et al. (2022b) Wu, J., Zou, D., Braverman, V., Gu, Q., and Kakade, S. The power and limitation of pretraining-finetuning for linear regression under covariate shift. Advances in Neural Information Processing Systems, 35:33041–33053, 2022b.
Yang et al. (2021) Yang, L., Lin, S., Zhang, J., and Fan, D. Grown: Grow only when necessary for continual learning. arXiv preprint arXiv:2110.00908, 2021.
Yoon et al. (2017) Yoon, J., Yang, E., Lee, J., and Hwang, S. J. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
Yoon et al. (2019) Yoon, J., Kim, S., Yang, E., and Hwang, S. J. Scalable and order-robust continual learning with additive parameter decomposition. arXiv preprint arXiv:1902.09432, 2019.
Zou et al. (2021) Zou, D., Wu, J., Braverman, V., Gu, Q., and Kakade, S. Benign overfitting of constant-stepsize sgd for linear regression. In Conference on Learning Theory, pp. 4633–4635. PMLR, 2021.

Appendix A Support Lemmas

Notations

For two matrices $\mathbf{A}$ and $\mathbf{B}$ , their inner product is defined as $\langle\mathbf{A},\mathbf{B}\rangle:=\operatorname{tr}(\mathbf{A}^{\top}% \mathbf{B})$ . For each task $m\in[M]$ , we define the following linear operators:

\begin{gathered}\mathcal{I}=\mathbf{I}\otimes\mathbf{I},\quad\mathcal{M}_{m}=% \mathbb{E}[\mathbf{x}_{m}\otimes\mathbf{x}_{m}\otimes\mathbf{x}_{m}\otimes% \mathbf{x}_{m}],\quad\widetilde{\mathcal{M}}_{m}=\mathbf{H}_{m}\otimes\mathbf{% H}_{m},\\ \mathcal{T}=\mathbf{H}\otimes\mathbf{I}+\mathbf{I}\otimes\mathbf{H}_{m}-\eta% \mathcal{M}_{m},\quad\widetilde{\mathcal{T}}=\mathbf{H}_{m}\otimes\mathbf{I}+% \mathbf{I}\otimes\mathbf{H}_{m}-\eta\mathbf{H}_{m}\otimes\mathbf{H}_{m}.\end{gathered}

We use the notation $\mathcal{O}\circ\mathbf{A}$ to denote the operator $\mathcal{O}$ acting on a symmetric matrix $\mathbf{A}$ . For example, with these definitions, we have that for a symmetric matrix $\mathbf{A}$ ,

\begin{gathered}\mathcal{I}\circ\mathbf{A}=\mathbf{A},\quad\mathcal{M}_{m}% \circ\mathbf{A}=\mathbb{E}[({\mathbf{x}_{m}}^{\top}\mathbf{A}{\mathbf{x}_{m}})% {\mathbf{x}_{m}}{\mathbf{x}_{m}}^{\top}],\quad\widetilde{\mathcal{M}}_{m}\circ% \mathbf{A}=\mathbf{H}_{m}\mathbf{A}\mathbf{H}_{m}\\ (\mathcal{I}-\eta\widetilde{\mathcal{T}}_{m})\circ\mathbf{A}=(\mathbf{I}-\eta% \mathbf{H}_{m})\mathbf{A}(\mathbf{I}-\eta\mathbf{H}_{m})\\ (\mathcal{I}-\eta\mathcal{T}_{m})\circ\mathbf{A}=\mathbb{E}[(\mathbf{I}-\eta{% \mathbf{x}_{m}}{\mathbf{x}_{m}}^{\top})\mathbf{A}(\mathbf{I}-\eta{\mathbf{x}_{% m}}{\mathbf{x}_{m}}^{\top})]\end{gathered}

It can be readily understood that the following properties are satisfied:

Lemma A.1 ((Zou et al., 2021)).

An operator $\mathcal{O}$ , when defined on symmetric matrices, is termed a Positive Semi-Definite (PSD) mapping if $\mathbf{A}\succeq 0$ implies $\mathcal{O}\circ\mathbf{A}\succeq 0$ . Consequently, for each task $m\in[M]$ we have:

1.

$\mathcal{M}_{m}$ and $\widetilde{\mathcal{M}}_{m}$ are both PSD mappings.
2.

$\mathcal{M}_{m}-\widetilde{\mathcal{M}}_{m}$ and $\widetilde{\mathcal{T}}_{m}-\mathcal{T}_{m}$ are both PSD mappings.
3.

$\mathcal{I}-\eta\mathcal{T}_{m}$ and $\mathcal{I}-\eta\widetilde{\mathcal{T}}_{m}$ are both PSD mappings.
4.

If $0<\eta<1/\lambda_{m}^{1}$ , then $\widetilde{\mathcal{T}}^{-1}$ exists, and is a PSD mapping.
5.

If $0<\eta<1/(\alpha_{m}\operatorname{tr}(\mathbf{H}_{m}))$ , then $\mathcal{T}_{m}^{-1}\circ\mathbf{A}$ exists for PSD matrix $\mathbf{A}$ , and $\mathcal{T}_{m}^{-1}$ is a PSD mapping.

Then for the SGD iterates, we can consider their associated bias iterates and variance iterates:

		$\displaystyle\begin{cases}\mathbf{B}_{0}=(\mathbf{w}_{0}-\mathbf{w}^{})(% \mathbf{w}_{0}-\mathbf{w}^{})^{\top},\\ \mathbf{B}_{(m-1)N+t+1}=(\mathcal{I}-\eta\mathcal{T}_{m}(\eta))\circ\mathbf{B}% _{(m-1)N+t};\end{cases}$		(9)
		$\displaystyle\begin{cases}\mathbf{C}_{0}=\mathbf{0},&\\ \mathbf{C}_{(m-1)N+t+1}=(\mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))% \circ\mathbf{C}_{(m-1)N+t}+\eta^{2}\bm{\Sigma}_{\mathbf{H}_{m}};\end{cases}$		(10)

where $t=0,\ldots,N-1$ and $m=1,\ldots,M$ .

Lemma A.2 (Bias-variance decomposition).

Suppose that Assumption 2.4 holds. Then we have:

\mathbb{E}[\operatorname{ExcessRisk}(\mathbf{w}_{MN})]=\frac{1}{2}\langle% \mathbf{H},\mathbf{B}_{MN}\rangle+\frac{1}{2}\langle\mathbf{H},\mathbf{C}_{MN}\rangle.

Appendix B Variance Error

B.1 Upper Bound

The assumption presented below can be inferred from 2.3 by setting $\mathbf{A}=\mathbf{I}$ , given that $R^{2}=\max\{\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})\}_{m=1}^{M}$ .

Assumption B.1 (Relaxed version).

For each task $m$ , there exists a constant $R\geq 0$ such that:

\mathbb{E}_{\mathbf{x}\sim\mathcal{D}_{m}}[\mathbf{x}\mathbf{x}^{\top}\mathbf{% x}\mathbf{x}^{\top}]\preceq R^{2}\mathbf{H}_{m}.

Lemma B.2.

Suppose Assumptions 2.3 and 2.4 hold with step size $\eta\leq 1/R^{2}$ , then it holds that:

\mathbf{C}_{t}\leq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\mathbf{I},\quad\text{ % for every }t=0,1,\ldots,MN

Proof.

This lemma is derived directly from the Lemmas in (Jain et al., 2018; Zou et al., 2021). To ensure completeness, we include a proof as follows.

We prove the lemma via induction. Initially, for $t=0$ , it is evident that $\mathbf{C}_{0}=\mathbf{0}\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\mathbf{I}$ . Now, assuming that $\mathbf{C}_{t}\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\mathbf{I}$ , let us examine $\mathbf{C}_{t+1}$ in light of Equation 9. When $0\leq t\leq N-1$ , for each task $m$ , it implies:

$\displaystyle\mathbf{C}_{(m-1)N+t+1}$	$\displaystyle=(\mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))\circ\mathbf% {C}_{(m-1)N+t}+\eta^{2}\bm{\Sigma}_{\mathbf{H}_{m}}$
	$\displaystyle\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\mathbf{I}\cdot(% \mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))\circ\mathbf{I}+\eta^{2}% \sigma^{2}\mathbf{H}_{m}$	(11)
	$\displaystyle\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\cdot(\mathbf{I}-2\eta% \mathbf{H}_{m}+\eta^{2}R^{2}\mathbf{H}_{m})+\eta^{2}\sigma^{2}\mathbf{H}_{m}$
	$\displaystyle\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\cdot\mathbf{I}.$

∎

Lemma B.3.

Suppose Assumptions 2.3 and 2.4 hold with step size $\eta\leq 1/R^{2}$ , then it holds that:

\displaystyle\mathbf{C}_{MN}

\displaystyle\preceq\prod_{m=2}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{C}_{N}+\sum_{m=1}^{M-1}\prod_{j=m+1}^{M% }(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P% }_{m}+\mathbf{P}_{M},

where $\mathbf{P}_{m}=\frac{\eta^{2}\sigma^{2}}{1-\eta R^{2}}\sum_{t=0}^{N-1}(% \mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ% \mathbf{H}_{m}$ and $\mathbf{C}_{N}\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\cdot(\mathbf{I}-(% \mathbf{I}-\eta\mathbf{H}_{1})^{N})$ .

Proof.

We first examine the recursion from $t=0$ to $t=N-1$ for each task $m$ :

	$\displaystyle\mathbf{C}_{(m-1)N+t+1}$	$\displaystyle=(\mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))\circ\mathbf% {C}_{(m-1)N+t}+\eta^{2}\bm{\Sigma}_{\mathbf{H}_{m}}$
		$\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))\circ\mathbf{C}_{(m-1)N+t}+\eta^{2}\mathcal{M}_{m}\circ\mathbf{C}_{(m-1)% N+t}+\eta^{2}\sigma^{2}\mathbf{H}_{m}$
		$\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))\circ\mathbf{C}_{(m-1)N+t}+\eta^{2}R^{2}\cdot\frac{\eta\sigma^{2}}{1-% \eta R^{2}}\cdot\mathbf{H}_{m}+\eta^{2}\sigma^{2}\mathbf{H}_{m}$
		$\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))\circ\mathbf{C}_{(m-1)N+t}+\frac{\eta^{2}\sigma^{2}}{1-\eta R^{2}}% \mathbf{H}_{m},$

where the penultimate inequality is derived from the Lemma B.2.

Hence, after $N$ iterations, we could have the following results for task $m$ :

\mathbf{C}_{(m-1)N+N}\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{% H}_{m}}(\eta))^{N}\circ\mathbf{C}_{(m-1)N}+\frac{\eta^{2}\sigma^{2}}{1-\eta R^% {2}}\sum_{t=0}^{N-1}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))^{t}\circ\mathbf{H}_{m}.

Now, we consider the first task incorporating with the Lemma B.5 in (Zou et al., 2021), which implies:

\mathbf{C}_{N}\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\cdot(\mathbf{I}-(% \mathbf{I}-\eta\mathbf{H}_{1})^{N}).

By combining the aforementioned results and denoting $\mathbf{P}_{m}=\frac{\eta^{2}\sigma^{2}}{1-\eta R^{2}}\sum_{t=0}^{N-1}(% \mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ% \mathbf{H}_{m}$ , we obtain:

\displaystyle\mathbf{C}_{MN}

\displaystyle\preceq\prod_{m=2}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{C}_{N}+\sum_{m=1}^{M-1}\prod_{j=m+1}^{M% }(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P% }_{m}+\mathbf{P}_{M}.

∎

Based on Lemma A.2, the upper bound of the variance error can be expressed as follows:

	$\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{C}_{MN}\rangle$	$\displaystyle\leq\underbrace{\sum_{k=1}^{M}\frac{\eta\sigma^{2}}{1-\eta R^{2}}% \langle\mathbf{H}_{k},\prod_{m=2}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_% {\mathbf{H}_{m}}(\eta))^{N}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{1})^{N})% \rangle}_{\text{variance term 1}}$
		$\displaystyle+\underbrace{\sum_{k=1}^{M}\langle\mathbf{H}_{k},\sum_{m=1}^{M-1}% \prod_{j=m+1}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(% \eta))^{N}\mathbf{P}_{m}\rangle}_{\text{variance term 2}}+\underbrace{\sum_{k=% 1}^{M}\langle\mathbf{H}_{k},\mathbf{P}_{M}\rangle}_{\text{variance term 3}}.$		(12)

Let us consider the variance terms separately.

variance term 1	$\displaystyle=\sum_{k=1}^{M}\frac{\eta\sigma^{2}}{1-\eta R^{2}}\langle\mathbf{% H}_{k},\prod_{m=2}^{M}(\mathbf{I}-\eta\mathbf{H}_{m})^{N}(\mathbf{I}-(\mathbf{% I}-\eta\mathbf{H}_{1})^{N})(\mathbf{I}-\eta\mathbf{H}_{m})^{N}\rangle$	(13)
	$\displaystyle\leq\sum_{k=1}^{M}\frac{\eta\sigma^{2}}{1-\eta R^{2}}\langle\prod% _{m=2}^{M}(\mathbf{I}-\eta\mathbf{H}_{m})^{N}\mathbf{H}_{k},(\mathbf{I}-(% \mathbf{I}-\eta\mathbf{H}_{1})^{N})\rangle$
	$\displaystyle=\frac{\eta\sigma^{2}}{1-\eta R^{2}}\sum_{k=1}^{M}\sum_{i}[\prod_% {m=2}^{M}(1-\eta\lambda_{m}^{i})^{N}\lambda_{k}^{i}(1-(1-\eta\lambda_{1}^{i})^% {N})]$
	$\displaystyle\leq\frac{\eta\sigma^{2}}{1-\eta R^{2}}(\sum_{i<k_{1}^{}}\Gamma_% {(2,M)}^{i}\Lambda^{i}+N\eta\sum_{i>k_{1}^{}}\Gamma_{(2,M)}^{i}\lambda_{1}^{i% }\Lambda^{i}),$

where we use the facts that $1-(1-\eta\lambda_{m}^{i})^{N}\leq\min\{1,\eta N\lambda_{m}^{i}\}$ hold for all $i\geq 1$ in the last inequality.

Before we turn our attention to the second term, we first consider the $\mathbf{P}_{m}$ :

	$\displaystyle\mathbf{P}_{m}$	$\displaystyle=\frac{\eta^{2}\sigma^{2}}{1-\eta R^{2}}\sum_{t=0}^{N-1}(\mathcal% {I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ\mathbf{H}_{m}$
		$\displaystyle=\frac{\eta^{2}\sigma^{2}}{1-\eta R^{2}}\sum_{t=0}^{N-1}(\mathbf{% I}-\eta\mathbf{H}_{m})^{t}\mathbf{H}_{m}(\mathbf{I}-\eta\mathbf{H}_{m})^{t}$
		$\displaystyle\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}(\mathbf{I}-(\mathbf{I}% -\eta\mathbf{H}_{m})^{N}).$

Substituting the above to the variance term 2, we have:

variance term 2	$\displaystyle=\sum_{k=1}^{M}\langle\mathbf{H}_{k},\sum_{m=1}^{M-1}\prod_{j=m+1% }^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}% \mathbf{P}_{m}\rangle$
	$\displaystyle\leq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\sum_{k=1}^{M}\langle% \mathbf{H}_{k},\sum_{m=1}^{M-1}\prod_{j=m+1}^{M}(\mathcal{I}-\eta\widetilde{% \mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}% _{m})^{N})\rangle$
	$\displaystyle\leq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\sum_{k=1}^{M}\langle% \mathbf{H}_{k},\sum_{m=1}^{M-1}\prod_{j=m+1}^{M}(\mathbf{I}-\eta\mathbf{H}_{j}% )^{N}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N})\rangle$
	$\displaystyle\leq\sum_{m=1}^{M-1}\frac{\eta\sigma^{2}}{1-\eta R^{2}}(\sum_{i<k% _{m}^{}}\Gamma_{(m+1,M)}^{i}\Lambda^{i}+N\eta\sum_{i>k_{m}^{}}\Gamma_{(m+1,M% )}^{i}\lambda_{m}^{i}\Lambda^{i})$	(14)

Similarly, for the last term, we have:

\displaystyle\text{variance term 3}=\sum_{k=1}^{M}\langle\mathbf{H}_{k},% \mathbf{P}_{M}\rangle\leq\frac{\eta\sigma^{2}}{1-\eta R^{2}}(\sum_{i<k_{M}^{*}% }\Gamma_{(M+1,M)}^{i}\Lambda^{i}+N\eta\sum_{i>k_{M}^{*}}\Gamma_{(M+1,M)}^{i}% \lambda_{m}^{i}\Lambda^{i})

(15)

B.2 Lower Bound

Now, we shift our focus to the lower bound of variance. Similarly, we have the following lemma hold:

Lemma B.4.

Suppose Assumptions 2.3 and 2.4 hold with step size $\eta\leq 1/R^{2}$ , then it holds that:

\displaystyle\mathbf{C}_{MN}

\displaystyle\succeq\prod_{m=2}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{C}_{N}+\sum_{m=1}^{M-1}\prod_{j=m+1}^{M% }(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P% }_{m}^{\prime}+\mathbf{P}_{M}^{\prime},

where $\mathbf{P}_{m}^{\prime}={\eta^{2}\sigma^{2}}\sum_{t=0}^{N-1}(\mathcal{I}-\eta% \widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ\mathbf{H}_{m}$ and $\mathbf{C}_{N}\succeq\frac{\eta\sigma^{2}}{2}\cdot(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{1})^{2N})$ .

Proof.

In a similar fashion, let’s first examine the recursion of $\mathbf{C}$ from $t=0$ to $t=N-1$ for each task $m$ .

	$\displaystyle\mathbf{C}_{(m-1)N+t+1}$	$\displaystyle=(\mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))\circ\mathbf% {C}_{(m-1)N+t}+\eta^{2}\bm{\Sigma}_{\mathbf{H}_{m}}$
		$\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% \circ\mathbf{C}_{(m-1)N+t}+\eta^{2}(\mathcal{M}_{m}-\widetilde{\mathcal{M}}_{m% })\circ\mathbf{C}_{(m-1)N+t}+\eta^{2}\sigma^{2}\mathbf{H}_{m}$
		$\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))\circ\mathbf{C}_{(m-1)N+t}+\eta^{2}\sigma^{2}\mathbf{H}_{m},$

where we utilize the fact that $\mathcal{M}_{m}-\widetilde{\mathcal{M}}_{m}$ is a PSD mapping, as established by A.1.

Consequently, after $N$ iterations, the following results can be deduced for task $m$ :

\mathbf{C}_{(m-1)N+N}\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{% H}_{m}}(\eta))^{N}\circ\mathbf{C}_{(m-1)N}+{\eta^{2}\sigma^{2}}\sum_{t=0}^{N-1% }(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ% \mathbf{H}_{m}.

Now, we consider the first task incorporating the Lemma C.2 in (Zou et al., 2021), which implies:

\mathbf{C}_{N}\succeq\frac{\eta\sigma^{2}}{2}\cdot(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{1})^{2N}).

By combining the aforementioned results and denoting $\mathbf{P}_{m}^{\prime}={\eta^{2}\sigma^{2}}\sum_{t=0}^{N-1}(\mathcal{I}-\eta% \widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ\mathbf{H}_{m}$ , we obtain:

\displaystyle\mathbf{C}_{MN}

\displaystyle\succeq\prod_{m=2}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{C}_{N}+\sum_{m=1}^{M-1}\prod_{j=m+1}^{M% }(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P% }_{m}^{\prime}+\mathbf{P}_{M}^{\prime},

which completes the proof. ∎

Drawing from Lemma A.2, the lower bound of the variance error is expressed as follows:

	$\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{C}_{MN}\rangle$	$\displaystyle\geq\underbrace{\sum_{k=1}^{M}\frac{\eta\sigma^{2}}{2}\langle% \mathbf{H}_{k},\prod_{m=2}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{1})^{2N})% \rangle}_{\text{variance term 1}^{\prime}}$
		$\displaystyle+\underbrace{\sum_{k=1}^{M}\langle\mathbf{H}_{k},\sum_{m=1}^{M-1}% \prod_{j=m+1}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(% \eta))^{N}\mathbf{P}_{m}^{\prime}\rangle}_{\text{variance term 2}^{\prime}}+% \underbrace{\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{P}_{M}^{\prime}\rangle% }_{\text{variance term 3}^{\prime}}.$		(16)

Analogous to the approach for the upper bound, we will examine the terms one by one.

$\displaystyle\text{variance term 1}^{\prime}$	$\displaystyle=\sum_{k=1}^{M}\frac{\eta\sigma^{2}}{2}\langle\mathbf{H}_{k},% \prod_{m=2}^{M}(\mathbf{I}-\eta\mathbf{H}_{m})^{N}(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{1})^{2N})(\mathbf{I}-\eta\mathbf{H}_{m})^{N}\rangle$
	$\displaystyle=\sum_{k=1}^{M}\frac{\eta\sigma^{2}}{2}\langle\prod_{m=2}^{M}(% \mathbf{I}-\eta\mathbf{H}_{m})^{2N}\mathbf{H}_{k},(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{1})^{2N})\rangle$
	$\displaystyle=\frac{\eta\sigma^{2}}{2}\sum_{k=1}^{M}\sum_{i}[\prod_{m=2}^{M}(1% -\eta\lambda_{m}^{i})^{2N}\lambda_{k}^{i}(1-(1-\eta\lambda_{1}^{i})^{2N})]$
	$\displaystyle\geq\frac{\eta\sigma^{2}}{2}\sum_{i}[\prod_{m=2}^{M}(1-\eta% \lambda_{m}^{i})^{2N}(\sum_{k=1}^{M}\lambda_{k}^{i})(1-(1-\eta\lambda_{1}^{i})% ^{2N})]$	(17)

To further lower bound the two terms, noticing the following inequality:

1-(1-\eta\lambda_{1}^{i})^{2N}\geq\begin{cases}1-(1-\frac{1}{N})^{2N}\geq 1-e^% {-2}\geq\frac{9}{10},&\lambda_{1}^{i}\geq\frac{1}{\eta N},\\ 2N\cdot\eta\lambda_{1}^{i}-\frac{2N(N-1)}{2}\cdot\eta^{2}{\lambda_{1}^{i}}^{2}% \geq\frac{9N}{10}\cdot\eta\lambda_{1}^{i},&\lambda_{1}^{i}<\frac{1}{\eta N}.% \end{cases}

Hence, the first term, we have:

\text{variance term 1}^{\prime}\geq\frac{9\eta^{2}\sigma^{2}}{20}(\sum_{i<k_{1% }^{*}}\Gamma_{(2,M)}^{i}\Lambda^{i}+N\eta\sum_{i>k_{1}^{*}}\Gamma_{(2,M)}^{i}% \lambda_{1}^{i}\Lambda^{i}).

For the variance term $2^{\prime}$ , we notice that:

	$\displaystyle\mathbf{P}_{m}^{\prime}$	$\displaystyle={\eta^{2}\sigma^{2}}\sum_{t=0}^{N-1}(\mathcal{I}-\eta\widetilde{% \mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ\mathbf{H}_{m}={\eta^{2}\sigma^{2% }}\sum_{t=0}^{N-1}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2t}\mathbf{H}_{m}$
		$\displaystyle\geq\frac{\eta^{2}\sigma^{2}}{2}(\mathbf{I}-(\mathbf{I}-\eta{% \mathbf{H}_{m}})^{2N})$

Substituting the above to the variance term 2’, we have:

	$\displaystyle\text{variance term 2}^{\prime}$	$\displaystyle=\sum_{k=1}^{M}\langle\mathbf{H}_{k},\sum_{m=1}^{M-1}\prod_{j=m+1% }^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}% \mathbf{P}_{m}^{\prime}\rangle$
		$\displaystyle\geq\frac{\eta^{2}\sigma^{2}}{2}\sum_{k=1}^{M}\langle\mathbf{H}_{% k},\sum_{m=1}^{M-1}\prod_{j=m+1}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{j}}(\eta))^{N}(\mathbf{I}-(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N})\rangle$
		$\displaystyle=\frac{\eta^{2}\sigma^{2}}{2}\sum_{k=1}^{M}\langle\mathbf{H}_{k},% \sum_{m=1}^{M-1}\prod_{j=m+1}^{M}(\mathbf{I}-\eta{\mathbf{H}_{j}})^{2N}(% \mathbf{I}-(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N})\rangle$
		$\displaystyle\geq\frac{9\eta^{2}\sigma^{2}}{20}\sum_{m=1}^{M-1}(\sum_{i<k_{m}^% {}}\Gamma_{(m+1,M)}^{i}\Lambda^{i}+N\eta\sum_{i>k_{m}^{}}\Gamma_{(m+1,M)}^{i% }\lambda_{m}^{i}\Lambda^{i}).$

Also, similar to the variance term 3’, it holds that:

\text{variance term 3'}\geq\frac{9\eta^{2}\sigma^{2}}{20}(\sum_{i<k_{M}^{*}}% \Gamma_{(M+1,M)}^{i}\Lambda^{i}+N\eta\sum_{i>k_{M}^{*}}\Gamma_{(M+1,M)}^{i}% \lambda_{M}^{i}\Lambda^{i}).

Appendix C Bias Error

Before providing the proof of bias bound, we first introduce the following lemmas for tradition SGD training in Zou et al. 2021.

Lemma C.1 (Summation of bias iterates (Zou et al., 2021)).

Suppose that Assumption 2.3 holds. Suppose that $\eta<1/(\alpha\operatorname{tr}(\mathbf{H}_{m}))$ . Then for every $N\geq 1$ and each task $m$ , it holds that:

\frac{1}{2\eta}\cdot(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{2N})\preceq% \sum_{t=0}^{N-1}(\mathcal{I}-\eta\cdot\mathcal{T}_{\mathbf{H}_{m}}(\eta))^{t}% \circ\mathbf{H}_{m}\preceq\frac{1}{\eta}\cdot(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{m})^{2N})

Lemma C.2.

Under Assumptions 2.3, let $\mathbf{B}_{a,b}=\mathbf{B}_{a}-(\mathbf{I}-\eta\mathbf{H}_{m})^{b-a}\mathbf{B% }_{a}(\mathbf{I}-\eta\mathbf{H}_{m})^{b-a}$ , if the stepsize satisfies $\eta<1/(\alpha_{m}\operatorname{tr}(\mathbf{H}_{m}))$ , then for any $t\leq N$ , it holds that for each task $m$ :

\mathbf{S}_{t}\preceq\sum_{k=0}^{t-1}(\mathbf{I}-\eta\mathbf{H}_{m})^{k}(\frac% {\eta\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}% \operatorname{tr}(\mathbf{H}_{m})}\cdot\mathbf{H}_{m}+\mathbf{B}_{0})(\mathbf{% I}-\eta\mathbf{H}_{m})^{k},

where denoting $\mathbf{S}_{t}=\sum_{k=0}^{t-1}(\mathcal{I}-\mathcal{T}(\eta)\circ\mathbf{B}_{0}$ .

Lemma C.3.

Suppose Assumptions 2.3 and 2.4 hold with step size $\eta\leq 1/R^{2}$ , then it holds that:

\mathbf{S}_{t}\succeq\frac{\beta_{m}}{4}\operatorname{tr}\left(\left(\mathbf{I% }-(\mathbf{I}-\eta\mathbf{H}_{m})^{t/2}\right)\mathbf{B}_{0}\right)\cdot\left(% \mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{t/2}\right)+\sum^{t-1}(\mathbf{I}-% \eta\mathbf{H}_{m})^{t}\cdot\mathbf{B}_{0}\cdot(\mathbf{I}-\eta\mathbf{H}_{m})% ^{t},

where denoting $\mathbf{S}_{t}=\sum_{k=0}^{t-1}(\mathcal{I}-\mathcal{T}(\eta)\circ\mathbf{B}_{0}$ .

C.1 Upper Bound

Lemma C.4.

Suppose Assumptions 2.3 and 2.4 hold with step size $\eta\leq 1/R^{2}$ , then it holds that:

\displaystyle\mathbf{B}_{MN}

\displaystyle\preceq\prod_{m=1}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{B}_{0}+\sum_{m=1}^{M}\prod_{j=m}^{M}(% \mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P}_% {m},

where $\mathbf{P}_{m}=\alpha_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-\eta\mathbf{H}_{m% })^{2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N+t}\rangle$ and $\prod_{k_{1}}^{k_{2}}=1$ if $k_{1}>k_{2}$ .

We first examine the recursion from $t=0$ to $t=N-1$ for each task $m$ :

$\displaystyle\mathbf{B}_{(m-1)N+t+1}$	$\displaystyle=(\mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))\circ\mathbf% {B}_{(m-1)N+t}$	(18)
	$\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% \circ\mathbf{B}_{(m-1)N+t}+\eta^{2}(\mathcal{M}_{m}-\widetilde{\mathcal{M}}_{m% })\circ\mathbf{B}_{(m-1)N+t}$
	$\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))\circ\mathbf{B}_{(m-1)N+t}+\alpha_{m}\eta^{2}\cdot\mathbf{H}_{m}\cdot% \langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N+t}\rangle.$

where the penultimate inequality is derived from the assumption 2.3.

Hence, after $N$ iterations, we could have the following results for task $m$ :

	$\displaystyle\mathbf{B}_{(m-1)N+N}$	$\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))^{N}\circ\mathbf{B}_{(m-1)N}+\alpha_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathcal% {I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{H}_{m}% \langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N+t}\rangle$
		$\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% ^{N}\circ\mathbf{B}_{(m-1)N}+\alpha_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-% \eta\mathbf{H}_{m})^{2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N% +t}\rangle$
		$\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% ^{N}\circ\mathbf{B}_{(m-1)N}+\alpha_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-% \eta\mathbf{H}_{m})^{2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{% \mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{(m-1)N}\rangle$
		$\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))^{N}\circ\mathbf{B}_{(m-1)N}+\alpha_{m}\eta^{2}\sum_{t=0}^{N-1}\mathbf{H% }_{m}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))^{t}\mathbf{B}_{(m-1)N}\rangle$

We now examine the second term for each $m$ :

		$\displaystyle\sum_{t=0}^{N-1}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{% T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{(m-1)N}\rangle$
	$\displaystyle=$	$\displaystyle\sum_{t=0}^{N-1}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{% T}}_{\mathbf{H}_{m}}(\eta))^{t}(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1% }}(\eta))^{N}\ldots(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{1}}(\eta))^{N}% \mathbf{B}_{0}\rangle$
	$\displaystyle=$	$\displaystyle\sum_{t=0}^{N-1}\langle(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}% _{m-1}}(\eta))^{N}\ldots(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{1}}(\eta))% ^{N}\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}% \mathbf{B}_{0}\rangle,$

where we know the following holds:

	$\displaystyle(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))\circ% \mathbf{H}_{m}$	$\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta% ))\circ\mathbf{H}_{m}+(\mathcal{M}-\widetilde{\mathcal{M}})\circ\mathbf{H}_{m}$
		$\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}% }(\eta))\circ\mathbf{H}_{m}+\alpha_{m-1}\eta^{2}\cdot\mathbf{H}_{m-1}\cdot% \langle\mathbf{H}_{m-1},\mathbf{H}_{m}\rangle.$

Moreover, we have $\sum_{t=0}^{N-1}\eta\cdot\mathbf{H}_{m-1}\cdot(\mathcal{I}-\eta\widetilde{% \mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))^{t}\preceq\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{m-1})^{N}\preceq\mathbf{I}$ . Therefore, it holds that:

\displaystyle(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))^{N}\circ% \mathbf{H}_{m}\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1% }}(\eta))^{N}\circ\mathbf{H}_{m}+\alpha_{m-1}\eta\cdot\mathbf{I}\cdot\langle% \mathbf{H}_{m-1},\mathbf{H}_{m}\rangle.

It implies:

	$\displaystyle(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{1}}(\eta))^{N}\ldots(% \mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))^{N}\circ\mathbf{H}_{m}$	$\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{1}}(% \eta))^{N}\ldots(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}}(% \eta))^{N}\circ\mathbf{H}_{m}$
		$\displaystyle+\sum_{j=1}^{m-1}\prod_{k=1}^{j}\alpha_{k}\eta^{j}\cdot\langle% \mathbf{H}_{k-1},\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m-1})^{N}\rangle\cdot% \langle\mathbf{H}_{j},\mathbf{H}_{m}\rangle\cdot\mathbf{I},$

where we denote $\mathbf{H}_{0}=\mathbf{I}$ and define $\Phi_{1}^{m-1}:=\sum_{j=1}^{m-1}\prod_{k=1}^{j}\alpha_{k}\eta^{j}\cdot\langle% \mathbf{H}_{k-1},\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m-1})^{N}\rangle\cdot% \langle\mathbf{H}_{j},\mathbf{H}_{m}\rangle$ . Therefore, Section C.1 can be represented as follows:

		$\displaystyle\sum_{t=0}^{N-1}\langle(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{1}}(\eta))^{N}\ldots(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m-1}}(\eta))^{N}\circ\mathbf{H}_{m}+{\Phi_{1}^{m-1}}\cdot\mathbf{I% },(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{0}\rangle$
	$\displaystyle\leq$	$\displaystyle\underbrace{\sum_{t=0}^{N-1}\langle(\mathcal{I}-\eta\widetilde{% \mathcal{T}}_{\mathbf{H}_{1}}(\eta))^{N}\ldots(\mathcal{I}-\eta\widetilde{% \mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))^{N}\circ\mathbf{H}_{m},(\mathbf{I}-\eta% \mathbf{H}_{m})^{t}(\frac{\eta\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1% -\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\cdot\mathbf{H}_{m}+\mathbf{B% }_{0})(\mathbf{I}-\eta\mathbf{H}_{m})^{t}\rangle}_{\text{term 1}}$
	$\displaystyle+$	$\displaystyle\underbrace{\sum_{t=0}^{N-1}\langle{\Phi_{1}^{m-1}}\cdot\mathbf{I% },(\mathbf{I}-\eta\mathbf{H}_{m})^{t}(\frac{\eta\alpha_{m}\operatorname{tr}(% \mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\cdot% \mathbf{H}_{m}+\mathbf{B}_{0})(\mathbf{I}-\eta\mathbf{H}_{m})^{t}\rangle}_{% \text{term 2}}.$

We first consider the term 1 with Lemma C.2.

	$\displaystyle\text{term 1}=$	$\displaystyle\sum_{t=0}^{N-1}\langle\prod_{j=1}^{m-1}(\mathbf{I}-\eta\mathbf{H% }_{j})^{2N}(\mathbf{I}-\eta\mathbf{H}_{m})^{2t}\mathbf{H}_{m},(\frac{\eta% \alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{% tr}(\mathbf{H}_{m})}\cdot\mathbf{H}_{m}+\mathbf{B}_{0})\rangle$
	$\displaystyle=$	$\displaystyle\sum_{t=0}^{N-1}\frac{\eta\alpha_{m}\operatorname{tr}(\mathbf{B}_% {0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\langle\prod_{j=1}^{% m-1}(\mathbf{I}-\eta\mathbf{H}_{j})^{2N}(\mathbf{I}-\eta\mathbf{H}_{m})^{2t}% \mathbf{H}_{m},\mathbf{H}_{m}\rangle+\sum_{t=0}^{N-1}\langle(\mathbf{I}-\eta% \mathbf{H}_{m})^{2t}\mathbf{H}_{m},\mathbf{B}_{0}\rangle$
	$\displaystyle\leq$	$\displaystyle\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha% _{m}\operatorname{tr}(\mathbf{H}_{m})}\langle\prod_{j=1}^{m-1}(\mathbf{I}-\eta% \mathbf{H}_{j})^{2N}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N}),\mathbf{H% }_{m}\rangle+\frac{1}{\eta}\langle\prod_{j=1}^{m-1}(\mathbf{I}-\eta\mathbf{H}_% {j})^{2N}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N}),\mathbf{B}_{0}\rangle$
	$\displaystyle=$	$\displaystyle\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha% _{m}\operatorname{tr}(\mathbf{H}_{m})}\sum_{i}\Gamma_{(1,m-1)}^{i}[1-(1-\eta% \lambda_{m}^{i})^{N}]\lambda_{m}^{i}+\frac{1}{\eta}\sum_{i}\Gamma_{(1,m-1)}^{i% }{\omega_{i}^{2}}[1-(1-\eta\lambda_{m}^{i})^{N}]$
	$\displaystyle\leq$	$\displaystyle\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha% _{m}\operatorname{tr}(\mathbf{H}_{m})}\sum_{i}\Gamma_{(1,m-1)}^{i}\min\{1,\eta N% \lambda_{m}^{i}\}+\frac{1}{\eta}\sum_{i}\Gamma_{(1,m-1)}^{i}{\omega_{i}^{2}}% \min\{1,\eta N\lambda_{m}^{i}\}$
	$\displaystyle\leq$	$\displaystyle\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha% _{m}\operatorname{tr}(\mathbf{H}_{m})}(\sum_{i\leq k_{m}^{}}\frac{\Gamma_{(1,% m-1)}^{i}\lambda_{m}^{i}}{N\eta}+N\eta\sum_{i>k_{m}^{}}\Gamma_{(1,m-1)}^{i}(% \lambda_{m}^{i})^{2})+\frac{1}{\eta}\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{\bm{% \Gamma}_{1}^{m-1}\mathbf{I}_{m,0:k_{m}^{}}}^{2}+N\\|\mathbf{w}_{0}-\mathbf{w}^% {}\\|_{\bm{\Gamma}_{1}^{m-1}\mathbf{H}_{m,k_{m}^{}:\infty}}^{2}$

where $k_{m}^{*}$ is the index of the smallest eigenvalue of $\mathbf{H}_{m}$ satisfying $\lambda_{k_{m}^{*}}^{i}\geq 1/(\eta N)$ , and denotes $U_{m}=\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}% \operatorname{tr}(\mathbf{H}_{m})}(\sum_{i\leq k_{m}^{*}}\frac{\Gamma_{(1,m-1)% }^{i}}{N\eta}+N\eta\sum_{i>k_{m}^{*}}(\lambda_{m}^{i})^{2})+\frac{1}{\eta}\|% \mathbf{w}_{0}-\mathbf{w}^{*}\|_{\bm{\Gamma}_{1}^{m-1}\mathbf{I}_{m,0:k_{m}^{*% }}}^{2}+N\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\bm{\Gamma}_{1}^{m-1}\mathbf{H}_{m% ,k_{m}^{*}:\infty}}^{2}$ .

Moreover, $.\operatorname{tr}(\mathbf{B}_{0,N})=\operatorname{tr}(\mathbf{B}_{0}-(\mathbf% {I}-\eta\mathbf{H}_{m})^{N}\mathbf{B}_{0}(\mathbf{I}-\eta\mathbf{H}_{m})^{N}))% =\sum(1-(1-\eta\Lambda^{i})^{2N})\cdot(\langle\mathbf{w}_{0}-\mathbf{w}^{*},% \mathbf{v}_{i}\rangle)^{2}$ , Hence:

\operatorname{tr}(\mathbf{B}_{0,N})\leq 2\sum_{i}\min\{1,N\eta\Lambda^{i}\}(% \langle\mathbf{w}_{0}-\mathbf{w}^{*},\mathbf{v}_{i}\rangle)^{2}\leq 2(\|% \mathbf{w}_{0}-\mathbf{w}^{*}\|_{\mathbf{I}_{m,0:{k_{m}^{*}}}}^{2}+N\eta\|% \mathbf{w}_{0}-\mathbf{w}^{*}\|_{\mathbf{H}_{m,{k_{m}^{*}}:\infty}}^{2}).

Now we are ready to examine the term 2.

	$\displaystyle\text{term 2}=\sum_{t=0}^{N-1}\langle{\Phi_{1}^{m-1}}\cdot\mathbf% {I}\cdot(\mathbf{I}-\eta\mathbf{H}_{m})^{2t},(\frac{\eta\alpha_{m}% \operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf% {H}_{m})}\cdot\mathbf{H}_{m}+\mathbf{B}_{0})\rangle$
	$\displaystyle\leq\sum_{t=0}^{N-1}\frac{\eta\alpha_{m}\operatorname{tr}(\mathbf% {B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\langle{\Phi_{1}% ^{m-1}}\cdot(\mathbf{I}-\eta\mathbf{H}_{m})^{2t},\mathbf{H}_{m}\rangle+\sum_{t% =0}^{N-1}\langle{\Phi_{1}^{m-1}}\cdot(\mathbf{I}-\eta\mathbf{H}_{m})^{2t},% \mathbf{B}_{0}\rangle$
	$\displaystyle\leq\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta% \alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\langle{\Phi_{1}^{m-1}}\mathbf{I},% (\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N})\rangle+\frac{1}{\eta}\langle% \sum_{j=1}^{m-1}\langle\mathbf{H}_{j},\alpha^{j}\bm{\Gamma}_{1}^{j-1}\cdot% \mathbf{H}_{m}\rangle\mathbf{H}_{m}^{-1}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}% _{m})^{N}),\mathbf{B}_{0}\rangle$
	$\displaystyle=\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta% \alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\sum_{i}{\Phi_{1}^{m-1}}[1-(1-\eta% \lambda_{m}^{i})^{N}]+\frac{1}{\eta}\sum_{i}{\Phi_{1}^{m-1}}^{i}{\omega_{i}^{2% }}(\lambda_{m}^{i})^{-1}[1-(1-\eta\lambda_{m}^{i})^{N}]$
	$\displaystyle\leq\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta% \alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\sum_{i}{\Phi_{1}^{m-1}}\min\{1,% \eta N\lambda_{m}^{i}\}+\frac{1}{\eta}\sum_{i}{\Phi_{1}^{m-1}}{\omega_{i}^{2}}% (\lambda_{m}^{i})^{-1}\min\{1,\eta N\lambda_{m}^{i}\}$
	$\displaystyle\leq\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N}){\Phi_{1}^% {m-1}}}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}(k_{m}^{}+N\eta\sum% _{i>k_{m}^{}}(\lambda_{m}^{i}))+\frac{{\Phi_{1}^{m-1}}}{\eta}\\|\mathbf{w}_{0}% -\mathbf{w}^{}\\|_{\mathbf{H}^{-1}_{m,0:k_{m}^{}}}^{2}+N{\Phi_{1}^{m-1}}\\|% \mathbf{w}_{0}-\mathbf{w}^{}\\|_{\mathbf{I}_{m,k_{m}^{}:\infty}}^{2}.$

Let us denote $V_{m}=\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N}){\Phi_{1}^{m-1}}}{1-% \eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}(k_{m}^{*}+N\eta\sum_{i>k_{m}^% {*}}(\lambda_{m}^{i}))+\frac{{\Phi_{1}^{m-1}}}{\eta}\|\mathbf{w}_{0}-\mathbf{w% }^{*}\|_{\mathbf{H}^{-1}_{m,0:k_{m}^{*}}}^{2}+N{\Phi_{1}^{m-1}}\|\mathbf{w}_{0% }-\mathbf{w}^{*}\|_{\mathbf{I}_{m,k_{m}^{*}:\infty}}^{2}$ .

By combining the aforementioned results, we obtain:

\displaystyle\mathbf{B}_{MN}

\displaystyle\preceq\prod_{m=1}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{B}_{0}+\sum_{m=1}^{M}\prod_{j=m}^{M}(% \mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P}_% {m},

where denoting $\mathbf{P}_{m}=\alpha_{m}\eta^{2}(U_{m}+V_{m})\cdot\mathbf{H}_{m}$ .

Based on Lemma A.2, the upper bound of the bias error can be expressed as follows:

\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{B}_{MN}\rangle\leq% \underbrace{\sum_{k=1}^{M}\langle\mathbf{H}_{k},\prod_{m=1}^{M}(\mathcal{I}-% \eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{B}_{0}% \rangle}_{\text{bias term 1}}+\underbrace{\sum_{k=1}^{M}\langle\mathbf{H}_{k},% \sum_{m=1}^{M}\prod_{j=m}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf% {H}_{j}}(\eta))^{N}\mathbf{P}_{m}\rangle}_{\text{bias term 2}}.

For each $k$ :

	$\displaystyle\langle\mathbf{H}_{k},\mathbf{B}_{MN}\rangle=\langle\prod_{m=1}^{% M}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k},\mathbf{B}_{0}\rangle+% \langle\mathbf{H}_{k},\sum_{m=1}^{M}\prod_{j=m}^{M}(\mathbf{I}-\eta{\mathbf{H}% _{j}})^{2N}\alpha_{m}\eta^{2}(U_{m}+V_{m})\cdot\mathbf{H}_{m}\rangle$
	$\displaystyle\leq\\|\mathbf{w}_{0}-\mathbf{w}^{*}\\|_{\prod_{m=1}^{M}(\mathbf{I}% -\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k}}^{2}$
	$\displaystyle+\sum_{m=1}^{M}\alpha_{m}\eta^{2}(\frac{\alpha_{m}\operatorname{% tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}(\sum% _{i<k_{m}^{}}\frac{\Gamma_{(1,M)}^{i}(\lambda_{m}^{i})^{2}\lambda_{k}^{i}}{N% \eta}+N\eta\sum_{i>k_{m}^{}}\Gamma_{(1,M)}^{i}(\lambda_{m}^{i})^{3}\lambda_{k% }^{i}))$
	$\displaystyle+\sum_{m=1}^{M}\alpha_{m}\eta^{2}(\\|\mathbf{w}_{0}-\mathbf{w}^{}% \\|_{(\bm{\Gamma}_{1}^{M}\mathbf{H}_{m}\mathbf{H}_{k})_{0:k_{m}^{}}}^{2}+N\eta% \\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{1}^{M}\mathbf{H}_{m}^{2}% \mathbf{H}_{k})_{k_{m}^{}:\infty}}^{2})$		(19)
	$\displaystyle+\sum_{m=1}^{M}\alpha_{m}\eta^{2}{\Phi_{1}^{m-1}}(\frac{\alpha_{m% }\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(% \mathbf{H}_{m})}(\sum_{i<k_{m}^{}}\Gamma_{(m,M)}^{i}(\lambda_{m}^{i})+N\eta% \sum_{i>k_{m}^{}}\Gamma_{(m,M)}^{i}(\lambda_{m}^{i})^{2}))$
	$\displaystyle+\sum_{m=1}^{M}\alpha_{m}\eta^{2}{\Phi_{1}^{m-1}}(\frac{1}{\eta}% \\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{m}^{M}\mathbf{H}^{-1}_{m}){0:% k_{m}^{}}}^{2}+N\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{m}^{M})_{k_{% m}^{}:\infty}}^{2})$

Hence,

	$\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{B}_{MN}\rangle$	$\displaystyle\leq\sum_{k=1}^{M}\\|\mathbf{w}_{0}-\mathbf{w}^{*}\\|_{\prod_{m=1}^% {M}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k}}^{2}$
		$\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}\alpha_{m}\eta^{2}\frac{\alpha_{m}% \operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf% {H}_{m})}(\sum_{i<k_{m}^{}}\frac{\Gamma_{(1,M)}^{i}(\lambda_{m}^{i})^{2}% \lambda_{k}^{i}}{N\eta}+N\eta\sum_{i>k_{m}^{}}\Gamma_{(1,M)}^{i}(\lambda_{m}^% {i})^{3}\lambda_{k}^{i})$
		$\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}\alpha_{m}\eta^{2}(\\|\mathbf{w}_{0}-% \mathbf{w}^{}\\|_{(\bm{\Gamma}_{1}^{M}\mathbf{H}_{m}\mathbf{H}_{k})_{0:k_{m}^{% }}}^{2}+N\eta\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{{(\bm{\Gamma}_{1}^{M}\mathbf{% H}_{k}\mathbf{H}_{m}^{2})}_{k_{m}^{}:\infty}}^{2})$
		$\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}\alpha_{m}\eta^{2}{\Phi_{1}^{m-1}}(% \frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}% \operatorname{tr}(\mathbf{H}_{m})}(\sum_{i<k_{m}^{}}\Gamma_{(m,M)}^{i}\lambda% _{k}^{i}(\lambda_{m}^{i})+N\eta\sum_{i>k_{m}^{}}\Gamma_{(m,M)}^{i}(\lambda_{m% }^{i})^{2}\lambda_{k}^{i}))$
		$\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}\alpha_{m}\eta{\Phi_{1}^{m-1}}(\\|% \mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{m}^{M}\mathbf{H}_{k}){0:k_{m}^{% }}}^{2}+N\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{m}^{M}\mathbf{H}_{k% }\mathbf{H}_{m})_{k_{m}^{}:\infty}}^{2})$

C.2 Lower Bound

We first examine the recursion from $t=0$ to $t=N-1$ for each task $m$ :

$\displaystyle\mathbf{B}_{(m-1)N+t+1}$	$\displaystyle=(\mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))\circ\mathbf% {B}_{(m-1)N+t}$	(20)
	$\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% \circ\mathbf{B}_{(m-1)N+t}+\eta^{2}(\mathcal{M}_{m}-\widetilde{\mathcal{M}}_{m% })\circ\mathbf{B}_{(m-1)N+t}$
	$\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))\circ\mathbf{B}_{(m-1)N+t}+\beta_{m}\eta^{2}\cdot\mathbf{H}_{m}\cdot% \langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N+t}\rangle.$

Hence, after $N$ iterations, we could have the following results for task $m$ :

	$\displaystyle\mathbf{B}_{(m-1)N+N}$	$\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))^{N}\circ\mathbf{B}_{(m-1)N}+\beta_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathcal{% I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{H}_{m}% \langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N+t}\rangle$
		$\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% ^{N}\circ\mathbf{B}_{(m-1)N}+\beta_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-\eta% \mathbf{H}_{m})^{2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N+t}\rangle$
		$\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% ^{N}\circ\mathbf{B}_{(m-1)N}+\beta_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-\eta% \mathbf{H}_{m})^{2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{% \mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{(m-1)N}\rangle$

We now examine the second term for each $m$ :

		$\displaystyle\beta_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-\eta\mathbf{H}_{m})^% {2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{(m-1)N}\rangle$		(22)
	$\displaystyle=$	$\displaystyle\beta_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-\eta\mathbf{H}_{m})^% {2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{t}(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(% \eta))^{N}\ldots(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{1}}(\eta))^{N}% \mathbf{B}_{0}\rangle$		(22)

Acccording to 2.3 (B), we have:

	$\displaystyle(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))\circ% \mathbf{H}_{m}$	$\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta% ))\circ\mathbf{H}_{m}+(\mathcal{M}-\widetilde{\mathcal{M}})\circ\mathbf{H}_{m}$
		$\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}% }(\eta))\circ\mathbf{H}_{m}+\beta_{m-1}\eta^{2}\cdot\mathbf{H}_{m-1}\cdot% \langle\mathbf{H}_{m-1},\mathbf{H}_{m}\rangle$
	$\displaystyle\rightarrow(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta% ))^{N}\circ\mathbf{H}_{m}$	$\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}% }(\eta))^{N}\circ\mathbf{H}_{m}+\beta_{m-1}\eta^{2}\cdot\sum_{t=0}^{N-1}(% \mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))^{t}\circ% \mathbf{H}_{m-1}\cdot\langle\mathbf{H}_{m-1},\mathbf{H}_{m}\rangle$
		$\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}% }(\eta))^{N}\circ\mathbf{H}_{m}+\frac{\beta_{m-1}\eta}{2}\cdot(\mathbf{I}-(% \mathbf{I}-\eta{\mathbf{H}_{m-1}})^{2N})\cdot\langle\mathbf{H}_{m-1},\mathbf{H% }_{m}\rangle.$

Therefore, we have iterations that:

	$\displaystyle(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{1}}(\eta))^{N}\ldots(% \mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))^{N}\circ\mathbf{H}_{m}$	$\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{1}}(% \eta))^{N}\ldots(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}}(% \eta))^{N}\circ\mathbf{H}_{m}$
		$\displaystyle+\sum_{j=1}^{m-1}\prod_{k=1}^{j}\beta_{k}(\frac{\eta}{2})^{j}% \cdot\langle\mathbf{H}_{k-1},(\mathbf{I}-(\mathbf{I}-\eta{\mathbf{H}_{m-1}})^{% 2N})\rangle\cdot\langle\mathbf{H}_{j},\mathbf{H}_{m}\rangle\cdot\mathbf{I}.$

Subsituting the above to Equation 22 and denoting $\hat{\Phi}_{1}^{m-1}:=\sum_{j=1}^{m-1}\prod_{k=1}^{j}\beta_{k}(\frac{\eta}{2})% ^{j}\cdot\langle\mathbf{H}_{k-1},(\mathbf{I}-(\mathbf{I}-\eta{\mathbf{H}_{m-1}% })^{2N})\rangle\cdot\langle\mathbf{H}_{j},\mathbf{H}_{m}\rangle$ , we have:

	$\displaystyle\sum_{t=0}^{N-1}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{% T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{(m-1)N}\rangle$	$\displaystyle\succeq\sum_{t=0}^{N-1}\langle(\mathcal{I}-\eta{\widetilde{% \mathcal{T}}}_{\mathbf{H}_{m-1}}(\eta))^{N}\ldots(\mathcal{I}-\eta{\widetilde{% \mathcal{T}}}_{\mathbf{H}_{1}}(\eta))^{N}\mathbf{H}_{m},(\mathcal{I}-\eta{% \mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{0}\rangle$
		$\displaystyle+\sum_{t=0}^{N-1}\langle\hat{\Phi}_{1}^{m-1}\mathbf{I},(\mathcal{% I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{0}\rangle$
		$\displaystyle=\underbrace{\langle\prod_{p=1}^{m-1}(\mathbf{I}-\eta\mathbf{H}_{% p})^{2N}\mathbf{H}_{m},\sum_{t=0}^{N-1}(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf% {H}_{m}}(\eta))^{t}\mathbf{B}_{0}\rangle}_{\text{term 1}}$
		$\displaystyle+\underbrace{\langle\hat{\Phi}_{1}^{m-1}\mathbf{I},\sum_{t=0}^{N-% 1}(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{0}% \rangle}_{\text{term 2}}$

From the Lemma, we have:

	$\displaystyle\sum_{t=0}^{N-1}(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))^{t}\mathbf{B}_{0}$	$\displaystyle\succeq\frac{\beta_{m}}{4}\operatorname{tr}((\mathbf{I}-(\mathbf{% I}-\eta\mathbf{H}_{m})^{N/2})\mathbf{B}_{0})\cdot(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{m})^{N/2})$
		$\displaystyle+\sum_{t=0}^{N-1}(\mathbf{I}-\eta\mathbf{H}_{m})^{t}\cdot\mathbf{% B}_{0}\cdot(\mathbf{I}-\eta\mathbf{H}_{m})^{t}.$

Then, for each task $m$ , we examine the term 1:

	$\displaystyle\text{term 1}=$	$\displaystyle\langle\prod_{p=1}^{m-1}(\mathbf{I}-\eta\mathbf{H}_{p})^{2N}% \mathbf{H}_{m},\sum_{t=0}^{N-1}(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}% (\eta))^{t}\mathbf{B}_{0}\rangle$
	$\displaystyle\geq$	$\displaystyle\langle\prod_{p=1}^{m-1}(\mathbf{I}-\eta\mathbf{H}_{p})^{2N}% \mathbf{H}_{m},\frac{\beta_{m}}{4}\operatorname{tr}((\mathbf{I}-(\mathbf{I}-% \eta\mathbf{H}_{m})^{N/2})\mathbf{B}_{0})\cdot(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{m})^{N/2})\rangle$
	$\displaystyle+$	$\displaystyle\langle\prod_{p=1}^{m-1}(\mathbf{I}-\eta\mathbf{H}_{p})^{2N}% \mathbf{H}_{m},\sum_{t=0}^{N-1}(\mathbf{I}-\eta\mathbf{H}_{m})^{t}\cdot\mathbf% {B}_{0}\cdot(\mathbf{I}-\eta\mathbf{H}_{m})^{t}\rangle$
	$\displaystyle=$	$\displaystyle\underbrace{\frac{\beta_{m}}{4}\operatorname{tr}((\mathbf{I}-(% \mathbf{I}-\eta\mathbf{H}_{m})^{N/2})\mathbf{B}_{0})\cdot\langle\prod_{p=1}^{m% -1}(\mathbf{I}-\eta\mathbf{H}_{p})^{2N}\mathbf{H}_{m},(\mathbf{I}-(\mathbf{I}-% \eta\mathbf{H}_{m})^{N/2})\rangle}_{\text{bias term ${b_{1}^{m}}$}}$
	$\displaystyle+$	$\displaystyle\underbrace{\frac{1}{2\eta}\langle\prod_{p=1}^{m-1}(\mathbf{I}-% \eta\mathbf{H}_{p})^{2N}\cdot(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{2N})% ,\mathbf{B}_{0}\rangle}_{\text{bias term $b_{2}^{m}$}}$

The first bias item is lower bounded by:

{\text{bias term ${b_{1}^{m}}$}}=\frac{\beta_{m}}{4}(\sum_{i}(1-(1-\eta\lambda% _{m}^{i})^{N/2})\omega_{i}^{2})\cdot(\sum_{i}\prod_{p=1}^{m-1}(1-\eta\lambda_{% p}^{i})^{2N}\lambda_{m}^{i}(1-(1-\eta\lambda_{m}^{i})^{N/2})),

The second bias item is lower bounded by:

{\text{bias term $b_{2}^{m}$}}\geq(\sum_{i}\prod_{p=1}^{m-1}(1-\eta\lambda_{p}% ^{i})^{2N}(1-(1-\eta\lambda_{m}^{i})^{2N})\omega_{i}^{2})

To further lower bound the two terms, we notice that:

1-(1-\eta{\lambda_{m}^{i}})^{\frac{N}{2}}\geq\begin{cases}1-(1-\frac{1}{N})^{% \frac{N}{2}}\geq 1-e^{-\frac{1}{2}}\geq\frac{1}{5},&{\lambda_{m}^{i}}\geq\frac% {1}{\eta N}\\ \frac{N}{2}\cdot\eta{\lambda_{m}^{i}}-\frac{N(N-2)}{8}\cdot\eta^{2}{\lambda_{m% }^{i}}^{2}\geq\frac{N}{5}\cdot\eta{\lambda_{m}^{i}},&{\lambda_{m}^{i}}<\frac{1% }{\eta N}\end{cases}

Substituting to the previous results, we have:

	bias term ${b_{1}^{m}}$	$\displaystyle\geq\frac{\beta_{m}}{4}(\frac{1}{5}\cdot\sum_{i\leq{k_{m}^{}}}% \omega_{i}^{2}+\frac{\eta N}{5}\sum_{i>{k_{m}^{}}}(\lambda_{m}^{i})\omega_{i}% ^{2})\cdot(\frac{1}{5}\cdot\sum_{i\leq{k_{m}^{}}}\Gamma_{(1,m-1)}^{i}\lambda_% {m}^{i}+\frac{\eta N}{5}\sum_{i>{k_{m}^{}}}\Gamma_{(1,m-1)}^{i}(\lambda_{m}^{% i})^{2})$
		$\displaystyle=\frac{\beta_{m}}{25}\cdot(\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{% \mathbf{I}_{m,{0:{k_{m}^{}}}}}^{2}+N\eta\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{% \mathbf{H}_{m,{{k_{m}^{}}:\infty}}}^{2})\cdot(\sum_{i\leq{k_{m}^{}}}{\Gamma_% {(1,m-1)}^{i}}(\lambda_{m}^{i})+{\eta N}\sum_{i>{k_{m}^{}}}\Gamma_{(1,m-1)}^{% i}(\lambda_{m}^{i})^{2})$

and

	bias term $b_{2}^{m}$	$\displaystyle\geq(\frac{1}{5}\cdot\sum_{i\leq{k_{m}^{}}}\Gamma_{(1,m-1)}^{i}% \omega_{i}^{2}+\frac{\eta N}{5}\sum_{i>{k_{m}^{}}}\Gamma_{(1,m-1)}^{i}\lambda% _{m}^{i}\omega_{i}^{2})$
		$\displaystyle=\frac{1}{5}\cdot(\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\prod_{p=1}% ^{m-1}(\mathbf{I}-\eta\mathbf{H}_{p})^{2N})_{0:{k_{m}^{}}}}^{2}+N\eta\\|% \mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\prod_{p=1}^{m-1}(\mathbf{I}-\eta\mathbf{H}_% {p})^{2N}\mathbf{H}_{m})_{{k_{m}^{}}:\infty}}^{2})$

Now we are ready to examine term 2.

	$\displaystyle\text{term 2}=$	$\displaystyle\langle\hat{\Phi}_{1}^{m-1}\mathbf{I},\sum_{t=0}^{N-1}(\mathcal{I% }-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{0}\rangle$
	$\displaystyle\geq$	$\displaystyle\langle\hat{\Phi}_{1}^{m-1}\mathbf{I},\frac{\beta_{m}}{4}% \operatorname{tr}((\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N/2})\mathbf{B}% _{0})\cdot(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N/2})\rangle$
	$\displaystyle+$	$\displaystyle\langle\hat{\Phi}_{1}^{m-1}\mathbf{I},\sum_{t=0}^{N-1}(\mathbf{I}% -\eta\mathbf{H}_{m})^{t}\cdot\mathbf{B}_{0}\cdot(\mathbf{I}-\eta\mathbf{H}_{m}% )^{t}\rangle$
	$\displaystyle=$	$\displaystyle\underbrace{\frac{\beta_{m}}{4}\operatorname{tr}((\mathbf{I}-(% \mathbf{I}-\eta\mathbf{H}_{m})^{N/2})\mathbf{B}_{0})\cdot\langle\hat{\Phi}_{1}% ^{m-1}\mathbf{I},(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N/2})\rangle}_{% \text{bias term ${d_{1}^{m}}$}}$
	$\displaystyle+$	$\displaystyle\underbrace{\frac{1}{2\eta}\langle\hat{\Phi}_{1}^{m-1}\mathbf{H}_% {m}^{-1}\cdot(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{2N}),\mathbf{B}_{0}% \rangle}_{\text{bias term $d_{2}^{m}$}}$

Analogous to term 1, we have:

	bias term ${d_{1}^{m}}$	$\displaystyle=\frac{\beta_{m}}{4}(\sum_{i}(1-(1-\eta\lambda_{m}^{i})^{N/2})% \omega_{i}^{2})\cdot(\sum_{i}\hat{\Phi}_{1}^{m-1}\lambda_{m}^{i}(1-(1-\eta% \lambda_{m}^{i})^{N/2}))$
		$\displaystyle\geq\frac{\beta_{m}}{25}\cdot(\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{% \mathbf{I}_{m,{0:{k_{m}^{}}}}}^{2}+N\eta\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{% \mathbf{H}_{m,{{k_{m}^{}}:\infty}}}^{2})\cdot{\Phi_{1}^{m-1}}\cdot(\sum_{i<{k% _{m}^{}}}(\lambda_{m}^{i})+{\eta N}\sum_{i>{k_{m}^{}}}(\lambda_{m}^{i})^{2})$

and

	bias term $d_{2}^{m}$	$\displaystyle\geq(\sum_{i}\hat{\Phi}_{1}^{m-1}(\lambda_{m}^{i})^{-1}(1-(1-\eta% \lambda_{m}^{i})^{2N})\omega_{i}^{2})$
		$\displaystyle\geq\hat{\Phi}_{1}^{m-1}(\frac{1}{5}\cdot\sum_{i\leq{k_{m}^{}}}(% \lambda_{m}^{i})^{-1}\omega_{i}^{2}+\frac{\eta N}{5}\sum_{i>{k_{m}^{}}}\omega% _{i}^{2})$
		$\displaystyle=\frac{\hat{\Phi}_{1}^{m-1}}{5}\cdot(\\|\mathbf{w}_{0}-\mathbf{w}^% {}\\|_{(\mathbf{H}_{m}^{-1})_{0:{k_{m}^{}}}}^{2}+N\eta\cdot\\|\mathbf{w}_{0}-% \mathbf{w}^{}\\|_{\mathbf{I})_{{k_{m}^{}}:\infty}}^{2})$

After $MN$ iterations, it holds that:

\displaystyle\mathbf{B}_{MN}

\displaystyle\succeq\prod_{m=1}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{B}_{0}+\sum_{m=1}^{M}\prod_{j=m}^{M}(% \mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P}_% {m},

where denoting $\mathbf{P}_{m}=\beta_{m}\eta^{2}(b_{1}^{m}+b_{2}^{m}+d_{1}^{m}+d_{2}^{m})\cdot% (\mathbf{I}-\eta\mathbf{H}_{m})^{2N}\mathbf{H}_{m}$ .

Then, the bias error can be represented as follows:

	$\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{B}_{MN}\rangle$	$\displaystyle\geq\underbrace{\sum_{k=1}^{M}\langle\mathbf{H}_{k},\prod_{m=1}^{% M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{N}\circ% \mathbf{B}_{0}\rangle}_{\text{bias term 1'}}+\underbrace{\sum_{k=1}^{M}\langle% \mathbf{H}_{k},\sum_{m=1}^{M}\prod_{j=m}^{M}(\mathcal{I}-\eta\widetilde{% \mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P}_{m}\rangle}_{\text{bias % term 2'}}.$
		$\displaystyle\geq\sum_{k=1}^{M}\\|\mathbf{w}_{0}-\mathbf{w}^{*}\\|_{\prod_{m=1}^% {M}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k}}^{2}$
		$\displaystyle+\sum_{k=1}^{M}\langle\mathbf{H}_{k},\sum_{m=1}^{M}\prod_{j=m}^{M% }(\mathbf{I}-\eta\mathbf{H}_{j})^{2N}\beta_{m}\eta^{2}(b_{1}^{m}+b_{2}^{m}+d_{% 1}^{m}+d_{2}^{m})\cdot(\mathbf{I}-\eta\mathbf{H}_{m})^{2N}\mathbf{H}_{m}\rangle.$

It follows that:

	$\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{B}_{MN}\rangle$	$\displaystyle\geq\sum_{k=1}^{M}\\|\mathbf{w}_{0}-\mathbf{w}^{*}\\|_{\prod_{m=1}^% {M}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k}}^{2}$
		$\displaystyle+\sum_{k=1}^{M}\sum_{i}\lambda_{k}^{i}\cdot\sum_{m=1}^{M}\prod_{j% =m}^{M}(1-\eta\lambda_{j}^{i})^{2N}\beta_{m}\eta^{2}(b_{1}^{m}+b_{2}^{m}+d_{1}% ^{m}+d_{2}^{m})\cdot(1-\eta\lambda_{m}^{i})^{2N}\lambda_{m}^{i}$
		$\displaystyle\geq\sum_{k=1}^{M}\\|\mathbf{w}_{0}-\mathbf{w}^{*}\\|_{\prod_{m=1}^% {M}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k}}^{2}$
		$\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}({b_{1}^{m}}^{\prime}+{b_{2}^{m}}^{% \prime}+{d_{1}^{m}}^{\prime}+{d_{2}^{m}}^{\prime})$

where

	$\displaystyle{b_{1}^{m}}^{\prime}=$	$\displaystyle\frac{\beta_{m}^{2}\eta^{2}}{25}\cdot(\\|\mathbf{w}_{0}-\mathbf{w}% ^{}\\|_{\mathbf{I}_{m,{0:{k_{m}^{}}}}}^{2}+N\eta\\|\mathbf{w}_{0}-\mathbf{w}^{% }\\|_{\mathbf{H}_{m,{{k_{m}^{}}:\infty}}}^{2})$
		$\displaystyle\cdot(\sum_{i<k_{m}^{}}{\Gamma_{(1,M)}^{i}\lambda_{k}^{i}(% \lambda_{m}^{i})^{2}}+{\eta N}\sum_{i>k_{m}^{}}\Gamma_{(1,M)}^{i}(\lambda_{m}% ^{i})^{3}\lambda_{k}^{i})$
	$\displaystyle{b_{2}^{m}}^{\prime}=$	$\displaystyle\frac{\beta_{m}\eta^{2}}{5}\cdot(\\|\mathbf{w}_{0}-\mathbf{w}^{}% \\|_{(\bm{\Gamma}_{(1,M)}\mathbf{H}_{m}\mathbf{H}_{k})_{0:{k_{m}^{}}}}^{2}+N% \eta\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{(1,M)}(\mathbf{I}-\eta% \mathbf{H}_{m})^{2N}\mathbf{H}_{m}^{2}\mathbf{H}_{k})_{{k_{m}^{}}:\infty}}^{2% }),$

and

	$\displaystyle{d_{1}^{m}}^{\prime}=$	$\displaystyle\frac{\beta_{m}}{25}\cdot(\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{% \mathbf{I}_{m,{0:{k_{m}^{}}}}}^{2}+N\eta\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{% \mathbf{H}_{m,{{k_{m}^{}}:\infty}}}^{2})$
		$\displaystyle\cdot\hat{\Phi}_{1}^{m-1}\cdot(\sum_{i<k_{m}^{}}{\Gamma_{(m,M)}^% {i}\lambda_{k}^{i}(\lambda_{m}^{i})}+{\eta N}\sum_{i>k_{m}^{}}\Gamma_{(m,M)}^% {i}(\lambda_{m}^{i})^{2}\lambda_{k}^{i})$
	$\displaystyle{d_{2}^{m}}^{\prime}=$	$\displaystyle\frac{\beta_{m}\eta^{2}\hat{\Phi}_{1}^{m-1}}{5}\cdot(\\|\mathbf{w}% _{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{(m,M)}\mathbf{H}_{k})_{0:{k_{m}^{}}}}^{2% }+N\eta\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{(m,M)}(\mathbf{I}-\eta% \mathbf{H}_{m})^{2N}\mathbf{H}_{m}\mathbf{H}_{k})_{{k_{m}^{}}:\infty}}^{2}).$

Appendix D Extension work

It is noticed that when the step size is set to $\|\mathbf{x}_{m}\|^{-2}$ , the update rule for the minimum norm solution can be considered equivalent to that of the last iterate SGD. Consequently, in this subsection, we will focus on a particular case (akin to the setting in Lin et al. 2023) that involves this specific step size, allowing us to draw direct comparisons and insights under a defined set of conditions.

Consider a series of tasks $\mathbb{M}=\{1,2,\ldots,M\}$ . Given $M$ datasets, for each dataset $m\in\mathbb{M}$ , $D_{m}=\{(\mathbf{x}_{m,i},y_{m,i})\}_{i=1}^{N}$ drawn i.i.d from some fixed distribution $\mathcal{D}_{m}=\mathcal{X}_{m}\times\mathcal{Y}_{m}\subset\mathbb{R}^{d}% \times\mathbb{R}$ . Assume that $\{(\mathbf{x}_{m,i},y_{m,i})\}_{i=1}^{N}$ are i.i.d. sampled from a linear regression model, i.e., each $(\mathbf{x}_{m,i},y_{m,i})$ is a realization of the linear regression model $y_{m}=(\mathbf{x}_{m}^{\top}\mathbf{w}_{m}^{*})+z_{m}$ , where $z_{m}$ is some randomized noise satisfing well-specified condition and $\mathbf{w}_{m}^{*}\in\mathbb{R}^{d}$ is the optimal model parameter for task $m$ .

We adopt the same learning procedure with specific step size, aiming to output a model $\mathbf{w}_{M}^{N}$ minimizing the performance (Lin et al., 2023), i.e.

G(\mathbf{w}_{M}^{N})=\frac{1}{M}\sum_{i=1}^{M}\|\mathbf{w}_{M}^{N}-\mathbf{w}% _{i}^{*}\|^{2}.

(23)

Therefore, our results can be restated as follows

Theorem D.1.

Consider a scenario where the model $\mathbf{w}$ undergoes training via SGD for $M$ distinct tasks, following a sequence $1,\ldots,M$ . With a specific step size of $\eta_{m,t}=\|\mathbf{x}_{m,t}\|^{-2}$ , each task is executed for $N$ iterations. Given that Assumption 2.4 are satisfied, the following will hold:

	$\displaystyle\mathbb{E}[G(\mathbf{w}_{M}^{N})]$	$\displaystyle=\frac{1}{M}\sum_{i=1}^{M}\\|\mathbf{w}_{0}^{0}-\mathbf{w}_{i}^{*}% \\|_{\prod_{m=1}^{M}\prod_{t=1}^{N}\left(\mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,t}% }\right)}^{2}$
		$\displaystyle+\frac{1}{M}\sum_{i=1}^{M}\sum_{m=1}^{M}\sum_{t=0}^{N-1}\\|\mathbf% {w}_{m}^{}-\mathbf{w}_{i}^{}\\|_{\prod_{p=1}^{M-m}\prod_{j=1}^{N}\left(% \mathbf{I}-{\mathbf{H}_{p}}{\eta_{p,j}}\right)\prod_{j=q}^{N-t}\left(\mathbf{I% }-{\mathbf{H}_{m}}{\eta_{m,q}}\right){\mathbf{H}_{m}}{\eta_{m,q}}}^{2}$
		$\displaystyle+\frac{1}{M}\sum_{i=1}^{M}\sum_{m=1}^{M}\sum_{t=0}^{N-1}\\|\bm{z}_% {m,t}\\|_{\prod_{p=1}^{M-m}\prod_{j=1}^{N}\left(\mathbf{I}-{\mathbf{H}_{p}}{% \eta_{p,j}}\right)\prod_{q=1}^{N-t}\left(\mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,q% }}\right){\eta_{m,q}}}^{2}.$

Remark 2.

In contrast to the approach in Theorem 3.1 and Theorem 3.2, here we do not rely on the decomposition of bias and variance error while considering that the projection $(\mathbf{I}-\eta_{m,t}\mathbf{x}_{m,t}\mathbf{x}_{m,t}^{\top})$ is orthogonal to $\eta_{m,t}\mathbf{x}_{m,t}\mathbf{x}_{m,t}^{\top}$ with a specific stepsize $\eta_{m,t}=\|\mathbf{x}_{m,t}\|_{2}^{-2}$ . This perspective allows us to derive a closed-form expression for the expected performance, which integrates the impact of initial parameter deviations, task-specific parameter variations, and random noise. Furthermore, Theorem D.1 in our study explores the performance behavior on general data distributions, expanding beyond the Gaussian distribution context discussed in Lin et al. 2023. In scenarios where there is only a single sample per training iteration, our results could cover their findings.

Proof.

For each iteration, according to the update rule of SGD, it holds that

\mathbf{w}_{m}^{N}=\mathbf{w}_{m}^{N-1}-\eta(\bm{x}_{m,N}((\bm{x}_{m,N})^{\top% }\mathbf{w}_{m}^{N-1}-y_{m,N})).

which can be rewritten as:

\mathbf{w}_{m}^{N}-\mathbf{w}_{i}^{*}=(\mathbf{I}-\eta\bm{x}_{m,N}(\bm{x}_{m,N% })^{\top})(\mathbf{w}_{m}^{N-1}-\mathbf{w}_{i}^{*})+\eta z_{m,N}\bm{x}_{m,N}.

We consider the expectation norm for both sides:

		$\displaystyle\mathbb{E}[\\|\mathbf{w}_{m}^{N}-\mathbf{w}_{i}^{*}\\|^{2}]$
	$\displaystyle=$	$\displaystyle\mathbb{E}[(\mathbf{w}_{m}^{N}-\mathbf{w}_{i}^{})^{\top}(\mathbf% {w}_{m}^{N}-\mathbf{w}_{i}^{})]$
	$\displaystyle=$	$\displaystyle\mathbb{E}[(\mathbf{w}_{m}^{N-1}-\mathbf{w}_{i}^{})^{\top}(% \mathbf{I}-\eta\bm{x}_{m,N}(\bm{x}_{m,N})^{\top})^{\top}(\mathbf{I}-\eta\bm{x}% _{m,N}(\bm{x}_{m,N})^{\top})(\mathbf{w}_{m}^{N-1}-\mathbf{w}_{i}^{})+\eta^{2}% (z_{m,N}\bm{x}_{m,N})^{\top}(z_{m,N}\bm{x}_{m,N})]$
	$\displaystyle(*)=$	$\displaystyle\mathbb{E}[\\|(\mathbf{I}-\eta_{m,N}\bm{x}_{m,N}(\bm{x}_{m,N})^{% \top})(\mathbf{w}_{m}^{N-1}-\mathbf{w}_{i}^{})\\|^{2}+\\|\eta_{m,N}\bm{x}_{m,N}% (\bm{x}_{m,N})^{\top}(\mathbf{w}_{m}^{}-\mathbf{w}_{i}^{*})\\|^{2}+\\|\eta_{m,N% }\bm{x}_{m,N}\bm{z}_{m,N}\\|^{2}]$
	$\displaystyle=$	$\displaystyle\mathbb{E}[\\|\mathbf{w}_{m}^{N-1}-\mathbf{w}_{i}^{}\\|_{(\mathbf{% I}-{\mathbf{H}_{m}}{\eta_{m,N}})}^{2}]+{\eta_{m,N}}\sigma^{2}+\\|\mathbf{w}_{m}% ^{}-\mathbf{w}_{i}^{*}\\|_{{\mathbf{H}_{m}}{\eta_{m,N}}}^{2}$
	$\displaystyle=$	$\displaystyle\\|\mathbf{w}_{m}^{0}-\mathbf{w}_{i}^{}\\|_{\prod_{t=1}^{N}(% \mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,t}})}^{2}+\sum_{t=1}^{N-1}\\|\mathbf{w}_{m}% ^{}-\mathbf{w}_{i}^{*}\\|_{\prod_{j=1}^{N-t}(\mathbf{I}-{\mathbf{H}_{m}}{\eta_% {m,j}})^{j}{\mathbf{H}_{m}}{\eta_{m,N}}}^{2}+\sum_{t=1}^{N-1}\\|\bm{z}_{m,t}\\|_% {\prod_{j=1}^{N-t}(\mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,j}})^{j}{\eta_{m,N}}}^{% 2},$

where the (*) equation comes from the choice of step size such that $(\mathbf{I}-\eta_{m,t}\bm{x}_{m,t}(\bm{x}_{m,t})^{\top})$ and $\eta_{m,t}\bm{x}_{m,t}(\bm{x}_{m,t})^{\top}$ are orthogonal projection, which equals the minimum norm solution with one sample.

Considering $M$ tasks, it holds that

	$\displaystyle\mathbb{E}\\|\mathbf{w}_{m,}-\mathbf{w}_{i}^{*}\\|^{2}$	$\displaystyle=\\|\mathbf{w}_{0}^{0}-\mathbf{w}_{i}^{*}\\|_{\prod_{m=1}^{M}\prod_% {t=1}^{N}(\mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,t}})}^{2}$
		$\displaystyle+\sum_{m=1}^{M-1}\sum_{t=0}^{N-1}\\|\mathbf{w}_{m}^{}-\mathbf{w}_% {i}^{}\\|_{\prod_{p=1}^{M-m}\prod_{j=1}^{N}(\mathbf{I}-{\mathbf{H}_{p}}{\eta_{% p,j}})\prod_{j=q}^{N-t}(\mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,q}}){\mathbf{H}_{m% }}{\eta_{m,q}}}^{2}$
		$\displaystyle+\sum_{m=1}^{M-1}\sum_{t=0}^{N-1}\\|\bm{z}_{m}\\|_{\prod_{p=1}^{M-m% }\prod_{j=1}^{N}(\mathbf{I}-{\mathbf{H}_{p}}{\eta_{p,j}})\prod_{q=1}^{N-t}(% \mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,q}}){\eta_{m,q}}}^{2}.$

In conclusion, we aggregate the performance metrics across tasks, ranging from $i=1$ to $i=M$ , to derive the final result. ∎

		$\displaystyle D_{1}^{\text{eff}}:=\sum_{i<k_{m}^{}}\Gamma_{(m+1,M)}^{i}% \Lambda^{i}+N\eta\sum_{i>k_{m}^{}}\Gamma_{(m+1,M)}^{i}\lambda_{m}^{i}\Lambda^% {i}$		(5)
		$\displaystyle D_{2}^{\text{eff}}:=\sum_{i<k_{m}^{}}{\Gamma^{i}_{(1,M)}(% \lambda_{m}^{i})^{2}\Lambda^{i}}+N\eta\sum_{i>k_{m}^{}}\Gamma^{i}_{(1,M)}(% \lambda_{m}^{i})^{3}\Lambda^{i}$
		$\displaystyle D_{3}^{\text{eff}}:=\sum_{i<k_{m}^{}}{\Gamma_{(m,M)}^{i}(% \lambda_{m}^{i})\Lambda^{i}}+{\eta N}\sum_{i>k_{m}^{}}\Gamma_{(m,M)}^{i}(% \lambda_{m}^{i})^{2}\Lambda^{i},$

	$\displaystyle\text{err}_{\text{var}}\geq\frac{\sum_{m=1}^{M}}{M}\cdot\frac{9% \eta^{2}\sigma^{2}}{20}\cdot D_{1}^{\text{eff}},$
	$\displaystyle\text{err}_{\text{bias}}\geq\frac{\sum_{k=1}^{M}}{M}\\|\mathbf{w}_% {0}-\mathbf{w}^{*}\\|_{\bm{\Gamma}_{1}^{M}\mathbf{H}_{k}}^{2}$
	$\displaystyle+\frac{\sum_{m=1}^{M}}{M}\cdot\frac{\beta_{m}^{2}\eta^{2}}{25}% \cdot(D_{2}^{\text{eff}}+\hat{\Phi}_{1}^{m-1}D_{3}^{\text{eff}})\cdot\\|\mathbf% {w}_{0}-\mathbf{w}^{}\\|_{\mathbf{U}_{k_{m}^{}}}^{2}$
	$\displaystyle+\frac{\sum_{m=1}^{M}}{M}\frac{\beta_{m}\eta^{2}}{5}\cdot\\|% \mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\mathbf{I}-\eta\mathbf{H}_{m})^{2N}\bm{% \Gamma}_{1}^{M}\mathbf{H}_{k}(\mathbf{H}_{m}+\hat{\Phi}_{1}^{m-1}\mathbf{I})% \cdot\mathbf{U}_{k_{m}^{}}}^{2}$

	$\displaystyle\langle\mathbf{H}_{k},\mathbf{B}_{MN}\rangle=\langle\prod_{m=1}^{% M}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k},\mathbf{B}_{0}\rangle+% \langle\mathbf{H}_{k},\sum_{m=1}^{M}\prod_{j=m}^{M}(\mathbf{I}-\eta{\mathbf{H}% _{j}})^{2N}\alpha_{m}\eta^{2}(U_{m}+V_{m})\cdot\mathbf{H}_{m}\rangle$
	$\displaystyle\leq\\|\mathbf{w}_{0}-\mathbf{w}^{*}\\|_{\prod_{m=1}^{M}(\mathbf{I}% -\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k}}^{2}$
	$\displaystyle+\sum_{m=1}^{M}\alpha_{m}\eta^{2}(\frac{\alpha_{m}\operatorname{% tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}(\sum% _{i<k_{m}^{}}\frac{\Gamma_{(1,M)}^{i}(\lambda_{m}^{i})^{2}\lambda_{k}^{i}}{N% \eta}+N\eta\sum_{i>k_{m}^{}}\Gamma_{(1,M)}^{i}(\lambda_{m}^{i})^{3}\lambda_{k% }^{i}))$
	$\displaystyle+\sum_{m=1}^{M}\alpha_{m}\eta^{2}(\\|\mathbf{w}_{0}-\mathbf{w}^{}% \\|_{(\bm{\Gamma}_{1}^{M}\mathbf{H}_{m}\mathbf{H}_{k})_{0:k_{m}^{}}}^{2}+N\eta% \\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{1}^{M}\mathbf{H}_{m}^{2}% \mathbf{H}_{k})_{k_{m}^{}:\infty}}^{2})$		(19)
	$\displaystyle+\sum_{m=1}^{M}\alpha_{m}\eta^{2}{\Phi_{1}^{m-1}}(\frac{\alpha_{m% }\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(% \mathbf{H}_{m})}(\sum_{i<k_{m}^{}}\Gamma_{(m,M)}^{i}(\lambda_{m}^{i})+N\eta% \sum_{i>k_{m}^{}}\Gamma_{(m,M)}^{i}(\lambda_{m}^{i})^{2}))$
	$\displaystyle+\sum_{m=1}^{M}\alpha_{m}\eta^{2}{\Phi_{1}^{m-1}}(\frac{1}{\eta}% \\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{m}^{M}\mathbf{H}^{-1}_{m}){0:% k_{m}^{}}}^{2}+N\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{m}^{M})_{k_{% m}^{}:\infty}}^{2})$

	$\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{B}_{MN}\rangle$	$\displaystyle\leq\sum_{k=1}^{M}\\|\mathbf{w}_{0}-\mathbf{w}^{*}\\|_{\prod_{m=1}^% {M}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k}}^{2}$
		$\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}\alpha_{m}\eta^{2}\frac{\alpha_{m}% \operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf% {H}_{m})}(\sum_{i<k_{m}^{}}\frac{\Gamma_{(1,M)}^{i}(\lambda_{m}^{i})^{2}% \lambda_{k}^{i}}{N\eta}+N\eta\sum_{i>k_{m}^{}}\Gamma_{(1,M)}^{i}(\lambda_{m}^% {i})^{3}\lambda_{k}^{i})$
		$\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}\alpha_{m}\eta^{2}(\\|\mathbf{w}_{0}-% \mathbf{w}^{}\\|_{(\bm{\Gamma}_{1}^{M}\mathbf{H}_{m}\mathbf{H}_{k})_{0:k_{m}^{% }}}^{2}+N\eta\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{{(\bm{\Gamma}_{1}^{M}\mathbf{% H}_{k}\mathbf{H}_{m}^{2})}_{k_{m}^{}:\infty}}^{2})$
		$\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}\alpha_{m}\eta^{2}{\Phi_{1}^{m-1}}(% \frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}% \operatorname{tr}(\mathbf{H}_{m})}(\sum_{i<k_{m}^{}}\Gamma_{(m,M)}^{i}\lambda% _{k}^{i}(\lambda_{m}^{i})+N\eta\sum_{i>k_{m}^{}}\Gamma_{(m,M)}^{i}(\lambda_{m% }^{i})^{2}\lambda_{k}^{i}))$
		$\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}\alpha_{m}\eta{\Phi_{1}^{m-1}}(\\|% \mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{m}^{M}\mathbf{H}_{k}){0:k_{m}^{% }}}^{2}+N\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{m}^{M}\mathbf{H}_{k% }\mathbf{H}_{m})_{k_{m}^{}:\infty}}^{2})$

	$\displaystyle{b_{1}^{m}}^{\prime}=$	$\displaystyle\frac{\beta_{m}^{2}\eta^{2}}{25}\cdot(\\|\mathbf{w}_{0}-\mathbf{w}% ^{}\\|_{\mathbf{I}_{m,{0:{k_{m}^{}}}}}^{2}+N\eta\\|\mathbf{w}_{0}-\mathbf{w}^{% }\\|_{\mathbf{H}_{m,{{k_{m}^{}}:\infty}}}^{2})$
		$\displaystyle\cdot(\sum_{i<k_{m}^{}}{\Gamma_{(1,M)}^{i}\lambda_{k}^{i}(% \lambda_{m}^{i})^{2}}+{\eta N}\sum_{i>k_{m}^{}}\Gamma_{(1,M)}^{i}(\lambda_{m}% ^{i})^{3}\lambda_{k}^{i})$
	$\displaystyle{b_{2}^{m}}^{\prime}=$	$\displaystyle\frac{\beta_{m}\eta^{2}}{5}\cdot(\\|\mathbf{w}_{0}-\mathbf{w}^{}% \\|_{(\bm{\Gamma}_{(1,M)}\mathbf{H}_{m}\mathbf{H}_{k})_{0:{k_{m}^{}}}}^{2}+N% \eta\\|\mathbf{w}_{0}-\mathbf{w}^{}\\|_{(\bm{\Gamma}_{(1,M)}(\mathbf{I}-\eta% \mathbf{H}_{m})^{2N}\mathbf{H}_{m}^{2}\mathbf{H}_{k})_{{k_{m}^{}}:\infty}}^{2% }),$

Understanding Forgetting in Continual Learning with Linear Regression: Overparameterized and Underparameterized Regimes

Abstract

1 Introduction

1.1 Related Work

2 Preliminaries

Definition 2.1 (Data Covariance).

Definition 2.2 (Covariate Shift).

Assumption 2.3 (Fourth moment conditions).

Remark 1.

Assumption 2.4 (Well-specified noise).

Continual Learning via SGD

3 Main Results

Theorem 3.1 (Upper Bound).

Theorem 3.2 (Lower Bound).

4 Discussion

4.1 Technical Understanding Under Simplified Cases

4.2 Comparison with Existing work

4.3 The Impact of Task Ordering and Parameters on Forgetting

5 Empirical Stimulation

5.1 Linear Regression

5.2 Implication on DNNs

6 Conclusion

Impact Statements

Acknowledgments

References

Appendix A Support Lemmas

Lemma A.1 ((Zou et al., 2021)).

Lemma A.2 (Bias-variance decomposition).

Appendix B Variance Error

B.1 Upper Bound

Assumption B.1 (Relaxed version).

Lemma B.2.

Proof.

Lemma B.3.

Proof.

B.2 Lower Bound

Lemma B.4.

Proof.

Appendix C Bias Error

Lemma C.1 (Summation of bias iterates (Zou et al., 2021)).

Lemma C.2.

Lemma C.3.

C.1 Upper Bound

Lemma C.4.

C.2 Lower Bound

Appendix D Extension work

Theorem D.1.

Remark 2.

Proof.

Understanding Forgetting in Continual Learning with Linear Regression:
Overparameterized and Underparameterized Regimes