Understanding Forgetting in Continual Learning with Linear Regression:
Overparameterized and Underparameterized Regimes

Meng Ding    Kaiyi Ji    Di Wang    Jinhui Xu
Abstract

Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. Despite the tremendous progress made in the past, the theoretical understanding, especially factors contributing to catastrophic forgetting, remains relatively unexplored. In this paper, we provide a general theoretical analysis of forgetting in the linear regression model via Stochastic Gradient Descent (SGD) applicable to both under-parameterized and overparameterized regimes. Our theoretical framework reveals some interesting insights into the intricate relationship between task sequence and algorithmic parameters, an aspect not fully captured in previous studies due to their restrictive assumptions. Specifically, we demonstrate that, given a sufficiently large data size, the arrangement of tasks in a sequence—where tasks with larger eigenvalues in their population data covariance matrices are trained later—tends to result in increased forgetting. Additionally, our findings highlight that an appropriate choice of step size will help mitigate forgetting in both under-parameterized and overparameterized settings. To validate our theoretical analysis, we conducted simulation experiments on both linear regression models and Deep Neural Networks (DNNs). Results from these simulations substantiate our theoretical findings.

Machine Learning, ICML

1 Introduction

Continual learning, also known as lifelong learning, is a subfield of machine learning that focuses on developing a model capable of learning continuously from a stream of data, which are i.i.d sampled from different tasks and presented sequentially to the model. A primary challenge in continual learning is the catastrophic forgetting phenomenon (McCloskey & Cohen, 1989), wherein the model forgets previously acquired knowledge when exposed to new data.

Previous research addressing catastrophic forgetting in continuous learning primarily focuses on empirical studies, which can be broadly classified into three categories: expansion-based methods, regularization-based methods, and memory-based methods. Expansion-based methods (Yoon et al., 2017, 2019; Yang et al., 2021) mitigate catastrophic forgetting by allocating distinct subsets of network parameters to individual tasks. Regularization-based methods (Kirkpatrick et al., 2017; Aljundi et al., 2018; Serra et al., 2018; Liu & Liu, 2022) employee structural regularization in fixed capacity models to counteract forgetting, which penalize significant changes in parameters that are crucial for previous tasks. Memory-based methods (Shin et al., 2017; Chaudhry et al., 2018; Riemer et al., 2018; Saha et al., 2021; Lin et al., 2022; Hao et al., 2023) alleviate forgetting by storing subsets of previous task data or synthesizing pseudo-data without data-replay.

Recently, there has been a growing body of work focused on understanding the behavior of catastrophic forgetting from a theoretical standpoint. For example, Bennani et al. 2020; Doan et al. 2021 analyze the generalization of continual learning for Orthogonal Gradient Descent (OGD) (Farajtabar et al., 2020) in the Neural Tangent Kernel (NTK) (Jacot et al., 2018) regime. Lee et al. 2021; Asanuma et al. 2021 explore the impact of task similarity in a teacher-student setting. Evron et al. 2022; Lin et al. 2023 provide a detailed forgetting analysis of the minimum-norm interpolator for the overparameterized linear regression model. However, the existing analyses of forgetting often rely on relatively stringent assumptions that may not be applicable in many scenarios. For example, Bennani et al. 2020; Doan et al. 2021; Evron et al. 2022; Lin et al. 2023 necessitate an overparameterized regime for their analysis, which may be invalid when involving large datasets. Moreover, Lee et al. 2021; Asanuma et al. 2021; Lin et al. 2023; Swartworth et al. 2023 assume that data follows a Gaussian distribution that may not hold in real-world datasets exhibiting more complex distributions. Evron et al. 2022; Lin et al. 2023 focus on the minimum-norm interpolator, where each task requires achieving zero loss on its training samples and hence can find a closed-form solution.

In this paper, we investigate the behavior of forgetting under the linear regression model via the more practical Stochastic Gradient Descent (SGD) method and provide a general theoretical analysis that is applicable to both over-parameterized and under-parameterized regimes. Our main contributions can be summarized as follows:

Firstly, our work provides a theoretical analysis for multi-step SGD algorithms in both underparameterized and overparameterized regimes, with the population data covariance matrix satisfying the general fourth moment instead of Gaussian distribution as in existing studies. In specific, we provide a novel upper bound on the model forgetting, as well as a matching lower bound that shows the tightness of our characterization. Our bounds derive the forgetting bound that is stated as a function of 𝟏)\mathbf{1)}bold_1 ) the spectrum of the population data covariance matrices for each task, 𝟐)\mathbf{2)}bold_2 ) the step size, 𝟑)\mathbf{3)}bold_3 ) the number of training samples and 𝟒)\mathbf{4)}bold_4 ) the effective dimensions on the forgetting.

Second, our study provides some interesting insights into the impact of task sequence and algorithmic parameters on the degree of forgetting. Specifically, we show that when the data size is sufficiently large, forgetting tends to escalate when we postpone the training of tasks, whose population data covariance matrices possess larger eigenvalues. It is intuitive that when tasks with larger eigenvalues are trained later, the model might overfit these tasks due to their high variance. In addition, our findings reveal that an appropriate choice of step size can help mitigate forgetting in both underparameterized and overparameterized settings. Note that these results cannot be derived from existing works due to their restrictive data distribution assumptions or closed-form updating rules. More detailed discussions can be found in Section 4.

Finally, we conducted simulation experiments on both linear regression models and Deep Neural Networks (DNNs) to validate our theoretical analysis. Our simulation results indicate that both linear regression models and DNNs exhibit increased forgetting when tasks with larger eigenvalues are encountered later. Additionally, we demonstrate that smaller step sizes in training can also mitigate forgetting across task sequences, especially in under-parameterized settings. Interestingly, we observe that in over-parameterized DNNs, higher dimensionality does not necessarily equate to more forgetting if the dataset size is fixed, as opposite to the linear regression case.

1.1 Related Work

In this section, we discuss related work on Covariate Shift, SGD analysis in linear regression, and theoretical studies for catastrophic forgetting.

Covariate Shift Covariate shift is a specific set-up in machine learning (Pan & Yang, 2009; Sugiyama & Kawanabe, 2012), referring to a distribution mismatch between the training and test data. The concept is typically applied in transfer learning, which can be seen as a particular instance of continual learning, generally involving two tasks. For example, Mohri & Medina 2012; Cortes & Mohri 2014; Kpotufe & Martinet 2018; Cortes et al. 2019; Hanneke & Kpotufe 2020; Ma et al. 2023; Wu et al. 2022b examine the (regularized) empirical risk minimizer, which focuses on minimizing the empirical and generalization error across accessible datasets. Nevertheless, the standard covariate shift is defined over two distinct data distributions, which can not be directly applied to our case. Consequently, we propose an extended version in Definition 2.2 to better suit our context.

SGD Analysis Recently, several studies have investigated the behavior of Stochastic Gradient Descent (SGD) in linear regression models through the lens of bias-variance decomposition (Défossez & Bach, 2015; Dieuleveut et al., 2017; Jain et al., 2017, 2018) and the eigen-decomposition of the covariance matrix (Chen et al., 2020; Zou et al., 2021; Wu et al., 2022a, b). Our work closely relates to the studies in Zou et al. 2021; Wu et al. 2022b that also characterized the SGD dynamic in linear regression with respect to the full eigenspectrum of the data covariance matrix. However, they focused on either the single-task setting or the pretraining-finetuning setting, while we studied the more challenging continual learning problem that involves a sequence of tasks with different data distributions. More discussion in Section 4.

Theoretical Studies in Continual Learning Although significant progress has been made in empirical studies addressing the issue of forgetting in continual learning, theoretical insights into this area are still largely unexplored. In this context, Bennani et al. 2020 established a theoretical framework to study continual learning algorithms in the NTK regime, and provided the first generalization bound dependent on task similarity for SGD and OGD. Doan et al. 2021 introduced the NTK overlap matrix as a task similarity metric and proposed a data-structure-informed variant of OGD that utilizes Principal Component Analysis (PCA). Asanuma et al. 2021 utilized the teacher-student framework on a single neural network and demonstrated that catastrophic forgetting can be circumvented when the similarity among input distributions is small and the similarity among teacher networks is large. Lee et al. 2021 expanded an earlier analysis of two-layer networks within the teacher-student setup to the setting with multiple teachers and revealed that the highest level of forgetting occurs when tasks have intermediate similarity with each other. Evron et al. 2022; Swartworth et al. 2023 explained the behavior of forgetting in the linear regression model from the perspectives of alternating projections and the Kaczmarz method (Karczmarz, 1937). Lin et al. 2023 investigated the impact of overparameterization, task similarity, and task ordering on forgetting and generalization in the overparameterized linear regression model.

The works most relevant to our study include (Evron et al., 2022; Lin et al., 2023), both of which also studied the behavior of forgetting in the linear regression model. However, our work differs from their studies in several aspects.

Firstly, with regard to assumptions, Evron et al. 2022 assumed all data are bounded with 1 and the model is noiseless, and Lin et al. 2023 assumed all data are sampled from a Gaussian distribution. In contrast, our assumptions cover more data distributions and are much milder than theirs (see Remark 2 and Section 4 for more details). Secondly, in terms of methods, both Evron et al. 2022 and Lin et al. 2023 analyze the problem of forgetting using the minimum norm solution, which presupposes zero training error—a requirement not necessary in our approach with SGD (see Section 2 for further discussions). Third, Evron et al. 2022; Lin et al. 2023 considered only the overparameterized case where the data dimension is larger than the data size, while our analysis holds for both the underparameterized and overparameterized settings.

Notations: In this paper, we adhere to a consistent notation style for clarity. We use boldface lower letters such as 𝐱,𝐰𝐱𝐰\mathbf{x},\mathbf{w}bold_x , bold_w for vectors, and boldface capital letters (e.g. 𝐀,𝐇𝐀𝐇\mathbf{A},\mathbf{H}bold_A , bold_H) for matrices. Let 𝐀2subscriptnorm𝐀2\|\mathbf{A}\|_{2}∥ bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote the spectral norm of 𝐀𝐀\mathbf{A}bold_A and 𝐯2subscriptnorm𝐯2\|\mathbf{v}\|_{2}∥ bold_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote the Euclidean norm of 𝐯𝐯\mathbf{v}bold_v. For two vectors 𝐮𝐮\mathbf{u}bold_u and 𝐯𝐯\mathbf{v}bold_v, their inner product is denoted by 𝐮,𝐯𝐮𝐯\langle\mathbf{u},\mathbf{v}\rangle⟨ bold_u , bold_v ⟩ oder 𝐮𝐯superscript𝐮top𝐯\mathbf{u}^{\top}\mathbf{v}bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v. For two matrices 𝐀𝐀\mathbf{A}bold_A and 𝐁𝐁\mathbf{B}bold_B of appropriate dimension, their inner product is defined as 𝐀,𝐁:=tr(𝐀𝐁)assign𝐀𝐁trsuperscript𝐀top𝐁\langle\mathbf{A},\mathbf{B}\rangle:=\operatorname{tr}(\mathbf{A}^{\top}% \mathbf{B})⟨ bold_A , bold_B ⟩ := roman_tr ( bold_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_B ). For a positive semi-definite (PSD) matrix 𝐀𝐀\mathbf{A}bold_A and a vector 𝐯𝐯\mathbf{v}bold_v of appropriate dimension, we write 𝐯𝐀2:=𝐯𝐀𝐯assignsuperscriptsubscriptnorm𝐯𝐀2superscript𝐯top𝐀𝐯\|\mathbf{v}\|_{\mathbf{A}}^{2}:=\mathbf{v}^{\top}\mathbf{Av}∥ bold_v ∥ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := bold_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Av. The outer product is denoted by tensor-product\otimes.

2 Preliminaries

In our setup, we consider a sequence of tasks, denoted as 𝕄={1,2,,M}𝕄12𝑀\mathbb{M}=\{1,2,\ldots,M\}blackboard_M = { 1 , 2 , … , italic_M }. For each task m𝑚mitalic_m in this sequence, we have a corresponding dataset Dmsubscript𝐷𝑚D_{m}italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, which consists of N𝑁Nitalic_N data points. Each of these data points, denoted as (𝐱m,i,ym,i)subscript𝐱𝑚𝑖subscript𝑦𝑚𝑖(\mathbf{x}_{m,i},y_{m,i})( bold_x start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ), is drawn independently and identically distributed (i.i.d.) from a specific distribution 𝒟m=subscript𝒟𝑚absent\mathcal{D}_{m}=caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 𝒳m×𝒴md×subscript𝒳𝑚subscript𝒴𝑚superscript𝑑\mathcal{X}_{m}\times\mathcal{Y}_{m}\subset\mathbb{R}^{d}\times\mathbb{R}caligraphic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × caligraphic_Y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R. Here, 𝐱m,isubscript𝐱𝑚𝑖\mathbf{x}_{m,i}bold_x start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT represents the feature vector, and ym,isubscript𝑦𝑚𝑖y_{m,i}italic_y start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT is the response variable for each data point in the dataset Dmsubscript𝐷𝑚D_{m}italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Assume that {(𝐱m,i,ym,i)}i=1Nsuperscriptsubscriptsubscript𝐱𝑚𝑖subscript𝑦𝑚𝑖𝑖1𝑁\{(\mathbf{x}_{m,i},y_{m,i})\}_{i=1}^{N}{ ( bold_x start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT are i.i.d. sampled from a linear regression model, i.e., each pair (𝐱m,i,ym,i)subscript𝐱𝑚𝑖subscript𝑦𝑚𝑖(\mathbf{x}_{m,i},y_{m,i})( bold_x start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ) is a realization of the linear regression model ym=(𝐱m𝐰)+zmsubscript𝑦𝑚superscriptsubscript𝐱𝑚topsubscript𝐰subscript𝑧𝑚y_{m}=(\mathbf{x}_{m}^{\top}\mathbf{w}_{*})+z_{m}italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ( bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, where zmsubscript𝑧𝑚z_{m}italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is some randomized noise and 𝐰dsubscript𝐰superscript𝑑\mathbf{w}_{*}\in\mathbb{R}^{d}bold_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the optimal model parameter.

Our goal is to output a model 𝐰MNsubscript𝐰𝑀𝑁\mathbf{w}_{MN}bold_w start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT minimizing the degree of forgetting (Evron et al., 2022) for M𝑀Mitalic_M tasks, i.e.

G(M)=1Mm=1Mm(𝐰MN),where𝐺𝑀1𝑀superscriptsubscript𝑚1𝑀subscript𝑚subscript𝐰𝑀𝑁whereG(M)=\frac{1}{M}\sum_{m=1}^{M}\mathcal{L}_{m}(\mathbf{w}_{MN}),\quad\text{where}italic_G ( italic_M ) = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT ) , where (1)
m(𝐰)=12𝔼(𝐱m,ym)𝒟m𝐱m𝐰ym2,m𝕄formulae-sequencesubscript𝑚𝐰12subscript𝔼similar-tosubscript𝐱𝑚subscript𝑦𝑚subscript𝒟𝑚superscriptnormsuperscriptsubscript𝐱𝑚top𝐰subscript𝑦𝑚2𝑚𝕄\mathcal{L}_{m}(\mathbf{w})=\frac{1}{2}\mathbb{E}_{(\mathbf{x}_{m},y_{m})\sim% \mathcal{D}_{m}}\|\mathbf{x}_{m}^{\top}\mathbf{w}-y_{m}\|^{2},\quad m\in% \mathbb{M}caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_w ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w - italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_m ∈ blackboard_M

𝐰MNsubscript𝐰𝑀𝑁\mathbf{w}_{MN}bold_w start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT represents the final output after sequentially training on M𝑀Mitalic_M tasks, each updated via SGD over N𝑁Nitalic_N iterations for each task. Equation 1 quantifies an average excess population risk on the final output 𝐰MNsubscript𝐰𝑀𝑁\mathbf{w}_{MN}bold_w start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT across all tasks. For each task m𝑚mitalic_m, the loss msubscript𝑚\mathcal{L}_{m}caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT evaluate how well 𝐰MNsubscript𝐰𝑀𝑁\mathbf{w}_{MN}bold_w start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT performs on it, thus assessing the degree of the model’s forgetting on previous tasks in continual learning scenarios.

Definition 2.1 (Data Covariance).

Assume that each entry and the trace of the 𝔼[𝐱m𝐱m]𝔼delimited-[]subscript𝐱𝑚superscriptsubscript𝐱𝑚top\mathbb{E}[\mathbf{x}_{m}\mathbf{x}_{m}^{\top}]blackboard_E [ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] are finite. Define 𝐇m:=assignsubscript𝐇𝑚absent\mathbf{H}_{m}:=bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT := 𝔼[𝐱m𝐱m]𝔼delimited-[]subscript𝐱𝑚superscriptsubscript𝐱𝑚top\mathbb{E}[\mathbf{x}_{m}\mathbf{x}_{m}^{\top}]blackboard_E [ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] as data covariance matrix.

Let 𝐇msubscript𝐇𝑚\mathbf{H}_{m}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denote the eigen decomposition of the data covariance for task m𝑚mitalic_m, given by 𝐇m=iλmi𝐯mi𝐯misubscript𝐇𝑚subscript𝑖superscriptsubscript𝜆𝑚𝑖superscriptsubscript𝐯𝑚𝑖superscriptsuperscriptsubscript𝐯𝑚𝑖top\mathbf{H}_{m}=\sum_{i}\lambda_{m}^{i}\mathbf{v}_{m}^{i}{\mathbf{v}_{m}^{i}}^{\top}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where (λmi)i1subscriptsuperscriptsubscript𝜆𝑚𝑖𝑖1(\lambda_{m}^{i})_{i\geq 1}( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ≥ 1 end_POSTSUBSCRIPT are eigenvalues in a nonincreasing order and (𝐯mi)i1subscriptsuperscriptsubscript𝐯𝑚𝑖𝑖1(\mathbf{v}_{m}^{i})_{i\geq 1}( bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ≥ 1 end_POSTSUBSCRIPT are the corresponding eigenvectors. Define 𝐇m,k1:k2subscript𝐇:𝑚subscript𝑘1subscript𝑘2\mathbf{H}_{m,k_{1}:k_{2}}bold_H start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT as 𝐇m,k1:k2:=k1<ik2λmi𝐯mi𝐯mi,assignsubscript𝐇:𝑚subscript𝑘1subscript𝑘2subscriptsubscript𝑘1𝑖subscript𝑘2superscriptsubscript𝜆𝑚𝑖superscriptsubscript𝐯𝑚𝑖superscriptsuperscriptsubscript𝐯𝑚𝑖top\mathbf{H}_{m,k_{1}:k_{2}}:=\sum_{k_{1}<i\leq k_{2}}\lambda_{m}^{i}\mathbf{v}_% {m}^{i}{\mathbf{v}_{m}^{i}}^{\top},bold_H start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_i ≤ italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , and allow k2=subscript𝑘2k_{2}=\inftyitalic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∞ to imply that 𝐇m,k:=i>kλmi𝐯mi𝐯misubscript𝐇:𝑚𝑘subscript𝑖𝑘superscriptsubscript𝜆𝑚𝑖superscriptsubscript𝐯𝑚𝑖superscriptsuperscriptsubscript𝐯𝑚𝑖top\mathbf{H}_{m,k:\infty}=\sum_{i>k}\lambda_{m}^{i}\mathbf{v}_{m}^{i}{\mathbf{v}% _{m}^{i}}^{\top}bold_H start_POSTSUBSCRIPT italic_m , italic_k : ∞ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i > italic_k end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.

Definition 2.2 (Covariate Shift).

For each task m𝑚mitalic_m, the covariates 𝐱m,1subscript𝐱𝑚1\mathbf{x}_{m,1}bold_x start_POSTSUBSCRIPT italic_m , 1 end_POSTSUBSCRIPT, \ldots, 𝐱m,Nsubscript𝐱𝑚𝑁\mathbf{x}_{m,N}bold_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT are i.i.d. drawn from 𝒟msubscript𝒟𝑚\mathcal{D}_{m}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

Compared to the concept of covariate shift in transfer learning (Pathak et al., 2022), Definition 2.2 provides a more general scenario applicable to a series of tasks M2𝑀2M\geq 2italic_M ≥ 2. For simplicity, in our analysis, we assume that each task m𝑚mitalic_m in our model consists of N𝑁Nitalic_N data points, differentiating it from transfer learning approaches that typically consider the total dataset size as N𝑁Nitalic_N.

Assumption 2.3 (Fourth moment conditions).

Assume that for each task m𝑚mitalic_m, the expected fourth moment of covariates, denoted as :=𝔼[𝐱m𝐱m𝐱m𝐱m]assign𝔼delimited-[]tensor-productsubscript𝐱𝑚subscript𝐱𝑚subscript𝐱𝑚subscript𝐱𝑚\mathcal{M}:=\mathbb{E}[\mathbf{x}_{m}\otimes\mathbf{x}_{m}\otimes\mathbf{x}_{% m}\otimes\mathbf{x}_{m}]caligraphic_M := blackboard_E [ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊗ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊗ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊗ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ], and the expected covariance matrix 𝐇msubscript𝐇𝑚\mathbf{H}_{m}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are finite. Moreover:

  1. (A)

    There exists a constant αm>0subscript𝛼𝑚0\alpha_{m}>0italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT > 0 such that for any Positive Semi-Definite (PSD) matrix 𝐀𝐀\mathbf{A}bold_A, the following holds:

    𝔼[𝐱m𝐱m𝐀𝐱m𝐱m]αmtr(𝐇m𝐀)𝐇m.precedes-or-equals𝔼delimited-[]subscript𝐱𝑚superscriptsubscript𝐱𝑚topsubscript𝐀𝐱𝑚superscriptsubscript𝐱𝑚topsubscript𝛼𝑚trsubscript𝐇𝑚𝐀subscript𝐇𝑚\mathbb{E}[{\mathbf{x}_{m}}{\mathbf{x}_{m}}^{\top}\mathbf{A}{\mathbf{x}_{m}}{% \mathbf{x}_{m}}^{\top}]\preceq\alpha_{m}\cdot\operatorname{tr}(\mathbf{H}_{m}% \mathbf{A})\mathbf{H}_{m}.blackboard_E [ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Ax start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ⪯ italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_A ) bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT .
  2. (B)

    There exists a constant βm>0subscript𝛽𝑚0\beta_{m}>0italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT > 0, such that for every PSD matrix 𝐀𝐀\mathbf{A}bold_A, the following holds:

    𝔼[𝐱m𝐱m𝐀𝐱m𝐱m]𝐇m𝐀𝐇mβmtr(𝐇m𝐀)𝐇m.succeeds-or-equals𝔼delimited-[]subscript𝐱𝑚superscriptsubscript𝐱𝑚topsubscript𝐀𝐱𝑚superscriptsubscript𝐱𝑚topsubscript𝐇𝑚subscript𝐀𝐇𝑚subscript𝛽𝑚trsubscript𝐇𝑚𝐀subscript𝐇𝑚\mathbb{E}[{\mathbf{x}_{m}}{\mathbf{x}_{m}}^{\top}\mathbf{A}{\mathbf{x}_{m}}{% \mathbf{x}_{m}}^{\top}]-\mathbf{H}_{m}\mathbf{A}\mathbf{H}_{m}\succeq\beta_{m}% \cdot\operatorname{tr}(\mathbf{H}_{m}\mathbf{A})\mathbf{H}_{m}.blackboard_E [ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Ax start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] - bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_AH start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⪰ italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_A ) bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT .
Remark 1.

2.3 is a commonly employed assumption in the linear regression analysis utilizing SGD methods (Zou et al., 2021; Wu et al., 2022a, b), which is much weaker than the assumptions on the aforementioned related work. Specifically, it can be verified that 2.3 holds with αm=3subscript𝛼𝑚3{\alpha_{m}}=3italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 3 and βm=1subscript𝛽𝑚1\beta_{m}=1italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 1 for Gaussian distribution discussed in (Asanuma et al., 2021; Lee et al., 2021; Lin et al., 2023). Additionally, 2.3(A) can be relaxed to 𝔼𝐱m22αmtr(𝐇m)𝔼superscriptsubscriptnormsubscript𝐱𝑚22subscript𝛼𝑚trsubscript𝐇𝑚\mathbb{E}\|\mathbf{x}_{m}\|_{2}^{2}\leq\alpha_{m}\operatorname{tr}(\mathbf{H}% _{m})blackboard_E ∥ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) with 𝐀=𝐈𝐀𝐈\mathbf{A}=\mathbf{I}bold_A = bold_I, where αmtr(𝐇m)=1subscript𝛼𝑚trsubscript𝐇𝑚1\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})=1italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = 1 is assumed in Evron et al. 2022.

Assumption 2.4 (Well-specified noise).

Assume that for each distribution of task m𝑚mitalic_m, the response (conditional on input covariates) is given by ym=𝐱m𝐰+zmsubscript𝑦𝑚superscriptsubscript𝐱𝑚topsuperscript𝐰subscript𝑧𝑚y_{m}=\mathbf{x}_{m}^{\top}\mathbf{w}^{*}+z_{m}italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, where zm𝒩(0,σ2)similar-tosubscript𝑧𝑚𝒩0superscript𝜎2z_{m}\sim\mathcal{N}(0,\sigma^{2})italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and zmsubscript𝑧𝑚z_{m}italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is independent with 𝐱msubscript𝐱𝑚\mathbf{x}_{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

Similar to previous works, we assume that zmsubscript𝑧𝑚z_{m}italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is some randomized noise that satisfies 𝔼[zm|𝐱]=0𝔼delimited-[]conditionalsubscript𝑧𝑚𝐱0\mathbb{E}[z_{m}|\mathbf{x}]=0blackboard_E [ italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | bold_x ] = 0 and 𝔼[zm2]=σ2𝔼delimited-[]superscriptsubscript𝑧𝑚2superscript𝜎2\mathbb{E}[z_{m}^{2}]=\sigma^{2}blackboard_E [ italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for each task m𝑚mitalic_m.

Continual Learning via SGD

Suppose we train the model parameter 𝐰𝐰\mathbf{w}bold_w sequentially. Let 𝐰(m1)N+Nsubscript𝐰𝑚1𝑁𝑁\mathbf{w}_{(m-1)N+N}bold_w start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_N end_POSTSUBSCRIPT represent the parameter state after the completion of training on task m𝑚mitalic_m, which also serves as the initial condition for the training of task m+1𝑚1m+1italic_m + 1. Starting with 𝐰0subscript𝐰0\mathbf{w}_{0}bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and employing a constant step size η𝜂\etaitalic_η, the model is updated by SGD for each task m𝕄𝑚𝕄m\in\mathbb{M}italic_m ∈ blackboard_M over N𝑁Nitalic_N iterations, with t=1,,N𝑡1𝑁t=1,\ldots,Nitalic_t = 1 , … , italic_N:

𝐰(m1)N+tsubscript𝐰𝑚1𝑁𝑡\displaystyle\mathbf{w}_{(m-1)N+t}bold_w start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT =𝐰(m1)N+t1η𝐠m,t,andabsentsubscript𝐰𝑚1𝑁𝑡1𝜂subscript𝐠𝑚𝑡and\displaystyle=\mathbf{w}_{(m-1)N+t-1}-\eta\cdot\mathbf{g}_{m,t},\quad\text{and}= bold_w start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t - 1 end_POSTSUBSCRIPT - italic_η ⋅ bold_g start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT , and (2)
𝐠m,tsubscript𝐠𝑚𝑡\displaystyle\mathbf{g}_{m,t}bold_g start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT :=(𝐱m,t𝐰(m1)N+t1ym,t)𝐱m,t,assignabsentsuperscriptsubscript𝐱𝑚𝑡topsubscript𝐰𝑚1𝑁𝑡1subscript𝑦𝑚𝑡subscript𝐱𝑚𝑡\displaystyle:=(\mathbf{x}_{m,t}^{\top}\mathbf{w}_{(m-1)N+t-1}-y_{m,t})\mathbf% {x}_{m,t},:= ( bold_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t - 1 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ,

where 𝐠m,tsubscript𝐠𝑚𝑡\mathbf{g}_{m,t}bold_g start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT represents the gradient of the loss function at task m𝑚mitalic_m and iteration t𝑡titalic_t for a given data point (𝐱m,t,ym,t)subscript𝐱𝑚𝑡subscript𝑦𝑚𝑡(\mathbf{x}_{m,t},y_{m,t})( bold_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ).

Contrastingly, the minimum norm solution in linear regression, particularly relevant in overparameterized settings, aims to find a weight vector 𝐰𝐰\mathbf{w}bold_w that not only achieves zero training error but also possesses the minimal possible norm. Here, 𝐰msubscript𝐰𝑚\mathbf{w}_{m}bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT represents the outcome post-training for task m𝑚mitalic_m, and it also serves as the starting point for training task m+1𝑚1m+1italic_m + 1. The objective, beginning from an initial condition 𝐰0=𝟎subscript𝐰00\mathbf{w}_{0}=\mathbf{0}bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_0, is defined by the following optimization problem:

min𝐰𝐰𝐰m12,s.t. (𝐗m)𝐰=𝒚m,subscript𝐰subscriptnorm𝐰subscript𝐰𝑚12s.t. superscriptsubscript𝐗𝑚top𝐰subscript𝒚𝑚\min_{\mathbf{w}}\|\mathbf{w}-\mathbf{w}_{m-1}\|_{2},\quad\text{s.t. }(\mathbf% {X}_{m})^{\top}\mathbf{w}=\bm{y}_{m},roman_min start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∥ bold_w - bold_w start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , s.t. ( bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w = bold_italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,

where 𝐗m:=[𝐱m,1,,𝐱m,N]d×Nassignsubscript𝐗𝑚subscript𝐱𝑚1subscript𝐱𝑚𝑁superscript𝑑𝑁\mathbf{X}_{m}:=[\mathbf{x}_{m,1},\ldots,\mathbf{x}_{m,N}]\in\mathbb{R}^{d% \times N}bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT := [ bold_x start_POSTSUBSCRIPT italic_m , 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_N end_POSTSUPERSCRIPT and 𝐲m=[ym,1,,ym,N]1×Nsubscript𝐲𝑚subscript𝑦𝑚1subscript𝑦𝑚𝑁superscript1𝑁\mathbf{y}_{m}=[y_{m,1},\ldots,y_{m,N}]\in\mathbb{R}^{1\times N}bold_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ italic_y start_POSTSUBSCRIPT italic_m , 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_N end_POSTSUPERSCRIPT. The update rules for each iteration follow as:

𝐰m=𝐰m1+𝐗m(𝐗m𝐗m)1(𝒚m𝐗m𝐰m1),subscript𝐰𝑚subscript𝐰𝑚1subscript𝐗𝑚superscriptsuperscriptsubscript𝐗𝑚topsubscript𝐗𝑚1subscript𝒚𝑚superscriptsubscript𝐗𝑚topsubscript𝐰𝑚1\mathbf{w}_{m}=\mathbf{w}_{m-1}+\mathbf{X}_{m}(\mathbf{X}_{m}^{\top}\mathbf{X}% _{m})^{-1}(\boldsymbol{y}_{m}-\mathbf{X}_{m}^{\top}\mathbf{w}_{m-1}),bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_w start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT + bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) , (3)

where highlights the computational intensity of inverting the matrix (𝐗m𝐗m)1superscriptsuperscriptsubscript𝐗𝑚topsubscript𝐗𝑚1(\mathbf{X}_{m}^{\top}\mathbf{X}_{m})^{-1}( bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. This is particularly challenging for large datasets or overparameterized feature spaces. Unlike the minimum norm solution, SGD does not assume the existence of a unique, exact solution and is more adaptable to a variety of problems, including those with non-linear dynamics.

3 Main Results

Before presenting our upper bound, we shall establish the following notations to facilitate comprehension of the results.

{Γ(p,q)i:=j=pq(1ηλji)2N,𝚪pq:=j=pq(𝐈η𝐇j)2N,𝐔km:=𝐈m,0:km+Nη𝐇m,km:,Λi:=m=1Mλmi,\left\{\begin{aligned} \Gamma_{(p,q)}^{i}&:=\prod_{j=p}^{q}(1-\eta\lambda_{j}^% {i})^{2N},\quad\bm{\Gamma}_{p}^{q}:=\prod_{j=p}^{q}(\mathbf{I}-\eta\mathbf{H}_% {j})^{2N},\\ \mathbf{U}_{k_{m}^{*}}&:={\mathbf{I}_{m,{0:{k_{m}^{*}}}}+N\eta\mathbf{H}_{m,{{% k_{m}^{*}}:\infty}}},\quad\Lambda^{i}:=\sum_{m=1}^{M}\lambda_{m}^{i},\end{% aligned}\right.{ start_ROW start_CELL roman_Γ start_POSTSUBSCRIPT ( italic_p , italic_q ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL start_CELL := ∏ start_POSTSUBSCRIPT italic_j = italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT , bold_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT := ∏ start_POSTSUBSCRIPT italic_j = italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_U start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL := bold_I start_POSTSUBSCRIPT italic_m , 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_N italic_η bold_H start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT , roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT := ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , end_CELL end_ROW (4)

where (λmi)i1subscriptsuperscriptsubscript𝜆𝑚𝑖𝑖1(\lambda_{m}^{i})_{i\geq 1}( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ≥ 1 end_POSTSUBSCRIPT are eigenvalues of 𝐇msubscript𝐇𝑚\mathbf{H}_{m}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in a nonincreasing order and km=max{i:λmi1Nη}superscriptsubscript𝑘𝑚:𝑖superscriptsubscript𝜆𝑚𝑖1𝑁𝜂k_{m}^{*}=\max\{i:\lambda_{m}^{i}\geq\frac{1}{N\eta}\}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_max { italic_i : italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG italic_N italic_η end_ARG } represents the cut-off index for 𝐇msubscript𝐇𝑚\mathbf{H}_{m}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Here, Γ(p,q)isuperscriptsubscriptΓ𝑝𝑞𝑖\Gamma_{(p,q)}^{i}roman_Γ start_POSTSUBSCRIPT ( italic_p , italic_q ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝚪pqsuperscriptsubscript𝚪𝑝𝑞\boldsymbol{\Gamma}_{p}^{q}bold_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT can be regarded as a projection accumulation from task p𝑝pitalic_p to task q𝑞qitalic_q, and basically capture the impact of the learning dynamic of previous tasks on the subsequent task. 𝐔kmsubscript𝐔superscriptsubscript𝑘𝑚\mathbf{U}_{k_{m}^{*}}bold_U start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is defined with respect to the cut-off index kmsuperscriptsubscript𝑘𝑚k_{m}^{*}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for each task’s data covariance matrix 𝐇msubscript𝐇𝑚\mathbf{H}_{m}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT that captures both the dominant eigenvalues and the tail of the spectrum, and ΛisuperscriptΛ𝑖\Lambda^{i}roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT denotes the sum of the i𝑖iitalic_i-th eigenvalue across all tasks.

In the following, we first provide our upper bound for the behavior of forgetting via SGD in the linear regression model.

Theorem 3.1 (Upper Bound).

Consider a scenario where the model 𝐰𝐰\mathbf{w}bold_w undergoes training via SGD for M𝑀Mitalic_M distinct tasks, following a sequence 1,,M1𝑀1,\ldots,M1 , … , italic_M. With a constant step size of η1/R2𝜂1superscript𝑅2\eta\leq 1/R^{2}italic_η ≤ 1 / italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT given that R2=max{αmtr(𝐇m)}m=1MR^{2}=\max\{\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})\}_{m=1}^{M}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_max { italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, each task m𝑚mitalic_m is executed for N𝑁Nitalic_N iterations. Given that Assumptions (A) and 2.4 are satisfied, the following will hold:

G(M)errvar+errbias,𝐺𝑀subscripterrvarsubscripterrbiasG(M)\leq\text{err}_{\text{var}}+\text{err}_{\text{bias}},italic_G ( italic_M ) ≤ err start_POSTSUBSCRIPT var end_POSTSUBSCRIPT + err start_POSTSUBSCRIPT bias end_POSTSUBSCRIPT ,

where the variance and bias errors are upper-bounded by

errvarsubscripterrvar\displaystyle\text{err}_{\text{var}}err start_POSTSUBSCRIPT var end_POSTSUBSCRIPT m=1MMησ2(1ηR2)D1eff,absentsuperscriptsubscript𝑚1𝑀𝑀𝜂superscript𝜎21𝜂superscript𝑅2superscriptsubscript𝐷1eff\displaystyle\leq\frac{\sum_{m=1}^{M}}{M}\cdot\frac{\eta\sigma^{2}}{(1-\eta R^% {2})}\cdot D_{1}^{\text{eff}},≤ divide start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG italic_M end_ARG ⋅ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ⋅ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT ,
errbiassubscripterrbias\displaystyle\text{err}_{\text{bias}}err start_POSTSUBSCRIPT bias end_POSTSUBSCRIPT k=1MM𝐰0𝐰𝚪1M𝐇k2absentsuperscriptsubscript𝑘1𝑀𝑀superscriptsubscriptnormsubscript𝐰0superscript𝐰superscriptsubscript𝚪1𝑀subscript𝐇𝑘2\displaystyle\leq\frac{\sum_{k=1}^{M}}{M}\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{% \bm{\Gamma}_{1}^{M}\mathbf{H}_{k}}^{2}≤ divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG italic_M end_ARG ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+m=1MM2αmη2(D2eff+Φ1m1D3eff)1ηαmtr(𝐇m)𝐰0𝐰𝐔km2superscriptsubscript𝑚1𝑀𝑀2subscript𝛼𝑚superscript𝜂2superscriptsubscript𝐷2effsuperscriptsubscriptΦ1𝑚1superscriptsubscript𝐷3eff1𝜂subscript𝛼𝑚trsubscript𝐇𝑚superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐔superscriptsubscript𝑘𝑚2\displaystyle+\frac{\sum_{m=1}^{M}}{M}\frac{2\alpha_{m}\eta^{2}\cdot(D_{2}^{% \text{eff}}+\Phi_{1}^{m-1}D_{3}^{\text{eff}})}{1-\eta\alpha_{m}\operatorname{% tr}(\mathbf{H}_{m})}\cdot\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\mathbf{U}_{k_{m}^% {*}}}^{2}+ divide start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG italic_M end_ARG divide start_ARG 2 italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT + roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ⋅ ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_U start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+m=1MMαmη𝐰0𝐰𝚪1M𝐇k(𝐇m+Φ1m1𝐈)𝐔km2,superscriptsubscript𝑚1𝑀𝑀subscript𝛼𝑚𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰superscriptsubscript𝚪1𝑀subscript𝐇𝑘subscript𝐇𝑚superscriptsubscriptΦ1𝑚1𝐈subscript𝐔superscriptsubscript𝑘𝑚2\displaystyle+\frac{\sum_{m=1}^{M}}{M}\alpha_{m}\eta\cdot\|\mathbf{w}_{0}-% \mathbf{w}^{*}\|_{\bm{\Gamma}_{1}^{M}\mathbf{H}_{k}(\mathbf{H}_{m}+{\Phi_{1}^{% m-1}}\mathbf{I})\cdot\mathbf{U}_{k_{m}^{*}}}^{2},+ divide start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG italic_M end_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η ⋅ ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_I ) ⋅ bold_U start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the effective dimensions are given by

D1eff:=i<kmΓ(m+1,M)iΛi+Nηi>kmΓ(m+1,M)iλmiΛiassignsuperscriptsubscript𝐷1effsubscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚1𝑀𝑖superscriptΛ𝑖𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚1𝑀𝑖superscriptsubscript𝜆𝑚𝑖superscriptΛ𝑖\displaystyle D_{1}^{\text{eff}}:=\sum_{i<k_{m}^{*}}\Gamma_{(m+1,M)}^{i}% \Lambda^{i}+N\eta\sum_{i>k_{m}^{*}}\Gamma_{(m+1,M)}^{i}\lambda_{m}^{i}\Lambda^% {i}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT := ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m + 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m + 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT (5)
D2eff:=i<kmΓ(1,M)i(λmi)2Λi+Nηi>kmΓ(1,M)i(λmi)3Λiassignsuperscriptsubscript𝐷2effsubscript𝑖superscriptsubscript𝑘𝑚subscriptsuperscriptΓ𝑖1𝑀superscriptsuperscriptsubscript𝜆𝑚𝑖2superscriptΛ𝑖𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑚subscriptsuperscriptΓ𝑖1𝑀superscriptsuperscriptsubscript𝜆𝑚𝑖3superscriptΛ𝑖\displaystyle D_{2}^{\text{eff}}:=\sum_{i<k_{m}^{*}}{\Gamma^{i}_{(1,M)}(% \lambda_{m}^{i})^{2}\Lambda^{i}}+N\eta\sum_{i>k_{m}^{*}}\Gamma^{i}_{(1,M)}(% \lambda_{m}^{i})^{3}\Lambda^{i}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT := ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( 1 , italic_M ) end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( 1 , italic_M ) end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
D3eff:=i<kmΓ(m,M)i(λmi)Λi+ηNi>kmΓ(m,M)i(λmi)2Λi,assignsuperscriptsubscript𝐷3effsubscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚𝑀𝑖superscriptsubscript𝜆𝑚𝑖superscriptΛ𝑖𝜂𝑁subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚𝑀𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖2superscriptΛ𝑖\displaystyle D_{3}^{\text{eff}}:=\sum_{i<k_{m}^{*}}{\Gamma_{(m,M)}^{i}(% \lambda_{m}^{i})\Lambda^{i}}+{\eta N}\sum_{i>k_{m}^{*}}\Gamma_{(m,M)}^{i}(% \lambda_{m}^{i})^{2}\Lambda^{i},italic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT := ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_η italic_N ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ,

with km,Γ(p,q)isuperscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑝𝑞𝑖k_{m}^{*},\Gamma_{(p,q)}^{i}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , roman_Γ start_POSTSUBSCRIPT ( italic_p , italic_q ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝚪pqsuperscriptsubscript𝚪𝑝𝑞\bm{\Gamma}_{p}^{q}bold_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT defined as in Equation 4 and denoting Φ1m1:=j=1m1k=1jαkηj𝐇k1,𝐈(𝐈η𝐇m1)N𝐇j,𝐇massignsuperscriptsubscriptΦ1𝑚1superscriptsubscript𝑗1𝑚1superscriptsubscriptproduct𝑘1𝑗subscript𝛼𝑘superscript𝜂𝑗subscript𝐇𝑘1𝐈superscript𝐈𝜂subscript𝐇𝑚1𝑁subscript𝐇𝑗subscript𝐇𝑚\Phi_{1}^{m-1}:=\sum_{j=1}^{m-1}\prod_{k=1}^{j}\alpha_{k}\eta^{j}\cdot\langle% \mathbf{H}_{k-1},\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m-1})^{N}\rangle\cdot% \langle\mathbf{H}_{j},\mathbf{H}_{m}\rangleroman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⟩ ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩.

In Theorem 3.1, we establish an upper bound on the forgetting behavior of a model trained using SGD in the continual learning with various data distribution settings. It highlights that the model’s performance is influenced by both errvarsubscripterrvar\text{err}_{\text{var}}err start_POSTSUBSCRIPT var end_POSTSUBSCRIPT and errbiassubscripterrbias\text{err}_{\text{bias}}err start_POSTSUBSCRIPT bias end_POSTSUBSCRIPT, where errvarsubscripterrvar\text{err}_{\text{var}}err start_POSTSUBSCRIPT var end_POSTSUBSCRIPT stems from the inherent noise intrinsic to the model itself and errbiassubscripterrbias\text{err}_{\text{bias}}err start_POSTSUBSCRIPT bias end_POSTSUBSCRIPT represents the bias associated with the initial value during the learning process. Notice that both of them are determined jointly by the spectrum of the covariance matrices as well as the stepsizes for continual learning.

To provide a more intuitive explanation, we explore a simplified scenario by setting η=0𝜂0\eta=0italic_η = 0. Specifically, this setting simplifies our analysis by reducing the error terms to only the first term in bias error, which appears to depend solely on the initial weight 𝐰0subscript𝐰0\mathbf{w}_{0}bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the data. However, this simplification might misleadingly imply that a minimal η𝜂\etaitalic_η would result in optimal learning outcomes. A crucial aspect overlooked in this interpretation is the role of the projection term Γ1M=j=1M(𝐈η𝐇j)2NsuperscriptsubscriptΓ1𝑀superscriptsubscriptproduct𝑗1𝑀superscript𝐈𝜂subscript𝐇𝑗2𝑁\Gamma_{1}^{M}=\prod_{j=1}^{M}(\mathbf{I}-\eta\mathbf{H}_{j})^{2N}roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT, which becomes an identity matrix 𝐈𝐈\mathbf{I}bold_I when η=0𝜂0\eta=0italic_η = 0. Thus, while setting η=0𝜂0\eta=0italic_η = 0 eliminates other error terms, it also exacerbates the first term of bias error, potentially making it the most significant error contributor. Consequently, there exists a trade-off in choosing the step size.

The subsequent theorem presents a nearly matching lower bound.

Theorem 3.2 (Lower Bound).

Consider a scenario where the model 𝐰𝐰\mathbf{w}bold_w undergoes training via SGD for M𝑀Mitalic_M distinct tasks, following a sequence 1,,M1𝑀1,\ldots,M1 , … , italic_M. With a constant step size of η1/R2𝜂1superscript𝑅2\eta\leq 1/R^{2}italic_η ≤ 1 / italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT given that R2=max{αmtr(𝐇m)}m=1MR^{2}=\max\{\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})\}_{m=1}^{M}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_max { italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, each task m𝑚mitalic_m is executed for N𝑁Nitalic_N iterations. Given that Assumptions (B) and 2.4 are satisfied, the following will hold:

G(M)errvar+errbia,𝐺𝑀subscripterrvarsubscripterrbiaG(M)\geq\text{err}_{\text{var}}+\text{err}_{\text{bia}},italic_G ( italic_M ) ≥ err start_POSTSUBSCRIPT var end_POSTSUBSCRIPT + err start_POSTSUBSCRIPT bia end_POSTSUBSCRIPT ,

where the variance and bias errors are lower bounded by

errvarm=1MM9η2σ220D1eff,subscripterrvarsuperscriptsubscript𝑚1𝑀𝑀9superscript𝜂2superscript𝜎220superscriptsubscript𝐷1eff\displaystyle\text{err}_{\text{var}}\geq\frac{\sum_{m=1}^{M}}{M}\cdot\frac{9% \eta^{2}\sigma^{2}}{20}\cdot D_{1}^{\text{eff}},err start_POSTSUBSCRIPT var end_POSTSUBSCRIPT ≥ divide start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG italic_M end_ARG ⋅ divide start_ARG 9 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 20 end_ARG ⋅ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT ,
errbiask=1MM𝐰0𝐰𝚪1M𝐇k2subscripterrbiassuperscriptsubscript𝑘1𝑀𝑀superscriptsubscriptnormsubscript𝐰0superscript𝐰superscriptsubscript𝚪1𝑀subscript𝐇𝑘2\displaystyle\text{err}_{\text{bias}}\geq\frac{\sum_{k=1}^{M}}{M}\|\mathbf{w}_% {0}-\mathbf{w}^{*}\|_{\bm{\Gamma}_{1}^{M}\mathbf{H}_{k}}^{2}err start_POSTSUBSCRIPT bias end_POSTSUBSCRIPT ≥ divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG italic_M end_ARG ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+m=1MMβm2η225(D2eff+Φ^1m1D3eff)𝐰0𝐰𝐔km2superscriptsubscript𝑚1𝑀𝑀superscriptsubscript𝛽𝑚2superscript𝜂225superscriptsubscript𝐷2effsuperscriptsubscript^Φ1𝑚1superscriptsubscript𝐷3effsuperscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐔superscriptsubscript𝑘𝑚2\displaystyle+\frac{\sum_{m=1}^{M}}{M}\cdot\frac{\beta_{m}^{2}\eta^{2}}{25}% \cdot(D_{2}^{\text{eff}}+\hat{\Phi}_{1}^{m-1}D_{3}^{\text{eff}})\cdot\|\mathbf% {w}_{0}-\mathbf{w}^{*}\|_{\mathbf{U}_{k_{m}^{*}}}^{2}+ divide start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG italic_M end_ARG ⋅ divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 25 end_ARG ⋅ ( italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT + over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT ) ⋅ ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_U start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+m=1MMβmη25𝐰0𝐰(𝐈η𝐇m)2N𝚪1M𝐇k(𝐇m+Φ^1m1𝐈)𝐔km2superscriptsubscript𝑚1𝑀𝑀subscript𝛽𝑚superscript𝜂25superscriptsubscriptnormsubscript𝐰0superscript𝐰superscript𝐈𝜂subscript𝐇𝑚2𝑁superscriptsubscript𝚪1𝑀subscript𝐇𝑘subscript𝐇𝑚superscriptsubscript^Φ1𝑚1𝐈subscript𝐔superscriptsubscript𝑘𝑚2\displaystyle+\frac{\sum_{m=1}^{M}}{M}\frac{\beta_{m}\eta^{2}}{5}\cdot\|% \mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\mathbf{I}-\eta\mathbf{H}_{m})^{2N}\bm{% \Gamma}_{1}^{M}\mathbf{H}_{k}(\mathbf{H}_{m}+\hat{\Phi}_{1}^{m-1}\mathbf{I})% \cdot\mathbf{U}_{k_{m}^{*}}}^{2}+ divide start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG italic_M end_ARG divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 5 end_ARG ⋅ ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_I ) ⋅ bold_U start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where the effective dimensions km,Γ(p,q)isuperscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑝𝑞𝑖k_{m}^{*},\Gamma_{(p,q)}^{i}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , roman_Γ start_POSTSUBSCRIPT ( italic_p , italic_q ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝚪pqsuperscriptsubscript𝚪𝑝𝑞\bm{\Gamma}_{p}^{q}bold_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT are the same as in Theorem 3.1, and Φ^1m1:=j=1m1k=1jβk(η2)j𝐇k1,(𝐈(𝐈η𝐇m1)2N)𝐇j,𝐇massignsuperscriptsubscript^Φ1𝑚1superscriptsubscript𝑗1𝑚1superscriptsubscriptproduct𝑘1𝑗subscript𝛽𝑘superscript𝜂2𝑗subscript𝐇𝑘1𝐈superscript𝐈𝜂subscript𝐇𝑚12𝑁subscript𝐇𝑗subscript𝐇𝑚\hat{\Phi}_{1}^{m-1}:=\sum_{j=1}^{m-1}\prod_{k=1}^{j}\beta_{k}(\frac{\eta}{2})% ^{j}\cdot\langle\mathbf{H}_{k-1},(\mathbf{I}-(\mathbf{I}-\eta{\mathbf{H}_{m-1}% })^{2N})\rangle\cdot\langle\mathbf{H}_{j},\mathbf{H}_{m}\rangleover^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( divide start_ARG italic_η end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ⟩ ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩.

Analogous to the Theorem 3.1, our lower bound also consists of the bias term and the variance term. It is noteworthy that our lower bound is tight with the upper bound in terms of variance term, differing only by absolute constants. Additionally, our lower bound closely matches the upper bound in terms of the bias term, with some differences arising from the following quantities

Φ^1m1𝐰0𝐰𝐔km2,𝐰0𝐰(𝐈η𝐇m)2N.superscriptsubscript^Φ1𝑚1superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐔superscriptsubscript𝑘𝑚2subscriptnormsubscript𝐰0superscript𝐰superscript𝐈𝜂subscript𝐇𝑚2𝑁\hat{\Phi}_{1}^{m-1}\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\mathbf{U}_{k_{m}^{*}}}% ^{2},\quad\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\mathbf{I}-\eta\mathbf{H}_{m})^{% 2N}}.over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_U start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT .

Specifically, Φ^1m1superscriptsubscript^Φ1𝑚1\hat{\Phi}_{1}^{m-1}over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT here differs from Φ1m1superscriptsubscriptΦ1𝑚1{\Phi}_{1}^{m-1}roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT in Theorem 3.1 only by a factor of constants (i.e. αksubscript𝛼𝑘\alpha_{k}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT defined in 2.3). The term 𝐰0𝐰(𝐈η𝐇m)2Nsubscriptnormsubscript𝐰0superscript𝐰superscript𝐈𝜂subscript𝐇𝑚2𝑁\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\mathbf{I}-\eta\mathbf{H}_{m})^{2N}}∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT has a different subscript of (𝐈η𝐇m)2Nsuperscript𝐈𝜂subscript𝐇𝑚2𝑁(\mathbf{I}-\eta\mathbf{H}_{m})^{2N}( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT compared to that of the upper bound. Nevertheless, it can be regarded as a part of the projection accumulation 𝚪1Msuperscriptsubscript𝚪1𝑀\bm{\Gamma}_{1}^{M}bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT that exists in the subscript of both results simultaneously.

More importantly, we show that the upper and lower bounds converge, ignoring constant factors, under the conditions

𝐰0𝐰𝐔km2σ2,Φ^1m1O(1),formulae-sequenceless-than-or-similar-tosuperscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐔superscriptsubscript𝑘𝑚2superscript𝜎2less-than-or-similar-tosuperscriptsubscript^Φ1𝑚1𝑂1\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\mathbf{U}_{k_{m}^{*}}}^{2}\lesssim\sigma^{% 2},\quad\hat{\Phi}_{1}^{m-1}\lesssim O(1),∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_U start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≲ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ≲ italic_O ( 1 ) ,

which can be satisfied that the signal-to-noise ratios 𝐰0𝐰𝐔km2/σ2superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐔superscriptsubscript𝑘𝑚2superscript𝜎2\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\mathbf{U}_{k_{m}^{*}}}^{2}/\sigma^{2}∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_U start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is bounded and the step size is appropriate small.

4 Discussion

Building on Theorem 3.1 and Theorem 3.2, we aim to offer a more comprehensive understanding of our findings from three key perspectives: 1) Technical Understanding Under Simplified Cases; 2) Comparison with Existing Work; 3) The Impact of Task Ordering and Parameters on Forgetting.

4.1 Technical Understanding Under Simplified Cases

In this section, we demonstrate how to achieve a vanishing bound in the overparameterized regime.

Based on Theorem 3.1, we consider a scenario where 𝐰0𝐰22,σ21less-than-or-similar-tosuperscriptsubscriptnormsubscript𝐰0superscript𝐰22superscript𝜎21\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{2}^{2},\sigma^{2}\lesssim 1∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≲ 1 and tr(𝐇m)1similar-to-or-equalstrsubscript𝐇𝑚1\operatorname{tr}(\mathbf{H}_{m})\simeq 1roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ≃ 1 for each task m𝑚mitalic_m, implying a rapid decay in the spectrum of 𝐇msubscript𝐇𝑚\mathbf{H}_{m}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. To obtain a vanishing bound in the overparameterized regime, the effective dimension should hold that

D1effD3effsimilar-to-or-equalssuperscriptsubscript𝐷1effsuperscriptsubscript𝐷3eff\displaystyle D_{1}^{\text{eff}}\simeq D_{3}^{\text{eff}}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT ≃ italic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT =o(MNe(Mm)),absent𝑜𝑀𝑁superscript𝑒𝑀𝑚\displaystyle=o(\frac{MN}{e^{(M-m)}}),= italic_o ( divide start_ARG italic_M italic_N end_ARG start_ARG italic_e start_POSTSUPERSCRIPT ( italic_M - italic_m ) end_POSTSUPERSCRIPT end_ARG ) , (6)
D2effsuperscriptsubscript𝐷2eff\displaystyle D_{2}^{\text{eff}}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT =o(MNeM).absent𝑜𝑀𝑁superscript𝑒𝑀\displaystyle=o(\frac{MN}{e^{M}}).= italic_o ( divide start_ARG italic_M italic_N end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG ) .

To meet the condition in Equation 6, for each task m~~𝑚\widetilde{m}over~ start_ARG italic_m end_ARG, let k=min{km,km~}superscript𝑘superscriptsubscript𝑘𝑚superscriptsubscript𝑘~𝑚k^{\dagger}=\min\{k_{m}^{*},k_{\widetilde{m}}^{*}\}italic_k start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = roman_min { italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_k start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } and k=max{km,km~k^{\star}=\max\{k_{m}^{*},k_{\widetilde{m}}^{*}italic_k start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = roman_max { italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_k start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT}. It necessarily holds that

i<kλm~isubscript𝑖superscript𝑘superscriptsubscript𝜆~𝑚𝑖\displaystyle\sum_{i<k^{\star}}\lambda_{\widetilde{m}}^{i}∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT i<kλmiλm~ii<k(λmi)2λm~i=o(N),similar-to-or-equalsabsentsubscript𝑖superscript𝑘superscriptsubscript𝜆𝑚𝑖superscriptsubscript𝜆~𝑚𝑖similar-to-or-equalssubscript𝑖superscript𝑘superscriptsuperscriptsubscript𝜆𝑚𝑖2superscriptsubscript𝜆~𝑚𝑖𝑜𝑁\displaystyle\simeq\sum_{i<k^{\star}}\lambda_{m}^{i}\lambda_{\widetilde{m}}^{i% }\simeq\sum_{i<k^{\star}}(\lambda_{m}^{i})^{2}\lambda_{\widetilde{m}}^{i}=o({N% }),≃ ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≃ ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_o ( italic_N ) , (7)
i>kλm~iλmisubscript𝑖superscript𝑘superscriptsubscript𝜆~𝑚𝑖superscriptsubscript𝜆𝑚𝑖\displaystyle\sum_{i>k^{\dagger}}\lambda_{\widetilde{m}}^{i}\lambda_{m}^{i}∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT i>k(λmi)2λm~ii>k(λmi)3λm~i=o(1N).similar-to-or-equalsabsentsubscript𝑖superscript𝑘superscriptsuperscriptsubscript𝜆𝑚𝑖2superscriptsubscript𝜆~𝑚𝑖similar-to-or-equalssubscript𝑖superscript𝑘superscriptsuperscriptsubscript𝜆𝑚𝑖3superscriptsubscript𝜆~𝑚𝑖𝑜1𝑁\displaystyle\simeq\sum_{i>k^{\dagger}}(\lambda_{m}^{i})^{2}\lambda_{% \widetilde{m}}^{i}\simeq\sum_{i>k^{\dagger}}(\lambda_{m}^{i})^{3}\lambda_{% \widetilde{m}}^{i}=o(\frac{1}{N}).≃ ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≃ ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_o ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ) .

To clarify Equation 7, let notice the crucial cut-off index ksuperscript𝑘k^{\star}italic_k start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and ksuperscript𝑘k^{\dagger}italic_k start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT, which divide the entire feature space into two ksuperscript𝑘k^{\star}italic_k start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT-dimensional and ksuperscript𝑘k^{\dagger}italic_k start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT-dimensional subspaces. For achieving a diminishing bound in overparameterized setting, it is necessary that the sum of eigenvalues for indices less than ksuperscript𝑘k^{\star}italic_k start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, denoted as i<ksubscript𝑖superscript𝑘\sum_{i<k^{\star}}∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, should be o(N)𝑜𝑁o(N)italic_o ( italic_N ), and the sum of the tail eigenvalues for indices greater than k,i>ksuperscript𝑘subscript𝑖superscript𝑘k^{\dagger},\sum_{i>k^{\dagger}}italic_k start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT , ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, should be o(1N)𝑜1𝑁o(\frac{1}{N})italic_o ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ). These conditions are typically met when the dataset size N𝑁Nitalic_N is sufficiently large, or when a smaller step size η𝜂\etaitalic_η is chosen dependent on N𝑁Nitalic_N. Additionally, We note that the condition in Equation 7 can be relaxed. In light of the definition of kmsuperscriptsubscript𝑘𝑚k_{m}^{*}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the eigenvalues for task m~~𝑚\widetilde{m}over~ start_ARG italic_m end_ARG are truncated based on the following two scenarios: 𝟏)\mathbf{1})bold_1 ) km<km~superscriptsubscript𝑘𝑚superscriptsubscript𝑘~𝑚k_{m}^{*}\textless k_{\widetilde{m}}^{*}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT < italic_k start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : Here, the cut-off for task m~~𝑚\widetilde{m}over~ start_ARG italic_m end_ARG occurs earlier, resulting in an additional (km~km)superscriptsubscript𝑘~𝑚superscriptsubscript𝑘𝑚(k_{\widetilde{m}}^{*}-k_{m}^{*})( italic_k start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) dimensions of eigenvalues such that λm~i1/(Nη)superscriptsubscript𝜆~𝑚𝑖1𝑁𝜂\lambda_{\widetilde{m}}^{i}\geq 1/(N\eta)italic_λ start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≥ 1 / ( italic_N italic_η ). To achieve a diminishing bound under this condition, it is necessary that kmikm~λm~i=subscriptsuperscriptsubscript𝑘𝑚𝑖superscriptsubscript𝑘~𝑚superscriptsubscript𝜆~𝑚𝑖absent\sum_{k_{m}^{*}\leq i\leq k_{\tilde{m}}^{*}}\lambda_{\widetilde{m}}^{i}=∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_i ≤ italic_k start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = o(N)𝑜𝑁o(N)italic_o ( italic_N ). 𝟐)\mathbf{2})bold_2 ) kmkm~superscriptsubscript𝑘𝑚superscriptsubscript𝑘~𝑚k_{m}^{*}\geq k_{\widetilde{m}}^{*}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≥ italic_k start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : In this case, the cut-off for task m~~𝑚\widetilde{m}over~ start_ARG italic_m end_ARG occurs later, involving an additional (kmkm~)superscriptsubscript𝑘𝑚superscriptsubscript𝑘~𝑚(k_{m}^{*}-k_{\widetilde{m}}^{*})( italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_k start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) dimensions of eigenvalues where λm~i1/(Nη)superscriptsubscript𝜆~𝑚𝑖1𝑁𝜂\lambda_{\widetilde{m}}^{i}\leq 1/(N\eta)italic_λ start_POSTSUBSCRIPT over~ start_ARG italic_m end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≤ 1 / ( italic_N italic_η ), achieving the same results.

In the under-parameterized regime, we even account for the worst-case scenario where λmi1Nηsuperscriptsubscript𝜆𝑚𝑖1𝑁𝜂\lambda_{m}^{i}\geq\frac{1}{N\eta}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG italic_N italic_η end_ARG for all index i𝑖iitalic_i and task m𝑚mitalic_m, leading to a bound of D1effD3eff=o(Mdλm1e(Mm)),D2eff=o(Mdλm1eM).formulae-sequencesimilar-to-or-equalssuperscriptsubscript𝐷1effsuperscriptsubscript𝐷3eff𝑜𝑀𝑑superscriptsubscript𝜆𝑚1superscript𝑒𝑀𝑚superscriptsubscript𝐷2eff𝑜𝑀𝑑superscriptsubscript𝜆𝑚1superscript𝑒𝑀D_{1}^{\text{eff}}\simeq D_{3}^{\text{eff}}=o(\frac{Md\lambda_{m}^{1}}{e^{(M-m% )}}),D_{2}^{\text{eff}}=o(\frac{Md\lambda_{m}^{1}}{e^{M}}).italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT ≃ italic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT = italic_o ( divide start_ARG italic_M italic_d italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT ( italic_M - italic_m ) end_POSTSUPERSCRIPT end_ARG ) , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT = italic_o ( divide start_ARG italic_M italic_d italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG ) .

4.2 Comparison with Existing work

In this section, we will first explore the challenges and parallels between traditional/transfer learning and continual learning. Secondly, we examine how restrictive assumptions in previous studies might overshadow the impact of key factors, thereby affecting the overall understanding of forgetting in continual learning.

Our results reveal that compared to traditional learning (Zou et al., 2021), which typically involves a single task, and transfer learning (Wu et al., 2022b), which usually incorporates two data distributions, the effective dimension in continual learning scenarios is more complex. Specifically, in our analysis, the term ΛisuperscriptΛ𝑖\Lambda^{i}roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT arises from a distinct measurement perspective (i.e. forgetting), which requires us to consider how the final output aligns with all previously encountered tasks in the continual learning (i.e. 𝐇msubscript𝐇𝑚\mathbf{H}_{m}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for all m𝑚mitalic_m). This is in contrast to both traditional training and transfer learning, where the evaluation metric is uniformly focused on performance against a single dataset (i.e. 𝐇Msubscript𝐇𝑀\mathbf{H}_{M}bold_H start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT). Moreover, the multi-task nature of continual learning introduces unique challenges considering the bias iterates and variance iterates, where we refer to the proof in Appendix for more details.

Given that our analysis, similar to theirs, characterizes bounds with the full eigenspectrum of the data covariance matrix, it follows that our derived results match their findings in several aspects: 𝟏)\mathbf{1)}bold_1 ) The cutoff index kmsuperscriptsubscript𝑘𝑚k_{m}^{*}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is uniquely determined for each task m𝑚mitalic_m in continual learning, akin to the one in Zou et al. 2021; Wu et al. 2022b, where they identify corresponding indices ktraining superscriptsubscript𝑘training k_{\text{training }}^{*}italic_k start_POSTSUBSCRIPT training end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and ktest superscriptsubscript𝑘test k_{\text{test }}^{*}italic_k start_POSTSUBSCRIPT test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. 𝟐)\mathbf{2)}bold_2 ) The projection terms Γ(p,q)isuperscriptsubscriptΓ𝑝𝑞𝑖{\Gamma}_{(p,q)}^{i}roman_Γ start_POSTSUBSCRIPT ( italic_p , italic_q ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and 𝚪pqsuperscriptsubscript𝚪𝑝𝑞\bm{\Gamma}_{p}^{q}bold_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT also occur in transfer learning (Wu et al., 2022b), showing how previous iterations/past learning is projected onto the future updates.

Previous work (Evron et al., 2022) also explored the dynamics of forgetting through the perspective of projection. We first revisit the findings presented by Evron et al. 2022. Considering a scenario where the number of iterations N=1𝑁1N=1italic_N = 1, the update rule in their analysis can be reformulated as follows:

𝐰m𝐰=(𝐈ηm𝐱m𝐱m)(𝐰m1𝐰),subscript𝐰𝑚superscript𝐰𝐈subscript𝜂𝑚subscript𝐱𝑚superscriptsubscript𝐱𝑚topsubscript𝐰𝑚1superscript𝐰\mathbf{w}_{m}-\mathbf{w}^{*}=(\mathbf{I}-\eta_{m}\mathbf{x}_{m}\mathbf{x}_{m}% ^{\top})(\mathbf{w}_{m-1}-\mathbf{w}^{*}),bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( bold_I - italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( bold_w start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , (8)

where they incorporate the noiseless model assumption that ym=𝐱m𝐰subscript𝑦𝑚superscriptsubscript𝐱𝑚topsuperscript𝐰y_{m}=\mathbf{x}_{m}^{\top}\mathbf{w}^{*}italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. As a result, the forgetting in Evron et al. 2022 holds that

G(M)=1Mm=1M𝐱m(𝐰M𝐰)22,given𝐱m21formulae-sequence𝐺𝑀1𝑀superscriptsubscript𝑚1𝑀superscriptsubscriptnormsubscript𝐱𝑚subscript𝐰𝑀superscript𝐰22givensubscriptnormsubscript𝐱𝑚21\displaystyle G(M)=\frac{1}{M}\sum_{m=1}^{M}\|\mathbf{x}_{m}(\mathbf{w}_{M}-% \mathbf{w}^{*})\|_{2}^{2},\quad\text{given}\quad\|\mathbf{x}_{m}\|_{2}\leq 1italic_G ( italic_M ) = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , given ∥ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1
1Mm=1M(𝐈ηm𝐱M𝐱M)(𝐈ηm𝐱1𝐱1)(𝐰0𝐰)22,absent1𝑀superscriptsubscript𝑚1𝑀superscriptsubscriptnorm𝐈subscript𝜂𝑚subscript𝐱𝑀superscriptsubscript𝐱𝑀top𝐈subscript𝜂𝑚subscript𝐱1superscriptsubscript𝐱1topsubscript𝐰0superscript𝐰22\displaystyle\leq\frac{1}{M}\sum_{m=1}^{M}\|(\mathbf{I}-\eta_{m}\mathbf{x}_{M}% \mathbf{x}_{M}^{\top})\ldots(\mathbf{I}-\eta_{m}\mathbf{x}_{1}\mathbf{x}_{1}^{% \top})(\mathbf{w}_{0}-\mathbf{w}^{*})\|_{2}^{2},≤ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ ( bold_I - italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) … ( bold_I - italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

indicating that the forgetting dynamic can be determined by the projection of (𝐈ηm𝐱m𝐱m)𝐈subscript𝜂𝑚subscript𝐱𝑚superscriptsubscript𝐱𝑚top(\mathbf{I}-\eta_{m}\mathbf{x}_{m}\mathbf{x}_{m}^{\top})( bold_I - italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ), where ηm=𝐱m2subscript𝜂𝑚superscriptnormsubscript𝐱𝑚2\eta_{m}=\|\mathbf{x}_{m}\|^{-2}italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ∥ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. However, compared to our analysis, their study exhibits several key differences in comparison to ours. 𝟏)\mathbf{1)}bold_1 ) The inherent model noise: Evron et al. 2022 considers a noiseless model, where results in the absence of an additional iterative term 𝐱mzmsubscript𝐱𝑚subscript𝑧𝑚\mathbf{x}_{m}\cdot z_{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT related to noise in Equation 8. This omission leads to a lack of accumulative variance error in the evaluation of forgetting performance (i.e. errvarsubscripterrvar\text{err}_{\text{var}}err start_POSTSUBSCRIPT var end_POSTSUBSCRIPT in our analysis). It is noteworthy to mention that in numerous learning problems, the variance error often plays a dominant role in the total error (Jain et al., 2018; Zou et al., 2021; Wu et al., 2022b). 𝟐)\mathbf{2)}bold_2 ) The bounded norm 𝐱2subscriptnorm𝐱2\|\mathbf{x}\|_{2}∥ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: the assumption of the bounded norm, which omits the interaction with projection effects, is crucial in our analysis as the factor ΛisuperscriptΛ𝑖\Lambda^{i}roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in Theorem 3.1 and Theorem 3.2. 𝟑)\mathbf{3)}bold_3 ) Last iterate SGD results: Evron et al. 2022 shows that, with a 𝐱m22superscriptsubscriptnormsubscript𝐱𝑚22\|\mathbf{x}_{m}\|_{2}^{-2}∥ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT step size, their worst-case expected forgetting will become a dimension-dependent bound of O(d/M)𝑂𝑑𝑀O(d/M)italic_O ( italic_d / italic_M ). This analysis, conducted under the overparameterized regime, suggests the occurrence of catastrophic forgetting. In contrast, our results, as discussed earlier, offer a different perspective, suggesting the possibility of achieving a vanishing forgetting bound in overparameterized settings with certain conditions met.

It is noticed that Lin et al. 2023 also investigates the relationship between catastrophic forgetting and factors such as task sequence (order) and dimensionality. However, their results will tend to be vacuous in the under-parameterized setting since (𝐗m𝐗m)1superscriptsubscript𝐗𝑚superscriptsubscript𝐗𝑚top1(\mathbf{X}_{m}\mathbf{X}_{m}^{\top})^{-1}( bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, data matrix for task m𝑚mitalic_m, is non-invertible when employing minimum norm solution, as we discussed earlier in Section 2. Due to space constraints, a more extensive discussion will be provided in Appendix D.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Figure 1: Impact of Task Sequence and Algorithmic Parameters on Forgetting Behavior with Linear Regression Model and Deep Neural Networks. This figure presents the relationship between task sequence order and algorithmic parameters (data size, dimensionality, and step size) on the forgetting behavior observed in linear regression models (figures (a)-(f)) and deep neural networks (figures (g)-(l)). Figures (a), (b), (i), and (j) illustrate how varying data sizes impact forgetting behavior for different task sequences, while figures (c), (d), (k), and (l) demonstrate the effect of changing dimensionality on forgetting. Lastly, Figures (e)-(h) demonstrate the influence of stepsize on the rate of forgetting across different model configurations.

4.3 The Impact of Task Ordering and Parameters on Forgetting

In the upcoming discussion, we will present theoretical insights derived from our results.

Notice that the bounds in Theorem 3.1 and Theorem 3.2 contain two crucial factors: the effective dimension Deffsuperscript𝐷effD^{\text{eff}}italic_D start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT and the covariance accumulation Φ^1m1superscriptsubscript^Φ1𝑚1\hat{\Phi}_{1}^{m-1}over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT/Φ1m1superscriptsubscriptΦ1𝑚1{\Phi}_{1}^{m-1}roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT. We first discuss the effective dimension. Each Deffsuperscript𝐷effD^{\text{eff}}italic_D start_POSTSUPERSCRIPT eff end_POSTSUPERSCRIPT is consist of a projection term Γ(m,M)isubscriptsuperscriptΓ𝑖𝑚𝑀\Gamma^{i}_{(m,M)}roman_Γ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_m , italic_M ) end_POSTSUBSCRIPT and the eigenvalues λmisuperscriptsubscript𝜆𝑚𝑖\lambda_{m}^{i}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, with ΛisuperscriptΛ𝑖\Lambda^{i}roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT serving as the constant. It can be observed that when data size N𝑁Nitalic_N approaches infinity, the projection term converges to 1eMm1superscript𝑒𝑀𝑚\frac{1}{e^{M-m}}divide start_ARG 1 end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_M - italic_m end_POSTSUPERSCRIPT end_ARG, implying that the eigenvalue will predominantly dictate the larger effective dimension with respect to λmieMmsuperscriptsubscript𝜆𝑚𝑖superscript𝑒𝑀𝑚\frac{\lambda_{m}^{i}}{e^{M-m}}divide start_ARG italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_M - italic_m end_POSTSUPERSCRIPT end_ARG. This observation highlights the substantial influence of eigenvalues on task sequence in continual learning. Specifically, it shows that when data size is sufficiently large, task sequences organized in a way, where tasks associated with larger eigenvalues in their population data covariance matrix are trained later, exhibit more forgetting. Additionally, if the step size is appropriately small, the projection term stabilizes to a constant of less than 1, leading to similar outcomes as in the first scenario. It is noteworthy that these insights can not be derived from the existing work analysis due to their restrictive assumptions, such as Gaussian data distribution in Lee et al. 2021; Asanuma et al. 2021; Lin et al. 2023 and minimum norm solution in Evron et al. 2022; Lin et al. 2023; Swartworth et al. 2023.

The covariance accumulation term, Φ^1m1/Φ1m1superscriptsubscript^Φ1𝑚1superscriptsubscriptΦ1𝑚1\hat{\Phi}_{1}^{m-1}/\Phi_{1}^{m-1}over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT / roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT, which includes the covariance matrices 𝐇jmsubscript𝐇𝑗𝑚\mathbf{H}_{j\leq m}bold_H start_POSTSUBSCRIPT italic_j ≤ italic_m end_POSTSUBSCRIPT and the step size η𝜂\etaitalic_η, plays a crucial role in demonstrating how previously acquired information is retained and influences the model’s adaptability to new tasks. Notably, there is an interesting contradiction in the optimal accumulation order within Φ^1m1/Φ1m1superscriptsubscript^Φ1𝑚1superscriptsubscriptΦ1𝑚1\hat{\Phi}_{1}^{m-1}/\Phi_{1}^{m-1}over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT / roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT compared to the projection term in Γ(m,M)isuperscriptsubscriptΓ𝑚𝑀𝑖\Gamma_{(m,M)}^{i}roman_Γ start_POSTSUBSCRIPT ( italic_m , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Specifically, earlier occurrence of 𝐇jsubscript𝐇𝑗\mathbf{H}_{j}bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with larger expected eigenvalues tends to increase the degree of forgetting. Meanwhile, an important observation is that if the step size is sufficiently small, the impact of the covariance accumulation term becomes less significant. This interplay between the effective dimension and covariance accumulation elucidates the complexities inherent in continual learning scenarios.

5 Empirical Stimulation

In this section, we conduct experiments using synthetic data to validate our theoretical results and shed light on the intricate interplay between eigenvalues, step size, and dimensionality.

Experimental Setup In our study, we designed three distinct tasks, denoted as Tasks 1,2, and 3, each with a different feature space. During the initial simulations, the eigenvalues for the feature values of Tasks 1, 2, and 3 were set according to λi=i3,λi=i2formulae-sequencesubscript𝜆𝑖superscript𝑖3subscript𝜆𝑖superscript𝑖2\lambda_{i}=i^{-3},\lambda_{i}=i^{-2}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_i start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_i start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, and λi=i1subscript𝜆𝑖superscript𝑖1\lambda_{i}=i^{-1}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_i start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT respectively. To mimic real-world data imperfections, Gaussian noise with a standard deviation of 0.1 was added to the labels. We assessed the impact of task sequence on the model’s tendency to forget by evaluating six different task orders: [1,2,3], [2,1,3], [1, 3, 2], [3, 1, 2], [2, 3, 1], and [3, 2, 1].

5.1 Linear Regression

Training and Evaluation For this experiment, a linear regression model was trained using Stochastic Gradient Descent (SGD) with a learning rate of 0.01 or 0.001. The model was tested in both low-dimensional (10 input features) and high-dimensional (1000 input features) settings. Each task sequence underwent training with various data sizes, ranging from 100 to 950 in increments of 50, and each task was trained for five epochs. The performance of the model was evaluated on each task to calculate the average excess risk (Equation 1), quantifying the degree of forgetting the model experienced.

Impact of Eigenvalue Sequencing The observations from Figure 1(a) and Figure 1(c) reveal the significant impact of eigenvalue sequencing on forgetting behavior in the underparameterzied regime. Notably, task sequences that are arranged such that tasks with larger eigenvalues (i.e. Task 3 in our case, characterized by λi=i1subscript𝜆𝑖superscript𝑖1\lambda_{i}=i^{-1}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_i start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) are trained later in the learning process tend to result in increased forgetting. This empirical finding aligns well with our theoretical analysis (the term λmieMmsuperscriptsubscript𝜆𝑚𝑖superscript𝑒𝑀𝑚\frac{\lambda_{m}^{i}}{e^{M-m}}divide start_ARG italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_M - italic_m end_POSTSUPERSCRIPT end_ARG discussed in Section 4.3). In an under-parameterized setting, or when the eigenvalues decay rapidly, the effective dimension — crucial in determining the model’s forgetting performance - is largely influenced by the eigenvalues. Such a pattern is intuitive as when tasks with larger eigenvalues are trained later, the model might overfit these tasks due to their high variance.

Impact of Dimensionality Our results, depicted in Figure 1(c) and Figure 1(d), show that in under-parameterized scenarios, performance remains relatively unaffected by an increase in dimensionality. However, in over-parameterized settings, the model tends to exhibit increased forgetting as dimensionality rises, particularly when the data size is kept constant. This highlights the varying impact of dimensionality on model performance in different parameterization contexts. In higher-dimensional settings, the influence of the projection term Γ(p,q)isuperscriptsubscriptΓ𝑝𝑞𝑖\Gamma_{(p,q)}^{i}roman_Γ start_POSTSUBSCRIPT ( italic_p , italic_q ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, as shown in Theorem 3.1, diminishes in comparison to the impact of ΛisuperscriptΛ𝑖\Lambda^{i}roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Consequently, as the number of features in the model increases, the sequence in which tasks are presented becomes less significant in determining the model’s forgetting behavior. This shift implies that, in high-dimensional scenarios, the inherent complexity and the distribution of eigenvalues of the feature space play a more critical role than the sequence of tasks, influencing the model’s learning and retention capabilities.

Impact of Step-size Our results, depicted in Figure 1(e) and Figure 1(f), reveal that a smaller step size effectively reduces forgetting in various task sequences and across different dimensionalities. This trend is especially noticeable in high-dimensional feature spaces, where a reduced step size markedly lowers the rate of forgetting. This observation is in line with the theoretical insights provided in Theorem 3.1 and Theorem 3.2, as smaller step sizes may lead to more refined updates during training, allowing the model to incrementally adjust to new tasks while preserving knowledge from previous ones.

5.2 Implication on DNNs

Intriguingly, our next discussion will adopt the same data generation and task setup as outlined in Section 5.1, but shift our focus to a different Neural Network model. This model comprises an input layer, a hidden layer with ten neurons, and an output layer, and it undergoes a training process akin to that of linear regression.

Impact of Eigenvalue Sequencing In our studies with Deep Neural Networks (DNNs), we still find that task sequences, ending with tasks having larger eigenvalues, tend to exhibit increased forgetting, especially in under-parameterized settings, similar to linear regression models. This indicates that the tendency of overfitting observed in linear models, particularly when tasks with larger eigenvalues are trained later in the sequence, may occur in DNNs as well.

Impact of Dimensionality Our results also reveal the consistent behaviors between DNNs and linear regression concerning dimensionality. In under-parameterized scenarios (Figure 1(k)), forgetting remains stable despite increased dimensionality, while in over-parameterized settings (Figure 1(l)), higher dimensionality leads to more forgetting when data size is fixed. However, the adverse effects of higher dimensions can be alleviated by expanding the dataset size, as demonstrated in Figure 1(j). It is a notable contrast to linear regression, which suggests that the complex structures of DNNs are better suited to manage and learn from high-dimensional data in continual learning scenarios. The different behaviors observed between DNNs and linear regression models will be a potentially interesting direction for future work.

Impact of Step Size Our results, depicted in Figure 1(e) and Figure 1(f), indicate that in under-parameterized settings, a smaller step size significantly lessens the influence of task sequences on forgetting, while in models with high-dimensional features, forgetting can be mitigated even without adjusting the step size.

6 Conclusion

In this work, we contribute to the understanding of catastrophic forgetting in continual learning via a multi-step SGD algorithm. Our theoretical analysis establishes bounds that illustrate the impact of various factors on forgetting such as data covariance matrix spectrum, step size, data size, and dimensionality, which can not be fully captured in previous studies due to their restrictive assumptions. This theoretical understanding is further substantiated through simulations conducted in linear regression models and Deep Neural Networks, which corroborate our theoretical insights.

Impact Statements

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Acknowledgments

The research of Meng Ding and Jinhui Xu was supported in part by KAUST through grant CRG10-4663.2. Di Wang was supported in part by the baseline funding BAS/1/1689-01-01, funding from the CRG grand URF/1/4663-01-01, REI/1/5232-01-01, REI/1/5332-01-01, FCC/1/1976-49-01 from CBRC of King Abdullah University of Science and Technology (KAUST). Di Wang was also supported by the funding RGC/3/4816-09-01 of the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI).

References

  • Aljundi et al. (2018) Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., and Tuytelaars, T. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pp.  139–154, 2018.
  • Asanuma et al. (2021) Asanuma, H., Takagi, S., Nagano, Y., Yoshida, Y., Igarashi, Y., and Okada, M. Statistical mechanical analysis of catastrophic forgetting in continual learning with teacher and student networks. Journal of the Physical Society of Japan, 90(10):104001, 2021.
  • Bennani et al. (2020) Bennani, M. A., Doan, T., and Sugiyama, M. Generalisation guarantees for continual learning with orthogonal gradient descent. arXiv preprint arXiv:2006.11942, 2020.
  • Chaudhry et al. (2018) Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
  • Chen et al. (2020) Chen, X., Liu, Q., and Tong, X. T. Dimension independent generalization error by stochastic gradient descent. arXiv preprint arXiv:2003.11196, 2020.
  • Cortes & Mohri (2014) Cortes, C. and Mohri, M. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
  • Cortes et al. (2019) Cortes, C., Mohri, M., and Medina, A. M. Adaptation based on generalized discrepancy. The Journal of Machine Learning Research, 20(1):1–30, 2019.
  • Défossez & Bach (2015) Défossez, A. and Bach, F. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Artificial Intelligence and Statistics, pp.  205–213. PMLR, 2015.
  • Dieuleveut et al. (2017) Dieuleveut, A., Flammarion, N., and Bach, F. Harder, better, faster, stronger convergence rates for least-squares regression. The Journal of Machine Learning Research, 18(1):3520–3570, 2017.
  • Doan et al. (2021) Doan, T., Bennani, M. A., Mazoure, B., Rabusseau, G., and Alquier, P. A theoretical analysis of catastrophic forgetting through the ntk overlap matrix. In International Conference on Artificial Intelligence and Statistics, pp.  1072–1080. PMLR, 2021.
  • Evron et al. (2022) Evron, I., Moroshko, E., Ward, R., Srebro, N., and Soudry, D. How catastrophic can catastrophic forgetting be in linear regression? In Conference on Learning Theory, pp.  4028–4079. PMLR, 2022.
  • Farajtabar et al. (2020) Farajtabar, M., Azizan, N., Mott, A., and Li, A. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pp.  3762–3773. PMLR, 2020.
  • Hanneke & Kpotufe (2020) Hanneke, S. and Kpotufe, S. On the value of target data in transfer learning, 2020.
  • Hao et al. (2023) Hao, J., Ji, K., and Liu, M. Bilevel coreset selection in continual learning: A new formulation and algorithm. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  • Jain et al. (2017) Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P., Pillutla, V. K., and Sidford, A. A markov chain theory approach to characterizing the minimax optimality of stochastic gradient descent (for least squares). arXiv preprint arXiv:1710.09430, 2017.
  • Jain et al. (2018) Jain, P., Kakade, S., Kidambi, R., Netrapalli, P., and Sidford, A. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. Journal of machine learning research, 18, 2018.
  • Karczmarz (1937) Karczmarz, S. Angenaherte auflosung von systemen linearer glei-chungen. Bull. Int. Acad. Pol. Sic. Let., Cl. Sci. Math. Nat., pp. 355–357, 1937.
  • Kirkpatrick et al. (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  • Kpotufe & Martinet (2018) Kpotufe, S. and Martinet, G. Marginal singularity, and the benefits of labels in covariate-shift. In Conference On Learning Theory, pp.  1882–1886. PMLR, 2018.
  • Lee et al. (2021) Lee, S., Goldt, S., and Saxe, A. Continual learning in the teacher-student setup: Impact of task similarity. In International Conference on Machine Learning, pp. 6109–6119. PMLR, 2021.
  • Lin et al. (2022) Lin, S., Yang, L., Fan, D., and Zhang, J. Trgp: Trust region gradient projection for continual learning. arXiv preprint arXiv:2202.02931, 2022.
  • Lin et al. (2023) Lin, S., Ju, P., Liang, Y., and Shroff, N. Theory on forgetting and generalization of continual learning. arXiv preprint arXiv:2302.05836, 2023.
  • Liu & Liu (2022) Liu, H. and Liu, H. Continual learning with recursive gradient optimization. arXiv preprint arXiv:2201.12522, 2022.
  • Ma et al. (2023) Ma, C., Pathak, R., and Wainwright, M. J. Optimally tackling covariate shift in rkhs-based nonparametric regression, 2023.
  • McCloskey & Cohen (1989) McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165. Elsevier, 1989.
  • Mohri & Medina (2012) Mohri, M. and Medina, A. M. New analysis and algorithm for learning with drifting distributions, 2012.
  • Pan & Yang (2009) Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
  • Pathak et al. (2022) Pathak, R., Ma, C., and Wainwright, M. A new similarity measure for covariate shift with applications to nonparametric regression. In International Conference on Machine Learning, pp. 17517–17530. PMLR, 2022.
  • Riemer et al. (2018) Riemer, M., Cases, I., Ajemian, R., Liu, M., Rish, I., Tu, Y., and Tesauro, G. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.
  • Saha et al. (2021) Saha, G., Garg, I., and Roy, K. Gradient projection memory for continual learning. arXiv preprint arXiv:2103.09762, 2021.
  • Serra et al. (2018) Serra, J., Suris, D., Miron, M., and Karatzoglou, A. Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pp. 4548–4557. PMLR, 2018.
  • Shin et al. (2017) Shin, H., Lee, J. K., Kim, J., and Kim, J. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.
  • Sugiyama & Kawanabe (2012) Sugiyama, M. and Kawanabe, M. Machine learning in non-stationary environments: Introduction to covariate shift adaptation. MIT press, 2012.
  • Swartworth et al. (2023) Swartworth, W. J., Needell, D., Ward, R., Kong, M., and Jeong, H. Nearly optimal bounds for cyclic forgetting. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=X25L5AjHig.
  • Wu et al. (2022a) Wu, J., Zou, D., Braverman, V., Gu, Q., and Kakade, S. Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression. In International Conference on Machine Learning, pp. 24280–24314. PMLR, 2022a.
  • Wu et al. (2022b) Wu, J., Zou, D., Braverman, V., Gu, Q., and Kakade, S. The power and limitation of pretraining-finetuning for linear regression under covariate shift. Advances in Neural Information Processing Systems, 35:33041–33053, 2022b.
  • Yang et al. (2021) Yang, L., Lin, S., Zhang, J., and Fan, D. Grown: Grow only when necessary for continual learning. arXiv preprint arXiv:2110.00908, 2021.
  • Yoon et al. (2017) Yoon, J., Yang, E., Lee, J., and Hwang, S. J. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
  • Yoon et al. (2019) Yoon, J., Kim, S., Yang, E., and Hwang, S. J. Scalable and order-robust continual learning with additive parameter decomposition. arXiv preprint arXiv:1902.09432, 2019.
  • Zou et al. (2021) Zou, D., Wu, J., Braverman, V., Gu, Q., and Kakade, S. Benign overfitting of constant-stepsize sgd for linear regression. In Conference on Learning Theory, pp.  4633–4635. PMLR, 2021.

Appendix A Support Lemmas

Notations

For two matrices 𝐀𝐀\mathbf{A}bold_A and 𝐁𝐁\mathbf{B}bold_B, their inner product is defined as 𝐀,𝐁:=tr(𝐀𝐁)assign𝐀𝐁trsuperscript𝐀top𝐁\langle\mathbf{A},\mathbf{B}\rangle:=\operatorname{tr}(\mathbf{A}^{\top}% \mathbf{B})⟨ bold_A , bold_B ⟩ := roman_tr ( bold_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_B ). For each task m[M]𝑚delimited-[]𝑀m\in[M]italic_m ∈ [ italic_M ], we define the following linear operators:

=𝐈𝐈,m=𝔼[𝐱m𝐱m𝐱m𝐱m],~m=𝐇m𝐇m,𝒯=𝐇𝐈+𝐈𝐇mηm,𝒯~=𝐇m𝐈+𝐈𝐇mη𝐇m𝐇m.\begin{gathered}\mathcal{I}=\mathbf{I}\otimes\mathbf{I},\quad\mathcal{M}_{m}=% \mathbb{E}[\mathbf{x}_{m}\otimes\mathbf{x}_{m}\otimes\mathbf{x}_{m}\otimes% \mathbf{x}_{m}],\quad\widetilde{\mathcal{M}}_{m}=\mathbf{H}_{m}\otimes\mathbf{% H}_{m},\\ \mathcal{T}=\mathbf{H}\otimes\mathbf{I}+\mathbf{I}\otimes\mathbf{H}_{m}-\eta% \mathcal{M}_{m},\quad\widetilde{\mathcal{T}}=\mathbf{H}_{m}\otimes\mathbf{I}+% \mathbf{I}\otimes\mathbf{H}_{m}-\eta\mathbf{H}_{m}\otimes\mathbf{H}_{m}.\end{gathered}start_ROW start_CELL caligraphic_I = bold_I ⊗ bold_I , caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = blackboard_E [ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊗ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊗ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊗ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] , over~ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊗ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL caligraphic_T = bold_H ⊗ bold_I + bold_I ⊗ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - italic_η caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , over~ start_ARG caligraphic_T end_ARG = bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊗ bold_I + bold_I ⊗ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊗ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT . end_CELL end_ROW

We use the notation 𝒪𝐀𝒪𝐀\mathcal{O}\circ\mathbf{A}caligraphic_O ∘ bold_A to denote the operator 𝒪𝒪\mathcal{O}caligraphic_O acting on a symmetric matrix 𝐀𝐀\mathbf{A}bold_A. For example, with these definitions, we have that for a symmetric matrix 𝐀𝐀\mathbf{A}bold_A,

𝐀=𝐀,m𝐀=𝔼[(𝐱m𝐀𝐱m)𝐱m𝐱m],~m𝐀=𝐇m𝐀𝐇m(η𝒯~m)𝐀=(𝐈η𝐇m)𝐀(𝐈η𝐇m)(η𝒯m)𝐀=𝔼[(𝐈η𝐱m𝐱m)𝐀(𝐈η𝐱m𝐱m)]\begin{gathered}\mathcal{I}\circ\mathbf{A}=\mathbf{A},\quad\mathcal{M}_{m}% \circ\mathbf{A}=\mathbb{E}[({\mathbf{x}_{m}}^{\top}\mathbf{A}{\mathbf{x}_{m}})% {\mathbf{x}_{m}}{\mathbf{x}_{m}}^{\top}],\quad\widetilde{\mathcal{M}}_{m}\circ% \mathbf{A}=\mathbf{H}_{m}\mathbf{A}\mathbf{H}_{m}\\ (\mathcal{I}-\eta\widetilde{\mathcal{T}}_{m})\circ\mathbf{A}=(\mathbf{I}-\eta% \mathbf{H}_{m})\mathbf{A}(\mathbf{I}-\eta\mathbf{H}_{m})\\ (\mathcal{I}-\eta\mathcal{T}_{m})\circ\mathbf{A}=\mathbb{E}[(\mathbf{I}-\eta{% \mathbf{x}_{m}}{\mathbf{x}_{m}}^{\top})\mathbf{A}(\mathbf{I}-\eta{\mathbf{x}_{% m}}{\mathbf{x}_{m}}^{\top})]\end{gathered}start_ROW start_CELL caligraphic_I ∘ bold_A = bold_A , caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∘ bold_A = blackboard_E [ ( bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Ax start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] , over~ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∘ bold_A = bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_AH start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∘ bold_A = ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) bold_A ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∘ bold_A = blackboard_E [ ( bold_I - italic_η bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_A ( bold_I - italic_η bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ] end_CELL end_ROW

It can be readily understood that the following properties are satisfied:

Lemma A.1 ((Zou et al., 2021)).

An operator 𝒪𝒪\mathcal{O}caligraphic_O, when defined on symmetric matrices, is termed a Positive Semi-Definite (PSD) mapping if 𝐀0succeeds-or-equals𝐀0\mathbf{A}\succeq 0bold_A ⪰ 0 implies 𝒪𝐀0succeeds-or-equals𝒪𝐀0\mathcal{O}\circ\mathbf{A}\succeq 0caligraphic_O ∘ bold_A ⪰ 0. Consequently, for each task m[M]𝑚delimited-[]𝑀m\in[M]italic_m ∈ [ italic_M ] we have:

  • 1.

    msubscript𝑚\mathcal{M}_{m}caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and ~msubscript~𝑚\widetilde{\mathcal{M}}_{m}over~ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are both PSD mappings.

  • 2.

    m~msubscript𝑚subscript~𝑚\mathcal{M}_{m}-\widetilde{\mathcal{M}}_{m}caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - over~ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝒯~m𝒯msubscript~𝒯𝑚subscript𝒯𝑚\widetilde{\mathcal{T}}_{m}-\mathcal{T}_{m}over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - caligraphic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are both PSD mappings.

  • 3.

    η𝒯m𝜂subscript𝒯𝑚\mathcal{I}-\eta\mathcal{T}_{m}caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and η𝒯~m𝜂subscript~𝒯𝑚\mathcal{I}-\eta\widetilde{\mathcal{T}}_{m}caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are both PSD mappings.

  • 4.

    If 0<η<1/λm10𝜂1superscriptsubscript𝜆𝑚10<\eta<1/\lambda_{m}^{1}0 < italic_η < 1 / italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, then 𝒯~1superscript~𝒯1\widetilde{\mathcal{T}}^{-1}over~ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT exists, and is a PSD mapping.

  • 5.

    If 0<η<1/(αmtr(𝐇m))0𝜂1subscript𝛼𝑚trsubscript𝐇𝑚0<\eta<1/(\alpha_{m}\operatorname{tr}(\mathbf{H}_{m}))0 < italic_η < 1 / ( italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ), then 𝒯m1𝐀superscriptsubscript𝒯𝑚1𝐀\mathcal{T}_{m}^{-1}\circ\mathbf{A}caligraphic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∘ bold_A exists for PSD matrix 𝐀𝐀\mathbf{A}bold_A, and 𝒯m1superscriptsubscript𝒯𝑚1\mathcal{T}_{m}^{-1}caligraphic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is a PSD mapping.

Then for the SGD iterates, we can consider their associated bias iterates and variance iterates:

{𝐁0=(𝐰0𝐰)(𝐰0𝐰),𝐁(m1)N+t+1=(η𝒯m(η))𝐁(m1)N+t;casessubscript𝐁0subscript𝐰0superscript𝐰superscriptsubscript𝐰0superscript𝐰topotherwisesubscript𝐁𝑚1𝑁𝑡1𝜂subscript𝒯𝑚𝜂subscript𝐁𝑚1𝑁𝑡otherwise\displaystyle\begin{cases}\mathbf{B}_{0}=(\mathbf{w}_{0}-\mathbf{w}^{*})(% \mathbf{w}_{0}-\mathbf{w}^{*})^{\top},\\ \mathbf{B}_{(m-1)N+t+1}=(\mathcal{I}-\eta\mathcal{T}_{m}(\eta))\circ\mathbf{B}% _{(m-1)N+t};\end{cases}{ start_ROW start_CELL bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t + 1 end_POSTSUBSCRIPT = ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT ; end_CELL start_CELL end_CELL end_ROW (9)
{𝐂0=𝟎,𝐂(m1)N+t+1=(η𝒯𝐇m(η))𝐂(m1)N+t+η2𝚺𝐇m;casessubscript𝐂00otherwisesubscript𝐂𝑚1𝑁𝑡1𝜂subscript𝒯subscript𝐇𝑚𝜂subscript𝐂𝑚1𝑁𝑡superscript𝜂2subscript𝚺subscript𝐇𝑚otherwise\displaystyle\begin{cases}\mathbf{C}_{0}=\mathbf{0},&\\ \mathbf{C}_{(m-1)N+t+1}=(\mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))% \circ\mathbf{C}_{(m-1)N+t}+\eta^{2}\bm{\Sigma}_{\mathbf{H}_{m}};\end{cases}{ start_ROW start_CELL bold_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_0 , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t + 1 end_POSTSUBSCRIPT = ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; end_CELL start_CELL end_CELL end_ROW (10)

where t=0,,N1𝑡0𝑁1t=0,\ldots,N-1italic_t = 0 , … , italic_N - 1 and m=1,,M𝑚1𝑀m=1,\ldots,Mitalic_m = 1 , … , italic_M.

Lemma A.2 (Bias-variance decomposition).

Suppose that Assumption 2.4 holds. Then we have:

𝔼[ExcessRisk(𝐰MN)]=12𝐇,𝐁MN+12𝐇,𝐂MN.𝔼delimited-[]ExcessRisksubscript𝐰𝑀𝑁12𝐇subscript𝐁𝑀𝑁12𝐇subscript𝐂𝑀𝑁\mathbb{E}[\operatorname{ExcessRisk}(\mathbf{w}_{MN})]=\frac{1}{2}\langle% \mathbf{H},\mathbf{B}_{MN}\rangle+\frac{1}{2}\langle\mathbf{H},\mathbf{C}_{MN}\rangle.blackboard_E [ roman_ExcessRisk ( bold_w start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT ) ] = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⟨ bold_H , bold_B start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT ⟩ + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⟨ bold_H , bold_C start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT ⟩ .

Appendix B Variance Error

B.1 Upper Bound

The assumption presented below can be inferred from 2.3 by setting 𝐀=𝐈𝐀𝐈\mathbf{A}=\mathbf{I}bold_A = bold_I, given that R2=max{αmtr(𝐇m)}m=1MR^{2}=\max\{\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})\}_{m=1}^{M}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_max { italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT.

Assumption B.1 (Relaxed version).

For each task m𝑚mitalic_m, there exists a constant R0𝑅0R\geq 0italic_R ≥ 0 such that:

𝔼𝐱𝒟m[𝐱𝐱𝐱𝐱]R2𝐇m.precedes-or-equalssubscript𝔼similar-to𝐱subscript𝒟𝑚delimited-[]superscript𝐱𝐱topsuperscript𝐱𝐱topsuperscript𝑅2subscript𝐇𝑚\mathbb{E}_{\mathbf{x}\sim\mathcal{D}_{m}}[\mathbf{x}\mathbf{x}^{\top}\mathbf{% x}\mathbf{x}^{\top}]\preceq R^{2}\mathbf{H}_{m}.blackboard_E start_POSTSUBSCRIPT bold_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ bold_xx start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_xx start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ⪯ italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT .
Lemma B.2.

Suppose Assumptions 2.3 and 2.4 hold with step size η1/R2𝜂1superscript𝑅2\eta\leq 1/R^{2}italic_η ≤ 1 / italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then it holds that:

𝐂tησ21ηR2𝐈, for every t=0,1,,MNformulae-sequencesubscript𝐂𝑡𝜂superscript𝜎21𝜂superscript𝑅2𝐈 for every 𝑡01𝑀𝑁\mathbf{C}_{t}\leq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\mathbf{I},\quad\text{ % for every }t=0,1,\ldots,MNbold_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_I , for every italic_t = 0 , 1 , … , italic_M italic_N
Proof.

This lemma is derived directly from the Lemmas in (Jain et al., 2018; Zou et al., 2021). To ensure completeness, we include a proof as follows.

We prove the lemma via induction. Initially, for t=0𝑡0t=0italic_t = 0, it is evident that 𝐂0=𝟎ησ21ηR2𝐈subscript𝐂00precedes-or-equals𝜂superscript𝜎21𝜂superscript𝑅2𝐈\mathbf{C}_{0}=\mathbf{0}\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\mathbf{I}bold_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_0 ⪯ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_I. Now, assuming that 𝐂tησ21ηR2𝐈precedes-or-equalssubscript𝐂𝑡𝜂superscript𝜎21𝜂superscript𝑅2𝐈\mathbf{C}_{t}\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\mathbf{I}bold_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⪯ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_I, let us examine 𝐂t+1subscript𝐂𝑡1\mathbf{C}_{t+1}bold_C start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT in light of Equation 9. When 0tN10𝑡𝑁10\leq t\leq N-10 ≤ italic_t ≤ italic_N - 1, for each task m𝑚mitalic_m, it implies:

𝐂(m1)N+t+1subscript𝐂𝑚1𝑁𝑡1\displaystyle\mathbf{C}_{(m-1)N+t+1}bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t + 1 end_POSTSUBSCRIPT =(η𝒯𝐇m(η))𝐂(m1)N+t+η2𝚺𝐇mabsent𝜂subscript𝒯subscript𝐇𝑚𝜂subscript𝐂𝑚1𝑁𝑡superscript𝜂2subscript𝚺subscript𝐇𝑚\displaystyle=(\mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))\circ\mathbf% {C}_{(m-1)N+t}+\eta^{2}\bm{\Sigma}_{\mathbf{H}_{m}}= ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT
ησ21ηR2𝐈(η𝒯𝐇m(η))𝐈+η2σ2𝐇mprecedes-or-equalsabsent𝜂superscript𝜎21𝜂superscript𝑅2𝐈𝜂subscript𝒯subscript𝐇𝑚𝜂𝐈superscript𝜂2superscript𝜎2subscript𝐇𝑚\displaystyle\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\mathbf{I}\cdot(% \mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))\circ\mathbf{I}+\eta^{2}% \sigma^{2}\mathbf{H}_{m}⪯ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_I ⋅ ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_I + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (11)
ησ21ηR2(𝐈2η𝐇m+η2R2𝐇m)+η2σ2𝐇mprecedes-or-equalsabsent𝜂superscript𝜎21𝜂superscript𝑅2𝐈2𝜂subscript𝐇𝑚superscript𝜂2superscript𝑅2subscript𝐇𝑚superscript𝜂2superscript𝜎2subscript𝐇𝑚\displaystyle\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\cdot(\mathbf{I}-2\eta% \mathbf{H}_{m}+\eta^{2}R^{2}\mathbf{H}_{m})+\eta^{2}\sigma^{2}\mathbf{H}_{m}⪯ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ ( bold_I - 2 italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
ησ21ηR2𝐈.precedes-or-equalsabsent𝜂superscript𝜎21𝜂superscript𝑅2𝐈\displaystyle\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\cdot\mathbf{I}.⪯ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ bold_I .

Lemma B.3.

Suppose Assumptions 2.3 and 2.4 hold with step size η1/R2𝜂1superscript𝑅2\eta\leq 1/R^{2}italic_η ≤ 1 / italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then it holds that:

𝐂MNsubscript𝐂𝑀𝑁\displaystyle\mathbf{C}_{MN}bold_C start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT m=2M(η𝒯~𝐇m(η))N𝐂N+m=1M1j=m+1M(η𝒯~𝐇j(η))N𝐏m+𝐏M,precedes-or-equalsabsentsuperscriptsubscriptproduct𝑚2𝑀superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐂𝑁superscriptsubscript𝑚1𝑀1superscriptsubscriptproduct𝑗𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁subscript𝐏𝑚subscript𝐏𝑀\displaystyle\preceq\prod_{m=2}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{C}_{N}+\sum_{m=1}^{M-1}\prod_{j=m+1}^{M% }(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P% }_{m}+\mathbf{P}_{M},⪯ ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ,

where 𝐏m=η2σ21ηR2t=0N1(η𝒯~𝐇m(η))t𝐇msubscript𝐏𝑚superscript𝜂2superscript𝜎21𝜂superscript𝑅2superscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑡subscript𝐇𝑚\mathbf{P}_{m}=\frac{\eta^{2}\sigma^{2}}{1-\eta R^{2}}\sum_{t=0}^{N-1}(% \mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ% \mathbf{H}_{m}bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝐂Nησ21ηR2(𝐈(𝐈η𝐇1)N)precedes-or-equalssubscript𝐂𝑁𝜂superscript𝜎21𝜂superscript𝑅2𝐈superscript𝐈𝜂subscript𝐇1𝑁\mathbf{C}_{N}\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\cdot(\mathbf{I}-(% \mathbf{I}-\eta\mathbf{H}_{1})^{N})bold_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⪯ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ).

Proof.

We first examine the recursion from t=0𝑡0t=0italic_t = 0 to t=N1𝑡𝑁1t=N-1italic_t = italic_N - 1 for each task m𝑚mitalic_m:

𝐂(m1)N+t+1subscript𝐂𝑚1𝑁𝑡1\displaystyle\mathbf{C}_{(m-1)N+t+1}bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t + 1 end_POSTSUBSCRIPT =(η𝒯𝐇m(η))𝐂(m1)N+t+η2𝚺𝐇mabsent𝜂subscript𝒯subscript𝐇𝑚𝜂subscript𝐂𝑚1𝑁𝑡superscript𝜂2subscript𝚺subscript𝐇𝑚\displaystyle=(\mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))\circ\mathbf% {C}_{(m-1)N+t}+\eta^{2}\bm{\Sigma}_{\mathbf{H}_{m}}= ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT
(η𝒯~𝐇m(η))𝐂(m1)N+t+η2m𝐂(m1)N+t+η2σ2𝐇mprecedes-or-equalsabsent𝜂subscript~𝒯subscript𝐇𝑚𝜂subscript𝐂𝑚1𝑁𝑡superscript𝜂2subscript𝑚subscript𝐂𝑚1𝑁𝑡superscript𝜂2superscript𝜎2subscript𝐇𝑚\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))\circ\mathbf{C}_{(m-1)N+t}+\eta^{2}\mathcal{M}_{m}\circ\mathbf{C}_{(m-1)% N+t}+\eta^{2}\sigma^{2}\mathbf{H}_{m}⪯ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
(η𝒯~𝐇m(η))𝐂(m1)N+t+η2R2ησ21ηR2𝐇m+η2σ2𝐇mprecedes-or-equalsabsent𝜂subscript~𝒯subscript𝐇𝑚𝜂subscript𝐂𝑚1𝑁𝑡superscript𝜂2superscript𝑅2𝜂superscript𝜎21𝜂superscript𝑅2subscript𝐇𝑚superscript𝜂2superscript𝜎2subscript𝐇𝑚\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))\circ\mathbf{C}_{(m-1)N+t}+\eta^{2}R^{2}\cdot\frac{\eta\sigma^{2}}{1-% \eta R^{2}}\cdot\mathbf{H}_{m}+\eta^{2}\sigma^{2}\mathbf{H}_{m}⪯ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
(η𝒯~𝐇m(η))𝐂(m1)N+t+η2σ21ηR2𝐇m,precedes-or-equalsabsent𝜂subscript~𝒯subscript𝐇𝑚𝜂subscript𝐂𝑚1𝑁𝑡superscript𝜂2superscript𝜎21𝜂superscript𝑅2subscript𝐇𝑚\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))\circ\mathbf{C}_{(m-1)N+t}+\frac{\eta^{2}\sigma^{2}}{1-\eta R^{2}}% \mathbf{H}_{m},⪯ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,

where the penultimate inequality is derived from the Lemma B.2.

Hence, after N𝑁Nitalic_N iterations, we could have the following results for task m𝑚mitalic_m:

𝐂(m1)N+N(η𝒯~𝐇m(η))N𝐂(m1)N+η2σ21ηR2t=0N1(η𝒯~𝐇m(η))t𝐇m.precedes-or-equalssubscript𝐂𝑚1𝑁𝑁superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐂𝑚1𝑁superscript𝜂2superscript𝜎21𝜂superscript𝑅2superscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑡subscript𝐇𝑚\mathbf{C}_{(m-1)N+N}\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{% H}_{m}}(\eta))^{N}\circ\mathbf{C}_{(m-1)N}+\frac{\eta^{2}\sigma^{2}}{1-\eta R^% {2}}\sum_{t=0}^{N-1}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))^{t}\circ\mathbf{H}_{m}.bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_N end_POSTSUBSCRIPT ⪯ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT + divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT .

Now, we consider the first task incorporating with the Lemma B.5 in (Zou et al., 2021), which implies:

𝐂Nησ21ηR2(𝐈(𝐈η𝐇1)N).precedes-or-equalssubscript𝐂𝑁𝜂superscript𝜎21𝜂superscript𝑅2𝐈superscript𝐈𝜂subscript𝐇1𝑁\mathbf{C}_{N}\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\cdot(\mathbf{I}-(% \mathbf{I}-\eta\mathbf{H}_{1})^{N}).bold_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⪯ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) .

By combining the aforementioned results and denoting 𝐏m=η2σ21ηR2t=0N1(η𝒯~𝐇m(η))t𝐇msubscript𝐏𝑚superscript𝜂2superscript𝜎21𝜂superscript𝑅2superscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑡subscript𝐇𝑚\mathbf{P}_{m}=\frac{\eta^{2}\sigma^{2}}{1-\eta R^{2}}\sum_{t=0}^{N-1}(% \mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ% \mathbf{H}_{m}bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, we obtain:

𝐂MNsubscript𝐂𝑀𝑁\displaystyle\mathbf{C}_{MN}bold_C start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT m=2M(η𝒯~𝐇m(η))N𝐂N+m=1M1j=m+1M(η𝒯~𝐇j(η))N𝐏m+𝐏M.precedes-or-equalsabsentsuperscriptsubscriptproduct𝑚2𝑀superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐂𝑁superscriptsubscript𝑚1𝑀1superscriptsubscriptproduct𝑗𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁subscript𝐏𝑚subscript𝐏𝑀\displaystyle\preceq\prod_{m=2}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{C}_{N}+\sum_{m=1}^{M-1}\prod_{j=m+1}^{M% }(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P% }_{m}+\mathbf{P}_{M}.⪯ ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT .

Based on Lemma A.2, the upper bound of the variance error can be expressed as follows:

k=1M𝐇k,𝐂MNsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘subscript𝐂𝑀𝑁\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{C}_{MN}\rangle∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT ⟩ k=1Mησ21ηR2𝐇k,m=2M(η𝒯~𝐇m(η))N(𝐈(𝐈η𝐇1)N)variance term 1absentsubscriptsuperscriptsubscript𝑘1𝑀𝜂superscript𝜎21𝜂superscript𝑅2subscript𝐇𝑘superscriptsubscriptproduct𝑚2𝑀superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁𝐈superscript𝐈𝜂subscript𝐇1𝑁variance term 1\displaystyle\leq\underbrace{\sum_{k=1}^{M}\frac{\eta\sigma^{2}}{1-\eta R^{2}}% \langle\mathbf{H}_{k},\prod_{m=2}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_% {\mathbf{H}_{m}}(\eta))^{N}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{1})^{N})% \rangle}_{\text{variance term 1}}≤ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ⟩ end_ARG start_POSTSUBSCRIPT variance term 1 end_POSTSUBSCRIPT
+k=1M𝐇k,m=1M1j=m+1M(η𝒯~𝐇j(η))N𝐏mvariance term 2+k=1M𝐇k,𝐏Mvariance term 3.subscriptsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscript𝑚1𝑀1superscriptsubscriptproduct𝑗𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁subscript𝐏𝑚variance term 2subscriptsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘subscript𝐏𝑀variance term 3\displaystyle+\underbrace{\sum_{k=1}^{M}\langle\mathbf{H}_{k},\sum_{m=1}^{M-1}% \prod_{j=m+1}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(% \eta))^{N}\mathbf{P}_{m}\rangle}_{\text{variance term 2}}+\underbrace{\sum_{k=% 1}^{M}\langle\mathbf{H}_{k},\mathbf{P}_{M}\rangle}_{\text{variance term 3}}.+ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT variance term 2 end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT variance term 3 end_POSTSUBSCRIPT . (12)

Let us consider the variance terms separately.

variance term 1 =k=1Mησ21ηR2𝐇k,m=2M(𝐈η𝐇m)N(𝐈(𝐈η𝐇1)N)(𝐈η𝐇m)Nabsentsuperscriptsubscript𝑘1𝑀𝜂superscript𝜎21𝜂superscript𝑅2subscript𝐇𝑘superscriptsubscriptproduct𝑚2𝑀superscript𝐈𝜂subscript𝐇𝑚𝑁𝐈superscript𝐈𝜂subscript𝐇1𝑁superscript𝐈𝜂subscript𝐇𝑚𝑁\displaystyle=\sum_{k=1}^{M}\frac{\eta\sigma^{2}}{1-\eta R^{2}}\langle\mathbf{% H}_{k},\prod_{m=2}^{M}(\mathbf{I}-\eta\mathbf{H}_{m})^{N}(\mathbf{I}-(\mathbf{% I}-\eta\mathbf{H}_{1})^{N})(\mathbf{I}-\eta\mathbf{H}_{m})^{N}\rangle= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⟩ (13)
k=1Mησ21ηR2m=2M(𝐈η𝐇m)N𝐇k,(𝐈(𝐈η𝐇1)N)absentsuperscriptsubscript𝑘1𝑀𝜂superscript𝜎21𝜂superscript𝑅2superscriptsubscriptproduct𝑚2𝑀superscript𝐈𝜂subscript𝐇𝑚𝑁subscript𝐇𝑘𝐈superscript𝐈𝜂subscript𝐇1𝑁\displaystyle\leq\sum_{k=1}^{M}\frac{\eta\sigma^{2}}{1-\eta R^{2}}\langle\prod% _{m=2}^{M}(\mathbf{I}-\eta\mathbf{H}_{m})^{N}\mathbf{H}_{k},(\mathbf{I}-(% \mathbf{I}-\eta\mathbf{H}_{1})^{N})\rangle≤ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⟨ ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ⟩
=ησ21ηR2k=1Mi[m=2M(1ηλmi)Nλki(1(1ηλ1i)N)]absent𝜂superscript𝜎21𝜂superscript𝑅2superscriptsubscript𝑘1𝑀subscript𝑖delimited-[]superscriptsubscriptproduct𝑚2𝑀superscript1𝜂superscriptsubscript𝜆𝑚𝑖𝑁superscriptsubscript𝜆𝑘𝑖1superscript1𝜂superscriptsubscript𝜆1𝑖𝑁\displaystyle=\frac{\eta\sigma^{2}}{1-\eta R^{2}}\sum_{k=1}^{M}\sum_{i}[\prod_% {m=2}^{M}(1-\eta\lambda_{m}^{i})^{N}\lambda_{k}^{i}(1-(1-\eta\lambda_{1}^{i})^% {N})]= divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ]
ησ21ηR2(i<k1Γ(2,M)iΛi+Nηi>k1Γ(2,M)iλ1iΛi),absent𝜂superscript𝜎21𝜂superscript𝑅2subscript𝑖superscriptsubscript𝑘1superscriptsubscriptΓ2𝑀𝑖superscriptΛ𝑖𝑁𝜂subscript𝑖superscriptsubscript𝑘1superscriptsubscriptΓ2𝑀𝑖superscriptsubscript𝜆1𝑖superscriptΛ𝑖\displaystyle\leq\frac{\eta\sigma^{2}}{1-\eta R^{2}}(\sum_{i<k_{1}^{*}}\Gamma_% {(2,M)}^{i}\Lambda^{i}+N\eta\sum_{i>k_{1}^{*}}\Gamma_{(2,M)}^{i}\lambda_{1}^{i% }\Lambda^{i}),≤ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 2 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 2 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ,

where we use the facts that 1(1ηλmi)Nmin{1,ηNλmi}1superscript1𝜂superscriptsubscript𝜆𝑚𝑖𝑁1𝜂𝑁superscriptsubscript𝜆𝑚𝑖1-(1-\eta\lambda_{m}^{i})^{N}\leq\min\{1,\eta N\lambda_{m}^{i}\}1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ≤ roman_min { 1 , italic_η italic_N italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } hold for all i1𝑖1i\geq 1italic_i ≥ 1 in the last inequality.

Before we turn our attention to the second term, we first consider the 𝐏msubscript𝐏𝑚\mathbf{P}_{m}bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT:

𝐏msubscript𝐏𝑚\displaystyle\mathbf{P}_{m}bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT =η2σ21ηR2t=0N1(η𝒯~𝐇m(η))t𝐇mabsentsuperscript𝜂2superscript𝜎21𝜂superscript𝑅2superscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑡subscript𝐇𝑚\displaystyle=\frac{\eta^{2}\sigma^{2}}{1-\eta R^{2}}\sum_{t=0}^{N-1}(\mathcal% {I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ\mathbf{H}_{m}= divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
=η2σ21ηR2t=0N1(𝐈η𝐇m)t𝐇m(𝐈η𝐇m)tabsentsuperscript𝜂2superscript𝜎21𝜂superscript𝑅2superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚𝑡subscript𝐇𝑚superscript𝐈𝜂subscript𝐇𝑚𝑡\displaystyle=\frac{\eta^{2}\sigma^{2}}{1-\eta R^{2}}\sum_{t=0}^{N-1}(\mathbf{% I}-\eta\mathbf{H}_{m})^{t}\mathbf{H}_{m}(\mathbf{I}-\eta\mathbf{H}_{m})^{t}= divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
ησ21ηR2(𝐈(𝐈η𝐇m)N).precedes-or-equalsabsent𝜂superscript𝜎21𝜂superscript𝑅2𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁\displaystyle\preceq\frac{\eta\sigma^{2}}{1-\eta R^{2}}(\mathbf{I}-(\mathbf{I}% -\eta\mathbf{H}_{m})^{N}).⪯ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) .

Substituting the above to the variance term 2, we have:

variance term 2 =k=1M𝐇k,m=1M1j=m+1M(η𝒯~𝐇j(η))N𝐏mabsentsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscript𝑚1𝑀1superscriptsubscriptproduct𝑗𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁subscript𝐏𝑚\displaystyle=\sum_{k=1}^{M}\langle\mathbf{H}_{k},\sum_{m=1}^{M-1}\prod_{j=m+1% }^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}% \mathbf{P}_{m}\rangle= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩
ησ21ηR2k=1M𝐇k,m=1M1j=m+1M(η𝒯~𝐇j(η))N(𝐈(𝐈η𝐇m)N)absent𝜂superscript𝜎21𝜂superscript𝑅2superscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscript𝑚1𝑀1superscriptsubscriptproduct𝑗𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁\displaystyle\leq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\sum_{k=1}^{M}\langle% \mathbf{H}_{k},\sum_{m=1}^{M-1}\prod_{j=m+1}^{M}(\mathcal{I}-\eta\widetilde{% \mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}% _{m})^{N})\rangle≤ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ⟩
ησ21ηR2k=1M𝐇k,m=1M1j=m+1M(𝐈η𝐇j)N(𝐈(𝐈η𝐇m)N)absent𝜂superscript𝜎21𝜂superscript𝑅2superscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscript𝑚1𝑀1superscriptsubscriptproduct𝑗𝑚1𝑀superscript𝐈𝜂subscript𝐇𝑗𝑁𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁\displaystyle\leq\frac{\eta\sigma^{2}}{1-\eta R^{2}}\sum_{k=1}^{M}\langle% \mathbf{H}_{k},\sum_{m=1}^{M-1}\prod_{j=m+1}^{M}(\mathbf{I}-\eta\mathbf{H}_{j}% )^{N}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N})\rangle≤ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ⟩
m=1M1ησ21ηR2(i<kmΓ(m+1,M)iΛi+Nηi>kmΓ(m+1,M)iλmiΛi)absentsuperscriptsubscript𝑚1𝑀1𝜂superscript𝜎21𝜂superscript𝑅2subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚1𝑀𝑖superscriptΛ𝑖𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚1𝑀𝑖superscriptsubscript𝜆𝑚𝑖superscriptΛ𝑖\displaystyle\leq\sum_{m=1}^{M-1}\frac{\eta\sigma^{2}}{1-\eta R^{2}}(\sum_{i<k% _{m}^{*}}\Gamma_{(m+1,M)}^{i}\Lambda^{i}+N\eta\sum_{i>k_{m}^{*}}\Gamma_{(m+1,M% )}^{i}\lambda_{m}^{i}\Lambda^{i})≤ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m + 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m + 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (14)

Similarly, for the last term, we have:

variance term 3=k=1M𝐇k,𝐏Mησ21ηR2(i<kMΓ(M+1,M)iΛi+Nηi>kMΓ(M+1,M)iλmiΛi)variance term 3superscriptsubscript𝑘1𝑀subscript𝐇𝑘subscript𝐏𝑀𝜂superscript𝜎21𝜂superscript𝑅2subscript𝑖superscriptsubscript𝑘𝑀superscriptsubscriptΓ𝑀1𝑀𝑖superscriptΛ𝑖𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑀superscriptsubscriptΓ𝑀1𝑀𝑖superscriptsubscript𝜆𝑚𝑖superscriptΛ𝑖\displaystyle\text{variance term 3}=\sum_{k=1}^{M}\langle\mathbf{H}_{k},% \mathbf{P}_{M}\rangle\leq\frac{\eta\sigma^{2}}{1-\eta R^{2}}(\sum_{i<k_{M}^{*}% }\Gamma_{(M+1,M)}^{i}\Lambda^{i}+N\eta\sum_{i>k_{M}^{*}}\Gamma_{(M+1,M)}^{i}% \lambda_{m}^{i}\Lambda^{i})variance term 3 = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ⟩ ≤ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_M + 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_M + 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (15)

B.2 Lower Bound

Now, we shift our focus to the lower bound of variance. Similarly, we have the following lemma hold:

Lemma B.4.

Suppose Assumptions 2.3 and 2.4 hold with step size η1/R2𝜂1superscript𝑅2\eta\leq 1/R^{2}italic_η ≤ 1 / italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then it holds that:

𝐂MNsubscript𝐂𝑀𝑁\displaystyle\mathbf{C}_{MN}bold_C start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT m=2M(η𝒯~𝐇m(η))N𝐂N+m=1M1j=m+1M(η𝒯~𝐇j(η))N𝐏m+𝐏M,succeeds-or-equalsabsentsuperscriptsubscriptproduct𝑚2𝑀superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐂𝑁superscriptsubscript𝑚1𝑀1superscriptsubscriptproduct𝑗𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁superscriptsubscript𝐏𝑚superscriptsubscript𝐏𝑀\displaystyle\succeq\prod_{m=2}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{C}_{N}+\sum_{m=1}^{M-1}\prod_{j=m+1}^{M% }(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P% }_{m}^{\prime}+\mathbf{P}_{M}^{\prime},⪰ ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ,

where 𝐏m=η2σ2t=0N1(η𝒯~𝐇m(η))t𝐇msuperscriptsubscript𝐏𝑚superscript𝜂2superscript𝜎2superscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑡subscript𝐇𝑚\mathbf{P}_{m}^{\prime}={\eta^{2}\sigma^{2}}\sum_{t=0}^{N-1}(\mathcal{I}-\eta% \widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ\mathbf{H}_{m}bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝐂Nησ22(𝐈(𝐈η𝐇1)2N)succeeds-or-equalssubscript𝐂𝑁𝜂superscript𝜎22𝐈superscript𝐈𝜂subscript𝐇12𝑁\mathbf{C}_{N}\succeq\frac{\eta\sigma^{2}}{2}\cdot(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{1})^{2N})bold_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⪰ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ).

Proof.

In a similar fashion, let’s first examine the recursion of 𝐂𝐂\mathbf{C}bold_C from t=0𝑡0t=0italic_t = 0 to t=N1𝑡𝑁1t=N-1italic_t = italic_N - 1 for each task m𝑚mitalic_m.

𝐂(m1)N+t+1subscript𝐂𝑚1𝑁𝑡1\displaystyle\mathbf{C}_{(m-1)N+t+1}bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t + 1 end_POSTSUBSCRIPT =(η𝒯𝐇m(η))𝐂(m1)N+t+η2𝚺𝐇mabsent𝜂subscript𝒯subscript𝐇𝑚𝜂subscript𝐂𝑚1𝑁𝑡superscript𝜂2subscript𝚺subscript𝐇𝑚\displaystyle=(\mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))\circ\mathbf% {C}_{(m-1)N+t}+\eta^{2}\bm{\Sigma}_{\mathbf{H}_{m}}= ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=(η𝒯~𝐇m(η))𝐂(m1)N+t+η2(m~m)𝐂(m1)N+t+η2σ2𝐇mabsent𝜂subscript~𝒯subscript𝐇𝑚𝜂subscript𝐂𝑚1𝑁𝑡superscript𝜂2subscript𝑚subscript~𝑚subscript𝐂𝑚1𝑁𝑡superscript𝜂2superscript𝜎2subscript𝐇𝑚\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% \circ\mathbf{C}_{(m-1)N+t}+\eta^{2}(\mathcal{M}_{m}-\widetilde{\mathcal{M}}_{m% })\circ\mathbf{C}_{(m-1)N+t}+\eta^{2}\sigma^{2}\mathbf{H}_{m}= ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - over~ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
(η𝒯~𝐇m(η))𝐂(m1)N+t+η2σ2𝐇m,succeeds-or-equalsabsent𝜂subscript~𝒯subscript𝐇𝑚𝜂subscript𝐂𝑚1𝑁𝑡superscript𝜂2superscript𝜎2subscript𝐇𝑚\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))\circ\mathbf{C}_{(m-1)N+t}+\eta^{2}\sigma^{2}\mathbf{H}_{m},⪰ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,

where we utilize the fact that m~msubscript𝑚subscript~𝑚\mathcal{M}_{m}-\widetilde{\mathcal{M}}_{m}caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - over~ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a PSD mapping, as established by A.1.

Consequently, after N𝑁Nitalic_N iterations, the following results can be deduced for task m𝑚mitalic_m:

𝐂(m1)N+N(η𝒯~𝐇m(η))N𝐂(m1)N+η2σ2t=0N1(η𝒯~𝐇m(η))t𝐇m.succeeds-or-equalssubscript𝐂𝑚1𝑁𝑁superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐂𝑚1𝑁superscript𝜂2superscript𝜎2superscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑡subscript𝐇𝑚\mathbf{C}_{(m-1)N+N}\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{% H}_{m}}(\eta))^{N}\circ\mathbf{C}_{(m-1)N}+{\eta^{2}\sigma^{2}}\sum_{t=0}^{N-1% }(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ% \mathbf{H}_{m}.bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_N end_POSTSUBSCRIPT ⪰ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_C start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT .

Now, we consider the first task incorporating the Lemma C.2 in (Zou et al., 2021), which implies:

𝐂Nησ22(𝐈(𝐈η𝐇1)2N).succeeds-or-equalssubscript𝐂𝑁𝜂superscript𝜎22𝐈superscript𝐈𝜂subscript𝐇12𝑁\mathbf{C}_{N}\succeq\frac{\eta\sigma^{2}}{2}\cdot(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{1})^{2N}).bold_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⪰ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) .

By combining the aforementioned results and denoting 𝐏m=η2σ2t=0N1(η𝒯~𝐇m(η))t𝐇msuperscriptsubscript𝐏𝑚superscript𝜂2superscript𝜎2superscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑡subscript𝐇𝑚\mathbf{P}_{m}^{\prime}={\eta^{2}\sigma^{2}}\sum_{t=0}^{N-1}(\mathcal{I}-\eta% \widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ\mathbf{H}_{m}bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, we obtain:

𝐂MNsubscript𝐂𝑀𝑁\displaystyle\mathbf{C}_{MN}bold_C start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT m=2M(η𝒯~𝐇m(η))N𝐂N+m=1M1j=m+1M(η𝒯~𝐇j(η))N𝐏m+𝐏M,succeeds-or-equalsabsentsuperscriptsubscriptproduct𝑚2𝑀superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐂𝑁superscriptsubscript𝑚1𝑀1superscriptsubscriptproduct𝑗𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁superscriptsubscript𝐏𝑚superscriptsubscript𝐏𝑀\displaystyle\succeq\prod_{m=2}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{C}_{N}+\sum_{m=1}^{M-1}\prod_{j=m+1}^{M% }(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P% }_{m}^{\prime}+\mathbf{P}_{M}^{\prime},⪰ ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ,

which completes the proof. ∎

Drawing from Lemma A.2, the lower bound of the variance error is expressed as follows:

k=1M𝐇k,𝐂MNsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘subscript𝐂𝑀𝑁\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{C}_{MN}\rangle∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT ⟩ k=1Mησ22𝐇k,m=2M(η𝒯~𝐇m(η))N(𝐈(𝐈η𝐇1)2N)variance term 1absentsubscriptsuperscriptsubscript𝑘1𝑀𝜂superscript𝜎22subscript𝐇𝑘superscriptsubscriptproduct𝑚2𝑀superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁𝐈superscript𝐈𝜂subscript𝐇12𝑁superscriptvariance term 1\displaystyle\geq\underbrace{\sum_{k=1}^{M}\frac{\eta\sigma^{2}}{2}\langle% \mathbf{H}_{k},\prod_{m=2}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{1})^{2N})% \rangle}_{\text{variance term 1}^{\prime}}≥ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ⟩ end_ARG start_POSTSUBSCRIPT variance term 1 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
+k=1M𝐇k,m=1M1j=m+1M(η𝒯~𝐇j(η))N𝐏mvariance term 2+k=1M𝐇k,𝐏Mvariance term 3.subscriptsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscript𝑚1𝑀1superscriptsubscriptproduct𝑗𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁superscriptsubscript𝐏𝑚superscriptvariance term 2subscriptsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscript𝐏𝑀superscriptvariance term 3\displaystyle+\underbrace{\sum_{k=1}^{M}\langle\mathbf{H}_{k},\sum_{m=1}^{M-1}% \prod_{j=m+1}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(% \eta))^{N}\mathbf{P}_{m}^{\prime}\rangle}_{\text{variance term 2}^{\prime}}+% \underbrace{\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{P}_{M}^{\prime}\rangle% }_{\text{variance term 3}^{\prime}}.+ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT variance term 2 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT variance term 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (16)

Analogous to the approach for the upper bound, we will examine the terms one by one.

variance term 1superscriptvariance term 1\displaystyle\text{variance term 1}^{\prime}variance term 1 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =k=1Mησ22𝐇k,m=2M(𝐈η𝐇m)N(𝐈(𝐈η𝐇1)2N)(𝐈η𝐇m)Nabsentsuperscriptsubscript𝑘1𝑀𝜂superscript𝜎22subscript𝐇𝑘superscriptsubscriptproduct𝑚2𝑀superscript𝐈𝜂subscript𝐇𝑚𝑁𝐈superscript𝐈𝜂subscript𝐇12𝑁superscript𝐈𝜂subscript𝐇𝑚𝑁\displaystyle=\sum_{k=1}^{M}\frac{\eta\sigma^{2}}{2}\langle\mathbf{H}_{k},% \prod_{m=2}^{M}(\mathbf{I}-\eta\mathbf{H}_{m})^{N}(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{1})^{2N})(\mathbf{I}-\eta\mathbf{H}_{m})^{N}\rangle= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⟩
=k=1Mησ22m=2M(𝐈η𝐇m)2N𝐇k,(𝐈(𝐈η𝐇1)2N)absentsuperscriptsubscript𝑘1𝑀𝜂superscript𝜎22superscriptsubscriptproduct𝑚2𝑀superscript𝐈𝜂subscript𝐇𝑚2𝑁subscript𝐇𝑘𝐈superscript𝐈𝜂subscript𝐇12𝑁\displaystyle=\sum_{k=1}^{M}\frac{\eta\sigma^{2}}{2}\langle\prod_{m=2}^{M}(% \mathbf{I}-\eta\mathbf{H}_{m})^{2N}\mathbf{H}_{k},(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{1})^{2N})\rangle= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ⟨ ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ⟩
=ησ22k=1Mi[m=2M(1ηλmi)2Nλki(1(1ηλ1i)2N)]absent𝜂superscript𝜎22superscriptsubscript𝑘1𝑀subscript𝑖delimited-[]superscriptsubscriptproduct𝑚2𝑀superscript1𝜂superscriptsubscript𝜆𝑚𝑖2𝑁superscriptsubscript𝜆𝑘𝑖1superscript1𝜂superscriptsubscript𝜆1𝑖2𝑁\displaystyle=\frac{\eta\sigma^{2}}{2}\sum_{k=1}^{M}\sum_{i}[\prod_{m=2}^{M}(1% -\eta\lambda_{m}^{i})^{2N}\lambda_{k}^{i}(1-(1-\eta\lambda_{1}^{i})^{2N})]= divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ]
ησ22i[m=2M(1ηλmi)2N(k=1Mλki)(1(1ηλ1i)2N)]absent𝜂superscript𝜎22subscript𝑖delimited-[]superscriptsubscriptproduct𝑚2𝑀superscript1𝜂superscriptsubscript𝜆𝑚𝑖2𝑁superscriptsubscript𝑘1𝑀superscriptsubscript𝜆𝑘𝑖1superscript1𝜂superscriptsubscript𝜆1𝑖2𝑁\displaystyle\geq\frac{\eta\sigma^{2}}{2}\sum_{i}[\prod_{m=2}^{M}(1-\eta% \lambda_{m}^{i})^{2N}(\sum_{k=1}^{M}\lambda_{k}^{i})(1-(1-\eta\lambda_{1}^{i})% ^{2N})]≥ divide start_ARG italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ ∏ start_POSTSUBSCRIPT italic_m = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ( 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ] (17)

To further lower bound the two terms, noticing the following inequality:

1(1ηλ1i)2N{1(11N)2N1e2910,λ1i1ηN,2Nηλ1i2N(N1)2η2λ1i29N10ηλ1i,λ1i<1ηN.1superscript1𝜂superscriptsubscript𝜆1𝑖2𝑁cases1superscript11𝑁2𝑁1superscript𝑒2910superscriptsubscript𝜆1𝑖1𝜂𝑁2𝑁𝜂superscriptsubscript𝜆1𝑖2𝑁𝑁12superscript𝜂2superscriptsuperscriptsubscript𝜆1𝑖29𝑁10𝜂superscriptsubscript𝜆1𝑖superscriptsubscript𝜆1𝑖1𝜂𝑁1-(1-\eta\lambda_{1}^{i})^{2N}\geq\begin{cases}1-(1-\frac{1}{N})^{2N}\geq 1-e^% {-2}\geq\frac{9}{10},&\lambda_{1}^{i}\geq\frac{1}{\eta N},\\ 2N\cdot\eta\lambda_{1}^{i}-\frac{2N(N-1)}{2}\cdot\eta^{2}{\lambda_{1}^{i}}^{2}% \geq\frac{9N}{10}\cdot\eta\lambda_{1}^{i},&\lambda_{1}^{i}<\frac{1}{\eta N}.% \end{cases}1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ≥ { start_ROW start_CELL 1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ≥ 1 - italic_e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ≥ divide start_ARG 9 end_ARG start_ARG 10 end_ARG , end_CELL start_CELL italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG italic_η italic_N end_ARG , end_CELL end_ROW start_ROW start_CELL 2 italic_N ⋅ italic_η italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - divide start_ARG 2 italic_N ( italic_N - 1 ) end_ARG start_ARG 2 end_ARG ⋅ italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG 9 italic_N end_ARG start_ARG 10 end_ARG ⋅ italic_η italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , end_CELL start_CELL italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT < divide start_ARG 1 end_ARG start_ARG italic_η italic_N end_ARG . end_CELL end_ROW

Hence, the first term, we have:

variance term 19η2σ220(i<k1Γ(2,M)iΛi+Nηi>k1Γ(2,M)iλ1iΛi).superscriptvariance term 19superscript𝜂2superscript𝜎220subscript𝑖superscriptsubscript𝑘1superscriptsubscriptΓ2𝑀𝑖superscriptΛ𝑖𝑁𝜂subscript𝑖superscriptsubscript𝑘1superscriptsubscriptΓ2𝑀𝑖superscriptsubscript𝜆1𝑖superscriptΛ𝑖\text{variance term 1}^{\prime}\geq\frac{9\eta^{2}\sigma^{2}}{20}(\sum_{i<k_{1% }^{*}}\Gamma_{(2,M)}^{i}\Lambda^{i}+N\eta\sum_{i>k_{1}^{*}}\Gamma_{(2,M)}^{i}% \lambda_{1}^{i}\Lambda^{i}).variance term 1 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ divide start_ARG 9 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 20 end_ARG ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 2 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 2 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) .

For the variance term 2superscript22^{\prime}2 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we notice that:

𝐏msuperscriptsubscript𝐏𝑚\displaystyle\mathbf{P}_{m}^{\prime}bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =η2σ2t=0N1(η𝒯~𝐇m(η))t𝐇m=η2σ2t=0N1(𝐈η𝐇m)2t𝐇mabsentsuperscript𝜂2superscript𝜎2superscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑡subscript𝐇𝑚superscript𝜂2superscript𝜎2superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐇𝑚\displaystyle={\eta^{2}\sigma^{2}}\sum_{t=0}^{N-1}(\mathcal{I}-\eta\widetilde{% \mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\circ\mathbf{H}_{m}={\eta^{2}\sigma^{2% }}\sum_{t=0}^{N-1}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2t}\mathbf{H}_{m}= italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
η2σ22(𝐈(𝐈η𝐇m)2N)absentsuperscript𝜂2superscript𝜎22𝐈superscript𝐈𝜂subscript𝐇𝑚2𝑁\displaystyle\geq\frac{\eta^{2}\sigma^{2}}{2}(\mathbf{I}-(\mathbf{I}-\eta{% \mathbf{H}_{m}})^{2N})≥ divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT )

Substituting the above to the variance term 2’, we have:

variance term 2superscriptvariance term 2\displaystyle\text{variance term 2}^{\prime}variance term 2 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =k=1M𝐇k,m=1M1j=m+1M(η𝒯~𝐇j(η))N𝐏mabsentsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscript𝑚1𝑀1superscriptsubscriptproduct𝑗𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁superscriptsubscript𝐏𝑚\displaystyle=\sum_{k=1}^{M}\langle\mathbf{H}_{k},\sum_{m=1}^{M-1}\prod_{j=m+1% }^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}% \mathbf{P}_{m}^{\prime}\rangle= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩
η2σ22k=1M𝐇k,m=1M1j=m+1M(η𝒯~𝐇j(η))N(𝐈(𝐈η𝐇m)2N)absentsuperscript𝜂2superscript𝜎22superscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscript𝑚1𝑀1superscriptsubscriptproduct𝑗𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁𝐈superscript𝐈𝜂subscript𝐇𝑚2𝑁\displaystyle\geq\frac{\eta^{2}\sigma^{2}}{2}\sum_{k=1}^{M}\langle\mathbf{H}_{% k},\sum_{m=1}^{M-1}\prod_{j=m+1}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{j}}(\eta))^{N}(\mathbf{I}-(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N})\rangle≥ divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ⟩
=η2σ22k=1M𝐇k,m=1M1j=m+1M(𝐈η𝐇j)2N(𝐈(𝐈η𝐇m)2N)absentsuperscript𝜂2superscript𝜎22superscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscript𝑚1𝑀1superscriptsubscriptproduct𝑗𝑚1𝑀superscript𝐈𝜂subscript𝐇𝑗2𝑁𝐈superscript𝐈𝜂subscript𝐇𝑚2𝑁\displaystyle=\frac{\eta^{2}\sigma^{2}}{2}\sum_{k=1}^{M}\langle\mathbf{H}_{k},% \sum_{m=1}^{M-1}\prod_{j=m+1}^{M}(\mathbf{I}-\eta{\mathbf{H}_{j}})^{2N}(% \mathbf{I}-(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N})\rangle= divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ⟩
9η2σ220m=1M1(i<kmΓ(m+1,M)iΛi+Nηi>kmΓ(m+1,M)iλmiΛi).absent9superscript𝜂2superscript𝜎220superscriptsubscript𝑚1𝑀1subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚1𝑀𝑖superscriptΛ𝑖𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚1𝑀𝑖superscriptsubscript𝜆𝑚𝑖superscriptΛ𝑖\displaystyle\geq\frac{9\eta^{2}\sigma^{2}}{20}\sum_{m=1}^{M-1}(\sum_{i<k_{m}^% {*}}\Gamma_{(m+1,M)}^{i}\Lambda^{i}+N\eta\sum_{i>k_{m}^{*}}\Gamma_{(m+1,M)}^{i% }\lambda_{m}^{i}\Lambda^{i}).≥ divide start_ARG 9 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 20 end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m + 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m + 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) .

Also, similar to the variance term 3’, it holds that:

variance term 3’9η2σ220(i<kMΓ(M+1,M)iΛi+Nηi>kMΓ(M+1,M)iλMiΛi).variance term 3’9superscript𝜂2superscript𝜎220subscript𝑖superscriptsubscript𝑘𝑀superscriptsubscriptΓ𝑀1𝑀𝑖superscriptΛ𝑖𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑀superscriptsubscriptΓ𝑀1𝑀𝑖superscriptsubscript𝜆𝑀𝑖superscriptΛ𝑖\text{variance term 3'}\geq\frac{9\eta^{2}\sigma^{2}}{20}(\sum_{i<k_{M}^{*}}% \Gamma_{(M+1,M)}^{i}\Lambda^{i}+N\eta\sum_{i>k_{M}^{*}}\Gamma_{(M+1,M)}^{i}% \lambda_{M}^{i}\Lambda^{i}).variance term 3’ ≥ divide start_ARG 9 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 20 end_ARG ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_M + 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_M + 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) .

Appendix C Bias Error

Before providing the proof of bias bound, we first introduce the following lemmas for tradition SGD training in Zou et al. 2021.

Lemma C.1 (Summation of bias iterates (Zou et al., 2021)).

Suppose that Assumption 2.3 holds. Suppose that η<1/(αtr(𝐇m))𝜂1𝛼trsubscript𝐇𝑚\eta<1/(\alpha\operatorname{tr}(\mathbf{H}_{m}))italic_η < 1 / ( italic_α roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ). Then for every N1𝑁1N\geq 1italic_N ≥ 1 and each task m𝑚mitalic_m, it holds that:

12η(𝐈(𝐈η𝐇m)2N)t=0N1(η𝒯𝐇m(η))t𝐇m1η(𝐈(𝐈η𝐇m)2N)precedes-or-equals12𝜂𝐈superscript𝐈𝜂subscript𝐇𝑚2𝑁superscriptsubscript𝑡0𝑁1superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐇𝑚precedes-or-equals1𝜂𝐈superscript𝐈𝜂subscript𝐇𝑚2𝑁\frac{1}{2\eta}\cdot(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{2N})\preceq% \sum_{t=0}^{N-1}(\mathcal{I}-\eta\cdot\mathcal{T}_{\mathbf{H}_{m}}(\eta))^{t}% \circ\mathbf{H}_{m}\preceq\frac{1}{\eta}\cdot(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{m})^{2N})divide start_ARG 1 end_ARG start_ARG 2 italic_η end_ARG ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ⪯ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η ⋅ caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⪯ divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT )
Lemma C.2.

Under Assumptions 2.3, let 𝐁a,b=𝐁a(𝐈η𝐇m)ba𝐁a(𝐈η𝐇m)basubscript𝐁𝑎𝑏subscript𝐁𝑎superscript𝐈𝜂subscript𝐇𝑚𝑏𝑎subscript𝐁𝑎superscript𝐈𝜂subscript𝐇𝑚𝑏𝑎\mathbf{B}_{a,b}=\mathbf{B}_{a}-(\mathbf{I}-\eta\mathbf{H}_{m})^{b-a}\mathbf{B% }_{a}(\mathbf{I}-\eta\mathbf{H}_{m})^{b-a}bold_B start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT = bold_B start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_b - italic_a end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_b - italic_a end_POSTSUPERSCRIPT, if the stepsize satisfies η<1/(αmtr(𝐇m))𝜂1subscript𝛼𝑚trsubscript𝐇𝑚\eta<1/(\alpha_{m}\operatorname{tr}(\mathbf{H}_{m}))italic_η < 1 / ( italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ), then for any tN𝑡𝑁t\leq Nitalic_t ≤ italic_N, it holds that for each task m𝑚mitalic_m:

𝐒tk=0t1(𝐈η𝐇m)k(ηαmtr(𝐁0,N)1ηαmtr(𝐇m)𝐇m+𝐁0)(𝐈η𝐇m)k,precedes-or-equalssubscript𝐒𝑡superscriptsubscript𝑘0𝑡1superscript𝐈𝜂subscript𝐇𝑚𝑘𝜂subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝐇𝑚subscript𝐁0superscript𝐈𝜂subscript𝐇𝑚𝑘\mathbf{S}_{t}\preceq\sum_{k=0}^{t-1}(\mathbf{I}-\eta\mathbf{H}_{m})^{k}(\frac% {\eta\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}% \operatorname{tr}(\mathbf{H}_{m})}\cdot\mathbf{H}_{m}+\mathbf{B}_{0})(\mathbf{% I}-\eta\mathbf{H}_{m})^{k},bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⪯ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( divide start_ARG italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ⋅ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ,

where denoting 𝐒t=k=0t1(𝒯(η)𝐁0\mathbf{S}_{t}=\sum_{k=0}^{t-1}(\mathcal{I}-\mathcal{T}(\eta)\circ\mathbf{B}_{0}bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( caligraphic_I - caligraphic_T ( italic_η ) ∘ bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Lemma C.3.

Suppose Assumptions 2.3 and 2.4 hold with step size η1/R2𝜂1superscript𝑅2\eta\leq 1/R^{2}italic_η ≤ 1 / italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then it holds that:

𝐒tβm4tr((𝐈(𝐈η𝐇m)t/2)𝐁0)(𝐈(𝐈η𝐇m)t/2)+t1(𝐈η𝐇m)t𝐁0(𝐈η𝐇m)t,succeeds-or-equalssubscript𝐒𝑡subscript𝛽𝑚4tr𝐈superscript𝐈𝜂subscript𝐇𝑚𝑡2subscript𝐁0𝐈superscript𝐈𝜂subscript𝐇𝑚𝑡2superscript𝑡1superscript𝐈𝜂subscript𝐇𝑚𝑡subscript𝐁0superscript𝐈𝜂subscript𝐇𝑚𝑡\mathbf{S}_{t}\succeq\frac{\beta_{m}}{4}\operatorname{tr}\left(\left(\mathbf{I% }-(\mathbf{I}-\eta\mathbf{H}_{m})^{t/2}\right)\mathbf{B}_{0}\right)\cdot\left(% \mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{t/2}\right)+\sum^{t-1}(\mathbf{I}-% \eta\mathbf{H}_{m})^{t}\cdot\mathbf{B}_{0}\cdot(\mathbf{I}-\eta\mathbf{H}_{m})% ^{t},bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⪰ divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG roman_tr ( ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t / 2 end_POSTSUPERSCRIPT ) bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t / 2 end_POSTSUPERSCRIPT ) + ∑ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ,

where denoting 𝐒t=k=0t1(𝒯(η)𝐁0\mathbf{S}_{t}=\sum_{k=0}^{t-1}(\mathcal{I}-\mathcal{T}(\eta)\circ\mathbf{B}_{0}bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( caligraphic_I - caligraphic_T ( italic_η ) ∘ bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

C.1 Upper Bound

Lemma C.4.

Suppose Assumptions 2.3 and 2.4 hold with step size η1/R2𝜂1superscript𝑅2\eta\leq 1/R^{2}italic_η ≤ 1 / italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then it holds that:

𝐁MNsubscript𝐁𝑀𝑁\displaystyle\mathbf{B}_{MN}bold_B start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT m=1M(η𝒯~𝐇m(η))N𝐁0+m=1Mj=mM(η𝒯~𝐇j(η))N𝐏m,precedes-or-equalsabsentsuperscriptsubscriptproduct𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐁0superscriptsubscript𝑚1𝑀superscriptsubscriptproduct𝑗𝑚𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁subscript𝐏𝑚\displaystyle\preceq\prod_{m=1}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{B}_{0}+\sum_{m=1}^{M}\prod_{j=m}^{M}(% \mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P}_% {m},⪯ ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,

where 𝐏m=αmη2t=0N1(𝐈η𝐇m)2t𝐇m𝐇m,𝐁(m1)N+tsubscript𝐏𝑚subscript𝛼𝑚superscript𝜂2superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐇𝑚subscript𝐇𝑚subscript𝐁𝑚1𝑁𝑡\mathbf{P}_{m}=\alpha_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-\eta\mathbf{H}_{m% })^{2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N+t}\ranglebold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT ⟩ and k1k2=1superscriptsubscriptproductsubscript𝑘1subscript𝑘21\prod_{k_{1}}^{k_{2}}=1∏ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = 1 if k1>k2subscript𝑘1subscript𝑘2k_{1}>k_{2}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

We first examine the recursion from t=0𝑡0t=0italic_t = 0 to t=N1𝑡𝑁1t=N-1italic_t = italic_N - 1 for each task m𝑚mitalic_m:

𝐁(m1)N+t+1subscript𝐁𝑚1𝑁𝑡1\displaystyle\mathbf{B}_{(m-1)N+t+1}bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t + 1 end_POSTSUBSCRIPT =(η𝒯𝐇m(η))𝐁(m1)N+tabsent𝜂subscript𝒯subscript𝐇𝑚𝜂subscript𝐁𝑚1𝑁𝑡\displaystyle=(\mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))\circ\mathbf% {B}_{(m-1)N+t}= ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT (18)
=(η𝒯~𝐇m(η))𝐁(m1)N+t+η2(m~m)𝐁(m1)N+tabsent𝜂subscript~𝒯subscript𝐇𝑚𝜂subscript𝐁𝑚1𝑁𝑡superscript𝜂2subscript𝑚subscript~𝑚subscript𝐁𝑚1𝑁𝑡\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% \circ\mathbf{B}_{(m-1)N+t}+\eta^{2}(\mathcal{M}_{m}-\widetilde{\mathcal{M}}_{m% })\circ\mathbf{B}_{(m-1)N+t}= ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - over~ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT
(η𝒯~𝐇m(η))𝐁(m1)N+t+αmη2𝐇m𝐇m,𝐁(m1)N+t.precedes-or-equalsabsent𝜂subscript~𝒯subscript𝐇𝑚𝜂subscript𝐁𝑚1𝑁𝑡subscript𝛼𝑚superscript𝜂2subscript𝐇𝑚subscript𝐇𝑚subscript𝐁𝑚1𝑁𝑡\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))\circ\mathbf{B}_{(m-1)N+t}+\alpha_{m}\eta^{2}\cdot\mathbf{H}_{m}\cdot% \langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N+t}\rangle.⪯ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT ⟩ .

where the penultimate inequality is derived from the assumption 2.3.

Hence, after N𝑁Nitalic_N iterations, we could have the following results for task m𝑚mitalic_m:

𝐁(m1)N+Nsubscript𝐁𝑚1𝑁𝑁\displaystyle\mathbf{B}_{(m-1)N+N}bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_N end_POSTSUBSCRIPT (η𝒯~𝐇m(η))N𝐁(m1)N+αmη2t=0N1(η𝒯~𝐇m(η))t𝐇m𝐇m,𝐁(m1)N+tprecedes-or-equalsabsentsuperscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐁𝑚1𝑁subscript𝛼𝑚superscript𝜂2superscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑡subscript𝐇𝑚subscript𝐇𝑚subscript𝐁𝑚1𝑁𝑡\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))^{N}\circ\mathbf{B}_{(m-1)N}+\alpha_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathcal% {I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{H}_{m}% \langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N+t}\rangle⪯ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT ⟩
=(η𝒯~𝐇m(η))N𝐁(m1)N+αmη2t=0N1(𝐈η𝐇m)2t𝐇m𝐇m,𝐁(m1)N+tabsentsuperscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐁𝑚1𝑁subscript𝛼𝑚superscript𝜂2superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐇𝑚subscript𝐇𝑚subscript𝐁𝑚1𝑁𝑡\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% ^{N}\circ\mathbf{B}_{(m-1)N}+\alpha_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-% \eta\mathbf{H}_{m})^{2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N% +t}\rangle= ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT ⟩
=(η𝒯~𝐇m(η))N𝐁(m1)N+αmη2t=0N1(𝐈η𝐇m)2t𝐇m𝐇m,(η𝒯𝐇m(η))t𝐁(m1)Nabsentsuperscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐁𝑚1𝑁subscript𝛼𝑚superscript𝜂2superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐇𝑚subscript𝐇𝑚superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁𝑚1𝑁\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% ^{N}\circ\mathbf{B}_{(m-1)N}+\alpha_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-% \eta\mathbf{H}_{m})^{2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{% \mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{(m-1)N}\rangle= ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT ⟩
(η𝒯~𝐇m(η))N𝐁(m1)N+αmη2t=0N1𝐇m𝐇m,(η𝒯𝐇m(η))t𝐁(m1)Nprecedes-or-equalsabsentsuperscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐁𝑚1𝑁subscript𝛼𝑚superscript𝜂2superscriptsubscript𝑡0𝑁1subscript𝐇𝑚subscript𝐇𝑚superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁𝑚1𝑁\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))^{N}\circ\mathbf{B}_{(m-1)N}+\alpha_{m}\eta^{2}\sum_{t=0}^{N-1}\mathbf{H% }_{m}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))^{t}\mathbf{B}_{(m-1)N}\rangle⪯ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT ⟩

We now examine the second term for each m𝑚mitalic_m:

t=0N1𝐇m,(η𝒯𝐇m(η))t𝐁(m1)Nsuperscriptsubscript𝑡0𝑁1subscript𝐇𝑚superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁𝑚1𝑁\displaystyle\sum_{t=0}^{N-1}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{% T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{(m-1)N}\rangle∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT ⟩
=\displaystyle== t=0N1𝐇m,(η𝒯𝐇m(η))t(η𝒯𝐇m1(η))N(η𝒯𝐇1(η))N𝐁0superscriptsubscript𝑡0𝑁1subscript𝐇𝑚superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡superscript𝜂subscript𝒯subscript𝐇𝑚1𝜂𝑁superscript𝜂subscript𝒯subscript𝐇1𝜂𝑁subscript𝐁0\displaystyle\sum_{t=0}^{N-1}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{% T}}_{\mathbf{H}_{m}}(\eta))^{t}(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1% }}(\eta))^{N}\ldots(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{1}}(\eta))^{N}% \mathbf{B}_{0}\rangle∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT … ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩
=\displaystyle== t=0N1(η𝒯𝐇m1(η))N(η𝒯𝐇1(η))N𝐇m,(η𝒯𝐇m(η))t𝐁0,superscriptsubscript𝑡0𝑁1superscript𝜂subscript𝒯subscript𝐇𝑚1𝜂𝑁superscript𝜂subscript𝒯subscript𝐇1𝜂𝑁subscript𝐇𝑚superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁0\displaystyle\sum_{t=0}^{N-1}\langle(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}% _{m-1}}(\eta))^{N}\ldots(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{1}}(\eta))% ^{N}\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}% \mathbf{B}_{0}\rangle,∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT … ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ ,

where we know the following holds:

(η𝒯𝐇m1(η))𝐇m𝜂subscript𝒯subscript𝐇𝑚1𝜂subscript𝐇𝑚\displaystyle(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))\circ% \mathbf{H}_{m}( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT =(η𝒯~𝐇m1(η))𝐇m+(~)𝐇mabsent𝜂subscript~𝒯subscript𝐇𝑚1𝜂subscript𝐇𝑚~subscript𝐇𝑚\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta% ))\circ\mathbf{H}_{m}+(\mathcal{M}-\widetilde{\mathcal{M}})\circ\mathbf{H}_{m}= ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + ( caligraphic_M - over~ start_ARG caligraphic_M end_ARG ) ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
(η𝒯~𝐇m1(η))𝐇m+αm1η2𝐇m1𝐇m1,𝐇m.precedes-or-equalsabsent𝜂subscript~𝒯subscript𝐇𝑚1𝜂subscript𝐇𝑚subscript𝛼𝑚1superscript𝜂2subscript𝐇𝑚1subscript𝐇𝑚1subscript𝐇𝑚\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}% }(\eta))\circ\mathbf{H}_{m}+\alpha_{m-1}\eta^{2}\cdot\mathbf{H}_{m-1}\cdot% \langle\mathbf{H}_{m-1},\mathbf{H}_{m}\rangle.⪯ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ .

Moreover, we have t=0N1η𝐇m1(η𝒯~𝐇m1(η))t𝐈(𝐈η𝐇m1)N𝐈precedes-or-equalssuperscriptsubscript𝑡0𝑁1𝜂subscript𝐇𝑚1superscript𝜂subscript~𝒯subscript𝐇𝑚1𝜂𝑡𝐈superscript𝐈𝜂subscript𝐇𝑚1𝑁precedes-or-equals𝐈\sum_{t=0}^{N-1}\eta\cdot\mathbf{H}_{m-1}\cdot(\mathcal{I}-\eta\widetilde{% \mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))^{t}\preceq\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{m-1})^{N}\preceq\mathbf{I}∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_η ⋅ bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ⋅ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⪯ bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⪯ bold_I. Therefore, it holds that:

(η𝒯𝐇m1(η))N𝐇m(η𝒯~𝐇m1(η))N𝐇m+αm1η𝐈𝐇m1,𝐇m.precedes-or-equalssuperscript𝜂subscript𝒯subscript𝐇𝑚1𝜂𝑁subscript𝐇𝑚superscript𝜂subscript~𝒯subscript𝐇𝑚1𝜂𝑁subscript𝐇𝑚subscript𝛼𝑚1𝜂𝐈subscript𝐇𝑚1subscript𝐇𝑚\displaystyle(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))^{N}\circ% \mathbf{H}_{m}\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1% }}(\eta))^{N}\circ\mathbf{H}_{m}+\alpha_{m-1}\eta\cdot\mathbf{I}\cdot\langle% \mathbf{H}_{m-1},\mathbf{H}_{m}\rangle.( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⪯ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT italic_η ⋅ bold_I ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ .

It implies:

(η𝒯𝐇1(η))N(η𝒯𝐇m1(η))N𝐇msuperscript𝜂subscript𝒯subscript𝐇1𝜂𝑁superscript𝜂subscript𝒯subscript𝐇𝑚1𝜂𝑁subscript𝐇𝑚\displaystyle(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{1}}(\eta))^{N}\ldots(% \mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))^{N}\circ\mathbf{H}_{m}( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT … ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (η𝒯~𝐇1(η))N(η𝒯~𝐇m1(η))N𝐇mprecedes-or-equalsabsentsuperscript𝜂subscript~𝒯subscript𝐇1𝜂𝑁superscript𝜂subscript~𝒯subscript𝐇𝑚1𝜂𝑁subscript𝐇𝑚\displaystyle\preceq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{1}}(% \eta))^{N}\ldots(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}}(% \eta))^{N}\circ\mathbf{H}_{m}⪯ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT … ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
+j=1m1k=1jαkηj𝐇k1,𝐈(𝐈η𝐇m1)N𝐇j,𝐇m𝐈,superscriptsubscript𝑗1𝑚1superscriptsubscriptproduct𝑘1𝑗subscript𝛼𝑘superscript𝜂𝑗subscript𝐇𝑘1𝐈superscript𝐈𝜂subscript𝐇𝑚1𝑁subscript𝐇𝑗subscript𝐇𝑚𝐈\displaystyle+\sum_{j=1}^{m-1}\prod_{k=1}^{j}\alpha_{k}\eta^{j}\cdot\langle% \mathbf{H}_{k-1},\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m-1})^{N}\rangle\cdot% \langle\mathbf{H}_{j},\mathbf{H}_{m}\rangle\cdot\mathbf{I},+ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⟩ ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ ⋅ bold_I ,

where we denote 𝐇0=𝐈subscript𝐇0𝐈\mathbf{H}_{0}=\mathbf{I}bold_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_I and define Φ1m1:=j=1m1k=1jαkηj𝐇k1,𝐈(𝐈η𝐇m1)N𝐇j,𝐇massignsuperscriptsubscriptΦ1𝑚1superscriptsubscript𝑗1𝑚1superscriptsubscriptproduct𝑘1𝑗subscript𝛼𝑘superscript𝜂𝑗subscript𝐇𝑘1𝐈superscript𝐈𝜂subscript𝐇𝑚1𝑁subscript𝐇𝑗subscript𝐇𝑚\Phi_{1}^{m-1}:=\sum_{j=1}^{m-1}\prod_{k=1}^{j}\alpha_{k}\eta^{j}\cdot\langle% \mathbf{H}_{k-1},\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m-1})^{N}\rangle\cdot% \langle\mathbf{H}_{j},\mathbf{H}_{m}\rangleroman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⟩ ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩. Therefore, Section C.1 can be represented as follows:

t=0N1(η𝒯~𝐇1(η))N(η𝒯~𝐇m1(η))N𝐇m+Φ1m1𝐈,(η𝒯𝐇m(η))t𝐁0superscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇1𝜂𝑁superscript𝜂subscript~𝒯subscript𝐇𝑚1𝜂𝑁subscript𝐇𝑚superscriptsubscriptΦ1𝑚1𝐈superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁0\displaystyle\sum_{t=0}^{N-1}\langle(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{1}}(\eta))^{N}\ldots(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m-1}}(\eta))^{N}\circ\mathbf{H}_{m}+{\Phi_{1}^{m-1}}\cdot\mathbf{I% },(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{0}\rangle∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT … ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ⋅ bold_I , ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩
\displaystyle\leq t=0N1(η𝒯~𝐇1(η))N(η𝒯~𝐇m1(η))N𝐇m,(𝐈η𝐇m)t(ηαmtr(𝐁0,N)1ηαmtr(𝐇m)𝐇m+𝐁0)(𝐈η𝐇m)tterm 1subscriptsuperscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇1𝜂𝑁superscript𝜂subscript~𝒯subscript𝐇𝑚1𝜂𝑁subscript𝐇𝑚superscript𝐈𝜂subscript𝐇𝑚𝑡𝜂subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝐇𝑚subscript𝐁0superscript𝐈𝜂subscript𝐇𝑚𝑡term 1\displaystyle\underbrace{\sum_{t=0}^{N-1}\langle(\mathcal{I}-\eta\widetilde{% \mathcal{T}}_{\mathbf{H}_{1}}(\eta))^{N}\ldots(\mathcal{I}-\eta\widetilde{% \mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))^{N}\circ\mathbf{H}_{m},(\mathbf{I}-\eta% \mathbf{H}_{m})^{t}(\frac{\eta\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1% -\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\cdot\mathbf{H}_{m}+\mathbf{B% }_{0})(\mathbf{I}-\eta\mathbf{H}_{m})^{t}\rangle}_{\text{term 1}}under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT … ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( divide start_ARG italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ⋅ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT term 1 end_POSTSUBSCRIPT
+\displaystyle++ t=0N1Φ1m1𝐈,(𝐈η𝐇m)t(ηαmtr(𝐁0,N)1ηαmtr(𝐇m)𝐇m+𝐁0)(𝐈η𝐇m)tterm 2.subscriptsuperscriptsubscript𝑡0𝑁1superscriptsubscriptΦ1𝑚1𝐈superscript𝐈𝜂subscript𝐇𝑚𝑡𝜂subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝐇𝑚subscript𝐁0superscript𝐈𝜂subscript𝐇𝑚𝑡term 2\displaystyle\underbrace{\sum_{t=0}^{N-1}\langle{\Phi_{1}^{m-1}}\cdot\mathbf{I% },(\mathbf{I}-\eta\mathbf{H}_{m})^{t}(\frac{\eta\alpha_{m}\operatorname{tr}(% \mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\cdot% \mathbf{H}_{m}+\mathbf{B}_{0})(\mathbf{I}-\eta\mathbf{H}_{m})^{t}\rangle}_{% \text{term 2}}.under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ⋅ bold_I , ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( divide start_ARG italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ⋅ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT term 2 end_POSTSUBSCRIPT .

We first consider the term 1 with Lemma C.2.

term 1=term 1absent\displaystyle\text{term 1}=term 1 = t=0N1j=1m1(𝐈η𝐇j)2N(𝐈η𝐇m)2t𝐇m,(ηαmtr(𝐁0,N)1ηαmtr(𝐇m)𝐇m+𝐁0)superscriptsubscript𝑡0𝑁1superscriptsubscriptproduct𝑗1𝑚1superscript𝐈𝜂subscript𝐇𝑗2𝑁superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐇𝑚𝜂subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝐇𝑚subscript𝐁0\displaystyle\sum_{t=0}^{N-1}\langle\prod_{j=1}^{m-1}(\mathbf{I}-\eta\mathbf{H% }_{j})^{2N}(\mathbf{I}-\eta\mathbf{H}_{m})^{2t}\mathbf{H}_{m},(\frac{\eta% \alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{% tr}(\mathbf{H}_{m})}\cdot\mathbf{H}_{m}+\mathbf{B}_{0})\rangle∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( divide start_ARG italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ⋅ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⟩
=\displaystyle== t=0N1ηαmtr(𝐁0,N)1ηαmtr(𝐇m)j=1m1(𝐈η𝐇j)2N(𝐈η𝐇m)2t𝐇m,𝐇m+t=0N1(𝐈η𝐇m)2t𝐇m,𝐁0superscriptsubscript𝑡0𝑁1𝜂subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚superscriptsubscriptproduct𝑗1𝑚1superscript𝐈𝜂subscript𝐇𝑗2𝑁superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐇𝑚subscript𝐇𝑚superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐇𝑚subscript𝐁0\displaystyle\sum_{t=0}^{N-1}\frac{\eta\alpha_{m}\operatorname{tr}(\mathbf{B}_% {0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\langle\prod_{j=1}^{% m-1}(\mathbf{I}-\eta\mathbf{H}_{j})^{2N}(\mathbf{I}-\eta\mathbf{H}_{m})^{2t}% \mathbf{H}_{m},\mathbf{H}_{m}\rangle+\sum_{t=0}^{N-1}\langle(\mathbf{I}-\eta% \mathbf{H}_{m})^{2t}\mathbf{H}_{m},\mathbf{B}_{0}\rangle∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT divide start_ARG italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ⟨ ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩
\displaystyle\leq αmtr(𝐁0,N)1ηαmtr(𝐇m)j=1m1(𝐈η𝐇j)2N(𝐈(𝐈η𝐇m)N),𝐇m+1ηj=1m1(𝐈η𝐇j)2N(𝐈(𝐈η𝐇m)N),𝐁0subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚superscriptsubscriptproduct𝑗1𝑚1superscript𝐈𝜂subscript𝐇𝑗2𝑁𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁subscript𝐇𝑚1𝜂superscriptsubscriptproduct𝑗1𝑚1superscript𝐈𝜂subscript𝐇𝑗2𝑁𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁subscript𝐁0\displaystyle\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha% _{m}\operatorname{tr}(\mathbf{H}_{m})}\langle\prod_{j=1}^{m-1}(\mathbf{I}-\eta% \mathbf{H}_{j})^{2N}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N}),\mathbf{H% }_{m}\rangle+\frac{1}{\eta}\langle\prod_{j=1}^{m-1}(\mathbf{I}-\eta\mathbf{H}_% {j})^{2N}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N}),\mathbf{B}_{0}\rangledivide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ⟨ ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ + divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ⟨ ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩
=\displaystyle== αmtr(𝐁0,N)1ηαmtr(𝐇m)iΓ(1,m1)i[1(1ηλmi)N]λmi+1ηiΓ(1,m1)iωi2[1(1ηλmi)N]subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝑖superscriptsubscriptΓ1𝑚1𝑖delimited-[]1superscript1𝜂superscriptsubscript𝜆𝑚𝑖𝑁superscriptsubscript𝜆𝑚𝑖1𝜂subscript𝑖superscriptsubscriptΓ1𝑚1𝑖superscriptsubscript𝜔𝑖2delimited-[]1superscript1𝜂superscriptsubscript𝜆𝑚𝑖𝑁\displaystyle\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha% _{m}\operatorname{tr}(\mathbf{H}_{m})}\sum_{i}\Gamma_{(1,m-1)}^{i}[1-(1-\eta% \lambda_{m}^{i})^{N}]\lambda_{m}^{i}+\frac{1}{\eta}\sum_{i}\Gamma_{(1,m-1)}^{i% }{\omega_{i}^{2}}[1-(1-\eta\lambda_{m}^{i})^{N}]divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [ 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ] italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ]
\displaystyle\leq αmtr(𝐁0,N)1ηαmtr(𝐇m)iΓ(1,m1)imin{1,ηNλmi}+1ηiΓ(1,m1)iωi2min{1,ηNλmi}subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝑖superscriptsubscriptΓ1𝑚1𝑖1𝜂𝑁superscriptsubscript𝜆𝑚𝑖1𝜂subscript𝑖superscriptsubscriptΓ1𝑚1𝑖superscriptsubscript𝜔𝑖21𝜂𝑁superscriptsubscript𝜆𝑚𝑖\displaystyle\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha% _{m}\operatorname{tr}(\mathbf{H}_{m})}\sum_{i}\Gamma_{(1,m-1)}^{i}\min\{1,\eta N% \lambda_{m}^{i}\}+\frac{1}{\eta}\sum_{i}\Gamma_{(1,m-1)}^{i}{\omega_{i}^{2}}% \min\{1,\eta N\lambda_{m}^{i}\}divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_min { 1 , italic_η italic_N italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } + divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_min { 1 , italic_η italic_N italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }
\displaystyle\leq αmtr(𝐁0,N)1ηαmtr(𝐇m)(ikmΓ(1,m1)iλmiNη+Nηi>kmΓ(1,m1)i(λmi)2)+1η𝐰0𝐰𝚪1m1𝐈m,0:km2+N𝐰0𝐰𝚪1m1𝐇m,km:2subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑚1𝑖superscriptsubscript𝜆𝑚𝑖𝑁𝜂𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑚1𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖21𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰superscriptsubscript𝚪1𝑚1subscript𝐈:𝑚0superscriptsubscript𝑘𝑚2𝑁superscriptsubscriptnormsubscript𝐰0superscript𝐰superscriptsubscript𝚪1𝑚1subscript𝐇:𝑚superscriptsubscript𝑘𝑚2\displaystyle\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha% _{m}\operatorname{tr}(\mathbf{H}_{m})}(\sum_{i\leq k_{m}^{*}}\frac{\Gamma_{(1,% m-1)}^{i}\lambda_{m}^{i}}{N\eta}+N\eta\sum_{i>k_{m}^{*}}\Gamma_{(1,m-1)}^{i}(% \lambda_{m}^{i})^{2})+\frac{1}{\eta}\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\bm{% \Gamma}_{1}^{m-1}\mathbf{I}_{m,0:k_{m}^{*}}}^{2}+N\|\mathbf{w}_{0}-\mathbf{w}^% {*}\|_{\bm{\Gamma}_{1}^{m-1}\mathbf{H}_{m,k_{m}^{*}:\infty}}^{2}divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ( ∑ start_POSTSUBSCRIPT italic_i ≤ italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_N italic_η end_ARG + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_m , 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where kmsuperscriptsubscript𝑘𝑚k_{m}^{*}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the index of the smallest eigenvalue of 𝐇msubscript𝐇𝑚\mathbf{H}_{m}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT satisfying λkmi1/(ηN)superscriptsubscript𝜆superscriptsubscript𝑘𝑚𝑖1𝜂𝑁\lambda_{k_{m}^{*}}^{i}\geq 1/(\eta N)italic_λ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≥ 1 / ( italic_η italic_N ), and denotes Um=αmtr(𝐁0,N)1ηαmtr(𝐇m)(ikmΓ(1,m1)iNη+Nηi>km(λmi)2)+1η𝐰0𝐰𝚪1m1𝐈m,0:km2+N𝐰0𝐰𝚪1m1𝐇m,km:2subscript𝑈𝑚subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑚1𝑖𝑁𝜂𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑚superscriptsuperscriptsubscript𝜆𝑚𝑖21𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰superscriptsubscript𝚪1𝑚1subscript𝐈:𝑚0superscriptsubscript𝑘𝑚2𝑁superscriptsubscriptnormsubscript𝐰0superscript𝐰superscriptsubscript𝚪1𝑚1subscript𝐇:𝑚superscriptsubscript𝑘𝑚2U_{m}=\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}% \operatorname{tr}(\mathbf{H}_{m})}(\sum_{i\leq k_{m}^{*}}\frac{\Gamma_{(1,m-1)% }^{i}}{N\eta}+N\eta\sum_{i>k_{m}^{*}}(\lambda_{m}^{i})^{2})+\frac{1}{\eta}\|% \mathbf{w}_{0}-\mathbf{w}^{*}\|_{\bm{\Gamma}_{1}^{m-1}\mathbf{I}_{m,0:k_{m}^{*% }}}^{2}+N\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\bm{\Gamma}_{1}^{m-1}\mathbf{H}_{m% ,k_{m}^{*}:\infty}}^{2}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ( ∑ start_POSTSUBSCRIPT italic_i ≤ italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_N italic_η end_ARG + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_m , 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Moreover, .tr(𝐁0,N)=tr(𝐁0(𝐈η𝐇m)N𝐁0(𝐈η𝐇m)N))=(1(1ηΛi)2N)(𝐰0𝐰,𝐯i)2.\operatorname{tr}(\mathbf{B}_{0,N})=\operatorname{tr}(\mathbf{B}_{0}-(\mathbf% {I}-\eta\mathbf{H}_{m})^{N}\mathbf{B}_{0}(\mathbf{I}-\eta\mathbf{H}_{m})^{N}))% =\sum(1-(1-\eta\Lambda^{i})^{2N})\cdot(\langle\mathbf{w}_{0}-\mathbf{w}^{*},% \mathbf{v}_{i}\rangle)^{2}. roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) = roman_tr ( bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ) = ∑ ( 1 - ( 1 - italic_η roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ⋅ ( ⟨ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, Hence:

tr(𝐁0,N)2imin{1,NηΛi}(𝐰0𝐰,𝐯i)22(𝐰0𝐰𝐈m,0:km2+Nη𝐰0𝐰𝐇m,km:2).trsubscript𝐁0𝑁2subscript𝑖1𝑁𝜂superscriptΛ𝑖superscriptsubscript𝐰0superscript𝐰subscript𝐯𝑖22superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐈:𝑚0superscriptsubscript𝑘𝑚2𝑁𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐇:𝑚superscriptsubscript𝑘𝑚2\operatorname{tr}(\mathbf{B}_{0,N})\leq 2\sum_{i}\min\{1,N\eta\Lambda^{i}\}(% \langle\mathbf{w}_{0}-\mathbf{w}^{*},\mathbf{v}_{i}\rangle)^{2}\leq 2(\|% \mathbf{w}_{0}-\mathbf{w}^{*}\|_{\mathbf{I}_{m,0:{k_{m}^{*}}}}^{2}+N\eta\|% \mathbf{w}_{0}-\mathbf{w}^{*}\|_{\mathbf{H}_{m,{k_{m}^{*}}:\infty}}^{2}).roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) ≤ 2 ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_min { 1 , italic_N italic_η roman_Λ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } ( ⟨ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 ( ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT italic_m , 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N italic_η ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Now we are ready to examine the term 2.

term 2=t=0N1Φ1m1𝐈(𝐈η𝐇m)2t,(ηαmtr(𝐁0,N)1ηαmtr(𝐇m)𝐇m+𝐁0)term 2superscriptsubscript𝑡0𝑁1superscriptsubscriptΦ1𝑚1𝐈superscript𝐈𝜂subscript𝐇𝑚2𝑡𝜂subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝐇𝑚subscript𝐁0\displaystyle\text{term 2}=\sum_{t=0}^{N-1}\langle{\Phi_{1}^{m-1}}\cdot\mathbf% {I}\cdot(\mathbf{I}-\eta\mathbf{H}_{m})^{2t},(\frac{\eta\alpha_{m}% \operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf% {H}_{m})}\cdot\mathbf{H}_{m}+\mathbf{B}_{0})\rangleterm 2 = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ⋅ bold_I ⋅ ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT , ( divide start_ARG italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ⋅ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⟩
t=0N1ηαmtr(𝐁0,N)1ηαmtr(𝐇m)Φ1m1(𝐈η𝐇m)2t,𝐇m+t=0N1Φ1m1(𝐈η𝐇m)2t,𝐁0absentsuperscriptsubscript𝑡0𝑁1𝜂subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚superscriptsubscriptΦ1𝑚1superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐇𝑚superscriptsubscript𝑡0𝑁1superscriptsubscriptΦ1𝑚1superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐁0\displaystyle\leq\sum_{t=0}^{N-1}\frac{\eta\alpha_{m}\operatorname{tr}(\mathbf% {B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\langle{\Phi_{1}% ^{m-1}}\cdot(\mathbf{I}-\eta\mathbf{H}_{m})^{2t},\mathbf{H}_{m}\rangle+\sum_{t% =0}^{N-1}\langle{\Phi_{1}^{m-1}}\cdot(\mathbf{I}-\eta\mathbf{H}_{m})^{2t},% \mathbf{B}_{0}\rangle≤ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT divide start_ARG italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ⟨ roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ⋅ ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ⋅ ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT , bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩
αmtr(𝐁0,N)1ηαmtr(𝐇m)Φ1m1𝐈,(𝐈(𝐈η𝐇m)N)+1ηj=1m1𝐇j,αj𝚪1j1𝐇m𝐇m1(𝐈(𝐈η𝐇m)N),𝐁0absentsubscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚superscriptsubscriptΦ1𝑚1𝐈𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁1𝜂superscriptsubscript𝑗1𝑚1subscript𝐇𝑗superscript𝛼𝑗superscriptsubscript𝚪1𝑗1subscript𝐇𝑚superscriptsubscript𝐇𝑚1𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁subscript𝐁0\displaystyle\leq\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta% \alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\langle{\Phi_{1}^{m-1}}\mathbf{I},% (\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N})\rangle+\frac{1}{\eta}\langle% \sum_{j=1}^{m-1}\langle\mathbf{H}_{j},\alpha^{j}\bm{\Gamma}_{1}^{j-1}\cdot% \mathbf{H}_{m}\rangle\mathbf{H}_{m}^{-1}(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}% _{m})^{N}),\mathbf{B}_{0}\rangle≤ divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ⟨ roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_I , ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ⟩ + divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ⟨ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_α start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT ⋅ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩
=αmtr(𝐁0,N)1ηαmtr(𝐇m)iΦ1m1[1(1ηλmi)N]+1ηiΦ1m1iωi2(λmi)1[1(1ηλmi)N]absentsubscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝑖superscriptsubscriptΦ1𝑚1delimited-[]1superscript1𝜂superscriptsubscript𝜆𝑚𝑖𝑁1𝜂subscript𝑖superscriptsuperscriptsubscriptΦ1𝑚1𝑖superscriptsubscript𝜔𝑖2superscriptsuperscriptsubscript𝜆𝑚𝑖1delimited-[]1superscript1𝜂superscriptsubscript𝜆𝑚𝑖𝑁\displaystyle=\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta% \alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\sum_{i}{\Phi_{1}^{m-1}}[1-(1-\eta% \lambda_{m}^{i})^{N}]+\frac{1}{\eta}\sum_{i}{\Phi_{1}^{m-1}}^{i}{\omega_{i}^{2% }}(\lambda_{m}^{i})^{-1}[1-(1-\eta\lambda_{m}^{i})^{N}]= divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT [ 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ] + divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ]
αmtr(𝐁0,N)1ηαmtr(𝐇m)iΦ1m1min{1,ηNλmi}+1ηiΦ1m1ωi2(λmi)1min{1,ηNλmi}absentsubscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝑖superscriptsubscriptΦ1𝑚11𝜂𝑁superscriptsubscript𝜆𝑚𝑖1𝜂subscript𝑖superscriptsubscriptΦ1𝑚1superscriptsubscript𝜔𝑖2superscriptsuperscriptsubscript𝜆𝑚𝑖11𝜂𝑁superscriptsubscript𝜆𝑚𝑖\displaystyle\leq\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta% \alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}\sum_{i}{\Phi_{1}^{m-1}}\min\{1,% \eta N\lambda_{m}^{i}\}+\frac{1}{\eta}\sum_{i}{\Phi_{1}^{m-1}}{\omega_{i}^{2}}% (\lambda_{m}^{i})^{-1}\min\{1,\eta N\lambda_{m}^{i}\}≤ divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT roman_min { 1 , italic_η italic_N italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } + divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_min { 1 , italic_η italic_N italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }
αmtr(𝐁0,N)Φ1m11ηαmtr(𝐇m)(km+Nηi>km(λmi))+Φ1m1η𝐰0𝐰𝐇m,0:km12+NΦ1m1𝐰0𝐰𝐈m,km:2.absentsubscript𝛼𝑚trsubscript𝐁0𝑁superscriptsubscriptΦ1𝑚11𝜂subscript𝛼𝑚trsubscript𝐇𝑚superscriptsubscript𝑘𝑚𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscript𝜆𝑚𝑖superscriptsubscriptΦ1𝑚1𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsuperscript𝐇1:𝑚0superscriptsubscript𝑘𝑚2𝑁superscriptsubscriptΦ1𝑚1superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐈:𝑚superscriptsubscript𝑘𝑚2\displaystyle\leq\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N}){\Phi_{1}^% {m-1}}}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}(k_{m}^{*}+N\eta\sum% _{i>k_{m}^{*}}(\lambda_{m}^{i}))+\frac{{\Phi_{1}^{m-1}}}{\eta}\|\mathbf{w}_{0}% -\mathbf{w}^{*}\|_{\mathbf{H}^{-1}_{m,0:k_{m}^{*}}}^{2}+N{\Phi_{1}^{m-1}}\|% \mathbf{w}_{0}-\mathbf{w}^{*}\|_{\mathbf{I}_{m,k_{m}^{*}:\infty}}^{2}.≤ divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ( italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) + divide start_ARG roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η end_ARG ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Let us denote Vm=αmtr(𝐁0,N)Φ1m11ηαmtr(𝐇m)(km+Nηi>km(λmi))+Φ1m1η𝐰0𝐰𝐇m,0:km12+NΦ1m1𝐰0𝐰𝐈m,km:2subscript𝑉𝑚subscript𝛼𝑚trsubscript𝐁0𝑁superscriptsubscriptΦ1𝑚11𝜂subscript𝛼𝑚trsubscript𝐇𝑚superscriptsubscript𝑘𝑚𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscript𝜆𝑚𝑖superscriptsubscriptΦ1𝑚1𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsuperscript𝐇1:𝑚0superscriptsubscript𝑘𝑚2𝑁superscriptsubscriptΦ1𝑚1superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐈:𝑚superscriptsubscript𝑘𝑚2V_{m}=\frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N}){\Phi_{1}^{m-1}}}{1-% \eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}(k_{m}^{*}+N\eta\sum_{i>k_{m}^% {*}}(\lambda_{m}^{i}))+\frac{{\Phi_{1}^{m-1}}}{\eta}\|\mathbf{w}_{0}-\mathbf{w% }^{*}\|_{\mathbf{H}^{-1}_{m,0:k_{m}^{*}}}^{2}+N{\Phi_{1}^{m-1}}\|\mathbf{w}_{0% }-\mathbf{w}^{*}\|_{\mathbf{I}_{m,k_{m}^{*}:\infty}}^{2}italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ( italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) + divide start_ARG roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η end_ARG ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

By combining the aforementioned results, we obtain:

𝐁MNsubscript𝐁𝑀𝑁\displaystyle\mathbf{B}_{MN}bold_B start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT m=1M(η𝒯~𝐇m(η))N𝐁0+m=1Mj=mM(η𝒯~𝐇j(η))N𝐏m,precedes-or-equalsabsentsuperscriptsubscriptproduct𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐁0superscriptsubscript𝑚1𝑀superscriptsubscriptproduct𝑗𝑚𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁subscript𝐏𝑚\displaystyle\preceq\prod_{m=1}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{B}_{0}+\sum_{m=1}^{M}\prod_{j=m}^{M}(% \mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P}_% {m},⪯ ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,

where denoting 𝐏m=αmη2(Um+Vm)𝐇msubscript𝐏𝑚subscript𝛼𝑚superscript𝜂2subscript𝑈𝑚subscript𝑉𝑚subscript𝐇𝑚\mathbf{P}_{m}=\alpha_{m}\eta^{2}(U_{m}+V_{m})\cdot\mathbf{H}_{m}bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ⋅ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

Based on Lemma A.2, the upper bound of the bias error can be expressed as follows:

k=1M𝐇k,𝐁MNk=1M𝐇k,m=1M(η𝒯~𝐇m(η))N𝐁0bias term 1+k=1M𝐇k,m=1Mj=mM(η𝒯~𝐇j(η))N𝐏mbias term 2.superscriptsubscript𝑘1𝑀subscript𝐇𝑘subscript𝐁𝑀𝑁subscriptsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscriptproduct𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐁0bias term 1subscriptsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscript𝑚1𝑀superscriptsubscriptproduct𝑗𝑚𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁subscript𝐏𝑚bias term 2\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{B}_{MN}\rangle\leq% \underbrace{\sum_{k=1}^{M}\langle\mathbf{H}_{k},\prod_{m=1}^{M}(\mathcal{I}-% \eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{B}_{0}% \rangle}_{\text{bias term 1}}+\underbrace{\sum_{k=1}^{M}\langle\mathbf{H}_{k},% \sum_{m=1}^{M}\prod_{j=m}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf% {H}_{j}}(\eta))^{N}\mathbf{P}_{m}\rangle}_{\text{bias term 2}}.∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT ⟩ ≤ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT bias term 1 end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT bias term 2 end_POSTSUBSCRIPT .

For each k𝑘kitalic_k:

𝐇k,𝐁MN=m=1M(𝐈η𝐇m)2N𝐇k,𝐁0+𝐇k,m=1Mj=mM(𝐈η𝐇j)2Nαmη2(Um+Vm)𝐇msubscript𝐇𝑘subscript𝐁𝑀𝑁superscriptsubscriptproduct𝑚1𝑀superscript𝐈𝜂subscript𝐇𝑚2𝑁subscript𝐇𝑘subscript𝐁0subscript𝐇𝑘superscriptsubscript𝑚1𝑀superscriptsubscriptproduct𝑗𝑚𝑀superscript𝐈𝜂subscript𝐇𝑗2𝑁subscript𝛼𝑚superscript𝜂2subscript𝑈𝑚subscript𝑉𝑚subscript𝐇𝑚\displaystyle\langle\mathbf{H}_{k},\mathbf{B}_{MN}\rangle=\langle\prod_{m=1}^{% M}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k},\mathbf{B}_{0}\rangle+% \langle\mathbf{H}_{k},\sum_{m=1}^{M}\prod_{j=m}^{M}(\mathbf{I}-\eta{\mathbf{H}% _{j}})^{2N}\alpha_{m}\eta^{2}(U_{m}+V_{m})\cdot\mathbf{H}_{m}\rangle⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT ⟩ = ⟨ ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ + ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ⋅ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩
𝐰0𝐰m=1M(𝐈η𝐇m)2N𝐇k2absentsuperscriptsubscriptnormsubscript𝐰0superscript𝐰superscriptsubscriptproduct𝑚1𝑀superscript𝐈𝜂subscript𝐇𝑚2𝑁subscript𝐇𝑘2\displaystyle\leq\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\prod_{m=1}^{M}(\mathbf{I}% -\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k}}^{2}≤ ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+m=1Mαmη2(αmtr(𝐁0,N)1ηαmtr(𝐇m)(i<kmΓ(1,M)i(λmi)2λkiNη+Nηi>kmΓ(1,M)i(λmi)3λki))superscriptsubscript𝑚1𝑀subscript𝛼𝑚superscript𝜂2subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑀𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖2superscriptsubscript𝜆𝑘𝑖𝑁𝜂𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑀𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖3superscriptsubscript𝜆𝑘𝑖\displaystyle+\sum_{m=1}^{M}\alpha_{m}\eta^{2}(\frac{\alpha_{m}\operatorname{% tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf{H}_{m})}(\sum% _{i<k_{m}^{*}}\frac{\Gamma_{(1,M)}^{i}(\lambda_{m}^{i})^{2}\lambda_{k}^{i}}{N% \eta}+N\eta\sum_{i>k_{m}^{*}}\Gamma_{(1,M)}^{i}(\lambda_{m}^{i})^{3}\lambda_{k% }^{i}))+ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG roman_Γ start_POSTSUBSCRIPT ( 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_N italic_η end_ARG + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) )
+m=1Mαmη2(𝐰0𝐰(𝚪1M𝐇m𝐇k)0:km2+Nη𝐰0𝐰(𝚪1M𝐇m2𝐇k)km:2)superscriptsubscript𝑚1𝑀subscript𝛼𝑚superscript𝜂2superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsuperscriptsubscript𝚪1𝑀subscript𝐇𝑚subscript𝐇𝑘:0superscriptsubscript𝑘𝑚2𝑁𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsuperscriptsubscript𝚪1𝑀superscriptsubscript𝐇𝑚2subscript𝐇𝑘:superscriptsubscript𝑘𝑚2\displaystyle+\sum_{m=1}^{M}\alpha_{m}\eta^{2}(\|\mathbf{w}_{0}-\mathbf{w}^{*}% \|_{(\bm{\Gamma}_{1}^{M}\mathbf{H}_{m}\mathbf{H}_{k})_{0:k_{m}^{*}}}^{2}+N\eta% \|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\bm{\Gamma}_{1}^{M}\mathbf{H}_{m}^{2}% \mathbf{H}_{k})_{k_{m}^{*}:\infty}}^{2})+ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N italic_η ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (19)
+m=1Mαmη2Φ1m1(αmtr(𝐁0,N)1ηαmtr(𝐇m)(i<kmΓ(m,M)i(λmi)+Nηi>kmΓ(m,M)i(λmi)2))superscriptsubscript𝑚1𝑀subscript𝛼𝑚superscript𝜂2superscriptsubscriptΦ1𝑚1subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚𝑀𝑖superscriptsubscript𝜆𝑚𝑖𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚𝑀𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖2\displaystyle+\sum_{m=1}^{M}\alpha_{m}\eta^{2}{\Phi_{1}^{m-1}}(\frac{\alpha_{m% }\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(% \mathbf{H}_{m})}(\sum_{i<k_{m}^{*}}\Gamma_{(m,M)}^{i}(\lambda_{m}^{i})+N\eta% \sum_{i>k_{m}^{*}}\Gamma_{(m,M)}^{i}(\lambda_{m}^{i})^{2}))+ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )
+m=1Mαmη2Φ1m1(1η𝐰0𝐰(𝚪mM𝐇m1)0:km2+N𝐰0𝐰(𝚪mM)km:2)superscriptsubscript𝑚1𝑀subscript𝛼𝑚superscript𝜂2superscriptsubscriptΦ1𝑚11𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰:superscriptsubscript𝚪𝑚𝑀subscriptsuperscript𝐇1𝑚0superscriptsubscript𝑘𝑚2𝑁superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsuperscriptsubscript𝚪𝑚𝑀:superscriptsubscript𝑘𝑚2\displaystyle+\sum_{m=1}^{M}\alpha_{m}\eta^{2}{\Phi_{1}^{m-1}}(\frac{1}{\eta}% \|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\bm{\Gamma}_{m}^{M}\mathbf{H}^{-1}_{m}){0:% k_{m}^{*}}}^{2}+N\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\bm{\Gamma}_{m}^{M})_{k_{% m}^{*}:\infty}}^{2})+ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_Γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_Γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

Hence,

k=1M𝐇k,𝐁MNsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘subscript𝐁𝑀𝑁\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{B}_{MN}\rangle∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT ⟩ k=1M𝐰0𝐰m=1M(𝐈η𝐇m)2N𝐇k2absentsuperscriptsubscript𝑘1𝑀superscriptsubscriptnormsubscript𝐰0superscript𝐰superscriptsubscriptproduct𝑚1𝑀superscript𝐈𝜂subscript𝐇𝑚2𝑁subscript𝐇𝑘2\displaystyle\leq\sum_{k=1}^{M}\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\prod_{m=1}^% {M}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k}}^{2}≤ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+k=1Mm=1Mαmη2αmtr(𝐁0,N)1ηαmtr(𝐇m)(i<kmΓ(1,M)i(λmi)2λkiNη+Nηi>kmΓ(1,M)i(λmi)3λki)superscriptsubscript𝑘1𝑀superscriptsubscript𝑚1𝑀subscript𝛼𝑚superscript𝜂2subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑀𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖2superscriptsubscript𝜆𝑘𝑖𝑁𝜂𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑀𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖3superscriptsubscript𝜆𝑘𝑖\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}\alpha_{m}\eta^{2}\frac{\alpha_{m}% \operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}\operatorname{tr}(\mathbf% {H}_{m})}(\sum_{i<k_{m}^{*}}\frac{\Gamma_{(1,M)}^{i}(\lambda_{m}^{i})^{2}% \lambda_{k}^{i}}{N\eta}+N\eta\sum_{i>k_{m}^{*}}\Gamma_{(1,M)}^{i}(\lambda_{m}^% {i})^{3}\lambda_{k}^{i})+ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG roman_Γ start_POSTSUBSCRIPT ( 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_N italic_η end_ARG + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
+k=1Mm=1Mαmη2(𝐰0𝐰(𝚪1M𝐇m𝐇k)0:km2+Nη𝐰0𝐰(𝚪1M𝐇k𝐇m2)km:2)superscriptsubscript𝑘1𝑀superscriptsubscript𝑚1𝑀subscript𝛼𝑚superscript𝜂2superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsuperscriptsubscript𝚪1𝑀subscript𝐇𝑚subscript𝐇𝑘:0superscriptsubscript𝑘𝑚2𝑁𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsuperscriptsubscript𝚪1𝑀subscript𝐇𝑘superscriptsubscript𝐇𝑚2:superscriptsubscript𝑘𝑚2\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}\alpha_{m}\eta^{2}(\|\mathbf{w}_{0}-% \mathbf{w}^{*}\|_{(\bm{\Gamma}_{1}^{M}\mathbf{H}_{m}\mathbf{H}_{k})_{0:k_{m}^{% *}}}^{2}+N\eta\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{{(\bm{\Gamma}_{1}^{M}\mathbf{% H}_{k}\mathbf{H}_{m}^{2})}_{k_{m}^{*}:\infty}}^{2})+ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N italic_η ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
+k=1Mm=1Mαmη2Φ1m1(αmtr(𝐁0,N)1ηαmtr(𝐇m)(i<kmΓ(m,M)iλki(λmi)+Nηi>kmΓ(m,M)i(λmi)2λki))superscriptsubscript𝑘1𝑀superscriptsubscript𝑚1𝑀subscript𝛼𝑚superscript𝜂2superscriptsubscriptΦ1𝑚1subscript𝛼𝑚trsubscript𝐁0𝑁1𝜂subscript𝛼𝑚trsubscript𝐇𝑚subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚𝑀𝑖superscriptsubscript𝜆𝑘𝑖superscriptsubscript𝜆𝑚𝑖𝑁𝜂subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚𝑀𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖2superscriptsubscript𝜆𝑘𝑖\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}\alpha_{m}\eta^{2}{\Phi_{1}^{m-1}}(% \frac{\alpha_{m}\operatorname{tr}(\mathbf{B}_{0,N})}{1-\eta\alpha_{m}% \operatorname{tr}(\mathbf{H}_{m})}(\sum_{i<k_{m}^{*}}\Gamma_{(m,M)}^{i}\lambda% _{k}^{i}(\lambda_{m}^{i})+N\eta\sum_{i>k_{m}^{*}}\Gamma_{(m,M)}^{i}(\lambda_{m% }^{i})^{2}\lambda_{k}^{i}))+ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_B start_POSTSUBSCRIPT 0 , italic_N end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_η italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_tr ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + italic_N italic_η ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) )
+k=1Mm=1MαmηΦ1m1(𝐰0𝐰(𝚪mM𝐇k)0:km2+N𝐰0𝐰(𝚪mM𝐇k𝐇m)km:2)superscriptsubscript𝑘1𝑀superscriptsubscript𝑚1𝑀subscript𝛼𝑚𝜂superscriptsubscriptΦ1𝑚1superscriptsubscriptnormsubscript𝐰0superscript𝐰:superscriptsubscript𝚪𝑚𝑀subscript𝐇𝑘0superscriptsubscript𝑘𝑚2𝑁superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsuperscriptsubscript𝚪𝑚𝑀subscript𝐇𝑘subscript𝐇𝑚:superscriptsubscript𝑘𝑚2\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}\alpha_{m}\eta{\Phi_{1}^{m-1}}(\|% \mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\bm{\Gamma}_{m}^{M}\mathbf{H}_{k}){0:k_{m}^{% *}}}^{2}+N\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\bm{\Gamma}_{m}^{M}\mathbf{H}_{k% }\mathbf{H}_{m})_{k_{m}^{*}:\infty}}^{2})+ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_Γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_Γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

C.2 Lower Bound

We first examine the recursion from t=0𝑡0t=0italic_t = 0 to t=N1𝑡𝑁1t=N-1italic_t = italic_N - 1 for each task m𝑚mitalic_m:

𝐁(m1)N+t+1subscript𝐁𝑚1𝑁𝑡1\displaystyle\mathbf{B}_{(m-1)N+t+1}bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t + 1 end_POSTSUBSCRIPT =(η𝒯𝐇m(η))𝐁(m1)N+tabsent𝜂subscript𝒯subscript𝐇𝑚𝜂subscript𝐁𝑚1𝑁𝑡\displaystyle=(\mathcal{I}-\eta\mathcal{T}_{\mathbf{H}_{m}}(\eta))\circ\mathbf% {B}_{(m-1)N+t}= ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT (20)
=(η𝒯~𝐇m(η))𝐁(m1)N+t+η2(m~m)𝐁(m1)N+tabsent𝜂subscript~𝒯subscript𝐇𝑚𝜂subscript𝐁𝑚1𝑁𝑡superscript𝜂2subscript𝑚subscript~𝑚subscript𝐁𝑚1𝑁𝑡\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% \circ\mathbf{B}_{(m-1)N+t}+\eta^{2}(\mathcal{M}_{m}-\widetilde{\mathcal{M}}_{m% })\circ\mathbf{B}_{(m-1)N+t}= ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - over~ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT
(η𝒯~𝐇m(η))𝐁(m1)N+t+βmη2𝐇m𝐇m,𝐁(m1)N+t.succeeds-or-equalsabsent𝜂subscript~𝒯subscript𝐇𝑚𝜂subscript𝐁𝑚1𝑁𝑡subscript𝛽𝑚superscript𝜂2subscript𝐇𝑚subscript𝐇𝑚subscript𝐁𝑚1𝑁𝑡\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))\circ\mathbf{B}_{(m-1)N+t}+\beta_{m}\eta^{2}\cdot\mathbf{H}_{m}\cdot% \langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N+t}\rangle.⪰ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT ⟩ .

Hence, after N𝑁Nitalic_N iterations, we could have the following results for task m𝑚mitalic_m:

𝐁(m1)N+Nsubscript𝐁𝑚1𝑁𝑁\displaystyle\mathbf{B}_{(m-1)N+N}bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_N end_POSTSUBSCRIPT (η𝒯~𝐇m(η))N𝐁(m1)N+βmη2t=0N1(η𝒯~𝐇m(η))t𝐇m𝐇m,𝐁(m1)N+tsucceeds-or-equalsabsentsuperscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐁𝑚1𝑁subscript𝛽𝑚superscript𝜂2superscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑡subscript𝐇𝑚subscript𝐇𝑚subscript𝐁𝑚1𝑁𝑡\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))^{N}\circ\mathbf{B}_{(m-1)N}+\beta_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathcal{% I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{H}_{m}% \langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N+t}\rangle⪰ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT ⟩
=(η𝒯~𝐇m(η))N𝐁(m1)N+βmη2t=0N1(𝐈η𝐇m)2t𝐇m𝐇m,𝐁(m1)N+tabsentsuperscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐁𝑚1𝑁subscript𝛽𝑚superscript𝜂2superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐇𝑚subscript𝐇𝑚subscript𝐁𝑚1𝑁𝑡\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% ^{N}\circ\mathbf{B}_{(m-1)N}+\beta_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-\eta% \mathbf{H}_{m})^{2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},\mathbf{B}_{(m-1)N+t}\rangle= ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N + italic_t end_POSTSUBSCRIPT ⟩
=(η𝒯~𝐇m(η))N𝐁(m1)N+βmη2t=0N1(𝐈η𝐇m)2t𝐇m𝐇m,(η𝒯𝐇m(η))t𝐁(m1)Nabsentsuperscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐁𝑚1𝑁subscript𝛽𝑚superscript𝜂2superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐇𝑚subscript𝐇𝑚superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁𝑚1𝑁\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))% ^{N}\circ\mathbf{B}_{(m-1)N}+\beta_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-\eta% \mathbf{H}_{m})^{2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{% \mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{(m-1)N}\rangle= ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT ⟩

We now examine the second term for each m𝑚mitalic_m:

βmη2t=0N1(𝐈η𝐇m)2t𝐇m𝐇m,(η𝒯𝐇m(η))t𝐁(m1)Nsubscript𝛽𝑚superscript𝜂2superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐇𝑚subscript𝐇𝑚superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁𝑚1𝑁\displaystyle\beta_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-\eta\mathbf{H}_{m})^% {2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{(m-1)N}\rangleitalic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT ⟩ (22)
=\displaystyle== βmη2t=0N1(𝐈η𝐇m)2t𝐇m𝐇m,(η𝒯𝐇m(η))t(η𝒯𝐇m1(η))N(η𝒯𝐇1(η))N𝐁0subscript𝛽𝑚superscript𝜂2superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚2𝑡subscript𝐇𝑚subscript𝐇𝑚superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡superscript𝜂subscript𝒯subscript𝐇𝑚1𝜂𝑁superscript𝜂subscript𝒯subscript𝐇1𝜂𝑁subscript𝐁0\displaystyle\beta_{m}\eta^{2}\sum_{t=0}^{N-1}(\mathbf{I}-\eta\mathbf{H}_{m})^% {2t}\mathbf{H}_{m}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{t}(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(% \eta))^{N}\ldots(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{1}}(\eta))^{N}% \mathbf{B}_{0}\rangleitalic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT … ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩

Acccording to 2.3 (B), we have:

(η𝒯𝐇m1(η))𝐇m𝜂subscript𝒯subscript𝐇𝑚1𝜂subscript𝐇𝑚\displaystyle(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))\circ% \mathbf{H}_{m}( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT =(η𝒯~𝐇m1(η))𝐇m+(~)𝐇mabsent𝜂subscript~𝒯subscript𝐇𝑚1𝜂subscript𝐇𝑚~subscript𝐇𝑚\displaystyle=(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta% ))\circ\mathbf{H}_{m}+(\mathcal{M}-\widetilde{\mathcal{M}})\circ\mathbf{H}_{m}= ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + ( caligraphic_M - over~ start_ARG caligraphic_M end_ARG ) ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
(η𝒯~𝐇m1(η))𝐇m+βm1η2𝐇m1𝐇m1,𝐇msucceeds-or-equalsabsent𝜂subscript~𝒯subscript𝐇𝑚1𝜂subscript𝐇𝑚subscript𝛽𝑚1superscript𝜂2subscript𝐇𝑚1subscript𝐇𝑚1subscript𝐇𝑚\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}% }(\eta))\circ\mathbf{H}_{m}+\beta_{m-1}\eta^{2}\cdot\mathbf{H}_{m-1}\cdot% \langle\mathbf{H}_{m-1},\mathbf{H}_{m}\rangle⪰ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩
(η𝒯𝐇m1(η))N𝐇mabsentsuperscript𝜂subscript𝒯subscript𝐇𝑚1𝜂𝑁subscript𝐇𝑚\displaystyle\rightarrow(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta% ))^{N}\circ\mathbf{H}_{m}→ ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (η𝒯~𝐇m1(η))N𝐇m+βm1η2t=0N1(η𝒯~𝐇m1(η))t𝐇m1𝐇m1,𝐇msucceeds-or-equalsabsentsuperscript𝜂subscript~𝒯subscript𝐇𝑚1𝜂𝑁subscript𝐇𝑚subscript𝛽𝑚1superscript𝜂2superscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇𝑚1𝜂𝑡subscript𝐇𝑚1subscript𝐇𝑚1subscript𝐇𝑚\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}% }(\eta))^{N}\circ\mathbf{H}_{m}+\beta_{m-1}\eta^{2}\cdot\sum_{t=0}^{N-1}(% \mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))^{t}\circ% \mathbf{H}_{m-1}\cdot\langle\mathbf{H}_{m-1},\mathbf{H}_{m}\rangle⪰ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩
(η𝒯~𝐇m1(η))N𝐇m+βm1η2(𝐈(𝐈η𝐇m1)2N)𝐇m1,𝐇m.succeeds-or-equalsabsentsuperscript𝜂subscript~𝒯subscript𝐇𝑚1𝜂𝑁subscript𝐇𝑚subscript𝛽𝑚1𝜂2𝐈superscript𝐈𝜂subscript𝐇𝑚12𝑁subscript𝐇𝑚1subscript𝐇𝑚\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}% }(\eta))^{N}\circ\mathbf{H}_{m}+\frac{\beta_{m-1}\eta}{2}\cdot(\mathbf{I}-(% \mathbf{I}-\eta{\mathbf{H}_{m-1}})^{2N})\cdot\langle\mathbf{H}_{m-1},\mathbf{H% }_{m}\rangle.⪰ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + divide start_ARG italic_β start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT italic_η end_ARG start_ARG 2 end_ARG ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ .

Therefore, we have iterations that:

(η𝒯𝐇1(η))N(η𝒯𝐇m1(η))N𝐇msuperscript𝜂subscript𝒯subscript𝐇1𝜂𝑁superscript𝜂subscript𝒯subscript𝐇𝑚1𝜂𝑁subscript𝐇𝑚\displaystyle(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{1}}(\eta))^{N}\ldots(% \mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m-1}}(\eta))^{N}\circ\mathbf{H}_{m}( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT … ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (η𝒯~𝐇1(η))N(η𝒯~𝐇m1(η))N𝐇msucceeds-or-equalsabsentsuperscript𝜂subscript~𝒯subscript𝐇1𝜂𝑁superscript𝜂subscript~𝒯subscript𝐇𝑚1𝜂𝑁subscript𝐇𝑚\displaystyle\succeq(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{1}}(% \eta))^{N}\ldots(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m-1}}(% \eta))^{N}\circ\mathbf{H}_{m}⪰ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT … ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
+j=1m1k=1jβk(η2)j𝐇k1,(𝐈(𝐈η𝐇m1)2N)𝐇j,𝐇m𝐈.superscriptsubscript𝑗1𝑚1superscriptsubscriptproduct𝑘1𝑗subscript𝛽𝑘superscript𝜂2𝑗subscript𝐇𝑘1𝐈superscript𝐈𝜂subscript𝐇𝑚12𝑁subscript𝐇𝑗subscript𝐇𝑚𝐈\displaystyle+\sum_{j=1}^{m-1}\prod_{k=1}^{j}\beta_{k}(\frac{\eta}{2})^{j}% \cdot\langle\mathbf{H}_{k-1},(\mathbf{I}-(\mathbf{I}-\eta{\mathbf{H}_{m-1}})^{% 2N})\rangle\cdot\langle\mathbf{H}_{j},\mathbf{H}_{m}\rangle\cdot\mathbf{I}.+ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( divide start_ARG italic_η end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ⟩ ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ ⋅ bold_I .

Subsituting the above to Equation 22 and denoting Φ^1m1:=j=1m1k=1jβk(η2)j𝐇k1,(𝐈(𝐈η𝐇m1)2N)𝐇j,𝐇massignsuperscriptsubscript^Φ1𝑚1superscriptsubscript𝑗1𝑚1superscriptsubscriptproduct𝑘1𝑗subscript𝛽𝑘superscript𝜂2𝑗subscript𝐇𝑘1𝐈superscript𝐈𝜂subscript𝐇𝑚12𝑁subscript𝐇𝑗subscript𝐇𝑚\hat{\Phi}_{1}^{m-1}:=\sum_{j=1}^{m-1}\prod_{k=1}^{j}\beta_{k}(\frac{\eta}{2})% ^{j}\cdot\langle\mathbf{H}_{k-1},(\mathbf{I}-(\mathbf{I}-\eta{\mathbf{H}_{m-1}% })^{2N})\rangle\cdot\langle\mathbf{H}_{j},\mathbf{H}_{m}\rangleover^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( divide start_ARG italic_η end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) ⟩ ⋅ ⟨ bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩, we have:

t=0N1𝐇m,(η𝒯𝐇m(η))t𝐁(m1)Nsuperscriptsubscript𝑡0𝑁1subscript𝐇𝑚superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁𝑚1𝑁\displaystyle\sum_{t=0}^{N-1}\langle\mathbf{H}_{m},(\mathcal{I}-\eta{\mathcal{% T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{(m-1)N}\rangle∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT ( italic_m - 1 ) italic_N end_POSTSUBSCRIPT ⟩ t=0N1(η𝒯~𝐇m1(η))N(η𝒯~𝐇1(η))N𝐇m,(η𝒯𝐇m(η))t𝐁0succeeds-or-equalsabsentsuperscriptsubscript𝑡0𝑁1superscript𝜂subscript~𝒯subscript𝐇𝑚1𝜂𝑁superscript𝜂subscript~𝒯subscript𝐇1𝜂𝑁subscript𝐇𝑚superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁0\displaystyle\succeq\sum_{t=0}^{N-1}\langle(\mathcal{I}-\eta{\widetilde{% \mathcal{T}}}_{\mathbf{H}_{m-1}}(\eta))^{N}\ldots(\mathcal{I}-\eta{\widetilde{% \mathcal{T}}}_{\mathbf{H}_{1}}(\eta))^{N}\mathbf{H}_{m},(\mathcal{I}-\eta{% \mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{0}\rangle⪰ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT … ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩
+t=0N1Φ^1m1𝐈,(η𝒯𝐇m(η))t𝐁0superscriptsubscript𝑡0𝑁1superscriptsubscript^Φ1𝑚1𝐈superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁0\displaystyle+\sum_{t=0}^{N-1}\langle\hat{\Phi}_{1}^{m-1}\mathbf{I},(\mathcal{% I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{0}\rangle+ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ⟨ over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_I , ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩
=p=1m1(𝐈η𝐇p)2N𝐇m,t=0N1(η𝒯𝐇m(η))t𝐁0term 1absentsubscriptsuperscriptsubscriptproduct𝑝1𝑚1superscript𝐈𝜂subscript𝐇𝑝2𝑁subscript𝐇𝑚superscriptsubscript𝑡0𝑁1superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁0term 1\displaystyle=\underbrace{\langle\prod_{p=1}^{m-1}(\mathbf{I}-\eta\mathbf{H}_{% p})^{2N}\mathbf{H}_{m},\sum_{t=0}^{N-1}(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf% {H}_{m}}(\eta))^{t}\mathbf{B}_{0}\rangle}_{\text{term 1}}= under⏟ start_ARG ⟨ ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT term 1 end_POSTSUBSCRIPT
+Φ^1m1𝐈,t=0N1(η𝒯𝐇m(η))t𝐁0term 2subscriptsuperscriptsubscript^Φ1𝑚1𝐈superscriptsubscript𝑡0𝑁1superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁0term 2\displaystyle+\underbrace{\langle\hat{\Phi}_{1}^{m-1}\mathbf{I},\sum_{t=0}^{N-% 1}(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{0}% \rangle}_{\text{term 2}}+ under⏟ start_ARG ⟨ over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_I , ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT term 2 end_POSTSUBSCRIPT

From the Lemma, we have:

t=0N1(η𝒯𝐇m(η))t𝐁0superscriptsubscript𝑡0𝑁1superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁0\displaystyle\sum_{t=0}^{N-1}(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(% \eta))^{t}\mathbf{B}_{0}∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT βm4tr((𝐈(𝐈η𝐇m)N/2)𝐁0)(𝐈(𝐈η𝐇m)N/2)succeeds-or-equalsabsentsubscript𝛽𝑚4tr𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁2subscript𝐁0𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁2\displaystyle\succeq\frac{\beta_{m}}{4}\operatorname{tr}((\mathbf{I}-(\mathbf{% I}-\eta\mathbf{H}_{m})^{N/2})\mathbf{B}_{0})\cdot(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{m})^{N/2})⪰ divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG roman_tr ( ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT )
+t=0N1(𝐈η𝐇m)t𝐁0(𝐈η𝐇m)t.superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚𝑡subscript𝐁0superscript𝐈𝜂subscript𝐇𝑚𝑡\displaystyle+\sum_{t=0}^{N-1}(\mathbf{I}-\eta\mathbf{H}_{m})^{t}\cdot\mathbf{% B}_{0}\cdot(\mathbf{I}-\eta\mathbf{H}_{m})^{t}.+ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT .

Then, for each task m𝑚mitalic_m, we examine the term 1:

term 1=term 1absent\displaystyle\text{term 1}=term 1 = p=1m1(𝐈η𝐇p)2N𝐇m,t=0N1(η𝒯𝐇m(η))t𝐁0superscriptsubscriptproduct𝑝1𝑚1superscript𝐈𝜂subscript𝐇𝑝2𝑁subscript𝐇𝑚superscriptsubscript𝑡0𝑁1superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁0\displaystyle\langle\prod_{p=1}^{m-1}(\mathbf{I}-\eta\mathbf{H}_{p})^{2N}% \mathbf{H}_{m},\sum_{t=0}^{N-1}(\mathcal{I}-\eta{\mathcal{T}}_{\mathbf{H}_{m}}% (\eta))^{t}\mathbf{B}_{0}\rangle⟨ ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩
\displaystyle\geq p=1m1(𝐈η𝐇p)2N𝐇m,βm4tr((𝐈(𝐈η𝐇m)N/2)𝐁0)(𝐈(𝐈η𝐇m)N/2)superscriptsubscriptproduct𝑝1𝑚1superscript𝐈𝜂subscript𝐇𝑝2𝑁subscript𝐇𝑚subscript𝛽𝑚4tr𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁2subscript𝐁0𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁2\displaystyle\langle\prod_{p=1}^{m-1}(\mathbf{I}-\eta\mathbf{H}_{p})^{2N}% \mathbf{H}_{m},\frac{\beta_{m}}{4}\operatorname{tr}((\mathbf{I}-(\mathbf{I}-% \eta\mathbf{H}_{m})^{N/2})\mathbf{B}_{0})\cdot(\mathbf{I}-(\mathbf{I}-\eta% \mathbf{H}_{m})^{N/2})\rangle⟨ ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG roman_tr ( ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) ⟩
+\displaystyle++ p=1m1(𝐈η𝐇p)2N𝐇m,t=0N1(𝐈η𝐇m)t𝐁0(𝐈η𝐇m)tsuperscriptsubscriptproduct𝑝1𝑚1superscript𝐈𝜂subscript𝐇𝑝2𝑁subscript𝐇𝑚superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚𝑡subscript𝐁0superscript𝐈𝜂subscript𝐇𝑚𝑡\displaystyle\langle\prod_{p=1}^{m-1}(\mathbf{I}-\eta\mathbf{H}_{p})^{2N}% \mathbf{H}_{m},\sum_{t=0}^{N-1}(\mathbf{I}-\eta\mathbf{H}_{m})^{t}\cdot\mathbf% {B}_{0}\cdot(\mathbf{I}-\eta\mathbf{H}_{m})^{t}\rangle⟨ ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩
=\displaystyle== βm4tr((𝐈(𝐈η𝐇m)N/2)𝐁0)p=1m1(𝐈η𝐇p)2N𝐇m,(𝐈(𝐈η𝐇m)N/2)bias term b1msubscriptsubscript𝛽𝑚4tr𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁2subscript𝐁0superscriptsubscriptproduct𝑝1𝑚1superscript𝐈𝜂subscript𝐇𝑝2𝑁subscript𝐇𝑚𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁2bias term b1m\displaystyle\underbrace{\frac{\beta_{m}}{4}\operatorname{tr}((\mathbf{I}-(% \mathbf{I}-\eta\mathbf{H}_{m})^{N/2})\mathbf{B}_{0})\cdot\langle\prod_{p=1}^{m% -1}(\mathbf{I}-\eta\mathbf{H}_{p})^{2N}\mathbf{H}_{m},(\mathbf{I}-(\mathbf{I}-% \eta\mathbf{H}_{m})^{N/2})\rangle}_{\text{bias term ${b_{1}^{m}}$}}under⏟ start_ARG divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG roman_tr ( ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ ⟨ ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) ⟩ end_ARG start_POSTSUBSCRIPT bias term italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
+\displaystyle++ 12ηp=1m1(𝐈η𝐇p)2N(𝐈(𝐈η𝐇m)2N),𝐁0bias term b2msubscript12𝜂superscriptsubscriptproduct𝑝1𝑚1superscript𝐈𝜂subscript𝐇𝑝2𝑁𝐈superscript𝐈𝜂subscript𝐇𝑚2𝑁subscript𝐁0bias term b2m\displaystyle\underbrace{\frac{1}{2\eta}\langle\prod_{p=1}^{m-1}(\mathbf{I}-% \eta\mathbf{H}_{p})^{2N}\cdot(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{2N})% ,\mathbf{B}_{0}\rangle}_{\text{bias term $b_{2}^{m}$}}under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG 2 italic_η end_ARG ⟨ ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) , bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT bias term italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT

The first bias item is lower bounded by:

bias term b1m=βm4(i(1(1ηλmi)N/2)ωi2)(ip=1m1(1ηλpi)2Nλmi(1(1ηλmi)N/2)),bias term b1msubscript𝛽𝑚4subscript𝑖1superscript1𝜂superscriptsubscript𝜆𝑚𝑖𝑁2superscriptsubscript𝜔𝑖2subscript𝑖superscriptsubscriptproduct𝑝1𝑚1superscript1𝜂superscriptsubscript𝜆𝑝𝑖2𝑁superscriptsubscript𝜆𝑚𝑖1superscript1𝜂superscriptsubscript𝜆𝑚𝑖𝑁2{\text{bias term ${b_{1}^{m}}$}}=\frac{\beta_{m}}{4}(\sum_{i}(1-(1-\eta\lambda% _{m}^{i})^{N/2})\omega_{i}^{2})\cdot(\sum_{i}\prod_{p=1}^{m-1}(1-\eta\lambda_{% p}^{i})^{2N}\lambda_{m}^{i}(1-(1-\eta\lambda_{m}^{i})^{N/2})),bias term italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) ) ,

The second bias item is lower bounded by:

bias term b2m(ip=1m1(1ηλpi)2N(1(1ηλmi)2N)ωi2)bias term b2msubscript𝑖superscriptsubscriptproduct𝑝1𝑚1superscript1𝜂superscriptsubscript𝜆𝑝𝑖2𝑁1superscript1𝜂superscriptsubscript𝜆𝑚𝑖2𝑁superscriptsubscript𝜔𝑖2{\text{bias term $b_{2}^{m}$}}\geq(\sum_{i}\prod_{p=1}^{m-1}(1-\eta\lambda_{p}% ^{i})^{2N}(1-(1-\eta\lambda_{m}^{i})^{2N})\omega_{i}^{2})bias term italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ≥ ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ( 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

To further lower bound the two terms, we notice that:

1(1ηλmi)N2{1(11N)N21e1215,λmi1ηNN2ηλmiN(N2)8η2λmi2N5ηλmi,λmi<1ηN1superscript1𝜂superscriptsubscript𝜆𝑚𝑖𝑁2cases1superscript11𝑁𝑁21superscript𝑒1215superscriptsubscript𝜆𝑚𝑖1𝜂𝑁𝑁2𝜂superscriptsubscript𝜆𝑚𝑖𝑁𝑁28superscript𝜂2superscriptsuperscriptsubscript𝜆𝑚𝑖2𝑁5𝜂superscriptsubscript𝜆𝑚𝑖superscriptsubscript𝜆𝑚𝑖1𝜂𝑁1-(1-\eta{\lambda_{m}^{i}})^{\frac{N}{2}}\geq\begin{cases}1-(1-\frac{1}{N})^{% \frac{N}{2}}\geq 1-e^{-\frac{1}{2}}\geq\frac{1}{5},&{\lambda_{m}^{i}}\geq\frac% {1}{\eta N}\\ \frac{N}{2}\cdot\eta{\lambda_{m}^{i}}-\frac{N(N-2)}{8}\cdot\eta^{2}{\lambda_{m% }^{i}}^{2}\geq\frac{N}{5}\cdot\eta{\lambda_{m}^{i}},&{\lambda_{m}^{i}}<\frac{1% }{\eta N}\end{cases}1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG italic_N end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ≥ { start_ROW start_CELL 1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT divide start_ARG italic_N end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ≥ 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 5 end_ARG , end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG italic_η italic_N end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_N end_ARG start_ARG 2 end_ARG ⋅ italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - divide start_ARG italic_N ( italic_N - 2 ) end_ARG start_ARG 8 end_ARG ⋅ italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG italic_N end_ARG start_ARG 5 end_ARG ⋅ italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT < divide start_ARG 1 end_ARG start_ARG italic_η italic_N end_ARG end_CELL end_ROW

Substituting to the previous results, we have:

bias term b1msuperscriptsubscript𝑏1𝑚{b_{1}^{m}}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT βm4(15ikmωi2+ηN5i>km(λmi)ωi2)(15ikmΓ(1,m1)iλmi+ηN5i>kmΓ(1,m1)i(λmi)2)absentsubscript𝛽𝑚415subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscript𝜔𝑖2𝜂𝑁5subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscript𝜆𝑚𝑖superscriptsubscript𝜔𝑖215subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑚1𝑖superscriptsubscript𝜆𝑚𝑖𝜂𝑁5subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑚1𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖2\displaystyle\geq\frac{\beta_{m}}{4}(\frac{1}{5}\cdot\sum_{i\leq{k_{m}^{*}}}% \omega_{i}^{2}+\frac{\eta N}{5}\sum_{i>{k_{m}^{*}}}(\lambda_{m}^{i})\omega_{i}% ^{2})\cdot(\frac{1}{5}\cdot\sum_{i\leq{k_{m}^{*}}}\Gamma_{(1,m-1)}^{i}\lambda_% {m}^{i}+\frac{\eta N}{5}\sum_{i>{k_{m}^{*}}}\Gamma_{(1,m-1)}^{i}(\lambda_{m}^{% i})^{2})≥ divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG ( divide start_ARG 1 end_ARG start_ARG 5 end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_i ≤ italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_η italic_N end_ARG start_ARG 5 end_ARG ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⋅ ( divide start_ARG 1 end_ARG start_ARG 5 end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_i ≤ italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + divide start_ARG italic_η italic_N end_ARG start_ARG 5 end_ARG ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=βm25(𝐰0𝐰𝐈m,0:km2+Nη𝐰0𝐰𝐇m,km:2)(ikmΓ(1,m1)i(λmi)+ηNi>kmΓ(1,m1)i(λmi)2)absentsubscript𝛽𝑚25superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐈:𝑚0superscriptsubscript𝑘𝑚2𝑁𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐇:𝑚superscriptsubscript𝑘𝑚2subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑚1𝑖superscriptsubscript𝜆𝑚𝑖𝜂𝑁subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑚1𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖2\displaystyle=\frac{\beta_{m}}{25}\cdot(\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{% \mathbf{I}_{m,{0:{k_{m}^{*}}}}}^{2}+N\eta\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{% \mathbf{H}_{m,{{k_{m}^{*}}:\infty}}}^{2})\cdot(\sum_{i\leq{k_{m}^{*}}}{\Gamma_% {(1,m-1)}^{i}}(\lambda_{m}^{i})+{\eta N}\sum_{i>{k_{m}^{*}}}\Gamma_{(1,m-1)}^{% i}(\lambda_{m}^{i})^{2})= divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 25 end_ARG ⋅ ( ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT italic_m , 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N italic_η ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_i ≤ italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + italic_η italic_N ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

and

bias term b2msuperscriptsubscript𝑏2𝑚b_{2}^{m}italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT (15ikmΓ(1,m1)iωi2+ηN5i>kmΓ(1,m1)iλmiωi2)absent15subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑚1𝑖superscriptsubscript𝜔𝑖2𝜂𝑁5subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑚1𝑖superscriptsubscript𝜆𝑚𝑖superscriptsubscript𝜔𝑖2\displaystyle\geq(\frac{1}{5}\cdot\sum_{i\leq{k_{m}^{*}}}\Gamma_{(1,m-1)}^{i}% \omega_{i}^{2}+\frac{\eta N}{5}\sum_{i>{k_{m}^{*}}}\Gamma_{(1,m-1)}^{i}\lambda% _{m}^{i}\omega_{i}^{2})≥ ( divide start_ARG 1 end_ARG start_ARG 5 end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_i ≤ italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_η italic_N end_ARG start_ARG 5 end_ARG ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_m - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=15(𝐰0𝐰(p=1m1(𝐈η𝐇p)2N)0:km2+Nη𝐰0𝐰(p=1m1(𝐈η𝐇p)2N𝐇m)km:2)absent15superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsuperscriptsubscriptproduct𝑝1𝑚1superscript𝐈𝜂subscript𝐇𝑝2𝑁:0superscriptsubscript𝑘𝑚2𝑁𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsuperscriptsubscriptproduct𝑝1𝑚1superscript𝐈𝜂subscript𝐇𝑝2𝑁subscript𝐇𝑚:superscriptsubscript𝑘𝑚2\displaystyle=\frac{1}{5}\cdot(\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\prod_{p=1}% ^{m-1}(\mathbf{I}-\eta\mathbf{H}_{p})^{2N})_{0:{k_{m}^{*}}}}^{2}+N\eta\|% \mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\prod_{p=1}^{m-1}(\mathbf{I}-\eta\mathbf{H}_% {p})^{2N}\mathbf{H}_{m})_{{k_{m}^{*}}:\infty}}^{2})= divide start_ARG 1 end_ARG start_ARG 5 end_ARG ⋅ ( ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N italic_η ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

Now we are ready to examine term 2.

term 2=term 2absent\displaystyle\text{term 2}=term 2 = Φ^1m1𝐈,t=0N1(η𝒯𝐇m(η))t𝐁0superscriptsubscript^Φ1𝑚1𝐈superscriptsubscript𝑡0𝑁1superscript𝜂subscript𝒯subscript𝐇𝑚𝜂𝑡subscript𝐁0\displaystyle\langle\hat{\Phi}_{1}^{m-1}\mathbf{I},\sum_{t=0}^{N-1}(\mathcal{I% }-\eta{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{t}\mathbf{B}_{0}\rangle⟨ over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_I , ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( caligraphic_I - italic_η caligraphic_T start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩
\displaystyle\geq Φ^1m1𝐈,βm4tr((𝐈(𝐈η𝐇m)N/2)𝐁0)(𝐈(𝐈η𝐇m)N/2)superscriptsubscript^Φ1𝑚1𝐈subscript𝛽𝑚4tr𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁2subscript𝐁0𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁2\displaystyle\langle\hat{\Phi}_{1}^{m-1}\mathbf{I},\frac{\beta_{m}}{4}% \operatorname{tr}((\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N/2})\mathbf{B}% _{0})\cdot(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N/2})\rangle⟨ over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_I , divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG roman_tr ( ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) ⟩
+\displaystyle++ Φ^1m1𝐈,t=0N1(𝐈η𝐇m)t𝐁0(𝐈η𝐇m)tsuperscriptsubscript^Φ1𝑚1𝐈superscriptsubscript𝑡0𝑁1superscript𝐈𝜂subscript𝐇𝑚𝑡subscript𝐁0superscript𝐈𝜂subscript𝐇𝑚𝑡\displaystyle\langle\hat{\Phi}_{1}^{m-1}\mathbf{I},\sum_{t=0}^{N-1}(\mathbf{I}% -\eta\mathbf{H}_{m})^{t}\cdot\mathbf{B}_{0}\cdot(\mathbf{I}-\eta\mathbf{H}_{m}% )^{t}\rangle⟨ over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_I , ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩
=\displaystyle== βm4tr((𝐈(𝐈η𝐇m)N/2)𝐁0)Φ^1m1𝐈,(𝐈(𝐈η𝐇m)N/2)bias term d1msubscriptsubscript𝛽𝑚4tr𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁2subscript𝐁0superscriptsubscript^Φ1𝑚1𝐈𝐈superscript𝐈𝜂subscript𝐇𝑚𝑁2bias term d1m\displaystyle\underbrace{\frac{\beta_{m}}{4}\operatorname{tr}((\mathbf{I}-(% \mathbf{I}-\eta\mathbf{H}_{m})^{N/2})\mathbf{B}_{0})\cdot\langle\hat{\Phi}_{1}% ^{m-1}\mathbf{I},(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{N/2})\rangle}_{% \text{bias term ${d_{1}^{m}}$}}under⏟ start_ARG divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG roman_tr ( ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ ⟨ over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_I , ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) ⟩ end_ARG start_POSTSUBSCRIPT bias term italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
+\displaystyle++ 12ηΦ^1m1𝐇m1(𝐈(𝐈η𝐇m)2N),𝐁0bias term d2msubscript12𝜂superscriptsubscript^Φ1𝑚1superscriptsubscript𝐇𝑚1𝐈superscript𝐈𝜂subscript𝐇𝑚2𝑁subscript𝐁0bias term d2m\displaystyle\underbrace{\frac{1}{2\eta}\langle\hat{\Phi}_{1}^{m-1}\mathbf{H}_% {m}^{-1}\cdot(\mathbf{I}-(\mathbf{I}-\eta\mathbf{H}_{m})^{2N}),\mathbf{B}_{0}% \rangle}_{\text{bias term $d_{2}^{m}$}}under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG 2 italic_η end_ARG ⟨ over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ ( bold_I - ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) , bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT bias term italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT

Analogous to term 1, we have:

bias term d1msuperscriptsubscript𝑑1𝑚{d_{1}^{m}}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT =βm4(i(1(1ηλmi)N/2)ωi2)(iΦ^1m1λmi(1(1ηλmi)N/2))absentsubscript𝛽𝑚4subscript𝑖1superscript1𝜂superscriptsubscript𝜆𝑚𝑖𝑁2superscriptsubscript𝜔𝑖2subscript𝑖superscriptsubscript^Φ1𝑚1superscriptsubscript𝜆𝑚𝑖1superscript1𝜂superscriptsubscript𝜆𝑚𝑖𝑁2\displaystyle=\frac{\beta_{m}}{4}(\sum_{i}(1-(1-\eta\lambda_{m}^{i})^{N/2})% \omega_{i}^{2})\cdot(\sum_{i}\hat{\Phi}_{1}^{m-1}\lambda_{m}^{i}(1-(1-\eta% \lambda_{m}^{i})^{N/2}))= divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⋅ ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_N / 2 end_POSTSUPERSCRIPT ) )
βm25(𝐰0𝐰𝐈m,0:km2+Nη𝐰0𝐰𝐇m,km:2)Φ1m1(i<km(λmi)+ηNi>km(λmi)2)absentsubscript𝛽𝑚25superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐈:𝑚0superscriptsubscript𝑘𝑚2𝑁𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐇:𝑚superscriptsubscript𝑘𝑚2superscriptsubscriptΦ1𝑚1subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscript𝜆𝑚𝑖𝜂𝑁subscript𝑖superscriptsubscript𝑘𝑚superscriptsuperscriptsubscript𝜆𝑚𝑖2\displaystyle\geq\frac{\beta_{m}}{25}\cdot(\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{% \mathbf{I}_{m,{0:{k_{m}^{*}}}}}^{2}+N\eta\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{% \mathbf{H}_{m,{{k_{m}^{*}}:\infty}}}^{2})\cdot{\Phi_{1}^{m-1}}\cdot(\sum_{i<{k% _{m}^{*}}}(\lambda_{m}^{i})+{\eta N}\sum_{i>{k_{m}^{*}}}(\lambda_{m}^{i})^{2})≥ divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 25 end_ARG ⋅ ( ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT italic_m , 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N italic_η ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⋅ roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ⋅ ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + italic_η italic_N ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

and

bias term d2msuperscriptsubscript𝑑2𝑚d_{2}^{m}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT (iΦ^1m1(λmi)1(1(1ηλmi)2N)ωi2)absentsubscript𝑖superscriptsubscript^Φ1𝑚1superscriptsuperscriptsubscript𝜆𝑚𝑖11superscript1𝜂superscriptsubscript𝜆𝑚𝑖2𝑁superscriptsubscript𝜔𝑖2\displaystyle\geq(\sum_{i}\hat{\Phi}_{1}^{m-1}(\lambda_{m}^{i})^{-1}(1-(1-\eta% \lambda_{m}^{i})^{2N})\omega_{i}^{2})≥ ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ) italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
Φ^1m1(15ikm(λmi)1ωi2+ηN5i>kmωi2)absentsuperscriptsubscript^Φ1𝑚115subscript𝑖superscriptsubscript𝑘𝑚superscriptsuperscriptsubscript𝜆𝑚𝑖1superscriptsubscript𝜔𝑖2𝜂𝑁5subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscript𝜔𝑖2\displaystyle\geq\hat{\Phi}_{1}^{m-1}(\frac{1}{5}\cdot\sum_{i\leq{k_{m}^{*}}}(% \lambda_{m}^{i})^{-1}\omega_{i}^{2}+\frac{\eta N}{5}\sum_{i>{k_{m}^{*}}}\omega% _{i}^{2})≥ over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 5 end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_i ≤ italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_η italic_N end_ARG start_ARG 5 end_ARG ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=Φ^1m15(𝐰0𝐰(𝐇m1)0:km2+Nη𝐰0𝐰𝐈)km:2)\displaystyle=\frac{\hat{\Phi}_{1}^{m-1}}{5}\cdot(\|\mathbf{w}_{0}-\mathbf{w}^% {*}\|_{(\mathbf{H}_{m}^{-1})_{0:{k_{m}^{*}}}}^{2}+N\eta\cdot\|\mathbf{w}_{0}-% \mathbf{w}^{*}\|_{\mathbf{I})_{{k_{m}^{*}}:\infty}}^{2})= divide start_ARG over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 5 end_ARG ⋅ ( ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N italic_η ⋅ ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_I ) start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

After MN𝑀𝑁MNitalic_M italic_N iterations, it holds that:

𝐁MNsubscript𝐁𝑀𝑁\displaystyle\mathbf{B}_{MN}bold_B start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT m=1M(η𝒯~𝐇m(η))N𝐁0+m=1Mj=mM(η𝒯~𝐇j(η))N𝐏m,succeeds-or-equalsabsentsuperscriptsubscriptproduct𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐁0superscriptsubscript𝑚1𝑀superscriptsubscriptproduct𝑗𝑚𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁subscript𝐏𝑚\displaystyle\succeq\prod_{m=1}^{M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{% \mathbf{H}_{m}}(\eta))^{N}\circ\mathbf{B}_{0}+\sum_{m=1}^{M}\prod_{j=m}^{M}(% \mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P}_% {m},⪰ ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,

where denoting 𝐏m=βmη2(b1m+b2m+d1m+d2m)(𝐈η𝐇m)2N𝐇msubscript𝐏𝑚subscript𝛽𝑚superscript𝜂2superscriptsubscript𝑏1𝑚superscriptsubscript𝑏2𝑚superscriptsubscript𝑑1𝑚superscriptsubscript𝑑2𝑚superscript𝐈𝜂subscript𝐇𝑚2𝑁subscript𝐇𝑚\mathbf{P}_{m}=\beta_{m}\eta^{2}(b_{1}^{m}+b_{2}^{m}+d_{1}^{m}+d_{2}^{m})\cdot% (\mathbf{I}-\eta\mathbf{H}_{m})^{2N}\mathbf{H}_{m}bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) ⋅ ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

Then, the bias error can be represented as follows:

k=1M𝐇k,𝐁MNsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘subscript𝐁𝑀𝑁\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{B}_{MN}\rangle∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT ⟩ k=1M𝐇k,m=1M(η𝒯~𝐇m(η))N𝐁0bias term 1’+k=1M𝐇k,m=1Mj=mM(η𝒯~𝐇j(η))N𝐏mbias term 2’.absentsubscriptsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscriptproduct𝑚1𝑀superscript𝜂subscript~𝒯subscript𝐇𝑚𝜂𝑁subscript𝐁0bias term 1’subscriptsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscript𝑚1𝑀superscriptsubscriptproduct𝑗𝑚𝑀superscript𝜂subscript~𝒯subscript𝐇𝑗𝜂𝑁subscript𝐏𝑚bias term 2’\displaystyle\geq\underbrace{\sum_{k=1}^{M}\langle\mathbf{H}_{k},\prod_{m=1}^{% M}(\mathcal{I}-\eta\widetilde{\mathcal{T}}_{\mathbf{H}_{m}}(\eta))^{N}\circ% \mathbf{B}_{0}\rangle}_{\text{bias term 1'}}+\underbrace{\sum_{k=1}^{M}\langle% \mathbf{H}_{k},\sum_{m=1}^{M}\prod_{j=m}^{M}(\mathcal{I}-\eta\widetilde{% \mathcal{T}}_{\mathbf{H}_{j}}(\eta))^{N}\mathbf{P}_{m}\rangle}_{\text{bias % term 2'}}.≥ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∘ bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT bias term 1’ end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( caligraphic_I - italic_η over~ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_η ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ end_ARG start_POSTSUBSCRIPT bias term 2’ end_POSTSUBSCRIPT .
k=1M𝐰0𝐰m=1M(𝐈η𝐇m)2N𝐇k2absentsuperscriptsubscript𝑘1𝑀superscriptsubscriptnormsubscript𝐰0superscript𝐰superscriptsubscriptproduct𝑚1𝑀superscript𝐈𝜂subscript𝐇𝑚2𝑁subscript𝐇𝑘2\displaystyle\geq\sum_{k=1}^{M}\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\prod_{m=1}^% {M}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k}}^{2}≥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+k=1M𝐇k,m=1Mj=mM(𝐈η𝐇j)2Nβmη2(b1m+b2m+d1m+d2m)(𝐈η𝐇m)2N𝐇m.superscriptsubscript𝑘1𝑀subscript𝐇𝑘superscriptsubscript𝑚1𝑀superscriptsubscriptproduct𝑗𝑚𝑀superscript𝐈𝜂subscript𝐇𝑗2𝑁subscript𝛽𝑚superscript𝜂2superscriptsubscript𝑏1𝑚superscriptsubscript𝑏2𝑚superscriptsubscript𝑑1𝑚superscriptsubscript𝑑2𝑚superscript𝐈𝜂subscript𝐇𝑚2𝑁subscript𝐇𝑚\displaystyle+\sum_{k=1}^{M}\langle\mathbf{H}_{k},\sum_{m=1}^{M}\prod_{j=m}^{M% }(\mathbf{I}-\eta\mathbf{H}_{j})^{2N}\beta_{m}\eta^{2}(b_{1}^{m}+b_{2}^{m}+d_{% 1}^{m}+d_{2}^{m})\cdot(\mathbf{I}-\eta\mathbf{H}_{m})^{2N}\mathbf{H}_{m}\rangle.+ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) ⋅ ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ .

It follows that:

k=1M𝐇k,𝐁MNsuperscriptsubscript𝑘1𝑀subscript𝐇𝑘subscript𝐁𝑀𝑁\displaystyle\sum_{k=1}^{M}\langle\mathbf{H}_{k},\mathbf{B}_{MN}\rangle∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟨ bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_B start_POSTSUBSCRIPT italic_M italic_N end_POSTSUBSCRIPT ⟩ k=1M𝐰0𝐰m=1M(𝐈η𝐇m)2N𝐇k2absentsuperscriptsubscript𝑘1𝑀superscriptsubscriptnormsubscript𝐰0superscript𝐰superscriptsubscriptproduct𝑚1𝑀superscript𝐈𝜂subscript𝐇𝑚2𝑁subscript𝐇𝑘2\displaystyle\geq\sum_{k=1}^{M}\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\prod_{m=1}^% {M}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k}}^{2}≥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+k=1Miλkim=1Mj=mM(1ηλji)2Nβmη2(b1m+b2m+d1m+d2m)(1ηλmi)2Nλmisuperscriptsubscript𝑘1𝑀subscript𝑖superscriptsubscript𝜆𝑘𝑖superscriptsubscript𝑚1𝑀superscriptsubscriptproduct𝑗𝑚𝑀superscript1𝜂superscriptsubscript𝜆𝑗𝑖2𝑁subscript𝛽𝑚superscript𝜂2superscriptsubscript𝑏1𝑚superscriptsubscript𝑏2𝑚superscriptsubscript𝑑1𝑚superscriptsubscript𝑑2𝑚superscript1𝜂superscriptsubscript𝜆𝑚𝑖2𝑁superscriptsubscript𝜆𝑚𝑖\displaystyle+\sum_{k=1}^{M}\sum_{i}\lambda_{k}^{i}\cdot\sum_{m=1}^{M}\prod_{j% =m}^{M}(1-\eta\lambda_{j}^{i})^{2N}\beta_{m}\eta^{2}(b_{1}^{m}+b_{2}^{m}+d_{1}% ^{m}+d_{2}^{m})\cdot(1-\eta\lambda_{m}^{i})^{2N}\lambda_{m}^{i}+ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) ⋅ ( 1 - italic_η italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
k=1M𝐰0𝐰m=1M(𝐈η𝐇m)2N𝐇k2absentsuperscriptsubscript𝑘1𝑀superscriptsubscriptnormsubscript𝐰0superscript𝐰superscriptsubscriptproduct𝑚1𝑀superscript𝐈𝜂subscript𝐇𝑚2𝑁subscript𝐇𝑘2\displaystyle\geq\sum_{k=1}^{M}\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{\prod_{m=1}^% {M}(\mathbf{I}-\eta{\mathbf{H}_{m}})^{2N}\mathbf{H}_{k}}^{2}≥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+k=1Mm=1M(b1m+b2m+d1m+d2m)superscriptsubscript𝑘1𝑀superscriptsubscript𝑚1𝑀superscriptsuperscriptsubscript𝑏1𝑚superscriptsuperscriptsubscript𝑏2𝑚superscriptsuperscriptsubscript𝑑1𝑚superscriptsuperscriptsubscript𝑑2𝑚\displaystyle+\sum_{k=1}^{M}\sum_{m=1}^{M}({b_{1}^{m}}^{\prime}+{b_{2}^{m}}^{% \prime}+{d_{1}^{m}}^{\prime}+{d_{2}^{m}}^{\prime})+ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )

where

b1m=superscriptsuperscriptsubscript𝑏1𝑚absent\displaystyle{b_{1}^{m}}^{\prime}=italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = βm2η225(𝐰0𝐰𝐈m,0:km2+Nη𝐰0𝐰𝐇m,km:2)superscriptsubscript𝛽𝑚2superscript𝜂225superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐈:𝑚0superscriptsubscript𝑘𝑚2𝑁𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐇:𝑚superscriptsubscript𝑘𝑚2\displaystyle\frac{\beta_{m}^{2}\eta^{2}}{25}\cdot(\|\mathbf{w}_{0}-\mathbf{w}% ^{*}\|_{\mathbf{I}_{m,{0:{k_{m}^{*}}}}}^{2}+N\eta\|\mathbf{w}_{0}-\mathbf{w}^{% *}\|_{\mathbf{H}_{m,{{k_{m}^{*}}:\infty}}}^{2})divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 25 end_ARG ⋅ ( ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT italic_m , 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N italic_η ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
(i<kmΓ(1,M)iλki(λmi)2+ηNi>kmΓ(1,M)i(λmi)3λki)absentsubscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑀𝑖superscriptsubscript𝜆𝑘𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖2𝜂𝑁subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ1𝑀𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖3superscriptsubscript𝜆𝑘𝑖\displaystyle\cdot(\sum_{i<k_{m}^{*}}{\Gamma_{(1,M)}^{i}\lambda_{k}^{i}(% \lambda_{m}^{i})^{2}}+{\eta N}\sum_{i>k_{m}^{*}}\Gamma_{(1,M)}^{i}(\lambda_{m}% ^{i})^{3}\lambda_{k}^{i})⋅ ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η italic_N ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( 1 , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
b2m=superscriptsuperscriptsubscript𝑏2𝑚absent\displaystyle{b_{2}^{m}}^{\prime}=italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = βmη25(𝐰0𝐰(𝚪(1,M)𝐇m𝐇k)0:km2+Nη𝐰0𝐰(𝚪(1,M)(𝐈η𝐇m)2N𝐇m2𝐇k)km:2),subscript𝛽𝑚superscript𝜂25superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsubscript𝚪1𝑀subscript𝐇𝑚subscript𝐇𝑘:0superscriptsubscript𝑘𝑚2𝑁𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsubscript𝚪1𝑀superscript𝐈𝜂subscript𝐇𝑚2𝑁superscriptsubscript𝐇𝑚2subscript𝐇𝑘:superscriptsubscript𝑘𝑚2\displaystyle\frac{\beta_{m}\eta^{2}}{5}\cdot(\|\mathbf{w}_{0}-\mathbf{w}^{*}% \|_{(\bm{\Gamma}_{(1,M)}\mathbf{H}_{m}\mathbf{H}_{k})_{0:{k_{m}^{*}}}}^{2}+N% \eta\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\bm{\Gamma}_{(1,M)}(\mathbf{I}-\eta% \mathbf{H}_{m})^{2N}\mathbf{H}_{m}^{2}\mathbf{H}_{k})_{{k_{m}^{*}}:\infty}}^{2% }),divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 5 end_ARG ⋅ ( ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_Γ start_POSTSUBSCRIPT ( 1 , italic_M ) end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N italic_η ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_Γ start_POSTSUBSCRIPT ( 1 , italic_M ) end_POSTSUBSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

and

d1m=superscriptsuperscriptsubscript𝑑1𝑚absent\displaystyle{d_{1}^{m}}^{\prime}=italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = βm25(𝐰0𝐰𝐈m,0:km2+Nη𝐰0𝐰𝐇m,km:2)subscript𝛽𝑚25superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐈:𝑚0superscriptsubscript𝑘𝑚2𝑁𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰subscript𝐇:𝑚superscriptsubscript𝑘𝑚2\displaystyle\frac{\beta_{m}}{25}\cdot(\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{% \mathbf{I}_{m,{0:{k_{m}^{*}}}}}^{2}+N\eta\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{% \mathbf{H}_{m,{{k_{m}^{*}}:\infty}}}^{2})divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 25 end_ARG ⋅ ( ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT italic_m , 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N italic_η ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
Φ^1m1(i<kmΓ(m,M)iλki(λmi)+ηNi>kmΓ(m,M)i(λmi)2λki)absentsuperscriptsubscript^Φ1𝑚1subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚𝑀𝑖superscriptsubscript𝜆𝑘𝑖superscriptsubscript𝜆𝑚𝑖𝜂𝑁subscript𝑖superscriptsubscript𝑘𝑚superscriptsubscriptΓ𝑚𝑀𝑖superscriptsuperscriptsubscript𝜆𝑚𝑖2superscriptsubscript𝜆𝑘𝑖\displaystyle\cdot\hat{\Phi}_{1}^{m-1}\cdot(\sum_{i<k_{m}^{*}}{\Gamma_{(m,M)}^% {i}\lambda_{k}^{i}(\lambda_{m}^{i})}+{\eta N}\sum_{i>k_{m}^{*}}\Gamma_{(m,M)}^% {i}(\lambda_{m}^{i})^{2}\lambda_{k}^{i})⋅ over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ⋅ ( ∑ start_POSTSUBSCRIPT italic_i < italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + italic_η italic_N ∑ start_POSTSUBSCRIPT italic_i > italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT ( italic_m , italic_M ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
d2m=superscriptsuperscriptsubscript𝑑2𝑚absent\displaystyle{d_{2}^{m}}^{\prime}=italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = βmη2Φ^1m15(𝐰0𝐰(𝚪(m,M)𝐇k)0:km2+Nη𝐰0𝐰(𝚪(m,M)(𝐈η𝐇m)2N𝐇m𝐇k)km:2).subscript𝛽𝑚superscript𝜂2superscriptsubscript^Φ1𝑚15superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsubscript𝚪𝑚𝑀subscript𝐇𝑘:0superscriptsubscript𝑘𝑚2𝑁𝜂superscriptsubscriptnormsubscript𝐰0superscript𝐰subscriptsubscript𝚪𝑚𝑀superscript𝐈𝜂subscript𝐇𝑚2𝑁subscript𝐇𝑚subscript𝐇𝑘:superscriptsubscript𝑘𝑚2\displaystyle\frac{\beta_{m}\eta^{2}\hat{\Phi}_{1}^{m-1}}{5}\cdot(\|\mathbf{w}% _{0}-\mathbf{w}^{*}\|_{(\bm{\Gamma}_{(m,M)}\mathbf{H}_{k})_{0:{k_{m}^{*}}}}^{2% }+N\eta\|\mathbf{w}_{0}-\mathbf{w}^{*}\|_{(\bm{\Gamma}_{(m,M)}(\mathbf{I}-\eta% \mathbf{H}_{m})^{2N}\mathbf{H}_{m}\mathbf{H}_{k})_{{k_{m}^{*}}:\infty}}^{2}).divide start_ARG italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 5 end_ARG ⋅ ( ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_Γ start_POSTSUBSCRIPT ( italic_m , italic_M ) end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 0 : italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N italic_η ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_Γ start_POSTSUBSCRIPT ( italic_m , italic_M ) end_POSTSUBSCRIPT ( bold_I - italic_η bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Appendix D Extension work

It is noticed that when the step size is set to 𝐱m2superscriptnormsubscript𝐱𝑚2\|\mathbf{x}_{m}\|^{-2}∥ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, the update rule for the minimum norm solution can be considered equivalent to that of the last iterate SGD. Consequently, in this subsection, we will focus on a particular case (akin to the setting in Lin et al. 2023) that involves this specific step size, allowing us to draw direct comparisons and insights under a defined set of conditions.

Consider a series of tasks 𝕄={1,2,,M}𝕄12𝑀\mathbb{M}=\{1,2,\ldots,M\}blackboard_M = { 1 , 2 , … , italic_M }. Given M𝑀Mitalic_M datasets, for each dataset m𝕄𝑚𝕄m\in\mathbb{M}italic_m ∈ blackboard_M, Dm={(𝐱m,i,ym,i)}i=1Nsubscript𝐷𝑚superscriptsubscriptsubscript𝐱𝑚𝑖subscript𝑦𝑚𝑖𝑖1𝑁D_{m}=\{(\mathbf{x}_{m,i},y_{m,i})\}_{i=1}^{N}italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { ( bold_x start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT drawn i.i.d from some fixed distribution 𝒟m=𝒳m×𝒴md×subscript𝒟𝑚subscript𝒳𝑚subscript𝒴𝑚superscript𝑑\mathcal{D}_{m}=\mathcal{X}_{m}\times\mathcal{Y}_{m}\subset\mathbb{R}^{d}% \times\mathbb{R}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = caligraphic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × caligraphic_Y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R. Assume that {(𝐱m,i,ym,i)}i=1Nsuperscriptsubscriptsubscript𝐱𝑚𝑖subscript𝑦𝑚𝑖𝑖1𝑁\{(\mathbf{x}_{m,i},y_{m,i})\}_{i=1}^{N}{ ( bold_x start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT are i.i.d. sampled from a linear regression model, i.e., each (𝐱m,i,ym,i)subscript𝐱𝑚𝑖subscript𝑦𝑚𝑖(\mathbf{x}_{m,i},y_{m,i})( bold_x start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ) is a realization of the linear regression model ym=(𝐱m𝐰m)+zmsubscript𝑦𝑚superscriptsubscript𝐱𝑚topsuperscriptsubscript𝐰𝑚subscript𝑧𝑚y_{m}=(\mathbf{x}_{m}^{\top}\mathbf{w}_{m}^{*})+z_{m}italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ( bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, where zmsubscript𝑧𝑚z_{m}italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is some randomized noise satisfing well-specified condition and 𝐰mdsuperscriptsubscript𝐰𝑚superscript𝑑\mathbf{w}_{m}^{*}\in\mathbb{R}^{d}bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the optimal model parameter for task m𝑚mitalic_m.

We adopt the same learning procedure with specific step size, aiming to output a model 𝐰MNsuperscriptsubscript𝐰𝑀𝑁\mathbf{w}_{M}^{N}bold_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT minimizing the performance (Lin et al., 2023), i.e.

G(𝐰MN)=1Mi=1M𝐰MN𝐰i2.𝐺superscriptsubscript𝐰𝑀𝑁1𝑀superscriptsubscript𝑖1𝑀superscriptnormsuperscriptsubscript𝐰𝑀𝑁superscriptsubscript𝐰𝑖2G(\mathbf{w}_{M}^{N})=\frac{1}{M}\sum_{i=1}^{M}\|\mathbf{w}_{M}^{N}-\mathbf{w}% _{i}^{*}\|^{2}.italic_G ( bold_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (23)

Therefore, our results can be restated as follows

Theorem D.1.

Consider a scenario where the model 𝐰𝐰\mathbf{w}bold_w undergoes training via SGD for M𝑀Mitalic_M distinct tasks, following a sequence 1,,M1𝑀1,\ldots,M1 , … , italic_M. With a specific step size of ηm,t=𝐱m,t2subscript𝜂𝑚𝑡superscriptnormsubscript𝐱𝑚𝑡2\eta_{m,t}=\|\mathbf{x}_{m,t}\|^{-2}italic_η start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT = ∥ bold_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, each task is executed for N𝑁Nitalic_N iterations. Given that Assumption 2.4 are satisfied, the following will hold:

𝔼[G(𝐰MN)]𝔼delimited-[]𝐺superscriptsubscript𝐰𝑀𝑁\displaystyle\mathbb{E}[G(\mathbf{w}_{M}^{N})]blackboard_E [ italic_G ( bold_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ] =1Mi=1M𝐰00𝐰im=1Mt=1N(𝐈𝐇mηm,t)2absent1𝑀superscriptsubscript𝑖1𝑀superscriptsubscriptnormsuperscriptsubscript𝐰00superscriptsubscript𝐰𝑖superscriptsubscriptproduct𝑚1𝑀superscriptsubscriptproduct𝑡1𝑁𝐈subscript𝐇𝑚subscript𝜂𝑚𝑡2\displaystyle=\frac{1}{M}\sum_{i=1}^{M}\|\mathbf{w}_{0}^{0}-\mathbf{w}_{i}^{*}% \|_{\prod_{m=1}^{M}\prod_{t=1}^{N}\left(\mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,t}% }\right)}^{2}= divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+1Mi=1Mm=1Mt=0N1𝐰m𝐰ip=1Mmj=1N(𝐈𝐇pηp,j)j=qNt(𝐈𝐇mηm,q)𝐇mηm,q21𝑀superscriptsubscript𝑖1𝑀superscriptsubscript𝑚1𝑀superscriptsubscript𝑡0𝑁1superscriptsubscriptnormsuperscriptsubscript𝐰𝑚superscriptsubscript𝐰𝑖superscriptsubscriptproduct𝑝1𝑀𝑚superscriptsubscriptproduct𝑗1𝑁𝐈subscript𝐇𝑝subscript𝜂𝑝𝑗superscriptsubscriptproduct𝑗𝑞𝑁𝑡𝐈subscript𝐇𝑚subscript𝜂𝑚𝑞subscript𝐇𝑚subscript𝜂𝑚𝑞2\displaystyle+\frac{1}{M}\sum_{i=1}^{M}\sum_{m=1}^{M}\sum_{t=0}^{N-1}\|\mathbf% {w}_{m}^{*}-\mathbf{w}_{i}^{*}\|_{\prod_{p=1}^{M-m}\prod_{j=1}^{N}\left(% \mathbf{I}-{\mathbf{H}_{p}}{\eta_{p,j}}\right)\prod_{j=q}^{N-t}\left(\mathbf{I% }-{\mathbf{H}_{m}}{\eta_{m,q}}\right){\mathbf{H}_{m}}{\eta_{m,q}}}^{2}+ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - italic_m end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_p , italic_j end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - italic_t end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_q end_POSTSUBSCRIPT ) bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+1Mi=1Mm=1Mt=0N1𝒛m,tp=1Mmj=1N(𝐈𝐇pηp,j)q=1Nt(𝐈𝐇mηm,q)ηm,q2.1𝑀superscriptsubscript𝑖1𝑀superscriptsubscript𝑚1𝑀superscriptsubscript𝑡0𝑁1superscriptsubscriptnormsubscript𝒛𝑚𝑡superscriptsubscriptproduct𝑝1𝑀𝑚superscriptsubscriptproduct𝑗1𝑁𝐈subscript𝐇𝑝subscript𝜂𝑝𝑗superscriptsubscriptproduct𝑞1𝑁𝑡𝐈subscript𝐇𝑚subscript𝜂𝑚𝑞subscript𝜂𝑚𝑞2\displaystyle+\frac{1}{M}\sum_{i=1}^{M}\sum_{m=1}^{M}\sum_{t=0}^{N-1}\|\bm{z}_% {m,t}\|_{\prod_{p=1}^{M-m}\prod_{j=1}^{N}\left(\mathbf{I}-{\mathbf{H}_{p}}{% \eta_{p,j}}\right)\prod_{q=1}^{N-t}\left(\mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,q% }}\right){\eta_{m,q}}}^{2}.+ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∥ bold_italic_z start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - italic_m end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_p , italic_j end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - italic_t end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_q end_POSTSUBSCRIPT ) italic_η start_POSTSUBSCRIPT italic_m , italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
Remark 2.

In contrast to the approach in Theorem 3.1 and Theorem 3.2, here we do not rely on the decomposition of bias and variance error while considering that the projection (𝐈ηm,t𝐱m,t𝐱m,t)𝐈subscript𝜂𝑚𝑡subscript𝐱𝑚𝑡superscriptsubscript𝐱𝑚𝑡top(\mathbf{I}-\eta_{m,t}\mathbf{x}_{m,t}\mathbf{x}_{m,t}^{\top})( bold_I - italic_η start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) is orthogonal to ηm,t𝐱m,t𝐱m,tsubscript𝜂𝑚𝑡subscript𝐱𝑚𝑡superscriptsubscript𝐱𝑚𝑡top\eta_{m,t}\mathbf{x}_{m,t}\mathbf{x}_{m,t}^{\top}italic_η start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT with a specific stepsize ηm,t=𝐱m,t22subscript𝜂𝑚𝑡superscriptsubscriptnormsubscript𝐱𝑚𝑡22\eta_{m,t}=\|\mathbf{x}_{m,t}\|_{2}^{-2}italic_η start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT = ∥ bold_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. This perspective allows us to derive a closed-form expression for the expected performance, which integrates the impact of initial parameter deviations, task-specific parameter variations, and random noise. Furthermore, Theorem D.1 in our study explores the performance behavior on general data distributions, expanding beyond the Gaussian distribution context discussed in Lin et al. 2023. In scenarios where there is only a single sample per training iteration, our results could cover their findings.

Proof.

For each iteration, according to the update rule of SGD, it holds that

𝐰mN=𝐰mN1η(𝒙m,N((𝒙m,N)𝐰mN1ym,N)).superscriptsubscript𝐰𝑚𝑁superscriptsubscript𝐰𝑚𝑁1𝜂subscript𝒙𝑚𝑁superscriptsubscript𝒙𝑚𝑁topsuperscriptsubscript𝐰𝑚𝑁1subscript𝑦𝑚𝑁\mathbf{w}_{m}^{N}=\mathbf{w}_{m}^{N-1}-\eta(\bm{x}_{m,N}((\bm{x}_{m,N})^{\top% }\mathbf{w}_{m}^{N-1}-y_{m,N})).bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT - italic_η ( bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ( ( bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ) ) .

which can be rewritten as:

𝐰mN𝐰i=(𝐈η𝒙m,N(𝒙m,N))(𝐰mN1𝐰i)+ηzm,N𝒙m,N.superscriptsubscript𝐰𝑚𝑁superscriptsubscript𝐰𝑖𝐈𝜂subscript𝒙𝑚𝑁superscriptsubscript𝒙𝑚𝑁topsuperscriptsubscript𝐰𝑚𝑁1superscriptsubscript𝐰𝑖𝜂subscript𝑧𝑚𝑁subscript𝒙𝑚𝑁\mathbf{w}_{m}^{N}-\mathbf{w}_{i}^{*}=(\mathbf{I}-\eta\bm{x}_{m,N}(\bm{x}_{m,N% })^{\top})(\mathbf{w}_{m}^{N-1}-\mathbf{w}_{i}^{*})+\eta z_{m,N}\bm{x}_{m,N}.bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( bold_I - italic_η bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_η italic_z start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT .

We consider the expectation norm for both sides:

𝔼[𝐰mN𝐰i2]𝔼delimited-[]superscriptnormsuperscriptsubscript𝐰𝑚𝑁superscriptsubscript𝐰𝑖2\displaystyle\mathbb{E}[\|\mathbf{w}_{m}^{N}-\mathbf{w}_{i}^{*}\|^{2}]blackboard_E [ ∥ bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== 𝔼[(𝐰mN𝐰i)(𝐰mN𝐰i)]𝔼delimited-[]superscriptsuperscriptsubscript𝐰𝑚𝑁superscriptsubscript𝐰𝑖topsuperscriptsubscript𝐰𝑚𝑁superscriptsubscript𝐰𝑖\displaystyle\mathbb{E}[(\mathbf{w}_{m}^{N}-\mathbf{w}_{i}^{*})^{\top}(\mathbf% {w}_{m}^{N}-\mathbf{w}_{i}^{*})]blackboard_E [ ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ]
=\displaystyle== 𝔼[(𝐰mN1𝐰i)(𝐈η𝒙m,N(𝒙m,N))(𝐈η𝒙m,N(𝒙m,N))(𝐰mN1𝐰i)+η2(zm,N𝒙m,N)(zm,N𝒙m,N)]𝔼delimited-[]superscriptsuperscriptsubscript𝐰𝑚𝑁1superscriptsubscript𝐰𝑖topsuperscript𝐈𝜂subscript𝒙𝑚𝑁superscriptsubscript𝒙𝑚𝑁toptop𝐈𝜂subscript𝒙𝑚𝑁superscriptsubscript𝒙𝑚𝑁topsuperscriptsubscript𝐰𝑚𝑁1superscriptsubscript𝐰𝑖superscript𝜂2superscriptsubscript𝑧𝑚𝑁subscript𝒙𝑚𝑁topsubscript𝑧𝑚𝑁subscript𝒙𝑚𝑁\displaystyle\mathbb{E}[(\mathbf{w}_{m}^{N-1}-\mathbf{w}_{i}^{*})^{\top}(% \mathbf{I}-\eta\bm{x}_{m,N}(\bm{x}_{m,N})^{\top})^{\top}(\mathbf{I}-\eta\bm{x}% _{m,N}(\bm{x}_{m,N})^{\top})(\mathbf{w}_{m}^{N-1}-\mathbf{w}_{i}^{*})+\eta^{2}% (z_{m,N}\bm{x}_{m,N})^{\top}(z_{m,N}\bm{x}_{m,N})]blackboard_E [ ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_I - italic_η bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_I - italic_η bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ) ]
()=absent\displaystyle(*)=( ∗ ) = 𝔼[(𝐈ηm,N𝒙m,N(𝒙m,N))(𝐰mN1𝐰i)2+ηm,N𝒙m,N(𝒙m,N)(𝐰m𝐰i)2+ηm,N𝒙m,N𝒛m,N2]𝔼delimited-[]superscriptnorm𝐈subscript𝜂𝑚𝑁subscript𝒙𝑚𝑁superscriptsubscript𝒙𝑚𝑁topsuperscriptsubscript𝐰𝑚𝑁1superscriptsubscript𝐰𝑖2superscriptnormsubscript𝜂𝑚𝑁subscript𝒙𝑚𝑁superscriptsubscript𝒙𝑚𝑁topsuperscriptsubscript𝐰𝑚superscriptsubscript𝐰𝑖2superscriptnormsubscript𝜂𝑚𝑁subscript𝒙𝑚𝑁subscript𝒛𝑚𝑁2\displaystyle\mathbb{E}[\|(\mathbf{I}-\eta_{m,N}\bm{x}_{m,N}(\bm{x}_{m,N})^{% \top})(\mathbf{w}_{m}^{N-1}-\mathbf{w}_{i}^{*})\|^{2}+\|\eta_{m,N}\bm{x}_{m,N}% (\bm{x}_{m,N})^{\top}(\mathbf{w}_{m}^{*}-\mathbf{w}_{i}^{*})\|^{2}+\|\eta_{m,N% }\bm{x}_{m,N}\bm{z}_{m,N}\|^{2}]blackboard_E [ ∥ ( bold_I - italic_η start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_η start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_η start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== 𝔼[𝐰mN1𝐰i(𝐈𝐇mηm,N)2]+ηm,Nσ2+𝐰m𝐰i𝐇mηm,N2𝔼delimited-[]superscriptsubscriptnormsuperscriptsubscript𝐰𝑚𝑁1superscriptsubscript𝐰𝑖𝐈subscript𝐇𝑚subscript𝜂𝑚𝑁2subscript𝜂𝑚𝑁superscript𝜎2superscriptsubscriptnormsuperscriptsubscript𝐰𝑚superscriptsubscript𝐰𝑖subscript𝐇𝑚subscript𝜂𝑚𝑁2\displaystyle\mathbb{E}[\|\mathbf{w}_{m}^{N-1}-\mathbf{w}_{i}^{*}\|_{(\mathbf{% I}-{\mathbf{H}_{m}}{\eta_{m,N}})}^{2}]+{\eta_{m,N}}\sigma^{2}+\|\mathbf{w}_{m}% ^{*}-\mathbf{w}_{i}^{*}\|_{{\mathbf{H}_{m}}{\eta_{m,N}}}^{2}blackboard_E [ ∥ bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_η start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 𝐰m0𝐰it=1N(𝐈𝐇mηm,t)2+t=1N1𝐰m𝐰ij=1Nt(𝐈𝐇mηm,j)j𝐇mηm,N2+t=1N1𝒛m,tj=1Nt(𝐈𝐇mηm,j)jηm,N2,superscriptsubscriptnormsuperscriptsubscript𝐰𝑚0superscriptsubscript𝐰𝑖superscriptsubscriptproduct𝑡1𝑁𝐈subscript𝐇𝑚subscript𝜂𝑚𝑡2superscriptsubscript𝑡1𝑁1superscriptsubscriptnormsuperscriptsubscript𝐰𝑚superscriptsubscript𝐰𝑖superscriptsubscriptproduct𝑗1𝑁𝑡superscript𝐈subscript𝐇𝑚subscript𝜂𝑚𝑗𝑗subscript𝐇𝑚subscript𝜂𝑚𝑁2superscriptsubscript𝑡1𝑁1superscriptsubscriptnormsubscript𝒛𝑚𝑡superscriptsubscriptproduct𝑗1𝑁𝑡superscript𝐈subscript𝐇𝑚subscript𝜂𝑚𝑗𝑗subscript𝜂𝑚𝑁2\displaystyle\|\mathbf{w}_{m}^{0}-\mathbf{w}_{i}^{*}\|_{\prod_{t=1}^{N}(% \mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,t}})}^{2}+\sum_{t=1}^{N-1}\|\mathbf{w}_{m}% ^{*}-\mathbf{w}_{i}^{*}\|_{\prod_{j=1}^{N-t}(\mathbf{I}-{\mathbf{H}_{m}}{\eta_% {m,j}})^{j}{\mathbf{H}_{m}}{\eta_{m,N}}}^{2}+\sum_{t=1}^{N-1}\|\bm{z}_{m,t}\|_% {\prod_{j=1}^{N-t}(\mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,j}})^{j}{\eta_{m,N}}}^{% 2},∥ bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - italic_t end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∥ bold_italic_z start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - italic_t end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the (*) equation comes from the choice of step size such that (𝐈ηm,t𝒙m,t(𝒙m,t))𝐈subscript𝜂𝑚𝑡subscript𝒙𝑚𝑡superscriptsubscript𝒙𝑚𝑡top(\mathbf{I}-\eta_{m,t}\bm{x}_{m,t}(\bm{x}_{m,t})^{\top})( bold_I - italic_η start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) and ηm,t𝒙m,t(𝒙m,t)subscript𝜂𝑚𝑡subscript𝒙𝑚𝑡superscriptsubscript𝒙𝑚𝑡top\eta_{m,t}\bm{x}_{m,t}(\bm{x}_{m,t})^{\top}italic_η start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT are orthogonal projection, which equals the minimum norm solution with one sample.

Considering M𝑀Mitalic_M tasks, it holds that

𝔼𝐰m,𝐰i2𝔼superscriptnormsubscript𝐰𝑚superscriptsubscript𝐰𝑖2\displaystyle\mathbb{E}\|\mathbf{w}_{m,}-\mathbf{w}_{i}^{*}\|^{2}blackboard_E ∥ bold_w start_POSTSUBSCRIPT italic_m , end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =𝐰00𝐰im=1Mt=1N(𝐈𝐇mηm,t)2absentsuperscriptsubscriptnormsuperscriptsubscript𝐰00superscriptsubscript𝐰𝑖superscriptsubscriptproduct𝑚1𝑀superscriptsubscriptproduct𝑡1𝑁𝐈subscript𝐇𝑚subscript𝜂𝑚𝑡2\displaystyle=\|\mathbf{w}_{0}^{0}-\mathbf{w}_{i}^{*}\|_{\prod_{m=1}^{M}\prod_% {t=1}^{N}(\mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,t}})}^{2}= ∥ bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+m=1M1t=0N1𝐰m𝐰ip=1Mmj=1N(𝐈𝐇pηp,j)j=qNt(𝐈𝐇mηm,q)𝐇mηm,q2superscriptsubscript𝑚1𝑀1superscriptsubscript𝑡0𝑁1superscriptsubscriptnormsuperscriptsubscript𝐰𝑚superscriptsubscript𝐰𝑖superscriptsubscriptproduct𝑝1𝑀𝑚superscriptsubscriptproduct𝑗1𝑁𝐈subscript𝐇𝑝subscript𝜂𝑝𝑗superscriptsubscriptproduct𝑗𝑞𝑁𝑡𝐈subscript𝐇𝑚subscript𝜂𝑚𝑞subscript𝐇𝑚subscript𝜂𝑚𝑞2\displaystyle+\sum_{m=1}^{M-1}\sum_{t=0}^{N-1}\|\mathbf{w}_{m}^{*}-\mathbf{w}_% {i}^{*}\|_{\prod_{p=1}^{M-m}\prod_{j=1}^{N}(\mathbf{I}-{\mathbf{H}_{p}}{\eta_{% p,j}})\prod_{j=q}^{N-t}(\mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,q}}){\mathbf{H}_{m% }}{\eta_{m,q}}}^{2}+ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - italic_m end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_p , italic_j end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - italic_t end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_q end_POSTSUBSCRIPT ) bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+m=1M1t=0N1𝒛mp=1Mmj=1N(𝐈𝐇pηp,j)q=1Nt(𝐈𝐇mηm,q)ηm,q2.superscriptsubscript𝑚1𝑀1superscriptsubscript𝑡0𝑁1superscriptsubscriptnormsubscript𝒛𝑚superscriptsubscriptproduct𝑝1𝑀𝑚superscriptsubscriptproduct𝑗1𝑁𝐈subscript𝐇𝑝subscript𝜂𝑝𝑗superscriptsubscriptproduct𝑞1𝑁𝑡𝐈subscript𝐇𝑚subscript𝜂𝑚𝑞subscript𝜂𝑚𝑞2\displaystyle+\sum_{m=1}^{M-1}\sum_{t=0}^{N-1}\|\bm{z}_{m}\|_{\prod_{p=1}^{M-m% }\prod_{j=1}^{N}(\mathbf{I}-{\mathbf{H}_{p}}{\eta_{p,j}})\prod_{q=1}^{N-t}(% \mathbf{I}-{\mathbf{H}_{m}}{\eta_{m,q}}){\eta_{m,q}}}^{2}.+ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∥ bold_italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - italic_m end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_p , italic_j end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - italic_t end_POSTSUPERSCRIPT ( bold_I - bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_m , italic_q end_POSTSUBSCRIPT ) italic_η start_POSTSUBSCRIPT italic_m , italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

In conclusion, we aggregate the performance metrics across tasks, ranging from i=1𝑖1i=1italic_i = 1 to i=M𝑖𝑀i=Mitalic_i = italic_M, to derive the final result. ∎