Forget but Recall: Incremental Latent Rectification
in Continual Learning

Nghia D. Nguyen¹, Hieu Trung Nguyen²,
Ang Li³, Hoang Pham¹,
Viet Anh Nguyen², Khoa D. Doan¹
¹College of Engineering & Computer Science, VinUniversity, Hanoi, Vietnam
²The Chinese University of Hong Kong, Hong Kong
³Simular, San Mateo, CA, USA
[email protected], [email protected],
[email protected], [email protected]
[email protected], [email protected]

Abstract

Intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which hinders remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches either retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored CL direction for incremental learning called Incremental Latent Rectification or ILR. In a nutshell, ILR learns to propagate with correction (or rectify) the representation from the current trained DNN backward to the representation space of the old task, where performing predictive decisions is easier. This rectification process only employs a chain of small representation mapping networks, called rectifier units. Empirical experiments on several continual learning benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods.

1 Introduction

Humans exhibit the innate capability to incrementally learn novel concepts while consolidating acquired knowledge into long-term memories [32]. More general Artificial Intelligence systems in real-world applications would require similar imitation to capture the dynamic of the changing data stream. These systems need to acquire knowledge incrementally without retraining, which is computationally expensive and exhibits a large memory footprint [34]. Nonetheless, existing learning approaches are yet to match human learning in this so-called Continual Learning (CL) problem due to catastrophic forgetting [28]. These systems encounter difficulty balancing the capability of incorporating new task knowledge while maintaining performance on learned tasks, or the plasticity-stability dilemma.

Representative CL approaches in the literature usually involve the use of memory buffer for rehearsal [33, 9, 7, 8, 5, 3], auxiliary loss term for learning regularization [22, 12, 49, 38], or structural changes such as pruning or model growing [37, 26, 14, 46]. These methods share the common objective of discouraging the deviation of learned knowledge representation. Rehearsal-based methods allow the model to revisit past exemplars to reinforce previously learned representations. Alternatively, regularization-based methods prevent changes in parameter spaces by formulating additional loss terms. However, both approaches present shortcomings, including keeping a rehearsal buffer of all past tasks during the model lifetime or infusing ad-hoc inductive bias into the regularization process. Meanwhile, structure-based methods utilize the over-parameterization property of the model by pruning, masking, or adding parameters to reduce new task interferences.

This paper studies a novel approach for CL named Incremental Latent Rectification (ILR), where we allow the model to “forget” knowledge of old tasks but then “recall” or rectify such “catastrophic forgetting” during inference using a sequence of lightweight knowledge mapping networks. These lightweight knowledge mapping networks, called rectifiers, help significantly reduce information loss on learned tasks by incrementally correcting the changes in the representation space. Specifically, for each new task, we add a small, simple, and computationally inexpensive auxiliary unit that will rectify the representation from the current task to the previous task. Our method differs from many network expansion methods, where additional parameters are allocated to minimize changes to the old parameters. Instead, we iteratively recover past task representations by backwardly propagating current representations through a series of mapping networks. Through this mechanism, ILR allows the optimal adaptation of a new task (plasticity) while separately mitigating catastrophic forgetting. In addition, different from previous CL approaches that modify the sequential training process (e.g., by changing the loss functions or using an additional buffer in fine-tuning), ILR does not change the new task’s learning, hence, ILR can be easily integrated into the existing CL pipelines.

Contributions.

We propose a new direction for CL by sequentially correcting the representation of the current task into the past task’s representation using a chain of lightweight rectifier units:

•

We propose a novel loss function for aligning the latent representation to guide the training procedure. The loss function is designed as a weighted sum of an L2-norm reconstruction error and a cosine distance metric.
•

To train the rectifier unit, we rely on either data samples from task $t-1$ or the current task $t$ ; when such data is unavailable (e.g., due to memory constraint or privacy concerns), a generative model that synthesizes task $t-1$ ’s data can also be utilized. At inference time, for the task-incremental setting, we construct a chain of rectifiers based on the provided task identity and forward the latent representation and inputs to correct the representation. For the class incremental setting, ILR forms the final prediction from an ensemble of predictions based on the reconstructed representations.
•

We empirically evaluate our approach on three widely-used continual learning benchmarks (CIFAR10, CIFAR100, and Tiny ImageNet) to demonstrate that our approach achieves comparable performance with the existing representative CL directions.

This paper unfolds as follows. Section 2 discusses the literature on the continual learning problems, and Section 3 describes our Incremental Latent Rectification method. Finally, Section 4 provides the empirical evidence for the effectiveness of our proposed solution.

2 Related Work

Catastrophic forgetting is a critical concern in artificial intelligence and is arguably one of the most prominent questions to address for DNNs. This phenomenon presents significant challenges when deploying models in different applications. Continual learning addresses this issue by enabling agents to learn throughout their lifespan. This aspect has gained significant attention recently [40, 16, 21, 4]. Considering a model well-trained on past tasks, we risk overwriting its past knowledge by adapting it for new tasks. The problem of knowledge loss can be addressed using different methods, as explored in the literature[47, 13, 22, 24, 9, 5, 37, 46] . These methods aim to mitigate knowledge loss and improve task performance through three main approaches: (1) Rehearsal-based methods, which involve reminding the model of past knowledge by using selective exemplars; (2) Regularization-based methods, which penalize changes in past task knowledge through regularization techniques; (3) Parameter-isolation and Dynamic Architecture methods, which allocate sub-networks or expand new sub-networks, respectively, for each task, minimizing task interference and enabling the model to specialize for different tasks.

Rehearsal-based. Experience replay methods build and store a memory of the knowledge learned so far [34, 25, 39, 35, 36, 50]. As an example, Averaged Gradient Episodic Memory (A-GEM) [9] builds an episodic memory of parameter gradients, while ER-Reservoir [11] uses a reservoir sampling method to maintain the episodic memory. These methods have shown strong performance in recent studies. However, they require a significant amount of memory for storing the examples.

Regularization-based. A popular early work using regularization is the elastic weight consolidation (EWC) method [22]. Other methods [49, 2, 42, 29, 1] propose different criteria to measure the “importance” of parameters. A later study showed that many regularization-based methods are variations of Hessian optimization [47]. These methods typically assume that there are multiple optima in the updated loss landscape in the new data distribution. One can find a good optimum for both the new and old data distributions by constraining the deviation from the original model weights.

Parameter Isolation. Parameter isolation methods allocate different subsets of the parameters to each task [37, 17, 31, 23]. From the stability-plasticity perspective, these methods implement gating mechanisms that improve stability and control plasticity by activating different gates for each task. Masse et al. [27] proposes a bio-inspired approach for a context-dependent gating that activates a non-overlapping subset of parameters for any specific task. Supermask in Superposition [44] is another parameter isolation method that starts with a randomly initialized, fixed base network and, for each task, finds a sub-network (supermask) such that the model achieves good performance.

Dynamic Architecture. Different from Parameter Isolation, which allocates subnets for tasks in a fixed main network, this approach dynamically expands the structure of the network. Yoon et al. [48] proposes a method that leverages the network structure trained on previous tasks to effectively learn new tasks, while dynamically expanding its capacity by adding or duplicating neurons as needed. Other methods [45, 30] reformulate CL problems into reinforcement learning (RL) problems, and leverage RL methods to determine when to expand the architecture during learning of new tasks. Yan et al. [46] introduces a two-stage learning method that first expands the previous frozen task feature representations by a new feature extractor, then re-trains the classifier with current and buffered data.

3 Proposed Framework

We consider the task-incremental and class-incremental learning scenarios, where we sequentially observe a set of tasks $t\in\{1,\ldots,N\}$ . The neural network comprises a single task-agnostic feature extractor $f$ and a classifier $w$ with task-specific heads $w^{(t)}|_{t=1}^{N}$ . The architecture of $f$ is fixed; however, its parameters are gradually updated as new tasks arrive. At task $t$ , the system receives the training dataset $\mathcal{D}_{t}^{\mathrm{train}}$ sampled from the data distribution $\mathcal{D}_{t}$ and learns the updated parameters of the feature extractor $f$ and $w$ . For easier discussion, the feature extractor and classifier obtained after learning at task $t$ are denoted as $f_{t}$ and $w_{t}$ , respectively. Thus, after learning on task $t$ , we obtain the evolved feature extractor $f_{t}$ and classifier $w_{t}$ . We call the latent space created by the feature extractor trained with $\mathcal{D}_{t}^{\mathrm{train}}$ as the $t$ -domain. Catastrophic forgetting occurs as the feature extractor $f_{t^{\prime}}$ is updated into $f_{t}$ , $t^{\prime}<t$ , which causes the $t^{\prime}$ -domain to be overwritten by the $t$ -domain. This domain shift degrades the model’s performance over time.

To overcome catastrophic forgetting, we propose a new CL paradigm: learning a latent rectification mechanism. This mechanism relies on a lightweight rectifier unit $r_{t}$ that learns to align the representations from the $t$ -domain to the $(t-1)$ -domain. Intuitively, this module “corrects” the representation change of a sample from the old task $t-1$ due to the evolution of the feature extractor $f$ when learning the newer task $t$ . These rectifier units will establish a chain of corrections for the representation of any task’s input, allowing the model to predict the rectified representation better. Figure 1 provides a visualization of the inference process on a task- $t$ sample, after learning $N$ tasks.

Learning the latent rectification mechanism is central to our proposed framework. In general, each rectifier unit should be small compared to the size of the final model or the feature extractor $f$ , and its learning process should be resource-efficient. In the following sections, we present and describe our solution for learning this mechanism.

Refer to caption — Figure 1: At task $t$ , the feature extractor $f_{t}$ and classifier head $w_{t}$ are optimized on the dataset $D_{t}^{\mathrm{train}}$ . During inference for a test sample from task $t$ , we forward the input data $x\in D_{t}^{\mathrm{test}}$ through the feature extractor and classifier head to obtain the logits. After learning all $N$ tasks, the DNN loses performance on task $t$ due to catastrophic forgetting. Therefore, the latent representation $f_{N}(x)$ is propagated through a series of rectifiers $r_{N},\ldots,r_{t+1}$ to perform incremental latent rectification and obtained approximated representations $\hat{f}_{N-1},\ldots,\hat{f}_{t}$ . The logits can be obtained by passing the recovered representation to the respective classifier head.

3.1 Learning the Rectifier Unit

As the training dataset $\mathcal{D}_{t}^{\mathrm{train}}$ of task $t$ arrives, we first update the feature extractor $f_{t}$ and the classifier head $w_{t}$ . The primary goal herein is to find $(f_{t},w_{t})$ that has high classification performance for task $t$ , and the secondary goal is to choose $f_{t}$ that can reduce the catastrophic forgetting on previous tasks. To combat catastrophic forgetting, we will first discuss the objective function for learning the lightweight rectifier unit $r_{t}$ and the potential alignment training data (or alignment set) $\mathcal{S}_{t}$ .

3.1.1 Alignment Loss

The goal of $r_{t}$ is to reduce the discrepancy between task $t$ ’s representation $f_{t}(x_{i})$ and the previous data representation $f_{t-1}(x_{i})$ , for $x_{i}\sim\mathcal{D}_{t-1}$ ; i.e. $r_{t}(f_{t}(x_{i}),x_{i})\approx f_{t-1}(x_{i})$ . One choice is the weighted linear combination of the $l_{2}$ error and the cosine error between $f_{t}(x_{i})$ with $r_{t}(f_{t}(x_{i}),x_{i})$ . This combination promotes alignment in both the magnitude and the direction between two representation vectors for improved representational similarity.

Let $s$ be a function, with parameters $\theta_{s}$ , that encodes inputs $x_{i}$ into its respective past representation in domain $t-1$ , and $\tau>0$ be the weight hyper-parameter; we define the alignment loss as:

\mathcal{L}_{\mathrm{align}}(\theta_{s};s,\tau,\mathcal{S}_{t},f_{t-1})=% \mathbb{E}_{x_{i}\sim\mathcal{S}_{t}}\left[\|s(x_{i})-f_{t-1}(x_{i})\|_{2}^{2}% +\tau(1-\cos(s(x_{i}),f_{t-1}(x_{i})))\right].

(1)

In practice, we could either store the value of $f_{t-1}(x_{i})$ together with $x_{i}$ in memory or $f_{t-1}$ directly.

3.1.2 Alignment Set

The alignment set $\mathcal{S}_{t}$ is used as the training data for the rectifier unit $r_{t}$ enables the rectifier unit to efficiently learn the mapping from the $t$ -domain back to the $t-1$ -domain. The design of ILR enables several options for selecting the alignment set, including $\mathcal{D}_{t-1}^{\mathrm{train}}$ , $\mathcal{D}_{t}^{\mathrm{train}}$ , or a generative method.

Task $t-1$ data. The simplest choice for the alignment set $\mathcal{S}_{t}$ is the $\mathcal{D}_{t-1}^{\text{train}}$ (i.e., the training data from the previous task $t-1$ ), which is sampled directly from the task $t-1$ ’s distribution. With this option, each element in $\mathcal{S}_{t}$ is a pair $(x_{i},\hat{z}_{i})$ , where $x_{i}\in\mathcal{D}_{t-1}^{\text{train}}$ is chosen randomly and $\hat{z}_{i}=f_{t-1}(x_{i})$ is the associated latent representation of $x_{i}$ under the feature extractor $f_{t-1}$ . Note that this option does not keep data samples from all past tasks $t\in\{1,\ldots,N\}$ like the rehearsal-based methods [43].

Task $t$ data. Another potential option for $\mathcal{S}_{t}$ is task- $t$ ’s data. If we expect the tasks’ data to not be completely unrelated, using data from $\mathcal{D}_{t}^{\text{train}}$ to train $r_{t}$ is reasonable. As we show in Section 4, we could achieve comparable performance to some rehearsal-based methods while remaining data-free when setting $\mathcal{S}_{t}=\mathcal{D}_{t}^{\text{train}}$ . Additionally, for this option, since we do not have access to $t-1$ -domain data, we need to keep a copy of $f_{t-1}$ to approximate $\hat{z_{i}}=f_{t-1}(x_{i})$ with $x_{i}\in\mathcal{D}_{t}^{\mathrm{train}}$ .

Generated task $t-1$ data. Generative methods provide a potential option for creating training data for the rectifier unit $r_{t}$ . Instead of keeping the alignment set $\mathcal{S}_{t}\subseteq\mathcal{D}_{t-1}^{\mathrm{train}}$ , we could train a generative neural network $G_{t-1}$ that learns the task $t-1$ distribution. Unlike generative continual learning methods, $G_{t-1}$ only needs to remember the task $t-1$ distribution instead of all past tasks. Thus, LRB can easily integrate with existing generative methods.

In addition, we could fill $\mathcal{S}_{t}$ with randomly initialized samples. Nonetheless, our experiments indicate that this approach is ineffective. Therefore, we will focus our discussion on the first three options and leave the exploration for other choices of $\mathcal{S}_{t}$ for future works.

Distiction from buffer-based methods. Rehearsal-based methods retains the data from all past tasks $t\in\{1,\ldots,N\}$ during the lifetime of the DNN. Meanwhile, depending on the choice of alignment set $\mathcal{S}_{t}$ , ILR could be considered strictly data-free if $\mathcal{S}_{t}=\mathcal{D}_{t}$ or using the generative method. While for $\mathcal{S}_{t}\subseteq\mathcal{D}_{t-1}$ , ILR can still arguably be a data-free method since task $t-1$ data is only retained until the end of task $t$ .

3.2 Incremental Latent Alignment

The latent alignment mechanism relies on a chain of task-specific rectifier units $(r_{t})_{t=2}^{N}$ that aims to correct the distortion of the representation space as the extractor $f$ learns a new task.

3.2.1 Latent Alignment

For an input $x$ at task $t-1$ , its feature representation under the feature extractor $f_{t-1}$ is $f_{t-1}(x)$ . One can heuristically define the $(t-1)$ -domain as the representation of the input under the feature extractor $f_{t-1}$ . Unfortunately, the $(t-1)$ -domain is brittle under extractor update: as the subsequent task $t$ arrives, the feature extractor is updated to $f_{t}$ , and the corresponding feature representation of the same input $x$ will be shifted to $f_{t}(x)$ . Likely, the $t$ -domain and the $(t-1)$ -domain do not coincide, and $f_{t}(x)\neq f_{t-1}(x)$ .

The feature rectifier unit $r_{t}$ aims to offset this representation shift. To do this, $r_{t}$ takes $x$ , and its $t$ -domain representation $f_{t}(x)$ as input, and it outputs the rectified representation that satisfies

r_{t}\circ(f_{t}\times I)(x)=r_{t}(f_{t}(x),x)\approx f_{t-1}(x),

(2)

with identity function $I$ .

With this formulation, we can effectively minimize the difference between the rectified representation $r_{t}\circ(f_{t}\times I)(x)$ and the original representation $f_{t-1}(x)$ . In practice, we only want to train the rectifier unit $r_{t}$ and retain the learned feature extractor $f_{t}$ ; therefore, let $s=r_{t}\circ(f_{t}\times I)$ , we can minimize the difference by using $L_{\mathrm{align}}(\theta_{r_{t}};s,\tau,\mathcal{S}_{t},f_{t-1})$ as in Equation 1.

3.2.2 Rectifier Architecture

The proposed rectifier is composed of three trainable components: a weak feature extractor $h_{t}$ , a compress layer $a_{t}$ , and a combine layer $b_{t}$ . The size of the rectifier units increases linearly with respect to the number of tasks, similar to the classification heads. However, since the rectifier unit is lightweight, this is trivial compared to the size of the full model. Figure 2 visualizes the feature rectifier unit.

Weak feature extractor $h_{t}$ . The weak feature extractor $h_{t}$ processes the input data $x$ to generate a simplified representation $h_{t}(x)$ . $h_{t}$ is distilled from $f_{t-1}$ to compress the knowledge of $f_{t-1}$ into a more compact, lower-dimensional representation while remaining parameter-efficient. For our experiment, we choose the simplest and most naive design of a weak feature extractor composed of only two 3x3 convolution layers and two max pooling layers. Instead of processing the full-size image, we use max-pooling to downsample the input to 16x16 images before feeding into $h_{t}$ . The weak feature extractor is a small network compared to the main model ( $h_{t}$ ’s architecture is provided in Table 6).

Compress layer $a_{t}$ . The compress layer $a_{t}$ receives the current latent value $f_{t}(x)$ and produces a compact representation $a_{t}(f_{t}(x))$ of reduced dimensionality. This layer essentially forms a bottleneck that only allows relevant $t-1$ -domain knowledge to pass through. We design the compress layer as a simple linear layer.

Combine layer $b_{t}$ . The combine layer $b_{t}$ recevies the concatenated representaton of the compressed representation $a_{t}\circ f_{t}(x)$ and the weakly extracted features $h_{t}(x)$ to form the rectified representation $r_{t}(f_{t}(x),x)$ . We design the combine layer as a simple linear layer.

Distiction from network-expansion approach. It could be argued that one can, instead, separately train a weak feature extractor $h_{t}$ for each task, making it a network-expansion CL approach. However, because $h_{t}$ is a small network, this approach is ineffective; specifically, our experiments demonstrate that the task-incremental average accuracy across all tasks of this approach on CIFAR100 falls below $53\%$ . Furthermore, for network-expansion approaches, the dedicated parameters are allocated for new task learning, which is fundamentally different from ILR’s objective to correct representation changes. The new task’s knowledge is acquired by $f_{t}$ and $w_{t}$ .

3.3 Training Procedure

Network training. Similar to conventional DNN training, the performance of the feature extractor $f_{t}$ and the classifier head $w_{t}$ is measured by the standard multi-class cross-entropy loss:

\mathcal{L}_{\mathrm{train}}(\theta_{f_{t}},\theta_{w_{t}})=\mathcal{L}_{% \mathrm{CE}}(\theta_{f_{t}},\theta_{w_{t}};f_{t},w_{t},\mathcal{D}_{t}^{% \mathrm{train}})=\mathbb{E}_{(x_{i},y_{i})\sim\mathcal{D}_{t}^{\mathrm{train}}% }\left[-\sum_{c=1}^{M_{t}}y_{i}\log(\hat{y}_{i})\right],

(3)

where $M_{t}$ is the number of classes of task $t$ , $\hat{y}_{i}$ is the probability-valued network output for the input $x_{i}$ that depends on the feature extractor $f_{t}$ and the classifier $w_{t}$ as $\hat{y}_{i}=w_{t}\circ f_{t}(x_{i})$ .

Furthermore, if the alignment set $\mathcal{S}_{t}$ uses either task $t-1$ data or generative network $G_{t-1}$ , we could also utilize $\mathcal{S}_{t}$ to further enforce task $t-1$ representation consistency, reduce forgetting, and enable more effective rectification by training and regularizing $f_{t}$ on $\mathcal{D}_{t}^{\mathrm{train}}$ and $\mathcal{S}_{t}$ , respectively. Let $s=f_{t}$ , then we can similarly use $\mathcal{L}_{\mathrm{align}}$ in Equation 1 with hyperparameter $\alpha$ :

\mathcal{L}_{\mathrm{train}}(\theta_{f_{t}},\theta_{w_{t}})=\mathcal{L}_{% \mathrm{CE}}(\theta_{f_{t}},\theta_{w_{t}};f_{t},w_{t},\mathcal{D}_{t}^{% \mathrm{train}})+\alpha\mathcal{L}_{\mathrm{align}}(\theta_{f_{t}};s,\tau,% \mathcal{S}_{t},f_{t-1}).

(4)

This is different from the rehearsal method since $f$ only visits $\mathcal{D}_{t-1}$ at task $t-1$ and task $t$ . After task $t$ , $f$ never seen $\mathcal{D}_{t-1}$ again, while for rehearsal method, $f$ observe samples from $\mathcal{D}_{t-1}$ throughout its lifetime, risk overfitting on stored exemplars.

Input : Training dataset

\mathcal{D}^{\mathrm{train}}_{t}

, weight parameter for loss functions

\alpha,\tau

Output : Feature extractor

f_{t}

, rectifier unit

r_{t}

1 Train

f_{t}

and

w_{t}

jointly by minimizing

\mathcal{L}_{\mathrm{\mathrm{train}}}(\theta_{f_{t}},\theta_{w_{t}})

[Equation 3, Equation 4];

3Distill

h_{t+1}

;

4 if t>1 then

5 Freeze

f_{t}

;

6 Train

r_{t}

\mathcal{S}_{t}

using

\mathcal{L}_{\mathrm{align}}(\theta_{r_{t}};s,\tau,\mathcal{S}_{t},f_{t-1})

with

s=r_{t}\circ(f_{t}\times I)

[Equation 1]

7 end if

Algorithm 1 Full training framework at task

t\in\{1,2,...,n\}

Rectifier training. Training the rectifier follows two main steps: train the weak feature extractor and then the compress/combine layers. The weak feature extractor $h_{t}$ is distilled from $f_{t-1}$ as task $t-1$ training is completed. Let $g_{t}$ be a temporary linear layer mapping $h_{t}$ ’s smaller dimension to $f_{t-1}$ ’s higher dimension, $s=g_{t}\circ h_{t}$ and $\mathcal{S}_{t}=\mathcal{D}_{t-1}$ , we train $h_{t}$ using the $\mathcal{L}_{\mathrm{align}}(\theta_{s};s,\tau,\mathcal{S}_{t},f_{t-1})$ in Equation 1. Similarly, as detailed in Section 3.2.1, we train the remaining components of $r_{t}$ , i.e., compress/combine layers, at the end of task $t$ . Details of ILR’s training algorithm are provided in Algorithm 1

3.4 Inference Procedure

We now describe how to stack multiple rectifier units $r_{t}$ into a chain for inference. As a new task arrives, our model dynamically extends an additional rectifier unit, forming a sequence of rectifiers.

Task-Incremental. We consider a task-incremental learning setting where a test sample $x_{i}$ is coupled with a task identifier $t_{i}\in\{1,\ldots,N\}$ . To classify $x_{i}$ , we can recover $\hat{f}_{t_{i}}(x)$ by forwarding the current latent variable $f_{N}(x)$ through a chain of $N-t_{i}$ rectifiers. We then pass this recovered latent variable through classifier head $w_{t_{i}}$ to make a prediction. The output $\hat{y}_{i}$ is computed as

\displaystyle\hat{y}_{i}=w_{t_{i}}(\hat{f}_{t_{i}}(x_{i}))\quad\text{where}% \quad\hat{f}_{t_{i}}(x_{i})=r_{t_{i}+1}\circ(\hat{f}_{t_{i}+1}\times I)(x)% \quad\text{with}\quad t_{i}<N,\hat{f}_{N}=f_{N}

Class-Incremental. ILR relies on the task identity to reconstruct the appropriate sequence of rectifier units for propagating the latent representation to the original space. However, no identity is provided for the CL method in the class-incremental learning setting. We provided a simple method for inference without task identity, which demonstrates the method’s extension to class-incremental learning; however, more robust task-identity inference methods could also be incorporated.

We obtain the class-incremental probabilities by forming an ensemble that averages the class probabilities over all domains. From the current task $t$ ’s domain, we iteratively rectified the latent back to task $t-1$ , task $t-2$ , …, task $1$ ’s domain. At each domain, we obtain the rectified representation corresponding with the domain, which we forward through the respective classifier. We then average the softmax probabilities of each domain, essentially forming an ensemble of $w_{i}(f_{i})|_{i=1}^{t}$ .

4 Experiments

Our implementation is based partially on the Mammoth [6, 7] repository, TAMiL [5] repository, and CLS-ER [3] repository.

4.1 Evaluation Protocol

Datasets. We select three standard continual learning benchmarks for our experiments: Sequential CIFAR10 (S-CIFAR10), Sequential CIFAR100 (S-CIFAR100), and Sequential Tiny ImageNet (S-TinyImg). Specifically, we divide S-CIFAR10 into 5 binary classification tasks, S-CIFAR100 into 5 tasks with 20 classes each, and S-TinyImg into 20 tasks with 20 classes each.

Baselines. We evaluate ILR against representative continual learning methods, including EWC (online) [38], and LwF (multi-class) [24], ER [10], AGEM [9], DER++ [7], ER-ACE [8], CLS-ER [3], TAMiL [5]. We further provide an upper and lower bound for all methods by joint training on all tasks’ data and fine-tuning without any catastrophic forgetting mitigation. We employ ResNet18 [15] as the unified feature extractor for all benchmarks. The classifier comprises a fixed number of separate linear heads for each task. More datasets and implementation details are provided in the Appendix.

Table 1: Task-Incremental Average Accuracy across all tasks after CL training. Joint: the upper bound accuracy when jointly training on all tasks (i.e., multi-task learning). Finetuning: the lower bound accuracy when learning without any CL techniques.

|\mathcal{B}|

is the buffer of all past tasks data, while

|\mathcal{S}_{t}|

is the alignment training data set, which only contains data from task

t-1

Method	$\|\mathcal{B}\|$	$\|\mathcal{S}_{t}\|$	S-CIFAR10		S-CIFAR100		S-TinyImg
TIL			NP	AA	NP	AA	NP	AA
Joint	-	-	11.17M	98.46 $\pm$ 0.07	11.22M	86.37 $\pm$ 0.17	11.27M	81.86 $\pm$ 0.57
Finetuning	-	-	11.17M	64.16 $\pm$ 2.40	11.22M	24.01 $\pm$ 2.14	11.27M	13.79 $\pm$ 0.23
o-EWC	-	-	11.17M	69.60 $\pm$ 5.22	11.22M	36.61 $\pm$ 3.82	11.27M	15.67 $\pm$ 0.67
LwF.mc	-	-	11.17M	60.96 $\pm$ 1.48	11.22M	41.00 $\pm$ 1.01	11.27M	23.24 $\pm$ 0.71
AGEM	500	-	11.17M	90.37 $\pm$ 1.05	11.22M	63.35 $\pm$ 1.47	11.27M	37.14 $\pm$ 0.32
ER			11.17M	94.24 $\pm$ 0.24	11.22M	67.41 $\pm$ 0.70	11.27M	46.07 $\pm$ 0.16
DER++			11.17M	92.49 $\pm$ 0.55	11.22M	68.52 $\pm$ 0.91	11.27M	50.84 $\pm$ 0.12
ER-ACE			11.17M	94.52 $\pm$ 0.13	11.22M	67.26 $\pm$ 0.50	11.27M	47.72 $\pm$ 0.42
TAMiL			22.68M	94.89 $\pm$ 0.16	22.77M	76.39 $\pm$ 0.29	23.20M	64.24 $\pm$ 0.69
CLS-ER			33.52M	95.35 $\pm$ 0.34	33.66M	77.03 $\pm$ 0.81	33.81M	54.69 $\pm$ 0.37
ILR	-	500	13.31M	86.28 $\pm$ 0.69	13.36M	74.59 $\pm$ 0.52	16.08M	59.78 $\pm$ 0.39
AGEM	1000	-	11.17M	91.68 $\pm$ 1.48	11.22M	67.43 $\pm$ 1.37	11.27M	46.94 $\pm$ 0.91
ER			11.17M	95.25 $\pm$ 0.07	11.22M	69.69 $\pm$ 1.49	11.27M	54.54 $\pm$ 0.40
DER++			11.17M	93.76 $\pm$ 0.23	11.22M	72.27 $\pm$ 1.13	11.27M	58.67 $\pm$ 0.28
ER-ACE			11.17M	94.69 $\pm$ 0.25	11.22M	72.46 $\pm$ 0.58	11.27M	57.37 $\pm$ 0.49
TAMiL			22.68M	95.22 $\pm$ 0.42	22.77M	78.72 $\pm$ 0.31	23.20M	70.89 $\pm$ 0.04
CLS-ER			33.52M	96.05 $\pm$ 0.11	33.66M	79.36 $\pm$ 0.20	33.81M	65.00 $\pm$ 0.02
ILR	-	1000	13.31M	91.02 $\pm$ 1.76	13.36M	78.53 $\pm$ 0.25	16.08M	66.79 $\pm$ 0.64
ILR	-	5000	13.31M	94.84 $\pm$ 0.31	13.36M	82.05 $\pm$ 0.29	16.08M	72.50 $\pm$ 0.92

4.2 Results

Table 1 shows the performance of ILR and other CL methods, including rehearsal-based and regularization-based methods, on multiple sequential datasets, including S-CIFAR10, S-CIFAR100, and S-TinyImg. For ILR, we create an alignment set from 500, 100, and 5000 samples of $\mathcal{D}_{t-1}^{\mathrm{train}}$ . As can be observed from the table, ILR achieves comparable results on S-CIFAR10, compared to the baselines. On S-CIFAR100 and S-TinyImg, ILR outperforms all the baselines given a sufficient alignment set, indicating its ability to rectify representation changes incrementally.

Table 2: Class-Incremental Average Accuracy across all tasks after CL training. The settings are similar to Table 1.

Method	$\|\mathcal{B}\|$	$\|\mathcal{S}_{t}\|$	S-CIFAR100
CIL			NP	AA
Joint	-	-	11.22M	71.07 $\pm$ 0.27
Finetuning	-	-	11.22M	17.50 $\pm$ 0.09
DER++	1000	-	11.22M	46.96 $\pm$ 0.17
ER-ACE			11.22M	47.09 $\pm$ 1.16
TAMiL			22.77M	51.83 $\pm$ 0.41
CLS-ER			33.66M	51.13 $\pm$ 0.12
ILR	-	1000	13.56M	42.53 $\pm$ 0.43
ILR	-	5000	13.56M	48.90 $\pm$ 0.28

Table 2 demonstrates the extension of ILR to class-incremental settings. As the class-incremental probabilities are simply obtained through averaging, we can still achieve comparable performance to other rehearsal-based methods given a sufficient alignment set.

4.3 Result with different alignment sets

We further evaluate choices of training data used for alignment set $\mathcal{S}_{t}$ , as discussed in Section 3. The choices include using samples from the previous task’s training data $\mathcal{D}_{t-1}^{\text{train}}$ , the current task’s training data $\mathcal{D}_{t-1}^{\text{train}}$ , and generative network $G_{t-1}$ . Details for $G_{t-1}$ training are included in the Appendix.

Table 3 shows the results of these experiments. As can be observed, training with data from $\mathcal{D}_{t-1}^{\text{train}}$ expectedly achieves better performance since the data is sampled directly from the data distribution $\mathcal{D}_{t}$ of the previous task; increasing the number of samples from $\mathcal{D}_{t-1}^{\text{train}}$ yields better performance results. The generative network also achieves comparable results due to its ability to synthesize data from $\mathcal{D}_{t}$ . Nevertheless, training with $\mathcal{D}_{t}^{\text{train}}$ is also an attractive choice for its reasonable performance and the fact that we do not need to keep a copy of the previous task’s data.

Table 3: Average Accuracy across 5 tasks for S-CIFAR100 dataset with different options of alignment training data.

Variation	Keep $t-1$ data	Keep $f_{t-1}$	Keep $G_{t-1}$	Avg. Accuracy
ILR with $\mathcal{S}_{t}=\mathcal{D}_{t}^{\mathrm{train}}$	-	✓	-	69.22 $\pm$ 0.40
ILR with $\mathcal{S}_{t}\subset\mathcal{D}_{t-1}^{\mathrm{train}},\|\mathcal{S}_{t}\|=5000$	✓	-	-	82.05 $\pm$ 0.29
ILR-GAN $(\mathcal{S}_{t}\sim\mathcal{D}_{t-1}$ )	-	✓	✓	79.51 $\pm$ 0.48

4.4 Parameter Growth Comparison

Table 4: Number of parameters

\downarrow

(in millions) of different methods after

N

tasks. Results for baselines are taken from [5] and [3], measured on the S-TinyImg. The ResNet-18 network with no classifier head is 11.17 million parameters

Methods	5 tasks	10 tasks	20 tasks
ResNet-18	11.27M	11.27M	11.27M
TAMiL [5]	22.87M	23.20M	23.85M
CLS-ER [3]	33.81M	33.81M	33.81M
LRB	13.94M	16.08M	21.96M

This section studies the network-size footprint of our framework. The base ResNet-18 has 11.17 million parameters. We report the network sizes after 5, 10, and 20 tasks for ILR and the two baselines, CSL-ER, and TAMIL in Table 4. As we can observe, ILR exhibits a linear memory growth and has the smallest memory footprint among the three baselines. Further analysis reveals that the compress layer (512x384 linear layer) and the combine layer (512x512 linear layer) contribute the most to memory usage, requiring approximately 0.20 million and 0.26 million parameters per task, respectively. Meanwhile, the weak feature extractor contribution to the total number of parameters is negligible at 0.07 million parameters per task.

4.5 Rectifier Quality Experiment

In this section, we utilize Principal Component Analysis (PCA) to visualize our learned latent space against the target latent space of the previous task and verify the behaviors of the rectifier in recovering past representation. Figure 3 shows the PCA plots with the first two components. As can be observed, the new representations of data from the previous task (red) after learning the current task change significantly from their original representations (green), which explains catastrophic forgetting. With ILR’s mechanism, the rectified data representations (blue) can now accurately align with the ‘true’ data representations (green), supporting the empirical effectiveness of our framework.

4.6 Ablation study

In this section, we investigate the impact of our alignment loss. We isolate the effect of the alignment loss by setting $\tau=0$ in Equation 1, effectively replacing it with a simple $l$ - $2$ norm. To analyze the contribution of the representation regularization on rectification effectiveness, we set $\alpha=0$ in Equation 4, eliminating it from the main feature extractor’s training process.

Table 5: Ablating

\tau

and

\alpha

for

|\mathcal{S}_{t}|=5000

Method	Hyperparameter	S-CIFAR10	S-CIFAR100	S-TinyImg
ILR	$\alpha=0.0,\tau\neq 0.0$	89.68 $\pm$ 0.75	72.45 $\pm$ 0.42	58.73 $\pm$ 0.81
ILR	$\alpha\neq 0.0,\tau=0.0$	90.82 $\pm$ 1.17	81.67 $\pm$ 0.22	72.07 $\pm$ 0.37
ILR	$\alpha\neq 0,\tau\neq 0$	94.84 $\pm$ 0.31	82.05 $\pm$ 0.29	72.50 $\pm$ 0.92

5 Limitations

We have shown the potential and high utility of ILR’s CL learning mechanism in this paper. Nevertheless, ILR also has some limitations. One limitation is that ILR still maintains an additional DNN, i.e., the rectifier, which incurs an additional overhead as the number of tasks increases. Inference cost for long chain would be costly, which can be further explored with modified chaining methods such as skipping (i.e., building a rectifier every two tasks). Additionally, the best performance is achieved when having access to task $t-1$ ’s data. Ideally, we would want to remove this requirement; thus, future research should focus on the creation of the alignment training data. We have attempted to demonstrate that generative methods are a viable option. Furthermore, since ILR relies on the task identity to reconstruct the rectifier sequence, application to class-incremental learning settings requires either inferring task identity or forming an ensemble of predictions. The proposed ensemble solution might suffer from over-confident or under-confident classifiers. Class-incremental learning is still an open research, where more effective adaptations of our framework can be discovered.

6 Conclusion

This work proposes a new CL paradigm, ILR, for task incremental learning. ILR tackles catastrophic forgetting through its novel backward-recall mechanism that learns to align the newly learned presentation of past data to their correct representations. Unlike existing CL methods, it requires neither a replay buffer nor intricate training modifications. Our experiments validate that the proposed ILR achieves comparable results to the performance of existing CL baselines for task-incremental and class-incremental learning.

References

Ahn et al. [2019] H. Ahn, S. Cha, D. Lee, and T. Moon. Uncertainty-based continual learning with adaptive regularization. In Advances in Neural Information Processing Systems, pages 4394–4404, 2019.
Aljundi et al. [2018] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
Arani et al. [2022] E. Arani, F. Sarfraz, and B. Zonooz. Learning fast, learning slow: A general continual learning method based on complementary learning system. In International Conference on Learning Representations, 2022.
Balaji et al. [2020] Y. Balaji, M. Farajtabar, D. Yin, A. Mott, and A. Li. The effectiveness of memory replay in large scale continual learning. arXiv preprint arXiv:2010.02418, 2020.
Bhat et al. [2023] P. S. Bhat, B. Zonooz, and E. Arani. Task-aware information routing from common representation space in lifelong learning. In The Eleventh International Conference on Learning Representations, 2023.
Boschini et al. [2022] M. Boschini, L. Bonicelli, P. Buzzega, A. Porrello, and S. Calderara. Class-incremental continual learning into the extended der-verse. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
Buzzega et al. [2020] P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara. Dark experience for general continual learning: a strong, simple baseline. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 15920–15930. Curran Associates, Inc., 2020.
Caccia et al. [2022] L. Caccia, R. Aljundi, N. Asadi, T. Tuytelaars, J. Pineau, and E. Belilovsky. New insights on reducing abrupt representation change in online continual learning. In International Conference on Learning Representations, 2022.
Chaudhry et al. [2019a] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with A-GEM. In International Conference on Learning Representations, 2019a.
Chaudhry et al. [2019b] A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. Dokania, P. Torr, and M. Ranzato. Continual learning with tiny episodic memories. In Workshop on Multi-Task and Lifelong Reinforcement Learning, 2019b.
Chaudhry et al. [2019] A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. S. Torr, and M. Ranzato. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486, 2019.
Ebrahimi et al. [2020] S. Ebrahimi, M. Elhoseiny, T. Darrell, and M. Rohrbach. Uncertainty-guided continual learning with bayesian neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklUCCVKDB.
Farajtabar et al. [2020] M. Farajtabar, N. Azizan, A. Mott, and A. Li. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pages 3762–3773. PMLR, 2020.
Fernando et al. [2017] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
Hu et al. [2021] H. Hu, A. Li, D. Calandriello, and D. Gorur. One pass imagenet. In NeurIPS 2021 Workshop on ImageNet: Past, Present, and Future, 2021. URL https://openreview.net/forum?id=mEgL92HSW6S.
Jerfel et al. [2019] G. Jerfel, E. Grant, T. L. Griffiths, and K. A. Heller. Reconciling meta-learning and continual learning with online mixtures of tasks. In NeurIPS, 2019.
Kang and Park [2020] M. Kang and J. Park. ContraGAN: Contrastive Learning for Conditional Image Generation. 2020.
Kang et al. [2021] M. Kang, W. Shim, M. Cho, and J. Park. Rebooting ACGAN: Auxiliary Classifier GANs with Stable Training. 2021.
Kang et al. [2023] M. Kang, J. Shin, and J. Park. StudioGAN: A Taxonomy and Benchmark of GANs for Image Synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023.
Kirichenko et al. [2021] P. Kirichenko, M. Farajtabar, D. Rao, B. Lakshminarayanan, N. Levine, A. Li, H. Hu, A. G. Wilson, and R. Pascanu. Task-agnostic continual learning with hybrid probabilistic models. 2021. URL https://openreview.net/forum?id=ZbSeZKdqNkm.
Kirkpatrick et al. [2017] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
Li et al. [2019] X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. arXiv preprint arXiv:1904.00310, 2019.
Li and Hoiem [2017] Z. Li and D. Hoiem. Learning without forgetting. arXiv preprint arXiv:1606.09282, 2017.
Lopez-Paz and Ranzato [2017] D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.
Mallya and Lazebnik [2018] A. Mallya and S. Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
Masse et al. [2018] N. Y. Masse, G. D. Grant, and D. J. Freedman. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences, 115(44):10467–10475, 2018.
McCloskey and Cohen [1989] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
Nguyen et al. [2018] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational continual learning. In International Conference on Learning Representations, 2018.
Qin et al. [2021] Q. Qin, W. Hu, H. Peng, D. Zhao, and B. Liu. Bns: Building network structures dynamically for continual learning. Advances in Neural Information Processing Systems, 34:20608–20620, 2021.
Rao et al. [2019] D. Rao, F. Visin, A. Rusu, R. Pascanu, Y. W. Teh, and R. Hadsell. Continual unsupervised representation learning. In Advances in Neural Information Processing Systems, pages 7645–7655, 2019.
Rasch and Born [2007] B. Rasch and J. Born. Maintaining memories by reactivation. Current Opinion in Neurobiology, 17(6):698–703, 2007.
Ratcliff [1990] R. Ratcliff. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychology Review, 97(2):285–308, Apr. 1990.
Rebuffi et al. [2016] S.-A. Rebuffi, A. I. Kolesnikov, G. Sperl, and C. H. Lampert. iCaRL: Incremental classifier and representation learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5533–5542, 2016.
Riemer et al. [2018] M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.
Rios and Itti [2018] A. Rios and L. Itti. Closed-loop GAN for continual learning. arXiv preprint arXiv:1811.01146, 2018.
Rusu et al. [2016] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
Schwarz et al. [2018] J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell. Progress & compress: A scalable framework for continual learning. In International Conference on Machine Learning, pages 4528–4537. PMLR, 2018.
Shin et al. [2017] H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017.
Sun et al. [2022] S. Sun, D. Calandriello, H. Hu, A. Li, and M. Titsias. Information-theoretic online memory selection for continual learning. In International Conference on Learning Representations (ICLR), 2022.
Tseng et al. [2021] H.-Y. Tseng, L. Jiang, C. Liu, M.-H. Yang, and W. Yang. Regularing generative adversarial networks under limited data. In CVPR, 2021.
Van et al. [2022] L. N. Van, N. L. Hai, H. Pham, and K. Than. Auxiliary local variables for improving regularization/prior approach in continual learning. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 16–28. Springer, 2022.
Verwimp et al. [2021] E. Verwimp, M. De Lange, and T. Tuytelaars. Rehearsal revealed: The limits and merits of revisiting samples in continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9385–9394, 2021.
Wortsman et al. [2020] M. Wortsman, V. Ramanujan, R. Liu, A. Kembhavi, M. Rastegari, J. Yosinski, and A. Farhadi. Supermasks in superposition. arXiv preprint arXiv:2006.14769, 2020.
Xu and Zhu [2018] J. Xu and Z. Zhu. Reinforced continual learning. Advances in Neural Information Processing Systems, 31, 2018.
Yan et al. [2021] S. Yan, J. Xie, and X. He. Der: Dynamically expandable representation for class incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3014–3023, 2021.
Yin et al. [2020] D. Yin, M. Farajtabar, and A. Li. SOLA: Continual learning with second-order loss approximation. arXiv preprint arXiv:2006.10974, 2020.
Yoon et al. [2018] J. Yoon, E. Yang, J. Lee, and S. J. Hwang. Lifelong learning with dynamically expandable networks. In Sixth International Conference on Learning Representations. ICLR, 2018.
Zenke et al. [2017] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pages 3987–3995. PMLR, 2017.
Zhang et al. [2019] M. Zhang, T. Wang, J. H. Lim, and J. Feng. Prototype reminding for continual learning. arXiv preprint arXiv:1905.09447, 2019.

Appendix

Appendix A Detailed Experimental Setup

Computing resource. We run the experiments on a machine with 8 NVIDIA RTX A5000s.

A.1 Baselines

As detailed in Section 4.1, we evaluate ILR against EWC (online version), LwF (multi-class) version, ER, AGEM, DER++, ER-ACE, CLS-ER, and TAMiL.

For extensive comparison, we provide rehearsal-based methods with a buffer with a max capacity of 500 and 1,000 samples, respectively. Since our method does not rely on a buffer of all task data but only an alignment set of task $t-1$ data, the forgetting can be more significant, which is not a fair comparison of ILR against other rehearsal-based methods. Therefore, we provide ILR with an alignment set of 500, 1,000, and 5,000 samples.

We replicate training settings as follows: For ER, DER++, ER-ACE, TAMiL, and CLS-ER, we employ the reservoir sampling strategy to remove the reliance on task boundaries as in the original implementation. On the other hand, ILR, AGEM, and TAMiL rely on the task boundary to learn the rectifier, modify the buffer, and add a new task-attention module, respectively. For TAMiL, we use the best-reported task-attention architecture. For CLS-ER, we perform inference using the stable model per the original formulation.

A.2 Datasets

To demonstrate the effectiveness of our method, we perform empirical evaluations on three standard continual learning benchmarks: Sequential CIFAR10 (S-CIFAR10), Sequential CIFAR100 (S-CIFAR100), and Sequential Tiny ImageNet (S-TinyImg). The datasets are split into 5, 5, and 10 tasks containing 2, 20, and 20 classes, respectively. The dataset of S-CIFAR10 and S-CIFAR100 each includes 60000 $32\times 32$ images splitter into 50000 training images and 10000 test images, with each task occupying 10000 training images and 2000 testing images. The dataset S-TinyImg contains 1100000 $64\times 64$ images with 100000 training images and 10000 test images divided into 10 tasks with 10000 training images and 1000 test images each. We perform simple augmentation of random horizontal flips and random image cropping for each training and buffered image.

A.3 Training

Settings. The training set of each task is divided into 90%-10% for training and validation. All methods are optimized by the Adam optimizer available in PyTorch with a learning rate of $5\times 10^{-4}$ . As the validation loss plateau for 3 epochs, we reduce the learning rate by 0.1. Each task is trained for 40 epochs. For ILR, we train $h_{t}$ and $r_{t}$ using the same formulation with Adam optimizer at a learning rate of $5\times 10^{-4}$ for 50 epochs.

Weak feature extractor. We provide the architecture of the weak feature extractor $h_{t}$ in Table 6. We choose a simple design of two 3x3 convolution layers coupled with two max pooling layers.

Table 6: Architecture of the weak feature extractor

h_{t}

. We use ReLU activation after each convolution layer. For each task, a weak feature extractor

h_{t}

is distilled from the current feature extractor

f_{t}

. The output dimension of

h

is 128, while the output dimension of the main feature extractor is 512.

Layer	Channel	Kernel	Stride	Padding	Output size
Input	3				$16\times 16$
Conv 1	64	$3\times 3$	2	1	$8\times 8$
MaxPool			2		$4\times 4$
Conv 2	128	$3\times 3$	2	1	$2\times 2$
MaxPool			2		$1\times 1$

GAN training. We use the StudioGan repository’s default implementation [20, 19, 18] of the BigGAN LeCam [41] to train the network on each task of S-CIFAR100. The obtained FID score for each task is between 17 and 23. The BigGAN network has nearly 95 million parameters. During ILR training, we sampled directly from the BigGAN network.

A.4 Hyperparameter search

For all methods, experiments, and datasets, we perform a grid search over the following hyper-parameters using a validation set. Some of the following hyperparameters are obtained directly from their original implementation to narrow down the search range.

•

Joint, Finetuning, LwF.mc, ER, AGEM, ER-ACE: No hyperparameters
•
o-EWC:
- -
  
  $\lambda\in\{10,20,50,100\}$
- -
  
  $\gamma\in\{0.9,1\}$
•
DER++:
- -
  
  $\alpha\in\{0.1,0.2,0.5,1\}$
- -
  
  $\beta\in\{0.1,0.2,0.5,1\}$
•
CLS-ER:
- -
  
  $r_{p}\in\{0.5,0.9\}$
- -
  
  $r_{s}\in\{0.1,0.5\}$
- -
  
  $\alpha_{p}\in\{0.999\}$
- -
  
  $\alpha_{s}\in\{0.999\}$
•
TAMiL:
- -
  
  $\alpha\in\{0.2,0.5,1\}$
- -
  
  $\beta\in\{0.1,0.2,1\}$
- -
  
  $\theta\in\{0.1\}$
•
ILR:
- -
  
  $\alpha\in\{1,2,3\}$
- -
  
  $\tau\in\{0.5,1,2\}$

Table 7: Hyperparameters for method in Table 1

Method	$\|\mathcal{B}\|$	$\|\mathcal{S}_{t}\|$	S-CIFAR10	S-CIFAR100	S-TinyImg
o-EWC	-	-	$\lambda=100,\gamma=0.9$	$\lambda=50,\gamma=0.1$	$\lambda=20,\gamma=0.9$
DER++	-	-	$\alpha=0.5,\beta=0.1$	$\alpha=0.2,\beta=0.1$	$\alpha=0.5,\beta=0.1$
TAMiL			$\alpha=1.0,\beta=1.0$	$\alpha=1.0,\beta=1.0$	$\alpha=1.0,\beta=0.5$
CLS-ER			$r_{p}=0.5,r_{s}=0.1$	$r_{p}=0.9,r_{s}=0.1$	$r_{p}=0.5,r_{s}=0.1$
ILR	-	500	$\alpha=3,\tau=2$	$\alpha=1,\tau=2$	$\alpha=1,\tau=2$
DER++			$\alpha=1.0,\beta=0.1$	$\alpha=0.2,\beta=0.1$	$\alpha=1.0,\beta=0.1$
TAMiL			$\alpha=1.0,\beta=1.0$	$\alpha=1.0,\beta=1.0$	$\alpha=1.0,\beta=0.5$
CLS-ER			$r_{p}=0.5,r_{s}=0.1$	$r_{p}=0.5,r_{s}=0.1$	$r_{p}=0.9,r_{s}=0.1$
ILR	-	1000	$\alpha=3,\tau=2$	$\alpha=1,\tau=2$	$\alpha=2,\tau=2$
ILR	-	5000	$\alpha=3,\tau=2$	$\alpha=3,\tau=2$	$\alpha=2,\tau=2$

Appendix B Versatility of ILR Framework

In ILR, as the tasks arrive, conventional fine-tuning or training on the new task happens without any CL’s intervention. ILR only augments or adds to this process with a separate training of the backward-recall mechanism. The attractiveness of this framework is twofold. First, ILR allows the best adaptation on the new task to possibly achieve maximum plasticity while the backward-recall mechanism mitigates catastrophic forgetting. Second, different from previous CL approaches that modify the sequential training process (e.g., by changing the loss functions, using an additional buffer, or dynamically adjusting the network’s architecture in fine-tuning), ILR does not change the fine-tuning process, allowing the users to more flexibly incorporate this framework into their existing machine learning pipelines.

Relationship to Memory Linking. ILR’s process of mapping newly learned knowledge representation resembles the popular humans’ mnemonic memory-linking technique, which establishes associations of fragments of information to enhance memory retention or recall. ¹¹1https://en.wikipedia.org/wiki/Mnemonic_link_system As the model learns a new task, the feature rectifier unit establishes a mnemonic link from the new representation of the sample from the past task to its past task’s correct representation.

Appendix C Societal Impacts

Our work has the potential to improve the capability of ML systems toward better adaptation to the changing world, which is usually the case for domains such as healthcare, education, and finance. This results in more reliable and robust learning systems. On the other hand, our framework will also have similar potential negative impacts that are often found in classification/predictive tasks, including bias, privacy, and misclassification.

Forget but Recall: Incremental Latent Rectification in Continual Learning