Forget but Recall: Incremental Latent Rectification
in Continual Learning

Nghia D. Nguyen1, Hieu Trung Nguyen2,
Ang Li3, Hoang Pham1,
Viet Anh Nguyen2, Khoa D. Doan1
1
College of Engineering & Computer Science, VinUniversity, Hanoi, Vietnam
2The Chinese University of Hong Kong, Hong Kong
3Simular, San Mateo, CA, USA
[email protected], [email protected],
[email protected], [email protected]
[email protected], [email protected]
Abstract

Intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which hinders remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches either retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored CL direction for incremental learning called Incremental Latent Rectification or ILR. In a nutshell, ILR learns to propagate with correction (or rectify) the representation from the current trained DNN backward to the representation space of the old task, where performing predictive decisions is easier. This rectification process only employs a chain of small representation mapping networks, called rectifier units. Empirical experiments on several continual learning benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods.

1 Introduction

Humans exhibit the innate capability to incrementally learn novel concepts while consolidating acquired knowledge into long-term memories [32]. More general Artificial Intelligence systems in real-world applications would require similar imitation to capture the dynamic of the changing data stream. These systems need to acquire knowledge incrementally without retraining, which is computationally expensive and exhibits a large memory footprint [34]. Nonetheless, existing learning approaches are yet to match human learning in this so-called Continual Learning (CL) problem due to catastrophic forgetting [28]. These systems encounter difficulty balancing the capability of incorporating new task knowledge while maintaining performance on learned tasks, or the plasticity-stability dilemma.

Representative CL approaches in the literature usually involve the use of memory buffer for rehearsal [33, 9, 7, 8, 5, 3], auxiliary loss term for learning regularization [22, 12, 49, 38], or structural changes such as pruning or model growing [37, 26, 14, 46]. These methods share the common objective of discouraging the deviation of learned knowledge representation. Rehearsal-based methods allow the model to revisit past exemplars to reinforce previously learned representations. Alternatively, regularization-based methods prevent changes in parameter spaces by formulating additional loss terms. However, both approaches present shortcomings, including keeping a rehearsal buffer of all past tasks during the model lifetime or infusing ad-hoc inductive bias into the regularization process. Meanwhile, structure-based methods utilize the over-parameterization property of the model by pruning, masking, or adding parameters to reduce new task interferences.

This paper studies a novel approach for CL named Incremental Latent Rectification (ILR), where we allow the model to “forget” knowledge of old tasks but then “recall” or rectify such “catastrophic forgetting” during inference using a sequence of lightweight knowledge mapping networks. These lightweight knowledge mapping networks, called rectifiers, help significantly reduce information loss on learned tasks by incrementally correcting the changes in the representation space. Specifically, for each new task, we add a small, simple, and computationally inexpensive auxiliary unit that will rectify the representation from the current task to the previous task. Our method differs from many network expansion methods, where additional parameters are allocated to minimize changes to the old parameters. Instead, we iteratively recover past task representations by backwardly propagating current representations through a series of mapping networks. Through this mechanism, ILR allows the optimal adaptation of a new task (plasticity) while separately mitigating catastrophic forgetting. In addition, different from previous CL approaches that modify the sequential training process (e.g., by changing the loss functions or using an additional buffer in fine-tuning), ILR does not change the new task’s learning, hence, ILR can be easily integrated into the existing CL pipelines.

Contributions.

We propose a new direction for CL by sequentially correcting the representation of the current task into the past task’s representation using a chain of lightweight rectifier units:

  • We propose a novel loss function for aligning the latent representation to guide the training procedure. The loss function is designed as a weighted sum of an L2-norm reconstruction error and a cosine distance metric.

  • To train the rectifier unit, we rely on either data samples from task t1𝑡1t-1italic_t - 1 or the current task t𝑡titalic_t; when such data is unavailable (e.g., due to memory constraint or privacy concerns), a generative model that synthesizes task t1𝑡1t-1italic_t - 1’s data can also be utilized. At inference time, for the task-incremental setting, we construct a chain of rectifiers based on the provided task identity and forward the latent representation and inputs to correct the representation. For the class incremental setting, ILR forms the final prediction from an ensemble of predictions based on the reconstructed representations.

  • We empirically evaluate our approach on three widely-used continual learning benchmarks (CIFAR10, CIFAR100, and Tiny ImageNet) to demonstrate that our approach achieves comparable performance with the existing representative CL directions.

This paper unfolds as follows. Section 2 discusses the literature on the continual learning problems, and Section 3 describes our Incremental Latent Rectification method. Finally, Section 4 provides the empirical evidence for the effectiveness of our proposed solution.

2 Related Work

Catastrophic forgetting is a critical concern in artificial intelligence and is arguably one of the most prominent questions to address for DNNs. This phenomenon presents significant challenges when deploying models in different applications. Continual learning addresses this issue by enabling agents to learn throughout their lifespan. This aspect has gained significant attention recently  [40, 16, 21, 4]. Considering a model well-trained on past tasks, we risk overwriting its past knowledge by adapting it for new tasks. The problem of knowledge loss can be addressed using different methods, as explored in the literature[47, 13, 22, 24, 9, 5, 37, 46] . These methods aim to mitigate knowledge loss and improve task performance through three main approaches: (1) Rehearsal-based methods, which involve reminding the model of past knowledge by using selective exemplars; (2) Regularization-based methods, which penalize changes in past task knowledge through regularization techniques; (3) Parameter-isolation and Dynamic Architecture methods, which allocate sub-networks or expand new sub-networks, respectively, for each task, minimizing task interference and enabling the model to specialize for different tasks.

Rehearsal-based. Experience replay methods build and store a memory of the knowledge learned so far [34, 25, 39, 35, 36, 50]. As an example, Averaged Gradient Episodic Memory (A-GEM) [9] builds an episodic memory of parameter gradients, while ER-Reservoir [11] uses a reservoir sampling method to maintain the episodic memory. These methods have shown strong performance in recent studies. However, they require a significant amount of memory for storing the examples.

Regularization-based. A popular early work using regularization is the elastic weight consolidation (EWC) method [22]. Other methods [49, 2, 42, 29, 1] propose different criteria to measure the “importance” of parameters. A later study showed that many regularization-based methods are variations of Hessian optimization [47]. These methods typically assume that there are multiple optima in the updated loss landscape in the new data distribution. One can find a good optimum for both the new and old data distributions by constraining the deviation from the original model weights.

Parameter Isolation. Parameter isolation methods allocate different subsets of the parameters to each task [37, 17, 31, 23]. From the stability-plasticity perspective, these methods implement gating mechanisms that improve stability and control plasticity by activating different gates for each task. Masse et al. [27] proposes a bio-inspired approach for a context-dependent gating that activates a non-overlapping subset of parameters for any specific task. Supermask in Superposition [44] is another parameter isolation method that starts with a randomly initialized, fixed base network and, for each task, finds a sub-network (supermask) such that the model achieves good performance.

Dynamic Architecture. Different from Parameter Isolation, which allocates subnets for tasks in a fixed main network, this approach dynamically expands the structure of the network. Yoon et al. [48] proposes a method that leverages the network structure trained on previous tasks to effectively learn new tasks, while dynamically expanding its capacity by adding or duplicating neurons as needed. Other methods [45, 30] reformulate CL problems into reinforcement learning (RL) problems, and leverage RL methods to determine when to expand the architecture during learning of new tasks. Yan et al. [46] introduces a two-stage learning method that first expands the previous frozen task feature representations by a new feature extractor, then re-trains the classifier with current and buffered data.

3 Proposed Framework

We consider the task-incremental and class-incremental learning scenarios, where we sequentially observe a set of tasks t{1,,N}𝑡1𝑁t\in\{1,\ldots,N\}italic_t ∈ { 1 , … , italic_N }. The neural network comprises a single task-agnostic feature extractor f𝑓fitalic_f and a classifier w𝑤witalic_w with task-specific heads w(t)|t=1Nevaluated-atsuperscript𝑤𝑡𝑡1𝑁w^{(t)}|_{t=1}^{N}italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The architecture of f𝑓fitalic_f is fixed; however, its parameters are gradually updated as new tasks arrive. At task t𝑡titalic_t, the system receives the training dataset 𝒟ttrainsuperscriptsubscript𝒟𝑡train\mathcal{D}_{t}^{\mathrm{train}}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT sampled from the data distribution 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and learns the updated parameters of the feature extractor f𝑓fitalic_f and w𝑤witalic_w. For easier discussion, the feature extractor and classifier obtained after learning at task t𝑡titalic_t are denoted as ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, respectively. Thus, after learning on task t𝑡titalic_t, we obtain the evolved feature extractor ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and classifier wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We call the latent space created by the feature extractor trained with 𝒟ttrainsuperscriptsubscript𝒟𝑡train\mathcal{D}_{t}^{\mathrm{train}}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT as the t𝑡titalic_t-domain. Catastrophic forgetting occurs as the feature extractor ftsubscript𝑓superscript𝑡f_{t^{\prime}}italic_f start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is updated into ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, t<tsuperscript𝑡𝑡t^{\prime}<titalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_t, which causes the tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-domain to be overwritten by the t𝑡titalic_t-domain. This domain shift degrades the model’s performance over time.

To overcome catastrophic forgetting, we propose a new CL paradigm: learning a latent rectification mechanism. This mechanism relies on a lightweight rectifier unit rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that learns to align the representations from the t𝑡titalic_t-domain to the (t1)𝑡1(t-1)( italic_t - 1 )-domain. Intuitively, this module “corrects” the representation change of a sample from the old task t1𝑡1t-1italic_t - 1 due to the evolution of the feature extractor f𝑓fitalic_f when learning the newer task t𝑡titalic_t. These rectifier units will establish a chain of corrections for the representation of any task’s input, allowing the model to predict the rectified representation better.  Figure 1 provides a visualization of the inference process on a task-t𝑡titalic_t sample, after learning N𝑁Nitalic_N tasks.

Learning the latent rectification mechanism is central to our proposed framework. In general, each rectifier unit should be small compared to the size of the final model or the feature extractor f𝑓fitalic_f, and its learning process should be resource-efficient. In the following sections, we present and describe our solution for learning this mechanism.

Refer to caption
Figure 1: At task t𝑡titalic_t, the feature extractor ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and classifier head wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are optimized on the dataset Dttrainsuperscriptsubscript𝐷𝑡trainD_{t}^{\mathrm{train}}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT. During inference for a test sample from task t𝑡titalic_t, we forward the input data xDttest𝑥superscriptsubscript𝐷𝑡testx\in D_{t}^{\mathrm{test}}italic_x ∈ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_test end_POSTSUPERSCRIPT through the feature extractor and classifier head to obtain the logits. After learning all N𝑁Nitalic_N tasks, the DNN loses performance on task t𝑡titalic_t due to catastrophic forgetting. Therefore, the latent representation fN(x)subscript𝑓𝑁𝑥f_{N}(x)italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x ) is propagated through a series of rectifiers rN,,rt+1subscript𝑟𝑁subscript𝑟𝑡1r_{N},\ldots,r_{t+1}italic_r start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT to perform incremental latent rectification and obtained approximated representations f^N1,,f^tsubscript^𝑓𝑁1subscript^𝑓𝑡\hat{f}_{N-1},\ldots,\hat{f}_{t}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The logits can be obtained by passing the recovered representation to the respective classifier head.

3.1 Learning the Rectifier Unit

As the training dataset 𝒟ttrainsuperscriptsubscript𝒟𝑡train\mathcal{D}_{t}^{\mathrm{train}}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT of task t𝑡titalic_t arrives, we first update the feature extractor ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the classifier head wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The primary goal herein is to find (ft,wt)subscript𝑓𝑡subscript𝑤𝑡(f_{t},w_{t})( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) that has high classification performance for task t𝑡titalic_t, and the secondary goal is to choose ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that can reduce the catastrophic forgetting on previous tasks. To combat catastrophic forgetting, we will first discuss the objective function for learning the lightweight rectifier unit rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the potential alignment training data (or alignment set) 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

3.1.1 Alignment Loss

The goal of rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is to reduce the discrepancy between task t𝑡titalic_t’s representation ft(xi)subscript𝑓𝑡subscript𝑥𝑖f_{t}(x_{i})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and the previous data representation ft1(xi)subscript𝑓𝑡1subscript𝑥𝑖f_{t-1}(x_{i})italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), for xi𝒟t1similar-tosubscript𝑥𝑖subscript𝒟𝑡1x_{i}\sim\mathcal{D}_{t-1}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT; i.e. rt(ft(xi),xi)ft1(xi)subscript𝑟𝑡subscript𝑓𝑡subscript𝑥𝑖subscript𝑥𝑖subscript𝑓𝑡1subscript𝑥𝑖r_{t}(f_{t}(x_{i}),x_{i})\approx f_{t-1}(x_{i})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≈ italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). One choice is the weighted linear combination of the l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT error and the cosine error between ft(xi)subscript𝑓𝑡subscript𝑥𝑖f_{t}(x_{i})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with rt(ft(xi),xi)subscript𝑟𝑡subscript𝑓𝑡subscript𝑥𝑖subscript𝑥𝑖r_{t}(f_{t}(x_{i}),x_{i})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). This combination promotes alignment in both the magnitude and the direction between two representation vectors for improved representational similarity.

Let s𝑠sitalic_s be a function, with parameters θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, that encodes inputs xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into its respective past representation in domain t1𝑡1t-1italic_t - 1, and τ>0𝜏0\tau>0italic_τ > 0 be the weight hyper-parameter; we define the alignment loss as:

align(θs;s,τ,𝒮t,ft1)=𝔼xi𝒮t[s(xi)ft1(xi)22+τ(1cos(s(xi),ft1(xi)))].subscriptalignsubscript𝜃𝑠𝑠𝜏subscript𝒮𝑡subscript𝑓𝑡1subscript𝔼similar-tosubscript𝑥𝑖subscript𝒮𝑡delimited-[]superscriptsubscriptnorm𝑠subscript𝑥𝑖subscript𝑓𝑡1subscript𝑥𝑖22𝜏1𝑠subscript𝑥𝑖subscript𝑓𝑡1subscript𝑥𝑖\mathcal{L}_{\mathrm{align}}(\theta_{s};s,\tau,\mathcal{S}_{t},f_{t-1})=% \mathbb{E}_{x_{i}\sim\mathcal{S}_{t}}\left[\|s(x_{i})-f_{t-1}(x_{i})\|_{2}^{2}% +\tau(1-\cos(s(x_{i}),f_{t-1}(x_{i})))\right].caligraphic_L start_POSTSUBSCRIPT roman_align end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_s , italic_τ , caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_τ ( 1 - roman_cos ( italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) ] . (1)

In practice, we could either store the value of ft1(xi)subscript𝑓𝑡1subscript𝑥𝑖f_{t-1}(x_{i})italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) together with xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in memory or ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT directly.

3.1.2 Alignment Set

The alignment set 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is used as the training data for the rectifier unit rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT enables the rectifier unit to efficiently learn the mapping from the t𝑡titalic_t-domain back to the t1𝑡1t-1italic_t - 1-domain. The design of ILR enables several options for selecting the alignment set, including 𝒟t1trainsuperscriptsubscript𝒟𝑡1train\mathcal{D}_{t-1}^{\mathrm{train}}caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT, 𝒟ttrainsuperscriptsubscript𝒟𝑡train\mathcal{D}_{t}^{\mathrm{train}}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT, or a generative method.

Task t1𝑡1t-1italic_t - 1 data. The simplest choice for the alignment set 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the 𝒟t1trainsuperscriptsubscript𝒟𝑡1train\mathcal{D}_{t-1}^{\text{train}}caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT (i.e., the training data from the previous task t1𝑡1t-1italic_t - 1), which is sampled directly from the task t1𝑡1t-1italic_t - 1’s distribution. With this option, each element in 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a pair (xi,z^i)subscript𝑥𝑖subscript^𝑧𝑖(x_{i},\hat{z}_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where xi𝒟t1trainsubscript𝑥𝑖superscriptsubscript𝒟𝑡1trainx_{i}\in\mathcal{D}_{t-1}^{\text{train}}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT is chosen randomly and z^i=ft1(xi)subscript^𝑧𝑖subscript𝑓𝑡1subscript𝑥𝑖\hat{z}_{i}=f_{t-1}(x_{i})over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the associated latent representation of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT under the feature extractor ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Note that this option does not keep data samples from all past tasks t{1,,N}𝑡1𝑁t\in\{1,\ldots,N\}italic_t ∈ { 1 , … , italic_N } like the rehearsal-based methods [43].

Task t𝑡titalic_t data. Another potential option for 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is task-t𝑡titalic_t’s data. If we expect the tasks’ data to not be completely unrelated, using data from 𝒟ttrainsuperscriptsubscript𝒟𝑡train\mathcal{D}_{t}^{\text{train}}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT to train rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is reasonable. As we show in Section 4, we could achieve comparable performance to some rehearsal-based methods while remaining data-free when setting 𝒮t=𝒟ttrainsubscript𝒮𝑡superscriptsubscript𝒟𝑡train\mathcal{S}_{t}=\mathcal{D}_{t}^{\text{train}}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT. Additionally, for this option, since we do not have access to t1𝑡1t-1italic_t - 1-domain data, we need to keep a copy of ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to approximate zi^=ft1(xi)^subscript𝑧𝑖subscript𝑓𝑡1subscript𝑥𝑖\hat{z_{i}}=f_{t-1}(x_{i})over^ start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with xi𝒟ttrainsubscript𝑥𝑖superscriptsubscript𝒟𝑡trainx_{i}\in\mathcal{D}_{t}^{\mathrm{train}}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT.

Generated task t1𝑡1t-1italic_t - 1 data. Generative methods provide a potential option for creating training data for the rectifier unit rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Instead of keeping the alignment set 𝒮t𝒟t1trainsubscript𝒮𝑡superscriptsubscript𝒟𝑡1train\mathcal{S}_{t}\subseteq\mathcal{D}_{t-1}^{\mathrm{train}}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT, we could train a generative neural network Gt1subscript𝐺𝑡1G_{t-1}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT that learns the task t1𝑡1t-1italic_t - 1 distribution. Unlike generative continual learning methods, Gt1subscript𝐺𝑡1G_{t-1}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT only needs to remember the task t1𝑡1t-1italic_t - 1 distribution instead of all past tasks. Thus, LRB can easily integrate with existing generative methods.

In addition, we could fill 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with randomly initialized samples. Nonetheless, our experiments indicate that this approach is ineffective. Therefore, we will focus our discussion on the first three options and leave the exploration for other choices of 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for future works.

Distiction from buffer-based methods. Rehearsal-based methods retains the data from all past tasks t{1,,N}𝑡1𝑁t\in\{1,\ldots,N\}italic_t ∈ { 1 , … , italic_N } during the lifetime of the DNN. Meanwhile, depending on the choice of alignment set 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, ILR could be considered strictly data-free if 𝒮t=𝒟tsubscript𝒮𝑡subscript𝒟𝑡\mathcal{S}_{t}=\mathcal{D}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT or using the generative method. While for 𝒮t𝒟t1subscript𝒮𝑡subscript𝒟𝑡1\mathcal{S}_{t}\subseteq\mathcal{D}_{t-1}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, ILR can still arguably be a data-free method since task t1𝑡1t-1italic_t - 1 data is only retained until the end of task t𝑡titalic_t.

3.2 Incremental Latent Alignment

The latent alignment mechanism relies on a chain of task-specific rectifier units (rt)t=2Nsuperscriptsubscriptsubscript𝑟𝑡𝑡2𝑁(r_{t})_{t=2}^{N}( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT that aims to correct the distortion of the representation space as the extractor f𝑓fitalic_f learns a new task.

3.2.1 Latent Alignment

For an input x𝑥xitalic_x at task t1𝑡1t-1italic_t - 1, its feature representation under the feature extractor ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is ft1(x)subscript𝑓𝑡1𝑥f_{t-1}(x)italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x ). One can heuristically define the (t1)𝑡1(t-1)( italic_t - 1 )-domain as the representation of the input under the feature extractor ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Unfortunately, the (t1)𝑡1(t-1)( italic_t - 1 )-domain is brittle under extractor update: as the subsequent task t𝑡titalic_t arrives, the feature extractor is updated to ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the corresponding feature representation of the same input x𝑥xitalic_x will be shifted to ft(x)subscript𝑓𝑡𝑥f_{t}(x)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ). Likely, the t𝑡titalic_t-domain and the (t1)𝑡1(t-1)( italic_t - 1 )-domain do not coincide, and ft(x)ft1(x)subscript𝑓𝑡𝑥subscript𝑓𝑡1𝑥f_{t}(x)\neq f_{t-1}(x)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ≠ italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x ).

The feature rectifier unit rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT aims to offset this representation shift. To do this, rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT takes x𝑥xitalic_x, and its t𝑡titalic_t-domain representation ft(x)subscript𝑓𝑡𝑥f_{t}(x)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) as input, and it outputs the rectified representation that satisfies

rt(ft×I)(x)=rt(ft(x),x)ft1(x),subscript𝑟𝑡subscript𝑓𝑡𝐼𝑥subscript𝑟𝑡subscript𝑓𝑡𝑥𝑥subscript𝑓𝑡1𝑥r_{t}\circ(f_{t}\times I)(x)=r_{t}(f_{t}(x),x)\approx f_{t-1}(x),italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_I ) ( italic_x ) = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) , italic_x ) ≈ italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x ) , (2)

with identity function I𝐼Iitalic_I.

With this formulation, we can effectively minimize the difference between the rectified representation rt(ft×I)(x)subscript𝑟𝑡subscript𝑓𝑡𝐼𝑥r_{t}\circ(f_{t}\times I)(x)italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_I ) ( italic_x ) and the original representation ft1(x)subscript𝑓𝑡1𝑥f_{t-1}(x)italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x ). In practice, we only want to train the rectifier unit rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and retain the learned feature extractor ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; therefore, let s=rt(ft×I)𝑠subscript𝑟𝑡subscript𝑓𝑡𝐼s=r_{t}\circ(f_{t}\times I)italic_s = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_I ), we can minimize the difference by using Lalign(θrt;s,τ,𝒮t,ft1)subscript𝐿alignsubscript𝜃subscript𝑟𝑡𝑠𝜏subscript𝒮𝑡subscript𝑓𝑡1L_{\mathrm{align}}(\theta_{r_{t}};s,\tau,\mathcal{S}_{t},f_{t-1})italic_L start_POSTSUBSCRIPT roman_align end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; italic_s , italic_τ , caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) as in Equation 1.

Refer to caption
Figure 2: The rectifier unit includes a weak feature extractor htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a linear compress atsubscripta𝑡\mathrm{a}_{t}roman_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT layer, and a linear combine btsubscriptb𝑡\mathrm{b}_{t}roman_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The compress layer forms a bottleneck to select the remaining (t1)𝑡1(t-1)( italic_t - 1 )-domain knowledge in ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, while htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT extracts compensation information for the loss information in ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The combine layer aggregates and transforms the information from both htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to form the rectified representation.

3.2.2 Rectifier Architecture

The proposed rectifier is composed of three trainable components: a weak feature extractor htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a compress layer atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and a combine layer btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The size of the rectifier units increases linearly with respect to the number of tasks, similar to the classification heads. However, since the rectifier unit is lightweight, this is trivial compared to the size of the full model. Figure 2 visualizes the feature rectifier unit.

Weak feature extractor htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The weak feature extractor htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT processes the input data x𝑥xitalic_x to generate a simplified representation ht(x)subscript𝑡𝑥h_{t}(x)italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ). htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is distilled from ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to compress the knowledge of ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT into a more compact, lower-dimensional representation while remaining parameter-efficient. For our experiment, we choose the simplest and most naive design of a weak feature extractor composed of only two 3x3 convolution layers and two max pooling layers. Instead of processing the full-size image, we use max-pooling to downsample the input to 16x16 images before feeding into htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The weak feature extractor is a small network compared to the main model (htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’s architecture is provided in Table 6).

Compress layer atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The compress layer atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT receives the current latent value ft(x)subscript𝑓𝑡𝑥f_{t}(x)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) and produces a compact representation at(ft(x))subscript𝑎𝑡subscript𝑓𝑡𝑥a_{t}(f_{t}(x))italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) of reduced dimensionality. This layer essentially forms a bottleneck that only allows relevant t1𝑡1t-1italic_t - 1-domain knowledge to pass through. We design the compress layer as a simple linear layer.

Combine layer btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The combine layer btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT recevies the concatenated representaton of the compressed representation atft(x)subscript𝑎𝑡subscript𝑓𝑡𝑥a_{t}\circ f_{t}(x)italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) and the weakly extracted features ht(x)subscript𝑡𝑥h_{t}(x)italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) to form the rectified representation rt(ft(x),x)subscript𝑟𝑡subscript𝑓𝑡𝑥𝑥r_{t}(f_{t}(x),x)italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) , italic_x ). We design the combine layer as a simple linear layer.

Distiction from network-expansion approach. It could be argued that one can, instead, separately train a weak feature extractor htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each task, making it a network-expansion CL approach. However, because htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a small network, this approach is ineffective; specifically, our experiments demonstrate that the task-incremental average accuracy across all tasks of this approach on CIFAR100 falls below 53%percent5353\%53 %. Furthermore, for network-expansion approaches, the dedicated parameters are allocated for new task learning, which is fundamentally different from ILR’s objective to correct representation changes. The new task’s knowledge is acquired by ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

3.3 Training Procedure

Network training. Similar to conventional DNN training, the performance of the feature extractor ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the classifier head wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is measured by the standard multi-class cross-entropy loss:

train(θft,θwt)=CE(θft,θwt;ft,wt,𝒟ttrain)=𝔼(xi,yi)𝒟ttrain[c=1Mtyilog(y^i)],subscripttrainsubscript𝜃subscript𝑓𝑡subscript𝜃subscript𝑤𝑡subscriptCEsubscript𝜃subscript𝑓𝑡subscript𝜃subscript𝑤𝑡subscript𝑓𝑡subscript𝑤𝑡superscriptsubscript𝒟𝑡trainsubscript𝔼similar-tosubscript𝑥𝑖subscript𝑦𝑖superscriptsubscript𝒟𝑡traindelimited-[]superscriptsubscript𝑐1subscript𝑀𝑡subscript𝑦𝑖subscript^𝑦𝑖\mathcal{L}_{\mathrm{train}}(\theta_{f_{t}},\theta_{w_{t}})=\mathcal{L}_{% \mathrm{CE}}(\theta_{f_{t}},\theta_{w_{t}};f_{t},w_{t},\mathcal{D}_{t}^{% \mathrm{train}})=\mathbb{E}_{(x_{i},y_{i})\sim\mathcal{D}_{t}^{\mathrm{train}}% }\left[-\sum_{c=1}^{M_{t}}y_{i}\log(\hat{y}_{i})\right],caligraphic_L start_POSTSUBSCRIPT roman_train end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ - ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] , (3)

where Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the number of classes of task t𝑡titalic_t, y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the probability-valued network output for the input xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that depends on the feature extractor ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the classifier wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as y^i=wtft(xi)subscript^𝑦𝑖subscript𝑤𝑡subscript𝑓𝑡subscript𝑥𝑖\hat{y}_{i}=w_{t}\circ f_{t}(x_{i})over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

Furthermore, if the alignment set 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT uses either task t1𝑡1t-1italic_t - 1 data or generative network Gt1subscript𝐺𝑡1G_{t-1}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, we could also utilize 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to further enforce task t1𝑡1t-1italic_t - 1 representation consistency, reduce forgetting, and enable more effective rectification by training and regularizing ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on 𝒟ttrainsuperscriptsubscript𝒟𝑡train\mathcal{D}_{t}^{\mathrm{train}}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT and 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, respectively. Let s=ft𝑠subscript𝑓𝑡s=f_{t}italic_s = italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, then we can similarly use alignsubscriptalign\mathcal{L}_{\mathrm{align}}caligraphic_L start_POSTSUBSCRIPT roman_align end_POSTSUBSCRIPT in Equation 1 with hyperparameter α𝛼\alphaitalic_α :

train(θft,θwt)=CE(θft,θwt;ft,wt,𝒟ttrain)+αalign(θft;s,τ,𝒮t,ft1).subscripttrainsubscript𝜃subscript𝑓𝑡subscript𝜃subscript𝑤𝑡subscriptCEsubscript𝜃subscript𝑓𝑡subscript𝜃subscript𝑤𝑡subscript𝑓𝑡subscript𝑤𝑡superscriptsubscript𝒟𝑡train𝛼subscriptalignsubscript𝜃subscript𝑓𝑡𝑠𝜏subscript𝒮𝑡subscript𝑓𝑡1\mathcal{L}_{\mathrm{train}}(\theta_{f_{t}},\theta_{w_{t}})=\mathcal{L}_{% \mathrm{CE}}(\theta_{f_{t}},\theta_{w_{t}};f_{t},w_{t},\mathcal{D}_{t}^{% \mathrm{train}})+\alpha\mathcal{L}_{\mathrm{align}}(\theta_{f_{t}};s,\tau,% \mathcal{S}_{t},f_{t-1}).caligraphic_L start_POSTSUBSCRIPT roman_train end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT ) + italic_α caligraphic_L start_POSTSUBSCRIPT roman_align end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; italic_s , italic_τ , caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) . (4)

This is different from the rehearsal method since f𝑓fitalic_f only visits 𝒟t1subscript𝒟𝑡1\mathcal{D}_{t-1}caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT at task t1𝑡1t-1italic_t - 1 and task t𝑡titalic_t. After task t𝑡titalic_t, f𝑓fitalic_f never seen 𝒟t1subscript𝒟𝑡1\mathcal{D}_{t-1}caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT again, while for rehearsal method, f𝑓fitalic_f observe samples from 𝒟t1subscript𝒟𝑡1\mathcal{D}_{t-1}caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT throughout its lifetime, risk overfitting on stored exemplars.

Input : Training dataset 𝒟ttrainsubscriptsuperscript𝒟train𝑡\mathcal{D}^{\mathrm{train}}_{t}caligraphic_D start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, weight parameter for loss functions α,τ𝛼𝜏\alpha,\tauitalic_α , italic_τ
Output : Feature extractor ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, rectifier unit rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
1 Train ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT jointly by minimizing train(θft,θwt)subscripttrainsubscript𝜃subscript𝑓𝑡subscript𝜃subscript𝑤𝑡\mathcal{L}_{\mathrm{\mathrm{train}}}(\theta_{f_{t}},\theta_{w_{t}})caligraphic_L start_POSTSUBSCRIPT roman_train end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) [Equation 3, Equation 4];
2
3Distill ht+1subscript𝑡1h_{t+1}italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT;
4 if t>1 then
5       Freeze ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
6       Train rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using align(θrt;s,τ,𝒮t,ft1)subscriptalignsubscript𝜃subscript𝑟𝑡𝑠𝜏subscript𝒮𝑡subscript𝑓𝑡1\mathcal{L}_{\mathrm{align}}(\theta_{r_{t}};s,\tau,\mathcal{S}_{t},f_{t-1})caligraphic_L start_POSTSUBSCRIPT roman_align end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; italic_s , italic_τ , caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) with s=rt(ft×I)𝑠subscript𝑟𝑡subscript𝑓𝑡𝐼s=r_{t}\circ(f_{t}\times I)italic_s = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_I ) [Equation 1]
7 end if
Algorithm 1 Full training framework at task t{1,2,,n}𝑡12𝑛t\in\{1,2,...,n\}italic_t ∈ { 1 , 2 , … , italic_n }

Rectifier training. Training the rectifier follows two main steps: train the weak feature extractor and then the compress/combine layers. The weak feature extractor htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is distilled from ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT as task t1𝑡1t-1italic_t - 1 training is completed. Let gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be a temporary linear layer mapping htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’s smaller dimension to ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT’s higher dimension, s=gtht𝑠subscript𝑔𝑡subscript𝑡s=g_{t}\circ h_{t}italic_s = italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒮t=𝒟t1subscript𝒮𝑡subscript𝒟𝑡1\mathcal{S}_{t}=\mathcal{D}_{t-1}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, we train htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using the align(θs;s,τ,𝒮t,ft1)subscriptalignsubscript𝜃𝑠𝑠𝜏subscript𝒮𝑡subscript𝑓𝑡1\mathcal{L}_{\mathrm{align}}(\theta_{s};s,\tau,\mathcal{S}_{t},f_{t-1})caligraphic_L start_POSTSUBSCRIPT roman_align end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_s , italic_τ , caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) in Equation 1. Similarly, as detailed in Section 3.2.1, we train the remaining components of rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, i.e., compress/combine layers, at the end of task t𝑡titalic_t. Details of ILR’s training algorithm are provided in Algorithm 1

3.4 Inference Procedure

We now describe how to stack multiple rectifier units rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into a chain for inference. As a new task arrives, our model dynamically extends an additional rectifier unit, forming a sequence of rectifiers.

Task-Incremental. We consider a task-incremental learning setting where a test sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is coupled with a task identifier ti{1,,N}subscript𝑡𝑖1𝑁t_{i}\in\{1,\ldots,N\}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , … , italic_N }. To classify xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we can recover f^ti(x)subscript^𝑓subscript𝑡𝑖𝑥\hat{f}_{t_{i}}(x)over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) by forwarding the current latent variable fN(x)subscript𝑓𝑁𝑥f_{N}(x)italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x ) through a chain of Nti𝑁subscript𝑡𝑖N-t_{i}italic_N - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT rectifiers. We then pass this recovered latent variable through classifier head wtisubscript𝑤subscript𝑡𝑖w_{t_{i}}italic_w start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT to make a prediction. The output y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is computed as

y^i=wti(f^ti(xi))wheref^ti(xi)=rti+1(f^ti+1×I)(x)withti<N,f^N=fNformulae-sequencesubscript^𝑦𝑖subscript𝑤subscript𝑡𝑖subscript^𝑓subscript𝑡𝑖subscript𝑥𝑖whereformulae-sequencesubscript^𝑓subscript𝑡𝑖subscript𝑥𝑖subscript𝑟subscript𝑡𝑖1subscript^𝑓subscript𝑡𝑖1𝐼𝑥withformulae-sequencesubscript𝑡𝑖𝑁subscript^𝑓𝑁subscript𝑓𝑁\displaystyle\hat{y}_{i}=w_{t_{i}}(\hat{f}_{t_{i}}(x_{i}))\quad\text{where}% \quad\hat{f}_{t_{i}}(x_{i})=r_{t_{i}+1}\circ(\hat{f}_{t_{i}+1}\times I)(x)% \quad\text{with}\quad t_{i}<N,\hat{f}_{N}=f_{N}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) where over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_r start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ∘ ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT × italic_I ) ( italic_x ) with italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_N , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT

Class-Incremental. ILR relies on the task identity to reconstruct the appropriate sequence of rectifier units for propagating the latent representation to the original space. However, no identity is provided for the CL method in the class-incremental learning setting. We provided a simple method for inference without task identity, which demonstrates the method’s extension to class-incremental learning; however, more robust task-identity inference methods could also be incorporated.

We obtain the class-incremental probabilities by forming an ensemble that averages the class probabilities over all domains. From the current task t𝑡titalic_t’s domain, we iteratively rectified the latent back to task t1𝑡1t-1italic_t - 1, task t2𝑡2t-2italic_t - 2, …, task 1111’s domain. At each domain, we obtain the rectified representation corresponding with the domain, which we forward through the respective classifier. We then average the softmax probabilities of each domain, essentially forming an ensemble of wi(fi)|i=1tevaluated-atsubscript𝑤𝑖subscript𝑓𝑖𝑖1𝑡w_{i}(f_{i})|_{i=1}^{t}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

4 Experiments

Our implementation is based partially on the Mammoth  [6, 7] repository, TAMiL [5] repository, and CLS-ER  [3] repository.

4.1 Evaluation Protocol

Datasets. We select three standard continual learning benchmarks for our experiments: Sequential CIFAR10 (S-CIFAR10), Sequential CIFAR100 (S-CIFAR100), and Sequential Tiny ImageNet (S-TinyImg). Specifically, we divide S-CIFAR10 into 5 binary classification tasks, S-CIFAR100 into 5 tasks with 20 classes each, and S-TinyImg into 20 tasks with 20 classes each.

Baselines. We evaluate ILR against representative continual learning methods, including EWC (online) [38], and LwF (multi-class) [24], ER [10], AGEM [9], DER++ [7], ER-ACE [8], CLS-ER  [3], TAMiL [5]. We further provide an upper and lower bound for all methods by joint training on all tasks’ data and fine-tuning without any catastrophic forgetting mitigation. We employ ResNet18 [15] as the unified feature extractor for all benchmarks. The classifier comprises a fixed number of separate linear heads for each task. More datasets and implementation details are provided in the Appendix.

Table 1: Task-Incremental Average Accuracy across all tasks after CL training. Joint: the upper bound accuracy when jointly training on all tasks (i.e., multi-task learning). Finetuning: the lower bound accuracy when learning without any CL techniques. |||\mathcal{B}|| caligraphic_B | is the buffer of all past tasks data, while |𝒮t|subscript𝒮𝑡|\mathcal{S}_{t}|| caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | is the alignment training data set, which only contains data from task t1𝑡1t-1italic_t - 1.
Method |||\mathcal{B}|| caligraphic_B | |𝒮t|subscript𝒮𝑡|\mathcal{S}_{t}|| caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | S-CIFAR10 S-CIFAR100 S-TinyImg
TIL NP AA NP AA NP AA
Joint - - 11.17M 98.46±plus-or-minus\pm±0.07 11.22M 86.37±plus-or-minus\pm±0.17 11.27M 81.86±plus-or-minus\pm±0.57
Finetuning 11.17M 64.16±plus-or-minus\pm±2.40 11.22M 24.01±plus-or-minus\pm±2.14 11.27M 13.79±plus-or-minus\pm±0.23
o-EWC - - 11.17M 69.60±plus-or-minus\pm±5.22 11.22M 36.61±plus-or-minus\pm±3.82 11.27M 15.67±plus-or-minus\pm±0.67
LwF.mc 11.17M 60.96±plus-or-minus\pm±1.48 11.22M 41.00±plus-or-minus\pm±1.01 11.27M 23.24±plus-or-minus\pm±0.71
AGEM 500 - 11.17M 90.37±plus-or-minus\pm±1.05 11.22M 63.35±plus-or-minus\pm±1.47 11.27M 37.14±plus-or-minus\pm±0.32
ER 11.17M 94.24±plus-or-minus\pm±0.24 11.22M 67.41±plus-or-minus\pm±0.70 11.27M 46.07±plus-or-minus\pm±0.16
DER++ 11.17M 92.49±plus-or-minus\pm±0.55 11.22M 68.52±plus-or-minus\pm±0.91 11.27M 50.84±plus-or-minus\pm±0.12
ER-ACE 11.17M 94.52±plus-or-minus\pm±0.13 11.22M 67.26±plus-or-minus\pm±0.50 11.27M 47.72±plus-or-minus\pm±0.42
TAMiL 22.68M 94.89±plus-or-minus\pm±0.16 22.77M 76.39±plus-or-minus\pm±0.29 23.20M 64.24±plus-or-minus\pm±0.69
CLS-ER 33.52M 95.35±plus-or-minus\pm±0.34 33.66M 77.03±plus-or-minus\pm±0.81 33.81M 54.69±plus-or-minus\pm±0.37
ILR - 500 13.31M 86.28±plus-or-minus\pm±0.69 13.36M 74.59±plus-or-minus\pm±0.52 16.08M 59.78±plus-or-minus\pm±0.39
AGEM 1000 - 11.17M 91.68±plus-or-minus\pm±1.48 11.22M 67.43±plus-or-minus\pm±1.37 11.27M 46.94±plus-or-minus\pm±0.91
ER 11.17M 95.25±plus-or-minus\pm±0.07 11.22M 69.69±plus-or-minus\pm±1.49 11.27M 54.54±plus-or-minus\pm±0.40
DER++ 11.17M 93.76±plus-or-minus\pm±0.23 11.22M 72.27±plus-or-minus\pm±1.13 11.27M 58.67±plus-or-minus\pm±0.28
ER-ACE 11.17M 94.69±plus-or-minus\pm±0.25 11.22M 72.46±plus-or-minus\pm±0.58 11.27M 57.37±plus-or-minus\pm±0.49
TAMiL 22.68M 95.22±plus-or-minus\pm±0.42 22.77M 78.72±plus-or-minus\pm±0.31 23.20M 70.89±plus-or-minus\pm±0.04
CLS-ER 33.52M 96.05±plus-or-minus\pm±0.11 33.66M 79.36±plus-or-minus\pm±0.20 33.81M 65.00±plus-or-minus\pm±0.02
ILR - 1000 13.31M 91.02±plus-or-minus\pm±1.76 13.36M 78.53±plus-or-minus\pm±0.25 16.08M 66.79±plus-or-minus\pm±0.64
ILR - 5000 13.31M 94.84±plus-or-minus\pm±0.31 13.36M 82.05±plus-or-minus\pm±0.29 16.08M 72.50±plus-or-minus\pm±0.92

4.2 Results

Table 1 shows the performance of ILR and other CL methods, including rehearsal-based and regularization-based methods, on multiple sequential datasets, including S-CIFAR10, S-CIFAR100, and S-TinyImg. For ILR, we create an alignment set from 500, 100, and 5000 samples of 𝒟t1trainsuperscriptsubscript𝒟𝑡1train\mathcal{D}_{t-1}^{\mathrm{train}}caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT. As can be observed from the table, ILR achieves comparable results on S-CIFAR10, compared to the baselines. On S-CIFAR100 and S-TinyImg, ILR outperforms all the baselines given a sufficient alignment set, indicating its ability to rectify representation changes incrementally.

Table 2: Class-Incremental Average Accuracy across all tasks after CL training. The settings are similar to Table 1.
Method |||\mathcal{B}|| caligraphic_B | |𝒮t|subscript𝒮𝑡|\mathcal{S}_{t}|| caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | S-CIFAR100
CIL NP AA
Joint - - 11.22M 71.07±plus-or-minus\pm±0.27
Finetuning 11.22M 17.50±plus-or-minus\pm±0.09
DER++ 1000 - 11.22M 46.96±plus-or-minus\pm±0.17
ER-ACE 11.22M 47.09±plus-or-minus\pm±1.16
TAMiL 22.77M 51.83±plus-or-minus\pm±0.41
CLS-ER 33.66M 51.13±plus-or-minus\pm±0.12
ILR - 1000 13.56M 42.53±plus-or-minus\pm±0.43
ILR - 5000 13.56M 48.90±plus-or-minus\pm±0.28

Table 2 demonstrates the extension of ILR to class-incremental settings. As the class-incremental probabilities are simply obtained through averaging, we can still achieve comparable performance to other rehearsal-based methods given a sufficient alignment set.

4.3 Result with different alignment sets

We further evaluate choices of training data used for alignment set 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, as discussed in  Section 3. The choices include using samples from the previous task’s training data 𝒟t1trainsuperscriptsubscript𝒟𝑡1train\mathcal{D}_{t-1}^{\text{train}}caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT, the current task’s training data 𝒟t1trainsuperscriptsubscript𝒟𝑡1train\mathcal{D}_{t-1}^{\text{train}}caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT, and generative network Gt1subscript𝐺𝑡1G_{t-1}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Details for Gt1subscript𝐺𝑡1G_{t-1}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT training are included in the Appendix.

Table 3 shows the results of these experiments. As can be observed, training with data from 𝒟t1trainsuperscriptsubscript𝒟𝑡1train\mathcal{D}_{t-1}^{\text{train}}caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT expectedly achieves better performance since the data is sampled directly from the data distribution 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the previous task; increasing the number of samples from 𝒟t1trainsuperscriptsubscript𝒟𝑡1train\mathcal{D}_{t-1}^{\text{train}}caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT yields better performance results. The generative network also achieves comparable results due to its ability to synthesize data from 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Nevertheless, training with 𝒟ttrainsuperscriptsubscript𝒟𝑡train\mathcal{D}_{t}^{\text{train}}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT is also an attractive choice for its reasonable performance and the fact that we do not need to keep a copy of the previous task’s data.

Table 3: Average Accuracy across 5 tasks for S-CIFAR100 dataset with different options of alignment training data.
Variation Keep t1𝑡1t-1italic_t - 1 data Keep ft1subscript𝑓𝑡1f_{t-1}italic_f start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT Keep Gt1subscript𝐺𝑡1G_{t-1}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT Avg. Accuracy
ILR with 𝒮t=𝒟ttrainsubscript𝒮𝑡superscriptsubscript𝒟𝑡train\mathcal{S}_{t}=\mathcal{D}_{t}^{\mathrm{train}}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT - - 69.22±plus-or-minus\pm±0.40
ILR with 𝒮t𝒟t1train,|𝒮t|=5000formulae-sequencesubscript𝒮𝑡superscriptsubscript𝒟𝑡1trainsubscript𝒮𝑡5000\mathcal{S}_{t}\subset\mathcal{D}_{t-1}^{\mathrm{train}},|\mathcal{S}_{t}|=5000caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_train end_POSTSUPERSCRIPT , | caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | = 5000 - - 82.05±plus-or-minus\pm±0.29
ILR-GAN (𝒮t𝒟t1(\mathcal{S}_{t}\sim\mathcal{D}_{t-1}( caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT) - 79.51±plus-or-minus\pm±0.48

4.4 Parameter Growth Comparison

Table 4: Number of parameters \downarrow (in millions) of different methods after N𝑁Nitalic_N tasks. Results for baselines are taken from [5] and [3], measured on the S-TinyImg. The ResNet-18 network with no classifier head is 11.17 million parameters
Methods 5 tasks 10 tasks 20 tasks
ResNet-18 11.27M 11.27M 11.27M
TAMiL [5] 22.87M 23.20M 23.85M
CLS-ER [3] 33.81M 33.81M 33.81M
LRB 13.94M 16.08M 21.96M

This section studies the network-size footprint of our framework. The base ResNet-18 has 11.17 million parameters. We report the network sizes after 5, 10, and 20 tasks for ILR and the two baselines, CSL-ER, and TAMIL in Table 4. As we can observe, ILR exhibits a linear memory growth and has the smallest memory footprint among the three baselines. Further analysis reveals that the compress layer (512x384 linear layer) and the combine layer (512x512 linear layer) contribute the most to memory usage, requiring approximately 0.20 million and 0.26 million parameters per task, respectively. Meanwhile, the weak feature extractor contribution to the total number of parameters is negligible at 0.07 million parameters per task.

4.5 Rectifier Quality Experiment

In this section, we utilize Principal Component Analysis (PCA) to visualize our learned latent space against the target latent space of the previous task and verify the behaviors of the rectifier in recovering past representation. Figure 3 shows the PCA plots with the first two components. As can be observed, the new representations of data from the previous task (red) after learning the current task change significantly from their original representations (green), which explains catastrophic forgetting. With ILR’s mechanism, the rectified data representations (blue) can now accurately align with the ‘true’ data representations (green), supporting the empirical effectiveness of our framework.

Refer to caption
Figure 3: We employ principal component analysis (PCA) to visualize our rectified latent space after training on task t𝑡titalic_t and predicting task t(t<t)superscript𝑡superscript𝑡𝑡t^{\prime}(t^{\prime}<t)italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_t ). By visualizing the original latent representation (ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), the rectified latent representation (f^tsubscript^𝑓superscript𝑡\hat{f}_{t^{\prime}}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT), and the target latent representation (ftsubscript𝑓superscript𝑡f_{t^{\prime}}italic_f start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT), we assess our training method’s effectiveness. The closer the proximity between our rectified latent representation and the target latent representation, the better our training method performs. The experiment is conducted by training S-CIFAR10 with 𝒮t𝒟t1subscript𝒮𝑡subscript𝒟𝑡1\mathcal{S}_{t}\subset\mathcal{D}_{t-1}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and α=0𝛼0\alpha=0italic_α = 0.

4.6 Ablation study

In this section, we investigate the impact of our alignment loss. We isolate the effect of the alignment loss by setting τ=0𝜏0\tau=0italic_τ = 0 in Equation 1, effectively replacing it with a simple l𝑙litalic_l-2222 norm. To analyze the contribution of the representation regularization on rectification effectiveness, we set α=0𝛼0\alpha=0italic_α = 0 in Equation 4, eliminating it from the main feature extractor’s training process.

Table 5: Ablating τ𝜏\tauitalic_τ and α𝛼\alphaitalic_α for |𝒮t|=5000subscript𝒮𝑡5000|\mathcal{S}_{t}|=5000| caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | = 5000
Method Hyperparameter S-CIFAR10 S-CIFAR100 S-TinyImg
ILR α=0.0,τ0.0formulae-sequence𝛼0.0𝜏0.0\alpha=0.0,\tau\neq 0.0italic_α = 0.0 , italic_τ ≠ 0.0 89.68±plus-or-minus\pm±0.75 72.45±plus-or-minus\pm±0.42 58.73±plus-or-minus\pm±0.81
ILR α0.0,τ=0.0formulae-sequence𝛼0.0𝜏0.0\alpha\neq 0.0,\tau=0.0italic_α ≠ 0.0 , italic_τ = 0.0 90.82±plus-or-minus\pm±1.17 81.67±plus-or-minus\pm±0.22 72.07±plus-or-minus\pm±0.37
ILR α0,τ0formulae-sequence𝛼0𝜏0\alpha\neq 0,\tau\neq 0italic_α ≠ 0 , italic_τ ≠ 0 94.84±plus-or-minus\pm±0.31 82.05±plus-or-minus\pm±0.29 72.50±plus-or-minus\pm±0.92

5 Limitations

We have shown the potential and high utility of ILR’s CL learning mechanism in this paper. Nevertheless, ILR also has some limitations. One limitation is that ILR still maintains an additional DNN, i.e., the rectifier, which incurs an additional overhead as the number of tasks increases. Inference cost for long chain would be costly, which can be further explored with modified chaining methods such as skipping (i.e., building a rectifier every two tasks). Additionally, the best performance is achieved when having access to task t1𝑡1t-1italic_t - 1’s data. Ideally, we would want to remove this requirement; thus, future research should focus on the creation of the alignment training data. We have attempted to demonstrate that generative methods are a viable option. Furthermore, since ILR relies on the task identity to reconstruct the rectifier sequence, application to class-incremental learning settings requires either inferring task identity or forming an ensemble of predictions. The proposed ensemble solution might suffer from over-confident or under-confident classifiers. Class-incremental learning is still an open research, where more effective adaptations of our framework can be discovered.

6 Conclusion

This work proposes a new CL paradigm, ILR, for task incremental learning. ILR tackles catastrophic forgetting through its novel backward-recall mechanism that learns to align the newly learned presentation of past data to their correct representations. Unlike existing CL methods, it requires neither a replay buffer nor intricate training modifications. Our experiments validate that the proposed ILR achieves comparable results to the performance of existing CL baselines for task-incremental and class-incremental learning.

References

  • Ahn et al. [2019] H. Ahn, S. Cha, D. Lee, and T. Moon. Uncertainty-based continual learning with adaptive regularization. In Advances in Neural Information Processing Systems, pages 4394–4404, 2019.
  • Aljundi et al. [2018] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
  • Arani et al. [2022] E. Arani, F. Sarfraz, and B. Zonooz. Learning fast, learning slow: A general continual learning method based on complementary learning system. In International Conference on Learning Representations, 2022.
  • Balaji et al. [2020] Y. Balaji, M. Farajtabar, D. Yin, A. Mott, and A. Li. The effectiveness of memory replay in large scale continual learning. arXiv preprint arXiv:2010.02418, 2020.
  • Bhat et al. [2023] P. S. Bhat, B. Zonooz, and E. Arani. Task-aware information routing from common representation space in lifelong learning. In The Eleventh International Conference on Learning Representations, 2023.
  • Boschini et al. [2022] M. Boschini, L. Bonicelli, P. Buzzega, A. Porrello, and S. Calderara. Class-incremental continual learning into the extended der-verse. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • Buzzega et al. [2020] P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara. Dark experience for general continual learning: a strong, simple baseline. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 15920–15930. Curran Associates, Inc., 2020.
  • Caccia et al. [2022] L. Caccia, R. Aljundi, N. Asadi, T. Tuytelaars, J. Pineau, and E. Belilovsky. New insights on reducing abrupt representation change in online continual learning. In International Conference on Learning Representations, 2022.
  • Chaudhry et al. [2019a] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with A-GEM. In International Conference on Learning Representations, 2019a.
  • Chaudhry et al. [2019b] A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. Dokania, P. Torr, and M. Ranzato. Continual learning with tiny episodic memories. In Workshop on Multi-Task and Lifelong Reinforcement Learning, 2019b.
  • Chaudhry et al. [2019] A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. S. Torr, and M. Ranzato. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486, 2019.
  • Ebrahimi et al. [2020] S. Ebrahimi, M. Elhoseiny, T. Darrell, and M. Rohrbach. Uncertainty-guided continual learning with bayesian neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklUCCVKDB.
  • Farajtabar et al. [2020] M. Farajtabar, N. Azizan, A. Mott, and A. Li. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pages 3762–3773. PMLR, 2020.
  • Fernando et al. [2017] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • Hu et al. [2021] H. Hu, A. Li, D. Calandriello, and D. Gorur. One pass imagenet. In NeurIPS 2021 Workshop on ImageNet: Past, Present, and Future, 2021. URL https://openreview.net/forum?id=mEgL92HSW6S.
  • Jerfel et al. [2019] G. Jerfel, E. Grant, T. L. Griffiths, and K. A. Heller. Reconciling meta-learning and continual learning with online mixtures of tasks. In NeurIPS, 2019.
  • Kang and Park [2020] M. Kang and J. Park. ContraGAN: Contrastive Learning for Conditional Image Generation. 2020.
  • Kang et al. [2021] M. Kang, W. Shim, M. Cho, and J. Park. Rebooting ACGAN: Auxiliary Classifier GANs with Stable Training. 2021.
  • Kang et al. [2023] M. Kang, J. Shin, and J. Park. StudioGAN: A Taxonomy and Benchmark of GANs for Image Synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023.
  • Kirichenko et al. [2021] P. Kirichenko, M. Farajtabar, D. Rao, B. Lakshminarayanan, N. Levine, A. Li, H. Hu, A. G. Wilson, and R. Pascanu. Task-agnostic continual learning with hybrid probabilistic models. 2021. URL https://openreview.net/forum?id=ZbSeZKdqNkm.
  • Kirkpatrick et al. [2017] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
  • Li et al. [2019] X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. arXiv preprint arXiv:1904.00310, 2019.
  • Li and Hoiem [2017] Z. Li and D. Hoiem. Learning without forgetting. arXiv preprint arXiv:1606.09282, 2017.
  • Lopez-Paz and Ranzato [2017] D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.
  • Mallya and Lazebnik [2018] A. Mallya and S. Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
  • Masse et al. [2018] N. Y. Masse, G. D. Grant, and D. J. Freedman. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences, 115(44):10467–10475, 2018.
  • McCloskey and Cohen [1989] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
  • Nguyen et al. [2018] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational continual learning. In International Conference on Learning Representations, 2018.
  • Qin et al. [2021] Q. Qin, W. Hu, H. Peng, D. Zhao, and B. Liu. Bns: Building network structures dynamically for continual learning. Advances in Neural Information Processing Systems, 34:20608–20620, 2021.
  • Rao et al. [2019] D. Rao, F. Visin, A. Rusu, R. Pascanu, Y. W. Teh, and R. Hadsell. Continual unsupervised representation learning. In Advances in Neural Information Processing Systems, pages 7645–7655, 2019.
  • Rasch and Born [2007] B. Rasch and J. Born. Maintaining memories by reactivation. Current Opinion in Neurobiology, 17(6):698–703, 2007.
  • Ratcliff [1990] R. Ratcliff. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychology Review, 97(2):285–308, Apr. 1990.
  • Rebuffi et al. [2016] S.-A. Rebuffi, A. I. Kolesnikov, G. Sperl, and C. H. Lampert. iCaRL: Incremental classifier and representation learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5533–5542, 2016.
  • Riemer et al. [2018] M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.
  • Rios and Itti [2018] A. Rios and L. Itti. Closed-loop GAN for continual learning. arXiv preprint arXiv:1811.01146, 2018.
  • Rusu et al. [2016] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  • Schwarz et al. [2018] J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell. Progress & compress: A scalable framework for continual learning. In International Conference on Machine Learning, pages 4528–4537. PMLR, 2018.
  • Shin et al. [2017] H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017.
  • Sun et al. [2022] S. Sun, D. Calandriello, H. Hu, A. Li, and M. Titsias. Information-theoretic online memory selection for continual learning. In International Conference on Learning Representations (ICLR), 2022.
  • Tseng et al. [2021] H.-Y. Tseng, L. Jiang, C. Liu, M.-H. Yang, and W. Yang. Regularing generative adversarial networks under limited data. In CVPR, 2021.
  • Van et al. [2022] L. N. Van, N. L. Hai, H. Pham, and K. Than. Auxiliary local variables for improving regularization/prior approach in continual learning. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 16–28. Springer, 2022.
  • Verwimp et al. [2021] E. Verwimp, M. De Lange, and T. Tuytelaars. Rehearsal revealed: The limits and merits of revisiting samples in continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9385–9394, 2021.
  • Wortsman et al. [2020] M. Wortsman, V. Ramanujan, R. Liu, A. Kembhavi, M. Rastegari, J. Yosinski, and A. Farhadi. Supermasks in superposition. arXiv preprint arXiv:2006.14769, 2020.
  • Xu and Zhu [2018] J. Xu and Z. Zhu. Reinforced continual learning. Advances in Neural Information Processing Systems, 31, 2018.
  • Yan et al. [2021] S. Yan, J. Xie, and X. He. Der: Dynamically expandable representation for class incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3014–3023, 2021.
  • Yin et al. [2020] D. Yin, M. Farajtabar, and A. Li. SOLA: Continual learning with second-order loss approximation. arXiv preprint arXiv:2006.10974, 2020.
  • Yoon et al. [2018] J. Yoon, E. Yang, J. Lee, and S. J. Hwang. Lifelong learning with dynamically expandable networks. In Sixth International Conference on Learning Representations. ICLR, 2018.
  • Zenke et al. [2017] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pages 3987–3995. PMLR, 2017.
  • Zhang et al. [2019] M. Zhang, T. Wang, J. H. Lim, and J. Feng. Prototype reminding for continual learning. arXiv preprint arXiv:1905.09447, 2019.

Appendix

Appendix A Detailed Experimental Setup

Computing resource. We run the experiments on a machine with 8 NVIDIA RTX A5000s.

A.1 Baselines

As detailed in Section 4.1, we evaluate ILR against EWC (online version), LwF (multi-class) version, ER, AGEM, DER++, ER-ACE, CLS-ER, and TAMiL.

For extensive comparison, we provide rehearsal-based methods with a buffer with a max capacity of 500 and 1,000 samples, respectively. Since our method does not rely on a buffer of all task data but only an alignment set of task t1𝑡1t-1italic_t - 1 data, the forgetting can be more significant, which is not a fair comparison of ILR against other rehearsal-based methods. Therefore, we provide ILR with an alignment set of 500, 1,000, and 5,000 samples.

We replicate training settings as follows: For ER, DER++, ER-ACE, TAMiL, and CLS-ER, we employ the reservoir sampling strategy to remove the reliance on task boundaries as in the original implementation. On the other hand, ILR, AGEM, and TAMiL rely on the task boundary to learn the rectifier, modify the buffer, and add a new task-attention module, respectively. For TAMiL, we use the best-reported task-attention architecture. For CLS-ER, we perform inference using the stable model per the original formulation.

A.2 Datasets

To demonstrate the effectiveness of our method, we perform empirical evaluations on three standard continual learning benchmarks: Sequential CIFAR10 (S-CIFAR10), Sequential CIFAR100 (S-CIFAR100), and Sequential Tiny ImageNet (S-TinyImg). The datasets are split into 5, 5, and 10 tasks containing 2, 20, and 20 classes, respectively. The dataset of S-CIFAR10 and S-CIFAR100 each includes 60000 32×32323232\times 3232 × 32 images splitter into 50000 training images and 10000 test images, with each task occupying 10000 training images and 2000 testing images. The dataset S-TinyImg contains 1100000 64×64646464\times 6464 × 64 images with 100000 training images and 10000 test images divided into 10 tasks with 10000 training images and 1000 test images each. We perform simple augmentation of random horizontal flips and random image cropping for each training and buffered image.

A.3 Training

Settings. The training set of each task is divided into 90%-10% for training and validation. All methods are optimized by the Adam optimizer available in PyTorch with a learning rate of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. As the validation loss plateau for 3 epochs, we reduce the learning rate by 0.1. Each task is trained for 40 epochs. For ILR, we train htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using the same formulation with Adam optimizer at a learning rate of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for 50 epochs.

Weak feature extractor. We provide the architecture of the weak feature extractor htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Table 6. We choose a simple design of two 3x3 convolution layers coupled with two max pooling layers.

Table 6: Architecture of the weak feature extractor htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We use ReLU activation after each convolution layer. For each task, a weak feature extractor htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is distilled from the current feature extractor ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The output dimension of hhitalic_h is 128, while the output dimension of the main feature extractor is 512.
Layer Channel Kernel Stride Padding Output size
Input 3 16×16161616\times 1616 × 16
Conv 1 64 3×3333\times 33 × 3 2 1 8×8888\times 88 × 8
MaxPool 2 4×4444\times 44 × 4
Conv 2 128 3×3333\times 33 × 3 2 1 2×2222\times 22 × 2
MaxPool 2 1×1111\times 11 × 1

GAN training. We use the StudioGan repository’s default implementation [20, 19, 18] of the BigGAN LeCam [41] to train the network on each task of S-CIFAR100. The obtained FID score for each task is between 17 and 23. The BigGAN network has nearly 95 million parameters. During ILR training, we sampled directly from the BigGAN network.

A.4 Hyperparameter search

For all methods, experiments, and datasets, we perform a grid search over the following hyper-parameters using a validation set. Some of the following hyperparameters are obtained directly from their original implementation to narrow down the search range.

  • Joint, Finetuning, LwF.mc, ER, AGEM, ER-ACE: No hyperparameters

  • o-EWC:

    • -

      λ{10,20,50,100}𝜆102050100\lambda\in\{10,20,50,100\}italic_λ ∈ { 10 , 20 , 50 , 100 }

    • -

      γ{0.9,1}𝛾0.91\gamma\in\{0.9,1\}italic_γ ∈ { 0.9 , 1 }

  • DER++:

    • -

      α{0.1,0.2,0.5,1}𝛼0.10.20.51\alpha\in\{0.1,0.2,0.5,1\}italic_α ∈ { 0.1 , 0.2 , 0.5 , 1 }

    • -

      β{0.1,0.2,0.5,1}𝛽0.10.20.51\beta\in\{0.1,0.2,0.5,1\}italic_β ∈ { 0.1 , 0.2 , 0.5 , 1 }

  • CLS-ER:

    • -

      rp{0.5,0.9}subscript𝑟𝑝0.50.9r_{p}\in\{0.5,0.9\}italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ { 0.5 , 0.9 }

    • -

      rs{0.1,0.5}subscript𝑟𝑠0.10.5r_{s}\in\{0.1,0.5\}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ { 0.1 , 0.5 }

    • -

      αp{0.999}subscript𝛼𝑝0.999\alpha_{p}\in\{0.999\}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ { 0.999 }

    • -

      αs{0.999}subscript𝛼𝑠0.999\alpha_{s}\in\{0.999\}italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ { 0.999 }

  • TAMiL:

    • -

      α{0.2,0.5,1}𝛼0.20.51\alpha\in\{0.2,0.5,1\}italic_α ∈ { 0.2 , 0.5 , 1 }

    • -

      β{0.1,0.2,1}𝛽0.10.21\beta\in\{0.1,0.2,1\}italic_β ∈ { 0.1 , 0.2 , 1 }

    • -

      θ{0.1}𝜃0.1\theta\in\{0.1\}italic_θ ∈ { 0.1 }

  • ILR:

    • -

      α{1,2,3}𝛼123\alpha\in\{1,2,3\}italic_α ∈ { 1 , 2 , 3 }

    • -

      τ{0.5,1,2}𝜏0.512\tau\in\{0.5,1,2\}italic_τ ∈ { 0.5 , 1 , 2 }

Table 7: Hyperparameters for method in Table 1
Method |||\mathcal{B}|| caligraphic_B | |𝒮t|subscript𝒮𝑡|\mathcal{S}_{t}|| caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | S-CIFAR10 S-CIFAR100 S-TinyImg
o-EWC - - λ=100,γ=0.9formulae-sequence𝜆100𝛾0.9\lambda=100,\gamma=0.9italic_λ = 100 , italic_γ = 0.9 λ=50,γ=0.1formulae-sequence𝜆50𝛾0.1\lambda=50,\gamma=0.1italic_λ = 50 , italic_γ = 0.1 λ=20,γ=0.9formulae-sequence𝜆20𝛾0.9\lambda=20,\gamma=0.9italic_λ = 20 , italic_γ = 0.9
DER++ α=0.5,β=0.1formulae-sequence𝛼0.5𝛽0.1\alpha=0.5,\beta=0.1italic_α = 0.5 , italic_β = 0.1 α=0.2,β=0.1formulae-sequence𝛼0.2𝛽0.1\alpha=0.2,\beta=0.1italic_α = 0.2 , italic_β = 0.1 α=0.5,β=0.1formulae-sequence𝛼0.5𝛽0.1\alpha=0.5,\beta=0.1italic_α = 0.5 , italic_β = 0.1
TAMiL α=1.0,β=1.0formulae-sequence𝛼1.0𝛽1.0\alpha=1.0,\beta=1.0italic_α = 1.0 , italic_β = 1.0 α=1.0,β=1.0formulae-sequence𝛼1.0𝛽1.0\alpha=1.0,\beta=1.0italic_α = 1.0 , italic_β = 1.0 α=1.0,β=0.5formulae-sequence𝛼1.0𝛽0.5\alpha=1.0,\beta=0.5italic_α = 1.0 , italic_β = 0.5
CLS-ER rp=0.5,rs=0.1formulae-sequencesubscript𝑟𝑝0.5subscript𝑟𝑠0.1r_{p}=0.5,r_{s}=0.1italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.5 , italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.1 rp=0.9,rs=0.1formulae-sequencesubscript𝑟𝑝0.9subscript𝑟𝑠0.1r_{p}=0.9,r_{s}=0.1italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.9 , italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.1 rp=0.5,rs=0.1formulae-sequencesubscript𝑟𝑝0.5subscript𝑟𝑠0.1r_{p}=0.5,r_{s}=0.1italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.5 , italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.1
ILR - 500 α=3,τ=2formulae-sequence𝛼3𝜏2\alpha=3,\tau=2italic_α = 3 , italic_τ = 2 α=1,τ=2formulae-sequence𝛼1𝜏2\alpha=1,\tau=2italic_α = 1 , italic_τ = 2 α=1,τ=2formulae-sequence𝛼1𝜏2\alpha=1,\tau=2italic_α = 1 , italic_τ = 2
DER++ α=1.0,β=0.1formulae-sequence𝛼1.0𝛽0.1\alpha=1.0,\beta=0.1italic_α = 1.0 , italic_β = 0.1 α=0.2,β=0.1formulae-sequence𝛼0.2𝛽0.1\alpha=0.2,\beta=0.1italic_α = 0.2 , italic_β = 0.1 α=1.0,β=0.1formulae-sequence𝛼1.0𝛽0.1\alpha=1.0,\beta=0.1italic_α = 1.0 , italic_β = 0.1
TAMiL α=1.0,β=1.0formulae-sequence𝛼1.0𝛽1.0\alpha=1.0,\beta=1.0italic_α = 1.0 , italic_β = 1.0 α=1.0,β=1.0formulae-sequence𝛼1.0𝛽1.0\alpha=1.0,\beta=1.0italic_α = 1.0 , italic_β = 1.0 α=1.0,β=0.5formulae-sequence𝛼1.0𝛽0.5\alpha=1.0,\beta=0.5italic_α = 1.0 , italic_β = 0.5
CLS-ER rp=0.5,rs=0.1formulae-sequencesubscript𝑟𝑝0.5subscript𝑟𝑠0.1r_{p}=0.5,r_{s}=0.1italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.5 , italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.1 rp=0.5,rs=0.1formulae-sequencesubscript𝑟𝑝0.5subscript𝑟𝑠0.1r_{p}=0.5,r_{s}=0.1italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.5 , italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.1 rp=0.9,rs=0.1formulae-sequencesubscript𝑟𝑝0.9subscript𝑟𝑠0.1r_{p}=0.9,r_{s}=0.1italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.9 , italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.1
ILR - 1000 α=3,τ=2formulae-sequence𝛼3𝜏2\alpha=3,\tau=2italic_α = 3 , italic_τ = 2 α=1,τ=2formulae-sequence𝛼1𝜏2\alpha=1,\tau=2italic_α = 1 , italic_τ = 2 α=2,τ=2formulae-sequence𝛼2𝜏2\alpha=2,\tau=2italic_α = 2 , italic_τ = 2
ILR - 5000 α=3,τ=2formulae-sequence𝛼3𝜏2\alpha=3,\tau=2italic_α = 3 , italic_τ = 2 α=3,τ=2formulae-sequence𝛼3𝜏2\alpha=3,\tau=2italic_α = 3 , italic_τ = 2 α=2,τ=2formulae-sequence𝛼2𝜏2\alpha=2,\tau=2italic_α = 2 , italic_τ = 2

Appendix B Versatility of ILR Framework

In ILR, as the tasks arrive, conventional fine-tuning or training on the new task happens without any CL’s intervention. ILR only augments or adds to this process with a separate training of the backward-recall mechanism. The attractiveness of this framework is twofold. First, ILR allows the best adaptation on the new task to possibly achieve maximum plasticity while the backward-recall mechanism mitigates catastrophic forgetting. Second, different from previous CL approaches that modify the sequential training process (e.g., by changing the loss functions, using an additional buffer, or dynamically adjusting the network’s architecture in fine-tuning), ILR does not change the fine-tuning process, allowing the users to more flexibly incorporate this framework into their existing machine learning pipelines.

Relationship to Memory Linking. ILR’s process of mapping newly learned knowledge representation resembles the popular humans’ mnemonic memory-linking technique, which establishes associations of fragments of information to enhance memory retention or recall. 111https://en.wikipedia.org/wiki/Mnemonic_link_system As the model learns a new task, the feature rectifier unit establishes a mnemonic link from the new representation of the sample from the past task to its past task’s correct representation.

Appendix C Societal Impacts

Our work has the potential to improve the capability of ML systems toward better adaptation to the changing world, which is usually the case for domains such as healthcare, education, and finance. This results in more reliable and robust learning systems. On the other hand, our framework will also have similar potential negative impacts that are often found in classification/predictive tasks, including bias, privacy, and misclassification.