[1]\fnmRadu Tudor \surIonescu

1]\orgdivDepartment of Computer Science, \orgnameUniversity of Bucharest, \orgaddress\street14 Academiei, \cityBucharest, \postcode010014, \countryRomania

2]\orgdivFaculty of Electronics, Telecommunications, and Information Technology, \orgnameNational University of Science and Technology Politehnica Bucharest, \orgaddress\street313 Splaiul Independentei, \cityBucharest, \postcode060042, \countryRomania

3]\orgdivDepartment of Information Engineering and Computer Science, \orgnameUniversity of Trento, \orgaddress\street9 via Sommarive, \cityPovo-Trento, \postcode38123, \countryItaly

Learning Rate Curriculum

\fnmFlorinel-Alin \surCroitoru    \fnmNicolae-Cătălin \surRistea    [email protected]    \fnmNicu \surSebe [ [ [
Abstract

Most curriculum learning methods require an approach to sort the data samples by difficulty, which is often cumbersome to perform. In this work, we propose a novel curriculum learning approach termed Learning Rate Curriculum (LeRaC), which leverages the use of a different learning rate for each layer of a neural network to create a data-agnostic curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. The learning rates increase at various paces during the first training iterations, until they all reach the same value. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that does not require sorting the examples by difficulty and is compatible with any neural network, generating higher performance levels regardless of the architecture. We conduct comprehensive experiments on 12 data sets from the computer vision (CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet-200, Food-101, UTKFace, PASCAL VOC), language (BoolQ, QNLI, RTE) and audio (ESC-50, CREMA-D) domains, considering various convolutional (ResNet-18, Wide-ResNet-50, DenseNet-121, YOLOv5), recurrent (LSTM) and transformer (CvT, BERT, SepTr) architectures. We compare our approach with the conventional training regime, as well as with Curriculum by Smoothing (CBS), a state-of-the-art data-agnostic curriculum learning approach. Unlike CBS, our performance improvements over the standard training regime are consistent across all data sets and models. Furthermore, we significantly surpass CBS in terms of training time (there is no additional cost over the standard training regime for LeRaC). Our code is freely available at: https://github.com/CroitoruAlin/LeRaC.

1 Introduction

Curriculum learning [1] refers to efficiently training effective neural networks by mimicking how humans learn, from easy to hard. As originally introduced by Bengio et al. [1], curriculum learning is a training procedure that first organizes the examples in their increasing order of difficulty, then starts the training of the neural network on the easiest examples, gradually adding increasingly more difficult examples along the way, until all training examples are fed into the network. The success of the approach relies in avoiding imposing the learning of very difficult examples right from the beginning, instead guiding the model on the right path through the imposed curriculum. This type of curriculum is later referred to as data-level curriculum learning [2]. Indeed, Soviany et al. [2] identified several types of curriculum learning approaches in the literature, dividing them into four categories based on the components involved in the definition of machine learning given by Mitchell [3]. The four categories are: data-level curriculum (examples are presented from easy to hard), model-level curriculum (the modeling capacity of the network is gradually increased), task-level curriculum (the complexity of the learning task is increased during training), objective-level curriculum (the model optimizes towards an increasingly more complex objective). While data-level curriculum is the most natural and direct way to employ curriculum learning, its main disadvantage is that it requires a way to determine the difficulty of data samples. Despite having many successful applications [2, 4], there is no universal way to determine the difficulty of the data samples, making the data-level curriculum less applicable to scenarios where the difficulty is hard to estimate, e.g. classification of radar signals. The task-level and objective-level curriculum learning strategies suffer from similar issues, e.g. it is hard to create a curriculum when the model has to learn an easy task (binary classification) or the objective function is already convex.

Refer to caption


Figure 1: Training based on Learning Rate Curriculum.

Considering the above observations, we recognize the potential of model-level curriculum learning strategies of being applicable across a wider range of domains and tasks. To date, there are only a few works [5, 6, 7] in the category of pure model-level curriculum learning methods. However, these methods have some drawbacks caused by their domain-dependent or architecture-specific design. To benefit from the full potential of the model-level curriculum learning category, we propose LeRaC (Learning Rate Curriculum), a novel and simple curriculum learning approach which leverages the use of a different learning rate for each layer of a neural network to create a data-agnostic curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. This reduces the propagation of noise caused by the multiplication operations inside the network, a phenomenon that is more prevalent when the weights are randomly initialized. The learning rates increase at various paces during the first training iterations, until they all reach the same value, as illustrated in Figure 1. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that is applicable to any domain and compatible with any neural network, generating higher performance levels regardless of the architecture, without adding any extra training time. To the best of our knowledge, we are the first to employ a different learning rate per layer to achieve the same effect as conventional (data-level) curriculum learning.

Refer to caption
Figure 2: Convolving an image of a car with random noise filters progressively increases the level of noise in the features. A theoretical proof of this observation is given in Appendix A.

As hinted above, the underlying hypothesis that justifies the use of LeRaC is that the level of noise grows from one neural layer to the next, especially when the input is multiplied with randomly initialized weights having low signal-to-noise ratios. We briefly illustrate this phenomenon through an example. Suppose an image x𝑥xitalic_x is successively convolved with a set of random filters c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, …, cnsubscript𝑐𝑛c_{n}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Since the filters are uncorrelated, each filter distorts the image in a different way, degrading the information in x𝑥xitalic_x with each convolution. The information in x𝑥xitalic_x is gradually replaced by noise (see Fig. 2), i.e. the signal-to-noise ratio increases with each layer. Optimizing the filter cnsubscript𝑐𝑛c_{n}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to learn a pattern from the image convolved with c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, …, cn1subscript𝑐𝑛1c_{n-1}italic_c start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT is suboptimal, because the filter cnsubscript𝑐𝑛c_{n}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT will adapt to the noisy (biased) activation map induced by filters c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, …, cn1subscript𝑐𝑛1c_{n-1}italic_c start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT. This suggests that earlier filters need to be optimized sooner to reduce the level of noise of the activation map passed to layer n𝑛nitalic_n. In general, this phenomenon becomes more obvious as the layers get deeper, since the number of multiplication operations grows along the way. Hence, in the initial training stages, it makes sense to use gradually lower learning rates, as the layers get father away from the input. Our hypothesis is theoretically supported by Theorem 1, and empirically validated in Appendix B.

We conduct comprehensive experiments on 12 data sets from the computer vision (CIFAR-10 [8], CIFAR-100 [8], Tiny ImageNet [9], ImageNet-200 [9], Food-101 [10], UTKFace [11], PASCAL VOC [12]), language (BoolQ [13], QNLI [14], RTE [14]) and audio (ESC-50 [15], CREMA-D [16]) domains, considering various convolutional (ResNet-18 [17], Wide-ResNet-50 [18], DenseNet-121 [19], YOLOv5 [20]), recurrent (LSTM [21]) and transformer (CvT [22], BERT [23], SepTr [24]) architectures. We compare our approach with the conventional training regime and Curriculum by Smoothing (CBS) [7], our closest competitor. Unlike CBS, our performance improvements over the standard training regime are consistent across all data sets and models. Furthermore, we significantly surpass CBS in terms of training time, since there is no additional cost over the conventional training regime for LeRaC, whereas CBS adds Gaussian smoothing layers. We also compare with several data-level and task-level curriculum learning methods [25, 26, 27, 28, 29], and show that our method scores best in most of the experiments.

In summary, our contribution is threefold:

  • We propose a novel and simple model-level curriculum learning strategy that creates a curriculum by updating the weights of each neural layer with a different learning rate, considering higher learning rates for the low-level feature layers and lower learning rates for the high-level feature layers.

  • We empirically demonstrate the applicability to multiple domains (image, audio and text), the compatibility to several neural network architectures (convolutional neural networks, recurrent neural networks and transformers), and the time efficiency (no extra training time added) of LeRaC through a comprehensive set of experiments.

  • We demonstrate our underlying hypothesis stating that the level of noise increases from one neural layer to another, both theoretically and empirically.

2 Related Work

2.1 Curriculum Learning

Curriculum learning was initially introduced by Bengio et al. [1] as a training strategy that helps machine learning models to generalize better when the training examples are presented in the ascending order of their difficulty. Extensive surveys on curriculum learning methods, including the most recent advancements on the topic, were conducted by Soviany et al. [2] and Wang et al. [4]. In the former survey, Soviany et al. [2] emphasized that curriculum learning is not only applied at the data level, but also with respect to the other components involved in a machine learning approach, namely at the model level, the task level and the objective (loss) level. Regardless of the component on which curriculum learning is applied, the technique has demonstrated its effectiveness on a broad range of machine learning tasks, from computer vision [1, 30, 31, 32, 33, 34, 7, 27, 28, 29] to natural language processing [35, 36, 37, 38, 1] and audio processing [39, 40].

The main challenge for the methods that build the curriculum at the data level is measuring the difficulty of the data samples, which is required to order the samples from easy to hard. Most studies have addressed the problem with human input [41, 42, 43] or metrics based on domain-specific heuristics. For instance, the text length [36, 44, 45, 46] and the word frequency [1, 38] have been employed in natural language processing. In computer vision, the samples containing fewer and larger objects have been considered to be easier in some works [33, 32]. Other solutions employed difficulty estimators [47] or even the confidence level of the predictions made by the neural network [48, 49] to approximate the complexity of the data samples. Other studies [27, 28, 29] used the error of a previously trained model to estimate the difficulty of each sample. Such solutions have shown their utility in specific application domains. Nonetheless, measuring the difficulty remains problematic when implementing standard (data-level) curriculum learning strategies, at least in some application domains. Therefore, several alternatives have emerged over time, handling the drawback and improving the conventional curriculum learning approach. In Kumar et al. [50], the authors introduced self-paced learning to evaluate the learning progress when selecting training samples. The method was successfully employed in multiple settings [50, 51, 52, 53, 54, 55, 56]. Furthermore, some studies combined self-paced learning with the traditional pre-computed difficulty metrics [55, 57]. An additional advancement related to self-paced learning is the approach called self-paced learning with diversity [58]. The authors demonstrated that enforcing a certain level of variety among the selected examples can improve the final performance. Another set of methods that bypass the need for predefined difficulty metrics is known as teacher-student curriculum learning [59, 60]. In this setting, a teacher network learns a curriculum to supervise a student neural network.

Closer to our work, a few methods [6, 7, 5] proposed to apply curriculum learning at the model level, by gradually increasing the learning capacity (complexity) of the neural architecture. Such curriculum learning strategies do not need to know the difficulty of the data samples, thus having a great potential to be useful in a broad range of tasks. For example, Karras et al. [6] proposed to gradually add layers to generative adversarial networks during training, while increasing the resolution of the input images at the same time. They are thus able to generate realistic high-resolution images. However, their approach is not applicable to every domain, since there is no notion of resolution for some input data types, e.g. text. Sinha et al. [7] presented a strategy that blurs the activation maps of the convolutional layers using Gaussian kernel layers, reducing the noisy information caused by the network initialization. The blur level is progressively reduced to zero by decreasing the standard deviation of the Gaussian kernels. With this mechanism, they obtain a training procedure that allows the neural network to see simple information at the start of the process and more intricate details towards the end. Curriculum by Smoothing (CBS) [7] was only shown to be useful for convolutional architectures applied in the image domain. Although we found that CBS is applicable to transformers by blurring the tokens, it is not necessarily applicable to any neural architecture, e.g. standard feed-forward neural networks. As an alternative to CBS, Burduja and Ionescu [5] proposed to apply the same smoothing process on the input image instead of the activation maps. The method was applied with success in medical image alignment. However, this approach is not applicable to natural language input, as it is not clear how to apply the blurring operation on the input text.

Different from Burduja and Ionescu [5] and Karras et al. [6], our approach is applicable to various domains, including but not limited to natural language processing, as demonstrated throughout our experiments. To the best of our knowledge, the only competing model-level curriculum method which is applicable to various domains is CBS [7]. Unlike CBS, LeRaC does not introduce new operations, such as smoothing with Gaussian kernels, during training. As such, our approach does not increase the training time with respect to the conventional training regime, as later shown in the experiments included in Section 4.

To classify our approach as a curriculum learning framework, we consider the extreme case when the learning rate is set to zero for later layers, which is equivalent to freezing those layers. This clearly reduces the learning capacity of the model. If layers are unfrozen one by one, the capacity of the model grows. LeRaC can be seen as a soft version of the model-level curriculum method described above. We thus classify LeRaC as a model-level curriculum method. However, our method can also be seen as a curriculum learning strategy that simplifies the optimization [41, 42, 43, 36, 44, 45, 46, 1, 38] in the early training stages by restricting the model updates (in a soft manner) to certain directions (corresponding to the weights of the earlier layers). Due to the imposed soft restrictions (lower learning rates for deeper layers), the optimization is easier at the beginning. As the training progresses, all directions become equally important, and the network is permitted to optimize the loss function in any direction. As the number of directions grows, the optimization task becomes more complex (it is harder to find the optimum). Hence, a relationship to curriculum learning can be discovered by noting that the complexity of the optimization increases over time, just as in curriculum learning.

In summary, we consider that the simplicity of our approach comes with many important advantages: applicability to any domain and task, compatibility with any neural network architecture, and time efficiency (adds no extra training time). We support all these claims through the comprehensive experiments presented in Section 4.

2.2 Learning Rate Schedulers

There are some contributions [61, 62] showing that using adaptive learning rates can lead to improved results. We explain how our method is different below. In [61], the main goal is increasing the learning rate of certain layers as necessary, to escape saddle points. Different from Singh et al. [61], our strategy reduces the learning rates of deeper layers, introducing soft optimization restrictions in the initial training epochs. You et al. [62] proposed to train models with very large batches using a learning rate for each layer, by scaling the learning rate with respect to the norms of the gradients. The goal of You et al. [62] is to specifically learn models with large batch sizes, e.g. formed of 8K samples. Unlike You et al. [62], we propose a more generic approach that can be applied to multiple architectures (convolutional, recurrent, transformer) under unrestricted training settings.

Gotmare et al. [63] point out that learning rate with warm-up and restarts is an effective strategy to improve stability of training neural models using large batches. Different from LeRaC, this approach does not employ a different learning rate for each layer. Moreover, the strategy restarts the learning rate at different moments during the entire training process, while LeRaC is applied only during the first few training epochs.

2.3 Optimizers

We consider Adam [64] and related optimizers as orthogonal approaches that perform the optimization rather than setting the learning rate. Our approach, LeRaC, only aims to guide the optimization during the initial training iterations by reducing the relevance of optimizing deeper network layers. Most of the baseline architectures used in our experiments are already based on Adam or some of its variations, e.g. AdaMax, AdamW [65]. LeRaC is applied in conjunction with these optimizers, showing improved performance over various architectures and application domains. This supports our claim that LeRaC is an orthogonal contribution to the family of Adam optimizers.

3 Method

Deep neural networks are commonly trained on a set of labeled data samples denoted as:

S={(xi,yi)|xiX,yiY,i{1,2,,m}},𝑆conditional-setsubscript𝑥𝑖subscript𝑦𝑖formulae-sequencesubscript𝑥𝑖𝑋formulae-sequencesubscript𝑦𝑖𝑌for-all𝑖12𝑚S\!=\!\{(x_{i},y_{i})|x_{i}\!\in\!X,y_{i}\!\in\!Y,\forall i\in\{1,2,...,m\}\},italic_S = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Y , ∀ italic_i ∈ { 1 , 2 , … , italic_m } } , (1)

where m𝑚mitalic_m is the number of examples, xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a data sample and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the associated label. The training process of a neural network f𝑓fitalic_f with parameters θ𝜃\thetaitalic_θ consists of minimizing some objective (loss) function \mathcal{L}caligraphic_L that quantifies the differences between the ground-truth labels and the predictions of the model f𝑓fitalic_f:

minθ1mi=1m(yi,f(xi,θ)).subscript𝜃1𝑚superscriptsubscript𝑖1𝑚subscript𝑦𝑖𝑓subscript𝑥𝑖𝜃\min_{\theta}\frac{1}{m}\sum_{i=1}^{m}\mathcal{L}\left(y_{i},f(x_{i},\theta)% \right).roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT caligraphic_L ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ) ) . (2)

The optimization is generally performed by some variant of Stochastic Gradient Descent (SGD), where the gradients are back-propagated from the neural layers closer to the output towards the neural layers closer to input through the chain rule. Let f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, …., fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, …, θnsubscript𝜃𝑛\theta_{n}italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the neural layers and the corresponding weights of the model f𝑓fitalic_f, such that the weights θjsubscript𝜃𝑗\theta_{j}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT belong to the layer fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, j{1,2,,n}for-all𝑗12𝑛\forall j\in\{1,2,...,n\}∀ italic_j ∈ { 1 , 2 , … , italic_n }. The output of the neural network for some training data sample xiXsubscript𝑥𝑖𝑋x_{i}\in Xitalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X is formally computed as follows:

y^i=f(xi,θ)=fn(f2(f1(xi,θ1),θ2).,θn).\hat{y}_{i}\!=\!f(x_{i},\theta)\!=\!f_{n}\!\left(...f_{2}\left(f_{1}\left(x_{i% },\theta_{1}\right),\theta_{2}\right)....,\theta_{n}\right)\!.over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ) = italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( … italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) … . , italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) . (3)

To optimize the model via SGD, the weights are updated as follows:

θj(t+1)=θj(t)η(t)θj(t),j{1,2,,n},formulae-sequencesuperscriptsubscript𝜃𝑗𝑡1superscriptsubscript𝜃𝑗𝑡superscript𝜂𝑡superscriptsubscript𝜃𝑗𝑡for-all𝑗12𝑛\theta_{j}^{(t+1)}=\theta_{j}^{(t)}-\eta^{(t)}\cdot\frac{\partial\mathcal{L}}{% \partial\theta_{j}^{(t)}},\forall j\in\{1,2,...,n\},italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG , ∀ italic_j ∈ { 1 , 2 , … , italic_n } , (4)

where t𝑡titalic_t is the index of the current training iteration, η(t)>0superscript𝜂𝑡0\eta^{(t)}>0italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT > 0 is the learning rate at iteration t𝑡titalic_t, and the gradient of \mathcal{L}caligraphic_L with respect to θj(t)superscriptsubscript𝜃𝑗𝑡\theta_{j}^{(t)}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is computed via the chain rule. Before starting the training process, the weights θj(0)superscriptsubscript𝜃𝑗0\theta_{j}^{(0)}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT are commonly initialized with random values, e.g. using Glorot initialization [66].

Sinha et al. [7] suggested that the random initialization of the weights produces a large amount of noise in the information propagated through the neural model during the early training iterations, which can negatively impact the learning process. Due to the feed-forward processing that involves several multiplication operations, we argue that the noise level grows with each neural layer, from fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to fj+1subscript𝑓𝑗1f_{j+1}italic_f start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT. This statement is confirmed by the following theorem:

Theorem 1.

Let s1=u1+z1subscript𝑠1subscript𝑢1subscript𝑧1s_{1}=u_{1}+z_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2=u2+z2subscript𝑠2subscript𝑢2subscript𝑧2s_{2}=u_{2}+z_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be two signals, where u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and u2subscript𝑢2u_{2}italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the clean components, and z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the noise components. The signal-to-noise ratio of the product between the two signals is lower than the signal-to-noise ratios of the two signals, i.e.:

SNR(s1s2)SNR(si),i{1,2}.formulae-sequenceSNRsubscript𝑠1subscript𝑠2SNRsubscript𝑠𝑖for-all𝑖12\operatorname{SNR}(s_{1}\cdot s_{2})\leq\operatorname{SNR}(s_{i}),\forall i\in% \{1,2\}.roman_SNR ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≤ roman_SNR ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ∀ italic_i ∈ { 1 , 2 } . (5)
Proof.

The proof is given in Appendix A. ∎

The same issue can occur if the weights are pre-trained on a distinct task, where the misalignment of the weights with a new task is likely higher for the high-level (specialized) feature layers. To alleviate this problem, we propose to introduce a curriculum learning strategy that assigns a different learning rate ηjsubscript𝜂𝑗\eta_{j}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to each layer fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, as follows:

θj(t+1)=θj(t)ηj(t)θj(t),j{1,2,,n},formulae-sequencesuperscriptsubscript𝜃𝑗𝑡1superscriptsubscript𝜃𝑗𝑡superscriptsubscript𝜂𝑗𝑡superscriptsubscript𝜃𝑗𝑡for-all𝑗12𝑛\theta_{j}^{(t+1)}=\theta_{j}^{(t)}-\eta_{j}^{(t)}\cdot\frac{\partial\mathcal{% L}}{\partial\theta_{j}^{(t)}},\forall j\in\{1,2,...,n\},italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG , ∀ italic_j ∈ { 1 , 2 , … , italic_n } , (6)

such that:

η(0)η1(0)η2(0)ηn(0),superscript𝜂0superscriptsubscript𝜂10superscriptsubscript𝜂20superscriptsubscript𝜂𝑛0\eta^{(0)}\geq\eta_{1}^{(0)}\geq\eta_{2}^{(0)}\geq...\geq\eta_{n}^{(0)},italic_η start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ≥ italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ≥ italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ≥ … ≥ italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , (7)
η(k)=η1(k)=η2(k)==ηn(k),superscript𝜂𝑘superscriptsubscript𝜂1𝑘superscriptsubscript𝜂2𝑘superscriptsubscript𝜂𝑛𝑘\eta^{(k)}=\eta_{1}^{(k)}=\eta_{2}^{(k)}=...=\eta_{n}^{(k)},italic_η start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = … = italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , (8)

where ηj(0)superscriptsubscript𝜂𝑗0\eta_{j}^{(0)}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT are the initial learning rates and ηj(k)superscriptsubscript𝜂𝑗𝑘\eta_{j}^{(k)}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT are the updated learning rates at iteration k𝑘kitalic_k. The condition formulated in Eq. (7) indicates that the initial learning rate ηj(0)superscriptsubscript𝜂𝑗0\eta_{j}^{(0)}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT of a neural layer fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT gets lower as the level of the respective neural layer becomes higher (farther away from the input). With each training iteration tk𝑡𝑘t\leq kitalic_t ≤ italic_k, the learning rates are gradually increased, until they become equal, according to Eq. (8). Thus, our curriculum learning strategy is only applied during the early training iterations, where the noise caused by the misfit (randomly initialized or pre-trained) weights is most prevalent. Hence, k𝑘kitalic_k is a hyperparameter of LeRaC that is usually adjusted such that kTmuch-less-than𝑘𝑇k\ll Titalic_k ≪ italic_T, where T𝑇Titalic_T is the total number of training iterations.

At this point, various schedulers can be used to increase each learning rate ηjsubscript𝜂𝑗\eta_{j}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT from iteration 00 to iteration k𝑘kitalic_k. We empirically observed that an exponential scheduler is a better option than linear or logarithmic schedulers. We thus propose to employ the exponential scheduler, which is based on the following rule:

ηj(l)=ηj(0)clk(logcηj(k)logcηj(0)),l{0,1,,k}.formulae-sequencesuperscriptsubscript𝜂𝑗𝑙superscriptsubscript𝜂𝑗0superscript𝑐𝑙𝑘subscript𝑐superscriptsubscript𝜂𝑗𝑘subscript𝑐superscriptsubscript𝜂𝑗0for-all𝑙01𝑘\eta_{j}^{(l)}\!=\!\eta_{j}^{(0)}\!\cdot\!c^{\frac{l}{k}\cdot\left(\log_{c}% \eta_{j}^{(k)}-\log_{c}\eta_{j}^{(0)}\right)}\!,\forall l\!\in\!\{0,1,...,k\}.italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ⋅ italic_c start_POSTSUPERSCRIPT divide start_ARG italic_l end_ARG start_ARG italic_k end_ARG ⋅ ( roman_log start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - roman_log start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , ∀ italic_l ∈ { 0 , 1 , … , italic_k } . (9)

We set c=10𝑐10c=10italic_c = 10 in Eq. (9) across all our experiments. This is because learning rates are usually expressed as a power of c=10𝑐10c=10italic_c = 10, e.g. 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. If we start with a learning rate of ηj(0)=108superscriptsubscript𝜂𝑗0superscript108\eta_{j}^{(0)}=10^{-8}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT for some layer j𝑗jitalic_j and we want to increase it to ηj(k)=104superscriptsubscript𝜂𝑗𝑘superscript104\eta_{j}^{(k)}=10^{-4}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT during the first 5 epochs (k=4𝑘4k=4italic_k = 4), the intermediate learning rates generated via Eq. (9) are ηj(1)=107superscriptsubscript𝜂𝑗1superscript107\eta_{j}^{(1)}\!=\!10^{-7}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT, ηj(2)=106superscriptsubscript𝜂𝑗2superscript106\eta_{j}^{(2)}\!=\!10^{-6}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, ηj(3)=105superscriptsubscript𝜂𝑗3superscript105\eta_{j}^{(3)}\!=\!10^{-5}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and ηj(4)=104superscriptsubscript𝜂𝑗4superscript104\eta_{j}^{(4)}\!=\!10^{-4}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We thus believe it is more intuitive to understand what happens when setting c=10𝑐10c=10italic_c = 10 in Eq. (9), as opposed to using some tuned value for c𝑐citalic_c. To this end, we refrain from tuning c𝑐citalic_c and fix it to c=10𝑐10c=10italic_c = 10.

In practice, we obtain optimal results by initializing the lowest learning rate ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT with a value that is around five or six orders of magnitude lower than η(0)superscript𝜂0\eta^{(0)}italic_η start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, while the highest learning rate η1(0)superscriptsubscript𝜂10\eta_{1}^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is always equal to η(0)superscript𝜂0\eta^{(0)}italic_η start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT. Apart from such general practical notes, the exact LeRaC configuration for each neural architecture is established by tuning its two hyperparameters (k𝑘kitalic_k, ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT) on the available validation sets.

We underline that the output feature maps of a layer j𝑗jitalic_j are affected (i)𝑖(i)( italic_i ) by the misfit weights θj(0)superscriptsubscript𝜃𝑗0\theta_{j}^{(0)}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT of the respective layer, and (ii)𝑖𝑖(ii)( italic_i italic_i ) by the input feature maps, which are in turn affected by the misfit weights of the previous layers θ1(0),,θj1(0)superscriptsubscript𝜃10superscriptsubscript𝜃𝑗10\theta_{1}^{(0)},...,\theta_{j-1}^{(0)}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT. Hence, the noise affecting the feature maps increases with each layer processing the feature maps, being multiplied with the weights from each layer along the way. Our curriculum learning strategy imposes the training of the earlier layers at a faster pace, transforming the noisy weights into discriminative patterns. As noise from the earlier layer weights is eliminated, we train the later layers at faster and faster paces, until all learning rates become equal at epoch k𝑘kitalic_k.

From a technical point of view, we note that our approach can also be regarded as a way to guide the optimization, which we see as an alternative to loss function smoothing. The link between curriculum learning and loss smoothing is discussed by Soviany et al. [2], who suggest that curriculum learning strategies induce a smoothing of the loss function, where the smoothing is higher during the early training iterations (simplifying the optimization) and lower to non-existent during the late training iterations (restoring the complexity of the loss function). LeRaC is aimed at producing a similar effect, but in a softer manner by dampening the importance of optimizing the weights of high-level layers in the early training iterations. Additionally, we empirically observe (see results in Appendix B) that LeRaC tends to balance the training pace of low-level and high-level features, while the conventional regime seems to update the high-level layers at a faster pace. This could provide an additional intuitive explanation of why our method works better.

4 Experiments

4.1 Data Sets

We perform experiments on 12 benchmarks: CIFAR-10 [8], CIFAR-100 [8], Tiny ImageNet [9], ImageNet-200 [9], Food-101 [10], UTKFace [11], PASCAL VOC 2007+2012 [12], BoolQ [13], QNLI [14], RTE [14], CREMA-D [16], and ESC-50 [15]. We adopt the official data splits for the 12 benchmarks considered in our experiments. When a validation set is not available, we keep 10%percent1010\%10 % of the training data for validation.

CIFAR-10. CIFAR-10 [8] is a popular data set for object recognition in images. It consists of 60,000 color images with a resolution of 32×32323232\times 3232 × 32 pixels. An image depicts one of 10 object classes, each class having 6,000 examples. We use the official data split with a training set of 50,000 images and a test set of 10,000 images.

CIFAR-100. The CIFAR-100 [8] data set is similar to CIFAR-10, except that it has 100 classes with 600 images per class. There are 50,000 training images and 10,000 test images.

Tiny ImageNet. Tiny ImageNet is a subset of ImageNet-1K [9] which provides 100,000 training images, 25,000 validation images and 25,000 test images representing objects from 200 different classes. The size of each image is 64×64646464\times 6464 × 64 pixels.

ImageNet. ImageNet-1K [9] is the most popular bemchmark in computer vision, comprising about 1.2 million images from 1,000 object categories. We set the resolution of all images to 224×224224224224\times 224224 × 224 pixels.

Food-101. Food-101 [10] is a data set that contains images from 101 food categories. For each category, there are 750 training images and 250 test images. Thus, the total number of images is 101,000. We resize all images to 224×224224224224\times 224224 × 224 pixels. The test set is manually cleaned, while the training set is purposely left uncurated, being affected by labeling noise. This makes Food-101 suitable for testing the robustness of models to labeling noise.

UTKFace. The UTKFace data set [11] contains face images representing various gender, age and ethnic groups. It consists of 23,709 images of 200×200200200200\times 200200 × 200 pixels. The data set is divided into 16,597 training images, 3,556 validation images, and 3,556 test images. Each image is annotated with the corresponding age and gender label, which makes UTKFace suitable for evaluating models in a multi-task learning setup.

PASCAL VOC 2007+2012. One of the most popular benchmarks for object detection is PASCAL VOC [12]. The data set consists of 21,503 images which are annotated with bounding boxes for 20 object categories. The official split has 16,551 training images and 4,952 test images.

BoolQ. BoolQ [13] is a question answering data set for yes/no questions containing 15,942 examples. The questions are naturally occurring, being generated in unprompted and unconstrained settings. Each example is a triplet of the form: {question, passage, answer}. We use the data split provided in the SuperGLUE benchmark [67], containing 9,427 examples for training, 3,270 for validation and 3,245 for testing.

Table 1: Optimal hyperparameter settings for the various neural architectures used in our experiments. Notice that η1(0)superscriptsubscript𝜂10\eta_{1}^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is always equal to η(0)superscript𝜂0\eta^{(0)}italic_η start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, being set without tuning. This means that LeRaC has only two tunable hyperparameters, k𝑘kitalic_k and ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, while CBS [7] has three.
Model Optimizer Mini-batch #Epochs η(0)superscript𝜂0\eta^{(0)}italic_η start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT CBS LeRaC
\cline6-10 σ𝜎\sigmaitalic_σ d𝑑ditalic_d u𝑢uitalic_u k𝑘kitalic_k η1(0)superscriptsubscript𝜂10\eta_{1}^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT - ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT
ResNet-18 SGD 64 100-200 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 1 0.9 2-5 5-7 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - 108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
Wide-ResNet-50 SGD 64 100-200 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 1 0.9 2-5 5-7 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - 108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
CvT-13 AdaMax 64-128 150-200 21032superscript1032\!\cdot\!10^{-3}2 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1 0.9 2-5 2-5 21032superscript1032\!\cdot\!10^{-3}2 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT - 21082superscript1082\!\cdot\!10^{-8}2 ⋅ 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
CvT-13pre-trainedpre-trained{}_{\mbox{\scriptsize{pre-trained}}}start_FLOATSUBSCRIPT pre-trained end_FLOATSUBSCRIPT AdaMax 64-128 25 51045superscript1045\!\cdot\!10^{-4}5 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1 0.9 2-5 3-6 51045superscript1045\!\cdot\!10^{-4}5 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT - 510105superscript10105\!\cdot\!10^{-10}5 ⋅ 10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT
YOLOv5pre-trainedpre-trained{}_{\mbox{\scriptsize{pre-trained}}}start_FLOATSUBSCRIPT pre-trained end_FLOATSUBSCRIPT SGD 16 100 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1 0.9 2 3 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT - 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
BERTlarge-uncasedlarge-uncased{}_{\mbox{\scriptsize{large-uncased}}}start_FLOATSUBSCRIPT large-uncased end_FLOATSUBSCRIPT AdaMax 10 7-25 51055superscript1055\!\cdot\!10^{-5}5 ⋅ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 1 0.9 1 3 51055superscript1055\!\cdot\!10^{-5}5 ⋅ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT - 51085superscript1085\!\cdot\!10^{-8}5 ⋅ 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
LSTM AdamW 256-512 25-70 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1 0.9 2 3-4 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT - 107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
SepTR Adam 2 50 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.8 0.9 1-3 2-5 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT - 108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
DenseNet-121 Adam 64 50 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.8 0.9 1-3 2-5 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT - 51085superscript1085\!\cdot\!10^{-8}5 ⋅ 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT

QNLI. The QNLI (Question-answering Natural Language Inference) data set [14] is a natural language inference benchmark automatically derived from SQuAD [68]. The data set contains {question, sentence} pairs and the task is to determine whether the context sentence contains the answer to the question. The data set is constructed on top of Wikipedia documents, each document being accompanied, on average, by 4 questions. We consider the data split provided in the GLUE benchmark [14], which comprises 104,743 examples for training, 5,463 for validation and 5,463 for testing.

RTE. Recognizing Textual Entailment (RTE) [14] is a natural language inference data set containing pairs of sentences with the target label indicating if the meaning of one sentence can be inferred from the other. The training subset includes 2,490 samples, the validation set 277 samples, and the test set 3,000 samples.

CREMA-D. The CREMA-D multi-modal database [16] is formed of 7,442 videos of 91 actors (48 male and 43 female) of different ethnic groups. The actors perform various emotions while uttering 12 particular sentences that evoke one of the 6 emotion categories: anger, disgust, fear, happy, neutral, and sad. Following previous work [56], we conduct experiments only on the audio modality, dividing the set of audio samples into 70%percent7070\%70 % for training, 15%percent1515\%15 % for validation and 15%percent1515\%15 % for testing.

ESC-50. The ESC-50 [15] data set is a collection of 2,000 samples of 5 seconds each, comprising 50 classes of various common sound events. Samples are recorded at a 44.1 kHz sampling frequency, with a single channel. In our evaluation, we employ the 5-fold cross-validation procedure, as described in related works [15, 24].

4.2 Experimental Setup

Architectures. To demonstrate the compatibility of LeRaC with multiple neural architectures, we select several convolutional, recurrent and transformer models. As representative convolutional neural networks (CNNs), we opt for ResNet-18 [17], Wide-ResNet-50 [18] and DenseNet-121 [19]. For the object detection experiments on PASCAL VOC, we use the YOLOv5 [20] model based on the CSPDarknet53 [69] backbone, which is pre-trained on the MS COCO data set [70]. As representative transformers, we consider CvT-13 [22], BERTuncased-largeuncased-large{}_{\mbox{\scriptsize{uncased-large}}}start_FLOATSUBSCRIPT uncased-large end_FLOATSUBSCRIPT [23] and SepTr [24]. For CvT, we consider both pre-trained and randomly initialized versions. We use an uncased large pre-trained version of BERT. As Ristea et al. [24], we train SepTr from scratch. In addition, we employ a long short-term memory (LSTM) network [21] to represent recurrent neural networks (RNNs). The recurrent neural network contains two LSTM layers, each having a hidden dimension of 256 components. These layers are preceded by one embedding layer with the embedding size set to 128 elements. The output of the last recurrent layer is passed to a classifier composed of two fully connected layers. The LSTM is activated by rectified linear units (ReLU). We apply the aforementioned models on distinct input data types, considering the intended application domain of each model. Hence, ResNet-18, Wide-ResNet-50, CvT and YOLOv5 are applied on images, BERT and LSTM are applied on text, and SepTr and DenseNet-121 are applied on audio.

Multi-task architectures. To determine the impact of LeRaC on multi-task learning models, we conduct experiments on the UTKFace data set, where the face images are annotated with gender and age labels. We consider two models for the multi-task learning setup, namely ResNet-18 and CvT-13. Each model is jointly trained on the two tasks (gender prediction and age estimation). To each model, we attach two heads, one for gender classification and one for age estimation, respectively. The classification head is trained using the cross-entropy loss with respect to the gender label, while the regression head uses the mean squared error with respect to the age label. The models are trained using a joint objective defined as follows:

MTL=1mi=1mCE(yig,y^ig)+λMSE(yia,y^ia),subscriptMTL1𝑚superscriptsubscript𝑖1𝑚subscriptCEsubscriptsuperscript𝑦𝑔𝑖subscriptsuperscript^𝑦𝑔𝑖𝜆subscriptMSEsubscriptsuperscript𝑦𝑎𝑖subscriptsuperscript^𝑦𝑎𝑖\mathcal{L}_{\mbox{\tiny{MTL}}}\!=\!\frac{1}{m}\sum_{i=1}^{m}\mathcal{L}_{% \mbox{\tiny{CE}}}\left(y^{g}_{i},\hat{y}^{g}_{i}\right)\!+\!\lambda\!\cdot\!% \mathcal{L}_{\mbox{\tiny{MSE}}}\left(y^{a}_{i},\hat{y}^{a}_{i}\right),caligraphic_L start_POSTSUBSCRIPT MTL end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_λ ⋅ caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (10)

where yigsubscriptsuperscript𝑦𝑔𝑖y^{g}_{i}italic_y start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and yiasubscriptsuperscript𝑦𝑎𝑖y^{a}_{i}italic_y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the ground-truth gender and age labels, y^igsubscriptsuperscript^𝑦𝑔𝑖\hat{y}^{g}_{i}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and y^iasubscriptsuperscript^𝑦𝑎𝑖\hat{y}^{a}_{i}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the predicted gender and age labels, λ+𝜆superscript\lambda\in\mathbb{R}^{+}italic_λ ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is a weight factor, and CEsubscriptCE\mathcal{L}_{\mbox{\tiny{CE}}}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT is the cross-entropy loss for the gender prediction task, defined as:

CE(yig,y^ig)=(yiglog(y^ig)+(1yig)log(1y^ig)),subscriptCEsubscriptsuperscript𝑦𝑔𝑖subscriptsuperscript^𝑦𝑔𝑖subscriptsuperscript𝑦𝑔𝑖subscriptsuperscript^𝑦𝑔𝑖1subscriptsuperscript𝑦𝑔𝑖1subscriptsuperscript^𝑦𝑔𝑖\mathcal{L}_{\mbox{\tiny{CE}}}\!\left(y^{g}_{i},\hat{y}^{g}_{i}\right)\!=\!-% \left(y^{g}_{i}\log(\hat{y}^{g}_{i})\!+\!(1\!-\!y^{g}_{i})\log(1-\hat{y}^{g}_{% i})\right),caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = - ( italic_y start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( 1 - italic_y start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (11)

and MSEsubscriptMSE\mathcal{L}_{\mbox{\tiny{MSE}}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT is the mean squared error for the age estimation task, defined as:

MSE(yia,y^ia)=(yiay^ia)2.subscriptMSEsubscriptsuperscript𝑦𝑎𝑖subscriptsuperscript^𝑦𝑎𝑖superscriptsubscriptsuperscript𝑦𝑎𝑖subscriptsuperscript^𝑦𝑎𝑖2\mathcal{L}_{\mbox{\tiny{MSE}}}\left(y^{a}_{i},\hat{y}^{a}_{i}\right)=(y^{a}_{% i}-\hat{y}^{a}_{i})^{2}.caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( italic_y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (12)

The factor λ𝜆\lambdaitalic_λ ensures the two tasks are equally important by weighting MSEsubscriptMSE\mathcal{L}_{\mbox{\tiny{MSE}}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT to have approximately the same range of values as CEsubscriptCE\mathcal{L}_{\mbox{\tiny{CE}}}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT. As such, we set λ=10𝜆10\lambda=10italic_λ = 10.

Baselines. We compare LeRaC with two baselines: the conventional training regime (which uses early stopping, reduces the learning rate on plateau, and employs linear warm-up and cosine annealing when required) and the state-of-the-art Curriculum by Smoothing [7]. For CBS, we use the official code released by Sinha et al. [7] at https://github.com/pairlab/CBS, to ensure the reproducibility of their method in our experimental settings, which include a more diverse selection of input data types and neural architectures. In addition, we compare with several data-level and task-level curriculum learning methods [25, 26, 27, 28, 29] on CIFAR-10 and CIFAR-100.

To apply CBS to non-convolutional architectures, we use 1D convolutional layers based on Gaussian filters with a receptive field of 3. For transformers, we integrate a 1D Gaussian layer before each transformer block, so the smoothing is applied on the sequence of tokens. Similarly, for recurrent neural networks, before each LSTM layer, we process the sequence of tokens with 1D convolutional layers based on Gaussian filters. For both transformers and RNNs, we anneal, during training, the standard deviation of the Gaussian filters to enhance the information propagated through the network. This approach mirrors the implementation of CBS for convolutional neural networks.

Hyperparameter tuning. We tune all hyperparameters on the validation set of each benchmark. In Table 1, we present the optimal hyperparameters chosen for each architecture. In addition to the standard parameters of the training process, we report the parameters that are specific for the CBS [7] and LeRaC strategies. In the case of CBS, σ𝜎\sigmaitalic_σ denotes the standard deviation of the Gaussian kernel, d𝑑ditalic_d is the decay rate for σ𝜎\sigmaitalic_σ, and u𝑢uitalic_u is the decay step. Regarding the parameters of LeRaC, k𝑘kitalic_k represents the number of iterations used in Eq. (9), and η1(0)superscriptsubscript𝜂10\eta_{1}^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT are the initial learning rates for the first and last layers of the architecture, respectively. We set η1(0)=η(0)superscriptsubscript𝜂10superscript𝜂0\eta_{1}^{(0)}=\eta^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_η start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and c=10𝑐10c=10italic_c = 10 in all experiments, without tuning. In addition, the intermediate learning rates ηj(0)superscriptsubscript𝜂𝑗0\eta_{j}^{(0)}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, j{2,3,,n1}for-all𝑗23𝑛1\forall j\in\{2,3,...,n-1\}∀ italic_j ∈ { 2 , 3 , … , italic_n - 1 }, are automatically set to be equally distanced between η1(0)superscriptsubscript𝜂10\eta_{1}^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT. Moreover, ηj(k)=η(0)superscriptsubscript𝜂𝑗𝑘superscript𝜂0\eta_{j}^{(k)}=\eta^{(0)}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_η start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, i.e. the initial learning rates of LeRaC converge to the original learning rate set for the conventional training regime. All models are trained with early stopping and the learning rate is reduced by a factor of 10101010 when the loss reaches a plateau. We use linear warm-up with cosine annealing, whenever it is found useful for models based on conventional or CBS training. The learning rate warm-up is switched off for LeRaC to avoid unwanted interactions with our training strategy. Except for the pre-trained models, the weights of all models are initialized with Glorot initialization [66].

We underline that some parameters are the same across all data sets, while others need to be established per data set. For example, the parameter u𝑢uitalic_u of CBS and the parameter k𝑘kitalic_k of LeRaC are validated on each data set. As such, for the ResNet-18 model, the parameter u𝑢uitalic_u of CBS takes one value on each data set (CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet, Food-101, UTKFace), but the values of u𝑢uitalic_u on all five data sets can range between 2 and 5. Similarly, the parameter k𝑘kitalic_k of LeRaC takes one value per data set, with the range of values being 5-7. In Table 1, we aggregate the optimal parameters of each model for all data sets. This explains why some hyperparameters are specified in terms of ranges.

Setting the initial learning rates. We should emphasize that the different learning rates ηj(0)superscriptsubscript𝜂𝑗0\eta_{j}^{(0)}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, j{1,2,,n}for-all𝑗12𝑛\forall j\in\{1,2,...,n\}∀ italic_j ∈ { 1 , 2 , … , italic_n }, are not optimized nor tuned during training. Instead, we set the initial learning rates ηj(0)superscriptsubscript𝜂𝑗0\eta_{j}^{(0)}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT through validation, such that ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is around five or six orders of magnitude lower than η(0)superscript𝜂0\eta^{(0)}italic_η start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, and η1(0)=η(0)superscriptsubscript𝜂10superscript𝜂0\eta_{1}^{(0)}=\eta^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_η start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT. After initialization, we apply our exponential scheduler, until all learning rates become equal at iteration k𝑘kitalic_k. In addition, we would like to underline that the difference δ𝛿\deltaitalic_δ between the initial learning rates of consecutive layers is automatically set based on the range given by η1(0)superscriptsubscript𝜂10\eta_{1}^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT. For example, let us consider a network with 5 layers. If we choose η1(0)=101superscriptsubscript𝜂10superscript101\eta_{1}^{(0)}=10^{-1}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and η5(0)=102superscriptsubscript𝜂50superscript102\eta_{5}^{(0)}=10^{-2}italic_η start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, then the intermediate initial learning rates are automatically set to η2(0)=101.25superscriptsubscript𝜂20superscript101.25\eta_{2}^{(0)}=10^{-1.25}italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 1.25 end_POSTSUPERSCRIPT, η3(0)=101.5superscriptsubscript𝜂30superscript101.5\eta_{3}^{(0)}=10^{-1.5}italic_η start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 1.5 end_POSTSUPERSCRIPT, η4(0)=101.75superscriptsubscript𝜂40superscript101.75\eta_{4}^{(0)}=10^{-1.75}italic_η start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 1.75 end_POSTSUPERSCRIPT, i.e. δ𝛿\deltaitalic_δ is used in the exponent and is equal to 0.250.25-0.25- 0.25 in this case. To obtain the intermediate learning rates according to this example, we actually apply the exponential scheduler defined in Eq. (9). This reduces the number of tunable hyperparameters from n𝑛nitalic_n (the number layers) to two, namely η1(0)superscriptsubscript𝜂10\eta_{1}^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT. We go even further, setting η1(0)=η(0)superscriptsubscript𝜂10superscript𝜂0\eta_{1}^{(0)}=\eta^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_η start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT without tuning, in all our experiments. Hence, tuning is only performed for the initial learning rate of the last layer, namely ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT. Although tuning all ηj(0)superscriptsubscript𝜂𝑗0\eta_{j}^{(0)}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, j{1,2,,n}for-all𝑗12𝑛\forall j\in\{1,2,...,n\}∀ italic_j ∈ { 1 , 2 , … , italic_n }, might lead to better results, we refrain from meticulously tuning every possible value to avoid overfitting in hyperparameter space.

Number of hyperparameters. We further emphasize that LeRaC adds only two additional tunable hyperparameters with respect to the conventional training regime. These are the lowest learning rate ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and the number of iterations k𝑘kitalic_k to employ LeRaC. We reduce the number of hyperparameters that require tuning by using a fixed rule to adjust the intermediate learning rates, e.g. by employing an exponential scheduler, or by fixing some hyperparameters, e.g. c=10𝑐10c=10italic_c = 10. In contrast, CBS [7] has three additional hyperparameters, thus having one extra hyperparameter compared with LeRaC. Furthermore, we note that data-level curriculum methods also introduce additional hyperparameters. Even a simple method that splits the examples into easy-to-hard batches that are gradually added to the training set requires at least two hyperparameters: the number of batches, and the number of iterations before introducing a new training batch. We thus believe that, in terms of the number of additional hyperparameters, LeRaC is comparable to CBS and other curriculum learning strategies. We emphasize that the same happens if we look at new optimizers, e.g. Adam [64] adds three additional hyperparameters compared with SGD.

Table 2: Average accuracy rates (in %) over 5 runs on CIFAR-10, CIFAR-100 and Tiny ImageNet for various neural models based on different training regimes: learning rate decay, linear warm-up, cosine annealing, constant learning rate, and LeRaC. The accuracy of the best training regime in each experiment is highlighted in bold.
Model Training Regime CIFAR-10 CIFAR-100 Tiny ImageNet
ResNet-18 learning rate decay 89.20±0.43plus-or-minus89.200.4389.20{\pm 0.43}89.20 ± 0.43 71.70±0.06plus-or-minus71.700.0671.70{\pm 0.06}71.70 ± 0.06 57.41±0.05plus-or-minus57.410.0557.41{\pm 0.05}57.41 ± 0.05
constant learning rate 72.30±1.08plus-or-minus72.301.0872.30{\pm 1.08}72.30 ± 1.08 62.06±0.41plus-or-minus62.060.4162.06{\pm 0.41}62.06 ± 0.41 49.42±0.37plus-or-minus49.420.3749.42{\pm 0.37}49.42 ± 0.37
LeRaC (ours) 89.56±0.16plus-or-minus89.560.16\mathbf{89.56}{\pm 0.16}bold_89.56 ± 0.16 72.72±0.12plus-or-minus72.720.12\mathbf{72.72}{\pm 0.12}bold_72.72 ± 0.12 57.86±0.20plus-or-minus57.860.20\mathbf{57.86}{\pm 0.20}bold_57.86 ± 0.20
Wide-ResNet-50 learning rate decay 91.22±0.24plus-or-minus91.220.2491.22{\pm 0.24}91.22 ± 0.24 68.14±0.16plus-or-minus68.140.1668.14{\pm 0.16}68.14 ± 0.16 55.97±0.30plus-or-minus55.970.3055.97{\pm 0.30}55.97 ± 0.30
constant learning rate 86.62±0.27plus-or-minus86.620.2786.62{\pm 0.27}86.62 ± 0.27 61.67±0.12plus-or-minus61.670.1261.67{\pm 0.12}61.67 ± 0.12 41.87±0.61plus-or-minus41.870.6141.87{\pm 0.61}41.87 ± 0.61
LeRaC (ours) 91.58±0.16plus-or-minus91.580.16\mathbf{91.58}{\pm 0.16}bold_91.58 ± 0.16 69.38±0.26plus-or-minus69.380.26\mathbf{69.38}{\pm 0.26}bold_69.38 ± 0.26 56.48±0.60plus-or-minus56.480.60\mathbf{56.48}{\pm 0.60}bold_56.48 ± 0.60
CvT-13 linear warm-up + cosine annealing 71.84±0.37plus-or-minus71.840.3771.84{\pm 0.37}71.84 ± 0.37 41.87±0.16plus-or-minus41.870.1641.87{\pm 0.16}41.87 ± 0.16 33.38±0.27plus-or-minus33.380.2733.38{\pm 0.27}33.38 ± 0.27
constant learning rate 71.75±0.07plus-or-minus71.750.0771.75{\pm 0.07}71.75 ± 0.07 41.62±0.20plus-or-minus41.620.2041.62{\pm 0.20}41.62 ± 0.20 30.68±0.10plus-or-minus30.680.1030.68{\pm 0.10}30.68 ± 0.10
LeRaC (ours) 72.90±0.28plus-or-minus72.900.28\mathbf{72.90}{\pm 0.28}bold_72.90 ± 0.28 43.46±0.18plus-or-minus43.460.18\mathbf{43.46}{\pm 0.18}bold_43.46 ± 0.18 33.95±0.28plus-or-minus33.950.28\mathbf{33.95}{\pm 0.28}bold_33.95 ± 0.28
CvT-13pre-trainedpre-trained{}_{\mbox{\scriptsize{pre-trained}}}start_FLOATSUBSCRIPT pre-trained end_FLOATSUBSCRIPT cosine annealing 93.06±0.06plus-or-minus93.060.0693.06{\pm 0.06}93.06 ± 0.06 77.76±0.38plus-or-minus77.760.3877.76{\pm 0.38}77.76 ± 0.38 70.91±0.24plus-or-minus70.910.2470.91{\pm 0.24}70.91 ± 0.24
constant learning rate 93.56±0.05plus-or-minus93.560.0593.56{\pm 0.05}93.56 ± 0.05 77.80±0.16plus-or-minus77.800.1677.80{\pm 0.16}77.80 ± 0.16 70.71±0.35plus-or-minus70.710.3570.71{\pm 0.35}70.71 ± 0.35
LeRaC (ours) 94.15±0.03plus-or-minus94.150.03\mathbf{94.15}{\pm 0.03}bold_94.15 ± 0.03 78.93±0.05plus-or-minus78.930.05\mathbf{78.93}{\pm 0.05}bold_78.93 ± 0.05 71.34±0.08plus-or-minus71.340.08\mathbf{71.34}{\pm 0.08}bold_71.34 ± 0.08

Avoiding too large learning rates. In principle, a larger learning rate implies a larger update. However, if the learning rate is too high, the model can actually diverge. This is because the gradient describes the loss function in the vicinity of the current location, providing no guarantee for the value of the loss outside this vicinity. Our implementation takes this aspect into account. Instead of increasing the learning rate for earlier layers, we reduce the learning rate for the deeper layer to avoid divergence. More precisely, we set the learning rate for the first layer η1(0)superscriptsubscript𝜂10\eta_{1}^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT to the original learning rate η(0)superscript𝜂0\eta^{(0)}italic_η start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and the other initial learning rates are gradually reduced with each layer. During training, the lower learning rates are gradually increased, until epoch k𝑘kitalic_k. Hence, LeRaC actually slows down the learning for deeper layers, until the earlier layers have learned representative features.

Evaluation. For the classification tasks, we evaluate all models in terms of the accuracy rate. For the regression task (age estimation), we use the mean absolute error. For the object detection task, we employ the mean Average Precision (mAP) at an intersection over union (IoU) threshold of 0.5. We repeat the training process of each model for 5 times and report the average performance and the standard deviation.

Table 3: Average accuracy rates (in %) over 5 runs on CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet and Food-101 for various neural models based on different training regimes: conventional, CBS [7] and LeRaC. The accuracy of the best training regime in each experiment is highlighted in bold.
Model Training Regime CIFAR-10 CIFAR-100 Tiny ImageNet ImageNet Food-101
conventional 89.20±0.43plus-or-minus89.200.4389.20{\pm 0.43}89.20 ± 0.43 71.70±0.06plus-or-minus71.700.0671.70{\pm 0.06}71.70 ± 0.06 57.41±0.05plus-or-minus57.410.0557.41{\pm 0.05}57.41 ± 0.05 68.44±0.65plus-or-minus68.440.6568.44{\pm 0.65}68.44 ± 0.65 68.31±0.09plus-or-minus68.310.0968.31{\pm 0.09}68.31 ± 0.09
ResNet-18 CBS 89.53±0.22plus-or-minus89.530.2289.53{\pm 0.22}89.53 ± 0.22 72.80±0.18plus-or-minus72.800.18\mathbf{72.80}{\pm 0.18}bold_72.80 ± 0.18 55.49±0.20plus-or-minus55.490.2055.49{\pm 0.20}55.49 ± 0.20 71.02±0.80plus-or-minus71.020.8071.02{\pm 0.80}71.02 ± 0.80 65.09±0.47plus-or-minus65.090.4765.09{\pm 0.47}65.09 ± 0.47
LeRaC (ours) 89.56±0.16plus-or-minus89.560.16\mathbf{89.56}{\pm 0.16}bold_89.56 ± 0.16 72.72±0.12plus-or-minus72.720.1272.72{\pm 0.12}72.72 ± 0.12 57.86±0.20plus-or-minus57.860.20\mathbf{57.86}{\pm 0.20}bold_57.86 ± 0.20 71.96±0.72plus-or-minus71.960.72\mathbf{71.96}{\pm 0.72}bold_71.96 ± 0.72 69.57±0.07plus-or-minus69.570.07\mathbf{69.57}{\pm 0.07}bold_69.57 ± 0.07
conventional 91.22±0.24plus-or-minus91.220.2491.22{\pm 0.24}91.22 ± 0.24 68.14±0.16plus-or-minus68.140.1668.14{\pm 0.16}68.14 ± 0.16 55.97±0.30plus-or-minus55.970.3055.97{\pm 0.30}55.97 ± 0.30 70.25±0.82plus-or-minus70.250.8270.25{\pm 0.82}70.25 ± 0.82 67.54±0.66plus-or-minus67.540.6667.54{\pm 0.66}67.54 ± 0.66
Wide-ResNet-50 CBS 89.05±1.00plus-or-minus89.051.0089.05{\pm 1.00}89.05 ± 1.00 65.73±0.36plus-or-minus65.730.3665.73{\pm 0.36}65.73 ± 0.36 48.30±1.53plus-or-minus48.301.5348.30{\pm 1.53}48.30 ± 1.53 72.10±0.71plus-or-minus72.100.7172.10{\pm 0.71}72.10 ± 0.71 58.95±1.80plus-or-minus58.951.8058.95{\pm 1.80}58.95 ± 1.80
LeRaC (ours) 91.58±0.16plus-or-minus91.580.16\mathbf{91.58}{\pm 0.16}bold_91.58 ± 0.16 69.38±0.26plus-or-minus69.380.26\mathbf{69.38}{\pm 0.26}bold_69.38 ± 0.26 56.48±0.60plus-or-minus56.480.60\mathbf{56.48}{\pm 0.60}bold_56.48 ± 0.60 72.49±0.64plus-or-minus72.490.64\mathbf{72.49}{\pm 0.64}bold_72.49 ± 0.64 67.96±0.35plus-or-minus67.960.35\mathbf{67.96}{\pm 0.35}bold_67.96 ± 0.35
conventional 71.84±0.37plus-or-minus71.840.3771.84{\pm 0.37}71.84 ± 0.37 41.87±0.16plus-or-minus41.870.1641.87{\pm 0.16}41.87 ± 0.16 33.38±0.27plus-or-minus33.380.2733.38{\pm 0.27}33.38 ± 0.27 81.33±0.75plus-or-minus81.330.7581.33{\pm 0.75}81.33 ± 0.75 39.17±1.26plus-or-minus39.171.2639.17{\pm 1.26}39.17 ± 1.26
CvT-13 CBS 72.64±0.29plus-or-minus72.640.2972.64{\pm 0.29}72.64 ± 0.29 44.48±0.40plus-or-minus44.480.40\mathbf{44.48}{\pm 0.40}bold_44.48 ± 0.40 33.56±0.36plus-or-minus33.560.3633.56{\pm 0.36}33.56 ± 0.36 80.42±0.58plus-or-minus80.420.5880.42{\pm 0.58}80.42 ± 0.58 38.63±0.49plus-or-minus38.630.4938.63{\pm 0.49}38.63 ± 0.49
LeRaC (ours) 72.90±0.28plus-or-minus72.900.28\mathbf{72.90}{\pm 0.28}bold_72.90 ± 0.28 43.46±0.18plus-or-minus43.460.1843.46{\pm 0.18}43.46 ± 0.18 33.95±0.28plus-or-minus33.950.28\mathbf{33.95}{\pm 0.28}bold_33.95 ± 0.28 82.19±0.68plus-or-minus82.190.68\mathbf{82.19}{\pm 0.68}bold_82.19 ± 0.68 41.42±0.72plus-or-minus41.420.72\mathbf{41.42}{\pm 0.72}bold_41.42 ± 0.72
conventional 93.56±0.05plus-or-minus93.560.0593.56{\pm 0.05}93.56 ± 0.05 77.80±0.16plus-or-minus77.800.1677.80{\pm 0.16}77.80 ± 0.16 70.71±0.35plus-or-minus70.710.3570.71{\pm 0.35}70.71 ± 0.35 - 85.22±0.11plus-or-minus85.220.1185.22{\pm 0.11}85.22 ± 0.11
CvT-13pre-trainedpre-trained{}_{\mbox{\scriptsize{pre-trained}}}start_FLOATSUBSCRIPT pre-trained end_FLOATSUBSCRIPT CBS 85.85±0.15plus-or-minus85.850.1585.85{\pm 0.15}85.85 ± 0.15 62.35±0.48plus-or-minus62.350.4862.35{\pm 0.48}62.35 ± 0.48 68.41±0.13plus-or-minus68.410.1368.41{\pm 0.13}68.41 ± 0.13 - 81.41±0.42plus-or-minus81.410.4281.41{\pm 0.42}81.41 ± 0.42
LeRaC (ours) 94.15±0.03plus-or-minus94.150.03\mathbf{94.15}{\pm 0.03}bold_94.15 ± 0.03 78.93±0.05plus-or-minus78.930.05\mathbf{78.93}{\pm 0.05}bold_78.93 ± 0.05 71.34±0.08plus-or-minus71.340.08\mathbf{71.34}{\pm 0.08}bold_71.34 ± 0.08 - 86.05±0.08plus-or-minus86.050.08\mathbf{86.05}{\pm 0.08}bold_86.05 ± 0.08

4.3 Domain-Specific Preprocessing

Image preprocessing. For the image classification experiments, we apply the same data preprocessing approach as Sinha et al. [7]. Hence, we normalize the images and maintain their original resolution, 32×32323232\times 3232 × 32 pixels for CIFAR-10 and CIFAR-100, 64×64646464\times 6464 × 64 pixels for Tiny ImageNet, 224×224224224224\times 224224 × 224 pixels for ImageNet and Food-101, and 200×200200200200\times 200200 × 200 pixels for UTKFace. Similar to Sinha et al. [7], we do not employ data augmentation.

Text preprocessing. For the text classification experiments with BERT, we lowercase all words and add the classification token ([CLS]) at the start of the input sequence. We add the separator token ([SEP]) to delimit sentences. For the LSTM network, we lowercase all words and replace them with indexes from vocabularies constructed from the training set. The input sequence length is limited to 512512512512 tokens for BERT and 200200200200 tokens for LSTM.

Speech preprocessing. The speech preprocessing steps are carried out following Ristea et al. [24]. We thus transform each audio sample into a time-frequency matrix by computing the discrete Short Time Fourier Transform (STFT) with Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT FFT points, using a Hamming window of length L𝐿Litalic_L and a hop size R𝑅Ritalic_R. For CREMA-D, we first standardize all audio clips to a fixed dimension of 4444 seconds by padding or clipping the samples. Then, we apply the STFT with Nx=1024subscript𝑁𝑥1024N_{x}=1024italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 1024, R=64𝑅64R=64italic_R = 64 and a window size of L=512𝐿512L=512italic_L = 512. For ESC-50, we keep the same values for Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and L𝐿Litalic_L, but we increase the hop size to R=128𝑅128R=128italic_R = 128. Next, for each STFT, we compute the square root of the magnitude and map the values to 128128128128 Mel bins. The result is converted to a logarithmic scale and normalized to the interval [0,1]01[0,1][ 0 , 1 ]. Furthermore, in all our speech classification experiments, we use the following data augmentation methods: noise perturbation, time shifting, speed perturbation, mix-up and SpecAugment [71].

Table 4: Multi-task learning results for ResNet-18 and CvT-13 (pre-trained) on UTKFace, using three different training regimes: conventional, CBS [7] and LeRaC. We report the accuracy (in %) for gender prediction and the mean absolute error (MAE) for age estimation. The \downarrow and \uparrow symbols indicate when lower or upper values are better, respectively. The best scores are highlighted in bold.
Model Training Regime Gender Accuracy \uparrow Age MAE \downarrow
ResNet-18 conventional 88.63±0.12plus-or-minus88.630.1288.63{\pm 0.12}88.63 ± 0.12 6.75±0.22plus-or-minus6.750.226.75{\pm 0.22}6.75 ± 0.22
CBS 89.23±0.11plus-or-minus89.230.1189.23{\pm 0.11}89.23 ± 0.11 6.24±0.22plus-or-minus6.240.226.24{\pm 0.22}6.24 ± 0.22
LeRaC (ours) 90.07±0.12plus-or-minus90.070.12\mathbf{90.07}{\pm 0.12}bold_90.07 ± 0.12 5.97±0.20plus-or-minus5.970.20\mathbf{5.97}{\pm 0.20}bold_5.97 ± 0.20
CvT-13pre-trainedpre-trained{}_{\mbox{\scriptsize{pre-trained}}}start_FLOATSUBSCRIPT pre-trained end_FLOATSUBSCRIPT conventional 92.57±0.15plus-or-minus92.570.1592.57{\pm 0.15}92.57 ± 0.15 4.78±0.18plus-or-minus4.780.184.78{\pm 0.18}4.78 ± 0.18
CBS 92.61±0.14plus-or-minus92.610.1492.61{\pm 0.14}92.61 ± 0.14 4.61±0.17plus-or-minus4.610.174.61{\pm 0.17}4.61 ± 0.17
LeRaC (ours) 93.19±0.14plus-or-minus93.190.14\mathbf{93.19}{\pm 0.14}bold_93.19 ± 0.14 4.06±0.15plus-or-minus4.060.15\mathbf{4.06}{\pm 0.15}bold_4.06 ± 0.15
Table 5: Object detection results of YOLOv5 on PASCAL VOC, using three different training regimes: conventional, CBS [7] and LeRaC. The best mAP is highlighted in bold.
Training Regime conventional CBS LeRaC (ours)
mAP 0.832±0.006plus-or-minus0.8320.0060.832{\pm 0.006}0.832 ± 0.006 0.829±0.003plus-or-minus0.8290.0030.829{\pm 0.003}0.829 ± 0.003 0.846±0.004plus-or-minus0.8460.004\mathbf{0.846}{\pm 0.004}bold_0.846 ± 0.004

4.4 Preliminary Results

We present preliminary experiments to show the effect of various learning rate schedulers for different architectures. For each architecture, we compare the constant learning rate scheduler with an adaptive learning rate scheduler. The aim is to find the best scheduler for the conventional training regime, which is used as baseline in the subsequent experiments. Table 2 showcases the preliminary results on CIFAR-10, CIFAR-100 and Tiny ImageNet. We compare the outcomes of the adaptive and constant learning rate schedulers with those of LeRaC. In most cases, the adaptive scheduler yields better results than the constant learning rate. Using a constant learning rate seems to work only for the pre-trained CvT-13. Notably, the analysis also reveals that LeRaC consistently outperforms the other baseline schedulers, achieving the highest accuracy rates across all data sets.

We emphasize that, for the subsequent experiments, the conventional regime is always represented by the best scheduler among the following options: learning rate decay, learning rate warm-up, cosine annealing, or combinations of the aforementioned options.

Table 6: Left side: average accuracy rates (in %) over 5 runs on BoolQ, RTE and QNLI for BERT and LSTM. Right side: average accuracy rates (in %) over 5 runs on CREMA-D and ESC-50 for SepTr and DenseNet-121. In both domains (text and audio), the comparison is between different training regimes: conventional, CBS [7] and LeRaC. The accuracy of the best training regime in each experiment is highlighted in bold.
Training Text Audio
\cline2-8 Regime Model BoolQ RTE QNLI Model CREMA-D ESC-50
conventional 74.12±0.32plus-or-minus74.120.3274.12{\pm 0.32}74.12 ± 0.32 74.48±1.36plus-or-minus74.481.3674.48{\pm 1.36}74.48 ± 1.36 92.13±0.08plus-or-minus92.130.0892.13{\pm 0.08}92.13 ± 0.08 70.47±0.67plus-or-minus70.470.6770.47{\pm 0.67}70.47 ± 0.67 91.13±0.33plus-or-minus91.130.3391.13{\pm 0.33}91.13 ± 0.33
CBS BERTlargelarge{}_{\mbox{\scriptsize{large}}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT 74.37±1.11plus-or-minus74.371.1174.37{\pm 1.11}74.37 ± 1.11 74.97±1.96plus-or-minus74.971.9674.97{\pm 1.96}74.97 ± 1.96 91.47±0.22plus-or-minus91.470.2291.47{\pm 0.22}91.47 ± 0.22 SepTr 69.98±0.71plus-or-minus69.980.7169.98{\pm 0.71}69.98 ± 0.71 91.15±0.41plus-or-minus91.150.4191.15{\pm 0.41}91.15 ± 0.41
LeRaC (ours) 75.55±0.66plus-or-minus75.550.66\mathbf{75.55}{\pm 0.66}bold_75.55 ± 0.66 75.81±0.29plus-or-minus75.810.29\mathbf{75.81}{\pm 0.29}bold_75.81 ± 0.29 92.45±0.13plus-or-minus92.450.13\mathbf{92.45}{\pm 0.13}bold_92.45 ± 0.13 70.95±0.56plus-or-minus70.950.56\mathbf{70.95}{\pm 0.56}bold_70.95 ± 0.56 91.58±0.28plus-or-minus91.580.28\mathbf{91.58}{\pm 0.28}bold_91.58 ± 0.28
conventional 64.40±1.37plus-or-minus64.401.3764.40{\pm 1.37}64.40 ± 1.37 54.12±1.60plus-or-minus54.121.6054.12{\pm 1.60}54.12 ± 1.60 59.42±0.36plus-or-minus59.420.3659.42{\pm 0.36}59.42 ± 0.36 67.21±0.12plus-or-minus67.210.1267.21{\pm 0.12}67.21 ± 0.12 88.91±0.11plus-or-minus88.910.1188.91{\pm 0.11}88.91 ± 0.11
CBS LSTM 64.75±1.54plus-or-minus64.751.5464.75{\pm 1.54}64.75 ± 1.54 54.03±0.45plus-or-minus54.030.4554.03{\pm 0.45}54.03 ± 0.45 59.89±0.38plus-or-minus59.890.3859.89{\pm 0.38}59.89 ± 0.38 DenseNet-121 68.16±0.19plus-or-minus68.160.1968.16{\pm 0.19}68.16 ± 0.19 88.76±0.17plus-or-minus88.760.1788.76{\pm 0.17}88.76 ± 0.17
LeRaC (ours) 65.80±0.33plus-or-minus65.800.33\mathbf{65.80}{\pm 0.33}bold_65.80 ± 0.33 55.71±1.04plus-or-minus55.711.04\mathbf{55.71}{\pm 1.04}bold_55.71 ± 1.04 59.98±0.34plus-or-minus59.980.34\mathbf{59.98}{\pm 0.34}bold_59.98 ± 0.34 68.99±0.08plus-or-minus68.990.08\mathbf{68.99}{\pm 0.08}bold_68.99 ± 0.08 90.02±0.10plus-or-minus90.020.10\mathbf{90.02}{\pm 0.10}bold_90.02 ± 0.10

4.5 Main Results

Image classification. In Table 3, we present the image classification results on CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet and Food-101. Since CvT-13 is pre-trained on ImageNet, it does not make sense to fine-tune it on ImageNet. Thus, the respective results are not reported. On the one hand, there are two scenarios (ResNet-18 on CIFAR-100, and CvT-13 on CIFAR-100) in which CBS provides the largest improvements over the conventional regime, surpassing LeRaC in the respective cases. On the other hand, there are more than 10 scenarios where CBS degrades the accuracy with respect to the standard training regime. This shows that the improvements attained by CBS are inconsistent across models and data sets. Unlike CBS, our strategy surpasses the baseline regime in all 19 cases, thus being more consistent. In 8 of these cases, the accuracy gains of LeRaC are higher than 1%percent11\%1 %. Moreover, LeRaC outperforms CBS in 17 out of 19 cases. We thus consider that LeRaC can be regarded as a better choice than CBS, bringing consistent performance gains.

Multi-task learning. In Table 4, we include the multi-task learning results on the UTKFace data set [11]. We evaluate the multi-task ResNet-18 and CvT-13pre-trainedpre-trained{}_{\mbox{\scriptsize{pre-trained}}}start_FLOATSUBSCRIPT pre-trained end_FLOATSUBSCRIPT models under various training regimes, reporting the accuracy rates for gender prediction, and the mean absolute errors for age estimation, respectively. LeRaC achieves the best scores in each and every case, surpassing the other training regimes in the multi-task learning setup. Moreover, its results are statistically better with respect to both competing regimes. In contrast, the CBS regime remains in the statistical margin of the conventional regime for the pre-trained CvT-13 network.

Object detection. In Table 5, we include the object detection results of YOLOv5 [20] based on different training regimes on PASCAL VOC 2007+2012 [12]. LeRaC exhibits a superior mAP score, significantly surpassing the other training regimes. In contrast, CBS leads to suboptimal performance, hinting towards the inconsistency of CBS across different evaluation scenarios.

Text classification. In Table 6 (left side), we report the text classification results on BoolQ, RTE and QNLI. Here, there are two cases (BERT on QNLI and LSTM on RTE) where CBS leads to performance drops compared with the conventional training regime. In all other cases, the improvements of CBS are below 0.6%percent0.60.6\%0.6 %. Just as in the image classification experiments, LeRaC brings accuracy gains for each and every model and data set. In four out of six scenarios, the accuracy gains yielded by LeRaC are higher than 1.3%percent1.31.3\%1.3 %. Once again, LeRaC proves to be the most consistent regime, generally surpassing CBS by significant margins.

Refer to caption
(a) ResNet-18 on Tiny ImageNet.
Refer to caption
(b) Wide-ResNet-50 on Tiny ImageNet.
Refer to caption
(c) BERT on BoolQ.
Refer to caption
(d) SepTr on CREMA-D.
Figure 3: Validation accuracy (on the y-axis) versus training time (on the x-axis) for four distinct architectures. The number of training epochs is the same for both LeRaC and CBS, the observable time difference being caused by the overhead of CBS due to the Gaussian kernel layers.

Speech classification. In Table 6 (right side), we present the results obtained on the audio data sets, namely CREMA-D and ESC-50. We observe that the CBS strategy obtains lower results compared with the baseline in two cases (SepTr on CREMA-D and DenseNet-121 on ESC-50), while our method provides superior results for each and every case. By applying LeRaC on SepTr, we set a new state-of-the-art accuracy level (70.95%percent70.9570.95\%70.95 %) on the CREMA-D audio modality, surpassing the previous state-of-the-art value attained by Ristea et al. [24] with SepTr alone. When applied on DenseNet-121, LeRaC brings performance improvements higher than 1%percent11\%1 %, the highest improvement (1.78%percent1.781.78\%1.78 %) over the baseline being attained on CREMA-D.

Significance testing. To determine if the reported accuracy gains observed for LeRaC with respect to the baseline are significant, we apply McNemar / Cochran Q significance testing [72] to the results reported in Table 3, Table 4, Table 5 and Table 6 on all 12 data sets. In 27 of 34 cases, we found that our results are significantly better than the corresponding baseline, at a p-value of 0.0010.0010.0010.001. This confirms that our gains are statistically significant in the majority of cases.

Table 7: Average accuracy rates (in %) over 5 runs for ResNet-18, Wide-ResNet-50 and CvT-13 (pre-trained) on CIFAR-10 and CIFAR-100 using different training regimes: conventional, CBS [7], LSCL [25], EfficientTrain [26], Self-taught [27], CLIP [28], LCDnet-CL [29] and LeRaC (ours). The accuracy of the best training regime on each data set is highlighted in bold.
Model Training Regime CIFAR-10 CIFAR-100
ResNet-18 conventional 89.20±0.43plus-or-minus89.200.4389.20\pm 0.4389.20 ± 0.43 71.70±0.06plus-or-minus71.700.0671.70\pm 0.0671.70 ± 0.06
CBS [7] 89.53±0.22plus-or-minus89.530.2289.53\pm 0.2289.53 ± 0.22 72.80±0.18plus-or-minus72.800.1872.80\pm 0.1872.80 ± 0.18
LSCL [25] 88.28±0.14plus-or-minus88.280.1488.28\pm 0.1488.28 ± 0.14 68.42±0.25plus-or-minus68.420.2568.42\pm 0.2568.42 ± 0.25
EfficientTrain [26] 89.51±0.13plus-or-minus89.510.1389.51\pm 0.1389.51 ± 0.13 72.83±0.12plus-or-minus72.830.12\mathbf{72.83}\pm 0.12bold_72.83 ± 0.12
Self-taught [27] 89.48±0.17plus-or-minus89.480.1789.48\pm 0.1789.48 ± 0.17 72.10±0.32plus-or-minus72.100.3272.10\pm 0.3272.10 ± 0.32
LCDnet-CL [29] 89.36±0.38plus-or-minus89.360.3889.36\pm 0.3889.36 ± 0.38 71.06±0.27plus-or-minus71.060.2771.06\pm 0.2771.06 ± 0.27
CLIP [28] 89.11±0.02plus-or-minus89.110.0289.11\pm 0.0289.11 ± 0.02 70.03±0.27plus-or-minus70.030.2770.03\pm 0.2770.03 ± 0.27
LeRaC (ours) 89.56±0.16plus-or-minus89.560.16\mathbf{89.56}\pm 0.16bold_89.56 ± 0.16 72.72±0.12plus-or-minus72.720.1272.72\pm 0.1272.72 ± 0.12
Wide-ResNet-50 conventional 91.22±0.24plus-or-minus91.220.2491.22\pm 0.2491.22 ± 0.24 68.14±0.16plus-or-minus68.140.1668.14\pm 0.1668.14 ± 0.16
CBS [7] 89.05±1.00plus-or-minus89.051.00{89.05}\pm 1.0089.05 ± 1.00 65.73±0.36plus-or-minus65.730.36{65.73}\pm 0.3665.73 ± 0.36
LSCL [25] 88.28±0.14plus-or-minus88.280.1488.28\pm 0.1488.28 ± 0.14 72.59±0.25plus-or-minus72.590.2572.59\pm 0.2572.59 ± 0.25
EfficientTrain [26] 91.03±0.28plus-or-minus91.030.2891.03\pm 0.2891.03 ± 0.28 69.14±0.20plus-or-minus69.140.2069.14\pm 0.2069.14 ± 0.20
Self-taught [27] 91.00±0.24plus-or-minus91.000.2491.00\pm 0.2491.00 ± 0.24 68.48±0.26plus-or-minus68.480.2668.48\pm 0.2668.48 ± 0.26
LCDnet-CL [29] 91.38±0.18plus-or-minus91.380.1891.38\pm 0.1891.38 ± 0.18 68.85±0.13plus-or-minus68.850.1368.85\pm 0.1368.85 ± 0.13
CLIP [28] 91.18±0.11plus-or-minus91.180.1191.18\pm 0.1191.18 ± 0.11 68.13±0.39plus-or-minus68.130.3968.13\pm 0.3968.13 ± 0.39
LeRaC (ours) 91.58±0.16plus-or-minus91.580.16\mathbf{91.58}\pm 0.16bold_91.58 ± 0.16 69.38±0.26plus-or-minus69.380.26\mathbf{69.38}\pm 0.26bold_69.38 ± 0.26
CvT-13pre-trainedpre-trained{}_{\mbox{\scriptsize{pre-trained}}}start_FLOATSUBSCRIPT pre-trained end_FLOATSUBSCRIPT conventional 93.56±0.05plus-or-minus93.560.0593.56\pm 0.0593.56 ± 0.05 77.80±0.16plus-or-minus77.800.1677.80\pm 0.1677.80 ± 0.16
CBS [7] 85.85±0.15plus-or-minus85.850.1585.85\pm 0.1585.85 ± 0.15 62.35±0.48plus-or-minus62.350.4862.35\pm 0.4862.35 ± 0.48
LSCL [25] 93.91±0.20plus-or-minus93.910.2093.91\pm 0.2093.91 ± 0.20 78.63±0.12plus-or-minus78.630.1278.63\pm 0.1278.63 ± 0.12
EfficientTrain [26] 94.50±0.17plus-or-minus94.500.17\mathbf{94.50}\pm 0.17bold_94.50 ± 0.17 78.20±0.34plus-or-minus78.200.3478.20\pm 0.3478.20 ± 0.34
Self-taught [27] 92.25±0.22plus-or-minus92.250.2292.25\pm 0.2292.25 ± 0.22 77.95±0.32plus-or-minus77.950.3277.95\pm 0.3277.95 ± 0.32
LCDnet-CL [29] 92.72±0.16plus-or-minus92.720.1692.72\pm 0.1692.72 ± 0.16 78.57±0.16plus-or-minus78.570.1678.57\pm 0.1678.57 ± 0.16
CLIP [28] 92.61±0.36plus-or-minus92.610.3692.61\pm 0.3692.61 ± 0.36 76.18±1.45plus-or-minus76.181.4576.18\pm 1.4576.18 ± 1.45
LeRaC (ours) 94.15±0.03plus-or-minus94.150.0394.15\pm 0.0394.15 ± 0.03 78.93±0.05plus-or-minus78.930.05\mathbf{78.93}\pm 0.05bold_78.93 ± 0.05

Training time comparison. For a particular model and data set, all training regimes are executed for the same number of epochs, for a fair comparison. However, the CBS strategy adds the smoothing operation at multiple levels inside the architecture, which increases the training time. To this end, we compare the training time (in hours) versus the validation error of CBS and LeRaC. For this experiment, we selected four neural models and illustrate the evolution of the validation accuracy over time in Figure 3. We underline that LeRaC introduces faster convergence times, being around 7-12% faster than CBS. It is trivial to note that LeRaC requires the same time as the conventional regime.

4.6 More Comparative Results

Comparing with domain-specific curriculum learning strategies. Although we consider CBS [7] as our closest competitor in terms of applicability across architectures and domains, there are domain-specific curriculum learning methods reporting promising results. To this end, we perform additional experiments on CIFAR-10 and CIFAR-100 with ResNet-18, Wide-ResNet-50 and CvT-13 (pre-trained), considering two recent curriculum learning strategies applied in the image domain, namely Label-Similarity Curriculum Learning (LSCL) [25] and EfficientTrain [26].

Table 8: Average accuracy rates (in %) over 5 runs on CIFAR-10, CIFAR-100 and Tiny ImageNet for CvT-13 based on different training regimes: conventional, LeRaC with logarithmic update, LeRaC with linear update, and LeRaC with exponential update (proposed). The accuracy rates surpassing the baseline training regime are highlighted in bold.
Model Training Regime CIFAR-10 CIFAR-100 Tiny ImageNet
CvT-13 conventional 71.84±0.37plus-or-minus71.840.3771.84{\pm 0.37}71.84 ± 0.37 41.87±0.16plus-or-minus41.870.1641.87{\pm 0.16}41.87 ± 0.16 33.38±0.27plus-or-minus33.380.2733.38{\pm 0.27}33.38 ± 0.27
\cline2-5 LeRaC (logarithmic update) 72.14±0.13plus-or-minus72.140.13\mathbf{72.14}{\pm 0.13}bold_72.14 ± 0.13 43.37±0.20plus-or-minus43.370.20\mathbf{43.37}{\pm 0.20}bold_43.37 ± 0.20 33.82±0.15plus-or-minus33.820.15\mathbf{33.82}{\pm 0.15}bold_33.82 ± 0.15
LeRaC (linear update) 72.49±0.27plus-or-minus72.490.27\mathbf{72.49}{\pm 0.27}bold_72.49 ± 0.27 43.39±0.14plus-or-minus43.390.14\mathbf{43.39}{\pm 0.14}bold_43.39 ± 0.14 33.86±0.07plus-or-minus33.860.07\mathbf{33.86}{\pm 0.07}bold_33.86 ± 0.07
LeRaC (exponential update) 72.90±0.28plus-or-minus72.900.28\mathbf{72.90}{\pm 0.28}bold_72.90 ± 0.28 43.46±0.18plus-or-minus43.460.18\mathbf{43.46}{\pm 0.18}bold_43.46 ± 0.18 33.95±0.28plus-or-minus33.950.28\mathbf{33.95}{\pm 0.28}bold_33.95 ± 0.28

Dogan et al. [25] proposed LSCL, a strategy that relies on hierarchically clustering the classes (labels) based on inter-label similarities determined with the help of document embeddings representing the Wikipedia pages of the respective classes. The corresponding results shown in Table 7 indicate that label-similarity curriculum is useful for CIFAR-100, but not for CIFAR-10. This suggests that the method needs a sufficiently large number of classes to benefit from the constructed hierarchy of classes. In contrast, LeRaC does not rely on external components, such as the similarity measure used by Dogan et al. [25] in their strategy. Another important limitation of LSCL [25] is its restricted use, e.g. LSCL is not applicable to regression tasks, where there are no classes. Therefore, we consider LeRaC as a more versatile alternative.

EfficientTrain is an alternative to CBS, which introduces a cropping operation in the Fourier spectrum of the inputs instead of blurring the activation maps. The method is not suitable for text data, so the comparison between EfficientTrain and LeRaC can only be performed in the image domain. Consequently, we compare with EfficientTrain [26] on CIFAR-10 and CIFAR-100, and show the corresponding results in Table 7. Notably, our method surpasses EfficientTrain [26] in 4 out of 6 evaluation scenarios. These results confirm the competitiveness of LeRaC in comparison to very recent methods, such as EfficientTrain [26].

Aside from outperforming EfficientTrain and LSCL in the image domain, our method has another important advantage: it is generally applicable to any domain.

Comparing with data-level curriculum learning methods. In Table 7, we also compare LeRaC with three data-level curriculum learning methods [27, 28, 29]. These methods share a common framework, where a scoring function ranks samples based on their difficulty, and a pacing function determines the timing for introducing new batches during training. Khan et al. [27] examine various pacing functions and classify scoring functions into two categories: self-taught and transfer-scoring functions. Self-taught functions involve training a model on a subset of data batches and then using this model to assess the difficulty of examples. In contrast, transfer-scoring functions utilize a pre-trained model for this purpose. For the results reported in Table 7 for Khan et al. [27], we use the self-taught scoring function and a linear pacing function. To compare with Khan et al. [29], we use a transfer-scoring function and a ResNet-50 model pre-trained on ImageNet. For Khan et al. [28], aside from using the pre-trained model for assessing the difficulty of the samples, we also remove the least significant samples during training.

The results reported in Table 7 indicate that LeRaC outperforms the data-level curriculum learning methods. We note that these methods were exclusively tested on crowd density estimation tasks, which could explain why their effectiveness might not generalize to different types of tasks. For instance, the method described by Khan et al. [28] is suboptimal even when compared with conventional training, suggesting that the strategy of removing easy examples is not always effective for image classification tasks.

Refer to caption

Figure 4: Average SNR of the feature maps at each layer of the randomly initialized LeNet architecture. The SNR at each layer is averaged for 100 randomly picked images from the CIFAR-100 data set. For the later layers, the SNR is negative because the signal is dominated by noise.

4.7 Ablation Studies

Comparing different schedulers. We first aim to establish if the exponential learning rate scheduler proposed in Eq. (9) is a good choice. To test this out, we select the CvT-13 model and change the LeRaC regime to use linear or logarithmic updates of the learning rates. The corresponding results are shown in Table 8. We observe that both alternative schedulers obtain performance gains, but our exponential learning rate scheduler brings higher gains on all three data sets. We thus conclude that the update rule defined in Eq. (9) is a sound option.

Our previous ablation study shows that the exponential scheduler leads to higher gains than the linear or the logarithmic schedulers. In general, a suitable scheduler is one that adjusts the learning rate at each layer proportionally to the estimated signal-to-noise drop from one layer to the next. To understand how the average SNR drops from one neural layer to the next, we plot the average SNR of the features maps at each layer of the randomly initialized LeNet architecture, computed over 100 images from CIFAR-100, in Figure 4. As anticipated, the average SNR decreases along with the layer index. Notably, we observe that the drop in SNR follows an exponential trend. This can explain why the exponential scheduler is a more suitable choice.

Refer to caption

Figure 5: Test accuracy (on the y-axis) versus training time (on the x-axis) for ResNet-18 on CIFAR-100 with various curriculum schedulers. The dashed line corresponds to the conventional regime, while the continuous lines correspond to LeRaC with various schedulers. Best viewed in color.

Refer to caption

Figure 6: Test accuracy (on the y-axis) versus training time (on the x-axis) for the pre-trained CvT-13 on CIFAR-10 with various curriculum schedulers. The dashed line corresponds to the conventional regime, while the continuous lines correspond to LeRaC with various schedulers. Best viewed in color.

To further justify our preference towards the exponential scheduler, we analyze the training progress of the ResNet-18 and the pre-trained CvT-13 models using various schedulers (logarithmic, linear, exponential) for LeRaC. Figure 5 shows the results for ResNet-18, while Figure 6 illustrates the results for CvT-13. In both cases, the exponential scheduler leads to a better training progress than the conventional regime, but the linear and logarithmic schedulers are not as good. These results further confirm that the exponential scheduler is optimal.

Table 9: Average accuracy rates (in %) over 5 runs for ResNet-18 and Wide-ResNet-50 on CIFAR-100 based on different ranges for the initial learning rates. The accuracy rates surpassing the baseline training regime are highlighted in bold.
Training Regime η1(0)superscriptsubscript𝜂10\eta_{1}^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT-ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ResNet-18 Wide-ResNet-50
conventional 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT-101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT     71.70±0.06plus-or-minus71.700.06\;\;\;\;71.70{\pm 0.06}\;\;\;\;71.70 ± 0.06 68.14±0.16plus-or-minus68.140.1668.14{\pm 0.16}68.14 ± 0.16
LeRaC (ours) 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT-106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 72.48±0.10plus-or-minus72.480.10\mathbf{72.48}{\pm 0.10}bold_72.48 ± 0.10 68.64±0.52plus-or-minus68.640.52\mathbf{68.64}{\pm 0.52}bold_68.64 ± 0.52
101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT-107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 72.52±0.17plus-or-minus72.520.17\mathbf{72.52}{\pm 0.17}bold_72.52 ± 0.17 69.25±0.37plus-or-minus69.250.37\mathbf{69.25}{\pm 0.37}bold_69.25 ± 0.37
101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT-108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 72.72±0.12plus-or-minus72.720.12\mathbf{72.72}{\pm 0.12}bold_72.72 ± 0.12 69.38±0.26plus-or-minus69.380.26\mathbf{69.38}{\pm 0.26}bold_69.38 ± 0.26
101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT-109superscript10910^{-9}10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT 72.29±0.38plus-or-minus72.290.38\mathbf{72.29}{\pm 0.38}bold_72.29 ± 0.38 69.26±0.27plus-or-minus69.260.27\mathbf{69.26}{\pm 0.27}bold_69.26 ± 0.27
101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT-1010superscript101010^{-10}10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT 72.45±0.25plus-or-minus72.450.25\mathbf{72.45}{\pm 0.25}bold_72.45 ± 0.25 69.66±0.34plus-or-minus69.660.34\mathbf{69.66}{\pm 0.34}bold_69.66 ± 0.34
102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT-108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 72.41±0.08plus-or-minus72.410.08\mathbf{72.41}{\pm 0.08}bold_72.41 ± 0.08 68.51±0.52plus-or-minus68.510.52\mathbf{68.51}{\pm 0.52}bold_68.51 ± 0.52
103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT-108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 72.08±0.19plus-or-minus72.080.19\mathbf{72.08}{\pm 0.19}bold_72.08 ± 0.19 68.71±0.47plus-or-minus68.710.47\mathbf{68.71}{\pm 0.47}bold_68.71 ± 0.47

Varying value ranges for initial learning rates. All our hyperparameters are either fixed without tuning or tuned on the validation data. In this ablation experiment, we present results with LeRaC using multiple ranges for η1(0)superscriptsubscript𝜂10\eta_{1}^{(0)}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and ηn(0)superscriptsubscript𝜂𝑛0\eta_{n}^{(0)}italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT to demonstrate that LeRaC is sufficiently stable with respect to suboptimal hyperparameter choices. We carry out experiments with ResNet-18 and Wide-ResNet-50 on CIFAR-100. We report the corresponding results in Table 9. We observe that all hyperparameter configurations lead to surpassing the baseline regime. This indicates that LeRaC can bring performance gains even outside the optimal learning rate bounds, demonstrating low sensitivity to suboptimal hyperparameter tuning.

Table 10: Average accuracy rates (in %) over 5 runs for ResNet-18 and Wide-ResNet-50 on CIFAR-100 using the LeRaC regime until iteration k𝑘kitalic_k, while varying k𝑘kitalic_k. The accuracy rates surpassing the baseline training regime are highlighted in bold.
Training Regime k𝑘kitalic_k ResNet-18 Wide-ResNet-50
conventional -     71.70±0.06plus-or-minus71.700.06\;\;\;\;71.70{\pm 0.06}\;\;\;\;71.70 ± 0.06 68.14±0.16plus-or-minus68.140.1668.14{\pm 0.16}68.14 ± 0.16
5 73.04±0.09plus-or-minus73.040.09\mathbf{73.04}{\pm 0.09}bold_73.04 ± 0.09 68.86±0.76plus-or-minus68.860.76\mathbf{68.86}{\pm 0.76}bold_68.86 ± 0.76
6 72.87±0.07plus-or-minus72.870.07\mathbf{72.87}{\pm 0.07}bold_72.87 ± 0.07 69.78±0.16plus-or-minus69.780.16\mathbf{69.78}{\pm 0.16}bold_69.78 ± 0.16
LeRaC (ours) 7 72.72±0.12plus-or-minus72.720.12\mathbf{72.72}{\pm 0.12}bold_72.72 ± 0.12 69.38±0.26plus-or-minus69.380.26\mathbf{69.38}{\pm 0.26}bold_69.38 ± 0.26
8 73.50±0.16plus-or-minus73.500.16\mathbf{73.50}{\pm 0.16}bold_73.50 ± 0.16 69.30±0.18plus-or-minus69.300.18\mathbf{69.30}{\pm 0.18}bold_69.30 ± 0.18
9 73.29±0.28plus-or-minus73.290.28\mathbf{73.29}{\pm 0.28}bold_73.29 ± 0.28 68.94±0.30plus-or-minus68.940.30\mathbf{68.94}{\pm 0.30}bold_68.94 ± 0.30
Table 11: Average accuracy rates (in %) over 5 runs for ResNet-18 and Wide-ResNet-50 on CIFAR-100, as well as SepTr on CREMA-D, based on different training regimes: conventional, anti-LeRaC and LeRaC. The accuracy of the best training regime in each experiment is highlighted in bold.
Data Set Model Training Regime Accuracy
CIFAR-100 conventional 71.70±0.06plus-or-minus71.700.0671.70\!\pm\!0.0671.70 ± 0.06
ResNet-18 anti-LeRaC 71.24±0.11plus-or-minus71.240.1171.24\!\pm\!0.1171.24 ± 0.11
LeRaC (ours) 72.72±0.12plus-or-minus72.720.12\mathbf{72.72}\!\pm\!0.12bold_72.72 ± 0.12
\cline2-4 conventional 68.14±0.16plus-or-minus68.140.1668.14\!\pm\!0.1668.14 ± 0.16
Wide-ResNet-50 anti-LeRaC 67.47±0.15plus-or-minus67.470.1567.47\!\pm\!0.1567.47 ± 0.15
LeRaC (ours) 69.38±0.26plus-or-minus69.380.26\mathbf{69.38}\!\pm\!0.26bold_69.38 ± 0.26
conventional 70.47±0.67plus-or-minus70.470.6770.47\!\pm\!0.6770.47 ± 0.67
CREMA-D SepTr anti-LeRaC 68.33±0.61plus-or-minus68.330.6168.33\!\pm\!0.6168.33 ± 0.61
LeRaC (ours) 70.95±0.56plus-or-minus70.950.56\mathbf{70.95}\!\pm\!0.56bold_70.95 ± 0.56

Varying 𝐤𝐤\mathbf{k}bold_k. In Table 10, we present additional results with ResNet-18 and Wide-ResNet-50 on CIFAR-100, considering various values for k𝑘kitalic_k (the last iteration for our training regime). We observe that all configurations surpass the baselines on CIFAR-100. Moreover, we observe that the optimal values for k𝑘kitalic_k (k=7𝑘7k=7italic_k = 7 for ResNet-18 and k=7𝑘7k=7italic_k = 7 for Wide-ResNet-50) obtained on the validation set are not the values producing the best results on the test set. This confirms that we did not overfit the hyperparameters of LeRaC.

Anti-curriculum. Since our goal is to perform curriculum learning (from easy to hard), we restrict the settings for ηjsubscript𝜂𝑗\eta_{j}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, j{1,2,,n}for-all𝑗12𝑛\forall j\in\{1,2,...,n\}∀ italic_j ∈ { 1 , 2 , … , italic_n }, such that deeper layers start with lower learning rates. However, another strategy is to consider the opposite setting, where we use higher learning rates for deeper layers. If we train later layers at a faster pace (anti-curriculum), we conjecture that the later layers get adapted to the noise from the early layers, which could likely lead to local optima or difficult training (due to the need of readapting to the earlier layers, once these layers start learning useful features). We tested this approach (anti-LeRaC), which belongs to the category of anti-curriculum learning strategies [2], in a set of new experiments with ResNet-18 and Wide-ResNet-50 on CIFAR-100, as well as SepTr on CREMA-D. We report the corresponding results with LeRaC and anti-LeRaC in Table 11. Although anti-curriculum, e.g. hard negative sample mining, was shown to be useful in other tasks [2], our results indicate that learning rate anti-curriculum attains inferior performance compared with our approach. Furthermore, anti-LeRaC is also below the conventional regime, confirming our conjecture regarding this strategy.

Summary. Notably, our ablation results show that the majority of hyperparameter configurations tested for LeRaC lead to outperforming the conventional regime, demonstrating the stability of LeRaC. We present additional experiments in Appendix C.

5 Discussion

Interaction with optimization algorithms. Throughout our experiments, we always keep using the same optimizer for a certain neural model, for all training regimes (conventional, CBS, LeRaC). The best optimizer for each neural model is established for the conventional training regime. We underline that our initial learning rates and scheduler are used independently of the optimizers. Although our learning rate scheduler updates the learning rates at the beginning of every iteration, we did not observe any stability or interaction issues with any of the optimizers (SGD, Adam, AdaMax, AdamW).

Interaction with other curriculum learning strategies. Our simple and generic curriculum learning scheme can be integrated into any model for any task, not relying on domain or task dependent information, e.g. the data samples. In Table 16 from Appendix C, we show that combining LeRaC and CBS can boost performance. In a similar fashion, LeRaC can be combined with data-level curriculum strategies for improved performance. We leave this exploration for future work.

Interaction with other learning rate schedulers. Whenever a learning rate scheduler is used for training a model in our experiments, we simply replace the scheduler with LeRaC until epoch k𝑘kitalic_k. For example, all the baseline CvT results are based on linear warm-up with cosine annealing, this being the recommended scheduler for CvT [22]. When we introduce LeRaC, we simply deactivate alternative schedulers between epochs 00 and k𝑘kitalic_k. In general, we recommend deactivating other schedulers while using LeRaC for simplicity in avoiding stability issues.

Limitations of our work. One limitation is the need to disable other learning rate schedulers while using LeRaC. We already tested this scenario with linear warm-up with cosine annealing, which is removed when using LeRaC, observing consistent performance gains (see Table 3). However, disabling alternative learning rate schedulers might bring performance drops in other cases. Hence, this has to be decided on a case by case basis. Another limitation is the possibility of encountering longer training times or poor convergence when the hyperparameters are not properly configured. We recommend hyperparameter tuning on the validation set to avoid this outcome.

6 Conclusion

In this paper, we introduced a novel model-level curriculum learning approach that is based on starting the training process with increasingly lower learning rates per layer, as the layers get closer to the output. We conducted comprehensive experiments on 12 data sets from three domains (image, text and audio), considering multiple neural architectures (CNNs, RNNs and transformers), to compare our novel training regime (LeRaC) with a state-of-the-art regime (CBS [7]), as well as the conventional training regime (based on early stopping and reduce on plateau). The empirical results demonstrate that LeRaC is significantly more consistent than CBS, perhaps being one of the most versatile curriculum learning strategy to date, due to its compatibility with multiple neural models and its usefulness across different domains. Remarkably, all these benefits come for free, i.e. LeRaC does not add any extra time over the conventional approach.

Declarations

Funding. This work was supported by a grant of the Romanian Ministry of Education and Research, CNCS - UEFISCDI, project number PN-III-P2-2.1-PED-2021-0195, within PNCDI III.

Conflict of interest. The authors have no conflicts of interest to declare that are relevant to the content of this article.

Availability of data and materials. The data sets are publicly available online.

Code availability. The code has been made publicly available for non-commercial use at https://github.com/CroitoruAlin/LeRaC.

Appendix A Theoretical Proof

The motivation behind using LeRaC stems from our conjecture stating that the level of noise inside features gradually increases with every layer of a neural network. Regardless of the type of layer (convolutional, transformer or fully connected), the operation performed inside a neural layer boils down to matrix or vector multiplications. To this end, we set out to demonstrate that the signal resulting from the multiplication of two signals has a lower signal-to-noise ratio (SNR) than the multiplied factors. We start with the definition of the variance of a signal, which is given below:

Definition 1.

The variance of a signal s𝑠sitalic_s is given by:

Var(s)=E[s2]E[s]2.Var𝑠𝐸delimited-[]superscript𝑠2𝐸superscriptdelimited-[]𝑠2\mathrm{Var}(s)=E[s^{2}]-E[s]^{2}.roman_Var ( italic_s ) = italic_E [ italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - italic_E [ italic_s ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (13)

From Definition 1, it results that the expected value of s2superscript𝑠2s^{2}italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which represents the power of signal s𝑠sitalic_s, is equal to:

E[s2]=E[s]2+Var(s)=μs2+σs2,𝐸delimited-[]superscript𝑠2𝐸superscriptdelimited-[]𝑠2Var𝑠subscriptsuperscript𝜇2𝑠subscriptsuperscript𝜎2𝑠E[s^{2}]=E[s]^{2}+\mathrm{Var}(s)=\mu^{2}_{s}+\sigma^{2}_{s},italic_E [ italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_E [ italic_s ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Var ( italic_s ) = italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , (14)

where μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the mean of s𝑠sitalic_s, and σs2subscriptsuperscript𝜎2𝑠\sigma^{2}_{s}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the variance of s𝑠sitalic_s. We use Eq. (14) to define the SNR of a signal as follows:

Definition 2.

The signal-to-noise ratio (SNR) of a signal s=u+z𝑠𝑢𝑧s=u+zitalic_s = italic_u + italic_z, where u𝑢uitalic_u is the clean signal and z𝑧zitalic_z is the noise component, is the ratio between the power of u𝑢uitalic_u and the power of z𝑧zitalic_z, which is given by:

SNR(s)=E[u2]E[z2]=μu2+σu2μz2+σz2,SNR𝑠𝐸delimited-[]superscript𝑢2𝐸delimited-[]superscript𝑧2subscriptsuperscript𝜇2𝑢subscriptsuperscript𝜎2𝑢subscriptsuperscript𝜇2𝑧subscriptsuperscript𝜎2𝑧\operatorname{SNR}(s)=\frac{E[u^{2}]}{E[z^{2}]}=\frac{\mu^{2}_{u}+\sigma^{2}_{% u}}{\mu^{2}_{z}+\sigma^{2}_{z}},roman_SNR ( italic_s ) = divide start_ARG italic_E [ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_E [ italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG = divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG , (15)

where μusubscript𝜇𝑢\mu_{u}italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and μzsubscript𝜇𝑧\mu_{z}italic_μ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT are the means of u𝑢uitalic_u and z𝑧zitalic_z, and σu2subscriptsuperscript𝜎2𝑢\sigma^{2}_{u}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and σz2subscriptsuperscript𝜎2𝑧\sigma^{2}_{z}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT are the variances of u𝑢uitalic_u and z𝑧zitalic_z, respectively.

The noise contained by data samples given as input to neural networks is usually uncorrelated, e.g. the noise in images is assumed to come from a random normal distribution of zero mean. Moreover, the weights of a neural network are usually initialized by sampling them from a random normal distribution of zero mean [66]. Hence, without loss of generality, we can naturally assume that the noise component has zero mean. This means that we can simplify Eq. (15) to:

SNR(s)=μu2+σu2σz2.SNR𝑠subscriptsuperscript𝜇2𝑢subscriptsuperscript𝜎2𝑢subscriptsuperscript𝜎2𝑧\operatorname{SNR}(s)=\frac{\mu^{2}_{u}+\sigma^{2}_{u}}{\sigma^{2}_{z}}.roman_SNR ( italic_s ) = divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG . (16)

If the power of the signal u𝑢uitalic_u is higher than the power of the noise z𝑧zitalic_z, then SNR(s)SNR𝑠\operatorname{SNR}(s)roman_SNR ( italic_s ) is higher than 1111. If the signal is dominated by noise, then SNR(s)SNR𝑠\operatorname{SNR}(s)roman_SNR ( italic_s ) is between 00 and 1111. Note that the SNR does not take negative values. To avoid discussing edge cases, we assume that the SNR of any signal is always defined, i.e. the power of the noise is never 00.

Theorem 1.

Let s1=u1+z1subscript𝑠1subscript𝑢1subscript𝑧1s_{1}=u_{1}+z_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2=u2+z2subscript𝑠2subscript𝑢2subscript𝑧2s_{2}=u_{2}+z_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be two signals, where u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and u2subscript𝑢2u_{2}italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the clean components, and z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the noise components. The signal-to-noise ratio of the product between the two signals is lower than the signal-to-noise ratios of the two signals, i.e.:

SNR(s1s2)SNR(si),i{1,2}.formulae-sequenceSNRsubscript𝑠1subscript𝑠2SNRsubscript𝑠𝑖for-all𝑖12\operatorname{SNR}(s_{1}\cdot s_{2})\leq\operatorname{SNR}(s_{i}),\forall i\in% \{1,2\}.roman_SNR ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≤ roman_SNR ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ∀ italic_i ∈ { 1 , 2 } . (17)
Proof.

To demonstrate our theorem, we rely on the formula of variance for the sum of two signals with zero mean:

Var(s1+s2)=Var(s1)+Var(s2).Varsubscript𝑠1subscript𝑠2Varsubscript𝑠1Varsubscript𝑠2\mathrm{Var}(s_{1}+s_{2})=\mathrm{Var}(s_{1})+\mathrm{Var}(s_{2}).roman_Var ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_Var ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + roman_Var ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . (18)

We also rely on the formula of variance for the product of two signals:

Var(s1s2)=Var(s1)Var(s2)+Var(s1)E[s2]2+Var(s2)E[s1]2.Varsubscript𝑠1subscript𝑠2Varsubscript𝑠1Varsubscript𝑠2Varsubscript𝑠1𝐸superscriptdelimited-[]subscript𝑠22Varsubscript𝑠2𝐸superscriptdelimited-[]subscript𝑠12\begin{split}\mathrm{Var}(s_{1}\cdot s_{2})=&\mathrm{Var}(s_{1})\cdot\mathrm{% Var}(s_{2})+\mathrm{Var}(s_{1})\cdot E[s_{2}]^{2}\\ &+\mathrm{Var}(s_{2})\cdot E[s_{1}]^{2}.\end{split}start_ROW start_CELL roman_Var ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = end_CELL start_CELL roman_Var ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ roman_Var ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + roman_Var ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ italic_E [ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + roman_Var ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋅ italic_E [ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (19)

Let s𝑠sitalic_s denote the product of the two signals, i.e. s=s1s2𝑠subscript𝑠1subscript𝑠2s=s_{1}\cdot s_{2}italic_s = italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Expanding the signals s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT leads to the following formulation of s𝑠sitalic_s:

s=s1s2=(u1+z1)(u2+z2)=u1u2+u1z2+u2z1+z1z2,𝑠subscript𝑠1subscript𝑠2subscript𝑢1subscript𝑧1subscript𝑢2subscript𝑧2subscript𝑢1subscript𝑢2subscript𝑢1subscript𝑧2subscript𝑢2subscript𝑧1subscript𝑧1subscript𝑧2\begin{split}s&=s_{1}\cdot s_{2}=(u_{1}+z_{1})\cdot(u_{2}+z_{2})\\ &=u_{1}\cdot u_{2}+u_{1}\cdot z_{2}+u_{2}\cdot z_{1}+z_{1}\cdot z_{2},\end{split}start_ROW start_CELL italic_s end_CELL start_CELL = italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , end_CELL end_ROW (20)

where the clean component is u=u1u2𝑢subscript𝑢1subscript𝑢2u=u_{1}\cdot u_{2}italic_u = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and the noise component is z=u1z2+u2z1+z1z2𝑧subscript𝑢1subscript𝑧2subscript𝑢2subscript𝑧1subscript𝑧1subscript𝑧2z=u_{1}\cdot z_{2}+u_{2}\cdot z_{1}+z_{1}\cdot z_{2}italic_z = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Hence, s=u+z𝑠𝑢𝑧s=u+zitalic_s = italic_u + italic_z.

An example given as input to a neural network and the initial weights of the respective neural network are not correlated under any practical circumstances. Hence, without loss of generality, we can assume that the signals s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are independent of each other, i.e. their covariance is equal to 00. This assumption allows us to simplify the signal power of u𝑢uitalic_u to:

E[u2]=E[u12u22]=E[u12]E[u22]=(μu12+σu12)(μu22+σu22).𝐸delimited-[]superscript𝑢2𝐸delimited-[]superscriptsubscript𝑢12superscriptsubscript𝑢22𝐸delimited-[]superscriptsubscript𝑢12𝐸delimited-[]superscriptsubscript𝑢22subscriptsuperscript𝜇2subscript𝑢1subscriptsuperscript𝜎2subscript𝑢1subscriptsuperscript𝜇2subscript𝑢2subscriptsuperscript𝜎2subscript𝑢2\begin{split}E[u^{2}]&=E[u_{1}^{2}\cdot u_{2}^{2}]=E[u_{1}^{2}]\cdot E[u_{2}^{% 2}]\\ &=\left(\mu^{2}_{u_{1}}+\sigma^{2}_{u_{1}}\right)\cdot\left(\mu^{2}_{u_{2}}+% \sigma^{2}_{u_{2}}\right).\end{split}start_ROW start_CELL italic_E [ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL start_CELL = italic_E [ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_E [ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ⋅ italic_E [ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . end_CELL end_ROW (21)

The signal power of z𝑧zitalic_z is given by:

E[z2]=E[z]2+Var(z)=Var(z),𝐸delimited-[]superscript𝑧2𝐸superscriptdelimited-[]𝑧2Var𝑧Var𝑧\begin{split}E[z^{2}]=E[z]^{2}+\mathrm{Var}(z)=\mathrm{Var}(z),\end{split}start_ROW start_CELL italic_E [ italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_E [ italic_z ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Var ( italic_z ) = roman_Var ( italic_z ) , end_CELL end_ROW (22)

since the noise is of zero mean, i.e. E[z]=0𝐸delimited-[]𝑧0E[z]=0italic_E [ italic_z ] = 0. By employing Eq. (18), we can compute the power of z𝑧zitalic_z as follows:

E[z2]=Var(z)=Var(u1z2+u2z1+z1z2)=Var(u1z2)+Var(u2z1)+Var(z1z2).𝐸delimited-[]superscript𝑧2Var𝑧Varsubscript𝑢1subscript𝑧2subscript𝑢2subscript𝑧1subscript𝑧1subscript𝑧2Varsubscript𝑢1subscript𝑧2Varsubscript𝑢2subscript𝑧1Varsubscript𝑧1subscript𝑧2\begin{split}\!\!E[z^{2}]&=\mathrm{Var}(z)=\mathrm{Var}(u_{1}\cdot z_{2}+u_{2}% \cdot z_{1}+z_{1}\cdot z_{2})\\ &=\mathrm{Var}(u_{1}\cdot z_{2})+\mathrm{Var}(u_{2}\cdot z_{1})+\mathrm{Var}(z% _{1}\cdot z_{2}).\end{split}start_ROW start_CELL italic_E [ italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL start_CELL = roman_Var ( italic_z ) = roman_Var ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_Var ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + roman_Var ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + roman_Var ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . end_CELL end_ROW (23)

By applying Eq. (19) in Eq. (23), and considering that z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT have zero mean, we obtain:

Var(u1z2)=(μu12+σu12)σz22,Var(u2z1)=(μu22+σu22)σz12,Var(z1z2)=σz12σz22.formulae-sequenceVarsubscript𝑢1subscript𝑧2subscriptsuperscript𝜇2subscript𝑢1subscriptsuperscript𝜎2subscript𝑢1subscriptsuperscript𝜎2subscript𝑧2formulae-sequenceVarsubscript𝑢2subscript𝑧1subscriptsuperscript𝜇2subscript𝑢2subscriptsuperscript𝜎2subscript𝑢2subscriptsuperscript𝜎2subscript𝑧1Varsubscript𝑧1subscript𝑧2subscriptsuperscript𝜎2subscript𝑧1subscriptsuperscript𝜎2subscript𝑧2\begin{split}\mathrm{Var}(u_{1}\cdot z_{2})&=\left(\mu^{2}_{u_{1}}+\sigma^{2}_% {u_{1}}\right)\cdot\sigma^{2}_{z_{2}},\\ \mathrm{Var}(u_{2}\cdot z_{1})&=\left(\mu^{2}_{u_{2}}+\sigma^{2}_{u_{2}}\right% )\cdot\sigma^{2}_{z_{1}},\\ \mathrm{Var}(z_{1}\cdot z_{2})&=\sigma^{2}_{z_{1}}\cdot\sigma^{2}_{z_{2}}.\\ \end{split}start_ROW start_CELL roman_Var ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL = ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL roman_Var ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL = ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL roman_Var ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT . end_CELL end_ROW (24)

Replacing Eq. (21) and Eq. (24) inside Definition 2 leads to the following expression of the signal-to-noise ratio of signal s𝑠sitalic_s:

SNR(s)=E[u2]E[z2]=(μu12+σu12)(μu22+σu22)(μu12+σu12)σz22+(μu22+σu22)σz12+σz12σz22=(μu12+σu12)(μu22+σu22)σz12σz22(μu12+σu12σz12+μu22+σu22σz22+1)=μu12+σu12σz12μu22+σu22σz22μu12+σu12σz12+μu22+σu22σz22+1=SNR(s1)SNR(s2)SNR(s1)+SNR(s2)+1.SNR𝑠𝐸delimited-[]superscript𝑢2𝐸delimited-[]superscript𝑧2subscriptsuperscript𝜇2subscript𝑢1subscriptsuperscript𝜎2subscript𝑢1subscriptsuperscript𝜇2subscript𝑢2subscriptsuperscript𝜎2subscript𝑢2subscriptsuperscript𝜇2subscript𝑢1subscriptsuperscript𝜎2subscript𝑢1subscriptsuperscript𝜎2subscript𝑧2subscriptsuperscript𝜇2subscript𝑢2subscriptsuperscript𝜎2subscript𝑢2subscriptsuperscript𝜎2subscript𝑧1subscriptsuperscript𝜎2subscript𝑧1subscriptsuperscript𝜎2subscript𝑧2subscriptsuperscript𝜇2subscript𝑢1subscriptsuperscript𝜎2subscript𝑢1subscriptsuperscript𝜇2subscript𝑢2subscriptsuperscript𝜎2subscript𝑢2subscriptsuperscript𝜎2subscript𝑧1subscriptsuperscript𝜎2subscript𝑧2subscriptsuperscript𝜇2subscript𝑢1subscriptsuperscript𝜎2subscript𝑢1subscriptsuperscript𝜎2subscript𝑧1subscriptsuperscript𝜇2subscript𝑢2subscriptsuperscript𝜎2subscript𝑢2subscriptsuperscript𝜎2subscript𝑧21subscriptsuperscript𝜇2subscript𝑢1subscriptsuperscript𝜎2subscript𝑢1subscriptsuperscript𝜎2subscript𝑧1subscriptsuperscript𝜇2subscript𝑢2subscriptsuperscript𝜎2subscript𝑢2subscriptsuperscript𝜎2subscript𝑧2subscriptsuperscript𝜇2subscript𝑢1subscriptsuperscript𝜎2subscript𝑢1subscriptsuperscript𝜎2subscript𝑧1subscriptsuperscript𝜇2subscript𝑢2subscriptsuperscript𝜎2subscript𝑢2subscriptsuperscript𝜎2subscript𝑧21SNRsubscript𝑠1SNRsubscript𝑠2SNRsubscript𝑠1SNRsubscript𝑠21\begin{split}\operatorname{SNR}&(s)=\frac{E[u^{2}]}{E[z^{2}]}\\ &=\!\frac{\left(\mu^{2}_{u_{1}}+\sigma^{2}_{u_{1}}\right)\cdot\left(\mu^{2}_{u% _{2}}+\sigma^{2}_{u_{2}}\right)}{\left(\mu^{2}_{u_{1}}\!+\!\sigma^{2}_{u_{1}}% \right)\!\cdot\!\sigma^{2}_{z_{2}}\!+\!\left(\mu^{2}_{u_{2}}\!+\!\sigma^{2}_{u% _{2}}\right)\!\cdot\!\sigma^{2}_{z_{1}}\!+\!\sigma^{2}_{z_{1}}\!\cdot\!\sigma^% {2}_{z_{2}}}\\ &=\frac{\left(\mu^{2}_{u_{1}}+\sigma^{2}_{u_{1}}\right)\cdot\left(\mu^{2}_{u_{% 2}}+\sigma^{2}_{u_{2}}\right)}{\sigma^{2}_{z_{1}}\!\cdot\!\sigma^{2}_{z_{2}}\!% \cdot\!\left(\frac{\mu^{2}_{u_{1}}\!+\!\sigma^{2}_{u_{1}}}{\sigma^{2}_{z_{1}}}% \!+\!\frac{\mu^{2}_{u_{2}}\!+\!\sigma^{2}_{u_{2}}}{\sigma^{2}_{z_{2}}}\!+\!1% \right)}\\ &=\frac{\frac{\mu^{2}_{u_{1}}+\sigma^{2}_{u_{1}}}{\sigma^{2}_{z_{1}}}\cdot% \frac{\mu^{2}_{u_{2}}+\sigma^{2}_{u_{2}}}{\sigma^{2}_{z_{2}}}}{\frac{\mu^{2}_{% u_{1}}\!+\!\sigma^{2}_{u_{1}}}{\sigma^{2}_{z_{1}}}\!+\!\frac{\mu^{2}_{u_{2}}\!% +\!\sigma^{2}_{u_{2}}}{\sigma^{2}_{z_{2}}}\!+\!1}\\ &=\frac{\operatorname{SNR}(s_{1})\cdot\operatorname{SNR}(s_{2})}{\operatorname% {SNR}(s_{1})+\operatorname{SNR}(s_{2})+1}.\end{split}start_ROW start_CELL roman_SNR end_CELL start_CELL ( italic_s ) = divide start_ARG italic_E [ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_E [ italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ ( italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ ( divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG + 1 ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG end_ARG start_ARG divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG + 1 end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG roman_SNR ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ roman_SNR ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_SNR ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + roman_SNR ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + 1 end_ARG . end_CELL end_ROW (25)

To simplify our notations in the remainder of this proof, we define a=SNR(s1)𝑎SNRsubscript𝑠1a=\operatorname{SNR}(s_{1})italic_a = roman_SNR ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and b=SNR(s2)𝑏SNRsubscript𝑠2b=\operatorname{SNR}(s_{2})italic_b = roman_SNR ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). By introducing these notations in Eq. (25), we obtain the following:

SNR(s)=aba+b+1.SNR𝑠𝑎𝑏𝑎𝑏1\begin{split}\operatorname{SNR}(s)=\frac{a\cdot b}{a+b+1}.\end{split}start_ROW start_CELL roman_SNR ( italic_s ) = divide start_ARG italic_a ⋅ italic_b end_ARG start_ARG italic_a + italic_b + 1 end_ARG . end_CELL end_ROW (26)

Now, it remains to prove that:

aba+b+1a,aba+b+1b.formulae-sequence𝑎𝑏𝑎𝑏1𝑎𝑎𝑏𝑎𝑏1𝑏\frac{a\cdot b}{a+b+1}\leq a,\;\;\frac{a\cdot b}{a+b+1}\leq b.divide start_ARG italic_a ⋅ italic_b end_ARG start_ARG italic_a + italic_b + 1 end_ARG ≤ italic_a , divide start_ARG italic_a ⋅ italic_b end_ARG start_ARG italic_a + italic_b + 1 end_ARG ≤ italic_b . (27)

However, since a𝑎aitalic_a and b𝑏bitalic_b are commutable in Eq. (26), it is sufficient to prove only one of the inequalities. We choose to provide the complete proof for the first inequality in Eq. (27) (as the proof for the other is analogous). We consider two separate cases, a=0𝑎0a=0italic_a = 0 and a>0𝑎0a>0italic_a > 0.

• Case (i)𝑖(i)( italic_i ): When a=0𝑎0a=0italic_a = 0, we obtain the following inequality:

0b+10,0𝑏10\frac{0}{b+1}\leq 0,divide start_ARG 0 end_ARG start_ARG italic_b + 1 end_ARG ≤ 0 , (28)

which clearly holds for any b0𝑏0b\geq 0italic_b ≥ 0.

Table 12: Distances between feature maps at epoch k=0𝑘0k=0italic_k = 0 and feature maps after the final epoch for ResNet-18 on CIFAR-10, while using the conventional training regime. Distances are independently computed for the first and last convolutional layers.
Training Regime Distance
\cline2-3 First Conv Layer Last Conv Layer
conventional 38.3638.3638.3638.36 709.93709.93709.93709.93

• Case (ii)𝑖𝑖(ii)( italic_i italic_i ): When a>0𝑎0a>0italic_a > 0, we can divide both terms of the inequality by a𝑎aitalic_a and arrive to:

ba+b+11.𝑏𝑎𝑏11\frac{b}{a+b+1}\leq 1.divide start_ARG italic_b end_ARG start_ARG italic_a + italic_b + 1 end_ARG ≤ 1 . (29)

Next, we multiply both terms by a+b+1𝑎𝑏1a+b+1italic_a + italic_b + 1, obtaining that:

ba+b+1.𝑏𝑎𝑏1b\leq a+b+1.italic_b ≤ italic_a + italic_b + 1 . (30)

We can subtract b𝑏bitalic_b from both terms and obtain the following:

0a+1.0𝑎10\leq a+1.0 ≤ italic_a + 1 . (31)

Since a>0𝑎0a>0italic_a > 0, it results that Eq. (31) is true. Moreover, the inequality is strict when a>0𝑎0a>0italic_a > 0. This concludes our proof. ∎

Corollary 1.

Let {s1,s2,sn}subscript𝑠1subscript𝑠2subscript𝑠𝑛\{s_{1},s_{2},...s_{n}\}{ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } be a set of n𝑛nitalic_n signals, where each signal si=ui+zisubscript𝑠𝑖subscript𝑢𝑖subscript𝑧𝑖s_{i}=u_{i}+z_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is formed of a clean component uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a noise component zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The following equation is true:

SNR(i=1psi)SNR(j=1p1sj),p{2,,n}.formulae-sequenceSNRsuperscriptsubscriptproduct𝑖1𝑝subscript𝑠𝑖SNRsuperscriptsubscriptproduct𝑗1𝑝1subscript𝑠𝑗for-all𝑝2𝑛\operatorname{SNR}\left(\prod_{i=1}^{p}s_{i}\right)\leq\operatorname{SNR}\left% (\prod_{j=1}^{p-1}s_{j}\right),\forall p\in\{2,...,n\}.roman_SNR ( ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ roman_SNR ( ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p - 1 end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , ∀ italic_p ∈ { 2 , … , italic_n } . (32)
Proof.

The proof results immediately by induction from Theorem 1. Note that the inequality is strict when SNR(si)>0,i{1,2,,p}formulae-sequenceSNRsubscript𝑠𝑖0for-all𝑖12𝑝\operatorname{SNR}(s_{i})>0,\forall i\in\{1,2,...,p\}roman_SNR ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > 0 , ∀ italic_i ∈ { 1 , 2 , … , italic_p }. ∎

Table 13: Entropy after k=6𝑘6k=6italic_k = 6 epochs for ResNet-18 on CIFAR-10, while alternating between the conventional and LeRaC training regimes.
Training Regime Entropy
\cline2-3 First Conv Layer Last Conv Layer
conventional 0.99650.99650.99650.9965 0.99050.99050.99050.9905
LeRaC (ours) 0.99700.99700.99700.9970 0.99680.99680.99680.9968

Refer to caption


Figure 7: Activation maps with low and high entropy from the first and last conv layers of ResNet-18 trained on CIFAR-10 for k=6𝑘6k=6italic_k = 6 epochs with the conventional (baseline) and LeRaC (ours) regimes. The input images are taken from ImageNet. Best viewed in color.

We employ Corollary 1 in the context of neural networks, where the input signal, which is expected to bear meaningful information and thus have a high SNR, is initially multiplied with random weights, which are expected to have low SNR values just after initialization. According to Corollary 1, the SNR of the resulting signal (features) is gradually decreasing, layer by layer. In this context, we conjecture that optimizing the weights θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of layer i𝑖iitalic_i to learn patterns from the signal (features) given as input to layer i𝑖iitalic_i is suboptimal for layers that are sufficiently far away from the input. This happens because the respective features (passed to layer i𝑖iitalic_i) can contain a large amount of noise, which can derail the network towards adapting the weights θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the noise instead of the clean signal. This phenomenon becomes more and more prevalent as the layer i𝑖iitalic_i is placed farther away from the input. To regulate this phenomenon during the initial stages of the learning process, we propose to employ LeRaC and gradually decrease the learning rate as layers get deeper, allowing the network to optimize the earlier weights sooner. We underline that training the earlier layers also reduces the amount of noise in later layers, since the amount of noise in later layers is bounded by the amount of noise in earlier layers (according to Corollary 1). As the amount of noise in later layers is progressively diminished, we can gradually increase the learning rates of later layers, allowing them to optimize their weights to cleaner signals (meaningful patterns).

Appendix B Empirical Proof

Noise quantification of early and later layers. The application of LeRaC is justified by the fact that the level of noise gradually grows with each layer during a forward pass through a neural network with randomly initialized weights. To empirically confirm this statement, we have computed the distances for the low-level (first conv) and high-level (last conv) layers between the activation maps at iteration 00 (based on random weights) and the last iteration (based on weights optimized until convergence) for ResNet-18 on CIFAR-10, while using the conventional training regime. The computed distances shown in Table 12 confirm our conjecture, namely that shallow layers contain less noise than deep layers when applying the conventional training regime.

Table 14: Distances between feature maps at epoch k=6𝑘6k=6italic_k = 6 and feature maps after the final epoch for ResNet-18 on CIFAR-10, while alternating between the conventional and LeRaC training regimes. Distances are independently computed for the first and last convolutional layers.
Training Regime Distance
\cline2-3 First Conv Layer Last Conv Layer
conventional 0.600.600.600.60 0.370.370.370.37
LeRaC (ours) 0.610.610.610.61 0.660.660.660.66

Entropy of low-level versus high-level features. We show a few examples of training dynamics in Figure 3. All four graphs exhibit a higher gap between CBS and LeRaC in the first half of the training process, suggesting that LeRaC has an important role towards faster convergence. To assess the comparative quality of low-level versus high-level feature maps obtained either with conventional or LeRaC training, we compute the entropy of the first and last conv layers of ResNet-18 on CIFAR-10, after k=6𝑘6k=6italic_k = 6 iterations. We report the computed entropy levels in Table 13. Conventional training seems to update deeper layers faster, observing a higher difference between the entropy levels of low-level and high-level features obtained with conventional training than with LeRac. This shows that LeRaC balances the training pace of low-level and high-level features. We conjecture that updating the deeper layers too soon could lead to overfitting to the noise still present in the early layers. This statement is supported by our empirical results on 12 data sets, showing that giving a chance to the early layers to converge before introducing large updates to the later layers leads to superior performance.

Aside from computing the global entropy over all training samples, in Figure 7, we illustrate some activation maps with the highest and lowest entropy from the first and last conv layers for three randomly chosen examples from ImageNet. The activation maps are extracted at epoch k=6𝑘6k=6italic_k = 6 from the ResNet-18 model trained on CIFAR-10 either with the conventional regime, the CBS regime or the LeRaC regime. In general, we observe that the low-level activation maps corresponding to LeRaC and CBS exhibit a higher degree of variability (being more distinct from each other), regardless of the entropy level (low or high). In the case of LeRaC, we believe the higher degree of variability comes from the fact that, having lower learning rates for the deeper layers, the model based on LeRaC is likely focused on finding a higher variety of patterns within the first layers to minimize the loss. Similarly, in the case of CBS, blurring the intermediary feature maps reduces the information propagated within the network. This compels the lower layers to identify and learn more distinctive patterns to minimize the loss. However, in general, the patterns found by LeRaC are more diverse. For instance, in the case of CBS, the low-level activation maps of the first image show greater similarity to each other, in contrast to those generated by LeRaC. For the third example (the image of an airplane), we observe that the activation maps with the highest entropy from the last conv layer produced by LeRaC have a higher entropy than the activation maps with the highest entropy produced by the conventional regime. This observation is in line with the results reported in Table 13, confirming that LeRaC is able to better balance the entropy of low-level and high-level features by preventing the faster convergence of the deeper layers.

Distances at epoch 𝐤𝐤\mathbf{k}bold_k versus final epoch. As discussed above, in Table 13, we report the entropy of the low-level and high-level layers after k=6𝑘6k=6italic_k = 6 epochs, before and after using LeRaC to train ResNet-18 on CIFAR-10. However, we consider that using the distance to the final feature maps provides additional useful insights about how LeRaC works. To this end, we compute the Euclidean distances of both low-level and high-level features between epoch k𝑘kitalic_k and the final epoch, before and after using LeRaC. We report the distances in Table 14. The computed distances confirm our previous observations, namely that LeRaC is capable of balancing the training pace of low-level and high-level layers.

Appendix C Additional Experiments

Table 15: Average accuracy rates (in %) over 5 runs for Wide-ResNet-50 on CIFAR-100 using different optimizers and training regimes (conventional versus LeRaC). The accuracy of the best training regime is highlighted in bold.
Model Optimizer Training Regime Accuracy
Wide-ResNet-50 Adam conventional 66.48±0.50plus-or-minus66.480.5066.48\!\pm\!0.5066.48 ± 0.50
SGD conventional 68.14±0.16plus-or-minus68.140.1668.14\!\pm\!0.1668.14 ± 0.16
SGD LeRaC (ours) 69.38±0.26plus-or-minus69.380.26\mathbf{69.38}\!\pm\!0.26bold_69.38 ± 0.26

Refer to caption

Figure 8: Test accuracy (on the y-axis) versus training time (on the x-axis) for ResNet-18 on CIFAR-10 with various configurations for the initial learning rates. Dashed lines correspond to the conventional regime, while continuous lines correspond to LeRaC. The different colors correspond to different initial learning rates. Best viewed in color.

Refer to caption

Figure 9: Test accuracy (on the y-axis) versus training time (on the x-axis) for ResNet-18 on CIFAR-100 with various configurations for the initial learning rates. Dashed lines correspond to the conventional regime, while continuous lines correspond to LeRaC. The different colors correspond to different initial learning rates. Best viewed in color.

Refer to caption

Figure 10: Test accuracy (on the y-axis) versus training time (on the x-axis) for the pre-trained CvT-13 on CIFAR-10 with various configurations for the initial learning rates. Dashed lines correspond to the conventional regime, while continuous lines correspond to LeRaC. The different colors correspond to different initial learning rates. Best viewed in color.

Training progress for various initial learning rates. We compare the training progress of the conventional and LeRaC training regimes. We first comparatively consider the progress of ResNet-18 on CIFAR-10, shown in Figure 8, and CIFAR-100, shown in Figure 9, respectively. LeRaC is consistently better than the conventional regime for all initial learning rate configurations, on both data sets. We next compare the progress on CIFAR-10 for ResNet-18, illustrated in Figure 8, and CvT-13 (pre-trained), illustrated in Figure 10. The training progress of LeRaC is consistently above the training progress of the conventional regime, for both ResNet-18 and CvT-13. In summary, the results showcase the benefits on the training progress offered by LeRaC across distinct models and data sets.

Table 16: Average accuracy rates (in %) over 5 runs on CIFAR-10, CIFAR-100 and Tiny ImageNet for CvT-13 based on different training regimes: conventional, CBS [7], LeRaC with linear update, LeRaC with exponential update (proposed), and a combination of CBS and LeRaC.
Model Training Regime CIFAR-10 CIFAR-100 Tiny ImageNet
CvT-13 conventional 71.84±0.37plus-or-minus71.840.3771.84\pm 0.3771.84 ± 0.37 41.87±0.16plus-or-minus41.870.1641.87\pm 0.1641.87 ± 0.16 33.38±0.27plus-or-minus33.380.2733.38\pm 0.2733.38 ± 0.27
\cline2-5 CBS 72.64±0.29plus-or-minus72.640.2972.64\pm 0.2972.64 ± 0.29 44.48±0.40plus-or-minus44.480.4044.48\pm 0.4044.48 ± 0.40 33.56±0.36plus-or-minus33.560.3633.56\pm 0.3633.56 ± 0.36
LeRaC 72.90±0.28plus-or-minus72.900.2872.90\pm 0.2872.90 ± 0.28 43.46±0.18plus-or-minus43.460.1843.46\pm 0.1843.46 ± 0.18 33.95±0.28plus-or-minus33.950.2833.95\pm 0.2833.95 ± 0.28
\cline2-5 CBS + LeRaC 73.25±0.19plus-or-minus73.250.1973.25\pm 0.1973.25 ± 0.19 44.90±0.41plus-or-minus44.900.4144.90\pm 0.4144.90 ± 0.41 34.20±0.61plus-or-minus34.200.6134.20\pm 0.6134.20 ± 0.61
Table 17: Average accuracy rates (in %) over 5 runs for ResNet-18 and Wide-ResNet-50 on CIFAR-100 using data augmentation and different training regimes (conventional versus LeRaC). The accuracy of the best training regime in each experiment is highlighted in bold.
Model Training Regime Accuracy
ResNet-18 conventional 72.25±0.04plus-or-minus72.250.0472.25\!\pm\!0.0472.25 ± 0.04
LeRaC (ours) 73.51±0.22plus-or-minus73.510.22\mathbf{73.51}\!\pm\!0.22bold_73.51 ± 0.22
Wide-ResNet-50 conventional 65.42±0.66plus-or-minus65.420.6665.42\!\pm\!0.6665.42 ± 0.66
LeRaC (ours) 67.00±0.55plus-or-minus67.000.55\mathbf{67.00}\!\pm\!0.55bold_67.00 ± 0.55

SGD+LeRaC versus Adam. In Table 15, we present results showing that SGD and SGD+LeRaC obtain better accuracy rates than Adam [64] for the Wide-ResNet-50 model on CIFAR-100. This indicates that a simple optimizer combined with LeRaC can obtain better results than a state-of-the-art optimizer such as Adam. This justifies our decision to use a different optimizer for each neural model (see Table 1).

Combining CBS and LeRaC. Another interesting aspect worth studying is to determine if putting the CBS and LeRaC regimes together could bring further performance gains. We study the effect of combining CBS and LeRaC for CvT-13, since both CBS and LeRaC improve this model. In Table 16, we present the results with CvT-13 on CIFAR-10, CIFAR-100 and Tiny ImageNet. The reported results show that the combination brings accuracy gains across all three data sets. We thus conclude that the combination of curriculum learning regimes is worth a try, whenever the two independent regimes boost performance.

Data augmentation on vision data sets. Following Sinha et al. [7], we did not use data augmentation for the vision data sets. We consider training data augmentation as an orthogonal method for improving results, expecting improvements for both baseline and LeRaC models. Nevertheless, since we extended the experimental settings considered in Sinha et al. [7] to other domains, we took the liberty to use data augmentation in the audio domain (see the results in Table 6). The same augmentations (noise perturbation, time shifting, speed perturbation, mix-up and SpecAugment) are used for all audio models, ensuring a fair comparison. Moreover, we next present additional results with ResNet-18 and Wide-ResNet-50 on CIFAR-100 using the following augmentations: horizontal flip, rotation, solarization, blur, sharpening and auto-contrast. The results reported in Table 17 confirm that the performance gaps in the vision domain are in the same range after introducing data augmentation. In addition, we note that data augmentation seems to be rather harmful for the Wide-ResNet-50 model, which attains better results without data augmentation.

Table 18: Average accuracy rates (in %) over 5 runs for ResNet-18 on CIFAR-100 using limited training data (only 5% of the full training set) and different training regimes: conventional, CBS [7] and LeRaC. The accuracy of the best training regime is highlighted in bold.
Training Set Size Training Regime Accuracy
5% conventional 23.86±0.32plus-or-minus23.860.3223.86\pm 0.3223.86 ± 0.32
CBS 24.79±0.17plus-or-minus24.790.17{24.79}\pm 0.1724.79 ± 0.17
LeRaC (ours) 25.04±0.22plus-or-minus25.040.22\mathbf{25.04}\pm 0.22bold_25.04 ± 0.22

Limited data regime. In all our experiments carried out so far, the evaluated models were trained on the complete training sets. However, it is interesting to find out how our strategy behaves in a limited data regime. To this end, we conduct another experiment to compare LeRaC with the conventional and CBS regimes in a limited data scenario, considering only 5% of the training data. We present the results for ResNet-18 on CIFAR-100 in Table 18. The results indicate that LeRaC keeps its performance edge in the limited data regime. We therefore conclude that LeRaC can also be useful when limited training data is available.

References

  • \bibcommenthead
  • Bengio et al. [2009] Bengio Y, Louradour J, Collobert R, Weston J. Curriculum Learning. In: Proceedings of ICML; 2009. p. 41–48.
  • Soviany et al. [2022] Soviany P, Ionescu RT, Rota P, Sebe N. Curriculum Learning: A Survey. International Journal of Computer Vision. 2022;130(6):1526–1565.
  • Mitchell [1997] Mitchell TM. Machine Learning. New York: McGraw-Hill; 1997.
  • Wang et al. [2022] Wang X, Chen Y, Zhu W. A Survey on Curriculum Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022;44(9):4555–4576.
  • Burduja and Ionescu [2021] Burduja M, Ionescu RT. Unsupervised Medical Image Alignment with Curriculum Learning. In: Proceedings of ICIP; 2021. p. 3787–3791.
  • Karras et al. [2018] Karras T, Aila T, Laine S, Lehtinen J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In: Proceedings of ICLR; 2018. .
  • Sinha et al. [2020] Sinha S, Garg A, Larochelle H. Curriculum by Smoothing. In: Proceedings of NeurIPS; 2020. p. 21653–21664.
  • Krizhevsky [2009] Krizhevsky A. Learning multiple layers of features from tiny images. University of Toronto; 2009.
  • Russakovsky et al. [2015] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision. 2015;115(3):211–252.
  • Bossard et al. [2014] Bossard L, Guillaumin M, Van Gool L. Food-101 – Mining Discriminative Components with Random Forests. In: Proceedings of ECCV; 2014. p. 446–461.
  • Zhang et al. [2017] Zhang Z, Song Y, Qi H. Age progression/regression by conditional adversarial autoencoder. In: Proceedings of CVPR; 2017. p. 5810–5818.
  • Everingham et al. [2010] Everingham M, Gool L, Williams CK, Winn J, Zisserman A. The Pascal Visual Object Classes (VOC) Challenge. Intenational Journal of Computer Vision. 2010;88(2):303–338.
  • Clark et al. [2019] Clark C, Lee K, Chang MW, Kwiatkowski T, Collins M, Toutanova K. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In: Proceedings of NAACL; 2019. p. 2924–2936.
  • Wang et al. [2019] Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In: Proceedings of ICLR; 2019. .
  • Piczak [2015] Piczak KJ. ESC: Dataset for Environmental Sound Classification. In: Proceedings of ACMMM; 2015. p. 1015–1018.
  • Cao et al. [2014] Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing. 2014;5(4):377–390.
  • He et al. [2016] He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: Proceedings of CVPR; 2016. p. 770–778.
  • Zagoruyko and Komodakis [2016] Zagoruyko S, Komodakis N. Wide Residual Networks. arXiv preprint arXiv:160507146. 2016;.
  • Huang et al. [2017] Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely Connected Convolutional Networks. In: Proceedings of CVPR; 2017. p. 2261–2269.
  • Jocher et al. [2022] Jocher G, Chaurasia A, Stoken A, Borovec J, NanoCode012, Kwon Y, et al. ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation. Zenodo. 2022;.
  • Hochreiter and Schmidhuber [1997] Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computing. 1997;9(8):1735–1780.
  • Wu et al. [2021] Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, et al. CvT: Introducing Convolutions to Vision Transformers. In: Proceedings of ICCV; 2021. p. 22–31.
  • Devlin et al. [2019] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL; 2019. p. 4171–4186.
  • Ristea et al. [2022] Ristea NC, Ionescu RT, Khan FS. SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH; 2022. p. 4103–4107.
  • Dogan et al. [2020] Dogan Ü, Deshmukh AA, Machura MB, Igel C. Label-similarity curriculum learning. In: ECCV; 2020. p. 174–190.
  • Wang et al. [2023] Wang Y, Yue Y, Lu R, Liu T, Zhong Z, Song S, et al. EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones. In: Proceedings of ICCV; 2023. p. 5852–5864.
  • Khan et al. [2024] Khan MA, Menouar H, Hamila R. Curriculum for Crowd Counting – Is it Worthy? In: Proceedings of VISAPP; 2024. p. 583–590.
  • Khan et al. [2023] Khan M, Hamila R, Menouar H. CLIP: Train Faster with Less Data. In: Proceedings of BigComp; 2023. p. 34–39.
  • Khan et al. [2023] Khan MA, Menouar H, Hamila R. LCDnet: A Lightweight Crowd Density Estimation Model for Real-time Video Surveillance. Journal of Real-Time Image Processing. 2023;20(2):29.
  • Gui et al. [2017] Gui L, Baltrušaitis T, Morency LP. Curriculum Learning for Facial Expression Recognition. In: Proceedings of FG; 2017. p. 505–511.
  • Jiang et al. [2018] Jiang L, Zhou Z, Leung T, Li LJ, Fei-Fei L. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In: Proceedings of ICML; 2018. p. 2304–2313.
  • Shi and Ferrari [2016] Shi M, Ferrari V. Weakly Supervised Object Localization Using Size Estimates. In: Proceedings of ECCV; 2016. p. 105–121.
  • Soviany et al. [2021] Soviany P, Ionescu RT, Rota P, Sebe N. Curriculum self-paced learning for cross-domain object detection. Computer Vision and Image Understanding. 2021;204:103–166.
  • Chen and Gupta [2015] Chen X, Gupta A. Webly Supervised Learning of Convolutional Networks. In: Proceedings of ICCV; 2015. p. 1431–1439.
  • Platanios et al. [2019] Platanios EA, Stretcu O, Neubig G, Poczos B, Mitchell T. Competence-based Curriculum Learning for Neural Machine Translation. In: Proceedings of NAACL; 2019. p. 1162–1172.
  • Kocmi and Bojar [2017] Kocmi T, Bojar O. Curriculum Learning and Minibatch Bucketing in Neural Machine Translation. In: Proceedings of RANLP; 2017. p. 379–386.
  • Spitkovsky et al. [2009] Spitkovsky VI, Alshawi H, Jurafsky D. Baby Steps: How “Less is More” in unsupervised dependency parsing. In: Proceedings of NIPS; 2009. .
  • Liu et al. [2018] Liu C, He S, Liu K, Zhao J. Curriculum Learning for Natural Answer Generation. In: Proceedings of IJCAI; 2018. p. 4223–4229.
  • Ranjan and Hansen [2018] Ranjan S, Hansen JHL. Curriculum Learning Based Approaches for Noise Robust Speaker Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2018;26:197–210.
  • Amodei et al. [2016] Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, et al. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. In: Proceedings of ICML; 2016. p. 173–182.
  • Pentina et al. [2015] Pentina A, Sharmanska V, Lampert CH. Curriculum Learning of Multiple Tasks. In: Proceedings of CVPR; 2015. p. 5492–5500.
  • Jiménez-Sánchez et al. [2019] Jiménez-Sánchez A, Mateus D, Kirchhoff S, Kirchhoff C, Biberthaler P, Navab N, et al. Medical-based Deep Curriculum Learning for Improved Fracture Classification. In: Proceedings of MICCAI; 2019. p. 694–702.
  • Wei et al. [2021] Wei J, Suriawinata A, Ren B, Liu X, Lisovsky M, Vaickus L, et al. Learn like a Pathologist: Curriculum Learning by Annotator Agreement for Histopathology Image Classification. In: Proceedings of WACV; 2021. p. 2472–2482.
  • Cirik et al. [2016] Cirik V, Hovy E, Morency LP. Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks. arXiv preprint arXiv:161106204. 2016;.
  • Tay et al. [2019] Tay Y, Wang S, Luu AT, Fu J, Phan MC, Yuan X, et al. Simple and Effective Curriculum Pointer-Generator Networks for Reading Comprehension over Long Narratives. In: Proceedings of ACL; 2019. p. 4922–4931.
  • Zhang et al. [2021] Zhang W, Wei W, Wang W, Jin L, Cao Z. Reducing BERT Computation by Padding Removal and Curriculum Learning. In: Proceedings of ISPASS; 2021. p. 90–92.
  • Ionescu et al. [2016] Ionescu RT, Alexe B, Leordeanu M, Popescu M, Papadopoulos DP, Ferrari V. How Hard Can It Be? Estimating the Difficulty of Visual Search in an Image. In: Proceedings of CVPR; 2016. p. 2157–2166.
  • Gong et al. [2016] Gong C, Tao D, Maybank SJ, Liu W, Kang G, Yang J. Multi-Modal Curriculum Learning for Semi-Supervised Image Classification. IEEE Transactions on Image Processing. 2016;25(7):3249–3260.
  • Hacohen and Weinshall [2019] Hacohen G, Weinshall D. On The Power of Curriculum Learning in Training Deep Networks. In: Proceedings of ICML; 2019. p. 2535–2544.
  • Kumar et al. [2010] Kumar M, Packer B, Koller D. Self-Paced Learning for Latent Variable Models. In: Proceedings of NIPS. vol. 23; 2010. p. 1189–1197.
  • Gong et al. [2019] Gong M, Li H, Meng D, Miao Q, Liu J. Decomposition-Based Evolutionary Multiobjective Optimization to Self-Paced Learning. IEEE Transactions on Evolutionary Computation. 2019;23(2):288–302.
  • Fan et al. [2017] Fan Y, He R, Liang J, Hu BG. Self-Paced Learning: An Implicit Regularization Perspective. In: Proceedings of AAAI; 2017. p. 1877–1883.
  • Li et al. [2016] Li H, Gong M, Meng D, Miao Q. Multi-Objective Self-Paced Learning. In: Proceedings of AAAI; 2016. p. 1802–1808.
  • Zhou et al. [2018] Zhou S, Wang J, Meng D, Xin X, Li Y, Gong Y, et al. Deep self-paced learning for person re-identification. Pattern Recognition. 2018;76:739–751.
  • Jiang et al. [2015] Jiang L, Meng D, Zhao Q, Shan S, Hauptmann AG. Self-Paced Curriculum Learning. In: Proceedings of AAAI; 2015. p. 2694–2700.
  • Ristea and Ionescu [2021] Ristea NC, Ionescu RT. Self-paced ensemble learning for speech and audio classification. In: Proceedings of INTERSPEECH; 2021. p. 2836–2840.
  • Ma et al. [2017] Ma F, Meng D, Xie Q, Li Z, Dong X. Self-Paced Co-training. In: Proceedings of ICML. vol. 70; 2017. p. 2275–2284.
  • Jiang et al. [2014] Jiang L, Meng D, Yu SI, Lan Z, Shan S, Hauptmann AG. Self-Paced Learning with Diversity. In: Proceedings of NIPS; 2014. p. 2078–2086.
  • Zhang et al. [2019] Zhang M, Yu Z, Wang H, Qin H, Zhao W, Liu Y. Automatic Digital Modulation Classification Based on Curriculum Learning. Applied Sciences. 2019;9(10).
  • Wu et al. [2018] Wu L, Tian F, Xia Y, Fan Y, Qin T, Jian-Huang L, et al. Learning to Teach with Dynamic Loss Functions. In: Proceedings of NeurIPS. vol. 31; 2018. p. 6467–6478.
  • Singh et al. [2015] Singh B, De S, Zhang Y, Goldstein T, Taylor G. Layer-specific adaptive learning rates for deep networks. In: Proceedings of ICMLA; 2015. p. 364–368.
  • You et al. [2017] You Y, Gitman I, Ginsburg B. Large batch training of convolutional networks. arXiv preprint arXiv:170803888. 2017;.
  • Gotmare et al. [2019] Gotmare A, Keskar NS, Xiong C, Socher R. A Closer Look at Deep Learning Heuristics: Learning Rate Restarts, Warmup and Distillation. In: Proceedings of ICLR; 2019. .
  • Kingma and Ba [2015] Kingma DP, Ba JL. Adam: A method for stochastic gradient descent. In: Proceedings of ICLR; 2015. .
  • Loshchilov and Hutter [2019] Loshchilov I, Hutter F. Decoupled Weight Decay Regularization. In: Proceedings of ICLR; 2019. .
  • Glorot and Bengio [2010] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of AISTATS; 2010. p. 249–256.
  • Wang et al. [2019] Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In: Proceedings of NeurIPS. vol. 32; 2019. p. 3266–3280.
  • Rajpurkar et al. [2016] Rajpurkar P, Zhang J, Lopyrev K, Liang P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In: Proceedings of EMNLP; 2016. p. 2383–2392.
  • Wang et al. [2020] Wang CY, Liao HYM, Wu YH, Chen PY, Hsieh JW, Yeh IH. CSPNet: A new backbone that can enhance learning capability of CNN. In: Proceedings of CVPRW; 2020. p. 390–391.
  • Lin et al. [2014] Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: Common Objects in Context. In: Proceedings of ECCV; 2014. p. 740–755.
  • Park et al. [2019] Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Proceedings of INTERSPEECH. 2019;p. 2613–2617.
  • Dietterich [1998] Dietterich TG. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation. 1998;10(7):1895–1923.