[1]\fnmRadu Tudor \surIonescu
1]\orgdivDepartment of Computer Science, \orgnameUniversity of Bucharest, \orgaddress\street14 Academiei, \cityBucharest, \postcode010014, \countryRomania
2]\orgdivFaculty of Electronics, Telecommunications, and Information Technology, \orgnameNational University of Science and Technology Politehnica Bucharest, \orgaddress\street313 Splaiul Independentei, \cityBucharest, \postcode060042, \countryRomania
3]\orgdivDepartment of Information Engineering and Computer Science, \orgnameUniversity of Trento, \orgaddress\street9 via Sommarive, \cityPovo-Trento, \postcode38123, \countryItaly
Learning Rate Curriculum
Abstract
Most curriculum learning methods require an approach to sort the data samples by difficulty, which is often cumbersome to perform. In this work, we propose a novel curriculum learning approach termed Learning Rate Curriculum (LeRaC), which leverages the use of a different learning rate for each layer of a neural network to create a data-agnostic curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. The learning rates increase at various paces during the first training iterations, until they all reach the same value. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that does not require sorting the examples by difficulty and is compatible with any neural network, generating higher performance levels regardless of the architecture. We conduct comprehensive experiments on 12 data sets from the computer vision (CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet-200, Food-101, UTKFace, PASCAL VOC), language (BoolQ, QNLI, RTE) and audio (ESC-50, CREMA-D) domains, considering various convolutional (ResNet-18, Wide-ResNet-50, DenseNet-121, YOLOv5), recurrent (LSTM) and transformer (CvT, BERT, SepTr) architectures. We compare our approach with the conventional training regime, as well as with Curriculum by Smoothing (CBS), a state-of-the-art data-agnostic curriculum learning approach. Unlike CBS, our performance improvements over the standard training regime are consistent across all data sets and models. Furthermore, we significantly surpass CBS in terms of training time (there is no additional cost over the standard training regime for LeRaC). Our code is freely available at: https://github.com/CroitoruAlin/LeRaC.
1 Introduction
Curriculum learning [1] refers to efficiently training effective neural networks by mimicking how humans learn, from easy to hard. As originally introduced by Bengio et al. [1], curriculum learning is a training procedure that first organizes the examples in their increasing order of difficulty, then starts the training of the neural network on the easiest examples, gradually adding increasingly more difficult examples along the way, until all training examples are fed into the network. The success of the approach relies in avoiding imposing the learning of very difficult examples right from the beginning, instead guiding the model on the right path through the imposed curriculum. This type of curriculum is later referred to as data-level curriculum learning [2]. Indeed, Soviany et al. [2] identified several types of curriculum learning approaches in the literature, dividing them into four categories based on the components involved in the definition of machine learning given by Mitchell [3]. The four categories are: data-level curriculum (examples are presented from easy to hard), model-level curriculum (the modeling capacity of the network is gradually increased), task-level curriculum (the complexity of the learning task is increased during training), objective-level curriculum (the model optimizes towards an increasingly more complex objective). While data-level curriculum is the most natural and direct way to employ curriculum learning, its main disadvantage is that it requires a way to determine the difficulty of data samples. Despite having many successful applications [2, 4], there is no universal way to determine the difficulty of the data samples, making the data-level curriculum less applicable to scenarios where the difficulty is hard to estimate, e.g. classification of radar signals. The task-level and objective-level curriculum learning strategies suffer from similar issues, e.g. it is hard to create a curriculum when the model has to learn an easy task (binary classification) or the objective function is already convex.
Considering the above observations, we recognize the potential of model-level curriculum learning strategies of being applicable across a wider range of domains and tasks. To date, there are only a few works [5, 6, 7] in the category of pure model-level curriculum learning methods. However, these methods have some drawbacks caused by their domain-dependent or architecture-specific design. To benefit from the full potential of the model-level curriculum learning category, we propose LeRaC (Learning Rate Curriculum), a novel and simple curriculum learning approach which leverages the use of a different learning rate for each layer of a neural network to create a data-agnostic curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. This reduces the propagation of noise caused by the multiplication operations inside the network, a phenomenon that is more prevalent when the weights are randomly initialized. The learning rates increase at various paces during the first training iterations, until they all reach the same value, as illustrated in Figure 1. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that is applicable to any domain and compatible with any neural network, generating higher performance levels regardless of the architecture, without adding any extra training time. To the best of our knowledge, we are the first to employ a different learning rate per layer to achieve the same effect as conventional (data-level) curriculum learning.
![Refer to caption](x2.png)
As hinted above, the underlying hypothesis that justifies the use of LeRaC is that the level of noise grows from one neural layer to the next, especially when the input is multiplied with randomly initialized weights having low signal-to-noise ratios. We briefly illustrate this phenomenon through an example. Suppose an image is successively convolved with a set of random filters , , …, . Since the filters are uncorrelated, each filter distorts the image in a different way, degrading the information in with each convolution. The information in is gradually replaced by noise (see Fig. 2), i.e. the signal-to-noise ratio increases with each layer. Optimizing the filter to learn a pattern from the image convolved with , , …, is suboptimal, because the filter will adapt to the noisy (biased) activation map induced by filters , , …, . This suggests that earlier filters need to be optimized sooner to reduce the level of noise of the activation map passed to layer . In general, this phenomenon becomes more obvious as the layers get deeper, since the number of multiplication operations grows along the way. Hence, in the initial training stages, it makes sense to use gradually lower learning rates, as the layers get father away from the input. Our hypothesis is theoretically supported by Theorem 1, and empirically validated in Appendix B.
We conduct comprehensive experiments on 12 data sets from the computer vision (CIFAR-10 [8], CIFAR-100 [8], Tiny ImageNet [9], ImageNet-200 [9], Food-101 [10], UTKFace [11], PASCAL VOC [12]), language (BoolQ [13], QNLI [14], RTE [14]) and audio (ESC-50 [15], CREMA-D [16]) domains, considering various convolutional (ResNet-18 [17], Wide-ResNet-50 [18], DenseNet-121 [19], YOLOv5 [20]), recurrent (LSTM [21]) and transformer (CvT [22], BERT [23], SepTr [24]) architectures. We compare our approach with the conventional training regime and Curriculum by Smoothing (CBS) [7], our closest competitor. Unlike CBS, our performance improvements over the standard training regime are consistent across all data sets and models. Furthermore, we significantly surpass CBS in terms of training time, since there is no additional cost over the conventional training regime for LeRaC, whereas CBS adds Gaussian smoothing layers. We also compare with several data-level and task-level curriculum learning methods [25, 26, 27, 28, 29], and show that our method scores best in most of the experiments.
In summary, our contribution is threefold:
-
•
We propose a novel and simple model-level curriculum learning strategy that creates a curriculum by updating the weights of each neural layer with a different learning rate, considering higher learning rates for the low-level feature layers and lower learning rates for the high-level feature layers.
-
•
We empirically demonstrate the applicability to multiple domains (image, audio and text), the compatibility to several neural network architectures (convolutional neural networks, recurrent neural networks and transformers), and the time efficiency (no extra training time added) of LeRaC through a comprehensive set of experiments.
-
•
We demonstrate our underlying hypothesis stating that the level of noise increases from one neural layer to another, both theoretically and empirically.
2 Related Work
2.1 Curriculum Learning
Curriculum learning was initially introduced by Bengio et al. [1] as a training strategy that helps machine learning models to generalize better when the training examples are presented in the ascending order of their difficulty. Extensive surveys on curriculum learning methods, including the most recent advancements on the topic, were conducted by Soviany et al. [2] and Wang et al. [4]. In the former survey, Soviany et al. [2] emphasized that curriculum learning is not only applied at the data level, but also with respect to the other components involved in a machine learning approach, namely at the model level, the task level and the objective (loss) level. Regardless of the component on which curriculum learning is applied, the technique has demonstrated its effectiveness on a broad range of machine learning tasks, from computer vision [1, 30, 31, 32, 33, 34, 7, 27, 28, 29] to natural language processing [35, 36, 37, 38, 1] and audio processing [39, 40].
The main challenge for the methods that build the curriculum at the data level is measuring the difficulty of the data samples, which is required to order the samples from easy to hard. Most studies have addressed the problem with human input [41, 42, 43] or metrics based on domain-specific heuristics. For instance, the text length [36, 44, 45, 46] and the word frequency [1, 38] have been employed in natural language processing. In computer vision, the samples containing fewer and larger objects have been considered to be easier in some works [33, 32]. Other solutions employed difficulty estimators [47] or even the confidence level of the predictions made by the neural network [48, 49] to approximate the complexity of the data samples. Other studies [27, 28, 29] used the error of a previously trained model to estimate the difficulty of each sample. Such solutions have shown their utility in specific application domains. Nonetheless, measuring the difficulty remains problematic when implementing standard (data-level) curriculum learning strategies, at least in some application domains. Therefore, several alternatives have emerged over time, handling the drawback and improving the conventional curriculum learning approach. In Kumar et al. [50], the authors introduced self-paced learning to evaluate the learning progress when selecting training samples. The method was successfully employed in multiple settings [50, 51, 52, 53, 54, 55, 56]. Furthermore, some studies combined self-paced learning with the traditional pre-computed difficulty metrics [55, 57]. An additional advancement related to self-paced learning is the approach called self-paced learning with diversity [58]. The authors demonstrated that enforcing a certain level of variety among the selected examples can improve the final performance. Another set of methods that bypass the need for predefined difficulty metrics is known as teacher-student curriculum learning [59, 60]. In this setting, a teacher network learns a curriculum to supervise a student neural network.
Closer to our work, a few methods [6, 7, 5] proposed to apply curriculum learning at the model level, by gradually increasing the learning capacity (complexity) of the neural architecture. Such curriculum learning strategies do not need to know the difficulty of the data samples, thus having a great potential to be useful in a broad range of tasks. For example, Karras et al. [6] proposed to gradually add layers to generative adversarial networks during training, while increasing the resolution of the input images at the same time. They are thus able to generate realistic high-resolution images. However, their approach is not applicable to every domain, since there is no notion of resolution for some input data types, e.g. text. Sinha et al. [7] presented a strategy that blurs the activation maps of the convolutional layers using Gaussian kernel layers, reducing the noisy information caused by the network initialization. The blur level is progressively reduced to zero by decreasing the standard deviation of the Gaussian kernels. With this mechanism, they obtain a training procedure that allows the neural network to see simple information at the start of the process and more intricate details towards the end. Curriculum by Smoothing (CBS) [7] was only shown to be useful for convolutional architectures applied in the image domain. Although we found that CBS is applicable to transformers by blurring the tokens, it is not necessarily applicable to any neural architecture, e.g. standard feed-forward neural networks. As an alternative to CBS, Burduja and Ionescu [5] proposed to apply the same smoothing process on the input image instead of the activation maps. The method was applied with success in medical image alignment. However, this approach is not applicable to natural language input, as it is not clear how to apply the blurring operation on the input text.
Different from Burduja and Ionescu [5] and Karras et al. [6], our approach is applicable to various domains, including but not limited to natural language processing, as demonstrated throughout our experiments. To the best of our knowledge, the only competing model-level curriculum method which is applicable to various domains is CBS [7]. Unlike CBS, LeRaC does not introduce new operations, such as smoothing with Gaussian kernels, during training. As such, our approach does not increase the training time with respect to the conventional training regime, as later shown in the experiments included in Section 4.
To classify our approach as a curriculum learning framework, we consider the extreme case when the learning rate is set to zero for later layers, which is equivalent to freezing those layers. This clearly reduces the learning capacity of the model. If layers are unfrozen one by one, the capacity of the model grows. LeRaC can be seen as a soft version of the model-level curriculum method described above. We thus classify LeRaC as a model-level curriculum method. However, our method can also be seen as a curriculum learning strategy that simplifies the optimization [41, 42, 43, 36, 44, 45, 46, 1, 38] in the early training stages by restricting the model updates (in a soft manner) to certain directions (corresponding to the weights of the earlier layers). Due to the imposed soft restrictions (lower learning rates for deeper layers), the optimization is easier at the beginning. As the training progresses, all directions become equally important, and the network is permitted to optimize the loss function in any direction. As the number of directions grows, the optimization task becomes more complex (it is harder to find the optimum). Hence, a relationship to curriculum learning can be discovered by noting that the complexity of the optimization increases over time, just as in curriculum learning.
In summary, we consider that the simplicity of our approach comes with many important advantages: applicability to any domain and task, compatibility with any neural network architecture, and time efficiency (adds no extra training time). We support all these claims through the comprehensive experiments presented in Section 4.
2.2 Learning Rate Schedulers
There are some contributions [61, 62] showing that using adaptive learning rates can lead to improved results. We explain how our method is different below. In [61], the main goal is increasing the learning rate of certain layers as necessary, to escape saddle points. Different from Singh et al. [61], our strategy reduces the learning rates of deeper layers, introducing soft optimization restrictions in the initial training epochs. You et al. [62] proposed to train models with very large batches using a learning rate for each layer, by scaling the learning rate with respect to the norms of the gradients. The goal of You et al. [62] is to specifically learn models with large batch sizes, e.g. formed of 8K samples. Unlike You et al. [62], we propose a more generic approach that can be applied to multiple architectures (convolutional, recurrent, transformer) under unrestricted training settings.
Gotmare et al. [63] point out that learning rate with warm-up and restarts is an effective strategy to improve stability of training neural models using large batches. Different from LeRaC, this approach does not employ a different learning rate for each layer. Moreover, the strategy restarts the learning rate at different moments during the entire training process, while LeRaC is applied only during the first few training epochs.
2.3 Optimizers
We consider Adam [64] and related optimizers as orthogonal approaches that perform the optimization rather than setting the learning rate. Our approach, LeRaC, only aims to guide the optimization during the initial training iterations by reducing the relevance of optimizing deeper network layers. Most of the baseline architectures used in our experiments are already based on Adam or some of its variations, e.g. AdaMax, AdamW [65]. LeRaC is applied in conjunction with these optimizers, showing improved performance over various architectures and application domains. This supports our claim that LeRaC is an orthogonal contribution to the family of Adam optimizers.
3 Method
Deep neural networks are commonly trained on a set of labeled data samples denoted as:
(1) |
where is the number of examples, is a data sample and is the associated label. The training process of a neural network with parameters consists of minimizing some objective (loss) function that quantifies the differences between the ground-truth labels and the predictions of the model :
(2) |
The optimization is generally performed by some variant of Stochastic Gradient Descent (SGD), where the gradients are back-propagated from the neural layers closer to the output towards the neural layers closer to input through the chain rule. Let , , …., and , , …, denote the neural layers and the corresponding weights of the model , such that the weights belong to the layer , . The output of the neural network for some training data sample is formally computed as follows:
(3) |
To optimize the model via SGD, the weights are updated as follows:
(4) |
where is the index of the current training iteration, is the learning rate at iteration , and the gradient of with respect to is computed via the chain rule. Before starting the training process, the weights are commonly initialized with random values, e.g. using Glorot initialization [66].
Sinha et al. [7] suggested that the random initialization of the weights produces a large amount of noise in the information propagated through the neural model during the early training iterations, which can negatively impact the learning process. Due to the feed-forward processing that involves several multiplication operations, we argue that the noise level grows with each neural layer, from to . This statement is confirmed by the following theorem:
Theorem 1.
Let and be two signals, where and are the clean components, and and are the noise components. The signal-to-noise ratio of the product between the two signals is lower than the signal-to-noise ratios of the two signals, i.e.:
(5) |
Proof.
The proof is given in Appendix A. ∎
The same issue can occur if the weights are pre-trained on a distinct task, where the misalignment of the weights with a new task is likely higher for the high-level (specialized) feature layers. To alleviate this problem, we propose to introduce a curriculum learning strategy that assigns a different learning rate to each layer , as follows:
(6) |
such that:
(7) |
(8) |
where are the initial learning rates and are the updated learning rates at iteration . The condition formulated in Eq. (7) indicates that the initial learning rate of a neural layer gets lower as the level of the respective neural layer becomes higher (farther away from the input). With each training iteration , the learning rates are gradually increased, until they become equal, according to Eq. (8). Thus, our curriculum learning strategy is only applied during the early training iterations, where the noise caused by the misfit (randomly initialized or pre-trained) weights is most prevalent. Hence, is a hyperparameter of LeRaC that is usually adjusted such that , where is the total number of training iterations.
At this point, various schedulers can be used to increase each learning rate from iteration to iteration . We empirically observed that an exponential scheduler is a better option than linear or logarithmic schedulers. We thus propose to employ the exponential scheduler, which is based on the following rule:
(9) |
We set in Eq. (9) across all our experiments. This is because learning rates are usually expressed as a power of , e.g. . If we start with a learning rate of for some layer and we want to increase it to during the first 5 epochs (), the intermediate learning rates generated via Eq. (9) are , , and . We thus believe it is more intuitive to understand what happens when setting in Eq. (9), as opposed to using some tuned value for . To this end, we refrain from tuning and fix it to .
In practice, we obtain optimal results by initializing the lowest learning rate with a value that is around five or six orders of magnitude lower than , while the highest learning rate is always equal to . Apart from such general practical notes, the exact LeRaC configuration for each neural architecture is established by tuning its two hyperparameters (, ) on the available validation sets.
We underline that the output feature maps of a layer are affected by the misfit weights of the respective layer, and by the input feature maps, which are in turn affected by the misfit weights of the previous layers . Hence, the noise affecting the feature maps increases with each layer processing the feature maps, being multiplied with the weights from each layer along the way. Our curriculum learning strategy imposes the training of the earlier layers at a faster pace, transforming the noisy weights into discriminative patterns. As noise from the earlier layer weights is eliminated, we train the later layers at faster and faster paces, until all learning rates become equal at epoch .
From a technical point of view, we note that our approach can also be regarded as a way to guide the optimization, which we see as an alternative to loss function smoothing. The link between curriculum learning and loss smoothing is discussed by Soviany et al. [2], who suggest that curriculum learning strategies induce a smoothing of the loss function, where the smoothing is higher during the early training iterations (simplifying the optimization) and lower to non-existent during the late training iterations (restoring the complexity of the loss function). LeRaC is aimed at producing a similar effect, but in a softer manner by dampening the importance of optimizing the weights of high-level layers in the early training iterations. Additionally, we empirically observe (see results in Appendix B) that LeRaC tends to balance the training pace of low-level and high-level features, while the conventional regime seems to update the high-level layers at a faster pace. This could provide an additional intuitive explanation of why our method works better.
4 Experiments
4.1 Data Sets
We perform experiments on 12 benchmarks: CIFAR-10 [8], CIFAR-100 [8], Tiny ImageNet [9], ImageNet-200 [9], Food-101 [10], UTKFace [11], PASCAL VOC 2007+2012 [12], BoolQ [13], QNLI [14], RTE [14], CREMA-D [16], and ESC-50 [15]. We adopt the official data splits for the 12 benchmarks considered in our experiments. When a validation set is not available, we keep of the training data for validation.
CIFAR-10. CIFAR-10 [8] is a popular data set for object recognition in images. It consists of 60,000 color images with a resolution of pixels. An image depicts one of 10 object classes, each class having 6,000 examples. We use the official data split with a training set of 50,000 images and a test set of 10,000 images.
CIFAR-100. The CIFAR-100 [8] data set is similar to CIFAR-10, except that it has 100 classes with 600 images per class. There are 50,000 training images and 10,000 test images.
Tiny ImageNet. Tiny ImageNet is a subset of ImageNet-1K [9] which provides 100,000 training images, 25,000 validation images and 25,000 test images representing objects from 200 different classes. The size of each image is pixels.
ImageNet. ImageNet-1K [9] is the most popular bemchmark in computer vision, comprising about 1.2 million images from 1,000 object categories. We set the resolution of all images to pixels.
Food-101. Food-101 [10] is a data set that contains images from 101 food categories. For each category, there are 750 training images and 250 test images. Thus, the total number of images is 101,000. We resize all images to pixels. The test set is manually cleaned, while the training set is purposely left uncurated, being affected by labeling noise. This makes Food-101 suitable for testing the robustness of models to labeling noise.
UTKFace. The UTKFace data set [11] contains face images representing various gender, age and ethnic groups. It consists of 23,709 images of pixels. The data set is divided into 16,597 training images, 3,556 validation images, and 3,556 test images. Each image is annotated with the corresponding age and gender label, which makes UTKFace suitable for evaluating models in a multi-task learning setup.
PASCAL VOC 2007+2012. One of the most popular benchmarks for object detection is PASCAL VOC [12]. The data set consists of 21,503 images which are annotated with bounding boxes for 20 object categories. The official split has 16,551 training images and 4,952 test images.
BoolQ. BoolQ [13] is a question answering data set for yes/no questions containing 15,942 examples. The questions are naturally occurring, being generated in unprompted and unconstrained settings. Each example is a triplet of the form: {question, passage, answer}. We use the data split provided in the SuperGLUE benchmark [67], containing 9,427 examples for training, 3,270 for validation and 3,245 for testing.
Model | Optimizer | Mini-batch | #Epochs | CBS | LeRaC | ||||
---|---|---|---|---|---|---|---|---|---|
\cline6-10 | - | ||||||||
ResNet-18 | SGD | 64 | 100-200 | 1 | 0.9 | 2-5 | 5-7 | - | |
Wide-ResNet-50 | SGD | 64 | 100-200 | 1 | 0.9 | 2-5 | 5-7 | - | |
CvT-13 | AdaMax | 64-128 | 150-200 | 1 | 0.9 | 2-5 | 2-5 | - | |
CvT-13 | AdaMax | 64-128 | 25 | 1 | 0.9 | 2-5 | 3-6 | - | |
YOLOv5 | SGD | 16 | 100 | 1 | 0.9 | 2 | 3 | - | |
BERT | AdaMax | 10 | 7-25 | 1 | 0.9 | 1 | 3 | - | |
LSTM | AdamW | 256-512 | 25-70 | 1 | 0.9 | 2 | 3-4 | - | |
SepTR | Adam | 2 | 50 | 0.8 | 0.9 | 1-3 | 2-5 | - | |
DenseNet-121 | Adam | 64 | 50 | 0.8 | 0.9 | 1-3 | 2-5 | - |
QNLI. The QNLI (Question-answering Natural Language Inference) data set [14] is a natural language inference benchmark automatically derived from SQuAD [68]. The data set contains {question, sentence} pairs and the task is to determine whether the context sentence contains the answer to the question. The data set is constructed on top of Wikipedia documents, each document being accompanied, on average, by 4 questions. We consider the data split provided in the GLUE benchmark [14], which comprises 104,743 examples for training, 5,463 for validation and 5,463 for testing.
RTE. Recognizing Textual Entailment (RTE) [14] is a natural language inference data set containing pairs of sentences with the target label indicating if the meaning of one sentence can be inferred from the other. The training subset includes 2,490 samples, the validation set 277 samples, and the test set 3,000 samples.
CREMA-D. The CREMA-D multi-modal database [16] is formed of 7,442 videos of 91 actors (48 male and 43 female) of different ethnic groups. The actors perform various emotions while uttering 12 particular sentences that evoke one of the 6 emotion categories: anger, disgust, fear, happy, neutral, and sad. Following previous work [56], we conduct experiments only on the audio modality, dividing the set of audio samples into for training, for validation and for testing.
ESC-50. The ESC-50 [15] data set is a collection of 2,000 samples of 5 seconds each, comprising 50 classes of various common sound events. Samples are recorded at a 44.1 kHz sampling frequency, with a single channel. In our evaluation, we employ the 5-fold cross-validation procedure, as described in related works [15, 24].
4.2 Experimental Setup
Architectures. To demonstrate the compatibility of LeRaC with multiple neural architectures, we select several convolutional, recurrent and transformer models. As representative convolutional neural networks (CNNs), we opt for ResNet-18 [17], Wide-ResNet-50 [18] and DenseNet-121 [19]. For the object detection experiments on PASCAL VOC, we use the YOLOv5 [20] model based on the CSPDarknet53 [69] backbone, which is pre-trained on the MS COCO data set [70]. As representative transformers, we consider CvT-13 [22], BERT [23] and SepTr [24]. For CvT, we consider both pre-trained and randomly initialized versions. We use an uncased large pre-trained version of BERT. As Ristea et al. [24], we train SepTr from scratch. In addition, we employ a long short-term memory (LSTM) network [21] to represent recurrent neural networks (RNNs). The recurrent neural network contains two LSTM layers, each having a hidden dimension of 256 components. These layers are preceded by one embedding layer with the embedding size set to 128 elements. The output of the last recurrent layer is passed to a classifier composed of two fully connected layers. The LSTM is activated by rectified linear units (ReLU). We apply the aforementioned models on distinct input data types, considering the intended application domain of each model. Hence, ResNet-18, Wide-ResNet-50, CvT and YOLOv5 are applied on images, BERT and LSTM are applied on text, and SepTr and DenseNet-121 are applied on audio.
Multi-task architectures. To determine the impact of LeRaC on multi-task learning models, we conduct experiments on the UTKFace data set, where the face images are annotated with gender and age labels. We consider two models for the multi-task learning setup, namely ResNet-18 and CvT-13. Each model is jointly trained on the two tasks (gender prediction and age estimation). To each model, we attach two heads, one for gender classification and one for age estimation, respectively. The classification head is trained using the cross-entropy loss with respect to the gender label, while the regression head uses the mean squared error with respect to the age label. The models are trained using a joint objective defined as follows:
(10) |
where and are the ground-truth gender and age labels, and are the predicted gender and age labels, is a weight factor, and is the cross-entropy loss for the gender prediction task, defined as:
(11) |
and is the mean squared error for the age estimation task, defined as:
(12) |
The factor ensures the two tasks are equally important by weighting to have approximately the same range of values as . As such, we set .
Baselines. We compare LeRaC with two baselines: the conventional training regime (which uses early stopping, reduces the learning rate on plateau, and employs linear warm-up and cosine annealing when required) and the state-of-the-art Curriculum by Smoothing [7]. For CBS, we use the official code released by Sinha et al. [7] at https://github.com/pairlab/CBS, to ensure the reproducibility of their method in our experimental settings, which include a more diverse selection of input data types and neural architectures. In addition, we compare with several data-level and task-level curriculum learning methods [25, 26, 27, 28, 29] on CIFAR-10 and CIFAR-100.
To apply CBS to non-convolutional architectures, we use 1D convolutional layers based on Gaussian filters with a receptive field of 3. For transformers, we integrate a 1D Gaussian layer before each transformer block, so the smoothing is applied on the sequence of tokens. Similarly, for recurrent neural networks, before each LSTM layer, we process the sequence of tokens with 1D convolutional layers based on Gaussian filters. For both transformers and RNNs, we anneal, during training, the standard deviation of the Gaussian filters to enhance the information propagated through the network. This approach mirrors the implementation of CBS for convolutional neural networks.
Hyperparameter tuning. We tune all hyperparameters on the validation set of each benchmark. In Table 1, we present the optimal hyperparameters chosen for each architecture. In addition to the standard parameters of the training process, we report the parameters that are specific for the CBS [7] and LeRaC strategies. In the case of CBS, denotes the standard deviation of the Gaussian kernel, is the decay rate for , and is the decay step. Regarding the parameters of LeRaC, represents the number of iterations used in Eq. (9), and and are the initial learning rates for the first and last layers of the architecture, respectively. We set and in all experiments, without tuning. In addition, the intermediate learning rates , , are automatically set to be equally distanced between and . Moreover, , i.e. the initial learning rates of LeRaC converge to the original learning rate set for the conventional training regime. All models are trained with early stopping and the learning rate is reduced by a factor of when the loss reaches a plateau. We use linear warm-up with cosine annealing, whenever it is found useful for models based on conventional or CBS training. The learning rate warm-up is switched off for LeRaC to avoid unwanted interactions with our training strategy. Except for the pre-trained models, the weights of all models are initialized with Glorot initialization [66].
We underline that some parameters are the same across all data sets, while others need to be established per data set. For example, the parameter of CBS and the parameter of LeRaC are validated on each data set. As such, for the ResNet-18 model, the parameter of CBS takes one value on each data set (CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet, Food-101, UTKFace), but the values of on all five data sets can range between 2 and 5. Similarly, the parameter of LeRaC takes one value per data set, with the range of values being 5-7. In Table 1, we aggregate the optimal parameters of each model for all data sets. This explains why some hyperparameters are specified in terms of ranges.
Setting the initial learning rates. We should emphasize that the different learning rates , , are not optimized nor tuned during training. Instead, we set the initial learning rates through validation, such that is around five or six orders of magnitude lower than , and . After initialization, we apply our exponential scheduler, until all learning rates become equal at iteration . In addition, we would like to underline that the difference between the initial learning rates of consecutive layers is automatically set based on the range given by and . For example, let us consider a network with 5 layers. If we choose and , then the intermediate initial learning rates are automatically set to , , , i.e. is used in the exponent and is equal to in this case. To obtain the intermediate learning rates according to this example, we actually apply the exponential scheduler defined in Eq. (9). This reduces the number of tunable hyperparameters from (the number layers) to two, namely and . We go even further, setting without tuning, in all our experiments. Hence, tuning is only performed for the initial learning rate of the last layer, namely . Although tuning all , , might lead to better results, we refrain from meticulously tuning every possible value to avoid overfitting in hyperparameter space.
Number of hyperparameters. We further emphasize that LeRaC adds only two additional tunable hyperparameters with respect to the conventional training regime. These are the lowest learning rate and the number of iterations to employ LeRaC. We reduce the number of hyperparameters that require tuning by using a fixed rule to adjust the intermediate learning rates, e.g. by employing an exponential scheduler, or by fixing some hyperparameters, e.g. . In contrast, CBS [7] has three additional hyperparameters, thus having one extra hyperparameter compared with LeRaC. Furthermore, we note that data-level curriculum methods also introduce additional hyperparameters. Even a simple method that splits the examples into easy-to-hard batches that are gradually added to the training set requires at least two hyperparameters: the number of batches, and the number of iterations before introducing a new training batch. We thus believe that, in terms of the number of additional hyperparameters, LeRaC is comparable to CBS and other curriculum learning strategies. We emphasize that the same happens if we look at new optimizers, e.g. Adam [64] adds three additional hyperparameters compared with SGD.
Model | Training Regime | CIFAR-10 | CIFAR-100 | Tiny ImageNet |
---|---|---|---|---|
ResNet-18 | learning rate decay | |||
constant learning rate | ||||
LeRaC (ours) | ||||
Wide-ResNet-50 | learning rate decay | |||
constant learning rate | ||||
LeRaC (ours) | ||||
CvT-13 | linear warm-up + cosine annealing | |||
constant learning rate | ||||
LeRaC (ours) | ||||
CvT-13 | cosine annealing | |||
constant learning rate | ||||
LeRaC (ours) |
Avoiding too large learning rates. In principle, a larger learning rate implies a larger update. However, if the learning rate is too high, the model can actually diverge. This is because the gradient describes the loss function in the vicinity of the current location, providing no guarantee for the value of the loss outside this vicinity. Our implementation takes this aspect into account. Instead of increasing the learning rate for earlier layers, we reduce the learning rate for the deeper layer to avoid divergence. More precisely, we set the learning rate for the first layer to the original learning rate and the other initial learning rates are gradually reduced with each layer. During training, the lower learning rates are gradually increased, until epoch . Hence, LeRaC actually slows down the learning for deeper layers, until the earlier layers have learned representative features.
Evaluation. For the classification tasks, we evaluate all models in terms of the accuracy rate. For the regression task (age estimation), we use the mean absolute error. For the object detection task, we employ the mean Average Precision (mAP) at an intersection over union (IoU) threshold of 0.5. We repeat the training process of each model for 5 times and report the average performance and the standard deviation.
Model | Training Regime | CIFAR-10 | CIFAR-100 | Tiny ImageNet | ImageNet | Food-101 |
---|---|---|---|---|---|---|
conventional | ||||||
ResNet-18 | CBS | |||||
LeRaC (ours) | ||||||
conventional | ||||||
Wide-ResNet-50 | CBS | |||||
LeRaC (ours) | ||||||
conventional | ||||||
CvT-13 | CBS | |||||
LeRaC (ours) | ||||||
conventional | - | |||||
CvT-13 | CBS | - | ||||
LeRaC (ours) | - |
4.3 Domain-Specific Preprocessing
Image preprocessing. For the image classification experiments, we apply the same data preprocessing approach as Sinha et al. [7]. Hence, we normalize the images and maintain their original resolution, pixels for CIFAR-10 and CIFAR-100, pixels for Tiny ImageNet, pixels for ImageNet and Food-101, and pixels for UTKFace. Similar to Sinha et al. [7], we do not employ data augmentation.
Text preprocessing. For the text classification experiments with BERT, we lowercase all words and add the classification token ([CLS]) at the start of the input sequence. We add the separator token ([SEP]) to delimit sentences. For the LSTM network, we lowercase all words and replace them with indexes from vocabularies constructed from the training set. The input sequence length is limited to tokens for BERT and tokens for LSTM.
Speech preprocessing. The speech preprocessing steps are carried out following Ristea et al. [24]. We thus transform each audio sample into a time-frequency matrix by computing the discrete Short Time Fourier Transform (STFT) with FFT points, using a Hamming window of length and a hop size . For CREMA-D, we first standardize all audio clips to a fixed dimension of seconds by padding or clipping the samples. Then, we apply the STFT with , and a window size of . For ESC-50, we keep the same values for and , but we increase the hop size to . Next, for each STFT, we compute the square root of the magnitude and map the values to Mel bins. The result is converted to a logarithmic scale and normalized to the interval . Furthermore, in all our speech classification experiments, we use the following data augmentation methods: noise perturbation, time shifting, speed perturbation, mix-up and SpecAugment [71].
Model | Training Regime | Gender Accuracy | Age MAE |
---|---|---|---|
ResNet-18 | conventional | ||
CBS | |||
LeRaC (ours) | |||
CvT-13 | conventional | ||
CBS | |||
LeRaC (ours) |
Training Regime | conventional | CBS | LeRaC (ours) |
---|---|---|---|
mAP |
4.4 Preliminary Results
We present preliminary experiments to show the effect of various learning rate schedulers for different architectures. For each architecture, we compare the constant learning rate scheduler with an adaptive learning rate scheduler. The aim is to find the best scheduler for the conventional training regime, which is used as baseline in the subsequent experiments. Table 2 showcases the preliminary results on CIFAR-10, CIFAR-100 and Tiny ImageNet. We compare the outcomes of the adaptive and constant learning rate schedulers with those of LeRaC. In most cases, the adaptive scheduler yields better results than the constant learning rate. Using a constant learning rate seems to work only for the pre-trained CvT-13. Notably, the analysis also reveals that LeRaC consistently outperforms the other baseline schedulers, achieving the highest accuracy rates across all data sets.
We emphasize that, for the subsequent experiments, the conventional regime is always represented by the best scheduler among the following options: learning rate decay, learning rate warm-up, cosine annealing, or combinations of the aforementioned options.
Training | Text | Audio | |||||
---|---|---|---|---|---|---|---|
\cline2-8 Regime | Model | BoolQ | RTE | QNLI | Model | CREMA-D | ESC-50 |
conventional | |||||||
CBS | BERT | SepTr | |||||
LeRaC (ours) | |||||||
conventional | |||||||
CBS | LSTM | DenseNet-121 | |||||
LeRaC (ours) |
4.5 Main Results
Image classification. In Table 3, we present the image classification results on CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet and Food-101. Since CvT-13 is pre-trained on ImageNet, it does not make sense to fine-tune it on ImageNet. Thus, the respective results are not reported. On the one hand, there are two scenarios (ResNet-18 on CIFAR-100, and CvT-13 on CIFAR-100) in which CBS provides the largest improvements over the conventional regime, surpassing LeRaC in the respective cases. On the other hand, there are more than 10 scenarios where CBS degrades the accuracy with respect to the standard training regime. This shows that the improvements attained by CBS are inconsistent across models and data sets. Unlike CBS, our strategy surpasses the baseline regime in all 19 cases, thus being more consistent. In 8 of these cases, the accuracy gains of LeRaC are higher than . Moreover, LeRaC outperforms CBS in 17 out of 19 cases. We thus consider that LeRaC can be regarded as a better choice than CBS, bringing consistent performance gains.
Multi-task learning. In Table 4, we include the multi-task learning results on the UTKFace data set [11]. We evaluate the multi-task ResNet-18 and CvT-13 models under various training regimes, reporting the accuracy rates for gender prediction, and the mean absolute errors for age estimation, respectively. LeRaC achieves the best scores in each and every case, surpassing the other training regimes in the multi-task learning setup. Moreover, its results are statistically better with respect to both competing regimes. In contrast, the CBS regime remains in the statistical margin of the conventional regime for the pre-trained CvT-13 network.
Object detection. In Table 5, we include the object detection results of YOLOv5 [20] based on different training regimes on PASCAL VOC 2007+2012 [12]. LeRaC exhibits a superior mAP score, significantly surpassing the other training regimes. In contrast, CBS leads to suboptimal performance, hinting towards the inconsistency of CBS across different evaluation scenarios.
Text classification. In Table 6 (left side), we report the text classification results on BoolQ, RTE and QNLI. Here, there are two cases (BERT on QNLI and LSTM on RTE) where CBS leads to performance drops compared with the conventional training regime. In all other cases, the improvements of CBS are below . Just as in the image classification experiments, LeRaC brings accuracy gains for each and every model and data set. In four out of six scenarios, the accuracy gains yielded by LeRaC are higher than . Once again, LeRaC proves to be the most consistent regime, generally surpassing CBS by significant margins.
![Refer to caption](x3.png)
![Refer to caption](x4.png)
![Refer to caption](x5.png)
![Refer to caption](x6.png)
Speech classification. In Table 6 (right side), we present the results obtained on the audio data sets, namely CREMA-D and ESC-50. We observe that the CBS strategy obtains lower results compared with the baseline in two cases (SepTr on CREMA-D and DenseNet-121 on ESC-50), while our method provides superior results for each and every case. By applying LeRaC on SepTr, we set a new state-of-the-art accuracy level () on the CREMA-D audio modality, surpassing the previous state-of-the-art value attained by Ristea et al. [24] with SepTr alone. When applied on DenseNet-121, LeRaC brings performance improvements higher than , the highest improvement () over the baseline being attained on CREMA-D.
Significance testing. To determine if the reported accuracy gains observed for LeRaC with respect to the baseline are significant, we apply McNemar / Cochran Q significance testing [72] to the results reported in Table 3, Table 4, Table 5 and Table 6 on all 12 data sets. In 27 of 34 cases, we found that our results are significantly better than the corresponding baseline, at a p-value of . This confirms that our gains are statistically significant in the majority of cases.
Model | Training Regime | CIFAR-10 | CIFAR-100 |
---|---|---|---|
ResNet-18 | conventional | ||
CBS [7] | |||
LSCL [25] | |||
EfficientTrain [26] | |||
Self-taught [27] | |||
LCDnet-CL [29] | |||
CLIP [28] | |||
LeRaC (ours) | |||
Wide-ResNet-50 | conventional | ||
CBS [7] | |||
LSCL [25] | |||
EfficientTrain [26] | |||
Self-taught [27] | |||
LCDnet-CL [29] | |||
CLIP [28] | |||
LeRaC (ours) | |||
CvT-13 | conventional | ||
CBS [7] | |||
LSCL [25] | |||
EfficientTrain [26] | |||
Self-taught [27] | |||
LCDnet-CL [29] | |||
CLIP [28] | |||
LeRaC (ours) |
Training time comparison. For a particular model and data set, all training regimes are executed for the same number of epochs, for a fair comparison. However, the CBS strategy adds the smoothing operation at multiple levels inside the architecture, which increases the training time. To this end, we compare the training time (in hours) versus the validation error of CBS and LeRaC. For this experiment, we selected four neural models and illustrate the evolution of the validation accuracy over time in Figure 3. We underline that LeRaC introduces faster convergence times, being around 7-12% faster than CBS. It is trivial to note that LeRaC requires the same time as the conventional regime.
4.6 More Comparative Results
Comparing with domain-specific curriculum learning strategies. Although we consider CBS [7] as our closest competitor in terms of applicability across architectures and domains, there are domain-specific curriculum learning methods reporting promising results. To this end, we perform additional experiments on CIFAR-10 and CIFAR-100 with ResNet-18, Wide-ResNet-50 and CvT-13 (pre-trained), considering two recent curriculum learning strategies applied in the image domain, namely Label-Similarity Curriculum Learning (LSCL) [25] and EfficientTrain [26].
Model | Training Regime | CIFAR-10 | CIFAR-100 | Tiny ImageNet |
---|---|---|---|---|
CvT-13 | conventional | |||
\cline2-5 | LeRaC (logarithmic update) | |||
LeRaC (linear update) | ||||
LeRaC (exponential update) |
Dogan et al. [25] proposed LSCL, a strategy that relies on hierarchically clustering the classes (labels) based on inter-label similarities determined with the help of document embeddings representing the Wikipedia pages of the respective classes. The corresponding results shown in Table 7 indicate that label-similarity curriculum is useful for CIFAR-100, but not for CIFAR-10. This suggests that the method needs a sufficiently large number of classes to benefit from the constructed hierarchy of classes. In contrast, LeRaC does not rely on external components, such as the similarity measure used by Dogan et al. [25] in their strategy. Another important limitation of LSCL [25] is its restricted use, e.g. LSCL is not applicable to regression tasks, where there are no classes. Therefore, we consider LeRaC as a more versatile alternative.
EfficientTrain is an alternative to CBS, which introduces a cropping operation in the Fourier spectrum of the inputs instead of blurring the activation maps. The method is not suitable for text data, so the comparison between EfficientTrain and LeRaC can only be performed in the image domain. Consequently, we compare with EfficientTrain [26] on CIFAR-10 and CIFAR-100, and show the corresponding results in Table 7. Notably, our method surpasses EfficientTrain [26] in 4 out of 6 evaluation scenarios. These results confirm the competitiveness of LeRaC in comparison to very recent methods, such as EfficientTrain [26].
Aside from outperforming EfficientTrain and LSCL in the image domain, our method has another important advantage: it is generally applicable to any domain.
Comparing with data-level curriculum learning methods. In Table 7, we also compare LeRaC with three data-level curriculum learning methods [27, 28, 29]. These methods share a common framework, where a scoring function ranks samples based on their difficulty, and a pacing function determines the timing for introducing new batches during training. Khan et al. [27] examine various pacing functions and classify scoring functions into two categories: self-taught and transfer-scoring functions. Self-taught functions involve training a model on a subset of data batches and then using this model to assess the difficulty of examples. In contrast, transfer-scoring functions utilize a pre-trained model for this purpose. For the results reported in Table 7 for Khan et al. [27], we use the self-taught scoring function and a linear pacing function. To compare with Khan et al. [29], we use a transfer-scoring function and a ResNet-50 model pre-trained on ImageNet. For Khan et al. [28], aside from using the pre-trained model for assessing the difficulty of the samples, we also remove the least significant samples during training.
The results reported in Table 7 indicate that LeRaC outperforms the data-level curriculum learning methods. We note that these methods were exclusively tested on crowd density estimation tasks, which could explain why their effectiveness might not generalize to different types of tasks. For instance, the method described by Khan et al. [28] is suboptimal even when compared with conventional training, suggesting that the strategy of removing easy examples is not always effective for image classification tasks.
4.7 Ablation Studies
Comparing different schedulers. We first aim to establish if the exponential learning rate scheduler proposed in Eq. (9) is a good choice. To test this out, we select the CvT-13 model and change the LeRaC regime to use linear or logarithmic updates of the learning rates. The corresponding results are shown in Table 8. We observe that both alternative schedulers obtain performance gains, but our exponential learning rate scheduler brings higher gains on all three data sets. We thus conclude that the update rule defined in Eq. (9) is a sound option.
Our previous ablation study shows that the exponential scheduler leads to higher gains than the linear or the logarithmic schedulers. In general, a suitable scheduler is one that adjusts the learning rate at each layer proportionally to the estimated signal-to-noise drop from one layer to the next. To understand how the average SNR drops from one neural layer to the next, we plot the average SNR of the features maps at each layer of the randomly initialized LeNet architecture, computed over 100 images from CIFAR-100, in Figure 4. As anticipated, the average SNR decreases along with the layer index. Notably, we observe that the drop in SNR follows an exponential trend. This can explain why the exponential scheduler is a more suitable choice.
To further justify our preference towards the exponential scheduler, we analyze the training progress of the ResNet-18 and the pre-trained CvT-13 models using various schedulers (logarithmic, linear, exponential) for LeRaC. Figure 5 shows the results for ResNet-18, while Figure 6 illustrates the results for CvT-13. In both cases, the exponential scheduler leads to a better training progress than the conventional regime, but the linear and logarithmic schedulers are not as good. These results further confirm that the exponential scheduler is optimal.
Training Regime | - | ResNet-18 | Wide-ResNet-50 |
---|---|---|---|
conventional | - | ||
LeRaC (ours) | - | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- |
Varying value ranges for initial learning rates. All our hyperparameters are either fixed without tuning or tuned on the validation data. In this ablation experiment, we present results with LeRaC using multiple ranges for and to demonstrate that LeRaC is sufficiently stable with respect to suboptimal hyperparameter choices. We carry out experiments with ResNet-18 and Wide-ResNet-50 on CIFAR-100. We report the corresponding results in Table 9. We observe that all hyperparameter configurations lead to surpassing the baseline regime. This indicates that LeRaC can bring performance gains even outside the optimal learning rate bounds, demonstrating low sensitivity to suboptimal hyperparameter tuning.
Training Regime | ResNet-18 | Wide-ResNet-50 | |
---|---|---|---|
conventional | - | ||
5 | |||
6 | |||
LeRaC (ours) | 7 | ||
8 | |||
9 |
Data Set | Model | Training Regime | Accuracy |
---|---|---|---|
CIFAR-100 | conventional | ||
ResNet-18 | anti-LeRaC | ||
LeRaC (ours) | |||
\cline2-4 | conventional | ||
Wide-ResNet-50 | anti-LeRaC | ||
LeRaC (ours) | |||
conventional | |||
CREMA-D | SepTr | anti-LeRaC | |
LeRaC (ours) |
Varying . In Table 10, we present additional results with ResNet-18 and Wide-ResNet-50 on CIFAR-100, considering various values for (the last iteration for our training regime). We observe that all configurations surpass the baselines on CIFAR-100. Moreover, we observe that the optimal values for ( for ResNet-18 and for Wide-ResNet-50) obtained on the validation set are not the values producing the best results on the test set. This confirms that we did not overfit the hyperparameters of LeRaC.
Anti-curriculum. Since our goal is to perform curriculum learning (from easy to hard), we restrict the settings for , , such that deeper layers start with lower learning rates. However, another strategy is to consider the opposite setting, where we use higher learning rates for deeper layers. If we train later layers at a faster pace (anti-curriculum), we conjecture that the later layers get adapted to the noise from the early layers, which could likely lead to local optima or difficult training (due to the need of readapting to the earlier layers, once these layers start learning useful features). We tested this approach (anti-LeRaC), which belongs to the category of anti-curriculum learning strategies [2], in a set of new experiments with ResNet-18 and Wide-ResNet-50 on CIFAR-100, as well as SepTr on CREMA-D. We report the corresponding results with LeRaC and anti-LeRaC in Table 11. Although anti-curriculum, e.g. hard negative sample mining, was shown to be useful in other tasks [2], our results indicate that learning rate anti-curriculum attains inferior performance compared with our approach. Furthermore, anti-LeRaC is also below the conventional regime, confirming our conjecture regarding this strategy.
Summary. Notably, our ablation results show that the majority of hyperparameter configurations tested for LeRaC lead to outperforming the conventional regime, demonstrating the stability of LeRaC. We present additional experiments in Appendix C.
5 Discussion
Interaction with optimization algorithms. Throughout our experiments, we always keep using the same optimizer for a certain neural model, for all training regimes (conventional, CBS, LeRaC). The best optimizer for each neural model is established for the conventional training regime. We underline that our initial learning rates and scheduler are used independently of the optimizers. Although our learning rate scheduler updates the learning rates at the beginning of every iteration, we did not observe any stability or interaction issues with any of the optimizers (SGD, Adam, AdaMax, AdamW).
Interaction with other curriculum learning strategies. Our simple and generic curriculum learning scheme can be integrated into any model for any task, not relying on domain or task dependent information, e.g. the data samples. In Table 16 from Appendix C, we show that combining LeRaC and CBS can boost performance. In a similar fashion, LeRaC can be combined with data-level curriculum strategies for improved performance. We leave this exploration for future work.
Interaction with other learning rate schedulers. Whenever a learning rate scheduler is used for training a model in our experiments, we simply replace the scheduler with LeRaC until epoch . For example, all the baseline CvT results are based on linear warm-up with cosine annealing, this being the recommended scheduler for CvT [22]. When we introduce LeRaC, we simply deactivate alternative schedulers between epochs and . In general, we recommend deactivating other schedulers while using LeRaC for simplicity in avoiding stability issues.
Limitations of our work. One limitation is the need to disable other learning rate schedulers while using LeRaC. We already tested this scenario with linear warm-up with cosine annealing, which is removed when using LeRaC, observing consistent performance gains (see Table 3). However, disabling alternative learning rate schedulers might bring performance drops in other cases. Hence, this has to be decided on a case by case basis. Another limitation is the possibility of encountering longer training times or poor convergence when the hyperparameters are not properly configured. We recommend hyperparameter tuning on the validation set to avoid this outcome.
6 Conclusion
In this paper, we introduced a novel model-level curriculum learning approach that is based on starting the training process with increasingly lower learning rates per layer, as the layers get closer to the output. We conducted comprehensive experiments on 12 data sets from three domains (image, text and audio), considering multiple neural architectures (CNNs, RNNs and transformers), to compare our novel training regime (LeRaC) with a state-of-the-art regime (CBS [7]), as well as the conventional training regime (based on early stopping and reduce on plateau). The empirical results demonstrate that LeRaC is significantly more consistent than CBS, perhaps being one of the most versatile curriculum learning strategy to date, due to its compatibility with multiple neural models and its usefulness across different domains. Remarkably, all these benefits come for free, i.e. LeRaC does not add any extra time over the conventional approach.
Declarations
Funding. This work was supported by a grant of the Romanian Ministry of Education and Research, CNCS - UEFISCDI, project number PN-III-P2-2.1-PED-2021-0195, within PNCDI III.
Conflict of interest. The authors have no conflicts of interest to declare that are relevant to the content of this article.
Availability of data and materials. The data sets are publicly available online.
Code availability. The code has been made publicly available for non-commercial use at https://github.com/CroitoruAlin/LeRaC.
Appendix A Theoretical Proof
The motivation behind using LeRaC stems from our conjecture stating that the level of noise inside features gradually increases with every layer of a neural network. Regardless of the type of layer (convolutional, transformer or fully connected), the operation performed inside a neural layer boils down to matrix or vector multiplications. To this end, we set out to demonstrate that the signal resulting from the multiplication of two signals has a lower signal-to-noise ratio (SNR) than the multiplied factors. We start with the definition of the variance of a signal, which is given below:
Definition 1.
The variance of a signal is given by:
(13) |
From Definition 1, it results that the expected value of , which represents the power of signal , is equal to:
(14) |
where is the mean of , and is the variance of . We use Eq. (14) to define the SNR of a signal as follows:
Definition 2.
The signal-to-noise ratio (SNR) of a signal , where is the clean signal and is the noise component, is the ratio between the power of and the power of , which is given by:
(15) |
where and are the means of and , and and are the variances of and , respectively.
The noise contained by data samples given as input to neural networks is usually uncorrelated, e.g. the noise in images is assumed to come from a random normal distribution of zero mean. Moreover, the weights of a neural network are usually initialized by sampling them from a random normal distribution of zero mean [66]. Hence, without loss of generality, we can naturally assume that the noise component has zero mean. This means that we can simplify Eq. (15) to:
(16) |
If the power of the signal is higher than the power of the noise , then is higher than . If the signal is dominated by noise, then is between and . Note that the SNR does not take negative values. To avoid discussing edge cases, we assume that the SNR of any signal is always defined, i.e. the power of the noise is never .
Theorem 1.
Let and be two signals, where and are the clean components, and and are the noise components. The signal-to-noise ratio of the product between the two signals is lower than the signal-to-noise ratios of the two signals, i.e.:
(17) |
Proof.
To demonstrate our theorem, we rely on the formula of variance for the sum of two signals with zero mean:
(18) |
We also rely on the formula of variance for the product of two signals:
(19) |
Let denote the product of the two signals, i.e. . Expanding the signals and leads to the following formulation of :
(20) |
where the clean component is , and the noise component is . Hence, .
An example given as input to a neural network and the initial weights of the respective neural network are not correlated under any practical circumstances. Hence, without loss of generality, we can assume that the signals and are independent of each other, i.e. their covariance is equal to . This assumption allows us to simplify the signal power of to:
(21) |
The signal power of is given by:
(22) |
since the noise is of zero mean, i.e. . By employing Eq. (18), we can compute the power of as follows:
(23) |
By applying Eq. (19) in Eq. (23), and considering that and have zero mean, we obtain:
(24) |
Replacing Eq. (21) and Eq. (24) inside Definition 2 leads to the following expression of the signal-to-noise ratio of signal :
(25) |
To simplify our notations in the remainder of this proof, we define and . By introducing these notations in Eq. (25), we obtain the following:
(26) |
Now, it remains to prove that:
(27) |
However, since and are commutable in Eq. (26), it is sufficient to prove only one of the inequalities. We choose to provide the complete proof for the first inequality in Eq. (27) (as the proof for the other is analogous). We consider two separate cases, and .
• Case : When , we obtain the following inequality:
(28) |
which clearly holds for any .
Training Regime | Distance | |
---|---|---|
\cline2-3 | First Conv Layer | Last Conv Layer |
conventional |
• Case : When , we can divide both terms of the inequality by and arrive to:
(29) |
Next, we multiply both terms by , obtaining that:
(30) |
We can subtract from both terms and obtain the following:
(31) |
Since , it results that Eq. (31) is true. Moreover, the inequality is strict when . This concludes our proof. ∎
Corollary 1.
Let be a set of signals, where each signal is formed of a clean component and a noise component . The following equation is true:
(32) |
Proof.
The proof results immediately by induction from Theorem 1. Note that the inequality is strict when . ∎
Training Regime | Entropy | |
---|---|---|
\cline2-3 | First Conv Layer | Last Conv Layer |
conventional | ||
LeRaC (ours) |
We employ Corollary 1 in the context of neural networks, where the input signal, which is expected to bear meaningful information and thus have a high SNR, is initially multiplied with random weights, which are expected to have low SNR values just after initialization. According to Corollary 1, the SNR of the resulting signal (features) is gradually decreasing, layer by layer. In this context, we conjecture that optimizing the weights of layer to learn patterns from the signal (features) given as input to layer is suboptimal for layers that are sufficiently far away from the input. This happens because the respective features (passed to layer ) can contain a large amount of noise, which can derail the network towards adapting the weights to the noise instead of the clean signal. This phenomenon becomes more and more prevalent as the layer is placed farther away from the input. To regulate this phenomenon during the initial stages of the learning process, we propose to employ LeRaC and gradually decrease the learning rate as layers get deeper, allowing the network to optimize the earlier weights sooner. We underline that training the earlier layers also reduces the amount of noise in later layers, since the amount of noise in later layers is bounded by the amount of noise in earlier layers (according to Corollary 1). As the amount of noise in later layers is progressively diminished, we can gradually increase the learning rates of later layers, allowing them to optimize their weights to cleaner signals (meaningful patterns).
Appendix B Empirical Proof
Noise quantification of early and later layers. The application of LeRaC is justified by the fact that the level of noise gradually grows with each layer during a forward pass through a neural network with randomly initialized weights. To empirically confirm this statement, we have computed the distances for the low-level (first conv) and high-level (last conv) layers between the activation maps at iteration (based on random weights) and the last iteration (based on weights optimized until convergence) for ResNet-18 on CIFAR-10, while using the conventional training regime. The computed distances shown in Table 12 confirm our conjecture, namely that shallow layers contain less noise than deep layers when applying the conventional training regime.
Training Regime | Distance | |
---|---|---|
\cline2-3 | First Conv Layer | Last Conv Layer |
conventional | ||
LeRaC (ours) |
Entropy of low-level versus high-level features. We show a few examples of training dynamics in Figure 3. All four graphs exhibit a higher gap between CBS and LeRaC in the first half of the training process, suggesting that LeRaC has an important role towards faster convergence. To assess the comparative quality of low-level versus high-level feature maps obtained either with conventional or LeRaC training, we compute the entropy of the first and last conv layers of ResNet-18 on CIFAR-10, after iterations. We report the computed entropy levels in Table 13. Conventional training seems to update deeper layers faster, observing a higher difference between the entropy levels of low-level and high-level features obtained with conventional training than with LeRac. This shows that LeRaC balances the training pace of low-level and high-level features. We conjecture that updating the deeper layers too soon could lead to overfitting to the noise still present in the early layers. This statement is supported by our empirical results on 12 data sets, showing that giving a chance to the early layers to converge before introducing large updates to the later layers leads to superior performance.
Aside from computing the global entropy over all training samples, in Figure 7, we illustrate some activation maps with the highest and lowest entropy from the first and last conv layers for three randomly chosen examples from ImageNet. The activation maps are extracted at epoch from the ResNet-18 model trained on CIFAR-10 either with the conventional regime, the CBS regime or the LeRaC regime. In general, we observe that the low-level activation maps corresponding to LeRaC and CBS exhibit a higher degree of variability (being more distinct from each other), regardless of the entropy level (low or high). In the case of LeRaC, we believe the higher degree of variability comes from the fact that, having lower learning rates for the deeper layers, the model based on LeRaC is likely focused on finding a higher variety of patterns within the first layers to minimize the loss. Similarly, in the case of CBS, blurring the intermediary feature maps reduces the information propagated within the network. This compels the lower layers to identify and learn more distinctive patterns to minimize the loss. However, in general, the patterns found by LeRaC are more diverse. For instance, in the case of CBS, the low-level activation maps of the first image show greater similarity to each other, in contrast to those generated by LeRaC. For the third example (the image of an airplane), we observe that the activation maps with the highest entropy from the last conv layer produced by LeRaC have a higher entropy than the activation maps with the highest entropy produced by the conventional regime. This observation is in line with the results reported in Table 13, confirming that LeRaC is able to better balance the entropy of low-level and high-level features by preventing the faster convergence of the deeper layers.
Distances at epoch versus final epoch. As discussed above, in Table 13, we report the entropy of the low-level and high-level layers after epochs, before and after using LeRaC to train ResNet-18 on CIFAR-10. However, we consider that using the distance to the final feature maps provides additional useful insights about how LeRaC works. To this end, we compute the Euclidean distances of both low-level and high-level features between epoch and the final epoch, before and after using LeRaC. We report the distances in Table 14. The computed distances confirm our previous observations, namely that LeRaC is capable of balancing the training pace of low-level and high-level layers.
Appendix C Additional Experiments
Model | Optimizer | Training Regime | Accuracy |
---|---|---|---|
Wide-ResNet-50 | Adam | conventional | |
SGD | conventional | ||
SGD | LeRaC (ours) |
Training progress for various initial learning rates. We compare the training progress of the conventional and LeRaC training regimes. We first comparatively consider the progress of ResNet-18 on CIFAR-10, shown in Figure 8, and CIFAR-100, shown in Figure 9, respectively. LeRaC is consistently better than the conventional regime for all initial learning rate configurations, on both data sets. We next compare the progress on CIFAR-10 for ResNet-18, illustrated in Figure 8, and CvT-13 (pre-trained), illustrated in Figure 10. The training progress of LeRaC is consistently above the training progress of the conventional regime, for both ResNet-18 and CvT-13. In summary, the results showcase the benefits on the training progress offered by LeRaC across distinct models and data sets.
Model | Training Regime | CIFAR-10 | CIFAR-100 | Tiny ImageNet |
---|---|---|---|---|
CvT-13 | conventional | |||
\cline2-5 | CBS | |||
LeRaC | ||||
\cline2-5 | CBS + LeRaC |
Model | Training Regime | Accuracy |
---|---|---|
ResNet-18 | conventional | |
LeRaC (ours) | ||
Wide-ResNet-50 | conventional | |
LeRaC (ours) |
SGD+LeRaC versus Adam. In Table 15, we present results showing that SGD and SGD+LeRaC obtain better accuracy rates than Adam [64] for the Wide-ResNet-50 model on CIFAR-100. This indicates that a simple optimizer combined with LeRaC can obtain better results than a state-of-the-art optimizer such as Adam. This justifies our decision to use a different optimizer for each neural model (see Table 1).
Combining CBS and LeRaC. Another interesting aspect worth studying is to determine if putting the CBS and LeRaC regimes together could bring further performance gains. We study the effect of combining CBS and LeRaC for CvT-13, since both CBS and LeRaC improve this model. In Table 16, we present the results with CvT-13 on CIFAR-10, CIFAR-100 and Tiny ImageNet. The reported results show that the combination brings accuracy gains across all three data sets. We thus conclude that the combination of curriculum learning regimes is worth a try, whenever the two independent regimes boost performance.
Data augmentation on vision data sets. Following Sinha et al. [7], we did not use data augmentation for the vision data sets. We consider training data augmentation as an orthogonal method for improving results, expecting improvements for both baseline and LeRaC models. Nevertheless, since we extended the experimental settings considered in Sinha et al. [7] to other domains, we took the liberty to use data augmentation in the audio domain (see the results in Table 6). The same augmentations (noise perturbation, time shifting, speed perturbation, mix-up and SpecAugment) are used for all audio models, ensuring a fair comparison. Moreover, we next present additional results with ResNet-18 and Wide-ResNet-50 on CIFAR-100 using the following augmentations: horizontal flip, rotation, solarization, blur, sharpening and auto-contrast. The results reported in Table 17 confirm that the performance gaps in the vision domain are in the same range after introducing data augmentation. In addition, we note that data augmentation seems to be rather harmful for the Wide-ResNet-50 model, which attains better results without data augmentation.
Training Set Size | Training Regime | Accuracy |
---|---|---|
5% | conventional | |
CBS | ||
LeRaC (ours) |
Limited data regime. In all our experiments carried out so far, the evaluated models were trained on the complete training sets. However, it is interesting to find out how our strategy behaves in a limited data regime. To this end, we conduct another experiment to compare LeRaC with the conventional and CBS regimes in a limited data scenario, considering only 5% of the training data. We present the results for ResNet-18 on CIFAR-100 in Table 18. The results indicate that LeRaC keeps its performance edge in the limited data regime. We therefore conclude that LeRaC can also be useful when limited training data is available.
References
- \bibcommenthead
- Bengio et al. [2009] Bengio Y, Louradour J, Collobert R, Weston J. Curriculum Learning. In: Proceedings of ICML; 2009. p. 41–48.
- Soviany et al. [2022] Soviany P, Ionescu RT, Rota P, Sebe N. Curriculum Learning: A Survey. International Journal of Computer Vision. 2022;130(6):1526–1565.
- Mitchell [1997] Mitchell TM. Machine Learning. New York: McGraw-Hill; 1997.
- Wang et al. [2022] Wang X, Chen Y, Zhu W. A Survey on Curriculum Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022;44(9):4555–4576.
- Burduja and Ionescu [2021] Burduja M, Ionescu RT. Unsupervised Medical Image Alignment with Curriculum Learning. In: Proceedings of ICIP; 2021. p. 3787–3791.
- Karras et al. [2018] Karras T, Aila T, Laine S, Lehtinen J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In: Proceedings of ICLR; 2018. .
- Sinha et al. [2020] Sinha S, Garg A, Larochelle H. Curriculum by Smoothing. In: Proceedings of NeurIPS; 2020. p. 21653–21664.
- Krizhevsky [2009] Krizhevsky A. Learning multiple layers of features from tiny images. University of Toronto; 2009.
- Russakovsky et al. [2015] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision. 2015;115(3):211–252.
- Bossard et al. [2014] Bossard L, Guillaumin M, Van Gool L. Food-101 – Mining Discriminative Components with Random Forests. In: Proceedings of ECCV; 2014. p. 446–461.
- Zhang et al. [2017] Zhang Z, Song Y, Qi H. Age progression/regression by conditional adversarial autoencoder. In: Proceedings of CVPR; 2017. p. 5810–5818.
- Everingham et al. [2010] Everingham M, Gool L, Williams CK, Winn J, Zisserman A. The Pascal Visual Object Classes (VOC) Challenge. Intenational Journal of Computer Vision. 2010;88(2):303–338.
- Clark et al. [2019] Clark C, Lee K, Chang MW, Kwiatkowski T, Collins M, Toutanova K. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In: Proceedings of NAACL; 2019. p. 2924–2936.
- Wang et al. [2019] Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In: Proceedings of ICLR; 2019. .
- Piczak [2015] Piczak KJ. ESC: Dataset for Environmental Sound Classification. In: Proceedings of ACMMM; 2015. p. 1015–1018.
- Cao et al. [2014] Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing. 2014;5(4):377–390.
- He et al. [2016] He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: Proceedings of CVPR; 2016. p. 770–778.
- Zagoruyko and Komodakis [2016] Zagoruyko S, Komodakis N. Wide Residual Networks. arXiv preprint arXiv:160507146. 2016;.
- Huang et al. [2017] Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely Connected Convolutional Networks. In: Proceedings of CVPR; 2017. p. 2261–2269.
- Jocher et al. [2022] Jocher G, Chaurasia A, Stoken A, Borovec J, NanoCode012, Kwon Y, et al. ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation. Zenodo. 2022;.
- Hochreiter and Schmidhuber [1997] Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computing. 1997;9(8):1735–1780.
- Wu et al. [2021] Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, et al. CvT: Introducing Convolutions to Vision Transformers. In: Proceedings of ICCV; 2021. p. 22–31.
- Devlin et al. [2019] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL; 2019. p. 4171–4186.
- Ristea et al. [2022] Ristea NC, Ionescu RT, Khan FS. SepTr: Separable Transformer for Audio Spectrogram Processing. In: Proceedings of INTERSPEECH; 2022. p. 4103–4107.
- Dogan et al. [2020] Dogan Ü, Deshmukh AA, Machura MB, Igel C. Label-similarity curriculum learning. In: ECCV; 2020. p. 174–190.
- Wang et al. [2023] Wang Y, Yue Y, Lu R, Liu T, Zhong Z, Song S, et al. EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones. In: Proceedings of ICCV; 2023. p. 5852–5864.
- Khan et al. [2024] Khan MA, Menouar H, Hamila R. Curriculum for Crowd Counting – Is it Worthy? In: Proceedings of VISAPP; 2024. p. 583–590.
- Khan et al. [2023] Khan M, Hamila R, Menouar H. CLIP: Train Faster with Less Data. In: Proceedings of BigComp; 2023. p. 34–39.
- Khan et al. [2023] Khan MA, Menouar H, Hamila R. LCDnet: A Lightweight Crowd Density Estimation Model for Real-time Video Surveillance. Journal of Real-Time Image Processing. 2023;20(2):29.
- Gui et al. [2017] Gui L, Baltrušaitis T, Morency LP. Curriculum Learning for Facial Expression Recognition. In: Proceedings of FG; 2017. p. 505–511.
- Jiang et al. [2018] Jiang L, Zhou Z, Leung T, Li LJ, Fei-Fei L. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In: Proceedings of ICML; 2018. p. 2304–2313.
- Shi and Ferrari [2016] Shi M, Ferrari V. Weakly Supervised Object Localization Using Size Estimates. In: Proceedings of ECCV; 2016. p. 105–121.
- Soviany et al. [2021] Soviany P, Ionescu RT, Rota P, Sebe N. Curriculum self-paced learning for cross-domain object detection. Computer Vision and Image Understanding. 2021;204:103–166.
- Chen and Gupta [2015] Chen X, Gupta A. Webly Supervised Learning of Convolutional Networks. In: Proceedings of ICCV; 2015. p. 1431–1439.
- Platanios et al. [2019] Platanios EA, Stretcu O, Neubig G, Poczos B, Mitchell T. Competence-based Curriculum Learning for Neural Machine Translation. In: Proceedings of NAACL; 2019. p. 1162–1172.
- Kocmi and Bojar [2017] Kocmi T, Bojar O. Curriculum Learning and Minibatch Bucketing in Neural Machine Translation. In: Proceedings of RANLP; 2017. p. 379–386.
- Spitkovsky et al. [2009] Spitkovsky VI, Alshawi H, Jurafsky D. Baby Steps: How “Less is More” in unsupervised dependency parsing. In: Proceedings of NIPS; 2009. .
- Liu et al. [2018] Liu C, He S, Liu K, Zhao J. Curriculum Learning for Natural Answer Generation. In: Proceedings of IJCAI; 2018. p. 4223–4229.
- Ranjan and Hansen [2018] Ranjan S, Hansen JHL. Curriculum Learning Based Approaches for Noise Robust Speaker Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2018;26:197–210.
- Amodei et al. [2016] Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, et al. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. In: Proceedings of ICML; 2016. p. 173–182.
- Pentina et al. [2015] Pentina A, Sharmanska V, Lampert CH. Curriculum Learning of Multiple Tasks. In: Proceedings of CVPR; 2015. p. 5492–5500.
- Jiménez-Sánchez et al. [2019] Jiménez-Sánchez A, Mateus D, Kirchhoff S, Kirchhoff C, Biberthaler P, Navab N, et al. Medical-based Deep Curriculum Learning for Improved Fracture Classification. In: Proceedings of MICCAI; 2019. p. 694–702.
- Wei et al. [2021] Wei J, Suriawinata A, Ren B, Liu X, Lisovsky M, Vaickus L, et al. Learn like a Pathologist: Curriculum Learning by Annotator Agreement for Histopathology Image Classification. In: Proceedings of WACV; 2021. p. 2472–2482.
- Cirik et al. [2016] Cirik V, Hovy E, Morency LP. Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks. arXiv preprint arXiv:161106204. 2016;.
- Tay et al. [2019] Tay Y, Wang S, Luu AT, Fu J, Phan MC, Yuan X, et al. Simple and Effective Curriculum Pointer-Generator Networks for Reading Comprehension over Long Narratives. In: Proceedings of ACL; 2019. p. 4922–4931.
- Zhang et al. [2021] Zhang W, Wei W, Wang W, Jin L, Cao Z. Reducing BERT Computation by Padding Removal and Curriculum Learning. In: Proceedings of ISPASS; 2021. p. 90–92.
- Ionescu et al. [2016] Ionescu RT, Alexe B, Leordeanu M, Popescu M, Papadopoulos DP, Ferrari V. How Hard Can It Be? Estimating the Difficulty of Visual Search in an Image. In: Proceedings of CVPR; 2016. p. 2157–2166.
- Gong et al. [2016] Gong C, Tao D, Maybank SJ, Liu W, Kang G, Yang J. Multi-Modal Curriculum Learning for Semi-Supervised Image Classification. IEEE Transactions on Image Processing. 2016;25(7):3249–3260.
- Hacohen and Weinshall [2019] Hacohen G, Weinshall D. On The Power of Curriculum Learning in Training Deep Networks. In: Proceedings of ICML; 2019. p. 2535–2544.
- Kumar et al. [2010] Kumar M, Packer B, Koller D. Self-Paced Learning for Latent Variable Models. In: Proceedings of NIPS. vol. 23; 2010. p. 1189–1197.
- Gong et al. [2019] Gong M, Li H, Meng D, Miao Q, Liu J. Decomposition-Based Evolutionary Multiobjective Optimization to Self-Paced Learning. IEEE Transactions on Evolutionary Computation. 2019;23(2):288–302.
- Fan et al. [2017] Fan Y, He R, Liang J, Hu BG. Self-Paced Learning: An Implicit Regularization Perspective. In: Proceedings of AAAI; 2017. p. 1877–1883.
- Li et al. [2016] Li H, Gong M, Meng D, Miao Q. Multi-Objective Self-Paced Learning. In: Proceedings of AAAI; 2016. p. 1802–1808.
- Zhou et al. [2018] Zhou S, Wang J, Meng D, Xin X, Li Y, Gong Y, et al. Deep self-paced learning for person re-identification. Pattern Recognition. 2018;76:739–751.
- Jiang et al. [2015] Jiang L, Meng D, Zhao Q, Shan S, Hauptmann AG. Self-Paced Curriculum Learning. In: Proceedings of AAAI; 2015. p. 2694–2700.
- Ristea and Ionescu [2021] Ristea NC, Ionescu RT. Self-paced ensemble learning for speech and audio classification. In: Proceedings of INTERSPEECH; 2021. p. 2836–2840.
- Ma et al. [2017] Ma F, Meng D, Xie Q, Li Z, Dong X. Self-Paced Co-training. In: Proceedings of ICML. vol. 70; 2017. p. 2275–2284.
- Jiang et al. [2014] Jiang L, Meng D, Yu SI, Lan Z, Shan S, Hauptmann AG. Self-Paced Learning with Diversity. In: Proceedings of NIPS; 2014. p. 2078–2086.
- Zhang et al. [2019] Zhang M, Yu Z, Wang H, Qin H, Zhao W, Liu Y. Automatic Digital Modulation Classification Based on Curriculum Learning. Applied Sciences. 2019;9(10).
- Wu et al. [2018] Wu L, Tian F, Xia Y, Fan Y, Qin T, Jian-Huang L, et al. Learning to Teach with Dynamic Loss Functions. In: Proceedings of NeurIPS. vol. 31; 2018. p. 6467–6478.
- Singh et al. [2015] Singh B, De S, Zhang Y, Goldstein T, Taylor G. Layer-specific adaptive learning rates for deep networks. In: Proceedings of ICMLA; 2015. p. 364–368.
- You et al. [2017] You Y, Gitman I, Ginsburg B. Large batch training of convolutional networks. arXiv preprint arXiv:170803888. 2017;.
- Gotmare et al. [2019] Gotmare A, Keskar NS, Xiong C, Socher R. A Closer Look at Deep Learning Heuristics: Learning Rate Restarts, Warmup and Distillation. In: Proceedings of ICLR; 2019. .
- Kingma and Ba [2015] Kingma DP, Ba JL. Adam: A method for stochastic gradient descent. In: Proceedings of ICLR; 2015. .
- Loshchilov and Hutter [2019] Loshchilov I, Hutter F. Decoupled Weight Decay Regularization. In: Proceedings of ICLR; 2019. .
- Glorot and Bengio [2010] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of AISTATS; 2010. p. 249–256.
- Wang et al. [2019] Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In: Proceedings of NeurIPS. vol. 32; 2019. p. 3266–3280.
- Rajpurkar et al. [2016] Rajpurkar P, Zhang J, Lopyrev K, Liang P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In: Proceedings of EMNLP; 2016. p. 2383–2392.
- Wang et al. [2020] Wang CY, Liao HYM, Wu YH, Chen PY, Hsieh JW, Yeh IH. CSPNet: A new backbone that can enhance learning capability of CNN. In: Proceedings of CVPRW; 2020. p. 390–391.
- Lin et al. [2014] Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: Common Objects in Context. In: Proceedings of ECCV; 2014. p. 740–755.
- Park et al. [2019] Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Proceedings of INTERSPEECH. 2019;p. 2613–2617.
- Dietterich [1998] Dietterich TG. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation. 1998;10(7):1895–1923.