\AtEveryBibitem\clearfield

urldate \clearfieldurlyear \addbibresourcelearning-from-noisy-data-bibtex.bib

AlleNoise - large-scale text classification benchmark dataset with real-world label noise

Alicja Rączkowska111Equal contribution  Aleksandra Osowska-Kurczab111Equal contribution  Jacek Szczerbiński111Equal contribution
Kalina Jasinska-Kobus111Equal contributionKlaudia Nazarko111Equal contribution
Machine Learning Research
Allegro.com
{alicja.raczkowska, aleksandra.kurczab, jacek.szczerbinski,
kalina.kobus, klaudia.nazarko}@allegro.com
Abstract

Label noise remains a challenge for training robust classification models. Most methods for mitigating label noise have been benchmarked using primarily datasets with synthetic noise. While the need for datasets with realistic noise distribution has partially been addressed by web-scraped benchmarks such as WebVision and Clothing1M, those benchmarks are restricted to the computer vision domain. With the growing importance of Transformer-based models, it is crucial to establish text classification benchmarks for learning with noisy labels. In this paper, we present AlleNoise, a new curated text classification benchmark dataset with real-world instance-dependent label noise, containing over 500,000 examples across approximately 5,600 classes, complemented with a meaningful, hierarchical taxonomy of categories. The noise distribution comes from actual users of a major e-commerce marketplace, so it realistically reflects the semantics of human mistakes. In addition to the noisy labels, we provide human-verified clean labels, which help to get a deeper insight into the noise distribution, unlike web-scraped datasets typically used in the field. We demonstrate that a representative selection of established methods for learning with noisy labels is inadequate to handle such real-world noise. In addition, we show evidence that these algorithms do not alleviate excessive memorization. As such, with AlleNoise, we set the bar high for the development of label noise methods that can handle real-world label noise in text classification tasks. The code and dataset are available for download at https://github.com/allegro/AlleNoise.

1 Introduction

The problem of label noise poses a sizeable challenge for classification models [frenay_classification_2014, song_learning_2022]. With modern deep neural networks, due to their capacity, it is possible to memorize all labels in a given training dataset [rolnick_deep_2018]. This, effectively, leads to overfitting to noise if the training dataset contains noisy labels, which in turn reduces the generalization capability of such models [arpit_closer_2017, zhang_understanding_generalization_2017, zhang_understanding_generalization_sequel_2021].

Most previous works on training robust classifiers have focused on analyzing relatively simple cases of synthetic noise [jindal_learning_2017, patrini_making_2017], either uniform (i.e. symmetric) or class-conditional (i.e. asymmetric). It is a common practice to evaluate these methods using popular datasets synthetically corrupted with label noise, such as MNIST [deng_mnist_2012], ImageNet [deng_imagenet_2009], CIFAR [krizhevsky_learning_2009] or SVHN [netzer_reading_2011]. However, synthetic noise is not indicative of realistic label noise and thus deciding to use noisy label methods based on such benchmarks can lead to unsatisfactory results in real-world machine learning practice. Moreover, it has been shown that these benchmark datasets are already noisy themselves [northcutt_pervasive_2021, liu_noise_text_2022], so the study of strictly synthetic noise in such a context is intrinsically flawed.

Realistic label noise is instance dependent, i.e. the labeling mistakes are caused not simply by label ambiguity, but by input uncertainty as well [goldberger_training_2017]. This is an inescapable fact when human annotators are responsible for the labeling process [krishna_embracing_2016]. However, many existing approaches for mitigating instance-dependent noise have one drawback in common - they had to, in some capacity, artificially model the noise distribution due to the lack of existing benchmark datasets [nguyen_robust_2022, gu_instance-dependent_2021, chen_beyond_2020, xia_part-dependent_2020, algan_label_2020, berthon_confidence_2021]. In addition, most of the focus in the field has been put on image classification, but with the ever-increasing importance of Transformer-based [transformer] architectures, the problem of label noise affecting the fine-tuning of natural language processing models needs to be addressed as well. There are many benchmark datasets for text data classification [maas_imdb_2011, lin_dataset_2019, wang_glue_2019, bhatia_extreme_2016], but none of them are meant for the study of label noise. In most cases, the actual level of noise in these datasets is unknown, so using them for benchmarking label noise methods is unfeasible.

Moreover, the datasets used in this research area usually contain relatively few labels. The maximum reported number of labels is 1000 [li_webvision_2017]. As such, there is a glaring lack of a benchmark dataset for studying label noise that provides realistic real-world noise, a high number of labels and text data at the same time.

We see a need for a textual benchmark dataset that would provide realistic instance-dependent noise distribution with a known level of label noise, as well as a relatively large number of target classes, with both clean and noisy labels. To this end, in this paper we provide the following main contributions:

  • We introduce AlleNoise - a benchmark dataset for multi-class text classification with real-world label noise. The dataset consists of 502,310 short texts (e-commerce product titles) belonging to 5,692 categories (taken from a real product assortment tree). It includes a noise level of 15%, stemming from mislabeled data points. This amount of noise reflects the actual noise distribution in the data source (Allegro.com e-commerce platform). For each of the mislabeled data instances, the true category label was determined by human domain experts.

  • We benchmark a comprehensive selection of well-established methods for classification with label noise against the real-world noise present in AlleNoise and compare the results to synthetic label noise generated for the same dataset. We provide evidence that the selected methods fail to mitigate real-world label noise, even though they are very effective in alleviating synthetic label noise.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Symmetric noise vs. AlleNoise in examples. Correct and noisy labels are marked in green and red, respectively. (a) Symmetric noise: an electric toothbrush incorrectly labeled as a winter tire is easy to spot, even for an untrained human. (b) AlleNoise: a ceiling dome is mislabeled as a pendant lamp. This error is semantically challenging and hard to detect. Note: AlleNoise dataset does not include images.

2 Related work

Several classification benchmarks with real-world instance-dependent noise have been reported in the literature. ANIMAL-10N [song_selfie_2019] is a human-labeled dataset of confusing images of animals, with 10 classes and an 8% noise level. CIFAR-10N and CIFAR-100N [wei_learning_2022] are noisy versions of the CIFAR dataset, with labels assigned by crowd-sourced human annotators. CIFAR-10N is provided in three versions, with noise levels of 9%, 18% and 40%, while CIFAR-100N has a noise level of 40%. Clothing1M [xiao_learning_2015] is a large-scale dataset of fashion images crawled from several online shops. It contains 14 classes and the estimated noise rate is 38%. Similarly, WebVision [li_webvision_2017] comprises of images crawled from the web, but it is more general - it has 1000 categories of diverse images. The estimated noise level is 20%. DCIC [schmarje_annotation_2022] is a benchmark that consists of 10 real-world image datasets, with several human annotations per image. This allows for testing algorithms that utilize soft labels to mitigate various kinds of annotation errors. The maximum number of classes in the included datasets is 10.

Dataset Modality Total examples Classes Noise level Clean label
ANIMAL10N Images 55k 10 8%
CIFAR10N Images 60k 10 9/18/40%
CIFAR100N Images 60k 10 40%
WebVision Images 2.4M 1000 similar-to\sim20%
Clothing1M Images 1M 14 similar-to\sim38%
Hausa Text 2,917 5 50.37%
Yorùbá Text 1,908 7 33.28%
NoisyNER Text 217k 4 unspecified
AlleNoise Text 500k 5692 15%
Table 1: Comparison of AlleNoise to previously published datasets created for studying the problem of learning with noisy labels. All datasets contain real-world noise. AlleNoise is the biggest text classification dataset in this field, has a known level of label noise and provides clean labels in addition to the noisy ones.

With the focus in the label noise field being primarily on images, the issue of noisy text classification remains relatively unexplored. Previous works have either utilized existing classification datasets with synthetic noise [jindal_effective_2019, liu_noise_text_2022, nguyen_robust_2022] or introduced new datasets with real-world noise. NoisyNER [hedderich_analysing_2021] contains annotated named entity recognition data in the Estonian language, assigned to 4 categories. The authors do not mention the noise level, only that they provide 7 variants of real-world noise. NoisywikiHow [wu_noisywikihow_2023] is a dataset of articles scraped from the wikiHow website, with accompanying 158 article categories. The data was manually cleaned by human annotators, which eliminated the real-world noise distribution. The authors performed experiments by injecting synthetic noise into their dataset. Thus, NoisywikiHow is not directly comparable to AlleNoise. Another two datasets are Hausa and Yorùbá [hedderich_transfer_2020], text classification datasets of low-resource African languages with 5 and 7 categories respectively. They both include real-world noise with the level of 50.37% for the former, and 33.28% for the latter.

While there is a number of text datasets containing e-commerce product data [lin_dataset_2019, nguyen_robust_2022, bhatia_extreme_2016], none of them have verified clean labels and in most cases the noise level is unknown. Similarly, classification settings with large numbers (i.e. more than 1000) of classes were not addressed up to this point in the existing datasets (Tab.  1).

3 AlleNoise Dataset Construction

We introduce AlleNoise - a benchmark dataset for large-scale multi-class text classification with real-world label noise. The dataset consists of 502,310 e-commerce product titles listed on Allegro.com in 5,692 assortment categories, collected in January of 2022. 15% of the products were listed in wrong categories, hence for each entry the dataset includes: the product title, the category where the product was originally listed, and the category where it should be listed according to human experts.

Additionally, we release the taxonomy of product categories in the form of a mapping (category ID \rightarrow path in the category tree), which allows for fine-grained exploration of noise semantics.

Offer title Kategorie Etikett True category label
Emporia PURE V25 BLACK 352 170
Metal Hanging Lid Rack Suspended 68710 321104
Miraculum Asta Plankton C Active Serum-Booster 5360 89000
Kategorie Etikett Category name
352 Electronics > Phones and Accessories > GSM Accessories > Batteries
170 Electronics > Phones and Accessories > Smartphones and Cell Phones
68710 Home and Garden > Equipment > Kitchen Utensils > Pots and Pans > Lids
321104 Home and Garden > Equipment > Kitchen Utensils > Pots and Pans > Organizers
5360 Allegro > Beauty > Care > Face > Masks
89000 Allegro > Beauty > Care > Face > Serum
Figure 2: AlleNoise consists of two tables: the first table includes the true and noisy label for each product title, while the second table maps the labels to category names.

3.1 Real-world noise

We collected 75,348 mislabeled products from two sources: 1) customer complaints about a product being listed in the wrong category - such requests usually suggest the true category label, 2) assortment clean-up by internal domain experts, employed by Allegro - products listed in the wrong category were manually moved to the correct category.

The resulting distribution of label noise is not uniform over the entire product assortment - most of the noisy instances belong to a small number of categories. Such asymmetric distribution is an inherent feature of real-world label noise. It is frequently modeled with class-conditional synthetic noise in related literature. However, since the mistakes in AlleNoise were based not only on the category name, but also on the product features, our noise distribution is in fact instance-dependent.

3.2 Clean data sampling

The 75,348 mislabeled products were complemented with 426,962 products listed in correct categories. The clean instances were sampled from the most popular items listed in the same categories as the noisy instances, proportionally to the total number of products listed in each category. The high popularity of the sampled products guarantees their correct categorization, because items that generate a lot of traffic are curated by human domain experts. Thus, the sampled distribution was representative for a subset of the whole marketplace: 5,692 categories out of over 23,000, for which label noise is particularly well known and described.

3.3 Post-processing

We automatically translated all 500k product titles from Polish to English. Machine translation is a common part of e-commerce, many platforms incorporate it in multiple aspects of their operation [tan_ecommerce_2020, zhang_improve_2023]. Moreover, it is an established practice to publish machine-translated text in product datasets [ni_justifying_2019]. Categories related to sexually explicit content were removed from the dataset altogether. Finally, categories with less than 5 products were removed from the dataset to allow for five-fold cross-validation in our experiments.

4 Methods

4.1 Problem statement

Let 𝒳𝒳\mathcal{X}caligraphic_X denote the input feature space, and 𝒴𝒴\mathcal{Y}caligraphic_Y be a set of class labels. In a typical supervised setting, each instance xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has a true class label yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. However, in learning with noisy labels, y~isubscript~𝑦𝑖\tilde{y}_{i}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is observed instead, which is with an unknown probability p𝑝pitalic_p (noise level) changed from the true yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

In this setting, we train a classifier f:𝒳𝒴:𝑓𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y}italic_f : caligraphic_X → caligraphic_Y that generalizes knowledge learnt from a dataset 𝒟𝒟\mathcal{D}caligraphic_D, consisting of training examples (xi,y~i)subscript𝑥𝑖subscript~𝑦𝑖(x_{i},\tilde{y}_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Because y~isubscript~𝑦𝑖\tilde{y}_{i}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be affected by label noise, the model’s predictions y^i=f(xi)subscript^𝑦𝑖𝑓subscript𝑥𝑖\hat{y}_{i}=f(x_{i})over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) might be corrupted by the distribution of noisy labels as well. Maximizing the robustness of such a classifier implies reducing the impact of noisy training samples on the generalization performance. In the AlleNoise dataset, xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponds to the product title, y~isubscript~𝑦𝑖\tilde{y}_{i}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the original product category, and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the correct category.

4.2 Synthetic noise generation

In order to compare the real-world noise directly with synthetic noise, we applied different kinds of synthetic noise to the clean version of AlleNoise: the synthetic noise was applied to each instance’s true label yisubscript𝑦𝑖{y}_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, yielding a new synthetic noisy label y~isubscript~𝑦𝑖\tilde{y}_{i}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Overall, the labels were flipped for a controlled fraction p=15%𝑝percent15p=15\%italic_p = 15 % of all instances. We examined the following types of synthetic noise:

  • Symmetric noise: each instance is given a noisy label different from the original label, with uniform probability p𝑝pitalic_p.

  • Class-conditional pair-flip noise: each instance in class j𝑗jitalic_j is given a noisy label j+1𝑗1j+1italic_j + 1 with probability p𝑝pitalic_p.

  • Class-conditional nested-flip noise: we only flip categories that are close to each other in the hierarchical taxonomy of categories. For example, for the parent category Car Tires we perform a cyclic flip between its children categories: Summer \rightarrow Winter \rightarrow All-Season \rightarrow Summer with probability p𝑝pitalic_p. Thus, the noise transition matrix is a block matrix with a small number of off-diagonal elements equal to p𝑝pitalic_p.

  • Class-conditional matrix-flip noise: the transition matrix between classes is approximated with the baseline classifier’s confusion matrix. The confusion matrix is evaluated against the clean labels on 8% of the dataset (validation split) [patrini_making_2017]. The resulting noise distribution is particularly tricky: we flip the labels between the classes that the model is most likely to confuse.

4.3 Model architecture

Next, we evaluated several algorithms for training classifiers under label noise. For a fair comparison, all experiments utilized the same classifier architecture as well as training and evaluation loops. We followed a fine-tuning routine that is typical for text classification tasks. In particular, we vectorized text inputs with XLMRoberta [xlmroberta], a multilingual text encoder based on the Transformer architecture [transformer]. To provide the final class predictions, we used a single fully connected layer with a softmax activation and the number of neurons equal to the number of classes. The baseline model uses cross-entropy (CE) as a loss function.

Models were trained with the AdamW optimiser and linear LambdaLR scheduling (warmup steps=100warmup steps100\text{warmup steps}=100warmup steps = 100). We have not used any additional regularization, i.e. weight decay or dropout. Key training parameters, such as batch size (bs=256bs256\text{bs}=256bs = 256) and learning rate (lr=104lrsuperscript104\text{lr}=10^{-4}lr = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT) were tuned to maximize the validation accuracy on the clean dataset. All models have been trained for 10 epochs. Training of the baseline model, accelerated with a single NVIDIA A100 40GB GPU, lasted for about 1 hour.

We used five-fold stratified cross-validation to comprehensively evaluate the results of the models trained with label noise. For each fold, the full dataset was divided into three splits: 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT, 𝒟valsubscript𝒟𝑣𝑎𝑙\mathcal{D}_{val}caligraphic_D start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT, 𝒟testsubscript𝒟𝑡𝑒𝑠𝑡\mathcal{D}_{test}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT, in proportion 72% : 8% : 20%. Following the literature on learning with noisy labels [song_learning_2022], both 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and 𝒟valsubscript𝒟𝑣𝑎𝑙\mathcal{D}_{val}caligraphic_D start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT were corrupted with label noise, while 𝒟testsubscript𝒟𝑡𝑒𝑠𝑡\mathcal{D}_{test}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT remained clean.

All of the results presented in this study correspond to the last checkpoint of the model. We use the following format for presenting the experimental results: [m]±[s]plus-or-minusdelimited-[]mdelimited-[]s[\text{m}]\pm[\text{s}][ m ] ± [ s ], where m𝑚mitalic_m is an average over the five cross-validation folds, while s𝑠sitalic_s is the standard deviation. Experiments used a seeded random number generator to ensure the reproducibility of the results.

4.4 Evaluation metrics

Accuracy on the clean test set is the key metric in our study. We expect that methods that are robust to the label noise observed in the training phase, should be able to improve the test accuracy when compared to the baseline model.

Additionally, to better understand the difference between synthetic and real-world noise, we collected detailed validation metrics. The validation dataset 𝒟valsubscript𝒟𝑣𝑎𝑙\mathcal{D}_{val}caligraphic_D start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT contained both instances for which the observed label y~isubscript~𝑦𝑖\tilde{y}_{i}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT was incorrect (𝒟valnoisysuperscriptsubscript𝒟𝑣𝑎𝑙noisy\mathcal{D}_{val}^{\texttt{noisy}}caligraphic_D start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noisy end_POSTSUPERSCRIPT) and correct (𝒟valcleansuperscriptsubscript𝒟𝑣𝑎𝑙clean\mathcal{D}_{val}^{\texttt{clean}}caligraphic_D start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT). Noisy observations from 𝒟valnoisysuperscriptsubscript𝒟𝑣𝑎𝑙noisy\mathcal{D}_{val}^{\texttt{noisy}}caligraphic_D start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noisy end_POSTSUPERSCRIPT were used to measure the memorization metric memorizedvalsubscriptmemorized𝑣𝑎𝑙\texttt{memorized}_{val}memorized start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT, defined as a ratio of predictions y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that match the noisy label y~isubscript~𝑦𝑖\tilde{y}_{i}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Notice that our memorization metric is computed on the validation set, contrary to the training set typically used in the literature [liu_early-learning_2020]. Our metric increases when the model not only memorizes incorrect classes from the training observations, but also repeats these errors on unseen observations. Furthermore, we compute accuracy on 𝒟valnoisysuperscriptsubscript𝒟𝑣𝑎𝑙noisy\mathcal{D}_{val}^{\texttt{noisy}}caligraphic_D start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noisy end_POSTSUPERSCRIPT denoted as correctvalnoisysuperscriptsubscriptcorrect𝑣𝑎𝑙noisy\texttt{correct}_{val}^{\texttt{noisy}}correct start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noisy end_POSTSUPERSCRIPT and its counterpart on the clean fraction, correctvalcleansuperscriptsubscriptcorrect𝑣𝑎𝑙clean\texttt{correct}_{val}^{\texttt{clean}}correct start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT.

4.5 Benchmarked methods

We evaluated the following methods for learning with noisy labels: Self-Paced Learning (SPL) [kumar_self-paced_2010], Provably Robust Learning (PRL) [liu_learning_2021], Early Learning Regularization (ELR) [liu_early-learning_2020], Generalized Jensen-Shannon Divergence (GJSD) [englesson_generalized_2021], Co-teaching (CT) [han_co-teaching_2018], Co-teaching+ (CT+) [yu_how_2019] and Mixup (MU) [zhang_mixup_2018]. Additionally, we implemented Clipped Cross-Entropy as a simple baseline (see Appendix A). These approaches represent a comprehensive selection of different method families: novel loss functions (GJSD), noise fltration (SPL, PRL, CCE, CT, CT+), robust regularization (ELR), data augmentation (MU) and training loop modifications (CT, CT+).

These methods are implemented with a range of technologies and software libraries. As such, in order to have a reliable and unbiased framework for comparing them, it is necessary to standardize the software implementation. To this end, we re-implemented these methods using PyTorch (version 1.13.1) and PyTorch Lightning (version 1.5.0) software libraries. We publish our re-implementations and the accompanying evaluation code on GitHub at https://github.com/allegro/AlleNoise.

To select the best hyperparameters (see Appendix A) for each of the benchmarked algorithms, we performed a tuning process on the AlleNoise dataset. We focused on maximizing the fraction of correct clean examples correctvalcleansuperscriptsubscriptcorrect𝑣𝑎𝑙clean\texttt{correct}_{val}^{\texttt{clean}}correct start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT within the validation set for two noise types: 15% real-world noise and 15% symmetric noise. The tuning was performed on a single fold selected out of five cross-validation folds, yielding optimal hyperparameter values (Tab. S1). We then used these tuned values in all further experiments.

5 Results

The selected methods for learning with noisy labels were found to perform differently on AlleNoise than on several types of synthetic noise. Below we highlight those differences in performance and relate them to the dissimilarities between real-world and synthetic noise.

5.1 Synthetic noise vs AlleNoise

The selected methods were compared on the clean dataset, the four types of synthetic noise and on the real-world noise in AlleNoise (Tab. 2). The accuracy score on the clean dataset did not degrade for any of the evaluated algorithms when compared to the baseline CE. When it comes to the performance on the datasets with symmetric noise, the best method was GJSD, with CCE not too far behind. GJSD increased the accuracy by 1.31 percentage points (p.p.) over the baseline. For asymmetric noise types, the best method was consistently ELR. It significantly improved the test accuracy in comparison to CE, by 1.3 p.p. on average. Interestingly, some methods deteriorated the test accuracy. CT+ was worse than the baseline for all synthetic noise types (by 2.59 p.p., 2.12 p.p., 3.1 p.p., 2.02 p.p. for symmetric, pair-flip, nested-flip and matrix-flip noises, respectively), while SPL decreased the results for all types of asymmetric noise (by 3.63 p.p., 4.2 p.p., 5.17 p.p. for pair-flip, nested-flip and matrix-flip noises, respectively). CT+ seems to perform better for noise levels higher than 15% (see Appendix B). On AlleNoise, we observed nearly no improvement in accuracy for any of the evaluated algorithms, and CT+, PRL and SPL all deteriorated the metric (by 2.65 p.p., 2.05 p.p. and 4.61 p.p., respectively).

Clean set Symmetric Pair-flip Nested-flip Matrix-flip AlleNoise
CE 74.85 ± 0.15 71.97 ± 0.08 71.92 ± 0.08 71.77 ± 0.08 70.75 ± 0.17 63.71 ± 0.11
ELR 74.81 ± 0.11 72.15 ± 0.10 73.21 ± 0.21 73.07 ± 0.11 72.02 ± 0.17 63.72 ± 0.19
MU 74.73 ± 0.09 71.96 ± 0.08 71.95 ± 0.10 71.65 ± 0.14 70.73 ± 0.17 63.65 ± 0.12
CCE 74.80 ± 0.09 73.01 ± 0.10 71.86 ± 0.17 71.62 ± 0.10 70.61 ± 0.10 63.73 ± 0.22
CT *74.85 ± 0.15 72.42 ± 0.13 71.99 ± 0.14 71.55 ± 0.08 70.57 ± 0.18 63.32 ± 0.25
CT+ *74.85 ± 0.15 \downarrow69.38 ± 0.29 \downarrow69.80 ± 0.24 \downarrow68.67 ± 2.59 \downarrow68.73 ± 0.27 \downarrow61.06 ± 0.38
PRL *74.85 ± 0.15 71.82 ± 0.17 71.95 ± 0.15 71.73 ± 0.16 71.12 ± 0.10 \downarrow61.66 ± 0.17
SPL *74.85 ± 0.15 72.56 ± 0.10 \downarrow68.29 ± 0.15 \downarrow67.57 ± 0.14 \downarrow65.58 ± 0.15 \downarrow59.10 ± 0.14
GJSD 74.63 ± 0.10 73.28 ± 0.13 71.67 ± 0.15 71.40 ± 0.10 70.55 ± 0.17 63.63 ± 0.19
Table 2: Accuracy of the evaluated methods on the clean dataset compared to various noisy datasets with 15% noise level. The noisy datasets include AlleNoise, symmetric synthetic noise, and asymmetric synthetic noises: pair-flip, nested-flip, and matrix-flip. * marks cases equivalent to the baseline CE. \downarrow marks results significantly worse than the baseline CE. Best results for each noise type are bolded.

5.2 Noise type impacts memorization

(a) Refer to caption
(b) Refer to caption (c) Refer to caption (d) Refer to caption
Figure 3: Memorization and correctness metrics as a function of the training step. (a) The value of memorizedvalsubscriptmemorized𝑣𝑎𝑙\texttt{memorized}_{val}memorized start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT for synthetic noise types. (b) The value of memorizedvalsubscriptmemorized𝑣𝑎𝑙\texttt{memorized}_{val}memorized start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT for AlleNoise. (c) The value of correctvalcleansuperscriptsubscriptcorrect𝑣𝑎𝑙clean\texttt{correct}_{val}^{\texttt{clean}}correct start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT for AlleNoise. (d) The value of correctvalnoisysuperscriptsubscriptcorrect𝑣𝑎𝑙noisy\texttt{correct}_{val}^{\texttt{noisy}}correct start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noisy end_POSTSUPERSCRIPT for AlleNoise.

To better understand the difference between synthetic noise types and AlleNoise, we analyze how the memorizedvalnoisysuperscriptsubscriptmemorized𝑣𝑎𝑙noisy\texttt{memorized}_{val}^{\texttt{noisy}}memorized start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noisy end_POSTSUPERSCRIPT, correctvalnoisysuperscriptsubscriptcorrect𝑣𝑎𝑙noisy\texttt{correct}_{val}^{\texttt{noisy}}correct start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noisy end_POSTSUPERSCRIPT and correctvalcleansuperscriptsubscriptcorrect𝑣𝑎𝑙clean\texttt{correct}_{val}^{\texttt{clean}}correct start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT metrics (see 4.4) evolve over time. Memorization and correctness should be interpreted jointly with test accuracy (Tab. 2).

Synthetic noise types are memorized to a smaller extent than the real-world AlleNoise (Fig. 3a). For the two simplest synthetic noise types, symmetric and pair-flip, the value of memorizedvalsubscriptmemorized𝑣𝑎𝑙\texttt{memorized}_{val}memorized start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT is negligible (very close to zero). For the other two synthetic noise types, nested-flip and matrix-flip, memorization is still low (2-8%), but there are clearly visible differences between the benchmarked methods. While ELR, CT+ and PRL all keep the value of memorizedvalnoisysuperscriptsubscriptmemorized𝑣𝑎𝑙noisy\texttt{memorized}_{val}^{\texttt{noisy}}memorized start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noisy end_POSTSUPERSCRIPT low for both nested-flip and matrix-flip noise types, it is only ELR that achieves test accuracy higher than the baseline.

However, for AlleNoise, the situation is completely different. All the training methods display increasing memorizedvalsubscriptmemorized𝑣𝑎𝑙\texttt{memorized}_{val}memorized start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT values throughout the training, up to 70% (Fig. 3b). PRL, SPL and CT+ give lower memorization than the other methods, but this is not reflected in higher accuracy. While these methods correct some of the errors on noisy examples, as measured by correctvalnoisysuperscriptsubscriptcorrect𝑣𝑎𝑙noisy\texttt{correct}_{val}^{\texttt{noisy}}correct start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noisy end_POSTSUPERSCRIPT (Fig. 3d), they display correctvalcleansuperscriptsubscriptcorrect𝑣𝑎𝑙clean\texttt{correct}_{val}^{\texttt{clean}}correct start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT lower than other tested approaches (Fig. 3c), and thus overall they achieve low accuracy.

These results show that reducing memorization is necessary to create noise-robust classifiers. In this context, it is clear that AlleNoise, with its real-world instance-dependent noise distribution, is a challenge for the existing methods.

5.3 Noise distribution

To get even more insight into why the real-world noise in AlleNoise is more challenging than synthetic noise types, we analyzed the class distribution within our dataset. For synthetic noise types, there are very few highly-corrupted categories (Fig. 4). On the other hand, for AlleNoise, there is a significant number of such categories. The baseline model test accuracy is much lower for these classes than for other, less corrupted, categories. The set of these highly-corrupted classes is heavily populated by the following:

  • Specialized categories that can be easily mistaken for a more generic category. For example, items belonging to the class safety shoes are frequently listed in categories derby shoes, ankle boots or other. In such cases, during the training, the model sees a large number of mislabeled instances of that class and very few correctly labeled ones, which is not enough to learn correct class associations.

  • Archetypal categories that are considered the most representative examples of a broader parent category. For instance, car tires are most frequently listed in Summer tires even when they actually should belong to All-season tires or other specialized categories. In this case, the learnt representation of the class is distorted by a huge number of specialized items mislabeled as the archetypal class.

We hypothesize that these categories are the main culprits behind the poor performance of the model.

Refer to caption
Figure 4: Noise level distribution over target categories (blue bars) shows that AlleNoise has a substantial fraction of classes with noise level over 0.5, contrary to synthetic noise. The same distribution multiplied by per-bin macro accuracy (yellow bars) shows that those specialized categories are particularly difficult to predict correctly.

6 Discussion

Our experiments show that the real-world noise present in AlleNoise is a challenging task for existing methods for learning with noisy labels. We hypothesize that the main challenges for these methods stem from two major features of AlleNoise: 1) real-world, instance dependent noise distribution, 2) relatively large number of categories with class imbalance and long tail. While previous works have investigated challenges 1) [wei_learning_2022] and 2) [wu_noisywikihow_2023], this paper combines both in a single dataset and evaluation study, while also applying them to text data. We hope that making AlleNoise available publicly will spark new method development, especially in directions that would address the features of our dataset.

Based on our experiments, we make several interesting observations. The methods that rely on removing examples from within a batch perform noticeably worse than other approaches. We hypothesize that this is due to the large number of classes and the unbalanced distribution of their sizes (especially the long tail of underrepresented categories) in AlleNoise - by removing samples, we lose important information that is not recoverable. This is supported by the fact that such noise filtration methods excel on simple benchmarks like CIFAR-10, which all have a completely different class distribution. In order to mitigate the noise in AlleNoise, a more sophisticated approach is necessary. A promising direction seems to be the one presented by ELR. While for the real-world noise it did not increase the results above the baseline CE, it was the best algorithm for class-dependent noise types. The outstanding performance of ELR might be attributed to its target smoothing approach. The use of such soft labels may be particularly adequate to extreme classification scenarios where some of the classes are semantically close. Extending this idea to include an instance-dependent component may lead to an algorithm robust to the real-world noise in AlleNoise. Furthermore, based on the results of the memorization metric, it is evident that this realistic noise pattern needs to be tackled in a different way than synthetic noise. With the clean labels published as a part of AlleNoise, we enable researchers to further explore the issue of memorization in the presence of real-world instance-dependent noise.

7 Conclusions and future work

In this paper, we presented a new dataset for the evaluation of methods for learning with noisy labels. Our dataset, AlleNoise, contains a real-world instance-dependent noise distribution, with both clean and noisy labels, provides a large-scale classification problem, and unlike most previously available datasets in the field of learning from noisy labels, features textual data in the form of product names. We performed an evaluation of established noise-mitigation methods, which showed quantitatively that these approaches are not enough to alleviate the noise in our dataset. With AlleNoise, we hope to jump-start the development of new robust classifiers that would be able to handle demanding, real-world instance-dependent noise.

The scope of this paper is limited to BERT-based classifiers. As AlleNoise includes clean label names in addition to noisy labels, it could be used to benchmark Large Language Models in few-shot or in-context learning scenarios. We leave this as a future research direction.

Acknowledgments and Disclosure of Funding

Acknowledgements

We thank Mikołaj Koszowski for his help with translating the product titles. We also thank Karol Jędrzejewski for his help in pin-pointing the location of appropriate data records in Allegro data warehouse.

Finanzierung

This work was funded fully by Allegro.com.

Competing interests

We declare no competing interests.

\printbibliography

Appendix A Implementation details

Self-Paced Learning

The Self-Paced Learning (SPL) [kumar_self-paced_2010] method sets a threshold λ𝜆\lambdaitalic_λ value for the loss and all examples with loss larger than λ𝜆\lambdaitalic_λ are skipped, since they are treated as hard to learn (because they are possibly noisy). After each training epoch, the threshold is increased by some constant multiplier. For simplification, we adjusted SPL in the following manner.

We set a parameter τSPLsubscript𝜏𝑆𝑃𝐿\tau_{SPL}italic_τ start_POSTSUBSCRIPT italic_S italic_P italic_L end_POSTSUBSCRIPT, which controls the percentage of samples with the highest loss within a batch that are excluded. The value of τSPLsubscript𝜏𝑆𝑃𝐿\tau_{SPL}italic_τ start_POSTSUBSCRIPT italic_S italic_P italic_L end_POSTSUBSCRIPT should be equal to the noise level present in the training dataset. As such, at each step, we exclude a set percentage of potentially noisy examples, thus reducing the impact of label noise on the training process. We keep the value of τSPLsubscript𝜏𝑆𝑃𝐿\tau_{SPL}italic_τ start_POSTSUBSCRIPT italic_S italic_P italic_L end_POSTSUBSCRIPT constant throughout the training.

Provably Robust Learning

The Provably Robust Learning (PRL) [liu_learning_2021] algorithm works in a similar manner to SPL. We follow the authors by introducing the τPRLsubscript𝜏𝑃𝑅𝐿\tau_{PRL}italic_τ start_POSTSUBSCRIPT italic_P italic_R italic_L end_POSTSUBSCRIPT parameter, which controls the percentage of samples excluded from each training batch on the basis of their gradient norm. Specifically, τPRL%percentsubscript𝜏𝑃𝑅𝐿\tau_{PRL}\%italic_τ start_POSTSUBSCRIPT italic_P italic_R italic_L end_POSTSUBSCRIPT % of samples with highest gradient norm are omitted, while the rest is used to update model parameters. The value of τPRLsubscript𝜏𝑃𝑅𝐿\tau_{PRL}italic_τ start_POSTSUBSCRIPT italic_P italic_R italic_L end_POSTSUBSCRIPT should be equal to the noise level in the training dataset.

Clipped Cross-Entropy

Since our implementation of SPL doesn’t have a hard loss threshold, we introduce a simple Clipped Cross-Entropy (CCE) baseline to check the effectiveness of such an approach. The CCE method checks if the loss is greater than some threshold λCCEsubscript𝜆𝐶𝐶𝐸\lambda_{CCE}italic_λ start_POSTSUBSCRIPT italic_C italic_C italic_E end_POSTSUBSCRIPT. If so, the loss is clipped to that value. Otherwise, it is left unchanged. Thus, we always use all training samples, but the impact of label noise is alleviated by clipping the loss.

Early Learning Regularization

For Early Learning Regularization (ELR) [liu_early-learning_2020], we followed the implementation published by the authors. We compute the softmax probabilities for each sample in a batch and clamp them, then compute the soft targets via temporal ensembling and use these targets in the loss function calculation. Our implementation includes one step not present in the publication text - softmax probability clamping in range [ϵ,1ϵ]italic-ϵ1italic-ϵ[\epsilon,1-\epsilon][ italic_ϵ , 1 - italic_ϵ ], where ϵitalic-ϵ\epsilonitalic_ϵ is a clamp margin parameter. Aside from this, we use the β𝛽\betaitalic_β target momentum and λELRsubscript𝜆𝐸𝐿𝑅\lambda_{ELR}italic_λ start_POSTSUBSCRIPT italic_E italic_L italic_R end_POSTSUBSCRIPT regularization parameters just as they were presented by the authors.

Generalized Jensen-Shannon Divergence Loss

The Generalized Jensen-Shannon Divergence (GJSD) [englesson_generalized_2021] loss function is a generalization of Cross-Entropy (CE) and Mean Absolute Error (MAE) losses. We follow the implementation provided by the authors, in which we use the M𝑀Mitalic_M parameter to set the number of averaged distributions and the π𝜋\piitalic_π parameter to adjust the weight between CE and MAE. While the authors share separate implementations for GJSD with and without consistency regularization, we implement it as a toggle to make the code more uniform. Since consistency regularization requires data augmentation and the GJSD authors described only augmentations for the image domain, we implemented several textual augmentations of our own: random token dropping, consecutive token dropping, random token swapping. However, in our experiments, we have kept consistency regularization turned off due to its detrimental effect on model convergence and test accuracy.

Co-teaching

While the methods described above modified the loss function in various ways, Co-teaching (CT) [han_co-teaching_2018] works in a different manner. It requires optimizing two sets of model parameters at the same time. As such, following the algorithm described by the authors, we implemented a custom model class, which manages the update of these two sets of weights and the exchange of low-loss examples at each optimization step. We keep the parameters k𝑘kitalic_k and τCTsubscript𝜏𝐶𝑇\tau_{CT}italic_τ start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT, to control the starting epoch for CT and the noise level (i.e. the percentage of low-loss examples that are exchanged between networks), respectively.

Co-teaching+

For Co-teaching+ (CT+) [yu_how_2019], we again adhere to the algorithm described by the authors. We use the same implementation framework as for CT, adjusting only the sample selection mechanism to look within examples for which there is disagreement between the two networks. Following the advice in the publication text, we use the recommended update strategy for the fraction of instances to select, which is calculated based on the epoch number, as well as parameters k𝑘kitalic_k and τCT+subscript𝜏limit-from𝐶𝑇\tau_{CT+}italic_τ start_POSTSUBSCRIPT italic_C italic_T + end_POSTSUBSCRIPT.

Mixup

The Mixup (MU) [zhang_mixup_2018] technique keeps the loss function (CE) and the hyperparameters of the baseline model unchanged, only augmenting the training data during the training procedure. We use in-batch augmentation, fixed per-batch mixing magnitude sampled from Beta(α,α)𝐵𝑒𝑡𝑎𝛼𝛼Beta(\alpha,\alpha)italic_B italic_e italic_t italic_a ( italic_α , italic_α ) (where α𝛼\alphaitalic_α is provided as input), and the mixed pairs are sampled without replacement from that distribution. Since we cannot mix input in the same way as for images, we implemented in-batch augmentation for logits. In addition, we also keep the rMUsubscript𝑟𝑀𝑈r_{MU}italic_r start_POSTSUBSCRIPT italic_M italic_U end_POSTSUBSCRIPT ratio parameter, to adjust the percentage of the batch size which is taken for augmentation in MU. Note: our hyperparameter tuning procedure resulted in setting both α𝛼\alphaitalic_α and rMUsubscript𝑟𝑀𝑈r_{MU}italic_r start_POSTSUBSCRIPT italic_M italic_U end_POSTSUBSCRIPT to low values (Tab. S1), contrary to what is recommended by the authors.

Method Hyperparameters Selected values
SPL τSPLsubscript𝜏𝑆𝑃𝐿\tau_{SPL}italic_τ start_POSTSUBSCRIPT italic_S italic_P italic_L end_POSTSUBSCRIPT equal to noise level
PRL τPRLsubscript𝜏𝑃𝑅𝐿\tau_{PRL}italic_τ start_POSTSUBSCRIPT italic_P italic_R italic_L end_POSTSUBSCRIPT equal to noise level
ELR ϵ,β,λELRitalic-ϵ𝛽subscript𝜆𝐸𝐿𝑅\epsilon,\beta,\lambda_{ELR}italic_ϵ , italic_β , italic_λ start_POSTSUBSCRIPT italic_E italic_L italic_R end_POSTSUBSCRIPT 1e-5, 0.6, 2
CCE λCCEsubscript𝜆𝐶𝐶𝐸\lambda_{CCE}italic_λ start_POSTSUBSCRIPT italic_C italic_C italic_E end_POSTSUBSCRIPT 9.5
MU α,rMU𝛼subscript𝑟𝑀𝑈\alpha,r_{MU}italic_α , italic_r start_POSTSUBSCRIPT italic_M italic_U end_POSTSUBSCRIPT 0.1, 0.1
GJSD M,π𝑀𝜋M,\piitalic_M , italic_π 2, 5e-3
CT k, τCTsubscript𝜏𝐶𝑇\tau_{CT}italic_τ start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT 8, equal to noise level
CT+ k, τCT+subscript𝜏limit-from𝐶𝑇\tau_{CT+}italic_τ start_POSTSUBSCRIPT italic_C italic_T + end_POSTSUBSCRIPT 8, equal to noise level
Table S1: Hyperparameter values for all benchmarked methods, selected through a tuning procedure.

Appendix B Results of experiments with higher noise level

For completeness, we evaluate the accuracy for all methods on datasets with 40% synthetic noise (Tab. S2). The best methods for this noise level are the same as for the case of 15% noise: for symmetric noise, GJSD is the best method, while for asymmetric noise types it is ELR. However, it is clear that some methods show more noticeable effect when compared to the baseline for the 40% noise level than for the 15%. While MU and CCU stay close to the baseline results for all noise types and SPL underperforms in all cases, CT consistently gives an improvement over the baseline and CT+ decreases the result for the symmetric noise, but is better than the baseline for asymmetric noise types.

We also plot memorizedvalnoisysuperscriptsubscriptmemorized𝑣𝑎𝑙noisy\texttt{memorized}_{val}^{\texttt{noisy}}memorized start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noisy end_POSTSUPERSCRIPT for those datasets (Fig. S1). For symmetric and pair-flip noise types the memorization for all methods is very low. For nested-flip and matrix-flip it is a bit higher, indicating that these two noise types are more challenging, and thus induce more memorization in the model.

Refer to caption
Figure S1: Value of memorizedvalsubscriptmemorized𝑣𝑎𝑙\texttt{memorized}_{val}memorized start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT for different noise types, measured at each training step for all evaluated methods. In all cases the noise level was set at 40%.
Clean set Symmetric Pair-flip Nested-flip Matrix-flip
CE 74.85 ± 0.15 67.29 ± 0.12 55.18 ± 0.26 52.87 ± 0.19 54.04 ± 0.23
ELR 74.81 ± 0.11 67.23 ± 0.18 66.12 ± 0.15 62.27 ± 0.19 61.72 ± 0.23
MU 74.73 ± 0.09 67.14 ± 0.14 55.28 ± 0.26 52.38 ± 0.24 54.26 ± 0.25
CCE 74.80 ± 0.09 68.92 ± 0.14 55.13 ± 0.28 52.07 ± 0.58 54.02 ± 0.18
CT *74.85 ± 0.15 68.60 ± 0.14 60.49 ± 0.24 58.48 ± 0.28 57.69 ± 0.47
CT+ *74.85 ± 0.15 \downarrow64.67 ± 0.32 59.03 ± 0.42 56.16 ± 0.42 57.06 ± 0.29
PRL *74.85 ± 0.15 \downarrow65.01 ± 0.30 62.22 ± 0.45 56.39 ± 0.40 51.59 ± 0.97
SPL *74.85 ± 0.15 65.27 ± 0.35 \downarrow44.92 ± 1.52 \downarrow42.29 ± 1.06 \downarrow40.89 ± 0.70
GJSD 74.63 ± 0.10 69.80 ± 0.12 54.78 ± 0.30 51.92 ± 0.46 53.84 ± 0.10
Table S2: Accuracy of the evaluated methods on the clean dataset compared to various noisy datasets with 40% noise level. The noisy datasets include symmetric synthetic noise and asymmetric synthetic noise types: pair-flip, nested-flip, and matrix-flip. * marks cases equivalent to the baseline CE. \downarrow marks results significantly worse than the baseline CE. Best results for each noise type are bolded.