SubRegWeigh: Effective and Efficient Annotation Weighing
with Subword Regularization

Kohei Tsuji¹, Tatsuya Hiraoka², Yuchang Cheng ^1,3, Tomoya Iwakura^1,3,

¹NAIST, ²MBZUAI, ³Fujitsu Ltd.
[email protected],
[email protected],
{cheng.yuchang, iwakura.tomoya}@fujitsu.com,

Abstract

Many datasets of natural language processing (NLP) sometimes include annotation errors. Researchers have attempted to develop methods to reduce the adverse effect of errors in datasets automatically. However, an existing method is time-consuming because it requires many trained models to detect errors. We propose a novel method to reduce the time of error detection. Specifically, we use a tokenization technique called subword regularization to create pseudo-multiple models which are used to detect errors. Our proposed method, SubRegWeigh, can perform annotation weighting four to five times faster than the existing method. Additionally, SubRegWeigh improved performance in both document classification and named entity recognition tasks. In experiments with pseudo-incorrect labels, pseudo-incorrect labels were adequately detected.

Kohei Tsuji¹, Tatsuya Hiraoka², Yuchang Cheng ^1,3, Tomoya Iwakura^1,3, ¹NAIST, ²MBZUAI, ³Fujitsu Ltd. [email protected], [email protected], {cheng.yuchang, iwakura.tomoya}@fujitsu.com,

1 Introduction

Many datasets of natural language processing (NLP) consist of raw texts and annotation labels. Various NLP tasks exploit the pair of the raw text and the annotation label for training and evaluating models. For example of named entity recognition (NER), which is applied to various practical technologies such as location detection Inkpen et al. (2017) and anonymization Mamede et al. (2016), some parts of the text are annotated as named entities (e.g., location names or personal names). And then, a model is trained to extract these entities from the raw text. To achieve higher performance in NLP tasks, the models should be trained or fine-tuned with a sophisticated training dataset without annotation errors.

However, in fact, some existing popular datasets such as CoNLL-2003 contain annotation errors Wang et al. (2019). When the annotation errors are included in datasets, the performance is degraded by training models from incorrect training datasets and/or models are incorrectly evaluated by errors of test datasets.

In the case of NER, these annotation errors are typically related to ambiguous categories of named entities. For example, the string "America" can refer to either a geographical location or an organization, depending on the context. Additionally, when followed by "Online", the correct entity span should be "America Online", which is the name of an organization, rather than just "America".

It has been reported that CoNLL-2003, the well-known NER dataset, contains annotation errors in both training and test splits Wang et al. (2019); Reiss et al. (2020). Most recently, researchers have also explored automated annotation using generative AI Goel et al. (2023); Bogdanov et al. (2024) and collaborative annotation involving humans and AI Naraki et al. (2024). Even with these methods, it is difficult to avoid annotation errors. Therefore sophisticated weighing methods to automatically detect the annotation errors and reduce their negative effect have been expected.

Such a method to weigh annotation errors is recently studied in the NER field. Wang et al. (2019) proposed CrossWeigh which is the method for detecting annotation errors in the dataset and adjusting their learning priority by weighting loss values so that the training is not affected by such annotation errors. However, there are shortcomings in its computational efficiency, especially in the recent NLP trends with the pre-trained large language models. We consider that the more efficient methods of annotation weighing can speed up the development of NLP. In addition, reducing the computational cost contributes to Green AI Schwartz et al. (2019). Furthermore, we cannot utilize CrossWeigh for NLP tasks other than NER because its architecture is specialized to NER. Therefore, we are required to develop the new weighing methods that can be widely used for various NLP tasks.

Refer to caption — Figure 1: The overview of the three steps of SubRegWeigh with the number of tokenization candidates in the inference $K=3$ : 1) the training step of the scouting model, 2) the inference step to examine the appropriateness of the training label, and 3) the weighing step to calculate the weights for each training sample according to the appropriateness of their labels. The calculated weights are used to train the final model.

In this study, we propose an efficient and effective method of annotation weighing, SubRegWeigh. SubRegWeigh utilizes the idea of subword regularization based on multiple segmentation results, to detect the annotation errors and weight the loss values to degrade the effect of training samples with errors. Figure 1 overviews the proposed method. Inspired by Takase et al. (2022), we obtain the multiple outputs from a single trained model by inputting multiple variations of tokenization candidates. In other words, SubRegWeigh measures annotation reliability by observing the consistency of multiple outputs generated with subword regularization. SubRegWeigh can be regarded as a method to create pseudo-multiple models for annotation evaluation. This allows for faster weighing of annotation errors by using a single model instead of multiple models required by CrossWeigh.

We conducted experiments on two NLP tasks: NER and text classification. The experimental results of the CoNLL-2003 demonstrate that SubRegWeigh can perform the annotation weighing four to five times faster than CrossWeigh. Furthermore, SubRegWeigh contributes to the performance improvement of the trained model compared to the case using CrossWeigh in some datasets of both text classification and NER tasks. Especially, our proposed method achieved SoTA on CoNLL-CW, a test dataset constructed by manually correcting annotation errors in the CoNLL-2003 test split. Experiments with pseudo-incorrect labels also showed that the proposed method can recognize annotation errors well.

2 Related Work

2.1 Annotation Error Detecting

Our work is in the line of CrossWeigh Wang et al. (2019). CrossWeigh detects annotation errors by performing $K$ -fold cross-validation on the training dataset $T$ times and weighs the loss value according to the number of wrong predictions, which degrades the effect of training samples with annotation errors in the NER task. This method requires large computational costs because it requires training $K\times T$ models for detecting annotation errors. Our work is different from CrossWeigh in that we use multiple tokenization candidates to obtain various outputs from one model instead of $K$ -fold cross-validation. Compared to CrossWeigh, the proposed method can remarkably reduce the time for detecting annotation errors because we train the model only once and use the output from pseudo-models by inputting various tokenization candidates. Furthermore, our method can use the entire training dataset to make the model because we do not employ $K$ -fold cross-validation, which contributes to the more accurate detection of annotation errors.

In addition to methods that pre-assign annotation correctness weights to the training data set, methods have been studied to correct labels in partially annotated training data using a small set of data known to have correct annotations Yang et al. (2018); Mayhew et al. (2019); Ding et al. (2023). In other lines, some methods are developed to infer the correctness of the labels during training. For instance, in the method introduced in Zhou and Chen (2021), multiple models are trained simultaneously with different initial values, and the KL divergence of outputs for the same sentence is added to the loss function. This method can be applied in conjunction with the weighting of training datasets used by CrossWeigh and our method.

For detecting annotation errors in datasets beyond NER, annotation correction methods with distant supervision Mintz et al. (2009) have been studied in the relation extraction tasks Huang and Du (2019); Qin et al. (2018). However, these methods require an additional clean dataset or are not always applicable. Furthermore, methods for correcting POS tagging tasks have also been studied Nakagawa and Matsumoto (2002); Helgadóttir et al. (2014), but these do not apply to tasks other than POS tagging. Maini et al. (2023) introduced a method to analyze the memorization of the model, which can be used to detect training samples with ambiguous labels.

2.2 Subword Regularization

The proposed method exploits the technique of tokenization for the efficient detection of annotation errors. Tokenization Song et al. (2021); Sennrich et al. (2016); Kudo (2018) is widely used in recent natural language processing. This technique reduces the number of unknown words and low-frequency words and limits the number of dictionaries by treating words as a sequence of finer subwords. However, because tokenization is deterministic processed on training data, the subword sequences that appear during inference may not always match the sequences learned during training.

Subword regularization Kudo (2018); Provilkov et al. (2020); Hiraoka (2022) alleviates this problem of deterministic tokenization. This technique generates multiple tokenization candidates for a single text. While subword regularization is typically used during the training step to learn from various tokenizations, some studies use it during inference. Takase et al. (2022) uses subword regularization during inference to ensemble the various outputs of a single model by inputting some different tokenization candidates to the model. Wang et al. (2021a) proposed a method to reduce the effect of tokenization difference. However, there has been no research on using subword regularization during inference to reduce noise in the training data.

Subword regularization using a unigram language model Kudo (2018) is a method to make the model more robust by sampling tokenization candidates using the unigram probability of tokens. For the cases where the model does not employ a tokenizer with the unigram language model like Byte Pair Encoding (BPE) Sennrich et al. (2016) used in RoBERTa Liu et al. (2019) or WordPiece Song et al. (2021) used in BERT Devlin et al. (2019), we can use BPE-Dropout Provilkov et al. (2020) and MaxMatch-Dropout Hiraoka (2022) for the subword regularization, respectively. BPE-Dropout randomly avoids merges during the combination process to obtain multiple subword sequences. MaxMatch-Dropout randomly rejects the longest match candidates for each word to obtain multiple subword sequences.

3 Proposed Method: SubRegWeigh

The purpose of the proposed method, SubRegWeigh, is to detect the incorrectly annotated samples in the training dataset and reduce the effect of such a sample with annotation errors during the training of the final model by weighting the loss values. The process of SubRegWeigh is composed of the following three steps:

1.

Training a scouting model in a usual manner of NLP tasks with the training data. Deterministic tokenization is used for this training (§3.1).
2.

Inferring the labels for the inputs in the training dataset with the trained scouting model. Herein, we tokenize the input texts into $K$ different tokenization candidates. Therefore, we can collect $K$ inferred outputs for each tokenization candidates of a single input text. (§3.2)
3.

Calculating the weights for the training sample according to the number of correctly answered inputs for each tokenization candidate (§3.3).

This process is overviewed in Figure 1. The first and second steps detect the incorrect annotations by scouting the dataset before the final model training. Then, SubRegWeigh weighs the correctness of annotation labels and assigns the weight for each training sample in the third step. After the above steps, we can train the final model with the original training data and the calculated weights for each training sample (§3.4). Compared to CrossWeigh, SubRegWeigh can be used for various NLP tasks beyond NER because the entire steps are not specialized to NER. The following section explains each step in detail.

3.1 Training of the Scouting Model

Inspired by CossWeigh Wang et al. (2019), we first create the scouting model $M$ (see the left top of Figure 1). This model is used to scout the correctness of the annotated label in the training dataset in the next step. Let $(X,Y)$ be a pair of the text and the label corresponding to a single training sample. When the text is composed of $I$ words, $X=x_{1},...,x_{I}$ . $Y$ is a single value in the text classification (e.g., positive and negative labels for sentiment classification) or a sequence of labels $Y=y_{1},...,y_{I}$ in sequential labeling tasks such as NER.

In this step, we train the scouting model $M$ in the usual manner of training or fine-tuning for general NLP datasets. The single input text $X$ is tokenized into a sequence of subwords $X^{\prime}=s_{1},...,s_{J}$ that is composed of $J$ subwords¹¹1The numbers of words $I$ and subwords $J$ are not necessarily matched depending on the tokenization methods. In the case of sequential labeling tasks with $I\neq J$ , we aligned the subwords and labels following the existing literature Liu et al. (2019); Yamada et al. (2020).. After tokenizing all texts in the training dataset, we train the scouting model $M$ with the tokenized texts and labels. Note that we use the deterministic tokenization here (e.g., the default encoding of the BPE tokenizer or the $1$ -best tokenization with the Viterbi algorithm for the unigram language model-based tokenizer). In other words, we do not use subword regularization (i.e., stochastic tokenization) for the training of the scouting model $M$ .

3.2 Inference with the scouting Model

In the second step, we use the trained scouting model $M$ to scout the correctness of the annotated label in the training dataset (see the right part of Figure 1). We assume that the trained scouting model $M$ can correctly predict the labels even for the differently tokenized input if the label is not wrongly annotated. Specifically, the input text $X$ is tokenized into $K$ different tokenization candidates $\{X^{\prime}_{1},...,X^{\prime}_{k},...,X^{\prime}_{K}\}$ . Then each tokenization candidate $X^{\prime}_{k}$ is fed into the trained model $M$ and we obtain the corresponding prediction $\hat{Y}_{k}$ ²²2As well as $Y$ , $\hat{Y}_{k}$ can be single or multiple labels depending on target tasks. After the inference, we evaluate the output $\hat{Y}_{k}$ with the gold annotation $Y$ . This process can be regarded as the “mistake-reweighing” process of CrossWeigh with $K$ pseudo-models, which is realized with the $K$ different tokenization candidates for the single input text and the single model.

We can obtain various tokenization candidates using a technique of subword regularization. When we use BPE or WordPiece, BPE-Dropout and MaxMatch-Dropout can be used, respectively. When one uses a unigram language model-based tokenizer, we can also use $N$ -best tokenization instead of subword regularization.

We consider it important to make a variety of the $K$ tokenization candidates to weigh the correctness of the annotation from diverse aspects. To avoid using similar subword sequences as the inputs for the model $M$ , we select $K$ much different candidates from a large number $N$ of tokenization candidates. We propose three options of the way to select such $K$ tokenization candidates as follows.

Random:

As the most straightforward way, we randomly sample $N=K$ tokenization candidates using subword regularization and use them as inputs for the model $M$ . In other words, we do not select the tokenization candidates from the large number of candidates.

Cos-Sim:

We select $K$ tokenization candidates using the cosine similarity calculated with TF-IDF vectors. TF-IDF was calculated with each subword as a term, each subword sequence as a document, and the entire candidate set as a corpus. First, we convert $N(>K$ ) tokenization candidates into TF-IDF vectors and select the first candidate with the smallest cosine similarity against the deterministic tokenization $X^{\prime}$ . Subsequently, we select the candidate that is not similar both to $X^{\prime}$ and already selected candidates. We repeatedly select the candidates until the number of candidates reaches $K$ .

$K$ -means:

We select $K$ tokenization candidates using the $K$ -means clustering. We apply the clustering to $N(>K)$ candidates vectorized with TF-IDF. Then we select $K$ candidates which are located at the nearest point to the centroids of each cluster.

Table 6 in Appendix A shows the tokenization examples by each selection method.

3.3 Weighting for Training Samples

To reduce the effect of training samples with annotation errors, we calculate weights for each sample depending on the results of the inference step (see the left bottom of Figure 1). Concretely, we calculate the weight $w$ corresponding to the training sample $(X,Y)$ as the follows:

	$\displaystyle w$	$\displaystyle=\min(w_{\mathrm{min}},\frac{1}{K}\sum_{k=1}^{K}{c}_{k}),$		(1)
	$\displaystyle{c}_{k}$	$\displaystyle=\begin{cases}1&\text{if}\quad\hat{Y}_{k}=Y\\ 0&\text{otherwise}\end{cases},$		(2)

where $\hat{Y}_{k}$ is the output of the scouting model $M$ when inputting the tokenization candidate $X^{\prime}_{k}$ . If all predictions were different from the original label $Y$ , the weights would be 0 and the data would not be used for training. To avoid this, we use a pre-defined minimum weight $w_{\mathrm{min}}$ . We use this minimum weight because Wang et al. (2019) reported that it is better to give a small weight to samples with annotation errors than not to use them.

3.4 Training of Final Model

Using the training dataset with weights calculated in the above section, we train the final model $M^{\prime}$ that is independent of $M$ . Following CrossWeigh Wang et al. (2019), the loss value $L_{\mathrm{weighted}}$ for the training sample $(X,Y)$ with the weight $w$ is defined as the follows:

L_{\mathrm{weighted}}(X,Y,w)=wL_{\mathrm{original}}(X,Y),

(3)

where $L_{\mathrm{original}}(\cdot)$ is the loss function used in the default training.

4 Experiments

We examine the processing time for the annotation error detection and weighting. Besides, we evaluate the performance of NER and text classification when using the training dataset with the weights.

4.1 Dataset

We briefly overview the dataset used in the experiments here. The detailed information can be found in Appendix B.

NER:

We used CoNLL-2003 Tjong Kim Sang and De Meulder (2003). We used the official split for the training, validation, and test datasets, with the label format changed from IOB1 to IOB2. In addition, we used CoNLL-CW Wang et al. (2019) and CoNLL-2020 Liu and Ritter (2023) for the test dataset.³³3Because both of these dataset names are CoNLL++, we call CoNLL-CW and CoNLL-2020 to distinguish them.

Text Classification:

We used SST-2 Socher et al. (2013). Training split was used for the training dataset and validation split was used for the test dataset.

4.2 Compared Methods

We compared the following five methods including three options of SubRegWeigh explained in §3.2.

For the baselines, we selected Vanilla and CrossWeigh. Vanilla is the final model that is trained on the original training split without any annotation weighing. CrossWeigh is the final model that is trained with the dataset weighed by the official code⁴⁴4https://github.com/ZihanWangKi/CrossWeigh of the existing work Wang et al. (2019). Although CrossWeigh is specialized to NER, we utilize it for the text classification setting without Entity Disjoint, which excludes sentences containing named entities in the validation data from the training data in each fold. proposed in Wang et al. (2019). This is because Entity Disjoint is unavailable in the text classification.

For the proposed method, we evaluated the three options introduced in §3.2, which are represented as SubRegWeigh (Random), SubRegWeigh (Cos-Sim), and SubRegWeigh ( $K$ -means), respectively. We used these proposed methods to weight the training dataset and evaluate the final model trained with this weighted dataset as well as other baselines.

4.3 Model Settings

As the pre-trained language model, we employed $\mathrm{RoBERTa_{LARGE}}$ for both NER and text classification. Besides $\mathrm{LUKE_{LARGE}}$ Yamada et al. (2020) is used for the NER setting because LUKE is an architecture for entity-related tasks. In this experimental setting, we used the same backbone model for both the scouting model and the final model⁵⁵5§5.3 discusses the case where using different pre-trained models between the scouting and final models.. For example, when RoBERTa is used in the scouting model, the final model will also use RoBERTa. The detailed model settings such as hyperparameters are provided in the Appendix C.

CrossWeigh has three hyperparameters: the number of folds in mistake estimation $K$ , the number of iterations of mistake estimation $T$ , and the weight scaling factor $\epsilon$ . We use $K=10$ , $T=3$ , and $\epsilon=0.7$ , respectively. BPE-Dropout Provilkov et al. (2020) was used to obtain the multiple tokenization candidates for SubRegWeigh because both RoBERTa and LUKE employ BPE for their tokenizers. BPE-Dropout has a hyperparameter $p$ , where a higher $p$ results in finer segmentation (i.e., more different from the original segmentation). In the experiments, $p$ was set to $0.1$ unless otherwise noted. Additionally, by default, the hyperparameters for SubRegWeigh were set to $N=500$ , $K=10$ , and $w_{\mathrm{min}}=1/3$ .

4.4 Evaluation Metrics

We evaluate the methods from the viewpoints of speed and performance. On the speed side, we measured the time for detecting the annotation errors and generating the weighted data for the training dataset in NER⁶⁶6We measured the processing time with Xeon Platinum 8167M + NVIDIA V100 for RoBERTa and Xeon Gold 6230R + NVIDIA A100 for LUKE.. On the performance side, we used the weighted data to train the final models and measured the $F_{1}$ score in NER and the accuracy score in text classification for the outputs of the models on each test dataset.

Processing

time

CoNLL

valid

CoNLL

test

CoNLL

2020

SST-2

hh:mm

F_{1}

F_{1}

F_{1}

F_{1}

ACC

SoTA

{94.60}^{\dagger}

{95.88}^{\dagger\dagger}

\mathrm{RoBERTa_{LARGE}}

Vanilla

97.26_±0.12

93.54_±0.28

95.27_±0.24

94.80_±0.22

94.68_±0.12

CrossWeigh

30:55

97.11_±0.11

93.40_±0.21

94.99_±0.19

94.93_±0.25

94.31_±0.09

SubRegWeigh

Random

3:26

97.30_±0.16

93.51_±0.26

95.24_±0.22

94.52_±0.22

94.61_±0.10

Cos-Sim

4:51

97.27_±0.12

93.44_±0.21

95.17_±0.23

94.91_±0.19

94.75_±0.09

K

-means

5:21

97.28_±0.11

93.81_±0.16

95.45_±0.25

94.96_±0.21

94.84_±0.07

\mathrm{LUKE_{LARGE}}

Vanilla

96.78_±0.08

94.32_±.0.15

95.92_±0.13

95.29_±0.18

CrossWeigh

26:19

96.62_±0.08

94.12_±.0.19

95.96_±0.12

95.32_±0.18

SubRegWeigh

Random

2:59

96.54_±0.11

94.22_±0.13

95.94_±0.15

95.24_±0.20

Cos-Sim

6:19

96.53_±0.10

94.11_±0.16

95.93_±0.13

95.21_±0.25

K

-means

6:36

96.65_±0.09

94.20_±0.15

96.12_±0.18

95.31_±0.14

Table 1: The speed of weighing the training dataset and the performance (

F_{1}

) of the final models with the weighted dataset. We measured weighting time with RoBERTa at V100 and LUKE at A100. We report the averaged times and scores and standard deviations over five runs. Vanilla does not have the The best performing scores are highlighted in bold. ^† and ^†† is SoTA score from Wang et al. (2021b) and Zhou and Chen (2021).

4.5 Experimental Results

The results of our experiments are shown in Table 1. We reported the averaged $F_{1}$ scores and accuracy scores over the five independent runs and the standard deviations.

4.5.1 Weighting Time

The column “Weighting time” shows the time taken to detect the annotation errors and generate weights for each sample in the entire training dataset in NER. As shown in this column, the proposed method, SubRegWeigh (Random), was maximum of over eight times faster than CrossWeigh. Even the most time-consuming method of SubRegWeigh ( $K$ -means) was five times faster with RoBERTa and four times faster with LUKE than CrossWeigh. The method with the random selection is the fastest among the proposed methods because it does not require any of the calculations about TF-IDF, cosine similarity, and $K$ -means clustering. In addition, in the random method, only $K(=N=10)$ tokenization candidates were generated, whereas the other methods of SubRegWeigh need to select $K$ candidates from the $N(=500)$ -sampled tokenization candidates, which results in longer processing time. Although the proposed method of Cos-Sim and $K$ -means takes longer time than Random, it is remarkably faster than the existing method for the automatic annotation weighing, CrossWeigh.

4.5.2 Performance on NER

For the evaluation of NER, the performance on CoNLL-CW is the most important because annotation errors in this evaluation split are removed, which indicates that we can compare the methods on the most unpolluted dataset. One can see that SubRegWeigh ( $K$ -means) achieves higher performance than the vanilla method on the test dataset of CoNLL-CW. Specifically, SubRegWeigh ( $K$ -means) improved the F1 scores by 0.18 points with RoBERTa and 0.2 points with LUKE compared to each vanilla baseline. This result indicates that the proposed method of annotation weighing contributes to the performance improvement in the dataset with fewer annotation errors. Furthermore, SubRegWeigh ( $K$ -means) achieves higher scores than CrossWeigh, which demonstrates that the proposed method is a reasonable alternative to the existing method.

Both the proposed methods and CrossWeigh scored lower than Vanilla in the CoNLL valid and the original test dataset, which contain annotation errors. These results indicate the remarkable negative effect of the annotation errors in the valid and test datasets and the difficulty to evaluate the models appropriately. The vanilla baseline score of LUKE exceeds the SoTA score Zhou and Chen (2021) for CoNLL-CW, which we consider because of the successful hyperparameter setting.

Among the proposed methods, Random scores lower than $K$ -means in many cases. We consider this because the random selection of tokenization candidates leads to the selection of similar subword sequences, causing a bias in the inference results. Cos-Sim also scored lower than $K$ -means in all cases, which shows that the naive method of selecting tokenization candidates does not contribute to performance improvement. We consider that one of the reasons for the successful result of the $K$ -means selection method is as follows: the $K$ -means clustering can handle the diverse range of subword tokenization candidates.

4.5.3 Performance on Text Classification

For the text classification task, SubRegWeigh ( $K$ -means) achieved the highest accuracy, and the SubRegWeigh (random) selection had low accuracy, similar to the results in NER. One reason for the lack of performance improvement with CrossWeigh in SST-2 is that Entity Disjoint is important for CrossWeigh but unavailable in the text classification task.

From the experimental results on the NER and text classification datasets, we conclude that the proposed method is superior to the existing method in terms of the weighting time regardless of the methods of selecting tokenization candidates. Especially, with $K$ -means, the method achieves approximately four to five times faster performance compared to existing methods, making it highly cost-effective.

5 Discussion

5.1 Pseudo-incorrect Label Test

\overline{w}_{cor}

\overline{w}_{incor}

CoNLL

test

CoNLL

F_{1}

F_{1}

Vanilla

84.15

85.49

CrossWeigh

0.8000

0.0019

84.28

85.71

SubRegWeigh

Random

0.9278

0.0048

83.77

85.04

Cos-Sim

0.8346

0.0042

84.25

85.69

K

-means

0.9284

0.0048

84.34

85.76

Table 2: Averaged weights and performance for the pseudo-incorrect dataset.

\overline{w}_{cor}

is the averaged weights for the data whose labels were not replaced in the pseudo-incorrect dataset.

\overline{w}_{incor}

is the averaged weights for the data whose labels were replaced in the pseudo-incorrect dataset.

We verified whether the proposed method can accurately detect annotation errors in the dataset. We evaluate this capability using a modified dataset with some labels flipped to incorrect labels artificially.

Assuming that the original label should be correctly annotated, we replaced 10% of the labels in the CoNLL2003 training dataset with different labels that do not conflict with the IOB2 format (the detail of the replacement is provided in the Appendix D). As a result, 3,329 sentences including pseudo-incorrect labels, referred to as replaced, and 5,356 sentences without replacements of annotations, referred to as unreplaced are generated by this method from 8,685 sentences.

We assign weights to this modified dataset using CrossWeigh and SubRegWeigh⁷⁷7The hyperparameter settings are the same as those described in Section §4.3 and Appendix C.. Then we compared the average weights between the replaced and the original samples to confirm whether annotation error detection is correct. Here, we calculate the weights as (number of the same inferences as labels) / (number of inferences) instead of $w$ in Eq. (1). This is because we consider that the calculation using the minimum weight $w_{min}$ and the weight correction in CrossWeigh could have a negative effect on the fair analysis.

In Table 2, $\overline{w}_{incor}$ is the averaged weights assigned to the samples with pseudo-incorrect labels, where we expect them should be lower weights as incorrect labels. Similarly, $\overline{w}_{cor}$ is the averaged weights assigned to the original labels, where higher weights should be assigned as the correct labels.

From the results in this table, SubRegWeigh ( $K$ -means) assigns the most distinct weights between the replaced and the unreplaced data. Additionally, all methods assign weights approximately 100 times lower to replaced data than unreplaced data. This indicates that all methods can detect errors effectively. The lowest average weight for unreplaced data is CrossWeigh. This is because CrossWeigh infers only 3 times while SubRegWeigh infers 10 times. Even one incorrect prediction reduced the weight to 2/3. Among SubRegWeigh methods, Cos-Sim assigns lower weights to sentences without pseudo-incorrect labels, but these do not mean lower performance improvements compared to other methods, especially SubRegWeigh (Random). According to this, if the weights of sentences containing errors are low, it does not matter if the average weight is slightly lower.

In addition to the investigation of weights, we also analyze the obtained performance by training final models with the modified dataset. For the training of the final models, we used the weights calculated in the manner of Eq. (1). The results are shown in the right two columns of Table 2. Because the training data is noisy, the entire scores are worse than the ones in Table 1. However, the total tendency of the performance is similar to the results in §4. From the entire results, we cannot find a clear relationship between the averaged weights and the final performance.

5.2 Numbers of Tokenization Candidates

K

Method

Zeit

CoNLL

test

CoNLL

500

Random