SubRegWeigh: Effective and Efficient Annotation Weighing
with Subword Regularization

Kohei Tsuji1, Tatsuya Hiraoka2, Yuchang Cheng 1,3, Tomoya Iwakura1,3,

1NAIST, 2MBZUAI, 3Fujitsu Ltd.
[email protected],
[email protected],
{cheng.yuchang, iwakura.tomoya}@fujitsu.com,
Abstract

Many datasets of natural language processing (NLP) sometimes include annotation errors. Researchers have attempted to develop methods to reduce the adverse effect of errors in datasets automatically. However, an existing method is time-consuming because it requires many trained models to detect errors. We propose a novel method to reduce the time of error detection. Specifically, we use a tokenization technique called subword regularization to create pseudo-multiple models which are used to detect errors. Our proposed method, SubRegWeigh, can perform annotation weighting four to five times faster than the existing method. Additionally, SubRegWeigh improved performance in both document classification and named entity recognition tasks. In experiments with pseudo-incorrect labels, pseudo-incorrect labels were adequately detected.

SubRegWeigh: Effective and Efficient Annotation Weighing
with Subword Regularization


Kohei Tsuji1, Tatsuya Hiraoka2, Yuchang Cheng 1,3, Tomoya Iwakura1,3, 1NAIST, 2MBZUAI, 3Fujitsu Ltd. [email protected], [email protected], {cheng.yuchang, iwakura.tomoya}@fujitsu.com,


1 Introduction

Many datasets of natural language processing (NLP) consist of raw texts and annotation labels. Various NLP tasks exploit the pair of the raw text and the annotation label for training and evaluating models. For example of named entity recognition (NER), which is applied to various practical technologies such as location detection Inkpen et al. (2017) and anonymization Mamede et al. (2016), some parts of the text are annotated as named entities (e.g., location names or personal names). And then, a model is trained to extract these entities from the raw text. To achieve higher performance in NLP tasks, the models should be trained or fine-tuned with a sophisticated training dataset without annotation errors.

However, in fact, some existing popular datasets such as CoNLL-2003 contain annotation errors Wang et al. (2019). When the annotation errors are included in datasets, the performance is degraded by training models from incorrect training datasets and/or models are incorrectly evaluated by errors of test datasets.

In the case of NER, these annotation errors are typically related to ambiguous categories of named entities. For example, the string "America" can refer to either a geographical location or an organization, depending on the context. Additionally, when followed by "Online", the correct entity span should be "America Online", which is the name of an organization, rather than just "America".

It has been reported that CoNLL-2003, the well-known NER dataset, contains annotation errors in both training and test splits Wang et al. (2019); Reiss et al. (2020). Most recently, researchers have also explored automated annotation using generative AI Goel et al. (2023); Bogdanov et al. (2024) and collaborative annotation involving humans and AI Naraki et al. (2024). Even with these methods, it is difficult to avoid annotation errors. Therefore sophisticated weighing methods to automatically detect the annotation errors and reduce their negative effect have been expected.

Such a method to weigh annotation errors is recently studied in the NER field. Wang et al. (2019) proposed CrossWeigh which is the method for detecting annotation errors in the dataset and adjusting their learning priority by weighting loss values so that the training is not affected by such annotation errors. However, there are shortcomings in its computational efficiency, especially in the recent NLP trends with the pre-trained large language models. We consider that the more efficient methods of annotation weighing can speed up the development of NLP. In addition, reducing the computational cost contributes to Green AI Schwartz et al. (2019). Furthermore, we cannot utilize CrossWeigh for NLP tasks other than NER because its architecture is specialized to NER. Therefore, we are required to develop the new weighing methods that can be widely used for various NLP tasks.

Refer to caption
Figure 1: The overview of the three steps of SubRegWeigh with the number of tokenization candidates in the inference K=3𝐾3K=3italic_K = 3: 1) the training step of the scouting model, 2) the inference step to examine the appropriateness of the training label, and 3) the weighing step to calculate the weights for each training sample according to the appropriateness of their labels. The calculated weights are used to train the final model.

In this study, we propose an efficient and effective method of annotation weighing, SubRegWeigh. SubRegWeigh utilizes the idea of subword regularization based on multiple segmentation results, to detect the annotation errors and weight the loss values to degrade the effect of training samples with errors. Figure 1 overviews the proposed method. Inspired by Takase et al. (2022), we obtain the multiple outputs from a single trained model by inputting multiple variations of tokenization candidates. In other words, SubRegWeigh measures annotation reliability by observing the consistency of multiple outputs generated with subword regularization. SubRegWeigh can be regarded as a method to create pseudo-multiple models for annotation evaluation. This allows for faster weighing of annotation errors by using a single model instead of multiple models required by CrossWeigh.

We conducted experiments on two NLP tasks: NER and text classification. The experimental results of the CoNLL-2003 demonstrate that SubRegWeigh can perform the annotation weighing four to five times faster than CrossWeigh. Furthermore, SubRegWeigh contributes to the performance improvement of the trained model compared to the case using CrossWeigh in some datasets of both text classification and NER tasks. Especially, our proposed method achieved SoTA on CoNLL-CW, a test dataset constructed by manually correcting annotation errors in the CoNLL-2003 test split. Experiments with pseudo-incorrect labels also showed that the proposed method can recognize annotation errors well.

2 Related Work

2.1 Annotation Error Detecting

Our work is in the line of CrossWeigh Wang et al. (2019). CrossWeigh detects annotation errors by performing K𝐾Kitalic_K-fold cross-validation on the training dataset T𝑇Titalic_T times and weighs the loss value according to the number of wrong predictions, which degrades the effect of training samples with annotation errors in the NER task. This method requires large computational costs because it requires training K×T𝐾𝑇K\times Titalic_K × italic_T models for detecting annotation errors. Our work is different from CrossWeigh in that we use multiple tokenization candidates to obtain various outputs from one model instead of K𝐾Kitalic_K-fold cross-validation. Compared to CrossWeigh, the proposed method can remarkably reduce the time for detecting annotation errors because we train the model only once and use the output from pseudo-models by inputting various tokenization candidates. Furthermore, our method can use the entire training dataset to make the model because we do not employ K𝐾Kitalic_K-fold cross-validation, which contributes to the more accurate detection of annotation errors.

In addition to methods that pre-assign annotation correctness weights to the training data set, methods have been studied to correct labels in partially annotated training data using a small set of data known to have correct annotations Yang et al. (2018); Mayhew et al. (2019); Ding et al. (2023). In other lines, some methods are developed to infer the correctness of the labels during training. For instance, in the method introduced in Zhou and Chen (2021), multiple models are trained simultaneously with different initial values, and the KL divergence of outputs for the same sentence is added to the loss function. This method can be applied in conjunction with the weighting of training datasets used by CrossWeigh and our method.

For detecting annotation errors in datasets beyond NER, annotation correction methods with distant supervision Mintz et al. (2009) have been studied in the relation extraction tasks Huang and Du (2019); Qin et al. (2018). However, these methods require an additional clean dataset or are not always applicable. Furthermore, methods for correcting POS tagging tasks have also been studied Nakagawa and Matsumoto (2002); Helgadóttir et al. (2014), but these do not apply to tasks other than POS tagging. Maini et al. (2023) introduced a method to analyze the memorization of the model, which can be used to detect training samples with ambiguous labels.

2.2 Subword Regularization

The proposed method exploits the technique of tokenization for the efficient detection of annotation errors. Tokenization Song et al. (2021); Sennrich et al. (2016); Kudo (2018) is widely used in recent natural language processing. This technique reduces the number of unknown words and low-frequency words and limits the number of dictionaries by treating words as a sequence of finer subwords. However, because tokenization is deterministic processed on training data, the subword sequences that appear during inference may not always match the sequences learned during training.

Subword regularization Kudo (2018); Provilkov et al. (2020); Hiraoka (2022) alleviates this problem of deterministic tokenization. This technique generates multiple tokenization candidates for a single text. While subword regularization is typically used during the training step to learn from various tokenizations, some studies use it during inference. Takase et al. (2022) uses subword regularization during inference to ensemble the various outputs of a single model by inputting some different tokenization candidates to the model. Wang et al. (2021a) proposed a method to reduce the effect of tokenization difference. However, there has been no research on using subword regularization during inference to reduce noise in the training data.

Subword regularization using a unigram language model Kudo (2018) is a method to make the model more robust by sampling tokenization candidates using the unigram probability of tokens. For the cases where the model does not employ a tokenizer with the unigram language model like Byte Pair Encoding (BPE) Sennrich et al. (2016) used in RoBERTa Liu et al. (2019) or WordPiece Song et al. (2021) used in BERT Devlin et al. (2019), we can use BPE-Dropout Provilkov et al. (2020) and MaxMatch-Dropout Hiraoka (2022) for the subword regularization, respectively. BPE-Dropout randomly avoids merges during the combination process to obtain multiple subword sequences. MaxMatch-Dropout randomly rejects the longest match candidates for each word to obtain multiple subword sequences.

3 Proposed Method: SubRegWeigh

The purpose of the proposed method, SubRegWeigh, is to detect the incorrectly annotated samples in the training dataset and reduce the effect of such a sample with annotation errors during the training of the final model by weighting the loss values. The process of SubRegWeigh is composed of the following three steps:

  1. 1.

    Training a scouting model in a usual manner of NLP tasks with the training data. Deterministic tokenization is used for this training (§3.1).

  2. 2.

    Inferring the labels for the inputs in the training dataset with the trained scouting model. Herein, we tokenize the input texts into K𝐾Kitalic_K different tokenization candidates. Therefore, we can collect K𝐾Kitalic_K inferred outputs for each tokenization candidates of a single input text. (§3.2)

  3. 3.

    Calculating the weights for the training sample according to the number of correctly answered inputs for each tokenization candidate (§3.3).

This process is overviewed in Figure 1. The first and second steps detect the incorrect annotations by scouting the dataset before the final model training. Then, SubRegWeigh weighs the correctness of annotation labels and assigns the weight for each training sample in the third step. After the above steps, we can train the final model with the original training data and the calculated weights for each training sample (§3.4). Compared to CrossWeigh, SubRegWeigh can be used for various NLP tasks beyond NER because the entire steps are not specialized to NER. The following section explains each step in detail.

3.1 Training of the Scouting Model

Inspired by CossWeigh Wang et al. (2019), we first create the scouting model M𝑀Mitalic_M (see the left top of Figure 1). This model is used to scout the correctness of the annotated label in the training dataset in the next step. Let (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) be a pair of the text and the label corresponding to a single training sample. When the text is composed of I𝐼Iitalic_I words, X=x1,,xI𝑋subscript𝑥1subscript𝑥𝐼X=x_{1},...,x_{I}italic_X = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT. Y𝑌Yitalic_Y is a single value in the text classification (e.g., positive and negative labels for sentiment classification) or a sequence of labels Y=y1,,yI𝑌subscript𝑦1subscript𝑦𝐼Y=y_{1},...,y_{I}italic_Y = italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT in sequential labeling tasks such as NER.

In this step, we train the scouting model M𝑀Mitalic_M in the usual manner of training or fine-tuning for general NLP datasets. The single input text X𝑋Xitalic_X is tokenized into a sequence of subwords X=s1,,sJsuperscript𝑋subscript𝑠1subscript𝑠𝐽X^{\prime}=s_{1},...,s_{J}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT that is composed of J𝐽Jitalic_J subwords111The numbers of words I𝐼Iitalic_I and subwords J𝐽Jitalic_J are not necessarily matched depending on the tokenization methods. In the case of sequential labeling tasks with IJ𝐼𝐽I\neq Jitalic_I ≠ italic_J, we aligned the subwords and labels following the existing literature Liu et al. (2019); Yamada et al. (2020).. After tokenizing all texts in the training dataset, we train the scouting model M𝑀Mitalic_M with the tokenized texts and labels. Note that we use the deterministic tokenization here (e.g., the default encoding of the BPE tokenizer or the 1111-best tokenization with the Viterbi algorithm for the unigram language model-based tokenizer). In other words, we do not use subword regularization (i.e., stochastic tokenization) for the training of the scouting model M𝑀Mitalic_M.

3.2 Inference with the scouting Model

In the second step, we use the trained scouting model M𝑀Mitalic_M to scout the correctness of the annotated label in the training dataset (see the right part of Figure 1). We assume that the trained scouting model M𝑀Mitalic_M can correctly predict the labels even for the differently tokenized input if the label is not wrongly annotated. Specifically, the input text X𝑋Xitalic_X is tokenized into K𝐾Kitalic_K different tokenization candidates {X1,,Xk,,XK}subscriptsuperscript𝑋1subscriptsuperscript𝑋𝑘subscriptsuperscript𝑋𝐾\{X^{\prime}_{1},...,X^{\prime}_{k},...,X^{\prime}_{K}\}{ italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }. Then each tokenization candidate Xksubscriptsuperscript𝑋𝑘X^{\prime}_{k}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is fed into the trained model M𝑀Mitalic_M and we obtain the corresponding prediction Y^ksubscript^𝑌𝑘\hat{Y}_{k}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT222As well as Y𝑌Yitalic_Y, Y^ksubscript^𝑌𝑘\hat{Y}_{k}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be single or multiple labels depending on target tasks. After the inference, we evaluate the output Y^ksubscript^𝑌𝑘\hat{Y}_{k}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with the gold annotation Y𝑌Yitalic_Y. This process can be regarded as the “mistake-reweighing” process of CrossWeigh with K𝐾Kitalic_K pseudo-models, which is realized with the K𝐾Kitalic_K different tokenization candidates for the single input text and the single model.

We can obtain various tokenization candidates using a technique of subword regularization. When we use BPE or WordPiece, BPE-Dropout and MaxMatch-Dropout can be used, respectively. When one uses a unigram language model-based tokenizer, we can also use N𝑁Nitalic_N-best tokenization instead of subword regularization.

We consider it important to make a variety of the K𝐾Kitalic_K tokenization candidates to weigh the correctness of the annotation from diverse aspects. To avoid using similar subword sequences as the inputs for the model M𝑀Mitalic_M, we select K𝐾Kitalic_K much different candidates from a large number N𝑁Nitalic_N of tokenization candidates. We propose three options of the way to select such K𝐾Kitalic_K tokenization candidates as follows.

Random:

As the most straightforward way, we randomly sample N=K𝑁𝐾N=Kitalic_N = italic_K tokenization candidates using subword regularization and use them as inputs for the model M𝑀Mitalic_M. In other words, we do not select the tokenization candidates from the large number of candidates.

Cos-Sim:

We select K𝐾Kitalic_K tokenization candidates using the cosine similarity calculated with TF-IDF vectors. TF-IDF was calculated with each subword as a term, each subword sequence as a document, and the entire candidate set as a corpus. First, we convert N(>KN(>Kitalic_N ( > italic_K) tokenization candidates into TF-IDF vectors and select the first candidate with the smallest cosine similarity against the deterministic tokenization Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Subsequently, we select the candidate that is not similar both to Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and already selected candidates. We repeatedly select the candidates until the number of candidates reaches K𝐾Kitalic_K.

K𝐾Kitalic_K-means:

We select K𝐾Kitalic_K tokenization candidates using the K𝐾Kitalic_K-means clustering. We apply the clustering to N(>K)annotated𝑁absent𝐾N(>K)italic_N ( > italic_K ) candidates vectorized with TF-IDF. Then we select K𝐾Kitalic_K candidates which are located at the nearest point to the centroids of each cluster.

Table 6 in Appendix A shows the tokenization examples by each selection method.

3.3 Weighting for Training Samples

To reduce the effect of training samples with annotation errors, we calculate weights for each sample depending on the results of the inference step (see the left bottom of Figure 1). Concretely, we calculate the weight w𝑤witalic_w corresponding to the training sample (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) as the follows:

w𝑤\displaystyle witalic_w =min(wmin,1Kk=1Kck),absentsubscript𝑤min1𝐾superscriptsubscript𝑘1𝐾subscript𝑐𝑘\displaystyle=\min(w_{\mathrm{min}},\frac{1}{K}\sum_{k=1}^{K}{c}_{k}),= roman_min ( italic_w start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (1)
cksubscript𝑐𝑘\displaystyle{c}_{k}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ={1ifY^k=Y0otherwise,absentcases1ifsubscript^𝑌𝑘𝑌0otherwise\displaystyle=\begin{cases}1&\text{if}\quad\hat{Y}_{k}=Y\\ 0&\text{otherwise}\end{cases},= { start_ROW start_CELL 1 end_CELL start_CELL if over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_Y end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW , (2)

where Y^ksubscript^𝑌𝑘\hat{Y}_{k}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the output of the scouting model M𝑀Mitalic_M when inputting the tokenization candidate Xksubscriptsuperscript𝑋𝑘X^{\prime}_{k}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. If all predictions were different from the original label Y𝑌Yitalic_Y, the weights would be 0 and the data would not be used for training. To avoid this, we use a pre-defined minimum weight wminsubscript𝑤minw_{\mathrm{min}}italic_w start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT. We use this minimum weight because Wang et al. (2019) reported that it is better to give a small weight to samples with annotation errors than not to use them.

3.4 Training of Final Model

Using the training dataset with weights calculated in the above section, we train the final model Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that is independent of M𝑀Mitalic_M. Following CrossWeigh Wang et al. (2019), the loss value Lweightedsubscript𝐿weightedL_{\mathrm{weighted}}italic_L start_POSTSUBSCRIPT roman_weighted end_POSTSUBSCRIPT for the training sample (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) with the weight w𝑤witalic_w is defined as the follows:

Lweighted(X,Y,w)=wLoriginal(X,Y),subscript𝐿weighted𝑋𝑌𝑤𝑤subscript𝐿original𝑋𝑌L_{\mathrm{weighted}}(X,Y,w)=wL_{\mathrm{original}}(X,Y),italic_L start_POSTSUBSCRIPT roman_weighted end_POSTSUBSCRIPT ( italic_X , italic_Y , italic_w ) = italic_w italic_L start_POSTSUBSCRIPT roman_original end_POSTSUBSCRIPT ( italic_X , italic_Y ) , (3)

where Loriginal()subscript𝐿originalL_{\mathrm{original}}(\cdot)italic_L start_POSTSUBSCRIPT roman_original end_POSTSUBSCRIPT ( ⋅ ) is the loss function used in the default training.

4 Experiments

We examine the processing time for the annotation error detection and weighting. Besides, we evaluate the performance of NER and text classification when using the training dataset with the weights.

4.1 Dataset

We briefly overview the dataset used in the experiments here. The detailed information can be found in Appendix B.

NER:

We used CoNLL-2003 Tjong Kim Sang and De Meulder (2003). We used the official split for the training, validation, and test datasets, with the label format changed from IOB1 to IOB2. In addition, we used CoNLL-CW Wang et al. (2019) and CoNLL-2020 Liu and Ritter (2023) for the test dataset.333Because both of these dataset names are CoNLL++, we call CoNLL-CW and CoNLL-2020 to distinguish them.

Text Classification:

We used SST-2 Socher et al. (2013). Training split was used for the training dataset and validation split was used for the test dataset.

4.2 Compared Methods

We compared the following five methods including three options of SubRegWeigh explained in §3.2.

For the baselines, we selected Vanilla and CrossWeigh. Vanilla is the final model that is trained on the original training split without any annotation weighing. CrossWeigh is the final model that is trained with the dataset weighed by the official code444https://github.com/ZihanWangKi/CrossWeigh of the existing work Wang et al. (2019). Although CrossWeigh is specialized to NER, we utilize it for the text classification setting without Entity Disjoint, which excludes sentences containing named entities in the validation data from the training data in each fold. proposed in Wang et al. (2019). This is because Entity Disjoint is unavailable in the text classification.

For the proposed method, we evaluated the three options introduced in §3.2, which are represented as SubRegWeigh (Random), SubRegWeigh (Cos-Sim), and SubRegWeigh (K𝐾Kitalic_K-means), respectively. We used these proposed methods to weight the training dataset and evaluate the final model trained with this weighted dataset as well as other baselines.

4.3 Model Settings

As the pre-trained language model, we employed RoBERTaLARGEsubscriptRoBERTaLARGE\mathrm{RoBERTa_{LARGE}}roman_RoBERTa start_POSTSUBSCRIPT roman_LARGE end_POSTSUBSCRIPT for both NER and text classification. Besides LUKELARGEsubscriptLUKELARGE\mathrm{LUKE_{LARGE}}roman_LUKE start_POSTSUBSCRIPT roman_LARGE end_POSTSUBSCRIPT Yamada et al. (2020) is used for the NER setting because LUKE is an architecture for entity-related tasks. In this experimental setting, we used the same backbone model for both the scouting model and the final model555§5.3 discusses the case where using different pre-trained models between the scouting and final models.. For example, when RoBERTa is used in the scouting model, the final model will also use RoBERTa. The detailed model settings such as hyperparameters are provided in the Appendix C.

CrossWeigh has three hyperparameters: the number of folds in mistake estimation K𝐾Kitalic_K, the number of iterations of mistake estimation T𝑇Titalic_T, and the weight scaling factor ϵitalic-ϵ\epsilonitalic_ϵ. We use K=10𝐾10K=10italic_K = 10, T=3𝑇3T=3italic_T = 3, and ϵ=0.7italic-ϵ0.7\epsilon=0.7italic_ϵ = 0.7, respectively. BPE-Dropout Provilkov et al. (2020) was used to obtain the multiple tokenization candidates for SubRegWeigh because both RoBERTa and LUKE employ BPE for their tokenizers. BPE-Dropout has a hyperparameter p𝑝pitalic_p, where a higher p𝑝pitalic_p results in finer segmentation (i.e., more different from the original segmentation). In the experiments, p𝑝pitalic_p was set to 0.10.10.10.1 unless otherwise noted. Additionally, by default, the hyperparameters for SubRegWeigh were set to N=500𝑁500N=500italic_N = 500, K=10𝐾10K=10italic_K = 10, and wmin=1/3subscript𝑤min13w_{\mathrm{min}}=1/3italic_w start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 1 / 3.

4.4 Evaluation Metrics

We evaluate the methods from the viewpoints of speed and performance. On the speed side, we measured the time for detecting the annotation errors and generating the weighted data for the training dataset in NER666We measured the processing time with Xeon Platinum 8167M + NVIDIA V100 for RoBERTa and Xeon Gold 6230R + NVIDIA A100 for LUKE.. On the performance side, we used the weighted data to train the final models and measured the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score in NER and the accuracy score in text classification for the outputs of the models on each test dataset.

Processing
time
CoNLL
valid
CoNLL
test
CoNLL
CW
CoNLL
2020
SST-2
hh:mm F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ACC
SoTA - - 94.60superscript94.60{94.60}^{\dagger}94.60 start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 95.88superscript95.88absent{95.88}^{\dagger\dagger}95.88 start_POSTSUPERSCRIPT † † end_POSTSUPERSCRIPT - -
RoBERTaLARGEsubscriptRoBERTaLARGE\mathrm{RoBERTa_{LARGE}}roman_RoBERTa start_POSTSUBSCRIPT roman_LARGE end_POSTSUBSCRIPT
Vanilla - 97.26±0.12 93.54±0.28 95.27±0.24 94.80±0.22 94.68±0.12
CrossWeigh 30:55 97.11±0.11 93.40±0.21 94.99±0.19 94.93±0.25 94.31±0.09
SubRegWeigh
Random 3:26 97.30±0.16 93.51±0.26 95.24±0.22 94.52±0.22 94.61±0.10
Cos-Sim 4:51 97.27±0.12 93.44±0.21 95.17±0.23 94.91±0.19 94.75±0.09
K𝐾Kitalic_K-means 5:21 97.28±0.11 93.81±0.16 95.45±0.25 94.96±0.21 94.84±0.07
LUKELARGEsubscriptLUKELARGE\mathrm{LUKE_{LARGE}}roman_LUKE start_POSTSUBSCRIPT roman_LARGE end_POSTSUBSCRIPT
Vanilla - 96.78±0.08 94.32±.0.15 95.92±0.13 95.29±0.18 -
CrossWeigh 26:19 96.62±0.08 94.12±.0.19 95.96±0.12 95.32±0.18 -
SubRegWeigh
Random 2:59 96.54±0.11 94.22±0.13 95.94±0.15 95.24±0.20 -
Cos-Sim 6:19 96.53±0.10 94.11±0.16 95.93±0.13 95.21±0.25 -
K𝐾Kitalic_K-means 6:36 96.65±0.09 94.20±0.15 96.12±0.18 95.31±0.14 -
Table 1: The speed of weighing the training dataset and the performance (F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) of the final models with the weighted dataset. We measured weighting time with RoBERTa at V100 and LUKE at A100. We report the averaged times and scores and standard deviations over five runs. Vanilla does not have the The best performing scores are highlighted in bold. and †† is SoTA score from Wang et al. (2021b) and  Zhou and Chen (2021).

4.5 Experimental Results

The results of our experiments are shown in Table 1. We reported the averaged F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores and accuracy scores over the five independent runs and the standard deviations.

4.5.1 Weighting Time

The column “Weighting time” shows the time taken to detect the annotation errors and generate weights for each sample in the entire training dataset in NER. As shown in this column, the proposed method, SubRegWeigh (Random), was maximum of over eight times faster than CrossWeigh. Even the most time-consuming method of SubRegWeigh (K𝐾Kitalic_K-means) was five times faster with RoBERTa and four times faster with LUKE than CrossWeigh. The method with the random selection is the fastest among the proposed methods because it does not require any of the calculations about TF-IDF, cosine similarity, and K𝐾Kitalic_K-means clustering. In addition, in the random method, only K(=N=10)K(=N=10)italic_K ( = italic_N = 10 ) tokenization candidates were generated, whereas the other methods of SubRegWeigh need to select K𝐾Kitalic_K candidates from the N(=500)annotated𝑁absent500N(=500)italic_N ( = 500 )-sampled tokenization candidates, which results in longer processing time. Although the proposed method of Cos-Sim and K𝐾Kitalic_K-means takes longer time than Random, it is remarkably faster than the existing method for the automatic annotation weighing, CrossWeigh.

4.5.2 Performance on NER

For the evaluation of NER, the performance on CoNLL-CW is the most important because annotation errors in this evaluation split are removed, which indicates that we can compare the methods on the most unpolluted dataset. One can see that SubRegWeigh (K𝐾Kitalic_K-means) achieves higher performance than the vanilla method on the test dataset of CoNLL-CW. Specifically, SubRegWeigh (K𝐾Kitalic_K-means) improved the F1 scores by 0.18 points with RoBERTa and 0.2 points with LUKE compared to each vanilla baseline. This result indicates that the proposed method of annotation weighing contributes to the performance improvement in the dataset with fewer annotation errors. Furthermore, SubRegWeigh (K𝐾Kitalic_K-means) achieves higher scores than CrossWeigh, which demonstrates that the proposed method is a reasonable alternative to the existing method.

Both the proposed methods and CrossWeigh scored lower than Vanilla in the CoNLL valid and the original test dataset, which contain annotation errors. These results indicate the remarkable negative effect of the annotation errors in the valid and test datasets and the difficulty to evaluate the models appropriately. The vanilla baseline score of LUKE exceeds the SoTA score Zhou and Chen (2021) for CoNLL-CW, which we consider because of the successful hyperparameter setting.

Among the proposed methods, Random scores lower than K𝐾Kitalic_K-means in many cases. We consider this because the random selection of tokenization candidates leads to the selection of similar subword sequences, causing a bias in the inference results. Cos-Sim also scored lower than K𝐾Kitalic_K-means in all cases, which shows that the naive method of selecting tokenization candidates does not contribute to performance improvement. We consider that one of the reasons for the successful result of the K𝐾Kitalic_K-means selection method is as follows: the K𝐾Kitalic_K-means clustering can handle the diverse range of subword tokenization candidates.

4.5.3 Performance on Text Classification

For the text classification task, SubRegWeigh (K𝐾Kitalic_K-means) achieved the highest accuracy, and the SubRegWeigh (random) selection had low accuracy, similar to the results in NER. One reason for the lack of performance improvement with CrossWeigh in SST-2 is that Entity Disjoint is important for CrossWeigh but unavailable in the text classification task.

From the experimental results on the NER and text classification datasets, we conclude that the proposed method is superior to the existing method in terms of the weighting time regardless of the methods of selecting tokenization candidates. Especially, with K𝐾Kitalic_K-means, the method achieves approximately four to five times faster performance compared to existing methods, making it highly cost-effective.

5 Discussion

5.1 Pseudo-incorrect Label Test

w¯corsubscript¯𝑤𝑐𝑜𝑟\overline{w}_{cor}over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_c italic_o italic_r end_POSTSUBSCRIPT w¯incorsubscript¯𝑤𝑖𝑛𝑐𝑜𝑟\overline{w}_{incor}over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i italic_n italic_c italic_o italic_r end_POSTSUBSCRIPT
CoNLL
test
CoNLL
CW
F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
Vanilla - - 84.15 85.49
CrossWeigh 0.8000 0.0019 84.28 85.71
SubRegWeigh
Random 0.9278 0.0048 83.77 85.04
Cos-Sim 0.8346 0.0042 84.25 85.69
K𝐾Kitalic_K-means 0.9284 0.0048 84.34 85.76
Table 2: Averaged weights and performance for the pseudo-incorrect dataset. w¯corsubscript¯𝑤𝑐𝑜𝑟\overline{w}_{cor}over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_c italic_o italic_r end_POSTSUBSCRIPT is the averaged weights for the data whose labels were not replaced in the pseudo-incorrect dataset. w¯incorsubscript¯𝑤𝑖𝑛𝑐𝑜𝑟\overline{w}_{incor}over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i italic_n italic_c italic_o italic_r end_POSTSUBSCRIPT is the averaged weights for the data whose labels were replaced in the pseudo-incorrect dataset.

We verified whether the proposed method can accurately detect annotation errors in the dataset. We evaluate this capability using a modified dataset with some labels flipped to incorrect labels artificially.

Assuming that the original label should be correctly annotated, we replaced 10% of the labels in the CoNLL2003 training dataset with different labels that do not conflict with the IOB2 format (the detail of the replacement is provided in the Appendix D). As a result, 3,329 sentences including pseudo-incorrect labels, referred to as replaced, and 5,356 sentences without replacements of annotations, referred to as unreplaced are generated by this method from 8,685 sentences.

We assign weights to this modified dataset using CrossWeigh and SubRegWeigh777The hyperparameter settings are the same as those described in Section §4.3 and Appendix C.. Then we compared the average weights between the replaced and the original samples to confirm whether annotation error detection is correct. Here, we calculate the weights as (number of the same inferences as labels) / (number of inferences) instead of w𝑤witalic_w in Eq. (1). This is because we consider that the calculation using the minimum weight wminsubscript𝑤𝑚𝑖𝑛w_{min}italic_w start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and the weight correction in CrossWeigh could have a negative effect on the fair analysis.

In Table 2, w¯incorsubscript¯𝑤𝑖𝑛𝑐𝑜𝑟\overline{w}_{incor}over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i italic_n italic_c italic_o italic_r end_POSTSUBSCRIPT is the averaged weights assigned to the samples with pseudo-incorrect labels, where we expect them should be lower weights as incorrect labels. Similarly, w¯corsubscript¯𝑤𝑐𝑜𝑟\overline{w}_{cor}over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_c italic_o italic_r end_POSTSUBSCRIPT is the averaged weights assigned to the original labels, where higher weights should be assigned as the correct labels.

From the results in this table, SubRegWeigh (K𝐾Kitalic_K-means) assigns the most distinct weights between the replaced and the unreplaced data. Additionally, all methods assign weights approximately 100 times lower to replaced data than unreplaced data. This indicates that all methods can detect errors effectively. The lowest average weight for unreplaced data is CrossWeigh. This is because CrossWeigh infers only 3 times while SubRegWeigh infers 10 times. Even one incorrect prediction reduced the weight to 2/3. Among SubRegWeigh methods, Cos-Sim assigns lower weights to sentences without pseudo-incorrect labels, but these do not mean lower performance improvements compared to other methods, especially SubRegWeigh (Random). According to this, if the weights of sentences containing errors are low, it does not matter if the average weight is slightly lower.

In addition to the investigation of weights, we also analyze the obtained performance by training final models with the modified dataset. For the training of the final models, we used the weights calculated in the manner of Eq. (1). The results are shown in the right two columns of Table 2. Because the training data is noisy, the entire scores are worse than the ones in Table 1. However, the total tendency of the performance is similar to the results in §4. From the entire results, we cannot find a clear relationship between the averaged weights and the final performance.

5.2 Numbers of Tokenization Candidates

K𝐾Kitalic_K Method Zeit
CoNLL
test
CoNLL
CW
500 Random 75:41 93.71 95.26
\hdashline50 Random 9:15 93.46 95.08
50 Cos-sim 10:52 93.46 95.12
50 K𝐾Kitalic_K-means 12:02 93.65 95.30
\hdashline10 Random 3:26 93.51 95.24
10 Cos-Sim 4:51 93.44 95.17
10 K𝐾Kitalic_K-means 5:21 93.81 95.45
Table 3: Difference in the speed and the performance against three options of K𝐾Kitalic_K.

In the proposed method, K𝐾Kitalic_K is an important hyperparameter affecting both speed and performance. We investigated the difference in the time for the annotation weighing and the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score of the final model with three options of the number of tokenization candidates K𝐾Kitalic_K in the NER dataset. We examined the difference in the range of K=10,50,500𝐾1050500K={10,50,500}italic_K = 10 , 50 , 500 with RoBERTa. The dropout rate for BPE-Dropout was p=0.1𝑝0.1p=0.1italic_p = 0.1. Since K𝐾Kitalic_K-means and Cos-Sim are techniques to reduce the variations in subword segmentation, a large value like K=500𝐾500K=500italic_K = 500 makes them almost indistinguishable from the Random method. Therefore, only the Random method was investigated for K=500𝐾500K=500italic_K = 500.

The results are shown in Table 3. Comparing the results between the ones with K=10𝐾10K=10italic_K = 10 and K=50𝐾50K=50italic_K = 50, one can see that K𝐾Kitalic_K does not have a large effect on the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores. However, the larger K𝐾Kitalic_K remarkably increases the time for the annotation weighing. For K=500𝐾500K=500italic_K = 500, the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score improved compared to Random with K=10𝐾10K=10italic_K = 10 and K=50𝐾50K=50italic_K = 50 on the CoNLL and CoNLL-CW datasets. This suggests that using a large number of subword sequences can effectively weigh annotation errors. However, such much large K𝐾Kitalic_K damages the speed of annotation weighing, making it a trade-off between the speed vs. the performance. While increasing K𝐾Kitalic_K to 500 in the Random method showed performance improvement, K𝐾Kitalic_K-means exhibited the highest performance even with smaller K𝐾Kitalic_K. This indicates that K𝐾Kitalic_K-means can select sufficiently diverse subword sequences even with a small K𝐾Kitalic_K.

Final
Model
Scouting
Model
CoNLL
Test
CoNLL
CW
RoBERTa RoBERTa 93.81 95.45
LUKE 93.65 95.42
\hdashlineLUKE RoBERTa 94.24 95.89
LUKE 94.20 96.12
Table 4: Difference in the performance when using different models for scouting and final model.
TextLabel CrossWeigh SubRegWeigh
Random Cos-Sim K-means
TheO foreignO ministryO ’sO ShenBORG¯¯BORG{}_{\underline{\mathrm{B-ORG}}}start_FLOATSUBSCRIPT under¯ start_ARG roman_B - roman_ORG end_ARG end_FLOATSUBSCRIPT toldO
ReutersB-ORG TelevisionI-ORG inO anO interviewO
heO hadO readO reportsO ofO TangB-PER ’sO commentsO
0.343 0.333 0.333 0.333
\hdashline
NOTESO BAYERISCHEB-ORG VEREINSBANKI-ORG ISO
JOINTO LEADO MANAGERO
0.700 0.500 0.500 0.333
\hdashline
FormerO SurinamBLOC¯¯BLOC{}_{\underline{\mathrm{B-LOC}}}start_FLOATSUBSCRIPT under¯ start_ARG roman_B - roman_LOC end_ARG end_FLOATSUBSCRIPT rebelO¯¯O{}_{\underline{\mathrm{O}}}start_FLOATSUBSCRIPT under¯ start_ARG roman_O end_ARG end_FLOATSUBSCRIPT
leaderO heldO afterO shootingO .O
1.000 0.700 0.700 0.500
Table 5: Examples of weights assigned by each method. Underlined are incorrect or ambiguous labels.

5.3 Weighting with Different Backbone

In the experiment shown in §4, we used the same backbone for the scouting model M𝑀Mitalic_M for annotation weighing and the final model Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Herein, we investigated the performance difference when using different backbones in the NER dataset. Specifically, we examined the effect in F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores when M𝑀Mitalic_M was RoBERTa and Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT was LUKE. Similarly, we examined the case where M𝑀Mitalic_M was LUKE and Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT was RoBERTa. Both models were trained with the hyperparameters explained in Table 7. We used K𝐾Kitalic_K-means for selecting the tokenization candidates because it showed the best performance in the main results (§4).

The results in Table 4 indicate that, in most cases, the data weighted by the same model tends to achieve higher f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores. Although we can save time by preparing the weighed dataset with a single model and reusing it to train models with other backbones, this observation suggests that we should prepare the weighed dataset depending on the backbone used for the final models to obtain performance improvement. This suggestion also supports the importance of the fast method of annotation weighing.

5.4 Qualitative Analysis

Several examples of the CoNLL-2003 training dataset weighed by each method are shown in Table 5. All methods successfully assigned low weights for clearly incorrect sentences like the top example in the table. However, for ambiguous labels like the example at the bottom of the table, CrossWeigh tended to assign high weights, whereas SubRegWeigh more frequently assigned low weights. Additionally, we discovered a specific weakness of SubRegWeigh: it tends to assign lower weights to sentences composed entirely of uppercase letters (see the example in the middle of the table). This is because only a few uppercase-only subwords are included in the tokenizer’s vocabulary, and even with a low p𝑝pitalic_p, subword regularization significantly changes the tokenization candidates for uppercase-only sentences, leading to incorrect inferences. We believe this issue is not unique to our method but rather a general problem with subword regularization, which we plan to investigate further in future research.

6 Conclusion

We proposed SubRegWeigh, a method for annotation weighing using subword regularization, which offers faster annotation weighing compared to existing methods. In particular, subword sequence selection using K𝐾Kitalic_K-means was four to five times faster than CrossWeigh for annotation weighing while contributing to better model performance than weighting with inference on all the large number of generated subword sequences.

In addition, the performance dropped in many cases when different models were used for the scouting and final model, indicating the need for comparison including annotation weighing for a better model and the importance of developing an efficient annotation weighing method.

Limitation

In this paper, we proposed a method for annotation weighing using subword regularization. However, since a deep learning model is used for error weighing, the calculated weights are not always guaranteed to be appropriate, and there is no assurance that all errors are weighed.

Ethical Considerations

Experiments presented in this work used datasets from previously published research Tjong Kim Sang and De Meulder (2003); Wang et al. (2019); Liu and Ritter (2023); Socher et al. (2013). These datasets were used to study NER or text classification models, which is consistent with their intended use. Because this study focuses on efficiently detecting annotation errors in the dataset, its potential risks and negative impact on society appear to be minimal. However, as indicated in the Limitation section, its output results may contain bias.

References

  • Bogdanov et al. (2024) Sergei Bogdanov, Alexandre Constantin, Timothée Bernard, Benoit Crabbé, and Etienne Bernard. 2024. Nuner: Entity recognition encoder pre-training via llm-annotated data. arXiv preprint arXiv:2402.15343.
  • Derczynski et al. (2017) Leon Derczynski, Eric Nichols, Marieke van Erp, and Nut Limsopatham. 2017. Results of the WNUT2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 140–147, Copenhagen, Denmark. Association for Computational Linguistics.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Ding et al. (2023) Liangping Ding, Giovanni Colavizza, and Zhixiong Zhang. 2023. Partial annotation learning for biomedical entity recognition. arXiv preprint arXiv:2305.13120.
  • Dolan and Brockett (2005) William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
  • Goel et al. (2023) Akshay Goel, Almog Gueta, Omry Gilon, Chang Liu, Sofia Erell, Lan Huong Nguyen, Xiaohong Hao, Bolous Jaber, Shashir Reddy, Rupesh Kartha, et al. 2023. Llms accelerate annotation for medical information extraction. In Machine Learning for Health (ML4H), pages 82–100. PMLR.
  • Helgadóttir et al. (2014) Sigrún Helgadóttir, Hrafn Loftsson, and Eiríkur Rögnvaldsson. 2014. Correcting errors in a new gold standard for tagging Icelandic text. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 2944–2948, Reykjavik, Iceland. European Language Resources Association (ELRA).
  • Hiraoka (2022) Tatsuya Hiraoka. 2022. MaxMatch-dropout: Subword regularization for WordPiece. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4864–4872, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  • Huang and Du (2019) Yuyun Huang and Jinhua Du. 2019. Self-attention enhanced CNNs and collaborative curriculum learning for distantly supervised relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 389–398, Hong Kong, China. Association for Computational Linguistics.
  • Inkpen et al. (2017) Diana Inkpen, Ji Liu, Atefeh Farzindar, Farzaneh Kazemi, and Diman Ghazi. 2017. Location detection and disambiguation from twitter messages. Journal of Intelligent Information Systems, 49(2):237–253.
  • Kudo (2018) Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
  • Liu and Ritter (2023) Shuheng Liu and Alan Ritter. 2023. Do CoNLL-2003 named entity taggers still work well in 2023? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8254–8271, Toronto, Canada. Association for Computational Linguistics.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  • Maini et al. (2023) Pratyush Maini, Michael C Mozer, Hanie Sedghi, Zachary C Lipton, J Zico Kolter, and Chiyuan Zhang. 2023. Can neural network memorization be localized? In International Conference on Machine Learning.
  • Mamede et al. (2016) Nuno Mamede, Jorge Baptista, and Francisco Dias. 2016. Automated anonymization of text documents. In 2016 IEEE Congress on Evolutionary Computation (CEC), pages 1287–1294.
  • Mayhew et al. (2019) Stephen Mayhew, Snigdha Chaturvedi, Chen-Tse Tsai, and Dan Roth. 2019. Named entity recognition with partially annotated training data. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 645–655, Hong Kong, China. Association for Computational Linguistics.
  • Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011, Suntec, Singapore. Association for Computational Linguistics.
  • Nakagawa and Matsumoto (2002) Tetsuji Nakagawa and Yuji Matsumoto. 2002. Detecting errors in corpora using support vector machines. In COLING 2002: The 19th International Conference on Computational Linguistics.
  • Naraki et al. (2024) Yuji Naraki, Ryosuke Yamaki, Yoshikazu Ikeda, Takafumi Horie, and Hiroki Naganuma. 2024. Augmenting ner datasets with llms: Towards automated and refined annotation. arXiv preprint arXiv:2404.01334.
  • Provilkov et al. (2020) Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. 2020. BPE-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, Online. Association for Computational Linguistics.
  • Qin et al. (2018) Pengda Qin, Weiran Xu, and William Yang Wang. 2018. Robust distant supervision relation extraction via deep reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2137–2147, Melbourne, Australia. Association for Computational Linguistics.
  • Reiss et al. (2020) Frederick Reiss, Hong Xu, Bryan Cutler, Karthik Muthuraman, and Zachary Eichenberger. 2020. Identifying incorrect labels in the CoNLL-2003 corpus. In Proceedings of the 24th Conference on Computational Natural Language Learning, pages 215–226, Online. Association for Computational Linguistics.
  • Schwartz et al. (2019) Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2019. Green AI. CoRR, abs/1907.10597.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
  • Song et al. (2021) Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, and Denny Zhou. 2021. Fast WordPiece tokenization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2089–2103, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Takase et al. (2022) Sho Takase, Tatsuya Hiraoka, and Naoaki Okazaki. 2022. Single model ensemble for subword regularized models in low-resource machine translation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2536–2541, Dublin, Ireland. Association for Computational Linguistics.
  • Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
  • Wang et al. (2021a) Xinyi Wang, Sebastian Ruder, and Graham Neubig. 2021a. Multi-view subword regularization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 473–482, Online. Association for Computational Linguistics.
  • Wang et al. (2021b) Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei Huang, and Kewei Tu. 2021b. Automated concatenation of embeddings for structured prediction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2643–2660, Online. Association for Computational Linguistics.
  • Wang et al. (2021c) Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei Huang, and Kewei Tu. 2021c. Improving named entity recognition by external context retrieving and cooperative learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1800–1812, Online. Association for Computational Linguistics.
  • Wang et al. (2019) Zihan Wang, Jingbo Shang, Liyuan Liu, Lihao Lu, Jiacheng Liu, and Jiawei Han. 2019. CrossWeigh: Training named entity tagger from imperfect annotations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5154–5163, Hong Kong, China. Association for Computational Linguistics.
  • Yamada et al. (2020) Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6442–6454, Online. Association for Computational Linguistics.
  • Yang et al. (2018) Yaosheng Yang, Wenliang Chen, Zhenghua Li, Zhengqiu He, and Min Zhang. 2018. Distantly supervised NER with partial annotation learning and reinforcement learning. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2159–2169, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  • Zhou and Chen (2021) Wenxuan Zhou and Muhao Chen. 2021. Learning from noisy labels for entity-centric information extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5381–5392, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.

Appendix A Example For Each Selection Method

Method Subwords
the number of
subwords
Default Tokenization
Japan then laid siege to the Syrian penalty area for most of the game
but rarely breached the Syrian defence .
20
Random
Japan then la id s iege to t he Sy rian penalty ar ea for most of the game
bu t rarely bre ache d t he Sy rian def ence .
29
\cdashline2-3
J a pan then l ai d s ieg e to the Syrian penal ty a rea for most of the ga me
bu t rarely bre ache d th e Syrian de fen ce .
33
\cdashline2-3
Japan then laid siege to t he Syrian pen alt y ar ea f o r mo st of the ga me
but rarely b rea ched the Sy r ian defence .
30
Cos-Sim
J ap an t hen l ai d s ie ge to th e S y rian p en alty a rea for most o f the ga me
b u t r ar e ly breached t he Sy rian d ef enc e .
43
\cdashline2-3
Ja pan t h e n la id siege t o the Syrian penal ty are a f or mos t of the gam e
but ra rely breached t he Sy r ian defence .
36
\cdashline2-3
J ap an th en l aid s ie ge to th e Syrian pen a lt y ar ea for most o f t he game
b ut r are ly bre ache d th e Syrian de fen ce .
42
K-means
Ja pan then laid siege to t he Sy r ian penalty are a for most of t he gam e
bu t r are ly breached the Syri an def enc e .
31
\cdashline2-3
J ap an the n laid siege t o the Syrian pen al ty area for most of the game
but r ar e ly bre ac hed the Syrian defence .
36
\cdashline2-3
Japan then laid si e ge to th e Syri an p en al t y area for most o f t h e game
but rarely breached the Sy ri an defence .
34
Table 6: Examples of subword series by each selection method. Subword breaks are shown in the blanks.

In §3.2, we use 3 subword sequence selection methods: Random, Cos-Sim, and K-means. We show specific examples of the subword series obtained by each selection method in Table 6.

Appendix B Detail About Dataset

B.1 NER

We use the training split of CoNLL-2003 Tjong Kim Sang and De Meulder (2003) as the target of the annotation weighing. This split is also used to train the final NER models evaluated on the following test datasets. The training split comprises 946 articles (14,041 sentences and 23,499 named entity labels).

We used the validation and test split of CoNLL-2003 to evaluate the finally trained NER models. The validation split contains 216 articles (3,250 sentences and 5,942 named entity labels) and the test split contains 232 articles (3,453 sentences and 5,648 labels).

In addition to the original CoNLL dataset, we employ the modified version and the recently published version of this dataset for further evaluation. CoNLL-CW Wang et al. (2019) is constructed by manually correcting annotation errors in the CoNLL-2003 test split, which includes 231 articles (3,453 sentences and 5648 named entity labels). CoNLL-2020 Liu and Ritter (2023) is constructed with articles from the year 2020 using the same definitions of NER labels as CoNLL-2003, which consists of 131 articles (1,840 sentences and 4,007 named entity labels).

B.2 Text Classification

We used the training and validation split of SST-2 Socher et al. (2013). The training split comprises 67,349 sentences. 55.8 % of this split are negative labels, and the rest are positive labels. The validation split comprises 872 sentences. 50.9 % of this split are negative labels, and the rest are positive labels. Test split is not used because the labels have not been published.

Appendix C Detail About Model Setting

Model RoBERTaRoBERTa\mathrm{RoBERTa}roman_RoBERTa LUKELUKE\mathrm{LUKE}roman_LUKE
Task NER
Text
Classification
NER
Epoch 5(20)* 5(20)* 5
Learning rate 1e-5 5e-5 1e-5
Batch size 32 32 16
Weight decay 0.01 0.01 0.01
Params
354M
(LARGE)
354M
(LARGE)
560M
(LARGE)
GPU V100*1 A100*1 A100*1
Table 7: model hyperparameter. *In RoBERTaRoBERTa\mathrm{RoBERTa}roman_RoBERTa, the scouting model was trained in 5 epochs and the final model was trained in 20 epochs. Other hyper-parameters are the same in the scouting and final model.

We employed RoBERTaLARGEsubscriptRoBERTaLARGE\mathrm{RoBERTa_{LARGE}}roman_RoBERTa start_POSTSUBSCRIPT roman_LARGE end_POSTSUBSCRIPT Liu et al. (2019) and LUKELARGEsubscriptLUKELARGE\mathrm{LUKE_{LARGE}}roman_LUKE start_POSTSUBSCRIPT roman_LARGE end_POSTSUBSCRIPT Yamada et al. (2020) for the scouting model M𝑀Mitalic_M and the final model Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The scouting and the final model were trained independently. Each model was trained using the hyperparameters in Table 7.

We used the same architectures for the scouting and final models. In other words, the scouting model with RoBERTaLARGEsubscriptRoBERTaLARGE\mathrm{RoBERTa_{LARGE}}roman_RoBERTa start_POSTSUBSCRIPT roman_LARGE end_POSTSUBSCRIPT is used to weigh the annotation errors that are used for the training of the final model with RoBERTaLARGEsubscriptRoBERTaLARGE\mathrm{RoBERTa_{LARGE}}roman_RoBERTa start_POSTSUBSCRIPT roman_LARGE end_POSTSUBSCRIPT. When the final model is based on LUKELARGEsubscriptLUKELARGE\mathrm{LUKE_{LARGE}}roman_LUKE start_POSTSUBSCRIPT roman_LARGE end_POSTSUBSCRIPT, we used the scouting model with LUKELARGEsubscriptLUKELARGE\mathrm{LUKE_{LARGE}}roman_LUKE start_POSTSUBSCRIPT roman_LARGE end_POSTSUBSCRIPT.

In NER, for RoBERTaLARGEsubscriptRoBERTaLARGE\mathrm{RoBERTa_{LARGE}}roman_RoBERTa start_POSTSUBSCRIPT roman_LARGE end_POSTSUBSCRIPT, inference was performed with a token-level classification model where the first subword of each word was classified based on BIO tags, and the outputs of the second and subsequent subwords were masked and ignored. For LUKELARGEsubscriptLUKELARGE\mathrm{LUKE_{LARGE}}roman_LUKE start_POSTSUBSCRIPT roman_LARGE end_POSTSUBSCRIPT, inference was performed using the span classification method employed in LUKE Yamada et al. (2020), which classifies whether a span from one word to another is a named entity.

Appendix D Pseudo-Incorrect Label Method

In the experiments in §5.1, Some of the original labels were replaced with pseudo-incorrect labels. This section describes the replacement method. We replaced the labels so that the replacement would not conflict with the IOB2 format and the number of entities would not increase too much from the original data. Specifically, in the case of replacing a%percent𝑎a\%italic_a % of all labels, we used the following procedure.

  1. 1.

    Select one label at random.

  2. 2.

    Select one label at random from B-xxx or O that is different from the selected label and replace that label with the selected label.

  3. 3.

    Apply the following operations, depending on the labels to change and to be changed.

    1. (a)

      If changing from O to B-xxx and the label behind the selected label is B-xxx, change it to I-xxx.

    2. (b)

      If changing from B-xxx or I-xxx to O and the label behind the selected label is I-xxx, change it to B-xxx.

    3. (c)

      If changing from B-xxx or I-xxx to B-yyy, change the subsequent I-xxx to I-yyy.

  4. 4.

    Repeat until the number of changed labels reaches a%percent𝑎a\%italic_a % of all labels.

Appendix E Effect of Subword Regularization

p𝑝pitalic_p N𝑁Nitalic_N Method Zeit
CoNLL
test
CoNLL
CW
0.2 10 Random 3:26 93.68 95.28
0.2 100 Cos-Sim 3:45 93.42 95.15
0.2 100 K𝐾Kitalic_K-means 3:58 93.52 95.20
\hdashline0.1 10 Random 3:26 93.51 95.24
0.1 500 Cos-Sim 4:51 93.44 95.17
0.1 500 K𝐾Kitalic_K-means 5:21 93.81 95.45
Table 8: Difference in the speed and performance against the hyperparameter of subword regularization p𝑝pitalic_p.

We select a few tokenization candidates from a large number of candidates to use a wide range of different candidates from the original tokenization for the annotation weighing. Instead of using a large number of candidates, we can also obtain various candidates by changing the BPE-Dropout hyperparameter p𝑝pitalic_p. Therefore, we investigated the difference in F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score of the final model against p𝑝pitalic_p in the NER dataset. In this examination, we used p=0.2𝑝0.2p=0.2italic_p = 0.2 as the hyperparameter of BPE-Dropout, which tends to generate more different tokenization candidates than p=0.1𝑝0.1p=0.1italic_p = 0.1. the number of tokenization candidates was limited to N=100𝑁100N=100italic_N = 100 to explore the possibility of more efficient annotation weighing by increasing the p of BPE-Dropout to generate diverse subword sequences. For Random, the number of generated subword sequences N𝑁Nitalic_N always equals the number of subword sequences used for inference K𝐾Kitalic_K. Since the experiment was conducted with K=10𝐾10K=10italic_K = 10, we set to N=10𝑁10N=10italic_N = 10 in the experiment for Random.

The results are shown in Table 8. One can see that the larger p𝑝pitalic_p improves the performance with Random, while the weighing time did not change. This suggests that Random with p=0.1𝑝0.1p=0.1italic_p = 0.1 did not obtain a sufficient variety of tokenization candidates. For non-Random selection methods, the time for weighing annotation errors was reduced to almost equivalent to the ones by Random. However, F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score decreased. The non-random selection methods use TF-IDF when comparing subword sequences. Therefore, when selecting from high p𝑝pitalic_p and small N𝑁Nitalic_N, the number of subwords that appear only in a single sequence will increase, and the selection will be attracted by such subwords. In addition, subwords that appear only in a single sequence are more finely segmented subwords, i.e., almost character-level subwords, so that a model trained with deterministic subwords cannot make correct inferences. This is likely the reason for the low f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores of the non-Random selection methods with p=0.2𝑝0.2p=0.2italic_p = 0.2. From these results, it is evident that using a large N𝑁Nitalic_N with a small p𝑝pitalic_p is appropriate for balancing both speed and performance.

Appendix F Effect of Minimum Weight

wminsubscript𝑤𝑚𝑖𝑛w_{min}italic_w start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT
CoNLL
test
CoNLL
CW
1 (Vanilla) 93.54 95.27
0.7 93.58 95.23
0.3 93.81 95.45
0 93.25 95.08
Table 9: Difference in the performance against wminsubscript𝑤𝑚𝑖𝑛w_{min}italic_w start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT

In the experiment shown in §4, the minimum weight wminsubscript𝑤𝑚𝑖𝑛w_{min}italic_w start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT for SubRegWeigh was set to 1/3. Herein, we investigate the effect on the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores of the CoNLL-2003 test data and CoNLL-CW when wminsubscript𝑤𝑚𝑖𝑛w_{min}italic_w start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT is changed, using RoBERTa LARGE. We use wmin=1/3,2/3subscript𝑤𝑚𝑖𝑛1323w_{min}=1/3,2/3italic_w start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 1 / 3 , 2 / 3, as well as wmin=0subscript𝑤𝑚𝑖𝑛0w_{min}=0italic_w start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 0, where no minimum weight is set. For comparison, Vanilla is recorded with wmin=1subscript𝑤𝑚𝑖𝑛1w_{min}=1italic_w start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 1, which is equivalent to not performing any weight correction, as all data are given 1 as the weight.

The results are shown in Table 9. When wmin=0subscript𝑤𝑚𝑖𝑛0w_{min}=0italic_w start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 0, the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores decrease compared to other settings. This is likely because the weight of some data becomes 0, reducing the amount of data used for training, and therefore, the model is not sufficiently trained. This is similar to the decrease in CrossWeigh Wang et al. (2019) when data with low weights were not used instead of being weighted. When wmin=2/3subscript𝑤𝑚𝑖𝑛23w_{min}=2/3italic_w start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 2 / 3, the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores also decrease compared to wmin=1/3subscript𝑤𝑚𝑖𝑛13w_{min}=1/3italic_w start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 1 / 3. This is likely because the dataset is weighed between 2/3 and 1, making the dataset almost identical to the vanilla baseline. This result indicates that wminsubscript𝑤𝑚𝑖𝑛w_{min}italic_w start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT should be set to a low but non-zero value.

Appendix G Additional Datasets

WNUT2017 MRPC
F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ACC
SoTA 60.45superscript60.45absent\mathbf{60.45}^{\dagger\dagger\dagger}bold_60.45 start_POSTSUPERSCRIPT † † † end_POSTSUPERSCRIPT -
Vanilla 60.04±0.31subscript60.04plus-or-minus0.3160.04_{\pm 0.31}60.04 start_POSTSUBSCRIPT ± 0.31 end_POSTSUBSCRIPT 90.43±0.15subscript90.43plus-or-minus0.15\mathbf{90.43_{\pm 0.15}}bold_90.43 start_POSTSUBSCRIPT ± bold_0.15 end_POSTSUBSCRIPT
CrossWeigh 60.19±0.43subscript60.19plus-or-minus0.4360.19_{\pm 0.43}60.19 start_POSTSUBSCRIPT ± 0.43 end_POSTSUBSCRIPT 90.35±0.23subscript90.35plus-or-minus0.2390.35_{\pm 0.23}90.35 start_POSTSUBSCRIPT ± 0.23 end_POSTSUBSCRIPT
SubRegWeigh
Random 60.15±0.30subscript60.15plus-or-minus0.3060.15_{\pm 0.30}60.15 start_POSTSUBSCRIPT ± 0.30 end_POSTSUBSCRIPT 90.16±0.23subscript90.16plus-or-minus0.2390.16_{\pm 0.23}90.16 start_POSTSUBSCRIPT ± 0.23 end_POSTSUBSCRIPT
Cos-Sim 60.05±0.46subscript60.05plus-or-minus0.4660.05_{\pm 0.46}60.05 start_POSTSUBSCRIPT ± 0.46 end_POSTSUBSCRIPT 89.69±0.36subscript89.69plus-or-minus0.3689.69_{\pm 0.36}89.69 start_POSTSUBSCRIPT ± 0.36 end_POSTSUBSCRIPT
K𝐾Kitalic_K-means 60.29±0.41subscript60.29plus-or-minus0.4160.29_{\pm 0.41}60.29 start_POSTSUBSCRIPT ± 0.41 end_POSTSUBSCRIPT 86.82±0.45subscript86.82plus-or-minus0.4586.82_{\pm 0.45}86.82 start_POSTSUBSCRIPT ± 0.45 end_POSTSUBSCRIPT
Table 10: The results of the additional datasets. ††† is SoTA score from Wang et al. (2021c).

We experimented with additional datasets, WNUT2017 Derczynski et al. (2017) and MRPC Dolan and Brockett (2005). We use the train and test split of WNUT2017 and the train and develop split of MRPC. The basic experiment setup is the same as for §4.3 and Appendix C, but in WNUT2017, we use only RoBERTa and the URLs in the text are replaced with <URL> tags.

The results are shown in Table 10. In WNUT2017, the proposed method improves for baselines as in the CoNLL2003 experiment. However, in MRPC, the proposed method has worse accuracy than baseline and CrossWeigh. MRPC is a task to classify whether two sentences have the same meaning. the same words are often used in each of these two sentences. If subword regularization splits these words into different subwords, the scouting model cannot perform inference successfully. This is the reason for the accuracy deterioration in MRPC.