Estimating Agreement by Chance
for Sequence Annotation

Diya Li
Freenome Holdings, Inc
[email protected]
&Carolyn Rosé
Carnegie Mellon University
[email protected]
\ANDAo Yuan
Georgetown University
[email protected]
&Chunxiao Zhou
National Institutes of Health
[email protected]
Abstract

In the field of natural language processing, correction of performance assessment for chance agreement plays a crucial role in evaluating the reliability of annotations. However, there is a notable dearth of research focusing on chance correction for assessing the reliability of sequence annotation tasks, despite their widespread prevalence in the field. To address this gap, this paper introduces a novel model for generating random annotations, which serves as the foundation for estimating chance agreement in sequence annotation tasks. Utilizing the proposed randomization model and a related comparison approach, we successfully derive the analytical form of the distribution, enabling the computation of the probable location of each annotated text segment and subsequent chance agreement estimation. Through a combination simulation and corpus-based evaluation, we successfully assess its applicability and validate its accuracy and efficacy.

1 Introduction

Reliable annotation is a cornerstone of NLP research, enabling both supervised learning methods and evaluation. Though not frequently employed for evaluation of model performance in the field of NLP, one of the most widely accepted metrics for evaluation of annotation reliability is Cohen’s Kappa, which offers an assessment of inter-rater reliability that is adjusted in order to avoid offering credit for the portion of observed agreement that can be attributed to chance. Some NLP tasks, such as Named Entity Recognition or other span detection/labeling tasks, lack an appropriate chance corrected metric. This paper addresses this gap by proposing such a measure for these tasks, demonstrating its application in both simulation and CoNLL03 corpus experiments.

Numerous studies caution against using non-chance-corrected agreement metrics. They can lead to unfair task or system comparisons due to biases introduced due to varying levels of chance agreement across tasks and systems (Ide and Pustejovsky, 2017; Komagata, 2002; Gates and Ahn, 2017; Rand, 1971; Lavelli et al., 2008; Artstein and Poesio, 2008). Furthermore, without correction for chance agreement, measurements tend to cluster within a narrow range, making it difficult to discern differences between approaches (Eugenio and Glass, 2004). Therefore, both estimating and correcting for chance agreement have become critical in annotation evaluation, except in cases where chance agreement is negligible.

The main contributions of our work are summarized as follows:

  • We propose a novel random annotation model that considers the specific characteristics of sequence annotation tasks as well as the annotation tendencies of different annotators. This model can be divided into sub-models, enabling us to separately address cases with or without annotation overlap.We also apply chance agreement to measure task difficulty.

  • Due to the additive nature of many popular similarity measures, we simplify the modeling of dependent annotation segments within a text. We successfully derive analytical probability distributions for random annotations, presenting a streamlined formulation that avoids redundant calculations.

  • We delve into the asymptotic properties of agreement by chance, highlighting scenarios where it can be disregarded.

  • We design and implement both simulation-based and naturalistic experiments, demonstrating that our proposed method is accurate, effective, and computationally efficient.

In the remainder of the paper, we provide a theoretical foundation for our work through a review of past literature. We then explain our methodology, and evaluate it first through a simulation study, and then through application to real-world corpora. Finally, we conclude with discussions of limitations, ethical considerations, and future research.

2 Theoretical Foundation and Motivation

Estimation of chance agreement is a key element in the evaluation of classification tasks. However, though the field of NLP features a wide variety of span detection and labeling tasks, there is a lack of widely adopted chance-corrected metrics for them.

In classification tasks, the Kappa coefficient is one of the most popular chance-corrected inter-annotator agreement measures (Komagata, 2002; Artstein and Poesio, 2008; Eugenio and Glass, 2004; Hripcsak and Rothschild, 2005; Powers, 2015). The Kappa coefficient is defined as (AoAe)/(1Ae)subscript𝐴𝑜subscript𝐴𝑒1subscript𝐴𝑒(A_{o}-A_{e})/(1-A_{e})( italic_A start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - italic_A start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) / ( 1 - italic_A start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ), where Aosubscript𝐴𝑜A_{o}italic_A start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is the observed agreement without chance agreement correction, and Aesubscript𝐴𝑒A_{e}italic_A start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the expected agreement assuming random annotation behavior. To estimate the chance agreement Aesubscript𝐴𝑒A_{e}italic_A start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, the key problem is how to build a random annotation model with reasonable assumptions.

Observed Random Invalid Random
Annotator 1 I visited the NIH campus in MD I visited the NIH campus in MD I visited the NIH campus in MD
Annotator 2 I visited the NIH campus in MD I visited the NIH campus in MD I visited the NIH campus in MD
Table 1: Example of a Toy Named Entity Annotation. Highlighted texts are annotations.

Chance-corrected agreement is unarguably desirable for the evaluation of complex text annotation tasks beyond classification. These tasks encompass sequence annotation tasks (Lampert et al., 2016; Esuli and Sebastiani, 2010; Dai, 2018), which involve a wide array of challenges. The complexity arises from the fact that estimating chance agreement is notably more intricate in comparison to straightforward classification tasks. In classification, the decisions to be made and the available options for each decision are uniform among annotators. However, with span prediction tasks, annotators initially identify the spans requiring labeling and subsequently assign a category to each of these spans. Discrepancies can arise at either of these stages, resulting from variations in span selection or category assignment.

Let’s consider the Named Entity Recognition (NER) task as an illustrative example. It’s important to note that the quantity and size of recognized entities can significantly differ among various annotators working on the same text. In Table 1, we provide an example of a simplified NER task with annotations from two annotators. The text comprises seven tokens, each represented by a single word. The "Observed" column in the table showcases the annotations made by these two annotators. In this toy example, annotator 1 identified and labeled two location entities: "the NIH campus" consisting of 3 tokens, and "MD" with 1 tokens. Meanwhile, annotator 2 identified a single entity, "the NIH campus in MD" encompassing 5 tokens.

While estimating inter-annotator agreement has become a crucial step in annotation evaluation, the challenge of estimating chance agreement for sequence annotation remains an open problem. As highlighted by numerous prior studies, the sample space for a sequence annotation task is often not well-defined (Cunningham and et al., 2014).

For instance, when considering the variability in annotator preferences, some tend to combine adjacent information, while others prefer to label them as distinct spans. Additionally, some annotators choose to encompass surrounding text within a segment, whereas others aim for shorter spans. All of these factors contribute to the complexity of estimating chance agreement in the context of sequence annotation tasks.

There is very little research on estimating chance agreement for span prediction tasks like NER. To the best of our knowledge, the most comprehensive and in-depth attempts so far have been the family of Krippendorff’s Alpha coefficients. Unlike Kappa, the Alpha coefficient is grounded in the concept of disagreement, represented as 1Do/De1subscript𝐷𝑜subscript𝐷𝑒1-D_{o}/D_{e}1 - italic_D start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, where Dosubscript𝐷𝑜D_{o}italic_D start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT stands for observed disagreement, and Desubscript𝐷𝑒D_{e}italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT denotes expected disagreement.

In 1995, Krippendorff first attempted to extend his Alpha coefficient for classification tasks to sequence labeling tasks (Krippendorff, 1995). The approach involved concatenating all annotations by different annotators for the same text and generating two copies. One copy remained unaltered, while the other undergoes all possible cyclic shifts. Krippendorff estimated the expected disagreement by comparing the differences between pairs of segments across these two sets of annotations. However, this shift-based random annotation model lacks a solid theoretical foundation and exhibits sensitivity to the location of relevant segments.

In 2016, Krippendorff introduced another data-driven approach to estimate expected disagreement (Krippendorff et al., 2016). This technique compares the dissimilarities between pairs of segments annotated by different annotators. It heavily relies on a large-scale annotation dataset. Notably, as it combines all annotation data from diverse texts indiscriminately, it cannot differentiate between different chance agreements corresponding to different annotation tasks.

In addition, Mathet proposed the gamma coefficient as a new metric for sequence labeling in 2015. The gamma coefficient paper (Mathet et al., 2015) extensively discusses the various applications and characteristics of sequence labeling tasks. Although the gamma coefficient has many contributions, such as combining an optimization of alignment in the computation of the measure, its estimation of expected chance agreement is in line with Krippendorff’s work and differs fundamentally from our approach.

It is critical to emphasize that neither of Krippendorff’s methods are suitable for sequence annotation tasks, especially within the context of information extraction. When calculating disagreement, the Alpha coefficient accounts for all disagreements between segment pairs, encompassing both relevant and irrelevant segments. In cases where relevant information is sparse, the Alpha coefficient may be disproportionately influenced by disagreements related to irrelevant information, regardless of the consistency of annotations for relevant content. However, in information extraction tasks, our primary concern typically focuses on the consistency of annotations related to portions of text with a high concentration of relevant information. In the experiments section, we will probe further into this issue by exploring the limitations of Alpha coefficients within the context of information extraction.

While the specific problem of estimating chance agreement for span prediction tasks is an open problem, we must acknowledge that some relevant research has been done in connection with classification and clustering problems that informs our work and provides a continuum that our work extends (Hennig et al., 2015; Fränti et al., 2014; Rezaei and Fränti, 2016; van der Hoef and Warrens, 2019; Warrens and van der Hoef, 2019; Meilă, 2007; Vinh et al., 2010). Estimating agreement by chance is relatively simple in classification, because the sample space is fixed and the same for each annotator.

In contrast, clustering problems present a greater challenge and bear closer resemblance to span prediction issues. From a conceptual standpoint, one could draw a parallel between elements within the same span and elements within the same cluster. The most commonly employed randomization model in clustering is the permutation model (Gates and Ahn, 2017), where all potential clusters, each with a fixed number of clusters and a fixed cluster size, are randomly generated with equal probability. However, what distinguishes span prediction from clustering is that the permutation model in clustering doesn’t impose any restrictions on the placement of elements within the same cluster. Elements within the same cluster can be positioned anywhere. This assumption isn’t suitable for sequence annotations, where segments are most typically comprise contiguous elements rather than fragmented. In essence, annotators treat each segment as a whole, rather than labeling each token independently.

The variation in sample spaces caused by different labeling tendencies and connectivity constraints within each segment makes this problem quite challenging, especially when annotated segments need to be non-overlapping. Therefore, considering the characteristics of span prediction tasks and different annotation tendencies, we propose a new random annotation model to fulfill these requirements.

Our random annotation model independently models each annotator’s tasks. Specifically, given the observed annotations for each task by each annotator, our random model uniformly randomizes entity positions while preserving the respective number of entities and the length of each entity.

To cater to various application requirements, we have designed two sub-models: the overlapping model and the non-overlapping model. These sub-models can accommodate situations where tasks necessitate non-overlapping spans and situations where no such requirement is specified.

For example, in Table 1, the "Random" column presents a sample of random annotations for each annotator. For annotator 1, the random annotation still consists of two entities: "NIH campus in" with 3 tokens and "visited" with 1 tokens, both with randomized positions. In contrast, the "Invalid random" column in Table 1 provides examples of invalid random annotations, as neither the number nor the length of entities matches the observed annotation. It’s important to note that in the random annotation model, the number of entities and the length of each entity are fixed for each annotator for each task, but these may vary between annotators for the same task. This flexibility is a deliberate choice in the random annotation model to account for the distinct annotation tendencies of each annotator, resulting in different chance agreements.

As another motivating observation, we recognize that many similarity measures are additive. In essence, the comparison between the annotations of different annotators involves accumulating comparisons among all segment pairs annotated by different annotators. For example, one of the most popular metrics, the F1 score for binary classification, can be expressed as 2a/(2a+b+c)2𝑎2𝑎𝑏𝑐2a/(2a+b+c)2 italic_a / ( 2 italic_a + italic_b + italic_c ), where a𝑎aitalic_a represents the number of items labeled as positive by both annotators, and b𝑏bitalic_b and c𝑐citalic_c indicate the numbers of items rated as positive by one annotator but negative by the other. It’s important to note that when the number and length of spans are both observed, the value of 2a+b+c2𝑎𝑏𝑐2a+b+c2 italic_a + italic_b + italic_c is a constant. The "positive agreement" rating, denoted as a𝑎aitalic_a, reflects the cumulative sum of positive agreements for all compared segment pairs.

To simplify the modeling of random sequence annotations, we approach each segment individually, even though each labeled segment is still influenced by constraints imposed by other labeled segments within the same text, particularly in situations where segment overlap is not allowed. We have successfully derived the analytical distribution for the location of each individually labeled segment. Additionally, we’ve observed that the probability remains relatively consistent across most segment locations, reducing the need for numerous redundant calculations. Further details will be presented in the next section.

3 Method

In this section, we provide the specification of the random annotation model for sequence annotation, also known as span prediction, and present the calculation, approximation, and asymptotic properties of chance agreement through random annotation.

Taking NER as an example, we begin by introducing random sequence annotation models for both non-overlapping and overlapping scenarios, accompanied by the mathematical definition of chance estimation. Leveraging additive similarity measures, we significantly simplify the estimation of expected chance agreement in Proposition 1, alongside its corresponding analytical formula for the distribution of random annotations in Proposition 2. In Proposition 3, we emphasize that each randomly annotated segment exhibits the same probability for most locations, with the exception of a few at the extreme ends, thus further reducing computational complexity.

Moreover, for lengthy texts with sparse annotation information, the expected chance agreement becomes so negligible that it can be safely disregarded. This assertion is substantiated in Proposition 4. The preceding conclusions primarily pertain to non-overlapping scenarios, and we briefly encapsulate the outcome for the overlapping model in Proposition 5, as its derivation is straightforward. Given space constraints, we present only the primary conclusions and concepts within this section. For detailed proofs, please consult the appendix.

We adopt the NER as a representative of complex text sequence annotation tasks to demonstrate how to estimate the chance agreement or performance for sequence annotation evaluation. Given a text T={t1t2tn}𝑇precedessubscript𝑡1subscript𝑡2precedesprecedessubscript𝑡𝑛T=\{t_{1}\prec t_{2}\prec\ldots\prec t_{n}\}italic_T = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≺ italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≺ … ≺ italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } with a sequence of n𝑛nitalic_n tokens ti,i{1,,n}subscript𝑡𝑖𝑖1𝑛t_{i},i\in\{1,\dots,n\}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 1 , … , italic_n }, and a pre-defined tag set C={c1,,cm}𝐶subscript𝑐1subscript𝑐𝑚C=\{c_{1},\ldots,c_{m}\}italic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } with m𝑚mitalic_m categorical tags; as a typical task in information extraction, named entity recognition aims to locate and classify segments of text T𝑇Titalic_T into pre-defined categories C𝐶Citalic_C, such as recognizing disease, medication, and symptom information from clinical notes.

Mathematically, the annotation task for NER can be formulated as a function Φ:T×CΩ:Φmaps-to𝑇𝐶Ω\Phi:T\times C\mapsto\Omegaroman_Φ : italic_T × italic_C ↦ roman_Ω, where ΩΩ\Omegaroman_Ω is the set of all possible annotations. For any ψΩ𝜓Ω\psi\in\Omegaitalic_ψ ∈ roman_Ω, ψ={ψ1,1,,ψ1,k1,,ψm,1,,ψm,km}𝜓subscript𝜓11subscript𝜓1subscript𝑘1subscript𝜓𝑚1subscript𝜓𝑚subscript𝑘𝑚\psi=\{\psi_{1,1},\ldots,\psi_{1,k_{1}},\ldots,\psi_{m,1},\ldots,\psi_{m,k_{m}}\}italic_ψ = { italic_ψ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , … , italic_ψ start_POSTSUBSCRIPT 1 , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_ψ start_POSTSUBSCRIPT italic_m , 1 end_POSTSUBSCRIPT , … , italic_ψ start_POSTSUBSCRIPT italic_m , italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, where ψ𝜓\psiitalic_ψ is an annotation of segments for all pre-defined categories, kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of segments for i𝑖iitalic_i-th category. For an annotation segment ψi,j={sti,j,ai,j}subscript𝜓𝑖𝑗𝑠subscript𝑡𝑖𝑗subscript𝑎𝑖𝑗\psi_{i,j}=\{st_{i,j},a_{i,j}\}italic_ψ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { italic_s italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT }, sti,j𝑠subscript𝑡𝑖𝑗st_{i,j}italic_s italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the index of the first token and ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the length for the j𝑗jitalic_j-th segment with i𝑖iitalic_i-th category. To simplify the discussion, in the following we will focus on single-tag text annotation (i.e., m=1𝑚1m=1italic_m = 1, ψ={ψ1,,ψk}𝜓subscript𝜓1subscript𝜓𝑘\psi=\{\psi_{1},\ldots,\psi_{k}\}italic_ψ = { italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, ψj={stj,aj}subscript𝜓𝑗𝑠subscript𝑡𝑗subscript𝑎𝑗\psi_{j}=\{st_{j},a_{j}\}italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }) since it is straightforward to generalize these techniques to multi-tag annotation as shown in the experiments.

To gauge chance agreement, we need a precise definition of random annotation. Adapting the permutation model, which is commonly used for clustering, to sequence annotation tasks is impractical due to the absence of location constraints within clusters. This conflicts with the usual intra-segment connectivity assumption in a text annotation setting. To overcome this, we propose a novel random annotation model. It accommodates annotator and task variation while upholding the coherence of text segments.

Random Sequence Annotation Model. The random annotation model is designed to keep the count and length of annotated segments consistent for each annotator within each task, while allowing variability across different annotators and tasks. It generates all feasible annotation configurations with equal probability. In other words, for a k𝑘kitalic_k-segment random annotation Ψ={Ψ1,,Ψk}ΨsubscriptΨ1subscriptΨ𝑘\Psi=\{\Psi_{1},\ldots,\Psi_{k}\}roman_Ψ = { roman_Ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } with each randomly annotated segment Ψi={STi,ai}subscriptΨ𝑖𝑆subscript𝑇𝑖subscript𝑎𝑖\Psi_{i}=\{ST_{i},a_{i}\}roman_Ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, it has equal probabilities for all possible start indices {st1,,stk}𝑠subscript𝑡1𝑠subscript𝑡𝑘\{st_{1},\ldots,st_{k}\}{ italic_s italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } with fixed lengths a1,,aksubscript𝑎1subscript𝑎𝑘a_{1},\ldots,a_{k}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

For annotator 1 in Table 1, we have k=2𝑘2k=2italic_k = 2, a1=3subscript𝑎13a_{1}=3italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3, ST1{1,,5}𝑆subscript𝑇115ST_{1}\in\{1,\ldots,5\}italic_S italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ { 1 , … , 5 }, and a2=1subscript𝑎21a_{2}=1italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1, ST2{1,,7}𝑆subscript𝑇217ST_{2}\in\{1,\ldots,7\}italic_S italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ { 1 , … , 7 }. The definition of a random annotation segment {STi,ai}𝑆subscript𝑇𝑖subscript𝑎𝑖\{ST_{i},a_{i}\}{ italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } indicates its connectivity. All tokens in the same segment are consecutive without gaps and the index of the last token in the i𝑖iitalic_i-th annotated segment is STi+ai1𝑆subscript𝑇𝑖subscript𝑎𝑖1ST_{i}+a_{i}-1italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1. In contrast, a random cluster generated by the permutation model for clustering does not require this property. Note that the permutation of different entities is still allowed in our model as long as the segments within each entity remain contiguous, in other words, that the entity is permuted as a whole. As shown in the "Annotator 1" row of Table 1, different from the observed two entities with 3 and 1 tokens ("the NIH campus" and "MD"), the left and right positions of the annotated entities in our random model with 3 and 1 tokens ("NIH campus in" and "visited") can be swapped as illustrated in the "Random" column. With regards to different applications, the random annotation model can be further divided into two sub-models, namely, the overlapping model and the non-overlapping model. The overlapping model allows segments to overlap with each other, so each STi𝑆subscript𝑇𝑖ST_{i}italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can take any value between 1111 and nai+1𝑛subscript𝑎𝑖1n-a_{i}+1italic_n - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1, whereas the non-overlapping model does not allow segments to overlap, i.e., STiSTj+aj𝑆subscript𝑇𝑖𝑆subscript𝑇𝑗subscript𝑎𝑗ST_{i}\geq ST_{j}+a_{j}italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_S italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT oder STjSTi+ai𝑆subscript𝑇𝑗𝑆subscript𝑇𝑖subscript𝑎𝑖ST_{j}\geq ST_{i}+a_{i}italic_S italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for any ij𝑖𝑗i\neq jitalic_i ≠ italic_j. Because the overlapping model is much easier to handle, we only focus on the non-overlapping model here.

The problem of estimating chance agreement for annotation evaluation can be described as follows:

Problem Definition. Assume there are two independent random annotations, Ψ1Ψ1\Psi 1roman_Ψ 1 for annotator 1 and Ψ2Ψ2\Psi 2roman_Ψ 2 for annotator 2 on the same text of length n𝑛nitalic_n. The problem is to estimate the expected similarity E(Sim(Ψ1,Ψ2))𝐸𝑆𝑖𝑚Ψ1Ψ2E(Sim(\Psi 1,\Psi 2))italic_E ( italic_S italic_i italic_m ( roman_Ψ 1 , roman_Ψ 2 ) ) based on a random non-overlapping annotation model.

In this paper, we use right index instead of right subscript to represent the index of annotators, for example, k1𝑘1k1italic_k 1 represents the number of segments annotated by annotator 1, and k2𝑘2k2italic_k 2 for annotator 2. We notice that many agreement measures, regardless of being token level or entity level, can be formulated as segment-wise measures, i.e., Sim(ψ1,ψ2)=f(ϕ1,1(ψ11,ψ21),,ϕk1,k2(ψ1k1,ψ2k2))𝑆𝑖𝑚𝜓1𝜓2𝑓subscriptitalic-ϕ11𝜓subscript11𝜓subscript21subscriptitalic-ϕ𝑘1𝑘2𝜓subscript1𝑘1𝜓subscript2𝑘2Sim(\psi 1,\psi 2)=f(\phi_{1,1}(\psi 1_{1},\psi 2_{1}),\ldots,\phi_{k1,k2}(% \psi 1_{k1},\psi 2_{k2}))italic_S italic_i italic_m ( italic_ψ 1 , italic_ψ 2 ) = italic_f ( italic_ϕ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( italic_ψ 1 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ψ 2 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_ϕ start_POSTSUBSCRIPT italic_k 1 , italic_k 2 end_POSTSUBSCRIPT ( italic_ψ 1 start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT , italic_ψ 2 start_POSTSUBSCRIPT italic_k 2 end_POSTSUBSCRIPT ) ) , where ψ1i={st1i,a1i}𝜓subscript1𝑖𝑠𝑡subscript1𝑖𝑎subscript1𝑖\psi 1_{i}=\{st1_{i},a1_{i}\}italic_ψ 1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_s italic_t 1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a 1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } is the i𝑖{i}italic_i-th annotated segment for annotator 1 and ψ2j={st2j,a2j}𝜓subscript2𝑗𝑠𝑡subscript2𝑗𝑎subscript2𝑗\psi 2_{j}=\{st2_{j},a2_{j}\}italic_ψ 2 start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { italic_s italic_t 2 start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a 2 start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } is the j𝑗{j}italic_j-th one for annotator 2. While it is challenging to estimate the chance agreement for a large number of dependent segments together with the random non-overlapping annotation model, the function f𝑓fitalic_f is additive for many popular measures. This fact allows us to process each segment individually, which greatly simplifies the estimation. We call the segment-wise measure with additive function f𝑓fitalic_f additive measure.

Proposition1. For the additive similarity measure, the expected chance agreement is E(Sim(Ψ1,Ψ2))=𝐸𝑆𝑖𝑚Ψ1Ψ2absentE(Sim(\Psi 1,\Psi 2))=italic_E ( italic_S italic_i italic_m ( roman_Ψ 1 , roman_Ψ 2 ) ) = f(Eϕ1,1(Ψ11,Ψ21)),,E(ϕk1,k2(Ψ1k1,Ψ2k2))𝑓𝐸subscriptitalic-ϕ11Ψsubscript11Ψsubscript21𝐸subscriptitalic-ϕ𝑘1𝑘2Ψsubscript1𝑘1Ψsubscript2𝑘2f(E\phi_{1,1}(\Psi 1_{1},\Psi 2_{1})),\ldots,E(\phi_{k1,k2}(\Psi 1_{k1},\Psi 2% _{k2}))italic_f ( italic_E italic_ϕ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( roman_Ψ 1 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Ψ 2 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , italic_E ( italic_ϕ start_POSTSUBSCRIPT italic_k 1 , italic_k 2 end_POSTSUBSCRIPT ( roman_Ψ 1 start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT , roman_Ψ 2 start_POSTSUBSCRIPT italic_k 2 end_POSTSUBSCRIPT ) ) .

Note that in the non-overlapping random annotation model, the position of each random annotation segment is dependent on all the other random annotation segments within the same document from the same annotator. Since we assume all possible random annotations are equally likely, the problem of estimating the location distribution for each segment is equivalent to counting the number of all possible configurations when we fix the location of the corresponding segment.

Proposition2. For the non-overlapping random annotation model, the number of all random annotations with the i𝑖iitalic_i-th segment fixed as:

Π(STi=l)=π(l1,0)π(nla+k,k1)+i1iπ(lai1,1)π(nla+ai1+k1,k2)+i1ii2iπ(lai1ai2+1,2)π(nla+ai1+ai2+k2,k3)++π(la+ai+k2,k1)π(nlai+1,0),missing-subexpressionΠ𝑆subscript𝑇𝑖𝑙limit-from𝜋𝑙10𝜋𝑛𝑙𝑎𝑘𝑘1missing-subexpressionlimit-fromsubscriptsubscript𝑖1𝑖𝜋𝑙subscript𝑎subscript𝑖11𝜋𝑛𝑙𝑎subscript𝑎subscript𝑖1𝑘1𝑘2missing-subexpressionsubscriptsubscript𝑖1𝑖subscriptsubscript𝑖2𝑖𝜋𝑙subscript𝑎subscript𝑖1subscript𝑎subscript𝑖212𝜋𝑛𝑙𝑎subscript𝑎subscript𝑖1subscript𝑎subscript𝑖2𝑘2𝑘3missing-subexpression𝜋𝑙𝑎subscript𝑎𝑖𝑘2𝑘1𝜋𝑛𝑙subscript𝑎𝑖10\begin{aligned} &\Pi(ST_{i}=l)=\pi(l-1,0)\pi(n-l-a+k,k-1)+\\ &\sum_{i_{1}\neq i}\pi(l-a_{i_{1}},1)\pi(n-l-a+a_{i_{1}}+k-1,k-2)+\\ &\sum_{i_{1}\neq i}\sum_{i_{2}\neq i}\pi(l-a_{i_{1}}-a_{i_{2}}+1,2)\pi(n-l-a+a% _{i_{1}}+a_{i_{2}}+k-2,k-3)\\ &+\ldots+\pi(l-a+a_{i}+k-2,k-1)\pi(n-l-a_{i}+1,0),\end{aligned}start_ROW start_CELL end_CELL start_CELL roman_Π ( italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) = italic_π ( italic_l - 1 , 0 ) italic_π ( italic_n - italic_l - italic_a + italic_k , italic_k - 1 ) + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_i end_POSTSUBSCRIPT italic_π ( italic_l - italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 1 ) italic_π ( italic_n - italic_l - italic_a + italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_k - 1 , italic_k - 2 ) + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≠ italic_i end_POSTSUBSCRIPT italic_π ( italic_l - italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 , 2 ) italic_π ( italic_n - italic_l - italic_a + italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_k - 2 , italic_k - 3 ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + … + italic_π ( italic_l - italic_a + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k - 2 , italic_k - 1 ) italic_π ( italic_n - italic_l - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 , 0 ) , end_CELL end_ROW

(1)

where π(n,r)=n!/(nr)!𝜋𝑛𝑟𝑛𝑛𝑟\pi(n,r)=n!/(n-r)!italic_π ( italic_n , italic_r ) = italic_n ! / ( italic_n - italic_r ) ! is the number of permutations of n𝑛nitalic_n things taken r𝑟ritalic_r at a time, k𝑘kitalic_k is the number of segments, aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the length of the i𝑖iitalic_i-th segment and a=iai𝑎subscript𝑖subscript𝑎𝑖a=\sum_{i}a_{i}italic_a = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the total length of annotations. Then the corresponding probability is p(STi=l)=Π(STi=l)/π(na+k,k)𝑝𝑆subscript𝑇𝑖𝑙Π𝑆subscript𝑇𝑖𝑙𝜋𝑛𝑎𝑘𝑘p(ST_{i}=l)=\Pi(ST_{i}=l)/\pi(n-a+k,k)italic_p ( italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) = roman_Π ( italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) / italic_π ( italic_n - italic_a + italic_k , italic_k ), for 1lnai+11𝑙𝑛subscript𝑎𝑖11\leq l\leq n-a_{i}+11 ≤ italic_l ≤ italic_n - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1. Here we treat each text segment as a different annotation, regardless of length. If we do not need to distinguish among entities of the same length, this formula can also be applied after a simple modification.

However, it is computationally expensive to calculate Equation 2 for all possible random locations of each text segment when the sequence is long. To solve this issue, we find that Π(STi=l)Π𝑆subscript𝑇𝑖𝑙\Pi(ST_{i}=l)roman_Π ( italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) is the same for most locations when the text is of length namuch-greater-than𝑛𝑎n\gg aitalic_n ≫ italic_a.

Proposition3. STi𝑆subscript𝑇𝑖ST_{i}italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is uniformly distributed for aaik+2stina+k𝑎subscript𝑎𝑖𝑘2𝑠subscript𝑡𝑖𝑛𝑎𝑘a-a_{i}-k+2\leq st_{i}\leq n-a+kitalic_a - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_k + 2 ≤ italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_n - italic_a + italic_k, i.e., Π(sti=l1)=Π(sti=l2)Π𝑠subscript𝑡𝑖subscript𝑙1Π𝑠subscript𝑡𝑖subscript𝑙2\Pi(st_{i}=l_{1})=\Pi(st_{i}=l_{2})roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) for aaik+2l1,l2na+kformulae-sequencefor-all𝑎subscript𝑎𝑖𝑘2subscript𝑙1subscript𝑙2𝑛𝑎𝑘\>\forall\>a-a_{i}-k+2\leq l_{1},l_{2}\leq n-a+k∀ italic_a - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_k + 2 ≤ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_n - italic_a + italic_k.

We further observe that it is not necessary to estimate chance agreement in all cases. Intuitively, we expect the chance agreement is small enough to be ignored when annotating sparse information in long texts and find that it is indeed the case. In most named entity recognition tasks, for example, the average tokens in an annotated sentence is usually large than 20 (Roth and Yih, 2004).

Proposition4. When na1+a2much-greater-than𝑛𝑎1𝑎2n\gg a1+a2italic_n ≫ italic_a 1 + italic_a 2, the expected similarity E(Sim(Ψ1,Ψ2))0𝐸𝑆𝑖𝑚Ψ1Ψ20E(Sim(\Psi 1,\Psi 2))\to 0italic_E ( italic_S italic_i italic_m ( roman_Ψ 1 , roman_Ψ 2 ) ) → 0, where a1𝑎1a1italic_a 1 and a2𝑎2a2italic_a 2 are the total lengths of all annotated segments for annotator 1 and annotator 2.

For the overlapping model, as the probability of the location of each randomly annotated segment is uniform, we can easily derive its probability distribution.

Proposition5. For the overlapping random annotation model, p(STi=l)=1/(nai+1)𝑝𝑆subscript𝑇𝑖𝑙1𝑛subscript𝑎𝑖1p(ST_{i}=l)=1/(n-a_{i}+1)italic_p ( italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) = 1 / ( italic_n - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 ), for 1lnai+11𝑙𝑛subscript𝑎𝑖11\leq l\leq n-a_{i}+11 ≤ italic_l ≤ italic_n - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1.

Annotation Difficulty Evaluation. Another important application of chance agreement is to define the difficulty of an annotation task from the perspective of agreement by chance. Usually, evaluating the difficulty of annotation tasks is highly subjective and there are no good quantitative indicators. We utilize the chance agreement to define the difficulty of annotation tasks as follows:

Definition. The difficulty level of an annotation task can be defined as 1E(Sim(Ψ,Ψ))1𝐸𝑆𝑖𝑚ΨΨ1-E(Sim(\Psi,\Psi))1 - italic_E ( italic_S italic_i italic_m ( roman_Ψ , roman_Ψ ) ) if there is a gold standard annotation ΨΨ\Psiroman_Ψ or as average similarity of all annotator pairs 1i,j=1vE(Sim(Ψ1,Ψ2))/v21superscriptsubscript𝑖𝑗1𝑣𝐸𝑆𝑖𝑚Ψ1Ψ2superscript𝑣21-\sum_{i,j=1}^{v}E(Sim(\Psi 1,\Psi 2))/v^{2}1 - ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT italic_E ( italic_S italic_i italic_m ( roman_Ψ 1 , roman_Ψ 2 ) ) / italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where v𝑣vitalic_v is the number of annotators.

4 Experiments

To demonstrate the accuracy and effectiveness of our approach, we conducted both simulation and corpus-based experiments111 All experiments are implemented with MATLAB on a 2017 Mac Pro. The configuration of the Mac Pro is 2.9 GHz Intel Core i7 processor and 16GB 2133 MHz LPDDR3 memory. The evaluation tool and datasets will be released as open-source after the review period.. We designed the simulation experiments to validate our probability distribution estimation for random sequence annotation. Additionally, by varying the length of text, entity length, and quantity in the simulation experiments, we demonstrated the effectiveness of chance correction, comparing it with Alpha coefficients. Ultimately, we illustrated how our chance estimation impacts the evaluation and ranking of model performance in corpus experiment. Since the estimation of chance agreement for the overlapping model is considerably simpler than for the non-overlapping model, all experiments in this paper are configured with the non-overlapping constraint.

Specifically, for the estimation of the probability distribution for random text annotation, we set to label four segments with lengths of 1, 5, 10, and 15 on a sequence of length 100. Figure 1 shows the probability distributions of the four segments at all possible locations calculated with the analytical formula in Proposition 2. The four distributions are approximately distributed as the inverted trapezoids with high ends and flat middle part, which confirms the conclusions of Proposition 2 and 3.222The calculation time of the whole process is about 0.01 seconds.

Refer to caption
Figure 1: The probability distributions for all possible locations of each random segment in a length=100 sequence annotated with four segments. The lengths of the four segments are 1, 5, 10, 15, from left to right.
Observed (case A) Observed (case B)
Annotator1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
Annotator2 0 0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
Table 2: Sequence Annotation Simulation 1.
Sim1 ObsF1 ChanceF1 CorrF1 ObsD ExpD Alpha Obsμ𝜇\muitalic_μD Expμ𝜇\muitalic_μD μ𝜇\muitalic_μAlpha
CaseA 0.8571 0.5335 0.6938 0.0075 0.0537 0.8602 0.15 0.5313 0.7177
CaseB 0.8571 0.3544 0.7787 0.0033 0.0366 0.9090 0.10 0.4704 0.7874
Table 3: Chance Agreement Estimation for Sequence Annotation Simulation 1.
Observed (case A) Observed (case B)
Annotator1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0
Annotator2 0 0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
Table 4: Sequence Annotation Simulation 2.
Sim2 ObsF1 ChanceF1 CorrF1 ObsD ExpD Alpha Obsμ𝜇\muitalic_μD Expμ𝜇\muitalic_μD μ𝜇\muitalic_μAlpha
CaseA 0.8571 0.5335 0.6938 0.0075 0.0537 0.8602 0.15 0.5313 0.7177
CaseB 0.8571 0.6455 0.5970 0.0125 0.1047 0.8806 0.15 0.5885 0.7451
Table 5: Chance Agreement Estimation for Sequence Annotation Simulation 2.
Observed (case A) Observed (case B)
Annotator1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0
Annotator2 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
Table 6: Sequence Annotation Simulation 3.
Sim3 ObsF1 ChanceF1 CorrF1 ObsD ExpD Alpha Obsμ𝜇\muitalic_μD Expμ𝜇\muitalic_μD μ𝜇\muitalic_μAlpha
CaseA 0.8571 0.1830 0.8251 0.0025 0.0388 0.9356 0.05 0.2996 0.8331
CaseB 0.8571 0.6455 0.5970 0.0125 0.1047 0.8806 0.15 0.5885 0.7451
Table 7: Chance Agreement Estimation for Sequence Annotation Simulation 3.
Gold Standard 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Annotator1 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Annotator2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Table 8: Sequence Annotation Simulation 4.
Sim4 ObsF1 ChanceF1 CorrF1 ObsD ExpD Alpha Obsμ𝜇\muitalic_μD Expμ𝜇\muitalic_μD μ𝜇\muitalic_μAlpha
Annotator1 0.6522 0.5013 0.3026 0.1523 0.2154 0.2931 0.3902 0.5222 0.2527
Annotator2 0.6808 0.5437 0.3005 0.0268 0.2881 0.9071 0.3659 0.5365 0.3181
Table 9: Chance Agreement Estimation for Sequence Annotation Simulation 4.

The problem of chance estimation and correction is unique in that, to our knowledge, there is no real benchmark data that can be used to evaluate the performance. Therefore, most classic works in this field use synthetic data to illustrate and evaluate the effect of chance correction, such as Komagata (2002) and Artstein and Poesio (2008). Intuitively, we know that the chance agreement is related to the size of the search space, the number of annotated objects, and the lengths of the annotated objects. We design the corresponding comparison experiments by varying these three factors.

Model F1-all F1-subset1 F1-subset2 Zeit
Obs Rank Cor Rank Obs Rank Cor Rank Obs Rank Cor Rank
A 0.923 3 0.901 3 0.919 2 0.911 2 0.9369 3 0.9035 4 23
B 0.905 7 0.878 7 0.889 7 0.878 7 0.9305 6 0.8938 6 23
C 0.9072 6 0.881 6 0.892 6 0.881 6 0.9320 5 0.8963 5 23
D 0.902 8 0.874 8 0.885 8 0.874 8 0.9261 7 0.8878 7 23
E 0.785 11 0.730 11 0.731 11 0.707 11 0.8537 11 0.7838 11 19
F 0.846 9 0.805 9 0.815 9 0.798 9 0.8929 9 0.8391 9 18
G 0.925 2 0.904 2 0.917 3 0.908 3 0.9414 2 0.9103 2 24
H 0.921 4 0.898 4 0.913 4 0.904 4 0.9368 4 0.9036 3 24
I 0.932 1 0.913 1 0.922 1 0.914 1 0.9500 1 0.9232 1 23
J 0.9073 5 0.882 5 0.903 5 0.894 5 0.9240 8 0.8851 8 22
K 0.802 10 0.752 10 0.759 10 0.737 10 0.8537 10 0.7854 10 16
Table 10: Chance Agreement Estimation for CoNLL03 Dataset. Obs is short for observed F1 as reported in corresponding real NER model (A-K), Cor is short for corrected F1. Time denotes the running time for chance estimation in seconds.

We design three sets of comparison experiments by varying the length of text (simulation 1), the number (simulation 2) and length (simulation 3) of entities. In case A of simulation 1 shown in Table 2, we use 1 or 0 to indicate that each token in the text sequence is labeled or not. For the same sequence with 20 tokens, annotator 1 labels 3 entities with lengths of 2, 3, and 4. Annotator 2 labels 3 entities with lengths of 3, 4, and 5. The annotations of case B for two annotators are the same as in case A, the only difference is that ten 0s are added after the 20 tokens, that is, neither annotator 1 nor annotator 2 have labeled the extra 10 tokens. As reported in Table 3, because F1 score only focuses on the annotated tokens, the observed agreement (F1 score) is the same in both cases. However, since the labeled information in case B is relatively sparse, the chance agreement in case B is smaller, and the corresponding corrected F1 score is larger which means the agreement is higher. In simulation 2, the text length and the total number of annotated tokens remain the same, but the number of annotated entities changes from 3 in case A to 1 in case B. In simulation 3, the text length and the number of annotated entities remain the same, whereas the number of annotated tokens in case B is tripled. The results in Table 3, 5 and 7 show that the longer the text, or the more entities, or the shorter the entities, the smaller the chance agreement. This is consistent with our intuition.

We also compared our results with two Alpha coefficients, namely Alpha and μ𝜇\muitalic_μAlpha (see Krippendorff et al., 2016 Equation 2 and Equation 5a for specific formulas). At first glance, Alpha coefficients exhibit a similar trend in simulations 1 and 3, consistent with intuition, while the results in simulation 2 contradict intuition. However, the underlying reasons are different. Our results are derived from chance agreement estimations that align with intuition, whereas the results of Alpha coefficients are influenced by their measurement metrics. For the critical estimation of expected disagreement (ExpD and Expμ𝜇\muitalic_μD), it should have an inverse trend with expected agreement (chanceF1) because the more the agreement, the less the disagreement. However, the actual results are the opposite, primarily because Alpha coefficients include agreement for irrelevant segments, which does not align with the needs of most information extraction tasks.

The main purpose of chance correction is to use different baselines for different tasks. In addition, chance correction may also change the ranking of model performance for the same task, although this is not common. As shown in the table 8, the gold standard annotation labels six entities with size of 3, 3, 3, 3, 3, 16. The annotator1 labels five 3-token entities correctly but misses the 16-token entity. The annotator2 labels the 16-token entities correctly but misses five 3-token entities. Note that the observed F1 score of annotator1 is lower than that of annotator2. But after the chance correction, the results are opposite (see table 9). Neither of the two Alpha coefficients demonstrated this capability.

To evaluate our model on real data, we estimated the chance agreement of 11 state-of-the-art NER models (Liu et al., 2021) using the CoNLL03 NER dataset (Sang and De Meulder, 2003). The results are presented in Table 10. The CONLL03 testing dataset comprises 3,453 sentences, each annotated with four types of entities: persons (PER), organizations (ORG), locations (LOC), and miscellaneous names (MISC).

We employ a micro-average approach to handle multiple sentences and entity types. This involves separately calculating token-level observed agreement and chance agreement for each sentence and entity type. These token-level observed agreements and chance agreements are then aggregated to compute the overall chance agreement, observed F1 score, and corrected F score. It’s important to note that validating chance agreement for real data without ground truth is challenging. However, the F1 scores demonstrate a noticeable widening of the range after chance correction.

Furthermore, we partition the entire 3,453 sentences of the CoNLL03 data into two roughly equivalent subsets based on the chance agreement level for each sentence. Subset1 consists of sentences with a chance agreement level greater than 0.825 (equivalent to difficulty level less than or equal to 0.175), while subset2 includes sentences with a chance agreement level less than or equal to 0.825 (equivalent to difficulty level greater than or equal to 0.175). The results indicate significant changes in the performance ranking of the 11 NER models across different datasets. Additionally, the performance ranking of all 11 models on subset2 also exhibits slight variations before and after chance correction.

5 Conclusion and Discussion

In this paper, we propose a novel sequence random annotation model that takes into account the different annotation styles of annotators and the characteristics of sequence annotations. For complex cases where labeled objects are required to be disjoint, we investigate the corresponding distribution characteristic and remove redundant calculations. We also derive an analytical formula to calculate the exact distribution. Our focus in this work is how to establish a general framework and corresponding fast algorithm for calculating similarity by chance in complex text annotations. The framework and method proposed in this paper are applicable to all additive similarity measures. Moreover, our approach can extend to nested spans by iteratively applying the same method layer by layer, ensuring compliance with the nested structure.

6 Limitations

Since chance estimation for sequence annotation is an open problem, there is very limited similar work to provide as a baseline for direct comparison. In addition, chance estimation lacks benchmark data with ground truth, although we have applied it to real data in order to demonstrate its utility. The current analysis of its effectiveness is mainly based on simulated data and whether it is consistent with human intuition. We expect that this work will stimulate more related work and benchmark data creation. The chance estimation in this paper focuses on the comparison between two annotators, and we plan to extend it to team-wise agreement for more than two annotators or systems.

7 Ethics Statement

The use of data on this project strictly adhered to ethical standards required by the National Institute of Health (NIH).

In addition to upholding ethical principles in conducting this work, we believe this work contributes to professional standards for rigor in the field. In particular, we expect that this paper will facilitate fair comparison of various annotation tasks or systems and reduce random chance agreement caused by different annotation styles and metrics. Chance agreement can also be used as a quantitative aid to measure the difficulty of annotation task. This provides a new perspective for evaluating different annotation tasks.

8 Acknowledgements

This study was supported by the Social Security Administration- National Institutes of Health Interagency Agreements and by the National Institutes of Health Intramural Research program.

References

  • Artstein and Poesio (2008) Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555–596.
  • Cunningham and et al. (2014) Hamish Cunningham and et al. 2014. Developing language processing components with gate version 8.
  • Dai (2018) Xiang Dai. 2018. Recognizing complex entity mentions: A review and future directions. In Proceedings of ACL 2018, Student Research Workshop, pages 37–44.
  • Esuli and Sebastiani (2010) Andrea Esuli and Fabrizio Sebastiani. 2010. Evaluating information extraction. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 100–111. Springer.
  • Eugenio and Glass (2004) Barbara Di Eugenio and Michael Glass. 2004. The kappa statistic: A second look. Computational linguistics, 30(1):95–101.
  • Fränti et al. (2014) Pasi Fränti, Mohammad Rezaei, and Qinpei Zhao. 2014. Centroid index: cluster level similarity measure. Pattern Recognition, 47(9):3034–3045.
  • Gates and Ahn (2017) Alexander J Gates and Yong-Yeol Ahn. 2017. The impact of random models on clustering similarity. The Journal of Machine Learning Research, 18(1):3049–3076.
  • Hennig et al. (2015) Christian Hennig, Marina Meila, Fionn Murtagh, and Roberto Rocci. 2015. Handbook of cluster analysis. CRC Press.
  • Hripcsak and Rothschild (2005) George Hripcsak and Adam S Rothschild. 2005. Agreement, the f-measure, and reliability in information retrieval. Journal of the American medical informatics association, 12(3):296–298.
  • Ide and Pustejovsky (2017) Nancy Ide and James Pustejovsky. 2017. Handbook of linguistic annotation. Springer.
  • Komagata (2002) Nobo Komagata. 2002. Chance agreement and significance of the kappa statistic. URL: http://www. tcnj. edu/komagata/pub/Kappa. pdf (Stand: Mai 2004).
  • Krippendorff (1995) Klaus Krippendorff. 1995. On the reliability of unitizing continuous data. Sociological Methodology, pages 47–76.
  • Krippendorff et al. (2016) Klaus Krippendorff, Yann Mathet, Stéphane Bouvry, and Antoine Widlöcher. 2016. On the reliability of unitizing textual continua: Further developments. Quality & Quantity, 50:2347–2364.
  • Lampert et al. (2016) Thomas A Lampert, André Stumpf, and Pierre Gançarski. 2016. An empirical study into annotator agreement, ground truth estimation, and algorithm evaluation. IEEE Transactions on Image Processing, 25(6):2557–2572.
  • Lavelli et al. (2008) Alberto Lavelli, Mary Elaine Califf, Fabio Ciravegna, Dayne Freitag, Claudio Giuliano, Nicholas Kushmerick, Lorenza Romano, and Neil Ireson. 2008. Evaluation of machine learning-based information extraction algorithms: criticisms and recommendations. Language Resources and Evaluation, 42(4):361–393.
  • Liu et al. (2021) Pengfei Liu, Jinlan Fu, Yang Xiao, Weizhe Yuan, Shuaicheng Chang, Junqi Dai, Yixin Liu, Zihuiwen Ye, Zi-Yi Dou, and Graham Neubig. 2021. Explainaboard: An explainable leaderboard for nlp. arXiv preprint arXiv:2104.06387.
  • Mathet et al. (2015) Yann Mathet, Antoine Widlöcher, and Jean-Philippe Métivier. 2015. The unified and holistic method gamma (γ𝛾\gammaitalic_γ) for inter-annotator agreement measure and alignment. Computational Linguistics, 41(3):437–479.
  • Meilă (2007) Marina Meilă. 2007. Comparing clusterings—an information based distance. Journal of multivariate analysis, 98(5):873–895.
  • Powers (2015) David MW Powers. 2015. What the f-measure doesn’t measure: Features, flaws, fallacies and fixes. arXiv preprint arXiv:1503.06410.
  • Rand (1971) William M Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850.
  • Rezaei and Fränti (2016) Mohammad Rezaei and Pasi Fränti. 2016. Set matching measures for external cluster validity. IEEE Transactions on Knowledge and Data Engineering, 28(8):2173–2186.
  • Roth and Yih (2004) Dan Roth and Wen-tau Yih. 2004. A linear programming formulation for global inference in natural language tasks. Technical report, ILLINOIS UNIV AT URBANA-CHAMPAIGN DEPT OF COMPUTER SCIENCE.
  • Sang and De Meulder (2003) Erik F Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050.
  • van der Hoef and Warrens (2019) Hanneke van der Hoef and Matthijs J Warrens. 2019. Understanding information theoretic measures for comparing clusterings. Behaviormetrika, 46(2):353–370.
  • Vinh et al. (2010) Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research, 11:2837–2854.
  • Warrens and van der Hoef (2019) Matthijs J Warrens and Hanneke van der Hoef. 2019. Understanding partition comparison indices based on counting object pairs. arXiv preprint arXiv:1901.01777.

9 Appendix

Proposition1 For the additive similarity measure, the expected chance agreement is E(Sim(Ψ1,Ψ2))=f(Eϕ1,1(Ψ11,Ψ21)),,E(ϕk1,k2(Ψ1k1,Ψ2k2)))E(Sim(\Psi 1,\Psi 2))=f(E\phi_{1,1}(\Psi 1_{1},\Psi 2_{1})),\ldots,E(\phi_{k1,% k2}(\Psi 1_{k1},\Psi 2_{k2})))italic_E ( italic_S italic_i italic_m ( roman_Ψ 1 , roman_Ψ 2 ) ) = italic_f ( italic_E italic_ϕ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( roman_Ψ 1 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Ψ 2 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , italic_E ( italic_ϕ start_POSTSUBSCRIPT italic_k 1 , italic_k 2 end_POSTSUBSCRIPT ( roman_Ψ 1 start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT , roman_Ψ 2 start_POSTSUBSCRIPT italic_k 2 end_POSTSUBSCRIPT ) ) ).

Proof.

Since the function f𝑓fitalic_f is additive, the order of the function f𝑓fitalic_f and expectation can be interchanged. We have E(Sim(Ψ1,Ψ2))=E(f(ϕ1,1(Ψ11,Ψ21),,ϕk1,k2(Ψ1k1,Ψ2k2)))=f(E(ϕ1,1(Ψ11,Ψ21)),,E(ϕk1,k2(Ψ1k1,Ψ2k2)))𝐸𝑆𝑖𝑚Ψ1Ψ2𝐸𝑓subscriptitalic-ϕ11Ψsubscript11Ψsubscript21subscriptitalic-ϕ𝑘1𝑘2Ψsubscript1𝑘1Ψsubscript2𝑘2𝑓𝐸subscriptitalic-ϕ11Ψsubscript11Ψsubscript21𝐸subscriptitalic-ϕ𝑘1𝑘2Ψsubscript1𝑘1Ψsubscript2𝑘2E(Sim(\Psi 1,\Psi 2))=E(f(\phi_{1,1}(\Psi 1_{1},\Psi 2_{1}),\ldots,\phi_{k1,k2% }(\Psi 1_{k1},\Psi 2_{k2})))=f(E(\phi_{1,1}(\Psi 1_{1},\Psi 2_{1})),\ldots,E(% \phi_{k1,k2}(\Psi 1_{k1},\Psi 2_{k2})))italic_E ( italic_S italic_i italic_m ( roman_Ψ 1 , roman_Ψ 2 ) ) = italic_E ( italic_f ( italic_ϕ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( roman_Ψ 1 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Ψ 2 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_ϕ start_POSTSUBSCRIPT italic_k 1 , italic_k 2 end_POSTSUBSCRIPT ( roman_Ψ 1 start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT , roman_Ψ 2 start_POSTSUBSCRIPT italic_k 2 end_POSTSUBSCRIPT ) ) ) = italic_f ( italic_E ( italic_ϕ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( roman_Ψ 1 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Ψ 2 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , italic_E ( italic_ϕ start_POSTSUBSCRIPT italic_k 1 , italic_k 2 end_POSTSUBSCRIPT ( roman_Ψ 1 start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT , roman_Ψ 2 start_POSTSUBSCRIPT italic_k 2 end_POSTSUBSCRIPT ) ) ).

Originally, to estimate the expectation of similarity by chance, we need to sum up the similarity in a high-dimensional space of all possible random annotations, i.e., E(Sim(Ψ1,Ψ2))=Ψ11Ψ1k1𝐸𝑆𝑖𝑚Ψ1Ψ2subscriptΨsubscript11subscriptΨsubscript1𝑘1E(Sim(\Psi 1,\Psi 2))=\sum_{\Psi 1_{1}}\ldots\sum_{\Psi 1_{k1}}italic_E ( italic_S italic_i italic_m ( roman_Ψ 1 , roman_Ψ 2 ) ) = ∑ start_POSTSUBSCRIPT roman_Ψ 1 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … ∑ start_POSTSUBSCRIPT roman_Ψ 1 start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT Ψ21Ψ2k2f(.)×p(Ψ11=ψ11,,Ψ2k2=ψ2k2)\sum_{\Psi 2_{1}}\ldots\sum_{\Psi 2_{k2}}f(.)\times p(\Psi 1_{1}=\psi 1_{1},% \ldots,\Psi 2_{k2}=\psi 2_{k2})∑ start_POSTSUBSCRIPT roman_Ψ 2 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … ∑ start_POSTSUBSCRIPT roman_Ψ 2 start_POSTSUBSCRIPT italic_k 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( . ) × italic_p ( roman_Ψ 1 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ψ 1 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_Ψ 2 start_POSTSUBSCRIPT italic_k 2 end_POSTSUBSCRIPT = italic_ψ 2 start_POSTSUBSCRIPT italic_k 2 end_POSTSUBSCRIPT ). Now we can simplify it to multiple low-dimensional summations, such as E(ϕi,j(Ψ1i,Ψ2j))𝐸subscriptitalic-ϕ𝑖𝑗Ψsubscript1𝑖Ψsubscript2𝑗E(\phi_{i,j}(\Psi 1_{i},\Psi 2_{j}))italic_E ( italic_ϕ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( roman_Ψ 1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_Ψ 2 start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ), under the condition of additive measure.

Note that in the non-overlapping random annotation model, the position of each random annotation segment is dependent on all the other random annotation segments within the same document from the same annotator. Since we assume all possible random annotations are equally likely, the problem of estimating the location distribution for each segment is equivalent to count the number of all possible configurations when we fix the location of the corresponding segment.

Proposition2 For the non-overlapping random annotation model, the number of all random annotations with the i𝑖iitalic_i-th segment fixed as:

Π(STi=l)=π(l1,0)π(nla+k,k1)+i1iπ(lai1,1)π(nla+ai1+k1,k2)+i1ii2iπ(lai1ai2+1,2)π(nla+ai1+ai2+k2,k3)++π(la+ai+k2,k1)π(nlai+1,0),missing-subexpressionΠ𝑆subscript𝑇𝑖𝑙limit-from𝜋𝑙10𝜋𝑛𝑙𝑎𝑘𝑘1missing-subexpressionlimit-fromsubscriptsubscript𝑖1𝑖𝜋𝑙subscript𝑎subscript𝑖11𝜋𝑛𝑙𝑎subscript𝑎subscript𝑖1𝑘1𝑘2missing-subexpressionsubscriptsubscript𝑖1𝑖subscriptsubscript𝑖2𝑖𝜋𝑙subscript𝑎subscript𝑖1subscript𝑎subscript𝑖212𝜋𝑛𝑙𝑎subscript𝑎subscript𝑖1subscript𝑎subscript𝑖2𝑘2𝑘3missing-subexpression𝜋𝑙𝑎subscript𝑎𝑖𝑘2𝑘1𝜋𝑛𝑙subscript𝑎𝑖10\begin{aligned} &\Pi(ST_{i}=l)=\pi(l-1,0)\pi(n-l-a+k,k-1)+\\ &\sum_{i_{1}\neq i}\pi(l-a_{i_{1}},1)\pi(n-l-a+a_{i_{1}}+k-1,k-2)+\\ &\sum_{i_{1}\neq i}\sum_{i_{2}\neq i}\pi(l-a_{i_{1}}-a_{i_{2}}+1,2)\pi(n-l-a+a% _{i_{1}}+a_{i_{2}}+k-2,k-3)\\ &+\ldots+\pi(l-a+a_{i}+k-2,k-1)\pi(n-l-a_{i}+1,0),\end{aligned}start_ROW start_CELL end_CELL start_CELL roman_Π ( italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) = italic_π ( italic_l - 1 , 0 ) italic_π ( italic_n - italic_l - italic_a + italic_k , italic_k - 1 ) + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_i end_POSTSUBSCRIPT italic_π ( italic_l - italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 1 ) italic_π ( italic_n - italic_l - italic_a + italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_k - 1 , italic_k - 2 ) + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≠ italic_i end_POSTSUBSCRIPT italic_π ( italic_l - italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 , 2 ) italic_π ( italic_n - italic_l - italic_a + italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_k - 2 , italic_k - 3 ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + … + italic_π ( italic_l - italic_a + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k - 2 , italic_k - 1 ) italic_π ( italic_n - italic_l - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 , 0 ) , end_CELL end_ROW

(2)

where π(n,r)=n!/(nr)!𝜋𝑛𝑟𝑛𝑛𝑟\pi(n,r)=n!/(n-r)!italic_π ( italic_n , italic_r ) = italic_n ! / ( italic_n - italic_r ) ! is the number of permutations of n𝑛nitalic_n things taken r𝑟ritalic_r at a time, k𝑘kitalic_k is the number of segments, aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the length of the i𝑖iitalic_i-th segment and a=iai𝑎subscript𝑖subscript𝑎𝑖a=\sum_{i}a_{i}italic_a = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the total length of annotations. Then the corresponding probability is p(STi=l)=Π(STi=l)/π(na+k,k)𝑝𝑆subscript𝑇𝑖𝑙Π𝑆subscript𝑇𝑖𝑙𝜋𝑛𝑎𝑘𝑘p(ST_{i}=l)=\Pi(ST_{i}=l)/\pi(n-a+k,k)italic_p ( italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) = roman_Π ( italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) / italic_π ( italic_n - italic_a + italic_k , italic_k ), for 1lnai+11𝑙𝑛subscript𝑎𝑖11\leq l\leq n-a_{i}+11 ≤ italic_l ≤ italic_n - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1. Here we treat each text segment as a different annotation, regardless of whether they have the same length. If we do not need to distinguish among entities of the same length, this formula can also be applied after a simple modification.

Proof sketch. We can divide all possible random annotations with STi=l𝑆subscript𝑇𝑖𝑙ST_{i}=litalic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l into k𝑘kitalic_k disjoint sets with m𝑚mitalic_m annotation segments located on the left of the specified i𝑖iitalic_i-th segment ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the remaining km1𝑘𝑚1k-m-1italic_k - italic_m - 1 segments on the right side. The cardinality of each set with selected left m𝑚mitalic_m annotation segments (which then determines the segments on the right ) is the number of all possible annotations on the left l1𝑙1l-1italic_l - 1 times the number for nlai𝑛𝑙subscript𝑎𝑖n-l-a_{i}italic_n - italic_l - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of tokens on the right side.

If we fix the order of m𝑚mitalic_m selected random annotation segments ψi1subscript𝜓subscript𝑖1\psi_{i_{1}}italic_ψ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, …, ψimsubscript𝜓subscript𝑖𝑚\psi_{i_{m}}italic_ψ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the random annotation of the left l1𝑙1l-1italic_l - 1 tokens is equivalent to distribute l1j=1maij𝑙1superscriptsubscript𝑗1𝑚subscript𝑎subscript𝑖𝑗l-1-\sum_{j=1}^{m}a_{i_{j}}italic_l - 1 - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT objects into m+1𝑚1m+1italic_m + 1 spaces, before the first annotation segment, between adjacent segments, and after the last one. This is a well studied problem (integer weak composition into a fixed number of parts) with (l1j=1maij+m)!/(l1j=1maij)!/m!𝑙1superscriptsubscript𝑗1𝑚subscript𝑎subscript𝑖𝑗𝑚𝑙1superscriptsubscript𝑗1𝑚subscript𝑎subscript𝑖𝑗𝑚(l-1-\sum_{j=1}^{m}a_{i_{j}}+m)!/(l-1-\sum_{j=1}^{m}a_{i_{j}})!/m!( italic_l - 1 - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_m ) ! / ( italic_l - 1 - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ! / italic_m ! possible configurations. Since we treat all annotation segments as different ones, there are m!𝑚m!italic_m ! permutations for the left m𝑚mitalic_m segments and (km1)!𝑘𝑚1(k-m-1)!( italic_k - italic_m - 1 ) ! for the right km1𝑘𝑚1k-m-1italic_k - italic_m - 1 ones, and the cardinality of each set is π(lj=1maij+m1,m)×π(nla+j=1maij+km,km1)𝜋𝑙superscriptsubscript𝑗1𝑚subscript𝑎subscript𝑖𝑗𝑚1𝑚𝜋𝑛𝑙𝑎superscriptsubscript𝑗1𝑚subscript𝑎subscript𝑖𝑗𝑘𝑚𝑘𝑚1\pi(l-\sum_{j=1}^{m}a_{i_{j}}+m-1,m)\times\pi(n-l-a+\sum_{j=1}^{m}a_{i_{j}}+k-% m,k-m-1)italic_π ( italic_l - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_m - 1 , italic_m ) × italic_π ( italic_n - italic_l - italic_a + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_k - italic_m , italic_k - italic_m - 1 ). Based on the above derivation, the number of all possible configurations when we fix the location of a segment can be expressed by Equation 2.

However, it is computationally expensive to calculate Equation 2 for all possible random locations of each text segment when the sequence is very long. To solve this issue, we find that Π(STi=l)Π𝑆subscript𝑇𝑖𝑙\Pi(ST_{i}=l)roman_Π ( italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) is the same for most locations when the text is of length namuch-greater-than𝑛𝑎n\gg aitalic_n ≫ italic_a. Please note that the effectiveness of Proposition3 is not related to the length of the sentence. It’s just that the longer the sentence, the more computation Proposition 3 can reduce. For short sentences, the computational cost itself is not significant.

Proposition3. STi𝑆subscript𝑇𝑖ST_{i}italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is uniformly distributed for aaik+2stina+k𝑎subscript𝑎𝑖𝑘2𝑠subscript𝑡𝑖𝑛𝑎𝑘a-a_{i}-k+2\leq st_{i}\leq n-a+kitalic_a - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_k + 2 ≤ italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_n - italic_a + italic_k, i.e., Π(sti=l1)=Π(sti=l2)aaik+2l1,l2na+kformulae-sequenceΠ𝑠subscript𝑡𝑖subscript𝑙1Π𝑠subscript𝑡𝑖subscript𝑙2for-all𝑎subscript𝑎𝑖𝑘2subscript𝑙1subscript𝑙2𝑛𝑎𝑘\Pi(st_{i}=l_{1})=\Pi(st_{i}=l_{2})\>\forall\>a-a_{i}-k+2\leq l_{1},l_{2}\leq n% -a+kroman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∀ italic_a - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_k + 2 ≤ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_n - italic_a + italic_k .

It is clear that proposition 3 and proposition 3* are equivalent.

Proposition3*. Π(sti=l)=Π(sti=l+1)aaik+2lna+k1Π𝑠subscript𝑡𝑖𝑙Π𝑠subscript𝑡𝑖𝑙1for-all𝑎subscript𝑎𝑖𝑘2𝑙𝑛𝑎𝑘1\Pi(st_{i}=l)=\Pi(st_{i}=l+1)\>\forall\>a-a_{i}-k+2\leq l\leq n-a+k-1roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) = roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l + 1 ) ∀ italic_a - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_k + 2 ≤ italic_l ≤ italic_n - italic_a + italic_k - 1 .

Proof sketch. Use mathematical induction

Initial step: when k=1𝑘1k=1italic_k = 1, Π(st1=l)=1Π𝑠subscript𝑡1𝑙1\Pi(st_{1}=l)=1roman_Π ( italic_s italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_l ) = 1 and p(st1=l)=1/(na1+1)𝑝𝑠subscript𝑡1𝑙1𝑛subscript𝑎11p(st_{1}=l)=1/(n-a_{1}+1)italic_p ( italic_s italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_l ) = 1 / ( italic_n - italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 ), for 1lna1+11𝑙𝑛subscript𝑎111\leq l\leq n-a_{1}+11 ≤ italic_l ≤ italic_n - italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1. So the proposition 3* is true at k=1𝑘1k=1italic_k = 1.

Inductive step: assume the proposition 3* holds for k=r𝑘𝑟k=ritalic_k = italic_r. When k=r+1𝑘𝑟1k=r+1italic_k = italic_r + 1, we partition all possible configurations with sti=l𝑠subscript𝑡𝑖𝑙st_{i}=litalic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l into r+1𝑟1r+1italic_r + 1 disjoint scenarios: the r𝑟ritalic_r scenarios with stj=l+ai𝑠subscript𝑡𝑗𝑙subscript𝑎𝑖st_{j}=l+a_{i}italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_l + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for all ji𝑗𝑖j\neq iitalic_j ≠ italic_i and the rest, i.e., the scenarios with a different annotation segment next to ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from right side or none annotation segment next to ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from right side. So Π(sti=l)=jiΠ(sti=l&stj=l+ai)+Π(sti=l&stjl+ai,ji)\Pi(st_{i}=l)=\sum_{j\neq i}\Pi(st_{i}=l\>\&\>st_{j}=l+a_{i})+\Pi(st_{i}=l\>\&% \>st_{j}\neq l+a_{i},\forall j\neq i)roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) = ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l & italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_l + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l & italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ italic_l + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_j ≠ italic_i ).

We also partition all possible configurations with sti=l+1𝑠subscript𝑡𝑖𝑙1st_{i}=l+1italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l + 1 into r+1𝑟1r+1italic_r + 1 disjoint scenarios: the r𝑟ritalic_r scenarios with stj=l+1aj𝑠subscript𝑡𝑗𝑙1subscript𝑎𝑗st_{j}=l+1-a_{j}italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_l + 1 - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for all ji𝑗𝑖j\neq iitalic_j ≠ italic_i and the rest, i.e., the scenarios with a different annotation segment next to ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from left side or none annotation segment next to ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from left side. Similarly, Π(sti=l+1)=jiΠ(sti=l+1&stj=l+1aj)+Π(sti=l+1&stjl+1aj,ji)\Pi(st_{i}=l+1)=\sum_{j\neq i}\Pi(st_{i}=l+1\>\&\>st_{j}=l+1-a_{j})+\Pi(st_{i}% =l+1\>\&\>st_{j}\neq l+1-a_{j},\forall j\neq i)roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l + 1 ) = ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l + 1 & italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_l + 1 - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l + 1 & italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ italic_l + 1 - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ≠ italic_i ).

Since there is a bijection between the scenario of sti=l&stjl+ai,jiformulae-sequence𝑠subscript𝑡𝑖𝑙𝑠subscript𝑡𝑗𝑙subscript𝑎𝑖for-all𝑗𝑖st_{i}=l\>\&\>st_{j}\neq l+a_{i},\forall j\neq iitalic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l & italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ italic_l + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_j ≠ italic_i and the one of sti=l+1&stjl+1aj,jiformulae-sequence𝑠subscript𝑡𝑖𝑙1𝑠subscript𝑡𝑗𝑙1subscript𝑎𝑗for-all𝑗𝑖st_{i}=l+1\>\>\&\>\>st_{j}\neq l+1-a_{j},\forall j\neq iitalic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l + 1 & italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ italic_l + 1 - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ≠ italic_i by identity mapping except the annotation segment ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the un-annotated token next to it with indices from l𝑙litalic_l to l+ai𝑙subscript𝑎𝑖l+a_{i}italic_l + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Π(sti=l&stjl+ai,ji)=Π(sti=l+1&stjl+1aj,ji)\Pi(st_{i}=l\>\&\>st_{j}\neq l+a_{i},\forall j\neq i)=\Pi(st_{i}=l+1\&st_{j}% \neq l+1-a_{j},\forall j\neq i)roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l & italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ italic_l + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_j ≠ italic_i ) = roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l + 1 & italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ italic_l + 1 - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ≠ italic_i ). For the pair of scenarios sti=l&stj=l+ai𝑠subscript𝑡𝑖𝑙𝑠subscript𝑡𝑗𝑙subscript𝑎𝑖st_{i}=l\>\&\>st_{j}=l+a_{i}italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l & italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_l + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and sti=l+1&stj=l+1aj𝑠subscript𝑡𝑖𝑙1𝑠subscript𝑡𝑗𝑙1subscript𝑎𝑗st_{i}=l+1\>\&\>st_{j}=l+1-a_{j}italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l + 1 & italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_l + 1 - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, they can be convert to scenarios sti=l&ai=ai+aj𝑠superscriptsubscript𝑡𝑖𝑙superscriptsubscript𝑎𝑖subscript𝑎𝑖subscript𝑎𝑗st_{i}^{*}=l\>\&\>a_{i}^{*}=a_{i}+a_{j}italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_l & italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and sti=l+1aj&ai=ai+aj𝑠superscriptsubscript𝑡𝑖𝑙1subscript𝑎𝑗superscriptsubscript𝑎𝑖subscript𝑎𝑖subscript𝑎𝑗st_{i}^{*}=l+1-a_{j}\>\&\>a_{i}^{*}=a_{i}+a_{j}italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_l + 1 - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT & italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT by merging ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ψjsubscript𝜓𝑗\psi_{j}italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Based on the assumption that the proposition 3* holds at k=r𝑘𝑟k=ritalic_k = italic_r, their cardinalities should be equal since there is only r𝑟ritalic_r segments after the combination and a(ai+aj)(k1)+2l,l+1ajna+(k1)formulae-sequence𝑎subscript𝑎𝑖subscript𝑎𝑗𝑘12𝑙𝑙1subscript𝑎𝑗𝑛𝑎𝑘1a-(a_{i}+a_{j})-(k-1)+2\leq l,l+1-a_{j}\leq n-a+(k-1)italic_a - ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - ( italic_k - 1 ) + 2 ≤ italic_l , italic_l + 1 - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_n - italic_a + ( italic_k - 1 ). Therefore, Π(sti=l&stj=l+ai)=Π(sti=l+1&stj=l+1aj)Π𝑠subscript𝑡𝑖𝑙𝑠subscript𝑡𝑗𝑙subscript𝑎𝑖Π𝑠subscript𝑡𝑖𝑙1𝑠subscript𝑡𝑗𝑙1subscript𝑎𝑗\Pi(st_{i}=l\>\&\>st_{j}=l+a_{i})=\Pi(st_{i}=l+1\>\&\>st_{j}=l+1-a_{j})roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l & italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_l + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_Π ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l + 1 & italic_s italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_l + 1 - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and the proposition 3* holds for k=r+1𝑘𝑟1k=r+1italic_k = italic_r + 1.

It is a tight bound since we have to satisfy the condition of 0lj=1maij+m10𝑙superscriptsubscript𝑗1𝑚subscript𝑎subscript𝑖𝑗𝑚10\leq l-\sum_{j=1}^{m}a_{i_{j}}+m-10 ≤ italic_l - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_m - 1 and 0nla+j=1maij+km0𝑛𝑙𝑎superscriptsubscript𝑗1𝑚subscript𝑎subscript𝑖𝑗𝑘𝑚0\leq n-l-a+\sum_{j=1}^{m}a_{i_{j}}+k-m0 ≤ italic_n - italic_l - italic_a + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_k - italic_m for all 0mk10𝑚𝑘10\leq m\leq k-10 ≤ italic_m ≤ italic_k - 1 and ijisubscript𝑖𝑗𝑖i_{j}\neq iitalic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ italic_i. This is the same as aaik+2lna+k𝑎subscript𝑎𝑖𝑘2𝑙𝑛𝑎𝑘a-a_{i}-k+2\leq l\leq n-a+kitalic_a - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_k + 2 ≤ italic_l ≤ italic_n - italic_a + italic_k.

Refer to caption
Figure 2: Convert the case of k=r+1𝑘𝑟1k=r+1italic_k = italic_r + 1 to the case of k=r𝑘𝑟k=ritalic_k = italic_r by merging two adjacent text segments ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ψjsubscript𝜓𝑗\psi_{j}italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the blue box represents the segment ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , and the red box represents the adjacent segment ψjsubscript𝜓𝑗\psi_{j}italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Proposition4. The expected similarity E(Sim(Ψ1,Ψ2))0𝐸𝑆𝑖𝑚Ψ1Ψ20E(Sim(\Psi 1,\Psi 2))\to 0italic_E ( italic_S italic_i italic_m ( roman_Ψ 1 , roman_Ψ 2 ) ) → 0 when na1+a2much-greater-than𝑛𝑎1𝑎2n\gg a1+a2italic_n ≫ italic_a 1 + italic_a 2, where a1𝑎1a1italic_a 1 and a2𝑎2a2italic_a 2 are the total lengths of all annotated segments for annotator 1 and annotator 2.

Proof sketch. According to the proof process of Proposition 2, we know the number of all possible random annotations of k𝑘kitalic_k segments with total length a𝑎aitalic_a for a text with n𝑛nitalic_n tokens is π(na+k,k)𝜋𝑛𝑎𝑘𝑘\pi(n-a+k,k)italic_π ( italic_n - italic_a + italic_k , italic_k ). Thus, the total number of comparisons between random annotations from annotator 1 and annotator 2 is π(na1+k1,k1)×π(na2+k2,k2)𝜋𝑛𝑎1𝑘1𝑘1𝜋𝑛𝑎2𝑘2𝑘2\pi(n-a1+k1,k1)\times\pi(n-a2+k2,k2)italic_π ( italic_n - italic_a 1 + italic_k 1 , italic_k 1 ) × italic_π ( italic_n - italic_a 2 + italic_k 2 , italic_k 2 ) under the independent annotation assumption. It is straight forward that the segment-wise agreement ϕi1,i2(ψ1i1,ψ2i2)subscriptitalic-ϕsubscript𝑖1subscript𝑖2𝜓subscript1subscript𝑖1𝜓subscript2subscript𝑖2\phi_{i_{1},i_{2}}(\psi 1_{i_{1}},\psi 2_{i_{2}})italic_ϕ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ψ 1 start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ψ 2 start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is zero if there is no overlap between the i1subscript𝑖1i_{1}italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-th text segment annotated by annotator 1 and the i2subscript𝑖2i_{2}italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-th text segment annotated by annotator 2. The agreement between two annotators is zero if there is no overlap among all k1+k2𝑘1𝑘2k1+k2italic_k 1 + italic_k 2 annotated text segments. The situation is equivalent to combining the annotation results of the two annotators and requiring no overlap among all k1+k2𝑘1𝑘2k1+k2italic_k 1 + italic_k 2 text segments in the same text. The total number of such possible annotations is π(na1a2+k1+k2,k1+k2)𝜋𝑛𝑎1𝑎2𝑘1𝑘2𝑘1𝑘2\pi(n-a1-a2+k1+k2,k1+k2)italic_π ( italic_n - italic_a 1 - italic_a 2 + italic_k 1 + italic_k 2 , italic_k 1 + italic_k 2 ). Therefore, the probability of zero chance agreement p(Sim(Ψ1,Ψ2))=0)=π(na1a2+k1+k2,k1+k2)/π(na1+k1,k1)/π(na2+k2,k2)=(na1a2+k1+k2)×(na1a2+1)/((na1+k1)×(na1+1)×(na2+k2)×(na2+1))1p(Sim(\Psi 1,\Psi 2))=0)=\pi(n-a1-a2+k1+k2,k1+k2)/\pi(n-a1+k1,k1)/\pi(n-a2+k2,% k2)=(n-a1-a2+k1+k2)\times\ldots(n-a1-a2+1)/((n-a1+k1)\times\ldots(n-a1+1)% \times(n-a2+k2)\times\ldots(n-a2+1))\to 1italic_p ( italic_S italic_i italic_m ( roman_Ψ 1 , roman_Ψ 2 ) ) = 0 ) = italic_π ( italic_n - italic_a 1 - italic_a 2 + italic_k 1 + italic_k 2 , italic_k 1 + italic_k 2 ) / italic_π ( italic_n - italic_a 1 + italic_k 1 , italic_k 1 ) / italic_π ( italic_n - italic_a 2 + italic_k 2 , italic_k 2 ) = ( italic_n - italic_a 1 - italic_a 2 + italic_k 1 + italic_k 2 ) × … ( italic_n - italic_a 1 - italic_a 2 + 1 ) / ( ( italic_n - italic_a 1 + italic_k 1 ) × … ( italic_n - italic_a 1 + 1 ) × ( italic_n - italic_a 2 + italic_k 2 ) × … ( italic_n - italic_a 2 + 1 ) ) → 1 because both numerator and denominator are to the (k1+k2)𝑘1𝑘2(k1+k2)( italic_k 1 + italic_k 2 )-th power of n𝑛nitalic_n and na1+a2k1+k2much-greater-than𝑛𝑎1𝑎2𝑘1𝑘2n\gg a1+a2\geq k1+k2italic_n ≫ italic_a 1 + italic_a 2 ≥ italic_k 1 + italic_k 2. Thus, we have E(Sim(Ψ1,Ψ2))0𝐸𝑆𝑖𝑚Ψ1Ψ20E(Sim(\Psi 1,\Psi 2))\to 0italic_E ( italic_S italic_i italic_m ( roman_Ψ 1 , roman_Ψ 2 ) ) → 0 when na1+a2much-greater-than𝑛𝑎1𝑎2n\gg a1+a2italic_n ≫ italic_a 1 + italic_a 2.

Proposition5. For the overlapping random annotation model, p(STi=l)=1/(nai+1)𝑝𝑆subscript𝑇𝑖𝑙1𝑛subscript𝑎𝑖1p(ST_{i}=l)=1/(n-a_{i}+1)italic_p ( italic_S italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) = 1 / ( italic_n - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 ), for 1lnai+11𝑙𝑛subscript𝑎𝑖11\leq l\leq n-a_{i}+11 ≤ italic_l ≤ italic_n - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1.

Proof sketch. This conclusion is straight forward because a random text segment annotation with length aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be placed at any feasible locations with equal probability without the non-overlapping constraint.

Computational complexity for random text annotation. The computational cost of calculating the probability distribution of the location of k𝑘kitalic_k random annotated text segments is bounded by ((k1)×ak2+2k)×2k×(k1)𝑘1𝑎superscript𝑘22𝑘superscript2𝑘𝑘1((k-1)\times a-k^{2}+2k)\times 2^{k}\times(k-1)( ( italic_k - 1 ) × italic_a - italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_k ) × 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × ( italic_k - 1 ) multiplications and ((k1)×ak2+2k)×(2k1)𝑘1𝑎superscript𝑘22𝑘superscript2𝑘1((k-1)\times a-k^{2}+2k)\times(2^{k}-1)( ( italic_k - 1 ) × italic_a - italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_k ) × ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - 1 ) additions.

In order to calculate the probability distributions for random text annotation, according to the proposition 2 and the proposition 3, we could calculate the probability of aaik+2𝑎subscript𝑎𝑖𝑘2a-a_{i}-k+2italic_a - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_k + 2 possible positions for each random annotated text segment with formula 1. And the analytical formula is a summation of 2ksuperscript2𝑘2^{k}2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT terms, and each term is equivalent to k1𝑘1k-1italic_k - 1 multiplications, so the computational complexity is bounded by i=1k(aaik+2)×2k×(k1)=((k1)×ak2+2k)×2k×(k1)superscriptsubscript𝑖1𝑘𝑎subscript𝑎𝑖𝑘2superscript2𝑘𝑘1𝑘1𝑎superscript𝑘22𝑘superscript2𝑘𝑘1\sum_{i=1}^{k}(a-a_{i}-k+2)\times 2^{k}\times(k-1)=((k-1)\times a-k^{2}+2k)% \times 2^{k}\times(k-1)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_a - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_k + 2 ) × 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × ( italic_k - 1 ) = ( ( italic_k - 1 ) × italic_a - italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_k ) × 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT × ( italic_k - 1 ) multiplications and i=1k(aaik+2)×(2k1)=((k1)×ak2+2k)×(2k1)superscriptsubscript𝑖1𝑘𝑎subscript𝑎𝑖𝑘2superscript2𝑘1𝑘1𝑎superscript𝑘22𝑘superscript2𝑘1\sum_{i=1}^{k}(a-a_{i}-k+2)\times(2^{k}-1)=((k-1)\times a-k^{2}+2k)\times(2^{k% }-1)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_a - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_k + 2 ) × ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - 1 ) = ( ( italic_k - 1 ) × italic_a - italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_k ) × ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - 1 ) additions. Since the formula 1 is a subset convolution, It may be possible to speed up this calculation with the fast subset convolution algorithm.

According to the above computational complexity analysis, we know that the probability distribution of the location of each random annotated segment can be calculated efficiently using the formula 1 when the number of text segments k𝑘kitalic_k is small. But with the increase of k𝑘kitalic_k, the computational cost will increase rapidly. Fortunately, when the text sequence is long enough and the annotated information is sparse, we can use the uniform distribution to approximate the distribution.

Uniform approximation. The probability distribution of the location of a random annotated text segment can be approximated by uniform distribution with p(sti=l)=1/(nai+1)𝑝𝑠subscript𝑡𝑖𝑙1𝑛subscript𝑎𝑖1p(st_{i}=l)=1/(n-a_{i}+1)italic_p ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_l ) = 1 / ( italic_n - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 ), for 1lnai+11𝑙𝑛subscript𝑎𝑖11\leq l\leq n-a_{i}+11 ≤ italic_l ≤ italic_n - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 if (na+k)/(nai+1)>α𝑛𝑎𝑘𝑛subscript𝑎𝑖1𝛼(n-a+k)/(n-a_{i}+1)>\alpha( italic_n - italic_a + italic_k ) / ( italic_n - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 ) > italic_α, where α𝛼\alphaitalic_α is a preset threshold which is close to 1 and less than 1, for example α=0.99𝛼0.99\alpha=0.99italic_α = 0.99 .

We observe that the probability distribution of the location of a random annotated text segment is approximately inverted trapezoid distributed with highest probabilities at both ends. And the majority of the whole distribution is flat when n>>amuch-greater-than𝑛𝑎n>>aitalic_n > > italic_a. It is straight forward to calculate the p(sti=1)=π(na+k1,k1)/π(na+k,k)=1/(na+k)𝑝𝑠subscript𝑡𝑖1𝜋𝑛𝑎𝑘1𝑘1𝜋𝑛𝑎𝑘𝑘1𝑛𝑎𝑘p(st_{i}=1)=\pi(n-a+k-1,k-1)/\pi(n-a+k,k)=1/(n-a+k)italic_p ( italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) = italic_π ( italic_n - italic_a + italic_k - 1 , italic_k - 1 ) / italic_π ( italic_n - italic_a + italic_k , italic_k ) = 1 / ( italic_n - italic_a + italic_k ). So the distribution could be approximate with uniform distribution if the highest probability 1/(na+k)1𝑛𝑎𝑘1/(n-a+k)1 / ( italic_n - italic_a + italic_k ) is close to the uniform probability 1/(nai+1)1𝑛subscript𝑎𝑖11/(n-a_{i}+1)1 / ( italic_n - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 ), i.e., (na+k)/(nai+1)𝑛𝑎𝑘𝑛subscript𝑎𝑖1(n-a+k)/(n-a_{i}+1)( italic_n - italic_a + italic_k ) / ( italic_n - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 ) is close to 1 if n>>amuch-greater-than𝑛𝑎n>>aitalic_n > > italic_a.

CoNLL03 NER dataset and system outputs. To evaluate our model in real data, we estimate the chance agreement of 11 state-of-the-art NER models on CoNLL03 NER dataset, the results are shown in Table 10. CoNLL-2003 is a named entity recognition dataset that is released as a part of CoNLL-2003 shared task: language-independent named entity recognition. This corpus consists of Reuters news stories between August 1996 and August 1997. There are four types of annotated entities: persons (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC). We downloaded 15 system outputs for the English test set from the Explained Board website after approval. Since 4 system outputs use different sentence segmentation, we limit our comparison to 11 system outputs that use the same sentence segmentation. The test set consists of 231 articles that include 3453 sentences.