TreeSeg: Hierarchical Topic Segmentation of Large Transcripts

Dimitrios C. Gklezakos    Timothy Misiak    Diamond Bishop
Augmend
{dimi, tim, diamond}@augmend.com
Abstract

From organizing recorded videos and meetings into chapters, to breaking down large inputs in order to fit them into the context window of commoditized Large Language Models (LLMs), topic segmentation of large transcripts emerges as a task of increasing significance. Still, accurate segmentation presents many challenges, including (a) the noisy nature of the Automatic Speech Recognition (ASR) software typically used to obtain the transcripts, (b) the lack of diverse labeled data and (c) the difficulty in pin-pointing the ground-truth number of segments. In this work we present TreeSeg, an approach that combines off-the-shelf embedding models with divisive clustering, to generate hierarchical, structured segmentations of transcripts in the form of binary trees. Our approach is robust to noise and can handle large transcripts efficiently. We evaluate TreeSeg on the ICSI and AMI corpora, demonstrating that it outperforms all baselines. Finally, we introduce TinyRec, a small-scale corpus of manually annotated transcripts, obtained from self-recorded video sessions.

1 Introduction

The wide availability of video conferencing platforms, together with the rapid surge in the volume of hosted videos McGrady et al. (2023), have resulted in the proliferation of self-recorded content in the form of meetings and videos. Often transcribed into text via Automatic Speech Recognition (ASR), this content offers a wealth of information waiting to be extracted.

In this work we focus on segmenting large transcripts originating from automatically transcribed, self-recorded content, into temporally contiguous but semantically distinct segments. The goal of segmentation in this context is two-fold: (a) To display content in an organized manner (i.e. automatic chapter generation) and (b) to break down large transcripts in order to satisfy the size constraints of downstream models, such as the context window limitations of commoditized Large Language Models (LLMs).

In this context, topic segmentation poses many challenges due to: (a) The noisy nature of ASR software, resulting in errors due to poor transcription or out-of-dictionary technical terms, (b) the scarcity of labeled data of a diverse distribution and the difficulty in obtaining it Gruenstein et al. (2008) and (c) the difficulty in pin-pointing the ground-truth number of segments, which can vary between human annotators even for the same transcript.

In this paper we propose TreeSeg, a novel hierarchical topic segmentation approach. TreeSeg combines utterance embeddings with divisive clustering to filter the input and identify segment transition points. Our approach is completely unsupervised, has no learnable parts and utilizes readily available off-the-shelf embedding models. TreeSeg partitions the input in a hierarchical manner and is accurate at multiple levels of segmentation resolution. In the context of automatically generating and displaying video or meeting segmentations, this hierarchical aspect of TreeSeg provides the user with the affordance to dynamically choose the desired number and resolution of the generated chapters/segments.

We evaluate our approach on two standard large meeting corpora: ICSI (Janin et al., 2003) and AMI (Mccowan et al., 2005). We demonstrate that TreeSeg outperforms its competition across the board. We also contribute a small-scale corpus of our own, TinyRec, consisting of 21212121 self-recorded sessions with technical content, transcribed via ASR and manually annotated. We plan to gradually extend TinyRec over time with more annotated sessions.

1.1 Related Work

Koshorek et al. (2018) adopt a supervised learning approach to topic segmentation of written text by applying an LSTM-based (Hochreiter and Schmidhuber, 1997) model to WIKI-727K, a dataset extracted from Wikipedia. Models based on transformers (Vaswani et al., 2017) such as BERT (Devlin et al., 2019) and its variants (Liu et al., 2019) are considered by Lukasik et al. (2020) and Ghosh et al. (2024). Retkowski and Waibel (2024) use annotations obtained from Youtube videos to train a transformer-based segmentation model in a supervised manner.

Bayomi and Lawless (2018) apply agglomerative clustering to extract a hierarchy of segments from text. Hazem et al. (2020) use a bottom-up approach to segment text from medieval manuscripts. Grootendorst (2022) uses Sentence-BERT (Reimers and Gurevych, 2019) embeddings to cluster documents and extract latent topics.

Unsupervised topic segmentation of meetings has generated a lot of interest in the recent years. Most recent approaches are essentially modern variants of TextTiling (Hearst, 1997), a technique that relies on similarities between adjacent utterances. TextTiling identifies topic changes by finding local similarity minima. Perhaps the closest work to our own is BertSeg, introduced in Solbiati et al. (2021). BertSeg is a modern version of TextTiling that embeds utterances using a pre-trained model, extracts overlapping blocks of embeddings and aggregates them to compute utterance similarities. HyperSeg, introduced in Park et al. (2023), also follows the TextTiling paradigm, while replacing learned embeddings with hyper-dimensional vectors derived from random word embeddings. CohereSeg (Xing and Carenini, 2021) and M3superscriptM3\text{M}^{3}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTSeg (Wang et al., 2023) are unsupervised segmentation approaches that fine-tune embeddings using what is, in essence, a contrastive learning technique. CohereSeg focuses on dialogue topic segmentation and is shown in Park et al. (2023) to perform worse than or on par with HyperSeg. At the time of writing this manuscript there is no publicly available code base for M3superscriptM3\text{M}^{3}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTSeg. Finally, Ghosh et al. (2022) compare various topic segmentation approaches on semi-structured and unstructured conversations and show that pre-training on structured data does not transfer well to unstructured data.

2 Method

Consider the linear topic segmentation setting, where the input is a temporal sequence of transcript entries/utterances U=[U1,,UT]𝑈subscript𝑈1subscript𝑈𝑇U=[U_{1},...,U_{T}]italic_U = [ italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_U start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ]. We will henceforth refer to U𝑈Uitalic_U as the ‘timeline’. The underlying organization of the transcript into topics is modeled as a partition P={Pk}k=1K𝑃superscriptsubscriptsubscript𝑃𝑘𝑘1𝐾P=\{P_{k}\}_{k=1}^{K}italic_P = { italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT of U𝑈Uitalic_U into segments. Each such segment covers a temporally contiguous set of utterances:

Pk={Ut:ts(k)tte(k)}subscript𝑃𝑘conditional-setsubscript𝑈𝑡subscript𝑡𝑠𝑘𝑡subscript𝑡𝑒𝑘P_{k}=\{U_{t}:t_{s}(k)\leq t\leq t_{e}(k)\}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_k ) ≤ italic_t ≤ italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_k ) }

starting with Uts(k)subscript𝑈subscript𝑡𝑠𝑘U_{t_{s}(k)}italic_U start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT and ending with Ute(k)subscript𝑈subscript𝑡𝑒𝑘U_{t_{e}(k)}italic_U start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT (endpoints included). Each utterance belongs to exactly one segment. The goal of linear topic segmentation is to approximate this ground-truth partition.

2.1 Hierarchical Topic Segmentation

We extend the linear setting to a hierarchical version by considering nested partitions of the timeline, of increasing resolution, represented by trees.

Refer to caption
(a) Flat partition tree: Linear topic segmentation.
Refer to caption
(b) Deep partition tree: Hierarchical topic segmentation.
Figure 1: From linear to hierarchical topic segmentation: (a) A partition tree of depth equal to 1111 corresponding to linear topic segmentation and (b) a deeper partition tree corresponding to hierarchical topic segmentation. The root node always covers the full timeline. Note that in both cases, the children of a node form a partition of the node’s segment.

A flat partition P𝑃Pitalic_P of the timeline can be viewed as a tree of depth equal to 1111, where the root is the timeline itself and each child is an utterance set Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the partition (see Figure 1(a)). A nested partition is represented by a deeper tree where each node v𝑣vitalic_v corresponds to a temporally contiguous set of utterances Pvsubscript𝑃𝑣P_{v}italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, while its children, denoted by N(v)𝑁𝑣N(v)italic_N ( italic_v ) form a partition of Pvsubscript𝑃𝑣P_{v}italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, such that: Pv=uN(v)Pusubscript𝑃𝑣subscript𝑢𝑁𝑣subscript𝑃𝑢P_{v}=\bigcup_{u\in N(v)}P_{u}italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT italic_u ∈ italic_N ( italic_v ) end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and i,jN(v):ij:for-all𝑖𝑗𝑁𝑣𝑖𝑗\forall i,j\in N(v)\;:\;i\neq j∀ italic_i , italic_j ∈ italic_N ( italic_v ) : italic_i ≠ italic_j, PiPj=subscript𝑃𝑖subscript𝑃𝑗P_{i}\cap P_{j}=\varnothingitalic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∅.

Figure 1(b) shows an example of such a nested partition and its corresponding tree. A partition tree has the following properties:

  1. 1.

    Every sub-tree containing the root is a valid nested partition of the timeline.

  2. 2.

    The leaves of every sub-tree containing the root form a valid flat partition of the timeline.

The partition tree also induces a natural order in which segments are divided into sub-segments. Consider the sub-tree that contains all nodes of depth smaller than τD(P)𝜏𝐷𝑃\tau\leq D(P)italic_τ ≤ italic_D ( italic_P ), where D(P)𝐷𝑃D(P)italic_D ( italic_P ) is the maximum depth of the partition tree and let Pτsubscript𝑃absent𝜏P_{\leq\tau}italic_P start_POSTSUBSCRIPT ≤ italic_τ end_POSTSUBSCRIPT denote its corresponding valid, nested partition. The leaves of this sub-tree form a flat partition of the timeline. As τ𝜏\tauitalic_τ increases, the resolution |Pτ|subscript𝑃absent𝜏|P_{\leq\tau}|| italic_P start_POSTSUBSCRIPT ≤ italic_τ end_POSTSUBSCRIPT | of this flat partition increases as well.

2.2 From Linear to Hierarchical Partitions

Suppose that we have access to a linear topic segmentation model that takes in the desired partition length K𝐾Kitalic_K and identifies K𝐾Kitalic_K segments. Suppose also that as K𝐾Kitalic_K increases, additional segment boundaries are added but not deleted111Note that techniques based on TextTiling (Hearst, 1997) such as BertSeg (Solbiati et al., 2021) and HyperSeg (Park et al., 2023) naturally exhibit this property.. Then we can evaluate this model on the hierarchical segmentation task, as follows:

  • Choose a depth threshold τ𝜏\tauitalic_τ and construct the corresponding partition Pτsubscript𝑃absent𝜏P_{\leq\tau}italic_P start_POSTSUBSCRIPT ≤ italic_τ end_POSTSUBSCRIPT of maximum depth τ𝜏\tauitalic_τ.

  • Query the model with K=|Pτ|𝐾subscript𝑃absent𝜏K=|P_{\leq\tau}|italic_K = | italic_P start_POSTSUBSCRIPT ≤ italic_τ end_POSTSUBSCRIPT |.

  • Evaluate the result by comparing with the flat partition induced by the leaves of Pτsubscript𝑃absent𝜏P_{\leq\tau}italic_P start_POSTSUBSCRIPT ≤ italic_τ end_POSTSUBSCRIPT.

  • Repeat for all 1τD(P)1𝜏𝐷𝑃1\leq\tau\leq D(P)1 ≤ italic_τ ≤ italic_D ( italic_P ).

Refer to caption
Figure 2: Inaccurate hierarchical segmentation: An example of an accurate linear, but inaccurate hierarchical approximation of the tree in Figure 1(b). Note that the leaves of the output partition match those of the ground-truth partition, however the order in which the nodes are partitioned is not respected and the hierarchical structure of the segments is not properly identified.

Intuitively, a model that accurately captures the hierarchical relations between segments, will output intermediate partitions that match those induced by the limited-depth sub-trees of the ground truth partition. Figure 2 shows an example where an output partition perfectly matches the ground-truth bottom-level partition in the linear topic segmentation setting, but fails to accurately capture its hierarchical structure.

2.3 TreeSeg

TreeSeg first embeds the transcript timeline entries U𝑈Uitalic_U. Then these embeddings are combined with a divisive clustering approach, to identify appropriate splitting points and construct a deep partition tree.

2.3.1 Embedding the Transcript

TreeSeg uses an embedding model to convert transcript entries to embeddings. For the results in this paper we used OpenAI’s text-embedding-ada-002 (ADA) (Neelakantan et al., 2022), an embedding model trained in an unsupervised manner, utilizing contrastive learning techniques. The choice of an off-the-shelf, commoditized embedding model, results in a pipeline with no trainable parts. Note that TreeSeg does not depend on this particular choice. Any suitable embedding model such as Liu et al. (2019) can be directly plugged into our approach in the same manner.

A single utterance might not contain enough context to be embedded on its own in a meaningful way, especially in the presence of automatic transcription errors. To address this issue we extract overlapping blocks of utterances. For each utterance utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at position t𝑡titalic_t we extract a block that consists of utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT itself and up to W𝑊Witalic_W utterances in the immediate past. This block [utW,,ut]subscript𝑢𝑡𝑊subscript𝑢𝑡[u_{t-W},...,u_{t}][ italic_u start_POSTSUBSCRIPT italic_t - italic_W end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] is passed through the embedding model f𝑓fitalic_f to obtain the block embedding et=f([utW,,ut])subscript𝑒𝑡𝑓subscript𝑢𝑡𝑊subscript𝑢𝑡e_{t}=f([u_{t-W},...,u_{t}])italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( [ italic_u start_POSTSUBSCRIPT italic_t - italic_W end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ). Note that this is a point of deviation from BertSeg (Solbiati et al., 2021). While BertSeg embeds each utterance separately and aggregates them using max-pooling, TreeSeg embeds the whole block of utterances together, resulting in additional context passed to the embedding model. Repeating for every utterance in the transcript results in a temporal sequence of utterance embeddings that maintain local context. The utterance block width W𝑊Witalic_W is the sole hyperparameter of TreeSeg.

2.3.2 Divisive Clustering

Refer to caption
(a) Candidate splitting points: An example utterance embedding timeline with valid candidate splitting points shown in green. Candidates in red are excluded due to minimum size constraints.
Refer to caption
(b) First split: The original timeline is divided into two sub-segments.
Refer to caption
(c) Second split: A sub-segment is sub-divided again and the partition tree deepens.
Figure 3: Dividing the timeline: (a) At each step, valid candidate splitting points are identified for all leaves. (b) & (c) The optimal splitting point across all leaves is used to divide the corresponding segment into two sub-segments. The process continues until a termination criterion is met.

TreeSeg utilizes a divisive clustering approach to recursively split segments into two sub-segments, constructing a binary partition tree in the process. Below, we provide a high-level outline of this segment division process:

  • For each leaf in the current partition tree, identify the optimal splitting point according to the loss function, while respecting any constraints.

  • Pick the leaf with the best scoring candidate splitting point and split it into two sub-segments.

  • Repeat until the termination condition is met.

We use a one-dimensional clustering objective as our loss function. Consider a timeline of embeddings E=[e1,,eT]𝐸subscript𝑒1subscript𝑒𝑇E=[e_{1},...,e_{T}]italic_E = [ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ]. A candidate splitting point i𝑖iitalic_i partitions this timeline into two segments [e1,,ei1]subscript𝑒1subscript𝑒𝑖1[e_{1},...,e_{i-1}][ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ] and [ei,,eT]subscript𝑒𝑖subscript𝑒𝑇[e_{i},...,e_{T}][ italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ]. In practice we want to avoid ending up with trivial segments, therefore we enforce a minimum viable segment size denoted by M𝑀Mitalic_M and only consider candidate points in the range M<iTM𝑀𝑖𝑇𝑀M<i\leq T-Mitalic_M < italic_i ≤ italic_T - italic_M. The loss function for candidate point i𝑖iitalic_i is given by:

(i)=t=1i1etμL22+t=iTetμR22𝑖superscriptsubscript𝑡1𝑖1superscriptsubscriptnormsubscript𝑒𝑡subscript𝜇L22superscriptsubscript𝑡𝑖𝑇superscriptsubscriptnormsubscript𝑒𝑡subscript𝜇R22\mathcal{L}(i)=\sum_{t=1}^{i-1}\|e_{t}-\mu_{\text{L}}\|_{2}^{2}+\sum_{t=i}^{T}% \|e_{t}-\mu_{\text{R}}\|_{2}^{2}caligraphic_L ( italic_i ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ∥ italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT L end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

with μL=1i1t=1i1etsubscript𝜇L1𝑖1superscriptsubscript𝑡1𝑖1subscript𝑒𝑡\mu_{\text{L}}=\frac{1}{i-1}\sum_{t=1}^{i-1}e_{t}italic_μ start_POSTSUBSCRIPT L end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_i - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and μR=1Ti+1t=iTetsubscript𝜇R1𝑇𝑖1superscriptsubscript𝑡𝑖𝑇subscript𝑒𝑡\mu_{\text{R}}=\frac{1}{T-i+1}\sum_{t=i}^{T}e_{t}italic_μ start_POSTSUBSCRIPT R end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T - italic_i + 1 end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The process stops when we reach the desired number of segments K𝐾Kitalic_K or when all leaf segments are of size <2Mabsent2𝑀<2M< 2 italic_M 222Further splitting such a segment will result in at least one sub-segment below the size threshold M𝑀Mitalic_M.. The division process of TreeSeg is outlined in Figure 3. Figure 3(a) shows an example timeline together with the corresponding set of candidate splitting points, while Figures 3(b) and 3(c) demonstrate two successive division steps.

Note that agglomerative hierarchical clustering approaches are typically preferred to divisive ones for computational efficiency reasons. We opt for the divisive approach for the following reasons:

  • Divisive clustering naturally matches the hierarchical segmentation task for transcripts. The termination condition is typically met long before the utterance level is reached.

  • Since timelines are one-dimensional, the optimal splitting point can be identified with a single linear pass. Efficient implementations using cumulative sums of embedding vectors allow for fast loss function computation. Optimal splits are computed only once for every node and are maintained in a min-heap data structure.

Our hypothesis is that the divisive approach utilizes global information to identify strong candidates for topic shifts, with averaging over multiple embedding vectors functioning as a candidate splitting point filter.

3 Evaluation

3.1 Datasets

We evaluate TreeSeg on three datasets:

  • ICSI (Janin et al., 2003) A corpus of 75757575 transcribed meetings, containing manual hierarchical topic annotations up to 4444 levels deep.

  • AMI (Mccowan et al., 2005) Over 100100100100 hours of transcribed meetings, containing manual hierarchical topic annotations up to 3333 levels deep.

  • TinyRec We introduce TinyRec, a dataset consisting of transcripts obtained from 21212121 self-recorded sessions. Each transcript contains spoken utterances as transcribed via ASR and was manually annotated with two-level topic annotations. For more details on the content and annotation guidelines for TinyRec, refer to Appendix A

We enforce a minimum size of five utterances per segment. Segments with sizes below this threshold are automatically merged to the segment that comes immediately after them. Table 1 shows the topic annotation statistics for each dataset. Almost all transcripts in the ICSI and AMI corpora contain second-level annotations, while several of them contain third- or even fourth-level annotations. All TinyRec transcripts are annotated at two different resolutions: ‘coarse’ and ‘fine’. Table 2 shows the average number of segments at each partition level, after pruning segments below the size threshold.

Dataset Avg. |U|𝑈|U|| italic_U | L1 L2 L3 L4
ICSI 1453.7 75 75 52 3
AMI 636.4 139 125 21 0
TinyRec 267.4 21 21 0 0
Table 1: Annotation statistics: Number of transcripts with available topic annotations per partition level (L1 through L4), together with the average utterance timeline length.
Dataset L1 L2 L3 L4
ICSI 5.8 19.05 28.08 34.33
AMI 6.81 14.44 26.61 -
TinyRec 4.18 14.12 - -
Table 2: Segment statistics: Average number of segments per partition level (L1 through L4), after pruning.

3.2 Methodology

We compare TreeSeg against four baselines on the hierarchical topic segmentation task. We adapt BertSeg and HyperSeg to output the top K𝐾Kitalic_K splitting points on the timeline, as described in Section 2.2. We also compare with two naive baselines: RandomSeg and EquiSeg. RandomSeg generates K𝐾Kitalic_K random segments by picking K1𝐾1K-1italic_K - 1 segment transition points at random. EquiSeg splits the timeline into equidistant segments.

For evaluation we use the standard Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (Beeferman et al., 1999) and WinDiff (Pevzner and Hearst, 2002) metrics. For each level of resolution we query each model with the ground-truth number of segments K𝐾Kitalic_K and compare the obtained partition with the ground-truth one. We average metrics across all possible partitions, as well as on a per-level basis. We run RandomSeg 100100100100 times and average the results.

The adapted version of HyperSeg has no hyperparameters. TreeSeg and BertSeg both use a hyperparameter that regulates the utterance embedding block width. BertSeg uses two additional hyperparameters related to utterance similarity score smoothing. The first five transcripts from each dataset were denoted as the ‘development’ set and were used to determine reasonable values for these hyperparameters.

3.3 Results

ICSI AMI TinyRec
Method Pk Wd Pk Wd Pk Wd
RandomSeg 0.464 0.503 0.464 0.503 0.465 0.492
EquiSeg 0.482 0.508 0.478 0.506 0.505 0.513
HyperSeg 0.453 0.499 0.48 0.519 0.485 0.515
BertSeg 0.388 0.432 0.443 0.48 0.473 0.486
TreeSeg (ours) 0.31 0.353 0.355 0.396 0.367 0.382
Table 3: Hierarchical topic segmentation results: The performance of all techniques is evaluated on the ICSI, AMI and TinyRec datasets. Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (Pk) and WinDiff (Wd) metrics (lower is better) are aggregated over all segmentation resolution levels. Our approach (TreeSeg) clearly outperforms all baselines.
ICSI (Pk) AMI (Pk) TinyRec (Pk)
Method L1 L2 L3 L4 L1 L2 L3 L1 L2
RandomSeg 0.445 0.472 0.48 0.487 0.455 0.471 0.486 0.44 0.49
EquiSeg 0.446 0.5 0.505 0.492 0.464 0.492 0.492 0.492 0.518
HyperSeg 0.442 0.456 0.465 0.417 0.47 0.491 0.481 0.48 0.49
BertSeg 0.343 0.416 0.41 0.422 0.441 0.442 0.461 0.462 0.484
TreeSeg (ours) 0.28 0.325 0.326 0.392 0.35 0.356 0.38 0.336 0.399
(a) Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
ICSI (Wd) AMI (Wd) TinyRec (Wd)
Method L1 L2 L3 L4 L1 L2 L3 L1 L2
RandomSeg 0.474 0.516 0.527 0.534 0.489 0.515 0.532 0.461 0.523
EquiSeg 0.47 0.531 0.53 0.519 0.493 0.52 0.515 0.502 0.524
HyperSeg 0.473 0.508 0.523 0.481 0.504 0.534 0.533 0.506 0.524
BertSeg 0.386 0.462 0.453 0.466 0.477 0.481 0.498 0.479 0.493
TreeSeg (ours) 0.314 0.375 0.372 0.441 0.387 0.403 0.421 0.352 0.413
(b) WinDiff
Table 4: Hierarchical topic segmentation results aggregated per level: The performance of all techniques is evaluated on the ICSI, AMI and TinyRec datasets. (a) WinDiff (Wd) and (b) Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (Pk) metrics are aggregated per segmentation resolution level. Our approach (TreeSeg) maintains strong performance across all segmentation resolutions.

Table 3 shows the Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and WinDiff scores for all approaches, averaged over all topic annotation resolutions as described in Section 2.2. TreeSeg clearly outperforms all baselines on all three datasets. Table 4 shows the Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (Table 4(a)) and WinDiff (Table 4(b)) scores of all approaches aggregated per segmentation resolution level. Note that TreeSeg maintains strong performance across all segmentation resolutions.

Our results demonstrate that TreeSeg adequately captures the hierarchical relations between segments at all levels of the hierarchy. Note also that the performance of BertSeg and HyperSeg degrades on TinyRec, a dataset that is less structured in nature than ICSI or AMI. While small in scale, TinyRec might be more representative of self-recorded content in the wild. TreeSeg maintains strong performance across all three datasets.

4 Conclusion

We introduced TreeSeg, a hierarchical segmentation approach suitable for segmenting large meeting and video transcripts. TreeSeg generates structured segmentations in the form of binary trees, capturing the hierarchical relations between segments. Our approach utilizes off-the-shelf components, contains no learnable parts and only a single hyperparameter. We provided a rigorous definition and evaluation methodology for the hierarchical topic segmentation task. We compared TreeSeg with two related embeddings-based approaches ; BertSeg (Solbiati et al., 2021) and HyperSeg (Park et al., 2023), as well as two naive baselines. We introduced TinyRec, a small-scale collection of transcripts obtained from self-recorded sessions via ASR. Evaluating on ICSI, AMI and TinyRec, we demonstrated the superior performance of TreeSeg. Our work constitutes, to our knowledge, the first divisive clustering variant of TextTiling (Hearst, 1997). A promising future direction for research is that of utilizing the structure of the generated partition in downstream tasks such as summarization (Park et al., 2024) or knowledge extraction.

5 Limitations

One limitation of this work is related to the diversity of the datasets in our evaluation. ICSI and AMI are large corpora but are unlikely to capture the full diversity of self-recorded content. Contributing TinyRec is one attempt at mitigating this limitation. Another limitation is the lack of comparison with M3superscriptM3\text{M}^{3}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTSeg (Wang et al., 2023). At the time of writing this manuscript, there is no publicly available code base for M3superscriptM3\text{M}^{3}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTSeg. Evaluation metrics on ICSI and AMI vary with the resolution of the segmentation and the data extraction process, making evaluating models on the same exact data a necessity for a fair comparison. Finally another limitation of this work is our restricted focus on ADA embeddings (Neelakantan et al., 2022). A more thorough comparison of various embedding models might yield interesting insights into the function of divisive clustering as a filter for strong topic shift candidate points.

References

  • Bayomi and Lawless (2018) Mostafa Bayomi and Séamus Lawless. 2018. C-HTS: A concept-based hierarchical text segmentation approach. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  • Beeferman et al. (1999) Doug Beeferman, Adam L. Berger, and John D. Lafferty. 1999. Statistical models for text segmentation. Machine Learning, 34:177–210.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics.
  • Ghosh et al. (2022) Reshmi Ghosh, Harjeet Kajal, Sharanya Kamath, Dhuri Shrivastava, Samyadeep Basu, and Soundararajan Srinivasan. 2022. Topic segmentation in the wild: Towards segmentation of semi-structured & unstructured chats.
  • Ghosh et al. (2024) Reshmi Ghosh, Harjeet Singh Kajal, Sharanya Kamath, Dhuri Shrivastava, Samyadeep Basu, Hansi Zeng, and Soundararajan Srinivasan. 2024. Topic segmentation of semi-structured and unstructured conversational datasets using language models. In Intelligent Systems and Applications, pages 91–104, Cham. Springer Nature Switzerland.
  • Grootendorst (2022) Maarten Grootendorst. 2022. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.
  • Gruenstein et al. (2008) Alexander Gruenstein, John Niekrasz, and Matthew Purver. 2008. Meeting Structure Annotation, pages 247–274.
  • Hazem et al. (2020) Amir Hazem, Beatrice Daille, Dominique Stutzmann, Christopher Kermorvant, and Louis Chevalier. 2020. Hierarchical text segmentation for medieval manuscripts. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6240–6251, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  • Hearst (1997) Marti A. Hearst. 1997. Texttiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist., 23(1):33–64.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Janin et al. (2003) A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The icsi meeting corpus. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., volume 1, pages I–I.
  • Koshorek et al. (2018) Omri Koshorek, Adir Cohen, Noam Mor, Michael Rotman, and Jonathan Berant. 2018. Text segmentation as a supervised learning task. ArXiv, abs/1803.09337.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  • Lukasik et al. (2020) Michal Lukasik, Boris Dadachev, Kishore Papineni, and Gonçalo Simões. 2020. Text segmentation by cross segment attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4707–4716, Online. Association for Computational Linguistics.
  • Mccowan et al. (2005) Iain Mccowan, J Carletta, Wessel Kraaij, Simone Ashby, S Bourban, M Flynn, M Guillemot, Thomas Hain, J Kadlec, V Karaiskos, M Kronenthal, Guillaume Lathoud, Mike Lincoln, Agnes Lisowska Masson, Wilfried Post, Dennis Reidsma, and P Wellner. 2005. The ami meeting corpus. Int’l. Conf. on Methods and Techniques in Behavioral Research.
  • McGrady et al. (2023) Ryan McGrady, Kevin Zheng, Rebecca Curran, Jason Baumgartner, and Ethan Zuckerman. 2023. Dialing for videos: A random sample of youtube. Journal of Quantitative Description: Digital Media, 3.
  • Neelakantan et al. (2022) Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, Peter Welinder, and Lilian Weng. 2022. Text and code embeddings by contrastive pre-training. CoRR, abs/2201.10005.
  • Park et al. (2024) Seongmin Park, Kyungho Kim, Jaejin Seo, and Jihwa Lee. 2024. Unsupervised extractive dialogue summarization in hyperdimensional space. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
  • Park et al. (2023) Seongmin Park, Jinkyu Seo, and Jihwa Lee. 2023. Unsupervised dialogue topic segmentation in hyperdimensional space. In INTERSPEECH 2023. ISCA.
  • Pevzner and Hearst (2002) Lev Pevzner and Marti A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1):19–36.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  • Retkowski and Waibel (2024) Fabian Retkowski and Alexander Waibel. 2024. From text segmentation to smart chaptering: A novel benchmark for structuring video transcriptions. ArXiv, abs/2402.17633.
  • Solbiati et al. (2021) Alessandro Solbiati, Kevin Heffernan, Georgios Damaskinos, Shivani Poddar, Shubham Modi, and Jacques Calì. 2021. Unsupervised topic segmentation of meetings with bert embeddings. ArXiv, abs/2106.12978.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc.
  • Wang et al. (2023) Ke Wang, Xiutian Zhao, Yanghui Li, and Wei Peng. 2023. M3Seg: A maximum-minimum mutual information paradigm for unsupervised topic segmentation in ASR transcripts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7928–7934, Singapore. Association for Computational Linguistics.
  • Xing and Carenini (2021) Linzi Xing and Giuseppe Carenini. 2021. Improving unsupervised dialogue topic segmentation with utterance-pair coherence scoring. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 167–177, Singapore and Online. Association for Computational Linguistics.

Appendix A The TinyRec Dataset

TinyRec consists of 21212121 self-recorded video sessions with screen-sharing that were transcribed using Automatic Speech Recognition (ASR) and manually annotated with topic annotations at two levels of resolution. The selection criteria were the following:

  • A session needs to be at least 8888 minutes long or contain at least 80808080 transcript entries.

  • The content of the session must be technical in nature and non-trivial.

The dataset was annotated by four different annotators. The concept of a partition was explained to each annotator. The annotators were asked to first identify a ‘coarse’ partition typically consisting of 2222-5555 segments for a ten-minute session. Then they were asked to further partition each ‘coarse’ segment, wherever that made sense according to their judgement, to obtain the ‘fine’, two-level partition.

The transition from ‘coarse’ to ‘fine’ segments was explained with the following example scenario:

You are recording your work update for a week during which you worked on three features. Suppose that you identified a coarse segment covering the discussion on one of these features. The fine segmentation would segment this coarse segment again into sub-segments, each discussing parts of the feature implementation or nuances in its design.

as well as an example of a segmented transcript.