TreeSeg: Hierarchical Topic Segmentation of Large Transcripts

Dimitrios C. Gklezakos Timothy Misiak Diamond Bishop
Augmend
{dimi, tim, diamond}@augmend.com

Abstract

From organizing recorded videos and meetings into chapters, to breaking down large inputs in order to fit them into the context window of commoditized Large Language Models (LLMs), topic segmentation of large transcripts emerges as a task of increasing significance. Still, accurate segmentation presents many challenges, including (a) the noisy nature of the Automatic Speech Recognition (ASR) software typically used to obtain the transcripts, (b) the lack of diverse labeled data and (c) the difficulty in pin-pointing the ground-truth number of segments. In this work we present TreeSeg, an approach that combines off-the-shelf embedding models with divisive clustering, to generate hierarchical, structured segmentations of transcripts in the form of binary trees. Our approach is robust to noise and can handle large transcripts efficiently. We evaluate TreeSeg on the ICSI and AMI corpora, demonstrating that it outperforms all baselines. Finally, we introduce TinyRec, a small-scale corpus of manually annotated transcripts, obtained from self-recorded video sessions.

1 Introduction

The wide availability of video conferencing platforms, together with the rapid surge in the volume of hosted videos McGrady et al. (2023), have resulted in the proliferation of self-recorded content in the form of meetings and videos. Often transcribed into text via Automatic Speech Recognition (ASR), this content offers a wealth of information waiting to be extracted.

In this work we focus on segmenting large transcripts originating from automatically transcribed, self-recorded content, into temporally contiguous but semantically distinct segments. The goal of segmentation in this context is two-fold: (a) To display content in an organized manner (i.e. automatic chapter generation) and (b) to break down large transcripts in order to satisfy the size constraints of downstream models, such as the context window limitations of commoditized Large Language Models (LLMs).

In this context, topic segmentation poses many challenges due to: (a) The noisy nature of ASR software, resulting in errors due to poor transcription or out-of-dictionary technical terms, (b) the scarcity of labeled data of a diverse distribution and the difficulty in obtaining it Gruenstein et al. (2008) and (c) the difficulty in pin-pointing the ground-truth number of segments, which can vary between human annotators even for the same transcript.

In this paper we propose TreeSeg, a novel hierarchical topic segmentation approach. TreeSeg combines utterance embeddings with divisive clustering to filter the input and identify segment transition points. Our approach is completely unsupervised, has no learnable parts and utilizes readily available off-the-shelf embedding models. TreeSeg partitions the input in a hierarchical manner and is accurate at multiple levels of segmentation resolution. In the context of automatically generating and displaying video or meeting segmentations, this hierarchical aspect of TreeSeg provides the user with the affordance to dynamically choose the desired number and resolution of the generated chapters/segments.

We evaluate our approach on two standard large meeting corpora: ICSI (Janin et al., 2003) and AMI (Mccowan et al., 2005). We demonstrate that TreeSeg outperforms its competition across the board. We also contribute a small-scale corpus of our own, TinyRec, consisting of $21$ self-recorded sessions with technical content, transcribed via ASR and manually annotated. We plan to gradually extend TinyRec over time with more annotated sessions.

1.1 Related Work

Koshorek et al. (2018) adopt a supervised learning approach to topic segmentation of written text by applying an LSTM-based (Hochreiter and Schmidhuber, 1997) model to WIKI-727K, a dataset extracted from Wikipedia. Models based on transformers (Vaswani et al., 2017) such as BERT (Devlin et al., 2019) and its variants (Liu et al., 2019) are considered by Lukasik et al. (2020) and Ghosh et al. (2024). Retkowski and Waibel (2024) use annotations obtained from Youtube videos to train a transformer-based segmentation model in a supervised manner.

Bayomi and Lawless (2018) apply agglomerative clustering to extract a hierarchy of segments from text. Hazem et al. (2020) use a bottom-up approach to segment text from medieval manuscripts. Grootendorst (2022) uses Sentence-BERT (Reimers and Gurevych, 2019) embeddings to cluster documents and extract latent topics.

Unsupervised topic segmentation of meetings has generated a lot of interest in the recent years. Most recent approaches are essentially modern variants of TextTiling (Hearst, 1997), a technique that relies on similarities between adjacent utterances. TextTiling identifies topic changes by finding local similarity minima. Perhaps the closest work to our own is BertSeg, introduced in Solbiati et al. (2021). BertSeg is a modern version of TextTiling that embeds utterances using a pre-trained model, extracts overlapping blocks of embeddings and aggregates them to compute utterance similarities. HyperSeg, introduced in Park et al. (2023), also follows the TextTiling paradigm, while replacing learned embeddings with hyper-dimensional vectors derived from random word embeddings. CohereSeg (Xing and Carenini, 2021) and $\text{M}^{3}$ Seg (Wang et al., 2023) are unsupervised segmentation approaches that fine-tune embeddings using what is, in essence, a contrastive learning technique. CohereSeg focuses on dialogue topic segmentation and is shown in Park et al. (2023) to perform worse than or on par with HyperSeg. At the time of writing this manuscript there is no publicly available code base for $\text{M}^{3}$ Seg. Finally, Ghosh et al. (2022) compare various topic segmentation approaches on semi-structured and unstructured conversations and show that pre-training on structured data does not transfer well to unstructured data.

2 Method

Consider the linear topic segmentation setting, where the input is a temporal sequence of transcript entries/utterances $U=[U_{1},...,U_{T}]$ . We will henceforth refer to $U$ as the ‘timeline’. The underlying organization of the transcript into topics is modeled as a partition $P=\{P_{k}\}_{k=1}^{K}$ of $U$ into segments. Each such segment covers a temporally contiguous set of utterances:

P_{k}=\{U_{t}:t_{s}(k)\leq t\leq t_{e}(k)\}

starting with $U_{t_{s}(k)}$ and ending with $U_{t_{e}(k)}$ (endpoints included). Each utterance belongs to exactly one segment. The goal of linear topic segmentation is to approximate this ground-truth partition.

2.1 Hierarchical Topic Segmentation

We extend the linear setting to a hierarchical version by considering nested partitions of the timeline, of increasing resolution, represented by trees.

Refer to caption — (a) Flat partition tree: Linear topic segmentation.

A flat partition $P$ of the timeline can be viewed as a tree of depth equal to $1$ , where the root is the timeline itself and each child is an utterance set $P_{k}$ in the partition (see Figure 1(a)). A nested partition is represented by a deeper tree where each node $v$ corresponds to a temporally contiguous set of utterances $P_{v}$ , while its children, denoted by $N(v)$ form a partition of $P_{v}$ , such that: $P_{v}=\bigcup_{u\in N(v)}P_{u}$ and $\forall i,j\in N(v)\;:\;i\neq j$ , $P_{i}\cap P_{j}=\varnothing$ .

Figure 1(b) shows an example of such a nested partition and its corresponding tree. A partition tree has the following properties:

1.

Every sub-tree containing the root is a valid nested partition of the timeline.
2.

The leaves of every sub-tree containing the root form a valid flat partition of the timeline.

The partition tree also induces a natural order in which segments are divided into sub-segments. Consider the sub-tree that contains all nodes of depth smaller than $\tau\leq D(P)$ , where $D(P)$ is the maximum depth of the partition tree and let $P_{\leq\tau}$ denote its corresponding valid, nested partition. The leaves of this sub-tree form a flat partition of the timeline. As $\tau$ increases, the resolution $|P_{\leq\tau}|$ of this flat partition increases as well.

2.2 From Linear to Hierarchical Partitions

Suppose that we have access to a linear topic segmentation model that takes in the desired partition length $K$ and identifies $K$ segments. Suppose also that as $K$ increases, additional segment boundaries are added but not deleted¹¹1Note that techniques based on TextTiling (Hearst, 1997) such as BertSeg (Solbiati et al., 2021) and HyperSeg (Park et al., 2023) naturally exhibit this property.. Then we can evaluate this model on the hierarchical segmentation task, as follows:

•

Choose a depth threshold $\tau$ and construct the corresponding partition $P_{\leq\tau}$ of maximum depth $\tau$ .
•

Query the model with $K=|P_{\leq\tau}|$ .
•

Evaluate the result by comparing with the flat partition induced by the leaves of $P_{\leq\tau}$ .
•

Repeat for all $1\leq\tau\leq D(P)$ .

Intuitively, a model that accurately captures the hierarchical relations between segments, will output intermediate partitions that match those induced by the limited-depth sub-trees of the ground truth partition. Figure 2 shows an example where an output partition perfectly matches the ground-truth bottom-level partition in the linear topic segmentation setting, but fails to accurately capture its hierarchical structure.

2.3 TreeSeg

TreeSeg first embeds the transcript timeline entries $U$ . Then these embeddings are combined with a divisive clustering approach, to identify appropriate splitting points and construct a deep partition tree.

2.3.1 Embedding the Transcript

TreeSeg uses an embedding model to convert transcript entries to embeddings. For the results in this paper we used OpenAI’s text-embedding-ada-002 (ADA) (Neelakantan et al., 2022), an embedding model trained in an unsupervised manner, utilizing contrastive learning techniques. The choice of an off-the-shelf, commoditized embedding model, results in a pipeline with no trainable parts. Note that TreeSeg does not depend on this particular choice. Any suitable embedding model such as Liu et al. (2019) can be directly plugged into our approach in the same manner.

A single utterance might not contain enough context to be embedded on its own in a meaningful way, especially in the presence of automatic transcription errors. To address this issue we extract overlapping blocks of utterances. For each utterance $u_{t}$ at position $t$ we extract a block that consists of $u_{t}$ itself and up to $W$ utterances in the immediate past. This block $[u_{t-W},...,u_{t}]$ is passed through the embedding model $f$ to obtain the block embedding $e_{t}=f([u_{t-W},...,u_{t}])$ . Note that this is a point of deviation from BertSeg (Solbiati et al., 2021). While BertSeg embeds each utterance separately and aggregates them using max-pooling, TreeSeg embeds the whole block of utterances together, resulting in additional context passed to the embedding model. Repeating for every utterance in the transcript results in a temporal sequence of utterance embeddings that maintain local context. The utterance block width $W$ is the sole hyperparameter of TreeSeg.

2.3.2 Divisive Clustering

TreeSeg utilizes a divisive clustering approach to recursively split segments into two sub-segments, constructing a binary partition tree in the process. Below, we provide a high-level outline of this segment division process:

•

For each leaf in the current partition tree, identify the optimal splitting point according to the loss function, while respecting any constraints.
•

Pick the leaf with the best scoring candidate splitting point and split it into two sub-segments.
•

Repeat until the termination condition is met.

We use a one-dimensional clustering objective as our loss function. Consider a timeline of embeddings $E=[e_{1},...,e_{T}]$ . A candidate splitting point $i$ partitions this timeline into two segments $[e_{1},...,e_{i-1}]$ and $[e_{i},...,e_{T}]$ . In practice we want to avoid ending up with trivial segments, therefore we enforce a minimum viable segment size denoted by $M$ and only consider candidate points in the range $M<i\leq T-M$ . The loss function for candidate point $i$ is given by:

\mathcal{L}(i)=\sum_{t=1}^{i-1}\|e_{t}-\mu_{\text{L}}\|_{2}^{2}+\sum_{t=i}^{T}% \|e_{t}-\mu_{\text{R}}\|_{2}^{2}

with $\mu_{\text{L}}=\frac{1}{i-1}\sum_{t=1}^{i-1}e_{t}$ and $\mu_{\text{R}}=\frac{1}{T-i+1}\sum_{t=i}^{T}e_{t}$ . The process stops when we reach the desired number of segments $K$ or when all leaf segments are of size $<2M$ ²²2Further splitting such a segment will result in at least one sub-segment below the size threshold $M$ .. The division process of TreeSeg is outlined in Figure 3. Figure 3(a) shows an example timeline together with the corresponding set of candidate splitting points, while Figures 3(b) and 3(c) demonstrate two successive division steps.

Note that agglomerative hierarchical clustering approaches are typically preferred to divisive ones for computational efficiency reasons. We opt for the divisive approach for the following reasons:

•

Divisive clustering naturally matches the hierarchical segmentation task for transcripts. The termination condition is typically met long before the utterance level is reached.
•

Since timelines are one-dimensional, the optimal splitting point can be identified with a single linear pass. Efficient implementations using cumulative sums of embedding vectors allow for fast loss function computation. Optimal splits are computed only once for every node and are maintained in a min-heap data structure.

Our hypothesis is that the divisive approach utilizes global information to identify strong candidates for topic shifts, with averaging over multiple embedding vectors functioning as a candidate splitting point filter.

3 Evaluation

3.1 Datasets

We evaluate TreeSeg on three datasets:

•

ICSI (Janin et al., 2003) A corpus of $75$ transcribed meetings, containing manual hierarchical topic annotations up to $4$ levels deep.
•

AMI (Mccowan et al., 2005) Over $100$ hours of transcribed meetings, containing manual hierarchical topic annotations up to $3$ levels deep.
•

TinyRec We introduce TinyRec, a dataset consisting of transcripts obtained from $21$ self-recorded sessions. Each transcript contains spoken utterances as transcribed via ASR and was manually annotated with two-level topic annotations. For more details on the content and annotation guidelines for TinyRec, refer to Appendix A

We enforce a minimum size of five utterances per segment. Segments with sizes below this threshold are automatically merged to the segment that comes immediately after them. Table 1 shows the topic annotation statistics for each dataset. Almost all transcripts in the ICSI and AMI corpora contain second-level annotations, while several of them contain third- or even fourth-level annotations. All TinyRec transcripts are annotated at two different resolutions: ‘coarse’ and ‘fine’. Table 2 shows the average number of segments at each partition level, after pruning segments below the size threshold.

Dataset	Avg. $\|U\|$	L1	L2	L3	L4
ICSI	1453.7	75	75	52	3
AMI	636.4	139	125	21	0
TinyRec	267.4	21	21	0	0

Table 1: Annotation statistics: Number of transcripts with available topic annotations per partition level (L1 through L4), together with the average utterance timeline length.

Dataset	L1	L2	L3	L4
ICSI	5.8	19.05	28.08	34.33
AMI	6.81	14.44	26.61	-
TinyRec	4.18	14.12	-	-

Table 2: Segment statistics: Average number of segments per partition level (L1 through L4), after pruning.

3.2 Methodology

We compare TreeSeg against four baselines on the hierarchical topic segmentation task. We adapt BertSeg and HyperSeg to output the top $K$ splitting points on the timeline, as described in Section 2.2. We also compare with two naive baselines: RandomSeg and EquiSeg. RandomSeg generates $K$ random segments by picking $K-1$ segment transition points at random. EquiSeg splits the timeline into equidistant segments.

For evaluation we use the standard $P_{k}$ (Beeferman et al., 1999) and WinDiff (Pevzner and Hearst, 2002) metrics. For each level of resolution we query each model with the ground-truth number of segments $K$ and compare the obtained partition with the ground-truth one. We average metrics across all possible partitions, as well as on a per-level basis. We run RandomSeg $100$ times and average the results.

The adapted version of HyperSeg has no hyperparameters. TreeSeg and BertSeg both use a hyperparameter that regulates the utterance embedding block width. BertSeg uses two additional hyperparameters related to utterance similarity score smoothing. The first five transcripts from each dataset were denoted as the ‘development’ set and were used to determine reasonable values for these hyperparameters.

3.3 Results

	ICSI		AMI		TinyRec
Method	Pk	Wd	Pk	Wd	Pk	Wd
RandomSeg	0.464	0.503	0.464	0.503	0.465	0.492
EquiSeg	0.482	0.508	0.478	0.506	0.505	0.513
HyperSeg	0.453	0.499	0.48	0.519	0.485	0.515
BertSeg	0.388	0.432	0.443	0.48	0.473	0.486
TreeSeg (ours)	0.31	0.353	0.355	0.396	0.367	0.382

Table 3: Hierarchical topic segmentation results: The performance of all techniques is evaluated on the ICSI, AMI and TinyRec datasets.

P_{k}

(Pk) and WinDiff (Wd) metrics (lower is better) are aggregated over all segmentation resolution levels. Our approach (TreeSeg) clearly outperforms all baselines.

	ICSI (Pk)				AMI (Pk)			TinyRec (Pk)
Method	L1	L2	L3	L4	L1	L2	L3	L1	L2
RandomSeg	0.445	0.472	0.48	0.487	0.455	0.471	0.486	0.44	0.49
EquiSeg	0.446	0.5	0.505	0.492	0.464	0.492	0.492	0.492	0.518
HyperSeg	0.442	0.456	0.465	0.417	0.47	0.491	0.481	0.48	0.49
BertSeg	0.343	0.416	0.41	0.422	0.441	0.442	0.461	0.462	0.484
TreeSeg (ours)	0.28	0.325	0.326	0.392	0.35	0.356	0.38	0.336	0.399

(a)

P_{k}

	ICSI (Wd)				AMI (Wd)			TinyRec (Wd)
Method	L1	L2	L3	L4	L1	L2	L3	L1	L2
RandomSeg	0.474	0.516	0.527	0.534	0.489	0.515	0.532	0.461	0.523
EquiSeg	0.47	0.531	0.53	0.519	0.493	0.52	0.515	0.502	0.524
HyperSeg	0.473	0.508	0.523	0.481	0.504	0.534	0.533	0.506	0.524
BertSeg	0.386	0.462	0.453	0.466	0.477	0.481	0.498	0.479	0.493
TreeSeg (ours)	0.314	0.375	0.372	0.441	0.387	0.403	0.421	0.352	0.413

(b) WinDiff

Table 4: Hierarchical topic segmentation results aggregated per level: The performance of all techniques is evaluated on the ICSI, AMI and TinyRec datasets. (a) WinDiff (Wd) and (b)

P_{k}

(Pk) metrics are aggregated per segmentation resolution level. Our approach (TreeSeg) maintains strong performance across all segmentation resolutions.

Table 3 shows the $P_{k}$ and WinDiff scores for all approaches, averaged over all topic annotation resolutions as described in Section 2.2. TreeSeg clearly outperforms all baselines on all three datasets. Table 4 shows the $P_{k}$ (Table 4(a)) and WinDiff (Table 4(b)) scores of all approaches aggregated per segmentation resolution level. Note that TreeSeg maintains strong performance across all segmentation resolutions.

Our results demonstrate that TreeSeg adequately captures the hierarchical relations between segments at all levels of the hierarchy. Note also that the performance of BertSeg and HyperSeg degrades on TinyRec, a dataset that is less structured in nature than ICSI or AMI. While small in scale, TinyRec might be more representative of self-recorded content in the wild. TreeSeg maintains strong performance across all three datasets.

4 Conclusion

We introduced TreeSeg, a hierarchical segmentation approach suitable for segmenting large meeting and video transcripts. TreeSeg generates structured segmentations in the form of binary trees, capturing the hierarchical relations between segments. Our approach utilizes off-the-shelf components, contains no learnable parts and only a single hyperparameter. We provided a rigorous definition and evaluation methodology for the hierarchical topic segmentation task. We compared TreeSeg with two related embeddings-based approaches ; BertSeg (Solbiati et al., 2021) and HyperSeg (Park et al., 2023), as well as two naive baselines. We introduced TinyRec, a small-scale collection of transcripts obtained from self-recorded sessions via ASR. Evaluating on ICSI, AMI and TinyRec, we demonstrated the superior performance of TreeSeg. Our work constitutes, to our knowledge, the first divisive clustering variant of TextTiling (Hearst, 1997). A promising future direction for research is that of utilizing the structure of the generated partition in downstream tasks such as summarization (Park et al., 2024) or knowledge extraction.

5 Limitations

One limitation of this work is related to the diversity of the datasets in our evaluation. ICSI and AMI are large corpora but are unlikely to capture the full diversity of self-recorded content. Contributing TinyRec is one attempt at mitigating this limitation. Another limitation is the lack of comparison with $\text{M}^{3}$ Seg (Wang et al., 2023). At the time of writing this manuscript, there is no publicly available code base for $\text{M}^{3}$ Seg. Evaluation metrics on ICSI and AMI vary with the resolution of the segmentation and the data extraction process, making evaluating models on the same exact data a necessity for a fair comparison. Finally another limitation of this work is our restricted focus on ADA embeddings (Neelakantan et al., 2022). A more thorough comparison of various embedding models might yield interesting insights into the function of divisive clustering as a filter for strong topic shift candidate points.

References

Bayomi and Lawless (2018) Mostafa Bayomi and Séamus Lawless. 2018. C-HTS: A concept-based hierarchical text segmentation approach. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
Beeferman et al. (1999) Doug Beeferman, Adam L. Berger, and John D. Lafferty. 1999. Statistical models for text segmentation. Machine Learning, 34:177–210.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics.
Ghosh et al. (2022) Reshmi Ghosh, Harjeet Kajal, Sharanya Kamath, Dhuri Shrivastava, Samyadeep Basu, and Soundararajan Srinivasan. 2022. Topic segmentation in the wild: Towards segmentation of semi-structured & unstructured chats.
Ghosh et al. (2024) Reshmi Ghosh, Harjeet Singh Kajal, Sharanya Kamath, Dhuri Shrivastava, Samyadeep Basu, Hansi Zeng, and Soundararajan Srinivasan. 2024. Topic segmentation of semi-structured and unstructured conversational datasets using language models. In Intelligent Systems and Applications, pages 91–104, Cham. Springer Nature Switzerland.
Grootendorst (2022) Maarten Grootendorst. 2022. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.
Gruenstein et al. (2008) Alexander Gruenstein, John Niekrasz, and Matthew Purver. 2008. Meeting Structure Annotation, pages 247–274.
Hazem et al. (2020) Amir Hazem, Beatrice Daille, Dominique Stutzmann, Christopher Kermorvant, and Louis Chevalier. 2020. Hierarchical text segmentation for medieval manuscripts. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6240–6251, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Hearst (1997) Marti A. Hearst. 1997. Texttiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist., 23(1):33–64.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Janin et al. (2003) A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The icsi meeting corpus. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., volume 1, pages I–I.
Koshorek et al. (2018) Omri Koshorek, Adir Cohen, Noam Mor, Michael Rotman, and Jonathan Berant. 2018. Text segmentation as a supervised learning task. ArXiv, abs/1803.09337.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
Lukasik et al. (2020) Michal Lukasik, Boris Dadachev, Kishore Papineni, and Gonçalo Simões. 2020. Text segmentation by cross segment attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4707–4716, Online. Association for Computational Linguistics.
Mccowan et al. (2005) Iain Mccowan, J Carletta, Wessel Kraaij, Simone Ashby, S Bourban, M Flynn, M Guillemot, Thomas Hain, J Kadlec, V Karaiskos, M Kronenthal, Guillaume Lathoud, Mike Lincoln, Agnes Lisowska Masson, Wilfried Post, Dennis Reidsma, and P Wellner. 2005. The ami meeting corpus. Int’l. Conf. on Methods and Techniques in Behavioral Research.
McGrady et al. (2023) Ryan McGrady, Kevin Zheng, Rebecca Curran, Jason Baumgartner, and Ethan Zuckerman. 2023. Dialing for videos: A random sample of youtube. Journal of Quantitative Description: Digital Media, 3.
Neelakantan et al. (2022) Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, Peter Welinder, and Lilian Weng. 2022. Text and code embeddings by contrastive pre-training. CoRR, abs/2201.10005.
Park et al. (2024) Seongmin Park, Kyungho Kim, Jaejin Seo, and Jihwa Lee. 2024. Unsupervised extractive dialogue summarization in hyperdimensional space. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
Park et al. (2023) Seongmin Park, Jinkyu Seo, and Jihwa Lee. 2023. Unsupervised dialogue topic segmentation in hyperdimensional space. In INTERSPEECH 2023. ISCA.
Pevzner and Hearst (2002) Lev Pevzner and Marti A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1):19–36.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
Retkowski and Waibel (2024) Fabian Retkowski and Alexander Waibel. 2024. From text segmentation to smart chaptering: A novel benchmark for structuring video transcriptions. ArXiv, abs/2402.17633.
Solbiati et al. (2021) Alessandro Solbiati, Kevin Heffernan, Georgios Damaskinos, Shivani Poddar, Shubham Modi, and Jacques Calì. 2021. Unsupervised topic segmentation of meetings with bert embeddings. ArXiv, abs/2106.12978.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc.
Wang et al. (2023) Ke Wang, Xiutian Zhao, Yanghui Li, and Wei Peng. 2023. M³Seg: A maximum-minimum mutual information paradigm for unsupervised topic segmentation in ASR transcripts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7928–7934, Singapore. Association for Computational Linguistics.
Xing and Carenini (2021) Linzi Xing and Giuseppe Carenini. 2021. Improving unsupervised dialogue topic segmentation with utterance-pair coherence scoring. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 167–177, Singapore and Online. Association for Computational Linguistics.

Appendix A The TinyRec Dataset

TinyRec consists of $21$ self-recorded video sessions with screen-sharing that were transcribed using Automatic Speech Recognition (ASR) and manually annotated with topic annotations at two levels of resolution. The selection criteria were the following:

•

A session needs to be at least $8$ minutes long or contain at least $80$ transcript entries.
•

The content of the session must be technical in nature and non-trivial.

The dataset was annotated by four different annotators. The concept of a partition was explained to each annotator. The annotators were asked to first identify a ‘coarse’ partition typically consisting of $2$ - $5$ segments for a ten-minute session. Then they were asked to further partition each ‘coarse’ segment, wherever that made sense according to their judgement, to obtain the ‘fine’, two-level partition.

The transition from ‘coarse’ to ‘fine’ segments was explained with the following example scenario:

You are recording your work update for a week during which you worked on three features. Suppose that you identified a coarse segment covering the discussion on one of these features. The fine segmentation would segment this coarse segment again into sub-segments, each discussing parts of the feature implementation or nuances in its design.

as well as an example of a segmented transcript.