Evaluation of Deep Clustering for Diarization of Aphasic Speech

Stud Health Technol Inform. 2019:260:81-88.

Abstract

Speaker attribution and labeling of single channel, multi speaker audio files is an area of active research, since the underlying problems have not been solved satisfactorily yet. This especially holds true for non-standard voices and speech, such as children and impaired speakers. Being able to perform speaker labelling of pathological speech would potentially enable the development of computer assisted diagnosis and treatment systems and is thus a desirable research goal. In this manuscript we investigate on the applicability of embeddings of audio signals, in the form of time and frequency-band based segments, into arbitrary vector spaces on diarization of pathological speech. We focus on modifying an existing embedding estimator such that it can be used for diarization. This is mainly done via clustering the time and frequency band dependant vectors and subsequently performing a majority vote procedure on all frequency dependent vectors of the same time segment to assign a speaker label. The result is evaluated on recordings of interviews of aphasia patients and language therapists. We demonstrate general applicability, with error rates that are close to what has been previously achieved in diarizing children's speech. Additionally, we propose to enhance the processing pipelines with smoothing and a more sophisticated, energy based, voting scheme.

Keywords: diarization; expressive language disorders; machine learning; medical informatics.

MeSH terms

  • Aphasia* / diagnosis
  • Child
  • Cluster Analysis*
  • Humans
  • Language
  • Speech*