-
Disentangled representation learning for multilingual speaker recognition
Authors:
Kihyun Nam,
Youkyum Kim,
Jaesung Huh,
Hee Soo Heo,
Jee-weon Jung,
Joon Son Chung
Abstract:
The goal of this paper is to learn robust speaker representation for bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when speaking in different languages.
Popular speaker recognition evaluation sets do not consider the bilingual scenario, making it difficult to analyse t…
▽ More
The goal of this paper is to learn robust speaker representation for bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when speaking in different languages.
Popular speaker recognition evaluation sets do not consider the bilingual scenario, making it difficult to analyse the effect of bilingual speakers on speaker recognition performance. In this paper, we publish a large-scale evaluation set named VoxCeleb1-B derived from VoxCeleb that considers bilingual scenarios.
We introduce an effective disentanglement learning strategy that combines adversarial and metric learning-based methods. This approach addresses the bilingual situation by disentangling language-related information from speaker representation while ensuring stable speaker representation learning. Our language-disentangled learning method only uses language pseudo-labels without manual information.
△ Less
Submitted 6 June, 2023; v1 submitted 1 November, 2022;
originally announced November 2022.
-
Look who's not talking
Authors:
Youngki Kwon,
Hee Soo Heo,
Jaesung Huh,
Bong-Jin Lee,
Joon Son Chung
Abstract:
The objective of this work is speaker diarisation of speech recordings 'in the wild'. The ability to determine speech segments is a crucial part of diarisation systems, accounting for a large proportion of errors. In this paper, we present a simple but effective solution for speech activity detection based on the speaker embeddings. In particular, we discover that the norm of the speaker embedding…
▽ More
The objective of this work is speaker diarisation of speech recordings 'in the wild'. The ability to determine speech segments is a crucial part of diarisation systems, accounting for a large proportion of errors. In this paper, we present a simple but effective solution for speech activity detection based on the speaker embeddings. In particular, we discover that the norm of the speaker embedding is an extremely effective indicator of speech activity. The method does not require an independent model for speech activity detection, therefore allows speaker diarisation to be performed using a unified representation for both speaker modelling and speech activity detection. We perform a number of experiments on in-house and public datasets, in which our method outperforms popular baselines.
△ Less
Submitted 30 November, 2020;
originally announced November 2020.
-
Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020
Authors:
Hee Soo Heo,
Bong-Jin Lee,
Jaesung Huh,
Joon Son Chung
Abstract:
This report describes our submission to the VoxCeleb Speaker Recognition Challenge (VoxSRC) at Interspeech 2020. We perform a careful analysis of speaker recognition models based on the popular ResNet architecture, and train a number of variants using a range of loss functions. Our results show significant improvements over most existing works without the use of model ensemble or post-processing.…
▽ More
This report describes our submission to the VoxCeleb Speaker Recognition Challenge (VoxSRC) at Interspeech 2020. We perform a careful analysis of speaker recognition models based on the popular ResNet architecture, and train a number of variants using a range of loss functions. Our results show significant improvements over most existing works without the use of model ensemble or post-processing. We release the training code and pre-trained models as unofficial baselines for this year's challenge.
△ Less
Submitted 29 September, 2020;
originally announced September 2020.
-
Augmentation adversarial training for self-supervised speaker recognition
Authors:
Jaesung Huh,
Hee Soo Heo,
Jingu Kang,
Shinji Watanabe,
Joon Son Chung
Abstract:
The goal of this work is to train robust speaker recognition models without speaker labels. Recent works on unsupervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to…
▽ More
The goal of this work is to train robust speaker recognition models without speaker labels. Recent works on unsupervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to separate the speaker information from the channel information. To this end, we propose augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied. Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general. Extensive experiments on the VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision, and the performance of our self-supervised models far exceed that of humans.
△ Less
Submitted 30 October, 2020; v1 submitted 23 July, 2020;
originally announced July 2020.
-
End-to-End Lip Synchronisation Based on Pattern Classification
Authors:
You Jin Kim,
Hee Soo Heo,
Soo-Whan Chung,
Bong-Jin Lee
Abstract:
The goal of this work is to synchronise audio and video of a talking face using deep neural network models. Existing works have trained networks on proxy tasks such as cross-modal similarity learning, and then computed similarities between audio and video frames using a sliding window approach. While these methods demonstrate satisfactory performance, the networks are not trained directly on the t…
▽ More
The goal of this work is to synchronise audio and video of a talking face using deep neural network models. Existing works have trained networks on proxy tasks such as cross-modal similarity learning, and then computed similarities between audio and video frames using a sliding window approach. While these methods demonstrate satisfactory performance, the networks are not trained directly on the task. To this end, we propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream. The similarity matrix between the two modalities is first computed from the features, then the inference of the offset can be considered to be a pattern recognition problem where the matrix is considered equivalent to an image. The feature extractor and the classifier are trained jointly. We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets.
△ Less
Submitted 19 March, 2021; v1 submitted 18 May, 2020;
originally announced May 2020.
-
In defence of metric learning for speaker recognition
Authors:
Joon Son Chung,
Jaesung Huh,
Seongkyu Mun,
Minjae Lee,
Hee Soo Heo,
Soyeon Choe,
Chiheon Ham,
Sunghwan Jung,
Bong-Jin Lee,
Icksang Han
Abstract:
The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance.
A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper…
▽ More
The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance.
A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper, we present an extensive evaluation of most popular loss functions for speaker recognition on the VoxCeleb dataset. We demonstrate that the vanilla triplet loss shows competitive performance compared to classification-based losses, and those trained with our proposed metric learning objective outperform state-of-the-art methods.
△ Less
Submitted 24 April, 2020; v1 submitted 26 March, 2020;
originally announced March 2020.