Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings

Woo Hyun Kang; Nam Soo Kim

doi:10.3390/s19214709

Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings

Sensors (Basel). 2019 Oct 30;19(21):4709. doi: 10.3390/s19214709.

Authors

Woo Hyun Kang¹, Nam Soo Kim²

Affiliations

¹ Department of Electrical and Computer Engineering and the Institute of New Media and Communications, Seoul National University, Seoul 08826, Korea. [email protected].
² Department of Electrical and Computer Engineering and the Institute of New Media and Communications, Seoul National University, Seoul 08826, Korea. [email protected].

Abstract

Over the recent years, various research has been conducted to investigate methods for verifying users with a short randomized pass-phrase due to the increasing demand for voice-based authentication systems. In this paper, we propose a novel technique for extracting an i-vector-like feature based on an adversarially learned inference (ALI) model which summarizes the variability within the Gaussian mixture model (GMM) distribution through a nonlinear process. Analogous to the previously proposed variational autoencoder (VAE)-based feature extractor, the proposed ALI-based model is trained to generate the GMM supervector according to the maximum likelihood criterion given the Baum-Welch statistics of the input utterance. However, to prevent the potential loss of information caused by the Kullback-Leibler divergence (KL divergence) regularization adopted in the VAE-based model training, the newly proposed ALI-based feature extractor exploits a joint discriminator to ensure that the generated latent variable and the GMM supervector are more realistic. The proposed framework is compared with the conventional i-vector and VAE-based methods using the TIDIGITS dataset. Experimental results show that the proposed method can represent the uncertainty caused by the short duration better than the VAE-based method. Furthermore, the proposed approach has shown great performance when applied in association with the standard i-vector framework.

Keywords: deep learning; speaker recognition; speech embedding; unsupervised representation learning.

Grants and funding

PA-J000001-2017-101/Ministry of Science, ICT and Future Planning