Robust text-dependent speaker verification system using gender aware Siamese-Triplet Deep Neural Network

Network. 2024 Dec 29:1-40. doi: 10.1080/0954898X.2024.2438128. Online ahead of print.

Abstract

Speaker verification in text-dependent scenarios is critical for high-security applications but faces challenges such as voice quality variations, linguistic diversity, and gender-related pitch differences, which affect authentication accuracy. This paper introduces a Gender-Aware Siamese-Triplet Network-Deep Neural Network (ST-DNN) architecture to address these challenges. The Gender-Aware Network utilizes Convolutional 2D layers with ReLU activation for initial feature extraction, followed by multi-fusion dense skip connections and batch normalization to integrate features across different depths, enhancing discrimination between male and female speakers. A bottleneck layer compresses feature maps to capture gender-related characteristics effectively. For enhanced speaker verification, separate male and female ST-DNN models are used, each incorporating Individual, Siamese, and Triplet Networks. The Individual Network extracts unique utterance characteristics, the Siamese Network compares speech sample pairs for speaker identity, and the Triplet Network ensures closely grouped embeddings of samples from the same speaker, facilitating precise verification. Experimental results on RSR2015 and RedDots Challenge 2016 datasets demonstrate significant improvements, with reductions in Equal Error Rate (EER) ranging from 32.31% to 54.55% for males and 33.73% to 38.98% for females, and reductions in MinDCF from 53.47% to 86.36% and 39.46% to 71.19%, respectively, validating the efficacy of the ST-DNN in real-world applications.

Keywords: Siamese network; Speaker verification; gender information; stage-wise training; triplet network.