DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

Yadav, Amit Kumar Singh; Bhagtani, Kratika; Xiang, Ziyue; Bestagini, Paolo; Tubaro, Stefano; Delp, Edward J.

Computer Science > Sound

arXiv:2304.03323 (cs)

[Submitted on 6 Apr 2023 (v1), last revised 28 Jul 2023 (this version, v2)]

Title:DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

Authors:Amit Kumar Singh Yadav, Kratika Bhagtani, Ziyue Xiang, Paolo Bestagini, Stefano Tubaro, Edward J. Delp

View PDF

Abstract:Tools to generate high quality synthetic speech signal that is perceptually indistinguishable from speech recorded from human speakers are easily available. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods as a black box without providing reasoning for the decisions they make. This limits the interpretability of these approaches. In this paper, we propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two staged trained variational autoencoder that processes spectrograms of speech using disentangled representation learning to generate interpretable representations of a speech signal for detecting synthetic speech. DSVAE also creates an activation map to highlight the spectrogram regions that discriminate synthetic and bona fide human speech signals. We evaluated the representations obtained from DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers. We also visualize the representation obtained from DSVAE for 17 different speech synthesizers and verify that they are indeed interpretable and discriminate bona fide and synthetic speech from each of the synthesizers.

Subjects:	Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2304.03323 [cs.SD]
	(or arXiv:2304.03323v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2304.03323

Submission history

From: Amit Kumar Singh Yadav [view email]
[v1] Thu, 6 Apr 2023 18:37:26 UTC (25,650 KB)
[v2] Fri, 28 Jul 2023 20:38:31 UTC (21,164 KB)

Computer Science > Sound

Title:DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators