Which to select?: Analysis of speaker representation with graph attention networks

Hye-Jin Shim; Jee-Weon Jung; Ha-Jin Yu

doi:10.1121/10.0032393

Which to select?: Analysis of speaker representation with graph attention networks

J Acoust Soc Am. 2024 Oct 1;156(4):2701-2708. doi: 10.1121/10.0032393.

Authors

Hye-Jin Shim^{1

2}, Jee-Weon Jung², Ha-Jin Yu¹

Affiliations

¹ School of Computer Science, University of Seoul, Seoul, 02504, South Korea.
² School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA.

PMID: 39431853
DOI: 10.1121/10.0032393

Abstract

Although the recent state-of-the-art systems show almost perfect performance, analysis of speaker embeddings has been lacking thus far. An in-depth analysis of speaker representation will be performed by looking into which features are selected. To this end, various intermediate representations of the trained model are observed using graph attentive feature aggregation, which includes a graph attention layer and graph pooling layer followed by a readout operation. To do so, the TIMIT dataset, which has comparably restricted conditions (e.g., the region and phoneme) is used after pre-training the model on the VoxCeleb dataset and then freezing the weight parameters. Through extensive experiments, there is a consistent trend in speaker representation in that the models learn to exploit sequence and phoneme information despite no supervision in that direction. The results shed light to help understand speaker embedding, which is yet considered to be a black box.