PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

Liu, Hexin; Perera, Leibny Paola Garcia; Khong, Andy W. H.; Styles, Suzy J.; Khudanpur, Sanjeev

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2203.12366 (eess)

[Submitted on 23 Mar 2022 (v1), last revised 31 Mar 2022 (this version, v2)]

Title:PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

Authors:Hexin Liu, Leibny Paola Garcia Perera, Andy W. H. Khong, Suzy J. Styles, Sanjeev Khudanpur

View PDF

Abstract:We propose a novel model to hierarchically incorporate phoneme and phonotactic information for language identification (LID) without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of phonotactic embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. We call this architecture CNN-Trans. We evaluate it on AP17-OLR data and the MLS14 set of NIST LRE 2017, and show that the PHO-LID model with multi-task optimization exhibits the highest LID performance among all models, achieving over 40% relative improvement in terms of average cost on AP17-OLR data compared to a CNN-Trans model optimized only for LID. The visualized confusion matrices imply that our proposed method achieves higher performance on languages of the same cluster in NIST LRE 2017 data than the CNN-Trans model. A comparison between predicted phoneme boundaries and corresponding audio spectrograms illustrates the leveraging of phoneme information for LID.

Comments:	Submitted to Interspeech 2022, updated to the submitted version
Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2203.12366 [eess.AS]
	(or arXiv:2203.12366v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2203.12366

Submission history

From: Hexin Liu [view email]
[v1] Wed, 23 Mar 2022 12:38:38 UTC (539 KB)
[v2] Thu, 31 Mar 2022 05:59:05 UTC (694 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators