Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval

Zeng, Donghuo; Ikeda, Kazushi

Computer Science > Sound

arXiv:2310.13451 (cs)

[Submitted on 20 Oct 2023]

Title:Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval

Authors:Donghuo Zeng, Kazushi Ikeda

View PDF

Abstract:The cross-modal retrieval model leverages the potential of triple loss optimization to learn robust embedding spaces. However, existing methods often train these models in a singular pass, overlooking the distinction between semi-hard and hard triples in the optimization process. The oversight of not distinguishing between semi-hard and hard triples leads to suboptimal model performance. In this paper, we introduce a novel approach rooted in curriculum learning to address this problem. We propose a two-stage training paradigm that guides the model's learning process from semi-hard to hard triplets. In the first stage, the model is trained with a set of semi-hard triplets, starting from a low-loss base. Subsequently, in the second stage, we augment the embeddings using an interpolation technique. This process identifies potential hard negatives, alleviating issues arising from high-loss functions due to a scarcity of hard triples. Our approach then applies hard triplet mining in the augmented embedding space to further optimize the model. Extensive experimental results conducted on two audio-visual datasets show a significant improvement of approximately 9.8% in terms of average Mean Average Precision (MAP) over the current state-of-the-art method, MSNSCA, for the Audio-Visual Cross-Modal Retrieval (AV-CMR) task on the AVE dataset, indicating the effectiveness of our proposed method.

Comments:	8 pages, 6 figures
Subjects:	Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2310.13451 [cs.SD]
	(or arXiv:2310.13451v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2310.13451

Submission history

From: Donghuo Zeng [view email]
[v1] Fri, 20 Oct 2023 12:35:54 UTC (3,647 KB)

Computer Science > Sound

Title:Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators