Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation
Authors:
Nameer Hirschkind,
Xiao Yu,
Mahesh Kumar Nandwana,
Joseph Liu,
Eloi DuBois,
Dao Le,
Nicolas Thiebaut,
Colin Sinclair,
Kyle Spence,
Charles Shang,
Zoe Abrams,
Morgan McGuire
Abstract:
We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve M…
▽ More
We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23\% each and speaker similarity by 5\% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5$\times$ faster than real-time.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.