Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models

submitted to INTERSPEECH 2024.

Abstract

Spontaneous style speech synthesis, which aims to generate human-like speech, often encounters challenges due to the scarcity of high-quality data and limitations in model capabilities. Recent language model-based TTS systems can be trained on large, diverse, and low-quality speech datasets, resulting in highly natural synthesized speech. However, they are limited by the difficulty of simulating various spontaneous behaviors and capturing prosody variations in spontaneous speech. In this paper, we propose a novel spontaneous speech synthesis system based on language models. We systematically categorize and uniformly model diverse spontaneous behaviors. Moreover, fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech. Experimental results show that our proposed method significantly outperforms the baseline methods in terms of prosody naturalness and spontaneous behavior naturalness.

The Definitions, lexical features and acoustic characteristics of spontaneous behaviors

  • Filled pause: A semantically empty element of speech that delays the transfer of the speaker's message and is usually expressed in the form of "em", "uh", etc; and have acoustic extensions at the end of the characters.
  • Repetitions: Fully repeated word sequences; specifically refers to disfluent repetitions, which cannot be explained or justified by Mandarin grammatical rules. Acoustic features include multiple characters with short and heavy speech.
  • Stutter: Speakers may hesitate or stutter when they have problems in finding the correct words; there may be pauses in the speech.
  • Prolongation: Mainly used to indicate hesitation and to emphasize the discourse focus; and have acoustic extensions at the end of the characters.
  • Doubt: Indicates a questioning tone; often appears in words like "Huh? What?"; with a rising tone at the end.
  • Response: Indicates a responsive tone; often appears in words like "Uh,hey!"; fast and firm speech.
  • Surprise: Expressions of surprise, realizations or discovery; excited, with a rising tone at the end.
  • Positive feedback: Positively-valenced content including feedback, good news, etc; often appears in words like "um, Yeah"; excited tone and faster speech.
  • Reminder: Reminding someone to pay attention to something; heavier tone, fast, to attract the other person's attention.
  • Realization: Indicates sudden understanding or enlightenment; often appears in words like "Oh, ah".
  • Sigh: Indicates helplessness or sadness; often appears in words like "hey"; depressed, with a lowering tone.
  • Coquetry: Indicates coquetry, pleasing, and affectation; generally has a higher tone and affects the prosody of the entire sentence.
  • Snort: Indicates dissatisfaction or anger with a situation and makes a humming sound; very short in duration.
  • Smile: The lightest degree of laughter, generally out of politeness.
  • Cachinnation: Very loud and unrestrained laughter. Occurs when a person is very happy; high-pitched tone.
  • Wry smile: A forced smile when in a bad mood; lower tone.
  • Awkward laughter: Laughter resulting from embarrassment, helplessness, or self-mockery.
  • Scoff: A laugh containing sarcasm or dissatisfaction.
  • Involuntary laughter: Laughter that cannot be controlled and is involuntarily emitted.
  • NOTE: Each laughter category has a corresponding laughter token for input on the text token side

    Audio samples for different models

  • FastSpeech 2 : Avanilla FastSpeech 2 which is trained on spontaneous corpus, dose no explicitly model spontaneous behavior.
  • VALL-E : A neural codec language model VALL-E. We trained the model in two Mandarin corpus and used it as our baseline model.
  • Base-L : The VALL-E with the syntactic-aware spontaneous behavior modeling, excludes the prosody representations.
  • Proposed : The model we propose in this paper, which considers syntactic-aware spontaneous behavior modeling and spontaneous prosody modeling based on VALL-E.
  • NOTE: In the text, <laughter> indicates laughter. (Spontaneous Behavior type of Chinese and English) denotes the type of spontaneous behaviors, and the text corresponding to the spontaneous behavior is bolded.

    MOS

    Target Chinese Text FastSpeech 2 VALL-E Base-L Proposed
    <laughter>(忍不住笑,Involuntary laughter)好啊我今天就过来找你。
    那那那(结巴,Stutter)你做买卖啊。
    今天是个好日子,那就做两组普拉提吧!
    <laughter>(嘲笑,Scoff)你这个技术,还是先赢了他再说吧。
    <laughter>(大笑,Cachinnation)这个笑话太好笑了!
    哦(醒悟,Realization)你原来以为这是在家里呀?
    嗯(赞同,Positive feedback),风景真美丽呀。
    你只要想见,随时都可以见的呀(撒娇,Coquetry)!
    喂(提醒,Reminder),是不是遇到什么麻烦了,我能帮你什么吗?
    嗯?(疑惑,Doubt)卖塑料瓶可以吗?

    Comparison of manually labeled spontaneous labels and model-predicted spontaneous labels(ABX)

    To demonstrate that using predicted labels also produces speech with reasonably spontaneous behavior, labels are not prompted in the sample text.

    Target Chinese Text Proposed-manual Proposed-predicted
    等一下,呃,这个好像不是这样的。
    唉,你又觉得我不乖了是吗?
    好的好的,芋泥啵啵奶茶大杯不加冰,稍等五分钟,马上好!
    <laughter>你怎么在这里啊。
    哼,很多人总是一边嫌弃一边还玩。

    Ablation Study

    investigation on spontaneous prosody modeling

    Target Chinese Text Proposed without spontaneous prosody modeling
    你,你(结巴,Stutter)瞎说。
    <laughter>(忍不住笑,Involuntary laughter)不是,这个不是这样放的啦。
    嗯(撒娇,Coquetry),好困,让我再睡一会吧。
    我最近开始学瑜伽,感觉对身体和心灵都很有好处。

    Investigation on spontaneous behavior modeling

    Target Chinese Text Proposed without spontaneous behavior modeling
    <laughter>(微笑,Smile)先生您的酒店在这里,请跟我走。
    哼(撒娇,Coquetry),你再这样我就不理你了啊。
    那(填充停顿,Filled pause),你还爱他吗?
    你周末(填充停顿,Filled pause)有什么计划?我们可以一起,(填充停顿,Filled pause),去看场电影或者散步。

    Controllable of spontaneous behaviors

    NOTE: text corresponding to the spontaneous behavior is bolded

    Target Chinese Text Spontaneous behavior type Audio
    ,风景真美丽呀。 赞同,Positive feedback
    ,风景真美丽呀。 撒娇,Coquetry
    ,风景真美丽呀。 填充停顿,Filled pause
    <laughter>好啊,我今天就过来找你 忍不住笑,Involuntary laughter
    <laughter>好啊,我今天就过来找你 尬笑,Awkward laughter
    <laughter>好啊,我今天就过来找你 微笑,Smile
    你周末有什么计划?我们可以一起,去看场电影或者散步。 填充停顿,Filled pause
    你周末有什么计划?我们可以一起,去看场电影或者散步。 结巴,Stutter