Pre-Trained Foundation Model representations to uncover Breathing patterns in Speech

Vikramjit Mitra [email protected] 0000-0002-2721-3976 AppleCupertinoCaliforniaUSA Anirban Chatterjee [email protected] AppleCupertinoUSA Ke Zhai ke˙[email protected] AppleCupertinoUSA Helen Weng helen˙[email protected] AppleCupertinoUSA Ayuko Hill [email protected] AppleCupertinoUSA Nicole Hay nicole˙[email protected] AppleCupertinoUSA Christopher Webb [email protected] AppleCupertinoUSA Jamie Cheng jamie˙[email protected] AppleCupertinoUSA  and  Erdrin Azemi [email protected] AppleCupertinoUSA
(5 June 2024; 13 July 2024; 1 July 2024)
Abstract.

The process of human speech production involves coordinated respiratory action to elicit acoustic speech signals. Typically, speech is produced when air is forced from the lungs and is modulated by the vocal tract, where such actions are interspersed by moments of breathing in air (inhalation) to refill the lungs again. Respiratory rate (RR𝑅𝑅RRitalic_R italic_R) is a vital metric that is used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure RR𝑅𝑅RRitalic_R italic_R (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have demonstrated that machine learning algorithms can be used to estimate RR𝑅𝑅RRitalic_R italic_R using bio-sensor signals as input. Speech-based estimation of RR𝑅𝑅RRitalic_R italic_R can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning based approach to estimate RR𝑅𝑅RRitalic_R italic_R from speech segments obtained from subjects speaking to a close-talking microphone device. Data were collected from N=26 individuals, where the groundtruth RR𝑅𝑅RRitalic_R italic_R was obtained through commercial grade chest-belts and then manually corrected for any errors. A convolutional long-short term memory network (Conv-LSTM) is proposed to estimate respiration time-series data from the speech signal. We demonstrate that the use of pre-trained representations obtained from a foundation model, such as Wav2Vec2, can be used to estimate respiration-time-series with low root-mean-squared error and high correlation coefficient, when compared with the baseline. The model-driven time series can be used to estimate RR𝑅𝑅RRitalic_R italic_R with a low mean absolute error (MAE𝑀𝐴𝐸MAEitalic_M italic_A italic_E) 1.6breaths/minabsent1.6𝑏𝑟𝑒𝑎𝑡𝑠𝑚𝑖𝑛{\approx 1.6\,breaths/min}≈ 1.6 italic_b italic_r italic_e italic_a italic_t italic_h italic_s / italic_m italic_i italic_n.

Respiration rate, speech processing, convolutional neural network, recurrent neural network, foundation models.
conference: BIOKDD; August 25, 2024; Barcelona, Spainccs: Human-centered computing Ubiquitous and mobile computing systems and tools

1. Introduction

The lungs play a central role in speech vocalization, where they act as the source of air that is pumped through the vocal tract, which acts as a filter (Stevens, 2000) to generate acoustic speech. Breathing is the source of most sounds that humans vocalize and speech production requires control and coordination of breathing and speech articulation, also known as speech breathing (Fuchs and Rochet-Capellan, 2021). Speech breathing demands more effort than regular breathing, where speech breathing is characterized by short inhalations to minimize interruptions during speech production, whereas regular breathing consists of equal phases of inhalation and exhalation (Hixon, 1987). Due to short inhalations, the velocity of air-inflow is higher compared to regular breathing (Conrad and Schoenle, 1979), hence, breath sound is normally audible in speech (Arafath K. and Routray, 2019). The volume of air exhaled during speech is influenced by the length and loudness of the intended utterance, and the exhale-duration is dependent upon the linguistic intent and sounds produced during speech production (Winkworth et al., 1994; Klatt et al., 1968). Speech production and breathing are inherently coupled and (Nallanthighal et al., 2020) aimed at sensing speech breathing patterns from the linguistic content and prosodic factors of speech.

Respiratory rate (RR𝑅𝑅RRitalic_R italic_R) is a vital metric, where studies have shown that RR𝑅𝑅RRitalic_R italic_R is the most valid marker of exertion (Nicolò et al., 2014, 2017) and a reduction in RR𝑅𝑅RRitalic_R italic_R is an indicator of a person’s relaxation response (Grant and Rainville, 2009; Wielgosz et al., 2016; Kral et al., 2022, 2023) and self-reported well-being (Kral et al., 2023). Speech breathing parameters have been used for clinical applications (Solomon and Hixon, 1993) as well as for affective analysis (Goldman-Eisler, 1955; Heim et al., 1968). Prior work on breath-sound detection from audio has focused on the detection and categorization of particular breath sounds to distinguish between healthy and abnormal breath sounds (Li et al., 2017; Castro and Marti-Puig, 2014). RR𝑅𝑅RRitalic_R italic_R estimation has been investigated from both contact-based sensors and non-contact-based sensors (Sierra et al., 2006, 2004; Ren et al., 2015; Kumar et al., 2021; Ahmed et al., 2023; Rahman et al., 2022), to acquire nasal breath recordings and wearable microphones. In this work, we investigate estimating respiratory parameters from speech recorded using close-talking microphones, that is more likely to sense respiratory sounds in speech, compared to distant-microphones, due to their proximity to the mouth.

Prior work on speech-breathing focused mostly on using traditional acoustic features such as log-mel spectrograms (Nallanthighal et al., 2020), or their discrete cosine transformed counterparts (a.k.a, mel-frequency cepstral coefficients or MFCCs) (Arafath K. and Routray, 2019; Ruinskiy and Lavner, 2007; MacIntyre et al., 2020). However, in case of limited-size datasets, such representations make the downstream machine learning models prone to over-fitting, and as a consequence restrict the generalization capacity and robustness of the machine learning (ML) model. Recent advances in foundation models (Bommasani et al., 2021) have resulted in significant performance boost of speech technologies, where pre-trained model representations (Baevski et al., 2020; Hsu et al., 2021) have shown state-of-the-art performance for speech recognition (Zuluaga-Gomez et al., 2023), speaker recognition (Zuluaga-Gomez et al., 2023), and emotion recognition (Mitra et al., 2022). Representations from pre-trained foundation models have demonstrated better generalization capacity and robustness across different speech tasks, under various acoustic conditions and for multiple languages, hence we hypothesize that such representations will be quite useful for the task of speech based respiration parameter estimation.

Self-supervised learned (SSL) models such as Wav2Vec2 (Baevski et al., 2020) or HuBERT (Hsu et al., 2021) are trained on large volumes of unlabeled data and are anticipated to learn acoustic units from the training data. The learned acoustic units should be discriminable in their spectro-temporal representations, and represent distinct acoustic phonetic units (such as vowels, voiced/unvoiced consonants, pauses, aspirated noise etc.) or their sub-states.

In this work, we aim to:
(1) estimate the respiration time-series signal from speech data,
(2) obtain RR𝑅𝑅RRitalic_R italic_R measure from speech data, and
(3) detect inhale events within the speech data.

We hypothesize that pre-trained representations should have information that can help with the above tasks and demonstrate better performance compared to standard mel-filterbank (MFB) based acoustic features given that they are pre-trained with large speech datasets.

This work demonstrates that:
(1) features from pre-trained models significantly improve RR𝑅𝑅RRitalic_R italic_R estimation from speech compared to standard acoustic features.
(2) respiration time-series (inhale/exhale signal) can be estimated from speech using an ML model, that is highly correlated to the reference measures.
(3) saliency-driven pre-trained representations can reduce the dimensionality of input representation space, as a consequence can reduce the downstream model’s parameter size.
(4) fusing pre-trained representations with standard acoustic features can improve RR𝑅𝑅RRitalic_R italic_R estimation performance.

Note that unlike prior works (Nallanthighal et al., 2020; Arafath K. and Routray, 2019) that have used standard acoustic features, we demonstrate that pre-trained model representations can be used for speech breath detection, and can demonstrate superior performance. In addition, we present a metric (breath-event error rate: BER𝐵𝐸𝑅BERitalic_B italic_E italic_R) that indicates how closely the detected breath-events align with the groundtruth data. Finally, we present a convolution LSTM (Conv-LSTM) model and show that the network-depth and fusion of pre-trained representations and MFB helps to better estimate the breath time-series data from speech.

The rest of the paper is organized as follows: Section (2) presents the dataset used in our study, Section (3) introduces feature representations investigated and details on the acoustic model and its parameters, Section (4) presents the results, followed by conclusions in Section (5).

2. Data

Publicly available speech datasets containing respiration time-series reference do not exist, hence we collected data internally. The 2020 speech paralinguistic challenge (Schuller et al., 2020) explored speech based respiration event detection, however the dataset used in that challenge is not publicly available. Data were collected from 26 adult speakers under realistic background acoustic environments (consisting of background noise) in an indoor setting. American English speakers, between the age 25 to 60, balanced by gender, were employed for the data collection. Data were recorded using microphone-enabled, wearable headphones. Speech data collected using wearable microphones and chest-belt measurements were collected across multiple sessions. During the data collection, participants were prompted to read a paragraph, where the reading session varied from 45 to 90 seconds. Note that conversational speech is not considered in this study, however we expect that findings from this work should generalize to such speech.

A strain-gauge chest-belt sensor (Vernier Go Direct Respiration Belt) was used during the data collection to obtain groundtruth reference chest contraction and relaxation (corresponding to inhalation and exhalation) measurements. Figure 1 shows a plot of a sample respiration signal spectrogram and its corresponding chest belt measurement. Due to calibration and subject variability, chest-belt measurements were observed to have variations, hence a quality check of the chest-belt measurement was performed manually and any data with erroneous measurement were removed. Chest-belt data were z-score normalized and dynamic range compressed before being used for model training. For some sessions the participant did not speak, hence they did not contain any recorded speech; such data were excluded from our experiments.

Data augmentation was performed to simulate faster and slower breathing by altering the speed of the entire audio signal. We have used 25 hours of speech data from 26 speakers in this study, where data from 3 and 4 speakers (roughly one hour of speech data/speaker) were set aside for validation and test sets, and the remaining 19 speakers were used for model training. Note that the validation and test split speakers were balanced by gender. Speech data were segmented into chunks of 30 seconds for model training, to ensure it contains at least one full breath cycle.

Refer to caption
Figure 1. Spectrogram speech [top] and the corresponding chest-belt pressure measurement (in Newton) [bottom].
Refer to caption
Figure 2. Histogram of RR𝑅𝑅RRitalic_R italic_R (in br/min) estimated from chest-belt data in the dataset.

2.1. Analysis

We analyzed the data used in this study to measure the variance in RR𝑅𝑅RRitalic_R italic_R, both within and across speakers. Figure 2 shows the histogram of RR𝑅𝑅RRitalic_R italic_R estimated from the chest-belt data obtained from the subjects in our dataset. Figure 3 shows the variance of RR𝑅𝑅RRitalic_R italic_R by subject, which shows that RR𝑅𝑅RRitalic_R italic_R varied not only across subjects, but also within the subject across multiple sessions. The overall dynamic range of RR𝑅𝑅RRitalic_R italic_R values in the dataset were within the range of 5 to 19 breaths/min (denoted as br/min).

3. Methods

3.1. Acoustic Features

The baseline acoustic features consist of 40-dimensional MFB energies, analyzed at a 25ms window, with a frame interval of 10ms.

Refer to caption
Figure 3. Mean and std-dev RR𝑅𝑅RRitalic_R italic_R (in br/min) by speakers.

3.2. Features from Pre-Trained Models

We explored embeddings generated from a pre-trained Wav2Vec2-base (Wav2Vec2) model (Baevski et al., 2020)111we have selected Wav2Vec2-base due to its smaller size. Note that the pre-trained acoustic model was not fine-tuned to our data, and its parameters were frozen to generate the representations for our dataset. The Wav2Vec2 model was pre-trained on 960 hours of speech from the Librispeech dataset with 12 transformer layers and 768 embedding dimensions, where we investigated the representations obtained from the 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT through the last transformer layers222https://pytorch.org/audio/stable/pipelines. Representations from the initial layers are expected to contain more acoustic information, while those from the latter layers are expected to contain more phonetic information.

Refer to caption

Figure 4. (A) Architecture of the single-feature (Conv-LSTM) network, and (B) Feature-fused network

3.3. Model

We used a convolutional network with Long-Short term memory units (Conv-LSTM) consisting of as many time-convolution filters as the number of feature inputs (which is 40 for MFB and 768 for Wav2Vec2), 128 LSTM units and 128 neurons in the embedding layer. The model architecture is shown in Figure 4.A. Additionally, we investigated feature fusion as shown in Figure 4.B. Given the ability of foundation models (such as Wav2Vec2) to learn large dimensional acoustic representations through multiple tiers of transformer layers, the down-stream classifiers trained on the foundation model representations can be simple in architecture, as reported in (Mitra et al., 2022). In this work we did not observed any evidence of performance gain by increasing model complexity (by introducing additional layers), hence we focused on exploring a simple (Conv-LSTM) architecture as shown in Figure 4.A.

Models were trained using the concordance correlation coefficient (CCC𝐶𝐶𝐶CCCitalic_C italic_C italic_C) (Lawrence and Lin, 1989) as the loss function (see Equation (1)). In Equation (1), where μxsubscript𝜇𝑥{\mu_{x}}italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and μysubscript𝜇𝑦{\mu_{y}}italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are the means, σx2superscriptsubscript𝜎𝑥2{\sigma_{x}^{2}}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and σy2superscriptsubscript𝜎𝑦2{\sigma_{y}^{2}}italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the corresponding variances for the estimated and groundtruth time-series data, and ρ𝜌{\rho}italic_ρ is the correlation coefficient between the two variables. The models were trained with a mini-batch size of 64, using Adam optimizer with a learning rate of 0.005. Early stopping was performed based on the validation-set loss.

(1) CCC𝐶𝐶𝐶\displaystyle CCCitalic_C italic_C italic_C =2ρσxσyσx2+σy2+(μxμy)2.absent2𝜌subscript𝜎𝑥subscript𝜎𝑦superscriptsubscript𝜎𝑥2superscriptsubscript𝜎𝑦2superscriptsubscript𝜇𝑥subscript𝜇𝑦2\displaystyle=\frac{2\rho\sigma_{x}\sigma_{y}}{\sigma_{x}^{2}+\sigma_{y}^{2}+(% \mu_{x}-\mu_{y})^{2}}.= divide start_ARG 2 italic_ρ italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

3.4. Salient representations

The pre-trained model embeddings have large dimensionality, for example, Wav2Vec2 model generates 768 dimensions, resulting in increased downstream model size. To reduce the feature dimension, we obtained breath-salient representations from the Wav2Vec2, by relying on the relationships between the input representation and the targets. Prior studies (Mitra and Franco, 2020; Mitra et al., 2023) have explored the input-output relationships of activations to obtain neural saliency, and we use a similar idea to obtain salient representations for respiration signal estimation. Let the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT dimension of N𝑁Nitalic_N dimensional Wav2Vec2 for an utterance y𝑦yitalic_y be represented by a vector Hk,y=[X1,k,,XM,k]subscript𝐻𝑘𝑦subscript𝑋1𝑘subscript𝑋𝑀𝑘H_{k,y}=[X_{1,k},\dots,X_{M,k}]italic_H start_POSTSUBSCRIPT italic_k , italic_y end_POSTSUBSCRIPT = [ italic_X start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_M , italic_k end_POSTSUBSCRIPT ], where M𝑀Mitalic_M denotes the sequence length. Let the reference respiration time-series be L𝐿Litalic_L for utterance y𝑦yitalic_y. The cross-correlation based saliency (CCSk𝐶𝐶subscript𝑆𝑘CCS_{k}italic_C italic_C italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) of kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT dimension is given by:

(2) SCCSk=Cov(Hk,L)σHkσL+γk,subscript𝑆𝐶𝐶subscript𝑆𝑘norm𝐶𝑜𝑣subscript𝐻𝑘𝐿subscript𝜎subscript𝐻𝑘subscript𝜎𝐿subscript𝛾𝑘S_{CCS_{k}}=\left\|\frac{Cov({{H}_{k}},L)}{\sigma_{H_{k}}\sigma_{L}}\right\|+% \gamma_{k},italic_S start_POSTSUBSCRIPT italic_C italic_C italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∥ divide start_ARG italic_C italic_o italic_v ( italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_L ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG ∥ + italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where Equation 2 computes the absolute cross-correlation between time-series L𝐿Litalic_L and embeddings Hksubscript𝐻𝑘{{H}_{k}}italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for dimension k𝑘kitalic_k for all utterances in the training set. γksubscript𝛾𝑘\gamma_{k}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the sum of the weighted cross-correlation between the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT dimension and all other dimensions, as shown in Equation 3:

(3) γk=1N1j=1,jkNwjCov(Hk,Hj)σHkσHj,subscript𝛾𝑘1𝑁1superscriptsubscriptformulae-sequence𝑗1𝑗𝑘𝑁subscript𝑤𝑗norm𝐶𝑜𝑣subscript𝐻𝑘subscript𝐻𝑗subscript𝜎subscript𝐻𝑘subscript𝜎subscript𝐻𝑗\gamma_{k}=\frac{1}{N-1}\sum_{j=1,j\neq k}^{N}w_{j}\left\|\frac{Cov({{H}_{k}},% {{H}_{j}})}{\sigma_{{{H}_{k}}}\sigma_{{{H}_{j}}}}\right\|,\\ italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 , italic_j ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ divide start_ARG italic_C italic_o italic_v ( italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ∥ ,

where, wj=Cov(Hj,L)σHjσLsubscript𝑤𝑗norm𝐶𝑜𝑣subscript𝐻𝑗𝐿subscript𝜎subscript𝐻𝑗subscript𝜎𝐿w_{j}=\left\|\frac{Cov({{H}_{j}},L)}{\sigma_{{{H}_{j}}}\sigma_{L}}\right\|italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∥ divide start_ARG italic_C italic_o italic_v ( italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_L ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG ∥.

In our experiments we have used SCCSsubscript𝑆𝐶𝐶𝑆S_{CCS}italic_S start_POSTSUBSCRIPT italic_C italic_C italic_S end_POSTSUBSCRIPT given in Equation 2 to select salient dimensions in pre-trained representations.

Table 1. Baseline performance (on test-set) for respiration time-series estimation using CCC𝐶𝐶𝐶CCCitalic_C italic_C italic_C and RMSE𝑅𝑀𝑆𝐸RMSEitalic_R italic_M italic_S italic_E measures using MFB and Wav2Vec2 representations
Representations Layer Time Series RR Estimate
CCC𝐶𝐶𝐶absentCCC\uparrowitalic_C italic_C italic_C ↑ RMSE𝑅𝑀𝑆𝐸absentRMSE\downarrowitalic_R italic_M italic_S italic_E ↓ MAE𝑀𝐴𝐸absentMAE\downarrowitalic_M italic_A italic_E ↓ Acc@2bpm𝐴𝑐𝑐@2𝑏𝑝𝑚absentAcc@2bpm\uparrowitalic_A italic_c italic_c @ 2 italic_b italic_p italic_m ↑
MFB K.A. 0.68 0.13 2.85 64.1
2 0.73 0.12 2.67 66.4
3 0.75 0.12 2.56 67.2
4 0.76 0.11 2.52 66.2
5 0.73 0.12 2.59 66.0
6 0.75 0.12 2.67 66.3
Wav2Vec2𝑊𝑎𝑣2𝑉𝑒𝑐2Wav2Vec2italic_W italic_a italic_v 2 italic_V italic_e italic_c 2 7 0.76 0.11 2.56 65.5
8 0.75 0.12 2.35 66.7
9 0.75 0.13 2.56 64.8
10 0.74 0.13 2.57 64.6
11 0.71 0.12 2.75 63.8
12 0.69 0.12 2.86 63.5
Table 2. Performance on Validation set for respiration time-series estimation using CCC𝐶𝐶𝐶CCCitalic_C italic_C italic_C and RMSE𝑅𝑀𝑆𝐸RMSEitalic_R italic_M italic_S italic_E measures using MFB and Wav2Vec2 representations
Representations Layer Time Series RR Estimate
CCC𝐶𝐶𝐶absentCCC\uparrowitalic_C italic_C italic_C ↑ RMSE𝑅𝑀𝑆𝐸absentRMSE\downarrowitalic_R italic_M italic_S italic_E ↓ MAE𝑀𝐴𝐸absentMAE\downarrowitalic_M italic_A italic_E ↓ Acc@2bpm𝐴𝑐𝑐@2𝑏𝑝𝑚absentAcc@2bpm\uparrowitalic_A italic_c italic_c @ 2 italic_b italic_p italic_m ↑
MFB K.A. 0.57 0.15 3.61 61.5
2 0.62 0.14 2.89 62.7
3 0.63 0.14 2.64 64.7
4 0.66 0.14 2.32 67.8
5 0.65 0.14 2.54 68.1
6 0.66 0.14 2.60 66.2
Wav2Vec2𝑊𝑎𝑣2𝑉𝑒𝑐2Wav2Vec2italic_W italic_a italic_v 2 italic_V italic_e italic_c 2 7 0.67 0.13 2.35 67.4
8 0.66 0.14 2.43 69.3
9 0.63 0.14 2.55 66.7
10 0.59 0.14 2.57 65.6
11 0.59 0.15 2.71 63.3
12 0.58 0.15 2.91 62.2

Refer to caption

Figure 5. Segment-level performance by number of LSTM layers for models trained with MFB and Wav2Vec2 features
Refer to caption
Figure 6. Spectrogram speech [top] and the corresponding chest-belt time-series (grountruth) in blue and the estimated time-series from the model in green [bottom].

4. Results

We trained baseline acoustic models using (i) MFB and (ii) Wav2Vec2 embeddings obtained from the 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT through 12thsuperscript12𝑡12^{th}12 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT transformer layers of the model. The performance of the respiration time-series estimation model is shown in Table 1. We present the results using metrics focusing on the time-series respiration signal estimation, where we have used CCC𝐶𝐶𝐶CCCitalic_C italic_C italic_C (Lawrence and Lin, 1989) and root-mean-squared error (RMSE𝑅𝑀𝑆𝐸RMSEitalic_R italic_M italic_S italic_E). Table 1 shows the baseline time-series estimation performance obtained from MFB and Wav2Vec2 representations. We also evaluated the segment-level RR𝑅𝑅RRitalic_R italic_R estimation performance, where for segment-level RR𝑅𝑅RRitalic_R italic_R estimation, we have used the following metrics: mean-absolute error (MAE𝑀𝐴𝐸MAEitalic_M italic_A italic_E) and Accuracy at 2 br/min error tolerance (Acc@2bpm𝐴𝑐𝑐@2𝑏𝑝𝑚Acc@2bpmitalic_A italic_c italic_c @ 2 italic_b italic_p italic_m). MAE𝑀𝐴𝐸MAEitalic_M italic_A italic_E is computed by comparing the number of breath-events detected from the estimated time-signal obtained from the model, with that observed in the chest-belt groundtruth signal.

Accuracy for a segment is measured at a tolerance bound of +/-2 breaths/min (bpm) (we made this selection to have a conservative error-bound), where an estimate outside the tolerance-bound is treated as an error. Table 1 shows that the pre-trained representations from Wav2Vec2 perform better than the MFB features for the test-set, and the relative improvement was at-least 2.4% increase in CCC𝐶𝐶𝐶CCCitalic_C italic_C italic_C and 6.8% relative reduction in RMSE𝑅𝑀𝑆𝐸RMSEitalic_R italic_M italic_S italic_E.

Interestingly, Table 1 also shows that representations from different transformer layers of the Wav2Vec2 features had different impact on the performance, where the representations from layers 4 to 9 were more effective than the final layers 10 through 12. The best performance was obtained from layers 4 and 7, which gave 12.3% relative improvement in CCC𝐶𝐶𝐶CCCitalic_C italic_C italic_C, and 14.3% relative reduction in RMSE𝑅𝑀𝑆𝐸RMSEitalic_R italic_M italic_S italic_E compared to the MFB features. Even though we have used the SSL trained Wav2Vec2 (which is not fine-tuned on any specific task), the final layers may contain more phonetic-discriminatory information which may not be essential for breath-signal estimation (see section 3.2). The middle layers may contain more broad acoustic-level information that helps to detect the breathing patterns in speech, speech-activity and silent pauses, hence, they helped to generate better performance than the final layers. Note that given the findings in Table 1, we will be using the representations from layers 4 and 7 in the remaining of this paper to train (Conv-LSTM) models with 2 LSTM layers.

Next, we investigated the depth of the LSTM layers and Figure 5 shows that a 2-layered LSTM model overall performed the best providing higher Acc@2bpm𝐴𝑐𝑐@2𝑏𝑝𝑚Acc@2bpmitalic_A italic_c italic_c @ 2 italic_b italic_p italic_m and lower MAE𝑀𝐴𝐸MAEitalic_M italic_A italic_E for all the features. Table 2 show the validation set performance, when MFB feature and representations from different transformer layers of Wav2Vec2 was used.

We also investigated if saliency-driven feature selection can help to reduce the model size, while retaining the model performance. Using the approach outlined in section 3.4 we investigated pruning input representations, by keeping only 90%, 75%, 50% and 25% of the input representations, which in turn resulted in reducing the model parameter size by 9%, 22%, 44% and 66% respectively. Table 3 shows the result obtained from selecting salient representations from Wav2Vec2 layers 4 and 7. We introduce a metric: breath error rate (BER𝐵𝐸𝑅BERitalic_B italic_E italic_R) to measure the accuracy of detecting breath events. BER𝐵𝐸𝑅BERitalic_B italic_E italic_R is computed by comparing the inhalation events in the groundtruth and estimated time-series signals, where we have only deletion of inhale-events (deletion errors, D𝐷Ditalic_D) and inserted inhale-events (insertion errors, I𝐼Iitalic_I), and use the total number of inhale events N𝑁Nitalic_N in the groundtruth data, to measure BER𝐵𝐸𝑅BERitalic_B italic_E italic_R:

(4) BER=I+DN,𝐵𝐸𝑅𝐼𝐷𝑁BER={\frac{I+D}{N}},italic_B italic_E italic_R = divide start_ARG italic_I + italic_D end_ARG start_ARG italic_N end_ARG ,
Table 3. Respiration time-series estimation performance (in CCC𝐶𝐶𝐶CCCitalic_C italic_C italic_C and RMSE𝑅𝑀𝑆𝐸RMSEitalic_R italic_M italic_S italic_E) and segmental RR𝑅𝑅RRitalic_R italic_R estimation (in MAE𝑀𝐴𝐸MAEitalic_M italic_A italic_E, Acc@2bpm𝐴𝑐𝑐@2𝑏𝑝𝑚Acc@2bpmitalic_A italic_c italic_c @ 2 italic_b italic_p italic_m and BER) from Wav2Vec2 layers 4 and 7 and fusion of layer 4 with MFB, after saliency based representation selection and their corresponding parameter size reduction
Feature %Input Time Series RR𝑅𝑅RRitalic_R italic_R estimate \downarrow % Rel.
Reps. CCC𝐶𝐶𝐶absentCCC\uparrowitalic_C italic_C italic_C ↑ RMSE𝑅𝑀𝑆𝐸absentRMSE\downarrowitalic_R italic_M italic_S italic_E ↓ MAE𝑀𝐴𝐸absentMAE\downarrowitalic_M italic_A italic_E ↓ Acc@2bpm𝐴𝑐𝑐@2𝑏𝑝𝑚absentAcc@2bpm\uparrowitalic_A italic_c italic_c @ 2 italic_b italic_p italic_m ↑ BER𝐵𝐸𝑅absentBER\downarrowitalic_B italic_E italic_R ↓ model size
100 0.75 0.11 1.58 84.4 29.8 0
90 0.76 0.12 1.89 77.6 26.8 8.8
Wav2Vec24𝑊𝑎𝑣2𝑉𝑒𝑐subscript24Wav2Vec2_{4}italic_W italic_a italic_v 2 italic_V italic_e italic_c 2 start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 75 0.75 0.11 2.13 75.5 29.3 22.0
50 0.76 0.11 1.80 78.1 24.9 44.0
10 0.72 0.12 1.97 74.5 32.4 66.0
100 0.77 0.11 1.77 80.7 28.7 0
90 0.77 0.11 1.89 79.7 30.1 8.8
Wav2Vec27𝑊𝑎𝑣2𝑉𝑒𝑐subscript27Wav2Vec2_{7}italic_W italic_a italic_v 2 italic_V italic_e italic_c 2 start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT 75 0.76 0.11 2.12 74.0 29.1 22.0
50 0.76 0.11 1.91 76.6 28.3 44.0
10 0.72 0.12 2.21 72.4 37.4 66.0
Wav2Vec24,50𝑊𝑎𝑣2𝑉𝑒𝑐subscript2450Wav2Vec2_{4,50}italic_W italic_a italic_v 2 italic_V italic_e italic_c 2 start_POSTSUBSCRIPT 4 , 50 end_POSTSUBSCRIPT+MFB 50 0.77 0.11 1.58 83.9 22.6 27.4

Table 3 shows that the representations from layer 4 performed better than those from layer 7, especially for the segment-level RR𝑅𝑅RRitalic_R italic_R metrics (MAE𝑀𝐴𝐸MAEitalic_M italic_A italic_E, Acc@2bpm𝐴𝑐𝑐@2𝑏𝑝𝑚Acc@2bpmitalic_A italic_c italic_c @ 2 italic_b italic_p italic_m and BER𝐵𝐸𝑅BERitalic_B italic_E italic_R). Selecting the top 50% representation based on saliency resulted in the best BER𝐵𝐸𝑅BERitalic_B italic_E italic_R with some regression in MAE𝑀𝐴𝐸MAEitalic_M italic_A italic_E and Acc@2bpm𝐴𝑐𝑐@2𝑏𝑝𝑚Acc@2bpmitalic_A italic_c italic_c @ 2 italic_b italic_p italic_m compared to the model trained with the full layer 4 representations. Note that the 50% representation based model is smaller than the full-representation based model by 44% (Figure 6 show the time-series estimate from the model). The above findings indicate that: (1) the earlier layers of Wav2Vec2 contain more respiration-relevant representation that resulted in better performance across multiple metrics, (2) RR𝑅𝑅RRitalic_R italic_R estimation MAE𝑀𝐴𝐸MAEitalic_M italic_A italic_E as low as 1.6 bpm can be achieved using speech as input data, where an RR𝑅𝑅RRitalic_R italic_R estimation accuracy as high as 84% can be obtained for a tolerance of +/-2 bpm, and (3) saliency-based representation can help to reduce the model size by 44% that can provide better BER𝐵𝐸𝑅BERitalic_B italic_E italic_R but some regression in RR𝑅𝑅RRitalic_R italic_R estimation performance. Note that for segment-level RR𝑅𝑅RRitalic_R italic_R estimation the MAE𝑀𝐴𝐸MAEitalic_M italic_A italic_E and Acc@2bpm𝐴𝑐𝑐@2𝑏𝑝𝑚Acc@2bpmitalic_A italic_c italic_c @ 2 italic_b italic_p italic_m obtained from MFB are 2.38 and 69.3% respectively, indicating that Wav2Vec2 representations performed better than MFBs for the segment-level metrics as well. We investigated fusion of 50% salient layer-4 representation with MFB features (Wav2Vec24,50𝑊𝑎𝑣2𝑉𝑒𝑐subscript2450Wav2Vec2_{4,50}italic_W italic_a italic_v 2 italic_V italic_e italic_c 2 start_POSTSUBSCRIPT 4 , 50 end_POSTSUBSCRIPT+MFB), result shown in the last row of table 3, where we observed that fusion of information helped to achieve the best BER, with comparable MAE and Acc@2bpm𝐴𝑐𝑐@2𝑏𝑝𝑚Acc@2bpmitalic_A italic_c italic_c @ 2 italic_b italic_p italic_m from the best single-feature system (Wav2Vec2 layer 4), with 27% reduction in model parameter size. The fusion results indicate that the Wav2Vec2 and MFB representations may have complementary information, hence their fusion resulted in improved performance.

5. Conclusion

In this work we demonstrated that respiration signal can be estimated from speech data collected through close-talking microphones. Results from our work has shown a time-series estimation performance with CCC𝐶𝐶𝐶CCCitalic_C italic_C italic_C as-high-as 0.77 and an RMSE𝑅𝑀𝑆𝐸RMSEitalic_R italic_M italic_S italic_E as-low-as 0.11, where the groundtruth respiration signal was z-score normalized. At the segment-level, we observed that RR𝑅𝑅RRitalic_R italic_R can be estimated with a MAE𝑀𝐴𝐸MAEitalic_M italic_A italic_E of 1.6 bpm. We also observed that pre-trained model representations from Wav2Vec2 SSL model performed better than standard MFB feature, providing a relative MAE𝑀𝐴𝐸MAEitalic_M italic_A italic_E reduction of 33.6% and relative improvement in estimation CCC𝐶𝐶𝐶CCCitalic_C italic_C italic_C by 10%. Additionally, we observed that fusion of Wav2Vec2 and MFB features provided the best overall performance.

Future studies should explore the use of representations from fine-tuned foundation models with speech data containing respiration-relevant information. Additionally, the impact of subjective variance and the models’ generalization capacity should be investigated using a dataset containing larger number of subjects than what was available in the dataset used in this study. A limitation of this study is that it uses a dataset containing read speech, future work should investigate spontaneous speech for estimating respiration signal.

References

  • (1)
  • Ahmed et al. (2023) Tousif Ahmed, Md Mahbubur Rahman, Ebrahim Nemati, Mohsin Yusuf Ahmed, Jilong Kuang, and Alex Jun Gao. 2023. Remote breathing rate tracking in stationary position using the motion and acoustic sensors of earables. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–22.
  • Arafath K. and Routray (2019) Mohamed Ismail Yasar Arafath K. and Aurobinda Routray. 2019. Automatic measurement of speech breathing rate. In 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 1–5.
  • Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020), 12449–12460.
  • Bommasani et al. (2021) Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
  • Castro and Marti-Puig (2014) J. Castro and P. Marti-Puig. 2014. Real-time Identification of Respiratory Movements through a Microphone. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal (ISSN: 2255-2863). Salamanca 3, 3 (2014).
  • Conrad and Schoenle (1979) B Conrad and Paul Schoenle. 1979. Speech and respiration. Archiv für Psychiatrie und Nervenkrankheiten 226 (1979), 251–268.
  • Fuchs and Rochet-Capellan (2021) Susanne Fuchs and Amélie Rochet-Capellan. 2021. The respiratory foundations of spoken language. Annual Review of Linguistics 7 (2021), 13–30.
  • Goldman-Eisler (1955) Frieda Goldman-Eisler. 1955. Speech-breathing activity-a measure of tension and affect during interviews. British Journal of Psychology 46, 1 (1955), 53.
  • Grant and Rainville (2009) J.A. Grant and P. Rainville. 2009. Pain sensitivity and analgesic effects of mindful states in Zen Meditators: A Cross-Sectional Study. Psychosomatic Medicine 71, 1 (2009), 106–114.
  • Heim et al. (1968) Edgar Heim, Peter H Knapp, Louis Vachon, Gordon G Globus, and S Joseph Nemetz. 1968. Emotion, breathing and speech. Journal of Psychosomatic Research 12, 4 (1968), 261–274.
  • Hixon (1987) Thomas J Hixon. 1987. Respiratory function in speech and song. (No Title) (1987).
  • Hsu et al. (2021) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
  • Klatt et al. (1968) Dennis H Klatt, KN Stevens, and J Mead. 1968. Studies of articulatory activity and airflow during speech. Annals of the New York Academy of Sciences 155, 1 (1968), 42–55.
  • Kral et al. (2022) T. Kral, R.C. Lapate, T. Imhoff-Smith, E. Patsenko, D.W. Grupe, R. Goldman, M.A. Rosenkranz, and R. J. Davidson. 2022. Long-term mindfulness training is associated with reliable differences in resting respiration rate. Journal of Cognitive Neuroscience 34, 9 (2022), 1576–1589.
  • Kral et al. (2023) Tammi RA Kral, Helen Y Weng, Vikramjit Mitra, Theodore P Imhoff-Smith, Erdrin Azemi, Robin I Goldman, Melissa A Rosenkranz, Sarah Wu, Andrew Chen, and Richard J Davidson. 2023. Slower respiration rate is associated with higher self-reported well-being after wellness training. Scientific Reports 13, 1 (2023), 15953.
  • Kumar et al. (2021) Agni Kumar, Vikramjit Mitra, Carolyn Oliver, Adeeti Ullal, Matt Biddulph, and Irida Mance. 2021. Estimating respiratory rate from breath audio obtained through wearable microphones. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 7310–7315.
  • Lawrence and Lin (1989) I. Lawrence and K. Lin. 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics (1989), 255–268.
  • Li et al. (2017) S. Li, B. Lin, C. Tsai, C. Yang, and B. Lin. 2017. Design of wearable breathing sound monitoring system for real-time wheeze detection. Sensors 17, 1 (2017), 171.
  • MacIntyre et al. (2020) Alexis Deighton MacIntyre, Georgios Rizos, Anton Batliner, Alice Baird, Shahin Amiriparian, Antonia Hamilton, and Björn W Schuller. 2020. Deep attentive end-to-end continuous breath sensing from speech. (2020).
  • Mitra et al. (2022) Vikramjit Mitra, Hsiang-Yun Sherry Chien, Vasudha Kowtha, Joseph Yitan Cheng, and Erdrin Azemi. 2022. Speech emotion: Investigating model representations, multi-task learning and knowledge distillation. arXiv preprint arXiv:2207.03334 (2022).
  • Mitra and Franco (2020) V. Mitra and H. Franco. 2020. Investigation and analysis of hyper and hypo neuron pruning to selectively update neurons during unsupervised adaptation. Digital Signal Processing 99 (2020), 102655.
  • Mitra et al. (2023) Vikramjit Mitra, Jingping Nie, and Erdrin Azemi. 2023. Investigating salient representations and label Variance in Dimensional Speech Emotion Analysis. arXiv preprint arXiv:2312.16180 (2023).
  • Nallanthighal et al. (2020) Venkata Srikanth Nallanthighal, Aki Härmä, and Helmer Strik. 2020. Speech breathing estimation using deep learning methods. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1140–1144.
  • Nicolò et al. (2014) A. Nicolò, I. Bazzucchi, M. Lenti, A. S. di Palumbo, and M. Sacchetti. 2014. Neuromuscular and metabolic responses to high-intensity intermittent cycling protocols with different work-to-rest ratios. International Journal of Sports Physiology and Performance 9, 1 (2014), 151–160.
  • Nicolò et al. (2017) A. Nicolò, S. M. Marcora, I. Bazzucchi, and M. Sacchetti. 2017. Differential control of respiratory frequency and tidal volume during high-intensity interval training. Experimental Physiology 102 (2017), 934–949.
  • Rahman et al. (2022) Md Mahbubur Rahman, Tousif Ahmed, Mohsin Yusuf Ahmed, Minh Dinh, Ebrahim Nemati, Jilong Kuang, and Jun Alex Gao. 2022. Breathebuddy: Tracking real-time breathing exercises for automated biofeedback using commodity earbuds. Proceedings of the ACM on Human-Computer Interaction 6, MHCI (2022), 1–18.
  • Ren et al. (2015) Y. Ren, C. Wang, J. Yang, and Y. Chen. 2015. Fine-grained sleep monitoring: Hearing your breathing with smartphones. In 2015 IEEE Conference on Computer Communications (INFOCOM). IEEE, 1194–1202.
  • Ruinskiy and Lavner (2007) Dima Ruinskiy and Yizhar Lavner. 2007. An effective algorithm for automatic detection and exact demarcation of breath sounds in speech and song signals. IEEE transactions on audio, speech, and language processing 15, 3 (2007), 838–850.
  • Schuller et al. (2020) Björn W Schuller, Anton Batliner, Christian Bergler, Eva-Maria Messner, Antonia Hamilton, Shahin Amiriparian, Alice Baird, Georgios Rizos, Maximilian Schmitt, Lukas Stappen, et al. 2020. The interspeech 2020 computational paralinguistics challenge: Elderly emotion, breathing & masks. (2020).
  • Sierra et al. (2004) G. Sierra, V. Telfort, B. Popov, L. Durand, R. Agarwal, and V. Lanzo. 2004. Monitoring respiratory rate based on tracheal sounds. First experiences. In The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Vol. 1. IEEE, 317–320.
  • Sierra et al. (2006) G. Sierra, V. Telfort, B. Popov, M. Pelletier, P. Despault, V. Lanzo, and R. Agarwal. 2006. Comparison of respiratory rate estimation based on tracheal sounds versus a capnograph. In 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. IEEE, 6145–6148.
  • Solomon and Hixon (1993) Nancy Pearl Solomon and Thomas J Hixon. 1993. Speech breathing in Parkinson’s disease. Journal of Speech, Language, and Hearing Research 36, 2 (1993), 294–310.
  • Stevens (2000) Kenneth N Stevens. 2000. Acoustic phonetics. Vol. 30. MIT press.
  • Wielgosz et al. (2016) J. Wielgosz, B. S. Schuyler, A. Lutz, and R. J. Davidson. 2016. Long-term mindfulness training is associated with reliable differences in resting respiration rate. Scientific Reports 6, 1 (2016), 1–6.
  • Winkworth et al. (1994) Alison L Winkworth, Pamela J Davis, Elizabeth Ellis, and Roger D Adams. 1994. Variability and consistency in speech breathing during reading: Lung volumes, speech intensity, and linguistic factors. Journal of Speech, Language, and Hearing Research 37, 3 (1994), 535–556.
  • Zuluaga-Gomez et al. (2023) Juan Zuluaga-Gomez, Amrutha Prasad, Iuliia Nigmatulina, Seyyed Saeed Sarfjoo, Petr Motlicek, Matthias Kleinert, Hartmut Helmke, Oliver Ohneiser, and Qingran Zhan. 2023. How does pre-trained wav2vec 2.0 perform on domain-shifted asr? an extensive benchmark on air traffic control communications. In 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 205–212.