Pre-Trained Foundation Model representations to uncover Breathing patterns in Speech

Vikramjit Mitra [email protected] 0000-0002-2721-3976 AppleCupertinoCaliforniaUSA , Anirban Chatterjee [email protected] AppleCupertinoUSA , Ke Zhai ke˙[email protected] AppleCupertinoUSA , Helen Weng helen˙[email protected] AppleCupertinoUSA , Ayuko Hill [email protected] AppleCupertinoUSA , Nicole Hay nicole˙[email protected] AppleCupertinoUSA , Christopher Webb [email protected] AppleCupertinoUSA , Jamie Cheng jamie˙[email protected] AppleCupertinoUSA and Erdrin Azemi [email protected] AppleCupertinoUSA

(5 June 2024; 13 July 2024; 1 July 2024)

Abstract.

The process of human speech production involves coordinated respiratory action to elicit acoustic speech signals. Typically, speech is produced when air is forced from the lungs and is modulated by the vocal tract, where such actions are interspersed by moments of breathing in air (inhalation) to refill the lungs again. Respiratory rate ( $RR$ ) is a vital metric that is used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure $RR$ (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have demonstrated that machine learning algorithms can be used to estimate $RR$ using bio-sensor signals as input. Speech-based estimation of $RR$ can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning based approach to estimate $RR$ from speech segments obtained from subjects speaking to a close-talking microphone device. Data were collected from N=26 individuals, where the groundtruth $RR$ was obtained through commercial grade chest-belts and then manually corrected for any errors. A convolutional long-short term memory network (Conv-LSTM) is proposed to estimate respiration time-series data from the speech signal. We demonstrate that the use of pre-trained representations obtained from a foundation model, such as Wav2Vec2, can be used to estimate respiration-time-series with low root-mean-squared error and high correlation coefficient, when compared with the baseline. The model-driven time series can be used to estimate $RR$ with a low mean absolute error ( $MAE$ ) ${\approx 1.6\,breaths/min}$ .

Respiration rate, speech processing, convolutional neural network, recurrent neural network, foundation models.

^†^†conference: BIOKDD; August 25, 2024; Barcelona, Spain^†^†ccs: Human-centered computing Ubiquitous and mobile computing systems and tools

1. Introduction

The lungs play a central role in speech vocalization, where they act as the source of air that is pumped through the vocal tract, which acts as a filter (Stevens, 2000) to generate acoustic speech. Breathing is the source of most sounds that humans vocalize and speech production requires control and coordination of breathing and speech articulation, also known as speech breathing (Fuchs and Rochet-Capellan, 2021). Speech breathing demands more effort than regular breathing, where speech breathing is characterized by short inhalations to minimize interruptions during speech production, whereas regular breathing consists of equal phases of inhalation and exhalation (Hixon, 1987). Due to short inhalations, the velocity of air-inflow is higher compared to regular breathing (Conrad and Schoenle, 1979), hence, breath sound is normally audible in speech (Arafath K. and Routray, 2019). The volume of air exhaled during speech is influenced by the length and loudness of the intended utterance, and the exhale-duration is dependent upon the linguistic intent and sounds produced during speech production (Winkworth et al., 1994; Klatt et al., 1968). Speech production and breathing are inherently coupled and (Nallanthighal et al., 2020) aimed at sensing speech breathing patterns from the linguistic content and prosodic factors of speech.

Respiratory rate ( $RR$ ) is a vital metric, where studies have shown that $RR$ is the most valid marker of exertion (Nicolò et al., 2014, 2017) and a reduction in $RR$ is an indicator of a person’s relaxation response (Grant and Rainville, 2009; Wielgosz et al., 2016; Kral et al., 2022, 2023) and self-reported well-being (Kral et al., 2023). Speech breathing parameters have been used for clinical applications (Solomon and Hixon, 1993) as well as for affective analysis (Goldman-Eisler, 1955; Heim et al., 1968). Prior work on breath-sound detection from audio has focused on the detection and categorization of particular breath sounds to distinguish between healthy and abnormal breath sounds (Li et al., 2017; Castro and Marti-Puig, 2014). $RR$ estimation has been investigated from both contact-based sensors and non-contact-based sensors (Sierra et al., 2006, 2004; Ren et al., 2015; Kumar et al., 2021; Ahmed et al., 2023; Rahman et al., 2022), to acquire nasal breath recordings and wearable microphones. In this work, we investigate estimating respiratory parameters from speech recorded using close-talking microphones, that is more likely to sense respiratory sounds in speech, compared to distant-microphones, due to their proximity to the mouth.

Prior work on speech-breathing focused mostly on using traditional acoustic features such as log-mel spectrograms (Nallanthighal et al., 2020), or their discrete cosine transformed counterparts (a.k.a, mel-frequency cepstral coefficients or MFCCs) (Arafath K. and Routray, 2019; Ruinskiy and Lavner, 2007; MacIntyre et al., 2020). However, in case of limited-size datasets, such representations make the downstream machine learning models prone to over-fitting, and as a consequence restrict the generalization capacity and robustness of the machine learning (ML) model. Recent advances in foundation models (Bommasani et al., 2021) have resulted in significant performance boost of speech technologies, where pre-trained model representations (Baevski et al., 2020; Hsu et al., 2021) have shown state-of-the-art performance for speech recognition (Zuluaga-Gomez et al., 2023), speaker recognition (Zuluaga-Gomez et al., 2023), and emotion recognition (Mitra et al., 2022). Representations from pre-trained foundation models have demonstrated better generalization capacity and robustness across different speech tasks, under various acoustic conditions and for multiple languages, hence we hypothesize that such representations will be quite useful for the task of speech based respiration parameter estimation.

Self-supervised learned (SSL) models such as Wav2Vec2 (Baevski et al., 2020) or HuBERT (Hsu et al., 2021) are trained on large volumes of unlabeled data and are anticipated to learn acoustic units from the training data. The learned acoustic units should be discriminable in their spectro-temporal representations, and represent distinct acoustic phonetic units (such as vowels, voiced/unvoiced consonants, pauses, aspirated noise etc.) or their sub-states.

In this work, we aim to:
(1) estimate the respiration time-series signal from speech data,
(2) obtain $RR$ measure from speech data, and
(3) detect inhale events within the speech data.

We hypothesize that pre-trained representations should have information that can help with the above tasks and demonstrate better performance compared to standard mel-filterbank (MFB) based acoustic features given that they are pre-trained with large speech datasets.

This work demonstrates that:
(1) features from pre-trained models significantly improve $RR$ estimation from speech compared to standard acoustic features.
(2) respiration time-series (inhale/exhale signal) can be estimated from speech using an ML model, that is highly correlated to the reference measures.
(3) saliency-driven pre-trained representations can reduce the dimensionality of input representation space, as a consequence can reduce the downstream model’s parameter size.
(4) fusing pre-trained representations with standard acoustic features can improve $RR$ estimation performance.

Note that unlike prior works (Nallanthighal et al., 2020; Arafath K. and Routray, 2019) that have used standard acoustic features, we demonstrate that pre-trained model representations can be used for speech breath detection, and can demonstrate superior performance. In addition, we present a metric (breath-event error rate: $BER$ ) that indicates how closely the detected breath-events align with the groundtruth data. Finally, we present a convolution LSTM (Conv-LSTM) model and show that the network-depth and fusion of pre-trained representations and MFB helps to better estimate the breath time-series data from speech.

The rest of the paper is organized as follows: Section (2) presents the dataset used in our study, Section (3) introduces feature representations investigated and details on the acoustic model and its parameters, Section (4) presents the results, followed by conclusions in Section (5).

2. Data

Publicly available speech datasets containing respiration time-series reference do not exist, hence we collected data internally. The 2020 speech paralinguistic challenge (Schuller et al., 2020) explored speech based respiration event detection, however the dataset used in that challenge is not publicly available. Data were collected from 26 adult speakers under realistic background acoustic environments (consisting of background noise) in an indoor setting. American English speakers, between the age 25 to 60, balanced by gender, were employed for the data collection. Data were recorded using microphone-enabled, wearable headphones. Speech data collected using wearable microphones and chest-belt measurements were collected across multiple sessions. During the data collection, participants were prompted to read a paragraph, where the reading session varied from 45 to 90 seconds. Note that conversational speech is not considered in this study, however we expect that findings from this work should generalize to such speech.

A strain-gauge chest-belt sensor (Vernier Go Direct Respiration Belt) was used during the data collection to obtain groundtruth reference chest contraction and relaxation (corresponding to inhalation and exhalation) measurements. Figure 1 shows a plot of a sample respiration signal spectrogram and its corresponding chest belt measurement. Due to calibration and subject variability, chest-belt measurements were observed to have variations, hence a quality check of the chest-belt measurement was performed manually and any data with erroneous measurement were removed. Chest-belt data were z-score normalized and dynamic range compressed before being used for model training. For some sessions the participant did not speak, hence they did not contain any recorded speech; such data were excluded from our experiments.

Data augmentation was performed to simulate faster and slower breathing by altering the speed of the entire audio signal. We have used 25 hours of speech data from 26 speakers in this study, where data from 3 and 4 speakers (roughly one hour of speech data/speaker) were set aside for validation and test sets, and the remaining 19 speakers were used for model training. Note that the validation and test split speakers were balanced by gender. Speech data were segmented into chunks of 30 seconds for model training, to ensure it contains at least one full breath cycle.

Refer to caption — Figure 1. Spectrogram speech [top] and the corresponding chest-belt pressure measurement (in Newton) [bottom].

2.1. Analysis

We analyzed the data used in this study to measure the variance in $RR$ , both within and across speakers. Figure 2 shows the histogram of $RR$ estimated from the chest-belt data obtained from the subjects in our dataset. Figure 3 shows the variance of $RR$ by subject, which shows that $RR$ varied not only across subjects, but also within the subject across multiple sessions. The overall dynamic range of $RR$ values in the dataset were within the range of 5 to 19 breaths/min (denoted as br/min).

3. Methods

3.1. Acoustic Features

The baseline acoustic features consist of 40-dimensional MFB energies, analyzed at a 25ms window, with a frame interval of 10ms.

3.2. Features from Pre-Trained Models

We explored embeddings generated from a pre-trained Wav2Vec2-base (Wav2Vec2) model (Baevski et al., 2020)¹¹1we have selected Wav2Vec2-base due to its smaller size. Note that the pre-trained acoustic model was not fine-tuned to our data, and its parameters were frozen to generate the representations for our dataset. The Wav2Vec2 model was pre-trained on 960 hours of speech from the Librispeech dataset with 12 transformer layers and 768 embedding dimensions, where we investigated the representations obtained from the $2^{nd}$ through the last transformer layers²²2https://pytorch.org/audio/stable/pipelines. Representations from the initial layers are expected to contain more acoustic information, while those from the latter layers are expected to contain more phonetic information.

3.3. Model

We used a convolutional network with Long-Short term memory units (Conv-LSTM) consisting of as many time-convolution filters as the number of feature inputs (which is 40 for MFB and 768 for Wav2Vec2), 128 LSTM units and 128 neurons in the embedding layer. The model architecture is shown in Figure 4.A. Additionally, we investigated feature fusion as shown in Figure 4.B. Given the ability of foundation models (such as Wav2Vec2) to learn large dimensional acoustic representations through multiple tiers of transformer layers, the down-stream classifiers trained on the foundation model representations can be simple in architecture, as reported in (Mitra et al., 2022). In this work we did not observed any evidence of performance gain by increasing model complexity (by introducing additional layers), hence we focused on exploring a simple (Conv-LSTM) architecture as shown in Figure 4.A.

Models were trained using the concordance correlation coefficient ( $CCC$ ) (Lawrence and Lin, 1989) as the loss function (see Equation (1)). In Equation (1), where ${\mu_{x}}$ and ${\mu_{y}}$ are the means, ${\sigma_{x}^{2}}$ and ${\sigma_{y}^{2}}$ are the corresponding variances for the estimated and groundtruth time-series data, and ${\rho}$ is the correlation coefficient between the two variables. The models were trained with a mini-batch size of 64, using Adam optimizer with a learning rate of 0.005. Early stopping was performed based on the validation-set loss.

(1)

\displaystyle CCC

\displaystyle=\frac{2\rho\sigma_{x}\sigma_{y}}{\sigma_{x}^{2}+\sigma_{y}^{2}+(% \mu_{x}-\mu_{y})^{2}}.

3.4. Salient representations

The pre-trained model embeddings have large dimensionality, for example, Wav2Vec2 model generates 768 dimensions, resulting in increased downstream model size. To reduce the feature dimension, we obtained breath-salient representations from the Wav2Vec2, by relying on the relationships between the input representation and the targets. Prior studies (Mitra and Franco, 2020; Mitra et al., 2023) have explored the input-output relationships of activations to obtain neural saliency, and we use a similar idea to obtain salient representations for respiration signal estimation. Let the $k^{th}$ dimension of $N$ dimensional Wav2Vec2 for an utterance $y$ be represented by a vector $H_{k,y}=[X_{1,k},\dots,X_{M,k}]$ , where $M$ denotes the sequence length. Let the reference respiration time-series be $L$ for utterance $y$ . The cross-correlation based saliency ( $CCS_{k}$ ) of $k^{th}$ dimension is given by:

(2)

S_{CCS_{k}}=\left\|\frac{Cov({{H}_{k}},L)}{\sigma_{H_{k}}\sigma_{L}}\right\|+% \gamma_{k},

where Equation 2 computes the absolute cross-correlation between time-series $L$ and embeddings ${{H}_{k}}$ for dimension $k$ for all utterances in the training set. $\gamma_{k}$ is the sum of the weighted cross-correlation between the $k^{th}$ dimension and all other dimensions, as shown in Equation 3:

(3)

\gamma_{k}=\frac{1}{N-1}\sum_{j=1,j\neq k}^{N}w_{j}\left\|\frac{Cov({{H}_{k}},% {{H}_{j}})}{\sigma_{{{H}_{k}}}\sigma_{{{H}_{j}}}}\right\|,\\

where, $w_{j}=\left\|\frac{Cov({{H}_{j}},L)}{\sigma_{{{H}_{j}}}\sigma_{L}}\right\|$ .

In our experiments we have used $S_{CCS}$ given in Equation 2 to select salient dimensions in pre-trained representations.

Table 1. Baseline performance (on test-set) for respiration time-series estimation using

CCC

and

RMSE

measures using MFB and Wav2Vec2 representations

Representations	Layer	Time Series		RR Estimate
Representations	Layer	$CCC\uparrow$	$RMSE\downarrow$	$MAE\downarrow$	$Acc@2bpm\uparrow$
MFB	K.A.	0.68	0.13	2.85	64.1
	2	0.73	0.12	2.67	66.4
	3	0.75	0.12	2.56	67.2
	4	0.76	0.11	2.52	66.2
	5	0.73	0.12	2.59	66.0
	6	0.75	0.12	2.67	66.3
$Wav2Vec2$	7	0.76	0.11	2.56	65.5
	8	0.75	0.12	2.35	66.7
	9	0.75	0.13	2.56	64.8
	10	0.74	0.13	2.57	64.6
	11	0.71	0.12	2.75	63.8
	12	0.69	0.12	2.86	63.5

Table 2. Performance on Validation set for respiration time-series estimation using

CCC

and

RMSE

measures using MFB and Wav2Vec2 representations

Representations	Layer	Time Series		RR Estimate
Representations	Layer	$CCC\uparrow$	$RMSE\downarrow$	$MAE\downarrow$	$Acc@2bpm\uparrow$
MFB	K.A.	0.57	0.15	3.61	61.5
	2	0.62	0.14	2.89	62.7
	3	0.63	0.14	2.64	64.7
	4	0.66	0.14	2.32	67.8
	5	0.65	0.14	2.54	68.1
	6	0.66	0.14	2.60	66.2
$Wav2Vec2$	7	0.67	0.13	2.35	67.4
	8	0.66	0.14	2.43	69.3
	9	0.63	0.14	2.55	66.7
	10	0.59	0.14	2.57	65.6
	11	0.59	0.15	2.71	63.3
	12	0.58	0.15	2.91	62.2

4. Results

We trained baseline acoustic models using (i) MFB and (ii) Wav2Vec2 embeddings obtained from the $2^{nd}$ through $12^{th}$ transformer layers of the model. The performance of the respiration time-series estimation model is shown in Table 1. We present the results using metrics focusing on the time-series respiration signal estimation, where we have used $CCC$ (Lawrence and Lin, 1989) and root-mean-squared error ( $RMSE$ ). Table 1 shows the baseline time-series estimation performance obtained from MFB and Wav2Vec2 representations. We also evaluated the segment-level $RR$ estimation performance, where for segment-level $RR$ estimation, we have used the following metrics: mean-absolute error ( $MAE$ ) and Accuracy at 2 br/min error tolerance ( $Acc@2bpm$ ). $MAE$ is computed by comparing the number of breath-events detected from the estimated time-signal obtained from the model, with that observed in the chest-belt groundtruth signal.

Accuracy for a segment is measured at a tolerance bound of +/-2 breaths/min (bpm) (we made this selection to have a conservative error-bound), where an estimate outside the tolerance-bound is treated as an error. Table 1 shows that the pre-trained representations from Wav2Vec2 perform better than the MFB features for the test-set, and the relative improvement was at-least 2.4% increase in $CCC$ and 6.8% relative reduction in $RMSE$ .

Interestingly, Table 1 also shows that representations from different transformer layers of the Wav2Vec2 features had different impact on the performance, where the representations from layers 4 to 9 were more effective than the final layers 10 through 12. The best performance was obtained from layers 4 and 7, which gave 12.3% relative improvement in $CCC$ , and 14.3% relative reduction in $RMSE$ compared to the MFB features. Even though we have used the SSL trained Wav2Vec2 (which is not fine-tuned on any specific task), the final layers may contain more phonetic-discriminatory information which may not be essential for breath-signal estimation (see section 3.2). The middle layers may contain more broad acoustic-level information that helps to detect the breathing patterns in speech, speech-activity and silent pauses, hence, they helped to generate better performance than the final layers. Note that given the findings in Table 1, we will be using the representations from layers 4 and 7 in the remaining of this paper to train (Conv-LSTM) models with 2 LSTM layers.

Next, we investigated the depth of the LSTM layers and Figure 5 shows that a 2-layered LSTM model overall performed the best providing higher $Acc@2bpm$ and lower $MAE$ for all the features. Table 2 show the validation set performance, when MFB feature and representations from different transformer layers of Wav2Vec2 was used.

We also investigated if saliency-driven feature selection can help to reduce the model size, while retaining the model performance. Using the approach outlined in section 3.4 we investigated pruning input representations, by keeping only 90%, 75%, 50% and 25% of the input representations, which in turn resulted in reducing the model parameter size by 9%, 22%, 44% and 66% respectively. Table 3 shows the result obtained from selecting salient representations from Wav2Vec2 layers 4 and 7. We introduce a metric: breath error rate ( $BER$ ) to measure the accuracy of detecting breath events. $BER$ is computed by comparing the inhalation events in the groundtruth and estimated time-series signals, where we have only deletion of inhale-events (deletion errors, $D$ ) and inserted inhale-events (insertion errors, $I$ ), and use the total number of inhale events $N$ in the groundtruth data, to measure $BER$ :

(4)

BER={\frac{I+D}{N}},

Table 3. Respiration time-series estimation performance (in

CCC

and

RMSE

) and segmental

RR

estimation (in

MAE

Acc@2bpm

and BER) from Wav2Vec2 layers 4 and 7 and fusion of layer 4 with MFB, after saliency based representation selection and their corresponding parameter size reduction

Feature	%Input	Time Series		$RR$ estimate			$\downarrow$ % Rel.
Feature	Reps.	$CCC\uparrow$	$RMSE\downarrow$	$MAE\downarrow$	$Acc@2bpm\uparrow$	$BER\downarrow$	model size
	100	0.75	0.11	1.58	84.4	29.8	0
	90	0.76	0.12	1.89	77.6	26.8	8.8
$Wav2Vec2_{4}$	75	0.75	0.11	2.13	75.5	29.3	22.0
	50	0.76	0.11	1.80	78.1	24.9	44.0
	10	0.72	0.12	1.97	74.5	32.4	66.0
	100	0.77	0.11	1.77	80.7	28.7	0
	90	0.77	0.11	1.89	79.7	30.1	8.8
$Wav2Vec2_{7}$	75	0.76	0.11	2.12	74.0	29.1	22.0
	50	0.76	0.11	1.91	76.6	28.3	44.0
	10	0.72	0.12	2.21	72.4	37.4	66.0
$Wav2Vec2_{4,50}$ +MFB	50	0.77	0.11	1.58	83.9	22.6	27.4

Table 3 shows that the representations from layer 4 performed better than those from layer 7, especially for the segment-level $RR$ metrics ( $MAE$ , $Acc@2bpm$ and $BER$ ). Selecting the top 50% representation based on saliency resulted in the best $BER$ with some regression in $MAE$ and $Acc@2bpm$ compared to the model trained with the full layer 4 representations. Note that the 50% representation based model is smaller than the full-representation based model by 44% (Figure 6 show the time-series estimate from the model). The above findings indicate that: (1) the earlier layers of Wav2Vec2 contain more respiration-relevant representation that resulted in better performance across multiple metrics, (2) $RR$ estimation $MAE$ as low as 1.6 bpm can be achieved using speech as input data, where an $RR$ estimation accuracy as high as 84% can be obtained for a tolerance of +/-2 bpm, and (3) saliency-based representation can help to reduce the model size by 44% that can provide better $BER$ but some regression in $RR$ estimation performance. Note that for segment-level $RR$ estimation the $MAE$ and $Acc@2bpm$ obtained from MFB are 2.38 and 69.3% respectively, indicating that Wav2Vec2 representations performed better than MFBs for the segment-level metrics as well. We investigated fusion of 50% salient layer-4 representation with MFB features ( $Wav2Vec2_{4,50}$ +MFB), result shown in the last row of table 3, where we observed that fusion of information helped to achieve the best BER, with comparable MAE and $Acc@2bpm$ from the best single-feature system (Wav2Vec2 layer 4), with 27% reduction in model parameter size. The fusion results indicate that the Wav2Vec2 and MFB representations may have complementary information, hence their fusion resulted in improved performance.

5. Conclusion

In this work we demonstrated that respiration signal can be estimated from speech data collected through close-talking microphones. Results from our work has shown a time-series estimation performance with $CCC$ as-high-as 0.77 and an $RMSE$ as-low-as 0.11, where the groundtruth respiration signal was z-score normalized. At the segment-level, we observed that $RR$ can be estimated with a $MAE$ of 1.6 bpm. We also observed that pre-trained model representations from Wav2Vec2 SSL model performed better than standard MFB feature, providing a relative $MAE$ reduction of 33.6% and relative improvement in estimation $CCC$ by 10%. Additionally, we observed that fusion of Wav2Vec2 and MFB features provided the best overall performance.

Future studies should explore the use of representations from fine-tuned foundation models with speech data containing respiration-relevant information. Additionally, the impact of subjective variance and the models’ generalization capacity should be investigated using a dataset containing larger number of subjects than what was available in the dataset used in this study. A limitation of this study is that it uses a dataset containing read speech, future work should investigate spontaneous speech for estimating respiration signal.

References

(1)
Ahmed et al. (2023) Tousif Ahmed, Md Mahbubur Rahman, Ebrahim Nemati, Mohsin Yusuf Ahmed, Jilong Kuang, and Alex Jun Gao. 2023. Remote breathing rate tracking in stationary position using the motion and acoustic sensors of earables. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–22.
Arafath K. and Routray (2019) Mohamed Ismail Yasar Arafath K. and Aurobinda Routray. 2019. Automatic measurement of speech breathing rate. In 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 1–5.
Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020), 12449–12460.
Bommasani et al. (2021) Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
Castro and Marti-Puig (2014) J. Castro and P. Marti-Puig. 2014. Real-time Identification of Respiratory Movements through a Microphone. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal (ISSN: 2255-2863). Salamanca 3, 3 (2014).
Conrad and Schoenle (1979) B Conrad and Paul Schoenle. 1979. Speech and respiration. Archiv für Psychiatrie und Nervenkrankheiten 226 (1979), 251–268.
Fuchs and Rochet-Capellan (2021) Susanne Fuchs and Amélie Rochet-Capellan. 2021. The respiratory foundations of spoken language. Annual Review of Linguistics 7 (2021), 13–30.
Goldman-Eisler (1955) Frieda Goldman-Eisler. 1955. Speech-breathing activity-a measure of tension and affect during interviews. British Journal of Psychology 46, 1 (1955), 53.
Grant and Rainville (2009) J.A. Grant and P. Rainville. 2009. Pain sensitivity and analgesic effects of mindful states in Zen Meditators: A Cross-Sectional Study. Psychosomatic Medicine 71, 1 (2009), 106–114.
Heim et al. (1968) Edgar Heim, Peter H Knapp, Louis Vachon, Gordon G Globus, and S Joseph Nemetz. 1968. Emotion, breathing and speech. Journal of Psychosomatic Research 12, 4 (1968), 261–274.
Hixon (1987) Thomas J Hixon. 1987. Respiratory function in speech and song. (No Title) (1987).
Hsu et al. (2021) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
Klatt et al. (1968) Dennis H Klatt, KN Stevens, and J Mead. 1968. Studies of articulatory activity and airflow during speech. Annals of the New York Academy of Sciences 155, 1 (1968), 42–55.
Kral et al. (2022) T. Kral, R.C. Lapate, T. Imhoff-Smith, E. Patsenko, D.W. Grupe, R. Goldman, M.A. Rosenkranz, and R. J. Davidson. 2022. Long-term mindfulness training is associated with reliable differences in resting respiration rate. Journal of Cognitive Neuroscience 34, 9 (2022), 1576–1589.
Kral et al. (2023) Tammi RA Kral, Helen Y Weng, Vikramjit Mitra, Theodore P Imhoff-Smith, Erdrin Azemi, Robin I Goldman, Melissa A Rosenkranz, Sarah Wu, Andrew Chen, and Richard J Davidson. 2023. Slower respiration rate is associated with higher self-reported well-being after wellness training. Scientific Reports 13, 1 (2023), 15953.
Kumar et al. (2021) Agni Kumar, Vikramjit Mitra, Carolyn Oliver, Adeeti Ullal, Matt Biddulph, and Irida Mance. 2021. Estimating respiratory rate from breath audio obtained through wearable microphones. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 7310–7315.
Lawrence and Lin (1989) I. Lawrence and K. Lin. 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics (1989), 255–268.
Li et al. (2017) S. Li, B. Lin, C. Tsai, C. Yang, and B. Lin. 2017. Design of wearable breathing sound monitoring system for real-time wheeze detection. Sensors 17, 1 (2017), 171.
MacIntyre et al. (2020) Alexis Deighton MacIntyre, Georgios Rizos, Anton Batliner, Alice Baird, Shahin Amiriparian, Antonia Hamilton, and Björn W Schuller. 2020. Deep attentive end-to-end continuous breath sensing from speech. (2020).
Mitra et al. (2022) Vikramjit Mitra, Hsiang-Yun Sherry Chien, Vasudha Kowtha, Joseph Yitan Cheng, and Erdrin Azemi. 2022. Speech emotion: Investigating model representations, multi-task learning and knowledge distillation. arXiv preprint arXiv:2207.03334 (2022).
Mitra and Franco (2020) V. Mitra and H. Franco. 2020. Investigation and analysis of hyper and hypo neuron pruning to selectively update neurons during unsupervised adaptation. Digital Signal Processing 99 (2020), 102655.
Mitra et al. (2023) Vikramjit Mitra, Jingping Nie, and Erdrin Azemi. 2023. Investigating salient representations and label Variance in Dimensional Speech Emotion Analysis. arXiv preprint arXiv:2312.16180 (2023).
Nallanthighal et al. (2020) Venkata Srikanth Nallanthighal, Aki Härmä, and Helmer Strik. 2020. Speech breathing estimation using deep learning methods. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1140–1144.
Nicolò et al. (2014) A. Nicolò, I. Bazzucchi, M. Lenti, A. S. di Palumbo, and M. Sacchetti. 2014. Neuromuscular and metabolic responses to high-intensity intermittent cycling protocols with different work-to-rest ratios. International Journal of Sports Physiology and Performance 9, 1 (2014), 151–160.
Nicolò et al. (2017) A. Nicolò, S. M. Marcora, I. Bazzucchi, and M. Sacchetti. 2017. Differential control of respiratory frequency and tidal volume during high-intensity interval training. Experimental Physiology 102 (2017), 934–949.
Rahman et al. (2022) Md Mahbubur Rahman, Tousif Ahmed, Mohsin Yusuf Ahmed, Minh Dinh, Ebrahim Nemati, Jilong Kuang, and Jun Alex Gao. 2022. Breathebuddy: Tracking real-time breathing exercises for automated biofeedback using commodity earbuds. Proceedings of the ACM on Human-Computer Interaction 6, MHCI (2022), 1–18.
Ren et al. (2015) Y. Ren, C. Wang, J. Yang, and Y. Chen. 2015. Fine-grained sleep monitoring: Hearing your breathing with smartphones. In 2015 IEEE Conference on Computer Communications (INFOCOM). IEEE, 1194–1202.
Ruinskiy and Lavner (2007) Dima Ruinskiy and Yizhar Lavner. 2007. An effective algorithm for automatic detection and exact demarcation of breath sounds in speech and song signals. IEEE transactions on audio, speech, and language processing 15, 3 (2007), 838–850.
Schuller et al. (2020) Björn W Schuller, Anton Batliner, Christian Bergler, Eva-Maria Messner, Antonia Hamilton, Shahin Amiriparian, Alice Baird, Georgios Rizos, Maximilian Schmitt, Lukas Stappen, et al. 2020. The interspeech 2020 computational paralinguistics challenge: Elderly emotion, breathing & masks. (2020).
Sierra et al. (2004) G. Sierra, V. Telfort, B. Popov, L. Durand, R. Agarwal, and V. Lanzo. 2004. Monitoring respiratory rate based on tracheal sounds. First experiences. In The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Vol. 1. IEEE, 317–320.
Sierra et al. (2006) G. Sierra, V. Telfort, B. Popov, M. Pelletier, P. Despault, V. Lanzo, and R. Agarwal. 2006. Comparison of respiratory rate estimation based on tracheal sounds versus a capnograph. In 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. IEEE, 6145–6148.
Solomon and Hixon (1993) Nancy Pearl Solomon and Thomas J Hixon. 1993. Speech breathing in Parkinson’s disease. Journal of Speech, Language, and Hearing Research 36, 2 (1993), 294–310.
Stevens (2000) Kenneth N Stevens. 2000. Acoustic phonetics. Vol. 30. MIT press.
Wielgosz et al. (2016) J. Wielgosz, B. S. Schuyler, A. Lutz, and R. J. Davidson. 2016. Long-term mindfulness training is associated with reliable differences in resting respiration rate. Scientific Reports 6, 1 (2016), 1–6.
Winkworth et al. (1994) Alison L Winkworth, Pamela J Davis, Elizabeth Ellis, and Roger D Adams. 1994. Variability and consistency in speech breathing during reading: Lung volumes, speech intensity, and linguistic factors. Journal of Speech, Language, and Hearing Research 37, 3 (1994), 535–556.
Zuluaga-Gomez et al. (2023) Juan Zuluaga-Gomez, Amrutha Prasad, Iuliia Nigmatulina, Seyyed Saeed Sarfjoo, Petr Motlicek, Matthias Kleinert, Hartmut Helmke, Oliver Ohneiser, and Qingran Zhan. 2023. How does pre-trained wav2vec 2.0 perform on domain-shifted asr? an extensive benchmark on air traffic control communications. In 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 205–212.