Search | arXiv e-print repository

arXiv:2408.15803 [pdf, other]

ModalityMirror: Improving Audio Classification in Modality Heterogeneity Federated Learning with Multimodal Distillation

Authors: Tiantian Feng, Tuo Zhang, Salman Avestimehr, Shrikanth S. Narayanan

Abstract: Multimodal Federated Learning frequently encounters challenges of client modality heterogeneity, leading to undesired performances for secondary modality in multimodal learning. It is particularly prevalent in audiovisual learning, with audio is often assumed to be the weaker modality in recognition tasks. To address this challenge, we introduce ModalityMirror to improve audio model performance by… ▽ More Multimodal Federated Learning frequently encounters challenges of client modality heterogeneity, leading to undesired performances for secondary modality in multimodal learning. It is particularly prevalent in audiovisual learning, with audio is often assumed to be the weaker modality in recognition tasks. To address this challenge, we introduce ModalityMirror to improve audio model performance by leveraging knowledge distillation from an audiovisual federated learning model. ModalityMirror involves two phases: a modality-wise FL stage to aggregate uni-modal encoders; and a federated knowledge distillation stage on multi-modality clients to train an unimodal student model. Our results demonstrate that ModalityMirror significantly improves the audio classification compared to the state-of-the-art FL methods such as Harmony, particularly in audiovisual FL facing video missing. Our approach unlocks the potential for exploiting the diverse modality spectrum inherent in multi-modal FL. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2407.13227 [pdf, other]

Solving the Model Unavailable MARE using Q-Learning Algorithm

Authors: Fei Yan, Jie Gao, Tao Feng, Jianxing Liu

Abstract: In this paper, the discrete-time modified algebraic Riccati equation (MARE) is solved when the system model is completely unavailable. To achieve this, firstly a brand new iterative method based on the standard discrete-time algebraic Riccati equation (DARE) and its input weighting matrix is proposed to solve the MARE. For the single-input case, the iteration can be initialized by an arbitrary pos… ▽ More In this paper, the discrete-time modified algebraic Riccati equation (MARE) is solved when the system model is completely unavailable. To achieve this, firstly a brand new iterative method based on the standard discrete-time algebraic Riccati equation (DARE) and its input weighting matrix is proposed to solve the MARE. For the single-input case, the iteration can be initialized by an arbitrary positive input weighting if and only if the MARE has a stabilizing solution; nevertheless a pre-given input weighting matrix of a sufficiently large magnitude is used to perform the iteration for the multi-input case when the characteristic parameter belongs to a specified subset. Benefit from the developed specific iteration structure, the Q-learning (QL) algorithm can be employed to subtly solve the MARE where only the system input/output data is used thus the system model is not required. Finally, a numerical simulation example is given to verify the effectiveness of the theoretical results and the algorithm. △ Less

Submitted 18 July, 2024; originally announced July 2024.

arXiv:2406.12816 [pdf, other]

Neural Approximate Mirror Maps for Constrained Diffusion Models

Authors: Berthy T. Feng, Ricardo Baptista, Katherine L. Bouman

Abstract: Diffusion models excel at creating visually-convincing images, but they often struggle to meet subtle constraints inherent in the training data. Such constraints could be physics-based (e.g., satisfying a PDE), geometric (e.g., respecting symmetry), or semantic (e.g., including a particular number of objects). When the training data all satisfy a certain constraint, enforcing this constraint on a… ▽ More Diffusion models excel at creating visually-convincing images, but they often struggle to meet subtle constraints inherent in the training data. Such constraints could be physics-based (e.g., satisfying a PDE), geometric (e.g., respecting symmetry), or semantic (e.g., including a particular number of objects). When the training data all satisfy a certain constraint, enforcing this constraint on a diffusion model not only improves its distribution-matching accuracy but also makes it more reliable for generating valid synthetic data and solving constrained inverse problems. However, existing methods for constrained diffusion models are inflexible with different types of constraints. Recent work proposed to learn mirror diffusion models (MDMs) in an unconstrained space defined by a mirror map and to impose the constraint with an inverse mirror map, but analytical mirror maps are challenging to derive for complex constraints. We propose neural approximate mirror maps (NAMMs) for general constraints. Our approach only requires a differentiable distance function from the constraint set. We learn an approximate mirror map that pushes data into an unconstrained space and a corresponding approximate inverse that maps data back to the constraint set. A generative model, such as an MDM, can then be trained in the learned mirror space and its samples restored to the constraint set by the inverse map. We validate our approach on a variety of constraints, showing that compared to an unconstrained diffusion model, a NAMM-based MDM substantially improves constraint satisfaction. We also demonstrate how existing diffusion-based inverse-problem solvers can be easily applied in the learned mirror space to solve constrained inverse problems. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.08800 [pdf, other]

Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

Authors: Tiantian Feng, Dimitrios Dimitriadis, Shrikanth Narayanan

Abstract: Recent advances in foundation models have enabled audio-generative models that produce high-fidelity sounds associated with music, events, and human actions. Despite the success achieved in modern audio-generative models, the conventional approach to assessing the quality of the audio generation relies heavily on distance metrics like Frechet Audio Distance. In contrast, we aim to evaluate the qua… ▽ More Recent advances in foundation models have enabled audio-generative models that produce high-fidelity sounds associated with music, events, and human actions. Despite the success achieved in modern audio-generative models, the conventional approach to assessing the quality of the audio generation relies heavily on distance metrics like Frechet Audio Distance. In contrast, we aim to evaluate the quality of audio generation by examining the effectiveness of using them as training data. Specifically, we conduct studies to explore the use of synthetic audio for audio recognition. Moreover, we investigate whether synthetic audio can serve as a resource for data augmentation in speech-related modeling. Our comprehensive experiments demonstrate the potential of using synthetic audio for audio recognition and speech-related modeling. Our code is available at https://github.com/usc-sail/SynthAudio. △ Less

Submitted 29 August, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted to 2024 INTERSPEECH; corrections to ActivityNet labels

arXiv:2406.08644 [pdf, other]

Toward Fully-End-to-End Listened Speech Decoding from EEG Signals

Authors: Jihwan Lee, Aditya Kommineni, Tiantian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Kadiri, Shrikanth Narayanan

Abstract: Speech decoding from EEG signals is a challenging task, where brain activity is modeled to estimate salient characteristics of acoustic stimuli. We propose FESDE, a novel framework for Fully-End-to-end Speech Decoding from EEG signals. Our approach aims to directly reconstruct listened speech waveforms given EEG signals, where no intermediate acoustic feature processing step is required. The propo… ▽ More Speech decoding from EEG signals is a challenging task, where brain activity is modeled to estimate salient characteristics of acoustic stimuli. We propose FESDE, a novel framework for Fully-End-to-end Speech Decoding from EEG signals. Our approach aims to directly reconstruct listened speech waveforms given EEG signals, where no intermediate acoustic feature processing step is required. The proposed method consists of an EEG module and a speech module along with a connector. The EEG module learns to better represent EEG signals, while the speech module generates speech waveforms from model representations. The connector learns to bridge the distributions of the latent spaces of EEG and speech. The proposed framework is both simple and efficient, by allowing single-step inference, and outperforms prior works on objective metrics. A fine-grained phoneme analysis is conducted to unveil model characteristics of speech decoding. The source code is available here: github.com/lee-jhwn/fesde. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: accepted to Interspeech2024

arXiv:2406.07890 [pdf, other]

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions

Authors: Anfeng Xu, Kevin Huang, Tiantian Feng, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan

Abstract: Speech foundation models, trained on vast datasets, have opened unique opportunities in addressing challenging low-resource speech understanding, such as child speech. In this work, we explore the capabilities of speech foundation models on child-adult speaker diarization. We show that exemplary foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker… ▽ More Speech foundation models, trained on vast datasets, have opened unique opportunities in addressing challenging low-resource speech understanding, such as child speech. In this work, we explore the capabilities of speech foundation models on child-adult speaker diarization. We show that exemplary foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker Confusion Rate, respectively, compared to previous speaker diarization methods. In addition, we benchmark and evaluate the speaker diarization results of the speech foundation models with varying the input audio window size, speaker demographics, and training data ratio. Our results highlight promising pathways for understanding and adopting speech foundation models to facilitate child speech understanding. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Interspeech 2024

arXiv:2406.02785 [pdf, other]

Event-horizon-scale Imaging of M87* under Different Assumptions via Deep Generative Image Priors

Authors: Berthy T. Feng, Katherine L. Bouman, William T. Freeman

Abstract: Reconstructing images from the Event Horizon Telescope (EHT) observations of M87*, the supermassive black hole at the center of the galaxy M87, depends on a prior to impose desired image statistics. However, given the impossibility of directly observing black holes, there is no clear choice for a prior. We present a framework for flexibly designing a range of priors, each bringing different biases… ▽ More Reconstructing images from the Event Horizon Telescope (EHT) observations of M87*, the supermassive black hole at the center of the galaxy M87, depends on a prior to impose desired image statistics. However, given the impossibility of directly observing black holes, there is no clear choice for a prior. We present a framework for flexibly designing a range of priors, each bringing different biases to the image reconstruction. These priors can be weak (e.g., impose only basic natural-image statistics) or strong (e.g., impose assumptions of black-hole structure). Our framework uses Bayesian inference with score-based priors, which are data-driven priors arising from a deep generative model that can learn complicated image distributions. Using our Bayesian imaging approach with sophisticated data-driven priors, we can assess how visual features and uncertainty of reconstructed images change depending on the prior. In addition to simulated data, we image the real EHT M87* data and discuss how recovered features are influenced by the choice of prior. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2404.17983 [pdf, other]

TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality

Authors: Tiantian Feng, Xuan Shi, Rahul Gupta, Shrikanth S. Narayanan

Abstract: Automatic Speech Understanding (ASU) aims at human-like speech interpretation, providing nuanced intent, emotion, sentiment, and content understanding from speech and language (text) content conveyed in speech. Typically, training a robust ASU model relies heavily on acquiring large-scale, high-quality speech and associated transcriptions. However, it is often challenging to collect or use speech… ▽ More Automatic Speech Understanding (ASU) aims at human-like speech interpretation, providing nuanced intent, emotion, sentiment, and content understanding from speech and language (text) content conveyed in speech. Typically, training a robust ASU model relies heavily on acquiring large-scale, high-quality speech and associated transcriptions. However, it is often challenging to collect or use speech data for training ASU due to concerns such as privacy. To approach this setting of enabling ASU when speech (audio) modality is missing, we propose TI-ASU, using a pre-trained text-to-speech model to impute the missing speech. We report extensive experiments evaluating TI-ASU on various missing scales, both multi- and single-modality settings, and the use of LLMs. Our findings show that TI-ASU yields substantial benefits to improve ASU in scenarios where even up to 95% of training speech is missing. Moreover, we show that TI-ASU is adaptive to dropout training, improving model robustness in addressing missing speech during inference. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2404.09385 [pdf, other]

A Large-Scale Evaluation of Speech Foundation Models

Authors: Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee

Abstract: The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,… ▽ More The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. For reproducibility and extensibility, we have developed a long-term maintained platform that enables deterministic benchmarking, allows for result sharing via an online leaderboard, and promotes collaboration through a community-driven benchmark database to support new development cycles. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models, the correctness of the weighted-sum benchmarking protocol and the statistical significance and robustness of the benchmark. △ Less

Submitted 29 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: The extended journal version for SUPERB and SUPERB-SG. Published in IEEE/ACM TASLP. The Arxiv version is preferred

arXiv:2404.00471 [pdf, other]

doi 10.1109/ICASSP48485.2024.10447579

Score-Based Diffusion Models for Photoacoustic Tomography Image Reconstruction

Authors: Sreemanti Dey, Snigdha Saha, Berthy T. Feng, Manxiu Cui, Laure Delisle, Oscar Leong, Lihong V. Wang, Katherine L. Bouman

Abstract: Photoacoustic tomography (PAT) is a rapidly-evolving medical imaging modality that combines optical absorption contrast with ultrasound imaging depth. One challenge in PAT is image reconstruction with inadequate acoustic signals due to limited sensor coverage or due to the density of the transducer array. Such cases call for solving an ill-posed inverse reconstruction problem. In this work, we use… ▽ More Photoacoustic tomography (PAT) is a rapidly-evolving medical imaging modality that combines optical absorption contrast with ultrasound imaging depth. One challenge in PAT is image reconstruction with inadequate acoustic signals due to limited sensor coverage or due to the density of the transducer array. Such cases call for solving an ill-posed inverse reconstruction problem. In this work, we use score-based diffusion models to solve the inverse problem of reconstructing an image from limited PAT measurements. The proposed approach allows us to incorporate an expressive prior learned by a diffusion model on simulated vessel structures while still being robust to varying transducer sparsity conditions. △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: 5 pages

Journal ref: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 2470-2474

arXiv:2310.19113 [pdf, other]

Dynamic V2X Autonomous Perception from Road-to-Vehicle Vision

Authors: Jiayao Tan, Fan Lyu, Linyan Li, Fuyuan Hu, Tingliang Feng, Fenglei Xu, Rui Yao

Abstract: Vehicle-to-everything (V2X) perception is an innovative technology that enhances vehicle perception accuracy, thereby elevating the security and reliability of autonomous systems. However, existing V2X perception methods focus on static scenes from mainly vehicle-based vision, which is constrained by sensor capabilities and communication loads. To adapt V2X perception models to dynamic scenes, we… ▽ More Vehicle-to-everything (V2X) perception is an innovative technology that enhances vehicle perception accuracy, thereby elevating the security and reliability of autonomous systems. However, existing V2X perception methods focus on static scenes from mainly vehicle-based vision, which is constrained by sensor capabilities and communication loads. To adapt V2X perception models to dynamic scenes, we propose to build V2X perception from road-to-vehicle vision and present Adaptive Road-to-Vehicle Perception (AR2VP) method. In AR2VP,we leverage roadside units to offer stable, wide-range sensing capabilities and serve as communication hubs. AR2VP is devised to tackle both intra-scene and inter-scene changes. For the former, we construct a dynamic perception representing module, which efficiently integrates vehicle perceptions, enabling vehicles to capture a more comprehensive range of dynamic factors within the scene.Moreover, we introduce a road-to-vehicle perception compensating module, aimed at preserving the maximized roadside unit perception information in the presence of intra-scene changes.For inter-scene changes, we implement an experience replay mechanism leveraging the roadside unit's storage capacity to retain a subset of historical scene data, maintaining model robustness in response to inter-scene shifts. We conduct perception experiment on 3D object detection and segmentation, and the results show that AR2VP excels in both performance-bandwidth trade-offs and adaptability within dynamic environments. △ Less

Submitted 29 October, 2023; originally announced October 2023.

arXiv:2310.10835 [pdf, other]

Provable Probabilistic Imaging using Score-Based Generative Priors

Authors: Yu Sun, Zihui Wu, Yifan Chen, Berthy T. Feng, Katherine L. Bouman

Abstract: Estimating high-quality images while also quantifying their uncertainty are two desired features in an image reconstruction algorithm for solving ill-posed inverse problems. In this paper, we propose plug-and-play Monte Carlo (PMC) as a principled framework for characterizing the space of possible solutions to a general inverse problem. PMC is able to incorporate expressive score-based generative… ▽ More Estimating high-quality images while also quantifying their uncertainty are two desired features in an image reconstruction algorithm for solving ill-posed inverse problems. In this paper, we propose plug-and-play Monte Carlo (PMC) as a principled framework for characterizing the space of possible solutions to a general inverse problem. PMC is able to incorporate expressive score-based generative priors for high-quality image reconstruction while also performing uncertainty quantification via posterior sampling. In particular, we develop two PMC algorithms that can be viewed as the sampling analogues of the traditional plug-and-play priors (PnP) and regularization by denoising (RED) algorithms. To improve the sampling efficiency, we introduce weighted annealing into these PMC algorithms, further developing two additional annealed PMC algorithms (APMC). We establish a theoretical analysis for characterizing the convergence behavior of PMC algorithms. Our analysis provides non-asymptotic stationarity guarantees in terms of the Fisher information, fully compatible with the joint presence of weighted annealing, potentially non-log-concave likelihoods, and imperfect score networks. We demonstrate the performance of the PMC algorithms on multiple representative inverse problems with both linear and nonlinear forward models. Experimental results show that PMC significantly improves reconstruction quality and enables high-fidelity uncertainty quantification. △ Less

Submitted 28 August, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.01867 [pdf, other]

Audio-visual child-adult speaker classification in dyadic interactions

Authors: Anfeng Xu, Kevin Huang, Tiantian Feng, Helen Tager-Flusberg, Shrikanth Narayanan

Abstract: Interactions involving children span a wide range of important domains from learning to clinical diagnostic and therapeutic contexts. Automated analyses of such interactions are motivated by the need to seek accurate insights and offer scale and robustness across diverse and wide-ranging conditions. Identifying the speech segments belonging to the child is a critical step in such modeling. Convent… ▽ More Interactions involving children span a wide range of important domains from learning to clinical diagnostic and therapeutic contexts. Automated analyses of such interactions are motivated by the need to seek accurate insights and offer scale and robustness across diverse and wide-ranging conditions. Identifying the speech segments belonging to the child is a critical step in such modeling. Conventional child-adult speaker classification typically relies on audio modeling approaches, overlooking visual signals that convey speech articulation information, such as lip motion. Building on the foundation of an audio-only child-adult speaker classification pipeline, we propose incorporating visual cues through active speaker detection and visual processing models. Our framework involves video pre-processing, utterance-level child-adult speaker detection, and late fusion of modality-specific predictions. We demonstrate from extensive experiments that a visually aided classification pipeline enhances the accuracy and robustness of the classification. We show relative improvements of 2.38% and 3.97% in F1 macro score when one face and two faces are visible, respectively. △ Less

Submitted 9 October, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: In review for ICASSP 2024, 5 pages

arXiv:2309.15292 [pdf, other]

Scaling Representation Learning from Ubiquitous ECG with State-Space Models

Authors: Kleanthis Avramidis, Dominika Kunc, Bartosz Perz, Kranti Adsul, Tiantian Feng, Przemysław Kazienko, Stanisław Saganowski, Shrikanth Narayanan

Abstract: Ubiquitous sensing from wearable devices in the wild holds promise for enhancing human well-being, from diagnosing clinical conditions and measuring stress to building adaptive health promoting scaffolds. But the large volumes of data therein across heterogeneous contexts pose challenges for conventional supervised learning approaches. Representation Learning from biological signals is an emerging… ▽ More Ubiquitous sensing from wearable devices in the wild holds promise for enhancing human well-being, from diagnosing clinical conditions and measuring stress to building adaptive health promoting scaffolds. But the large volumes of data therein across heterogeneous contexts pose challenges for conventional supervised learning approaches. Representation Learning from biological signals is an emerging realm catalyzed by the recent advances in computational modeling and the abundance of publicly shared databases. The electrocardiogram (ECG) is the primary researched modality in this context, with applications in health monitoring, stress and affect estimation. Yet, most studies are limited by small-scale controlled data collection and over-parameterized architecture choices. We introduce \textbf{WildECG}, a pre-trained state-space model for representation learning from ECG signals. We train this model in a self-supervised manner with 275,000 10s ECG recordings collected in the wild and evaluate it on a range of downstream tasks. The proposed model is a robust backbone for ECG analysis, providing competitive performance on most of the tasks considered, while demonstrating efficacy in low-resource regimes. The code and pre-trained weights are shared publicly at https://github.com/klean2050/tiles_ecg_model. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: Pre-print, currently under review

arXiv:2309.10993 [pdf, other]

Directional Source Separation for Robust Speech Recognition on Smart Glasses

Authors: Tiantian Feng, Ju Lin, Yiteng Huang, Weipeng He, Kaustubh Kalgaonkar, Niko Moritz, Li Wan, Xin Lei, Ming Sun, Frank Seide

Abstract: Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequently encounter challenges related to environmental noises, resulting in degradation to speech recognition and speaker change detection. To improve voice quality,… ▽ More Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequently encounter challenges related to environmental noises, resulting in degradation to speech recognition and speaker change detection. To improve voice quality, this work investigates directional source separation using the multi-microphone array. We first explore multiple beamformers to assist source separation modeling by strengthening the directional properties of speech signals. In addition to relying on predetermined beamformers, we investigate neural beamforming in multi-channel source separation, demonstrating that automatic learning directional characteristics effectively improves separation quality. We further compare the ASR performance leveraging separated outputs to noisy inputs. Our results show that directional source separation benefits ASR for the wearer but not for the conversation partner. Lastly, we perform the joint training of the directional source separation and ASR model, achieving the best overall ASR performance. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: Submitted to ICASSP 2024

arXiv:2309.08108 [pdf, other]

Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting

Authors: Tiantian Feng, Shrikanth Narayanan

Abstract: Significant advances are being made in speech emotion recognition (SER) using deep learning models. Nonetheless, training SER systems remains challenging, requiring both time and costly resources. Like many other machine learning tasks, acquiring datasets for SER requires substantial data annotation efforts, including transcription and labeling. These annotation processes present challenges when a… ▽ More Significant advances are being made in speech emotion recognition (SER) using deep learning models. Nonetheless, training SER systems remains challenging, requiring both time and costly resources. Like many other machine learning tasks, acquiring datasets for SER requires substantial data annotation efforts, including transcription and labeling. These annotation processes present challenges when attempting to scale up conventional SER systems. Recent developments in foundational models have had a tremendous impact, giving rise to applications such as ChatGPT. These models have enhanced human-computer interactions including bringing unique possibilities for streamlining data collection in fields like SER. In this research, we explore the use of foundational models to assist in automating SER from transcription and annotation to augmentation. Our study demonstrates that these models can generate transcriptions to enhance the performance of SER systems that rely solely on speech data. Furthermore, we note that annotating emotions from transcribed speech remains a challenging task. However, combining outputs from multiple LLMs enhances the quality of annotations. Lastly, our findings suggest the feasibility of augmenting existing speech emotion datasets by annotating unlabeled speech samples. △ Less

Submitted 14 September, 2023; originally announced September 2023.

Comments: Under review

arXiv:2308.12610 [pdf, other]

Emotion-Aligned Contrastive Learning Between Images and Music

Authors: Shanti Stewart, Kleanthis Avramidis, Tiantian Feng, Shrikanth Narayanan

Abstract: Traditional music search engines rely on retrieval methods that match natural language queries with music metadata. There have been increasing efforts to expand retrieval methods to consider the audio characteristics of music itself, using queries of various modalities including text, video, and speech. While most approaches aim to match general music semantics to the input queries, only a few foc… ▽ More Traditional music search engines rely on retrieval methods that match natural language queries with music metadata. There have been increasing efforts to expand retrieval methods to consider the audio characteristics of music itself, using queries of various modalities including text, video, and speech. While most approaches aim to match general music semantics to the input queries, only a few focus on affective qualities. In this work, we address the task of retrieving emotionally-relevant music from image queries by learning an affective alignment between images and music audio. Our approach focuses on learning an emotion-aligned joint embedding space between images and music. This embedding space is learned via emotion-supervised contrastive learning, using an adapted cross-modal version of the SupCon loss. We evaluate the joint embeddings through cross-modal retrieval tasks (image-to-music and music-to-image) based on emotion labels. Furthermore, we investigate the generalizability of the learned music embeddings via automatic music tagging. Our experiments show that the proposed approach successfully aligns images and music, and that the learned embedding space is effective for cross-modal retrieval applications. △ Less

Submitted 20 September, 2023; v1 submitted 24 August, 2023; originally announced August 2023.

Comments: 4 pages + 1 reference page, 1 figure, 3 tables. Under review for publication

arXiv:2307.16398 [pdf, other]

Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism

Authors: Rimita Lahiri, Tiantian Feng, Rajat Hebbar, Catherine Lord, So Hyun Kim, Shrikanth Narayanan

Abstract: We address the problem of detecting who spoke when in child-inclusive spoken interactions i.e., automatic child-adult speaker classification. Interactions involving children are richly heterogeneous due to developmental differences. The presence of neurodiversity e.g., due to Autism, contributes additional variability. We investigate the impact of additional pre-training with more unlabelled child… ▽ More We address the problem of detecting who spoke when in child-inclusive spoken interactions i.e., automatic child-adult speaker classification. Interactions involving children are richly heterogeneous due to developmental differences. The presence of neurodiversity e.g., due to Autism, contributes additional variability. We investigate the impact of additional pre-training with more unlabelled child speech on the child-adult classification performance. We pre-train our model with child-inclusive interactions, following two recent self-supervision algorithms, Wav2vec 2.0 and WavLM, with a contrastive loss objective. We report 9 - 13% relative improvement over the state-of-the-art baseline with regards to classification F1 scores on two clinical interaction datasets involving children with Autism. We also analyze the impact of pre-training under different conditions by evaluating our model on interactions involving different subgroups of children based on various demographic factors. △ Less

Submitted 31 July, 2023; originally announced July 2023.

arXiv:2307.04445 [pdf, other]

Learning Behavioral Representations of Routines From Large-scale Unlabeled Wearable Time-series Data Streams using Hawkes Point Process

Authors: Tiantian Feng, Brandon M Booth, Shrikanth Narayanan

Abstract: Continuously-worn wearable sensors enable researchers to collect copious amounts of rich bio-behavioral time series recordings of real-life activities of daily living, offering unprecedented opportunities to infer novel human behavior patterns during daily routines. Existing approaches to routine discovery through bio-behavioral data rely either on pre-defined notions of activities or use addition… ▽ More Continuously-worn wearable sensors enable researchers to collect copious amounts of rich bio-behavioral time series recordings of real-life activities of daily living, offering unprecedented opportunities to infer novel human behavior patterns during daily routines. Existing approaches to routine discovery through bio-behavioral data rely either on pre-defined notions of activities or use additional non-behavioral measurements as contexts, such as GPS location or localization within the home, presenting risks to user privacy. In this work, we propose a novel wearable time-series mining framework, Hawkes point process On Time series clusters for ROutine Discovery (HOT-ROD), for uncovering behavioral routines from completely unlabeled wearable recordings. We utilize a covariance-based method to generate time-series clusters and discover routines via the Hawkes point process learning algorithm. We empirically validate our approach for extracting routine behaviors using a completely unlabeled time-series collected continuously from over 100 individuals both in and outside of the workplace during a period of ten weeks. Furthermore, we demonstrate this approach intuitively captures daily transitional relationships between physical activity states without using prior knowledge. We also show that the learned behavioral patterns can assist in illuminating an individual's personality and affect. △ Less

Submitted 10 July, 2023; originally announced July 2023.

Comments: 2023 9th ACM SIGKDD International Workshop on Mining and Learning From Time Series (MiLeTS 2023)

arXiv:2306.07791 [pdf, other]

Unlocking Foundation Models for Privacy-Enhancing Speech Understanding: An Early Study on Low Resource Speech Training Leveraging Label-guided Synthetic Speech Content

Authors: Tiantian Feng, Digbalay Bose, Xuan Shi, Shrikanth Narayanan

Abstract: Automatic Speech Understanding (ASU) leverages the power of deep learning models for accurate interpretation of human speech, leading to a wide range of speech applications that enrich the human experience. However, training a robust ASU model requires the curation of a large number of speech samples, creating risks for privacy breaches. In this work, we investigate using foundation models to assi… ▽ More Automatic Speech Understanding (ASU) leverages the power of deep learning models for accurate interpretation of human speech, leading to a wide range of speech applications that enrich the human experience. However, training a robust ASU model requires the curation of a large number of speech samples, creating risks for privacy breaches. In this work, we investigate using foundation models to assist privacy-enhancing speech computing. Unlike conventional works focusing primarily on data perturbation or distributed algorithms, our work studies the possibilities of using pre-trained generative models to synthesize speech content as training data with just label guidance. We show that zero-shot learning with training label-guided synthetic speech content remains a challenging task. On the other hand, our results demonstrate that the model trained with synthetic speech samples provides an effective initialization point for low-resource ASU training. This result reveals the potential to enhance privacy by reducing user data collection but using label-guided synthetic speech content. △ Less

Submitted 13 June, 2023; originally announced June 2023.

arXiv:2306.05350 [pdf, other]

doi 10.1109/ACII59096.2023.10388152

PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models

Authors: Tiantian Feng, Shrikanth Narayanan

Abstract: Many recent studies have focused on fine-tuning pre-trained models for speech emotion recognition (SER), resulting in promising performance compared to traditional methods that rely largely on low-level, knowledge-inspired acoustic features. These pre-trained speech models learn general-purpose speech representations using self-supervised or weakly-supervised learning objectives from large-scale d… ▽ More Many recent studies have focused on fine-tuning pre-trained models for speech emotion recognition (SER), resulting in promising performance compared to traditional methods that rely largely on low-level, knowledge-inspired acoustic features. These pre-trained speech models learn general-purpose speech representations using self-supervised or weakly-supervised learning objectives from large-scale datasets. Despite the significant advances made in SER through the use of pre-trained architecture, fine-tuning these large pre-trained models for different datasets requires saving copies of entire weight parameters, rendering them impractical to deploy in real-world settings. As an alternative, this work explores parameter-efficient fine-tuning (PEFT) approaches for adapting pre-trained speech models for emotion recognition. Specifically, we evaluate the efficacy of adapter tuning, embedding prompt tuning, and LoRa (Low-rank approximation) on four popular SER testbeds. Our results reveal that LoRa achieves the best fine-tuning performance in emotion recognition while enhancing fairness and requiring only a minimal extra amount of weight parameters. Furthermore, our findings offer novel insights into future research directions in SER, distinct from existing approaches focusing on directly fine-tuning the model architecture. Our code is publicly available under: https://github.com/usc-sail/peft-ser. △ Less

Submitted 14 February, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: This work was accepted to the 11th International Conference on Affective Computing and Intelligent Interaction (ACII), 2023

arXiv:2305.14117 [pdf, other]

Understanding Spoken Language Development of Children with ASD Using Pre-trained Speech Embeddings

Authors: Anfeng Xu, Rajat Hebbar, Rimita Lahiri, Tiantian Feng, Lindsay Butler, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan

Abstract: Speech processing techniques are useful for analyzing speech and language development in children with Autism Spectrum Disorder (ASD), who are often varied and delayed in acquiring these skills. Early identification and intervention are crucial, but traditional assessment methodologies such as caregiver reports are not adequate for the requisite behavioral phenotyping. Natural Language Sample (NLS… ▽ More Speech processing techniques are useful for analyzing speech and language development in children with Autism Spectrum Disorder (ASD), who are often varied and delayed in acquiring these skills. Early identification and intervention are crucial, but traditional assessment methodologies such as caregiver reports are not adequate for the requisite behavioral phenotyping. Natural Language Sample (NLS) analysis has gained attention as a promising complement. Researchers have developed benchmarks for spoken language capabilities in children with ASD, obtainable through the analysis of NLS. This paper proposes applications of speech processing technologies in support of automated assessment of children's spoken language development by classification between child and adult speech and between speech and nonverbal vocalization in NLS, with respective F1 macro scores of 82.6% and 67.8%, underscoring the potential for accurate and scalable tools for ASD research and clinical use. △ Less

Submitted 31 May, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: Accepted to Interspeech 2023, 5 pages

arXiv:2305.11229 [pdf, other]

TrustSER: On the Trustworthiness of Fine-tuning Pre-trained Speech Embeddings For Speech Emotion Recognition

Authors: Tiantian Feng, Rajat Hebbar, Shrikanth Narayanan

Abstract: Recent studies have explored the use of pre-trained embeddings for speech emotion recognition (SER), achieving comparable performance to conventional methods that rely on low-level knowledge-inspired acoustic features. These embeddings are often generated from models trained on large-scale speech datasets using self-supervised or weakly-supervised learning objectives. Despite the significant advan… ▽ More Recent studies have explored the use of pre-trained embeddings for speech emotion recognition (SER), achieving comparable performance to conventional methods that rely on low-level knowledge-inspired acoustic features. These embeddings are often generated from models trained on large-scale speech datasets using self-supervised or weakly-supervised learning objectives. Despite the significant advancements made in SER through the use of pre-trained embeddings, there is a limited understanding of the trustworthiness of these methods, including privacy breaches, unfair performance, vulnerability to adversarial attacks, and computational cost, all of which may hinder the real-world deployment of these systems. In response, we introduce TrustSER, a general framework designed to evaluate the trustworthiness of SER systems using deep learning methods, with a focus on privacy, safety, fairness, and sustainability, offering unique insights into future research in the field of SER. Our code is publicly available under: https://github.com/usc-sail/trust-ser. △ Less

Submitted 18 May, 2023; originally announced May 2023.

arXiv:2302.12757 [pdf, other]

Ensemble knowledge distillation of self-supervised speech models

Authors: Kuan-Po Huang, Tzu-hsun Feng, Yu-Kuan Fu, Tsu-Yuan Hsu, Po-Chieh Yen, Wei-Cheng Tseng, Kai-Wei Chang, Hung-yi Lee

Abstract: Distilled self-supervised models have shown competitive performance and efficiency in recent years. However, there is a lack of experience in jointly distilling multiple self-supervised speech models. In our work, we performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM. We tried two different aggregation techniques, layerw… ▽ More Distilled self-supervised models have shown competitive performance and efficiency in recent years. However, there is a lack of experience in jointly distilling multiple self-supervised speech models. In our work, we performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM. We tried two different aggregation techniques, layerwise-average and layerwise-concatenation, to the representations of different teacher models and found that the former was more effective. On top of that, we proposed a multiple prediction head method for student models to predict different layer outputs of multiple teacher models simultaneously. The experimental results show that our method improves the performance of the distilled models on four downstream speech processing tasks, Phoneme Recognition, Speaker Identification, Emotion Recognition, and Automatic Speech Recognition in the hidden-set track of the SUPERB benchmark. △ Less

Submitted 24 February, 2023; originally announced February 2023.

Comments: Accepted by ICASSP 2023

arXiv:2212.09090 [pdf, other]

Exploring Workplace Behaviors through Speaking Patterns using Large-scale Multimodal Wearable Recordings: A Study of Healthcare Providers

Authors: Tiantian Feng, Shrikanth Narayanan

Abstract: Interpersonal spoken communication is central to human interaction and the exchange of information. Such interactive processes involve not only speech and spoken language but also non-verbal cues such as hand gestures, facial expressions, and nonverbal vocalization, that are used to express feelings and provide feedback. These multimodal communication signals carry a variety of information about t… ▽ More Interpersonal spoken communication is central to human interaction and the exchange of information. Such interactive processes involve not only speech and spoken language but also non-verbal cues such as hand gestures, facial expressions, and nonverbal vocalization, that are used to express feelings and provide feedback. These multimodal communication signals carry a variety of information about the people: traits like gender and age as well as about physical and psychological states and behavior. This work uses wearable multimodal sensors to investigate interpersonal communication behaviors focusing on speaking patterns among healthcare providers with a focus on nurses. We analyze longitudinal data collected from $99$ nurses in a large hospital setting over ten weeks. The results indicate that speaking pattern differences across shift schedules and working units. Moreover, results show that speaking patterns combined with physiological measures can be used to predict affect measures and life satisfaction scores. The implementation of this work can be accessed at https://github.com/usc-sail/tiles-audio-arousal. △ Less

Submitted 18 December, 2022; originally announced December 2022.

arXiv:2212.09006 [pdf, other]

doi 10.1561/116.00000084

A Review of Speech-centric Trustworthy Machine Learning: Privacy, Safety, and Fairness

Authors: Tiantian Feng, Rajat Hebbar, Nicholas Mehlman, Xuan Shi, Aditya Kommineni, and Shrikanth Narayanan

Abstract: Speech-centric machine learning systems have revolutionized many leading domains ranging from transportation and healthcare to education and defense, profoundly changing how people live, work, and interact with each other. However, recent studies have demonstrated that many speech-centric ML systems may need to be considered more trustworthy for broader deployment. Specifically, concerns over priv… ▽ More Speech-centric machine learning systems have revolutionized many leading domains ranging from transportation and healthcare to education and defense, profoundly changing how people live, work, and interact with each other. However, recent studies have demonstrated that many speech-centric ML systems may need to be considered more trustworthy for broader deployment. Specifically, concerns over privacy breaches, discriminating performance, and vulnerability to adversarial attacks have all been discovered in ML research fields. In order to address the above challenges and risks, a significant number of efforts have been made to ensure these ML systems are trustworthy, especially private, safe, and fair. In this paper, we conduct the first comprehensive survey on speech-centric trustworthy ML topics related to privacy, safety, and fairness. In addition to serving as a summary report for the research community, we point out several promising future research directions to inspire the researchers who wish to explore further in this area. △ Less

Submitted 16 April, 2023; v1 submitted 17 December, 2022; originally announced December 2022.

Journal ref: APSIPA Transactions on Signal and Information Processing, vol. 12, no. 3, 2023

arXiv:2211.14806 [pdf, other]

Efficient Demand Response Location Targeting for Price Spike Mitigation by Exploiting Price-demand Relationship

Authors: Yufan Zhang, Honglin Wen, Tao Feng, Yize Chen

Abstract: Demand response (DR) leverages demand-side flexibility, offering a promising approach to enhance market conditions like mitigating wholesale price spikes. However, poorly chosen DR locations can inadvertently increase electricity prices. For that, we introduce a method to rigorously select DR locations and corresponding demand reductions. We formulate a bilevel program where the upper level determ… ▽ More Demand response (DR) leverages demand-side flexibility, offering a promising approach to enhance market conditions like mitigating wholesale price spikes. However, poorly chosen DR locations can inadvertently increase electricity prices. For that, we introduce a method to rigorously select DR locations and corresponding demand reductions. We formulate a bilevel program where the upper level determines the DR locations and demand reductions while ensuring the average nodal prices meet a predetermined target. The lower level tackles an economic dispatch (ED) problem and feeds the resulting nodal prices back to the upper level based on post-DR demands. This bilevel formulation presents challenges due to the lower-level non-convexity affecting the upper-level constraints on average nodal prices. To address this, we propose to replace the lower level with a piecewise linear function representing the price-demand relationship, solving iteratively for each linear segment. This results in a tractable mixed-integer linear program. An acceleration strategy is proposed to further reduce the computation time. Numerical studies demonstrate the ability of the proposed approach to reduce prices to a desired level. Besides, we empirically show that the proposed approach is robust against inaccurate system parameters and can reduce computation time by over 50%. △ Less

Submitted 5 August, 2024; v1 submitted 27 November, 2022; originally announced November 2022.

Comments: Submitted to Applied Energy

arXiv:2211.09949 [pdf, other]

Compressing Transformer-based self-supervised models for speech processing

Authors: Tzu-Quan Lin, Tsung-Huan Yang, Chun-Yao Chang, Kuang-Ming Chen, Tzu-hsun Feng, Hung-yi Lee, Hao Tang

Abstract: Despite the success of Transformers in self- supervised learning with applications to various downstream tasks, the computational cost of training and inference remains a major challenge for applying these models to a wide spectrum of devices. Several isolated attempts have been made to compress Transformers, but the settings and metrics are different across studies. Trade-off at various compressi… ▽ More Despite the success of Transformers in self- supervised learning with applications to various downstream tasks, the computational cost of training and inference remains a major challenge for applying these models to a wide spectrum of devices. Several isolated attempts have been made to compress Transformers, but the settings and metrics are different across studies. Trade-off at various compression rates are also largely missing in prior work, making it difficult to compare compression techniques. In this work, we aim to provide context for the isolated results, studying several commonly used compression techniques, including weight pruning, head pruning, low-rank approximation, and knowledge distillation. We report trade- off at various compression rate, including wall-clock time, the number of parameters, and the number of multiply-accumulate operations. Our results show that compared to recent approaches, basic compression techniques are strong baselines. We further present several applications of our results, revealing properties of Transformers, such as the significance of diagonal attention heads. In addition, our results lead to a simple combination of compression techniques that improves trade-off over recent approaches. We hope the results would promote more diverse comparisons among model compression techniques and promote the use of model compression as a tool for analyzing models. Our code of compressing speech self-supervised model is available at https://github.com/nervjack2/Speech-SSL-Compression/. △ Less

Submitted 26 January, 2024; v1 submitted 17 November, 2022; originally announced November 2022.

Comments: Submitted to IEEE Transactions on Audio, Speech and Language Processing (TASLP)

arXiv:2210.15826 [pdf, other]

Multimodal Estimation of Change Points of Physiological Arousal in Drivers

Authors: Kleanthis Avramidis, Tiantian Feng, Digbalay Bose, Shrikanth Narayanan

Abstract: Detecting unsafe driving states, such as stress, drowsiness, and fatigue, is an important component of ensuring driving safety and an essential prerequisite for automatic intervention systems in vehicles. These concerning conditions are primarily connected to the driver's low or high arousal levels. In this study, we describe a framework for processing multimodal physiological time-series from wea… ▽ More Detecting unsafe driving states, such as stress, drowsiness, and fatigue, is an important component of ensuring driving safety and an essential prerequisite for automatic intervention systems in vehicles. These concerning conditions are primarily connected to the driver's low or high arousal levels. In this study, we describe a framework for processing multimodal physiological time-series from wearable sensors during driving and locating points of prominent change in drivers' physiological arousal state. These points of change could potentially indicate events that require just-in-time intervention. We apply time-series segmentation on heart rate and breathing rate measurements and quantify their robustness in capturing change points in electrodermal activity, treated as a reference index for arousal, as well as on self-reported stress ratings, using three public datasets. Our experiments demonstrate that physiological measures are veritable indicators of change points of arousal and perform robustly across an extensive ablation study. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: 5 pages, 3 tables, 4 figures

arXiv:2210.15707 [pdf, other]

FedAudio: A Federated Learning Benchmark for Audio Tasks

Authors: Tuo Zhang, Tiantian Feng, Samiul Alam, Sunwoo Lee, Mi Zhang, Shrikanth S. Narayanan, Salman Avestimehr

Abstract: Federated learning (FL) has gained substantial attention in recent years due to the data privacy concerns related to the pervasiveness of consumer devices that continuously collect data from users. While a number of FL benchmarks have been developed to facilitate FL research, none of them include audio data and audio-related tasks. In this paper, we fill this critical gap by introducing a new FL b… ▽ More Federated learning (FL) has gained substantial attention in recent years due to the data privacy concerns related to the pervasiveness of consumer devices that continuously collect data from users. While a number of FL benchmarks have been developed to facilitate FL research, none of them include audio data and audio-related tasks. In this paper, we fill this critical gap by introducing a new FL benchmark for audio tasks which we refer to as FedAudio. FedAudio includes four representative and commonly used audio datasets from three important audio tasks that are well aligned with FL use cases. In particular, a unique contribution of FedAudio is the introduction of data noises and label errors to the datasets to emulate challenges when deploying FL systems in real-world settings. FedAudio also includes the benchmark results of the datasets and a PyTorch library with the objective of facilitating researchers to fairly compare their algorithms. We hope FedAudio could act as a catalyst to inspire new FL research for audio tasks and thus benefit the acoustic and speech research community. The datasets and benchmark results can be accessed at https://github.com/zhang-tuo-pdf/FedAudio. △ Less

Submitted 8 February, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

arXiv:2210.08634 [pdf, other]

SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning

Authors: Tzu-hsun Feng, Annie Dong, Ching-Feng Yeh, Shu-wen Yang, Tzu-Quan Lin, Jiatong Shi, Kai-Wei Chang, Zili Huang, Haibin Wu, Xuankai Chang, Shinji Watanabe, Abdelrahman Mohamed, Shang-Wen Li, Hung-yi Lee

Abstract: We present the SUPERB challenge at SLT 2022, which aims at learning self-supervised speech representation for better performance, generalization, and efficiency. The challenge builds upon the SUPERB benchmark and implements metrics to measure the computation requirements of self-supervised learning (SSL) representation and to evaluate its generalizability and performance across the diverse SUPERB… ▽ More We present the SUPERB challenge at SLT 2022, which aims at learning self-supervised speech representation for better performance, generalization, and efficiency. The challenge builds upon the SUPERB benchmark and implements metrics to measure the computation requirements of self-supervised learning (SSL) representation and to evaluate its generalizability and performance across the diverse SUPERB tasks. The SUPERB benchmark provides comprehensive coverage of popular speech processing tasks, from speech and speaker recognition to audio generation and semantic understanding. As SSL has gained interest in the speech community and showed promising outcomes, we envision the challenge to uplevel the impact of SSL techniques by motivating more practical designs of techniques beyond task performance. We summarize the results of 14 submitted models in this paper. We also discuss the main findings from those submissions and the future directions of SSL research. △ Less

Submitted 29 October, 2022; v1 submitted 16 October, 2022; originally announced October 2022.

Comments: Accepted by 2022 SLT Workshop

arXiv:2204.02500 [pdf, other]

doi 10.21437/Interspeech.2022-10060

User-Level Differential Privacy against Attribute Inference Attack of Speech Emotion Recognition in Federated Learning

Authors: Tiantian Feng, Raghuveer Peri, Shrikanth Narayanan

Abstract: Many existing privacy-enhanced speech emotion recognition (SER) frameworks focus on perturbing the original speech data through adversarial training within a centralized machine learning setup. However, this privacy protection scheme can fail since the adversary can still access the perturbed data. In recent years, distributed learning algorithms, especially federated learning (FL), have gained po… ▽ More Many existing privacy-enhanced speech emotion recognition (SER) frameworks focus on perturbing the original speech data through adversarial training within a centralized machine learning setup. However, this privacy protection scheme can fail since the adversary can still access the perturbed data. In recent years, distributed learning algorithms, especially federated learning (FL), have gained popularity to protect privacy in machine learning applications. While FL provides good intuition to safeguard privacy by keeping the data on local devices, prior work has shown that privacy attacks, such as attribute inference attacks, are achievable for SER systems trained using FL. In this work, we propose to evaluate the user-level differential privacy (UDP) in mitigating the privacy leaks of the SER system in FL. UDP provides theoretical privacy guarantees with privacy parameters $ε$ and $δ$. Our results show that the UDP can effectively decrease attribute information leakage while keeping the utility of the SER system with the adversary accessing one model update. However, the efficacy of the UDP suffers when the FL system leaks more model updates to the adversary. We make the code publicly available to reproduce the results in https://github.com/usc-sail/fed-ser-leakage. △ Less

Submitted 16 May, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Journal ref: Proc. Interspeech 2022

arXiv:2203.08810 [pdf, ps, other]

doi 10.21437/Interspeech.2022-141

Semi-FedSER: Semi-supervised Learning for Speech Emotion Recognition On Federated Learning using Multiview Pseudo-Labeling

Authors: Tiantian Feng, Shrikanth Narayanan

Abstract: Speech Emotion Recognition (SER) application is frequently associated with privacy concerns as it often acquires and transmits speech data at the client-side to remote cloud platforms for further processing. These speech data can reveal not only speech content and affective information but the speaker's identity, demographic traits, and health status. Federated learning (FL) is a distributed machi… ▽ More Speech Emotion Recognition (SER) application is frequently associated with privacy concerns as it often acquires and transmits speech data at the client-side to remote cloud platforms for further processing. These speech data can reveal not only speech content and affective information but the speaker's identity, demographic traits, and health status. Federated learning (FL) is a distributed machine learning algorithm that coordinates clients to train a model collaboratively without sharing local data. This algorithm shows enormous potential for SER applications as sharing raw speech or speech features from a user's device is vulnerable to privacy attacks. However, a major challenge in FL is limited availability of high-quality labeled data samples. In this work, we propose a semi-supervised federated learning framework, Semi-FedSER, that utilizes both labeled and unlabeled data samples to address the challenge of limited labeled data samples in FL. We show that our Semi-FedSER can generate desired SER performance even when the local label rate l=20 using two SER benchmark datasets: IEMOCAP and MSP-Improv. △ Less

Submitted 15 March, 2022; originally announced March 2022.

Comments: This paper was submitted to Insterspeech 2022 for review

Journal ref: Proc. Interspeech 2022

arXiv:2105.08630 [pdf, other]

Fast and Accurate Single-Image Depth Estimation on Mobile Devices, Mobile AI 2021 Challenge: Report

Authors: Andrey Ignatov, Grigory Malivenko, David Plowman, Samarth Shukla, Radu Timofte, Ziyu Zhang, Yicheng Wang, Zilong Huang, Guozhong Luo, Gang Yu, Bin Fu, Yiran Wang, Xingyi Li, Min Shi, Ke Xian, Zhiguo Cao, Jin-Hua Du, Pei-Lin Wu, Chao Ge, Jiaoyang Yao, Fangwen Tu, Bo Li, Jung Eun Yoo, Kwanggyoon Seo, Jialei Xu , et al. (13 additional authors not shown)

Abstract: Depth estimation is an important computer vision problem with many practical applications to mobile devices. While many solutions have been proposed for this task, they are usually very computationally expensive and thus are not applicable for on-device inference. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop an end-to-end deep learning-based d… ▽ More Depth estimation is an important computer vision problem with many practical applications to mobile devices. While many solutions have been proposed for this task, they are usually very computationally expensive and thus are not applicable for on-device inference. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop an end-to-end deep learning-based depth estimation solutions that can demonstrate a nearly real-time performance on smartphones and IoT platforms. For this, the participants were provided with a new large-scale dataset containing RGB-depth image pairs obtained with a dedicated stereo ZED camera producing high-resolution depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the popular Raspberry Pi 4 platform with a mobile ARM-based Broadcom chipset. The proposed solutions can generate VGA resolution depth maps at up to 10 FPS on the Raspberry Pi 4 while achieving high fidelity results, and are compatible with any Android or Linux-based mobile devices. A detailed description of all models developed in the challenge is provided in this paper. △ Less

Submitted 17 May, 2021; originally announced May 2021.

Comments: Mobile AI 2021 Workshop and Challenges: https://ai-benchmark.com/workshops/mai/2021/. arXiv admin note: text overlap with arXiv:2105.07809

arXiv:2104.02735 [pdf, other]

doi 10.1109/CVPR52688.2022.01575

Visual Vibration Tomography: Estimating Interior Material Properties from Monocular Video

Authors: Berthy T. Feng, Alexander C. Ogren, Chiara Daraio, Katherine L. Bouman

Abstract: An object's interior material properties, while invisible to the human eye, determine motion observed on its surface. We propose an approach that estimates heterogeneous material properties of an object from a monocular video of its surface vibrations. Specifically, we show how to estimate Young's modulus and density throughout a 3D object with known geometry. Knowledge of how these values change… ▽ More An object's interior material properties, while invisible to the human eye, determine motion observed on its surface. We propose an approach that estimates heterogeneous material properties of an object from a monocular video of its surface vibrations. Specifically, we show how to estimate Young's modulus and density throughout a 3D object with known geometry. Knowledge of how these values change across the object is useful for simulating its motion and characterizing any defects. Traditional non-destructive testing approaches, which often require expensive instruments, generally estimate only homogenized material properties or simply identify the presence of defects. In contrast, our approach leverages monocular video to (1) identify image-space modes from an object's sub-pixel motion, and (2) directly infer spatially-varying Young's modulus and density values from the observed modes. We demonstrate our approach on both simulated and real videos. △ Less

Submitted 23 April, 2023; v1 submitted 6 April, 2021; originally announced April 2021.

arXiv:2101.08918 [pdf, other]

Performance Analysis for Cache-enabled Cellular Networks with Cooperative Transmission

Authors: Tianming Feng, Shuo Shi, Shushi Gu, Ning Zhang, Wei Xiang, Xuemai Gu

Abstract: The large amount of deployed smart devices put tremendous traffic pressure on networks. Caching at the edge has been widely studied as a promising technique to solve this problem. To further improve the successful transmission probability (STP) of cache-enabled cellular networks (CEN), we combine the cooperative transmission technique with CEN and propose a novel transmission scheme. Local channel… ▽ More The large amount of deployed smart devices put tremendous traffic pressure on networks. Caching at the edge has been widely studied as a promising technique to solve this problem. To further improve the successful transmission probability (STP) of cache-enabled cellular networks (CEN), we combine the cooperative transmission technique with CEN and propose a novel transmission scheme. Local channel state information (CSI) is introduced at each cooperative base station (BS) to enhance the strength of the signal received by the user. A tight approximation for the STP of this scheme is derived using tools from stochastic geometry. The optimal content placement strategy of this scheme is obtained using a numerical method to maximize the STP. Simulation results demonstrate the optimal strategy achieves significant gains in STP over several comparative baselines with the proposed scheme. △ Less

Submitted 21 January, 2021; originally announced January 2021.

Comments: arXiv admin note: text overlap with arXiv:2101.08669

arXiv:2004.11332 [pdf, ps, other]

UAV-Enabled Data Collection for Wireless Sensor Networks with Distributed Beamforming

Authors: Tianxin Feng, Lifeng Xie, Jianping Yao, Jie Xu

Abstract: This paper studies an unmanned aerial vehicle (UAV)-enabled wireless sensor network, in which one UAV flies in the sky to collect the data transmitted from a set of ground nodes (GNs) via distributed beamforming. We consider two scenarios with delay-tolerant and delay-sensitive applications, in which the GNs send the common/shared messages to the UAV via adaptive- and fixed-rate transmissions, res… ▽ More This paper studies an unmanned aerial vehicle (UAV)-enabled wireless sensor network, in which one UAV flies in the sky to collect the data transmitted from a set of ground nodes (GNs) via distributed beamforming. We consider two scenarios with delay-tolerant and delay-sensitive applications, in which the GNs send the common/shared messages to the UAV via adaptive- and fixed-rate transmissions, respectively. For the two scenarios, we aim to maximize the average data-rate throughput and minimize the transmission outage probability, respectively, by jointly optimizing the UAV's trajectory design and the GNs' transmit power allocation over time, subject to the UAV's flight speed constraints and the GNs' individual average power constraints. However, the two formulated problems are both non-convex and thus generally difficult to be optimally solved. To tackle this issue, we first consider the relaxed problems in the ideal case with the UAV's flight speed constraints ignored, for which the well-structured optimal solutions are obtained to reveal the fundamental performance upper bounds. It is shown that for the two approximate problems, the optimal trajectory solutions have the same multi-location-hovering structure, but with different optimal power allocation strategies. Next, for the general problems with the UAV's flight speed constraints considered, we propose efficient algorithms to obtain high-quality solutions by using the techniques from convex optimization and approximation. Finally, numerical results show that our proposed designs significantly outperform other benchmark schemes, in terms of the achieved data-rate throughput and outage probability under the two scenarios. It is also observed that when the mission period becomes sufficiently long, our proposed designs approach the performance upper bounds when the UAV's flight speed constraints are ignored. △ Less

Submitted 7 August, 2021; v1 submitted 23 April, 2020; originally announced April 2020.

Comments: Double-column, 15 pages, 8 figures, 4 tables. Accepted for publication in the IEEE Transactions on Wireless Communications. It overlaps with the former version (arXiv:2004.11332)

arXiv:2003.08474 [pdf, other]

doi 10.1038/s41597-020-00655-3

TILES-2018, a longitudinal physiologic and behavioral data set of hospital workers

Authors: Karel Mundnich, Brandon M. Booth, Michelle L'Hommedieu, Tiantian Feng, Benjamin Girault, Justin L'Hommedieu, Mackenzie Wildman, Sophia Skaaden, Amrutha Nadarajan, Jennifer L. Villatte, Tiago H. Falk, Kristina Lerman, Emilio Ferrara, Shrikanth Narayanan

Abstract: We present a novel longitudinal multimodal corpus of physiological and behavioral data collected from direct clinical providers in a hospital workplace. We designed the study to investigate the use of off-the-shelf wearable and environmental sensors to understand individual-specific constructs such as job performance, interpersonal interaction, and well-being of hospital workers over time in their… ▽ More We present a novel longitudinal multimodal corpus of physiological and behavioral data collected from direct clinical providers in a hospital workplace. We designed the study to investigate the use of off-the-shelf wearable and environmental sensors to understand individual-specific constructs such as job performance, interpersonal interaction, and well-being of hospital workers over time in their natural day-to-day job settings. We collected behavioral and physiological data from $n = 212$ participants through Internet-of-Things Bluetooth data hubs, wearable sensors (including a wristband, a biometrics-tracking garment, a smartphone, and an audio-feature recorder), together with a battery of surveys to assess personality traits, behavioral states, job performance, and well-being over time. Besides the default use of the data set, we envision several novel research opportunities and potential applications, including multi-modal and multi-task behavioral modeling, authentication through biometrics, and privacy-aware and privacy-preserving machine learning. △ Less

Submitted 18 December, 2020; v1 submitted 18 March, 2020; originally announced March 2020.

Comments: 57 pages, 9 figures, journal paper

Journal ref: Sci Data 7, 354 (2020)

arXiv:2003.03574 [pdf, ps, other]

Outage Probability Minimization for UAV-Enabled Data Collection with Distributed Beamforming

Authors: Tianxin Feng, Lifeng Xie, Jianping Yao, Jie Xu

Abstract: This paper studies an unmanned aerial vehicle (UAV)-enabled wireless sensor network, in which one UAV flies in the sky to collect the data transmitted from a set of sensors via distributed beamforming. We consider the delay-sensitive application scenario, in which the sensors transmit the common/shared messages by using fixed data rates and adaptive transmit powers. Under this setup, we jointly op… ▽ More This paper studies an unmanned aerial vehicle (UAV)-enabled wireless sensor network, in which one UAV flies in the sky to collect the data transmitted from a set of sensors via distributed beamforming. We consider the delay-sensitive application scenario, in which the sensors transmit the common/shared messages by using fixed data rates and adaptive transmit powers. Under this setup, we jointly optimize the UAV's trajectory design and the sensors' transmit power allocation, in order to minimize the transmission outage probability, subject to the UAV's flight speed constraints and the sensors' individual average power constraints. However, the formulated outage probability minimization problem is non-convex and thus difficult to be optimally solved in general. To tackle this issue, we first consider the special problem in the ideal case with the UAV's flight speed constraints ignored, for which the well-structured optimal solution is obtained to reveal the fundamental performance upper bound. Next, for the general problem with the UAV's flight speed constraints considered, we propose an efficient algorithm to solve it sub-optimally by using the techniques of convex optimization and approximation. Finally, numerical results show that our proposed design achieves significantly reduced outage probability than other benchmark schemes. △ Less

Submitted 7 March, 2020; originally announced March 2020.

Comments: 6 pages, 4 figures, ICC 2020 workshop

arXiv:2003.00127 [pdf]

Time of arrival imaging: The proof of concept for a novel medical imaging modality

Authors: Tao Feng

Abstract: It has been shown that with the use of ultra-wideband (UWB) electromagnetic signal and time of arrival (ToA) principle, it is possible to locate medical implants given the permittivity distribution of the body. We propose a new imaging modality using the reverse process to acquire permittivity distributions as a surrogate of human anatomy. In the proposed systems, the locations of the signal sourc… ▽ More It has been shown that with the use of ultra-wideband (UWB) electromagnetic signal and time of arrival (ToA) principle, it is possible to locate medical implants given the permittivity distribution of the body. We propose a new imaging modality using the reverse process to acquire permittivity distributions as a surrogate of human anatomy. In the proposed systems, the locations of the signal source, receiver, and signal shapes are assumed to be known exactly. The measured data is recorded as the time it takes for the signal to travel from the signal source to the signal receiver. The finite-difference-time-domain (FDTD) method is used for the modeling of signal propagation within the phantom, which is used for both simulation and image reconstruction. Image reconstruction is achieved using linear regression on the training pairs, which includes randomly generated images and its corresponding arrival times generated using the FDTD approach. The linear weights of the training images are generated to minimize the difference between the arrival time of the reconstruction image and the measured arrival time. A simulation study using UWB signal with the central frequency of 300 MHz and the Shepp-Logan phantom was carried out. Ten-picosecond timing resolution is used for the simulation and image reconstruction. The quantitative difference between the arrival times of the phantom and the reconstructed image reduced with an increased iteration number. The quantitative error of the reconstructed image reached below 10% after 900 iterations, and 8.4% after 1200 iterations. With additional post-smoothing to suppress the introduced noise pattern through reconstruction, 6.5% error was achieved. In this paper, an approach that utilizes the ToA principle to achieve transmission imaging with radio waves is proposed and validated using a simulation study. △ Less

Submitted 28 February, 2020; originally announced March 2020.

arXiv:1906.08889 [pdf, other]

doi 10.1109/LRA.2019.2925555

SGANVO: Unsupervised Deep Visual Odometry and Depth Estimation with Stacked Generative Adversarial Networks

Authors: Tuo Feng, Dongbing Gu

Abstract: Recently end-to-end unsupervised deep learning methods have achieved an effect beyond geometric methods for visual depth and ego-motion estimation tasks. These data-based learning methods perform more robustly and accurately in some of the challenging scenes. The encoder-decoder network has been widely used in the depth estimation and the RCNN has brought significant improvements in the ego-motion… ▽ More Recently end-to-end unsupervised deep learning methods have achieved an effect beyond geometric methods for visual depth and ego-motion estimation tasks. These data-based learning methods perform more robustly and accurately in some of the challenging scenes. The encoder-decoder network has been widely used in the depth estimation and the RCNN has brought significant improvements in the ego-motion estimation. Furthermore, the latest use of Generative Adversarial Nets(GANs) in depth and ego-motion estimation has demonstrated that the estimation could be further improved by generating pictures in the game learning process. This paper proposes a novel unsupervised network system for visual depth and ego-motion estimation: Stacked Generative Adversarial Network(SGANVO). It consists of a stack of GAN layers, of which the lowest layer estimates the depth and ego-motion while the higher layers estimate the spatial features. It can also capture the temporal dynamic due to the use of a recurrent representation across the layers. See Fig.1 for details. We select the most commonly used KITTI [1] data set for evaluation. The evaluation results show that our proposed method can produce better or comparable results in depth and ego-motion estimation. △ Less

Submitted 20 June, 2019; originally announced June 2019.

Comments: 7 pages, 4 figures,

Report number: ras.ral.19-0181.628f4a7b

arXiv:1903.08860 [pdf, ps, other]

Cognitive Wireless Power Transfer in the Presence of Reactive Primary Communication User

Authors: Tianxin Feng, Ganggang Ma, Jie Xu

Abstract: This paper studies a cognitive or secondary multi-antenna wireless power transfer (WPT) system over a multi-carrier channel, which shares the same spectrum with a primary wireless information transfer (WIT) system that employs adaptive water-filling power allocation. By controlling the transmit energy beamforming over sub-carriers (SCs), the secondary energy transmitter (S-ET) can directly charge… ▽ More This paper studies a cognitive or secondary multi-antenna wireless power transfer (WPT) system over a multi-carrier channel, which shares the same spectrum with a primary wireless information transfer (WIT) system that employs adaptive water-filling power allocation. By controlling the transmit energy beamforming over sub-carriers (SCs), the secondary energy transmitter (S-ET) can directly charge the secondary energy receiver (S-ER), even purposely interfere with the primary WIT system, such that the primary information transmitter (P-IT) can reactively adjust its power allocation (based on water-filling) to facilitate the S-ER's energy harvesting. We investigate how the secondary WPT system can exploit the primary WIT system's reactive power allocation, for improving the wireless energy harvesting performance. In particular, our objective is to maximize the total energy received at the S-ER from both the S-ET and the P-IT, by optimizing the S-ET's energy beamforming over SCs, subject to its maximum transmit power constraint, and the maximum interference power constraint imposed at the primary information receiver (P-IR) to protect the primary WIT. Although the formulated problem is non-convex and difficult to be optimally solved in general, we propose an efficient algorithm to obtain a high-quality solution by employing the Lagrange dual method together with a one-dimensional search. We also present two benchmark energy beamforming designs based on the zero-forcing (ZF) and maximum-ratio-transmission (MRT) principles, respectively, as well as the conventional design without considering the primary WIT system's reaction. Numerical results show that our proposed design leads to significantly improved energy harvesting performance at the S-ER, as compared to these benchmark schemes. △ Less

Submitted 21 March, 2019; originally announced March 2019.

Showing 1–42 of 42 results for author: Feng, T