-
Neural network based approach for solving problems in plane wave duct acoustics
Authors:
D. Veerababu,
Prasanta K. Ghosh
Abstract:
Neural networks have emerged as a tool for solving differential equations in many branches of engineering and science. But their progress in frequency domain acoustics is limited by the vanishing gradient problem that occurs at higher frequencies. This paper discusses a formulation that can address this issue. The problem of solving the governing differential equation along with the boundary condi…
▽ More
Neural networks have emerged as a tool for solving differential equations in many branches of engineering and science. But their progress in frequency domain acoustics is limited by the vanishing gradient problem that occurs at higher frequencies. This paper discusses a formulation that can address this issue. The problem of solving the governing differential equation along with the boundary conditions is posed as an unconstrained optimization problem. The acoustic field is approximated to the output of a neural network which is constructed in such a way that it always satisfies the boundary conditions. The applicability of the formulation is demonstrated on popular problems in plane wave acoustic theory. The predicted solution from the neural network formulation is compared with those obtained from the analytical solution. A good agreement is observed between the two solutions. The method of transfer learning to calculate the particle velocity from the existing acoustic pressure field is demonstrated with and without mean flow effects. The sensitivity of the training process to the choice of the activation function and the number of collocation points is studied.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Model Adaptation for ASR in low-resource Indian Languages
Authors:
Abhayjeet Singh,
Arjun Singh Mehta,
Ashish Khuraishi K S,
Deekshitha G,
Gauri Date,
Jai Nanavati,
Jesuraja Bandekar,
Karnalius Basumatary,
Karthika P,
Sandhya Badiger,
Sathvik Udupa,
Saurabh Kumar,
Savitha,
Prasanta Kumar Ghosh,
Prashanthi V,
Priyanka Pai,
Raoul Nanavati,
Rohan Saxena,
Sai Praneeth Reddy Mora,
Srinivasa Raghavan
Abstract:
Automatic speech recognition (ASR) performance has improved drastically in recent years, mainly enabled by self-supervised learning (SSL) based acoustic models such as wav2vec2 and large-scale multi-lingual training like Whisper. A huge challenge still exists for low-resource languages where the availability of both audio and text is limited. This is further complicated by the presence of multiple…
▽ More
Automatic speech recognition (ASR) performance has improved drastically in recent years, mainly enabled by self-supervised learning (SSL) based acoustic models such as wav2vec2 and large-scale multi-lingual training like Whisper. A huge challenge still exists for low-resource languages where the availability of both audio and text is limited. This is further complicated by the presence of multiple dialects like in Indian languages. However, many Indian languages can be grouped into the same families and share the same script and grammatical structure. This is where a lot of adaptation and fine-tuning techniques can be applied to overcome the low-resource nature of the data by utilising well-resourced similar languages.
In such scenarios, it is important to understand the extent to which each modality, like acoustics and text, is important in building a reliable ASR. It could be the case that an abundance of acoustic data in a language reduces the need for large text-only corpora. Or, due to the availability of various pretrained acoustic models, the vice-versa could also be true. In this proposed special session, we encourage the community to explore these ideas with the data in two low-resource Indian languages of Bengali and Bhojpuri. These approaches are not limited to Indian languages, the solutions are potentially applicable to various languages spoken around the world.
△ Less
Submitted 16 July, 2023;
originally announced July 2023.
-
Real-Time MRI Video synthesis from time aligned phonemes with sequence-to-sequence networks
Authors:
Sathvik Udupa,
Prasanta Kumar Ghosh
Abstract:
Real-Time Magnetic resonance imaging (rtMRI) of the midsagittal plane of the mouth is of interest for speech production research. In this work, we focus on estimating utterance level rtMRI video from the spoken phoneme sequence. We obtain time-aligned phonemes from forced alignment, to obtain frame-level phoneme sequences which are aligned with rtMRI frames. We propose a sequence-to-sequence learn…
▽ More
Real-Time Magnetic resonance imaging (rtMRI) of the midsagittal plane of the mouth is of interest for speech production research. In this work, we focus on estimating utterance level rtMRI video from the spoken phoneme sequence. We obtain time-aligned phonemes from forced alignment, to obtain frame-level phoneme sequences which are aligned with rtMRI frames. We propose a sequence-to-sequence learning model with a transformer phoneme encoder and convolutional frame decoder. We then modify the learning by using intermediary features obtained from sampling from a pretrained phoneme-conditioned variational autoencoder (CVAE). We train on 8 subjects in a subject-specific manner and demonstrate the performance with a subjective test. We also use an auxiliary task of air tissue boundary (ATB) segmentation to obtain the objective scores on the proposed models. We show that the proposed method is able to generate realistic rtMRI video for unseen utterances, and adding CVAE is beneficial for learning the sequence-to-sequence mapping for subjects where the mapping is hard to learn.
△ Less
Submitted 30 October, 2022;
originally announced October 2022.
-
Improved acoustic-to-articulatory inversion using representations from pretrained self-supervised learning models
Authors:
Sathvik Udupa,
Siddarth C,
Prasanta Kumar Ghosh
Abstract:
In this work, we investigate the effectiveness of pretrained Self-Supervised Learning (SSL) features for learning the mapping for acoustic to articulatory inversion (AAI). Signal processing-based acoustic features such as MFCCs have been predominantly used for the AAI task with deep neural networks. With SSL features working well for various other speech tasks such as speech recognition, emotion c…
▽ More
In this work, we investigate the effectiveness of pretrained Self-Supervised Learning (SSL) features for learning the mapping for acoustic to articulatory inversion (AAI). Signal processing-based acoustic features such as MFCCs have been predominantly used for the AAI task with deep neural networks. With SSL features working well for various other speech tasks such as speech recognition, emotion classification, etc., we experiment with its efficacy for AAI. We train on SSL features with transformer neural networks-based AAI models of 3 different model complexities and compare its performance with MFCCs in subject-specific (SS), pooled and fine-tuned (FT) configurations with data from 10 subjects, and evaluate with correlation coefficient (CC) score on the unseen sentence test set. We find that acoustic feature reconstruction objective-based SSL features such as TERA and DeCoAR work well for AAI, with SS CCs of these SSL features reaching close to the best FT CCs of MFCC. We also find the results consistent across different model sizes.
△ Less
Submitted 30 October, 2022;
originally announced October 2022.
-
An error correction scheme for improved air-tissue boundary in real-time MRI video for speech production
Authors:
Anwesha Roy,
Varun Belagali,
Prasanta Kumar Ghosh
Abstract:
The best performance in Air-tissue boundary (ATB) segmentation of real-time Magnetic Resonance Imaging (rtMRI) videos in speech production is known to be achieved by a 3-dimensional convolutional neural network (3D-CNN) model. However, the evaluation of this model, as well as other ATB segmentation techniques reported in the literature, is done using Dynamic Time Warping (DTW) distance between the…
▽ More
The best performance in Air-tissue boundary (ATB) segmentation of real-time Magnetic Resonance Imaging (rtMRI) videos in speech production is known to be achieved by a 3-dimensional convolutional neural network (3D-CNN) model. However, the evaluation of this model, as well as other ATB segmentation techniques reported in the literature, is done using Dynamic Time Warping (DTW) distance between the entire original and predicted contours. Such an evaluation measure may not capture local errors in the predicted contour. Careful analysis of predicted contours reveals errors in regions like the velum part of contour1 (ATB comprising of upper lip, hard palate, and velum) and tongue base section of contour2 (ATB covering jawline, lower lip, tongue base, and epiglottis), which are not captured in a global evaluation metric like DTW distance. In this work, we automatically detect such errors and propose a correction scheme for the same. We also propose two new evaluation metrics for ATB segmentation separately in contour1 and contour2 to explicitly capture two types of errors in these contours. The proposed detection and correction strategies result in an improvement of these two evaluation metrics by 61.8% and 61.4% for contour1 and by 67.8% and 28.4% for contour2. Traditional DTW distance, on the other hand, improves by 44.6% for contour1 and 4.0% for contour2.
△ Less
Submitted 8 March, 2022;
originally announced March 2022.
-
A study on native American English speech recognition by Indian listeners with varying word familiarity level
Authors:
Abhayjeet Singh,
Achuth Rao MV,
Rakesh Vaideeswaran,
Chiranjeevi Yarra,
Prasanta Kumar Ghosh
Abstract:
In this study, listeners of varied Indian nativities are asked to listen and recognize TIMIT utterances spoken by American speakers. We have three kinds of responses from each listener while they recognize an utterance: 1. Sentence difficulty ratings, 2. Speaker difficulty ratings, and 3. Transcription of the utterance. From these transcriptions, word error rate (WER) is calculated and used as a m…
▽ More
In this study, listeners of varied Indian nativities are asked to listen and recognize TIMIT utterances spoken by American speakers. We have three kinds of responses from each listener while they recognize an utterance: 1. Sentence difficulty ratings, 2. Speaker difficulty ratings, and 3. Transcription of the utterance. From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences.The sentences selected in this study are categorized into three groups: Easy, Medium and Hard, based on the frequency ofoccurrence of the words in them. We observe that the sentence, speaker difficulty ratings and the WERs increase from easy to hard categories of sentences. We also compare the human speech recognition performance with that using three automatic speech recognition (ASR) under following three combinations of acoustic model (AM) and language model(LM): ASR1) AM trained with recordings from speakers of Indian origin and LM built on TIMIT text, ASR2) AM using recordings from native American speakers and LM built ontext from LIBRI speech corpus, and ASR3) AM using recordings from native American speakers and LM build on LIBRI speech and TIMIT text. We observe that HSR performance is similar to that of ASR1 whereas ASR3 achieves the best performance. Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities
△ Less
Submitted 8 December, 2021;
originally announced December 2021.
-
Multi-modal Point-of-Care Diagnostics for COVID-19 Based On Acoustics and Symptoms
Authors:
Srikanth Raj Chetupalli,
Prashant Krishnan,
Neeraj Sharma,
Ananya Muguli,
Rohit Kumar,
Viral Nanda,
Lancelot Mark Pinto,
Prasanta Kumar Ghosh,
Sriram Ganapathy
Abstract:
The research direction of identifying acoustic bio-markers of respiratory diseases has received renewed interest following the onset of COVID-19 pandemic. In this paper, we design an approach to COVID-19 diagnostic using crowd-sourced multi-modal data. The data resource, consisting of acoustic signals like cough, breathing, and speech signals, along with the data of symptoms, are recorded using a…
▽ More
The research direction of identifying acoustic bio-markers of respiratory diseases has received renewed interest following the onset of COVID-19 pandemic. In this paper, we design an approach to COVID-19 diagnostic using crowd-sourced multi-modal data. The data resource, consisting of acoustic signals like cough, breathing, and speech signals, along with the data of symptoms, are recorded using a web-application over a period of ten months. We investigate the use of statistical descriptors of simple time-frequency features for acoustic signals and binary features for the presence of symptoms. Unlike previous works, we primarily focus on the application of simple linear classifiers like logistic regression and support vector machines for acoustic data while decision tree models are employed on the symptoms data. We show that a multi-modal integration of acoustics and symptoms classifiers achieves an area-under-curve (AUC) of 92.40, a significant improvement over any individual modality. Several ablation experiments are also provided which highlight the acoustic and symptom dimensions that are important for the task of COVID-19 diagnostics.
△ Less
Submitted 5 June, 2021; v1 submitted 1 June, 2021;
originally announced June 2021.
-
Estimating articulatory movements in speech production with transformer networks
Authors:
Sathvik Udupa,
Anwesha Roy,
Abhayjeet Singh,
Aravind Illa,
Prasanta Kumar Ghosh
Abstract:
We estimate articulatory movements in speech production from different modalities - acoustics and phonemes. Acoustic-to articulatory inversion (AAI) is a sequence-to-sequence task. On the other hand, phoneme to articulatory (PTA) motion estimation faces a key challenge in reliably aligning the text and the articulatory movements. To address this challenge, we explore the use of a transformer archi…
▽ More
We estimate articulatory movements in speech production from different modalities - acoustics and phonemes. Acoustic-to articulatory inversion (AAI) is a sequence-to-sequence task. On the other hand, phoneme to articulatory (PTA) motion estimation faces a key challenge in reliably aligning the text and the articulatory movements. To address this challenge, we explore the use of a transformer architecture - FastSpeech, with explicit duration modelling to learn hard alignments between the phonemes and articulatory movements. We also train a transformer model on AAI. We use correlation coefficient (CC) and root mean squared error (rMSE) to assess the estimation performance in comparison to existing methods on both tasks. We observe 154%, 11.8% & 4.8% relative improvement in CC with subject-dependent, pooled and fine-tuning strategies, respectively, for PTA estimation. Additionally, on the AAI task, we obtain 1.5%, 3% and 3.1% relative gain in CC on the same setups compared to the state-of-the-art baseline. We further present the computational benefits of having transformer architecture as representation blocks.
△ Less
Submitted 12 June, 2021; v1 submitted 11 April, 2021;
originally announced April 2021.
-
Multilingual and code-switching ASR challenges for low resource Indian languages
Authors:
Anuj Diwan,
Rakesh Vaideeswaran,
Sanket Shah,
Ankita Singh,
Srinivasa Raghavan,
Shreya Khare,
Vinit Unni,
Saurabh Vyas,
Akash Rajpuria,
Chiranjeevi Yarra,
Ashish Mittal,
Prasanta Kumar Ghosh,
Preethi Jyothi,
Kalika Bali,
Vivek Seshadri,
Sunayana Sitaram,
Samarth Bharadwaj,
Jai Nanavati,
Raoul Nanavati,
Karthik Sankaranarayanan,
Tejaswi Seeram,
Basil Abraham
Abstract:
Recently, there is increasing interest in multilingual automatic speech recognition (ASR) where a speech recognition system caters to multiple low resource languages by taking advantage of low amounts of labeled corpora in multiple languages. With multilingualism becoming common in today's world, there has been increasing interest in code-switching ASR as well. In code-switching, multiple language…
▽ More
Recently, there is increasing interest in multilingual automatic speech recognition (ASR) where a speech recognition system caters to multiple low resource languages by taking advantage of low amounts of labeled corpora in multiple languages. With multilingualism becoming common in today's world, there has been increasing interest in code-switching ASR as well. In code-switching, multiple languages are freely interchanged within a single sentence or between sentences. The success of low-resource multilingual and code-switching ASR often depends on the variety of languages in terms of their acoustics, linguistic characteristics as well as the amount of data available and how these are carefully considered in building the ASR system. In this challenge, we would like to focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages, namely Hindi, Marathi, Odia, Tamil, Telugu, Gujarati and Bengali. For this purpose, we provide a total of ~600 hours of transcribed speech data, comprising train and test sets, in these languages including two code-switched language pairs, Hindi-English and Bengali-English. We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
△ Less
Submitted 31 March, 2021;
originally announced April 2021.
-
DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics
Authors:
Ananya Muguli,
Lancelot Pinto,
Nirmala R.,
Neeraj Sharma,
Prashant Krishnan,
Prasanta Kumar Ghosh,
Rohit Kumar,
Shrirama Bhat,
Srikanth Raj Chetupalli,
Sriram Ganapathy,
Shreyas Ramoji,
Viral Nanda
Abstract:
The DiCOVA challenge aims at accelerating research in diagnosing COVID-19 using acoustics (DiCOVA), a topic at the intersection of speech and audio processing, respiratory health diagnosis, and machine learning. This challenge is an open call for researchers to analyze a dataset of sound recordings collected from COVID-19 infected and non-COVID-19 individuals for a two-class classification. These…
▽ More
The DiCOVA challenge aims at accelerating research in diagnosing COVID-19 using acoustics (DiCOVA), a topic at the intersection of speech and audio processing, respiratory health diagnosis, and machine learning. This challenge is an open call for researchers to analyze a dataset of sound recordings collected from COVID-19 infected and non-COVID-19 individuals for a two-class classification. These recordings were collected via crowdsourcing from multiple countries, through a website application. The challenge features two tracks, one focusing on cough sounds, and the other on using a collection of breath, sustained vowel phonation, and number counting speech recordings. In this paper, we introduce the challenge and provide a detailed description of the task, and present a baseline system for the task.
△ Less
Submitted 17 June, 2021; v1 submitted 16 March, 2021;
originally announced March 2021.
-
Attention and Encoder-Decoder based models for transforming articulatory movements at different speaking rates
Authors:
Abhayjeet Singh,
Aravind Illa,
Prasanta Kumar Ghosh
Abstract:
While speaking at different rates, articulators (like tongue, lips) tend to move differently and the enunciations are also of different durations. In the past, affine transformation and DNN have been used to transform articulatory movements from neutral to fast(N2F) and neutral to slow(N2S) speaking rates [1]. In this work, we improve over the existing transformation techniques by modeling rate sp…
▽ More
While speaking at different rates, articulators (like tongue, lips) tend to move differently and the enunciations are also of different durations. In the past, affine transformation and DNN have been used to transform articulatory movements from neutral to fast(N2F) and neutral to slow(N2S) speaking rates [1]. In this work, we improve over the existing transformation techniques by modeling rate specific durations and their transformation using AstNet, an encoder-decoder framework with attention. In the current work, we propose an encoder-decoder architecture using LSTMs which generates smoother predicted articulatory trajectories. For modeling duration variations across speaking rates, we deploy attention network, which eliminates the needto align trajectories in different rates using DTW. We performa phoneme specific duration analysis to examine how well duration is transformed using the proposed AstNet. As the range of articulatory motions is correlated with speaking rate, we also analyze amplitude of the transformed articulatory movements at different rates compared to their original counterparts, to examine how well the proposed AstNet predicts the extent of articulatory movements in N2F and N2S. We observe that AstNet could model both duration and extent of articulatory movements better than the existing transformation techniques resulting in more accurate transformed articulatory trajectories.
△ Less
Submitted 20 August, 2020; v1 submitted 4 June, 2020;
originally announced June 2020.
-
Coswara -- A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis
Authors:
Neeraj Sharma,
Prashant Krishnan,
Rohit Kumar,
Shreyas Ramoji,
Srikanth Raj Chetupalli,
Nirmala R.,
Prasanta Kumar Ghosh,
Sriram Ganapathy
Abstract:
The COVID-19 pandemic presents global challenges transcending boundaries of country, race, religion, and economy. The current gold standard method for COVID-19 detection is the reverse transcription polymerase chain reaction (RT-PCR) testing. However, this method is expensive, time-consuming, and violates social distancing. Also, as the pandemic is expected to stay for a while, there is a need for…
▽ More
The COVID-19 pandemic presents global challenges transcending boundaries of country, race, religion, and economy. The current gold standard method for COVID-19 detection is the reverse transcription polymerase chain reaction (RT-PCR) testing. However, this method is expensive, time-consuming, and violates social distancing. Also, as the pandemic is expected to stay for a while, there is a need for an alternate diagnosis tool which overcomes these limitations, and is deployable at a large scale. The prominent symptoms of COVID-19 include cough and breathing difficulties. We foresee that respiratory sounds, when analyzed using machine learning techniques, can provide useful insights, enabling the design of a diagnostic tool. Towards this, the paper presents an early effort in creating (and analyzing) a database, called Coswara, of respiratory sounds, namely, cough, breath, and voice. The sound samples are collected via worldwide crowdsourcing using a website application. The curated dataset is released as open access. As the pandemic is evolving, the data collection and analysis is a work in progress. We believe that insights from analysis of Coswara can be effective in enabling sound based technology solutions for point-of-care diagnosis of respiratory infection, and in the near future this can help to diagnose COVID-19.
△ Less
Submitted 11 August, 2020; v1 submitted 21 May, 2020;
originally announced May 2020.
-
A comparative study of estimating articulatory movements from phoneme sequences and acoustic features
Authors:
Abhayjeet Singh,
Aravind Illa,
Prasanta Kumar Ghosh
Abstract:
Unlike phoneme sequences, movements of speech articulators (lips, tongue, jaw, velum) and the resultant acoustic signal are known to encode not only the linguistic message but also carry para-linguistic information. While several works exist for estimating articulatory movement from acoustic signals, little is known to what extent articulatory movements can be predicted only from linguistic inform…
▽ More
Unlike phoneme sequences, movements of speech articulators (lips, tongue, jaw, velum) and the resultant acoustic signal are known to encode not only the linguistic message but also carry para-linguistic information. While several works exist for estimating articulatory movement from acoustic signals, little is known to what extent articulatory movements can be predicted only from linguistic information, i.e., phoneme sequence. In this work, we estimate articulatory movements from three different input representations: R1) acoustic signal, R2) phoneme sequence, R3) phoneme sequence with timing information. While an attention network is used for estimating articulatory movement in the case of R2, BLSTM network is used for R1 and R3. Experiments with ten subjects' acoustic-articulatory data reveal that the estimation techniques achieve an average correlation coefficient of 0.85, 0.81, and 0.81 in the case of R1, R2, and R3 respectively. This indicates that attention network, although uses only phoneme sequence (R2) without any timing information, results in an estimation performance similar to that using rich acoustic signal (R1), suggesting that articulatory motion is primarily driven by the linguistic message. The correlation coefficient is further improved to 0.88 when R1 and R3 are used together for estimating articulatory movements.
△ Less
Submitted 19 February, 2020; v1 submitted 31 October, 2019;
originally announced October 2019.
-
The Role of Boolean Function in Fractal Formation and it s Application to CDMA Wireless Communication
Authors:
Somnath Mukherjee,
Pabitra Kumar Ghosh
Abstract:
In this paper, a new transformation is generated from a three variable Boolean function 3, which is used to produce a self-similar fractal pattern of dimension 1.58. This very fractal pattern is used to reconstruct the whole structural position of resources in wireless CDMA network. This reconstruction minimizes the number of resources in the network and so naturally network consumption costs are…
▽ More
In this paper, a new transformation is generated from a three variable Boolean function 3, which is used to produce a self-similar fractal pattern of dimension 1.58. This very fractal pattern is used to reconstruct the whole structural position of resources in wireless CDMA network. This reconstruction minimizes the number of resources in the network and so naturally network consumption costs are getting reduced. Now -a -days resource controlling and cost minimization are still a severe problem in wireless CDMA network. To overcome this problem fractal pattern produced in our research provides a complete solution of structural position of resources in this Wireless CDMA Network.
△ Less
Submitted 19 May, 2010; v1 submitted 30 April, 2010;
originally announced April 2010.