-
Uncertainty-Aware Shape Estimation of a Surgical Continuum Manipulator in Constrained Environments using Fiber Bragg Grating Sensors
Authors:
Alexander Schwarz,
Arian Mehrfard,
Golchehr Amirkhani,
Henry Phalen,
Justin H. Ma,
Robert B. Grupp,
Alejandro Martin-Gomez,
Mehran Armand
Abstract:
Continuum Dexterous Manipulators (CDMs) are well-suited tools for minimally invasive surgery due to their inherent dexterity and reachability. Nonetheless, their flexible structure and non-linear curvature pose significant challenges for shape-based feedback control. The use of Fiber Bragg Grating (FBG) sensors for shape sensing has shown great potential in estimating the CDM's tip position and su…
▽ More
Continuum Dexterous Manipulators (CDMs) are well-suited tools for minimally invasive surgery due to their inherent dexterity and reachability. Nonetheless, their flexible structure and non-linear curvature pose significant challenges for shape-based feedback control. The use of Fiber Bragg Grating (FBG) sensors for shape sensing has shown great potential in estimating the CDM's tip position and subsequently reconstructing the shape using optimization algorithms. This optimization, however, is under-constrained and may be ill-posed for complex shapes, falling into local minima. In this work, we introduce a novel method capable of directly estimating a CDM's shape from FBG sensor wavelengths using a deep neural network. In addition, we propose the integration of uncertainty estimation to address the critical issue of uncertainty in neural network predictions. Neural network predictions are unreliable when the input sample is outside the training distribution or corrupted by noise. Recognizing such deviations is crucial when integrating neural networks within surgical robotics, as inaccurate estimations can pose serious risks to the patient. We present a robust method that not only improves the precision upon existing techniques for FBG-based shape estimation but also incorporates a mechanism to quantify the models' confidence through uncertainty estimation. We validate the uncertainty estimation through extensive experiments, demonstrating its effectiveness and reliability on out-of-distribution (OOD) data, adding an additional layer of safety and precision to minimally invasive surgical robotics.
△ Less
Submitted 11 May, 2024;
originally announced May 2024.
-
GALÆXI: Solving complex compressible flows with high-order discontinuous Galerkin methods on accelerator-based systems
Authors:
Daniel Kempf,
Marius Kurz,
Marcel Blind,
Patrick Kopper,
Philipp Offenhäuser,
Anna Schwarz,
Spencer Starr,
Jens Keim,
Andrea Beck
Abstract:
This work presents GALÆXI as a novel, energy-efficient flow solver for the simulation of compressible flows on unstructured meshes leveraging the parallel computing power of modern Graphics Processing Units (GPUs). GALÆXI implements the high-order Discontinuous Galerkin Spectral Element Method (DGSEM) using shock capturing with a finite-volume subcell approach to ensure the stability of the high-o…
▽ More
This work presents GALÆXI as a novel, energy-efficient flow solver for the simulation of compressible flows on unstructured meshes leveraging the parallel computing power of modern Graphics Processing Units (GPUs). GALÆXI implements the high-order Discontinuous Galerkin Spectral Element Method (DGSEM) using shock capturing with a finite-volume subcell approach to ensure the stability of the high-order scheme near shocks. This work provides details on the general code design, the parallelization strategy, and the implementation approach for the compute kernels with a focus on the element local mappings between volume and surface data due to the unstructured mesh. GALÆXI exhibits excellent strong scaling properties up to 1024 GPUs if each GPU is assigned a minimum of one million degrees of freedom degrees of freedom. To verify its implementation, a convergence study is performed that recovers the theoretical order of convergence of the implemented numerical schemes. Moreover, the solver is validated using both the incompressible and compressible formulation of the Taylor-Green-Vortex at a Mach number of 0.1 and 1.25, respectively. A mesh convergence study shows that the results converge to the high-fidelity reference solution and that the results match the original CPU implementation. Finally, GALÆXI is applied to a large-scale wall-resolved large eddy simulation of a linear cascade of the NASA Rotor 37. Here, the supersonic region and shocks at the leading edge are captured accurately and robustly by the implemented shock-capturing approach. It is demonstrated that GALÆXI requires less than half of the energy to carry out this simulation in comparison to the reference CPU implementation. This renders GALÆXI as a potent tool for accurate and efficient simulations of compressible flows in the realm of exascale computing and the associated new HPC architectures.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
Creating a Digital Twin of Spinal Surgery: A Proof of Concept
Authors:
Jonas Hein,
Frédéric Giraud,
Lilian Calvet,
Alexander Schwarz,
Nicola Alessandro Cavalcanti,
Sergey Prokudin,
Mazda Farshad,
Siyu Tang,
Marc Pollefeys,
Fabio Carrillo,
Philipp Fürnstahl
Abstract:
Surgery digitalization is the process of creating a virtual replica of real-world surgery, also referred to as a surgical digital twin (SDT). It has significant applications in various fields such as education and training, surgical planning, and automation of surgical tasks. In addition, SDTs are an ideal foundation for machine learning methods, enabling the automatic generation of training data.…
▽ More
Surgery digitalization is the process of creating a virtual replica of real-world surgery, also referred to as a surgical digital twin (SDT). It has significant applications in various fields such as education and training, surgical planning, and automation of surgical tasks. In addition, SDTs are an ideal foundation for machine learning methods, enabling the automatic generation of training data. In this paper, we present a proof of concept (PoC) for surgery digitalization that is applied to an ex-vivo spinal surgery. The proposed digitalization focuses on the acquisition and modelling of the geometry and appearance of the entire surgical scene. We employ five RGB-D cameras for dynamic 3D reconstruction of the surgeon, a high-end camera for 3D reconstruction of the anatomy, an infrared stereo camera for surgical instrument tracking, and a laser scanner for 3D reconstruction of the operating room and data fusion. We justify the proposed methodology, discuss the challenges faced and further extensions of our prototype. While our PoC partially relies on manual data curation, its high quality and great potential motivate the development of automated methods for the creation of SDTs.
△ Less
Submitted 22 May, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.
-
AnatoMix: Anatomy-aware Data Augmentation for Multi-organ Segmentation
Authors:
Chang Liu,
Fuxin Fan,
Annette Schwarz,
Andreas Maier
Abstract:
Multi-organ segmentation in medical images is a widely researched task and can save much manual efforts of clinicians in daily routines. Automating the organ segmentation process using deep learning (DL) is a promising solution and state-of-the-art segmentation models are achieving promising accuracy. In this work, We proposed a novel data augmentation strategy for increasing the generalizibility…
▽ More
Multi-organ segmentation in medical images is a widely researched task and can save much manual efforts of clinicians in daily routines. Automating the organ segmentation process using deep learning (DL) is a promising solution and state-of-the-art segmentation models are achieving promising accuracy. In this work, We proposed a novel data augmentation strategy for increasing the generalizibility of multi-organ segmentation datasets, namely AnatoMix. By object-level matching and manipulation, our method is able to generate new images with correct anatomy, i.e. organ segmentation mask, exponentially increasing the size of the segmentation dataset. Initial experiments have been done to investigate the segmentation performance influenced by our method on a public CT dataset. Our augmentation method can lead to mean dice of 76.1, compared with 74.8 of the baseline method.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Promptformer: Prompted Conformer Transducer for ASR
Authors:
Sergio Duarte-Torres,
Arunasish Sen,
Aman Rana,
Lukas Drude,
Alejandro Gomez-Alanis,
Andreas Schwarz,
Leif Rädel,
Volker Leutnant
Abstract:
Context cues carry information which can improve multi-turn interactions in automatic speech recognition (ASR) systems. In this paper, we introduce a novel mechanism inspired by hyper-prompting to fuse textual context with acoustic representations in the attention mechanism. Results on a test set with multi-turn interactions show that our method achieves 5.9% relative word error rate reduction (rW…
▽ More
Context cues carry information which can improve multi-turn interactions in automatic speech recognition (ASR) systems. In this paper, we introduce a novel mechanism inspired by hyper-prompting to fuse textual context with acoustic representations in the attention mechanism. Results on a test set with multi-turn interactions show that our method achieves 5.9% relative word error rate reduction (rWERR) over a strong baseline. We show that our method does not degrade in the absence of context and leads to improvements even if the model is trained without context. We further show that leveraging a pre-trained sentence-piece model for context embedding generation can outperform an external BERT model.
△ Less
Submitted 14 January, 2024;
originally announced January 2024.
-
Towards Exascale CFD Simulations Using the Discontinuous Galerkin Solver FLEXI
Authors:
Marcel Blind,
Min Gao,
Daniel Kempf,
Patrick Kopper,
Marius Kurz,
Anna Schwarz,
Andrea Beck
Abstract:
Modern high-order discretizations bear considerable potential for the exascale era due to their high fidelity and the high, local computational load that allows for computational efficiency in massively parallel simulations. To this end, the discontinuous Galerkin (DG) framework FLEXI was selected to demonstrate exascale readiness within the Center of Excellence for Exascale CFD (CEEC) by simulati…
▽ More
Modern high-order discretizations bear considerable potential for the exascale era due to their high fidelity and the high, local computational load that allows for computational efficiency in massively parallel simulations. To this end, the discontinuous Galerkin (DG) framework FLEXI was selected to demonstrate exascale readiness within the Center of Excellence for Exascale CFD (CEEC) by simulating shock buffet on a three-dimensional wing segment at transsonic flight conditions. This paper summarizes the recent progress made to enable the simulation of this challenging exascale problem. For this, it is first demonstrated that FLEXI scales excellently to over 500 000 CPU cores on HAWK at the HLRS. To tackle the considerable resolution requirements near the wall, a novel wall model is proposed that takes compressibility effects into account and yields decent results for the simulation of a NACA 64A-110 airfoil. To address the shocks in the domain, a finite-volume-based shock capturing method was implemented in FLEXI, which is validated here using the simulation of a linear compressor cascade at supersonic flow conditions, where the method is demonstrated to yield efficient, robust and accurate results. Lastly, we present the TensorFlow-Fortran-Binding (TFFB) as an easy-to-use library to deploy trained machine learning models in Fortran solvers such as FLEXI.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
Personalized Predictive ASR for Latency Reduction in Voice Assistants
Authors:
Andreas Schwarz,
Di He,
Maarten Van Segbroeck,
Mohammed Hethnawi,
Ariya Rastrow
Abstract:
Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize prefetching to partially hide the latency of response generation. Prefetching involves passing a preliminary ASR hypothesis to downstream systems in order to prefetch and cache a response. If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, th…
▽ More
Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize prefetching to partially hide the latency of response generation. Prefetching involves passing a preliminary ASR hypothesis to downstream systems in order to prefetch and cache a response. If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, thus saving latency. In this paper, we extend this idea by introducing predictive automatic speech recognition, where we predict the full utterance from a partially observed utterance, and prefetch the response based on the predicted utterance. We introduce two personalization approaches and investigate the tradeoff between potential latency gains from successful predictions and the cost increase from failed predictions. We evaluate our methods on an internal voice assistant dataset as well as the public SLURP dataset.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Contextual-Utterance Training for Automatic Speech Recognition
Authors:
Alejandro Gomez-Alanis,
Lukas Drude,
Andreas Schwarz,
Rupak Vignesh Swaminathan,
Simon Wiesler
Abstract:
Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit…
▽ More
Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling "in-place" the knowledge of a teacher, which is able to see both past and future contextual utterances, to the student which can only see the current and past contextual utterances. The experimental results show that a conformer-transducer system trained with the proposed techniques outperforms the same system trained with the classical RNN-T loss. Specifically, the proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative, respectively.
△ Less
Submitted 27 October, 2022;
originally announced October 2022.
-
Multi-channel Opus compression for far-field automatic speech recognition with a fixed bitrate budget
Authors:
Lukas Drude,
Jahn Heymann,
Andreas Schwarz,
Jean-Marc Valin
Abstract:
Automatic speech recognition (ASR) in the cloud allows the use of larger models and more powerful multi-channel signal processing front-ends compared to on-device processing. However, it also adds an inherent latency due to the transmission of the audio signal, especially when transmitting multiple channels of a microphone array. One way to reduce the network bandwidth requirements is client-side…
▽ More
Automatic speech recognition (ASR) in the cloud allows the use of larger models and more powerful multi-channel signal processing front-ends compared to on-device processing. However, it also adds an inherent latency due to the transmission of the audio signal, especially when transmitting multiple channels of a microphone array. One way to reduce the network bandwidth requirements is client-side compression with a lossy codec such as Opus. However, this compression can have a detrimental effect especially on multi-channel ASR front-ends, due to the distortion and loss of spatial information introduced by the codec. In this publication, we propose an improved approach for the compression of microphone array signals based on Opus, using a modified joint channel coding approach and additionally introducing a multi-channel spatial decorrelating transform to reduce redundancy in the transmission. We illustrate the effect of the proposed approach on the spatial information retained in multi-channel signals after compression, and evaluate the performance on far-field ASR with a multi-channel beamforming front-end. We demonstrate that our approach can lead to a 37.5 % bitrate reduction or a 5.1 % relative word error rate reduction for a fixed bitrate budget in a seven channel setup.
△ Less
Submitted 15 June, 2021;
originally announced June 2021.
-
Robust level-3 BLAS Inverse Iteration from the Hessenberg Matrix
Authors:
Angelika Schwarz
Abstract:
Inverse iteration is known to be an effective method for computing eigenvectors corresponding to simple and well-separated eigenvalues. In the non-symmetric case, the solution of shifted Hessenberg systems is a central step. Existing inverse iteration solvers approach the solution of the shifted Hessenberg systems with either RQ or LU factorizations and, once factored, solve the corresponding syst…
▽ More
Inverse iteration is known to be an effective method for computing eigenvectors corresponding to simple and well-separated eigenvalues. In the non-symmetric case, the solution of shifted Hessenberg systems is a central step. Existing inverse iteration solvers approach the solution of the shifted Hessenberg systems with either RQ or LU factorizations and, once factored, solve the corresponding systems. This approach has limited level-3 BLAS potential since distinct shifts have distinct factorizations. This paper rearranges the RQ approach such that data shared between distinct shifts is exposed. Thereby the backward substitution with the triangular R factor can be expressed mostly with matrix-matrix multiplications (level-3 BLAS). The resulting algorithm computes eigenvectors in a tiled, overflow-free, and task-parallel fashion. The numerical experiments show that the new algorithm outperforms existing inverse iteration solvers for the computation of both real and complex eigenvectors.
△ Less
Submitted 13 January, 2021;
originally announced January 2021.
-
Improving RNN-T ASR Accuracy Using Context Audio
Authors:
Andreas Schwarz,
Ilya Sklyar,
Simon Wiesler
Abstract:
We present a training scheme for streaming automatic speech recognition (ASR) based on recurrent neural network transducers (RNN-T) which allows the encoder network to learn to exploit context audio from a stream, using segmented or partially labeled sequences of the stream during training. We show that the use of context audio during training and inference can lead to word error rate reductions o…
▽ More
We present a training scheme for streaming automatic speech recognition (ASR) based on recurrent neural network transducers (RNN-T) which allows the encoder network to learn to exploit context audio from a stream, using segmented or partially labeled sequences of the stream during training. We show that the use of context audio during training and inference can lead to word error rate reductions of more than 6% in a realistic production setting for a voice assistant ASR system. We investigate the effect of the proposed training approach on acoustically challenging data containing background speech and present data points which indicate that this approach helps the network learn both speaker and environment adaptation. To gain further insight into the ability of a long short-term memory (LSTM) based ASR encoder to exploit long-term context, we also visualize RNN-T loss gradients with respect to the input.
△ Less
Submitted 15 June, 2021; v1 submitted 20 November, 2020;
originally announced November 2020.
-
A Neural Network based Shock Detection and Localization Approach for Discontinuous Galerkin Methods
Authors:
Andrea D. Beck,
Jonas Zeifang,
Anna Schwarz,
David G. Flad
Abstract:
The stable and accurate approximation of discontinuities such as shocks on a finite computational mesh is a challenging task. Detection of shocks or strong discontinuities in the flow solution is typically achieved through a priori troubled cell indicators, which guide the subsequent action of an appropriate shock capturing mechanism. Arriving at a stable and accurate solution often requires empir…
▽ More
The stable and accurate approximation of discontinuities such as shocks on a finite computational mesh is a challenging task. Detection of shocks or strong discontinuities in the flow solution is typically achieved through a priori troubled cell indicators, which guide the subsequent action of an appropriate shock capturing mechanism. Arriving at a stable and accurate solution often requires empirically based parameter tuning and adjustments of the indicator settings to the discretization and solution at hand. In this work, we propose to separate the task of shock detection and shock capturing more strongly and aim to develop a shock indicator that is robust, accurate, requires minimal user input and is suitable for high order element-based methods like discontinuous Galerkin and flux reconstruction methods. The novel indicator is learned from analytical data through a supervised learning strategy; its input is given by the high order solution field, its output is an element-local map of the shock position. We use state of the art methods from edge detection in image analysis based on deep convolutional multiscale networks and deep supervision to train the indicators. The resulting networks are then used as black box indicators, showing their robustness and accuracy on well established canonical testcases. All simulations are run ab initio using the developed indicators, showing that they provide also stability during the strongly transient phases. In particular for high order schemes with large cells and considerable inner-cell resolution capabilities, we demonstrate how the additional accurate prediction of the position of the shock front can be exploited to guide inner-element shock capturing strategies.
△ Less
Submitted 20 January, 2020;
originally announced January 2020.
-
Robust Task-Parallel Solution of the Triangular Sylvester Equation
Authors:
Angelika Schwarz,
Carl Christian Kjelgaard Mikkelsen
Abstract:
The Bartels-Stewart algorithm is a standard approach to solving the dense Sylvester equation. It reduces the problem to the solution of the triangular Sylvester equation. The triangular Sylvester equation is solved with a variant of backward substitution. Backward substitution is prone to overflow. Overflow can be avoided by dynamic scaling of the solution matrix. An algorithm which prevents overf…
▽ More
The Bartels-Stewart algorithm is a standard approach to solving the dense Sylvester equation. It reduces the problem to the solution of the triangular Sylvester equation. The triangular Sylvester equation is solved with a variant of backward substitution. Backward substitution is prone to overflow. Overflow can be avoided by dynamic scaling of the solution matrix. An algorithm which prevents overflow is said to be robust. The standard library LAPACK contains the robust scalar sequential solver dtrsyl. This paper derives a robust, level-3 BLAS-based task-parallel solver. By adding overflow protection, our robust solver closes the gap between problems solvable by LAPACK and problems solvable by existing non-robust task-parallel solvers. We demonstrate that our robust solver achieves a similar performance as non-robust solvers.
△ Less
Submitted 25 May, 2019;
originally announced May 2019.
-
Robust coherence-based spectral enhancement for speech recognition in adverse real-world environments
Authors:
Hendrik Barfuss,
Christian Huemmer,
Andreas Schwarz,
Walter Kellermann
Abstract:
Speech recognition in adverse real-world environments is highly affected by reverberation and nonstationary background noise. A well-known strategy to reduce such undesired signal components in multi-microphone scenarios is spatial filtering of the microphone signals. In this article, we demonstrate that an additional coherence-based postfilter, which is applied to the beamformer output signal to…
▽ More
Speech recognition in adverse real-world environments is highly affected by reverberation and nonstationary background noise. A well-known strategy to reduce such undesired signal components in multi-microphone scenarios is spatial filtering of the microphone signals. In this article, we demonstrate that an additional coherence-based postfilter, which is applied to the beamformer output signal to remove diffuse interference components from the latter, is an effective means to further improve the recognition accuracy of modern deep learning speech recognition systems. To this end, the recently updated 3rd CHiME Speech Separation and Recognition Challenge (CHiME-3) baseline speech recognition system is extended by a coherence-based postfilter and the postfilter's impact on the word error rates is investigated for the noisy environments provided by CHiME-3. To determine the time- and frequency-dependent postfilter gains, we use a Direction-of-Arrival (DOA)-dependent and a DOA-independent estimator of the coherent-to-diffuse power ratio as an approximation of the short-time signal-to-noise ratio. Our experiments show that incorporating coherence-based postfiltering into the CHiME-3 baseline speech recognition system leads to a significant reduction of the word error rate scores for the noisy and reverberant environments provided as part of CHiME-3.
△ Less
Submitted 7 August, 2017; v1 submitted 12 April, 2016;
originally announced April 2016.
-
Robust coherence-based spectral enhancement for distant speech recognition
Authors:
Hendrik Barfuss,
Christian Huemmer,
Andreas Schwarz,
Walter Kellermann
Abstract:
In this contribution to the 3rd CHiME Speech Separation and Recognition Challenge (CHiME-3) we extend the acoustic front-end of the CHiME-3 baseline speech recognition system by a coherence-based Wiener filter which is applied to the output signal of the baseline beamformer. To compute the time- and frequency-dependent postfilter gains the ratio between direct and diffuse signal components at the…
▽ More
In this contribution to the 3rd CHiME Speech Separation and Recognition Challenge (CHiME-3) we extend the acoustic front-end of the CHiME-3 baseline speech recognition system by a coherence-based Wiener filter which is applied to the output signal of the baseline beamformer. To compute the time- and frequency-dependent postfilter gains the ratio between direct and diffuse signal components at the output of the baseline beamformer is estimated and used as approximation of the short-time signal-to-noise ratio. The proposed spectral enhancement technique is evaluated with respect to word error rates of the CHiME-3 challenge baseline speech recognition system using real speech recorded in public environments. Results confirm the effectiveness of the coherence-based postfilter when integrated into the front-end signal enhancement.
△ Less
Submitted 23 September, 2015;
originally announced September 2015.
-
A model for the temporal evolution of the spatial coherence in decaying reverberant sound fields
Authors:
Sam Nees,
Andreas Schwarz,
Walter Kellermann
Abstract:
Reverberant sound fields are often modeled as isotropic. However, it has been observed that spatial properties change during the decay of the sound field energy, due to non-isotropic attenuation in non-ideal rooms. In this letter, a model for the spatial coherence between two sensors in a decaying reverberant sound field is developed for rectangular rooms. The modeled coherence function depends on…
▽ More
Reverberant sound fields are often modeled as isotropic. However, it has been observed that spatial properties change during the decay of the sound field energy, due to non-isotropic attenuation in non-ideal rooms. In this letter, a model for the spatial coherence between two sensors in a decaying reverberant sound field is developed for rectangular rooms. The modeled coherence function depends on room dimensions, surface reflectivity and orientation of the sensor pair, but is independent of the position of source and sensors in the room. The model includes the spherically isotropic (diffuse) and cylindrically isotropic sound field models as special cases.
△ Less
Submitted 27 July, 2015;
originally announced July 2015.
-
Binaural coherent-to-diffuse-ratio estimation for dereverberation using an ITD model
Authors:
Chengshi Zheng,
Andreas Schwarz,
Walter Kellermann,
Xiaodong Li
Abstract:
Most previously proposed dual-channel coherent-to-diffuse-ratio (CDR) estimators are based on a free-field model. When used for binaural signals, e.g., for dereverberation in binaural hearing aids, their performance may degrade due to the influence of the head, even when the direction-of-arrival of the desired speaker is exactly known. In this paper, the head shadowing effect is taken into account…
▽ More
Most previously proposed dual-channel coherent-to-diffuse-ratio (CDR) estimators are based on a free-field model. When used for binaural signals, e.g., for dereverberation in binaural hearing aids, their performance may degrade due to the influence of the head, even when the direction-of-arrival of the desired speaker is exactly known. In this paper, the head shadowing effect is taken into account for CDR estimation by using a simplified model for the frequency-dependent interaural time difference and a model for the binaural coherence of the diffuse noise field. Evaluation of CDR-based dereverberation with measured binaural impulse responses indicates that the proposed binaural CDR estimators can improve PESQ scores.
△ Less
Submitted 11 June, 2015;
originally announced June 2015.
-
Coherent-to-Diffuse Power Ratio Estimation for Dereverberation
Authors:
Andreas Schwarz,
Walter Kellermann
Abstract:
The estimation of the time- and frequency-dependent coherent-to-diffuse power ratio (CDR) from the measured spatial coherence between two omnidirectional microphones is investigated. Known CDR estimators are formulated in a common framework, illustrated using a geometric interpretation in the complex plane, and investigated with respect to bias and robustness towards model errors. Several novel un…
▽ More
The estimation of the time- and frequency-dependent coherent-to-diffuse power ratio (CDR) from the measured spatial coherence between two omnidirectional microphones is investigated. Known CDR estimators are formulated in a common framework, illustrated using a geometric interpretation in the complex plane, and investigated with respect to bias and robustness towards model errors. Several novel unbiased CDR estimators are proposed, and it is shown that knowledge of either the direction of arrival (DOA) of the target source or the coherence of the noise field is sufficient for unbiased CDR estimation. The validity of the model for the application of CDR estimates to dereverberation is investigated using measured and simulated impulse responses. A CDR-based dereverberation system is presented and evaluated using signal-based quality measures as well as automatic speech recognition accuracy. The results show that the proposed unbiased estimators have a practical advantage over existing estimators, and that the proposed DOA-independent estimator can be used for effective blind dereverberation.
△ Less
Submitted 13 February, 2015; v1 submitted 12 February, 2015;
originally announced February 2015.
-
Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments
Authors:
Andreas Schwarz,
Christian Huemmer,
Roland Maas,
Walter Kellermann
Abstract:
We propose a spatial diffuseness feature for deep neural network (DNN)-based automatic speech recognition to improve recognition accuracy in reverberant and noisy environments. The feature is computed in real-time from multiple microphone signals without requiring knowledge or estimation of the direction of arrival, and represents the relative amount of diffuse noise in each time and frequency bin…
▽ More
We propose a spatial diffuseness feature for deep neural network (DNN)-based automatic speech recognition to improve recognition accuracy in reverberant and noisy environments. The feature is computed in real-time from multiple microphone signals without requiring knowledge or estimation of the direction of arrival, and represents the relative amount of diffuse noise in each time and frequency bin. It is shown that using the diffuseness feature as an additional input to a DNN-based acoustic model leads to a reduced word error rate for the REVERB challenge corpus, both compared to logmelspec features extracted from noisy signals, and features enhanced by spectral subtraction.
△ Less
Submitted 16 February, 2015; v1 submitted 9 October, 2014;
originally announced October 2014.