-
The Nah Bandit: Modeling User Non-compliance in Recommendation Systems
Authors:
Tianyue Zhou,
Jung-Hoon Cho,
Cathy Wu
Abstract:
Recommendation systems now pervade the digital world, ranging from advertising to entertainment. However, it remains challenging to implement effective recommendation systems in the physical world, such as in mobility or health. This work focuses on a key challenge: in the physical world, it is often easy for the user to opt out of taking any recommendation if they are not to her liking, and to fa…
▽ More
Recommendation systems now pervade the digital world, ranging from advertising to entertainment. However, it remains challenging to implement effective recommendation systems in the physical world, such as in mobility or health. This work focuses on a key challenge: in the physical world, it is often easy for the user to opt out of taking any recommendation if they are not to her liking, and to fall back to her baseline behavior. It is thus crucial in cyber-physical recommendation systems to operate with an interaction model that is aware of such user behavior, lest the user abandon the recommendations altogether. This paper thus introduces the Nah Bandit, a tongue-in-cheek reference to describe a Bandit problem where users can say `nah' to the recommendation and opt for their preferred option instead. As such, this problem lies in between a typical bandit setup and supervised learning. We model the user non-compliance by parameterizing an anchoring effect of recommendations on users. We then propose the Expert with Clustering (EWC) algorithm, a hierarchical approach that incorporates feedback from both recommended and non-recommended options to accelerate user preference learning. In a recommendation scenario with $N$ users, $T$ rounds per user, and $K$ clusters, EWC achieves a regret bound of $O(N\sqrt{T\log K} + NT)$, achieving superior theoretical performance in the short term compared to LinUCB algorithm. Experimental results also highlight that EWC outperforms both supervised learning and traditional contextual bandit approaches. This advancement reveals that effective use of non-compliance feedback can accelerate preference learning and improve recommendation accuracy. This work lays the foundation for future research in Nah Bandit, providing a robust framework for more effective recommendation systems.
△ Less
Submitted 14 August, 2024;
originally announced August 2024.
-
A Unified Framework for Synthesizing Multisequence Brain MRI via Hybrid Fusion
Authors:
Jihoon Cho,
Jonghye Woo,
Jinah Park
Abstract:
Multisequence Magnetic Resonance Imaging (MRI) provides a reliable diagnosis in clinical applications through complementary information within sequences. However, in practice, the absence of certain MR sequences is a common problem that can lead to inconsistent analysis results. In this work, we propose a novel unified framework for synthesizing multisequence MR images, called Hybrid Fusion GAN (H…
▽ More
Multisequence Magnetic Resonance Imaging (MRI) provides a reliable diagnosis in clinical applications through complementary information within sequences. However, in practice, the absence of certain MR sequences is a common problem that can lead to inconsistent analysis results. In this work, we propose a novel unified framework for synthesizing multisequence MR images, called Hybrid Fusion GAN (HF-GAN). We introduce a hybrid fusion encoder designed to ensure the disentangled extraction of complementary and modality-specific information, along with a channel attention-based feature fusion module that integrates the features into a common latent space handling the complexity from combinations of accessible MR sequences. Common feature representations are transformed into a target latent space via the modality infuser to synthesize missing MR sequences. We have performed experiments on multisequence brain MRI datasets from healthy individuals and patients diagnosed with brain tumors. Experimental results show that our method outperforms state-of-the-art methods in both quantitative and qualitative comparisons. In addition, a detailed analysis of our framework demonstrates the superiority of our designed modules and their effectiveness for use in data imputation tasks.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Articulatory Encodec: Coding Speech through Vocal Tract Kinematics
Authors:
Cheol Jun Cho,
Peter Wu,
Tejas S. Prabhune,
Dhruv Agarwal,
Gopala K. Anumanchipalli
Abstract:
Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- Articulatory Encodec. Articulatory Encodec c…
▽ More
Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- Articulatory Encodec. Articulatory Encodec comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.
△ Less
Submitted 20 August, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer
Authors:
Keon Lee,
Dong Won Kim,
Jaehyeon Kim,
Jaewoong Cho
Abstract:
Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models f…
▽ More
Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity. Our speech samples are available at https://ditto-tts.github.io.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Learned Pulse Shaping Design for PAPR Reduction in DFT-s-OFDM
Authors:
Fabrizio Carpi,
Soheil Rostami,
Joonyoung Cho,
Siddharth Garg,
Elza Erkip,
Charlie Jianzhong Zhang
Abstract:
High peak-to-average power ratio (PAPR) is one of the main factors limiting cell coverage for cellular systems, especially in the uplink direction. Discrete Fourier transform spread orthogonal frequency-domain multiplexing (DFT-s-OFDM) with spectrally-extended frequency-domain spectrum shaping (FDSS) is one of the efficient techniques deployed to lower the PAPR of the uplink waveforms. In this wor…
▽ More
High peak-to-average power ratio (PAPR) is one of the main factors limiting cell coverage for cellular systems, especially in the uplink direction. Discrete Fourier transform spread orthogonal frequency-domain multiplexing (DFT-s-OFDM) with spectrally-extended frequency-domain spectrum shaping (FDSS) is one of the efficient techniques deployed to lower the PAPR of the uplink waveforms. In this work, we propose a machine learning-based framework to determine the FDSS filter, optimizing a tradeoff between the symbol error rate (SER), the PAPR, and the spectral flatness requirements. Our end-to-end optimization framework considers multiple important design constraints, including the Nyquist zero-ISI (inter-symbol interference) condition. The numerical results show that learned FDSS filters lower the PAPR compared to conventional baselines, with minimal SER degradation. Tuning the parameters of the optimization also helps us understand the fundamental limitations and characteristics of the FDSS filters for PAPR reduction.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
Authors:
Jaehyeon Kim,
Keon Lee,
Seungjun Chung,
Jaewoong Cho
Abstract:
With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complex…
▽ More
With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complexity of modelling the multiple sequences. To mitigate these issues, we present CLaM-TTS that employs a probabilistic residual vector quantization to (1) achieve superior compression in the token length, and (2) allow a language model to generate multiple tokens at once, thereby eliminating the need for cascaded modeling to handle the number of token streams. Our experimental results demonstrate that CLaM-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding naturalness, intelligibility, speaker similarity, and inference speed. In addition, we examine the impact of the pretraining extent of the language models and their text tokenization strategies on performances.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Disentangled Multimodal Brain MR Image Translation via Transformer-based Modality Infuser
Authors:
Jihoon Cho,
Xiaofeng Liu,
Fangxu Xing,
Jinsong Ouyang,
Georges El Fakhri,
Jinah Park,
Jonghye Woo
Abstract:
Multimodal Magnetic Resonance (MR) Imaging plays a crucial role in disease diagnosis due to its ability to provide complementary information by analyzing a relationship between multimodal images on the same subject. Acquiring all MR modalities, however, can be expensive, and, during a scanning session, certain MR images may be missed depending on the study protocol. The typical solution would be t…
▽ More
Multimodal Magnetic Resonance (MR) Imaging plays a crucial role in disease diagnosis due to its ability to provide complementary information by analyzing a relationship between multimodal images on the same subject. Acquiring all MR modalities, however, can be expensive, and, during a scanning session, certain MR images may be missed depending on the study protocol. The typical solution would be to synthesize the missing modalities from the acquired images such as using generative adversarial networks (GANs). Yet, GANs constructed with convolutional neural networks (CNNs) are likely to suffer from a lack of global relationships and mechanisms to condition the desired modality. To address this, in this work, we propose a transformer-based modality infuser designed to synthesize multimodal brain MR images. In our method, we extract modality-agnostic features from the encoder and then transform them into modality-specific features using the modality infuser. Furthermore, the modality infuser captures long-range relationships among all brain structures, leading to the generation of more realistic images. We carried out experiments on the BraTS 2018 dataset, translating between four MR modalities, and our experimental results demonstrate the superiority of our proposed method in terms of synthesis quality. In addition, we conducted experiments on a brain tumor segmentation task and different conditioning methods.
△ Less
Submitted 1 February, 2024;
originally announced February 2024.
-
NLCG-Net: A Model-Based Zero-Shot Learning Framework for Undersampled Quantitative MRI Reconstruction
Authors:
Xinrui Jiang,
Yohan Jun,
Jaejin Cho,
Mengze Gao,
Xingwang Yong,
Berkin Bilgic
Abstract:
Typical quantitative MRI (qMRI) methods estimate parameter maps after image reconstructing, which is prone to biases and error propagation. We propose a Nonlinear Conjugate Gradient (NLCG) optimizer for model-based T2/T1 estimation, which incorporates U-Net regularization trained in a scan-specific manner. This end-to-end method directly estimates qMRI maps from undersampled k-space data using mon…
▽ More
Typical quantitative MRI (qMRI) methods estimate parameter maps after image reconstructing, which is prone to biases and error propagation. We propose a Nonlinear Conjugate Gradient (NLCG) optimizer for model-based T2/T1 estimation, which incorporates U-Net regularization trained in a scan-specific manner. This end-to-end method directly estimates qMRI maps from undersampled k-space data using mono-exponential signal modeling with zero-shot scan-specific neural network regularization to enable high fidelity T1 and T2 mapping. T2 and T1 mapping results demonstrate the ability of the proposed NLCG-Net to improve estimation quality compared to subspace reconstruction at high accelerations.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection
Authors:
Jiachen Lian,
Carly Feng,
Naasir Farooqi,
Steve Li,
Anshul Kashyap,
Cheol Jun Cho,
Peter Wu,
Robbie Netzorg,
Tingle Li,
Gopala Krishna Anumanchipalli
Abstract:
Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and dete…
▽ More
Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and detection in an automatic and hierarchical manner. UDM eliminates the need for extensive manual annotation by providing a comprehensive solution. Furthermore, we introduce a simulated dysfluent dataset called VCTK++ to enhance the capabilities of UDM in phonetic transcription. Our experimental results demonstrate the effectiveness and robustness of our proposed methods in both transcription and detection tasks.
△ Less
Submitted 20 December, 2023;
originally announced December 2023.
-
Temporal Transfer Learning for Traffic Optimization with Coarse-grained Advisory Autonomy
Authors:
Jung-Hoon Cho,
Sirui Li,
Jeongyun Kim,
Cathy Wu
Abstract:
The recent development of connected and automated vehicle (CAV) technologies has spurred investigations to optimize dense urban traffic to maximize vehicle speed and throughput. This paper explores advisory autonomy, in which real-time driving advisories are issued to the human drivers, thus achieving near-term performance of automated vehicles. Due to the complexity of traffic systems, recent stu…
▽ More
The recent development of connected and automated vehicle (CAV) technologies has spurred investigations to optimize dense urban traffic to maximize vehicle speed and throughput. This paper explores advisory autonomy, in which real-time driving advisories are issued to the human drivers, thus achieving near-term performance of automated vehicles. Due to the complexity of traffic systems, recent studies of coordinating CAVs have resorted to leveraging deep reinforcement learning (RL). Coarse-grained advisory is formalized as zero-order holds, and we consider a range of hold duration from 0.1 to 40 seconds. However, despite the similarity of the higher frequency tasks on CAVs, a direct application of deep RL fails to be generalized to advisory autonomy tasks. To overcome this, we utilize zero-shot transfer, training policies on a set of source tasks--specific traffic scenarios with designated hold durations--and then evaluating the efficacy of these policies on different target tasks. We introduce Temporal Transfer Learning (TTL) algorithms to select source tasks for zero-shot transfer, systematically leveraging the temporal structure to solve the full range of tasks. TTL selects the most suitable source tasks to maximize the performance of the range of tasks. We validate our algorithms on diverse mixed-traffic scenarios, demonstrating that TTL more reliably solves the tasks than baselines. This paper underscores the potential of coarse-grained advisory autonomy with TTL in traffic flow optimization.
△ Less
Submitted 1 August, 2024; v1 submitted 27 November, 2023;
originally announced December 2023.
-
Incentive Design for Eco-driving in Urban Transportation Networks
Authors:
M. Umar B. Niazi,
Jung-Hoon Cho,
Munther A. Dahleh,
Roy Dong,
Cathy Wu
Abstract:
Eco-driving emerges as a cost-effective and efficient strategy to mitigate greenhouse gas emissions in urban transportation networks. Acknowledging the persuasive influence of incentives in shaping driver behavior, this paper presents the `eco-planner,' a digital platform devised to promote eco-driving practices in urban transportation. At the outset of their trips, users provide the platform with…
▽ More
Eco-driving emerges as a cost-effective and efficient strategy to mitigate greenhouse gas emissions in urban transportation networks. Acknowledging the persuasive influence of incentives in shaping driver behavior, this paper presents the `eco-planner,' a digital platform devised to promote eco-driving practices in urban transportation. At the outset of their trips, users provide the platform with their trip details and travel time preferences, enabling the eco-planner to formulate personalized eco-driving recommendations and corresponding incentives, while adhering to its budgetary constraints. Upon trip completion, incentives are transferred to users who comply with the recommendations and effectively reduce their emissions. By comparing our proposed incentive mechanism with a baseline scheme that offers uniform incentives to all users, we demonstrate that our approach achieves superior emission reductions and increased user compliance with a smaller budget.
△ Less
Submitted 16 May, 2024; v1 submitted 6 November, 2023;
originally announced November 2023.
-
Hybrid-Fusion Transformer for Multisequence MRI
Authors:
Jihoon Cho,
Jinah Park
Abstract:
Medical segmentation has grown exponentially through the advent of a fully convolutional network (FCN), and we have now reached a turning point through the success of Transformer. However, the different characteristics of the modality have not been fully integrated into Transformer for medical segmentation. In this work, we propose the novel hybrid fusion Transformer (HFTrans) for multisequence MR…
▽ More
Medical segmentation has grown exponentially through the advent of a fully convolutional network (FCN), and we have now reached a turning point through the success of Transformer. However, the different characteristics of the modality have not been fully integrated into Transformer for medical segmentation. In this work, we propose the novel hybrid fusion Transformer (HFTrans) for multisequence MRI image segmentation. We take advantage of the differences among multimodal MRI sequences and utilize the Transformer layers to integrate the features extracted from each modality as well as the features of the early fused modalities. We validate the effectiveness of our hybrid-fusion method in three-dimensional (3D) medical segmentation. Experiments on two public datasets, BraTS2020 and MRBrainS18, show that the proposed method outperforms previous state-of-the-art methods on the task of brain tumor segmentation and brain structure segmentation.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT
Authors:
Cheol Jun Cho,
Abdelrahman Mohamed,
Shang-Wen Li,
Alan W Black,
Gopala K. Anumanchipalli
Abstract:
Data-driven unit discovery in self-supervised learning (SSL) of speech has embarked on a new era of spoken language processing. Yet, the discovered units often remain in phonetic space and the units beyond phonemes are largely underexplored. Here, we demonstrate that a syllabic organization emerges in learning sentence-level representation of speech. In particular, we adopt "self-distillation" obj…
▽ More
Data-driven unit discovery in self-supervised learning (SSL) of speech has embarked on a new era of spoken language processing. Yet, the discovered units often remain in phonetic space and the units beyond phonemes are largely underexplored. Here, we demonstrate that a syllabic organization emerges in learning sentence-level representation of speech. In particular, we adopt "self-distillation" objective to fine-tune the pretrained HuBERT with an aggregator token that summarizes the entire sentence. Without any supervision, the resulting model draws definite boundaries in speech, and the representations across frames exhibit salient syllabic structures. We demonstrate that this emergent structure largely corresponds to the ground truth syllables. Furthermore, we propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech. When compared to previous models, our model outperforms in both unsupervised syllable discovery and learning sentence-level representation. Together, we demonstrate that the self-distillation of HuBERT gives rise to syllabic organization without relying on external labels or modalities, and potentially provides novel data-driven units for spoken language modeling.
△ Less
Submitted 16 January, 2024; v1 submitted 16 October, 2023;
originally announced October 2023.
-
Self-Supervised Models of Speech Infer Universal Articulatory Kinematics
Authors:
Cheol Jun Cho,
Abdelrahman Mohamed,
Alan W Black,
Gopala K. Anumanchipalli
Abstract:
Self-Supervised Learning (SSL) based models of speech have shown remarkable performance on a range of downstream tasks. These state-of-the-art models have remained blackboxes, but many recent studies have begun "probing" models like HuBERT, to correlate their internal representations to different aspects of speech. In this paper, we show "inference of articulatory kinematics" as fundamental proper…
▽ More
Self-Supervised Learning (SSL) based models of speech have shown remarkable performance on a range of downstream tasks. These state-of-the-art models have remained blackboxes, but many recent studies have begun "probing" models like HuBERT, to correlate their internal representations to different aspects of speech. In this paper, we show "inference of articulatory kinematics" as fundamental property of SSL models, i.e., the ability of these models to transform acoustics into the causal articulatory dynamics underlying the speech signal. We also show that this abstraction is largely overlapping across the language of the data used to train the model, with preference to the language with similar phonological system. Furthermore, we show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and dialects, showing the generalizability of this property. Together, these results shed new light on the internals of SSL models that are critical to their superior performance, and open up new avenues into language-agnostic universal models for speech engineering, that are interpretable and grounded in speech science.
△ Less
Submitted 16 January, 2024; v1 submitted 16 October, 2023;
originally announced October 2023.
-
Addressing Feature Imbalance in Sound Source Separation
Authors:
Jaechang Kim,
Jeongyeon Hwang,
Soheun Yi,
Jaewoong Cho,
Jungseul Ok
Abstract:
Neural networks often suffer from a feature preference problem, where they tend to overly rely on specific features to solve a task while disregarding other features, even if those neglected features are essential for the task. Feature preference problems have primarily been investigated in classification task. However, we observe that feature preference occurs in high-dimensional regression task,…
▽ More
Neural networks often suffer from a feature preference problem, where they tend to overly rely on specific features to solve a task while disregarding other features, even if those neglected features are essential for the task. Feature preference problems have primarily been investigated in classification task. However, we observe that feature preference occurs in high-dimensional regression task, specifically, source separation. To mitigate feature preference in source separation, we propose FEAture BAlancing by Suppressing Easy feature (FEABASE). This approach enables efficient data utilization by learning hidden information about the neglected feature. We evaluate our method in a multi-channel source separation task, where feature preference between spatial feature and timbre feature appears.
△ Less
Submitted 4 October, 2023; v1 submitted 11 September, 2023;
originally announced September 2023.
-
Full Reference Video Quality Assessment for Machine Learning-Based Video Codecs
Authors:
Abrar Majeedi,
Babak Naderi,
Yasaman Hosseinkashi,
Juhee Cho,
Ruben Alvarez Martinez,
Ross Cutler
Abstract:
Machine learning-based video codecs have made significant progress in the past few years. A critical area in the development of ML-based video codecs is an accurate evaluation metric that does not require an expensive and slow subjective test. We show that existing evaluation metrics that were designed and trained on DSP-based video codecs are not highly correlated to subjective opinion when used…
▽ More
Machine learning-based video codecs have made significant progress in the past few years. A critical area in the development of ML-based video codecs is an accurate evaluation metric that does not require an expensive and slow subjective test. We show that existing evaluation metrics that were designed and trained on DSP-based video codecs are not highly correlated to subjective opinion when used with ML video codecs due to the video artifacts being quite different between ML and video codecs. We provide a new dataset of ML video codec videos that have been accurately labeled for quality. We also propose a new full reference video quality assessment (FRVQA) model that achieves a Pearson Correlation Coefficient (PCC) of 0.99 and a Spearman's Rank Correlation Coefficient (SRCC) of 0.99 at the model level. We make the dataset and FRVQA model open source to help accelerate research in ML video codecs, and so that others can further improve the FRVQA model.
△ Less
Submitted 1 September, 2023;
originally announced September 2023.
-
Neural Latent Aligner: Cross-trial Alignment for Learning Representations of Complex, Naturalistic Neural Data
Authors:
Cheol Jun Cho,
Edward F. Chang,
Gopala K. Anumanchipalli
Abstract:
Understanding the neural implementation of complex human behaviors is one of the major goals in neuroscience. To this end, it is crucial to find a true representation of the neural data, which is challenging due to the high complexity of behaviors and the low signal-to-ratio (SNR) of the signals. Here, we propose a novel unsupervised learning framework, Neural Latent Aligner (NLA), to find well-co…
▽ More
Understanding the neural implementation of complex human behaviors is one of the major goals in neuroscience. To this end, it is crucial to find a true representation of the neural data, which is challenging due to the high complexity of behaviors and the low signal-to-ratio (SNR) of the signals. Here, we propose a novel unsupervised learning framework, Neural Latent Aligner (NLA), to find well-constrained, behaviorally relevant neural representations of complex behaviors. The key idea is to align representations across repeated trials to learn cross-trial consistent information. Furthermore, we propose a novel, fully differentiable time warping model (TWM) to resolve the temporal misalignment of trials. When applied to intracranial electrocorticography (ECoG) of natural speaking, our model learns better representations for decoding behaviors than the baseline models, especially in lower dimensional space. The TWM is empirically validated by measuring behavioral coherence between aligned trials. The proposed framework learns more cross-trial consistent representations than the baselines, and when visualized, the manifold reveals shared neural trajectories across trials.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
-
Improved Multi-Shot Diffusion-Weighted MRI with Zero-Shot Self-Supervised Learning Reconstruction
Authors:
Jaejin Cho,
Yohan Jun,
Xiaoqing Wang,
Caique Kobayashi,
Berkin Bilgic
Abstract:
Diffusion MRI is commonly performed using echo-planar imaging (EPI) due to its rapid acquisition time. However, the resolution of diffusion-weighted images is often limited by magnetic field inhomogeneity-related artifacts and blurring induced by T2- and T2*-relaxation effects. To address these limitations, multi-shot EPI (msEPI) combined with parallel imaging techniques is frequently employed. Ne…
▽ More
Diffusion MRI is commonly performed using echo-planar imaging (EPI) due to its rapid acquisition time. However, the resolution of diffusion-weighted images is often limited by magnetic field inhomogeneity-related artifacts and blurring induced by T2- and T2*-relaxation effects. To address these limitations, multi-shot EPI (msEPI) combined with parallel imaging techniques is frequently employed. Nevertheless, reconstructing msEPI can be challenging due to phase variation between multiple shots. In this study, we introduce a novel msEPI reconstruction approach called zero-MIRID (zero-shot self-supervised learning of Multi-shot Image Reconstruction for Improved Diffusion MRI). This method jointly reconstructs msEPI data by incorporating deep learning-based image regularization techniques. The network incorporates CNN denoisers in both k- and image-spaces, while leveraging virtual coils to enhance image reconstruction conditioning. By employing a self-supervised learning technique and dividing sampled data into three groups, the proposed approach achieves superior results compared to the state-of-the-art parallel imaging method, as demonstrated in an in-vivo experiment.
△ Less
Submitted 22 September, 2023; v1 submitted 9 August, 2023;
originally announced August 2023.
-
EchoVest: Real-Time Sound Classification and Depth Perception Expressed through Transcutaneous Electrical Nerve Stimulation
Authors:
Jesse Choe,
Siddhant Sood,
Ryan Park
Abstract:
Over 1.5 billion people worldwide live with hearing impairment. Despite various technologies that have been created for individuals with such disabilities, most of these technologies are either extremely expensive or inaccessible for everyday use in low-medium income countries. In order to combat this issue, we have developed a new assistive device, EchoVest, for blind/deaf people to intuitively b…
▽ More
Over 1.5 billion people worldwide live with hearing impairment. Despite various technologies that have been created for individuals with such disabilities, most of these technologies are either extremely expensive or inaccessible for everyday use in low-medium income countries. In order to combat this issue, we have developed a new assistive device, EchoVest, for blind/deaf people to intuitively become more aware of their environment. EchoVest transmits vibrations to the user's body by utilizing transcutaneous electric nerve stimulation (TENS) based on the source of the sounds. EchoVest also provides various features, including sound localization, sound classification, noise reduction, and depth perception. We aimed to outperform CNN-based machine-learning models, the most commonly used machine learning model for classification tasks, in accuracy and computational costs. To do so, we developed and employed a novel audio pipeline that adapts the Audio Spectrogram Transformer (AST) model, an attention-based model, for our sound classification purposes, and Fast Fourier Transforms for noise reduction. The application of Otsu's Method helped us find the optimal thresholds for background noise sound filtering and gave us much greater accuracy. In order to calculate direction and depth accurately, we applied Complex Time Difference of Arrival algorithms and SOTA localization. Our last improvement was to use blind source separation to make our algorithms applicable to multiple microphone inputs. The final algorithm achieved state-of-the-art results on numerous checkpoints, including a 95.7\% accuracy on the ESC-50 dataset for environmental sound classification.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Zero-DeepSub: Zero-Shot Deep Subspace Reconstruction for Rapid Multiparametric Quantitative MRI Using 3D-QALAS
Authors:
Yohan Jun,
Yamin Arefeen,
Jaejin Cho,
Shohei Fujita,
Xiaoqing Wang,
P. Ellen Grant,
Borjan Gagoski,
Camilo Jaimes,
Michael S. Gee,
Berkin Bilgic
Abstract:
Purpose: To develop and evaluate methods for 1) reconstructing 3D-quantification using an interleaved Look-Locker acquisition sequence with T2 preparation pulse (3D-QALAS) time-series images using a low-rank subspace method, which enables accurate and rapid T1 and T2 mapping, and 2) improving the fidelity of subspace QALAS by combining scan-specific deep-learning-based reconstruction and subspace…
▽ More
Purpose: To develop and evaluate methods for 1) reconstructing 3D-quantification using an interleaved Look-Locker acquisition sequence with T2 preparation pulse (3D-QALAS) time-series images using a low-rank subspace method, which enables accurate and rapid T1 and T2 mapping, and 2) improving the fidelity of subspace QALAS by combining scan-specific deep-learning-based reconstruction and subspace modeling. Methods: A low-rank subspace method for 3D-QALAS (i.e., subspace QALAS) and zero-shot deep-learning subspace method (i.e., Zero-DeepSub) were proposed for rapid and high fidelity T1 and T2 mapping and time-resolved imaging using 3D-QALAS. Using an ISMRM/NIST system phantom, the accuracy and reproducibility of the T1 and T2 maps estimated using the proposed methods were evaluated by comparing them with reference techniques. The reconstruction performance of the proposed subspace QALAS using Zero-DeepSub was evaluated in vivo and compared with conventional QALAS at high reduction factors of up to 9-fold. Results: Phantom experiments showed that subspace QALAS had good linearity with respect to the reference methods while reducing biases and improving precision compared to conventional QALAS, especially for T2 maps. Moreover, in vivo results demonstrated that subspace QALAS had better g-factor maps and could reduce voxel blurring, noise, and artifacts compared to conventional QALAS and showed robust performance at up to 9-fold acceleration with Zero-DeepSub, which enabled whole-brain T1, T2, and PD mapping at 1 mm isotropic resolution within 2 min of scan time. Conclusion: The proposed subspace QALAS along with Zero-DeepSub enabled high fidelity and rapid whole-brain multiparametric quantification and time-resolved imaging.
△ Less
Submitted 23 January, 2024; v1 submitted 3 July, 2023;
originally announced July 2023.
-
The Brain Tumor Segmentation (BraTS-METS) Challenge 2023: Brain Metastasis Segmentation on Pre-treatment MRI
Authors:
Ahmed W. Moawad,
Anastasia Janas,
Ujjwal Baid,
Divya Ramakrishnan,
Rachit Saluja,
Nader Ashraf,
Leon Jekel,
Raisa Amiruddin,
Maruf Adewole,
Jake Albrecht,
Udunna Anazodo,
Sanjay Aneja,
Syed Muhammad Anwar,
Timothy Bergquist,
Evan Calabrese,
Veronica Chiang,
Verena Chung,
Gian Marco Marco Conte,
Farouk Dako,
James Eddy,
Ivan Ezhov,
Ariana Familiar,
Keyvan Farahani,
Juan Eugenio Iglesias,
Zhifan Jiang
, et al. (206 additional authors not shown)
Abstract:
The translation of AI-generated brain metastases (BM) segmentation into clinical practice relies heavily on diverse, high-quality annotated medical imaging datasets. The BraTS-METS 2023 challenge has gained momentum for testing and benchmarking algorithms using rigorously annotated internationally compiled real-world datasets. This study presents the results of the segmentation challenge and chara…
▽ More
The translation of AI-generated brain metastases (BM) segmentation into clinical practice relies heavily on diverse, high-quality annotated medical imaging datasets. The BraTS-METS 2023 challenge has gained momentum for testing and benchmarking algorithms using rigorously annotated internationally compiled real-world datasets. This study presents the results of the segmentation challenge and characterizes the challenging cases that impacted the performance of the winning algorithms. Untreated brain metastases on standard anatomic MRI sequences (T1, T2, FLAIR, T1PG) from eight contributed international datasets were annotated in stepwise method: published UNET algorithms, student, neuroradiologist, final approver neuroradiologist. Segmentations were ranked based on lesion-wise Dice and Hausdorff distance (HD95) scores. False positives (FP) and false negatives (FN) were rigorously penalized, receiving a score of 0 for Dice and a fixed penalty of 374 for HD95. Eight datasets comprising 1303 studies were annotated, with 402 studies (3076 lesions) released on Synapse as publicly available datasets to challenge competitors. Additionally, 31 studies (139 lesions) were held out for validation, and 59 studies (218 lesions) were used for testing. Segmentation accuracy was measured as rank across subjects, with the winning team achieving a LesionWise mean score of 7.9. Common errors among the leading teams included false negatives for small lesions and misregistration of masks in space.The BraTS-METS 2023 challenge successfully curated well-annotated, diverse datasets and identified common errors, facilitating the translation of BM segmentation across varied clinical environments and providing personalized volumetric reports to patients undergoing BM treatment.
△ Less
Submitted 17 June, 2024; v1 submitted 1 June, 2023;
originally announced June 2023.
-
Measurement-based Close-in Path Loss Modeling with Diffraction for Rural Long-distance Communications
Authors:
Jaedon Park,
Hong-Bae Jeon,
Jungho Cho,
Chan-Byoung Chae
Abstract:
In this letter, we investigate rural large-scale path loss models based on the measurements in a central area of South Korea (rural area) in spring. In particular, we develop new close-in (CI) path loss models incorporating a diffraction component. The transmitter used in the measurement system is located on a hill and utilizes omnidirectional antennas operating at 1400 and 2250 MHz frequencies. T…
▽ More
In this letter, we investigate rural large-scale path loss models based on the measurements in a central area of South Korea (rural area) in spring. In particular, we develop new close-in (CI) path loss models incorporating a diffraction component. The transmitter used in the measurement system is located on a hill and utilizes omnidirectional antennas operating at 1400 and 2250 MHz frequencies. The receiver is also equipped with omnidirectional antennas and measures at positions totaling 3,858 (1,262 positions for LOS and 2,596 positions for NLOS) and 4,957 (1,427 positions for LOS and 3,530 positions for NLOS) for 1400 and 2250 MHz, respectively. This research demonstrates that the newly developed CI path loss models incorporating a diffraction component significantly reduce standard deviations (STD) and are independent of frequency, especially for LOS beyond the first meter of propagation, making them suitable for use with frequencies up to a millimeter-wave.
△ Less
Submitted 1 May, 2023;
originally announced May 2023.
-
Perspective Projection-Based 3D CT Reconstruction from Biplanar X-rays
Authors:
Daeun Kyung,
Kyungmin Jo,
Jaegul Choo,
Joonseok Lee,
Edward Choi
Abstract:
X-ray computed tomography (CT) is one of the most common imaging techniques used to diagnose various diseases in the medical field. Its high contrast sensitivity and spatial resolution allow the physician to observe details of body parts such as bones, soft tissue, blood vessels, etc. As it involves potentially harmful radiation exposure to patients and surgeons, however, reconstructing 3D CT volu…
▽ More
X-ray computed tomography (CT) is one of the most common imaging techniques used to diagnose various diseases in the medical field. Its high contrast sensitivity and spatial resolution allow the physician to observe details of body parts such as bones, soft tissue, blood vessels, etc. As it involves potentially harmful radiation exposure to patients and surgeons, however, reconstructing 3D CT volume from perpendicular 2D X-ray images is considered a promising alternative, thanks to its lower radiation risk and better accessibility. This is highly challenging though, since it requires reconstruction of 3D anatomical information from 2D images with limited views, where all the information is overlapped. In this paper, we propose PerX2CT, a novel CT reconstruction framework from X-ray that reflects the perspective projection scheme. Our proposed method provides a different combination of features for each coordinate which implicitly allows the model to obtain information about the 3D location. We reveal the potential to reconstruct the selected part of CT with high resolution by properly using the coordinate-wise local and global features. Our approach shows potential for use in clinical applications with low computational complexity and fast inference time, demonstrating superior performance than baselines in multiple evaluation metrics.
△ Less
Submitted 9 March, 2023;
originally announced March 2023.
-
SSL-QALAS: Self-Supervised Learning for Rapid Multiparameter Estimation in Quantitative MRI Using 3D-QALAS
Authors:
Yohan Jun,
Jaejin Cho,
Xiaoqing Wang,
Michael Gee,
P. Ellen Grant,
Berkin Bilgic,
Borjan Gagoski
Abstract:
Purpose: To develop and evaluate a method for rapid estimation of multiparametric T1, T2, proton density (PD), and inversion efficiency (IE) maps from 3D-quantification using an interleaved Look-Locker acquisition sequence with T2 preparation pulse (3D-QALAS) measurements using self-supervised learning (SSL) without the need for an external dictionary. Methods: A SSL-based QALAS mapping method (SS…
▽ More
Purpose: To develop and evaluate a method for rapid estimation of multiparametric T1, T2, proton density (PD), and inversion efficiency (IE) maps from 3D-quantification using an interleaved Look-Locker acquisition sequence with T2 preparation pulse (3D-QALAS) measurements using self-supervised learning (SSL) without the need for an external dictionary. Methods: A SSL-based QALAS mapping method (SSL-QALAS) was developed for rapid and dictionary-free estimation of multiparametric maps from 3D-QALAS measurements. The accuracy of the reconstructed quantitative maps using dictionary matching and SSL-QALAS was evaluated by comparing the estimated T1 and T2 values with those obtained from the reference methods on an ISMRM/NIST phantom. The SSL-QALAS and the dictionary matching methods were also compared in vivo, and generalizability was evaluated by comparing the scan-specific, pre-trained, and transfer learning models. Results: Phantom experiments showed that both the dictionary matching and SSL-QALAS methods produced T1 and T2 estimates that had a strong linear agreement with the reference values in the ISMRM/NIST phantom. Further, SSL-QALAS showed similar performance with dictionary matching in reconstructing the T1, T2, PD, and IE maps on in vivo data. Rapid reconstruction of multiparametric maps was enabled by inferring the data using a pre-trained SSL-QALAS model within 10 s. Fast scan-specific tuning was also demonstrated by fine-tuning the pre-trained model with the target subject's data within 15 min. Conclusion: The proposed SSL-QALAS method enabled rapid reconstruction of multiparametric maps from 3D-QALAS measurements without an external dictionary or labeled ground-truth training data.
△ Less
Submitted 23 January, 2024; v1 submitted 27 February, 2023;
originally announced February 2023.
-
Speaker-Independent Acoustic-to-Articulatory Speech Inversion
Authors:
Peter Wu,
Li-Wei Chen,
Cheol Jun Cho,
Shinji Watanabe,
Louis Goldstein,
Alan W Black,
Gopala K. Anumanchipalli
Abstract:
To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible mapping from speech to an interpretable space. The articulatory space is a promising inversion target, since this space captures the mechanics of speech production. To this end, we build an acoustic-to-articulatory inversion (AAI) model that leverages…
▽ More
To build speech processing methods that can handle speech as naturally as humans, researchers have explored multiple ways of building an invertible mapping from speech to an interpretable space. The articulatory space is a promising inversion target, since this space captures the mechanics of speech production. To this end, we build an acoustic-to-articulatory inversion (AAI) model that leverages self-supervision to generalize to unseen speakers. Our approach obtains 0.784 correlation on an electromagnetic articulography (EMA) dataset, improving the state-of-the-art by 12.5\%. Additionally, we show the interpretability of these representations through directly comparing the behavior of estimated representations with speech production behavior. Finally, we propose a resynthesis-based AAI evaluation metric that does not rely on articulatory labels, demonstrating its efficacy with an 18-speaker dataset.
△ Less
Submitted 24 July, 2023; v1 submitted 13 February, 2023;
originally announced February 2023.
-
Hybrid Paradigm-based Brain-Computer Interface for Robotic Arm Control
Authors:
Byeong-Hoo Lee,
Jeong-Hyun Cho,
Byung-Hee Kwon
Abstract:
Brain-computer interface (BCI) uses brain signals to communicate with external devices without actual control. Particularly, BCI is one of the interfaces for controlling the robotic arm. In this study, we propose a knowledge distillation-based framework to manipulate robotic arm through hybrid paradigm induced EEG signals for practical use. The teacher model is designed to decode input data hierar…
▽ More
Brain-computer interface (BCI) uses brain signals to communicate with external devices without actual control. Particularly, BCI is one of the interfaces for controlling the robotic arm. In this study, we propose a knowledge distillation-based framework to manipulate robotic arm through hybrid paradigm induced EEG signals for practical use. The teacher model is designed to decode input data hierarchically and transfer knowledge to student model. To this end, soft labels and distillation loss functions are applied to the student model training. According to experimental results, student model achieved the best performance among the singular architecture-based methods. It is confirmed that using hierarchical models and knowledge distillation, the performance of a simple architecture can be improved. Since it is uncertain what knowledge is transferred, it is important to clarify this part in future studies.
△ Less
Submitted 14 December, 2022;
originally announced December 2022.
-
Target-centered Subject Transfer Framework for EEG Data Augmentation
Authors:
Kang Yin,
Byeong-Hoo Lee,
Byoung-Hee Kwon,
Jeong-Hyun Cho
Abstract:
Data augmentation approaches are widely explored for the enhancement of decoding electroencephalogram signals. In subject-independent brain-computer interface system, domain adaption and generalization are utilized to shift source subjects' data distribution to match the target subject as an augmentation. However, previous works either introduce noises (e.g., by noise addition or generation with r…
▽ More
Data augmentation approaches are widely explored for the enhancement of decoding electroencephalogram signals. In subject-independent brain-computer interface system, domain adaption and generalization are utilized to shift source subjects' data distribution to match the target subject as an augmentation. However, previous works either introduce noises (e.g., by noise addition or generation with random noises) or modify target data, thus, cannot well depict the target data distribution and hinder further analysis. In this paper, we propose a target-centered subject transfer framework as a data augmentation approach. A subset of source data is first constructed to maximize the source-target relevance. Then, the generative model is applied to transfer the data to target domain. The proposed framework enriches the explainability of target domain by adding extra real data, instead of noises. It shows superior performance compared with other data augmentation methods. Extensive experiments are conducted to verify the effectiveness and robustness of our approach as a prosperous tool for further research.
△ Less
Submitted 23 November, 2022;
originally announced December 2022.
-
3D-EPI Blip-Up/Down Acquisition (BUDA) with CAIPI and Joint Hankel Structured Low-Rank Reconstruction for Rapid Distortion-Free High-Resolution T2* Mapping
Authors:
Zhifeng Chen,
Congyu Liao,
Xiaozhi Cao,
Benedikt A. Poser,
Zhongbiao Xu,
Wei-Ching Lo,
Manyi Wen,
Jaejin Cho,
Qiyuan Tian,
Yaohui Wang,
Yanqiu Feng,
Ling Xia,
Wufan Chen,
Feng Liu,
Berkin Bilgic
Abstract:
Purpose: This work aims to develop a novel distortion-free 3D-EPI acquisition and image reconstruction technique for fast and robust, high-resolution, whole-brain imaging as well as quantitative T2* mapping. Methods: 3D-Blip-Up and -Down Acquisition (3D-BUDA) sequence is designed for both single- and multi-echo 3D GRE-EPI imaging using multiple shots with blip-up and -down readouts to encode B0 fi…
▽ More
Purpose: This work aims to develop a novel distortion-free 3D-EPI acquisition and image reconstruction technique for fast and robust, high-resolution, whole-brain imaging as well as quantitative T2* mapping. Methods: 3D-Blip-Up and -Down Acquisition (3D-BUDA) sequence is designed for both single- and multi-echo 3D GRE-EPI imaging using multiple shots with blip-up and -down readouts to encode B0 field map information. Complementary k-space coverage is achieved using controlled aliasing in parallel imaging (CAIPI) sampling across the shots. For image reconstruction, an iterative hard-thresholding algorithm is employed to minimize the cost function that combines field map information informed parallel imaging with the structured low-rank constraint for multi-shot 3D-BUDA data. Extending 3D-BUDA to multi-echo imaging permits T2* mapping. For this, we propose constructing a joint Hankel matrix along both echo and shot dimensions to improve the reconstruction. Results: Experimental results on in vivo multi-echo data demonstrate that, by performing joint reconstruction along with both echo and shot dimensions, reconstruction accuracy is improved compared to standard 3D-BUDA reconstruction. CAIPI sampling is further shown to enhance the image quality. For T2* mapping, T2* values from 3D-Joint-CAIPI-BUDA and reference multi-echo GRE are within limits of agreement as quantified by Bland-Altman analysis. Conclusions: The proposed technique enables rapid 3D distortion-free high-resolution imaging and T2* mapping. Specifically, 3D-BUDA enables 1-mm isotropic whole-brain imaging in 22 s at 3 T and 9 s on a 7 T scanner. The combination of multi-echo 3D-BUDA with CAIPI acquisition and joint reconstruction enables distortion-free whole-brain T2* mapping in 47 s at 1.1x1.1x1.0 mm3 resolution.
△ Less
Submitted 1 December, 2022;
originally announced December 2022.
-
Time-efficient, High Resolution 3T Whole Brain Quantitative Relaxometry using 3D-QALAS with Wave-CAIPI Readouts
Authors:
Jaejin Cho,
Borjan Gagoski,
Tae Hyung Kim,
Fuyixue Wang,
Daniel Nico Splitthoff,
Wei-Ching Lo,
Wei Liu,
Daniel Polak,
Stephen Cauley,
Kawin Setsompop,
P. Ellen Grant,
Berkin Bilgic
Abstract:
Purpose: Volumetric, high-resolution, quantitative mapping of brain tissue relaxation properties is hindered by long acquisition times and signal-to-noise (SNR) challenges. This study, for the first time, combines the time-efficient wave-CAIPI readouts into the 3D-quantification using an interleaved Look-Locker acquisition sequence with a T2 preparation pulse (3D-QALAS) acquisition scheme, enablin…
▽ More
Purpose: Volumetric, high-resolution, quantitative mapping of brain tissue relaxation properties is hindered by long acquisition times and signal-to-noise (SNR) challenges. This study, for the first time, combines the time-efficient wave-CAIPI readouts into the 3D-quantification using an interleaved Look-Locker acquisition sequence with a T2 preparation pulse (3D-QALAS) acquisition scheme, enabling full brain quantitative T1, T2 and proton density (PD) maps at 1.15 mm3 isotropic voxels in only 3 minutes. Methods: Wave-CAIPI readouts were embedded in the standard 3D-QALAS encoding scheme, enabling full brain quantitative parameter maps (T1, T2, and PD) at acceleration factors of R=3x2 with minimum SNR loss due to g-factor penalties. The quantitative parameter maps were estimated using a dictionary-based mapping algorithm incorporating inversion efficiency and B1 field inhomogeneity. The quantitative maps using the accelerated protocol were quantitatively compared against those obtained from conventional 3D-QALAS sequence using GRAPPA acceleration of R=2 in the ISMRM NIST phantom, and ten healthy volunteers. Results: When tested in both the ISMRM/NIST phantom and ten healthy volunteers, the quantitative maps using the accelerated protocol showed excellent agreement against those obtained from conventional 3D-QALAS at RGRAPPA=2. Conclusion: 3D-QALAS enhanced with wave-CAIPI readouts enables time-efficient, full brain quantitative T1, T2, and PD mapping at 1.15 mm3 in 3 minutes at R=3x2 acceleration. When tested on the NIST phantom and ten healthy volunteers, the quantitative maps obtained from the accelerated wave-CAIPI 3D-QALAS protocol showed very similar values to those obtained from the standard 3D-QALAS (R=2) protocol, alluding to the robustness and reliability of the proposed methods.
△ Less
Submitted 27 January, 2023; v1 submitted 8 November, 2022;
originally announced November 2022.
-
Evidence of Vocal Tract Articulation in Self-Supervised Learning of Speech
Authors:
Cheol Jun Cho,
Peter Wu,
Abdelrahman Mohamed,
Gopala K. Anumanchipalli
Abstract:
Recent self-supervised learning (SSL) models have proven to learn rich representations of speech, which can readily be utilized by diverse downstream tasks. To understand such utilities, various analyses have been done for speech SSL models to reveal which and how information is encoded in the learned representations. Although the scope of previous analyses is extensive in acoustic, phonetic, and…
▽ More
Recent self-supervised learning (SSL) models have proven to learn rich representations of speech, which can readily be utilized by diverse downstream tasks. To understand such utilities, various analyses have been done for speech SSL models to reveal which and how information is encoded in the learned representations. Although the scope of previous analyses is extensive in acoustic, phonetic, and semantic perspectives, the physical grounding by speech production has not yet received full attention. To bridge this gap, we conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA). Our analysis is based on a linear probing approach where we measure articulatory score as an average correlation of linear mapping to EMA. We analyze a set of SSL models selected from the leaderboard of the SUPERB benchmark and perform further layer-wise analyses on two most successful models, Wav2Vec 2.0 and HuBERT. Surprisingly, representations from the recent speech SSL models are highly correlated with EMA traces (best: r = 0.81), and only 5 minutes are sufficient to train a linear model with high performance (r = 0.77). Our findings suggest that SSL models learn to align closely with continuous articulations, and provide a novel insight into speech SSL.
△ Less
Submitted 20 July, 2023; v1 submitted 21 October, 2022;
originally announced October 2022.
-
Enemy Spotted: in-game gun sound dataset for gunshot classification and localization
Authors:
Junwoo Park,
Youngwoo Cho,
Gyuhyeon Sim,
Hojoon Lee,
Jaegul Choo
Abstract:
Recently, deep learning-based methods have drawn huge attention due to their simple yet high performance without domain knowledge in sound classification and localization tasks. However, a lack of gun sounds in existing datasets has been a major obstacle to implementing a support system to spot criminals from their gunshots by leveraging deep learning models. Since the occurrence of gunshot is rar…
▽ More
Recently, deep learning-based methods have drawn huge attention due to their simple yet high performance without domain knowledge in sound classification and localization tasks. However, a lack of gun sounds in existing datasets has been a major obstacle to implementing a support system to spot criminals from their gunshots by leveraging deep learning models. Since the occurrence of gunshot is rare and unpredictable, it is impractical to collect gun sounds in the real world. As an alternative, gun sounds can be obtained from an FPS game that is designed to mimic real-world warfare. The recent FPS game offers a realistic environment where we can safely collect gunshot data while simulating even dangerous situations. By exploiting the advantage of the game environment, we construct a gunshot dataset, namely BGG, for the firearm classification and gunshot localization tasks. The BGG dataset consists of 37 different types of firearms, distances, and directions between the sound source and a receiver. We carefully verify that the in-game gunshot data has sufficient information to identify the location and type of gunshots by training several sound classification and localization baselines on the BGG dataset. Afterward, we demonstrate that the accuracy of real-world firearm classification and localization tasks can be enhanced by utilizing the BGG dataset.
△ Less
Submitted 16 February, 2023; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech
Authors:
Jaejin Cho,
Jes'us Villalba,
Laureano Moro-Velazquez,
Najim Dehak
Abstract:
In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-s…
▽ More
In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also SSL techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings. We adapted DIstillation with NO labels (DINO) from computer vision to speech. Unlike contrastive methods, DINO does not require negative sampling. We compared DINO to x-vector trained in a supervised manner. When transferred to down-stream tasks (speaker verification, speech emotion recognition (SER), and Alzheimer's disease detection), DINO outperformed x-vector. We studied the influence of several aspects during transfer learning such as dividing the fine-tuning process into steps, chunk lengths, or augmentation. During fine-tuning, tuning the last affine layers first and then the whole network surpassed fine-tuning all at once. Using shorter chunk lengths, although they generate more diverse inputs, did not necessarily improve performance, implying speech segments at least with a specific length are required for better performance per application. Augmentation was helpful in SER.
△ Less
Submitted 10 August, 2022;
originally announced August 2022.
-
Non-Contrastive Self-Supervised Learning of Utterance-Level Speech Representations
Authors:
Jaejin Cho,
Raghavendra Pappagari,
Piotr Żelasko,
Laureano Moro-Velazquez,
Jesús Villalba,
Najim Dehak
Abstract:
Considering the abundance of unlabeled speech data and the high labeling costs, unsupervised learning methods can be essential for better system development. One of the most successful methods is contrastive self-supervised methods, which require negative sampling: sampling alternative samples to contrast with the current sample (anchor). However, it is hard to ensure if all the negative samples b…
▽ More
Considering the abundance of unlabeled speech data and the high labeling costs, unsupervised learning methods can be essential for better system development. One of the most successful methods is contrastive self-supervised methods, which require negative sampling: sampling alternative samples to contrast with the current sample (anchor). However, it is hard to ensure if all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised learning method on an unlabeled speech corpus to learn utterance-level embeddings. We used DIstillation with NO labels (DINO), proposed in computer vision, and adapted it to the speech domain. Unlike the contrastive methods, DINO does not require negative sampling. These embeddings were evaluated on speaker verification and emotion recognition. In speaker verification, the unsupervised DINO embedding with cosine scoring provided 4.38% EER on the VoxCeleb1 test trial. This outperforms the best contrastive self-supervised method by 40% relative in EER. An iterative pseudo-labeling training pipeline, not requiring speaker labels, further improved the EER to 1.89%. In emotion recognition, the DINO embedding performed 60.87, 79.21, and 56.98% in micro-f1 score on IEMOCAP, Crema-D, and MSP-Podcast, respectively. The results imply the generality of the DINO embedding to different speech applications.
△ Less
Submitted 10 August, 2022;
originally announced August 2022.
-
Comparative Validation of AI and non-AI Methods in MRI Volumetry to Diagnose Parkinsonian Syndromes
Authors:
Joomee Song,
Juyoung Hahm,
Jisoo Lee,
Chae Yeon Lim,
Myung Jin Chung,
Jinyoung Youn,
Jin Whan Cho,
Jong Hyeon Ahn,
Kyung-Su Kim
Abstract:
Automated segmentation and volumetry of brain magnetic resonance imaging (MRI) scans are essential for the diagnosis of Parkinson's disease (PD) and Parkinson's plus syndromes (P-plus). To enhance the diagnostic performance, we adopt deep learning (DL) models in brain segmentation and compared their performance with the gold-standard non-DL method. We collected brain MRI scans of healthy controls…
▽ More
Automated segmentation and volumetry of brain magnetic resonance imaging (MRI) scans are essential for the diagnosis of Parkinson's disease (PD) and Parkinson's plus syndromes (P-plus). To enhance the diagnostic performance, we adopt deep learning (DL) models in brain segmentation and compared their performance with the gold-standard non-DL method. We collected brain MRI scans of healthy controls (n=105) and patients with PD (n=105), multiple systemic atrophy (n=132), and progressive supranuclear palsy (n=69) at Samsung Medical Center from January 2017 to December 2020. Using the gold-standard non-DL model, FreeSurfer (FS), we segmented six brain structures: midbrain, pons, caudate, putamen, pallidum, and third ventricle, and considered them as annotating data for DL models, the representative V-Net and UNETR. The Dice scores and area under the curve (AUC) for differentiating normal, PD, and P-plus cases were calculated. The segmentation times of V-Net and UNETR for the six brain structures per patient were 3.48 +- 0.17 and 48.14 +- 0.97 s, respectively, being at least 300 times faster than FS (15,735 +- 1.07 s). Dice scores of both DL models were sufficiently high (>0.85), and their AUCs for disease classification were superior to that of FS. For classification of normal vs. P-plus and PD vs. multiple systemic atrophy (cerebellar type), the DL models and FS showed AUCs above 0.8. DL significantly reduces the analysis time without compromising the performance of brain segmentation and differential diagnosis. Our findings may contribute to the adoption of DL brain MRI segmentation in clinical settings and advance brain research.
△ Less
Submitted 23 July, 2022;
originally announced July 2022.
-
Domain Agnostic Few-shot Learning for Speaker Verification
Authors:
Seunghan Yang,
Debasmit Das,
Janghoon Cho,
Hyoungwoo Park,
Sungrack Yun
Abstract:
Deep learning models for verification systems often fail to generalize to new users and new environments, even though they learn highly discriminative features. To address this problem, we propose a few-shot domain generalization framework that learns to tackle distribution shift for new users and new domains. Our framework consists of domain-specific and domain-aggregation networks, which are the…
▽ More
Deep learning models for verification systems often fail to generalize to new users and new environments, even though they learn highly discriminative features. To address this problem, we propose a few-shot domain generalization framework that learns to tackle distribution shift for new users and new domains. Our framework consists of domain-specific and domain-aggregation networks, which are the experts on specific and combined domains, respectively. By using these networks, we generate episodes that mimic the presence of both novel users and novel domains in the training phase to eventually produce better generalization. To save memory, we reduce the number of domain-specific networks by clustering similar domains together. Upon extensive evaluation on artificially generated noise domains, we can explicitly show generalization ability of our framework. In addition, we apply our proposed methods to the existing competitive architecture on the standard benchmark, which shows further performance improvements.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
Factorization Approach for Sparse Spatio-Temporal Brain-Computer Interface
Authors:
Byeong-Hoo Lee,
Jeong-Hyun Cho,
Byoung-Hee Kwon,
Seong-Whan Lee
Abstract:
Recently, advanced technologies have unlimited potential in solving various problems with a large amount of data. However, these technologies have yet to show competitive performance in brain-computer interfaces (BCIs) which deal with brain signals. Basically, brain signals are difficult to collect in large quantities, in particular, the amount of information would be sparse in spontaneous BCIs. I…
▽ More
Recently, advanced technologies have unlimited potential in solving various problems with a large amount of data. However, these technologies have yet to show competitive performance in brain-computer interfaces (BCIs) which deal with brain signals. Basically, brain signals are difficult to collect in large quantities, in particular, the amount of information would be sparse in spontaneous BCIs. In addition, we conjecture that high spatial and temporal similarities between tasks increase the prediction difficulty. We define this problem as sparse condition. To solve this, a factorization approach is introduced to allow the model to obtain distinct representations from latent space. To this end, we propose two feature extractors: A class-common module is trained through adversarial learning acting as a generator; Class-specific module utilizes loss function generated from classification so that features are extracted with traditional methods. To minimize the latent space shared by the class-common and class-specific features, the model is trained under orthogonal constraint. As a result, EEG signals are factorized into two separate latent spaces. Evaluations were conducted on a single-arm motor imagery dataset. From the results, we demonstrated that factorizing the EEG signal allows the model to extract rich and decisive features under sparse condition.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
Restructuring TCAD System: Teaching Traditional TCAD New Tricks
Authors:
Sanghoon Myung,
Wonik Jang,
Seonghoon Jin,
Jae Myung Choe,
Changwook Jeong,
Dae Sin Kim
Abstract:
Traditional TCAD simulation has succeeded in predicting and optimizing the device performance; however, it still faces a massive challenge - a high computational cost. There have been many attempts to replace TCAD with deep learning, but it has not yet been completely replaced. This paper presents a novel algorithm restructuring the traditional TCAD system. The proposed algorithm predicts three-di…
▽ More
Traditional TCAD simulation has succeeded in predicting and optimizing the device performance; however, it still faces a massive challenge - a high computational cost. There have been many attempts to replace TCAD with deep learning, but it has not yet been completely replaced. This paper presents a novel algorithm restructuring the traditional TCAD system. The proposed algorithm predicts three-dimensional (3-D) TCAD simulation in real-time while capturing a variance, enables deep learning and TCAD to complement each other, and fully resolves convergence errors.
△ Less
Submitted 19 April, 2022;
originally announced April 2022.
-
On Digital Subcarrier Multiplexing under A Bandwidth Limitation and ASE Noise
Authors:
Junho Cho,
Xi Chen,
Greg Raybon,
Son Thai Le
Abstract:
We show that digital subcarrier multiplexing (DSM) systems require much greater complexity for Nyquist pulse shaping than single-carrier (SC) systems, and it is a misconception that both systems use the same bandwidth when using the same pulse shaping. Through back-to-back (B2B) experiments with realistic transmitter (TX) modules and amplified spontaneous emission (ASE) noise loading, we show that…
▽ More
We show that digital subcarrier multiplexing (DSM) systems require much greater complexity for Nyquist pulse shaping than single-carrier (SC) systems, and it is a misconception that both systems use the same bandwidth when using the same pulse shaping. Through back-to-back (B2B) experiments with realistic transmitter (TX) modules and amplified spontaneous emission (ASE) noise loading, we show that even with optimized waterfilling and entropy loading, DSM does not achieve a larger net data rate (NDR) compared to SC when only ASE noise exists in the channel in long-haul transmission scenarios.
△ Less
Submitted 25 February, 2022;
originally announced February 2022.
-
ML-based Anomaly Detection in Optical Fiber Monitoring
Authors:
Khouloud Abdelli,
Joo Yeon Cho,
Carsten Tropschug
Abstract:
Secure and reliable data communication in optical networks is critical for high-speed internet. We propose a data driven approach for the anomaly detection and faults identification in optical networks to diagnose physical attacks such as fiber breaks and optical tapping. The proposed methods include an autoencoder-based anomaly detection and an attention-based bidirectional gated recurrent unit a…
▽ More
Secure and reliable data communication in optical networks is critical for high-speed internet. We propose a data driven approach for the anomaly detection and faults identification in optical networks to diagnose physical attacks such as fiber breaks and optical tapping. The proposed methods include an autoencoder-based anomaly detection and an attention-based bidirectional gated recurrent unit algorithm for the fiber fault identification and localization. We verify the efficiency of our methods by experiments under various attack scenarios using real operational data.
△ Less
Submitted 23 February, 2022;
originally announced February 2022.
-
Wave-Encoded Model-based Deep Learning for Highly Accelerated Imaging with Joint Reconstruction
Authors:
Jaejin Cho,
Borjan Gagoski,
Taehyung Kim,
Qiyuan Tian,
Stephen Robert Frost,
Itthi Chatnuntawech,
Berkin Bilgic
Abstract:
Purpose: To propose a wave-encoded model-based deep learning (wave-MoDL) strategy for highly accelerated 3D imaging and joint multi-contrast image reconstruction, and further extend this to enable rapid quantitative imaging using an interleaved look-locker acquisition sequence with T2 preparation pulse (3D-QALAS).
Method: Recently introduced MoDL technique successfully incorporates convolutional…
▽ More
Purpose: To propose a wave-encoded model-based deep learning (wave-MoDL) strategy for highly accelerated 3D imaging and joint multi-contrast image reconstruction, and further extend this to enable rapid quantitative imaging using an interleaved look-locker acquisition sequence with T2 preparation pulse (3D-QALAS).
Method: Recently introduced MoDL technique successfully incorporates convolutional neural network (CNN)-based regularizers into physics-based parallel imaging reconstruction using a small number of network parameters. Wave-CAIPI is an emerging parallel imaging method that accelerates the imaging speed by employing sinusoidal gradients in the phase- and slice-encoding directions during the readout to take better advantage of 3D coil sensitivity profiles. In wave-MoDL, we propose to combine the wave-encoding strategy with unrolled network constraints to accelerate the acquisition speed while enforcing wave-encoded data consistency. We further extend wave-MoDL to reconstruct multi-contrast data with controlled aliasing in parallel imaging (CAIPI) sampling patterns to leverage similarity between multiple images to improve the reconstruction quality.
Result: Wave-MoDL enables a 47-second MPRAGE acquisition at 1 mm resolution at 16-fold acceleration. For quantitative imaging, wave-MoDL permits a 2-minute acquisition for T1, T2, and proton density mapping at 1 mm resolution at 12-fold acceleration, from which contrast weighted images can be synthesized as well.
Conclusion: Wave-MoDL allows rapid MR acquisition and high-fidelity image reconstruction and may facilitate clinical and neuroscientific applications by incorporating unrolled neural networks into wave-CAIPI reconstruction.
△ Less
Submitted 6 February, 2022;
originally announced February 2022.
-
Predicting Future CSI Feedback For Highly-Mobile Massive MIMO Systems
Authors:
Yu Zhang,
Ahmed Alkhateeb,
Pranav Madadi,
Jeongho Jeon,
Joonyoung Cho,
Charlie Zhang
Abstract:
Massive multiple-input multiple-output (MIMO) system is promising in providing unprecedentedly high data rate. To achieve its full potential, the transceiver needs complete channel state information (CSI) to perform transmit/receive precoding/combining. This requirement, however, is challenging in the practical systems due to the unavoidable processing and feedback delays, which oftentimes degrade…
▽ More
Massive multiple-input multiple-output (MIMO) system is promising in providing unprecedentedly high data rate. To achieve its full potential, the transceiver needs complete channel state information (CSI) to perform transmit/receive precoding/combining. This requirement, however, is challenging in the practical systems due to the unavoidable processing and feedback delays, which oftentimes degrades the performance to a great extent, especially in the high mobility scenarios. In this paper, we develop a deep learning based channel prediction framework that proactively predicts the downlink channel state information based on the past observed channel sequence. In its core, the model adopts a 3-D convolutional neural network (CNN) based architecture to efficiently learn the temporal, spatial and frequency correlations of downlink channel samples, based on which accurate channel prediction can be performed. Simulation results highlight the potential of the developed learning model in extracting information and predicting future downlink channels directly from the observed past channel sequence, which significantly improves the performance compared to the sample-and-hold approach, and mitigates the impact of the dynamic communication environment.
△ Less
Submitted 5 February, 2022;
originally announced February 2022.
-
Memory-guided Image De-raining Using Time-Lapse Data
Authors:
Jaehoon Cho,
Seungryong Kim,
Kwanghoon Sohn
Abstract:
This paper addresses the problem of single image de-raining, that is, the task of recovering clean and rain-free background scenes from a single image obscured by a rainy artifact. Although recent advances adopt real-world time-lapse data to overcome the need for paired rain-clean images, they are limited to fully exploit the time-lapse data. The main cause is that, in terms of network architectur…
▽ More
This paper addresses the problem of single image de-raining, that is, the task of recovering clean and rain-free background scenes from a single image obscured by a rainy artifact. Although recent advances adopt real-world time-lapse data to overcome the need for paired rain-clean images, they are limited to fully exploit the time-lapse data. The main cause is that, in terms of network architectures, they could not capture long-term rain streak information in the time-lapse data during training owing to the lack of memory components. To address this problem, we propose a novel network architecture based on a memory network that explicitly helps to capture long-term rain streak information in the time-lapse data. Our network comprises the encoder-decoder networks and a memory network. The features extracted from the encoder are read and updated in the memory network that contains several memory items to store rain streak-aware feature representations. With the read/update operation, the memory network retrieves relevant memory items in terms of the queries, enabling the memory items to represent the various rain streaks included in the time-lapse data. To boost the discriminative power of memory features, we also present a novel background selective whitening (BSW) loss for capturing only rain streak information in the memory network by erasing the background information. Experimental results on standard benchmarks demonstrate the effectiveness and superiority of our approach.
△ Less
Submitted 5 January, 2022;
originally announced January 2022.
-
Recognition of Tactile-related EEG Signals Generated by Self-touch
Authors:
Myoung-Ki Kim,
Jeong-Hyun Cho,
Hye-Bin Shin
Abstract:
Touch is the first sense among human senses. Not only that, but it is also one of the most important senses that are indispensable. However, compared to sight and hearing, it is often neglected. In particular, since humans use the tactile sense of the skin to recognize and manipulate objects, without tactile sensation, it is very difficult to recognize or skillfully manipulate objects. In addition…
▽ More
Touch is the first sense among human senses. Not only that, but it is also one of the most important senses that are indispensable. However, compared to sight and hearing, it is often neglected. In particular, since humans use the tactile sense of the skin to recognize and manipulate objects, without tactile sensation, it is very difficult to recognize or skillfully manipulate objects. In addition, the importance and interest of haptic technology related to touch are increasing with the development of technologies such as VR and AR in recent years. So far, the focus is only on haptic technology based on mechanical devices. Especially, there are not many studies on tactile sensation in the field of brain-computer interface based on EEG. There have been some studies that measured the surface roughness of artificial structures in relation to EEG-based tactile sensation. However, most studies have used passive contact methods in which the object moves, while the human subject remains still. Additionally, there have been no EEG-based tactile studies of active skin touch. In reality, we directly move our hands to feel the sense of touch. Therefore, as a preliminary study for our future research, we collected EEG signals for tactile sensation upon skin touch based on active touch and compared and analyzed differences in brain changes during touch and movement tasks. Through time-frequency analysis and statistical analysis, significant differences in power changes in alpha, beta, gamma, and high-gamma regions were observed. In addition, major spatial differences were observed in the sensory-motor region of the brain.
△ Less
Submitted 13 December, 2021;
originally announced December 2021.
-
On the Kurtosis of Modulation Formats for Characterizing the Nonlinear Fiber Propagation
Authors:
Junho Cho,
Robert Tkach
Abstract:
Knowing only two high-order statistical moments of modulation symbols, often represented by the fourth moment called "kurtosis", the overestimation of nonlinear interference (NLI) in a Gaussian noise (GN) model due to Gaussian signaling assumption can be corrected through an enhanced GN (EGN) model. However, in some modern optical communication systems where the transmitted modulation symbols are…
▽ More
Knowing only two high-order statistical moments of modulation symbols, often represented by the fourth moment called "kurtosis", the overestimation of nonlinear interference (NLI) in a Gaussian noise (GN) model due to Gaussian signaling assumption can be corrected through an enhanced GN (EGN) model. However, in some modern optical communication systems where the transmitted modulation symbols are statistically correlated, such as in systems that use probabilistic constellation shaping (PCS) with finite-length sphere shaping, the kurtosis-based EGN model produces significant inaccuracies in analytical prediction of NLI. In this paper, we show that for correlated modulation symbols, the NLI can be more accurately estimated by substituting a statistical measure called windowed kurtosis into the EGN model, instead of the conventional kurtosis. Remarkably, the optimal window length for windowed kurtosis is found to be consistent with the self-phase modulation (SPM) and cross-phase modulation (XPM) characteristic times in various system configurations. The findings can be used in practice to analytically evaluate and design NLI-tolerant modulation formats.
△ Less
Submitted 7 December, 2021;
originally announced December 2021.
-
Turbo Autoencoder with a Trainable Interleaver
Authors:
Karl Chahine,
Yihan Jiang,
Pooja Nuti,
Hyeji Kim,
Joonyoung Cho
Abstract:
A critical aspect of reliable communication involves the design of codes that allow transmissions to be robustly and computationally efficiently decoded under noisy conditions. Advances in the design of reliable codes have been driven by coding theory and have been sporadic. Recently, it is shown that channel codes that are comparable to modern codes can be learned solely via deep learning. In par…
▽ More
A critical aspect of reliable communication involves the design of codes that allow transmissions to be robustly and computationally efficiently decoded under noisy conditions. Advances in the design of reliable codes have been driven by coding theory and have been sporadic. Recently, it is shown that channel codes that are comparable to modern codes can be learned solely via deep learning. In particular, Turbo Autoencoder (TURBOAE), introduced by Jiang et al., is shown to achieve the reliability of Turbo codes for Additive White Gaussian Noise channels. In this paper, we focus on applying the idea of TURBOAE to various practical channels, such as fading channels and chirp noise channels. We introduce TURBOAE-TI, a novel neural architecture that combines TURBOAE with a trainable interleaver design. We develop a carefully-designed training procedure and a novel interleaver penalty function that are crucial in learning the interleaver and TURBOAE jointly. We demonstrate that TURBOAE-TI outperforms TURBOAE and LTE Turbo codes for several channels of interest. We also provide interpretation analysis to better understand TURBOAE-TI.
△ Less
Submitted 22 November, 2021;
originally announced November 2021.
-
The JHU submission to VoxSRC-21: Track 3
Authors:
Jejin Cho,
Jesus Villalba,
Najim Dehak
Abstract:
This technical report describes Johns Hopkins University speaker recognition system submitted to Voxceleb Speaker Recognition Challenge 2021 Track 3: Self-supervised speaker verification (closed). Our overall training process is similar to the proposed one from the first place team in the last year's VoxSRC2020 challenge. The main difference is a recently proposed non-contrastive self-supervised m…
▽ More
This technical report describes Johns Hopkins University speaker recognition system submitted to Voxceleb Speaker Recognition Challenge 2021 Track 3: Self-supervised speaker verification (closed). Our overall training process is similar to the proposed one from the first place team in the last year's VoxSRC2020 challenge. The main difference is a recently proposed non-contrastive self-supervised method in computer vision (CV), distillation with no labels (DINO), is used to train our initial model, which outperformed the last year's contrastive learning based on momentum contrast (MoCo). Also, this requires only a few iterations in the iterative clustering stage, where pseudo labels for supervised embedding learning are updated based on the clusters of the embeddings generated from a model that is continually fine-tuned over iterations. In the final stage, Res2Net50 is trained on the final pseudo labels from the iterative clustering stage. This is our best submitted model to the challenge, showing 1.89, 6.50, and 6.89 in EER(%) in voxceleb1 test o, VoxSRC-21 validation, and test trials, respectively.
△ Less
Submitted 27 September, 2021;
originally announced September 2021.
-
Single-ended Coherent Receiver
Authors:
Son Thai Le,
Vahid Aref,
Junho Cho
Abstract:
Commercial coherent receivers utilize balanced photodetectors (PDs) with high single-port rejection ratio (SPRR) to mitigate the signal-signal beat interference (SSBI) due to the square-law detection process. As the symbol rates of coherent transponders are increased to 100 Gbaud and beyond, maintaining a high SPRR in a cost-effective manner becomes more and more challenging. One potential approac…
▽ More
Commercial coherent receivers utilize balanced photodetectors (PDs) with high single-port rejection ratio (SPRR) to mitigate the signal-signal beat interference (SSBI) due to the square-law detection process. As the symbol rates of coherent transponders are increased to 100 Gbaud and beyond, maintaining a high SPRR in a cost-effective manner becomes more and more challenging. One potential approach for solving this problem is to leverage the concept of single-ended coherent receiver (SER) where single-ended PDs are used instead of the balanced PDs. In this case, the resulting SSBI should be mitigated in the digital domain. In this paper, we show that SSBI can be effectively mitigated using various low-complexity techniques, such as the direct filed reconstruction (DFR), clipped iterative SSBI cancellation (CIC) and gradient decent (GD). In addition, we present a self-calibration technique for SERs which can be extended for characterizing the optical-to-electrical (O/E) response of a conventional balanced coherent receiver (BR). Using the developed techniques, we then experimentally demonstrate a 90 Gbaud probabilistically constellation shaped 64-QAM (PCS-64QAM) transmission using a SER, achieving a net data rate of 882 Gb/s over 100 km of standard single mode fiber (SSMF). The sensitivity penalty compared to the BR is below 0.5 dB. We expect that when the symbol rate is increased further, a SER can potentially outperform a BR, especially when applied to cost-sensitive commercial pluggable coherent transceivers
△ Less
Submitted 12 September, 2021;
originally announced September 2021.
-
BUDA-SAGE with self-supervised denoising enables fast, distortion-free, high-resolution T2, T2*, para- and dia-magnetic susceptibility mapping
Authors:
Zijing Zhang,
Long Wang,
Jaejin Cho,
Congyu Liao,
Hyeong-Geol Shin,
Xiaozhi Cao,
Jongho Lee,
Jinmin Xu,
Tao Zhang,
Huihui Ye,
Kawin Setsompop,
Huafeng Liu,
Berkin Bilgic
Abstract:
To rapidly obtain high resolution T2, T2* and quantitative susceptibility mapping (QSM) source separation maps with whole-brain coverage and high geometric fidelity. We propose Blip Up-Down Acquisition for Spin And Gradient Echo imaging (BUDA-SAGE), an efficient echo-planar imaging (EPI) sequence for quantitative mapping. The acquisition includes multiple T2*-, T2'- and T2-weighted contrasts. We a…
▽ More
To rapidly obtain high resolution T2, T2* and quantitative susceptibility mapping (QSM) source separation maps with whole-brain coverage and high geometric fidelity. We propose Blip Up-Down Acquisition for Spin And Gradient Echo imaging (BUDA-SAGE), an efficient echo-planar imaging (EPI) sequence for quantitative mapping. The acquisition includes multiple T2*-, T2'- and T2-weighted contrasts. We alternate the phase-encoding polarities across the interleaved shots in this multi-shot navigator-free acquisition. A field map estimated from interim reconstructions was incorporated into the joint multi-shot EPI reconstruction with a structured low rank constraint to eliminate geometric distortion. A self-supervised MR-Self2Self (MR-S2S) neural network (NN) was utilized to perform denoising after BUDA reconstruction to boost SNR. Employing Slider encoding allowed us to reach 1 mm isotropic resolution by performing super-resolution reconstruction on BUDA-SAGE volumes acquired with 2 mm slice thickness. Quantitative T2 and T2* maps were obtained using Bloch dictionary matching on the reconstructed echoes. QSM was estimated using nonlinear dipole inversion (NDI) on the gradient echoes. Starting from the estimated R2 and R2* maps, R2' information was derived and used in source separation QSM reconstruction, which provided additional para- and dia-magnetic susceptibility maps. In vivo results demonstrate the ability of BUDA-SAGE to provide whole-brain, distortion-free, high-resolution multi-contrast images and quantitative T2 and T2* maps, as well as yielding para- and dia-magnetic susceptibility maps. Derived quantitative maps showed comparable values to conventional mapping methods in phantom and in vivo measurements. BUDA-SAGE acquisition with self-supervised denoising and Slider encoding enabled rapid, distortion-free, whole-brain T2, T2* mapping at 1 mm3 isotropic resolution in 90 seconds.
△ Less
Submitted 9 September, 2021; v1 submitted 28 August, 2021;
originally announced August 2021.
-
Highly Accelerated EPI with Wave Encoding and Multi-shot Simultaneous Multi-Slice Imaging
Authors:
Jaejin Cho,
Congyu Liao,
Qiyuan Tian,
Zijing Zhang,
Jinmin Xu,
Wei-Ching Lo,
Benedikt A. Poser,
V. Andrew Stenger,
Jason Stockmann,
Kawin Setsompop,
Berkin Bilgic
Abstract:
We introduce wave encoded acquisition and reconstruction techniques for highly accelerated echo planar imaging (EPI) with reduced g-factor penalty and image artifacts. Wave-EPI involves playing sinusoidal gradients during the EPI readout while employing interslice shifts as in blipped-CAIPI acquisitions. This spreads the aliasing in all spatial directions, thereby taking better advantage of 3D coi…
▽ More
We introduce wave encoded acquisition and reconstruction techniques for highly accelerated echo planar imaging (EPI) with reduced g-factor penalty and image artifacts. Wave-EPI involves playing sinusoidal gradients during the EPI readout while employing interslice shifts as in blipped-CAIPI acquisitions. This spreads the aliasing in all spatial directions, thereby taking better advantage of 3D coil sensitivity profiles. The amount of voxel spreading that can be achieved by the wave gradients during the short EPI readout period is constrained by the slew rate of the gradient coils and peripheral nerve stimulation (PNS) monitor. We propose to use a half-cycle sinusoidal gradient to increase the amount of voxel spreading that can be achieved while respecting the slew and stimulation constraints. Extending wave-EPI to multi-shot acquisition minimizes geometric distortion and voxel blurring at high in-plane resolution, while structured low-rank regularization mitigates shot-to-shot phase variations without additional navigators. We propose to use different point spread functions (PSFs) for the k-space lines with positive and negative polarities, which are calibrated with a FLEET-based reference scan and allow for addressing gradient imperfections. Wave-EPI provided whole-brain single-shot gradient echo (GE) and multi-shot spin echo (SE) EPI acquisitions at high acceleration factors and was combined with g-Slider slab encoding to boost the SNR level in 1mm isotropic diffusion imaging. Relative to blipped-CAIPI, wave-EPI reduced average and maximum g-factors by up to 1.21- and 1.37-fold, respectively. In conclusion, wave-EPI allows highly accelerated single- and multi-shot EPI with reduced g-factor and artifacts and may facilitate clinical and neuroscientific applications of EPI by improving the spatial and temporal resolution in functional and diffusion imaging.
△ Less
Submitted 3 June, 2021;
originally announced June 2021.
-
Scan Specific Artifact Reduction in K-space (SPARK) Neural Networks Synergize with Physics-based Reconstruction to Accelerate MRI
Authors:
Yamin Arefeen,
Onur Beker,
Jaejin Cho,
Heng Yu,
Elfar Adalsteinsson,
Berkin Bilgic
Abstract:
Purpose: To develop a scan-specific model that estimates and corrects k-space errors made when reconstructing accelerated Magnetic Resonance Imaging (MRI) data.
Methods: Scan-Specific Artifact Reduction in k-space (SPARK) trains a convolutional-neural-network to estimate and correct k-space errors made by an input reconstruction technique by back-propagating from the mean-squared-error loss betw…
▽ More
Purpose: To develop a scan-specific model that estimates and corrects k-space errors made when reconstructing accelerated Magnetic Resonance Imaging (MRI) data.
Methods: Scan-Specific Artifact Reduction in k-space (SPARK) trains a convolutional-neural-network to estimate and correct k-space errors made by an input reconstruction technique by back-propagating from the mean-squared-error loss between an auto-calibration signal (ACS) and the input technique's reconstructed ACS. First, SPARK is applied to GRAPPA and demonstrates improved robustness over other scan-specific models, such as RAKI and residual-RAKI. Subsequent experiments demonstrate that SPARK synergizes with residual-RAKI to improve reconstruction performance. SPARK also improves reconstruction quality when applied to advanced acquisition and reconstruction techniques like 2D virtual coil (VC-) GRAPPA, 2D LORAKS, 3D GRAPPA without an integrated ACS region, and 2D/3D wave-encoded images.
Results: SPARK yields 1.5x - 2x RMSE reduction when applied to GRAPPA and improves robustness to ACS size for various acceleration rates in comparison to other scan-specific techniques. When applied to advanced reconstruction techniques such as residual-RAKI, 2D VC-GRAPPA and LORAKS, SPARK achieves up to 20% RMSE improvement. SPARK with 3D GRAPPA also improves performance by ~2x and perceived image quality without a fully sampled ACS region. Finally, SPARK synergizes with non-cartesian 2D and 3D wave-encoding imaging by reducing RMSE between 20-25% and providing qualitative improvements.
Conclusion: SPARK synergizes with physics-based acquisition and reconstruction techniques to improve accelerated MRI by training scan-specific models to estimate and correct reconstruction errors in k-space.
△ Less
Submitted 28 April, 2022; v1 submitted 2 April, 2021;
originally announced April 2021.