-
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Authors:
Shengpeng Ji,
Ziyue Jiang,
Xize Cheng,
Yifu Chen,
Minghui Fang,
Jialong Zuo,
Qian Yang,
Ruiqi Li,
Ziang Zhang,
Xiaoda Yang,
Rongjie Huang,
Yidi Jiang,
Qian Chen,
Siqi Zheng,
Wen Wang,
Zhou Zhao
Abstract:
Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domai…
▽ More
Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at https://github.com/jishengpeng/WavTokenizer.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
NuSegDG: Integration of Heterogeneous Space and Gaussian Kernel for Domain-Generalized Nuclei Segmentation
Authors:
Zhenye Lou,
Qing Xu,
Zekun Jiang,
Xiangjian He,
Zhen Chen,
Yi Wang,
Chenxin Li,
Maggie M. He,
Wenting Duan
Abstract:
Domain-generalized nuclei segmentation refers to the generalizability of models to unseen domains based on knowledge learned from source domains and is challenged by various image conditions, cell types, and stain strategies. Recently, the Segment Anything Model (SAM) has made great success in universal image segmentation by interactive prompt modes (e.g., point and box). Despite its strengths, th…
▽ More
Domain-generalized nuclei segmentation refers to the generalizability of models to unseen domains based on knowledge learned from source domains and is challenged by various image conditions, cell types, and stain strategies. Recently, the Segment Anything Model (SAM) has made great success in universal image segmentation by interactive prompt modes (e.g., point and box). Despite its strengths, the original SAM presents limited adaptation to medical images. Moreover, SAM requires providing manual bounding box prompts for each object to produce satisfactory segmentation masks, so it is laborious in nuclei segmentation scenarios. To address these limitations, we propose a domain-generalizable framework for nuclei image segmentation, abbreviated to NuSegDG. Specifically, we first devise a Heterogeneous Space Adapter (HS-Adapter) to learn multi-dimensional feature representations of different nuclei domains by injecting a small number of trainable parameters into the image encoder of SAM. To alleviate the labor-intensive requirement of manual prompts, we introduce a Gaussian-Kernel Prompt Encoder (GKP-Encoder) to generate density maps driven by a single point, which guides segmentation predictions by mixing position prompts and semantic prompts. Furthermore, we present a Two-Stage Mask Decoder (TSM-Decoder) to effectively convert semantic masks to instance maps without the manual demand for morphological shape refinement. Based on our experimental evaluations, the proposed NuSegDG demonstrates state-of-the-art performance in nuclei instance segmentation, exhibiting superior domain generalization capabilities. The source code is available at https://github.com/xq141839/NuSegDG.
△ Less
Submitted 24 August, 2024; v1 submitted 21 August, 2024;
originally announced August 2024.
-
Automatic Mitigation of Dynamic Atmospheric Turbulence Using Optical Phase Conjugation for Coherent Free-Space Optical Communications
Authors:
Huibin Zhou,
Xinzhou Su,
Yuxiang Duan,
Yue Zuo,
Zile Jiang,
Muralekrishnan Ramakrishnan,
Jan Tepper,
Volker Ziegler,
Robert W. Boyd,
Moshe Tur,
Alan E. Willner
Abstract:
Coherent detection can provide enhanced receiver sensitivity and spectral efficiency in free-space optical (FSO) communications. However, turbulence can cause modal power coupling effects on a Gaussian data beam and significantly degrade the mixing efficiency between the data beam and a Gaussian local oscillator (LO) in the coherent detector. Optical phase conjugation (OPC) in a photorefractive cr…
▽ More
Coherent detection can provide enhanced receiver sensitivity and spectral efficiency in free-space optical (FSO) communications. However, turbulence can cause modal power coupling effects on a Gaussian data beam and significantly degrade the mixing efficiency between the data beam and a Gaussian local oscillator (LO) in the coherent detector. Optical phase conjugation (OPC) in a photorefractive crystal can "automatically" mitigate turbulence by: (a) recording a back-propagated turbulence-distorted probe beam, and (b) creating a phase-conjugate beam that has the inverse phase distortion of the medium as the transmitted data beam. However, previously reported crystal-based OPC approaches for FSO links have demonstrated either: (i) a relatively fast response time of 35 ms but at a relatively low data rate (e.g., <1 Mbit/s), or (ii) a relatively high data rate of 2-Gbit/s but at a slow response time (e.g., >60 s). Here, we report an OPC approach for the automatic mitigation of dynamic turbulence that enables both a high data rate (8 Gbit/s) data beam and a rapid (<5 ms) response time. For a similar data rate, this represents a 10,000-fold faster response time than previous reports, thereby enabling mitigation for dynamic effects. In our approach, the transmitted pre-distorted phase-conjugate data beam is generated by four-wave mixing in a GaAs crystal of three input beams: a turbulence-distorted probe beam, a Gaussian reference beam regenerated from the probe beam, and a Gaussian data beam carrying a high-speed data channel. We experimentally demonstrate our approach in an 8-Gbit/s quadrature-phase-shift-keying coherent FSO link through emulated dynamic turbulence. Our results show ~10-dB improvement in the mixing efficiency of the LO with the data beam under dynamic turbulence with a bandwidth of up to ~260 Hz (Greenwood frequency).
△ Less
Submitted 17 August, 2024;
originally announced August 2024.
-
MulliVC: Multi-lingual Voice Conversion With Cycle Consistency
Authors:
Jiawei Huang,
Chen Zhang,
Yi Ren,
Ziyue Jiang,
Zhenhui Ye,
Jinglin Liu,
Jinzheng He,
Xiang Yin,
Zhou Zhao
Abstract:
Voice conversion aims to modify the source speaker's voice to resemble the target speaker while preserving the original speech content. Despite notable advancements in voice conversion these days, multi-lingual voice conversion (including both monolingual and cross-lingual scenarios) has yet to be extensively studied. It faces two main challenges: 1) the considerable variability in prosody and art…
▽ More
Voice conversion aims to modify the source speaker's voice to resemble the target speaker while preserving the original speech content. Despite notable advancements in voice conversion these days, multi-lingual voice conversion (including both monolingual and cross-lingual scenarios) has yet to be extensively studied. It faces two main challenges: 1) the considerable variability in prosody and articulation habits across languages; and 2) the rarity of paired multi-lingual datasets from the same speaker. In this paper, we propose MulliVC, a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data. Specifically, each training step of MulliVC contains three substeps: In step one the model is trained with monolingual speech data; then, steps two and three take inspiration from back translation, construct a cyclical process to disentangle the timbre and other information (content, prosody, and other language-related information) in the absence of multi-lingual data from the same speaker. Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts, demonstrating the system's efficacy and the viability of the three-step approach with cycle consistency. Audio samples can be found on our demo page (mullivc.github.io).
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis
Authors:
Qian Yang,
Jialong Zuo,
Zhe Su,
Ziyue Jiang,
Mingze Li,
Zhou Zhao,
Feiyang Chen,
Zhefeng Wang,
Baoxing Huai
Abstract:
We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed and recorded according to daily life scenarios. Each scenario includes multiple speakers and a diverse range of prosodic styles, making it suitable for…
▽ More
We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed and recorded according to daily life scenarios. Each scenario includes multiple speakers and a diverse range of prosodic styles, making it suitable for speech synthesis that entails multi-speaker style and prosody modeling. We have established a robust baseline, through the prompting mechanism, that can effectively synthesize speech characterized by both user-specific timbre and scene-specific prosody with arbitrary text input. The open source MSceneSpeech Dataset and audio samples of our baseline are available at https://speechai-demo.github.io/MSceneSpeech/.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark
Authors:
Yuan-Hao Ho,
Jen-Hao Cheng,
Sheng Yao Kuan,
Zhongyu Jiang,
Wenhao Chai,
Hsiang-Wei Huang,
Chih-Lung Lin,
Jenq-Neng Hwang
Abstract:
Traditional methods for human localization and pose estimation (HPE), which mainly rely on RGB images as an input modality, confront substantial limitations in real-world applications due to privacy concerns. In contrast, radar-based HPE methods emerge as a promising alternative, characterized by distinctive attributes such as through-wall recognition and privacy-preserving, rendering the method m…
▽ More
Traditional methods for human localization and pose estimation (HPE), which mainly rely on RGB images as an input modality, confront substantial limitations in real-world applications due to privacy concerns. In contrast, radar-based HPE methods emerge as a promising alternative, characterized by distinctive attributes such as through-wall recognition and privacy-preserving, rendering the method more conducive to practical deployments. This paper presents a Radar Tensor-based human pose (RT-Pose) dataset and an open-source benchmarking framework. The RT-Pose dataset comprises 4D radar tensors, LiDAR point clouds, and RGB images, and is collected for a total of 72k frames across 240 sequences with six different complexity-level actions. The 4D radar tensor provides raw spatio-temporal information, differentiating it from other radar point cloud-based datasets. We develop an annotation process using RGB images and LiDAR point clouds to accurately label 3D human skeletons. In addition, we propose HRRadarPose, the first single-stage architecture that extracts the high-resolution representation of 4D radar tensors in 3D space to aid human keypoint estimation. HRRadarPose outperforms previous radar-based HPE work on the RT-Pose benchmark. The overall HRRadarPose performance on the RT-Pose dataset, as reflected in a mean per joint position error (MPJPE) of 9.91cm, indicates the persistent challenges in achieving accurate HPE in complex real-world scenarios. RT-Pose is available at https://huggingface.co/datasets/uwipl/RT-Pose.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Data Alchemy: Mitigating Cross-Site Model Variability Through Test Time Data Calibration
Authors:
Abhijeet Parida,
Antonia Alomar,
Zhifan Jiang,
Pooneh Roshanitabrizi,
Austin Tapp,
Maria Ledesma-Carbayo,
Ziyue Xu,
Syed Muhammed Anwar,
Marius George Linguraru,
Holger R. Roth
Abstract:
Deploying deep learning-based imaging tools across various clinical sites poses significant challenges due to inherent domain shifts and regulatory hurdles associated with site-specific fine-tuning. For histopathology, stain normalization techniques can mitigate discrepancies, but they often fall short of eliminating inter-site variations. Therefore, we present Data Alchemy, an explainable stain n…
▽ More
Deploying deep learning-based imaging tools across various clinical sites poses significant challenges due to inherent domain shifts and regulatory hurdles associated with site-specific fine-tuning. For histopathology, stain normalization techniques can mitigate discrepancies, but they often fall short of eliminating inter-site variations. Therefore, we present Data Alchemy, an explainable stain normalization method combined with test time data calibration via a template learning framework to overcome barriers in cross-site analysis. Data Alchemy handles shifts inherent to multi-site data and minimizes them without needing to change the weights of the normalization or classifier networks. Our approach extends to unseen sites in various clinical settings where data domain discrepancies are unknown. Extensive experiments highlight the efficacy of our framework in tumor classification in hematoxylin and eosin-stained patches. Our explainable normalization method boosts classification tasks' area under the precision-recall curve(AUPR) by 0.165, 0.545 to 0.710. Additionally, Data Alchemy further reduces the multisite classification domain gap, by improving the 0.710 AUPR an additional 0.142, elevating classification performance further to 0.852, from 0.545. Our Data Alchemy framework can popularize precision medicine with minimal operational overhead by allowing for the seamless integration of pre-trained deep learning-based clinical tools across multiple sites.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
BraTS-PEDs: Results of the Multi-Consortium International Pediatric Brain Tumor Segmentation Challenge 2023
Authors:
Anahita Fathi Kazerooni,
Nastaran Khalili,
Xinyang Liu,
Debanjan Haldar,
Zhifan Jiang,
Anna Zapaishchykova,
Julija Pavaine,
Lubdha M. Shah,
Blaise V. Jones,
Nakul Sheth,
Sanjay P. Prabhu,
Aaron S. McAllister,
Wenxin Tu,
Khanak K. Nandolia,
Andres F. Rodriguez,
Ibraheem Salman Shaikh,
Mariana Sanchez Montano,
Hollie Anne Lai,
Maruf Adewole,
Jake Albrecht,
Udunna Anazodo,
Hannah Anderson,
Syed Muhammed Anwar,
Alejandro Aristizabal,
Sina Bagheri
, et al. (55 additional authors not shown)
Abstract:
Pediatric central nervous system tumors are the leading cause of cancer-related deaths in children. The five-year survival rate for high-grade glioma in children is less than 20%. The development of new treatments is dependent upon multi-institutional collaborative clinical trials requiring reproducible and accurate centralized response assessment. We present the results of the BraTS-PEDs 2023 cha…
▽ More
Pediatric central nervous system tumors are the leading cause of cancer-related deaths in children. The five-year survival rate for high-grade glioma in children is less than 20%. The development of new treatments is dependent upon multi-institutional collaborative clinical trials requiring reproducible and accurate centralized response assessment. We present the results of the BraTS-PEDs 2023 challenge, the first Brain Tumor Segmentation (BraTS) challenge focused on pediatric brain tumors. This challenge utilized data acquired from multiple international consortia dedicated to pediatric neuro-oncology and clinical trials. BraTS-PEDs 2023 aimed to evaluate volumetric segmentation algorithms for pediatric brain gliomas from magnetic resonance imaging using standardized quantitative performance evaluation metrics employed across the BraTS 2023 challenges. The top-performing AI approaches for pediatric tumor analysis included ensembles of nnU-Net and Swin UNETR, Auto3DSeg, or nnU-Net with a self-supervised framework. The BraTSPEDs 2023 challenge fostered collaboration between clinicians (neuro-oncologists, neuroradiologists) and AI/imaging scientists, promoting faster data sharing and the development of automated volumetric analysis techniques. These advancements could significantly benefit clinical trials and improve the care of children with brain tumors.
△ Less
Submitted 16 July, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
D-Rax: Domain-specific Radiologic assistant leveraging multi-modal data and eXpert model predictions
Authors:
Hareem Nisar,
Syed Muhammad Anwar,
Zhifan Jiang,
Abhijeet Parida,
Ramon Sanchez-Jacob,
Vishwesh Nath,
Holger R. Roth,
Marius George Linguraru
Abstract:
Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently…
▽ More
Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently limited by well-known challenges that exist in the large language model space. Hallucinations and imprecision in responses can lead to misdiagnosis which currently hinder the clinical adaptability of VLMs. To create precise, user-friendly models in healthcare, we propose D-Rax -- a domain-specific, conversational, radiologic assistance tool that can be used to gain insights about a particular radiologic image. In this study, we enhance the conversational analysis of chest X-ray (CXR) images to support radiological reporting, offering comprehensive insights from medical imaging and aiding in the formulation of accurate diagnosis. D-Rax is achieved by fine-tuning the LLaVA-Med architecture on our curated enhanced instruction-following data, comprising of images, instructions, as well as disease diagnosis and demographic predictions derived from MIMIC-CXR imaging data, CXR-related visual question answer (VQA) pairs, and predictive outcomes from multiple expert AI models. We observe statistically significant improvement in responses when evaluated for both open and close-ended conversations. Leveraging the power of state-of-the-art diagnostic models combined with VLMs, D-Rax empowers clinicians to interact with medical images using natural language, which could potentially streamline their decision-making process, enhance diagnostic accuracy, and conserve their time.
△ Less
Submitted 2 August, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases
Authors:
Meng Wang,
Tian Lin,
Aidi Lin,
Kai Yu,
Yuanyuan Peng,
Lianyu Wang,
Cheng Chen,
Ke Zou,
Huiyu Liang,
Man Chen,
Xue Yao,
Meiqin Zhang,
Binwei Huang,
Chaoxin Zheng,
Peixin Zhang,
Wei Chen,
Yilong Luo,
Yifan Chen,
Honghe Xia,
Tingkun Shi,
Qi Zhang,
Jinming Guo,
Xiaolin Chen,
Jingcheng Wang,
Yih Chung Tham
, et al. (24 additional authors not shown)
Abstract:
Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources…
▽ More
Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources, encompassing a diverse range of diseases across multiple ethnicities and countries. RetiZero exhibits superior performance in several downstream tasks, including zero-shot disease recognition, image-to-image retrieval, and internal- and cross-domain disease identification. In zero-shot scenarios, RetiZero achieves Top5 accuracy scores of 0.8430 for 15 fundus diseases and 0.7561 for 52 fundus diseases. For image retrieval, it achieves Top5 scores of 0.9500 and 0.8860 for the same disease sets, respectively. Clinical evaluations show that RetiZero's Top3 zero-shot performance surpasses the average of 19 ophthalmologists from Singapore, China and the United States. Furthermore, RetiZero significantly enhances clinicians' accuracy in diagnosing fundus disease. These findings underscore the value of integrating the RetiZero foundation model into clinical settings, where a variety of fundus diseases are encountered.
△ Less
Submitted 30 June, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
AudioMarkBench: Benchmarking Robustness of Audio Watermarking
Authors:
Hongbin Liu,
Moyang Guo,
Zhengyuan Jiang,
Lun Wang,
Neil Zhenqiang Gong
Abstract:
The increasing realism of synthetic speech, driven by advancements in text-to-speech models, raises ethical concerns regarding impersonation and disinformation. Audio watermarking offers a promising solution via embedding human-imperceptible watermarks into AI-generated audios. However, the robustness of audio watermarking against common/adversarial perturbations remains understudied. We present A…
▽ More
The increasing realism of synthetic speech, driven by advancements in text-to-speech models, raises ethical concerns regarding impersonation and disinformation. Audio watermarking offers a promising solution via embedding human-imperceptible watermarks into AI-generated audios. However, the robustness of audio watermarking against common/adversarial perturbations remains understudied. We present AudioMarkBench, the first systematic benchmark for evaluating the robustness of audio watermarking against watermark removal and watermark forgery. AudioMarkBench includes a new dataset created from Common-Voice across languages, biological sexes, and ages, 3 state-of-the-art watermarking methods, and 15 types of perturbations. We benchmark the robustness of these methods against the perturbations in no-box, black-box, and white-box settings. Our findings highlight the vulnerabilities of current watermarking techniques and emphasize the need for more robust and fair audio watermarking solutions. Our dataset and code are publicly available at \url{https://github.com/moyangkuo/AudioMarkBench}.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec
Authors:
Shengpeng Ji,
Jialong Zuo,
Minghui Fang,
Siqi Zheng,
Qian Chen,
Wen Wang,
Ziyue Jiang,
Hai Huang,
Xize Cheng,
Rongjie Huang,
Zhou Zhao
Abstract:
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and…
▽ More
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging new task-a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture corresponding codec representations in a discrete decoupling codec space. Moreover, we discovered the issue of text style controllability in a many-to-many mapping fashion and proposed the Style Mixture Semantic Density (SMSD) model to resolve this problem. SMSD module which is based on Gaussian mixture density networks, is designed to enhance the fine-grained partitioning and sampling capabilities of style semantic information and generate speech with more diverse styles. In terms of experiments, we make available a controllable model toolkit called ControlToolkit with a new style controllable dataset, some replicated baseline models and propose new metrics to evaluate both the control capability and the quality of generated audio in ControlSpeech. The relevant ablation studies validate the necessity of each component in ControlSpeech is necessary. We hope that ControlSpeech can establish the next foundation paradigm of controllable speech synthesis. The relevant code and demo are available at https://github.com/jishengpeng/ControlSpeech .
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Safety-Critical Control of Euler-Lagrange Systems Subject to Multiple Obstacles and Velocity Constraints
Authors:
Zhi Liu,
Si Wu,
Tengfei Liu,
Zhong-Ping Jiang
Abstract:
This paper studies the safety-critical control problem for Euler-Lagrange (EL) systems subject to multiple ball obstacles and velocity constraints in accordance with affordable velocity ranges. A key strategy is to exploit the underlying inner-outer-loop structure for the design of a new cascade controller for the class of EL systems. In particular, the outer-loop controller is developed based on…
▽ More
This paper studies the safety-critical control problem for Euler-Lagrange (EL) systems subject to multiple ball obstacles and velocity constraints in accordance with affordable velocity ranges. A key strategy is to exploit the underlying inner-outer-loop structure for the design of a new cascade controller for the class of EL systems. In particular, the outer-loop controller is developed based on quadratic programming (QP) to avoid ball obstacles and generate velocity reference signals fulfilling the velocity limitation. Taking full advantage of the conservation-of-energy property, a nonlinear velocity-tracking controller is designed to form the inner loop. One major difficulty is caused by the possible non-Lipschitz continuity of the standard QP algorithm when there are multiple constraints. To solve this problem, we propose a refined QP algorithm with the feasible set reshaped by an appropriately chosen positive basis such that the feasibility is retained while the resulting outer-loop controller is locally Lipschitz. It is proved that the constraint-satisfaction problem is solvable as long as the ball obstacles satisfy a mild distance condition. The proposed design is validated by numerical simulation and an experiment based on a $2$-link planar manipulator.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Constructive Safety Control
Authors:
Si Wu,
Tengfei Liu,
Zhong-Ping Jiang
Abstract:
This paper proposes a constructive approach to safety control of nonlinear cascade systems subject to multiple state constraints. New design ingredients include a unified characterization of safety and stability for systematic designs of safety controllers, and a novel technique of reshaping the feasible sets of quadratically constrained quadratic programming induced from safety control. The propo…
▽ More
This paper proposes a constructive approach to safety control of nonlinear cascade systems subject to multiple state constraints. New design ingredients include a unified characterization of safety and stability for systematic designs of safety controllers, and a novel technique of reshaping the feasible sets of quadratically constrained quadratic programming induced from safety control. The proposed method guarantees Lipschitz continuity of virtual control laws, enabling a stepwise constructive design. A refined nonlinear small-gain synthesis is employed to address the nonlinear uncertain interconnections between the resulting subsystems corresponding to different virtual control laws, and to guarantee the achievement of the safety control objective. When the safety constraints are removed, the proposed approach coincides with the standard constructive nonlinear control. The proposed safety-control algorithm is experimentally validated in a testbed involving a vertical takeoff and landing (VTOL) vehicle taking off in narrow spaces.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Singular Perturbation: When the Perturbation Parameter Becomes a State-Dependent Function
Authors:
Tengfei Liu,
Zhong-Ping Jiang
Abstract:
This paper presents a new systematic framework for nonlinear singularly perturbed systems in which state-dependent perturbation functions are used instead of constant perturbation coefficients. Under this framework, general results are obtained for the global robust stability and input-to-state stability of nonlinear singularly perturbed systems. Interestingly, the proposed methodology provides in…
▽ More
This paper presents a new systematic framework for nonlinear singularly perturbed systems in which state-dependent perturbation functions are used instead of constant perturbation coefficients. Under this framework, general results are obtained for the global robust stability and input-to-state stability of nonlinear singularly perturbed systems. Interestingly, the proposed methodology provides innovative solutions beyond traditional singular perturbation theory for emerging control problems arising from nonlinear integral control, feedback optimization, and formation-based source seeking.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Analysis of the BraTS 2023 Intracranial Meningioma Segmentation Challenge
Authors:
Dominic LaBella,
Ujjwal Baid,
Omaditya Khanna,
Shan McBurney-Lin,
Ryan McLean,
Pierre Nedelec,
Arif Rashid,
Nourel Hoda Tahon,
Talissa Altes,
Radhika Bhalerao,
Yaseen Dhemesh,
Devon Godfrey,
Fathi Hilal,
Scott Floyd,
Anastasia Janas,
Anahita Fathi Kazerooni,
John Kirkpatrick,
Collin Kent,
Florian Kofler,
Kevin Leu,
Nazanin Maleki,
Bjoern Menze,
Maxence Pajot,
Zachary J. Reitman,
Jeffrey D. Rudie
, et al. (96 additional authors not shown)
Abstract:
We describe the design and results from the BraTS 2023 Intracranial Meningioma Segmentation Challenge. The BraTS Meningioma Challenge differed from prior BraTS Glioma challenges in that it focused on meningiomas, which are typically benign extra-axial tumors with diverse radiologic and anatomical presentation and a propensity for multiplicity. Nine participating teams each developed deep-learning…
▽ More
We describe the design and results from the BraTS 2023 Intracranial Meningioma Segmentation Challenge. The BraTS Meningioma Challenge differed from prior BraTS Glioma challenges in that it focused on meningiomas, which are typically benign extra-axial tumors with diverse radiologic and anatomical presentation and a propensity for multiplicity. Nine participating teams each developed deep-learning automated segmentation models using image data from the largest multi-institutional systematically expert annotated multilabel multi-sequence meningioma MRI dataset to date, which included 1000 training set cases, 141 validation set cases, and 283 hidden test set cases. Each case included T2, T2/FLAIR, T1, and T1Gd brain MRI sequences with associated tumor compartment labels delineating enhancing tumor, non-enhancing tumor, and surrounding non-enhancing T2/FLAIR hyperintensity. Participant automated segmentation models were evaluated and ranked based on a scoring system evaluating lesion-wise metrics including dice similarity coefficient (DSC) and 95% Hausdorff Distance. The top ranked team had a lesion-wise median dice similarity coefficient (DSC) of 0.976, 0.976, and 0.964 for enhancing tumor, tumor core, and whole tumor, respectively and a corresponding average DSC of 0.899, 0.904, and 0.871, respectively. These results serve as state-of-the-art benchmarks for future pre-operative meningioma automated segmentation algorithms. Additionally, we found that 1286 of 1424 cases (90.3%) had at least 1 compartment voxel abutting the edge of the skull-stripped image edge, which requires further investigation into optimal pre-processing face anonymization steps.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
GaitMotion: A Multitask Dataset for Pathological Gait Forecasting
Authors:
Wenwen Zhang,
Hao Zhang,
Zenan Jiang,
Jing Wang,
Amir Servati,
Peyman Servati
Abstract:
Gait benchmark empowers uncounted encouraging research fields such as gait recognition, humanoid locomotion, etc. Despite the growing focus on gait analysis, the research community is hindered by the limitations of the currently available databases, which mostly consist of videos or images with limited labeling. In this paper, we introduce GaitMotion, a multitask dataset leveraging wearable sensor…
▽ More
Gait benchmark empowers uncounted encouraging research fields such as gait recognition, humanoid locomotion, etc. Despite the growing focus on gait analysis, the research community is hindered by the limitations of the currently available databases, which mostly consist of videos or images with limited labeling. In this paper, we introduce GaitMotion, a multitask dataset leveraging wearable sensors to capture the patients' real-time movement with pathological gait. This dataset offers extensive ground-truth labeling for multiple tasks, including step/stride segmentation and step/stride length prediction, empowers researchers with a more holistic understanding of gait disturbances linked to neurological impairments. The wearable gait analysis suit captures the gait cycle, pattern, and parameters for both normal and pathological subjects. This data may prove beneficial for healthcare products focused on patient progress monitoring and post-disease recovery, as well as for forensics technologies aimed at person reidentification, and biomechanics research to aid in the development of humanoid robotics. Moreover, the analysis has considered the drift in data distribution across individual subjects. This drift can be attributed to each participant's unique behavioral habits or potential displacement of the sensor. Stride length variance for normal, Parkinson's, and stroke patients are compared to recognize the pathological walking pattern. As the baseline and benchmark, we provide an error of 14.1, 13.3, and 12.2 centimeters of stride length prediction for normal, Parkinson's, and Stroke gaits separately. We also analyzed the gait characteristics for normal and pathological gaits in terms of the gait cycle and gait parameters.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
EvaNet: Elevation-Guided Flood Extent Mapping on Earth Imagery
Authors:
Mirza Tanzim Sami,
Da Yan,
Saugat Adhikari,
Lyuheng Yuan,
Jiao Han,
Zhe Jiang,
Jalal Khalil,
Yang Zhou
Abstract:
Accurate and timely mapping of flood extent from high-resolution satellite imagery plays a crucial role in disaster management such as damage assessment and relief activities. However, current state-of-the-art solutions are based on U-Net, which can-not segment the flood pixels accurately due to the ambiguous pixels (e.g., tree canopies, clouds) that prevent a direct judgement from only the spectr…
▽ More
Accurate and timely mapping of flood extent from high-resolution satellite imagery plays a crucial role in disaster management such as damage assessment and relief activities. However, current state-of-the-art solutions are based on U-Net, which can-not segment the flood pixels accurately due to the ambiguous pixels (e.g., tree canopies, clouds) that prevent a direct judgement from only the spectral features. Thanks to the digital elevation model (DEM) data readily available from sources such as United States Geological Survey (USGS), this work explores the use of an elevation map to improve flood extent mapping. We propose, EvaNet, an elevation-guided segmentation model based on the encoder-decoder architecture with two novel techniques: (1) a loss function encoding the physical law of gravity that if a location is flooded (resp. dry), then its adjacent locations with a lower (resp. higher) elevation must also be flooded (resp. dry); (2) a new (de)convolution operation that integrates the elevation map by a location sensitive gating mechanism to regulate how much spectral features flow through adjacent layers. Extensive experiments show that EvaNet significantly outperforms the U-Net baselines, and works as a perfect drop-in replacement for U-Net in existing solutions to flood extent mapping.
△ Less
Submitted 12 May, 2024; v1 submitted 27 April, 2024;
originally announced April 2024.
-
The Brain Tumor Segmentation in Pediatrics (BraTS-PEDs) Challenge: Focus on Pediatrics (CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs)
Authors:
Anahita Fathi Kazerooni,
Nastaran Khalili,
Xinyang Liu,
Deep Gandhi,
Zhifan Jiang,
Syed Muhammed Anwar,
Jake Albrecht,
Maruf Adewole,
Udunna Anazodo,
Hannah Anderson,
Ujjwal Baid,
Timothy Bergquist,
Austin J. Borja,
Evan Calabrese,
Verena Chung,
Gian-Marco Conte,
Farouk Dako,
James Eddy,
Ivan Ezhov,
Ariana Familiar,
Keyvan Farahani,
Andrea Franson,
Anurag Gottipati,
Shuvanjan Haldar,
Juan Eugenio Iglesias
, et al. (46 additional authors not shown)
Abstract:
Pediatric tumors of the central nervous system are the most common cause of cancer-related death in children. The five-year survival rate for high-grade gliomas in children is less than 20%. Due to their rarity, the diagnosis of these entities is often delayed, their treatment is mainly based on historic treatment concepts, and clinical trials require multi-institutional collaborations. Here we pr…
▽ More
Pediatric tumors of the central nervous system are the most common cause of cancer-related death in children. The five-year survival rate for high-grade gliomas in children is less than 20%. Due to their rarity, the diagnosis of these entities is often delayed, their treatment is mainly based on historic treatment concepts, and clinical trials require multi-institutional collaborations. Here we present the CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs challenge, focused on pediatric brain tumors with data acquired across multiple international consortia dedicated to pediatric neuro-oncology and clinical trials. The CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs challenge brings together clinicians and AI/imaging scientists to lead to faster development of automated segmentation techniques that could benefit clinical trials, and ultimately the care of children with brain tumors.
△ Less
Submitted 11 July, 2024; v1 submitted 23 April, 2024;
originally announced April 2024.
-
Soar: Design and Deployment of A Smart Roadside Infrastructure System for Autonomous Driving
Authors:
Shuyao Shi,
Neiwen Ling,
Zhehao Jiang,
Xuan Huang,
Yuze He,
Xiaoguang Zhao,
Bufang Yang,
Chen Bian,
Jingfei Xia,
Zhenyu Yan,
Raymond Yeung,
Guoliang Xing
Abstract:
Recently,smart roadside infrastructure (SRI) has demonstrated the potential of achieving fully autonomous driving systems. To explore the potential of infrastructure-assisted autonomous driving, this paper presents the design and deployment of Soar, the first end-to-end SRI system specifically designed to support autonomous driving systems. Soar consists of both software and hardware components ca…
▽ More
Recently,smart roadside infrastructure (SRI) has demonstrated the potential of achieving fully autonomous driving systems. To explore the potential of infrastructure-assisted autonomous driving, this paper presents the design and deployment of Soar, the first end-to-end SRI system specifically designed to support autonomous driving systems. Soar consists of both software and hardware components carefully designed to overcome various system and physical challenges. Soar can leverage the existing operational infrastructure like street lampposts for a lower barrier of adoption. Soar adopts a new communication architecture that comprises a bi-directional multi-hop I2I network and a downlink I2V broadcast service, which are designed based on off-the-shelf 802.11ac interfaces in an integrated manner. Soar also features a hierarchical DL task management framework to achieve desirable load balancing among nodes and enable them to collaborate efficiently to run multiple data-intensive autonomous driving applications. We deployed a total of 18 Soar nodes on existing lampposts on campus, which have been operational for over two years. Our real-world evaluation shows that Soar can support a diverse set of autonomous driving applications and achieve desirable real-time performance and high communication reliability. Our findings and experiences in this work offer key insights into the development and deployment of next-generation smart roadside infrastructure and autonomous driving systems.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
YNetr: Dual-Encoder architecture on Plain Scan Liver Tumors (PSLT)
Authors:
Wen Sheng,
Zhong Zheng,
Jiajun Liu,
Han Lu,
Hanyuan Zhang,
Zhengyong Jiang,
Zhihong Zhang,
Daoping Zhu
Abstract:
Background: Liver tumors are abnormal growths in the liver that can be either benign or malignant, with liver cancer being a significant health concern worldwide. However, there is no dataset for plain scan segmentation of liver tumors, nor any related algorithms. To fill this gap, we propose Plain Scan Liver Tumors(PSLT) and YNetr. Methods: A collection of 40 liver tumor plain scan segmentation d…
▽ More
Background: Liver tumors are abnormal growths in the liver that can be either benign or malignant, with liver cancer being a significant health concern worldwide. However, there is no dataset for plain scan segmentation of liver tumors, nor any related algorithms. To fill this gap, we propose Plain Scan Liver Tumors(PSLT) and YNetr. Methods: A collection of 40 liver tumor plain scan segmentation datasets was assembled and annotated. Concurrently, we utilized Dice coefficient as the metric for assessing the segmentation outcomes produced by YNetr, having advantage of capturing different frequency information. Results: The YNetr model achieved a Dice coefficient of 62.63% on the PSLT dataset, surpassing the other publicly available model by an accuracy margin of 1.22%. Comparative evaluations were conducted against a range of models including UNet 3+, XNet, UNetr, Swin UNetr, Trans-BTS, COTr, nnUNetv2 (2D), nnUNetv2 (3D fullres), MedNext (2D) and MedNext(3D fullres). Conclusions: We not only proposed a dataset named PSLT(Plain Scan Liver Tumors), but also explored a structure called YNetr that utilizes wavelet transform to extract different frequency information, which having the SOTA in PSLT by experiments.
△ Less
Submitted 4 July, 2024; v1 submitted 30 March, 2024;
originally announced April 2024.
-
Invisible Needle Detection in Ultrasound: Leveraging Mechanism-Induced Vibration
Authors:
Chenyang Li,
Dianye Huang,
Angelos Karlas,
Nassir Navab,
Zhongliang Jiang
Abstract:
In clinical applications that involve ultrasound-guided intervention, the visibility of the needle can be severely impeded due to steep insertion and strong distractors such as speckle noise and anatomical occlusion. To address this challenge, we propose VibNet, a learning-based framework tailored to enhance the robustness and accuracy of needle detection in ultrasound images, even when the target…
▽ More
In clinical applications that involve ultrasound-guided intervention, the visibility of the needle can be severely impeded due to steep insertion and strong distractors such as speckle noise and anatomical occlusion. To address this challenge, we propose VibNet, a learning-based framework tailored to enhance the robustness and accuracy of needle detection in ultrasound images, even when the target becomes invisible to the naked eye. Inspired by Eulerian Video Magnification techniques, we utilize an external step motor to induce low-amplitude periodic motion on the needle. These subtle vibrations offer the potential to generate robust frequency features for detecting the motion patterns around the needle. To robustly and precisely detect the needle leveraging these vibrations, VibNet integrates learning-based Short-Time-Fourier-Transform and Hough-Transform modules to achieve successive sub-goals, including motion feature extraction in the spatiotemporal space, frequency feature aggregation, and needle detection in the Hough space. Based on the results obtained on distinct ex vivo porcine and bovine tissue samples, the proposed algorithm exhibits superior detection performance with efficient computation and generalization capability.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
The Effect of Different Optimization Strategies to Physics-Constrained Deep Learning for Soil Moisture Estimation
Authors:
Jianxin Xie,
Bing Yao,
Zheyu Jiang
Abstract:
Soil moisture is a key hydrological parameter that has significant importance to human society and the environment. Accurate modeling and monitoring of soil moisture in crop fields, especially in the root zone (top 100 cm of soil), is essential for improving agricultural production and crop yield with the help of precision irrigation and farming tools. Realizing the full sensor data potential depe…
▽ More
Soil moisture is a key hydrological parameter that has significant importance to human society and the environment. Accurate modeling and monitoring of soil moisture in crop fields, especially in the root zone (top 100 cm of soil), is essential for improving agricultural production and crop yield with the help of precision irrigation and farming tools. Realizing the full sensor data potential depends greatly on advanced analytical and predictive domain-aware models. In this work, we propose a physics-constrained deep learning (P-DL) framework to integrate physics-based principles on water transport and water sensing signals for effective reconstruction of the soil moisture dynamics. We adopt three different optimizers, namely Adam, RMSprop, and GD, to minimize the loss function of P-DL during the training process. In the illustrative case study, we demonstrate the empirical convergence of Adam optimizers outperforms the other optimization methods in both mini-batch and full-batch training.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Physics-constrained Active Learning for Soil Moisture Estimation and Optimal Sensor Placement
Authors:
Jianxin Xie,
Bing Yao,
Zheyu Jiang
Abstract:
Soil moisture is a crucial hydrological state variable that has significant importance to the global environment and agriculture. Precise monitoring of soil moisture in crop fields is critical to reducing agricultural drought and improving crop yield. In-situ soil moisture sensors, which are buried at pre-determined depths and distributed across the field, are promising solutions for monitoring so…
▽ More
Soil moisture is a crucial hydrological state variable that has significant importance to the global environment and agriculture. Precise monitoring of soil moisture in crop fields is critical to reducing agricultural drought and improving crop yield. In-situ soil moisture sensors, which are buried at pre-determined depths and distributed across the field, are promising solutions for monitoring soil moisture. However, high-density sensor deployment is neither economically feasible nor practical. Thus, to achieve a higher spatial resolution of soil moisture dynamics using a limited number of sensors, we integrate a physics-based agro-hydrological model based on Richards' equation in a physics-constrained deep learning framework to accurately predict soil moisture dynamics in the soil's root zone. This approach ensures that soil moisture estimates align well with sensor observations while obeying physical laws at the same time. Furthermore, to strategically identify the locations for sensor placement, we introduce a novel active learning framework that combines space-filling design and physics residual-based sampling to maximize data acquisition potential with limited sensors. Our numerical results demonstrate that integrating Physics-constrained Deep Learning (P-DL) with an active learning strategy within a unified framework--named the Physics-constrained Active Learning (P-DAL) framework--significantly improves the predictive accuracy and effectiveness of field-scale soil moisture monitoring using in-situ sensors.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
Authors:
Chunhui Wang,
Chang Zeng,
Bowen Zhang,
Ziyang Ma,
Yefan Zhu,
Zifeng Cai,
Jian Zhao,
Zhonglin Jiang,
Yong Chen
Abstract:
Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train i…
▽ More
Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train it on the combination of real and synthetic data, scaling the data size up to 650k hours, leading to the zero-shot TTS model with 0.8B parameters. Specifically, our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor. This significantly mitigates pronunciation errors and style mutations in synthesized speech. During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity. Moreover, a pretrained few-shot voice conversion model is utilized to generate a plethora of voices with identical content yet varied timbres. This facilitates the explicit learning of utterance-level one-to-many mappings, enriching speech diversity and also ensuring consistency in timbre. Comparative experiments (Demo page: https://anonymous.4open.science/w/ham-tts/)demonstrate our model's superiority over VALL-E in pronunciation precision and maintaining speaking style, as well as timbre continuity.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
A Hierarchical Dataflow-Driven Heterogeneous Architecture for Wireless Baseband Processing
Authors:
Limin Jiang,
Yi Shi,
Haiqin Hu,
Qingyu Deng,
Siyi Xu,
Yintao Liu,
Feng Yuan,
Si Wang,
Yihao Shen,
Fangfang Ye,
Shan Cao,
Zhiyuan Jiang
Abstract:
Wireless baseband processing (WBP) is a key element of wireless communications, with a series of signal processing modules to improve data throughput and counter channel fading. Conventional hardware solutions, such as digital signal processors (DSPs) and more recently, graphic processing units (GPUs), provide various degrees of parallelism, yet they both fail to take into account the cyclical and…
▽ More
Wireless baseband processing (WBP) is a key element of wireless communications, with a series of signal processing modules to improve data throughput and counter channel fading. Conventional hardware solutions, such as digital signal processors (DSPs) and more recently, graphic processing units (GPUs), provide various degrees of parallelism, yet they both fail to take into account the cyclical and consecutive character of WBP. Furthermore, the large amount of data in WBPs cannot be processed quickly in symmetric multiprocessors (SMPs) due to the unpredictability of memory latency. To address this issue, we propose a hierarchical dataflow-driven architecture to accelerate WBP. A pack-and-ship approach is presented under a non-uniform memory access (NUMA) architecture to allow the subordinate tiles to operate in a bundled access and execute manner. We also propose a multi-level dataflow model and the related scheduling scheme to manage and allocate the heterogeneous hardware resources. Experiment results demonstrate that our prototype achieves $2\times$ and $2.3\times$ speedup in terms of normalized throughput and single-tile clock cycles compared with GPU and DSP counterparts in several critical WBP benchmarks. Additionally, a link-level throughput of $288$ Mbps can be achieved with a $45$-core configuration.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models
Authors:
Shengpeng Ji,
Minghui Fang,
Ziyue Jiang,
Siqi Zheng,
Qian Chen,
Rongjie Huang,
Jialung Zuo,
Shulei Wang,
Zhou Zhao
Abstract:
In recent years, large language models have achieved significant success in generative tasks (e.g., speech cloning and audio generation) related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serves as an intermediate representation replacing the mel-spectrogram. However, there exist several gaps between discrete codecs a…
▽ More
In recent years, large language models have achieved significant success in generative tasks (e.g., speech cloning and audio generation) related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serves as an intermediate representation replacing the mel-spectrogram. However, there exist several gaps between discrete codecs and downstream speech language models. Specifically, 1) most codec models are trained on only 1,000 hours of data, whereas most speech language models are trained on 60,000 hours; 2) Achieving good reconstruction performance requires the utilization of numerous codebooks, which increases the burden on downstream speech language models; 3) The initial channel of the codebooks contains excessive information, making it challenging to directly generate acoustic tokens from weakly supervised signals such as text in downstream tasks. Consequently, leveraging the characteristics of speech language models, we propose Language-Codec. In the Language-Codec, we introduce a Mask Channel Residual Vector Quantization (MCRVQ) mechanism along with improved Fourier transform structures and larger training datasets to address the aforementioned gaps. We compare our method with competing audio compression algorithms and observe significant outperformance across extensive evaluations. Furthermore, we also validate the efficiency of the Language-Codec on downstream speech language models. The source code and pre-trained models can be accessed at https://github.com/jishengpeng/languagecodec .
△ Less
Submitted 27 April, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech
Authors:
Shengpeng Ji,
Ziyue Jiang,
Hanting Wang,
Jialong Zuo,
Zhou Zhao
Abstract:
Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, mod…
▽ More
Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, model size, and robustness. Therefore, we propose MobileSpeech, which is a fast, lightweight, and robust zero-shot text-to-speech system based on mobile devices for the first time. Specifically: 1) leveraging discrete codec, we design a parallel speech mask decoder module called SMD, which incorporates hierarchical information from the speech codec and weight mechanisms across different codec layers during the generation process. Moreover, to bridge the gap between text and speech, we introduce a high-level probabilistic mask that simulates the progression of information flow from less to more during speech generation. 2) For speaker prompts, we extract fine-grained prompt duration from the prompt speech and incorporate text, prompt speech by cross attention in SMD. We demonstrate the effectiveness of MobileSpeech on multilingual datasets at different levels, achieving state-of-the-art results in terms of generating speed and speech quality. MobileSpeech achieves RTF of 0.09 on a single A100 GPU and we have successfully deployed MobileSpeech on mobile devices. Audio samples are available at \url{https://mobilespeech.github.io/} .
△ Less
Submitted 2 June, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
Authors:
Qian Yang,
Jin Xu,
Wenrui Liu,
Yunfei Chu,
Ziyue Jiang,
Xiaohuan Zhou,
Yichong Leng,
Yuanjun Lv,
Zhou Zhao,
Chang Zhou,
Jingren Zhou
Abstract:
Recently, instruction-following audio-language models have received broad attention for human-audio interaction. However, the absence of benchmarks capable of evaluating audio-centric interaction capabilities has impeded advancements in this field. Previous models primarily focus on assessing different fundamental tasks, such as Automatic Speech Recognition (ASR), and lack an assessment of the ope…
▽ More
Recently, instruction-following audio-language models have received broad attention for human-audio interaction. However, the absence of benchmarks capable of evaluating audio-centric interaction capabilities has impeded advancements in this field. Previous models primarily focus on assessing different fundamental tasks, such as Automatic Speech Recognition (ASR), and lack an assessment of the open-ended generative capabilities centered around audio. Thus, it is challenging to track the progression in the Large Audio-Language Models (LALMs) domain and to provide guidance for future improvement. In this paper, we introduce AIR-Bench (\textbf{A}udio \textbf{I}nst\textbf{R}uction \textbf{Bench}mark), the first benchmark designed to evaluate the ability of LALMs to understand various types of audio signals (including human speech, natural sounds, and music), and furthermore, to interact with humans in the textual format. AIR-Bench encompasses two dimensions: \textit{foundation} and \textit{chat} benchmarks. The former consists of 19 tasks with approximately 19k single-choice questions, intending to inspect the basic single-task ability of LALMs. The latter one contains 2k instances of open-ended question-and-answer data, directly assessing the comprehension of the model on complex audio and its capacity to follow instructions. Both benchmarks require the model to generate hypotheses directly. We design a unified framework that leverages advanced language models, such as GPT-4, to evaluate the scores of generated hypotheses given the meta-information of the audio. Experimental results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation. By revealing the limitations of existing LALMs through evaluation results, AIR-Bench can provide insights into the direction of future research.
△ Less
Submitted 26 July, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
Quantitative Metrics for Benchmarking Medical Image Harmonization
Authors:
Abhijeet Parida,
Zhifan Jiang,
Roger J. Packer,
Robert A. Avery,
Syed M. Anwar,
Marius G. Linguraru
Abstract:
Image harmonization is an important preprocessing strategy to address domain shifts arising from data acquired using different machines and scanning protocols in medical imaging. However, benchmarking the effectiveness of harmonization techniques has been a challenge due to the lack of widely available standardized datasets with ground truths. In this context, we propose three metrics: two intensi…
▽ More
Image harmonization is an important preprocessing strategy to address domain shifts arising from data acquired using different machines and scanning protocols in medical imaging. However, benchmarking the effectiveness of harmonization techniques has been a challenge due to the lack of widely available standardized datasets with ground truths. In this context, we propose three metrics: two intensity harmonization metrics and one anatomy preservation metric for medical images during harmonization, where no ground truths are required. Through extensive studies on a dataset with available harmonization ground truth, we demonstrate that our metrics are correlated with established image quality assessment metrics. We show how these novel metrics may be applied to real-world scenarios where no harmonization ground truth exists. Additionally, we provide insights into different interpretations of the metric values, shedding light on their significance in the context of the harmonization process. As a result of our findings, we advocate for the adoption of these quantitative harmonization metrics as a standard for benchmarking the performance of image harmonization techniques.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Annotating sleep states in children from wrist-worn accelerometer data using Machine Learning
Authors:
Ashwin Ram,
Sundar Sripada V. S.,
Shuvam Keshari,
Zizhe Jiang
Abstract:
Sleep detection and annotation are crucial for researchers to understand sleep patterns, especially in children. With modern wrist-worn watches comprising built-in accelerometers, sleep logs can be collected. However, the annotation of these logs into distinct sleep events: onset and wakeup, proves to be challenging. These annotations must be automated, precise, and scalable. We propose to model t…
▽ More
Sleep detection and annotation are crucial for researchers to understand sleep patterns, especially in children. With modern wrist-worn watches comprising built-in accelerometers, sleep logs can be collected. However, the annotation of these logs into distinct sleep events: onset and wakeup, proves to be challenging. These annotations must be automated, precise, and scalable. We propose to model the accelerometer data using different machine learning (ML) techniques such as support vectors, boosting, ensemble methods, and more complex approaches involving LSTMs and Region-based CNNs. Later, we aim to evaluate these approaches using the Event Detection Average Precision (EDAP) score (similar to the IOU metric) to eventually compare the predictive power and model performance.
△ Less
Submitted 9 December, 2023;
originally announced December 2023.
-
Lightweight Speaker Verification Using Transformation Module with Feature Partition and Fusion
Authors:
Yanxiong Li,
Zhongjie Jiang,
Qisheng Huang,
Wenchang Cao,
Jialong Li
Abstract:
Although many efforts have been made on decreasing the model complexity for speaker verification, it is still challenging to deploy speaker verification systems with satisfactory result on low-resource terminals. We design a transformation module that performs feature partition and fusion to implement lightweight speaker verification. The transformation module consists of multiple simple but effec…
▽ More
Although many efforts have been made on decreasing the model complexity for speaker verification, it is still challenging to deploy speaker verification systems with satisfactory result on low-resource terminals. We design a transformation module that performs feature partition and fusion to implement lightweight speaker verification. The transformation module consists of multiple simple but effective operations, such as convolution, pooling, mean, concatenation, normalization, and element-wise summation. It works in a plug-and-play way, and can be easily implanted into a wide variety of models to reduce the model complexity while maintaining the model error. First, the input feature is split into several low-dimensional feature subsets for decreasing the model complexity. Then, each feature subset is updated by fusing it with the inter-feature-subsets correlational information to enhance its representational capability. Finally, the updated feature subsets are independently fed into the block (one or several layers) of the model for further processing. The features that are output from current block of the model are processed according to the steps above before they are fed into the next block of the model. Experimental data are selected from two public speech corpora (namely VoxCeleb1 and VoxCeleb2). Results show that implanting the transformation module into three models (namely AMCRN, ResNet34, and ECAPA-TDNN) for speaker verification slightly increases the model error and significantly decreases the model complexity. Our proposed method outperforms baseline methods on the whole in memory requirement and computational complexity with lower equal error rate. It also generalizes well across truncated segments with various lengths.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Dynamic Operating Envelopes Embedded Peer-to-Peer-to-Grid Energy Trading
Authors:
Zhisen Jiang,
Ye Guo,
Hongbin Sun,
Jianxiao Wang
Abstract:
A novel decentralized peer-to-peer-to-grid (P2P2G) trading mechanism considering distribution network integrity is proposed. In order to direct prosumers' peer-to-peer (P2P) trading behavior to be grid-friendly, the proposed method incorporates Dynamic Operating Envelopes (DOEs) into the existing P2P2G trading. Moreover, DOEs are determined through negotiations between the distribution system oper…
▽ More
A novel decentralized peer-to-peer-to-grid (P2P2G) trading mechanism considering distribution network integrity is proposed. In order to direct prosumers' peer-to-peer (P2P) trading behavior to be grid-friendly, the proposed method incorporates Dynamic Operating Envelopes (DOEs) into the existing P2P2G trading. Moreover, DOEs are determined through negotiations between the distribution system operator (DSO) and prosumers alongside the process of P2P trading, avoiding compromising prosumers' privacy and network parameters leakage. To reduce communication costs during P2P trading, a variant of the alternating direction method of multipliers (ADMM), i.e., communication-censored ADMM (COCA) is used to solve the P2P2G trading problem. Finally, the DOE price is shown to be comprised of several economically interpretable components. Simulations validate the effectiveness of the proposed mechanism.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
Software-Defined Virtual Synchronous Condenser
Authors:
Zimin Jiang,
Peng Zhang,
Yifan Zhou,
Łukasz Kocewiak,
Divya Kurthakoti Chandrashekhara,
Marie-Lou Picherit,
Zefan Tang,
Kenneth B. Bowes,
Guangya Yang
Abstract:
Synchronous condensers (SCs) play important roles in integrating wind energy into relatively weak power grids. However, the design of SCs usually depends on specific application requirements and may not be adaptive enough to the frequently-changing grid conditions caused by the transition from conventional to renewable power generation. This paper devises a software-defined virtual synchronous con…
▽ More
Synchronous condensers (SCs) play important roles in integrating wind energy into relatively weak power grids. However, the design of SCs usually depends on specific application requirements and may not be adaptive enough to the frequently-changing grid conditions caused by the transition from conventional to renewable power generation. This paper devises a software-defined virtual synchronous condenser (SDViSC) method to address the challenges. Our contributions are fourfold: 1) design of a virtual synchronous condenser (ViSC) to enable full converter wind turbines to provide built-in SC functionalities; 2) engineering SDViSCs to transfer hardware-based ViSC controllers into software services, where a Tustin transformation-based software-defined control algorithm guarantees accurate tracking of fast dynamics under limited communication bandwidth; 3) a software-defined networking-enhanced SDViSC communication scheme to allow enhanced communication reliability and reduced communication bandwidth occupation; and 4) Prototype of SDViSC on our real-time, cyber-in-the-loop digital twin of large-wind-farm in an RTDS environment. Extensive test results validate the excellent performance of SDViSC to support reliable and resilient operations of wind farms under various physical and cyber conditions.
△ Less
Submitted 17 November, 2023; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Sensing Mutual Information with Random Signals in Gaussian Channels
Authors:
Lei Xie,
Fan Liu,
Zhanyuan Xie,
Zheng Jiang,
Shenghui Song
Abstract:
Sensing performance is typically evaluated by classical metrics, such as Cramer-Rao bound and signal-to-clutter-plus-noise ratio. The recent development of the integrated sensing and communication (ISAC) framework motivated the efforts to unify the metric for sensing and communication, where researchers have proposed to utilize mutual information (MI) to measure the sensing performance with determ…
▽ More
Sensing performance is typically evaluated by classical metrics, such as Cramer-Rao bound and signal-to-clutter-plus-noise ratio. The recent development of the integrated sensing and communication (ISAC) framework motivated the efforts to unify the metric for sensing and communication, where researchers have proposed to utilize mutual information (MI) to measure the sensing performance with deterministic signals. However, the need to communicate in ISAC systems necessitates the use of random signals for sensing applications and the closed-form evaluation for the sensing mutual information (SMI) with random signals is not yet available in the literature. This paper investigates the achievable performance and precoder design for sensing applications with random signals. For that purpose, we first derive the closed-form expression for the SMI with random signals by utilizing random matrix theory. The result reveals some interesting physical insights regarding the relation between the SMI with deterministic and random signals. The derived SMI is then utilized to optimize the precoder by leveraging a manifold-based optimization approach. The effectiveness of the proposed methods is validated by simulation results.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Reconfigurable Intelligent Surface & Edge -- An Introduction of an EM manipulation structure on obstacles' edge
Authors:
Tianqi Xiang,
Zhiwei Jiang,
Weijun Hong,
Xin Zhang,
Yuehong Gao
Abstract:
Reconfigurable Intelligent Surface (RIS) or metasurface is one of the important enabling technologies in mobile cellular networks that can effectively enhance the signal coverage performance in obstructed regions, and it is generally deployed on surfaces different from obstacles to redirect electromagnetic (EM) waves by reflection, or covered on objects' surfaces to manipulate EM waves by refracti…
▽ More
Reconfigurable Intelligent Surface (RIS) or metasurface is one of the important enabling technologies in mobile cellular networks that can effectively enhance the signal coverage performance in obstructed regions, and it is generally deployed on surfaces different from obstacles to redirect electromagnetic (EM) waves by reflection, or covered on objects' surfaces to manipulate EM waves by refraction. In this paper, Reconfigurable Intelligent Surface & Edge (RISE) is proposed to extend RIS' abilities of reflection and refraction over surfaces to diffraction around obstacles' edge for better adaptation to specific coverage scenarios. Based on that, this paper analyzes the performance of several different deployment locations and EM manipulation structure designs for different coverage scenarios. Then a novel EM manipulation structure deployed at the obstacles' edge is proposed to achieve static EM environment modification. Simulations validate the preference of the schemes for different scenarios and the new structure achieves better coverage performance than other typical structures in the static scheme.
△ Less
Submitted 3 November, 2023;
originally announced November 2023.
-
Exploring Driving Behavior for Autonomous Vehicles Based on Gramian Angular Field Vision Transformer
Authors:
Junwei You,
Ying Chen,
Zhuoyu Jiang,
Zhangchi Liu,
Zilin Huang,
Yifeng Ding,
Bin Ran
Abstract:
Effective classification of autonomous vehicle (AV) driving behavior emerges as a critical area for diagnosing AV operation faults, enhancing autonomous driving algorithms, and reducing accident rates. This paper presents the Gramian Angular Field Vision Transformer (GAF-ViT) model, designed to analyze AV driving behavior. The proposed GAF-ViT model consists of three key components: GAF Transforme…
▽ More
Effective classification of autonomous vehicle (AV) driving behavior emerges as a critical area for diagnosing AV operation faults, enhancing autonomous driving algorithms, and reducing accident rates. This paper presents the Gramian Angular Field Vision Transformer (GAF-ViT) model, designed to analyze AV driving behavior. The proposed GAF-ViT model consists of three key components: GAF Transformer Module, Channel Attention Module, and Multi-Channel ViT Module. These modules collectively convert representative sequences of multivariate behavior into multi-channel images and employ image recognition techniques for behavior classification. A channel attention mechanism is applied to multi-channel images to discern the impact of various driving behavior features. Experimental evaluation on the Waymo Open Dataset of trajectories demonstrates that the proposed model achieves state-of-the-art performance. Furthermore, an ablation study effectively substantiates the efficacy of individual modules within the model.
△ Less
Submitted 21 October, 2023;
originally announced October 2023.
-
Longitudinal gOSNR Monitoring by Receiver-side Digital Signal Processing in Multi-Span Optical Transmission System
Authors:
Choloong Hahn,
Junho Chang,
Zhiping Jiang
Abstract:
We propose the world first longitudinal gOSNR estimation by using correlation template method at Rx, without any monitoring devices located in the middle of the link. The proposed method is experimentally demonstrated in a 12-span link with commercial transceiver.
We propose the world first longitudinal gOSNR estimation by using correlation template method at Rx, without any monitoring devices located in the middle of the link. The proposed method is experimentally demonstrated in a 12-span link with commercial transceiver.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
Secondary frequency control of islanded microgrid considering wind and solar stochastics
Authors:
Cheng Zhong,
Zhifu Jiang,
Xiangyu Zhang,
Jikai Chen,
Yang Li
Abstract:
As the high penetration of wind and photovoltaic distributed generation (DG) in the microgrid, the stochastic and low inertia emerge, bringing more challenges especially when the microgrid operates in isolated islands. Nevertheless, the reserve power of DGs in deloading control mode can be utilized for frequency regulation and mitigating frequency excursion. This paper proposed a model predictive…
▽ More
As the high penetration of wind and photovoltaic distributed generation (DG) in the microgrid, the stochastic and low inertia emerge, bringing more challenges especially when the microgrid operates in isolated islands. Nevertheless, the reserve power of DGs in deloading control mode can be utilized for frequency regulation and mitigating frequency excursion. This paper proposed a model predictive control (MPC) secondary frequency control method considering wind and solar power generation stochastics. The extended state-space matrix including unknown stochastic power disturbance is established, and a Kalman filter is used to observe the unknown disturbance. The maximum available power of wind and solar DGs is estimated for establishing real-time variable constraints that prevent DGs output power from exceeding the limits. Through setting proper weight coefficients, wind and photovoltaic DGs are given priority to participate in secondary frequency control. The distributed restorative power of each DG is obtained by solving the quadratic programming(QP) optimal problem with variable constraints. Finally, a microgrid simulation model including multiple PV and wind DGs is built and performed in various scenarios compared to the traditional secondary frequency control method. The simulation results validated that the proposed method can enhance the frequency recovery speed and reDGce the frequency deviation, especially in severe photovoltaic and wind fluctuations scenarios.
△ Less
Submitted 7 October, 2023;
originally announced October 2023.
-
Small-Disturbance Input-to-State Stability of Perturbed Gradient Flows: Applications to LQR Problem
Authors:
Leilei Cui,
Zhong-Ping Jiang,
Eduardo D. Sontag
Abstract:
This paper studies the effect of perturbations on the gradient flow of a general nonlinear programming problem, where the perturbation may arise from inaccurate gradient estimation in the setting of data-driven optimization. Under suitable conditions on the objective function, the perturbed gradient flow is shown to be small-disturbance input-to-state stable (ISS), which implies that, in the prese…
▽ More
This paper studies the effect of perturbations on the gradient flow of a general nonlinear programming problem, where the perturbation may arise from inaccurate gradient estimation in the setting of data-driven optimization. Under suitable conditions on the objective function, the perturbed gradient flow is shown to be small-disturbance input-to-state stable (ISS), which implies that, in the presence of a small-enough perturbation, the trajectories of the perturbed gradient flow must eventually enter a small neighborhood of the optimum. This work was motivated by the question of robustness of direct methods for the linear quadratic regulator problem, and specifically the analysis of the effect of perturbations caused by gradient estimation or round-off errors in policy optimization. We show small-disturbance ISS for three of the most common optimization algorithms: standard gradient flow, natural gradient flow, and Newton gradient flow.
△ Less
Submitted 16 April, 2024; v1 submitted 4 October, 2023;
originally announced October 2023.
-
FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency
Authors:
Rui Liu,
Jiatian Xi,
Ziyue Jiang,
Haizhou Li
Abstract:
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself. Despite much progress in neural network-based TSE techniques, the current techniques have focused on reducing the difference between the generated speech segment and the reference target in the editing region, ignoring its local and gl…
▽ More
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself. Despite much progress in neural network-based TSE techniques, the current techniques have focused on reducing the difference between the generated speech segment and the reference target in the editing region, ignoring its local and global fluency in the context and original utterance. To maintain the speech fluency, we propose a fluency speech editing model, termed \textit{FluentEditor}, by considering fluency-aware training criterion in the TSE training. Specifically, the \textit{acoustic consistency constraint} aims to smooth the transition between the edited region and its neighboring acoustic segments consistent with the ground truth, while the \textit{prosody consistency constraint} seeks to ensure that the prosody attributes within the edited regions remain consistent with the overall style of the original utterance. The subjective and objective experimental results on VCTK demonstrate that our \textit{FluentEditor} outperforms all advanced baselines in terms of naturalness and fluency. The audio samples and code are available at \url{https://github.com/Ai-S2-Lab/FluentEditor}.
△ Less
Submitted 21 September, 2023; v1 submitted 20 September, 2023;
originally announced September 2023.
-
GAN-based Algorithm for Efficient Image Inpainting
Authors:
Zhengyang Han,
Zehao Jiang,
Yuan Ju
Abstract:
Global pandemic due to the spread of COVID-19 has post challenges in a new dimension on facial recognition, where people start to wear masks. Under such condition, the authors consider utilizing machine learning in image inpainting to tackle the problem, by complete the possible face that is originally covered in mask. In particular, autoencoder has great potential on retaining important, general…
▽ More
Global pandemic due to the spread of COVID-19 has post challenges in a new dimension on facial recognition, where people start to wear masks. Under such condition, the authors consider utilizing machine learning in image inpainting to tackle the problem, by complete the possible face that is originally covered in mask. In particular, autoencoder has great potential on retaining important, general features of the image as well as the generative power of the generative adversarial network (GAN). The authors implement a combination of the two models, context encoders and explain how it combines the power of the two models and train the model with 50,000 images of influencers faces and yields a solid result that still contains space for improvements. Furthermore, the authors discuss some shortcomings with the model, their possible improvements, as well as some area of study for future investigation for applicative perspective, as well as directions to further enhance and refine the model.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
FSD: An Initial Chinese Dataset for Fake Song Detection
Authors:
Yuankun Xie,
Jingjing Zhou,
Xiaolin Lu,
Zhenghao Jiang,
Yuxin Yang,
Haonan Cheng,
Long Ye
Abstract:
Singing voice synthesis and singing voice conversion have significantly advanced, revolutionizing musical experiences. However, the rise of "Deepfake Songs" generated by these technologies raises concerns about authenticity. Unlike Audio DeepFake Detection (ADD), the field of song deepfake detection lacks specialized datasets or methods for song authenticity verification. In this paper, we initial…
▽ More
Singing voice synthesis and singing voice conversion have significantly advanced, revolutionizing musical experiences. However, the rise of "Deepfake Songs" generated by these technologies raises concerns about authenticity. Unlike Audio DeepFake Detection (ADD), the field of song deepfake detection lacks specialized datasets or methods for song authenticity verification. In this paper, we initially construct a Chinese Fake Song Detection (FSD) dataset to investigate the field of song deepfake detection. The fake songs in the FSD dataset are generated by five state-of-the-art singing voice synthesis and singing voice conversion methods. Our initial experiments on FSD revealed the ineffectiveness of existing speech-trained ADD models for the task of song deepFake detection. Thus, we employ the FSD dataset for the training of ADD models. We subsequently evaluate these models under two scenarios: one with the original songs and another with separated vocal tracks. Experiment results show that song-trained ADD models exhibit a 38.58% reduction in average equal error rate compared to speech-trained ADD models on the FSD test set.
△ Less
Submitted 6 September, 2023; v1 submitted 5 September, 2023;
originally announced September 2023.
-
TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models
Authors:
Shengpeng Ji,
Jialong Zuo,
Minghui Fang,
Ziyue Jiang,
Feiyang Chen,
Xinyu Duan,
Baoxing Huai,
Zhou Zhao
Abstract:
Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS). While previous studies have relied on users providing specific style factor values based on acoustic knowledge or selecting reference speeches that meet certain requirements, generating speech solely from natural text prompts has emerged as a new challenge for researchers. This challenge arises due to th…
▽ More
Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS). While previous studies have relied on users providing specific style factor values based on acoustic knowledge or selecting reference speeches that meet certain requirements, generating speech solely from natural text prompts has emerged as a new challenge for researchers. This challenge arises due to the scarcity of high-quality speech datasets with natural text style prompt and the absence of advanced text-controllable TTS models. In light of this, 1) we propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes. The dataset comprises 236,220 pairs of style prompt in natural text descriptions with five style factors and corresponding speech samples. Through iterative experimentation, we introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes. 2) Furthermore, to address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle. This architecture treats text controllable TTS as a language model task, utilizing audio codec codes as an intermediate representation to replace the conventional mel-spectrogram. Finally, we successfully demonstrate the ability of the proposed model by showing a comparable performance in the controllable TTS task. Audio samples are available at https://sall-e.github.io/
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
Harmonization Across Imaging Locations(HAIL): One-Shot Learning for Brain MRI
Authors:
Abhijeet Parida,
Zhifan Jiang,
Syed Muhammad Anwar,
Nicholas Foreman,
Nicholas Stence,
Michael J. Fisher,
Roger J. Packer,
Robert A. Avery,
Marius George Linguraru
Abstract:
For machine learning-based prognosis and diagnosis of rare diseases, such as pediatric brain tumors, it is necessary to gather medical imaging data from multiple clinical sites that may use different devices and protocols. Deep learning-driven harmonization of radiologic images relies on generative adversarial networks (GANs). However, GANs notoriously generate pseudo structures that do not exist…
▽ More
For machine learning-based prognosis and diagnosis of rare diseases, such as pediatric brain tumors, it is necessary to gather medical imaging data from multiple clinical sites that may use different devices and protocols. Deep learning-driven harmonization of radiologic images relies on generative adversarial networks (GANs). However, GANs notoriously generate pseudo structures that do not exist in the original training data, a phenomenon known as "hallucination". To prevent hallucination in medical imaging, such as magnetic resonance images (MRI) of the brain, we propose a one-shot learning method where we utilize neural style transfer for harmonization. At test time, the method uses one image from a clinical site to generate an image that matches the intensity scale of the collaborating sites. Our approach combines learning a feature extractor, neural style transfer, and adaptive instance normalization. We further propose a novel strategy to evaluate the effectiveness of image harmonization approaches with evaluation metrics that both measure image style harmonization and assess the preservation of anatomical structures. Experimental results demonstrate the effectiveness of our method in preserving patient anatomy while adjusting the image intensities to a new clinical site. Our general harmonization model can be used on unseen data from new sites, making it a valuable tool for real-world medical applications and clinical trials.
△ Less
Submitted 21 August, 2023;
originally announced August 2023.
-
DefCor-Net: Physics-Aware Ultrasound Deformation Correction
Authors:
Zhongliang Jiang,
Yue Zhou,
Dongliang Cao,
Nassir Navab
Abstract:
The recovery of morphologically accurate anatomical images from deformed ones is challenging in ultrasound (US) image acquisition, but crucial to accurate and consistent diagnosis, particularly in the emerging field of computer-assisted diagnosis. This article presents a novel anatomy-aware deformation correction approach based on a coarse-to-fine, multi-scale deep neural network (DefCor-Net). To…
▽ More
The recovery of morphologically accurate anatomical images from deformed ones is challenging in ultrasound (US) image acquisition, but crucial to accurate and consistent diagnosis, particularly in the emerging field of computer-assisted diagnosis. This article presents a novel anatomy-aware deformation correction approach based on a coarse-to-fine, multi-scale deep neural network (DefCor-Net). To achieve pixel-wise performance, DefCor-Net incorporates biomedical knowledge by estimating pixel-wise stiffness online using a U-shaped feature extractor. The deformation field is then computed using polynomial regression by integrating the measured force applied by the US probe. Based on real-time estimation of pixel-by-pixel tissue properties, the learning-based approach enables the potential for anatomy-aware deformation correction. To demonstrate the effectiveness of the proposed DefCor-Net, images recorded at multiple locations on forearms and upper arms of six volunteers are used to train and validate DefCor-Net. The results demonstrate that DefCor-Net can significantly improve the accuracy of deformation correction to recover the original geometry (Dice Coefficient: from $14.3\pm20.9$ to $82.6\pm12.1$ when the force is $6N$).
△ Less
Submitted 7 August, 2023;
originally announced August 2023.
-
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
Authors:
Ziyue Jiang,
Jinglin Liu,
Yi Ren,
Jinzheng He,
Zhenhui Ye,
Shengpeng Ji,
Qian Yang,
Chen Zhang,
Pengfei Wei,
Chunfeng Wang,
Xiang Yin,
Zejun Ma,
Zhou Zhao
Abstract:
Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping the fine-tuning process. However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which si…
▽ More
Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping the fine-tuning process. However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage. 2) The prosodic information in prompts is highly coupled with timbre, making it untransferable to each other. This paper introduces Mega-TTS 2, a generic prompting mechanism for zero-shot TTS, to tackle the aforementioned challenges. Specifically, we design a powerful acoustic autoencoder that separately encodes the prosody and timbre information into the compressed latent space while providing high-quality reconstructions. Then, we propose a multi-reference timbre encoder and a prosody latent language model (P-LLM) to extract useful information from multi-sentence prompts. We further leverage the probabilities derived from multiple P-LLM outputs to produce transferable and controllable prosody. Experimental results demonstrate that Mega-TTS 2 could not only synthesize identity-preserving speech with a short prompt of an unseen speaker from arbitrary sources but consistently outperform the fine-tuning method when the volume of data ranges from 10 seconds to 5 minutes. Furthermore, our method enables to transfer various speaking styles to the target timbre in a fine-grained and controlled manner. Audio samples can be found in https://boostprompt.github.io/boostprompt/.
△ Less
Submitted 10 April, 2024; v1 submitted 14 July, 2023;
originally announced July 2023.
-
Thoracic Cartilage Ultrasound-CT Registration using Dense Skeleton Graph
Authors:
Zhongliang Jiang,
Chenyang Li,
Xuesong Li,
Nassir Navab
Abstract:
Autonomous ultrasound (US) imaging has gained increased interest recently, and it has been seen as a potential solution to overcome the limitations of free-hand US examinations, such as inter-operator variations. However, it is still challenging to accurately map planned paths from a generic atlas to individual patients, particularly for thoracic applications with high acoustic-impedance bone stru…
▽ More
Autonomous ultrasound (US) imaging has gained increased interest recently, and it has been seen as a potential solution to overcome the limitations of free-hand US examinations, such as inter-operator variations. However, it is still challenging to accurately map planned paths from a generic atlas to individual patients, particularly for thoracic applications with high acoustic-impedance bone structures under the skin. To address this challenge, a graph-based non-rigid registration is proposed to enable transferring planned paths from the atlas to the current setup by explicitly considering subcutaneous bone surface features instead of the skin surface. To this end, the sternum and cartilage branches are segmented using a template matching to assist coarse alignment of US and CT point clouds. Afterward, a directed graph is generated based on the CT template. Then, the self-organizing map using geographical distance is successively performed twice to extract the optimal graph representations for CT and US point clouds, individually. To evaluate the proposed approach, five cartilage point clouds from distinct patients are employed. The results demonstrate that the proposed graph-based registration can effectively map trajectories from CT to the current setup for displaying US views through limited intercostal space. The non-rigid registration results in terms of Hausdorff distance (Mean$\pm$SD) is 9.48$\pm$0.27 mm and the path transferring error in terms of Euclidean distance is 2.21$\pm$1.11 mm.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Motion Magnification in Robotic Sonography: Enabling Pulsation-Aware Artery Segmentation
Authors:
Dianye Huang,
Yuan Bi,
Nassir Navab,
Zhongliang Jiang
Abstract:
Ultrasound (US) imaging is widely used for diagnosing and monitoring arterial diseases, mainly due to the advantages of being non-invasive, radiation-free, and real-time. In order to provide additional information to assist clinicians in diagnosis, the tubular structures are often segmented from US images. To improve the artery segmentation accuracy and stability during scans, this work presents a…
▽ More
Ultrasound (US) imaging is widely used for diagnosing and monitoring arterial diseases, mainly due to the advantages of being non-invasive, radiation-free, and real-time. In order to provide additional information to assist clinicians in diagnosis, the tubular structures are often segmented from US images. To improve the artery segmentation accuracy and stability during scans, this work presents a novel pulsation-assisted segmentation neural network (PAS-NN) by explicitly taking advantage of the cardiac-induced motions. Motion magnification techniques are employed to amplify the subtle motion within the frequency band of interest to extract the pulsation signals from sequential US images. The extracted real-time pulsation information can help to locate the arteries on cross-section US images; therefore, we explicitly integrated the pulsation into the proposed PAS-NN as attention guidance. Notably, a robotic arm is necessary to provide stable movement during US imaging since magnifying the target motions from the US images captured along a scan path is not manually feasible due to the hand tremor. To validate the proposed robotic US system for imaging arteries, experiments are carried out on volunteers' carotid and radial arteries. The results demonstrated that the PAS-NN could achieve comparable results as state-of-the-art on carotid and can effectively improve the segmentation performance for small vessels (radial artery).
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
Authors:
Ziyue Jiang,
Yi Ren,
Zhenhui Ye,
Jinglin Liu,
Chen Zhang,
Qian Yang,
Shengpeng Ji,
Rongjie Huang,
Chunfeng Wang,
Xiang Yin,
Zejun Ma,
Zhou Zhao
Abstract:
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or un…
▽ More
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately constructed by the GAN-based vocoder and does not need to be modeled by the language model. 2) We model the timbre using global vectors since timbre is a global attribute that changes slowly over time. 3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K hours of speech and evaluate its performance on unseen speakers. Experimental results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior naturalness, robustness, and speaker similarity due to the proper inductive bias of each module. Audio samples are available at https://mega-tts.github.io/demo-page.
△ Less
Submitted 6 June, 2023;
originally announced June 2023.