Search | arXiv e-print repository

Utilizing Speaker Profiles for Impersonation Audio Detection

Authors: Hao Gu, JiangYan Yi, Chenglong Wang, Yong Ren, Jianhua Tao, Xinrui Yan, Yujie Chen, Xiaohui Zhang

Abstract: Fake audio detection is an emerging active topic. A growing number of literatures have aimed to detect fake utterance, which are mostly generated by Text-to-speech (TTS) or voice conversion (VC). However, countermeasures against impersonation remain an underexplored area. Impersonation is a fake type that involves an imitator replicating specific traits and speech style of a target speaker. Unlike… ▽ More Fake audio detection is an emerging active topic. A growing number of literatures have aimed to detect fake utterance, which are mostly generated by Text-to-speech (TTS) or voice conversion (VC). However, countermeasures against impersonation remain an underexplored area. Impersonation is a fake type that involves an imitator replicating specific traits and speech style of a target speaker. Unlike TTS and VC, which often leave digital traces or signal artifacts, impersonation involves live human beings producing entirely natural speech, rendering the detection of impersonation audio a challenging task. Thus, we propose a novel method that integrates speaker profiles into the process of impersonation audio detection. Speaker profiles are inherent characteristics that are challenging for impersonators to mimic accurately, such as speaker's age, job. We aim to leverage these features to extract discriminative information for detecting impersonation audio. Moreover, there is no large impersonated speech corpora available for quantitative study of impersonation impacts. To address this gap, we further design the first large-scale, diverse-speaker Chinese impersonation dataset, named ImPersonation Audio Detection (IPAD), to advance the community's research on impersonation audio detection. We evaluate several existing fake audio detection methods on our proposed dataset IPAD, demonstrating its necessity and the challenges. Additionally, our findings reveal that incorporating speaker profiles can significantly enhance the model's performance in detecting impersonation audio. △ Less

Submitted 30 August, 2024; originally announced August 2024.

Comments: Accepted by ACM MM2024

arXiv:2408.16809 [pdf, other]

See or Guess: Counterfactually Regularized Image Captioning

Authors: Qian Cao, Xu Chen, Ruihua Song, Xiting Wang, Xinting Huang, Yuchen Ren

Abstract: Image captioning, which generates natural language descriptions of the visual information in an image, is a crucial task in vision-language research. Previous models have typically addressed this task by aligning the generative capabilities of machines with human intelligence through statistical fitting of existing datasets. While effective for normal images, they may struggle to accurately descri… ▽ More Image captioning, which generates natural language descriptions of the visual information in an image, is a crucial task in vision-language research. Previous models have typically addressed this task by aligning the generative capabilities of machines with human intelligence through statistical fitting of existing datasets. While effective for normal images, they may struggle to accurately describe those where certain parts of the image are obscured or edited, unlike humans who excel in such cases. These weaknesses they exhibit, including hallucinations and limited interpretability, often hinder performance in scenarios with shifted association patterns. In this paper, we present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Our approach includes two variants leveraging either total effect or natural direct effect. Integrating them into the training process enables models to handle counterfactual scenarios, increasing their generalizability. Extensive experiments on various datasets show that our method effectively reduces hallucinations and improves the model's faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models. The code is available at https://github.com/Aman-4-Real/See-or-Guess. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: Accepted by ACM MM 2024

arXiv:2408.16300 [pdf, other]

A Distance Similarity-based Genetic Optimization Algorithm for Satellite Ground Network Planning Considering Feeding Mode

Authors: Yingying Ren, Qiuli Li, Yangyang Guo, Witold Pedrycz, Lining Xing, Anfeng Liu, Yanjie Song

Abstract: With the rapid development of the satellite industry, the information transmission network based on communication satellites has gradually become a major and important part of the future satellite ground integration network. However, the low transmission efficiency of the satellite data relay back mission has become a problem that is currently constraining the construction of the system and needs… ▽ More With the rapid development of the satellite industry, the information transmission network based on communication satellites has gradually become a major and important part of the future satellite ground integration network. However, the low transmission efficiency of the satellite data relay back mission has become a problem that is currently constraining the construction of the system and needs to be solved urgently. Effectively planning the task of satellite ground networking by reasonably scheduling resources is crucial for the efficient transmission of task data. In this paper, we hope to provide a task execution scheme that maximizes the profit of the networking task for satellite ground network planning considering feeding mode (SGNPFM). To solve the SGNPFM problem, a mixed-integer planning model with the objective of maximizing the gain of the link-building task is constructed, which considers various constraints of the satellite in the feed-switching mode. Based on the problem characteristics, we propose a distance similarity-based genetic optimization algorithm (DSGA), which considers the state characteristics between the tasks and introduces a weighted Euclidean distance method to determine the similarity between the tasks. To obtain more high-quality solutions, different similarity evaluation methods are designed to assist the algorithm in intelligently screening individuals. The DSGA also uses an adaptive crossover strategy based on similarity mechanism, which guides the algorithm to achieve efficient population search. In addition, a task scheduling algorithm considering the feed-switching mode is designed for decoding the algorithm to generate a high-quality scheme. The results of simulation experiments show that the DSGA can effectively solve the SGNPFM problem. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: 25 pages

arXiv:2408.14035 [pdf, other]

FAST-LIVO2: Fast, Direct LiDAR-Inertial-Visual Odometry

Authors: Chunran Zheng, Wei Xu, Zuhao Zou, Tong Hua, Chongjian Yuan, Dongjiao He, Bingyang Zhou, Zheng Liu, Jiarong Lin, Fangcheng Zhu, Yunfan Ren, Rong Wang, Fanle Meng, Fu Zhang

Abstract: This paper proposes FAST-LIVO2: a fast, direct LiDAR-inertial-visual odometry framework to achieve accurate and robust state estimation in SLAM tasks and provide great potential in real-time, onboard robotic applications. FAST-LIVO2 fuses the IMU, LiDAR and image measurements efficiently through an ESIKF. To address the dimension mismatch between the heterogeneous LiDAR and image measurements, we… ▽ More This paper proposes FAST-LIVO2: a fast, direct LiDAR-inertial-visual odometry framework to achieve accurate and robust state estimation in SLAM tasks and provide great potential in real-time, onboard robotic applications. FAST-LIVO2 fuses the IMU, LiDAR and image measurements efficiently through an ESIKF. To address the dimension mismatch between the heterogeneous LiDAR and image measurements, we use a sequential update strategy in the Kalman filter. To enhance the efficiency, we use direct methods for both the visual and LiDAR fusion, where the LiDAR module registers raw points without extracting edge or plane features and the visual module minimizes direct photometric errors without extracting ORB or FAST corner features. The fusion of both visual and LiDAR measurements is based on a single unified voxel map where the LiDAR module constructs the geometric structure for registering new LiDAR scans and the visual module attaches image patches to the LiDAR points. To enhance the accuracy of image alignment, we use plane priors from the LiDAR points in the voxel map (and even refine the plane prior) and update the reference patch dynamically after new images are aligned. Furthermore, to enhance the robustness of image alignment, FAST-LIVO2 employs an on-demanding raycast operation and estimates the image exposure time in real time. Lastly, we detail three applications of FAST-LIVO2: UAV onboard navigation demonstrating the system's computation efficiency for real-time onboard navigation, airborne mapping showcasing the system's mapping accuracy, and 3D model rendering (mesh-based and NeRF-based) underscoring the suitability of our reconstructed dense map for subsequent rendering tasks. We open source our code, dataset and application on GitHub to benefit the robotics community. △ Less

Submitted 28 August, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

Comments: 30 pages, 31 figures, due to the limitation that 'The abstract field cannot exceed 1,920 characters', the abstract presented here is shorter than the one in the PDF file

arXiv:2408.11986 [pdf, other]

Magnetic proximity coupling to defects in a two-dimensional semiconductor

Authors: Muhammad Hassan Shaikh, Matthew Whalen, Dai Q. Ho, Aqiq Ishraq, Collin Maurtua, Kenji Watanabe, Takashi Taniguchi, Yafei Ren, Anderson Janotti, John Xiao, Chitraleema Chakraborty

Abstract: The ultrathin structure and efficient spin dynamics of two-dimensional (2D) antiferromagnetic (AFM) materials hold unprecedented opportunities for ultrafast memory devices, artificial intelligence circuits, and novel computing technology. For example, chromium thiophosphate (CrPS4) is one of the most promising 2D A-type AFM materials due to its robust stability in diverse environmental conditions… ▽ More The ultrathin structure and efficient spin dynamics of two-dimensional (2D) antiferromagnetic (AFM) materials hold unprecedented opportunities for ultrafast memory devices, artificial intelligence circuits, and novel computing technology. For example, chromium thiophosphate (CrPS4) is one of the most promising 2D A-type AFM materials due to its robust stability in diverse environmental conditions and net out-of-plane magnetic moment in each layer, attributed to anisotropy in crystal axes (a and b). However, their net zero magnetic moment poses a challenge for detecting the Neel state that is used to encode information. In this study, we demonstrate the detection of the Neel vector by detecting the magnetic order of the surface layer by employing defects in tungsten diselenide (WSe2). These defects are ideal candidates for optically active transducers to probe the magnetic order due to their narrow linewidth and high susceptibility to magnetic fields. We observed spin-polarized charge transfer in the heterostructure of bulk CrPS4 and single-layer WSe2 indicating type-II band alignment as supported by density functional theory (DFT) calculations. In the A-type AFM regime, the intensity of both right-handed and left-handed circularly polarized light emanating from the sample remains constant as a function of the applied magnetic field, indicating a constant polarized transition behavior. Our results showcase a new approach to optically characterizing the magnetic states of 2D bulk AFM material, highlighting avenues for future research and technological applications. △ Less

Submitted 21 August, 2024; originally announced August 2024.

arXiv:2408.11758 [pdf, other]

MambaCSR: Dual-Interleaved Scanning for Compressed Image Super-Resolution With SSMs

Authors: Yulin Ren, Xin Li, Mengxi Guo, Bingchen Li, Shijie Zhao, Zhibo Chen

Abstract: We present MambaCSR, a simple but effective framework based on Mamba for the challenging compressed image super-resolution (CSR) task. Particularly, the scanning strategies of Mamba are crucial for effective contextual knowledge modeling in the restoration process despite it relying on selective state space modeling for all tokens. In this work, we propose an efficient dual-interleaved scanning pa… ▽ More We present MambaCSR, a simple but effective framework based on Mamba for the challenging compressed image super-resolution (CSR) task. Particularly, the scanning strategies of Mamba are crucial for effective contextual knowledge modeling in the restoration process despite it relying on selective state space modeling for all tokens. In this work, we propose an efficient dual-interleaved scanning paradigm (DIS) for CSR, which is composed of two scanning strategies: (i) hierarchical interleaved scanning is designed to comprehensively capture and utilize the most potential contextual information within an image by simultaneously taking advantage of the local window-based and sequential scanning methods; (ii) horizontal-to-vertical interleaved scanning is proposed to reduce the computational cost by leaving the redundancy between the scanning of different directions. To overcome the non-uniform compression artifacts, we also propose position-aligned cross-scale scanning to model multi-scale contextual information. Experimental results on multiple benchmarks have shown the great performance of our MambaCSR in the compressed image super-resolution task. The code will be soon available in~\textcolor{magenta}{\url{https://github.com/renyulin-f/MambaCSR}}. △ Less

Submitted 21 August, 2024; originally announced August 2024.

arXiv:2408.10591 [pdf, ps, other]

On differential geometry of non-degenerate CR manifolds

Authors: Yuxin Dong, Yibin Ren

Abstract: In this paper, we consider a non-degenerate CR manifold (M,H(M),J) with a given pseudo-Hermitian 1-form θ, and endow the CR distribution H(M) with any Hermitian metric h instead of the Levi form L_{θ}. This induces a natural Riemannian metric g_{h,θ} on M compatible with the structure. The synthetic object (M,θ,J,h) will be called a pseudo-Hermitian manifold, which generalizes the usual notion of… ▽ More In this paper, we consider a non-degenerate CR manifold (M,H(M),J) with a given pseudo-Hermitian 1-form θ, and endow the CR distribution H(M) with any Hermitian metric h instead of the Levi form L_{θ}. This induces a natural Riemannian metric g_{h,θ} on M compatible with the structure. The synthetic object (M,θ,J,h) will be called a pseudo-Hermitian manifold, which generalizes the usual notion of pseudo-Hermitian manifold (M,θ,J,L_{θ}) in the literature. Our purpose is to investigate the differential-geometric aspect of pseudo-Hermitian manifolds. By imitating Hermitian geometry, we find a canonical connection on (M,θ,J,h), which generalizes the Tanaka-Webster connection on (M,θ,J,L_{θ}). We define the pseudo-Kähler 2-form by g_{h,θ} and J; and introduce the notion of a pseudo-Kähler manifold, which is an analogue of a Kähler manifold. It turns out that (M,θ,J,L_{θ}) is pseudo-Kählerian. Using the structure equations of the canonical connection, we derive some curvature and torsion properties of a pseudo-Hermitian manifold, in particular of a pseudo-Kähler manifold. Then some known results in Riemannian geometry are generalized to the pseudo-Hermitian case. These results include some Cartan type results. As an application, we give a new proof for the classification of Sasakian space forms. △ Less

Submitted 20 August, 2024; originally announced August 2024.

Comments: 36 pages, Comments welcome

arXiv:2408.10124 [pdf, other]

Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

Authors: Tianyu Zhang, Yuxiang Ren, Chengbin Hou, Hairong Lv, Xuegong Zhang

Abstract: Molecular property prediction is a crucial foundation for drug discovery. In recent years, pre-trained deep learning models have been widely applied to this task. Some approaches that incorporate prior biological domain knowledge into the pre-training framework have achieved impressive results. However, these methods heavily rely on biochemical experts, and retrieving and summarizing vast amounts… ▽ More Molecular property prediction is a crucial foundation for drug discovery. In recent years, pre-trained deep learning models have been widely applied to this task. Some approaches that incorporate prior biological domain knowledge into the pre-training framework have achieved impressive results. However, these methods heavily rely on biochemical experts, and retrieving and summarizing vast amounts of domain knowledge literature is both time-consuming and expensive. Large Language Models (LLMs) have demonstrated remarkable performance in understanding and efficiently providing general knowledge. Nevertheless, they occasionally exhibit hallucinations and lack precision in generating domain-specific knowledge. Conversely, Domain-specific Small Models (DSMs) possess rich domain knowledge and can accurately calculate molecular domain-related metrics. However, due to their limited model size and singular functionality, they lack the breadth of knowledge necessary for comprehensive representation learning. To leverage the advantages of both approaches in molecular property prediction, we propose a novel Molecular Graph representation learning framework that integrates Large language models and Domain-specific small models (MolGraph-LarDo). Technically, we design a two-stage prompt strategy where DSMs are introduced to calibrate the knowledge provided by LLMs, enhancing the accuracy of domain-specific information and thus enabling LLMs to generate more precise textual descriptions for molecular samples. Subsequently, we employ a multi-modal alignment method to coordinate various modalities, including molecular graphs and their corresponding descriptive texts, to guide the pre-training of molecular representations. Extensive experiments demonstrate the effectiveness of the proposed method. △ Less

Submitted 19 August, 2024; originally announced August 2024.

arXiv:2408.09738 [pdf]

Room-Temperature Multiferroic Skyrmions in LiNbO3 with enhancement in electric-optical property

Authors: Yalong Yu, Bo Xiong, Siqi Wu, Yekai Ren, Nuo Chen, Qingjiao Mi, Zhaojie Zheng, Kangping Lou, Rui Wang, Tao Chu

Abstract: LiNbO3 (LN) is renowned for its exceptional ferroelectric properties, particularly its notable linear electro-optical (EO) effect, which is highly advantageous for various applications such as high-speed communication, optical computation, and quantum information processing. Compared to its ferroelectric properties, the magnetism of LN is not attractive enough due to its weak ferromagnetic nature.… ▽ More LiNbO3 (LN) is renowned for its exceptional ferroelectric properties, particularly its notable linear electro-optical (EO) effect, which is highly advantageous for various applications such as high-speed communication, optical computation, and quantum information processing. Compared to its ferroelectric properties, the magnetism of LN is not attractive enough due to its weak ferromagnetic nature. Theoretical studies suggest that LN may exhibit a novel magnetoelectric coupling via ferroelectrically-induced ferromagnetism. However, this mechanism has not yet been experimentally validated in any materials, presenting significant challenges for research. In this study, we provide the first experimental evidence supporting the mechanism of ferroelectrically-induced ferromagnetism in LN, including observations of the Dzyaloshinskii-Moriya interaction (DMI) and magnetoelectric coupling. Additionally, we have identified various multiferroic skyrmions, within which ferroelectric polarization signals are detectable. These signals can be influenced by the magnetic vortex structures, indicating a magnetoelectric coupling nature. Currently, they are the only multiferroic skyrmions that can keep stable at room temperature. Moreover, these magnetic textures significantly affect the ferroelectric properties, as demonstrated by an enhancement of the linear electro-optic effect of LN by over 200%. Given the novel magnetoelectric coupling mechanism, the potential of multiferroic skyrmions in spintronics and advanced data storage, and the extensive use of LN EO modulators, our research has significant implications for condensed matter physics, multiferroic materials, and optoelectronics. △ Less

Submitted 19 August, 2024; originally announced August 2024.

arXiv:2408.08202 [pdf, other]

Towards Practical Human Motion Prediction with LiDAR Point Clouds

Authors: Xiao Han, Yiming Ren, Yichen Yao, Yujing Sun, Yuexin Ma

Abstract: Human motion prediction is crucial for human-centric multimedia understanding and interacting. Current methods typically rely on ground truth human poses as observed input, which is not practical for real-world scenarios where only raw visual sensor data is available. To implement these methods in practice, a pre-phrase of pose estimation is essential. However, such two-stage approaches often lead… ▽ More Human motion prediction is crucial for human-centric multimedia understanding and interacting. Current methods typically rely on ground truth human poses as observed input, which is not practical for real-world scenarios where only raw visual sensor data is available. To implement these methods in practice, a pre-phrase of pose estimation is essential. However, such two-stage approaches often lead to performance degradation due to the accumulation of errors. Moreover, reducing raw visual data to sparse keypoint representations significantly diminishes the density of information, resulting in the loss of fine-grained features. In this paper, we propose \textit{LiDAR-HMP}, the first single-LiDAR-based 3D human motion prediction approach, which receives the raw LiDAR point cloud as input and forecasts future 3D human poses directly. Building upon our novel structure-aware body feature descriptor, LiDAR-HMP adaptively maps the observed motion manifold to future poses and effectively models the spatial-temporal correlations of human motions for further refinement of prediction results. Extensive experiments show that our method achieves state-of-the-art performance on two public benchmarks and demonstrates remarkable robustness and efficacy in real-world deployments. △ Less

Submitted 15 August, 2024; originally announced August 2024.

arXiv:2408.06787 [pdf, other]

Unlock the Power of Frozen LLMs in Knowledge Graph Completion

Authors: Bo Xue, Yi Xu, Yunchong Song, Yiming Pang, Yuyang Ren, Jiaxin Ding, Luoyi Fu, Xinbing Wang

Abstract: Classical knowledge graph completion (KGC) methods rely solely on structural information, struggling with the inherent sparsity of knowledge graphs (KGs). Large Language Models (LLMs) learn extensive knowledge from large corpora with powerful context modeling, which is ideal for mitigating the limitations of previous methods. Directly fine-tuning LLMs offers great capability but comes at the cost… ▽ More Classical knowledge graph completion (KGC) methods rely solely on structural information, struggling with the inherent sparsity of knowledge graphs (KGs). Large Language Models (LLMs) learn extensive knowledge from large corpora with powerful context modeling, which is ideal for mitigating the limitations of previous methods. Directly fine-tuning LLMs offers great capability but comes at the cost of huge time and memory consumption, while utilizing frozen LLMs yields suboptimal results. In this work, we aim to leverage LLMs for KGC effectively and efficiently. We capture the context-aware hidden states of knowledge triples by employing prompts to stimulate the intermediate layers of LLMs. We then train a data-efficient classifier on these hidden states to harness the inherent capabilities of frozen LLMs in KGC. We also generate entity descriptions with subgraph sampling on KGs, reducing the ambiguity of triplets and enriching the knowledge representation. Extensive experiments on standard benchmarks showcase the efficiency and effectiveness of our approach. We outperform classical KGC methods on most datasets and match the performance of fine-tuned LLMs. Additionally, compared to fine-tuned LLMs, we boost GPU memory efficiency by \textbf{$188\times$} and speed up training+inference by \textbf{$13.48\times$}. △ Less

Submitted 13 August, 2024; originally announced August 2024.

arXiv:2408.04967 [pdf, other]

ADD 2023: Towards Audio Deepfake Detection and Analysis in the Wild

Authors: Jiangyan Yi, Chu Yuan Zhang, Jianhua Tao, Chenglong Wang, Xinrui Yan, Yong Ren, Hao Gu, Junzuo Zhou

Abstract: The growing prominence of the field of audio deepfake detection is driven by its wide range of applications, notably in protecting the public from potential fraud and other malicious activities, prompting the need for greater attention and research in this area. The ADD 2023 challenge goes beyond binary real/fake classification by emulating real-world scenarios, such as the identification of manip… ▽ More The growing prominence of the field of audio deepfake detection is driven by its wide range of applications, notably in protecting the public from potential fraud and other malicious activities, prompting the need for greater attention and research in this area. The ADD 2023 challenge goes beyond binary real/fake classification by emulating real-world scenarios, such as the identification of manipulated intervals in partially fake audio and determining the source responsible for generating any fake audio, both with real-life implications, notably in audio forensics, law enforcement, and construction of reliable and trustworthy evidence. To further foster research in this area, in this article, we describe the dataset that was used in the fake game, manipulation region location and deepfake algorithm recognition tracks of the challenge. We also focus on the analysis of the technical methodologies by the top-performing participants in each task and note the commonalities and differences in their approaches. Finally, we discuss the current technical limitations as identified through the technical analysis, and provide a roadmap for future research directions. The dataset is available for download. △ Less

Submitted 9 August, 2024; originally announced August 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2408.04708 [pdf, other]

MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

Authors: Jiawei Huang, Chen Zhang, Yi Ren, Ziyue Jiang, Zhenhui Ye, Jinglin Liu, Jinzheng He, Xiang Yin, Zhou Zhao

Abstract: Voice conversion aims to modify the source speaker's voice to resemble the target speaker while preserving the original speech content. Despite notable advancements in voice conversion these days, multi-lingual voice conversion (including both monolingual and cross-lingual scenarios) has yet to be extensively studied. It faces two main challenges: 1) the considerable variability in prosody and art… ▽ More Voice conversion aims to modify the source speaker's voice to resemble the target speaker while preserving the original speech content. Despite notable advancements in voice conversion these days, multi-lingual voice conversion (including both monolingual and cross-lingual scenarios) has yet to be extensively studied. It faces two main challenges: 1) the considerable variability in prosody and articulation habits across languages; and 2) the rarity of paired multi-lingual datasets from the same speaker. In this paper, we propose MulliVC, a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data. Specifically, each training step of MulliVC contains three substeps: In step one the model is trained with monolingual speech data; then, steps two and three take inspiration from back translation, construct a cyclical process to disentangle the timbre and other information (content, prosody, and other language-related information) in the absence of multi-lingual data from the same speaker. Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts, demonstrating the system's efficacy and the viability of the three-step approach with cycle consistency. Audio samples can be found on our demo page (mullivc.github.io). △ Less

Submitted 8 August, 2024; originally announced August 2024.

arXiv:2408.04334 [pdf, other]

A Node-Based Polar List Decoder with Frame Interleaving and Ensemble Decoding Support

Authors: Yuqing Ren, Leyu Zhang, Ludovic Damien Blanc, Yifei Shen, Xinwei Li, Alexios Balatsoukas-Stimming, Chuan Zhang, Andreas Burg

Abstract: Node-based successive cancellation list (SCL) decoding has received considerable attention in wireless communications for its significant reduction in decoding latency, particularly with 5G New Radio (NR) polar codes. However, the existing node-based SCL decoders are constrained by sequential processing, leading to complicated and data-dependent computational units that introduce unavoidable stall… ▽ More Node-based successive cancellation list (SCL) decoding has received considerable attention in wireless communications for its significant reduction in decoding latency, particularly with 5G New Radio (NR) polar codes. However, the existing node-based SCL decoders are constrained by sequential processing, leading to complicated and data-dependent computational units that introduce unavoidable stalls, reducing hardware efficiency. In this paper, we present a frame-interleaving hardware architecture for a generalized node-based SCL decoder. By efficiently reusing otherwise idle computational units, two independent frames can be decoded simultaneously, resulting in a significant throughput gain. Based on this new architecture, we further exploit graph ensembles to diversify the decoding space, thus enhancing the error-correcting performance with a limited list size. Two dynamic strategies are proposed to eliminate the residual stalls in the decoding schedule, which eventually results in nearly 2x throughput compared to the state-of-the-art baseline node-based SCL decoder. To impart the decoder rate flexibility, we develop a novel online instruction generator to identify the generalized nodes and produce instructions on-the-fly. The corresponding 28nm FD-SOI ASIC SCL decoder with a list size of 8 has a core area of 1.28 mm2 and operates at 692 MHz. It is compatible with all 5G NR polar codes and achieves a throughput of 3.34 Gbps and an area efficiency of 2.62 Gbps/mm2 for uplink (1024, 512) codes, which is 1.41x and 1.69x better than the state-of-the-art node-based SCL decoders. △ Less

Submitted 8 August, 2024; originally announced August 2024.

Comments: 13 pages, 16 figures, accepted by IEEE Transactions on Circuits and Systems I: Regular Papers

arXiv:2408.02431 [pdf, other]

Moire exciton polaritons in twisted photonic lattices at room temperature

Authors: Chunzi Xing, Yu Wang, Tobias Schneider, Xiaokun Zhai, Xinzheng Zhang, Zhenyu Xiong, Hao Wu, Yuan Ren, Haitao Dai, Xiao Wang, Anlian Pan, Stefan Schumacher, Xuekai Ma, Tingge Gao

Abstract: Moire lattices attract intensive attention in the double graphene/TMD layers and photonic crystals due to the interesting exotic physics within these structures. However, precise measurement of the moir'e ground states, excited states and Bloch bands in the twisted photonic lattices is still illusive. In this work we report the strong coupling between the excitons of CsPbBr3 microplates and the ph… ▽ More Moire lattices attract intensive attention in the double graphene/TMD layers and photonic crystals due to the interesting exotic physics within these structures. However, precise measurement of the moir'e ground states, excited states and Bloch bands in the twisted photonic lattices is still illusive. In this work we report the strong coupling between the excitons of CsPbBr3 microplates and the photonic modes of the moire lattice at room temperature. Depending on the coupling strength between the nearest potential sites, we observe staggered moire polariton ground states, excited states trapped in the potential sites and moire polariton bands across the twisted photonic lattice. In addition, the phase locking of moire zero (stable in-phase) states and moire pi (metastable antiphase) states with different spatial distributions are measured. Moir'e polariton distribution can be tuned in the shape of parallelogram by controlling the depth and width of the potential in one photonic lattice with another one fixed. Our work lays the foundation to study moir'e exciton polariton Wigner crystals and Luttinger liquid in twisted photonic lattices at room temperature. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2408.02036 [pdf, other]

LEGO: Self-Supervised Representation Learning for Scene Text Images

Authors: Yujin Ren, Jiaxin Zhang, Lianwen Jin

Abstract: In recent years, significant progress has been made in scene text recognition by data-driven methods. However, due to the scarcity of annotated real-world data, the training of these methods predominantly relies on synthetic data. The distribution gap between synthetic and real data constrains the further performance improvement of these methods in real-world applications. To tackle this problem,… ▽ More In recent years, significant progress has been made in scene text recognition by data-driven methods. However, due to the scarcity of annotated real-world data, the training of these methods predominantly relies on synthetic data. The distribution gap between synthetic and real data constrains the further performance improvement of these methods in real-world applications. To tackle this problem, a highly promising approach is to utilize massive amounts of unlabeled real data for self-supervised training, which has been widely proven effective in many NLP and CV tasks. Nevertheless, generic self-supervised methods are unsuitable for scene text images due to their sequential nature. To address this issue, we propose a Local Explicit and Global Order-aware self-supervised representation learning method (LEGO) that accounts for the characteristics of scene text images. Inspired by the human cognitive process of learning words, which involves spelling, reading, and writing, we propose three novel pre-text tasks for LEGO to model sequential, semantic, and structural features, respectively. The entire pre-training process is optimized by using a consistent Text Knowledge Codebook. Extensive experiments validate that LEGO outperforms previous scene text self-supervised methods. The recognizer incorporated with our pre-trained model achieves superior or comparable performance compared to state-of-the-art scene text recognition methods on six benchmarks. Furthermore, we demonstrate that LEGO can achieve superior performance in other text-related tasks. △ Less

Submitted 4 August, 2024; originally announced August 2024.

arXiv:2408.00788 [pdf, other]

SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network

Authors: Kexin Wang, Jiahong Zhang, Yong Ren, Man Yao, Di Shang, Bo Xu, Guoqi Li

Abstract: Brain-inspired Spiking Neural Network (SNN) has demonstrated its effectiveness and efficiency in vision, natural language, and speech understanding tasks, indicating their capacity to "see", "listen", and "read". In this paper, we design \textbf{SpikeVoice}, which performs high-quality Text-To-Speech (TTS) via SNN, to explore the potential of SNN to "speak". A major obstacle to using SNN for such… ▽ More Brain-inspired Spiking Neural Network (SNN) has demonstrated its effectiveness and efficiency in vision, natural language, and speech understanding tasks, indicating their capacity to "see", "listen", and "read". In this paper, we design \textbf{SpikeVoice}, which performs high-quality Text-To-Speech (TTS) via SNN, to explore the potential of SNN to "speak". A major obstacle to using SNN for such generative tasks lies in the demand for models to grasp long-term dependencies. The serial nature of spiking neurons, however, leads to the invisibility of information at future spiking time steps, limiting SNN models to capture sequence dependencies solely within the same time step. We term this phenomenon "partial-time dependency". To address this issue, we introduce Spiking Temporal-Sequential Attention STSA in the SpikeVoice. To the best of our knowledge, SpikeVoice is the first TTS work in the SNN field. We perform experiments using four well-established datasets that cover both Chinese and English languages, encompassing scenarios with both single-speaker and multi-speaker configurations. The results demonstrate that SpikeVoice can achieve results comparable to Artificial Neural Networks (ANN) with only 10.5 energy consumption of ANN. △ Less

Submitted 17 July, 2024; originally announced August 2024.

Comments: 9 pages

arXiv:2408.00661 [pdf, other]

Neuromorphic detection and cooling of microparticle arrays

Authors: Yugang Ren, Benjamin Siegel, Ronghao Yin, Muddassar Rashid, James Millen

Abstract: Micro-objects levitated in a vacuum are an exciting platform for precision sensing due to their low dissipation motion and the potential for control at the quantum level. Arrays of such sensors would allow noise cancellation, directionality, increased sensitivity and in the quantum regime the potential to exploit correlation and entanglement. We use neuromorphic detection via a single event-based… ▽ More Micro-objects levitated in a vacuum are an exciting platform for precision sensing due to their low dissipation motion and the potential for control at the quantum level. Arrays of such sensors would allow noise cancellation, directionality, increased sensitivity and in the quantum regime the potential to exploit correlation and entanglement. We use neuromorphic detection via a single event-based camera to record the motion of an array of levitated microspheres. We present the first truly scalable method for multiparticle control by implementing real-time feedback to cool the motion of three objects simultaneously. △ Less

Submitted 1 August, 2024; originally announced August 2024.

arXiv:2407.21491 [pdf]

Generative Expressive Conversational Speech Synthesis

Authors: Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li

Abstract: Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting. Existing CSS methods employ effective multi-modal context modeling techniques to achieve empathy understanding and expression. However, they often need to design complex network architectures and meticulously optimize the modules within them. In addition, du… ▽ More Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting. Existing CSS methods employ effective multi-modal context modeling techniques to achieve empathy understanding and expression. However, they often need to design complex network architectures and meticulously optimize the modules within them. In addition, due to the limitations of small-scale datasets containing scripted recording styles, they often fail to simulate real natural conversational styles. To address the above issues, we propose a novel generative expressive CSS system, termed GPT-Talker.We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context. Leveraging the power of GPT, we predict the token sequence, that includes both semantic and style knowledge, of response for the agent. After that, the expressive conversational speech is synthesized by the conversation-enriched VITS to deliver feedback to the user.Furthermore, we propose a large-scale Natural CSS Dataset called NCSSD, that includes both naturally recorded conversational speech in improvised styles and dialogues extracted from TV shows. It encompasses both Chinese and English languages, with a total duration of 236 hours.We conducted comprehensive experiments on the reliability of the NCSSD and the effectiveness of our GPT-Talker. Both subjective and objective evaluations demonstrate that our model outperforms other state-of-the-art CSS systems significantly in terms of naturalness and expressiveness. The Code, Dataset, and Pre-trained Model are available at: https://github.com/AI-S2-Lab/GPT-Talker. △ Less

Submitted 31 July, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

Comments: 14 pages, 6 figures, 8 tables. Accepted by ACM MM 2024

arXiv:2407.20481 [pdf, ps, other]

Stronger sum uncertainty relations for non-Hermitian operators

Authors: Xiao-Feng Song, Yi-Fang Ren, Shuang Liu, Xi-Hao Chen, Yusuf Turek

Abstract: Unlike the uncertainty relationships of two arbitrary incompatible observables represented by the product of variances in the past, representing them by the sum of variances is better as it guarantees to be nontrivial for two incompatible operators in some special cases. Although the uncertainty relation is formulated as the sum of variances for unitary operators has been confirmed, its general fo… ▽ More Unlike the uncertainty relationships of two arbitrary incompatible observables represented by the product of variances in the past, representing them by the sum of variances is better as it guarantees to be nontrivial for two incompatible operators in some special cases. Although the uncertainty relation is formulated as the sum of variances for unitary operators has been confirmed, its general forms for arbitrary non-Hermitian operators have not been yet investigated in detail. Thus, this study develops four sum uncertainty relations for arbitrary non-Hermitian operators acting on system states by utilizing an appropriate Hilbert-space metric. The compatible forms of our sum inequalities with the conventional quantum mechanics are also provided via $G$-metric formalism. Concrete examples demonstrate the validity of the purposed sum uncertainty relations in both $\mathcal{PT}$-symmetric and $\mathcal{PT}$-broken phases. The proposed methods and results can help the reader to understand in-depth the usefulness of $G$-metric formalism in non-Hermitian quantum mechanics and the sum uncertainty relations of incompatible operators within. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.19973 [pdf, other]

Spontaneous spin superconductor state in ABCA-stacked tetralayer graphene

Authors: Shuai Li, Yuan-Hang Ren, Ao-Long Li, Hua Jiang

Abstract: We theoretically demonstrate a spontaneous spin superconductor (SC) state in ABCA-stacked tetralayer graphene, under sequential effects of electron-electron (e-e) and electron-hole (e-h) interactions. First of all, we examine the ferromagnetic (FM) exchange instability and phase diagram of the system induced by the long-range e-e interaction. At non- or low-doping levels, the interaction trends to… ▽ More We theoretically demonstrate a spontaneous spin superconductor (SC) state in ABCA-stacked tetralayer graphene, under sequential effects of electron-electron (e-e) and electron-hole (e-h) interactions. First of all, we examine the ferromagnetic (FM) exchange instability and phase diagram of the system induced by the long-range e-e interaction. At non- or low-doping levels, the interaction trends to stabilize a FM phase with the coexisting electron and hole carriers. Superior to bilayer and trilayer systems, tetralayer graphene has a larger FM phase region and spin splitting, making it more advantageous to realize the spin SC state. Subsequently, we prove that the FM phase becomes unstable when attractive e-h interaction is considered. As a consequence, the spin SC state can be spontaneously formed at low temperature, where spin-triplet exciton pairs act as the equivalent of Cooper pairs. We further develop a consistent BCS-type theory for the spin SC state in ABCA-stacked graphene. The predicted spin superconducting gap can reach about $7.0$ meV, with a critical temperature of about 45 K for non-doping system. At last, we demonstrated a spin-current Josephson effect in the ABCA-stacked graphene spin SC heterojunction. Our findings enrich the prospective spin SC candidate materials, illuminating more possibilities for achieving non-dissipative super-spintronics. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: 15 pages,7 figures

arXiv:2407.19691 [pdf, ps, other]

Detection of Electron Paramagnetic Resonance of Two Electron Spins Using a Single NV Center in Diamond

Authors: Yuhang Ren, Susumu Takahashi

Abstract: An interacting spin system is a great testbed for fundamental quantum physics and applications in quantum sensing and quantum simulation. For these investigations, detailed information of the interactions, e.g. the number of spins and their interaction strengths, is often required. In this study, we present the identification and characterization of a single nitrogen-vacancy (NV) center coupled to… ▽ More An interacting spin system is a great testbed for fundamental quantum physics and applications in quantum sensing and quantum simulation. For these investigations, detailed information of the interactions, e.g. the number of spins and their interaction strengths, is often required. In this study, we present the identification and characterization of a single nitrogen-vacancy (NV) center coupled to two electron spins. In the experiment, we first identify a well isolated single NV center, and characterize its spin decoherence time. Then we perform NV-detected electron paramagnetic resonance (EPR) spectroscopy to detect surrounding electron spins. From the analysis of the NV-EPR signal, we determine the number of detected spins and their interaction strengths precisely. Moreover, the spectral analysis indicates candidates of the detected spins to be diamond surface spins. This study demonstrates a promising approach for the identification and characterization of an interacting spin system for realizing entangled sensing, using the electron spin as quantum reporters. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: 15 pages, 5 figures, submitted to APL Quantum

arXiv:2407.18939 [pdf]

Promoting AI Competencies for Medical Students: A Scoping Review on Frameworks, Programs, and Tools

Authors: Yingbo Ma, Yukyeong Song, Jeremy A. Balch, Yuanfang Ren, Divya Vellanki, Zhenhong Hu, Meghan Brennan, Suraj Kolla, Ziyuan Guan, Brooke Armfield, Tezcan Ozrazgat-Baslanti, Parisa Rashidi, Tyler J. Loftus, Azra Bihorac, Benjamin Shickel

Abstract: As more clinical workflows continue to be augmented by artificial intelligence (AI), AI literacy among physicians will become a critical requirement for ensuring safe and ethical AI-enabled patient care. Despite the evolving importance of AI in healthcare, the extent to which it has been adopted into traditional and often-overloaded medical curricula is currently unknown. In a scoping review of 1,… ▽ More As more clinical workflows continue to be augmented by artificial intelligence (AI), AI literacy among physicians will become a critical requirement for ensuring safe and ethical AI-enabled patient care. Despite the evolving importance of AI in healthcare, the extent to which it has been adopted into traditional and often-overloaded medical curricula is currently unknown. In a scoping review of 1,699 articles published between January 2016 and June 2024, we identified 18 studies which propose guiding frameworks, and 11 studies documenting real-world instruction, centered around the integration of AI into medical education. We found that comprehensive guidelines will require greater clinical relevance and personalization to suit medical student interests and career trajectories. Current efforts highlight discrepancies in the teaching guidelines, emphasizing AI evaluation and ethics over technical topics such as data science and coding. Additionally, we identified several challenges associated with integrating AI training into the medical education program, including a lack of guidelines to define medical students AI literacy, a perceived lack of proven clinical value, and a scarcity of qualified instructors. With this knowledge, we propose an AI literacy framework to define competencies for medical students. To prioritize relevant and personalized AI education, we categorize literacy into four dimensions: Foundational, Practical, Experimental, and Ethical, with tailored learning objectives to the pre-clinical, clinical, and clinical research stages of medical education. This review provides a road map for developing practical and relevant education strategies for building an AI-competent healthcare workforce. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: 25 pages, 2 figures, 3 tables

arXiv:2407.18469 [pdf, ps, other]

On Asymptotic Analysis of Perturbed Sweeping Processes with Application to Optimization

Authors: Zhaoyue Xia, Jun Du, Chunxiao Jiang, H. Vincent Poor, Yong Ren

Abstract: Convergence analysis of constrained optimization methods from the dynamical systems viewpoint has attracted considerable attention because it provides a geometric demonstration towards the shadowing trajectory of a numerical scheme. In this work, we establish a tight connection between a continuous-time nonsmooth dynamical system called a perturbed sweeping process (PSP) and a proximal stochastic… ▽ More Convergence analysis of constrained optimization methods from the dynamical systems viewpoint has attracted considerable attention because it provides a geometric demonstration towards the shadowing trajectory of a numerical scheme. In this work, we establish a tight connection between a continuous-time nonsmooth dynamical system called a perturbed sweeping process (PSP) and a proximal stochastic approximation scheme. Theoretical results are obtained by analyzing the asymptotic pseudo trajectory of a PSP. We show that under mild assumptions a proximal stochastic approximation scheme converges to an internally chain transitive invariant set of the corresponding PSP. Furthermore, given the existence of a Lyapunov function $V$ with respect to a set $Λ$, convergence to $Λ$ can be established if $V(Λ)$ has an empty interior. Based on these theoretical results, we are able to provide a useful framework for convergence analysis of proximal gradient methods. Illustrative examples are provided to determine the convergence of proximal variants of gradient methods (including accelerated gradient methods). Finally, numerical simulations are conducted to confirm the validity of theoretical analysis. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.17386 [pdf, other]

Data-driven stellar intrinsic colors and dust reddenings for spectro-photometric data: From the blue-edge method to a machine-learning approach

Authors: He Zhao, Shu Wang, Biwei Jiang, Jun Li, Dongwei Fan, Yi Ren, Xiaoxiao Ma

Abstract: Intrinsic colors (ICs) of stars are essential for the studies on both stellar physics and dust reddening. In this work, we developed an XGBoost model to predict the ICs with the atmospheric parameters $T_{\rm eff}$, ${\rm log}\,g$, and $\rm [M/H]$. The model was trained and tested for three colors at Gaia and 2MASS bands with 1,040,446 low-reddening sources. The atmospheric parameters were determi… ▽ More Intrinsic colors (ICs) of stars are essential for the studies on both stellar physics and dust reddening. In this work, we developed an XGBoost model to predict the ICs with the atmospheric parameters $T_{\rm eff}$, ${\rm log}\,g$, and $\rm [M/H]$. The model was trained and tested for three colors at Gaia and 2MASS bands with 1,040,446 low-reddening sources. The atmospheric parameters were determined by the Gaia DR3 GSP-phot module and were validated by comparing with APOGEE and LAMOST. We further confirmed that the biases in GSP-phot parameters, especially for $\rm [M/H]$, do not present a significant impact on the IC prediction. The generalization error of the model estimated by the test set is 0.014 mag for $(G_{\rm BP}\,{-}\,G_{\rm RP})_0$, 0.050 mag for $(G_{\rm BP}\,{-}\,K_{\rm S})_0$, and 0.040 mag for $(J\,{-}\,K_{\rm S})_0$. The model was applied to a sample containing 5,714,528 reddened stars with stellar parameters from Andrae et al. (2023) to calculate ICs and reddenings. The high consistency in the comparison of $E(J\,{-}\,K_{\rm S})$ between our results and literature values further validates the accuracy of the XGBoost model. The variation of $E(G_{\rm BP}\,{-}\,K_{\rm S})/E(G_{\rm BP}\,{-}\,G_{\rm RP})$, a representation of the extinction law, with Galactic longitude is found on large scales. This work preliminarily presents the feasibility and the accuracy of the machine-learning approach for IC and dust reddening calculation, whose products could be widely applied to spectro-photometric data. The data sets and trained model can be accessed via \url{https://doi.org/10.5281/zenodo.12787594}. The models for more bands will be completed in the following works. △ Less

Submitted 24 July, 2024; originally announced July 2024.

Comments: 23 pages, 1 table, 11 figures, 2 appendices, accepted for publication in ApJ

arXiv:2407.16004 [pdf, other]

Theory of electric polarization induced by magnon transport in two-dimensional honeycomb antiferromagnets

Authors: D. Quang To, Federico Garcia-Gaitan, Yafei Ren, Joshua M. O. Zide, John Q. Xiao, Branislav K. Nikolić, Garnett W. Bryant, Matthew F. Doty

Abstract: We introduce a quantum mechanical formalism for computing the electric polarization arising in two-dimensional (2D) antiferromagnets (AFMs) as a result of {\bf both} spin and orbital transport effect of magnons. We first show that an applied temperature gradient in a 2D collinear honeycomb AFM gives rise to accumulations of magnons having both orbital moment and spin moment at the edges of the 2D… ▽ More We introduce a quantum mechanical formalism for computing the electric polarization arising in two-dimensional (2D) antiferromagnets (AFMs) as a result of {\bf both} spin and orbital transport effect of magnons. We first show that an applied temperature gradient in a 2D collinear honeycomb AFM gives rise to accumulations of magnons having both orbital moment and spin moment at the edges of the 2D AFM as the manifestation of the magnon Nernst effects. We then use our formalism to show that such magnon Nernst effects, in the presence of the Dzyaloshinskii-Moriya Interaction (DMI), induce electric polarization that can be measured experimentally. Finally, we demonstrate the value of this formalism by using it to predict the properties of two 2D honeycomb AFMs, one each with Néel and zigzag order. These results integrate advances in magnonics, spinorbitronics, and orbitronics to create a unified framework that can be used to understand and control the manipulation of magnetic states in 2D AFMs through the magnon spin and orbital degrees of freedom. △ Less

Submitted 22 July, 2024; originally announced July 2024.

Comments: 11 pages, 5 figures

arXiv:2407.14560 [pdf, other]

Automated and Holistic Co-design of Neural Networks and ASICs for Enabling In-Pixel Intelligence

Authors: Shubha R. Kharel, Prashansa Mukim, Piotr Maj, Grzegorz W. Deptuch, Shinjae Yoo, Yihui Ren, Soumyajit Mandal

Abstract: Extreme edge-AI systems, such as those in readout ASICs for radiation detection, must operate under stringent hardware constraints such as micron-level dimensions, sub-milliwatt power, and nanosecond-scale speed while providing clear accuracy advantages over traditional architectures. Finding ideal solutions means identifying optimal AI and ASIC design choices from a design space that has explosiv… ▽ More Extreme edge-AI systems, such as those in readout ASICs for radiation detection, must operate under stringent hardware constraints such as micron-level dimensions, sub-milliwatt power, and nanosecond-scale speed while providing clear accuracy advantages over traditional architectures. Finding ideal solutions means identifying optimal AI and ASIC design choices from a design space that has explosively expanded during the merger of these domains, creating non-trivial couplings which together act upon a small set of solutions as constraints tighten. It is impractical, if not impossible, to manually determine ideal choices among possibilities that easily exceed billions even in small-size problems. Existing methods to bridge this gap have leveraged theoretical understanding of hardware to f architecture search. However, the assumptions made in computing such theoretical metrics are too idealized to provide sufficient guidance during the difficult search for a practical implementation. Meanwhile, theoretical estimates for many other crucial metrics (like delay) do not even exist and are similarly variable, dependent on parameters of the process design kit (PDK). To address these challenges, we present a study that employs intelligent search using multi-objective Bayesian optimization, integrating both neural network search and ASIC synthesis in the loop. This approach provides reliable feedback on the collective impact of all cross-domain design choices. We showcase the effectiveness of our approach by finding several Pareto-optimal design choices for effective and efficient neural networks that perform real-time feature extraction from input pulses within the individual pixels of a readout ASIC. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: 18 pages, 17 figures

arXiv:2407.14239 [pdf, other]

KoMA: Knowledge-driven Multi-agent Framework for Autonomous Driving with Large Language Models

Authors: Kemou Jiang, Xuan Cai, Zhiyong Cui, Aoyong Li, Yilong Ren, Haiyang Yu, Hao Yang, Daocheng Fu, Licheng Wen, Pinlong Cai

Abstract: Large language models (LLMs) as autonomous agents offer a novel avenue for tackling real-world challenges through a knowledge-driven manner. These LLM-enhanced methodologies excel in generalization and interpretability. However, the complexity of driving tasks often necessitates the collaboration of multiple, heterogeneous agents, underscoring the need for such LLM-driven agents to engage in coope… ▽ More Large language models (LLMs) as autonomous agents offer a novel avenue for tackling real-world challenges through a knowledge-driven manner. These LLM-enhanced methodologies excel in generalization and interpretability. However, the complexity of driving tasks often necessitates the collaboration of multiple, heterogeneous agents, underscoring the need for such LLM-driven agents to engage in cooperative knowledge sharing and cognitive synergy. Despite the promise of LLMs, current applications predominantly center around single agent scenarios. To broaden the horizons of knowledge-driven strategies and bolster the generalization capabilities of autonomous agents, we propose the KoMA framework consisting of multi-agent interaction, multi-step planning, shared-memory, and ranking-based reflection modules to enhance multi-agents' decision-making in complex driving scenarios. Based on the framework's generated text descriptions of driving scenarios, the multi-agent interaction module enables LLM agents to analyze and infer the intentions of surrounding vehicles, akin to human cognition. The multi-step planning module enables LLM agents to analyze and obtain final action decisions layer by layer to ensure consistent goals for short-term action decisions. The shared memory module can accumulate collective experience to make superior decisions, and the ranking-based reflection module can evaluate and improve agent behavior with the aim of enhancing driving safety and efficiency. The KoMA framework not only enhances the robustness and adaptability of autonomous driving agents but also significantly elevates their generalization capabilities across diverse scenarios. Empirical results demonstrate the superiority of our approach over traditional methods, particularly in its ability to handle complex, unpredictable driving environments without extensive retraining. △ Less

Submitted 19 July, 2024; originally announced July 2024.

Comments: 13 pages, 18 figures

arXiv:2407.13985 [pdf]

Cluster Sliding Ferroelectricity in Trilayer Quasi-Hexagonal C60

Authors: Xuefei Wang, Yanhan Ren, Shi Qiu, Fan Zhang, Xueao Li, Junfeng Gao, Weiwei Gao, Jijun Zhao

Abstract: Electric polarization typically originates from non-centrosymmetric charge distributions. Since chemical bonds between atoms of the same elements favor centrosymmetric crystal structures and symmetrically distributed electron charges, elemental ferroelectrics are extremely rare. In comparison to atoms, elemental clusters are less symmetric and typically have various preferred orientations in cryst… ▽ More Electric polarization typically originates from non-centrosymmetric charge distributions. Since chemical bonds between atoms of the same elements favor centrosymmetric crystal structures and symmetrically distributed electron charges, elemental ferroelectrics are extremely rare. In comparison to atoms, elemental clusters are less symmetric and typically have various preferred orientations in crystals. Consequently, the assembly of clusters with different orientations tends to break the inversion symmetry. Based on this concept, we show that sliding ferroelectricity naturally emerges in trilayer quasi-hexagonal phase (qHP) C60, a cluster-assembled carbon allotrope recently synthesized. Trilayer qHP C60's have several stable polar structures, which are distinguishable in second-harmonic generation (SHG) responses. Compared to previously found elemental ferroelectrics, trilayer qHP C60's have sizable band gaps and some of them have both switchable out-of-plane and in-plane polarizations. Remarkably, the out-of-plane and in-plane polarizations are decoupled, enabling an easy-to-implement construction of Van der Waals homostructures with ferroelectrically switchable chirality. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: 5 figures

arXiv:2407.13637 [pdf]

Autonomous self-evolving research on biomedical data: the DREAM paradigm

Authors: Luojia Deng, Yijie Wu, Yongyong Ren, Hui Lu

Abstract: In contemporary biomedical research, the efficiency of data-driven approaches is hindered by large data volumes, tool selection complexity, and human resource limitations, necessitating the development of fully autonomous research systems to meet complex analytical needs. Such a system should include the ability to autonomously generate research questions, write analytical code, configure the comp… ▽ More In contemporary biomedical research, the efficiency of data-driven approaches is hindered by large data volumes, tool selection complexity, and human resource limitations, necessitating the development of fully autonomous research systems to meet complex analytical needs. Such a system should include the ability to autonomously generate research questions, write analytical code, configure the computational environment, judge and interpret the results, and iteratively generate in-depth questions or solutions, all without human intervention. Here we developed DREAM, the first biomedical Data-dRiven self-Evolving Autonomous systeM, which can independently conduct scientific research without human involvement. Utilizing a clinical dataset and two omics datasets, DREAM demonstrated its ability to raise and deepen scientific questions, with difficulty scores for clinical data questions surpassing top published articles by 5.7% and outperforming GPT-4 and bioinformatics graduate students by 58.6% and 56.0%, respectively. Overall, DREAM has a success rate of 80% in autonomous clinical data mining. Certainly, human can participate in different steps of DREAM to achieve more personalized goals. After evolution, 10% of the questions exceeded the average scores of top published article questions on originality and complexity. In the autonomous environment configuration of the eight bioinformatics workflows, DREAM exhibited an 88% success rate, whereas GPT-4 failed to configure any workflows. In clinical dataset, DREAM was over 10,000 times more efficient than the average scientist with a single computer core, and capable of revealing new discoveries. As a self-evolving autonomous research system, DREAM provides an efficient and reliable solution for future biomedical research. This paradigm may also have a revolutionary impact on other data-driven scientific research fields. △ Less

Submitted 10 August, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

Comments: 11 pages, 4 figures, content added, typos in figure corrected, references revised and font changed

arXiv:2407.13108 [pdf, other]

UCIP: A Universal Framework for Compressed Image Super-Resolution using Dynamic Prompt

Authors: Xin Li, Bingchen Li, Yeying Jin, Cuiling Lan, Hanxin Zhu, Yulin Ren, Zhibo Chen

Abstract: Compressed Image Super-resolution (CSR) aims to simultaneously super-resolve the compressed images and tackle the challenging hybrid distortions caused by compression. However, existing works on CSR usually focuses on a single compression codec, i.e., JPEG, ignoring the diverse traditional or learning-based codecs in the practical application, e.g., HEVC, VVC, HIFIC, etc. In this work, we propose… ▽ More Compressed Image Super-resolution (CSR) aims to simultaneously super-resolve the compressed images and tackle the challenging hybrid distortions caused by compression. However, existing works on CSR usually focuses on a single compression codec, i.e., JPEG, ignoring the diverse traditional or learning-based codecs in the practical application, e.g., HEVC, VVC, HIFIC, etc. In this work, we propose the first universal CSR framework, dubbed UCIP, with dynamic prompt learning, intending to jointly support the CSR distortions of any compression codecs/modes. Particularly, an efficient dynamic prompt strategy is proposed to mine the content/spatial-aware task-adaptive contextual information for the universal CSR task, using only a small amount of prompts with spatial size 1x1. To simplify contextual information mining, we introduce the novel MLP-like framework backbone for our UCIP by adapting the Active Token Mixer (ATM) to CSR tasks for the first time, where the global information modeling is only taken in horizontal and vertical directions with offset prediction. We also build an all-in-one benchmark dataset for the CSR task by collecting the datasets with the popular 6 diverse traditional and learning-based codecs, including JPEG, HEVC, VVC, HIFIC, etc., resulting in 23 common degradations. Extensive experiments have shown the consistent and excellent performance of our UCIP on universal CSR tasks. The project can be found in https://lixinustc.github.io/UCIP.github.io △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV 2024

arXiv:2407.11367 [pdf, other]

Enhancement of nonclassical properties of two-mode squeezed vacuum state with postselected von Neumann measurement

Authors: Janarbek Yuanbek, Yi-Fang Ren, Ahmad Abliz, Yusuf Turek

Abstract: We investigate the effects of weak value amplification on the nonclassical properties of two-mode squeezing vacuum state. To show the advantages of the two-mode squeezing vacuum state based post-selective weak measurements. We investigate the effects of weak value amplification on the nonclassical properties of two-mode squeezing vacuum state. To show the advantages of the two-mode squeezing vacuum state based post-selective weak measurements. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.10833 [pdf, other]

MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration

Authors: Yulin Ren, Xin Li, Bingchen Li, Xingrui Wang, Mengxi Guo, Shijie Zhao, Li Zhang, Zhibo Chen

Abstract: We present MoE-DiffIR, an innovative universal compressed image restoration (CIR) method with task-customized diffusion priors. This intends to handle two pivotal challenges in the existing CIR methods: (i) lacking adaptability and universality for different image codecs, e.g., JPEG and WebP; (ii) poor texture generation capability, particularly at low bitrates. Specifically, our MoE-DiffIR develo… ▽ More We present MoE-DiffIR, an innovative universal compressed image restoration (CIR) method with task-customized diffusion priors. This intends to handle two pivotal challenges in the existing CIR methods: (i) lacking adaptability and universality for different image codecs, e.g., JPEG and WebP; (ii) poor texture generation capability, particularly at low bitrates. Specifically, our MoE-DiffIR develops the powerful mixture-of-experts (MoE) prompt module, where some basic prompts cooperate to excavate the task-customized diffusion priors from Stable Diffusion (SD) for each compression task. Moreover, the degradation-aware routing mechanism is proposed to enable the flexible assignment of basic prompts. To activate and reuse the cross-modality generation prior of SD, we design the visual-to-text adapter for MoE-DiffIR, which aims to adapt the embedding of low-quality images from the visual domain to the textual domain as the textual guidance for SD, enabling more consistent and reasonable texture generation. We also construct one comprehensive benchmark dataset for universal CIR, covering 21 types of degradations from 7 popular traditional and learned codecs. Extensive experiments on universal CIR have demonstrated the excellent robustness and texture restoration capability of our proposed MoE-DiffIR. The project can be found at https://renyulin-f.github.io/MoE-DiffIR.github.io/. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV 2024

arXiv:2407.10490 [pdf, other]

Learning Dynamics of LLM Finetuning

Authors: Yi Ren, Danica J. Sutherland

Abstract: Learning dynamics, which describes how the learning of specific training examples influences the model's prediction of other examples, give us a powerful tool for understanding the behavior of deep learning systems. We study the learning dynamics of large language models during finetuning, by analyzing the step-wise decomposition and accumulated influence among different responses. Our framework a… ▽ More Learning dynamics, which describes how the learning of specific training examples influences the model's prediction of other examples, give us a powerful tool for understanding the behavior of deep learning systems. We study the learning dynamics of large language models during finetuning, by analyzing the step-wise decomposition and accumulated influence among different responses. Our framework allows a uniform interpretation of many interesting observations about the training of popular algorithms for both instruction tuning and preference tuning. The analysis not only explains where the benefits of these methods come from but also inspires a simple, effective method to further improve the alignment performance. Code for experiments is available at https://github.com/Joshua-Ren/Learning_dynamics_LLM. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 32 pages

arXiv:2407.09833 [pdf, other]

LiveHPS++: Robust and Coherent Motion Capture in Dynamic Free Environment

Authors: Yiming Ren, Xiao Han, Yichen Yao, Xiaoxiao Long, Yujing Sun, Yuexin Ma

Abstract: LiDAR-based human motion capture has garnered significant interest in recent years for its practicability in large-scale and unconstrained environments. However, most methods rely on cleanly segmented human point clouds as input, the accuracy and smoothness of their motion results are compromised when faced with noisy data, rendering them unsuitable for practical applications. To address these lim… ▽ More LiDAR-based human motion capture has garnered significant interest in recent years for its practicability in large-scale and unconstrained environments. However, most methods rely on cleanly segmented human point clouds as input, the accuracy and smoothness of their motion results are compromised when faced with noisy data, rendering them unsuitable for practical applications. To address these limitations and enhance the robustness and precision of motion capture with noise interference, we introduce LiveHPS++, an innovative and effective solution based on a single LiDAR system. Benefiting from three meticulously designed modules, our method can learn dynamic and kinematic features from human movements, and further enable the precise capture of coherent human motions in open settings, making it highly applicable to real-world scenarios. Through extensive experiments, LiveHPS++ has proven to significantly surpass existing state-of-the-art methods across various datasets, establishing a new benchmark in the field. △ Less

Submitted 13 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV 2024

arXiv:2407.09697 [pdf, other]

Uplifting Range-View-based 3D Semantic Segmentation in Real-Time with Multi-Sensor Fusion

Authors: Shiqi Tan, Hamidreza Fazlali, Yixuan Xu, Yuan Ren, Bingbing Liu

Abstract: Range-View(RV)-based 3D point cloud segmentation is widely adopted due to its compact data form. However, RV-based methods fall short in providing robust segmentation for the occluded points and suffer from distortion of projected RGB images due to the sparse nature of 3D point clouds. To alleviate these problems, we propose a new LiDAR and Camera Range-view-based 3D point cloud semantic segmentat… ▽ More Range-View(RV)-based 3D point cloud segmentation is widely adopted due to its compact data form. However, RV-based methods fall short in providing robust segmentation for the occluded points and suffer from distortion of projected RGB images due to the sparse nature of 3D point clouds. To alleviate these problems, we propose a new LiDAR and Camera Range-view-based 3D point cloud semantic segmentation method (LaCRange). Specifically, a distortion-compensating knowledge distillation (DCKD) strategy is designed to remedy the adverse effect of RV projection of RGB images. Moreover, a context-based feature fusion module is introduced for robust and preservative sensor fusion. Finally, in order to address the limited resolution of RV and its insufficiency of 3D topology, a new point refinement scheme is devised for proper aggregation of features in 2D and augmentation of point features in 3D. We evaluated the proposed method on large-scale autonomous driving datasets \ie SemanticKITTI and nuScenes. In addition to being real-time, the proposed method achieves state-of-the-art results on nuScenes benchmark △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.09361 [pdf, ps, other]

Nonreciprocal phonons in PT-symmetric antiferromagnet

Authors: Yafei Ren, Daniyar Saparov, Qian Niu

Abstract: Phonon nonreciprocity, indicating different transport properties along opposite directions, has been observed in experiments under a magnetic field. We show that nonreciprocal acoustic phonons can also exist without a magnetic field nor net magnetization. We focus on PT symmetric antiferromagnets that break both time-reversal T and inversion symmetry P. We identify crucial contributions in phenome… ▽ More Phonon nonreciprocity, indicating different transport properties along opposite directions, has been observed in experiments under a magnetic field. We show that nonreciprocal acoustic phonons can also exist without a magnetic field nor net magnetization. We focus on PT symmetric antiferromagnets that break both time-reversal T and inversion symmetry P. We identify crucial contributions in phenomenological elastic theory, dubbed flexo-viscosity and flexo-torque, that induce phonon nonreciprocity without changing the phonon polarization. The microscopic origin of these contributions is the molecular Berry curvature, manifested as emergent nonlocal magnetic fields on phonons. The symmetry breaking originated from spin order is transferred to the phonon system through spin-orbit coupling, where the orbital degree of freedom affects the lattice dynamics directly. By electrically modifying the spin-orbit coupling, we show that both the phonon nonreciprocity and helicity can be controlled and enhanced. Importantly, the phonon nonreciprocity is an odd function of the Néel vector, serving as an indicator of the order parameter. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: 5 pages, 2 figures

arXiv:2407.08239 [pdf, other]

An Unsupervised Domain Adaptation Method for Locating Manipulated Region in partially fake Audio

Authors: Siding Zeng, Jiangyan Yi, Jianhua Tao, Yujie Chen, Shan Liang, Yong Ren, Xiaohui Zhang

Abstract: When the task of locating manipulation regions in partially-fake audio (PFA) involves cross-domain datasets, the performance of deep learning models drops significantly due to the shift between the source and target domains. To address this issue, existing approaches often employ data augmentation before training. However, they overlook the characteristics in target domain that are absent in sourc… ▽ More When the task of locating manipulation regions in partially-fake audio (PFA) involves cross-domain datasets, the performance of deep learning models drops significantly due to the shift between the source and target domains. To address this issue, existing approaches often employ data augmentation before training. However, they overlook the characteristics in target domain that are absent in source domain. Inspired by the mixture-of-experts model, we propose an unsupervised method named Samples mining with Diversity and Entropy (SDE). Our method first learns from a collection of diverse experts that achieve great performance from different perspectives in the source domain, but with ambiguity on target samples. We leverage these diverse experts to select the most informative samples by calculating their entropy. Furthermore, we introduced a label generation method tailored for these selected samples that are incorporated in the training process in source domain integrating the target domain information. We applied our method to a cross-domain partially fake audio detection dataset, ADD2023Track2. By introducing 10% of unknown samples from the target domain, we achieved an F1 score of 43.84%, which represents a relative increase of 77.2% compared to the second-best method. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.07464 [pdf, other]

Video-to-Audio Generation with Hidden Alignment

Authors: Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, Dong Yu

Abstract: Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. In this work, we aim to offer insights into the video-to-audio generation paradigm, focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techni… ▽ More Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. In this work, we aim to offer insights into the video-to-audio generation paradigm, focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techniques. Beginning with a foundational model VTA-LDM built on a simple yet surprisingly effective intuition, we explore various vision encoders and auxiliary embeddings through ablation studies. Employing a comprehensive evaluation pipeline that emphasizes generation quality and video-audio synchronization alignment, we demonstrate that our model exhibits state-of-the-art video-to-audio generation capabilities. Furthermore, we provide critical insights into the impact of different data augmentation methods on enhancing the generation framework's overall capacity. We showcase possibilities to advance the challenge of generating synchronized audio from semantic and temporal perspectives. We hope these insights will serve as a stepping stone toward developing more realistic and accurate audio-visual generation models. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: https://sites.google.com/view/vta-ldm

arXiv:2407.06516 [pdf, other]

VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving

Authors: Yibo Liu, Zheyuan Yang, Guile Wu, Yuan Ren, Kejian Lin, Bingbing Liu, Yang Liu, Jinjun Shan

Abstract: Generating 3D vehicle assets from in-the-wild observations is crucial to autonomous driving. Existing image-to-3D methods cannot well address this problem because they learn generation merely from image RGB information without a deeper understanding of in-the-wild vehicles (such as car models, manufacturers, etc.). This leads to their poor zero-shot prediction capability to handle real-world obser… ▽ More Generating 3D vehicle assets from in-the-wild observations is crucial to autonomous driving. Existing image-to-3D methods cannot well address this problem because they learn generation merely from image RGB information without a deeper understanding of in-the-wild vehicles (such as car models, manufacturers, etc.). This leads to their poor zero-shot prediction capability to handle real-world observations with occlusion or tricky viewing angles. To solve this problem, in this work, we propose VQA-Diff, a novel framework that leverages in-the-wild vehicle images to create photorealistic 3D vehicle assets for autonomous driving. VQA-Diff exploits the real-world knowledge inherited from the Large Language Model in the Visual Question Answering (VQA) model for robust zero-shot prediction and the rich image prior knowledge in the Diffusion model for structure and appearance generation. In particular, we utilize a multi-expert Diffusion Models strategy to generate the structure information and employ a subject-driven structure-controlled generation mechanism to model appearance information. As a result, without the necessity to learn from a large-scale image-to-3D vehicle dataset collected from the real world, VQA-Diff still has a robust zero-shot image-to-novel-view generation ability. We conduct experiments on various datasets, including Pascal 3D+, Waymo, and Objaverse, to demonstrate that VQA-Diff outperforms existing state-of-the-art methods both qualitatively and quantitatively. △ Less

Submitted 10 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.05349 [pdf]

Stable room-temperature multiferroic skyrmions in lithium niobate with enhanced Pockels effect

Authors: Yalong Yu, Bo Xiong, Siqi Wu, Yekai Ren, Nuo Chen, Qingjiao Mi, Kangping Lou, Rui Wang, Tao Chu

Abstract: Lithium Niobate (LN) is a ferroelectric material with exceptional electrical characteristics, including high piezoelectricity, high Pockels effect, etc. These properties make it a promising platform for numerous fields such as high-speed communication, optical computation, and quantum information processing. Besides these, the introduction of magnetic structures to LN holds significant potential t… ▽ More Lithium Niobate (LN) is a ferroelectric material with exceptional electrical characteristics, including high piezoelectricity, high Pockels effect, etc. These properties make it a promising platform for numerous fields such as high-speed communication, optical computation, and quantum information processing. Besides these, the introduction of magnetic structures to LN holds significant potential to achieve magnetoelectric coupling, which can be applied in magnetic memory and data-processing devices with high efficiency. Here, for the first time, we observe a special topological magnetic structure called magnetic skyrmion in LN (SK-LN) by the combination of magnetic field annealing and rapid annealing processes . Compared to the magnetic skyrmions reported in magnetic systems, SK-LN exhibit exceptionally high stability. Additionally, the center of the magnetic vortex exhibits spontaneous ferroelectric polarization, indicating its multiferroic characteristic. With the excitation of these multiferroic skyrmions, the modulation efficiency of the electro-optical (EO) modulator fabricated on thin film lithium niobate on insulator (LNOI) wafer was found to be enhanced from 1.98 V*cm to 0.63 V*cm. It is considered that the multiferroic skyrmions significantly enhance the Pockels coefficient of LN to 101 pm/V, nearly three times the result (32pm/V) reported previously. △ Less

Submitted 7 July, 2024; originally announced July 2024.

Report number: submit/5797581

arXiv:2407.05089 [pdf, other]

Bayesian network-guided sparse regression with flexible varying effects

Authors: Yangfan Ren, Christine B. Peterson, Marina Vannucci

Abstract: In this paper, we propose Varying Effects Regression with Graph Estimation (VERGE), a novel Bayesian method for feature selection in regression. Our model has key aspects that allow it to leverage the complex structure of data sets arising from genomics or imaging studies. We distinguish between the predictors, which are the features utilized in the outcome prediction model, and the subject-level… ▽ More In this paper, we propose Varying Effects Regression with Graph Estimation (VERGE), a novel Bayesian method for feature selection in regression. Our model has key aspects that allow it to leverage the complex structure of data sets arising from genomics or imaging studies. We distinguish between the predictors, which are the features utilized in the outcome prediction model, and the subject-level covariates, which modulate the effects of the predictors on the outcome. We construct a varying coefficients modeling framework where we infer a network among the predictor variables and utilize this network information to encourage the selection of related predictors. We employ variable selection spike-and-slab priors that enable the selection of both network-linked predictor variables and covariates that modify the predictor effects. We demonstrate through simulation studies that our method outperforms existing alternative methods in terms of both feature selection and predictive accuracy. We illustrate VERGE with an application to characterizing the influence of gut microbiome features on obesity, where we identify a set of microbial taxa and their ecological dependence relations. We allow subject-level covariates including sex and dietary intake variables to modify the coefficients of the microbiome predictors, providing additional insight into the interplay between these factors. △ Less

Submitted 6 July, 2024; originally announced July 2024.

arXiv:2407.04575 [pdf, other]

FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

Authors: Rubing Shen, Yanzhen Ren, Zongkun Sun

Abstract: Generative adversarial network (GAN) based vocoders have achieved significant attention in speech synthesis with high quality and fast inference speed. However, there still exist many noticeable spectral artifacts, resulting in the quality decline of synthesized speech. In this work, we adopt a novel GAN-based vocoder designed for few artifacts and high fidelity, called FA-GAN. To suppress the ali… ▽ More Generative adversarial network (GAN) based vocoders have achieved significant attention in speech synthesis with high quality and fast inference speed. However, there still exist many noticeable spectral artifacts, resulting in the quality decline of synthesized speech. In this work, we adopt a novel GAN-based vocoder designed for few artifacts and high fidelity, called FA-GAN. To suppress the aliasing artifacts caused by non-ideal upsampling layers in high-frequency components, we introduce the anti-aliased twin deconvolution module in the generator. To alleviate blurring artifacts and enrich the reconstruction of spectral details, we propose a novel fine-grained multi-resolution real and imaginary loss to assist in the modeling of phase information. Experimental results reveal that FA-GAN outperforms the compared approaches in promoting audio quality and alleviating spectral artifacts, and exhibits superior performance when applied to unseen speaker scenarios. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.04216 [pdf, other]

Safe MPC Alignment with Human Directional Feedback

Authors: Zhixian Xie, Wenlong Zhang, Yi Ren, Zhaoran Wang, George J. Pappas, Wanxin Jin

Abstract: In safety-critical robot planning or control, manually specifying safety constraints or learning them from demonstrations can be challenging. In this paper, we propose a certifiable alignment method for a robot to learn a safety constraint in its model predictive control (MPC) policy with human online directional feedback. To our knowledge, it is the first method to learn safety constraints from h… ▽ More In safety-critical robot planning or control, manually specifying safety constraints or learning them from demonstrations can be challenging. In this paper, we propose a certifiable alignment method for a robot to learn a safety constraint in its model predictive control (MPC) policy with human online directional feedback. To our knowledge, it is the first method to learn safety constraints from human feedback. The proposed method is based on an empirical observation: human directional feedback, when available, tends to guide the robot toward safer regions. The method only requires the direction of human feedback to update the learning hypothesis space. It is certifiable, providing an upper bound on the total number of human feedback in the case of successful learning of safety constraints, or declaring the misspecification of the hypothesis space, i.e., the true implicit safety constraint cannot be found within the specified hypothesis space. We evaluated the proposed method using numerical examples and user studies in two developed simulation games. Additionally, we implemented and tested the proposed method on a real-world Franka robot arm performing mobile water-pouring tasks in a user study. The simulation and experimental results demonstrate the efficacy and efficiency of our method, showing that it enables a robot to successfully learn safety constraints with a small handful (tens) of human directional corrections. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: 18 pages, submission to T-RO

arXiv:2407.03000 [pdf, other]

VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values

Authors: Zhe Hu, Yixiao Ren, Jing Li, Yu Yin

Abstract: This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues. While most large vision-language models (VLMs) focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,062 images depicting diverse real-world situations and the man… ▽ More This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues. While most large vision-language models (VLMs) focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,062 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.02839 [pdf, other]

CRUISE on Quantum Computing for Feature Selection in Recommender Systems

Authors: Jiayang Niu, Jie Li, Ke Deng, Yongli Ren

Abstract: Using Quantum Computers to solve problems in Recommender Systems that classical computers cannot address is a worthwhile research topic. In this paper, we use Quantum Annealers to address the feature selection problem in recommendation algorithms. This feature selection problem is a Quadratic Unconstrained Binary Optimization(QUBO) problem. By incorporating Counterfactual Analysis, we significantl… ▽ More Using Quantum Computers to solve problems in Recommender Systems that classical computers cannot address is a worthwhile research topic. In this paper, we use Quantum Annealers to address the feature selection problem in recommendation algorithms. This feature selection problem is a Quadratic Unconstrained Binary Optimization(QUBO) problem. By incorporating Counterfactual Analysis, we significantly improve the performance of the item-based KNN recommendation algorithm compared to using pure Mutual Information. Extensive experiments have demonstrated that the use of Counterfactual Analysis holds great promise for addressing such problems. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: accepted by QuantumCLEF 2024

arXiv:2407.02598 [pdf, other]

AutoSplat: Constrained Gaussian Splatting for Autonomous Driving Scene Reconstruction

Authors: Mustafa Khan, Hamidreza Fazlali, Dhruv Sharma, Tongtong Cao, Dongfeng Bai, Yuan Ren, Bingbing Liu

Abstract: Realistic scene reconstruction and view synthesis are essential for advancing autonomous driving systems by simulating safety-critical scenarios. 3D Gaussian Splatting excels in real-time rendering and static scene reconstructions but struggles with modeling driving scenarios due to complex backgrounds, dynamic objects, and sparse views. We propose AutoSplat, a framework employing Gaussian splatti… ▽ More Realistic scene reconstruction and view synthesis are essential for advancing autonomous driving systems by simulating safety-critical scenarios. 3D Gaussian Splatting excels in real-time rendering and static scene reconstructions but struggles with modeling driving scenarios due to complex backgrounds, dynamic objects, and sparse views. We propose AutoSplat, a framework employing Gaussian splatting to achieve highly realistic reconstructions of autonomous driving scenes. By imposing geometric constraints on Gaussians representing the road and sky regions, our method enables multi-view consistent simulation of challenging scenarios including lane changes. Leveraging 3D templates, we introduce a reflected Gaussian consistency constraint to supervise both the visible and unseen side of foreground objects. Moreover, to model the dynamic appearance of foreground objects, we estimate residual spherical harmonics for each foreground Gaussian. Extensive experiments on Pandaset and KITTI demonstrate that AutoSplat outperforms state-of-the-art methods in scene reconstruction and novel view synthesis across diverse driving scenarios. Visit our project page at https://autosplat.github.io/. △ Less

Submitted 3 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.01816 [pdf, ps, other]

Asymptotic behaviors of subcritical branching killed Brownian motion with drift

Authors: Haojie Hou, Yan-Xia Ren, Renming Song, Yaping Zhu

Abstract: In this paper, we study asymptotic behaviors of a subcritical branching killed Brownian motion with drift $-ρ$ and offspring distribution $\{p_k:k\ge 0\}$. Let $\widetildeζ^{-ρ}$ be the extinction time of this subcritical branching killed Brownian motion, $\widetilde{M}_t^{-ρ}$ the maximal position of all the particles alive at time $t$ and $\widetilde{M}^{-ρ}:=\max_{t\ge 0}\widetilde{M}_t^{-ρ}$ t… ▽ More In this paper, we study asymptotic behaviors of a subcritical branching killed Brownian motion with drift $-ρ$ and offspring distribution $\{p_k:k\ge 0\}$. Let $\widetildeζ^{-ρ}$ be the extinction time of this subcritical branching killed Brownian motion, $\widetilde{M}_t^{-ρ}$ the maximal position of all the particles alive at time $t$ and $\widetilde{M}^{-ρ}:=\max_{t\ge 0}\widetilde{M}_t^{-ρ}$ the all time maximal position. Let $\mathbb{P}_x$ be the law of this subcritical branching killed Brownian motion when the initial particle is located at $x\in (0,\infty)$. Under the assumption $\sum_{k=1}^\infty k (\log k) p_k <\infty$, we establish the decay rates of $\mathbb{P}_x(\widetildeζ^{-ρ}>t)$ and $\mathbb{P}_x(\widetilde{M}^{-ρ}>y)$ as $t$ and $y$ tend to $\infty$ respectively. We also establish the decay rate of $\mathbb{P}_x(\widetilde{M}_t^{-ρ}>z(t,ρ))$ as $t\to\infty$, where $z(t,ρ)=\sqrt{t}z-ρt$ for $ρ\leq 0$ and $z(t,ρ)=z$ for $ρ>0$. As a consequence, we obtain a Yaglom-type limit theorem. △ Less

Submitted 3 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00167 [pdf, other]

Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach

Authors: Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang

Abstract: In recent years, the United States has witnessed a significant surge in the popularity of vaping or e-cigarette use, leading to a notable rise in cases of e-cigarette and vaping use-associated lung injury (EVALI) that caused hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cessation. Due… ▽ More In recent years, the United States has witnessed a significant surge in the popularity of vaping or e-cigarette use, leading to a notable rise in cases of e-cigarette and vaping use-associated lung injury (EVALI) that caused hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cessation. Due to the ubiquity of social media platforms, over 4.7 billion users worldwide use them for connectivity, communications, news, and entertainment with a significant portion of the discourse related to health, thereby establishing social media data as an invaluable organic data resource for public health research. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users' quit-vaping intentions. Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit vaping intention detection, this study compares the outcomes of this model against layman and clinical expert annotations. Using different prompting strategies such as zero-shot, one-shot, few-shot and chain-of-thought prompting, we developed 8 prompts with varying levels of detail to explain the task to GPT-4 and also evaluated the performance of the strategies against each other. These preliminary findings emphasize the potential of GPT-4 in social media data analysis, especially in identifying users' subtle intentions that may elude human detection. △ Less

Submitted 28 June, 2024; originally announced July 2024.

Comments: Accepted for the AI Applications in Public Health and Social Services workshop at the 22nd International Conference on Artificial Intelligence in Medicine (AIME 2024)

arXiv:2407.00072 [pdf, other]

Pistis-RAG: A Scalable Cascading Framework Towards Trustworthy Retrieval-Augmented Generation

Authors: Yu Bai, Yukai Miao, Li Chen, Dan Li, Yanyu Ren, Hongtao Xie, Ce Yang, Xuhui Cai

Abstract: In Greek mythology, Pistis symbolized good faith, trust, and reliability. Drawing inspiration from these principles, Pistis-RAG is a scalable multi-stage framework designed to address the challenges of large-scale retrieval-augmented generation (RAG) systems. This framework consists of distinct stages: matching, pre-ranking, ranking, reasoning, and aggregating. Each stage contributes to narrowing… ▽ More In Greek mythology, Pistis symbolized good faith, trust, and reliability. Drawing inspiration from these principles, Pistis-RAG is a scalable multi-stage framework designed to address the challenges of large-scale retrieval-augmented generation (RAG) systems. This framework consists of distinct stages: matching, pre-ranking, ranking, reasoning, and aggregating. Each stage contributes to narrowing the search space, prioritizing semantically relevant documents, aligning with the large language model's (LLM) preferences, supporting complex chain-of-thought (CoT) methods, and combining information from multiple sources. Our ranking stage introduces a significant innovation by recognizing that semantic relevance alone may not lead to improved generation quality, due to the sensitivity of the few-shot prompt order, as noted in previous research. This critical aspect is often overlooked in current RAG frameworks. We argue that the alignment issue between LLMs and external knowledge ranking methods is tied to the model-centric paradigm dominant in RAG systems. We propose a content-centric approach, emphasizing seamless integration between LLMs and external information sources to optimize content transformation for specific tasks. Our novel ranking stage is designed specifically for RAG systems, incorporating principles of information retrieval while considering the unique business scenarios reflected in LLM preferences and user feedback. We simulated feedback signals on the MMLU benchmark, resulting in a 9.3% performance improvement. Our model and code will be open-sourced on GitHub. Additionally, experiments on real-world, large-scale data validate the scalability of our framework. △ Less

Submitted 1 August, 2024; v1 submitted 21 June, 2024; originally announced July 2024.

Showing 1–50 of 1,087 results for author: Ren, Y