Search | arXiv e-print repository

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

Authors: Zhiyuan Yan, Yandan Zhao, Shen Chen, Xinghe Fu, Taiping Yao, Shouhong Ding, Li Yuan

Abstract: Three key challenges hinder the development of current deepfake video detection: (1) Temporal features can be complex and diverse: how can we identify general temporal artifacts to enhance model generalization? (2) Spatiotemporal models often lean heavily on one type of artifact and ignore the other: how can we ensure balanced learning from both? (3) Videos are naturally resource-intensive: how ca… ▽ More Three key challenges hinder the development of current deepfake video detection: (1) Temporal features can be complex and diverse: how can we identify general temporal artifacts to enhance model generalization? (2) Spatiotemporal models often lean heavily on one type of artifact and ignore the other: how can we ensure balanced learning from both? (3) Videos are naturally resource-intensive: how can we tackle efficiency without compromising accuracy? This paper attempts to tackle the three challenges jointly. First, inspired by the notable generality of using image-level blending data for image forgery detection, we investigate whether and how video-level blending can be effective in video. We then perform a thorough analysis and identify a previously underexplored temporal forgery artifact: Facial Feature Drift (FFD), which commonly exists across different forgeries. To reproduce FFD, we then propose a novel Video-level Blending data (VB), where VB is implemented by blending the original image and its warped version frame-by-frame, serving as a hard negative sample to mine more general artifacts. Second, we carefully design a lightweight Spatiotemporal Adapter (StA) to equip a pretrained image model (both ViTs and CNNs) with the ability to capture both spatial and temporal features jointly and efficiently. StA is designed with two-stream 3D-Conv with varying kernel sizes, allowing it to process spatial and temporal features separately. Extensive experiments validate the effectiveness of the proposed methods; and show our approach can generalize well to previously unseen forgery videos, even the just-released (in 2024) SoTAs. We release our code and pretrained weights at \url{https://github.com/YZY-stack/StA4Deepfake}. △ Less

Submitted 30 August, 2024; originally announced August 2024.

arXiv:2408.17052 [pdf, other]

Can We Leave Deepfake Data Behind in Training Deepfake Detector?

Authors: Jikang Cheng, Zhiyuan Yan, Ying Zhang, Yuhao Luo, Zhongyuan Wang, Chen Li

Abstract: The generalization ability of deepfake detectors is vital for their applications in real-world scenarios. One effective solution to enhance this ability is to train the models with manually-blended data, which we termed "blendfake", encouraging models to learn generic forgery artifacts like blending boundary. Interestingly, current SoTA methods utilize blendfake without incorporating any deepfake… ▽ More The generalization ability of deepfake detectors is vital for their applications in real-world scenarios. One effective solution to enhance this ability is to train the models with manually-blended data, which we termed "blendfake", encouraging models to learn generic forgery artifacts like blending boundary. Interestingly, current SoTA methods utilize blendfake without incorporating any deepfake data in their training process. This is likely because previous empirical observations suggest that vanilla hybrid training (VHT), which combines deepfake and blendfake data, results in inferior performance to methods using only blendfake data (so-called "1+1<2"). Therefore, a critical question arises: Can we leave deepfake behind and rely solely on blendfake data to train an effective deepfake detector? Intuitively, as deepfakes also contain additional informative forgery clues (e.g., deep generative artifacts), excluding all deepfake data in training deepfake detectors seems counter-intuitive. In this paper, we rethink the role of blendfake in detecting deepfakes and formulate the process from "real to blendfake to deepfake" to be a progressive transition. Specifically, blendfake and deepfake can be explicitly delineated as the oriented pivot anchors between "real-to-fake" transitions. The accumulation of forgery information should be oriented and progressively increasing during this transition process. To this end, we propose an Oriented Progressive Regularizor (OPR) to establish the constraints that compel the distribution of anchors to be discretely arranged. Furthermore, we introduce feature bridging to facilitate the smooth transition between adjacent anchors. Extensive experiments confirm that our design allows leveraging forgery information from both blendfake and deepfake effectively and comprehensively. △ Less

Submitted 30 August, 2024; originally announced August 2024.

arXiv:2408.16853 [pdf, other]

RIS-Aided Backscattering Tag-to-Tag Networks: Performance Analysis

Authors: Masoud Kaveh, Farshad Rostami Ghadi, Zheng Yan, Riku Jantti

Abstract: Backscattering tag-to-tag networks (BTTNs) represent a passive radio frequency identification (RFID) system that enables direct communication between tags within an external radio frequency (RF) field. However, low spectral efficiency and short-range communication capabilities, along with the ultra-low power nature of the tags, create significant challenges for reliable and practical applications… ▽ More Backscattering tag-to-tag networks (BTTNs) represent a passive radio frequency identification (RFID) system that enables direct communication between tags within an external radio frequency (RF) field. However, low spectral efficiency and short-range communication capabilities, along with the ultra-low power nature of the tags, create significant challenges for reliable and practical applications of BTTNs. To address these challenges, this paper introduces integrating an indoor reconfigurable intelligent surface (RIS) into BTTN and studying RIS's impact on the system's performance. To that end, we first derive compact analytical expressions of the probability density function (PDF) and cumulative distribution function (CDF) for the received signal-to-noise ratio (SNR) at the receiver tag by exploiting the moment matching technique. Then, based on the derived PDF and CDF, we further derive analytical expressions of outage probability (OP), bit error rate (BER), and average capacity (AC) rate. Eventually, the Monte Carlo simulation is used to validate the accuracy of the analytical results, revealing that utilizing RIS can greatly improve the performance of BTTNs in terms of AC, BER, OP, and coverage region relative to traditional BTTNs setups that do not incorporate RIS. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.15487 [pdf, ps, other]

A strong structural stability of $C_{2k+1}$-free graphs

Authors: Zilong Yan, Yuejian Peng

Abstract: Füredi and Gunderson showed that $ex(n, C_{2k+1})$ is achieved only on $K_{\lfloor\frac{n}{2}\rfloor, \lceil\frac{n}{2}\rceil}$ if $n\ge 4k-2$. It is natural to study how far a $ C_{2k+1}$-free graph is from being bipartite.Let $T^*(r, n)$ be obtained by adding a suspension $K_{r}$ with $1$ suspension point to $K_{\lfloor\frac{n-r+1}{2}\rfloor, \lceil\frac{n-r+1}{2}\rceil}$. We show that for integ… ▽ More Füredi and Gunderson showed that $ex(n, C_{2k+1})$ is achieved only on $K_{\lfloor\frac{n}{2}\rfloor, \lceil\frac{n}{2}\rceil}$ if $n\ge 4k-2$. It is natural to study how far a $ C_{2k+1}$-free graph is from being bipartite.Let $T^*(r, n)$ be obtained by adding a suspension $K_{r}$ with $1$ suspension point to $K_{\lfloor\frac{n-r+1}{2}\rfloor, \lceil\frac{n-r+1}{2}\rceil}$. We show that for integers $r, k$ with $3\le r\le 2k-4$ and $n\ge 20(r+2)^2k$, if $G$ is a $C_{2k+1}$-free $n$-vertex graph with $e(G)\ge e(T^*(r, n))$, then $G$ is obtained by adding suspensions to a bipartite graph one by one and the total number of vertices in all suspensions minus intersection points is no more than $r-1$. In other words, $G=B\bigcup\limits_{i=1}^p G_i$, where $B$ is a bipartite graph, $G_1$ is a suspension to $B$, $G_j$ is a suspension to $B\bigcup\limits_{i=1}^{j-1} G_i$ for $2\le j\le p$ and $\sum\limits_{i=1}^p \vert V(G_i)-V(G_i)\cap V(B\bigcup\limits_{i=1}^{j-1} G_i) \vert\le r-1$. Furthermore, $\sum\limits_{i=1}^p \vert V(G_i)-V(G_i)\cap V(B\bigcup\limits_{i=1}^{j-1} G_i) \vert= r-1$ if and only if $G=T^*(r, n)$. Let $d_2(G)=\min\{|T|: T\subseteq V(G), G-T \ \text{is bipartite}\}$ and $γ_2(G)=\min\{|E|: E\subseteq E(G), G-E \ \text{is bipartite}\}$. Our structural stability result implies that $d_2(G)\le r-1$ and $γ_2(G)\le {\lceil\frac{r}{2}\rceil \choose 2}+{\lfloor\frac{r}{2}\rfloor \choose 2}$ under the same condition, which is a recent result of Ren-Wang-Wang-Yang [SIAM J. Discrete Math. 38 (2024)]. They proved $d_2(G)\le r-1$ and $γ_2(G)\le {\lceil\frac{r}{2}\rceil \choose 2}+{\lfloor\frac{r}{2}\rfloor \choose 2}$ separately. We introduce a new concept strong-$2k$-core which is the key that we can give a stronger structural stability result but a simpler proof. △ Less

Submitted 27 August, 2024; originally announced August 2024.

arXiv:2408.12301 [pdf, other]

Compact star in noninteger power model of $f(R)$ gravity

Authors: Yong-Xiang Cui, Zu Yan, Kota Numajiri, Taishi Katsuragawa, Shin'ichi Nojiri

Abstract: We investigate compact stars in the noninteger power (NIP) model of $f(R)$ gravity theory, which includes the higher-curvature correction to the Einstein-Hilbert action. The mass-radius relation of the compact stars in the NIP model predicts large deviations from those in the general relativity in the low-mass region, potentially allowing us to test the NIP model by future astrophysical observatio… ▽ More We investigate compact stars in the noninteger power (NIP) model of $f(R)$ gravity theory, which includes the higher-curvature correction to the Einstein-Hilbert action. The mass-radius relation of the compact stars in the NIP model predicts large deviations from those in the general relativity in the low-mass region, potentially allowing us to test the NIP model by future astrophysical observations. We also study the nonvanishing scalar hair surrounding the compact star and demonstrate that the chameleon mechanism works efficiently. They result in distinct scalar profiles inside and outside the star, which implies screening the fifth force mediated by the scalar field. △ Less

Submitted 22 August, 2024; originally announced August 2024.

Comments: 35 pages, 27 figures

Report number: KEK-TH-2646, KEK-Cosmo-0354

arXiv:2408.08228 [pdf, other]

Rethinking Medical Anomaly Detection in Brain MRI: An Image Quality Assessment Perspective

Authors: Zixuan Pan, Jun Xia, Zheyu Yan, Guoyue Xu, Yawen Wu, Zhenge Jia, Jianxu Chen, Yiyu Shi

Abstract: Reconstruction-based methods, particularly those leveraging autoencoders, have been widely adopted to perform anomaly detection in brain MRI. While most existing works try to improve detection accuracy by proposing new model structures or algorithms, we tackle the problem through image quality assessment, an underexplored perspective in the field. We propose a fusion quality loss function that com… ▽ More Reconstruction-based methods, particularly those leveraging autoencoders, have been widely adopted to perform anomaly detection in brain MRI. While most existing works try to improve detection accuracy by proposing new model structures or algorithms, we tackle the problem through image quality assessment, an underexplored perspective in the field. We propose a fusion quality loss function that combines Structural Similarity Index Measure loss with l1 loss, offering a more comprehensive evaluation of reconstruction quality. Additionally, we introduce a data pre-processing strategy that enhances the average intensity ratio (AIR) between normal and abnormal regions, further improving the distinction of anomalies. By fusing the aforementioned two methods, we devise the image quality assessment (IQA) approach. The proposed IQA approach achieves significant improvements (>10%) in terms of Dice coefficient (DICE) and Area Under the Precision-Recall Curve (AUPRC) on the BraTS21 (T2, FLAIR) and MSULB datasets when compared with state-of-the-art methods. These results highlight the importance of invoking the comprehensive image quality assessment in medical anomaly detection and provide a new perspective for future research in this field. △ Less

Submitted 15 August, 2024; originally announced August 2024.

arXiv:2408.07197 [pdf, other]

Hybrid Magnonics with Localized Spoof Surface Plasmon Polaritons

Authors: Yuzan Xiong, Andrew Christy, Zixin Yan, Amin Pishehvar, Muntasir Mahdi, Junming Wu, James F. Cahoon, Binbin Yang, Michael C. Hamilton, Xufeng Zhang, Wei Zhang

Abstract: Hybrid magnonic systems have emerged as a promising direction for information propagation with preserved coherence. Due to high tunability of magnons, their interactions with microwave photons can be engineered to probe novel phenomena based on strong photon-magnon coupling. Improving the photon-magnon coupling strength can be done by tuning the structure of microwave resonators to better interact… ▽ More Hybrid magnonic systems have emerged as a promising direction for information propagation with preserved coherence. Due to high tunability of magnons, their interactions with microwave photons can be engineered to probe novel phenomena based on strong photon-magnon coupling. Improving the photon-magnon coupling strength can be done by tuning the structure of microwave resonators to better interact with the magnon counterpart. Planar resonators have been explored due to their potential for on-chip integration, but only common modes from stripline-based resonators have been used. Here, we present a microwave spiral resonator supporting the spoof localized surface plasmons (LSPs) and implement it to the investigation of photon-magnon coupling for hybrid magnonic applications. We showcase strong magnon-LSP photon coupling using a ferrimagnetic yttrium iron garnet sphere. We discuss the dependence of the spiral resonator design to the engineering capacity of the photon mode frequency and spatial field distributions, via both experiment and simulation. By the localized photon mode profiles, the resulting magnetic field concentrates near the surface dielectrics, giving rise to an enhanced magnetic filling factor. The strong coupling and large engineering space render the spoof LSPs an interesting contender in developing novel hybrid magnonic systems and functionalities. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: 13 pages, 13 figures

arXiv:2408.06779 [pdf, other]

ED$^4$: Explicit Data-level Debiasing for Deepfake Detection

Authors: Jikang Cheng, Ying Zhang, Qin Zou, Zhiyuan Yan, Chao Liang, Zhongyuan Wang, Chen Li

Abstract: Learning intrinsic bias from limited data has been considered the main reason for the failure of deepfake detection with generalizability. Apart from the discovered content and specific-forgery bias, we reveal a novel spatial bias, where detectors inertly anticipate observing structural forgery clues appearing at the image center, also can lead to the poor generalization of existing methods. We pr… ▽ More Learning intrinsic bias from limited data has been considered the main reason for the failure of deepfake detection with generalizability. Apart from the discovered content and specific-forgery bias, we reveal a novel spatial bias, where detectors inertly anticipate observing structural forgery clues appearing at the image center, also can lead to the poor generalization of existing methods. We present ED$^4$, a simple and effective strategy, to address aforementioned biases explicitly at the data level in a unified framework rather than implicit disentanglement via network design. In particular, we develop ClockMix to produce facial structure preserved mixtures with arbitrary samples, which allows the detector to learn from an exponentially extended data distribution with much more diverse identities, backgrounds, local manipulation traces, and the co-occurrence of multiple forgery artifacts. We further propose the Adversarial Spatial Consistency Module (AdvSCM) to prevent extracting features with spatial bias, which adversarially generates spatial-inconsistent images and constrains their extracted feature to be consistent. As a model-agnostic debiasing strategy, ED$^4$ is plug-and-play: it can be integrated with various deepfake detectors to obtain significant benefits. We conduct extensive experiments to demonstrate its effectiveness and superiority over existing deepfake detection approaches. △ Less

Submitted 13 August, 2024; originally announced August 2024.

arXiv:2408.06550 [pdf, other]

Stretch or Vibrate? Rendering Spatial Information of Static and Moving Objects in VR via Haptic Feedback for Blind People

Authors: Jiasheng Li, Zining Zhang, Zeyu Yan, Yuhang Zhao, Huaishu Peng

Abstract: Perceiving spatial information of a virtual object (e.g., direction, distance) is critical yet challenging for blind users seeking an immersive virtual reality experience. To facilitate VR accessibility for blind users, in this paper, we investigate the effectiveness of two types of haptic cues--vibrotactile and skin-stretch cues--in conveying the spatial information of a virtual object when appli… ▽ More Perceiving spatial information of a virtual object (e.g., direction, distance) is critical yet challenging for blind users seeking an immersive virtual reality experience. To facilitate VR accessibility for blind users, in this paper, we investigate the effectiveness of two types of haptic cues--vibrotactile and skin-stretch cues--in conveying the spatial information of a virtual object when applied to the dorsal side of a blind user's hand. We conducted a user study with 10 blind users to investigate how they perceive static and moving objects in VR with a custom-made haptic apparatus. Our results reveal that blind users can more accurately understand an object's location and movement when receiving skin-stretch cues, as opposed to vibrotactile cues. We discuss the pros and cons of both types of haptic cues and conclude with design recommendations for future haptic solutions for VR accessibility. △ Less

Submitted 12 August, 2024; originally announced August 2024.

arXiv:2408.03286 [pdf, other]

Biomedical SAM 2: Segment Anything in Biomedical Images and Videos

Authors: Zhiling Yan, Weixiang Sun, Rong Zhou, Zhengqing Yuan, Kai Zhang, Yiwei Li, Tianming Liu, Quanzheng Li, Xiang Li, Lifang He, Lichao Sun

Abstract: Medical image segmentation and video object segmentation are essential for diagnosing and analyzing diseases by identifying and measuring biological structures. Recent advances in natural domain have been driven by foundation models like the Segment Anything Model 2 (SAM-2). To explore the performance of SAM-2 in biomedical applications, we designed three evaluation pipelines for single-frame 2D i… ▽ More Medical image segmentation and video object segmentation are essential for diagnosing and analyzing diseases by identifying and measuring biological structures. Recent advances in natural domain have been driven by foundation models like the Segment Anything Model 2 (SAM-2). To explore the performance of SAM-2 in biomedical applications, we designed three evaluation pipelines for single-frame 2D image segmentation, multi-frame 3D image segmentation and multi-frame video segmentation with varied prompt designs, revealing SAM-2's limitations in medical contexts. Consequently, we developed BioSAM-2, an enhanced foundation model optimized for biomedical data based on SAM-2. Our experiments show that BioSAM-2 not only surpasses the performance of existing state-of-the-art foundation models but also matches or even exceeds specialist models, demonstrating its efficacy and potential in the medical domain. △ Less

Submitted 17 August, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

arXiv:2408.03285 [pdf, other]

doi 10.1145/3654777.3676440

JetUnit: Rendering Diverse Force Feedback in Virtual Reality Using Water Jets

Authors: Zining Zhang, Jiasheng Li, Zeyu Yan, Jun Nishida, Huaishu Peng

Abstract: We propose JetUnit, a water-based VR haptic system designed to produce force feedback with a wide spectrum of intensities and frequencies through water jets. The key challenge in designing this system lies in optimizing parameters to enable the haptic device to generate force feedback that closely replicates the most intense force produced by direct water jets while ensuring the user remains dry.… ▽ More We propose JetUnit, a water-based VR haptic system designed to produce force feedback with a wide spectrum of intensities and frequencies through water jets. The key challenge in designing this system lies in optimizing parameters to enable the haptic device to generate force feedback that closely replicates the most intense force produced by direct water jets while ensuring the user remains dry. In this paper, we present the key design parameters of the JetUnit wearable device determined through a set of quantitative experiments and a perception study. We further conducted a user study to assess the impact of integrating our haptic solutions into virtual reality experiences. The results revealed that, by adhering to the design principles of JetUnit, the water-based haptic system is capable of delivering diverse force feedback sensations, significantly enhancing the immersive experience in virtual reality. △ Less

Submitted 6 August, 2024; originally announced August 2024.

Journal ref: ACM UIST 2024

arXiv:2408.01960 [pdf, other]

AnomalySD: Few-Shot Multi-Class Anomaly Detection with Stable Diffusion Model

Authors: Zhenyu Yan, Qingqing Fang, Wenxi Lv, Qinliang Su

Abstract: Anomaly detection is a critical task in industrial manufacturing, aiming to identify defective parts of products. Most industrial anomaly detection methods assume the availability of sufficient normal data for training. This assumption may not hold true due to the cost of labeling or data privacy policies. Additionally, mainstream methods require training bespoke models for different objects, whic… ▽ More Anomaly detection is a critical task in industrial manufacturing, aiming to identify defective parts of products. Most industrial anomaly detection methods assume the availability of sufficient normal data for training. This assumption may not hold true due to the cost of labeling or data privacy policies. Additionally, mainstream methods require training bespoke models for different objects, which incurs heavy costs and lacks flexibility in practice. To address these issues, we seek help from Stable Diffusion (SD) model due to its capability of zero/few-shot inpainting, which can be leveraged to inpaint anomalous regions as normal. In this paper, a few-shot multi-class anomaly detection framework that adopts Stable Diffusion model is proposed, named AnomalySD. To adapt SD to anomaly detection task, we design different hierarchical text descriptions and the foreground mask mechanism for fine-tuning SD. In the inference stage, to accurately mask anomalous regions for inpainting, we propose multi-scale mask strategy and prototype-guided mask strategy to handle diverse anomalous regions. Hierarchical text prompts are also utilized to guide the process of inpainting in the inference stage. The anomaly score is estimated based on inpainting result of all masks. Extensive experiments on the MVTec-AD and VisA datasets demonstrate the superiority of our approach. We achieved anomaly classification and segmentation results of 93.6%/94.8% AUROC on the MVTec-AD dataset and 86.1%/96.5% AUROC on the VisA dataset under multi-class and one-shot settings. △ Less

Submitted 4 August, 2024; originally announced August 2024.

Comments: 8 pages, 4 figures

arXiv:2408.01618 [pdf, ps, other]

Magnetic order-dependent giant tunneling magnetoresistance and electroresistance in van der Waals antiferromagnetic-multiferroic tunnel junctions

Authors: Zhi Yan, Dan Qiao, Wentian Lu, Xinlong Dong, Xiaohong Xu

Abstract: Antiferromagnetic spintronics exhibits ultra-high operational speed and stability in a magnetic field, holding promise for the realization of next-generation ultra-high-speed magnetic storage. However, theoretical exploration of the electronic transport properties of antiferromagnetic-multiferroic tunnel junction (AMFTJ) devices remains largely unexplored. Here, we design an antiferromagnet/ferroe… ▽ More Antiferromagnetic spintronics exhibits ultra-high operational speed and stability in a magnetic field, holding promise for the realization of next-generation ultra-high-speed magnetic storage. However, theoretical exploration of the electronic transport properties of antiferromagnetic-multiferroic tunnel junction (AMFTJ) devices remains largely unexplored. Here, we design an antiferromagnet/ferroelectric barrier/antiferromagnet van der Waals heterojunction, renamed vdW AMFTJ, using a bilayer MnBi$_2$Te$_4$/In$_2$Se$_3$/bilayer MnBi$_2$Te$_4$ (MBT-2L/IS/MBT-2L) as the prototype. Based on first-principles calculations using the nonequilibrium Green's function method combined with density functional theory, we theoretically investigate the spin-resolved electronic transport properties of this AMFTJ. By manipulating the various possible magnetization directions of the multilayer antiferromagnetic MnBi$_2$Te$_4$ and the ferroelectric polarization direction of the In$_2$Se$_3$ within the junction, sixteen distinct non-volatile resistance states can be revealed and manipulated by applying external biaxial strain and bias voltage. We predict maximum tunneling magnetoresistance (electroresistance) values of $3.79\times10^{4}$\% ($2.41\times10^{5}$\%) in the equilibrium state, which can increase up to $5.01\times10^{5}$\% ($4.97\times10^{5}$\%) under external bias voltage. Furthermore, the perfect spin filtering effect is also present in our AMFTJ. Our results highlight the tremendous potential of the MBT-2L/IS/MBT-2L vdW AMFTJ in non-volatile memory, expanding the application avenues for antiferromagnetic spintronic devices. △ Less

Submitted 2 August, 2024; originally announced August 2024.

arXiv:2408.01607 [pdf]

Deep Learning Meets OBIA: Tasks, Challenges, Strategies, and Perspectives

Authors: Lei Ma, Ziyun Yan, Mengmeng Li, Tao Liu, Liqin Tan, Xuan Wang, Weiqiang He, Ruikun Wang, Guangjun He, Heng Lu, Thomas Blaschke

Abstract: Deep learning has gained significant attention in remote sensing, especially in pixel- or patch-level applications. Despite initial attempts to integrate deep learning into object-based image analysis (OBIA), its full potential remains largely unexplored. In this article, as OBIA usage becomes more widespread, we conducted a comprehensive review and expansion of its task subdomains, with or withou… ▽ More Deep learning has gained significant attention in remote sensing, especially in pixel- or patch-level applications. Despite initial attempts to integrate deep learning into object-based image analysis (OBIA), its full potential remains largely unexplored. In this article, as OBIA usage becomes more widespread, we conducted a comprehensive review and expansion of its task subdomains, with or without the integration of deep learning. Furthermore, we have identified and summarized five prevailing strategies to address the challenge of deep learning's limitations in directly processing unstructured object data within OBIA, and this review also recommends some important future research directions. Our goal with these endeavors is to inspire more exploration in this fascinating yet overlooked area and facilitate the integration of deep learning into OBIA processing workflows. △ Less

Submitted 2 August, 2024; originally announced August 2024.

arXiv:2408.01246 [pdf, other]

MapComp: A Secure View-based Collaborative Analytics Framework for Join-Group-Aggregation

Authors: Xinyu Peng, Feng Han, Li Peng, Weiran Liu, Zheng Yan, Kai Kang, Xinyuan Zhang, Guoxing Wei, Jianling Sun, Jinfei Liu

Abstract: This paper introduces MapComp, a novel view-based framework to facilitate join-group-aggregation (JGA) queries for collaborative analytics. Through specially crafted materialized view for join and novel design of group-aggregation (GA) protocols, MapComp removes duplicated join workload and expedites subsequent GA, improving the efficiency of JGA query execution. To support continuous data updates… ▽ More This paper introduces MapComp, a novel view-based framework to facilitate join-group-aggregation (JGA) queries for collaborative analytics. Through specially crafted materialized view for join and novel design of group-aggregation (GA) protocols, MapComp removes duplicated join workload and expedites subsequent GA, improving the efficiency of JGA query execution. To support continuous data updates, our materialized view offers payload-independence feature and brings in significant efficiency improvement of view refreshing with free MPC overhead. This feature also allows further acceleration for GA, where we devised multiple novel protocols that outperform prior works. Notably, our work represents the first endeavor to expedite secure collaborative JGA queries using materialized views. Our experiments demonstrate a significant advantage of MapComp, achieving up to a 2189.9x efficiency improvement compared to the non-view based baseline when executing queries eight times. △ Less

Submitted 15 August, 2024; v1 submitted 2 August, 2024; originally announced August 2024.

Comments: 12 pages

arXiv:2408.01077 [pdf, other]

PhysMamba: State Space Duality Model for Remote Physiological Measurement

Authors: Zhixin Yan, Yan Zhong, Hongbin Xu, Wenjun Zhang, Lin Shu, Hongbin Xu, Wenxiong Kang

Abstract: Remote Photoplethysmography (rPPG) is a non-contact technique for extracting physiological signals from facial videos, used in applications like emotion monitoring, medical assistance, and anti-face spoofing. Unlike controlled laboratory settings, real-world environments often contain motion artifacts and noise, affecting the performance of existing rPPG methods. To address this, we propose PhysMa… ▽ More Remote Photoplethysmography (rPPG) is a non-contact technique for extracting physiological signals from facial videos, used in applications like emotion monitoring, medical assistance, and anti-face spoofing. Unlike controlled laboratory settings, real-world environments often contain motion artifacts and noise, affecting the performance of existing rPPG methods. To address this, we propose PhysMamba, a dual-Pathway time-frequency interaction model via State Space Duality. This method allows the network to learn richer, more representative features, enhancing robustness in noisy conditions. To facilitate information exchange and feature complementation between the two pathways, we design an improved algorithm: Cross-Attention State Space Duality (CASSD). We conduct comparative experiments on the PURE, UBFC-rPPG, and MMPD datasets. Experimental results show that PhysMamba achieves state-of-the-art performance, particularly in complex environments, demonstrating its potential in practical remote physiological signal measurement applications. △ Less

Submitted 17 August, 2024; v1 submitted 2 August, 2024; originally announced August 2024.

arXiv:2407.21783 [pdf, other]

The Llama 3 Herd of Models

Authors: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang , et al. (510 additional authors not shown)

Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. △ Less

Submitted 15 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

arXiv:2407.21415 [pdf, other]

In situ Qubit Frequency Tuning Circuit for Scalable Superconducting Quantum Computing: Scheme and Experiment

Authors: Lei Jiang, Yu Xu, Shaowei Li, Zhiguang Yan, Ming Gong, Tao Rong, Chenyin Sun, Tianzuo Sun, Tao Jiang, Hui Deng, Chen Zha, Jin Lin, Fusheng Chen, Qingling Zhu, Yangsen Ye, Hao Rong, Kai Yan, Sirui Cao, Yuan Li, Shaojun Guo, Haoran Qian, Yisen Hu, Yulin Wu, Yuhuai Li, Gang Wu , et al. (8 additional authors not shown)

Abstract: Frequency tunable qubit plays a significant role for scalable superconducting quantum processors. The state-of-the-art room-temperature electronics for tuning qubit frequency suffers from unscalable limit, such as heating problem, linear growth of control cables, etc. Here we propose a scalable scheme to tune the qubit frequency by using in situ superconducting circuit, which is based on radio fre… ▽ More Frequency tunable qubit plays a significant role for scalable superconducting quantum processors. The state-of-the-art room-temperature electronics for tuning qubit frequency suffers from unscalable limit, such as heating problem, linear growth of control cables, etc. Here we propose a scalable scheme to tune the qubit frequency by using in situ superconducting circuit, which is based on radio frequency superconducting quantum interference device (rf-SQUID). We demonstrate both theoretically and experimentally that the qubit frequency could be modulated by inputting several single pulses into rf-SQUID. Compared with the traditional scheme, our scheme not only solves the heating problem, but also provides the potential to exponentially reduce the number of cables inside the dilute refrigerator and the room-temperature electronics resource for tuning qubit frequency, which is achieved by a time-division-multiplex (TDM) scheme combining rf-SQUID with switch arrays. With such TDM scheme, the number of cables could be reduced from the usual $\sim 3n$ to $\sim \log_2{(3n)} + 1$ for two-dimensional quantum processors comprising $n$ qubits and $\sim 2n$ couplers. Our work paves the way for large-scale control of superconducting quantum processor. △ Less

Submitted 31 July, 2024; originally announced July 2024.

Comments: 9 pages, 6 figures

arXiv:2407.20262 [pdf]

A Neural-Network-Embedded Equivalent Circuit Model for Lithium-ion Battery State Estimation

Authors: Zelin Guo, Yiyan Li, Zheng Yan, Mo-Yuen Chow

Abstract: Equivalent Circuit Model(ECM)has been widelyused in battery modeling and state estimation because of itssimplicity, stability and interpretability.However, ECM maygenerate large estimation errors in extreme working conditionssuch as freezing environmenttemperature andcomplexcharging/discharging behaviors,in whichscenariostheelectrochemical characteristics of the battery become extremelycomplex and… ▽ More Equivalent Circuit Model(ECM)has been widelyused in battery modeling and state estimation because of itssimplicity, stability and interpretability.However, ECM maygenerate large estimation errors in extreme working conditionssuch as freezing environmenttemperature andcomplexcharging/discharging behaviors,in whichscenariostheelectrochemical characteristics of the battery become extremelycomplex and nonlinear.In this paper,we propose a hybridbattery model by embeddingneural networks as 'virtualelectronic components' into the classical ECM to enhance themodel nonlinear-fitting ability and adaptability. First, thestructure of the proposed hybrid model is introduced, where theembedded neural networks are targeted to fit the residuals of theclassical ECM,Second, an iterative offline training strategy isdesigned to train the hybrid model by merging the battery statespace equation into the neural network loss function. Last, thebattery online state of charge (SOC)estimation is achieved basedon the proposed hybrid model to demonstrate its applicationvalue,Simulation results based on a real-world battery datasetshow that the proposed hybrid model can achieve 29%-64%error reduction for $OC estimation under different operatingconditions at varying environment temperatures. △ Less

Submitted 24 July, 2024; originally announced July 2024.

Comments: 8 pages

arXiv:2407.18866 [pdf, other]

A Comment on Deriving the Gibbons-Hawking-York Term From the String Worldsheet

Authors: Amr Ahmadain, Vasudev Shyam, Zihan Yan

Abstract: In this note, we show that the noncovariant metric boundary term obtained from the nonlinear sigma model worldsheet derivation of the bulk off-shell sphere partition function is closely related to the Einstein boundary term in the Gamma-Gamma noncovariant action. In fact, when expressed in terms of the trace of the extrinsic curvature tensor, we illustrate that this boundary term has one-half the… ▽ More In this note, we show that the noncovariant metric boundary term obtained from the nonlinear sigma model worldsheet derivation of the bulk off-shell sphere partition function is closely related to the Einstein boundary term in the Gamma-Gamma noncovariant action. In fact, when expressed in terms of the trace of the extrinsic curvature tensor, we illustrate that this boundary term has one-half the coefficient of the Gibbons-Hawking-York boundary term required such that the total (bulk plus boundary) off-shell classical action has a well-posed variational principle with Dirichlet boundary conditions. △ Less

Submitted 26 July, 2024; originally announced July 2024.

Comments: 11 pages

arXiv:2407.16260 [pdf, other]

DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors

Authors: Zizheng Yan, Jiapeng Zhou, Fanpeng Meng, Yushuang Wu, Lingteng Qiu, Zisheng Ye, Shuguang Cui, Guanying Chen, Xiaoguang Han

Abstract: Text-to-3D generation has recently seen significant progress. To enhance its practicality in real-world applications, it is crucial to generate multiple independent objects with interactions, similar to layer-compositing in 2D image editing. However, existing text-to-3D methods struggle with this task, as they are designed to generate either non-independent objects or independent objects lacking s… ▽ More Text-to-3D generation has recently seen significant progress. To enhance its practicality in real-world applications, it is crucial to generate multiple independent objects with interactions, similar to layer-compositing in 2D image editing. However, existing text-to-3D methods struggle with this task, as they are designed to generate either non-independent objects or independent objects lacking spatially plausible interactions. Addressing this, we propose DreamDissector, a text-to-3D method capable of generating multiple independent objects with interactions. DreamDissector accepts a multi-object text-to-3D NeRF as input and produces independent textured meshes. To achieve this, we introduce the Neural Category Field (NeCF) for disentangling the input NeRF. Additionally, we present the Category Score Distillation Sampling (CSDS), facilitated by a Deep Concept Mining (DCM) module, to tackle the concept gap issue in diffusion models. By leveraging NeCF and CSDS, we can effectively derive sub-NeRFs from the original scene. Further refinement enhances geometry and texture. Our experimental results validate the effectiveness of DreamDissector, providing users with novel means to control 3D synthesis at the object level and potentially opening avenues for various creative applications in the future. △ Less

Submitted 23 July, 2024; originally announced July 2024.

Comments: ECCV 2024. Project page: https://chester256.github.io/dreamdissector

arXiv:2407.14796 [pdf, other]

PASSION: Towards Effective Incomplete Multi-Modal Medical Image Segmentation with Imbalanced Missing Rates

Authors: Junjie Shi, Caozhi Shang, Zhaobin Sun, Li Yu, Xin Yang, Zengqiang Yan

Abstract: Incomplete multi-modal image segmentation is a fundamental task in medical imaging to refine deployment efficiency when only partial modalities are available. However, the common practice that complete-modality data is visible during model training is far from realistic, as modalities can have imbalanced missing rates in clinical scenarios. In this paper, we, for the first time, formulate such a c… ▽ More Incomplete multi-modal image segmentation is a fundamental task in medical imaging to refine deployment efficiency when only partial modalities are available. However, the common practice that complete-modality data is visible during model training is far from realistic, as modalities can have imbalanced missing rates in clinical scenarios. In this paper, we, for the first time, formulate such a challenging setting and propose Preference-Aware Self-diStillatION (PASSION) for incomplete multi-modal medical image segmentation under imbalanced missing rates. Specifically, we first construct pixel-wise and semantic-wise self-distillation to balance the optimization objective of each modality. Then, we define relative preference to evaluate the dominance of each modality during training, based on which to design task-wise and gradient-wise regularization to balance the convergence rates of different modalities. Experimental results on two publicly available multi-modal datasets demonstrate the superiority of PASSION against existing approaches for modality balancing. More importantly, PASSION is validated to work as a plug-and-play module for consistent performance improvement across different backbones. Code is available at https://github.com/Jun-Jie-Shi/PASSION. △ Less

Submitted 20 July, 2024; originally announced July 2024.

Comments: Accepted by ACM MM 2024

arXiv:2407.14769 [pdf, other]

A Two-Phase Visualization System for Continuous Human-AI Collaboration in Sequelae Analysis and Modeling

Authors: Yang Ouyang, Chenyang Zhang, He Wang, Tianle Ma, Chang Jiang, Yuheng Yan, Zuoqin Yan, Xiaojuan Ma, Chuhan Shi, Quan Li

Abstract: In healthcare, AI techniques are widely used for tasks like risk assessment and anomaly detection. Despite AI's potential as a valuable assistant, its role in complex medical data analysis often oversimplifies human-AI collaboration dynamics. To address this, we collaborated with a local hospital, engaging six physicians and one data scientist in a formative study. From this collaboration, we prop… ▽ More In healthcare, AI techniques are widely used for tasks like risk assessment and anomaly detection. Despite AI's potential as a valuable assistant, its role in complex medical data analysis often oversimplifies human-AI collaboration dynamics. To address this, we collaborated with a local hospital, engaging six physicians and one data scientist in a formative study. From this collaboration, we propose a framework integrating two-phase interactive visualization systems: one for Human-Led, AI-Assisted Retrospective Analysis and another for AI-Mediated, Human-Reviewed Iterative Modeling. This framework aims to enhance understanding and discussion around effective human-AI collaboration in healthcare. △ Less

Submitted 20 July, 2024; originally announced July 2024.

Comments: To appear at the IEEE VIS Conference 2024

arXiv:2407.13691 [pdf, other]

Unsupervised and Interpretable Synthesizing for Electrical Time Series Based on Information Maximizing Generative Adversarial Nets

Authors: Zhenghao Zhou, Yiyan Li, Runlong Liu, Zheng Yan, Mo-Yuen Chow

Abstract: Generating synthetic data has become a popular alternative solution to deal with the difficulties in accessing and sharing field measurement data in power systems. However, to make the generation results controllable, existing methods (e.g. Conditional Generative Adversarial Nets, cGAN) require labeled dataset to train the model, which is demanding in practice because many field measurement data l… ▽ More Generating synthetic data has become a popular alternative solution to deal with the difficulties in accessing and sharing field measurement data in power systems. However, to make the generation results controllable, existing methods (e.g. Conditional Generative Adversarial Nets, cGAN) require labeled dataset to train the model, which is demanding in practice because many field measurement data lacks descriptive labels. In this paper, we introduce the Information Maximizing Generative Adversarial Nets (infoGAN) to achieve interpretable feature extraction and controllable synthetic data generation based on the unlabeled electrical time series dataset. Features with clear physical meanings can be automatically extracted by maximizing the mutual information between the input latent code and the classifier output of infoGAN. Then the extracted features are used to control the generation results similar to a vanilla cGAN framework. Case study is based on the time series datasets of power load and renewable energy output. Results demonstrate that infoGAN can extract both discrete and continuous features with clear physical meanings, as well as generating realistic synthetic time series that satisfy given features. △ Less

Submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.13338 [pdf, other]

Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM

Authors: Baicheng Li, Zike Yan, Dong Wu, Hanqing Jiang, Hongbin Zha

Abstract: Simultaneous localization and mapping (SLAM) with implicit neural representations has received extensive attention due to the expressive representation power and the innovative paradigm of continual learning. However, deploying such a system within a dynamic environment has not been well-studied. Such challenges are intractable even for conventional algorithms since observations from different vie… ▽ More Simultaneous localization and mapping (SLAM) with implicit neural representations has received extensive attention due to the expressive representation power and the innovative paradigm of continual learning. However, deploying such a system within a dynamic environment has not been well-studied. Such challenges are intractable even for conventional algorithms since observations from different views with dynamic objects involved break the geometric and photometric consistency, whereas the consistency lays the foundation for joint optimizing the camera pose and the map parameters. In this paper, we best exploit the characteristics of continual learning and propose a novel SLAM framework for dynamic environments. While past efforts have been made to avoid catastrophic forgetting by exploiting an experience replay strategy, we view forgetting as a desirable characteristic. By adaptively controlling the replayed buffer, the ambiguity caused by moving objects can be easily alleviated through forgetting. We restrain the replay of the dynamic objects by introducing a continually-learned classifier for dynamic object identification. The iterative optimization of the neural map and the classifier notably improves the robustness of the SLAM system under a dynamic environment. Experiments on challenging datasets verify the effectiveness of the proposed framework. △ Less

Submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.12446 [pdf, other]

Non-parametric regularization for class imbalance federated medical image classification

Authors: Jeffry Wicaksana, Zengqiang Yan, Kwang-Ting Cheng

Abstract: Limited training data and severe class imbalance pose significant challenges to developing clinically robust deep learning models. Federated learning (FL) addresses the former by enabling different medical clients to collaboratively train a deep model without sharing privacy-sensitive data. However, class imbalance worsens due to variation in inter-client class distribution. We propose federated l… ▽ More Limited training data and severe class imbalance pose significant challenges to developing clinically robust deep learning models. Federated learning (FL) addresses the former by enabling different medical clients to collaboratively train a deep model without sharing privacy-sensitive data. However, class imbalance worsens due to variation in inter-client class distribution. We propose federated learning with non-parametric regularization (FedNPR and FedNPR-Per, a personalized version of FedNPR) to regularize the feature extractor and enhance useful and discriminative signal in the feature space. Our extensive experiments show that FedNPR outperform the existing state-of-the art FL approaches in class imbalance skin lesion classification and intracranial hemorrhage identification. Additionally, the non-parametric regularization module consistently improves the performance of existing state-of-the-art FL approaches. We believe that NPR is a valuable tool in FL under clinical settings. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: arXiv admin note: text overlap with arXiv:2305.00738

arXiv:2407.12441 [pdf, ps, other]

Dynamics of discrete solitons in the fractional discrete nonlinear Schrödinger equation with the quasi-Riesz derivative

Authors: Ming Zhong, Boris A. Malomed, Zhenya Yan

Abstract: We elaborate a fractional discrete nonlinear Schrödinger (FDNLS) equation based on an appropriately modified definition of the Riesz fractional derivative, which is characterized by its Lévy index (LI). This FDNLS equation represents a novel discrete system, in which the nearest-neighbor coupling is combined with long-range interactions, that decay as the inverse square of the separation between l… ▽ More We elaborate a fractional discrete nonlinear Schrödinger (FDNLS) equation based on an appropriately modified definition of the Riesz fractional derivative, which is characterized by its Lévy index (LI). This FDNLS equation represents a novel discrete system, in which the nearest-neighbor coupling is combined with long-range interactions, that decay as the inverse square of the separation between lattice sites. The system may be realized as an array of parallel quasi-one-dimensional Bose-Einstein condensates composed of atoms or small molecules carrying, respectively, a permanent magnetic or electric dipole moment. The dispersion relation (DR) for lattice waves and the corresponding propagation band in the system's linear spectrum are found in an exact form for all values of LI. The DR is consistent with the continuum limit, differing in the range of wavenumbers. Formation of single-site and two-site discrete solitons is explored, starting from the anti-continuum limit and continuing the analysis in the numerical form up to the existence boundary of the discrete solitons. Stability of the solitons is identified in terms of eigenvalues for small perturbations, and verified in direct simulations. Mobility of the discrete solitons is considered too, by means of an estimate of the system's Peierls-Nabarro potential barrier, and with the help of direct simulations. Collisions between persistently moving discrete solitons are also studied. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: 15 pages, 8 figures (to be published in Phys. Rev. E, 2024)

arXiv:2407.12280 [pdf, ps, other]

Juhl type formulas for curved Ovsienko--Redou operators

Authors: Shane Chern, Zetian Yan

Abstract: We prove Juhl type formulas for the curved Ovsienko--Redou operators and their linear analogues, which indicate the associated formal self-adjointness, thereby confirming two conjectures of Case, Lin, and Yuan. We also offer an extension of Juhl's original formula for the GJMS operators. We prove Juhl type formulas for the curved Ovsienko--Redou operators and their linear analogues, which indicate the associated formal self-adjointness, thereby confirming two conjectures of Case, Lin, and Yuan. We also offer an extension of Juhl's original formula for the GJMS operators. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 37 pages. Comments are welcome

MSC Class: Primary 58J70; Secondary 53A40; 33C20

arXiv:2407.10166 [pdf, other]

A general theory for infernal points in non-Hermitian systems

Authors: Shu-Xuan Wang, Zhongbo Yan

Abstract: The coalescence of eigenstates is a unique phenomena in non-Hermitian systems. Remarkably, it has been noticed in some non-Hermitian systems under open boundary conditions that the whole set of eigenstates can coalesce to only a few eigenstates. In the parameter space, the point at which such a coalescence of macroscopic eigenstates occurs is dubbed as an infernal point. In this paper, based on th… ▽ More The coalescence of eigenstates is a unique phenomena in non-Hermitian systems. Remarkably, it has been noticed in some non-Hermitian systems under open boundary conditions that the whole set of eigenstates can coalesce to only a few eigenstates. In the parameter space, the point at which such a coalescence of macroscopic eigenstates occurs is dubbed as an infernal point. In this paper, based on the non-Bloch band theory and amoeba formulation, we establish the criteria for the presence of infernal points in one-dimensional and higher dimensional open-boundary non-Hermitian systems. In addition, we find an explanation of the extreme localization of the wave functions and unveil the mechanism for the coalescence of enormous eigenstates at the infernal points. Our work provides a general theory for infernal points in open-boundary non-Hermitian systems in arbitrary dimensions, and hence paves the way to study the intriguing infernal points systematically. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: 7+9 pages, 2+3 figures

arXiv:2407.08421 [pdf, other]

X-ray spectral and timing evolution during the 2018 outburst of MAXI J1820+070

Authors: YaXing Li, Zhen Yan, ChenXu Gao, Wenfei Yu

Abstract: We made use high-cadence observations from the $Insight$-HXMT and $NICER$ to scrutinize the spectral and timing evolution during the 2018 outburst of the black hole X-ray binary (BHXRB) MAXI J1820+070. It's hardness-intensity diagram (HID) displays a ''q''-like track including all the spectral states, along a unique loop in the hard state. The tracks observed in the HID is anticipated in the evolu… ▽ More We made use high-cadence observations from the $Insight$-HXMT and $NICER$ to scrutinize the spectral and timing evolution during the 2018 outburst of the black hole X-ray binary (BHXRB) MAXI J1820+070. It's hardness-intensity diagram (HID) displays a ''q''-like track including all the spectral states, along a unique loop in the hard state. The tracks observed in the HID is anticipated in the evolution of the components responsible for Compton and reflection emission. This is substantiated by the relationship between the X-ray luminosity $L_\mathrm{X}$ and photon index $Γ$, as well as the relationship between X-ray luminosity $L_\mathrm{X}$ and the ratio of Compton to disk luminosities $L_\mathrm{C}/L_\mathrm{D}$. Both of these relationships exhibit a pattern reminiscent of HID. During the hard state, the hardness (also $Γ$) is determined by either reflection component ($R_{f}>1$ ) or Compton component ($R_{f}<1$) depending on the value reflection fraction $R_{f}$. So the distinctive evolution of $R_{f}$ leads to the unique loop in the HID (also in the $L_\mathrm{X}$--$Γ$ plane) of hard state. Additionally, we found a negative correlation between frequency of the type-C quasi-periodic oscillation (QPO) ($ν_{\mathrm{C,QPO}}$) and the optical depth of the Compton emission ($τ$), and a positive correlation between $ν_{\mathrm{C,QPO}}$ and $Γ$. These correlations strongly suggest a coupling between the QPO properties and the underlying process responsible for Comptonization. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 14 pages, 10 figures, submitted to MNRAS

arXiv:2407.05407 [pdf, other]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Authors: Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role… ▽ More Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis. Experimental results show that supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning. Moreover, we find that utilizing large-scale data further improves the synthesis performance, indicating the scalable capacity of CosyVoice. To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models. △ Less

Submitted 9 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

Comments: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051

arXiv:2407.04942 [pdf, other]

FOSP: Fine-tuning Offline Safe Policy through World Models

Authors: Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang

Abstract: Model-based Reinforcement Learning (RL) has shown its high training efficiency and capability of handling high-dimensional tasks. Regarding safety issues, safe model-based RL can achieve nearly zero-cost performance and effectively manage the trade-off between performance and safety. Nevertheless, prior works still pose safety challenges due to the online exploration in real-world deployment. To a… ▽ More Model-based Reinforcement Learning (RL) has shown its high training efficiency and capability of handling high-dimensional tasks. Regarding safety issues, safe model-based RL can achieve nearly zero-cost performance and effectively manage the trade-off between performance and safety. Nevertheless, prior works still pose safety challenges due to the online exploration in real-world deployment. To address this, some offline RL methods have emerged as solutions, which learn from a static dataset in a safe way by avoiding interactions with the environment. In this paper, we aim to further enhance safety during the deployment stage for vision-based robotic tasks by fine-tuning an offline-trained policy. We incorporate in-sample optimization, model-based policy expansion, and reachability guidance to construct a safe offline-to-online framework. Moreover, our method proves to improve the generalization of offline policy in unseen safety-constrained scenarios. Finally, the efficiency of our method is validated on simulation benchmarks with five vision-only tasks and a real robot by solving some deployment problems using limited data. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 21 pages

arXiv:2407.04242 [pdf, other]

Fine-grained Context and Multi-modal Alignment for Freehand 3D Ultrasound Reconstruction

Authors: Zhongnuo Yan, Xin Yang, Mingyuan Luo, Jiongquan Chen, Rusi Chen, Lian Liu, Dong Ni

Abstract: Fine-grained spatio-temporal learning is crucial for freehand 3D ultrasound reconstruction. Previous works mainly resorted to the coarse-grained spatial features and the separated temporal dependency learning and struggles for fine-grained spatio-temporal learning. Mining spatio-temporal information in fine-grained scales is extremely challenging due to learning difficulties in long-range dependen… ▽ More Fine-grained spatio-temporal learning is crucial for freehand 3D ultrasound reconstruction. Previous works mainly resorted to the coarse-grained spatial features and the separated temporal dependency learning and struggles for fine-grained spatio-temporal learning. Mining spatio-temporal information in fine-grained scales is extremely challenging due to learning difficulties in long-range dependencies. In this context, we propose a novel method to exploit the long-range dependency management capabilities of the state space model (SSM) to address the above challenge. Our contribution is three-fold. First, we propose ReMamba, which mines multi-scale spatio-temporal information by devising a multi-directional SSM. Second, we propose an adaptive fusion strategy that introduces multiple inertial measurement units as auxiliary temporal information to enhance spatio-temporal perception. Last, we design an online alignment strategy that encodes the temporal information as pseudo labels for multi-modal alignment to further improve reconstruction performance. Extensive experimental validations on two large-scale datasets show remarkable improvement from our method over competitors. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: Accepted at MICCAI 2024. This is the submitted manuscript and the preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections

arXiv:2407.04051 [pdf, other]

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM. △ Less

Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: Work in progress. Authors are listed in alphabetical order by family name

arXiv:2407.03699 [pdf, other]

Generalized Robust Fundus Photography-based Vision Loss Estimation for High Myopia

Authors: Zipei Yan, Zhile Liang, Zhengji Liu, Shuai Wang, Rachel Ka-Man Chun, Jizhou Li, Chea-su Kee, Dong Liang

Abstract: High myopia significantly increases the risk of irreversible vision loss. Traditional perimetry-based visual field (VF) assessment provides systematic quantification of visual loss but it is subjective and time-consuming. Consequently, machine learning models utilizing fundus photographs to estimate VF have emerged as promising alternatives. However, due to the high variability and the limited ava… ▽ More High myopia significantly increases the risk of irreversible vision loss. Traditional perimetry-based visual field (VF) assessment provides systematic quantification of visual loss but it is subjective and time-consuming. Consequently, machine learning models utilizing fundus photographs to estimate VF have emerged as promising alternatives. However, due to the high variability and the limited availability of VF data, existing VF estimation models fail to generalize well, particularly when facing out-of-distribution data across diverse centers and populations. To tackle this challenge, we propose a novel, parameter-efficient framework to enhance the generalized robustness of VF estimation on both in- and out-of-distribution data. Specifically, we design a Refinement-by-Denoising (RED) module for feature refinement and adaptation from pretrained vision models, aiming to learn high-entropy feature representations and to mitigate the domain gap effectively and efficiently. Through independent validation on two distinct real-world datasets from separate centers, our method significantly outperforms existing approaches in RMSE, MAE and correlation coefficient for both internal and external validation. Our proposed framework benefits both in- and out-of-distribution VF estimation, offering significant clinical implications and potential utility in real-world ophthalmic practices. △ Less

Submitted 17 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: Accepted by MICCAI 2024, code: https://github.com/yanzipei/VF_RED

arXiv:2407.02280 [pdf, other]

FedIA: Federated Medical Image Segmentation with Heterogeneous Annotation Completeness

Authors: Yangyang Xiang, Nannan Wu, Li Yu, Xin Yang, Kwang-Ting Cheng, Zengqiang Yan

Abstract: Federated learning has emerged as a compelling paradigm for medical image segmentation, particularly in light of increasing privacy concerns. However, most of the existing research relies on relatively stringent assumptions regarding the uniformity and completeness of annotations across clients. Contrary to this, this paper highlights a prevalent challenge in medical practice: incomplete annotatio… ▽ More Federated learning has emerged as a compelling paradigm for medical image segmentation, particularly in light of increasing privacy concerns. However, most of the existing research relies on relatively stringent assumptions regarding the uniformity and completeness of annotations across clients. Contrary to this, this paper highlights a prevalent challenge in medical practice: incomplete annotations. Such annotations can introduce incorrectly labeled pixels, potentially undermining the performance of neural networks in supervised learning. To tackle this issue, we introduce a novel solution, named FedIA. Our insight is to conceptualize incomplete annotations as noisy data (i.e., low-quality data), with a focus on mitigating their adverse effects. We begin by evaluating the completeness of annotations at the client level using a designed indicator. Subsequently, we enhance the influence of clients with more comprehensive annotations and implement corrections for incomplete ones, thereby ensuring that models are trained on accurate data. Our method's effectiveness is validated through its superior performance on two extensively used medical image segmentation datasets, outperforming existing solutions. The code is available at https://github.com/HUSTxyy/FedIA. △ Less

Submitted 3 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

Comments: Early accepted by MICCAI 2024

arXiv:2406.18995 [pdf, other]

FedMLP: Federated Multi-Label Medical Image Classification under Task Heterogeneity

Authors: Zhaobin Sun, Nannan Wu, Junjie Shi, Li Yu, Xin Yang, Kwang-Ting Cheng, Zengqiang Yan

Abstract: Cross-silo federated learning (FL) enables decentralized organizations to collaboratively train models while preserving data privacy and has made significant progress in medical image classification. One common assumption is task homogeneity where each client has access to all classes during training. However, in clinical practice, given a multi-label classification task, constrained by the level… ▽ More Cross-silo federated learning (FL) enables decentralized organizations to collaboratively train models while preserving data privacy and has made significant progress in medical image classification. One common assumption is task homogeneity where each client has access to all classes during training. However, in clinical practice, given a multi-label classification task, constrained by the level of medical knowledge and the prevalence of diseases, each institution may diagnose only partial categories, resulting in task heterogeneity. How to pursue effective multi-label medical image classification under task heterogeneity is under-explored. In this paper, we first formulate such a realistic label missing setting in the multi-label FL domain and propose a two-stage method FedMLP to combat class missing from two aspects: pseudo label tagging and global knowledge learning. The former utilizes a warmed-up model to generate class prototypes and select samples with high confidence to supplement missing labels, while the latter uses a global model as a teacher for consistency regularization to prevent forgetting missing class knowledge. Experiments on two publicly-available medical datasets validate the superiority of FedMLP against the state-of-the-art both federated semi-supervised and noisy label learning approaches under task heterogeneity. Code is available at https://github.com/szbonaldo/FedMLP. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: Early accepted by MICCAI 2024

arXiv:2406.18361 [pdf, other]

Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process

Authors: Tianyu Lin, Zhiguang Chen, Zhonghao Yan, Weijiang Yu, Fudan Zheng

Abstract: Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first laten… ▽ More Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD). SDSeg incorporates a straightforward latent estimation strategy to facilitate a single-step reverse process and utilizes latent fusion concatenation to remove the necessity for multiple samples. Extensive experiments indicate that SDSeg surpasses existing state-of-the-art methods on five benchmark datasets featuring diverse imaging modalities. Remarkably, SDSeg is capable of generating stable predictions with a solitary reverse step and sample, epitomizing the model's stability as implied by its name. The code is available at https://github.com/lin-tianyu/Stable-Diffusion-Seg △ Less

Submitted 9 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted at MICCAI 2024. Code and citation info see https://github.com/lin-tianyu/Stable-Diffusion-Seg

arXiv:2406.16168 [pdf, other]

An All-MLP Sequence Modeling Architecture That Excels at Copying

Authors: Chenwei Cui, Zehao Yan, Gedeon Muhawenayo, Hannah Kerner

Abstract: Recent work demonstrated Transformers' ability to efficiently copy strings of exponential sizes, distinguishing them from other architectures. We present the Causal Relation Network (CausalRN), an all-MLP sequence modeling architecture that can match Transformers on the copying task. Extending Relation Networks (RNs), we implemented key innovations to support autoregressive sequence modeling while… ▽ More Recent work demonstrated Transformers' ability to efficiently copy strings of exponential sizes, distinguishing them from other architectures. We present the Causal Relation Network (CausalRN), an all-MLP sequence modeling architecture that can match Transformers on the copying task. Extending Relation Networks (RNs), we implemented key innovations to support autoregressive sequence modeling while maintaining computational feasibility. We discovered that exponentially-activated RNs are reducible to linear time complexity, and pre-activation normalization induces an infinitely growing memory pool, similar to a KV cache. In ablation study, we found both exponential activation and pre-activation normalization are indispensable for Transformer-level copying. Our findings provide new insights into what actually constitutes strong in-context retrieval. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: Accepted by ICML 2024 Next Generation of Sequence Modeling Architectures Workshop

arXiv:2406.15994 [pdf, other]

The delayed radio emission in the black hole X-ray binary MAXI J1348$-$630

Authors: Bei You, Shuai-kang Yang, Zhen Yan, Xinwu Cao, Andrzej A. Zdziarski

Abstract: We explore the coupling between the accretion flow and the jet in black hole X-ray binary (BHXRB) MAXI J1348-630 by analyzing the X-ray and radio observations during its 2019 outburst. We measure the time delay between the radio and Comptonization fluxes with the interpolated cross-correlation function. For the first time, we find that the radio emission lags behind the X-ray Comptonization emissi… ▽ More We explore the coupling between the accretion flow and the jet in black hole X-ray binary (BHXRB) MAXI J1348-630 by analyzing the X-ray and radio observations during its 2019 outburst. We measure the time delay between the radio and Comptonization fluxes with the interpolated cross-correlation function. For the first time, we find that the radio emission lags behind the X-ray Comptonization emission by about 3 days during the rising phase covering the rising hard state and the following soft state. Such a long radio delay indicates that the Comptonization emission most likely originates from the advection-dominated accretion flow rather than the jet in this source. The Comptonization luminosity $L_{\rm C}$ in 0.1-100 keV and the radio luminosity $L_{\rm R}$ at 5.5 GHz, after considering the radio delay of $\sim 3$ days, follow the correlation with a slope $β= 3.04 \pm 0.93$, which is much steeper than the previously reported $β= 0.6$ or 1.40 using the total luminosity in the limited band (e.g., 1-10 keV) in the literature. This highlights the necessity of considering (1) the time delay, (2) the spectral decomposition, and (3) the broad energy band, in the radio-X-ray correlation analysis. As the jet reappears during the decaying phase (covering the soft state and the following decaying hard state) and the mini-outburst, the Componization and the radio emission appear to be almost simultaneous. And, the radio-Compton correlation during the mini-outburst becomes shallow with the correlation slope $β= 1.11 \pm 0.15$. These indicate an intrinsic difference in the accretion-jet coupling physics between the main outburst and the mini-outburst. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 10 pages, 4 figures, Accepted for publication in ApJ Letters

arXiv:2406.13495 [pdf, other]

DF40: Toward Next-Generation Deepfake Detection

Authors: Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Li Yuan, Chengjie Wang, Shouhong Ding, Yunsheng Wu

Abstract: We propose a new comprehensive benchmark to revolutionize the current deepfake detection field to the next generation. Predominantly, existing works identify top-notch detection algorithms and models by adhering to the common practice: training detectors on one specific dataset (e.g., FF++) and testing them on other prevalent deepfake datasets. This protocol is often regarded as a "golden compass"… ▽ More We propose a new comprehensive benchmark to revolutionize the current deepfake detection field to the next generation. Predominantly, existing works identify top-notch detection algorithms and models by adhering to the common practice: training detectors on one specific dataset (e.g., FF++) and testing them on other prevalent deepfake datasets. This protocol is often regarded as a "golden compass" for navigating SoTA detectors. But can these stand-out "winners" be truly applied to tackle the myriad of realistic and diverse deepfakes lurking in the real world? If not, what underlying factors contribute to this gap? In this work, we found the dataset (both train and test) can be the "primary culprit" due to: (1) forgery diversity: Deepfake techniques are commonly referred to as both face forgery (face-swapping and face-reenactment) and entire image synthesis (AIGC). Most existing datasets only contain partial types, with limited forgery methods implemented; (2) forgery realism: The dominant training dataset, FF++, contains old forgery techniques from the past five years. "Honing skills" on these forgeries makes it difficult to guarantee effective detection of nowadays' SoTA deepfakes; (3) evaluation protocol: Most detection works perform evaluations on one type, e.g., train and test on face-swapping only, which hinders the development of universal deepfake detectors. To address this dilemma, we construct a highly diverse and large-scale deepfake dataset called DF40, which comprises 40 distinct deepfake techniques. We then conduct comprehensive evaluations using 4 standard evaluation protocols and 7 representative detectors, resulting in over 2,000 evaluations. Through these evaluations, we analyze from various perspectives, leading to 12 new insightful findings contributing to the field. We also open up 5 valuable yet previously underexplored research questions to inspire future works. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13275 [pdf, other]

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Authors: Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

Abstract: Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED)… ▽ More Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A. △ Less

Submitted 25 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.12477 [pdf, other]

An atypical low-frequency QPO detected in the hard state of MAXI J1348-630 with $Insight$-HXMT

Authors: Xin-Lei Wang, Zhen Yan, Fu-Guo Xie, Jun-Feng Wang, Ren-Yi Ma

Abstract: Based on the $Insight$-HXMT archival data, we have detected a new atypical low-frequency quasi-periodic oscillation (LFQPO) in the black hole X-ray binary MAXI J1348$-$630. The new LFQPO is detected in all the three instruments of $Insight$-HXMT with a combined significance of 3--5 $σ$, covering a wide energy range of 1--100 keV. The fractional root-mean-square (RMS) seems decrease with energy. It… ▽ More Based on the $Insight$-HXMT archival data, we have detected a new atypical low-frequency quasi-periodic oscillation (LFQPO) in the black hole X-ray binary MAXI J1348$-$630. The new LFQPO is detected in all the three instruments of $Insight$-HXMT with a combined significance of 3--5 $σ$, covering a wide energy range of 1--100 keV. The fractional root-mean-square (RMS) seems decrease with energy. It exclusively appears in the hard state during both the main and mini outburst, spanning an X-ray intensity range by a factor of 10, and a very narrow hardness range. The frequency of this new type of LFQPO is moderately stable, in the range of 0.08--0.15 Hz. We discussed different models for the LFQPO, and found none is able to explain the observed properties of this new type of LFQPO. △ Less

Submitted 19 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

Comments: 20 pages, 6 figures. Accepted by ApJ

arXiv:2406.11495 [pdf, other]

Online Context Learning for Socially-compliant Navigation

Authors: Iaroslav Okunevich, Alexandre Lombard, Tomas Krajnik, Yassine Ruichek, Zhi Yan

Abstract: Robot social navigation needs to adapt to different human factors and environmental contexts. However, since these factors and contexts are difficult to predict and cannot be exhaustively enumerated, traditional learning-based methods have difficulty in ensuring the social attributes of robots in long-term and cross-environment deployments. This letter introduces an online context learning method… ▽ More Robot social navigation needs to adapt to different human factors and environmental contexts. However, since these factors and contexts are difficult to predict and cannot be exhaustively enumerated, traditional learning-based methods have difficulty in ensuring the social attributes of robots in long-term and cross-environment deployments. This letter introduces an online context learning method that aims to empower robots to adapt to new social environments online. The proposed method adopts a two-layer structure. The bottom layer is built using a deep reinforcement learning-based method to ensure the output of basic robot navigation commands. The upper layer is implemented using an online robot learning-based method to socialize the control commands suggested by the bottom layer. Experiments using a community-wide simulator show that our method outperforms the state-of-the-art ones. Experimental results in the most challenging scenarios show that our method improves the performance of the state-of-the-art by 8%. The source code of the proposed method, the data used, and the tools for the per-training step will be publicly available at https://github.com/Nedzhaken/SOCSARL-OL. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 8 pages, 4 figures, 1 table, 1 algorithm

arXiv:2406.08698 [pdf, other]

Constraints on Ultra Heavy Dark Matter Properties from Dwarf Spheroidal Galaxies with LHAASO Observations

Authors: Zhen Cao, F. Aharonian, Q. An, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen, S. Z. Chen , et al. (255 additional authors not shown)

Abstract: In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes… ▽ More In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes of astrophysical $γ$-ray background while large amount of dark matter. By analyzing more than 700 days observational data at LHAASO, no significant dark matter signal from 1 TeV to 1 EeV is detected. Accordingly we derive the most stringent constraints on the ultra-heavy dark matter annihilation cross-section up to EeV. The constraints on the lifetime of dark matter in decay mode are also derived. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 17 pages, 12 figures, accepted by PRL

arXiv:2406.08563 [pdf, other]

Field-sensitive dislocation bound states in two-dimensional $d$-wave altermagnets

Authors: Di Zhu, Dongling Liu, Zheng-Yang Zhuang, Zhigang Wu, Zhongbo Yan

Abstract: When a two-dimensional $d$-wave altermagnet is grown on a substrate, the interplay of momentum-dependent spin splittings arising from altermagnetism and Rashba spin-orbit coupling gives rise to a nodal band structure with band degeneracies enforced by a $C_{4z}\mathcal{T}$ symmetry. If we break the $C_{4z}\mathcal{T}$ symmetry by an exchange field, the band degeneracies are found to be immediately… ▽ More When a two-dimensional $d$-wave altermagnet is grown on a substrate, the interplay of momentum-dependent spin splittings arising from altermagnetism and Rashba spin-orbit coupling gives rise to a nodal band structure with band degeneracies enforced by a $C_{4z}\mathcal{T}$ symmetry. If we break the $C_{4z}\mathcal{T}$ symmetry by an exchange field, the band degeneracies are found to be immediately lifted, leading to a topological band structure characterized by nontrivial strong and weak topological indices. Remarkably, both the strong topological index and the $Z_{2}$-valued weak topological indices depend sensitively on the direction of the exchange field. As a consequence of the bulk-defect correspondence, we find that the unique dependence of weak topological indices on the exchange field in this system dictates that the presence or absence of topological bound states at lattice dislocations also depends sensitively on the direction of the exchange field. When the substrate is an $s$-wave superconductor, we find that a similar dependence of band topology on the exchange field gives rise to field-sensitive dislocation Majorana zero modes. As topological dislocation bound states are easily detectable by scanning tunneling microscopy, our findings unveil a promising experimental diagnosis of altermagnetic materials among an ever growing list of candidates. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 9 pages, 5 figures

arXiv:2406.07487 [pdf, other]

GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection

Authors: Hang Yao, Ming Liu, Haolin Wang, Zhicun Yin, Zifei Yan, Xiaopeng Hong, Wangmeng Zuo

Abstract: Diffusion models have shown superior performance on unsupervised anomaly detection tasks. Since trained with normal data only, diffusion models tend to reconstruct normal counterparts of test images with certain noises added. However, these methods treat all potential anomalies equally, which may cause two main problems. From the global perspective, the difficulty of reconstructing images with dif… ▽ More Diffusion models have shown superior performance on unsupervised anomaly detection tasks. Since trained with normal data only, diffusion models tend to reconstruct normal counterparts of test images with certain noises added. However, these methods treat all potential anomalies equally, which may cause two main problems. From the global perspective, the difficulty of reconstructing images with different anomalies is uneven. Therefore, instead of utilizing the same setting for all samples, we propose to predict a particular denoising step for each sample by evaluating the difference between image contents and the priors extracted from diffusion models. From the local perspective, reconstructing abnormal regions differs from normal areas even in the same image. Theoretically, the diffusion model predicts a noise for each step, typically following a standard Gaussian distribution. However, due to the difference between the anomaly and its potential normal counterpart, the predicted noise in abnormal regions will inevitably deviate from the standard Gaussian distribution. To this end, we propose introducing synthetic abnormal samples in training to encourage the diffusion models to break through the limitation of standard Gaussian distribution, and a spatial-adaptive feature fusion scheme is utilized during inference. With the above modifications, we propose a global and local adaptive diffusion model (abbreviated to GLAD) for unsupervised anomaly detection, which introduces appealing flexibility and achieves anomaly-free reconstruction while retaining as much normal information as possible. Extensive experiments are conducted on three commonly used anomaly detection datasets (MVTec-AD, MPDD, and VisA) and a printed circuit board dataset (PCB-Bank) we integrated, showing the effectiveness of the proposed method. △ Less

Submitted 2 July, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted by ECCV 2024, code and models: https://github.com/hyao1/GLAD. Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract here is shorter than that in the PDF file

arXiv:2406.07012 [pdf, other]

Bridging Language Gaps in Audio-Text Retrieval

Authors: Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang

Abstract: Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multi… ▽ More Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available https://github.com/zyyan4/ml-clap. △ Less

Submitted 16 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: interspeech2024

arXiv:2406.06992 [pdf, other]

Scaling up masked audio encoder learning for general audio classification

Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

Abstract: Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and… ▽ More Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and dataset sizes for SSL-based general audio classification. We introduce Dasheng, a simple SSL audio encoder, based on the efficient masked autoencoder framework. Trained with 1.2 billion parameters on 272,356 hours of diverse audio, Dasheng obtains significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environment classification. Dasheng features inherently contain rich speech, music, and environmental information, as shown in nearest-neighbor classification experiments. Code is available https://github.com/richermans/dasheng/. △ Less

Submitted 13 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: Interspeech 2024

arXiv:2406.06544 [pdf, other]

TSB: Tiny Shared Block for Efficient DNN Deployment on NVCIM Accelerators

Authors: Yifan Qin, Zheyu Yan, Zixuan Pan, Wujie Wen, Xiaobo Sharon Hu, Yiyu Shi

Abstract: Compute-in-memory (CIM) accelerators using non-volatile memory (NVM) devices offer promising solutions for energy-efficient and low-latency Deep Neural Network (DNN) inference execution. However, practical deployment is often hindered by the challenge of dealing with the massive amount of model weight parameters impacted by the inherent device variations within non-volatile computing-in-memory (NV… ▽ More Compute-in-memory (CIM) accelerators using non-volatile memory (NVM) devices offer promising solutions for energy-efficient and low-latency Deep Neural Network (DNN) inference execution. However, practical deployment is often hindered by the challenge of dealing with the massive amount of model weight parameters impacted by the inherent device variations within non-volatile computing-in-memory (NVCIM) accelerators. This issue significantly offsets their advantages by increasing training overhead, the time and energy needed for mapping weights to device states, and diminishing inference accuracy. To mitigate these challenges, we propose the "Tiny Shared Block (TSB)" method, which integrates a small shared 1x1 convolution block into the DNN architecture. This block is designed to stabilize feature processing across the network, effectively reducing the impact of device variation. Extensive experimental results show that TSB achieves over 20x inference accuracy gap improvement, over 5x training speedup, and weights-to-device mapping cost reduction while requiring less than 0.4% of the original weights to be write-verified during programming, when compared with state-of-the-art baseline solutions. Our approach provides a practical and efficient solution for deploying robust DNN models on NVCIM accelerators, making it a valuable contribution to the field of energy-efficient AI hardware. △ Less

Submitted 21 August, 2024; v1 submitted 8 May, 2024; originally announced June 2024.

Comments: 9 pages, accepted to IEEE/ACM International Conference on Computer-Aided Design (ICCAD 2024)

Showing 1–50 of 1,243 results for author: Yan, Z