Search | arXiv e-print repository

Whisper-KDQ: A Lightweight Whisper via Guided Knowledge Distillation and Quantization for Efficient ASR

Authors: Hang Shao, Wei Wang, Bei Liu, Xun Gong, Haoyu Wang, Yanmin Qian

Abstract: Due to the rapid development of computing hardware resources and the dramatic growth of data, pre-trained models in speech recognition, such as Whisper, have significantly improved the performance of speech recognition tasks. However, these models usually have a high computational overhead, making it difficult to execute effectively on resource-constrained devices. To speed up inference and reduce… ▽ More Due to the rapid development of computing hardware resources and the dramatic growth of data, pre-trained models in speech recognition, such as Whisper, have significantly improved the performance of speech recognition tasks. However, these models usually have a high computational overhead, making it difficult to execute effectively on resource-constrained devices. To speed up inference and reduce model size while maintaining performance, we propose a novel guided knowledge distillation and quantization for large pre-trained model Whisper. The student model selects distillation and quantization layers based on quantization loss and distillation loss, respectively. We compressed $\text{Whisper}_\text{small}$ to $\text{Whisper}_\text{base}$ and $\text{Whisper}_\text{tiny}$ levels, making $\text{Whisper}_\text{small}$ 5.18x/10.48x smaller, respectively. Moreover, compared to the original $\text{Whisper}_\text{base}$ and $\text{Whisper}_\text{tiny}$, there is also a relative character error rate~(CER) reduction of 11.3% and 14.0% for the new compressed model respectively. △ Less

Submitted 18 May, 2023; originally announced May 2023.

arXiv:2305.10704 [pdf, other]

Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor

Authors: Zhengyang Chen, Bing Han, Shuai Wang, Yanmin Qian

Abstract: This paper proposes a novel Attention-based Encoder-Decoder network for End-to-End Neural speaker Diarization (AED-EEND). In AED-EEND system, we incorporate the target speaker enrollment information used in target speaker voice activity detection (TS-VAD) to calculate the attractor, which can mitigate the speaker permutation problem and facilitate easier model convergence. In the training process,… ▽ More This paper proposes a novel Attention-based Encoder-Decoder network for End-to-End Neural speaker Diarization (AED-EEND). In AED-EEND system, we incorporate the target speaker enrollment information used in target speaker voice activity detection (TS-VAD) to calculate the attractor, which can mitigate the speaker permutation problem and facilitate easier model convergence. In the training process, we propose a teacher-forcing strategy to obtain the enrollment information using the ground-truth label. Furthermore, we propose three heuristic decoding methods to identify the enrollment area for each speaker during the evaluation process. Additionally, we enhance the attractor calculation network LSTM used in the end-to-end encoder-decoder based attractor calculation (EEND-EDA) system by incorporating an attention-based model. By utilizing such an attention-based attractor decoder, our proposed AED-EEND system outperforms both the EEND-EDA and TS-VAD systems with only 0.5s of enrollment data. △ Less

Submitted 15 August, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted by InterSpeech 2023

arXiv:2305.00437 [pdf, ps, other]

doi 10.1103/PhysRevResearch.5.L022013

Temperature-Dependent and Magnetism-Controlled Fermi Surface Changes in Magnetic Weyl Semimetals

Authors: Nan Zhang, Xianyong Ding, Fangyang Zhan, Houpu Li, Hongyu Li, Kaixin Tang, Yingcai Qian, Senyang Pan, Xiaoliang Xiao, Jinglei Zhang, Rui Wang, Ziji Xiang, Xianhui Chen

Abstract: The coupling between band structure and magnetism can lead to intricate Fermi surface modifications. Here we report on the comprehensive study of the Shubnikov-de Haas (SdH) effect in two rare-earth-based magnetic Weyl semimetals, NdAlSi and CeAlSi$_{0.8}$Ge$_{0.2}$. The results show that the temperature evolution of topologically nontrivial Fermi surfaces strongly depends on magnetic configuratio… ▽ More The coupling between band structure and magnetism can lead to intricate Fermi surface modifications. Here we report on the comprehensive study of the Shubnikov-de Haas (SdH) effect in two rare-earth-based magnetic Weyl semimetals, NdAlSi and CeAlSi$_{0.8}$Ge$_{0.2}$. The results show that the temperature evolution of topologically nontrivial Fermi surfaces strongly depends on magnetic configurations. In NdAlSi, the SdH frequencies vary with temperature in both the paramagnetic state and the magnetically ordered state with a chiral spin texture, but become temperature independent in the high-field fully polarized state. In CeAlSi$_{0.8}$Ge$_{0.2}$, SdH frequencies are temperature-dependent only in the ferromagnetic state with magnetic fields applied along the $c$ axis. First-principles calculations suggest that the notable temperature and magnetic-configuration dependence of Fermi surface morphology can be attributed to strong exchange coupling between the conduction electrons and local magnetic moments. △ Less

Submitted 30 April, 2023; originally announced May 2023.

Comments: 8 pages, 5 figures

Journal ref: Phys. Rev. Research 5, L022013 (2023)

arXiv:2304.14508 [pdf]

3D Brainformer: 3D Fusion Transformer for Brain Tumor Segmentation

Authors: Rui Nian, Guoyao Zhang, Yao Sui, Yuqi Qian, Qiuying Li, Mingzhang Zhao, Jianhui Li, Ali Gholipour, Simon K. Warfield

Abstract: Magnetic resonance imaging (MRI) is critically important for brain mapping in both scientific research and clinical studies. Precise segmentation of brain tumors facilitates clinical diagnosis, evaluations, and surgical planning. Deep learning has recently emerged to improve brain tumor segmentation and achieved impressive results. Convolutional architectures are widely used to implement those neu… ▽ More Magnetic resonance imaging (MRI) is critically important for brain mapping in both scientific research and clinical studies. Precise segmentation of brain tumors facilitates clinical diagnosis, evaluations, and surgical planning. Deep learning has recently emerged to improve brain tumor segmentation and achieved impressive results. Convolutional architectures are widely used to implement those neural networks. By the nature of limited receptive fields, however, those architectures are subject to representing long-range spatial dependencies of the voxel intensities in MRI images. Transformers have been leveraged recently to address the above limitations of convolutional networks. Unfortunately, the majority of current Transformers-based methods in segmentation are performed with 2D MRI slices, instead of 3D volumes. Moreover, it is difficult to incorporate the structures between layers because each head is calculated independently in the Multi-Head Self-Attention mechanism (MHSA). In this work, we proposed a 3D Transformer-based segmentation approach. We developed a Fusion-Head Self-Attention mechanism (FHSA) to combine each attention head through attention logic and weight mapping, for the exploration of the long-range spatial dependencies in 3D MRI images. We implemented a plug-and-play self-attention module, named the Infinite Deformable Fusion Transformer Module (IDFTM), to extract features on any deformable feature maps. We applied our approach to the task of brain tumor segmentation, and assessed it on the public BRATS datasets. The experimental results demonstrated that our proposed approach achieved superior performance, in comparison to several state-of-the-art segmentation methods. △ Less

Submitted 27 April, 2023; originally announced April 2023.

Comments: 10 pages, 4 figures

MSC Class: 68T07 ACM Class: I.4.6; I.5.1

arXiv:2304.12259 [pdf, other]

doi 10.1038/s41467-024-47558-0

Imaging 3D Chemistry at 1 nm Resolution with Fused Multi-Modal Electron Tomography

Authors: Jonathan Schwartz, Zichao Wendy Di, Yi Jiang, Jason Manassa, Jacob Pietryga, Yiwen Qian, Min Gee Cho, Jonathan L. Rowell, Huihuo Zheng, Richard D. Robinson, Junsi Gu, Alexey Kirilin, Steve Rozeveld, Peter Ercius, Jeffrey A. Fessler, Ting Xu, Mary Scott, Robert Hovden

Abstract: Measuring the three-dimensional (3D) distribution of chemistry in nanoscale matter is a longstanding challenge for metrological science. The inelastic scattering events required for 3D chemical imaging are too rare, requiring high beam exposure that destroys the specimen before an experiment completes. Even larger doses are required to achieve high resolution. Thus, chemical mapping in 3D has been… ▽ More Measuring the three-dimensional (3D) distribution of chemistry in nanoscale matter is a longstanding challenge for metrological science. The inelastic scattering events required for 3D chemical imaging are too rare, requiring high beam exposure that destroys the specimen before an experiment completes. Even larger doses are required to achieve high resolution. Thus, chemical mapping in 3D has been unachievable except at lower resolution with the most radiation-hard materials. Here, high-resolution 3D chemical imaging is achieved near or below one nanometer resolution in a Au-Fe$_3$O$_4$ metamaterial, Co$_3$O$_4$ - Mn$_3$O$_4$ core-shell nanocrystals, and ZnS-Cu$_{0.64}$S$_{0.36}$ nanomaterial using fused multi-modal electron tomography. Multi-modal data fusion enables high-resolution chemical tomography often with 99\% less dose by linking information encoded within both elastic (HAADF) and inelastic (EDX / EELS) signals. Now sub-nanometer 3D resolution of chemistry is measurable for a broad class of geometrically and compositionally complex materials. △ Less

Submitted 18 June, 2024; v1 submitted 24 April, 2023; originally announced April 2023.

Journal ref: Nat Commun 15, 3555 (2024)

arXiv:2304.11550 [pdf, other]

Provable Reach-avoid Controllers Synthesis Based on Inner-approximating Controlled Reach-avoid Sets

Authors: Jianqiang Ding, Taoran Wu, Yuping Qian, Lijun Zhang, Bai Xue

Abstract: In this paper, we propose an approach for synthesizing provable reach-avoid controllers, which drive a deterministic system operating in an unknown environment to safely reach a desired target set. The approach falls within the reachability analysis framework and is based on the computation of inner-approximations of controlled reach-avoid sets(CRSs). Given a target set and a safe set, the control… ▽ More In this paper, we propose an approach for synthesizing provable reach-avoid controllers, which drive a deterministic system operating in an unknown environment to safely reach a desired target set. The approach falls within the reachability analysis framework and is based on the computation of inner-approximations of controlled reach-avoid sets(CRSs). Given a target set and a safe set, the controlled reach-avoid set is the set of states such that starting from each of them, there exists at least one controller to ensure that the system can enter the target set while staying inside the safe set before the target hitting time. Therefore, the boundary of the controlled reach-avoid set acts as a barrier, which separating states capable of achieving the reach-avoid objective from those that are not, and thus the computed inner-approximation provides a viable space for the system to achieve the reach-avoid objective. Our approach for synthesizing reach-avoid controllers mainly consists of three steps. We first learn a safe set of states in the unknown environment from sensor measurements based on a support vector machine approach. Then, based on the learned safe set and target set, we compute an inner-approximation of the CRS. Finally, we synthesize controllers online to ensure that the system will reach the target set by evolving inside the computed inner-approximation. The proposed method is demonstrated on a Dubin's car system. △ Less

Submitted 23 April, 2023; originally announced April 2023.

arXiv:2304.06342 [pdf, other]

RoSI: Recovering 3D Shape Interiors from Few Articulation Images

Authors: Akshay Gadi Patil, Yiming Qian, Shan Yang, Brian Jackson, Eric Bennett, Hao Zhang

Abstract: The dominant majority of 3D models that appear in gaming, VR/AR, and those we use to train geometric deep learning algorithms are incomplete, since they are modeled as surface meshes and missing their interior structures. We present a learning framework to recover the shape interiors (RoSI) of existing 3D models with only their exteriors from multi-view and multi-articulation images. Given a set o… ▽ More The dominant majority of 3D models that appear in gaming, VR/AR, and those we use to train geometric deep learning algorithms are incomplete, since they are modeled as surface meshes and missing their interior structures. We present a learning framework to recover the shape interiors (RoSI) of existing 3D models with only their exteriors from multi-view and multi-articulation images. Given a set of RGB images that capture a target 3D object in different articulated poses, possibly from only few views, our method infers the interior planes that are observable in the input images. Our neural architecture is trained in a category-agnostic manner and it consists of a motion-aware multi-view analysis phase including pose, depth, and motion estimations, followed by interior plane detection in images and 3D space, and finally multi-view plane fusion. In addition, our method also predicts part articulations and is able to realize and even extrapolate the captured motions on the target 3D object. We evaluate our method by quantitative and qualitative comparisons to baselines and alternative solutions, as well as testing on untrained object categories and real image inputs to assess its generalization capabilities. △ Less

Submitted 13 April, 2023; originally announced April 2023.

arXiv:2304.05754 [pdf, other]

Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification

Authors: Bing Han, Zhengyang Chen, Yanmin Qian

Abstract: Automatic speaker verification task has made great achievements using deep learning approaches with the large-scale manually annotated dataset. However, it's very difficult and expensive to collect a large amount of well-labeled data for system building. In this paper, we propose a novel and advanced self-supervised learning framework which can construct a high performance speaker verification sys… ▽ More Automatic speaker verification task has made great achievements using deep learning approaches with the large-scale manually annotated dataset. However, it's very difficult and expensive to collect a large amount of well-labeled data for system building. In this paper, we propose a novel and advanced self-supervised learning framework which can construct a high performance speaker verification system without using any labeled data. To avoid the impact of false negative pairs, we adopt the self-distillation with no labels (DINO) framework as the initial model, which can be trained without exploiting negative pairs. Then, we introduce a cluster-aware training strategy for DINO to improve the diversity of data. In the iteration learning stage, due to a mass of unreliable labels from clustering, the quality of pseudo labels is important for the system training. This motivates us to propose dynamic loss-gate and label correction (DLG-LC) methods to alleviate the performance degradation caused by unreliable labels. More specifically, we model the loss distribution with GMM and obtain the loss-gate threshold dynamically to distinguish the reliable and unreliable labels. Besides, we adopt the model predictions to correct the unreliable label, for better utilizing the unreliable data rather than dropping them directly. Moreover, we extend the DLG-LC to multi-modality to further improve the performance. The experiments are performed on the commonly used Voxceleb dataset. Compared to the best-known self-supervised speaker verification system, our proposed method obtain 22.17%, 27.94% and 25.56% relative EER improvement on Vox-O, Vox-E and Vox-H test sets, even with fewer iterations, smaller models, and simpler clustering methods. More importantly, the newly proposed system even achieves comparable results with the fully supervised system, but without using any human labeled data. △ Less

Submitted 12 April, 2023; originally announced April 2023.

Comments: Submitted to TASLP in July 19, 2022

arXiv:2304.04162 [pdf, other]

Design of Two-Level Incentive Mechanisms for Hierarchical Federated Learning

Authors: Shunfeng Chu, Jun Li, Kang Wei, Yuwen Qian, Kunlun Wang, Feng Shu, Wen Chen

Abstract: Hierarchical Federated Learning (HFL) is a distributed machine learning paradigm tailored for multi-tiered computation architectures, which supports massive access of devices' models simultaneously. To enable efficient HFL, it is crucial to design suitable incentive mechanisms to ensure that devices actively participate in local training. However, there are few studies on incentive mechanism desig… ▽ More Hierarchical Federated Learning (HFL) is a distributed machine learning paradigm tailored for multi-tiered computation architectures, which supports massive access of devices' models simultaneously. To enable efficient HFL, it is crucial to design suitable incentive mechanisms to ensure that devices actively participate in local training. However, there are few studies on incentive mechanism design for HFL. In this paper, we design two-level incentive mechanisms for the HFL with a two-tiered computing structure to encourage the participation of entities in each tier in the HFL training. In the lower-level game, we propose a coalition formation game to joint optimize the edge association and bandwidth allocation problem, and obtain efficient coalition partitions by the proposed preference rule, which can be proven to be stable by exact potential game. In the upper-level game, we design the Stackelberg game algorithm, which not only determines the optimal number of edge aggregations for edge servers to maximize their utility, but also optimize the unit reward provided for the edge aggregation performance to ensure the interests of cloud servers. Furthermore, numerical results indicate that the proposed algorithms can achieve better performance than the benchmark schemes. △ Less

Submitted 16 January, 2024; v1 submitted 9 April, 2023; originally announced April 2023.

arXiv:2304.03981 [pdf, other]

Uncertainty-inspired Open Set Learning for Retinal Anomaly Identification

Authors: Meng Wang, Tian Lin, Lianyu Wang, Aidi Lin, Ke Zou, Xinxing Xu, Yi Zhou, Yuanyuan Peng, Qingquan Meng, Yiming Qian, Guoyao Deng, Zhiqun Wu, Junhong Chen, Jianhong Lin, Mingzhi Zhang, Weifang Zhu, Changqing Zhang, Daoqiang Zhang, Rick Siow Mong Goh, Yong Liu, Chi Pui Pang, Xinjian Chen, Haoyu Chen, Huazhu Fu

Abstract: Failure to recognize samples from the classes unseen during training is a major limitation of artificial intelligence in the real-world implementation for recognition and classification of retinal anomalies. We established an uncertainty-inspired open-set (UIOS) model, which was trained with fundus images of 9 retinal conditions. Besides assessing the probability of each category, UIOS also calcul… ▽ More Failure to recognize samples from the classes unseen during training is a major limitation of artificial intelligence in the real-world implementation for recognition and classification of retinal anomalies. We established an uncertainty-inspired open-set (UIOS) model, which was trained with fundus images of 9 retinal conditions. Besides assessing the probability of each category, UIOS also calculated an uncertainty score to express its confidence. Our UIOS model with thresholding strategy achieved an F1 score of 99.55%, 97.01% and 91.91% for the internal testing set, external target categories (TC)-JSIEC dataset and TC-unseen testing set, respectively, compared to the F1 score of 92.20%, 80.69% and 64.74% by the standard AI model. Furthermore, UIOS correctly predicted high uncertainty scores, which would prompt the need for a manual check in the datasets of non-target categories retinal diseases, low-quality fundus images, and non-fundus images. UIOS provides a robust method for real-world screening of retinal anomalies. △ Less

Submitted 29 August, 2023; v1 submitted 8 April, 2023; originally announced April 2023.

arXiv:2304.03359 [pdf, other]

Approximate Wireless Communication for Federated Learning

Authors: Xiang Ma, Haijian Sun, Rose Qingyang Hu, Yi Qian

Abstract: This paper presents an approximate wireless communication scheme for federated learning (FL) model aggregation in the uplink transmission. We consider a realistic channel that reveals bit errors during FL model exchange in wireless networks. Our study demonstrates that random bit errors during model transmission can significantly affect FL performance. To overcome this challenge, we propose an app… ▽ More This paper presents an approximate wireless communication scheme for federated learning (FL) model aggregation in the uplink transmission. We consider a realistic channel that reveals bit errors during FL model exchange in wireless networks. Our study demonstrates that random bit errors during model transmission can significantly affect FL performance. To overcome this challenge, we propose an approximate communication scheme based on the mathematical and statistical proof that machine learning (ML) model gradients are bounded under certain constraints. This bound enables us to introduce a novel encoding scheme for float-to-binary representation of gradient values and their QAM constellation mapping. Besides, since FL gradients are error-resilient, the proposed scheme simply delivers gradients with errors when the channel quality is satisfactory, eliminating extensive error-correcting codes and/or retransmission. The direct benefits include less overhead and lower latency. The proposed scheme is well-suited for resource-constrained devices in wireless networks. Through simulations, we show that the proposed scheme is effective in reducing the impact of bit errors on FL performance and saves at least half the time than transmission with error correction and retransmission to achieve the same learning performance. In addition, we investigated the effectiveness of bit protection mechanisms in high-order modulation when gray coding is employed and found that this approach considerably enhances learning performance. △ Less

Submitted 6 April, 2023; originally announced April 2023.

arXiv:2304.02835 [pdf, other]

doi 10.1145/3543507.3583521

GIF: A General Graph Unlearning Strategy via Influence Function

Authors: Jiancan Wu, Yi Yang, Yuchun Qian, Yongduo Sui, Xiang Wang, Xiangnan He

Abstract: With the greater emphasis on privacy and security in our society, the problem of graph unlearning -- revoking the influence of specific data on the trained GNN model, is drawing increasing attention. However, ranging from machine unlearning to recently emerged graph unlearning methods, existing efforts either resort to retraining paradigm, or perform approximate erasure that fails to consider the… ▽ More With the greater emphasis on privacy and security in our society, the problem of graph unlearning -- revoking the influence of specific data on the trained GNN model, is drawing increasing attention. However, ranging from machine unlearning to recently emerged graph unlearning methods, existing efforts either resort to retraining paradigm, or perform approximate erasure that fails to consider the inter-dependency between connected neighbors or imposes constraints on GNN structure, therefore hard to achieve satisfying performance-complexity trade-offs. In this work, we explore the influence function tailored for graph unlearning, so as to improve the unlearning efficacy and efficiency for graph unlearning. We first present a unified problem formulation of diverse graph unlearning tasks \wrt node, edge, and feature. Then, we recognize the crux to the inability of traditional influence function for graph unlearning, and devise Graph Influence Function (GIF), a model-agnostic unlearning method that can efficiently and accurately estimate parameter changes in response to a $ε$-mass perturbation in deleted data. The idea is to supplement the objective of the traditional influence function with an additional loss term of the influenced neighbors due to the structural dependency. Further deductions on the closed-form solution of parameter changes provide a better understanding of the unlearning mechanism. We conduct extensive experiments on four representative GNN models and three benchmark datasets to justify the superiority of GIF for diverse graph unlearning tasks in terms of unlearning efficacy, model utility, and unlearning efficiency. Our implementations are available at \url{https://github.com/wujcan/GIF-torch/}. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: Accepted by WWW 2023

arXiv:2304.02162 [pdf, other]

doi 10.1109/TIP.2024.3393390

Learning to Recover Spectral Reflectance from RGB Images

Authors: Dong Huo, Jian Wang, Yiming Qian, Yee-Hong Yang

Abstract: This paper tackles spectral reflectance recovery (SRR) from RGB images. Since capturing ground-truth spectral reflectance and camera spectral sensitivity are challenging and costly, most existing approaches are trained on synthetic images and utilize the same parameters for all unseen testing images, which are suboptimal especially when the trained models are tested on real images because they nev… ▽ More This paper tackles spectral reflectance recovery (SRR) from RGB images. Since capturing ground-truth spectral reflectance and camera spectral sensitivity are challenging and costly, most existing approaches are trained on synthetic images and utilize the same parameters for all unseen testing images, which are suboptimal especially when the trained models are tested on real images because they never exploit the internal information of the testing images. To address this issue, we adopt a self-supervised meta-auxiliary learning (MAXL) strategy that fine-tunes the well-trained network parameters with each testing image to combine external with internal information. To the best of our knowledge, this is the first work that successfully adapts the MAXL strategy to this problem. Instead of relying on naive end-to-end training, we also propose a novel architecture that integrates the physical relationship between the spectral reflectance and the corresponding RGB images into the network based on our mathematical analysis. Besides, since the spectral reflectance of a scene is independent to its illumination while the corresponding RGB images are not, we recover the spectral reflectance of a scene from its RGB images captured under multiple illuminations to further reduce the unknown. Qualitative and quantitative evaluations demonstrate the effectiveness of our proposed network and of the MAXL. Our code and data are available at https://github.com/Dong-Huo/SRR-MAXL. △ Less

Submitted 22 April, 2024; v1 submitted 4 April, 2023; originally announced April 2023.

Comments: IEEE Transactions on Image Processing (TIP), 2024

arXiv:2304.01849 [pdf, other]

Semiparametric efficient estimation of genetic relatedness with machine learning methods

Authors: Xu Guo, Yiyuan Qian, Hongwei Shi, Weichao Yang, Niwen Zhou

Abstract: In this paper, we propose semiparametric efficient estimators of genetic relatedness between two traits in a model-free framework. Most existing methods require specifying certain parametric models involving the traits and genetic variants. However, the bias due to model misspecification may yield misleading statistical results. Moreover, the semiparametric efficient bounds for estimators of genet… ▽ More In this paper, we propose semiparametric efficient estimators of genetic relatedness between two traits in a model-free framework. Most existing methods require specifying certain parametric models involving the traits and genetic variants. However, the bias due to model misspecification may yield misleading statistical results. Moreover, the semiparametric efficient bounds for estimators of genetic relatedness are still lacking. In this paper, we develop semiparametric efficient estimators with machine learning methods and construct valid confidence intervals for two important measures of genetic relatedness: genetic covariance and genetic correlation, allowing both continuous and discrete responses. Based on the derived efficient influence functions of genetic relatedness, we propose a consistent estimator of the genetic covariance as long as one of genetic values is consistently estimated. The data of two traits may be collected from the same group or different groups of individuals. Various numerical studies are performed to illustrate our introduced procedures. We also apply proposed procedures to analyze Carworth Farms White mice genome-wide association study data. △ Less

Submitted 2 June, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

Comments: 46pages,9 tables, 1 figure

arXiv:2303.15790 [pdf, other]

doi 10.1007/s11467-023-1333-z

STCF Conceptual Design Report: Volume 1 -- Physics & Detector

Authors: M. Achasov, X. C. Ai, R. Aliberti, L. P. An, Q. An, X. Z. Bai, Y. Bai, O. Bakina, A. Barnyakov, V. Blinov, V. Bobrovnikov, D. Bodrov, A. Bogomyagkov, A. Bondar, I. Boyko, Z. H. Bu, F. M. Cai, H. Cai, J. J. Cao, Q. H. Cao, Z. Cao, Q. Chang, K. T. Chao, D. Y. Chen, H. Chen , et al. (413 additional authors not shown)

Abstract: The Super $τ$-Charm facility (STCF) is an electron-positron collider proposed by the Chinese particle physics community. It is designed to operate in a center-of-mass energy range from 2 to 7 GeV with a peak luminosity of $0.5\times 10^{35}{\rm cm}^{-2}{\rm s}^{-1}$ or higher. The STCF will produce a data sample about a factor of 100 larger than that by the present $τ$-Charm factory -- the BEPCII,… ▽ More The Super $τ$-Charm facility (STCF) is an electron-positron collider proposed by the Chinese particle physics community. It is designed to operate in a center-of-mass energy range from 2 to 7 GeV with a peak luminosity of $0.5\times 10^{35}{\rm cm}^{-2}{\rm s}^{-1}$ or higher. The STCF will produce a data sample about a factor of 100 larger than that by the present $τ$-Charm factory -- the BEPCII, providing a unique platform for exploring the asymmetry of matter-antimatter (charge-parity violation), in-depth studies of the internal structure of hadrons and the nature of non-perturbative strong interactions, as well as searching for exotic hadrons and physics beyond the Standard Model. The STCF project in China is under development with an extensive R\&D program. This document presents the physics opportunities at the STCF, describes conceptual designs of the STCF detector system, and discusses future plans for detector R\&D and physics case studies. △ Less

Submitted 5 October, 2023; v1 submitted 28 March, 2023; originally announced March 2023.

Journal ref: Front. Phys. 19(1), 14701 (2024)

arXiv:2303.13033 [pdf, other]

doi 10.1007/978-3-031-43895-0_21

Federated Uncertainty-Aware Aggregation for Fundus Diabetic Retinopathy Staging

Authors: Meng Wang, Lianyu Wang, Xinxing Xu, Ke Zou, Yiming Qian, Rick Siow Mong Goh, Yong Liu, Huazhu Fu

Abstract: Deep learning models have shown promising performance in the field of diabetic retinopathy (DR) staging. However, collaboratively training a DR staging model across multiple institutions remains a challenge due to non-iid data, client reliability, and confidence evaluation of the prediction. To address these issues, we propose a novel federated uncertainty-aware aggregation paradigm (FedUAA), whic… ▽ More Deep learning models have shown promising performance in the field of diabetic retinopathy (DR) staging. However, collaboratively training a DR staging model across multiple institutions remains a challenge due to non-iid data, client reliability, and confidence evaluation of the prediction. To address these issues, we propose a novel federated uncertainty-aware aggregation paradigm (FedUAA), which considers the reliability of each client and produces a confidence estimation for the DR staging. In our FedUAA, an aggregated encoder is shared by all clients for learning a global representation of fundus images, while a novel temperature-warmed uncertainty head (TWEU) is utilized for each client for local personalized staging criteria. Our TWEU employs an evidential deep layer to produce the uncertainty score with the DR staging results for client reliability evaluation. Furthermore, we developed a novel uncertainty-aware weighting module (UAW) to dynamically adjust the weights of model aggregation based on the uncertainty score distribution of each client. In our experiments, we collect five publicly available datasets from different institutions to conduct a dataset for federated DR staging to satisfy the real non-iid condition. The experimental results demonstrate that our FedUAA achieves better DR staging performance with higher reliability compared to other federated learning methods. Our proposed FedUAA paradigm effectively addresses the challenges of collaboratively training DR staging models across multiple institutions, and provides a robust and reliable solution for the deployment of DR diagnosis models in real-world clinical scenarios. △ Less

Submitted 22 July, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

Report number: 978-3-031-43894-3

Journal ref: Medical Image Computing and Computer Assisted Intervention(MICCAI 2023)

arXiv:2303.12370 [pdf, other]

Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Authors: Sixun Dong, Huazhang Hu, Dongze Lian, Weixin Luo, Yicheng Qian, Shenghua Gao

Abstract: Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-lev… ▽ More Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https://github.com/svip-lab/WeakSVR △ Less

Submitted 28 March, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

Comments: CVPR 2023. Code: https://github.com/svip-lab/WeakSVR

arXiv:2303.12245 [pdf, other]

Error Analysis of Physics-Informed Neural Networks for Approximating Dynamic PDEs of Second Order in Time

Authors: Yanxia Qian, Yongchao Zhang, Yunqing Huang, Suchuan Dong

Abstract: We consider the approximation of a class of dynamic partial differential equations (PDE) of second order in time by the physics-informed neural network (PINN) approach, and provide an error analysis of PINN for the wave equation, the Sine-Gordon equation and the linear elastodynamic equation. Our analyses show that, with feed-forward neural networks having two hidden layers and the $\tanh$ activat… ▽ More We consider the approximation of a class of dynamic partial differential equations (PDE) of second order in time by the physics-informed neural network (PINN) approach, and provide an error analysis of PINN for the wave equation, the Sine-Gordon equation and the linear elastodynamic equation. Our analyses show that, with feed-forward neural networks having two hidden layers and the $\tanh$ activation function, the PINN approximation errors for the solution field, its time derivative and its gradient field can be effectively bounded by the training loss and the number of training data points (quadrature points). Our analyses further suggest new forms for the training loss function, which contain certain residuals that are crucial to the error estimate but would be absent from the canonical PINN loss formulation. Adopting these new forms for the loss function leads to a variant PINN algorithm. We present ample numerical experiments with the new PINN algorithm for the wave equation, the Sine-Gordon equation and the linear elastodynamic equation, which show that the method can capture the solution well. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Comments: 46 pages, 14 figures, 3 tables

arXiv:2303.10949 [pdf, other]

Code-Switching Text Generation and Injection in Mandarin-English ASR

Authors: Haibin Yu, Yuxuan Hu, Yao Qian, Ma Jin, Linquan Liu, Shujie Liu, Yu Shi, Yanmin Qian, Edward Lin, Michael Zeng

Abstract: Code-switching speech refers to a means of expression by mixing two or more languages within a single utterance. Automatic Speech Recognition (ASR) with End-to-End (E2E) modeling for such speech can be a challenging task due to the lack of data. In this study, we investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transd… ▽ More Code-switching speech refers to a means of expression by mixing two or more languages within a single utterance. Automatic Speech Recognition (ASR) with End-to-End (E2E) modeling for such speech can be a challenging task due to the lack of data. In this study, we investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T), in Mandarin-English code-switching speech recognition. We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces. Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models, i.e., 16% relative Token-based Error Rate (TER) reduction averaged on three evaluation sets, and the approach of tying speech and text latent spaces is superior to that of TTS conversion on the evaluation set which contains more homogeneous data with the training set. △ Less

Submitted 20 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.10741 [pdf]

Computer Vision Estimation of Emotion Reaction Intensity in the Wild

Authors: Yang Qian, Ali Kargarandehkordi, Onur Cezmi Mutlu, Saimourya Surabhi, Mohammadmahdi Honarmand, Dennis Paul Wall, Peter Washington

Abstract: Emotions play an essential role in human communication. Developing computer vision models for automatic recognition of emotion expression can aid in a variety of domains, including robotics, digital behavioral healthcare, and media analytics. There are three types of emotional representations which are traditionally modeled in affective computing research: Action Units, Valence Arousal (VA), and C… ▽ More Emotions play an essential role in human communication. Developing computer vision models for automatic recognition of emotion expression can aid in a variety of domains, including robotics, digital behavioral healthcare, and media analytics. There are three types of emotional representations which are traditionally modeled in affective computing research: Action Units, Valence Arousal (VA), and Categorical Emotions. As part of an effort to move beyond these representations towards more fine-grained labels, we describe our submission to the newly introduced Emotional Reaction Intensity (ERI) Estimation challenge in the 5th competition for Affective Behavior Analysis in-the-Wild (ABAW). We developed four deep neural networks trained in the visual domain and a multimodal model trained with both visual and audio features to predict emotion reaction intensity. Our best performing model on the Hume-Reaction dataset achieved an average Pearson correlation coefficient of 0.4080 on the test set using a pre-trained ResNet50 model. This work provides a first step towards the development of production-grade models which predict emotion reaction intensities rather than discrete emotion categories. △ Less

Submitted 2 August, 2023; v1 submitted 19 March, 2023; originally announced March 2023.

arXiv:2303.08372 [pdf, other]

Target Sound Extraction with Variable Cross-modality Clues

Authors: Chenda Li, Yao Qian, Zhuo Chen, Dongmei Wang, Takuya Yoshioka, Shujie Liu, Yanmin Qian, Michael Zeng

Abstract: Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources. It often uses a model conditioned on a fixed form of target sound clues, such as a sound class label, which limits the ways in which users can interact with the model to specify the target sounds. To leverage… ▽ More Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources. It often uses a model conditioned on a fixed form of target sound clues, such as a sound class label, which limits the ways in which users can interact with the model to specify the target sounds. To leverage variable number of clues cross modalities available in the inference phase, including a video, a sound event class, and a text caption, we propose a unified transformer-based TSE model architecture, where a multi-clue attention module integrates all the clues across the modalities. Since there is no off-the-shelf benchmark to evaluate our proposed approach, we build a dataset based on public corpora, Audioset and AudioCaps. Experimental results for seen and unseen target-sound evaluation sets show that our proposed TSE model can effectively deal with a varying number of clues which improves the TSE performance and robustness against partially compromised clues. △ Less

Submitted 15 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.07623 [pdf, other]

Uncertainty-weighted Multi-tasking for $T_{1ρ}$ and T$_2$ Mapping in the Liver with Self-supervised Learning

Authors: Chaoxing Huang, Yurui Qian, Jian Hou, Baiyan Jiang, Queenie Chan, Vincent WS Wong, Winnie CW Chu, Weitian Chen

Abstract: Multi-parametric mapping of MRI relaxations in liver has the potential of revealing pathological information of the liver. A self-supervised learning based multi-parametric mapping method is proposed to map T$T_{1ρ}$ and T$_2$ simultaneously, by utilising the relaxation constraint in the learning process. Data noise of different mapping tasks is utilised to make the model uncertainty-aware, which… ▽ More Multi-parametric mapping of MRI relaxations in liver has the potential of revealing pathological information of the liver. A self-supervised learning based multi-parametric mapping method is proposed to map T$T_{1ρ}$ and T$_2$ simultaneously, by utilising the relaxation constraint in the learning process. Data noise of different mapping tasks is utilised to make the model uncertainty-aware, which adaptively weight different mapping tasks during learning. The method was examined on a dataset of 51 patients with non-alcoholic fatter liver disease. Results showed that the proposed method can produce comparable parametric maps to the traditional multi-contrast pixel wise fitting method, with a reduced number of images and less computation time. The uncertainty weighting also improves the model performance. It has the potential of accelerating MRI quantitative imaging. △ Less

Submitted 14 March, 2023; originally announced March 2023.

arXiv:2303.05906 [pdf, other]

doi 10.1103/PhysRevD.108.043007

Symmetry and bipolar motion in collective neutrino flavor oscillations

Authors: Zewei Xiong, Meng-Ru Wu, Yong-Zhong Qian

Abstract: We identify a geometric symmetry on the two-flavor Bloch sphere for collective flavor oscillations of a homogeneous dense neutrino gas. Based on this symmetry, analytical solutions to the periodic bipolar flavor evolution are derived. Using numerical calculations, we show that for configurations without this symmetry, the flavor evolution displays deviations from the bipolar flavor motion or even… ▽ More We identify a geometric symmetry on the two-flavor Bloch sphere for collective flavor oscillations of a homogeneous dense neutrino gas. Based on this symmetry, analytical solutions to the periodic bipolar flavor evolution are derived. Using numerical calculations, we show that for configurations without this symmetry, the flavor evolution displays deviations from the bipolar flavor motion or even exhibits aperiodic patterns. We also discuss the implication of our finding for more general three-flavor and inhomogeneous cases. △ Less

Submitted 11 June, 2023; v1 submitted 10 March, 2023; originally announced March 2023.

Comments: 11 pages, 4 figures

arXiv:2303.02693 [pdf, other]

Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition

Authors: Junyan Wang, Zhenhong Sun, Yichen Qian, Dong Gong, Xiuyu Sun, Ming Lin, Maurice Pagnucco, Yang Song

Abstract: 3D convolution neural networks (CNNs) have been the prevailing option for video recognition. To capture the temporal information, 3D convolutions are computed along the sequences, leading to cubically growing and expensive computations. To reduce the computational cost, previous methods resort to manually designed 3D/2D CNN structures with approximations or automatic search, which sacrifice the mo… ▽ More 3D convolution neural networks (CNNs) have been the prevailing option for video recognition. To capture the temporal information, 3D convolutions are computed along the sequences, leading to cubically growing and expensive computations. To reduce the computational cost, previous methods resort to manually designed 3D/2D CNN structures with approximations or automatic search, which sacrifice the modeling ability or make training time-consuming. In this work, we propose to automatically design efficient 3D CNN architectures via a novel training-free neural architecture search approach tailored for 3D CNNs considering the model complexity. To measure the expressiveness of 3D CNNs efficiently, we formulate a 3D CNN as an information system and derive an analytic entropy score, based on the Maximum Entropy Principle. Specifically, we propose a spatio-temporal entropy score (STEntr-Score) with a refinement factor to handle the discrepancy of visual information in spatial and temporal dimensions, through dynamically leveraging the correlation between the feature map size and kernel size depth-wisely. Highly efficient and expressive 3D CNN architectures, \ie entropy-based 3D CNNs (E3D family), can then be efficiently searched by maximizing the STEntr-Score under a given computational budget, via an evolutionary algorithm without training the network parameters. Extensive experiments on Something-Something V1\&V2 and Kinetics400 demonstrate that the E3D family achieves state-of-the-art performance with higher computational efficiency. Code is available at https://github.com/alibaba/lightweight-neural-architecture-search. △ Less

Submitted 5 March, 2023; originally announced March 2023.

Comments: This manuscript has been accepted at ICLR 2023

arXiv:2302.14498 [pdf, other]

Effective Community Search on Large Attributed Bipartite Graphs

Authors: Zongyu Xu, Yihao Zhang, Long Yuan, Yuwen Qian, Zi Chen, Mingliang Zhou, Qin Mao, Weibin Pan

Abstract: Community search over bipartite graphs has attracted significant interest recently. In many applications such as user-item bipartite graph in E-commerce, customer-movie bipartite graph in movie rating website, nodes tend to have attributes, while previous community search algorithm on bipartite graphs ignore attributes, which makes the returned results with poor cohesion with respect to their node… ▽ More Community search over bipartite graphs has attracted significant interest recently. In many applications such as user-item bipartite graph in E-commerce, customer-movie bipartite graph in movie rating website, nodes tend to have attributes, while previous community search algorithm on bipartite graphs ignore attributes, which makes the returned results with poor cohesion with respect to their node attributes. In this paper, we study the community search problem on attributed bipartite graphs. Given a query vertex q, we aim to find attributed $\left(α,β\right)$-communities of $G$, where the structure cohesiveness of the community is described by an $\left(α,β\right)$-core model, and the attribute similarity of two groups of nodes in the subgraph is maximized. In order to retrieve attributed communities from bipartite graphs, we first propose a basic algorithm composed of two steps: the generation and verification of candidate keyword sets, and then two improved query algorithms Inc and Dec are proposed. Inc is proposed considering the anti-monotonity property of attributed bipartite graphs, then we adopt different generating method and verifying order of candidate keyword sets and propose the Dec algorithm. After evaluating our solutions on eight large graphs, the experimental results demonstrate that our methods are effective and efficient in querying the attributed communities on bipartite graphs. △ Less

Submitted 28 February, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

arXiv:2302.13763 [pdf, other]

Efficient and Low Overhead Website Fingerprinting Attacks and Defenses based on TCP/IP Traffic

Authors: Guodong Huang, Chuan Ma, Ming Ding, Yuwen Qian, Chunpeng Ge, Liming Fang, Zhe Liu

Abstract: Website fingerprinting attack is an extensively studied technique used in a web browser to analyze traffic patterns and thus infer confidential information about users. Several website fingerprinting attacks based on machine learning and deep learning tend to use the most typical features to achieve a satisfactory performance of attacking rate. However, these attacks suffer from several practical… ▽ More Website fingerprinting attack is an extensively studied technique used in a web browser to analyze traffic patterns and thus infer confidential information about users. Several website fingerprinting attacks based on machine learning and deep learning tend to use the most typical features to achieve a satisfactory performance of attacking rate. However, these attacks suffer from several practical implementation factors, such as a skillfully pre-processing step or a clean dataset. To defend against such attacks, random packet defense (RPD) with a high cost of excessive network overhead is usually applied. In this work, we first propose a practical filter-assisted attack against RPD, which can filter out the injected noises using the statistical characteristics of TCP/IP traffic. Then, we propose a list-assisted defensive mechanism to defend the proposed attack method. To achieve a configurable trade-off between the defense and the network overhead, we further improve the list-based defense by a traffic splitting mechanism, which can combat the mentioned attacks as well as save a considerable amount of network overhead. In the experiments, we collect real-life traffic patterns using three mainstream browsers, i.e., Microsoft Edge, Google Chrome, and Mozilla Firefox, and extensive results conducted on the closed and open-world datasets show the effectiveness of the proposed algorithms in terms of defense accuracy and network efficiency. △ Less

Submitted 27 February, 2023; originally announced February 2023.

arXiv:2302.11787 [pdf, other]

Empathetic Response Generation via Emotion Cause Transition Graph

Authors: Yushan Qian, Bo Wang, Ting-En Lin, Yinhe Zheng, Ying Zhu, Dongming Zhao, Yuexian Hou, Yuchuan Wu, Yongbin Li

Abstract: Empathetic dialogue is a human-like behavior that requires the perception of both affective factors (e.g., emotion status) and cognitive factors (e.g., cause of the emotion). Besides concerning emotion status in early work, the latest approaches study emotion causes in empathetic dialogue. These approaches focus on understanding and duplicating emotion causes in the context to show empathy for the… ▽ More Empathetic dialogue is a human-like behavior that requires the perception of both affective factors (e.g., emotion status) and cognitive factors (e.g., cause of the emotion). Besides concerning emotion status in early work, the latest approaches study emotion causes in empathetic dialogue. These approaches focus on understanding and duplicating emotion causes in the context to show empathy for the speaker. However, instead of only repeating the contextual causes, the real empathic response often demonstrate a logical and emotion-centered transition from the causes in the context to those in the responses. In this work, we propose an emotion cause transition graph to explicitly model the natural transition of emotion causes between two adjacent turns in empathetic dialogue. With this graph, the concept words of the emotion causes in the next turn can be predicted and used by a specifically designed concept-aware decoder to generate the empathic response. Automatic and human experimental results on the benchmark dataset demonstrate that our method produces more empathetic, coherent, informative, and specific responses than existing models. △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: Accepted to ICASSP 2023

arXiv:2302.08629 [pdf, other]

Physics-based parameterized neural ordinary differential equations: prediction of laser ignition in a rocket combustor

Authors: Yizhou Qian, Jonathan Wang, Quentin Douasbin, Eric Darve

Abstract: In this work, we present a novel physics-based data-driven framework for reduced-order modeling of laser ignition in a model rocket combustor based on parameterized neural ordinary differential equations (PNODE). Deep neural networks are embedded as functions of high-dimensional parameters of laser ignition to predict various terms in a 0D flow model including the heat source function, pre-exponen… ▽ More In this work, we present a novel physics-based data-driven framework for reduced-order modeling of laser ignition in a model rocket combustor based on parameterized neural ordinary differential equations (PNODE). Deep neural networks are embedded as functions of high-dimensional parameters of laser ignition to predict various terms in a 0D flow model including the heat source function, pre-exponential factors, and activation energy. Using the governing equations of a 0D flow model, our PNODE needs only a limited number of training samples and predicts trajectories of various quantities such as temperature, pressure, and mass fractions of species while satisfying physical constraints. We validate our physics-based PNODE on solution snapshots of high-fidelity Computational Fluid Dynamics (CFD) simulations of laser-induced ignition in a prototype rocket combustor. We compare the performance of our physics-based PNODE with that of kernel ridge regression and fully connected neural networks. Our results show that our physics-based PNODE provides solutions with lower mean absolute errors of average temperature over time, thus improving the prediction of successful laser ignition with high-dimensional parameters. △ Less

Submitted 3 May, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

arXiv:2302.02978 [pdf, other]

doi 10.18653/v1/2023.findings-emnlp.354

MuG: A Multimodal Classification Benchmark on Game Data with Tabular, Textual, and Visual Fields

Authors: Jiaying Lu, Yongchen Qian, Shifan Zhao, Yuanzhe Xi, Carl Yang

Abstract: Previous research has demonstrated the advantages of integrating data from multiple sources over traditional unimodal data, leading to the emergence of numerous novel multimodal applications. We propose a multimodal classification benchmark MuG with eight datasets that allows researchers to evaluate and improve their models. These datasets are collected from four various genres of games that cover… ▽ More Previous research has demonstrated the advantages of integrating data from multiple sources over traditional unimodal data, leading to the emergence of numerous novel multimodal applications. We propose a multimodal classification benchmark MuG with eight datasets that allows researchers to evaluate and improve their models. These datasets are collected from four various genres of games that cover tabular, textual, and visual modalities. We conduct multi-aspect data analysis to provide insights into the benchmark, including label balance ratios, percentages of missing features, distributions of data within each modality, and the correlations between labels and input modalities. We further present experimental results obtained by several state-of-the-art unimodal classifiers and multimodal classifiers, which demonstrate the challenging and multimodal-dependent properties of the benchmark. MuG is released at https://github.com/lujiaying/MUG-Bench with the data, tutorials, and implemented baselines. △ Less

Submitted 17 October, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

Journal ref: In Findings of the Association for Computational Linguistics: EMNLP 2023

arXiv:2302.00613 [pdf]

doi 10.1109/PVSC48320.2023.10359832

Impact of Surface Roughness in Measuring Optoelectronic Characteristics of Thin-Film Solar Cells

Authors: David Magginetti, Seokmin Jeon, Yohan Yoon, Ashif Choudhury, Ashraful Mamun, Yang Qian, Jordan Gerton, Heayoung Yoon

Abstract: Microstructural properties of thin-film absorber layers play a vital role in developing high-performance solar cells. Scanning probe microscopy is frequently used for measuring spatially inhomogeneous properties of thin-film solar cells. While powerful, the nanoscale probe can be sensitive to the roughness of samples, introducing convoluted signals and unintended artifacts into the measurement. He… ▽ More Microstructural properties of thin-film absorber layers play a vital role in developing high-performance solar cells. Scanning probe microscopy is frequently used for measuring spatially inhomogeneous properties of thin-film solar cells. While powerful, the nanoscale probe can be sensitive to the roughness of samples, introducing convoluted signals and unintended artifacts into the measurement. Here, we apply a glancing-angle focused ion beam (FIB) technique to reduce the surface roughness of CdTe while preserving the subsurface optoelectronic properties of the solar cells. We compare the nanoscale optoelectronic properties before and after the FIB polishing. Simultaneously collected Kelvin-probe force microscopy (KPFM) and atomic force microscopy (AFM) images show that the contact potential difference (CPD) of CdTe pristine (peak-to-valley roughness of approximately 600 nm) follows the topography. In contrast, the CPD map of polished CdTe (roughness of approximately 20 nm) is independent of the surface roughness. We demonstrate the smooth CdTe surface also enables high-resolution photoluminescence (PL) imaging at a resolution much smaller than individual grains (< 1 micrometer). Our finite-difference time-domain (FDTD) simulations illustrate how the local light excitation interacts with CdTe surfaces. Our work supports low-angle FIB polishing can be beneficial in studying buried sub-microstructural properties of thin-film solar cells with care for possible ion-beam damage near the surface. △ Less

Submitted 1 February, 2023; originally announced February 2023.

Comments: 4 pages, 4 figures

Journal ref: 2023 IEEE 50th Photovoltaic Specialists Conference

arXiv:2301.13356 [pdf, other]

Inference Time Evidences of Adversarial Attacks for Forensic on Transformers

Authors: Hugo Lemarchant, Liangzi Li, Yiming Qian, Yuta Nakashima, Hajime Nagahara

Abstract: Vision Transformers (ViTs) are becoming a very popular paradigm for vision tasks as they achieve state-of-the-art performance on image classification. However, although early works implied that this network structure had increased robustness against adversarial attacks, some works argue ViTs are still vulnerable. This paper presents our first attempt toward detecting adversarial attacks during inf… ▽ More Vision Transformers (ViTs) are becoming a very popular paradigm for vision tasks as they achieve state-of-the-art performance on image classification. However, although early works implied that this network structure had increased robustness against adversarial attacks, some works argue ViTs are still vulnerable. This paper presents our first attempt toward detecting adversarial attacks during inference time using the network's input and outputs as well as latent features. We design four quantifications (or derivatives) of input, output, and latent vectors of ViT-based models that provide a signature of the inference, which could be beneficial for the attack detection, and empirically study their behavior over clean samples and adversarial samples. The results demonstrate that the quantifications from input (images) and output (posterior probabilities) are promising for distinguishing clean and adversarial samples, while latent vectors offer less discriminative power, though they give some insights on how adversarial perturbations work. △ Less

Submitted 30 January, 2023; originally announced January 2023.

arXiv:2301.12798 [pdf, other]

Reliable Federated Disentangling Network for Non-IID Domain Feature

Authors: Meng Wang, Kai Yu, Chun-Mei Feng, Yiming Qian, Ke Zou, Lianyu Wang, Rick Siow Mong Goh, Yong Liu, Huazhu Fu

Abstract: Federated learning (FL), as an effective decentralized distributed learning approach, enables multiple institutions to jointly train a model without sharing their local data. However, the domain feature shift caused by different acquisition devices/clients substantially degrades the performance of the FL model. Furthermore, most existing FL approaches aim to improve accuracy without considering re… ▽ More Federated learning (FL), as an effective decentralized distributed learning approach, enables multiple institutions to jointly train a model without sharing their local data. However, the domain feature shift caused by different acquisition devices/clients substantially degrades the performance of the FL model. Furthermore, most existing FL approaches aim to improve accuracy without considering reliability (e.g., confidence or uncertainty). The predictions are thus unreliable when deployed in safety-critical applications. Therefore, aiming at improving the performance of FL in non-Domain feature issues while enabling the model more reliable. In this paper, we propose a novel reliable federated disentangling network, termed RFedDis, which utilizes feature disentangling to enable the ability to capture the global domain-invariant cross-client representation and preserve local client-specific feature learning. Meanwhile, to effectively integrate the decoupled features, an uncertainty-aware decision fusion is also introduced to guide the network for dynamically integrating the decoupled features at the evidence level, while producing a reliable prediction with an estimated uncertainty. To the best of our knowledge, our proposed RFedDis is the first work to develop an FL approach based on evidential uncertainty combined with feature disentangling, which enhances the performance and reliability of FL in non-IID domain features. Extensive experimental results show that our proposed RFedDis provides outstanding performance with a high degree of reliability as compared to other state-of-the-art FL approaches. △ Less

Submitted 19 September, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

arXiv:2301.10460 [pdf, other]

HAL3D: Hierarchical Active Learning for Fine-Grained 3D Part Labeling

Authors: Fenggen Yu, Yiming Qian, Francisca Gil-Ureta, Brian Jackson, Eric Bennett, Hao Zhang

Abstract: We present the first active learning tool for fine-grained 3D part labeling, a problem which challenges even the most advanced deep learning (DL) methods due to the significant structural variations among the small and intricate parts. For the same reason, the necessary data annotation effort is tremendous, motivating approaches to minimize human involvement. Our labeling tool iteratively verifies… ▽ More We present the first active learning tool for fine-grained 3D part labeling, a problem which challenges even the most advanced deep learning (DL) methods due to the significant structural variations among the small and intricate parts. For the same reason, the necessary data annotation effort is tremendous, motivating approaches to minimize human involvement. Our labeling tool iteratively verifies or modifies part labels predicted by a deep neural network, with human feedback continually improving the network prediction. To effectively reduce human efforts, we develop two novel features in our tool, hierarchical and symmetry-aware active labeling. Our human-in-the-loop approach, coined HAL3D, achieves 100% accuracy (barring human errors) on any test set with pre-defined hierarchical part labels, with 80% time-saving over manual effort. △ Less

Submitted 1 April, 2024; v1 submitted 25 January, 2023; originally announced January 2023.

Comments: Accepted to ICCV 2023

arXiv:2301.10181 [pdf, other]

Interpretable Tsetlin Machine-based Premature Ventricular Contraction Identification

Authors: Jinbao Zhang, Xuan Zhang, Lei Jiao, Ole-Christoffer Granmo, Yongjun Qian, Fan Pan

Abstract: Neural network-based models have found wide use in automatic long-term electrocardiogram (ECG) analysis. However, such black box models are inadequate for analysing physiological signals where credibility and interpretability are crucial. Indeed, how to make ECG analysis transparent is still an open problem. In this study, we develop a Tsetlin machine (TM) based architecture for premature ventricu… ▽ More Neural network-based models have found wide use in automatic long-term electrocardiogram (ECG) analysis. However, such black box models are inadequate for analysing physiological signals where credibility and interpretability are crucial. Indeed, how to make ECG analysis transparent is still an open problem. In this study, we develop a Tsetlin machine (TM) based architecture for premature ventricular contraction (PVC) identification by analysing long-term ECG signals. The architecture is transparent by describing patterns directly with logical AND rules. To validate the accuracy of our approach, we compare the TM performance with those of convolutional neural networks (CNNs). Our numerical results demonstrate that TM provides comparable performance with CNNs on the MIT-BIH database. To validate interpretability, we provide explanatory diagrams that show how TM makes the PVC identification from confirming and invalidating patterns. We argue that these are compatible with medical knowledge so that they can be readily understood and verified by a medical doctor. Accordingly, we believe this study paves the way for machine learning (ML) for ECG analysis in clinical practice. △ Less

Submitted 20 January, 2023; originally announced January 2023.

arXiv:2301.06059 [pdf, other]

Learning Audio-Driven Viseme Dynamics for 3D Face Animation

Authors: Linchao Bao, Haoxian Zhang, Yue Qian, Tangli Xue, Changhai Chen, Xuefei Zhe, Di Kang

Abstract: We present a novel audio-driven facial animation approach that can generate realistic lip-synchronized 3D facial animations from the input audio. Our approach learns viseme dynamics from speech videos, produces animator-friendly viseme curves, and supports multilingual speech inputs. The core of our approach is a novel parametric viseme fitting algorithm that utilizes phoneme priors to extract vis… ▽ More We present a novel audio-driven facial animation approach that can generate realistic lip-synchronized 3D facial animations from the input audio. Our approach learns viseme dynamics from speech videos, produces animator-friendly viseme curves, and supports multilingual speech inputs. The core of our approach is a novel parametric viseme fitting algorithm that utilizes phoneme priors to extract viseme parameters from speech videos. With the guidance of phonemes, the extracted viseme curves can better correlate with phonemes, thus more controllable and friendly to animators. To support multilingual speech inputs and generalizability to unseen voices, we take advantage of deep audio feature models pretrained on multiple languages to learn the mapping from audio to viseme curves. Our audio-to-curves mapping achieves state-of-the-art performance even when the input audio suffers from distortions of volume, pitch, speed, or noise. Lastly, a viseme scanning approach for acquiring high-fidelity viseme assets is presented for efficient speech animation production. We show that the predicted viseme curves can be applied to different viseme-rigged characters to yield various personalized animations with realistic and natural facial motions. Our approach is artist-friendly and can be easily integrated into typical animation production workflows including blendshape or bone based animation. △ Less

Submitted 15 January, 2023; originally announced January 2023.

Comments: Project page: https://linchaobao.github.io/viseme2023/

arXiv:2301.04907 [pdf, other]

Think Twice: A Human-like Two-stage Conversational Agent for Emotional Response Generation

Authors: Yushan Qian, Bo Wang, Shangzhao Ma, Wu Bin, Shuo Zhang, Dongming Zhao, Kun Huang, Yuexian Hou

Abstract: Towards human-like dialogue systems, current emotional dialogue approaches jointly model emotion and semantics with a unified neural network. This strategy tends to generate safe responses due to the mutual restriction between emotion and semantics, and requires rare emotion-annotated large-scale dialogue corpus. Inspired by the "think twice" behavior in human dialogue, we propose a two-stage conv… ▽ More Towards human-like dialogue systems, current emotional dialogue approaches jointly model emotion and semantics with a unified neural network. This strategy tends to generate safe responses due to the mutual restriction between emotion and semantics, and requires rare emotion-annotated large-scale dialogue corpus. Inspired by the "think twice" behavior in human dialogue, we propose a two-stage conversational agent for the generation of emotional dialogue. Firstly, a dialogue model trained without the emotion-annotated dialogue corpus generates a prototype response that meets the contextual semantics. Secondly, the first-stage prototype is modified by a controllable emotion refiner with the empathy hypothesis. Experimental results on the DailyDialog and EmpatheticDialogues datasets demonstrate that the proposed conversational outperforms the comparison models in emotion generation and maintains the semantic performance in automatic and human evaluations. △ Less

Submitted 8 June, 2023; v1 submitted 12 January, 2023; originally announced January 2023.

Comments: Accepted to AAMAS2023

arXiv:2212.13860 [pdf]

Automatic Recognition and Classification of Future Work Sentences from Academic Articles in a Specific Domain

Authors: Chengzhi Zhang, Yi Xiang, Wenke Hao, Zhicheng Li, Yuchen Qian, Yuzhuo Wang

Abstract: Future work sentences (FWS) are the particular sentences in academic papers that contain the author's description of their proposed follow-up research direction. This paper presents methods to automatically extract FWS from academic papers and classify them according to the different future directions embodied in the paper's content. FWS recognition methods will enable subsequent researchers to lo… ▽ More Future work sentences (FWS) are the particular sentences in academic papers that contain the author's description of their proposed follow-up research direction. This paper presents methods to automatically extract FWS from academic papers and classify them according to the different future directions embodied in the paper's content. FWS recognition methods will enable subsequent researchers to locate future work sentences more accurately and quickly and reduce the time and cost of acquiring the corpus. The current work on automatic identification of future work sentences is relatively small, and the existing research cannot accurately identify FWS from academic papers, and thus cannot conduct data mining on a large scale. Furthermore, there are many aspects to the content of future work, and the subdivision of the content is conducive to the analysis of specific development directions. In this paper, Nature Language Processing (NLP) is used as a case study, and FWS are extracted from academic papers and classified into different types. We manually build an annotated corpus with six different types of FWS. Then, automatic recognition and classification of FWS are implemented using machine learning models, and the performance of these models is compared based on the evaluation metrics. The results show that the Bernoulli Bayesian model has the best performance in the automatic recognition task, with the Macro F1 reaching 90.73%, and the SCIBERT model has the best performance in the automatic classification task, with the weighted average F1 reaching 72.63%. Finally, we extract keywords from FWS and gain a deep understanding of the key content described in FWS, and we also demonstrate that content determination in FWS will be reflected in the subsequent research work by measuring the similarity between future work sentences and the abstracts. △ Less

Submitted 28 December, 2022; originally announced December 2022.

arXiv:2212.12984 [pdf, other]

MC-Nonlocal-PINNs: handling nonlocal operators in PINNs via Monte Carlo sampling

Authors: Xiaodong Feng, Yue Qian, Wanfang Shen

Abstract: We propose, Monte Carlo Nonlocal physics-informed neural networks (MC-Nonlocal-PINNs), which is a generalization of MC-fPINNs in \cite{guo2022monte}, for solving general nonlocal models such as integral equations and nonlocal PDEs. Similar as in MC-fPINNs, our MC-Nonlocal-PINNs handle the nonlocal operators in a Monte Carlo way, resulting in a very stable approach for high dimensional problems. We… ▽ More We propose, Monte Carlo Nonlocal physics-informed neural networks (MC-Nonlocal-PINNs), which is a generalization of MC-fPINNs in \cite{guo2022monte}, for solving general nonlocal models such as integral equations and nonlocal PDEs. Similar as in MC-fPINNs, our MC-Nonlocal-PINNs handle the nonlocal operators in a Monte Carlo way, resulting in a very stable approach for high dimensional problems. We present a variety of test problems, including high dimensional Volterra type integral equations, hypersingular integral equations and nonlocal PDEs, to demonstrate the effectiveness of our approach. △ Less

Submitted 25 December, 2022; originally announced December 2022.

Comments: 23pages, 13figures

arXiv:2212.08892 [pdf, other]

Flattening-Net: Deep Regular 2D Representation for 3D Point Cloud Analysis

Authors: Qijian Zhang, Junhui Hou, Yue Qian, Yiming Zeng, Juyong Zhang, Ying He

Abstract: Point clouds are characterized by irregularity and unstructuredness, which pose challenges in efficient data exploitation and discriminative feature extraction. In this paper, we present an unsupervised deep neural architecture called Flattening-Net to represent irregular 3D point clouds of arbitrary geometry and topology as a completely regular 2D point geometry image (PGI) structure, in which co… ▽ More Point clouds are characterized by irregularity and unstructuredness, which pose challenges in efficient data exploitation and discriminative feature extraction. In this paper, we present an unsupervised deep neural architecture called Flattening-Net to represent irregular 3D point clouds of arbitrary geometry and topology as a completely regular 2D point geometry image (PGI) structure, in which coordinates of spatial points are captured in colors of image pixels. \mr{Intuitively, Flattening-Net implicitly approximates a locally smooth 3D-to-2D surface flattening process while effectively preserving neighborhood consistency.} \mr{As a generic representation modality, PGI inherently encodes the intrinsic property of the underlying manifold structure and facilitates surface-style point feature aggregation.} To demonstrate its potential, we construct a unified learning framework directly operating on PGIs to achieve \mr{diverse types of high-level and low-level} downstream applications driven by specific task networks, including classification, segmentation, reconstruction, and upsampling. Extensive experiments demonstrate that our methods perform favorably against the current state-of-the-art competitors. We will make the code and data publicly available at https://github.com/keeganhk/Flattening-Net. △ Less

Submitted 7 February, 2023; v1 submitted 17 December, 2022; originally announced December 2022.

Comments: Accepted to TPAMI

arXiv:2212.08472 [pdf, other]

doi 10.1109/TGRS.2023.3243062

One-Stage Cascade Refinement Networks for Infrared Small Target Detection

Authors: Yimian Dai, Xiang Li, Fei Zhou, Yulei Qian, Yaohong Chen, Jian Yang

Abstract: Single-frame InfraRed Small Target (SIRST) detection has been a challenging task due to a lack of inherent characteristics, imprecise bounding box regression, a scarcity of real-world datasets, and sensitive localization evaluation. In this paper, we propose a comprehensive solution to these challenges. First, we find that the existing anchor-free label assignment method is prone to mislabeling sm… ▽ More Single-frame InfraRed Small Target (SIRST) detection has been a challenging task due to a lack of inherent characteristics, imprecise bounding box regression, a scarcity of real-world datasets, and sensitive localization evaluation. In this paper, we propose a comprehensive solution to these challenges. First, we find that the existing anchor-free label assignment method is prone to mislabeling small targets as background, leading to their omission by detectors. To overcome this issue, we propose an all-scale pseudo-box-based label assignment scheme that relaxes the constraints on scale and decouples the spatial assignment from the size of the ground-truth target. Second, motivated by the structured prior of feature pyramids, we introduce the one-stage cascade refinement network (OSCAR), which uses the high-level head as soft proposals for the low-level refinement head. This allows OSCAR to process the same target in a cascade coarse-to-fine manner. Finally, we present a new research benchmark for infrared small target detection, consisting of the SIRST-V2 dataset of real-world, high-resolution single-frame targets, the normalized contrast evaluation metric, and the DeepInfrared toolkit for detection. We conduct extensive ablation studies to evaluate the components of OSCAR and compare its performance to state-of-the-art model-driven and data-driven methods on the SIRST-V2 benchmark. Our results demonstrate that a top-down cascade refinement framework can improve the accuracy of infrared small target detection without sacrificing efficiency. The DeepInfrared toolkit, dataset, and trained models are available at https://github.com/YimianDai/open-deepinfrared to advance further research in this field. △ Less

Submitted 31 December, 2022; v1 submitted 16 December, 2022; originally announced December 2022.

Comments: Submitted to TGRS

arXiv:2212.08266 [pdf, other]

doi 10.1103/PhysRevD.108.L021303

Signature of Collapsars as Sources for High-energy Neutrinos and $r$-process Nuclei

Authors: Gang Guo, Yong-Zhong Qian, Meng-Ru Wu

Abstract: If collapsars are sources for both high-energy (HE) neutrinos and $r$-process nuclei, the profuse low-energy antineutrinos from $β$-decay of the newly-synthesized nuclei can annihilate the HE neutrinos. Considering HE neutrinos produced at internal shocks induced by intermittent mildly-magnetized jets, we show that such annihilation suppresses the overall HE neutrino spectrum at $\gtrsim 300$~TeV… ▽ More If collapsars are sources for both high-energy (HE) neutrinos and $r$-process nuclei, the profuse low-energy antineutrinos from $β$-decay of the newly-synthesized nuclei can annihilate the HE neutrinos. Considering HE neutrinos produced at internal shocks induced by intermittent mildly-magnetized jets, we show that such annihilation suppresses the overall HE neutrino spectrum at $\gtrsim 300$~TeV and produces a corresponding flavor composition of $(F_{ν_e+\barν_e}: F_{ν_μ+\barν_μ}: F_{ν_τ+\barν_τ})_\star \approx (1 : 10 : 1)$ at source. We find that the emergent HE neutrino flux can well fit the diffuse flux observed at IceCube if contributions from all similar sources are taken into account. Our results highlight the unique role of HE neutrinos in supporting collapsars as sources for $r$-process nuclei, and can be tested by accurate measurement of the diffuse HE neutrino flux spectrum and flavor composition, as well as detection of HE neutrinos from individual sources. △ Less

Submitted 23 July, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

Comments: 10 pages, 7 figures

Journal ref: Phys. Rev. D 108, L021303 (2023)

arXiv:2212.04019 [pdf, other]

Silicon-based decoder for polarization-encoding quantum key distribution

Authors: Yongqiang Du, Xun Zhu, Xin Hua, Zhengeng Zhao, Xiao Hu, Yi Qian, Xi Xiao, Kejin Wei

Abstract: Silicon-based polarization-encoding quantum key distribution (QKD) has been widely studied, owing to its low cost and robustness. However, prior studies have utilized off-chip devices to demodulate the quantum states or perform polarization compensation, given the difficulty of fabricating polarized independent components on the chip. In this paper, we propose a fully chip-based decoder for polari… ▽ More Silicon-based polarization-encoding quantum key distribution (QKD) has been widely studied, owing to its low cost and robustness. However, prior studies have utilized off-chip devices to demodulate the quantum states or perform polarization compensation, given the difficulty of fabricating polarized independent components on the chip. In this paper, we propose a fully chip-based decoder for polarization-encoding QKD. The chip realizes a polarization state analyzer and compensates for the BB84 protocol without requiring additional hardware. It is based on a polarization-to-path conversion method that uses a polarization splitter-rotator. The chip was fabricated using a standard silicon photonics foundry; it has a compact design and is suitable for mass production. In the experimental stability test, an average quantum bit error rate of $0.56\%$ was achieved through continuous operation for 10 h without any polarization feedback. Furthermore, using the developed feedback algorithm, the chip enabled the automatic compensation of the fiber polarization drift, which was emulated by a random fiber polarization scrambler. In the case of the QKD demonstration, we obtained a finite-key secret rate of 240 bps over a fiber spool of 100 km. This study represents an important step toward the integrated, practical, and large-scale deployment of QKD systems. △ Less

Submitted 7 December, 2022; originally announced December 2022.

Comments: 11 pages

arXiv:2212.02163 [pdf, other]

2D Human Pose Estimation with Explicit Anatomical Keypoints Structure Constraints

Authors: Zhangjian Ji, Zilong Wang, Ming Zhang, Yapeng Chen, Yuhua Qian

Abstract: Recently, human pose estimation mainly focuses on how to design a more effective and better deep network structure as human features extractor, and most designed feature extraction networks only introduce the position of each anatomical keypoint to guide their training process. However, we found that some human anatomical keypoints kept their topology invariance, which can help to localize them mo… ▽ More Recently, human pose estimation mainly focuses on how to design a more effective and better deep network structure as human features extractor, and most designed feature extraction networks only introduce the position of each anatomical keypoint to guide their training process. However, we found that some human anatomical keypoints kept their topology invariance, which can help to localize them more accurately when detecting the keypoints on the feature map. But to the best of our knowledge, there is no literature that has specifically studied it. Thus, in this paper, we present a novel 2D human pose estimation method with explicit anatomical keypoints structure constraints, which introduces the topology constraint term that consisting of the differences between the distance and direction of the keypoint-to-keypoint and their groundtruth in the loss object. More importantly, our proposed model can be plugged in the most existing bottom-up or top-down human pose estimation methods and improve their performance. The extensive experiments on the benchmark dataset: COCO keypoint dataset, show that our methods perform favorably against the most existing bottom-up and top-down human pose estimation methods, especially for Lite-HRNet, when our model is plugged into it, its AP scores separately raise by 2.9\% and 3.3\% on COCO val2017 and test-dev2017 datasets. △ Less

Submitted 5 December, 2022; originally announced December 2022.

arXiv:2211.12277 [pdf, other]

Semantic Guided Level-Category Hybrid Prediction Network for Hierarchical Image Classification

Authors: Peng Wang, Jingzhou Chen, Yuntao Qian

Abstract: Hierarchical classification (HC) assigns each object with multiple labels organized into a hierarchical structure. The existing deep learning based HC methods usually predict an instance starting from the root node until a leaf node is reached. However, in the real world, images interfered by noise, occlusion, blur, or low resolution may not provide sufficient information for the classification at… ▽ More Hierarchical classification (HC) assigns each object with multiple labels organized into a hierarchical structure. The existing deep learning based HC methods usually predict an instance starting from the root node until a leaf node is reached. However, in the real world, images interfered by noise, occlusion, blur, or low resolution may not provide sufficient information for the classification at subordinate levels. To address this issue, we propose a novel semantic guided level-category hybrid prediction network (SGLCHPN) that can jointly perform the level and category prediction in an end-to-end manner. SGLCHPN comprises two modules: a visual transformer that extracts feature vectors from the input images, and a semantic guided cross-attention module that uses categories word embeddings as queries to guide learning category-specific representations. In order to evaluate the proposed method, we construct two new datasets in which images are at a broad range of quality and thus are labeled to different levels (depths) in the hierarchy according to their individual quality. Experimental results demonstrate the effectiveness of our proposed HC method. △ Less

Submitted 31 March, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

arXiv:2211.09412 [pdf, other]

LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

Authors: Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

Abstract: Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language mo… ▽ More Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the {LongFNT-Text} architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the {LongFNT} architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate~(WER) reduction, respectively. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP2023

arXiv:2211.04219 [pdf, other]

Nimbus: Toward Speed Up Function Signature Recovery via Input Resizing and Multi-Task Learning

Authors: Yi Qian, Ligeng Chen, Yuyang Wang, Bing Mao

Abstract: Function signature recovery is important for many binary analysis tasks such as control-flow integrity enforcement, clone detection, and bug finding. Existing works try to substitute learning-based methods with rule-based methods to reduce human effort.They made considerable efforts to enhance the system's performance, which also bring the side effect of higher resource consumption. However, recov… ▽ More Function signature recovery is important for many binary analysis tasks such as control-flow integrity enforcement, clone detection, and bug finding. Existing works try to substitute learning-based methods with rule-based methods to reduce human effort.They made considerable efforts to enhance the system's performance, which also bring the side effect of higher resource consumption. However, recovering the function signature is more about providing information for subsequent tasks, and both efficiency and performance are significant. In this paper, we first propose a method called Nimbus for efficient function signature recovery that furthest reduces the whole-process resource consumption without performance loss. Thanks to information bias and task relation (i.e., the relation between parameter count and parameter type recovery), we utilize selective inputs and introduce multi-task learning (MTL) structure for function signature recovery to reduce computational resource consumption, and fully leverage mutual information. Our experimental results show that, with only about the one-eighth processing time of the state-of-the-art method, we even achieve about 1% more prediction accuracy over all function signature recovery tasks. △ Less

Submitted 8 November, 2022; originally announced November 2022.

arXiv:2211.01267 [pdf, other]

Multi-Vector Retrieval as Sparse Alignment

Authors: Yujie Qian, Jinhyuk Lee, Sai Meher Karthik Duddu, Zhuyun Dai, Siddhartha Brahma, Iftekhar Naim, Tao Lei, Vincent Y. Zhao

Abstract: Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose AligneR, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. `dog' vs. `puppy') and per-token unary… ▽ More Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose AligneR, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. `dog' vs. `puppy') and per-token unary saliences reflecting their relative importance for retrieval. We show that controlling the sparsity of pairwise token alignments often brings significant performance gains. While most factoid questions focusing on a specific part of a document require a smaller number of alignments, others requiring a broader understanding of a document favor a larger number of alignments. Unary saliences, on the other hand, decide whether a token ever needs to be aligned with others for retrieval (e.g. `kind' from `kind of currency is used in new zealand}'). With sparsified unary saliences, we are able to prune a large number of query and document token vectors and improve the efficiency of multi-vector retrieval. We learn the sparse unary saliences with entropy-regularized linear programming, which outperforms other methods to achieve sparsity. In a zero-shot setting, AligneR scores 51.1 points nDCG@10, achieving a new retriever-only state-of-the-art on 13 tasks in the BEIR benchmark. In addition, adapting pairwise alignments with a few examples (<= 8) further improves the performance up to 15.7 points nDCG@10 for argument retrieval tasks. The unary saliences of AligneR helps us to keep only 20% of the document token representations with minimal performance loss. We further show that our model often produces interpretable alignments and significantly improves its performance when initialized from larger language models. △ Less

Submitted 2 November, 2022; originally announced November 2022.

arXiv:2211.00815 [pdf, other]

Build a SRE Challenge System: Lessons from VoxSRC 2022 and CNSRC 2022

Authors: Zhengyang Chen, Bing Han, Xu Xiang, Houjun Huang, Bei Liu, Yanmin Qian

Abstract: Many speaker recognition challenges have been held to assess the speaker verification system in the wild and probe the performance limit. Voxceleb Speaker Recognition Challenge (VoxSRC), based on the voxceleb, is the most popular. Besides, another challenge called CN-Celeb Speaker Recognition Challenge (CNSRC) is also held this year, which is based on the Chinese celebrity multi-genre dataset CN-C… ▽ More Many speaker recognition challenges have been held to assess the speaker verification system in the wild and probe the performance limit. Voxceleb Speaker Recognition Challenge (VoxSRC), based on the voxceleb, is the most popular. Besides, another challenge called CN-Celeb Speaker Recognition Challenge (CNSRC) is also held this year, which is based on the Chinese celebrity multi-genre dataset CN-Celeb. This year, our team participated in both speaker verification closed tracks in CNSRC 2022 and VoxSRC 2022, and achieved the 1st place and 3rd place respectively. In most system reports, the authors usually only provide a description of their systems but lack an effective analysis of their methods. In this paper, we will outline how to build a strong speaker verification challenge system and give a detailed analysis of each method compared with some other popular technical means. △ Less

Submitted 1 June, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: Accepted by InterSpeech 2023

arXiv:2210.17016 [pdf, other]

Wespeaker: A Research and Production oriented Speaker Embedding Learning Toolkit

Authors: Hongji Wang, Chengdong Liang, Shuai Wang, Zhengyang Chen, Binbin Zhang, Xu Xiang, Yanlei Deng, Yanmin Qian

Abstract: Speaker modeling is essential for many related tasks, such as speaker recognition and speaker diarization. The dominant modeling approach is fixed-dimensional vector representation, i.e., speaker embedding. This paper introduces a research and production oriented speaker embedding learning toolkit, Wespeaker. Wespeaker contains the implementation of scalable data management, state-of-the-art speak… ▽ More Speaker modeling is essential for many related tasks, such as speaker recognition and speaker diarization. The dominant modeling approach is fixed-dimensional vector representation, i.e., speaker embedding. This paper introduces a research and production oriented speaker embedding learning toolkit, Wespeaker. Wespeaker contains the implementation of scalable data management, state-of-the-art speaker embedding models, loss functions, and scoring back-ends, with highly competitive results achieved by structured recipes which were adopted in the winning systems in several speaker verification challenges. The application to other downstream tasks such as speaker diarization is also exhibited in the related recipe. Moreover, CPU- and GPU-compatible deployment codes are integrated for production-oriented development. The toolkit is publicly available at https://github.com/wenet-e2e/wespeaker. △ Less

Submitted 1 November, 2022; v1 submitted 30 October, 2022; originally announced October 2022.

arXiv:2210.15936 [pdf, other]

A comprehensive study on self-supervised distillation for speaker representation learning

Authors: Zhengyang Chen, Yao Qian, Bing Han, Yanmin Qian, Michael Zeng

Abstract: In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In… ▽ More In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791% for trial Vox1-O, Vox1-E and Vox1-H , respectively), discarding any speaker labels in the training phase. △ Less

Submitted 25 November, 2022; v1 submitted 28 October, 2022; originally announced October 2022.

Comments: Accepted by SLT2022

Showing 151–200 of 652 results for author: Qian, Y