Search | arXiv e-print repository

Integrating AI in College Education: Positive yet Mixed Experiences with ChatGPT

Authors: Xinrui Song, Jiajin Zhang, Pingkun Yan, Juergen Hahn, Uwe Kruger, Hisham Mohamed, Ge Wang

Abstract: The integration of artificial intelligence (AI) chatbots into higher education marks a shift towards a new generation of pedagogical tools, mirroring the arrival of milestones like the internet. With the launch of ChatGPT-4 Turbo in November 2023, we developed a ChatGPT-based teaching application (https://chat.openai.com/g/g-1imx1py4K-chatge-medical-imaging) and integrated it into our undergraduat… ▽ More The integration of artificial intelligence (AI) chatbots into higher education marks a shift towards a new generation of pedagogical tools, mirroring the arrival of milestones like the internet. With the launch of ChatGPT-4 Turbo in November 2023, we developed a ChatGPT-based teaching application (https://chat.openai.com/g/g-1imx1py4K-chatge-medical-imaging) and integrated it into our undergraduate medical imaging course in the Spring 2024 semester. This study investigates the use of ChatGPT throughout a semester-long trial, providing insights into students' engagement, perception, and the overall educational effectiveness of the technology. We systematically collected and analyzed data concerning students' interaction with ChatGPT, focusing on their attitudes, concerns, and usage patterns. The findings indicate that ChatGPT offers significant advantages such as improved information access and increased interactivity, but its adoption is accompanied by concerns about the accuracy of the information provided and the necessity for well-defined guidelines to optimize its use. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.03658 [pdf, other]

GPT-4 vs. Human Translators: A Comprehensive Evaluation of Translation Quality Across Languages, Domains, and Expertise Levels

Authors: Jianhao Yan, Pingchuan Yan, Yulong Chen, Judy Li, Xianchao Zhu, Yue Zhang

Abstract: This study comprehensively evaluates the translation quality of Large Language Models (LLMs), specifically GPT-4, against human translators of varying expertise levels across multiple language pairs and domains. Through carefully designed annotation rounds, we find that GPT-4 performs comparably to junior translators in terms of total errors made but lags behind medium and senior translators. We a… ▽ More This study comprehensively evaluates the translation quality of Large Language Models (LLMs), specifically GPT-4, against human translators of varying expertise levels across multiple language pairs and domains. Through carefully designed annotation rounds, we find that GPT-4 performs comparably to junior translators in terms of total errors made but lags behind medium and senior translators. We also observe the imbalanced performance across different languages and domains, with GPT-4's translation capability gradually weakening from resource-rich to resource-poor directions. In addition, we qualitatively study the translation given by GPT-4 and human translators, and find that GPT-4 translator suffers from literal translations, but human translators sometimes overthink the background information. To our knowledge, this study is the first to evaluate LLMs against human translators and analyze the systematic differences between their outputs, providing valuable insights into the current state of LLM-based translation and its potential limitations. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.00557 [pdf, other]

Explaining Chest X-ray Pathology Models using Textual Concepts

Authors: Vijay Sadashivaiah, Mannudeep K. Kalra, Pingkun Yan, James A. Hendler

Abstract: Deep learning models have revolutionized medical imaging and diagnostics, yet their opaque nature poses challenges for clinical adoption and trust. Amongst approaches to improve model interpretability, concept-based explanations aim to provide concise and human understandable explanations of any arbitrary classifier. However, such methods usually require a large amount of manually collected data w… ▽ More Deep learning models have revolutionized medical imaging and diagnostics, yet their opaque nature poses challenges for clinical adoption and trust. Amongst approaches to improve model interpretability, concept-based explanations aim to provide concise and human understandable explanations of any arbitrary classifier. However, such methods usually require a large amount of manually collected data with concept annotation, which is often scarce in the medical domain. In this paper, we propose Conceptual Counterfactual Explanations for Chest X-ray (CoCoX) that leverage existing vision-language models (VLM) joint embedding space to explain black-box classifier outcomes without the need for annotated datasets. Specifically, we utilize textual concepts derived from chest radiography reports and a pre-trained chest radiography-based VLM to explain three common cardiothoracic pathologies. We demonstrate that the explanations generated by our method are semantically meaningful and faithful to underlying pathologies. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2407.00541 [pdf]

Answering real-world clinical questions using large language model based systems

Authors: Yen Sia Low, Michael L. Jackson, Rebecca J. Hyde, Robert E. Brown, Neil M. Sanghavi, Julian D. Baldwin, C. William Pike, Jananee Muralidharan, Gavin Hui, Natasha Alexander, Hadeel Hassan, Rahul V. Nene, Morgan Pike, Courtney J. Pokrzywa, Shivam Vedak, Adam Paul Yan, Dong-han Yao, Amy R. Zipursky, Christina Dinh, Philip Ballentine, Dan C. Derieg, Vladimir Polony, Rehan N. Chawdry, Jordan Davies, Brigham B. Hyde , et al. (2 additional authors not shown)

Abstract: Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-bas… ▽ More Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Comments: 28 pages (2 figures, 3 tables) inclusive of 8 pages of supplemental materials (4 supplemental figures and 4 supplemental tables)

arXiv:2407.00514 [pdf, ps, other]

Combining Classical and Probabilistic Independence Reasoning to Verify the Security of Oblivious Algorithms (Extended Version)

Authors: Pengbo Yan, Toby Murray, Olga Ohrimenko, Van-Thuan Pham, Robert Sison

Abstract: We consider the problem of how to verify the security of probabilistic oblivious algorithms formally and systematically. Unfortunately, prior program logics fail to support a number of complexities that feature in the semantics and invariant needed to verify the security of many practical probabilistic oblivious algorithms. We propose an approach based on reasoning over perfectly oblivious approxi… ▽ More We consider the problem of how to verify the security of probabilistic oblivious algorithms formally and systematically. Unfortunately, prior program logics fail to support a number of complexities that feature in the semantics and invariant needed to verify the security of many practical probabilistic oblivious algorithms. We propose an approach based on reasoning over perfectly oblivious approximations, using a program logic that combines both classical Hoare logic reasoning and probabilistic independence reasoning to support all the needed features. We formalise and prove our new logic sound in Isabelle/HOL and apply our approach to formally verify the security of several challenging case studies beyond the reach of prior methods for proving obliviousness. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.19631 [pdf, other]

Personalized Interpretation on Federated Learning: A Virtual Concepts approach

Authors: Peng Yan, Guodong Long, Jing Jiang, Michael Blumenstein

Abstract: Tackling non-IID data is an open challenge in federated learning research. Existing FL methods, including robust FL and personalized FL, are designed to improve model performance without consideration of interpreting non-IID across clients. This paper aims to design a novel FL method to robust and interpret the non-IID data across clients. Specifically, we interpret each client's dataset as a mixt… ▽ More Tackling non-IID data is an open challenge in federated learning research. Existing FL methods, including robust FL and personalized FL, are designed to improve model performance without consideration of interpreting non-IID across clients. This paper aims to design a novel FL method to robust and interpret the non-IID data across clients. Specifically, we interpret each client's dataset as a mixture of conceptual vectors that each one represents an interpretable concept to end-users. These conceptual vectors could be pre-defined or refined in a human-in-the-loop process or be learnt via the optimization procedure of the federated learning system. In addition to the interpretability, the clarity of client-specific personalization could also be applied to enhance the robustness of the training process on FL system. The effectiveness of the proposed method have been validated on benchmark datasets. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.00258 [pdf, other]

Artemis: Towards Referential Understanding in Complex Videos

Authors: Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, Yunjie Tian

Abstract: Videos carry rich visual information including object description, action, interaction, etc., but the existing multimodal large language models (MLLMs) fell short in referential understanding scenarios such as video-based referring. In this paper, we present Artemis, an MLLM that pushes video-based referential understanding to a finer level. Given a video, Artemis receives a natural-language quest… ▽ More Videos carry rich visual information including object description, action, interaction, etc., but the existing multimodal large language models (MLLMs) fell short in referential understanding scenarios such as video-based referring. In this paper, we present Artemis, an MLLM that pushes video-based referential understanding to a finer level. Given a video, Artemis receives a natural-language question with a bounding box in any video frame and describes the referred target in the entire video. The key to achieving this goal lies in extracting compact, target-specific video features, where we set a solid baseline by tracking and selecting spatiotemporal features from the video. We train Artemis on the newly established VideoRef45K dataset with 45K video-QA pairs and design a computationally efficient, three-stage training procedure. Results are promising both quantitatively and qualitatively. Additionally, we show that \model can be integrated with video grounding and text summarization tools to understand more complex scenarios. Code and data are available at https://github.com/qiujihao19/Artemis. △ Less

Submitted 31 May, 2024; originally announced June 2024.

Comments: 19 pages, 14 figures. Code and data are available at https://github.com/qiujihao19/Artemis

arXiv:2405.18533 [pdf, other]

Cardiovascular Disease Detection from Multi-View Chest X-rays with BI-Mamba

Authors: Zefan Yang, Jiajin Zhang, Ge Wang, Mannudeep K. Kalra, Pingkun Yan

Abstract: Accurate prediction of Cardiovascular disease (CVD) risk in medical imaging is central to effective patient health management. Previous studies have demonstrated that imaging features in computed tomography (CT) can help predict CVD risk. However, CT entails notable radiation exposure, which may result in adverse health effects for patients. In contrast, chest X-ray emits significantly lower level… ▽ More Accurate prediction of Cardiovascular disease (CVD) risk in medical imaging is central to effective patient health management. Previous studies have demonstrated that imaging features in computed tomography (CT) can help predict CVD risk. However, CT entails notable radiation exposure, which may result in adverse health effects for patients. In contrast, chest X-ray emits significantly lower levels of radiation, offering a safer option. This rationale motivates our investigation into the feasibility of using chest X-ray for predicting CVD risk. Convolutional Neural Networks (CNNs) and Transformers are two established network architectures for computer-aided diagnosis. However, they struggle to model very high resolution chest X-ray due to the lack of large context modeling power or quadratic time complexity. Inspired by state space sequence models (SSMs), a new class of network architectures with competitive sequence modeling power as Transfomers and linear time complexity, we propose Bidirectional Image Mamba (BI-Mamba) to complement the unidirectional SSMs with opposite directional information. BI-Mamba utilizes parallel forward and backwark blocks to encode longe-range dependencies of multi-view chest X-rays. We conduct extensive experiments on images from 10,395 subjects in National Lung Screening Trail (NLST). Results show that BI-Mamba outperforms ResNet-50 and ViT-S with comparable parameter size, and saves significant amount of GPU memory during training. Besides, BI-Mamba achieves promising performance compared with previous state of the art in CT, unraveling the potential of chest X-ray for CVD risk prediction. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Early accepted paper for MICCAI 2024

arXiv:2405.15728 [pdf, other]

Disease-informed Adaptation of Vision-Language Models

Authors: Jiajin Zhang, Ge Wang, Mannudeep K. Kalra, Pingkun Yan

Abstract: In medical image analysis, the expertise scarcity and the high cost of data annotation limits the development of large artificial intelligence models. This paper investigates the potential of transfer learning with pre-trained vision-language models (VLMs) in this domain. Currently, VLMs still struggle to transfer to the underrepresented diseases with minimal presence and new diseases entirely abs… ▽ More In medical image analysis, the expertise scarcity and the high cost of data annotation limits the development of large artificial intelligence models. This paper investigates the potential of transfer learning with pre-trained vision-language models (VLMs) in this domain. Currently, VLMs still struggle to transfer to the underrepresented diseases with minimal presence and new diseases entirely absent from the pretraining dataset. We argue that effective adaptation of VLMs hinges on the nuanced representation learning of disease concepts. By capitalizing on the joint visual-linguistic capabilities of VLMs, we introduce disease-informed contextual prompting in a novel disease prototype learning framework. This approach enables VLMs to grasp the concepts of new disease effectively and efficiently, even with limited data. Extensive experiments across multiple image modalities showcase notable enhancements in performance compared to existing techniques. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: Early Accepted by MICCAI 2024

arXiv:2405.13467 [pdf, other]

AdaFedFR: Federated Face Recognition with Adaptive Inter-Class Representation Learning

Authors: Di Qiu, Xinyang Lin, Kaiye Wang, Xiangxiang Chu, Pengfei Yan

Abstract: With the growing attention on data privacy and communication security in face recognition applications, federated learning has been introduced to learn a face recognition model with decentralized datasets in a privacy-preserving manner. However, existing works still face challenges such as unsatisfying performance and additional communication costs, limiting their applicability in real-world scena… ▽ More With the growing attention on data privacy and communication security in face recognition applications, federated learning has been introduced to learn a face recognition model with decentralized datasets in a privacy-preserving manner. However, existing works still face challenges such as unsatisfying performance and additional communication costs, limiting their applicability in real-world scenarios. In this paper, we propose a simple yet effective federated face recognition framework called AdaFedFR, by devising an adaptive inter-class representation learning algorithm to enhance the generalization of the generic face model and the efficiency of federated training under strict privacy-preservation. In particular, our work delicately utilizes feature representations of public identities as learnable negative knowledge to optimize the local objective within the feature space, which further encourages the local model to learn powerful representations and optimize personalized models for clients. Experimental results demonstrate that our method outperforms previous approaches on several prevalent face recognition benchmarks within less than 3 communication rounds, which shows communication-friendly and great efficiency. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.11344 [pdf]

LiPost: Improved Content Understanding With Effective Use of Multi-task Contrastive Learning

Authors: Akanksha Bindal, Sudarshan Ramanujam, Dave Golland, TJ Hazen, Tina Jiang, Fengyu Zhang, Peng Yan

Abstract: In enhancing LinkedIn core content recommendation models, a significant challenge lies in improving their semantic understanding capabilities. This paper addresses the problem by leveraging multi-task learning, a method that has shown promise in various domains. We fine-tune a pre-trained, transformer-based LLM using multi-task contrastive learning with data from a diverse set of semantic labeling… ▽ More In enhancing LinkedIn core content recommendation models, a significant challenge lies in improving their semantic understanding capabilities. This paper addresses the problem by leveraging multi-task learning, a method that has shown promise in various domains. We fine-tune a pre-trained, transformer-based LLM using multi-task contrastive learning with data from a diverse set of semantic labeling tasks. We observe positive transfer, leading to superior performance across all tasks when compared to training independently on each. Our model outperforms the baseline on zero shot learning and offers improved multilingual support, highlighting its potential for broader application. The specialized content embeddings produced by our model outperform generalized embeddings offered by OpenAI on Linkedin dataset and tasks. This work provides a robust foundation for vertical teams across LinkedIn to customize and fine-tune the LLM to their specific applications. Our work offers insights and best practices for the field to build on. △ Less

Submitted 13 July, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

arXiv:2404.08450 [pdf, other]

Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues

Authors: Xianhua He, Dashuang Liang, Song Yang, Zhanlong Hao, Hui Ma, Binjie Mao, Xi Li, Yao Wang, Pengfei Yan, Ajian Liu

Abstract: Face recognition systems are frequently subjected to a variety of physical and digital attacks of different types. Previous methods have achieved satisfactory performance in scenarios that address physical attacks and digital attacks, respectively. However, few methods are considered to integrate a model that simultaneously addresses both physical and digital attacks, implying the necessity to dev… ▽ More Face recognition systems are frequently subjected to a variety of physical and digital attacks of different types. Previous methods have achieved satisfactory performance in scenarios that address physical attacks and digital attacks, respectively. However, few methods are considered to integrate a model that simultaneously addresses both physical and digital attacks, implying the necessity to develop and maintain multiple models. To jointly detect physical and digital attacks within a single model, we propose an innovative approach that can adapt to any network architecture. Our approach mainly contains two types of data augmentation, which we call Simulated Physical Spoofing Clues augmentation (SPSC) and Simulated Digital Spoofing Clues augmentation (SDSC). SPSC and SDSC augment live samples into simulated attack samples by simulating spoofing clues of physical and digital attacks, respectively, which significantly improve the capability of the model to detect "unseen" attack types. Extensive experiments show that SPSC and SDSC can achieve state-of-the-art generalization in Protocols 2.1 and 2.2 of the UniAttackData dataset, respectively. Our method won first place in "Unified Physical-Digital Face Attack Detection" of the 5th Face Anti-spoofing Challenge@CVPR2024. Our final submission obtains 3.75% APCER, 0.93% BPCER, and 2.34% ACER, respectively. Our code is available at https://github.com/Xianhua-He/cvpr2024-face-anti-spoofing-challenge. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 10 pages with 6 figures, Accepted by CVPRW 2024

arXiv:2404.08361 [pdf, other]

Large-Scale Multi-Domain Recommendation: an Automatic Domain Feature Extraction and Personalized Integration Framework

Authors: Dongbo Xi, Zhen Chen, Yuexian Wang, He Cui, Chong Peng, Fuzhen Zhuang, Peng Yan

Abstract: Feed recommendation is currently the mainstream mode for many real-world applications (e.g., TikTok, Dianping), it is usually necessary to model and predict user interests in multiple scenarios (domains) within and even outside the application. Multi-domain learning is a typical solution in this regard. While considerable efforts have been made in this regard, there are still two long-standing cha… ▽ More Feed recommendation is currently the mainstream mode for many real-world applications (e.g., TikTok, Dianping), it is usually necessary to model and predict user interests in multiple scenarios (domains) within and even outside the application. Multi-domain learning is a typical solution in this regard. While considerable efforts have been made in this regard, there are still two long-standing challenges: (1) Accurately depicting the differences among domains using domain features is crucial for enhancing the performance of each domain. However, manually designing domain features and models for numerous domains can be a laborious task. (2) Users typically have limited impressions in only a few domains. Extracting features automatically from other domains and leveraging them to improve the predictive capabilities of each domain has consistently posed a challenging problem. In this paper, we propose an Automatic Domain Feature Extraction and Personalized Integration (DFEI) framework for the large-scale multi-domain recommendation. The framework automatically transforms the behavior of each individual user into an aggregation of all user behaviors within the domain, which serves as the domain features. Unlike offline feature engineering methods, the extracted domain features are higher-order representations and directly related to the target label. Besides, by personalized integration of domain features from other domains for each user and the innovation in the training mode, the DFEI framework can yield more accurate conversion identification. Experimental results on both public and industrial datasets, consisting of over 20 domains, clearly demonstrate that the proposed framework achieves significantly better performance compared with SOTA baselines. Furthermore, we have released the source code of the proposed framework at https://github.com/xidongbo/DFEI. △ Less

Submitted 14 April, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

Comments: 8 pages

arXiv:2404.03181 [pdf, other]

MonoCD: Monocular 3D Object Detection with Complementary Depths

Authors: Longfei Yan, Pei Yan, Shengzhou Xiong, Xuanyu Xiang, Yihua Tan

Abstract: Monocular 3D object detection has attracted widespread attention due to its potential to accurately obtain object 3D localization from a single image at a low cost. Depth estimation is an essential but challenging subtask of monocular 3D object detection due to the ill-posedness of 2D to 3D mapping. Many methods explore multiple local depth clues such as object heights and keypoints and then formu… ▽ More Monocular 3D object detection has attracted widespread attention due to its potential to accurately obtain object 3D localization from a single image at a low cost. Depth estimation is an essential but challenging subtask of monocular 3D object detection due to the ill-posedness of 2D to 3D mapping. Many methods explore multiple local depth clues such as object heights and keypoints and then formulate the object depth estimation as an ensemble of multiple depth predictions to mitigate the insufficiency of single-depth information. However, the errors of existing multiple depths tend to have the same sign, which hinders them from neutralizing each other and limits the overall accuracy of combined depth. To alleviate this problem, we propose to increase the complementarity of depths with two novel designs. First, we add a new depth prediction branch named complementary depth that utilizes global and efficient depth clues from the entire image rather than the local clues to reduce the correlation of depth predictions. Second, we propose to fully exploit the geometric relations between multiple depth clues to achieve complementarity in form. Benefiting from these designs, our method achieves higher complementarity. Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data. In addition, complementary depth can also be a lightweight and plug-and-play module to boost multiple existing monocular 3d object detectors. Code is available at https://github.com/elvintanhust/MonoCD. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: Accepted to CVPR 2024

arXiv:2404.02655 [pdf, other]

Calibrating the Confidence of Large Language Models by Eliciting Fidelity

Authors: Mozhi Zhang, Mianqiu Huang, Rundong Shi, Linsen Guo, Chong Peng, Peng Yan, Yaqian Zhou, Xipeng Qiu

Abstract: Large language models optimized with techniques like RLHF have achieved good alignment in being helpful and harmless. However, post-alignment, these language models often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate. In this paper, we decompose the language model confidence into the \textit{Uncertainty} about the question and the… ▽ More Large language models optimized with techniques like RLHF have achieved good alignment in being helpful and harmless. However, post-alignment, these language models often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate. In this paper, we decompose the language model confidence into the \textit{Uncertainty} about the question and the \textit{Fidelity} to the answer generated by language models. Then, we propose a plug-and-play method to estimate the confidence of language models. Our method has shown good calibration performance by conducting experiments with 6 RLHF-LMs on four MCQA datasets. Moreover, we propose two novel metrics, IPR and CE, to evaluate the calibration of the model, and we have conducted a detailed discussion on \textit{Truly Well-Calibrated Confidence}. Our method could serve as a strong baseline, and we hope that this work will provide some insights into the model confidence calibration. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: 17 pages, 13 figures

arXiv:2403.19499 [pdf, other]

Client-supervised Federated Learning: Towards One-model-for-all Personalization

Authors: Peng Yan, Guodong Long

Abstract: Personalized Federated Learning (PerFL) is a new machine learning paradigm that delivers personalized models for diverse clients under federated learning settings. Most PerFL methods require extra learning processes on a client to adapt a globally shared model to the client-specific personalized model using its own local data. However, the model adaptation process in PerFL is still an open challen… ▽ More Personalized Federated Learning (PerFL) is a new machine learning paradigm that delivers personalized models for diverse clients under federated learning settings. Most PerFL methods require extra learning processes on a client to adapt a globally shared model to the client-specific personalized model using its own local data. However, the model adaptation process in PerFL is still an open challenge in the stage of model deployment and test time. This work tackles the challenge by proposing a novel federated learning framework to learn only one robust global model to achieve competitive performance to those personalized models on unseen/test clients in the FL system. Specifically, we design a new Client-Supervised Federated Learning (FedCS) to unravel clients' bias on instances' latent representations so that the global model can learn both client-specific and client-agnostic knowledge. Experimental study shows that the FedCS can learn a robust FL global model for the changing data distributions of unseen/test clients. The FedCS's global model can be directly deployed to the test clients while achieving comparable performance to other personalized FL methods that require model adaptation. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.00274 [pdf, other]

CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation

Authors: Xi Liu, Ying Guo, Cheng Zhen, Tong Li, Yingying Ao, Pengfei Yan

Abstract: Listening head generation aims to synthesize a non-verbal responsive listener head by modeling the correlation between the speaker and the listener in dynamic conversion.The applications of listener agent generation in virtual interaction have promoted many works achieving the diverse and fine-grained motion generation. However, they can only manipulate motions through simple emotional labels, but… ▽ More Listening head generation aims to synthesize a non-verbal responsive listener head by modeling the correlation between the speaker and the listener in dynamic conversion.The applications of listener agent generation in virtual interaction have promoted many works achieving the diverse and fine-grained motion generation. However, they can only manipulate motions through simple emotional labels, but cannot freely control the listener's motions. Since listener agents should have human-like attributes (e.g. identity, personality) which can be freely customized by users, this limits their realism. In this paper, we propose a user-friendly framework called CustomListener to realize the free-form text prior guided listener generation. To achieve speaker-listener coordination, we design a Static to Dynamic Portrait module (SDP), which interacts with speaker information to transform static text into dynamic portrait token with completion rhythm and amplitude information. To achieve coherence between segments, we design a Past Guided Generation Module (PGG) to maintain the consistency of customized listener attributes through the motion prior, and utilize a diffusion-based structure conditioned on the portrait token and the motion prior to realize the controllable generation. To train and evaluate our model, we have constructed two text-annotated listening head datasets based on ViCo and RealTalk, which provide text-video paired labels. Extensive experiments have verified the effectiveness of our model. △ Less

Submitted 29 March, 2024; v1 submitted 29 February, 2024; originally announced March 2024.

Comments: Accepted by CVPR 2024

arXiv:2403.00209 [pdf, other]

ChartReformer: Natural Language-Driven Chart Image Editing

Authors: Pengyu Yan, Mahesh Bhosale, Jay Lal, Bikhyat Adhikari, David Doermann

Abstract: Chart visualizations are essential for data interpretation and communication; however, most charts are only accessible in image format and lack the corresponding data tables and supplementary information, making it difficult to alter their appearance for different application scenarios. To eliminate the need for original underlying data and information to perform chart editing, we propose ChartRef… ▽ More Chart visualizations are essential for data interpretation and communication; however, most charts are only accessible in image format and lack the corresponding data tables and supplementary information, making it difficult to alter their appearance for different application scenarios. To eliminate the need for original underlying data and information to perform chart editing, we propose ChartReformer, a natural language-driven chart image editing solution that directly edits the charts from the input images with the given instruction prompts. The key in this method is that we allow the model to comprehend the chart and reason over the prompt to generate the corresponding underlying data table and visual attributes for new charts, enabling precise edits. Additionally, to generalize ChartReformer, we define and standardize various types of chart editing, covering style, layout, format, and data-centric edits. The experiments show promising results for the natural language-driven chart image editing. △ Less

Submitted 1 May, 2024; v1 submitted 29 February, 2024; originally announced March 2024.

Comments: Published in ICDAR 2024. Code and model are available at https://github.com/pengyu965/ChartReformer

arXiv:2402.15687 [pdf, other]

General Purpose Image Encoder DINOv2 for Medical Image Registration

Authors: Xinrui Song, Xuanang Xu, Pingkun Yan

Abstract: Existing medical image registration algorithms rely on either dataset specific training or local texture-based features to align images. The former cannot be reliably implemented without large modality-specific training datasets, while the latter lacks global semantics thus could be easily trapped at local minima. In this paper, we present a training-free deformable image registration method, DINO… ▽ More Existing medical image registration algorithms rely on either dataset specific training or local texture-based features to align images. The former cannot be reliably implemented without large modality-specific training datasets, while the latter lacks global semantics thus could be easily trapped at local minima. In this paper, we present a training-free deformable image registration method, DINO-Reg, leveraging a general purpose image encoder DINOv2 for image feature extraction. The DINOv2 encoder was trained using the ImageNet data containing natural images. We used the pretrained DINOv2 without any finetuning. Our method feeds the DINOv2 encoded features into a discrete optimizer to find the optimal deformable registration field. We conducted a series of experiments to understand the behavior and role of such a general purpose image encoder in the application of image registration. Combined with handcrafted features, our method won the first place in the recent OncoReg Challenge. To our knowledge, this is the first application of general vision foundation models in medical image registration. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.00137 [pdf, other]

Multimodal Neurodegenerative Disease Subtyping Explained by ChatGPT

Authors: Diego Machado Reyes, Hanqing Chao, Juergen Hahn, Li Shen, Pingkun Yan

Abstract: Alzheimer's disease (AD) is the most prevalent neurodegenerative disease; yet its currently available treatments are limited to stopping disease progression. Moreover, effectiveness of these treatments is not guaranteed due to the heterogenetiy of the disease. Therefore, it is essential to be able to identify the disease subtypes at a very early stage. Current data driven approaches are able to cl… ▽ More Alzheimer's disease (AD) is the most prevalent neurodegenerative disease; yet its currently available treatments are limited to stopping disease progression. Moreover, effectiveness of these treatments is not guaranteed due to the heterogenetiy of the disease. Therefore, it is essential to be able to identify the disease subtypes at a very early stage. Current data driven approaches are able to classify the subtypes at later stages of AD or related disorders, but struggle when predicting at the asymptomatic or prodromal stage. Moreover, most existing models either lack explainability behind the classification or only use a single modality for the assessment, limiting scope of its analysis. Thus, we propose a multimodal framework that uses early-stage indicators such as imaging, genetics and clinical assessments to classify AD patients into subtypes at early stages. Similarly, we build prompts and use large language models, such as ChatGPT, to interpret the findings of our model. In our framework, we propose a tri-modal co-attention mechanism (Tri-COAT) to explicitly learn the cross-modal feature associations. Our proposed model outperforms baseline models and provides insight into key cross-modal feature associations supported by known biological mechanisms. △ Less

Submitted 31 January, 2024; originally announced February 2024.

arXiv:2401.08407 [pdf, other]

Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining

Authors: Jiahao Nie, Yun Xing, Gongjie Zhang, Pei Yan, Aoran Xiao, Yap-Peng Tan, Alex C. Kot, Shijian Lu

Abstract: Cross-Domain Few-Shot Segmentation (CD-FSS) poses the challenge of segmenting novel categories from a distinct domain using only limited exemplars. In this paper, we undertake a comprehensive study of CD-FSS and uncover two crucial insights: (i) the necessity of a fine-tuning stage to effectively transfer the learned meta-knowledge across domains, and (ii) the overfitting risk during the naïve fin… ▽ More Cross-Domain Few-Shot Segmentation (CD-FSS) poses the challenge of segmenting novel categories from a distinct domain using only limited exemplars. In this paper, we undertake a comprehensive study of CD-FSS and uncover two crucial insights: (i) the necessity of a fine-tuning stage to effectively transfer the learned meta-knowledge across domains, and (ii) the overfitting risk during the naïve fine-tuning due to the scarcity of novel category examples. With these insights, we propose a novel cross-domain fine-tuning strategy that addresses the challenging CD-FSS tasks. We first design Bi-directional Few-shot Prediction (BFP), which establishes support-query correspondence in a bi-directional manner, crafting augmented supervision to reduce the overfitting risk. Then we further extend BFP into Iterative Few-shot Adaptor (IFA), which is a recursive framework to capture the support-query correspondence iteratively, targeting maximal exploitation of supervisory signals from the sparse novel category samples. Extensive empirical evaluations show that our method significantly outperforms the state-of-the-arts (+7.8\%), which verifies that IFA tackles the cross-domain challenges and mitigates the overfitting simultaneously. The code is available at: https://github.com/niejiahao1998/IFA. △ Less

Submitted 13 March, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted by CVPR 2024

arXiv:2312.12484 [pdf, other]

SkyMask: Attack-agnostic Robust Federated Learning with Fine-grained Learnable Masks

Authors: Peishen Yan, Hao Wang, Tao Song, Yang Hua, Ruhui Ma, Ningxin Hu, Mohammad R. Haghighat, Haibing Guan

Abstract: Federated Learning (FL) is becoming a popular paradigm for leveraging distributed data and preserving data privacy. However, due to the distributed characteristic, FL systems are vulnerable to Byzantine attacks that compromised clients attack the global model by uploading malicious model updates. With the development of layer-level and parameter-level fine-grained attacks, the attacks' stealthines… ▽ More Federated Learning (FL) is becoming a popular paradigm for leveraging distributed data and preserving data privacy. However, due to the distributed characteristic, FL systems are vulnerable to Byzantine attacks that compromised clients attack the global model by uploading malicious model updates. With the development of layer-level and parameter-level fine-grained attacks, the attacks' stealthiness and effectiveness have been significantly improved. The existing defense mechanisms solely analyze the model-level statistics of individual model updates uploaded by clients to mitigate Byzantine attacks, which are ineffective against fine-grained attacks due to unawareness or overreaction. To address this problem, we propose SkyMask, a new attack-agnostic robust FL system that firstly leverages fine-grained learnable masks to identify malicious model updates at the parameter level. Specifically, the FL server freezes and multiplies the model updates uploaded by clients with the parameter-level masks, and trains the masks over a small clean dataset (i.e., root dataset) to learn the subtle difference between benign and malicious model updates in a high-dimension space. Our extensive experiments involve different models on three public datasets under state-of-the-art (SOTA) attacks, where the results show that SkyMask achieves up to 14% higher testing accuracy compared with SOTA defense strategies under the same attacks and successfully defends against attacks with malicious clients of a high fraction up to 80%. Code is available at https://github.com/KoalaYan/SkyMask. △ Less

Submitted 18 July, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: Accepted by ECCV2024

arXiv:2312.11927 [pdf, other]

Empowering Dual-Level Graph Self-Supervised Pretraining with Motif Discovery

Authors: Pengwei Yan, Kaisong Song, Zhuoren Jiang, Yangyang Kang, Tianqianjin Lin, Changlong Sun, Xiaozhong Liu

Abstract: While self-supervised graph pretraining techniques have shown promising results in various domains, their application still experiences challenges of limited topology learning, human knowledge dependency, and incompetent multi-level interactions. To address these issues, we propose a novel solution, Dual-level Graph self-supervised Pretraining with Motif discovery (DGPM), which introduces a unique… ▽ More While self-supervised graph pretraining techniques have shown promising results in various domains, their application still experiences challenges of limited topology learning, human knowledge dependency, and incompetent multi-level interactions. To address these issues, we propose a novel solution, Dual-level Graph self-supervised Pretraining with Motif discovery (DGPM), which introduces a unique dual-level pretraining structure that orchestrates node-level and subgraph-level pretext tasks. Unlike prior approaches, DGPM autonomously uncovers significant graph motifs through an edge pooling module, aligning learned motif similarities with graph kernel-based similarities. A cross-matching task enables sophisticated node-motif interactions and novel representation learning. Extensive experiments on 15 datasets validate DGPM's effectiveness and generalizability, outperforming state-of-the-art methods in unsupervised representation learning and transfer learning settings. The autonomously discovered motifs demonstrate the potential of DGPM to enhance robustness and interpretability. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 14 pages, 6 figures, accepted by AAAI'24

arXiv:2312.08317 [pdf, other]

Prompt Engineering-assisted Malware Dynamic Analysis Using GPT-4

Authors: Pei Yan, Shunquan Tan, Miaohui Wang, Jiwu Huang

Abstract: Dynamic analysis methods effectively identify shelled, wrapped, or obfuscated malware, thereby preventing them from invading computers. As a significant representation of dynamic malware behavior, the API (Application Programming Interface) sequence, comprised of consecutive API calls, has progressively become the dominant feature of dynamic analysis methods. Though there have been numerous deep l… ▽ More Dynamic analysis methods effectively identify shelled, wrapped, or obfuscated malware, thereby preventing them from invading computers. As a significant representation of dynamic malware behavior, the API (Application Programming Interface) sequence, comprised of consecutive API calls, has progressively become the dominant feature of dynamic analysis methods. Though there have been numerous deep learning models for malware detection based on API sequences, the quality of API call representations produced by those models is limited. These models cannot generate representations for unknown API calls, which weakens both the detection performance and the generalization. Further, the concept drift phenomenon of API calls is prominent. To tackle these issues, we introduce a prompt engineering-assisted malware dynamic analysis using GPT-4. In this method, GPT-4 is employed to create explanatory text for each API call within the API sequence. Afterward, the pre-trained language model BERT is used to obtain the representation of the text, from which we derive the representation of the API sequence. Theoretically, this proposed method is capable of generating representations for all API calls, excluding the necessity for dataset training during the generation process. Utilizing the representation, a CNN-based detection model is designed to extract the feature. We adopt five benchmark datasets to validate the performance of the proposed model. The experimental results reveal that the proposed detection algorithm performs better than the state-of-the-art method (TextCNN). Specifically, in cross-database experiments and few-shot learning experiments, the proposed model achieves excellent detection performance and almost a 100% recall rate for malware, verifying its superior generalization performance. The code is available at: github.com/yan-scnu/Prompted_Dynamic_Detection. △ Less

Submitted 13 December, 2023; originally announced December 2023.

arXiv:2312.06462 [pdf, other]

Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation

Authors: Qi Yang, Xing Nie, Tong Li, Pengfei Gao, Ying Guo, Cheng Zhen, Pengfei Yan, Shiming Xiang

Abstract: Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral… ▽ More Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement. Regarding pixel entanglement, we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement, we design a Bilateral-Fusion Module (BFM), enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement, we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4, 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods. Code and more results will be publicly available at https://yannqi.github.io/AVS-COMBO/. △ Less

Submitted 7 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

Comments: CVPR 2024 Highlight. 13 pages, 10 figures

arXiv:2311.03679 [pdf, other]

Unsupervised convolutional neural network fusion approach for change detection in remote sensing images

Authors: Weidong Yan, Pei Yan, Li Cao

Abstract: With the rapid development of deep learning, a variety of change detection methods based on deep learning have emerged in recent years. However, these methods usually require a large number of training samples to train the network model, so it is very expensive. In this paper, we introduce a completely unsupervised shallow convolutional neural network (USCNN) fusion approach for change detection.… ▽ More With the rapid development of deep learning, a variety of change detection methods based on deep learning have emerged in recent years. However, these methods usually require a large number of training samples to train the network model, so it is very expensive. In this paper, we introduce a completely unsupervised shallow convolutional neural network (USCNN) fusion approach for change detection. Firstly, the bi-temporal images are transformed into different feature spaces by using convolution kernels of different sizes to extract multi-scale information of the images. Secondly, the output features of bi-temporal images at the same convolution kernels are subtracted to obtain the corresponding difference images, and the difference feature images at the same scale are fused into one feature image by using 1 * 1 convolution layer. Finally, the output features of different scales are concatenated and a 1 * 1 convolution layer is used to fuse the multi-scale information of the image. The model parameters are obtained by a redesigned sparse function. Our model has three features: the entire training process is conducted in an unsupervised manner, the network architecture is shallow, and the objective function is sparse. Thus, it can be seen as a kind of lightweight network model. Experimental results on four real remote sensing datasets indicate the feasibility and effectiveness of the proposed approach. △ Less

Submitted 6 November, 2023; originally announced November 2023.

arXiv:2311.00353 [pdf, other]

LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation

Authors: Yuxiang Bao, Di Qiu, Guoliang Kang, Baochang Zhang, Bo Jin, Kaiye Wang, Pengfei Yan

Abstract: Leveraging the generative ability of image diffusion models offers great potential for zero-shot video-to-video translation. The key lies in how to maintain temporal consistency across generated video frames by image diffusion models. Previous methods typically adopt cross-frame attention, \emph{i.e.,} sharing the \textit{key} and \textit{value} tokens across attentions of different frames, to enc… ▽ More Leveraging the generative ability of image diffusion models offers great potential for zero-shot video-to-video translation. The key lies in how to maintain temporal consistency across generated video frames by image diffusion models. Previous methods typically adopt cross-frame attention, \emph{i.e.,} sharing the \textit{key} and \textit{value} tokens across attentions of different frames, to encourage the temporal consistency. However, in those works, temporal inconsistency issue may not be thoroughly solved, rendering the fidelity of generated videos limited.%The current state of the art cross-frame attention method aims at maintaining fine-grained visual details across frames, but it is still challenged by the temporal coherence problem. In this paper, we find the bottleneck lies in the unconstrained query tokens and propose a new zero-shot video-to-video translation framework, named \textit{LatentWarp}. Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space to constrain the query tokens. Specifically, based on the optical flow obtained from the original video, we warp the generated latent features of last frame to align with the current frame during the denoising process. As a result, the corresponding regions across the adjacent frames can share closely-related query tokens and attention outputs, which can further improve latent-level consistency to enhance visual temporal coherence of generated videos. Extensive experiment results demonstrate the superiority of \textit{LatentWarp} in achieving video-to-video translation with temporal coherence. △ Less

Submitted 1 November, 2023; originally announced November 2023.

arXiv:2309.01207 [pdf, other]

Spectral Adversarial MixUp for Few-Shot Unsupervised Domain Adaptation

Authors: Jiajin Zhang, Hanqing Chao, Amit Dhurandhar, Pin-Yu Chen, Ali Tajer, Yangyang Xu, Pingkun Yan

Abstract: Domain shift is a common problem in clinical applications, where the training images (source domain) and the test images (target domain) are under different distributions. Unsupervised Domain Adaptation (UDA) techniques have been proposed to adapt models trained in the source domain to the target domain. However, those methods require a large number of images from the target domain for model train… ▽ More Domain shift is a common problem in clinical applications, where the training images (source domain) and the test images (target domain) are under different distributions. Unsupervised Domain Adaptation (UDA) techniques have been proposed to adapt models trained in the source domain to the target domain. However, those methods require a large number of images from the target domain for model training. In this paper, we propose a novel method for Few-Shot Unsupervised Domain Adaptation (FSUDA), where only a limited number of unlabeled target domain samples are available for training. To accomplish this challenging task, first, a spectral sensitivity map is introduced to characterize the generalization weaknesses of models in the frequency domain. We then developed a Sensitivity-guided Spectral Adversarial MixUp (SAMix) method to generate target-style images to effectively suppresses the model sensitivity, which leads to improved model generalizability in the target domain. We demonstrated the proposed method and rigorously evaluated its performance on multiple tasks using several public datasets. △ Less

Submitted 3 September, 2023; originally announced September 2023.

Comments: Accepted by MICCAI 2023

arXiv:2308.01971 [pdf, other]

SpaDen : Sparse and Dense Keypoint Estimation for Real-World Chart Understanding

Authors: Saleem Ahmed, Pengyu Yan, David Doermann, Srirangaraj Setlur, Venu Govindaraju

Abstract: We introduce a novel bottom-up approach for the extraction of chart data. Our model utilizes images of charts as inputs and learns to detect keypoints (KP), which are used to reconstruct the components within the plot area. Our novelty lies in detecting a fusion of continuous and discrete KP as predicted heatmaps. A combination of sparse and dense per-pixel objectives coupled with a uni-modal self… ▽ More We introduce a novel bottom-up approach for the extraction of chart data. Our model utilizes images of charts as inputs and learns to detect keypoints (KP), which are used to reconstruct the components within the plot area. Our novelty lies in detecting a fusion of continuous and discrete KP as predicted heatmaps. A combination of sparse and dense per-pixel objectives coupled with a uni-modal self-attention-based feature-fusion layer is applied to learn KP embeddings. Further leveraging deep metric learning for unsupervised clustering, allows us to segment the chart plot area into various objects. By further matching the chart components to the legend, we are able to obtain the data series names. A post-processing threshold is applied to the KP embeddings to refine the object reconstructions and improve accuracy. Our extensive experiments include an evaluation of different modules for KP estimation and the combination of deep layer aggregation and corner pooling approaches. The results of our experiments provide extensive evaluation for the task of real-world chart data extraction. △ Less

Submitted 3 August, 2023; originally announced August 2023.

Comments: Accepted ORAL at ICDAR 23

arXiv:2307.14634 [pdf, other]

Fact-Checking of AI-Generated Reports

Authors: Razi Mahmood, Ge Wang, Mannudeep Kalra, Pingkun Yan

Abstract: With advances in generative artificial intelligence (AI), it is now possible to produce realistic-looking automated reports for preliminary reads of radiology images. This can expedite clinical workflows, improve accuracy and reduce overall costs. However, it is also well-known that such models often hallucinate, leading to false findings in the generated reports. In this paper, we propose a new m… ▽ More With advances in generative artificial intelligence (AI), it is now possible to produce realistic-looking automated reports for preliminary reads of radiology images. This can expedite clinical workflows, improve accuracy and reduce overall costs. However, it is also well-known that such models often hallucinate, leading to false findings in the generated reports. In this paper, we propose a new method of fact-checking of AI-generated reports using their associated images. Specifically, the developed examiner differentiates real and fake sentences in reports by learning the association between an image and sentences describing real or potentially fake findings. To train such an examiner, we first created a new dataset of fake reports by perturbing the findings in the original ground truth radiology reports associated with images. Text encodings of real and fake sentences drawn from these reports are then paired with image encodings to learn the mapping to real/fake labels. The utility of such an examiner is demonstrated for verifying automatically generated reports by detecting and removing fake sentences. Future generative AI approaches can use the resulting tool to validate their reports leading to a more responsible use of AI in expediting clinical workflows. △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: 10 pages, 3 figures, 3 tables

arXiv:2307.14039 [pdf, other]

Controllable Guide-Space for Generalizable Face Forgery Detection

Authors: Ying Guo, Cheng Zhen, Pengfei Yan

Abstract: Recent studies on face forgery detection have shown satisfactory performance for methods involved in training datasets, but are not ideal enough for unknown domains. This motivates many works to improve the generalization, but forgery-irrelevant information, such as image background and identity, still exists in different domain features and causes unexpected clustering, limiting the generalizatio… ▽ More Recent studies on face forgery detection have shown satisfactory performance for methods involved in training datasets, but are not ideal enough for unknown domains. This motivates many works to improve the generalization, but forgery-irrelevant information, such as image background and identity, still exists in different domain features and causes unexpected clustering, limiting the generalization. In this paper, we propose a controllable guide-space (GS) method to enhance the discrimination of different forgery domains, so as to increase the forgery relevance of features and thereby improve the generalization. The well-designed guide-space can simultaneously achieve both the proper separation of forgery domains and the large distance between real-forgery domains in an explicit and controllable manner. Moreover, for better discrimination, we use a decoupling module to weaken the interference of forgery-irrelevant correlations between domains. Furthermore, we make adjustments to the decision boundary manifold according to the clustering degree of the same domain features within the neighborhood. Extensive experiments in multiple in-domain and cross-domain settings confirm that our method can achieve state-of-the-art generalization. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: Accepted by ICCV 2023

arXiv:2307.13693 [pdf, other]

Evaluating Large Language Models for Radiology Natural Language Processing

Authors: Zhengliang Liu, Tianyang Zhong, Yiwei Li, Yutong Zhang, Yi Pan, Zihao Zhao, Peixin Dong, Chao Cao, Yuxiao Liu, Peng Shu, Yaonai Wei, Zihao Wu, Chong Ma, Jiaqi Wang, Sheng Wang, Mengyue Zhou, Zuowei Jiang, Chunlin Li, Jason Holmes, Shaochen Xu, Lu Zhang, Haixing Dai, Kai Zhang, Lin Zhao, Yuanhao Chen , et al. (20 additional authors not shown)

Abstract: The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a compreh… ▽ More The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a comprehensive evaluation of these models remains to be conducted. This lack of assessment is especially apparent within the context of radiology NLP. This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports, a crucial component of radiology NLP. Specifically, the ability to derive impressions from radiologic findings is assessed. The outcomes of this evaluation provide key insights into the performance, strengths, and weaknesses of these LLMs, informing their practical applications within the medical domain. △ Less

Submitted 27 July, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

arXiv:2307.10954 [pdf, other]

Soft-tissue Driven Craniomaxillofacial Surgical Planning

Authors: Xi Fang, Daeseung Kim, Xuanang Xu, Tianshu Kuang, Nathan Lampen, Jungwook Lee, Hannah H. Deng, Jaime Gateno, Michael A. K. Liebschner, James J. Xia, Pingkun Yan

Abstract: In CMF surgery, the planning of bony movement to achieve a desired facial outcome is a challenging task. Current bone driven approaches focus on normalizing the bone with the expectation that the facial appearance will be corrected accordingly. However, due to the complex non-linear relationship between bony structure and facial soft-tissue, such bone-driven methods are insufficient to correct fac… ▽ More In CMF surgery, the planning of bony movement to achieve a desired facial outcome is a challenging task. Current bone driven approaches focus on normalizing the bone with the expectation that the facial appearance will be corrected accordingly. However, due to the complex non-linear relationship between bony structure and facial soft-tissue, such bone-driven methods are insufficient to correct facial deformities. Despite efforts to simulate facial changes resulting from bony movement, surgical planning still relies on iterative revisions and educated guesses. To address these issues, we propose a soft-tissue driven framework that can automatically create and verify surgical plans. Our framework consists of a bony planner network that estimates the bony movements required to achieve the desired facial outcome and a facial simulator network that can simulate the possible facial changes resulting from the estimated bony movement plans. By combining these two models, we can verify and determine the final bony movement required for planning. The proposed framework was evaluated using a clinical dataset, and our experimental results demonstrate that the soft-tissue driven approach greatly improves the accuracy and efficacy of surgical planning when compared to the conventional bone-driven approach. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Comments: Early accepted by MICCAI 2023

arXiv:2307.05638 [pdf, other]

doi 10.1109/ACCESS.2023.3349132

A Comprehensive Survey of Deep Transfer Learning for Anomaly Detection in Industrial Time Series: Methods, Applications, and Directions

Authors: Peng Yan, Ahmed Abdulkadir, Paul-Philipp Luley, Matthias Rosenthal, Gerrit A. Schatte, Benjamin F. Grewe, Thilo Stadelmann

Abstract: Automating the monitoring of industrial processes has the potential to enhance efficiency and optimize quality by promptly detecting abnormal events and thus facilitating timely interventions. Deep learning, with its capacity to discern non-trivial patterns within large datasets, plays a pivotal role in this process. Standard deep learning methods are suitable to solve a specific task given a spec… ▽ More Automating the monitoring of industrial processes has the potential to enhance efficiency and optimize quality by promptly detecting abnormal events and thus facilitating timely interventions. Deep learning, with its capacity to discern non-trivial patterns within large datasets, plays a pivotal role in this process. Standard deep learning methods are suitable to solve a specific task given a specific type of data. During training, deep learning demands large volumes of labeled data. However, due to the dynamic nature of the industrial processes and environment, it is impractical to acquire large-scale labeled data for standard deep learning training for every slightly different case anew. Deep transfer learning offers a solution to this problem. By leveraging knowledge from related tasks and accounting for variations in data distributions, the transfer learning framework solves new tasks with little or even no additional labeled data. The approach bypasses the need to retrain a model from scratch for every new setup and dramatically reduces the labeled data requirement. This survey first provides an in-depth review of deep transfer learning, examining the problem settings of transfer learning and classifying the prevailing deep transfer learning methods. Moreover, we delve into applications of deep transfer learning in the context of a broad spectrum of time series anomaly detection tasks prevalent in primary industrial domains, e.g., manufacturing process monitoring, predictive maintenance, energy management, and infrastructure facility monitoring. We discuss the challenges and limitations of deep transfer learning in industrial contexts and conclude the survey with practical directions and actionable suggestions to address the need to leverage diverse time series data for anomaly detection in an increasingly dynamic production environment. △ Less

Submitted 10 January, 2024; v1 submitted 11 July, 2023; originally announced July 2023.

Comments: 27 pages, 8 figures, 2 tables, published in IEEE Acess

ACM Class: I.2.0; I.2.4

Journal ref: IEEE Acess 12 (2024) 3768-3789

arXiv:2306.05480 [pdf, other]

Artificial General Intelligence for Medical Imaging

Authors: Xiang Li, Lu Zhang, Zihao Wu, Zhengliang Liu, Lin Zhao, Yixuan Yuan, Jun Liu, Gang Li, Dajiang Zhu, Pingkun Yan, Quanzheng Li, Wei Liu, Tianming Liu, Dinggang Shen

Abstract: In this review, we explore the potential applications of Artificial General Intelligence (AGI) models in healthcare, focusing on foundational Large Language Models (LLMs), Large Vision Models, and Large Multimodal Models. We emphasize the importance of integrating clinical expertise, domain knowledge, and multimodal capabilities into AGI models. In addition, we lay out key roadmaps that guide the… ▽ More In this review, we explore the potential applications of Artificial General Intelligence (AGI) models in healthcare, focusing on foundational Large Language Models (LLMs), Large Vision Models, and Large Multimodal Models. We emphasize the importance of integrating clinical expertise, domain knowledge, and multimodal capabilities into AGI models. In addition, we lay out key roadmaps that guide the development and deployment of healthcare AGI models. Throughout the review, we provide critical perspectives on the potential challenges and pitfalls associated with deploying large-scale AGI models in the medical field. This comprehensive review aims to offer insights into the future implications of AGI in medical imaging, healthcare and beyond. △ Less

Submitted 2 July, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

arXiv:2306.03570 [pdf, other]

Personalization Disentanglement for Federated Learning: An explainable perspective

Authors: Peng Yan, Guodong Long

Abstract: Personalized federated learning (PFL) jointly trains a variety of local models through balancing between knowledge sharing across clients and model personalization per client. This paper addresses PFL via explicit disentangling latent representations into two parts to capture the shared knowledge and client-specific personalization, which leads to more reliable and effective PFL. The disentangleme… ▽ More Personalized federated learning (PFL) jointly trains a variety of local models through balancing between knowledge sharing across clients and model personalization per client. This paper addresses PFL via explicit disentangling latent representations into two parts to capture the shared knowledge and client-specific personalization, which leads to more reliable and effective PFL. The disentanglement is achieved by a novel Federated Dual Variational Autoencoder (FedDVA), which employs two encoders to infer the two types of representations. FedDVA can produce a better understanding of the trade-off between global knowledge sharing and local personalization in PFL. Moreover, it can be integrated with existing FL methods and turn them into personalized models for heterogeneous downstream tasks. Extensive experiments validate the advantages caused by disentanglement and show that models trained with disentangled representations substantially outperform those vanilla methods. △ Less

Submitted 13 July, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

arXiv:2305.12650 [pdf, other]

When Federated Recommendation Meets Cold-Start Problem: Separating Item Attributes and User Interactions

Authors: Chunxu Zhang, Guodong Long, Tianyi Zhou, Zijian Zhang, Peng Yan, Bo Yang

Abstract: Federated recommendation system usually trains a global model on the server without direct access to users' private data on their own devices. However, this separation of the recommendation model and users' private data poses a challenge in providing quality service, particularly when it comes to new items, namely cold-start recommendations in federated settings. This paper introduces a novel meth… ▽ More Federated recommendation system usually trains a global model on the server without direct access to users' private data on their own devices. However, this separation of the recommendation model and users' private data poses a challenge in providing quality service, particularly when it comes to new items, namely cold-start recommendations in federated settings. This paper introduces a novel method called Item-aligned Federated Aggregation (IFedRec) to address this challenge. It is the first research work in federated recommendation to specifically study the cold-start scenario. The proposed method learns two sets of item representations by leveraging item attributes and interaction records simultaneously. Additionally, an item representation alignment mechanism is designed to align two item representations and learn the meta attribute network at the server within a federated learning framework. Experiments on four benchmark datasets demonstrate IFedRec's superior performance for cold-start scenarios. Furthermore, we also verify IFedRec owns good robustness when the system faces limited client participation and noise injection, which brings promising practical application potential in privacy-protection enhanced federated recommendation systems. The implementation code is available △ Less

Submitted 24 February, 2024; v1 submitted 21 May, 2023; originally announced May 2023.

Comments: Accepted as a regular paper of WWW'24

arXiv:2305.07866 [pdf, other]

GPFedRec: Graph-guided Personalization for Federated Recommendation

Authors: Chunxu Zhang, Guodong Long, Tianyi Zhou, Zijjian Zhang, Peng Yan, Bo Yang

Abstract: The federated recommendation system is an emerging AI service architecture that provides recommendation services in a privacy-preserving manner. Using user-relation graphs to enhance federated recommendations is a promising topic. However, it is still an open challenge to construct the user-relation graph while preserving data locality-based privacy protection in federated settings. Inspired by a… ▽ More The federated recommendation system is an emerging AI service architecture that provides recommendation services in a privacy-preserving manner. Using user-relation graphs to enhance federated recommendations is a promising topic. However, it is still an open challenge to construct the user-relation graph while preserving data locality-based privacy protection in federated settings. Inspired by a simple motivation, similar users share a similar vision (embeddings) to the same item set, this paper proposes a novel Graph-guided Personalization for Federated Recommendation (GPFedRec). The proposed method constructs a user-relation graph from user-specific personalized item embeddings at the server without accessing the users' interaction records. The personalized item embedding is locally fine-tuned on each device, and then a user-relation graph will be constructed by measuring the similarity among client-specific item embeddings. Without accessing users' historical interactions, we embody the data locality-based privacy protection of vanilla federated learning. Furthermore, a graph-guided aggregation mechanism is designed to leverage the user-relation graph and federated optimization framework simultaneously. Extensive experiments on five benchmark datasets demonstrate GPFedRec's superior performance. The in-depth study validates that GPFedRec can generally improve existing federated recommendation methods as a plugin while keeping user privacy safe. Code is available to ease reproducibility △ Less

Submitted 18 June, 2024; v1 submitted 13 May, 2023; originally announced May 2023.

Comments: Accepted as a regular paper of KDD'24

arXiv:2305.04151 [pdf, other]

doi 10.1007/978-3-031-41676-7_13

Context-Aware Chart Element Detection

Authors: Pengyu Yan, Saleem Ahmed, David Doermann

Abstract: As a prerequisite of chart data extraction, the accurate detection of chart basic elements is essential and mandatory. In contrast to object detection in the general image domain, chart element detection relies heavily on context information as charts are highly structured data visualization formats. To address this, we propose a novel method CACHED, which stands for Context-Aware Chart Element De… ▽ More As a prerequisite of chart data extraction, the accurate detection of chart basic elements is essential and mandatory. In contrast to object detection in the general image domain, chart element detection relies heavily on context information as charts are highly structured data visualization formats. To address this, we propose a novel method CACHED, which stands for Context-Aware Chart Element Detection, by integrating a local-global context fusion module consisting of visual context enhancement and positional context encoding with the Cascade R-CNN framework. To improve the generalization of our method for broader applicability, we refine the existing chart element categorization and standardized 18 classes for chart basic elements, excluding plot elements. Our CACHED method, with the updated category of chart elements, achieves state-of-the-art performance in our experiments, underscoring the importance of context in chart element detection. Extending our method to the bar plot detection task, we obtain the best result on the PMC test dataset. △ Less

Submitted 8 September, 2023; v1 submitted 6 May, 2023; originally announced May 2023.

Comments: Published in ICDAR 2023. Code and model are available at https://github.com/pengyu965/ChartDete

arXiv:2304.02649 [pdf, other]

Specialty-Oriented Generalist Medical AI for Chest CT Screening

Authors: Chuang Niu, Qing Lyu, Christopher D. Carothers, Parisa Kaviani, Josh Tan, Pingkun Yan, Mannudeep K. Kalra, Christopher T. Whitlow, Ge Wang

Abstract: Modern medical records include a vast amount of multimodal free text clinical data and imaging data from radiology, cardiology, and digital pathology. Fully mining such big data requires multitasking; otherwise, occult but important aspects may be overlooked, adversely affecting clinical management and population healthcare. Despite remarkable successes of AI in individual tasks with single-modal… ▽ More Modern medical records include a vast amount of multimodal free text clinical data and imaging data from radiology, cardiology, and digital pathology. Fully mining such big data requires multitasking; otherwise, occult but important aspects may be overlooked, adversely affecting clinical management and population healthcare. Despite remarkable successes of AI in individual tasks with single-modal data, the progress in developing generalist medical AI remains relatively slow to combine multimodal data for multitasks because of the dual challenges of data curation and model architecture. The data challenge involves querying and curating multimodal structured and unstructured text, alphanumeric, and especially 3D tomographic scans on an individual patient level for real-time decisions and on a scale to estimate population health statistics. The model challenge demands a scalable and adaptable network architecture to integrate multimodal datasets for diverse clinical tasks. Here we propose the first-of-its-kind medical multimodal-multitask foundation model (M3FM) with application in lung cancer screening and related tasks. After we curated a comprehensive multimodal multitask dataset consisting of 49 clinical data types including 163,725 chest CT series and 17 medical tasks involved in LCS, we develop a multimodal question-answering framework as a unified training and inference strategy to synergize multimodal information and perform multiple tasks via free-text prompting. M3FM consistently outperforms the state-of-the-art single-modal task-specific models, identifies multimodal data elements informative for clinical tasks and flexibly adapts to new tasks with a small out-of-distribution dataset. As a specialty-oriented generalist medical AI model, M3FM paves the way for similar breakthroughs in other areas of medicine, closing the gap between specialists and the generalist. △ Less

Submitted 24 April, 2024; v1 submitted 3 April, 2023; originally announced April 2023.

arXiv:2303.17225 [pdf, other]

FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

Authors: Jie Qin, Jie Wu, Pengxiang Yan, Ming Li, Ren Yuxi, Xuefeng Xiao, Yitong Wang, Rui Wang, Shilei Wen, Xin Pan, Xingang Wang

Abstract: Recently, open-vocabulary learning has emerged to accomplish segmentation for arbitrary categories of text-based descriptions, which popularizes the segmentation system to more general-purpose application scenarios. However, existing methods devote to designing specialized architectures or parameters for specific segmentation tasks. These customized design paradigms lead to fragmentation between v… ▽ More Recently, open-vocabulary learning has emerged to accomplish segmentation for arbitrary categories of text-based descriptions, which popularizes the segmentation system to more general-purpose application scenarios. However, existing methods devote to designing specialized architectures or parameters for specific segmentation tasks. These customized design paradigms lead to fragmentation between various segmentation tasks, thus hindering the uniformity of segmentation models. Hence in this paper, we propose FreeSeg, a generic framework to accomplish Unified, Universal and Open-Vocabulary Image Segmentation. FreeSeg optimizes an all-in-one network via one-shot training and employs the same architecture and parameters to handle diverse segmentation tasks seamlessly in the inference procedure. Additionally, adaptive prompt learning facilitates the unified model to capture task-aware and category-sensitive concepts, improving model robustness in multi-task and varied scenarios. Extensive experimental results demonstrate that FreeSeg establishes new state-of-the-art results in performance and generalization on three segmentation tasks, which outperforms the best task-specific architectures by a large margin: 5.5% mIoU on semantic segmentation, 17.6% mAP on instance segmentation, 20.1% PQ on panoptic segmentation for the unseen class on COCO. △ Less

Submitted 30 March, 2023; originally announced March 2023.

Comments: Accepted by CVPR 2023; camera-ready version

arXiv:2301.08143 [pdf, other]

Dual Personalization on Federated Recommendation

Authors: Chunxu Zhang, Guodong Long, Tianyi Zhou, Peng Yan, Zijian Zhang, Chengqi Zhang, Bo Yang

Abstract: Federated recommendation is a new Internet service architecture that aims to provide privacy-preserving recommendation services in federated settings. Existing solutions are used to combine distributed recommendation algorithms and privacy-preserving mechanisms. Thus it inherently takes the form of heavyweight models at the server and hinders the deployment of on-device intelligent models to end-u… ▽ More Federated recommendation is a new Internet service architecture that aims to provide privacy-preserving recommendation services in federated settings. Existing solutions are used to combine distributed recommendation algorithms and privacy-preserving mechanisms. Thus it inherently takes the form of heavyweight models at the server and hinders the deployment of on-device intelligent models to end-users. This paper proposes a novel Personalized Federated Recommendation (PFedRec) framework to learn many user-specific lightweight models to be deployed on smart devices rather than a heavyweight model on a server. Moreover, we propose a new dual personalization mechanism to effectively learn fine-grained personalization on both users and items. The overall learning process is formulated into a unified federated optimization framework. Specifically, unlike previous methods that share exactly the same item embeddings across users in a federated system, dual personalization allows mild finetuning of item embeddings for each user to generate user-specific views for item representations which can be integrated into existing federated recommendation methods to gain improvements immediately. Experiments on multiple benchmark datasets have demonstrated the effectiveness of PFedRec and the dual personalization mechanism. Moreover, we provide visualizations and in-depth analysis of the personalization techniques in item embedding, which shed novel insights on the design of recommender systems in federated settings. The code is available. △ Less

Submitted 13 May, 2023; v1 submitted 16 January, 2023; originally announced January 2023.

Comments: Accepted as a regular paper of IJCAI23

arXiv:2212.00850 [pdf, other]

When Neural Networks Fail to Generalize? A Model Sensitivity Perspective

Authors: Jiajin Zhang, Hanqing Chao, Amit Dhurandhar, Pin-Yu Chen, Ali Tajer, Yangyang Xu, Pingkun Yan

Abstract: Domain generalization (DG) aims to train a model to perform well in unseen domains under different distributions. This paper considers a more realistic yet more challenging scenario,namely Single Domain Generalization (Single-DG), where only a single source domain is available for training. To tackle this challenge, we first try to understand when neural networks fail to generalize? We empirically… ▽ More Domain generalization (DG) aims to train a model to perform well in unseen domains under different distributions. This paper considers a more realistic yet more challenging scenario,namely Single Domain Generalization (Single-DG), where only a single source domain is available for training. To tackle this challenge, we first try to understand when neural networks fail to generalize? We empirically ascertain a property of a model that correlates strongly with its generalization that we coin as "model sensitivity". Based on our analysis, we propose a novel strategy of Spectral Adversarial Data Augmentation (SADA) to generate augmented images targeted at the highly sensitive frequencies. Models trained with these hard-to-learn samples can effectively suppress the sensitivity in the frequency space, which leads to improved generalization performance. Extensive experiments on multiple public datasets demonstrate the superiority of our approach, which surpasses the state-of-the-art single-DG methods. △ Less

Submitted 1 December, 2022; originally announced December 2022.

Comments: Accepted by AAAI 2023

arXiv:2210.05738 [pdf, other]

Distance Map Supervised Landmark Localization for MR-TRUS Registration

Authors: Xinrui Song, Xuanang Xu, Sheng Xu, Baris Turkbey, Bradford J. Wood, Thomas Sanford, Pingkun Yan

Abstract: In this work, we propose to explicitly use the landmarks of prostate to guide the MR-TRUS image registration. We first train a deep neural network to automatically localize a set of meaningful landmarks, and then directly generate the affine registration matrix from the location of these landmarks. For landmark localization, instead of directly training a network to predict the landmark coordinate… ▽ More In this work, we propose to explicitly use the landmarks of prostate to guide the MR-TRUS image registration. We first train a deep neural network to automatically localize a set of meaningful landmarks, and then directly generate the affine registration matrix from the location of these landmarks. For landmark localization, instead of directly training a network to predict the landmark coordinates, we propose to regress a full-resolution distance map of the landmark, which is demonstrated effective in avoiding statistical bias to unsatisfactory performance and thus improving performance. We then use the predicted landmarks to generate the affine transformation matrix, which outperforms the clinicians' manual rigid registration by a significant margin in terms of TRE. △ Less

Submitted 11 October, 2022; originally announced October 2022.

Comments: Submitted to SPIE Medical Imaging 2023

arXiv:2210.01685 [pdf, other]

doi 10.1007/978-3-031-16449-1_54

Deep Learning-based Facial Appearance Simulation Driven by Surgically Planned Craniomaxillofacial Bony Movement

Authors: Xi Fang, Daeseung Kim, Xuanang Xu, Tianshu Kuang, Hannah H. Deng, Joshua C. Barber, Nathan Lampen, Jaime Gateno, Michael A. K. Liebschner, James J. Xia, Pingkun Yan

Abstract: Simulating facial appearance change following bony movement is a critical step in orthognathic surgical planning for patients with jaw deformities. Conventional biomechanics-based methods such as the finite-element method (FEM) are labor intensive and computationally inefficient. Deep learning-based approaches can be promising alternatives due to their high computational efficiency and strong mode… ▽ More Simulating facial appearance change following bony movement is a critical step in orthognathic surgical planning for patients with jaw deformities. Conventional biomechanics-based methods such as the finite-element method (FEM) are labor intensive and computationally inefficient. Deep learning-based approaches can be promising alternatives due to their high computational efficiency and strong modeling capability. However, the existing deep learning-based method ignores the physical correspondence between facial soft tissue and bony segments and thus is significantly less accurate compared to FEM. In this work, we propose an Attentive Correspondence assisted Movement Transformation network (ACMT-Net) to estimate the facial appearance by transforming the bony movement to facial soft tissue through a point-to-point attentive correspondence matrix. Experimental results on patients with jaw deformity show that our proposed method can achieve comparable facial change prediction accuracy compared with the state-of-the-art FEM-based approach with significantly improved computational efficiency. △ Less

Submitted 4 October, 2022; originally announced October 2022.

Comments: MICCAI 2022 Young Scientist Publication Award

arXiv:2208.06127 [pdf, other]

An investigation on selecting audio pre-trained models for audio captioning

Authors: Peiran Yan, Shengchen Li

Abstract: Audio captioning is a task that generates description of audio based on content. Pre-trained models are widely used in audio captioning due to high complexity. Unless a comprehensive system is re-trained, it is hard to determine how well pre-trained models contribute to audio captioning system. To prevent the time consuming and energy consuming process of retraining, it is necessary to propose a p… ▽ More Audio captioning is a task that generates description of audio based on content. Pre-trained models are widely used in audio captioning due to high complexity. Unless a comprehensive system is re-trained, it is hard to determine how well pre-trained models contribute to audio captioning system. To prevent the time consuming and energy consuming process of retraining, it is necessary to propose a preditor of performance for the pre-trained model in audio captioning. In this paper, a series of pre-trained models are investigated for the correlation between extracted audio features and the performance of audio captioning. A couple of predictor is proposed based on the experiment results.The result demonstrates that the kurtosis and skewness of audio features extracted may act as an indicator of the performance of audio captioning systems for pre-trained audio due to the high correlation between kurtosis and skewness of audio features and the performance of audio captioning systems. △ Less

Submitted 12 August, 2022; originally announced August 2022.

Comments: 5 pages, 7 figures

arXiv:2208.02343 [pdf]

Improvements to enhance robustness of third-order scale-independent WENO-Z schemes

Authors: Qin Li, Xiao Huang, Pan Yan, Guozhuo Tan, Yi Duan, Yancheng You

Abstract: Although there are many improvements to WENO3-Z that target the achievement of optimal order in the occurrence of the first-order critical point (CP1), they mainly address resolution performance, while the robustness of schemes is of less concern and lacks understanding accordingly. In light of our analysis considering the occurrence of critical points within grid intervals, we theoretically prove… ▽ More Although there are many improvements to WENO3-Z that target the achievement of optimal order in the occurrence of the first-order critical point (CP1), they mainly address resolution performance, while the robustness of schemes is of less concern and lacks understanding accordingly. In light of our analysis considering the occurrence of critical points within grid intervals, we theoretically prove that it is impossible for a scale-independent scheme that has the stencil of WENO3-Z to fulfill the above order achievement, and current scale-dependent improvements barely fulfill the job when CP1 occurs at the middle of the grid cell. In order to achieve scale-independent improvements, we devise new smoothness indicators that increase the error order from 2 to 4 when CP1 occurs and perform more stably. Meanwhile, we construct a new global smoothness indicator that increases the error order from 4 to 5 similarly, through which new nonlinear weights with regard to WENO3-Z are derived and new scale-independents improvements, namely WENO-ZES2 and -ZES3, are acquired. Through 1D scalar and Euler tests, as well as 2D computations, in comparison with typical scale-dependent improvement, the following performances of the proposed schemes are demonstrated: The schemes can achieve third-order accuracy at CP1 no matter its location in the stencil, indicate high resolution in resolving flow subtleties, and manifest strong robustness in hypersonic simulations (e.g., the accomplishment of computations on hypersonic half-cylinder flow with Mach numbers reaching 16 and 19, respectively, as well as essentially non-oscillatory solutions of inviscid sharp double cone flow at M=9.59), which contrasts the comparative WENO3-Z improvement. △ Less

Submitted 1 August, 2022; originally announced August 2022.

arXiv:2207.05231 [pdf, other]

Regression Metric Loss: Learning a Semantic Representation Space for Medical Images

Authors: Hanqing Chao, Jiajin Zhang, Pingkun Yan

Abstract: Regression plays an essential role in many medical imaging applications for estimating various clinical risk or measurement scores. While training strategies and loss functions have been studied for the deep neural networks in medical image classification tasks, options for regression tasks are very limited. One of the key challenges is that the high-dimensional feature representation learned by e… ▽ More Regression plays an essential role in many medical imaging applications for estimating various clinical risk or measurement scores. While training strategies and loss functions have been studied for the deep neural networks in medical image classification tasks, options for regression tasks are very limited. One of the key challenges is that the high-dimensional feature representation learned by existing popular loss functions like Mean Squared Error or L1 loss is hard to interpret. In this paper, we propose a novel Regression Metric Loss (RM-Loss), which endows the representation space with the semantic meaning of the label space by finding a representation manifold that is isometric to the label space. Experiments on two regression tasks, i.e. coronary artery calcium score estimation and bone age assessment, show that RM-Loss is superior to the existing popular regression losses on both performance and interpretability. Code is available at https://github.com/DIAL-RPI/Regression-Metric-Loss. △ Less

Submitted 11 July, 2022; originally announced July 2022.

Comments: Accepted by MICCAI2022

arXiv:2206.07156 [pdf, other]

doi 10.1109/TMI.2023.3270140

Federated Multi-organ Segmentation with Inconsistent Labels

Authors: Xuanang Xu, Hannah H. Deng, Jaime Gateno, Pingkun Yan

Abstract: Federated learning is an emerging paradigm allowing large-scale decentralized learning without sharing data across different data owners, which helps address the concern of data privacy in medical image analysis. However, the requirement for label consistency across clients by the existing methods largely narrows its application scope. In practice, each clinical site may only annotate certain orga… ▽ More Federated learning is an emerging paradigm allowing large-scale decentralized learning without sharing data across different data owners, which helps address the concern of data privacy in medical image analysis. However, the requirement for label consistency across clients by the existing methods largely narrows its application scope. In practice, each clinical site may only annotate certain organs of interest with partial or no overlap with other sites. Incorporating such partially labeled data into a unified federation is an unexplored problem with clinical significance and urgency. This work tackles the challenge by using a novel federated multi-encoding U-Net (Fed-MENU) method for multi-organ segmentation. In our method, a multi-encoding U-Net (MENU-Net) is proposed to extract organ-specific features through different encoding sub-networks. Each sub-network can be seen as an expert of a specific organ and trained for that client. Moreover, to encourage the organ-specific features extracted by different sub-networks to be informative and distinctive, we regularize the training of the MENU-Net by designing an auxiliary generic decoder (AGD). Extensive experiments on six public abdominal CT datasets show that our Fed-MENU method can effectively obtain a federated learning model using the partially labeled datasets with superior performance to other models trained by either localized or centralized learning methods. Source code is publicly available at https://github.com/DIAL-RPI/Fed-MENU. △ Less

Submitted 25 May, 2023; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: v1: 10 pages, 5 figures; v2: 14 pages, 5 figures, accepted by IEEE Transactions on Medical Imaging (TMI), published version available at https://doi.org/10.1109/TMI.2023.3270140, source code available at https://github.com/DIAL-RPI/Fed-MENU

arXiv:2205.13117 [pdf, other]

Learn to Cluster Faces via Pairwise Classification

Authors: Junfu Liu, Di Qiu, Pengfei Yan, Xiaolin Wei

Abstract: Face clustering plays an essential role in exploiting massive unlabeled face data. Recently, graph-based face clustering methods are getting popular for their satisfying performances. However, they usually suffer from excessive memory consumption especially on large-scale graphs, and rely on empirical thresholds to determine the connectivities between samples in inference, which restricts their ap… ▽ More Face clustering plays an essential role in exploiting massive unlabeled face data. Recently, graph-based face clustering methods are getting popular for their satisfying performances. However, they usually suffer from excessive memory consumption especially on large-scale graphs, and rely on empirical thresholds to determine the connectivities between samples in inference, which restricts their applications in various real-world scenes. To address such problems, in this paper, we explore face clustering from the pairwise angle. Specifically, we formulate the face clustering task as a pairwise relationship classification task, avoiding the memory-consuming learning on large-scale graphs. The classifier can directly determine the relationship between samples and is enhanced by taking advantage of the contextual information. Moreover, to further facilitate the efficiency of our method, we propose a rank-weighted density to guide the selection of pairs sent to the classifier. Experimental results demonstrate that our method achieves state-of-the-art performances on several public clustering benchmarks at the fastest speed and shows a great advantage in comparison with graph-based clustering methods on memory consumption. △ Less

Submitted 25 May, 2022; originally announced May 2022.

Comments: Accepted by ICCV2021

Showing 1–50 of 98 results for author: Yan, P