-
Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory
Authors:
Yongxin Deng,
Xihe Qiu,
Xiaoyu Tan,
Jing Pan,
Chen Jue,
Zhijun Fang,
Yinghui Xu,
Wei Chu,
Yuan Qi
Abstract:
Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information. Although techniques such as Affective Alignment can mitigate some negative impacts of these biases, existing prompt-based attack methods can still extract these biases from the model's weights. Moreover, these biases frequently appear subtly when LLMs are prompted to perform identical t…
▽ More
Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information. Although techniques such as Affective Alignment can mitigate some negative impacts of these biases, existing prompt-based attack methods can still extract these biases from the model's weights. Moreover, these biases frequently appear subtly when LLMs are prompted to perform identical tasks across different demographic groups, thereby camouflaging their presence. To address this issue, we have formally defined the implicit bias problem and developed an innovative framework for bias removal based on Bayesian theory, Bayesian-Theory based Bias Removal (BTBR). BTBR employs likelihood ratio screening to pinpoint data entries within publicly accessible biased datasets that represent biases inadvertently incorporated during the LLM training phase. It then automatically constructs relevant knowledge triples and expunges bias information from LLMs using model editing techniques. Through extensive experimentation, we have confirmed the presence of the implicit bias problem in LLMs and demonstrated the effectiveness of our BTBR approach.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
Authors:
Chen Ju,
Haicheng Wang,
Haozhe Cheng,
Xu Chen,
Zhonghua Zhai,
Weilin Huang,
Jinsong Lan,
Shuai Xiao,
Bo Zheng
Abstract:
Vision-Language Large Models (VLMs) recently become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in the real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantization, but completely overlook the data-perspective…
▽ More
Vision-Language Large Models (VLMs) recently become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in the real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantization, but completely overlook the data-perspective redundancy. To fill the overlook, this paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. In pursuit of efficiency-performance trade-offs, information degree takes two crucial factors into consideration: mutual redundancy and semantic value. Concretely, the former evaluates data duplication between sequential tokens; while the latter evaluates each token by its contribution to the overall semantics. As a result, tokens with high information degree carry less redundancy and stronger semantics. For VLMs' calculation, Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. Its advantages are multifaceted, e.g., being generally compatible to various VLMs across understanding and generation, simple use without re-training and trivial engineering efforts. On multiple VLMs benchmarks, we fully experiment to demonstrate the good acceleration of Turbo, under negligible performance drop.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition
Authors:
Haozhe Cheng,
Cheng Ju,
Haicheng Wang,
Jinxiang Liu,
Mengting Chen,
Qiang Hu,
Xiaoyun Zhang,
Yanfeng Wang
Abstract:
As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. Howe…
▽ More
As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the generative part denoises noisy class-text names via one decoding process, i.e., propose text candidates, then utilize inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to class-text names, thus obtaining more semantics. For optimization, we alternately iterate between generative and discriminative parts for progressive refinements. The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising. On three datasets, we carry out extensive experiments to show our superior robustness, and thorough ablations to dissect the effectiveness of each component.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Cell Variational Information Bottleneck Network
Authors:
Zhonghua Zhai,
Chen Ju,
Jinsong Lan,
Shuai Xiao
Abstract:
In this work, we propose Cell Variational Information Bottleneck Network (cellVIB), a convolutional neural network using information bottleneck mechanism, which can be combined with the latest feedforward network architecture in an end-to-end training method. Our Cell Variational Information Bottleneck Network is constructed by stacking VIB cells, which generate feature maps with uncertainty. As l…
▽ More
In this work, we propose Cell Variational Information Bottleneck Network (cellVIB), a convolutional neural network using information bottleneck mechanism, which can be combined with the latest feedforward network architecture in an end-to-end training method. Our Cell Variational Information Bottleneck Network is constructed by stacking VIB cells, which generate feature maps with uncertainty. As layers going deeper, the regularization effect will gradually increase, instead of directly adding excessive regular constraints to the output layer of the model as in Deep VIB. Under each VIB cell, the feedforward process learns an independent mean term and an standard deviation term, and predicts the Gaussian distribution based on them. The feedback process is based on reparameterization trick for effective training. This work performs an extensive analysis on MNIST dataset to verify the effectiveness of each VIB cells, and provides an insightful analysis on how the VIB cells affect mutual information. Experiments conducted on CIFAR-10 also prove that our cellVIB is robust against noisy labels during training and against corrupted images during testing. Then, we validate our method on PACS dataset, whose results show that the VIB cells can significantly improve the generalization performance of the basic model. Finally, in a more complex representation learning task, face recognition, our network structure has also achieved very competitive results.
△ Less
Submitted 29 March, 2024; v1 submitted 22 March, 2024;
originally announced March 2024.
-
Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment
Authors:
Mengting Chen,
Xi Chen,
Zhonghua Zhai,
Chen Ju,
Xuewen Hong,
Jinsong Lan,
Shuai Xiao
Abstract:
This paper introduces a novel framework for virtual try-on, termed Wear-Any-Way. Different from previous methods, Wear-Any-Way is a customizable solution. Besides generating high-fidelity results, our method supports users to precisely manipulate the wearing style. To achieve this goal, we first construct a strong pipeline for standard virtual try-on, supporting single/multiple garment try-on and…
▽ More
This paper introduces a novel framework for virtual try-on, termed Wear-Any-Way. Different from previous methods, Wear-Any-Way is a customizable solution. Besides generating high-fidelity results, our method supports users to precisely manipulate the wearing style. To achieve this goal, we first construct a strong pipeline for standard virtual try-on, supporting single/multiple garment try-on and model-to-model settings in complicated scenarios. To make it manipulable, we propose sparse correspondence alignment which involves point-based control to guide the generation for specific locations. With this design, Wear-Any-Way gets state-of-the-art performance for the standard setting and provides a novel interaction form for customizing the wearing style. For instance, it supports users to drag the sleeve to make it rolled up, drag the coat to make it open, and utilize clicks to control the style of tuck, etc. Wear-Any-Way enables more liberated and flexible expressions of the attires, holding profound implications in the fashion industry.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Authors:
Jinxiang Liu,
Yikun Liu,
Fei Zhang,
Chen Ju,
Ya Zhang,
Yanfeng Wang
Abstract:
Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed, we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames, leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS, we explicitly divide them into two categories bas…
▽ More
Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed, we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames, leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS, we explicitly divide them into two categories based on their temporal characteristics, i.e., neighboring frame (NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame, often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs, DFs have long temporal distances from the labeled frame, which share semantic-similar objects with appearance variations. Considering their unique characteristics, we propose a versatile framework that effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides, we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames, which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method, unleashing the power of the abundant unlabeled frames.
△ Less
Submitted 16 March, 2024;
originally announced March 2024.
-
Heterogeneity-aware Cross-school Electives Recommendation: a Hybrid Federated Approach
Authors:
Chengyi Ju,
Jiannong Cao,
Yu Yang,
Zhen-Qun Yang,
Ho Man Lee
Abstract:
In the era of modern education, addressing cross-school learner diversity is crucial, especially in personalized recommender systems for elective course selection. However, privacy concerns often limit cross-school data sharing, which hinders existing methods' ability to model sparse data and address heterogeneity effectively, ultimately leading to suboptimal recommendations. In response, we propo…
▽ More
In the era of modern education, addressing cross-school learner diversity is crucial, especially in personalized recommender systems for elective course selection. However, privacy concerns often limit cross-school data sharing, which hinders existing methods' ability to model sparse data and address heterogeneity effectively, ultimately leading to suboptimal recommendations. In response, we propose HFRec, a heterogeneity-aware hybrid federated recommender system designed for cross-school elective course recommendations. The proposed model constructs heterogeneous graphs for each school, incorporating various interactions and historical behaviors between students to integrate context and content information. We design an attention mechanism to capture heterogeneity-aware representations. Moreover, under a federated scheme, we train individual school-based models with adaptive learning settings to recommend tailored electives. Our HFRec model demonstrates its effectiveness in providing personalized elective recommendations while maintaining privacy, as it outperforms state-of-the-art models on both open-source and real-world datasets.
△ Less
Submitted 19 February, 2024;
originally announced February 2024.
-
Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models
Authors:
Chenyang Gao,
Brecht Desplanques,
Chelsea J. -T. Ju,
Aman Chadha,
Andreas Stolcke
Abstract:
Automated speaker identification (SID) is a crucial step for the personalization of a wide range of speech-enabled services. Typical SID systems use a symmetric enrollment-verification framework with a single model to derive embeddings both offline for voice profiles extracted from enrollment utterances, and online from runtime utterances. Due to the distinct circumstances of enrollment and runtim…
▽ More
Automated speaker identification (SID) is a crucial step for the personalization of a wide range of speech-enabled services. Typical SID systems use a symmetric enrollment-verification framework with a single model to derive embeddings both offline for voice profiles extracted from enrollment utterances, and online from runtime utterances. Due to the distinct circumstances of enrollment and runtime, such as different computation and latency constraints, several applications would benefit from an asymmetric enrollment-verification framework that uses different models for enrollment and runtime embedding generation. To support this asymmetric SID where each of the two models can be updated independently, we propose using a lightweight neural network to map the embeddings from the two independent models to a shared speaker embedding space. Our results show that this approach significantly outperforms cosine scoring in a shared speaker logit space for models that were trained with a contrastive loss on large datasets with many speaker identities. This proposed Neural Embedding Speaker Space Alignment (NESSA) combined with an asymmetric update of only one of the models delivers at least 60% of the performance gain achieved by updating both models in the standard symmetric SID approach.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models
Authors:
Chen Ju,
Haicheng Wang,
Zeqian Li,
Xu Chen,
Zhonghua Zhai,
Weilin Huang,
Shuai Xiao
Abstract:
Vision-Language Large Models (VLMs) have become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantification, but completely overlook the data-perspective redund…
▽ More
Vision-Language Large Models (VLMs) have become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantification, but completely overlook the data-perspective redundancy. To fill the overlook, this paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. In pursuit of efficiency-performance trade-offs, information degree takes two key factors into consideration: mutual redundancy and semantic value. Concretely, the former evaluates the data duplication between sequential tokens; while the latter evaluates each token by its contribution to the overall semantics. As a result, tokens with high information degree carry less redundancy and stronger semantics. For VLMs' calculation, Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. Its advantages are multifaceted, e.g., being generally compatible to various VLMs across understanding and generation, simple use without retraining and trivial engineering efforts. On multiple public VLMs benchmarks, we conduct extensive experiments to reveal the gratifying acceleration of Turbo, under negligible performance drop.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
Enhancing Cross-domain Click-Through Rate Prediction via Explicit Feature Augmentation
Authors:
Xu Chen,
Zida Cheng,
Jiangchao Yao,
Chen Ju,
Weilin Huang,
Jinsong Lan,
Xiaoyi Zeng,
Shuai Xiao
Abstract:
Cross-domain CTR (CDCTR) prediction is an important research topic that studies how to leverage meaningful data from a related domain to help CTR prediction in target domain. Most existing CDCTR works design implicit ways to transfer knowledge across domains such as parameter-sharing that regularizes the model training in target domain. More effectively, recent researchers propose explicit techniq…
▽ More
Cross-domain CTR (CDCTR) prediction is an important research topic that studies how to leverage meaningful data from a related domain to help CTR prediction in target domain. Most existing CDCTR works design implicit ways to transfer knowledge across domains such as parameter-sharing that regularizes the model training in target domain. More effectively, recent researchers propose explicit techniques to extract user interest knowledge and transfer this knowledge to target domain. However, the proposed method mainly faces two issues: 1) it usually requires a super domain, i.e. an extremely large source domain, to cover most users or items of target domain, and 2) the extracted user interest knowledge is static no matter what the context is in target domain. These limitations motivate us to develop a more flexible and efficient technique to explicitly transfer knowledge. In this work, we propose a cross-domain augmentation network (CDAnet) being able to perform explicit knowledge transfer between two domains. Specifically, CDAnet contains a designed translation network and an augmentation network which are trained sequentially. The translation network computes latent features from two domains and learns meaningful cross-domain knowledge of each input in target domain by using a designed cross-supervised feature translator. Later the augmentation network employs the explicit cross-domain knowledge as augmented information to boost the target domain CTR prediction. Through extensive experiments on two public benchmarks and one industrial production dataset, we show CDAnet can learn meaningful translated features and largely improve the performance of CTR prediction. CDAnet has been conducted online A/B test in image2product retrieval at Taobao app, bringing an absolute 0.11 point CTR improvement, a relative 0.64% deal growth and a relative 1.26% GMV increase.
△ Less
Submitted 18 February, 2024; v1 submitted 29 November, 2023;
originally announced December 2023.
-
How to Do Machine Learning with Small Data? -- A Review from an Industrial Perspective
Authors:
Ivan Kraljevski,
Yong Chul Ju,
Dmitrij Ivanov,
Constanze Tschöpe,
Matthias Wolff
Abstract:
Artificial intelligence experienced a technological breakthrough in science, industry, and everyday life in the recent few decades. The advancements can be credited to the ever-increasing availability and miniaturization of computational resources that resulted in exponential data growth. However, because of the insufficient amount of data in some cases, employing machine learning in solving compl…
▽ More
Artificial intelligence experienced a technological breakthrough in science, industry, and everyday life in the recent few decades. The advancements can be credited to the ever-increasing availability and miniaturization of computational resources that resulted in exponential data growth. However, because of the insufficient amount of data in some cases, employing machine learning in solving complex tasks is not straightforward or even possible. As a result, machine learning with small data experiences rising importance in data science and application in several fields. The authors focus on interpreting the general term of "small data" and their engineering and industrial application role. They give a brief overview of the most important industrial applications of machine learning and small data. Small data is defined in terms of various characteristics compared to big data, and a machine learning formalism was introduced. Five critical challenges of machine learning with small data in industrial applications are presented: unlabeled data, imbalanced data, missing data, insufficient data, and rare events. Based on those definitions, an overview of the considerations in domain representation and data acquisition is given along with a taxonomy of machine learning approaches in the context of small data.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Radiology-Llama2: Best-in-Class Large Language Model for Radiology
Authors:
Zhengliang Liu,
Yiwei Li,
Peng Shu,
Aoxiao Zhong,
Longtao Yang,
Chao Ju,
Zihao Wu,
Chong Ma,
Jie Luo,
Cheng Chen,
Sekeun Kim,
Jiang Hu,
Haixing Dai,
Lin Zhao,
Dajiang Zhu,
Jun Liu,
Wei Liu,
Dinggang Shen,
Tianming Liu,
Quanzheng Li,
Xiang Li
Abstract:
This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning. Radiology-Llama2 is based on the Llama2 architecture and further trained on a large dataset of radiology reports to generate coherent and clinically useful impressions from radiological findings. Quantitative evaluations using ROUGE metrics on the MIMIC-CXR and Op…
▽ More
This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning. Radiology-Llama2 is based on the Llama2 architecture and further trained on a large dataset of radiology reports to generate coherent and clinically useful impressions from radiological findings. Quantitative evaluations using ROUGE metrics on the MIMIC-CXR and OpenI datasets demonstrate that Radiology-Llama2 achieves state-of-the-art performance compared to other generative language models, with a Rouge-1 score of 0.4834 on MIMIC-CXR and 0.4185 on OpenI. Additional assessments by radiology experts highlight the model's strengths in understandability, coherence, relevance, conciseness, and clinical utility. The work illustrates the potential of localized language models designed and tuned for specialized domains like radiology. When properly evaluated and deployed, such models can transform fields like radiology by automating rote tasks and enhancing human expertise.
△ Less
Submitted 29 August, 2023;
originally announced September 2023.
-
AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation
Authors:
Chaofan Ma,
Yuhuan Yang,
Chen Ju,
Fei Zhang,
Ya Zhang,
Yanfeng Wang
Abstract:
Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time. Recent studies have explored vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios, i.e., low-quality textual category names. For example, this paradigm assumes that new textual categories will be accurately and c…
▽ More
Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time. Recent studies have explored vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios, i.e., low-quality textual category names. For example, this paradigm assumes that new textual categories will be accurately and completely provided, and exist in lexicons during pre-training. However, exceptions often happen when encountering ambiguity for brief or incomplete names, new words that are not present in the pre-trained lexicons, and difficult-to-describe categories for users. To address these issues, this work proposes a novel attribute decomposition-aggregation framework, AttrSeg, inspired by human cognition in understanding new concepts. Specifically, in the decomposition stage, we decouple class names into diverse attribute descriptions to complement semantic contexts from multiple perspectives. Two attribute construction strategies are designed: using large language models for common categories, and involving manually labeling for human-invented categories. In the aggregation stage, we group diverse attributes into an integrated global description, to form a discriminative classifier that distinguishes the target object from others. One hierarchical aggregation architecture is further proposed to achieve multi-level aggregations, leveraging the meticulously designed clustering module. The final results are obtained by computing the similarity between aggregated attributes and images embeddings. To evaluate the effectiveness, we annotate three types of datasets with attribute descriptions, and conduct extensive experiments and ablation studies. The results show the superior performance of attribute decomposition-aggregation.
△ Less
Submitted 5 January, 2024; v1 submitted 31 August, 2023;
originally announced September 2023.
-
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
Authors:
Jinxiang Liu,
Chen Ju,
Chaofan Ma,
Yanfeng Wang,
Yu Wang,
Ya Zhang
Abstract:
The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in the video frames using audio cues. However, current fusion-based methods have the performance limitations due to the small receptive field of convolution and inadequate fusion of audio-visual features. To overcome these issues, we propose a novel \textbf{Au}dio-aware query-enhanced \textbf{TR}ansformer (AuTR…
▽ More
The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in the video frames using audio cues. However, current fusion-based methods have the performance limitations due to the small receptive field of convolution and inadequate fusion of audio-visual features. To overcome these issues, we propose a novel \textbf{Au}dio-aware query-enhanced \textbf{TR}ansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features. Furthermore, we devise an audio-aware query-enhanced transformer decoder that explicitly helps the model focus on the segmentation of the pinpointed sounding objects based on audio signals, while disregarding silent yet salient objects. Experimental results show that our method outperforms previous methods and demonstrates better generalization ability in multi-sound and open-set scenarios.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
Efficient parallel implementation of the multiplicative weight update method for graph-based linear programs
Authors:
Caleb Ju,
Serif Yesil,
Mengyuan Sun,
Chandra Chekuri,
Edgar Solomonik
Abstract:
Positive linear programs (LPs) model many graph and operations research problems. One can solve for a $(1+ε)$-approximation for positive LPs, for any selected $ε$, in polylogarithmic depth and near-linear work via variations of the multiplicative weight update (MWU) method. Despite extensive theoretical work on these algorithms through the decades, their empirical performance is not well understoo…
▽ More
Positive linear programs (LPs) model many graph and operations research problems. One can solve for a $(1+ε)$-approximation for positive LPs, for any selected $ε$, in polylogarithmic depth and near-linear work via variations of the multiplicative weight update (MWU) method. Despite extensive theoretical work on these algorithms through the decades, their empirical performance is not well understood.
In this work, we implement and test an efficient parallel algorithm for solving positive LP relaxations, and apply it to graph problems such as densest subgraph, bipartite matching, vertex cover and dominating set. We accelerate the algorithm via a new step size search heuristic. Our implementation uses sparse linear algebra optimization techniques such as fusion of vector operations and use of sparse format. Furthermore, we devise an implicit representation for graph incidence constraints. We demonstrate the parallel scalability with the use of threading OpenMP and MPI on the Stampede2 supercomputer. We compare this implementation with exact libraries and specialized libraries for the above problems in order to evaluate MWU's practical standing for both accuracy and performance among other methods. Our results show this implementation is faster than general purpose LP solvers (IBM CPLEX, Gurobi) in all of our experiments, and in some instances, outperforms state-of-the-art specialized parallel graph algorithms.
△ Less
Submitted 12 February, 2024; v1 submitted 6 July, 2023;
originally announced July 2023.
-
Multi-Modal Prototypes for Open-World Semantic Segmentation
Authors:
Yuhuan Yang,
Chaofan Ma,
Chen Ju,
Fei Zhang,
Jiangchao Yao,
Ya Zhang,
Yanfeng Wang
Abstract:
In semantic segmentation, generalizing a visual system to both seen categories and novel categories at inference time has always been practically valuable yet challenging. To enable such functionality, existing methods mainly rely on either providing several support demonstrations from the visual aspect or characterizing the informative clues from the textual aspect (e.g., the class names). Nevert…
▽ More
In semantic segmentation, generalizing a visual system to both seen categories and novel categories at inference time has always been practically valuable yet challenging. To enable such functionality, existing methods mainly rely on either providing several support demonstrations from the visual aspect or characterizing the informative clues from the textual aspect (e.g., the class names). Nevertheless, both two lines neglect the complementary intrinsic of low-level visual and high-level language information, while the explorations that consider visual and textual modalities as a whole to promote predictions are still limited. To close this gap, we propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for open-world semantic segmentation, and build a novel prototype-based segmentation framework to realize this promise. To be specific, unlike the straightforward combination of bi-modal clues, we decompose the high-level language information as multi-aspect prototypes and aggregate the low-level visual information as more semantic prototypes, on basis of which, a fine-grained complementary fusion makes the multi-modal prototypes more powerful and accurate to promote the prediction. Based on an elastic mask prediction module that permits any number and form of prototype inputs, we are able to solve the zero-shot, few-shot and generalized counterpart tasks in one architecture. Extensive experiments on both PASCAL-$5^i$ and COCO-$20^i$ datasets show the consistent superiority of the proposed method compared with the previous state-of-the-art approaches, and a range of ablation studies thoroughly dissects each component in our framework both quantitatively and qualitatively that verify their effectiveness.
△ Less
Submitted 11 July, 2024; v1 submitted 4 July, 2023;
originally announced July 2023.
-
Radiology-GPT: A Large Language Model for Radiology
Authors:
Zhengliang Liu,
Aoxiao Zhong,
Yiwei Li,
Longtao Yang,
Chao Ju,
Zihao Wu,
Chong Ma,
Peng Shu,
Cheng Chen,
Sekeun Kim,
Haixing Dai,
Lin Zhao,
Lichao Sun,
Dajiang Zhu,
Jun Liu,
Wei Liu,
Dinggang Shen,
Xiang Li,
Quanzheng Li,
Tianming Liu
Abstract:
We introduce Radiology-GPT, a large language model for radiology. Using an instruction tuning approach on an extensive dataset of radiology domain knowledge, Radiology-GPT demonstrates superior performance compared to general language models such as StableLM, Dolly and LLaMA. It exhibits significant versatility in radiological diagnosis, research, and communication. This work serves as a catalyst…
▽ More
We introduce Radiology-GPT, a large language model for radiology. Using an instruction tuning approach on an extensive dataset of radiology domain knowledge, Radiology-GPT demonstrates superior performance compared to general language models such as StableLM, Dolly and LLaMA. It exhibits significant versatility in radiological diagnosis, research, and communication. This work serves as a catalyst for future developments in clinical NLP. The successful implementation of Radiology-GPT is indicative of the potential of localizing generative large language models, specifically tailored for distinctive medical specialties, while ensuring adherence to privacy standards such as HIPAA. The prospect of developing individualized, large-scale language models that cater to specific needs of various hospitals presents a promising direction. The fusion of conversational competence and domain-specific knowledge in these models is set to foster future development in healthcare AI. A demo of Radiology-GPT is available at https://huggingface.co/spaces/allen-eric/radiology-gpt.
△ Less
Submitted 19 March, 2024; v1 submitted 14 June, 2023;
originally announced June 2023.
-
Annotation-free Audio-Visual Segmentation
Authors:
Jinxiang Liu,
Yu Wang,
Chen Ju,
Chaofan Ma,
Ya Zhang,
Weidi Xie
Abstract:
The objective of Audio-Visual Segmentation (AVS) is to localise the sounding objects within visual scenes by accurately predicting pixel-wise segmentation masks. To tackle the task, it involves a comprehensive consideration of both the data and model aspects. In this paper, first, we initiate a novel pipeline for generating artificial data for the AVS task without extra manual annotations. We leve…
▽ More
The objective of Audio-Visual Segmentation (AVS) is to localise the sounding objects within visual scenes by accurately predicting pixel-wise segmentation masks. To tackle the task, it involves a comprehensive consideration of both the data and model aspects. In this paper, first, we initiate a novel pipeline for generating artificial data for the AVS task without extra manual annotations. We leverage existing image segmentation and audio datasets and match the image-mask pairs with its corresponding audio samples using category labels in segmentation datasets, that allows us to effortlessly compose (image, audio, mask) triplets for training AVS models. The pipeline is annotation-free and scalable to cover a large number of categories. Additionally, we introduce a lightweight model SAMA-AVS which adapts the pre-trained segment anything model~(SAM) to the AVS task. By introducing only a small number of trainable parameters with adapters, the proposed model can effectively achieve adequate audio-visual fusion and interaction in the encoding stage with vast majority of parameters fixed. We conduct extensive experiments, and the results show our proposed model remarkably surpasses other competing methods. Moreover, by using the proposed model pretrained with our synthetic data, the performance on real AVSBench data is further improved, achieving 83.17 mIoU on S4 subset and 66.95 mIoU on MS3 set. The project page is https://jinxiang-liu.github.io/anno-free-AVS/.
△ Less
Submitted 7 October, 2023; v1 submitted 18 May, 2023;
originally announced May 2023.
-
Category-Oriented Representation Learning for Image to Multi-Modal Retrieval
Authors:
Zida Cheng,
Chen Ju,
Shuai Xiao,
Xu Chen,
Zhonghua Zhai,
Xiaoyi Zeng,
Weilin Huang,
Junchi Yan
Abstract:
The rise of multi-modal search requests from users has highlighted the importance of multi-modal retrieval (i.e. image-to-text or text-to-image retrieval), yet the more complex task of image-to-multi-modal retrieval, crucial for many industry applications, remains under-explored. To address this gap and promote further research, we introduce and define the concept of Image-to-Multi-Modal Retrieval…
▽ More
The rise of multi-modal search requests from users has highlighted the importance of multi-modal retrieval (i.e. image-to-text or text-to-image retrieval), yet the more complex task of image-to-multi-modal retrieval, crucial for many industry applications, remains under-explored. To address this gap and promote further research, we introduce and define the concept of Image-to-Multi-Modal Retrieval (IMMR), a process designed to retrieve rich multi-modal (i.e. image and text) documents based on image queries. We focus on representation learning for IMMR and analyze three key challenges for it: 1) skewed data and noisy label in real-world industrial data, 2) the information-inequality between image and text modality of documents when learning representations, 3) effective and efficient training in large-scale industrial contexts. To tackle the above challenges, we propose a novel framework named organizing categories and learning by classification for retrieval (OCLEAR). It consists of three components: 1) a novel category-oriented data governance scheme coupled with a large-scale classification-based learning paradigm, which handles the skewed and noisy data from a data perspective. 2) model architecture specially designed for multi-modal learning, where information-inequality between image and text modality of documents is considered for modality fusion. 3) a hybrid parallel training approach for tackling large-scale training in industrial scenario. The proposed framework achieves SOTA performance on public datasets and has been deployed in a real-world industrial e-commence system, leading to significant business growth. Code will be made publicly available.
△ Less
Submitted 9 June, 2024; v1 submitted 6 May, 2023;
originally announced May 2023.
-
Multi-modal Prompting for Low-Shot Temporal Action Localization
Authors:
Chen Ju,
Zeqian Li,
Peisen Zhao,
Ya Zhang,
Xiaopeng Zhang,
Qi Tian,
Yanfeng Wang,
Weidi Xie
Abstract:
In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-voc…
▽ More
In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing low-shot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the pre-trained CLIP text encoder either with detailed action descriptions (acquired from large-scale language models), or visually-conditioned instance-specific prompt vectors. Third, we conduct thorough experiments and ablation studies on THUMOS14 and ActivityNet1.3, demonstrating the superior performance of our proposed model, outperforming existing state-of-the-art approaches by one significant margin.
△ Less
Submitted 21 March, 2023;
originally announced March 2023.
-
DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery
Authors:
Chaofan Ma,
Yuhuan Yang,
Chen Ju,
Fei Zhang,
Jinxiang Liu,
Yu Wang,
Ya Zhang,
Yanfeng Wang
Abstract:
Learning from a large corpus of data, pre-trained models have achieved impressive progress nowadays. As popular generative pre-training, diffusion models capture both low-level visual knowledge and high-level semantic relations. In this paper, we propose to exploit such knowledgeable diffusion models for mainstream discriminative tasks, i.e., unsupervised object discovery: saliency segmentation an…
▽ More
Learning from a large corpus of data, pre-trained models have achieved impressive progress nowadays. As popular generative pre-training, diffusion models capture both low-level visual knowledge and high-level semantic relations. In this paper, we propose to exploit such knowledgeable diffusion models for mainstream discriminative tasks, i.e., unsupervised object discovery: saliency segmentation and object localization. However, the challenges exist as there is one structural difference between generative and discriminative models, which limits the direct use. Besides, the lack of explicitly labeled data significantly limits performance in unsupervised settings. To tackle these issues, we introduce DiffusionSeg, one novel synthesis-exploitation framework containing two-stage strategies. To alleviate data insufficiency, we synthesize abundant images, and propose a novel training-free AttentionCut to obtain masks in the first synthesis stage. In the second exploitation stage, to bridge the structural gap, we use the inversion technique, to map the given image back to diffusion features. These features can be directly used by downstream architectures. Extensive experiments and ablation studies demonstrate the superiority of adapting diffusion for unsupervised object discovery.
△ Less
Submitted 17 March, 2023;
originally announced March 2023.
-
Constraint and Union for Partially-Supervised Temporal Sentence Grounding
Authors:
Chen Ju,
Haicheng Wang,
Jinxiang Liu,
Chaofan Ma,
Ya Zhang,
Peisen Zhao,
Jianlong Chang,
Qi Tian
Abstract:
Temporal sentence grounding aims to detect the event timestamps described by the natural language query from given untrimmed videos. The existing fully-supervised setting achieves great performance but requires expensive annotation costs; while the weakly-supervised setting adopts cheap labels but performs poorly. To pursue high performance with less annotation cost, this paper introduces an inter…
▽ More
Temporal sentence grounding aims to detect the event timestamps described by the natural language query from given untrimmed videos. The existing fully-supervised setting achieves great performance but requires expensive annotation costs; while the weakly-supervised setting adopts cheap labels but performs poorly. To pursue high performance with less annotation cost, this paper introduces an intermediate partially-supervised setting, i.e., only short-clip or even single-frame labels are available during training. To take full advantage of partial labels, we propose a novel quadruple constraint pipeline to comprehensively shape event-query aligned representations, covering intra- and inter-samples, uni- and multi-modalities. The former raises intra-cluster compactness and inter-cluster separability; while the latter enables event-background separation and event-query gather. To achieve more powerful performance with explicit grounding optimization, we further introduce a partial-full union framework, i.e., bridging with an additional fully-supervised branch, to enjoy its impressive grounding bonus, and be robust to partial annotations. Extensive experiments and ablations on Charades-STA and ActivityNet Captions demonstrate the significance of partial supervision and our superior performance.
△ Less
Submitted 20 February, 2023;
originally announced February 2023.
-
Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization
Authors:
Chen Ju,
Kunhao Zheng,
Jinxiang Liu,
Peisen Zhao,
Ya Zhang,
Jianlong Chang,
Yanfeng Wang,
Qi Tian
Abstract:
Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from th…
▽ More
Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), since we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. And as a result, the dual-branch complementarity is effectively fused to promote a strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Graph Neural Networks on SPD Manifolds for Motor Imagery Classification: A Perspective from the Time-Frequency Analysis
Authors:
Ce Ju,
Cuntai Guan
Abstract:
The motor imagery (MI) classification has been a prominent research topic in brain-computer interfaces based on electroencephalography (EEG). Over the past few decades, the performance of MI-EEG classifiers has seen gradual enhancement. In this study, we amplify the geometric deep learning-based MI-EEG classifiers from the perspective of time-frequency analysis, introducing a new architecture call…
▽ More
The motor imagery (MI) classification has been a prominent research topic in brain-computer interfaces based on electroencephalography (EEG). Over the past few decades, the performance of MI-EEG classifiers has seen gradual enhancement. In this study, we amplify the geometric deep learning-based MI-EEG classifiers from the perspective of time-frequency analysis, introducing a new architecture called Graph-CSPNet. We refer to this category of classifiers as Geometric Classifiers, highlighting their foundation in differential geometry stemming from EEG spatial covariance matrices. Graph-CSPNet utilizes novel manifold-valued graph convolutional techniques to capture the EEG features in the time-frequency domain, offering heightened flexibility in signal segmentation for capturing localized fluctuations. To evaluate the effectiveness of Graph-CSPNet, we employ five commonly-used publicly available MI-EEG datasets, achieving near-optimal classification accuracies in nine out of eleven scenarios. The Python repository can be found at https://github.com/GeometricBCI/Tensor-CSPNet-and-Graph-CSPNet.
△ Less
Submitted 20 August, 2023; v1 submitted 25 October, 2022;
originally announced November 2022.
-
Adversarial Reweighting for Speaker Verification Fairness
Authors:
Minho Jin,
Chelsea J. -T. Ju,
Zeya Chen,
Yi-Chieh Liu,
Jasha Droppo,
Andreas Stolcke
Abstract:
We address performance fairness for speaker verification using the adversarial reweighting (ARW) method. ARW is reformulated for speaker verification with metric learning, and shown to improve results across different subgroups of gender and nationality, without requiring annotation of subgroups in the training data. An adversarial network learns a weight for each training sample in the batch so t…
▽ More
We address performance fairness for speaker verification using the adversarial reweighting (ARW) method. ARW is reformulated for speaker verification with metric learning, and shown to improve results across different subgroups of gender and nationality, without requiring annotation of subgroups in the training data. An adversarial network learns a weight for each training sample in the batch so that the main learner is forced to focus on poorly performing instances. Using a min-max optimization algorithm, this method improves overall speaker verification fairness. We present three different ARWformulations: accumulated pairwise similarity, pseudo-labeling, and pairwise weighting, and measure their performance in terms of equal error rate (EER) on the VoxCeleb corpus. Results show that the pairwise weighting method can achieve 1.08% overall EER, 1.25% for male and 0.67% for female speakers, with relative EER reductions of 7.7%, 10.1% and 3.0%, respectively. For nationality subgroups, the proposed algorithm showed 1.04% EER for US speakers, 0.76% for UK speakers, and 1.22% for all others. The absolute EER gap between gender groups was reduced from 0.70% to 0.58%, while the standard deviation over nationality groups decreased from 0.21 to 0.19.
△ Less
Submitted 15 July, 2022;
originally announced July 2022.
-
A Support Vector Model of Pruning Trees Evaluation Based on OTSU Algorithm
Authors:
Yuefei Chen,
Xinli Zheng,
Chunhua Ju,
Fuguang Bao
Abstract:
The tree pruning process is the key to promoting fruits' growth and improving their productions due to effects on the photosynthesis efficiency of fruits and nutrition transportation in branches. Currently, pruning is still highly dependent on human labor. The workers' experience will strongly affect the robustness of the performance of the tree pruning. Thus, it is a challenge for workers and far…
▽ More
The tree pruning process is the key to promoting fruits' growth and improving their productions due to effects on the photosynthesis efficiency of fruits and nutrition transportation in branches. Currently, pruning is still highly dependent on human labor. The workers' experience will strongly affect the robustness of the performance of the tree pruning. Thus, it is a challenge for workers and farmers to evaluate the pruning performance. Intended for a better solution to the problem, this paper presents a novel pruning classification strategy model called "OTSU-SVM" to evaluate the pruning performance based on the shadows of branches and leaves. This model considers not only the available illuminated area of the tree but also the uniformity of the illuminated area of the tree. More importantly, our group implements OTSU algorithm into the model, which highly reinforces robustness of the evaluation of this model. In addition, the data from the pear trees in the Yuhang District, Hangzhou is also used in the experiment. In this experiment, we prove that the OTSU-SVM has good accuracy with 80% and high performance in the evaluation of the pruning for the pear trees. It can provide more successful pruning if applied into the orchard. A successful pruning can broaden the illuminated area of individual fruit, and increase nutrition transportation from the target branch, dramatically elevating the weights and production of the fruits.
△ Less
Submitted 7 July, 2022;
originally announced July 2022.
-
Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation
Authors:
Jinxiang Liu,
Chen Ju,
Weidi Xie,
Ya Zhang
Abstract:
We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, i.e. explicitly encouraging the audio-visual representat…
▽ More
We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, i.e. explicitly encouraging the audio-visual representations to be invariant to various transformations~({\em transformation invariance}); (2) enforcing geometric consistency substantially improves the quality of learned representations, i.e. the detected sound source should follow the same transformation applied on input video frames~({\em transformation equivariance}). Extensive experiments demonstrate that our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound. Additionally, we also evaluate audio retrieval and cross-modal retrieval tasks. In both cases, our self-supervised models demonstrate superior retrieval performances, even competitive with the supervised approach in audio retrieval. This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications. \textit{All codes will be available}.
△ Less
Submitted 15 August, 2022; v1 submitted 25 June, 2022;
originally announced June 2022.
-
Improving Subgraph Representation Learning via Multi-View Augmentation
Authors:
Yili Shen,
Xiao Liu,
Cheng-Wei Ju,
Jiaxu Yan,
Jun Yi,
Zhou Lin,
Hui Guan
Abstract:
Subgraph representation learning based on Graph Neural Network (GNN) has exhibited broad applications in scientific advancements, such as predictions of molecular structure-property relationships and collective cellular function. In particular, graph augmentation techniques have shown promising results in improving graph-based and node-based classification tasks. Still, they have rarely been explo…
▽ More
Subgraph representation learning based on Graph Neural Network (GNN) has exhibited broad applications in scientific advancements, such as predictions of molecular structure-property relationships and collective cellular function. In particular, graph augmentation techniques have shown promising results in improving graph-based and node-based classification tasks. Still, they have rarely been explored in the existing GNN-based subgraph representation learning studies. In this study, we develop a novel multi-view augmentation mechanism to improve subgraph representation learning models and thus the accuracy of downstream prediction tasks. Our augmentation technique creates multiple variants of subgraphs and embeds these variants into the original graph to achieve highly improved training efficiency, scalability, and accuracy. Benchmark experiments on several real-world biological and physiological datasets demonstrate the superiority of our proposed multi-view augmentation techniques in subgraph representation learning.
△ Less
Submitted 13 October, 2022; v1 submitted 25 May, 2022;
originally announced May 2022.
-
Tensor-CSPNet: A Novel Geometric Deep Learning Framework for Motor Imagery Classification
Authors:
Ce Ju,
Cuntai Guan
Abstract:
Deep learning (DL) has been widely investigated in a vast majority of applications in electroencephalography (EEG)-based brain-computer interfaces (BCIs), especially for motor imagery (MI) classification in the past five years. The mainstream DL methodology for the MI-EEG classification exploits the temporospatial patterns of EEG signals using convolutional neural networks (CNNs), which have remar…
▽ More
Deep learning (DL) has been widely investigated in a vast majority of applications in electroencephalography (EEG)-based brain-computer interfaces (BCIs), especially for motor imagery (MI) classification in the past five years. The mainstream DL methodology for the MI-EEG classification exploits the temporospatial patterns of EEG signals using convolutional neural networks (CNNs), which have remarkably succeeded in visual images. However, since the statistical characteristics of visual images depart radically from EEG signals, a natural question arises whether an alternative network architecture exists apart from CNNs. To address this question, we propose a novel geometric deep learning (GDL) framework called Tensor-CSPNet, which characterizes spatial covariance matrices derived from EEG signals on symmetric positive definite (SPD) manifolds and fully captures the temporospatiofrequency patterns using existing deep neural networks on SPD manifolds, integrating with experiences from many successful MI-EEG classifiers to optimize the framework. In the experiments, Tensor-CSPNet attains or slightly outperforms the current state-of-the-art performance on the cross-validation and holdout scenarios in two commonly-used MI-EEG datasets. Moreover, the visualization and interpretability analyses also exhibit the validity of Tensor-CSPNet for the MI-EEG classification. To conclude, in this study, we provide a feasible answer to the question by generalizing the DL methodologies on SPD manifolds, which indicates the start of a specific GDL methodology for the MI-EEG classification.
△ Less
Submitted 23 September, 2022; v1 submitted 4 February, 2022;
originally announced February 2022.
-
Deep Optimal Transport for Domain Adaptation on SPD Manifolds
Authors:
Ce Ju,
Cuntai Guan
Abstract:
The machine learning community has shown increasing interest in addressing the domain adaptation problem on symmetric positive definite (SPD) manifolds. This interest is primarily driven by the complexities of neuroimaging data generated from brain signals, which often exhibit shifts in data distribution across recording sessions. These neuroimaging data, represented by signal covariance matrices,…
▽ More
The machine learning community has shown increasing interest in addressing the domain adaptation problem on symmetric positive definite (SPD) manifolds. This interest is primarily driven by the complexities of neuroimaging data generated from brain signals, which often exhibit shifts in data distribution across recording sessions. These neuroimaging data, represented by signal covariance matrices, possess the mathematical properties of symmetry and positive definiteness. However, applying conventional domain adaptation methods is challenging because these mathematical properties can be disrupted when operating on covariance matrices. In this study, we introduce a novel geometric deep learning-based approach utilizing optimal transport on SPD manifolds to manage discrepancies in both marginal and conditional distributions between the source and target domains. We evaluate the effectiveness of this approach in three cross-session brain-computer interface scenarios and provide visualized results for further insights. The GitHub repository of this study can be accessed at https://github.com/GeometricBCI/Deep-Optimal-Transport-for-Domain-Adaptation-on-SPD-Manifolds.
△ Less
Submitted 3 June, 2024; v1 submitted 14 January, 2022;
originally announced January 2022.
-
Prompting Visual-Language Models for Efficient Video Understanding
Authors:
Chen Ju,
Tengda Han,
Kunhao Zheng,
Ya Zhang,
Weidi Xie
Abstract:
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model, and exploit its powerful ability for resource-hungry video understanding tasks, with minimal t…
▽ More
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model, and exploit its powerful ability for resource-hungry video understanding tasks, with minimal training. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert video-related tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components. On 10 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite optimising significantly fewer parameters.
△ Less
Submitted 15 July, 2022; v1 submitted 8 December, 2021;
originally announced December 2021.
-
Implicit Regularization of Bregman Proximal Point Algorithm and Mirror Descent on Separable Data
Authors:
Yan Li,
Caleb Ju,
Ethan X. Fang,
Tuo Zhao
Abstract:
Bregman proximal point algorithm (BPPA) has witnessed emerging machine learning applications, yet its theoretical understanding has been largely unexplored. We study the computational properties of BPPA through learning linear classifiers with separable data, and demonstrate provable algorithmic regularization of BPPA. For any BPPA instantiated with a fixed Bregman divergence, we provide a lower b…
▽ More
Bregman proximal point algorithm (BPPA) has witnessed emerging machine learning applications, yet its theoretical understanding has been largely unexplored. We study the computational properties of BPPA through learning linear classifiers with separable data, and demonstrate provable algorithmic regularization of BPPA. For any BPPA instantiated with a fixed Bregman divergence, we provide a lower bound of the margin obtained by BPPA with respect to an arbitrarily chosen norm. The obtained margin lower bound differs from the maximal margin by a multiplicative factor, which inversely depends on the condition number of the distance-generating function measured in the dual norm. We show that the dependence on the condition number is tight, thus demonstrating the importance of divergence in affecting the quality of the learned classifiers. We then extend our findings to mirror descent, for which we establish similar connections between the margin and Bregman divergence, together with a non-asymptotic analysis. Numerical experiments on both synthetic and real-world datasets are provided to support our theoretical findings. To the best of our knowledge, the aforementioned findings appear to be new in the literature of algorithmic regularization.
△ Less
Submitted 24 August, 2023; v1 submitted 15 August, 2021;
originally announced August 2021.
-
Communication lower bounds for nested bilinear algorithms via rank expansion of Kronecker products
Authors:
Caleb Ju,
Yifan Zhang,
Edgar Solomonik
Abstract:
We develop lower bounds on communication in the memory hierarchy or between processors for nested bilinear algorithms, such as Strassen's algorithm for matrix multiplication. We build on a previous framework that establishes communication lower bounds by use of the rank expansion, or the minimum rank of any fixed size subset of columns of a matrix, for each of the three matrices encoding a bilinea…
▽ More
We develop lower bounds on communication in the memory hierarchy or between processors for nested bilinear algorithms, such as Strassen's algorithm for matrix multiplication. We build on a previous framework that establishes communication lower bounds by use of the rank expansion, or the minimum rank of any fixed size subset of columns of a matrix, for each of the three matrices encoding a bilinear algorithm. This framework provides lower bounds for a class of dependency directed acyclic graphs (DAGs) corresponding to the execution of a given bilinear algorithm, in contrast to other approaches that yield bounds for specific DAGs. However, our lower bounds only apply to executions that do not compute the same DAG node multiple times. Two bilinear algorithms can be nested by taking Kronecker products between their encoding matrices. Our main result is a lower bound on the rank expansion of a matrix constructed by a Kronecker product derived from lower bounds on the rank expansion of the Kronecker product's operands. We apply the rank expansion lower bounds to obtain novel communication lower bounds for nested Toom-Cook convolution, Strassen's algorithm, and fast algorithms for contraction of partially symmetric tensors.
△ Less
Submitted 28 September, 2023; v1 submitted 20 July, 2021;
originally announced July 2021.
-
Clinical Named Entity Recognition using Contextualized Token Representations
Authors:
Yichao Zhou,
Chelsea Ju,
J. Harry Caufield,
Kevin Shih,
Calvin Chen,
Yizhou Sun,
Kai-Wei Chang,
Peipei Ping,
Wei Wang
Abstract:
The clinical named entity recognition (CNER) task seeks to locate and classify clinical terminologies into predefined categories, such as diagnostic procedure, disease disorder, severity, medication, medication dosage, and sign symptom. CNER facilitates the study of side-effect on medications including identification of novel phenomena and human-focused information extraction. Existing approaches…
▽ More
The clinical named entity recognition (CNER) task seeks to locate and classify clinical terminologies into predefined categories, such as diagnostic procedure, disease disorder, severity, medication, medication dosage, and sign symptom. CNER facilitates the study of side-effect on medications including identification of novel phenomena and human-focused information extraction. Existing approaches in extracting the entities of interests focus on using static word embeddings to represent each word. However, one word can have different interpretations that depend on the context of the sentences. Evidently, static word embeddings are insufficient to integrate the diverse interpretation of a word. To overcome this challenge, the technique of contextualized word embedding has been introduced to better capture the semantic meaning of each word based on its context. Two of these language models, ELMo and Flair, have been widely used in the field of Natural Language Processing to generate the contextualized word embeddings on domain-generic documents. However, these embeddings are usually too general to capture the proximity among vocabularies of specific domains. To facilitate various downstream applications using clinical case reports (CCRs), we pre-train two deep contextualized language models, Clinical Embeddings from Language Model (C-ELMo) and Clinical Contextual String Embeddings (C-Flair) using the clinical-related corpus from the PubMed Central. Explicit experiments show that our models gain dramatic improvements compared to both static word embeddings and domain-generic language models.
△ Less
Submitted 23 June, 2021;
originally announced June 2021.
-
Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition
Authors:
Ruirui Li,
Chelsea J. -T. Ju,
Zeya Chen,
Hongda Mao,
Oguz Elibol,
Andreas Stolcke
Abstract:
By implicitly recognizing a user based on his/her speech input, speaker identification enables many downstream applications, such as personalized system behavior and expedited shopping checkouts. Based on whether the speech content is constrained or not, both text-dependent (TD) and text-independent (TI) speaker recognition models may be used. We wish to combine the advantages of both types of mod…
▽ More
By implicitly recognizing a user based on his/her speech input, speaker identification enables many downstream applications, such as personalized system behavior and expedited shopping checkouts. Based on whether the speech content is constrained or not, both text-dependent (TD) and text-independent (TI) speaker recognition models may be used. We wish to combine the advantages of both types of models through an ensemble system to make more reliable predictions. However, any such combined approach has to be robust to incomplete inputs, i.e., when either TD or TI input is missing. As a solution we propose a fusion of embeddings network foenet architecture, combining joint learning with neural attention. We compare foenet with four competitive baseline methods on a dataset of voice assistant inputs, and show that it achieves higher accuracy than the baseline and score fusion methods, especially in the presence of incomplete inputs.
△ Less
Submitted 18 June, 2021;
originally announced June 2021.
-
Adaptive Mutual Supervision for Weakly-Supervised Temporal Action Localization
Authors:
Chen Ju,
Peisen Zhao,
Siheng Chen,
Ya Zhang,
Xiaoyun Zhang,
Qi Tian
Abstract:
Weakly-supervised temporal action localization aims to localize actions in untrimmed videos with only video-level action category labels. Most of previous methods ignore the incompleteness issue of Class Activation Sequences (CAS), suffering from trivial localization results. To solve this issue, we introduce an adaptive mutual supervision framework (AMS) with two branches, where the base branch a…
▽ More
Weakly-supervised temporal action localization aims to localize actions in untrimmed videos with only video-level action category labels. Most of previous methods ignore the incompleteness issue of Class Activation Sequences (CAS), suffering from trivial localization results. To solve this issue, we introduce an adaptive mutual supervision framework (AMS) with two branches, where the base branch adopts CAS to localize the most discriminative action regions, while the supplementary branch localizes the less discriminative action regions through a novel adaptive sampler. The adaptive sampler dynamically updates the input of the supplementary branch with a sampling weight sequence negatively correlated with the CAS from the base branch, thereby prompting the supplementary branch to localize the action regions underestimated by the base branch. To promote mutual enhancement between these two branches, we construct mutual location supervision. Each branch leverages location pseudo-labels generated from the other branch as localization supervision. By alternately optimizing the two branches in multiple iterations, we progressively complete action regions. Extensive experiments on THUMOS14 and ActivityNet1.2 demonstrate that the proposed AMS method significantly outperforms the state-of-the-art methods.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
Ternary Hashing
Authors:
Chang Liu,
Lixin Fan,
Kam Woh Ng,
Yilun Jin,
Ce Ju,
Tianyu Zhang,
Chee Seng Chan,
Qiang Yang
Abstract:
This paper proposes a novel ternary hash encoding for learning to hash methods, which provides a principled more efficient coding scheme with performances better than those of the state-of-the-art binary hashing counterparts. Two kinds of axiomatic ternary logic, Kleene logic and Łukasiewicz logic are adopted to calculate the Ternary Hamming Distance (THD) for both the learning/encoding and testin…
▽ More
This paper proposes a novel ternary hash encoding for learning to hash methods, which provides a principled more efficient coding scheme with performances better than those of the state-of-the-art binary hashing counterparts. Two kinds of axiomatic ternary logic, Kleene logic and Łukasiewicz logic are adopted to calculate the Ternary Hamming Distance (THD) for both the learning/encoding and testing/querying phases. Our work demonstrates that, with an efficient implementation of ternary logic on standard binary machines, the proposed ternary hashing is compared favorably to the binary hashing methods with consistent improvements of retrieval mean average precision (mAP) ranging from 1\% to 5.9\% as shown in CIFAR10, NUS-WIDE and ImageNet100 datasets.
△ Less
Submitted 19 March, 2021; v1 submitted 16 March, 2021;
originally announced March 2021.
-
Bio-JOIE: Joint Representation Learning of Biological Knowledge Bases
Authors:
Junheng Hao,
Chelsea Ju,
Muhao Chen,
Yizhou Sun,
Carlo Zaniolo,
Wei Wang
Abstract:
The widespread of Coronavirus has led to a worldwide pandemic with a high mortality rate. Currently, the knowledge accumulated from different studies about this virus is very limited. Leveraging a wide-range of biological knowledge, such as gene ontology and protein-protein interaction (PPI) networks from other closely related species presents a vital approach to infer the molecular impact of a ne…
▽ More
The widespread of Coronavirus has led to a worldwide pandemic with a high mortality rate. Currently, the knowledge accumulated from different studies about this virus is very limited. Leveraging a wide-range of biological knowledge, such as gene ontology and protein-protein interaction (PPI) networks from other closely related species presents a vital approach to infer the molecular impact of a new species. In this paper, we propose the transferred multi-relational embedding model Bio-JOIE to capture the knowledge of gene ontology and PPI networks, which demonstrates superb capability in modeling the SARS-CoV-2-human protein interactions. Bio-JOIE jointly trains two model components. The knowledge model encodes the relational facts from the protein and GO domains into separated embedding spaces, using a hierarchy-aware encoding technique employed for the GO terms. On top of that, the transfer model learns a non-linear transformation to transfer the knowledge of PPIs and gene ontology annotations across their embedding spaces. By leveraging only structured knowledge, Bio-JOIE significantly outperforms existing state-of-the-art methods in PPI type prediction on multiple species. Furthermore, we also demonstrate the potential of leveraging the learned representations on clustering proteins with enzymatic function into enzyme commission families. Finally, we show that Bio-JOIE can accurately identify PPIs between the SARS-CoV-2 proteins and human proteins, providing valuable insights for advancing research on this new disease.
△ Less
Submitted 7 March, 2021;
originally announced March 2021.
-
Point-Level Temporal Action Localization: Bridging Fully-supervised Proposals to Weakly-supervised Losses
Authors:
Chen Ju,
Peisen Zhao,
Ya Zhang,
Yanfeng Wang,
Qi Tian
Abstract:
Point-Level temporal action localization (PTAL) aims to localize actions in untrimmed videos with only one timestamp annotation for each action instance. Existing methods adopt the frame-level prediction paradigm to learn from the sparse single-frame labels. However, such a framework inevitably suffers from a large solution space. This paper attempts to explore the proposal-based prediction paradi…
▽ More
Point-Level temporal action localization (PTAL) aims to localize actions in untrimmed videos with only one timestamp annotation for each action instance. Existing methods adopt the frame-level prediction paradigm to learn from the sparse single-frame labels. However, such a framework inevitably suffers from a large solution space. This paper attempts to explore the proposal-based prediction paradigm for point-level annotations, which has the advantage of more constrained solution space and consistent predictions among neighboring frames. The point-level annotations are first used as the keypoint supervision to train a keypoint detector. At the location prediction stage, a simple but effective mapper module, which enables back-propagation of training errors, is then introduced to bridge the fully-supervised framework with weak supervision. To our best of knowledge, this is the first work to leverage the fully-supervised paradigm for the point-level setting. Experiments on THUMOS14, BEOID, and GTEA verify the effectiveness of our proposed method both quantitatively and qualitatively, and demonstrate that our method outperforms state-of-the-art methods.
△ Less
Submitted 15 December, 2020;
originally announced December 2020.
-
Rethinking Uncertainty in Deep Learning: Whether and How it Improves Robustness
Authors:
Yilun Jin,
Lixin Fan,
Kam Woh Ng,
Ce Ju,
Qiang Yang
Abstract:
Deep neural networks (DNNs) are known to be prone to adversarial attacks, for which many remedies are proposed. While adversarial training (AT) is regarded as the most robust defense, it suffers from poor performance both on clean examples and under other types of attacks, e.g. attacks with larger perturbations. Meanwhile, regularizers that encourage uncertain outputs, such as entropy maximization…
▽ More
Deep neural networks (DNNs) are known to be prone to adversarial attacks, for which many remedies are proposed. While adversarial training (AT) is regarded as the most robust defense, it suffers from poor performance both on clean examples and under other types of attacks, e.g. attacks with larger perturbations. Meanwhile, regularizers that encourage uncertain outputs, such as entropy maximization (EntM) and label smoothing (LS) can maintain accuracy on clean examples and improve performance under weak attacks, yet their ability to defend against strong attacks is still in doubt. In this paper, we revisit uncertainty promotion regularizers, including EntM and LS, in the field of adversarial learning. We show that EntM and LS alone provide robustness only under small perturbations. Contrarily, we show that uncertainty promotion regularizers complement AT in a principled manner, consistently improving performance on both clean examples and under various attacks, especially attacks with large perturbations. We further analyze how uncertainty promotion regularizers enhance the performance of AT from the perspective of Jacobian matrices $\nabla_X f(X;θ)$, and find out that EntM effectively shrinks the norm of Jacobian matrices and hence promotes robustness.
△ Less
Submitted 26 November, 2020;
originally announced November 2020.
-
Non-local convolutional neural networks (nlcnn) for speaker recognition
Authors:
Haici Yang,
Hongda Mao,
Ruirui Li,
Chelsea J. T. Ju,
Oguz Elibol
Abstract:
Speaker recognition is the process of identifying a speaker based on the voice. The technology has attracted more attention with the recent increase in popularity of smart voice assistants, such as Amazon Alexa. In the past few years, various convolutional neural network (CNN) based speaker recognition algorithms have been proposed and achieved satisfactory performance. However, convolutional oper…
▽ More
Speaker recognition is the process of identifying a speaker based on the voice. The technology has attracted more attention with the recent increase in popularity of smart voice assistants, such as Amazon Alexa. In the past few years, various convolutional neural network (CNN) based speaker recognition algorithms have been proposed and achieved satisfactory performance. However, convolutional operations are building blocks that typically perform on a local neighborhood at a time and thus miss to capture global, long-range interactions at the feature level which are critical for understanding the pattern in a speaker's voice. In this work, we propose to apply Non-local Convolutional Neural Networks (NLCNN) to improve the capability of capturing long-range dependencies at the feature level, therefore improving speaker recognition performance. Specifically, we introduce non-local blocks where the output response of a position is computed as a weighted sum of the input features at all positions. Combining non-local blocks with pre-defined CNN networks, we investigate the effectiveness of NLCNN models. Without extensive tuning, the proposed NLCNN models outperform state-of-the-art speaker recognition algorithms on the public Voxceleb dataset. What's more, we investigate different types of non-local operations applied to the frequency-time domain, time domain, frequency domain and frame-level respectively. Among them, time domain is the most effective one for speaker recognition applications.
△ Less
Submitted 19 May, 2021; v1 submitted 6 November, 2020;
originally announced November 2020.
-
Survey: Geometric Foundations of Data Reduction
Authors:
Ce Ju
Abstract:
This survey is written in summer, 2016. The purpose of this survey is to briefly introduce nonlinear dimensionality reduction (NLDR) in data reduction. The first two NLDR were respectively published in Science in 2000 in which they solve the similar reduction problem of high-dimensional data endowed with the intrinsic nonlinear structure. The intrinsic nonlinear structure is always interpreted as…
▽ More
This survey is written in summer, 2016. The purpose of this survey is to briefly introduce nonlinear dimensionality reduction (NLDR) in data reduction. The first two NLDR were respectively published in Science in 2000 in which they solve the similar reduction problem of high-dimensional data endowed with the intrinsic nonlinear structure. The intrinsic nonlinear structure is always interpreted as a concept in manifolds from geometry and topology in theoretical mathematics by computer scientists and theoretical physicists. In 2001, the concept of Manifold Learning first appears as an NLDR method called Laplacian Eigenmaps. In a typical manifold learning setup, the data set, also called the observation set, is distributed on or near a low dimensional manifold M embedded in RD, which yields that each observation has a D-dimensional representation. The goal of manifold learning is to reduce these observations as a compact lower-dimensional representation based on the geometric information. The reduction procedure is called the spectral manifold learning. In this paper, we derive each spectral manifold learning with the matrix and operator representation, and we then discuss the convergence behavior of each method in a geometric uniform language. Hence, the survey is named Geometric Foundations of Data Reduction.
△ Less
Submitted 20 March, 2022; v1 submitted 16 August, 2020;
originally announced August 2020.
-
Privacy Threats Against Federated Matrix Factorization
Authors:
Dashan Gao,
Ben Tan,
Ce Ju,
Vincent W. Zheng,
Qiang Yang
Abstract:
Matrix Factorization has been very successful in practical recommendation applications and e-commerce. Due to data shortage and stringent regulations, it can be hard to collect sufficient data to build performant recommender systems for a single company. Federated learning provides the possibility to bridge the data silos and build machine learning models without compromising privacy and security.…
▽ More
Matrix Factorization has been very successful in practical recommendation applications and e-commerce. Due to data shortage and stringent regulations, it can be hard to collect sufficient data to build performant recommender systems for a single company. Federated learning provides the possibility to bridge the data silos and build machine learning models without compromising privacy and security. Participants sharing common users or items collaboratively build a model over data from all the participants. There have been some works exploring the application of federated learning to recommender systems and the privacy issues in collaborative filtering systems. However, the privacy threats in federated matrix factorization are not studied. In this paper, we categorize federated matrix factorization into three types based on the partition of feature space and analyze privacy threats against each type of federated matrix factorization model. We also discuss privacy-preserving approaches. As far as we are aware, this is the first study of privacy threats of the matrix factorization method in the federated learning framework.
△ Less
Submitted 3 July, 2020;
originally announced July 2020.
-
Rethinking Privacy Preserving Deep Learning: How to Evaluate and Thwart Privacy Attacks
Authors:
Lixin Fan,
Kam Woh Ng,
Ce Ju,
Tianyu Zhang,
Chang Liu,
Chee Seng Chan,
Qiang Yang
Abstract:
This paper investigates capabilities of Privacy-Preserving Deep Learning (PPDL) mechanisms against various forms of privacy attacks. First, we propose to quantitatively measure the trade-off between model accuracy and privacy losses incurred by reconstruction, tracing and membership attacks. Second, we formulate reconstruction attacks as solving a noisy system of linear equations, and prove that a…
▽ More
This paper investigates capabilities of Privacy-Preserving Deep Learning (PPDL) mechanisms against various forms of privacy attacks. First, we propose to quantitatively measure the trade-off between model accuracy and privacy losses incurred by reconstruction, tracing and membership attacks. Second, we formulate reconstruction attacks as solving a noisy system of linear equations, and prove that attacks are guaranteed to be defeated if condition (2) is unfulfilled. Third, based on theoretical analysis, a novel Secret Polarization Network (SPN) is proposed to thwart privacy attacks, which pose serious challenges to existing PPDL methods. Extensive experiments showed that model accuracies are improved on average by 5-20% compared with baseline mechanisms, in regimes where data privacy are satisfactorily protected.
△ Less
Submitted 23 June, 2020; v1 submitted 20 June, 2020;
originally announced June 2020.
-
Privacy-Preserving Technology to Help Millions of People: Federated Prediction Model for Stroke Prevention
Authors:
Ce Ju,
Ruihui Zhao,
Jichao Sun,
Xiguang Wei,
Bo Zhao,
Yang Liu,
Hongshan Li,
Tianjian Chen,
Xinwei Zhang,
Dashan Gao,
Ben Tan,
Han Yu,
Chuning He,
Yuan Jin
Abstract:
Prevention of stroke with its associated risk factors has been one of the public health priorities worldwide. Emerging artificial intelligence technology is being increasingly adopted to predict stroke. Because of privacy concerns, patient data are stored in distributed electronic health record (EHR) databases, voluminous clinical datasets, which prevent patient data from being aggregated and rest…
▽ More
Prevention of stroke with its associated risk factors has been one of the public health priorities worldwide. Emerging artificial intelligence technology is being increasingly adopted to predict stroke. Because of privacy concerns, patient data are stored in distributed electronic health record (EHR) databases, voluminous clinical datasets, which prevent patient data from being aggregated and restrains AI technology to boost the accuracy of stroke prediction with centralized training data. In this work, our scientists and engineers propose a privacy-preserving scheme to predict the risk of stroke and deploy our federated prediction model on cloud servers. Our system of federated prediction model asynchronously supports any number of client connections and arbitrary local gradient iterations in each communication round. It adopts federated averaging during the model training process, without patient data being taken out of the hospitals during the whole process of model training and forecasting. With the privacy-preserving mechanism, our federated prediction model trains over all the healthcare data from hospitals in a certain city without actual data sharing among them. Therefore, it is not only secure but also more accurate than any single prediction model that trains over the data only from one single hospital. Especially for small hospitals with few confirmed stroke cases, our federated model boosts model performance by 10%~20% in several machine learning metrics. To help stroke experts comprehend the advantage of our prediction system more intuitively, we developed a mobile app that collects the key information of patients' statistics and demonstrates performance comparisons between the federated prediction model and the single prediction model during the federated training process.
△ Less
Submitted 14 December, 2020; v1 submitted 15 June, 2020;
originally announced June 2020.
-
Federated Transfer Learning for EEG Signal Classification
Authors:
Ce Ju,
Dashan Gao,
Ravikiran Mane,
Ben Tan,
Yang Liu,
Cuntai Guan
Abstract:
The success of deep learning (DL) methods in the Brain-Computer Interfaces (BCI) field for classification of electroencephalographic (EEG) recordings has been restricted by the lack of large datasets. Privacy concerns associated with EEG signals limit the possibility of constructing a large EEG-BCI dataset by the conglomeration of multiple small ones for jointly training machine learning models. H…
▽ More
The success of deep learning (DL) methods in the Brain-Computer Interfaces (BCI) field for classification of electroencephalographic (EEG) recordings has been restricted by the lack of large datasets. Privacy concerns associated with EEG signals limit the possibility of constructing a large EEG-BCI dataset by the conglomeration of multiple small ones for jointly training machine learning models. Hence, in this paper, we propose a novel privacy-preserving DL architecture named federated transfer learning (FTL) for EEG classification that is based on the federated learning framework. Working with the single-trial covariance matrix, the proposed architecture extracts common discriminative information from multi-subject EEG data with the help of domain adaptation techniques. We evaluate the performance of the proposed architecture on the PhysioNet dataset for 2-class motor imagery classification. While avoiding the actual data sharing, our FTL approach achieves 2% higher classification accuracy in a subject-adaptive analysis. Also, in the absence of multi-subject data, our architecture provides 6% better accuracy compared to other state-of-the-art DL architectures.
△ Less
Submitted 25 January, 2021; v1 submitted 26 April, 2020;
originally announced April 2020.
-
A Hybrid Systems-based Hierarchical Control Architecture for Heterogeneous Field Robot Teams
Authors:
Chanyoung Ju,
Hyoung Il Son
Abstract:
Field robot systems have recently been applied to a wide range of research fields. Making such systems more automated, advanced, and activated requires cooperation among heterogeneous robots. Classic control theory is inefficient in managing large-scale complex dynamic systems. Therefore, the supervisory control theory based on discrete event system needs to be introduced to overcome this limitati…
▽ More
Field robot systems have recently been applied to a wide range of research fields. Making such systems more automated, advanced, and activated requires cooperation among heterogeneous robots. Classic control theory is inefficient in managing large-scale complex dynamic systems. Therefore, the supervisory control theory based on discrete event system needs to be introduced to overcome this limitation. In this study, we propose a hybrid systems-based hierarchical control architecture through a supervisory control-based high-level controller and a traditional control-based low-level controller. The hybrid systems and its dynamics are modeled through a formal method called hybrid automata, and the behavior specifications expressing the control objectives for cooperation are designed. Additionally, a modular supervisor that is more scalable and maintainable than a centralized supervisory controller was synthesized. The proposed hybrid systems and hierarchical control architecture were implemented, validated, and then evaluated for performance through the physics-based simulator. Experimental results confirmed that the heterogeneous field robot team satisfied the given specifications and presented systematic results, validating the efficiency of the proposed control architecture.
△ Less
Submitted 20 February, 2020;
originally announced February 2020.
-
Extending iLQR method with control delay
Authors:
Cheng Ju,
Yan Qin,
Chunjiang Fu
Abstract:
Iterative linear quadradic regulator(iLQR) has become a benchmark method to deal with nonlinear stochastic optimal control problem. However, it does not apply to delay system. In this paper, we extend the iLQR theory and prove new theorem in case of input signal with fixed delay. Which could be beneficial for machine learning or optimal control application to real time robot or human assistive dev…
▽ More
Iterative linear quadradic regulator(iLQR) has become a benchmark method to deal with nonlinear stochastic optimal control problem. However, it does not apply to delay system. In this paper, we extend the iLQR theory and prove new theorem in case of input signal with fixed delay. Which could be beneficial for machine learning or optimal control application to real time robot or human assistive device.
△ Less
Submitted 15 February, 2020;
originally announced February 2020.
-
Bottom-Up Temporal Action Localization with Mutual Regularization
Authors:
Peisen Zhao,
Lingxi Xie,
Chen Ju,
Ya Zhang,
Yanfeng Wang,
Qi Tian
Abstract:
Recently, temporal action localization (TAL), i.e., finding specific action segments in untrimmed videos, has attracted increasing attentions of the computer vision community. State-of-the-art solutions for TAL involves evaluating the frame-level probabilities of three action-indicating phases, i.e. starting, continuing, and ending; and then post-processing these predictions for the final localiza…
▽ More
Recently, temporal action localization (TAL), i.e., finding specific action segments in untrimmed videos, has attracted increasing attentions of the computer vision community. State-of-the-art solutions for TAL involves evaluating the frame-level probabilities of three action-indicating phases, i.e. starting, continuing, and ending; and then post-processing these predictions for the final localization. This paper delves deep into this mechanism, and argues that existing methods, by modeling these phases as individual classification tasks, ignored the potential temporal constraints between them. This can lead to incorrect and/or inconsistent predictions when some frames of the video input lack sufficient discriminative information. To alleviate this problem, we introduce two regularization terms to mutually regularize the learning procedure: the Intra-phase Consistency (IntraC) regularization is proposed to make the predictions verified inside each phase; and the Inter-phase Consistency (InterC) regularization is proposed to keep consistency between these phases. Jointly optimizing these two terms, the entire framework is aware of these potential constraints during an end-to-end optimization process. Experiments are performed on two popular TAL datasets, THUMOS14 and ActivityNet1.3. Our approach clearly outperforms the baseline both quantitatively and qualitatively. The proposed regularization also generalizes to other TAL methods (e.g., TSA-Net and PGCN). code: https://github.com/PeisenZhao/Bottom-Up-TAL-with-MR
△ Less
Submitted 25 February, 2021; v1 submitted 17 February, 2020;
originally announced February 2020.
-
Derivation and Analysis of Fast Bilinear Algorithms for Convolution
Authors:
Caleb Ju,
Edgar Solomonik
Abstract:
The prevalence of convolution in applications within signal processing, deep neural networks, and numerical solvers has motivated the development of numerous fast convolution algorithms. In many of these problems, convolution is performed on terabytes or petabytes of data, so even constant factors of improvement can significantly reduce the computation time. We leverage the formalism of bilinear a…
▽ More
The prevalence of convolution in applications within signal processing, deep neural networks, and numerical solvers has motivated the development of numerous fast convolution algorithms. In many of these problems, convolution is performed on terabytes or petabytes of data, so even constant factors of improvement can significantly reduce the computation time. We leverage the formalism of bilinear algorithms to describe and analyze all of the most popular approaches. This unified lens permits us to study the relationship between different variants of convolution as well as to derive error bounds and analyze the cost of the various algorithms. We provide new derivations, which predominantly leverage matrix and tensor algebra, to describe the Winograd family of convolution algorithms as well as reductions between 1D and multidimensional convolution. We provide cost and error bounds as well as experimental numerical studies. Our experiments for two of these algorithms, the overlap-add approach and Winograd convolution algorithm with polynomials of degree greater than one, show that fast convolution algorithms can rival the accuracy of the fast Fourier transform (FFT) without using complex arithmetic. These algorithms can be used for convolution problems with multidimensional inputs or for filters larger than size of four, extending the state-of-the-art in Winograd-based convolution algorithms.
△ Less
Submitted 2 July, 2020; v1 submitted 29 October, 2019;
originally announced October 2019.