Search | arXiv e-print repository

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

Authors: Jie Yang, Xuesong Niu, Nan Jiang, Ruimao Zhang, Siyuan Huang

Abstract: Existing 3D human object interaction (HOI) datasets and models simply align global descriptions with the long HOI sequence, while lacking a detailed understanding of intermediate states and the transitions between states. In this paper, we argue that fine-grained semantic alignment, which utilizes state-level descriptions, offers a promising paradigm for learning semantically rich HOI representati… ▽ More Existing 3D human object interaction (HOI) datasets and models simply align global descriptions with the long HOI sequence, while lacking a detailed understanding of intermediate states and the transitions between states. In this paper, we argue that fine-grained semantic alignment, which utilizes state-level descriptions, offers a promising paradigm for learning semantically rich HOI representations. To achieve this, we introduce Semantic-HOI, a new dataset comprising over 20K paired HOI states with fine-grained descriptions for each HOI state and the body movements that happen between two consecutive states. Leveraging the proposed dataset, we design three state-level HOI tasks to accomplish fine-grained semantic alignment within the HOI sequence. Additionally, we propose a unified model called F-HOI, designed to leverage multimodal instructions and empower the Multi-modal Large Language Model to efficiently handle diverse HOI tasks. F-HOI offers multiple advantages: (1) It employs a unified task formulation that supports the use of versatile multimodal inputs. (2) It maintains consistency in HOI across 2D, 3D, and linguistic spaces. (3) It utilizes fine-grained textual supervision for direct optimization, avoiding intricate modeling of HOI states. Extensive experiments reveal that F-HOI effectively aligns HOI states with fine-grained semantic descriptions, adeptly tackling understanding, reasoning, generation, and reconstruction tasks. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: ECCV24

arXiv:2407.11163 [pdf, other]

Exact Label Recovery in Euclidean Random Graphs

Authors: Julia Gaudio, Charlie Guan, Xiaochun Niu, Ermin Wei

Abstract: In this paper, we propose a family of label recovery problems on weighted Euclidean random graphs. The vertices of a graph are embedded in $\mathbb{R}^d$ according to a Poisson point process, and are assigned to a discrete community label. Our goal is to infer the vertex labels, given edge weights whose distributions depend on the vertex labels as well as their geometric positions. Our general mod… ▽ More In this paper, we propose a family of label recovery problems on weighted Euclidean random graphs. The vertices of a graph are embedded in $\mathbb{R}^d$ according to a Poisson point process, and are assigned to a discrete community label. Our goal is to infer the vertex labels, given edge weights whose distributions depend on the vertex labels as well as their geometric positions. Our general model provides a geometric extension of popular graph and matrix problems, including submatrix localization and $\mathbb{Z}_2$-synchronization, and includes the Geometric Stochastic Block Model (proposed by Sankararaman and Baccelli) as a special case. We study the fundamental limits of exact recovery of the vertex labels. Under a mild distinctness of distributions assumption, we determine the information-theoretic threshold for exact label recovery, in terms of a Chernoff-Hellinger divergence criterion. Impossibility of recovery below the threshold is proven by a unified analysis using a Cramér lower bound. Achievability above the threshold is proven via an efficient two-phase algorithm, where the first phase computes an almost-exact labeling through a local propagation scheme, while the second phase refines the labels. The information-theoretic threshold is dictated by the performance of the so-called genie estimator, which decodes the label of a single vertex given all the other labels. This shows that our proposed models exhibit the local-to-global amplification phenomenon. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: arXiv admin note: text overlap with arXiv:2307.11196

arXiv:2407.07589 [pdf]

MSC-LIO: An MSCKF-Based LiDAR-Inertial Odometry with Same-Plane-Point Tracking

Authors: Tisheng Zhang, Man Yuan, Linfu Wei, Hailiang Tang, Xiaoji Niu

Abstract: The multi-state constraint Kalman filter (MSCKF) has been proven to be more efficient than graph optimization for visual-based odometry while with similar accuracy. However, it has not yet been properly considered and studied for LiDAR-based odometry. In this paper, we propose a novel tightly coupled LiDAR-inertial odometry based on the MSCKF framework, named MSC-LIO. An efficient LiDAR same-plane… ▽ More The multi-state constraint Kalman filter (MSCKF) has been proven to be more efficient than graph optimization for visual-based odometry while with similar accuracy. However, it has not yet been properly considered and studied for LiDAR-based odometry. In this paper, we propose a novel tightly coupled LiDAR-inertial odometry based on the MSCKF framework, named MSC-LIO. An efficient LiDAR same-plane-point (LSPP) tracking method, without explicit feature extraction, is present for frame-to-frame data associations. The tracked LSPPs are employed to build an LSPP measurement model, which constructs a multi-state constraint. Besides, we propose an effective point-velocity-based LiDAR-IMU time-delay (LITD) estimation method, which is derived from the proposed LSPP tracking method. Extensive experiments were conducted on both public and private datasets. The results demonstrate that the proposed MSC-LIO yields higher accuracy and efficiency than the state-of-the-art methods. The ablation experiment results indicate that the data-association efficiency is improved by nearly 3 times using the LSPP tracking method. Besides, the proposed LITD estimation method can effectively and accurately estimate the LITD. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: 9 pages

arXiv:2407.04411 [pdf, other]

Waterfall: Framework for Robust and Scalable Text Watermarking

Authors: Gregory Kang Ruey Lau, Xinyuan Niu, Hieu Dao, Jiangwei Chen, Chuan-Sheng Foo, Bryan Kian Hsiang Low

Abstract: Protecting intellectual property (IP) of text such as articles and code is increasingly important, especially as sophisticated attacks become possible, such as paraphrasing by large language models (LLMs) or even unauthorized training of LLMs on copyrighted text to infringe such IP. However, existing text watermarking methods are not robust enough against such attacks nor scalable to millions of u… ▽ More Protecting intellectual property (IP) of text such as articles and code is increasingly important, especially as sophisticated attacks become possible, such as paraphrasing by large language models (LLMs) or even unauthorized training of LLMs on copyrighted text to infringe such IP. However, existing text watermarking methods are not robust enough against such attacks nor scalable to millions of users for practical implementation. In this paper, we propose Waterfall, the first training-free framework for robust and scalable text watermarking applicable across multiple text types (e.g., articles, code) and languages supportable by LLMs, for general text and LLM data provenance. Waterfall comprises several key innovations, such as being the first to use LLM as paraphrasers for watermarking along with a novel combination of techniques that are surprisingly effective in achieving robust verifiability and scalability. We empirically demonstrate that Waterfall achieves significantly better scalability, robust verifiability, and computational efficiency compared to SOTA article-text watermarking methods, and also showed how it could be directly applied to the watermarking of code. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2406.14473 [pdf, other]

Data-Centric AI in the Age of Large Language Models

Authors: Xinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan Kian Hsiang Low

Abstract: This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and yet it receives disproportionally low attention from the research community. We identify four specific… ▽ More This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and yet it receives disproportionally low attention from the research community. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization. In each scenario, we underscore the importance of data, highlight promising research directions, and articulate the potential impacts on the research community and, where applicable, the society as a whole. For instance, we advocate for a suite of data-centric benchmarks tailored to the scale and complexity of data for LLMs. These benchmarks can be used to develop new data curation methods and document research efforts and results, which can help promote openness and transparency in AI and LLM research. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Preprint

arXiv:2406.12331 [pdf, other]

Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding

Authors: Weizhi Fei, Xueyan Niu, Guoqing Xie, Yanhua Zhang, Bo Bai, Lei Deng, Wei Han

Abstract: Current Large Language Models (LLMs) face inherent limitations due to their pre-defined context lengths, which impede their capacity for multi-hop reasoning within extensive textual contexts. While existing techniques like Retrieval-Augmented Generation (RAG) have attempted to bridge this gap by sourcing external information, they fall short when direct answers are not readily available. We introd… ▽ More Current Large Language Models (LLMs) face inherent limitations due to their pre-defined context lengths, which impede their capacity for multi-hop reasoning within extensive textual contexts. While existing techniques like Retrieval-Augmented Generation (RAG) have attempted to bridge this gap by sourcing external information, they fall short when direct answers are not readily available. We introduce a novel approach that re-imagines information retrieval through dynamic in-context editing, inspired by recent breakthroughs in knowledge editing. By treating lengthy contexts as malleable external knowledge, our method interactively gathers and integrates relevant information, thereby enabling LLMs to perform sophisticated reasoning steps. Experimental results demonstrate that our method effectively empowers context-limited LLMs, such as Llama2, to engage in multi-hop reasoning with improved performance, which outperforms state-of-the-art context window extrapolation methods and even compares favorably to more advanced commercial long-context models. Our interactive method not only enhances reasoning capabilities but also mitigates the associated training and computational costs, making it a pragmatic solution for enhancing LLMs' reasoning within expansive contexts. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.08255 [pdf, other]

M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

Authors: Benjamin Hsu, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nadejde, Xing Niu, Yair Kittenplon, Ron Litman, Raghavendra Pappagari

Abstract: Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world… ▽ More Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world documents often possess intricate text layouts that defy these assumptions. Extracting information from Optical Character Recognition (OCR) or heuristic rules can result in errors, and the layout (e.g., paragraphs, headers) may convey relationships between distant sections of text. This complexity is particularly evident in widely used PDF documents, which represent information visually. This paper addresses this gap by introducing M3T, a novel benchmark dataset tailored to evaluate NMT systems on the comprehensive task of translating semi-structured documents. This dataset aims to bridge the evaluation gap in document-level NMT systems, acknowledging the challenges posed by rich text layouts in real-world applications. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: NAACL 2024, dataset at https://github.com/amazon-science/m3t-multi-modal-translation-bench

arXiv:2406.07069 [pdf, other]

Optimal Gait Control for a Tendon-driven Soft Quadruped Robot by Model-based Reinforcement Learning

Authors: Xuezhi Niu, Kaige Tan, Lei Feng

Abstract: This study presents an innovative approach to optimal gait control for a soft quadruped robot enabled by four Compressible Tendon-driven Soft Actuators (CTSAs). Improving our previous studies of using model-free reinforcement learning for gait control, we employ model-based reinforcement learning (MBRL) to further enhance the performance of the gait controller. Compared to rigid robots, the propos… ▽ More This study presents an innovative approach to optimal gait control for a soft quadruped robot enabled by four Compressible Tendon-driven Soft Actuators (CTSAs). Improving our previous studies of using model-free reinforcement learning for gait control, we employ model-based reinforcement learning (MBRL) to further enhance the performance of the gait controller. Compared to rigid robots, the proposed soft quadruped robot has better safety, less weight, and a simpler mechanism for fabrication and control. However, the primary challenge lies in developing sophisticated control algorithms to attain optimal gait control for fast and stable locomotion. The research employs a multi-stage methodology, including state space restriction, data-driven model training, and reinforcement learning algorithm development. Compared to benchmark methods, the proposed MBRL algorithm, combined with post-training, significantly improves the efficiency and performance of gait control policies. The developed policy is both robust and adaptable to the robot's deformable morphology. The study concludes by highlighting the practical applicability of these findings in real-world scenarios. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.07065 [pdf, other]

Optimal Gait Design for a Soft Quadruped Robot via Multi-fidelity Bayesian Optimization

Authors: Kaige Tan, Xuezhi Niu, Qinglei Ji, Lei Feng, Martin Törngren

Abstract: This study focuses on the locomotion capability improvement in a tendon-driven soft quadruped robot through an online adaptive learning approach. Leveraging the inverse kinematics model of the soft quadruped robot, we employ a central pattern generator to design a parametric gait pattern, and use Bayesian optimization (BO) to find the optimal parameters. Further, to address the challenges of model… ▽ More This study focuses on the locomotion capability improvement in a tendon-driven soft quadruped robot through an online adaptive learning approach. Leveraging the inverse kinematics model of the soft quadruped robot, we employ a central pattern generator to design a parametric gait pattern, and use Bayesian optimization (BO) to find the optimal parameters. Further, to address the challenges of modeling discrepancies, we implement a multi-fidelity BO approach, combining data from both simulation and physical experiments throughout training and optimization. This strategy enables the adaptive refinement of the gait pattern and ensures a smooth transition from simulation to real-world deployment for the controller. Moreover, we integrate a computational task off-loading architecture by edge computing, which reduces the onboard computational and memory overhead, to improve real-time control performance and facilitate an effective online learning process. The proposed approach successfully achieves optimal walking gait design for physical deployment with high efficiency, effectively addressing challenges related to the reality gap in soft robotics. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2405.15338 [pdf, other]

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Authors: Xinlei Niu, Jing Zhang, Christian Walder, Charles Patrick Martin

Abstract: We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the gen… ▽ More We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD. Demo page: \url{https://XinleiNIU.github.io/demo-SoundLoCD/}. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.11442 [pdf, other]

Unifying 3D Vision-Language Understanding via Promptable Queries

Authors: Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li

Abstract: A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified m… ▽ More A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 1.8% (AP), ScanRefer by 5.4% ([email protected]), Multi3DRefer by 11.7% ([email protected]), and Scan2Cap by 13.4% ([email protected]). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: Project page: https://pq3d.github.io

arXiv:2405.08707 [pdf, other]

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

Authors: Xueyan Niu, Bo Bai, Lei Deng, Wei Han

Abstract: Increasing the size of a Transformer model does not always lead to enhanced performance. This phenomenon cannot be explained by the empirical scaling laws. Furthermore, improved generalization ability occurs as the model memorizes the training samples. We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models. We m… ▽ More Increasing the size of a Transformer model does not always lead to enhanced performance. This phenomenon cannot be explained by the empirical scaling laws. Furthermore, improved generalization ability occurs as the model memorizes the training samples. We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models. We model the behavior of Transformers with associative memories using Hopfield networks, such that each transformer block effectively conducts an approximate nearest-neighbor search. Based on this, we design an energy function analogous to that in the modern continuous Hopfield network which provides an insightful explanation for the attention mechanism. Using the majorization-minimization technique, we construct a global energy function that captures the layered architecture of the Transformer. Under specific conditions, we show that the minimum achievable cross-entropy loss is bounded from below by a constant approximately equal to 1. We substantiate our theoretical results by conducting experiments with GPT-2 on various data sizes, as well as training vanilla Transformers on a dataset of 2M tokens. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2405.08295 [pdf, other]

SpeechVerse: A Large-scale Generalizable Audio Language Model

Authors: Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

Abstract: Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore devel… ▽ More Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks. △ Less

Submitted 31 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

Comments: Single Column, 13 page

arXiv:2404.15637 [pdf, other]

HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts

Authors: Xinlei Niu, Jing Zhang, Charles Patrick Martin

Abstract: We introduce HybridVC, a voice conversion (VC) framework built upon a pre-trained conditional variational autoencoder (CVAE) that combines the strengths of a latent model with contrastive learning. HybridVC supports text and audio prompts, enabling more flexible voice style conversion. HybridVC models a latent distribution conditioned on speaker embeddings acquired by a pretrained speaker encoder… ▽ More We introduce HybridVC, a voice conversion (VC) framework built upon a pre-trained conditional variational autoencoder (CVAE) that combines the strengths of a latent model with contrastive learning. HybridVC supports text and audio prompts, enabling more flexible voice style conversion. HybridVC models a latent distribution conditioned on speaker embeddings acquired by a pretrained speaker encoder and optimises style text embeddings to align with the speaker style information through contrastive learning in parallel. Therefore, HybridVC can be efficiently trained under limited computational resources. Our experiments demonstrate HybridVC's superior training efficiency and its capability for advanced multi-modal voice style conversion. This underscores its potential for widespread applications such as user-defined personalised voice in various social media platforms. A comprehensive ablation study further validates the effectiveness of our method. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.12713 [pdf, other]

Energy Conserved Failure Detection for NS-IoT Systems

Authors: Guojin Liu, Jianhong Zhou, Hang Su, Biaohong Xiong, Xianhua Niu

Abstract: Nowadays, network slicing (NS) technology has gained widespread adoption within Internet of Things (IoT) systems to meet diverse customized requirements. In the NS based IoT systems, the detection of equipment failures necessitates comprehensive equipment monitoring, which leads to significant resource utilization, particularly within large-scale IoT ecosystems. Thus, the imperative task of reduci… ▽ More Nowadays, network slicing (NS) technology has gained widespread adoption within Internet of Things (IoT) systems to meet diverse customized requirements. In the NS based IoT systems, the detection of equipment failures necessitates comprehensive equipment monitoring, which leads to significant resource utilization, particularly within large-scale IoT ecosystems. Thus, the imperative task of reducing failure rates while optimizing monitoring costs has emerged. In this paper, we propose a monitor application function (MAF) based dynamic dormancy monitoring mechanism for the novel NS-IoT system, which is based on a network data analysis function (NWDAF) framework defined in Rel-17. Within the NS-IoT system, all nodes are organized into groups, and multiple MAFs are deployed to monitor each group of nodes. We also propose a dormancy monitor mechanism to mitigate the monitoring energy consumption by placing the MAFs, which is monitoring non-failure devices, in a dormant state. We propose a reinforcement learning based PPO algorithm to guide the dynamic dormancy of MAFs. Simulation results demonstrate that our dynamic dormancy strategy maximizes energy conservation, while proposed algorithm outperforms alternatives in terms of efficiency and stability. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.04681 [pdf, other]

Computation and Critical Transitions of Rate-Distortion-Perception Functions With Wasserstein Barycenter

Authors: Chunhui Chen, Xueyan Niu, Wenhao Ye, Hao Wu, Bo Bai

Abstract: The information rate-distortion-perception (RDP) function characterizes the three-way trade-off between description rate, average distortion, and perceptual quality measured by discrepancy between probability distributions. We study several variants of the RDP functions through the lens of optimal transport. By transforming the information RDP function into a Wasserstein Barycenter problem, we ide… ▽ More The information rate-distortion-perception (RDP) function characterizes the three-way trade-off between description rate, average distortion, and perceptual quality measured by discrepancy between probability distributions. We study several variants of the RDP functions through the lens of optimal transport. By transforming the information RDP function into a Wasserstein Barycenter problem, we identify the critical transitions when one of the constraints becomes inactive and demonstrate several critical transition properties of the RDP variants. Further, the non-strictly convexity brought by the perceptual constraint can be regularized by an entropy regularization term. We prove that the entropy regularized model converges to the original problem and propose an alternating iteration method based on the Sinkhorn algorithm to numerically solve the regularized optimization problem. Experimental results demonstrate the effectiveness and accuracy of the proposed algorithms. As a practical application of our theory, we incorporate our numerical method into a reverse data hiding problem, where a secret message is imperceptibly embedded into the image with guarantees of the perceptual fidelity. △ Less

Submitted 9 April, 2024; v1 submitted 6 April, 2024; originally announced April 2024.

Comments: arXiv admin note: text overlap with arXiv:2304.14611. This paper was presented in part at the 2023 IEEE International Symposium on Information Theory

arXiv:2404.01204 [pdf, other]

The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis

Authors: Chen Yang, Junzhuo Li, Xinyao Niu, Xinrun Du, Songyang Gao, Haoran Zhang, Zhaoliang Chen, Xingwei Qu, Ruibin Yuan, Yizhi Li, Jiaheng Liu, Stephen W. Huang, Shawn Yue, Wenhu Chen, Jie Fu, Ge Zhang

Abstract: Uncovering early-stage metrics that reflect final model performance is one core principle for large-scale pretraining. The existing scaling law demonstrates the power-law correlation between pretraining loss and training flops, which serves as an important indicator of the current training state for large language models. However, this principle only focuses on the model's compression properties o… ▽ More Uncovering early-stage metrics that reflect final model performance is one core principle for large-scale pretraining. The existing scaling law demonstrates the power-law correlation between pretraining loss and training flops, which serves as an important indicator of the current training state for large language models. However, this principle only focuses on the model's compression properties on the training data, resulting in an inconsistency with the ability improvements on the downstream tasks. Some follow-up works attempted to extend the scaling-law to more complex metrics (such as hyperparameters), but still lacked a comprehensive analysis of the dynamic differences among various capabilities during pretraining. To address the aforementioned limitations, this paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints. Through this analysis, we confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes, up to 67 billion parameters. In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints. This initiative offers valuable resources to the research community and facilitates the verification and exploration of LLM pretraining by open-source researchers. Besides, we provide empirical summaries, including performance comparisons of different models and capabilities, and tuition of key metrics for different training phases. Based on these findings, we provide a more user-friendly strategy for evaluating the optimization state, offering guidance for establishing a stable pretraining process. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.05916 [pdf, other]

GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing

Authors: Hao Lu, Xuesong Niu, Jiyao Wang, Yin Wang, Qingyong Hu, Jiaqi Tang, Yuting Zhang, Kaishen Yuan, Bin Huang, Zitong Yu, Dengbo He, Shuiguang Deng, Hao Chen, Yingcong Chen, Shiguang Shan

Abstract: Multimodal large language models (MLLMs) are designed to process and integrate information from multiple sources, such as text, speech, images, and videos. Despite its success in language understanding, it is critical to evaluate the performance of downstream tasks for better human-centric applications. This paper assesses the application of MLLMs with 5 crucial abilities for affective computing,… ▽ More Multimodal large language models (MLLMs) are designed to process and integrate information from multiple sources, such as text, speech, images, and videos. Despite its success in language understanding, it is critical to evaluate the performance of downstream tasks for better human-centric applications. This paper assesses the application of MLLMs with 5 crucial abilities for affective computing, spanning from visual affective tasks and reasoning tasks. The results show that \gpt has high accuracy in facial action unit recognition and micro-expression detection while its general facial expression recognition performance is not accurate. We also highlight the challenges of achieving fine-grained micro-expression recognition and the potential for further study and demonstrate the versatility and potential of \gpt for handling advanced tasks in emotion recognition and related fields by integrating with task-related agents for more complex tasks, such as heart rate estimation through signal processing. In conclusion, this paper provides valuable insights into the potential applications and challenges of MLLMs in human-centric computing. Our interesting examples are at https://github.com/EnVision-Research/GPT4Affectivity. △ Less

Submitted 10 April, 2024; v1 submitted 9 March, 2024; originally announced March 2024.

arXiv:2403.04652 [pdf, other]

Yi: Open Foundation Models by 01.AI

Authors: 01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie , et al. (7 additional authors not shown)

Abstract: We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU,… ▽ More We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual representations to the semantic space of the language model. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2402.15159 [pdf, other]

Machine Unlearning of Pre-trained Large Language Models

Authors: Jin Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao Wang, Zezhou Cheng, Xiang Yue

Abstract: This study investigates the concept of the `right to be forgotten' within the context of large language models (LLMs). We explore machine unlearning as a pivotal solution, with a focus on pre-trained models--a notably under-researched area. Our research delineates a comprehensive framework for machine unlearning in pre-trained LLMs, encompassing a critical analysis of seven diverse unlearning meth… ▽ More This study investigates the concept of the `right to be forgotten' within the context of large language models (LLMs). We explore machine unlearning as a pivotal solution, with a focus on pre-trained models--a notably under-researched area. Our research delineates a comprehensive framework for machine unlearning in pre-trained LLMs, encompassing a critical analysis of seven diverse unlearning methods. Through rigorous evaluation using curated datasets from arXiv, books, and GitHub, we establish a robust benchmark for unlearning performance, demonstrating that these methods are over $10^5$ times more computationally efficient than retraining. Our results show that integrating gradient ascent with gradient descent on in-distribution data improves hyperparameter robustness. We also provide detailed guidelines for efficient hyperparameter tuning in the unlearning process. Our findings advance the discourse on ethical AI practices, offering substantive insights into the mechanics of machine unlearning for pre-trained LLMs and underscoring the potential for responsible AI development. △ Less

Submitted 30 May, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Comments: ACL 2024 main. Code and data at https://github.com/yaojin17/Unlearning_LLM

arXiv:2402.10171 [pdf, other]

Data Engineering for Scaling Language Models to 128K Context

Authors: Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, Hao Peng

Abstract: We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contex… ▽ More We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize \textit{domain balance} and \textit{length upsampling}. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: Code at https://github.com/FranxYao/Long-Context-Data-Engineering

arXiv:2402.08934 [pdf, other]

Extreme Video Compression with Pre-trained Diffusion Models

Authors: Bohan Li, Yiming Liu, Xueyan Niu, Bo Bai, Lei Deng, Deniz Gündüz

Abstract: Diffusion models have achieved remarkable success in generating high quality image and video data. More recently, they have also been used for image compression with high perceptual quality. In this paper, we present a novel approach to extreme video compression leveraging the predictive power of diffusion-based generative models at the decoder. The conditional diffusion model takes several neural… ▽ More Diffusion models have achieved remarkable success in generating high quality image and video data. More recently, they have also been used for image compression with high perceptual quality. In this paper, we present a novel approach to extreme video compression leveraging the predictive power of diffusion-based generative models at the decoder. The conditional diffusion model takes several neural compressed frames and generates subsequent frames. When the reconstruction quality drops below the desired level, new frames are encoded to restart prediction. The entire video is sequentially encoded to achieve a visually pleasing reconstruction, considering perceptual quality metrics such as the learned perceptual image patch similarity (LPIPS) and the Frechet video distance (FVD), at bit rates as low as 0.02 bits per pixel (bpp). Experimental results demonstrate the effectiveness of the proposed scheme compared to standard codecs such as H.264 and H.265 in the low bpp regime. The results showcase the potential of exploiting the temporal relations in video data using generative models. Code is available at: https://github.com/ElesionKyrie/Extreme-Video-Compression-With-Prediction-Using-Pre-trainded-Diffusion-Models- △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2402.00085 [pdf, ps, other]

doi 10.1109/ACCESS.2024.3376418

Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning

Authors: Xuecheng Niu, Akinori Ito, Takashi Nose

Abstract: Training task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most previous frameworks start training by randomly choosing training samples, which differs from the hu… ▽ More Training task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most previous frameworks start training by randomly choosing training samples, which differs from the human learning method and hurts the efficiency and stability of training. Therefore, we propose Scheduled Curiosity-Deep Dyna-Q (SC-DDQ), a curiosity-driven curriculum learning framework based on a state-of-the-art model-based reinforcement learning dialog model, Deep Dyna-Q (DDQ). Furthermore, we designed learning schedules for SC-DDQ and DDQ, respectively, following two opposite training strategies: classic curriculum learning and its reverse version. Our results show that by introducing scheduled learning and curiosity, the new framework leads to a significant improvement over the DDQ and Deep Q-learning(DQN). Surprisingly, we found that traditional curriculum learning was not always effective. Specifically, according to the experimental results, the easy-first and difficult-first strategies are more suitable for SC-DDQ and DDQ. To analyze our results, we adopted the entropy of sampled actions to depict action exploration and found that training strategies with high entropy in the first stage and low entropy in the last stage lead to better performance. △ Less

Submitted 20 May, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

Comments: Accepted to IEEE Access

Journal ref: IEEE Access, vol. 12, pp. 46940-46952, 2024

arXiv:2401.11491 [pdf]

BA-LINS: A Frame-to-Frame Bundle Adjustment for LiDAR-Inertial Navigation

Authors: Hailiang Tang, Tisheng Zhang, Liqiang Wang, Man Yuan, Xiaoji Niu

Abstract: Bundle Adjustment (BA) has been proven to improve the accuracy of the LiDAR mapping. However, the BA method has not yet been properly employed in a dead-reckoning navigation system. In this paper, we present a frame-to-frame (F2F) BA for LiDAR-inertial navigation, named BA-LINS. Based on the direct F2F point-cloud association, the same-plane points are associated among the LiDAR keyframes. Hence,… ▽ More Bundle Adjustment (BA) has been proven to improve the accuracy of the LiDAR mapping. However, the BA method has not yet been properly employed in a dead-reckoning navigation system. In this paper, we present a frame-to-frame (F2F) BA for LiDAR-inertial navigation, named BA-LINS. Based on the direct F2F point-cloud association, the same-plane points are associated among the LiDAR keyframes. Hence, the F2F plane-point BA measurement can be constructed using the same-plane points. The LiDAR BA and the inertial measurement unit (IMU)-preintegration measurements are tightly integrated under the framework of factor graph optimization. An effective adaptive covariance estimation algorithm for LiDAR BA measurements is proposed to further improve the accuracy. We conduct exhaustive real-world experiments on public and private datasets to examine the proposed BA-LINS. The results demonstrate that BA-LINS yields superior accuracy to state-of-the-art methods. Compared to the baseline system FF-LINS, the absolute translation accuracy and state-estimation efficiency of BA-LINS are improved by 29.5% and 28.7% on the private dataset, respectively. Besides, the ablation experiment results exhibit that the proposed adaptive covariance estimation algorithm can notably improve the accuracy and robustness of BA-LINS. △ Less

Submitted 10 February, 2024; v1 submitted 21 January, 2024; originally announced January 2024.

Comments: 14 pages, 14 figures

arXiv:2401.09340 [pdf, other]

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Authors: Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang

Abstract: 3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and int… ▽ More 3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io. △ Less

Submitted 6 March, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

Comments: 21 pages

arXiv:2401.05731 [pdf]

On Grobner-Shirshov bases for Markov semirings

Authors: Xiaohui Niu, Wenxi Li, Zhongzhi Wang

Abstract: In order to investigate the relationship between Shannon information measure of random variables, scholars such as Yeung utilized information diagrams to explore the structured representation of information measures, establishing correspondences with sets. However, this method has limitations when studying information measures of five or more random variables. In this paper, we consider employing… ▽ More In order to investigate the relationship between Shannon information measure of random variables, scholars such as Yeung utilized information diagrams to explore the structured representation of information measures, establishing correspondences with sets. However, this method has limitations when studying information measures of five or more random variables. In this paper, we consider employing algebraic methods to study the relationship of information measures of random variables. By introducing a semiring generated by random variables, we establish correspondences between sets and elements of the semiring. Utilizing the Grobner-Shirshov basis, we present the structure of the semiring and its standard form. Furthermore, we delve into the structure of the semiring generated under Markov chain conditions (referred to as Markov semiring), obtaining its standard form. △ Less

Submitted 11 January, 2024; originally announced January 2024.

MSC Class: 16Y60; 16Z10; 94A15

arXiv:2312.11539 [pdf, other]

KGLens: A Parameterized Knowledge Graph Solution to Assess What an LLM Does and Doesn't Know

Authors: Shangshang Zheng, He Bai, Yizhe Zhang, Yi Su, Xiaochuan Niu, Navdeep Jaitly

Abstract: Measuring the alignment between a Knowledge Graph (KG) and Large Language Models (LLMs) is an effective method to assess the factualness and identify the knowledge blind spots of LLMs. However, this approach encounters two primary challenges including the translation of KGs into natural language and the efficient evaluation of these extensive and complex structures. In this paper, we present KGLen… ▽ More Measuring the alignment between a Knowledge Graph (KG) and Large Language Models (LLMs) is an effective method to assess the factualness and identify the knowledge blind spots of LLMs. However, this approach encounters two primary challenges including the translation of KGs into natural language and the efficient evaluation of these extensive and complex structures. In this paper, we present KGLens--a novel framework aimed at measuring the alignment between KGs and LLMs, and pinpointing the LLMs' knowledge deficiencies relative to KGs. KGLens features a graph-guided question generator for converting KGs into natural language, along with a carefully designed sampling strategy based on parameterized KG structure to expedite KG traversal. We conducted experiments using three domain-specific KGs from Wikidata, which comprise over 19,000 edges, 700 relations, and 21,000 entities. Our analysis across eight LLMs reveals that KGLens not only evaluates the factual accuracy of LLMs more rapidly but also delivers in-depth analyses on topics, temporal dynamics, and relationships. Furthermore, human evaluation results indicate that KGLens can assess LLMs with a level of accuracy nearly equivalent to that of human annotators, achieving 95.7% of the accuracy rate. △ Less

Submitted 16 February, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

arXiv:2312.09870 [pdf, other]

CABBA: Compatible Authenticated Bandwidth-efficient Broadcast protocol for ADS-B

Authors: Mikaëla Ngamboé, Xiao Niu, Benoit Joly, Steven P Biegler, Paul Berthier, Rémi Benito, Greg Rice, José M Fernandez, Gabriela Nicolescu

Abstract: The Automatic Dependent Surveillance-Broadcast (ADS-B) is a surveillance technology that becomes mandatory in many airspaces. It improves safety, increases efficiency and reduces air traffic congestion by broadcasting aircraft navigation data. Yet, ADS-B is vulnerable to spoofing attacks as it lacks mechanisms to ensure the integrity and authenticity of the data being supplied. None of the existin… ▽ More The Automatic Dependent Surveillance-Broadcast (ADS-B) is a surveillance technology that becomes mandatory in many airspaces. It improves safety, increases efficiency and reduces air traffic congestion by broadcasting aircraft navigation data. Yet, ADS-B is vulnerable to spoofing attacks as it lacks mechanisms to ensure the integrity and authenticity of the data being supplied. None of the existing cryptographic solutions fully meet the backward compatibility and bandwidth preservation requirements of the standard. Hence, we propose the Compatible Authenticated Bandwidth-efficient Broadcast protocol for ADS-B (CABBA), an improved approach that integrates TESLA, phase-overlay modulation techniques and certificate-based PKI. As a result, entity authentication, data origin authentication, and data integrity are the security services that CABBA offers. To assess compliance with the standard, we designed an SDR-based implementation of CABBA and performed backward compatibility tests on commercial and general aviation (GA) ADS-B in receivers. Besides, we calculated the 1090ES band's activity factor and analyzed the channel occupancy rate according to ITU-R SM.2256-1 recommendation. Also, we performed a bit error rate analysis of CABBA messages. The results suggest that CABBA is backward compatible, does not incur significant communication overhead, and has an error rate that is acceptable for Eb/No values above 14 dB. △ Less

Submitted 12 February, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: The paper has been submitted to IEEE Transactions on Aerospace and Electronic Systems

arXiv:2312.09571 [pdf, other]

Extending Context Window of Large Language Models via Semantic Compression

Authors: Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, Wei Han

Abstract: Transformer-based Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses. This constraint restricts their applicability in scenarios involving long texts. We propose a novel semantic compression method that enables generalization to texts that are 6-8 times longer, without incurring significant computational c… ▽ More Transformer-based Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses. This constraint restricts their applicability in scenarios involving long texts. We propose a novel semantic compression method that enables generalization to texts that are 6-8 times longer, without incurring significant computational costs or requiring fine-tuning. Our proposed framework draws inspiration from source coding in information theory and employs a pre-trained model to reduce the semantic redundancy of long inputs before passing them to the LLMs for downstream tasks. Experimental results demonstrate that our method effectively extends the context window of LLMs across a range of tasks including question answering, summarization, few-shot learning, and information retrieval. Furthermore, the proposed semantic compression method exhibits consistent fluency in text generation while reducing the associated computational overhead. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2312.04597 [pdf, other]

TrustFed: A Reliable Federated Learning Framework with Malicious-Attack Resistance

Authors: Hangn Su, Jianhong Zhou, Xianhua Niu, Gang Feng

Abstract: As a key technology in 6G research, federated learning (FL) enables collaborative learning among multiple clients while ensuring individual data privacy. However, malicious attackers among the participating clients can intentionally tamper with the training data or the trained model, compromising the accuracy and trustworthiness of the system. To address this issue, in this paper, we propose a hie… ▽ More As a key technology in 6G research, federated learning (FL) enables collaborative learning among multiple clients while ensuring individual data privacy. However, malicious attackers among the participating clients can intentionally tamper with the training data or the trained model, compromising the accuracy and trustworthiness of the system. To address this issue, in this paper, we propose a hierarchical audit-based FL (HiAudit-FL) framework, with the aim to enhance the reliability and security of the learning process. The hierarchical audit process includes two stages, namely model-audit and parameter-audit. In the model-audit stage, a low-overhead audit method is employed to identify suspicious clients. Subsequently, in the parameter-audit stage, a resource-consuming method is used to detect all malicious clients with higher accuracy among the suspicious ones. Specifically, we execute the model audit method among partial clients for multiple rounds, which is modeled as a partial observation Markov decision process (POMDP) with the aim to enhance the robustness and accountability of the decision-making in complex and uncertain environments. Meanwhile, we formulate the problem of identifying malicious attackers through a multi-round audit as an active sequential hypothesis testing problem and leverage a diffusion model-based AI-Enabled audit selection strategy (ASS) to decide which clients should be audited in each round. To accomplish efficient and effective audit selection, we design a DRL-ASS algorithm by incorporating the ASS in a deep reinforcement learning (DRL) framework. Our simulation results demonstrate that HiAudit-FL can effectively identify and handle potential malicious users accurately, with small system overhead. △ Less

Submitted 6 December, 2023; originally announced December 2023.

Comments: 13 pages, 9figures

arXiv:2312.01809 [pdf]

SE-LIO: Semantics-enhanced Solid-State-LiDAR-Inertial Odometry for Tree-rich Environments

Authors: Tisheng Zhang, Linfu Wei, Hailiang Tang, Liqiang Wang, Man Yuan, Xiaoji Niu

Abstract: In this letter, we propose a semantics-enhanced solid-state-LiDAR-inertial odometry (SE-LIO) in tree-rich environments. Multiple LiDAR frames are first merged and compensated with the inertial navigation system (INS) to increase the point-cloud coverage, thus improving the accuracy of semantic segmentation. The unstructured point clouds, such as tree leaves and dynamic objects, are then removed wi… ▽ More In this letter, we propose a semantics-enhanced solid-state-LiDAR-inertial odometry (SE-LIO) in tree-rich environments. Multiple LiDAR frames are first merged and compensated with the inertial navigation system (INS) to increase the point-cloud coverage, thus improving the accuracy of semantic segmentation. The unstructured point clouds, such as tree leaves and dynamic objects, are then removed with the semantic information. Furthermore, the pole-like point clouds, primarily tree trunks, are modeled as cylinders to improve positioning accuracy. An adaptive piecewise cylinder-fitting method is proposed to accommodate environments with a high prevalence of curved tree trunks. Finally, the iterated error-state Kalman filter (IESKF) is employed for state estimation. Point-to-cylinder and point-to-plane constraints are tightly coupled with the prior constraints provided by the INS to obtain the maximum a posteriori estimation. Targeted experiments are conducted in complex campus and park environments to evaluate the performance of SE-LIO. The proposed methods, including removing the unstructured point clouds and the adaptive cylinder fitting, yield improved accuracy. Specifically, the positioning accuracy of the proposed SE-LIO is improved by 43.1% compared to the plane-based LIO. △ Less

Submitted 4 December, 2023; originally announced December 2023.

arXiv:2311.14337 [pdf, other]

TVT: Training-Free Vision Transformer Search on Tiny Datasets

Authors: Zimian Wei, Hengyue Pan, Lujun Li, Peijie Dong, Zhiliang Tian, Xin Niu, Dongsheng Li

Abstract: Training-free Vision Transformer (ViT) architecture search is presented to search for a better ViT with zero-cost proxies. While ViTs achieve significant distillation gains from CNN teacher models on small datasets, the current zero-cost proxies in ViTs do not generalize well to the distillation training paradigm according to our experimental observations. In this paper, for the first time, we inv… ▽ More Training-free Vision Transformer (ViT) architecture search is presented to search for a better ViT with zero-cost proxies. While ViTs achieve significant distillation gains from CNN teacher models on small datasets, the current zero-cost proxies in ViTs do not generalize well to the distillation training paradigm according to our experimental observations. In this paper, for the first time, we investigate how to search in a training-free manner with the help of teacher models and devise an effective Training-free ViT (TVT) search framework. Firstly, we observe that the similarity of attention maps between ViT and ConvNet teachers affects distill accuracy notably. Thus, we present a teacher-aware metric conditioned on the feature attention relations between teacher and student. Additionally, TVT employs the L2-Norm of the student's weights as the student-capability metric to improve ranking consistency. Finally, TVT searches for the best ViT for distilling with ConvNet teachers via our teacher-aware metric and student-capability metric, resulting in impressive gains in efficiency and effectiveness. Extensive experiments on various tiny datasets and search spaces show that our TVT outperforms state-of-the-art training-free search methods. The code will be released. △ Less

Submitted 24 November, 2023; originally announced November 2023.

arXiv:2311.00697 [pdf, other]

End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

Authors: Juan Zuluaga-Gomez, Zhaocheng Huang, Xing Niu, Rohit Paturi, Sundararajan Srinivasan, Prashant Mathur, Brian Thompson, Marcello Federico

Abstract: Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combin… ▽ More Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: Accepted at EMNLP 2023. Code: https://github.com/amazon-science/stac-speech-translation

arXiv:2310.03748 [pdf]

Phase Synchrony Component Self-Organization in Brain Computer Interface

Authors: Xu Niu, Na Lu, Huan Luo, Ruofan Yan

Abstract: Phase synchrony information plays a crucial role in analyzing functional brain connectivity and identifying brain activities. A widely adopted feature extraction pipeline, composed of preprocessing, selection of EEG acquisition channels, and phase locking value (PLV) calculation, has achieved success in motor imagery classification (MI). However, this pipeline is manual and reliant on expert knowl… ▽ More Phase synchrony information plays a crucial role in analyzing functional brain connectivity and identifying brain activities. A widely adopted feature extraction pipeline, composed of preprocessing, selection of EEG acquisition channels, and phase locking value (PLV) calculation, has achieved success in motor imagery classification (MI). However, this pipeline is manual and reliant on expert knowledge, limiting its convenience and adaptability to different application scenarios. Moreover, most studies have employed mediocre data-independent spatial filters to suppress noise, impeding the exploration of more significant phase synchronization phenomena. To address the issues, we propose the concept of phase synchrony component self-organization, which enables the adaptive learning of data-dependent spatial filters for automating both the preprocessing and channel selection procedures. Based on this concept, the first deep learning end-to-end network is developed, which directly extracts phase synchrony-based features from raw EEG signals and perform classification. The network learns optimal filters during training, which are obtained when the network achieves peak classification results. Extensive experiments have demonstrated that our network outperforms state-of-the-art methods. Remarkably, through the learned optimal filters, significant phase synchronization phenomena can be observed. Specifically, by calculating the PLV between a pair of signals extracted from each sample using two of the learned spatial filters, we have obtained an average PLV exceeding 0.87 across all tongue MI samples. This high PLV indicates a groundbreaking discovery in the synchrony pattern of tongue MI. △ Less

Submitted 11 October, 2023; v1 submitted 21 September, 2023; originally announced October 2023.

arXiv:2309.15889 [pdf, other]

High Perceptual Quality Wireless Image Delivery with Denoising Diffusion Models

Authors: Selim F. Yilmaz, Xueyan Niu, Bo Bai, Wei Han, Lei Deng, Deniz Gunduz

Abstract: We consider the image transmission problem over a noisy wireless channel via deep learning-based joint source-channel coding (DeepJSCC) along with a denoising diffusion probabilistic model (DDPM) at the receiver. Specifically, we are interested in the perception-distortion trade-off in the practical finite block length regime, in which separate source and channel coding can be highly suboptimal. W… ▽ More We consider the image transmission problem over a noisy wireless channel via deep learning-based joint source-channel coding (DeepJSCC) along with a denoising diffusion probabilistic model (DDPM) at the receiver. Specifically, we are interested in the perception-distortion trade-off in the practical finite block length regime, in which separate source and channel coding can be highly suboptimal. We introduce a novel scheme that utilizes the range-null space decomposition of the target image. We transmit the range-space of the image after encoding and employ DDPM to progressively refine its null space contents. Through extensive experiments, we demonstrate significant improvements in distortion and perceptual quality of reconstructed images compared to standard DeepJSCC and the state-of-the-art generative learning-based method. We will publicly share our source code to facilitate further research and reproducibility. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: 6 pages, 4 figures

arXiv:2309.04842 [pdf, other]

Leveraging Large Language Models for Exploiting ASR Uncertainty

Authors: Pranay Dighe, Yi Su, Shangshang Zheng, Yunshu Liu, Vineet Garg, Xiaochuan Niu, Ahmed Tewfik

Abstract: While large language models excel in a variety of natural language processing (NLP) tasks, to perform well on spoken language understanding (SLU) tasks, they must either rely on off-the-shelf automatic speech recognition (ASR) systems for transcription, or be equipped with an in-built speech modality. This work focuses on the former scenario, where LLM's accuracy on SLU tasks is constrained by the… ▽ More While large language models excel in a variety of natural language processing (NLP) tasks, to perform well on spoken language understanding (SLU) tasks, they must either rely on off-the-shelf automatic speech recognition (ASR) systems for transcription, or be equipped with an in-built speech modality. This work focuses on the former scenario, where LLM's accuracy on SLU tasks is constrained by the accuracy of a fixed ASR system on the spoken input. Specifically, we tackle speech-intent classification task, where a high word-error-rate can limit the LLM's ability to understand the spoken intent. Instead of chasing a high accuracy by designing complex or specialized architectures regardless of deployment costs, we seek to answer how far we can go without substantially changing the underlying ASR and LLM, which can potentially be shared by multiple unrelated tasks. To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. We explore prompt-engineering to explain the concept of n-best lists to the LLM; followed by the finetuning of Low-Rank Adapters on the downstream tasks. Our approach using n-best lists proves to be effective on a device-directed speech detection task as well as on a keyword spotting task, where systems using n-best list prompts outperform those using 1-best ASR hypothesis; thus paving the way for an efficient method to exploit ASR uncertainty via LLMs for speech-based applications. △ Less

Submitted 12 September, 2023; v1 submitted 9 September, 2023; originally announced September 2023.

Comments: Added references

arXiv:2309.03040 [pdf, other]

Automated CVE Analysis for Threat Prioritization and Impact Prediction

Authors: Ehsan Aghaei, Ehab Al-Shaer, Waseem Shadid, Xi Niu

Abstract: The Common Vulnerabilities and Exposures (CVE) are pivotal information for proactive cybersecurity measures, including service patching, security hardening, and more. However, CVEs typically offer low-level, product-oriented descriptions of publicly disclosed cybersecurity vulnerabilities, often lacking the essential attack semantic information required for comprehensive weakness characterization… ▽ More The Common Vulnerabilities and Exposures (CVE) are pivotal information for proactive cybersecurity measures, including service patching, security hardening, and more. However, CVEs typically offer low-level, product-oriented descriptions of publicly disclosed cybersecurity vulnerabilities, often lacking the essential attack semantic information required for comprehensive weakness characterization and threat impact estimation. This critical insight is essential for CVE prioritization and the identification of potential countermeasures, particularly when dealing with a large number of CVEs. Current industry practices involve manual evaluation of CVEs to assess their attack severities using the Common Vulnerability Scoring System (CVSS) and mapping them to Common Weakness Enumeration (CWE) for potential mitigation identification. Unfortunately, this manual analysis presents a major bottleneck in the vulnerability analysis process, leading to slowdowns in proactive cybersecurity efforts and the potential for inaccuracies due to human errors. In this research, we introduce our novel predictive model and tool (called CVEDrill) which revolutionizes CVE analysis and threat prioritization. CVEDrill accurately estimates the CVSS vector for precise threat mitigation and priority ranking and seamlessly automates the classification of CVEs into the appropriate CWE hierarchy classes. By harnessing CVEDrill, organizations can now implement cybersecurity countermeasure mitigation with unparalleled accuracy and timeliness, surpassing in this domain the capabilities of state-of-the-art tools like ChaptGPT. △ Less

Submitted 6 September, 2023; originally announced September 2023.

arXiv:2308.08244 [pdf, other]

A Hybrid Wireless Image Transmission Scheme with Diffusion

Authors: Xueyan Niu, Xu Wang, Deniz Gündüz, Bo Bai, Weichao Chen, Guohua Zhou

Abstract: We propose a hybrid joint source-channel coding (JSCC) scheme, in which the conventional digital communication scheme is complemented with a generative refinement component to improve the perceptual quality of the reconstruction. The input image is decomposed into two components: the first is a coarse compressed version, and is transmitted following the conventional separation based approach. An a… ▽ More We propose a hybrid joint source-channel coding (JSCC) scheme, in which the conventional digital communication scheme is complemented with a generative refinement component to improve the perceptual quality of the reconstruction. The input image is decomposed into two components: the first is a coarse compressed version, and is transmitted following the conventional separation based approach. An additional component is obtained through the diffusion process by adding independent Gaussian noise to the input image, and is transmitted using DeepJSCC. The decoder combines the two signals to produce a high quality reconstruction of the source. Experimental results show that the hybrid design provides bandwidth savings and enables graceful performance improvement as the channel quality improves. △ Less

Submitted 16 August, 2023; originally announced August 2023.

arXiv:2308.07770 [pdf, other]

Multi-scale Promoted Self-adjusting Correlation Learning for Facial Action Unit Detection

Authors: Xin Liu, Kaishen Yuan, Xuesong Niu, Jingang Shi, Zitong Yu, Huanjing Yue, Jingyu Yang

Abstract: Facial Action Unit (AU) detection is a crucial task in affective computing and social robotics as it helps to identify emotions expressed through facial expressions. Anatomically, there are innumerable correlations between AUs, which contain rich information and are vital for AU detection. Previous methods used fixed AU correlations based on expert experience or statistical rules on specific bench… ▽ More Facial Action Unit (AU) detection is a crucial task in affective computing and social robotics as it helps to identify emotions expressed through facial expressions. Anatomically, there are innumerable correlations between AUs, which contain rich information and are vital for AU detection. Previous methods used fixed AU correlations based on expert experience or statistical rules on specific benchmarks, but it is challenging to comprehensively reflect complex correlations between AUs via hand-crafted settings. There are alternative methods that employ a fully connected graph to learn these dependencies exhaustively. However, these approaches can result in a computational explosion and high dependency with a large dataset. To address these challenges, this paper proposes a novel self-adjusting AU-correlation learning (SACL) method with less computation for AU detection. This method adaptively learns and updates AU correlation graphs by efficiently leveraging the characteristics of different levels of AU motion and emotion representation information extracted in different stages of the network. Moreover, this paper explores the role of multi-scale learning in correlation information extraction, and design a simple yet effective multi-scale feature learning (MSFL) method to promote better performance in AU detection. By integrating AU correlation information with multi-scale features, the proposed method obtains a more robust feature representation for the final AU detection. Extensive experiments show that the proposed method outperforms the state-of-the-art methods on widely used AU detection benchmark datasets, with only 28.7\% and 12.0\% of the parameters and FLOPs of the best method, respectively. The code for this method is available at \url{https://github.com/linuxsino/Self-adjusting-AU}. △ Less

Submitted 15 August, 2023; originally announced August 2023.

Comments: 13pages, 7 figures

arXiv:2308.00183 [pdf, other]

Hovering Control of Flapping Wings in Tandem with Multi-Rotors

Authors: Aniket Dhole, Bibek Gupta, Adarsh Salagame, Xuejian Niu, Yizhe Xu, Kaushik Venkatesh, Paul Ghanem, Ioannis Mandralis, Eric Sihite, Alireza Ramezani

Abstract: This work briefly covers our efforts to stabilize the flight dynamics of Northeastern's tailless bat-inspired micro aerial vehicle, Aerobat. Flapping robots are not new. A plethora of examples is mainly dominated by insect-style design paradigms that are passively stable. However, Aerobat, in addition for being tailless, possesses morphing wings that add to the inherent complexity of flight contro… ▽ More This work briefly covers our efforts to stabilize the flight dynamics of Northeastern's tailless bat-inspired micro aerial vehicle, Aerobat. Flapping robots are not new. A plethora of examples is mainly dominated by insect-style design paradigms that are passively stable. However, Aerobat, in addition for being tailless, possesses morphing wings that add to the inherent complexity of flight control. The robot can dynamically adjust its wing platform configurations during gait cycles, increasing its efficiency and agility. We employ a guard design with manifold small thrusters to stabilize Aerobat's position and orientation in hovering, a flapping system in tandem with a multi-rotor. For flight control purposes, we take an approach based on assuming the guard cannot observe Aerobat's states. Then, we propose an observer to estimate the unknown states of the guard which are then used for closed-loop hovering control of the Guard-Aerobat platform. △ Less

Submitted 31 July, 2023; originally announced August 2023.

arXiv:2307.11196 [pdf, other]

Exact Community Recovery in the Geometric SBM

Authors: Julia Gaudio, Xiaochun Niu, Ermin Wei

Abstract: We study the problem of exact community recovery in the Geometric Stochastic Block Model (GSBM), where each vertex has an unknown community label as well as a known position, generated according to a Poisson point process in $\mathbb{R}^d$. Edges are formed independently conditioned on the community labels and positions, where vertices may only be connected by an edge if they are within a prescrib… ▽ More We study the problem of exact community recovery in the Geometric Stochastic Block Model (GSBM), where each vertex has an unknown community label as well as a known position, generated according to a Poisson point process in $\mathbb{R}^d$. Edges are formed independently conditioned on the community labels and positions, where vertices may only be connected by an edge if they are within a prescribed distance of each other. The GSBM thus favors the formation of dense local subgraphs, which commonly occur in real-world networks, a property that makes the GSBM qualitatively very different from the standard Stochastic Block Model (SBM). We propose a linear-time algorithm for exact community recovery, which succeeds down to the information-theoretic threshold, confirming a conjecture of Abbe, Baccelli, and Sankararaman. The algorithm involves two phases. The first phase exploits the density of local subgraphs to propagate estimated community labels among sufficiently occupied subregions, and produces an almost-exact vertex labeling. The second phase then refines the initial labels using a Poisson testing procedure. Thus, the GSBM enjoys local to global amplification just as the SBM, with the advantage of admitting an information-theoretically optimal, linear-time algorithm. △ Less

Submitted 5 January, 2024; v1 submitted 20 July, 2023; originally announced July 2023.

arXiv:2307.10554 [pdf, other]

EMQ: Evolving Training-free Proxies for Automated Mixed Precision Quantization

Authors: Peijie Dong, Lujun Li, Zimian Wei, Xin Niu, Zhiliang Tian, Hengyue Pan

Abstract: Mixed-Precision Quantization~(MQ) can achieve a competitive accuracy-complexity trade-off for models. Conventional training-based search methods require time-consuming candidate training to search optimized per-layer bit-width configurations in MQ. Recently, some training-free approaches have presented various MQ proxies and significantly improve search efficiency. However, the correlation between… ▽ More Mixed-Precision Quantization~(MQ) can achieve a competitive accuracy-complexity trade-off for models. Conventional training-based search methods require time-consuming candidate training to search optimized per-layer bit-width configurations in MQ. Recently, some training-free approaches have presented various MQ proxies and significantly improve search efficiency. However, the correlation between these proxies and quantization accuracy is poorly understood. To address the gap, we first build the MQ-Bench-101, which involves different bit configurations and quantization results. Then, we observe that the existing training-free proxies perform weak correlations on the MQ-Bench-101. To efficiently seek superior proxies, we develop an automatic search of proxies framework for MQ via evolving algorithms. In particular, we devise an elaborate search space involving the existing proxies and perform an evolution search to discover the best correlated MQ proxy. We proposed a diversity-prompting selection strategy and compatibility screening protocol to avoid premature convergence and improve search efficiency. In this way, our Evolving proxies for Mixed-precision Quantization~(EMQ) framework allows the auto-generation of proxies without heavy tuning and expert knowledge. Extensive experiments on ImageNet with various ResNet and MobileNet families demonstrate that our EMQ obtains superior performance than state-of-the-art mixed-precision methods at a significantly reduced cost. The code will be released. △ Less

Submitted 19 July, 2023; originally announced July 2023.

Comments: Accepted by ICCV2023

arXiv:2307.06632 [pdf]

FF-LINS: A Consistent Frame-to-Frame Solid-State-LiDAR-Inertial State Estimator

Authors: Hailiang Tang, Tisheng Zhang, Xiaoji Niu, Liqiang Wang, Linfu Wei, Jingnan Liu

Abstract: Most of the existing LiDAR-inertial navigation systems are based on frame-to-map registrations, leading to inconsistency in state estimation. The newest solid-state LiDAR with a non-repetitive scanning pattern makes it possible to achieve a consistent LiDAR-inertial estimator by employing a frame-to-frame data association. In this letter, we propose a robust and consistent frame-to-frame LiDAR-ine… ▽ More Most of the existing LiDAR-inertial navigation systems are based on frame-to-map registrations, leading to inconsistency in state estimation. The newest solid-state LiDAR with a non-repetitive scanning pattern makes it possible to achieve a consistent LiDAR-inertial estimator by employing a frame-to-frame data association. In this letter, we propose a robust and consistent frame-to-frame LiDAR-inertial navigation system (FF-LINS) for solid-state LiDARs. With the INS-centric LiDAR frame processing, the keyframe point-cloud map is built using the accumulated point clouds to construct the frame-to-frame data association. The LiDAR frame-to-frame and the inertial measurement unit (IMU) preintegration measurements are tightly integrated using the factor graph optimization, with online calibration of the LiDAR-IMU extrinsic and time-delay parameters. The experiments on the public and private datasets demonstrate that the proposed FF-LINS achieves superior accuracy and robustness than the state-of-the-art systems. Besides, the LiDAR-IMU extrinsic and time-delay parameters are estimated effectively, and the online calibration notably improves the pose accuracy. The proposed FF-LINS and the employed datasets are open-sourced on GitHub (https://github.com/i2Nav-WHU/FF-LINS). △ Less

Submitted 13 July, 2023; originally announced July 2023.

arXiv:2306.03824 [pdf, other]

Understanding Generalization of Federated Learning via Stability: Heterogeneity Matters

Authors: Zhenyu Sun, Xiaochun Niu, Ermin Wei

Abstract: Generalization performance is a key metric in evaluating machine learning models when applied to real-world applications. Good generalization indicates the model can predict unseen data correctly when trained under a limited number of data. Federated learning (FL), which has emerged as a popular distributed learning framework, allows multiple devices or clients to train a shared model without viol… ▽ More Generalization performance is a key metric in evaluating machine learning models when applied to real-world applications. Good generalization indicates the model can predict unseen data correctly when trained under a limited number of data. Federated learning (FL), which has emerged as a popular distributed learning framework, allows multiple devices or clients to train a shared model without violating privacy requirements. While the existing literature has studied extensively the generalization performances of centralized machine learning algorithms, similar analysis in the federated settings is either absent or with very restrictive assumptions on the loss functions. In this paper, we aim to analyze the generalization performances of federated learning by means of algorithmic stability, which measures the change of the output model of an algorithm when perturbing one data point. Three widely-used algorithms are studied, including FedAvg, SCAFFOLD, and FedProx, under convex and non-convex loss functions. Our analysis shows that the generalization performances of models trained by these three algorithms are closely related to the heterogeneity of clients' datasets as well as the convergence behaviors of the algorithms. Particularly, in the i.i.d. setting, our results recover the classical results of stochastic gradient descent (SGD). △ Less

Submitted 6 June, 2023; originally announced June 2023.

Comments: Submitted to NeurIPS 2023

arXiv:2306.02568 [pdf, other]

Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming

Authors: Xinlei Niu, Christian Walder, Jing Zhang, Charles Patrick Martin

Abstract: We propose the stochastic optimal path which solves the classical optimal path problem by a probability-softening solution. This unified approach transforms a wide range of DP problems into directed acyclic graphs in which all paths follow a Gibbs distribution. We show the equivalence of the Gibbs distribution to a message-passing algorithm by the properties of the Gumbel distribution and give all… ▽ More We propose the stochastic optimal path which solves the classical optimal path problem by a probability-softening solution. This unified approach transforms a wide range of DP problems into directed acyclic graphs in which all paths follow a Gibbs distribution. We show the equivalence of the Gibbs distribution to a message-passing algorithm by the properties of the Gumbel distribution and give all the ingredients required for variational Bayesian inference of a latent path, namely Bayesian dynamic programming (BDP). We demonstrate the usage of BDP in the latent space of variational autoencoders (VAEs) and propose the BDP-VAE which captures structured sparse optimal paths as latent variables. This enables end-to-end training for generative tasks in which models rely on unobserved structural information. At last, we validate the behavior of our approach and showcase its applicability in two real-world applications: text-to-speech and singing voice synthesis. Our implementation code is available at \url{https://github.com/XinleiNIU/LatentOptimalPathsBayesianDP}. △ Less

Submitted 25 June, 2024; v1 submitted 4 June, 2023; originally announced June 2023.

Comments: Accepted by ICML 2024

arXiv:2305.17131 [pdf, other]

doi 10.18653/v1/2023.acl-short.126

RAMP: Retrieval and Attribute-Marking Enhanced Prompting for Attribute-Controlled Translation

Authors: Gabriele Sarti, Phu Mon Htut, Xing Niu, Benjamin Hsu, Anna Currey, Georgiana Dinu, Maria Nadejde

Abstract: Attribute-controlled translation (ACT) is a subtask of machine translation that involves controlling stylistic or linguistic attributes (like formality and gender) of translation outputs. While ACT has garnered attention in recent years due to its usefulness in real-world applications, progress in the task is currently limited by dataset availability, since most prior approaches rely on supervised… ▽ More Attribute-controlled translation (ACT) is a subtask of machine translation that involves controlling stylistic or linguistic attributes (like formality and gender) of translation outputs. While ACT has garnered attention in recent years due to its usefulness in real-world applications, progress in the task is currently limited by dataset availability, since most prior approaches rely on supervised methods. To address this limitation, we propose Retrieval and Attribute-Marking enhanced Prompting (RAMP), which leverages large multilingual language models to perform ACT in few-shot and zero-shot settings. RAMP improves generation accuracy over the standard prompting approach by (1) incorporating a semantic similarity retrieval component for selecting similar in-context examples, and (2) marking in-context examples with attribute annotations. Our comprehensive experiments show that RAMP is a viable approach in both zero-shot and few-shot settings. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: Accepted at ACL 2023

Journal ref: Proceedings of ACL (2023) 1476-1490

arXiv:2305.13547 [pdf, other]

Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot Text Classification Tasks

Authors: Haoqi Zheng, Qihuang Zhong, Liang Ding, Zhiliang Tian, Xin Niu, Dongsheng Li, Dacheng Tao

Abstract: Text classification tasks often encounter few shot scenarios with limited labeled data, and addressing data scarcity is crucial. Data augmentation with mixup has shown to be effective on various text classification tasks. However, most of the mixup methods do not consider the varying degree of learning difficulty in different stages of training and generate new samples with one hot labels, resulti… ▽ More Text classification tasks often encounter few shot scenarios with limited labeled data, and addressing data scarcity is crucial. Data augmentation with mixup has shown to be effective on various text classification tasks. However, most of the mixup methods do not consider the varying degree of learning difficulty in different stages of training and generate new samples with one hot labels, resulting in the model over confidence. In this paper, we propose a self evolution learning (SE) based mixup approach for data augmentation in text classification, which can generate more adaptive and model friendly pesudo samples for the model training. SE focuses on the variation of the model's learning ability. To alleviate the model confidence, we introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up. Through experimental analysis, in addition to improving classification accuracy, we demonstrate that SE also enhances the model's generalize ability. △ Less

Submitted 27 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

arXiv:2305.12644 [pdf]

PO-VINS: An Efficient Pose-Only LiDAR-Enhanced Visual-Inertial State Estimator

Authors: Hailiang Tang, Xiaoji Niu, Tisheng Zhang, Liqiang Wang, Guan Wang, Jingnan Liu

Abstract: The pose-only (PO) visual representation has been proven to be equivalent to the classical multiple-view geometry, while significantly improving computational efficiency. However, its applicability for real-world navigation in large-scale complex environments has not yet been demonstrated. In this study, we present an efficient pose-only LiDAR-enhanced visual-inertial navigation system (PO-VINS) t… ▽ More The pose-only (PO) visual representation has been proven to be equivalent to the classical multiple-view geometry, while significantly improving computational efficiency. However, its applicability for real-world navigation in large-scale complex environments has not yet been demonstrated. In this study, we present an efficient pose-only LiDAR-enhanced visual-inertial navigation system (PO-VINS) to enhance the real-time performance of the state estimator. In the visual-inertial state estimator (VISE), we propose a pose-only visual-reprojection measurement model that only contains the inertial measurement unit (IMU) pose and extrinsic-parameter states. We further integrated the LiDAR-enhanced method to construct a pose-only LiDAR-depth measurement model. Real-world experiments were conducted in large-scale complex environments, demonstrating that the proposed PO-VISE and LiDAR-enhanced PO-VISE reduce computational complexity by more than 50% and over 20%, respectively. Additionally, the PO-VINS yields the same accuracy as conventional methods. These results indicate that the pose-only solution is efficient and applicable for real-time visual-inertial state estimation. △ Less

Submitted 21 May, 2023; originally announced May 2023.

arXiv:2305.11808 [pdf, other]

Pseudo-Label Training and Model Inertia in Neural Machine Translation

Authors: Benjamin Hsu, Anna Currey, Xing Niu, Maria Nădejde, Georgiana Dinu

Abstract: Like many other machine learning applications, neural machine translation (NMT) benefits from over-parameterized deep neural models. However, these models have been observed to be brittle: NMT model predictions are sensitive to small input changes and can show significant variation across re-training or incremental model updates. This work studies a frequently used method in NMT, pseudo-label trai… ▽ More Like many other machine learning applications, neural machine translation (NMT) benefits from over-parameterized deep neural models. However, these models have been observed to be brittle: NMT model predictions are sensitive to small input changes and can show significant variation across re-training or incremental model updates. This work studies a frequently used method in NMT, pseudo-label training (PLT), which is common to the related techniques of forward-translation (or self-training) and sequence-level knowledge distillation. While the effect of PLT on quality is well-documented, we highlight a lesser-known effect: PLT can enhance a model's stability to model updates and input perturbations, a set of properties we call model inertia. We study inertia effects under different training settings and we identify distribution simplification as a mechanism behind the observed results. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: accepted ICLR 2023

arXiv:2305.10029 [pdf, other]

TextSLAM: Visual SLAM with Semantic Planar Text Features

Authors: Boying Li, Danping Zou, Yuan Huang, Xinghan Niu, Ling Pei, Wenxian Yu

Abstract: We propose a novel visual SLAM method that integrates text objects tightly by treating them as semantic features via fully exploring their geometric and semantic prior. The text object is modeled as a texture-rich planar patch whose semantic meaning is extracted and updated on the fly for better data association. With the full exploration of locally planar characteristics and semantic meaning of t… ▽ More We propose a novel visual SLAM method that integrates text objects tightly by treating them as semantic features via fully exploring their geometric and semantic prior. The text object is modeled as a texture-rich planar patch whose semantic meaning is extracted and updated on the fly for better data association. With the full exploration of locally planar characteristics and semantic meaning of text objects, the SLAM system becomes more accurate and robust even under challenging conditions such as image blurring, large viewpoint changes, and significant illumination variations (day and night). We tested our method in various scenes with the ground truth data. The results show that integrating texture features leads to a more superior SLAM system that can match images across day and night. The reconstructed semantic 3D text map could be useful for navigation and scene understanding in robotic and mixed reality applications. Our project page: https://github.com/SJTU-ViSYS/TextSLAM . △ Less

Submitted 3 July, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: 19 pages, 23 figures. Whole project page: https://leeby68.github.io/TextSLAM/

Showing 1–50 of 137 results for author: Niu, X