-
Rényi-infinity constrained sampling with $d^3$ membership queries
Authors:
Yunbum Kook,
Matthew S. Zhang
Abstract:
Uniform sampling over a convex body is a fundamental algorithmic problem, yet the convergence in KL or Rényi divergence of most samplers remains poorly understood. In this work, we propose a constrained proximal sampler, a principled and simple algorithm that possesses elegant convergence guarantees. Leveraging the uniform ergodicity of this sampler, we show that it converges in the Rényi-infinity…
▽ More
Uniform sampling over a convex body is a fundamental algorithmic problem, yet the convergence in KL or Rényi divergence of most samplers remains poorly understood. In this work, we propose a constrained proximal sampler, a principled and simple algorithm that possesses elegant convergence guarantees. Leveraging the uniform ergodicity of this sampler, we show that it converges in the Rényi-infinity divergence ($\mathcal R_\infty$) with no query complexity overhead when starting from a warm start. This is the strongest of commonly considered performance metrics, implying rates in $\{\mathcal R_q, \mathsf{KL}\}$ convergence as special cases.
By applying this sampler within an annealing scheme, we propose an algorithm which can approximately sample $\varepsilon$-close to the uniform distribution on convex bodies in $\mathcal R_\infty$-divergence with $\widetilde{\mathcal{O}}(d^3\, \text{polylog} \frac{1}{\varepsilon})$ query complexity. This improves on all prior results in $\{\mathcal R_q, \mathsf{KL}\}$-divergences, without resorting to any algorithmic modifications or post-processing of the sample. It also matches the prior best known complexity in total variation distance.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models
Authors:
Zhong-Zhi Li,
Ming-Liang Zhang,
Fei Yin,
Zhi-Long Ji,
Jin-Feng Bai,
Zhen-Ru Pan,
Fan-Hu Zeng,
Jian Xu,
Jia-Xin Zhang,
Cheng-Lin Liu
Abstract:
Due to the rapid advancements in multimodal large language models, evaluating their multimodal mathematical capabilities continues to receive wide attention. Despite the datasets like MathVista proposed benchmarks for assessing mathematical capabilities in multimodal scenarios, there is still a lack of corresponding evaluation tools and datasets for fine-grained assessment in the context of K12 ed…
▽ More
Due to the rapid advancements in multimodal large language models, evaluating their multimodal mathematical capabilities continues to receive wide attention. Despite the datasets like MathVista proposed benchmarks for assessing mathematical capabilities in multimodal scenarios, there is still a lack of corresponding evaluation tools and datasets for fine-grained assessment in the context of K12 education in Chinese language. To systematically evaluate the capability of multimodal large models in solving Chinese multimodal mathematical problems, we propose a Chinese Multi-modal Math Skill Evaluation Benchmark, named CMMaTH, contraining 23k multimodal K12 math related questions, forming the largest Chinese multimodal mathematical problem benchmark to date. CMMaTH questions from elementary to high school levels, provide increased diversity in problem types, solution objectives, visual elements, detailed knowledge points, and standard solution annotations. We have constructed an open-source tool GradeGPT integrated with the CMMaTH dataset, facilitating stable, rapid, and cost-free model evaluation. Our data and code are available.
△ Less
Submitted 27 June, 2024;
originally announced July 2024.
-
MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models
Authors:
Hongrong Cheng,
Miao Zhang,
Javen Qinfeng Shi
Abstract:
As Large Language Models (LLMs) grow dramatically in size, there is an increasing trend in compressing and speeding up these models. Previous studies have highlighted the usefulness of gradients for importance scoring in neural network compressing, especially in pruning medium-size networks. However, the substantial memory requirements involved in calculating gradients with backpropagation impede…
▽ More
As Large Language Models (LLMs) grow dramatically in size, there is an increasing trend in compressing and speeding up these models. Previous studies have highlighted the usefulness of gradients for importance scoring in neural network compressing, especially in pruning medium-size networks. However, the substantial memory requirements involved in calculating gradients with backpropagation impede the utilization of gradients in guiding LLM pruning. As a result, most pruning strategies for LLMs rely on gradient-free criteria, such as weight magnitudes or a mix of magnitudes and activations. In this paper, we devise a hybrid pruning criterion, which appropriately integrates magnitude, activation, and gradient to capitalize on feature map sensitivity for pruning LLMs. To overcome memory requirement barriers, we estimate gradients using only forward passes. Based on this, we propose a Memory-effIcieNt structured prunIng procedure for LLMs (MINI-LLM) to remove no-critical channels and multi-attention heads. Experimental results demonstrate the superior performance of MINI-LLM over existing gradient-free methods on three LLMs: LLaMA, BLOOM, and OPT across various downstream tasks (classification, multiple-choice, and generation), while MINI-LLM maintains a GPU memory footprint akin to gradient-free methods.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
TEXasGAN: Tactile Texture Exploration and Synthesis System Using Generative Adversarial Network
Authors:
Mingxin Zhang,
Shun Terui,
Yasutoshi Makino,
Hiroyuki Shinoda
Abstract:
To create more realistic experiences in human-virtual object interactions, texture rendering has become a research hotspot in recent years. Different frequency components of designed vibrations can activate texture-related sensations due to similar receptors. However, designing specific vibrations for numerous real-world materials is impractical. Therefore, this study proposes a human-in-the-loop…
▽ More
To create more realistic experiences in human-virtual object interactions, texture rendering has become a research hotspot in recent years. Different frequency components of designed vibrations can activate texture-related sensations due to similar receptors. However, designing specific vibrations for numerous real-world materials is impractical. Therefore, this study proposes a human-in-the-loop vibration generation model based on user preferences. To enable users to easily control the generation of vibration samples with large parameter spaces, we introduce an optimization model based on Differential Subspace Search (DSS) and Generative Adversarial Network (GAN). With DSS, users can use a one-dimensional slider to easily modify the high-dimensional latent space so that the GAN can generate desired vibrations. We trained the generative model using a open dataset of tactile vibration data and selected five types of vibrations as target samples for the generation experiment. Extensive user experiments were conducted using the generated and real samples. The results indicate that our system can generate distinguishable samples that match the target characteristics. Moreover, the results also reveal a correlation between subjects' ability to distinguish real samples and their ability to distinguish generated samples.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
TLRN: Temporal Latent Residual Networks For Large Deformation Image Registration
Authors:
Nian Wu,
Jiarui Xing,
Miaomiao Zhang
Abstract:
This paper presents a novel approach, termed {\em Temporal Latent Residual Network (TLRN)}, to predict a sequence of deformation fields in time-series image registration. The challenge of registering time-series images often lies in the occurrence of large motions, especially when images differ significantly from a reference (e.g., the start of a cardiac cycle compared to the peak stretching phase…
▽ More
This paper presents a novel approach, termed {\em Temporal Latent Residual Network (TLRN)}, to predict a sequence of deformation fields in time-series image registration. The challenge of registering time-series images often lies in the occurrence of large motions, especially when images differ significantly from a reference (e.g., the start of a cardiac cycle compared to the peak stretching phase). To achieve accurate and robust registration results, we leverage the nature of motion continuity and exploit the temporal smoothness in consecutive image frames. Our proposed TLRN highlights a temporal residual network with residual blocks carefully designed in latent deformation spaces, which are parameterized by time-sequential initial velocity fields. We treat a sequence of residual blocks over time as a dynamic training system, where each block is designed to learn the residual function between desired deformation features and current input accumulated from previous time frames. We validate the effectivenss of TLRN on both synthetic data and real-world cine cardiac magnetic resonance (CMR) image videos. Our experimental results shows that TLRN is able to achieve substantially improved registration accuracy compared to the state-of-the-art. Our code is publicly available at https://github.com/nellie689/TLRN.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Protecting Data Buyer Privacy in Data Markets
Authors:
Minxing Zhang,
Jian Pei
Abstract:
Data markets serve as crucial platforms facilitating data discovery, exchange, sharing, and integration among data users and providers. However, the paramount concern of privacy has predominantly centered on protecting privacy of data owners and third parties, neglecting the challenges associated with protecting the privacy of data buyers. In this article, we address this gap by modeling the intri…
▽ More
Data markets serve as crucial platforms facilitating data discovery, exchange, sharing, and integration among data users and providers. However, the paramount concern of privacy has predominantly centered on protecting privacy of data owners and third parties, neglecting the challenges associated with protecting the privacy of data buyers. In this article, we address this gap by modeling the intricacies of data buyer privacy protection and investigating the delicate balance between privacy and purchase cost. Through comprehensive experimentation, our results yield valuable insights, shedding light on the efficacy and efficiency of our proposed approaches.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
GOFA: A Generative One-For-All Model for Joint Graph Language Modeling
Authors:
Lecheng Kong,
Jiarui Feng,
Hao Liu,
Chengsong Huang,
Jiaxin Huang,
Yixin Chen,
Muhan Zhang
Abstract:
Foundation models, such as Large Language Models (LLMs) or Large Vision Models (LVMs), have emerged as one of the most powerful tools in the respective fields. However, unlike text and image data, graph data do not have a definitive structure, posing great challenges to developing a Graph Foundation Model (GFM). For example, current attempts at designing general graph models either transform graph…
▽ More
Foundation models, such as Large Language Models (LLMs) or Large Vision Models (LVMs), have emerged as one of the most powerful tools in the respective fields. However, unlike text and image data, graph data do not have a definitive structure, posing great challenges to developing a Graph Foundation Model (GFM). For example, current attempts at designing general graph models either transform graph data into a language format for LLM-based prediction or still train a GNN model with LLM as an assistant. The former can handle unlimited tasks, while the latter captures graph structure much better -- yet, no existing work can achieve both simultaneously. In this paper, we identify three key desirable properties of a GFM: self-supervised pretraining, fluidity in tasks, and graph awareness. To account for these properties, we extend the conventional language modeling to the graph domain and propose a novel generative graph language model GOFA to solve the problem. The model interleaves randomly initialized GNN layers into a frozen pre-trained LLM so that the semantic and structural modeling abilities are organically combined. GOFA is pre-trained on newly proposed graph-level next-word prediction, question-answering, and structural tasks to obtain the above GFM properties. The pre-trained model is further fine-tuned on downstream tasks to obtain task-solving ability. The fine-tuned model is evaluated on various downstream tasks, demonstrating a strong ability to solve structural and contextual problems in zero-shot scenarios. The code is available at https://github.com/JiaruiFeng/GOFA.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
SLoRD: Structural Low-Rank Descriptors for Shape Consistency in Vertebrae Segmentation
Authors:
Xin You,
Yixin Lou,
Minghui Zhang,
Chuyan Zhang,
Jie Yang,
Yun Gu
Abstract:
Automatic and precise segmentation of vertebrae from CT images is crucial for various clinical applications. However, due to a lack of explicit and strict constraints, existing methods especially for single-stage methods, still suffer from the challenge of intra-vertebrae segmentation inconsistency, which refers to multiple label predictions inside a singular vertebra. For multi-stage methods, ver…
▽ More
Automatic and precise segmentation of vertebrae from CT images is crucial for various clinical applications. However, due to a lack of explicit and strict constraints, existing methods especially for single-stage methods, still suffer from the challenge of intra-vertebrae segmentation inconsistency, which refers to multiple label predictions inside a singular vertebra. For multi-stage methods, vertebrae detection serving as the first step, is affected by the pathology and mental implants. Thus, incorrect detections cause biased patches before segmentation, then lead to inconsistent labeling and segmentation. In our work, motivated by the perspective of instance segmentation, we try to label individual and complete binary masks to address this limitation. Specifically, a contour-based network is proposed based on Structural Low-Rank Descriptors for shape consistency, termed SLoRD. These contour descriptors are acquired in a data-driven manner in advance. For a more precise representation of contour descriptors, we adopt the spherical coordinate system and devise the spherical centroid. Besides, the contour loss is designed to impose explicit consistency constraints, facilitating regressed contour points close to vertebral boundaries. Quantitative and qualitative evaluations on VerSe 2019 demonstrate the superior performance of our framework over other single-stage and multi-stage state-of-the-art (SOTA) methods.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks
Authors:
Zheng Wang,
Boxiao Jin,
Zhongzhi Yu,
Minjia Zhang
Abstract:
How to efficiently serve Large Language Models (LLMs) has become a pressing issue because of their huge computational cost in their autoregressive generation process. To mitigate computational costs, LLMs often employ the KV Cache technique to improve the generation speed. While improving the computational efficiency, the storage requirements of the KV cache are substantial, particularly in long-c…
▽ More
How to efficiently serve Large Language Models (LLMs) has become a pressing issue because of their huge computational cost in their autoregressive generation process. To mitigate computational costs, LLMs often employ the KV Cache technique to improve the generation speed. While improving the computational efficiency, the storage requirements of the KV cache are substantial, particularly in long-context scenarios, leading to significant memory consumption. Existing KV cache eviction methods often degrade the performance of LLMs in long-context scenarios due to the information loss introduced by eviction. In this paper, we propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks without significant performance degradation under constrained memory budgets. Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence. To facilitate merging, we develop an effective yet straightforward merging set identification algorithm to identify suitable KV states for merging. Our merging set identification algorithm stimulates the second observation that KV cache sparsity, from similarity perspective, is independent of the dataset and remains persistent at the model level. Subsequently, we propose a Gaussian kernel weighted merging algorithm to selectively merge all states within each merging set. We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets, applying it to models including Llama2-7B-chat and Llama2-13B-chat. Using the LongBench and ZeroScroll benchmarks, we compare our method with other KV cache compression techniques, including H2O and CaM, showing that our method achieves superior performance across tasks with both 50% and 35% KV cache budgets.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
IRSAM: Advancing Segment Anything Model for Infrared Small Target Detection
Authors:
Mingjin Zhang,
Yuchun Wang,
Jie Guo,
Yunsong Li,
Xinbo Gao,
Jing Zhang
Abstract:
The recent Segment Anything Model (SAM) is a significant advancement in natural image segmentation, exhibiting potent zero-shot performance suitable for various downstream image segmentation tasks. However, directly utilizing the pretrained SAM for Infrared Small Target Detection (IRSTD) task falls short in achieving satisfying performance due to a notable domain gap between natural and infrared i…
▽ More
The recent Segment Anything Model (SAM) is a significant advancement in natural image segmentation, exhibiting potent zero-shot performance suitable for various downstream image segmentation tasks. However, directly utilizing the pretrained SAM for Infrared Small Target Detection (IRSTD) task falls short in achieving satisfying performance due to a notable domain gap between natural and infrared images. Unlike a visible light camera, a thermal imager reveals an object's temperature distribution by capturing infrared radiation. Small targets often show a subtle temperature transition at the object's boundaries. To address this issue, we propose the IRSAM model for IRSTD, which improves SAM's encoder-decoder architecture to learn better feature representation of infrared small objects. Specifically, we design a Perona-Malik diffusion (PMD)-based block and incorporate it into multiple levels of SAM's encoder to help it capture essential structural features while suppressing noise. Additionally, we devise a Granularity-Aware Decoder (GAD) to fuse the multi-granularity feature from the encoder to capture structural information that may be lost in long-distance modeling. Extensive experiments on the public datasets, including NUAA-SIRST, NUDT-SIRST, and IRSTD-1K, validate the design choice of IRSAM and its significant superiority over representative state-of-the-art methods. The source code are available at: github.com/IPIC-Lab/IRSAM.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
A deep graph model for the signed interaction prediction in biological network
Authors:
Shuyi Jin,
Mengji Zhang,
Meijie Wang,
Lun Yu
Abstract:
In pharmaceutical research, the strategy of drug repurposing accelerates the development of new therapies while reducing R&D costs. Network pharmacology lays the theoretical groundwork for identifying new drug indications, and deep graph models have become essential for their precision in mapping complex biological networks. Our study introduces an advanced graph model that utilizes graph convolut…
▽ More
In pharmaceutical research, the strategy of drug repurposing accelerates the development of new therapies while reducing R&D costs. Network pharmacology lays the theoretical groundwork for identifying new drug indications, and deep graph models have become essential for their precision in mapping complex biological networks. Our study introduces an advanced graph model that utilizes graph convolutional networks and tensor decomposition to effectively predict signed chemical-gene interactions. This model demonstrates superior predictive performance, especially in handling the polar relations in biological networks. Our research opens new avenues for drug discovery and repurposing, especially in understanding the mechanism of actions of drugs.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Fuse, Reason and Verify: Geometry Problem Solving with Parsed Clauses from Diagram
Authors:
Ming-Liang Zhang,
Zhong-Zhi Li,
Fei Yin,
Liang Lin,
Cheng-Lin Liu
Abstract:
Geometry problem solving (GPS) requires capacities of multi-modal understanding, multi-hop reasoning and theorem knowledge application. In this paper, we propose a neural-symbolic model for plane geometry problem solving (PGPS), named PGPSNet-v2, with three key steps: modal fusion, reasoning process and knowledge verification. In modal fusion, we leverage textual clauses to express fine-grained st…
▽ More
Geometry problem solving (GPS) requires capacities of multi-modal understanding, multi-hop reasoning and theorem knowledge application. In this paper, we propose a neural-symbolic model for plane geometry problem solving (PGPS), named PGPSNet-v2, with three key steps: modal fusion, reasoning process and knowledge verification. In modal fusion, we leverage textual clauses to express fine-grained structural and semantic content of geometry diagram, and fuse diagram with textual problem efficiently through structural-semantic pre-training. For reasoning, we design an explicable solution program to describe the geometric reasoning process, and employ a self-limited decoder to generate solution program autoregressively. To reduce solution errors, a multi-level theorem verifier is proposed to eliminate solutions that do not match geometric principles, alleviating the hallucination of the neural model. We also construct a large-scale geometry problem dataset called PGPS9K, containing fine-grained annotations of textual clauses, solution program and involved knowledge tuples. Extensive experiments on datasets Geometry3K and PGPS9K show that our PGPSNet solver outperforms existing symbolic and neural solvers in GPS performance, while maintaining good explainability and reliability, and the solver components (fusion, reasoning, verification) are all justified effective.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition
Authors:
Mingfang Zhang,
Yifei Huang,
Ruicong Liu,
Yoichi Sato
Abstract:
Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motio…
▽ More
Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. Due to the scarcity of labeled multimodal data, we design an MAE-based self-supervised pretraining method, obtaining strong multi-modal representations via modeling the natural correlation between visual and motion signals. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices and propose to embed the relative motion features of human joints into a graph structure. Experiments show our method can achieve state-of-the-art performance on multiple public datasets. The effectiveness of our MAE-based pretraining and graph-based IMU modeling are further validated by experiments in more challenging scenarios, including partially missing IMU devices and video quality corruption, promoting more flexible usages in the real world.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images
Authors:
Zhangyang Qi,
Yunhan Yang,
Mengchen Zhang,
Long Xing,
Xiaoyang Wu,
Tong Wu,
Dahua Lin,
Xihui Liu,
Jiaqi Wang,
Hengshuang Zhao
Abstract:
Recent advances in 3D AIGC have shown promise in directly creating 3D objects from text and images, offering significant cost savings in animation and product design. However, detailed edit and customization of 3D assets remains a long-standing challenge. Specifically, 3D Generation methods lack the ability to follow finely detailed instructions as precisely as their 2D image creation counterparts…
▽ More
Recent advances in 3D AIGC have shown promise in directly creating 3D objects from text and images, offering significant cost savings in animation and product design. However, detailed edit and customization of 3D assets remains a long-standing challenge. Specifically, 3D Generation methods lack the ability to follow finely detailed instructions as precisely as their 2D image creation counterparts. Imagine you can get a toy through 3D AIGC but with undesired accessories and dressing. To tackle this challenge, we propose a novel pipeline called Tailor3D, which swiftly creates customized 3D assets from editable dual-side images. We aim to emulate a tailor's ability to locally change objects or perform overall style transfer. Unlike creating 3D assets from multiple views, using dual-side images eliminates conflicts on overlapping areas that occur when editing individual views. Specifically, it begins by editing the front view, then generates the back view of the object through multi-view diffusion. Afterward, it proceeds to edit the back views. Finally, a Dual-sided LRM is proposed to seamlessly stitch together the front and back 3D features, akin to a tailor sewing together the front and back of a garment. The Dual-sided LRM rectifies imperfect consistencies between the front and back views, enhancing editing capabilities and reducing memory burdens while seamlessly integrating them into a unified 3D representation with the LoRA Triplane Transformer. Experimental results demonstrate Tailor3D's effectiveness across various 3D generation and editing tasks, including 3D generative fill and style transfer. It provides a user-friendly, efficient solution for editing 3D assets, with each editing step taking only seconds to complete.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation
Authors:
Xinying Guo,
Mingyuan Zhang,
Haozhe Xie,
Chenyang Gu,
Ziwei Liu
Abstract:
Crowd Motion Generation is essential in entertainment industries such as animation and games as well as in strategic fields like urban simulation and planning. This new task requires an intricate integration of control and generation to realistically synthesize crowd dynamics under specific spatial and semantic constraints, whose challenges are yet to be fully explored. On the one hand, existing h…
▽ More
Crowd Motion Generation is essential in entertainment industries such as animation and games as well as in strategic fields like urban simulation and planning. This new task requires an intricate integration of control and generation to realistically synthesize crowd dynamics under specific spatial and semantic constraints, whose challenges are yet to be fully explored. On the one hand, existing human motion generation models typically focus on individual behaviors, neglecting the complexities of collective behaviors. On the other hand, recent methods for multi-person motion generation depend heavily on pre-defined scenarios and are limited to a fixed, small number of inter-person interactions, thus hampering their practicality. To overcome these challenges, we introduce CrowdMoGen, a zero-shot text-driven framework that harnesses the power of Large Language Model (LLM) to incorporate the collective intelligence into the motion generation framework as guidance, thereby enabling generalizable planning and generation of crowd motions without paired training data. Our framework consists of two key components: 1) Crowd Scene Planner that learns to coordinate motions and dynamics according to specific scene contexts or introduced perturbations, and 2) Collective Motion Generator that efficiently synthesizes the required collective motions based on the holistic plans. Extensive quantitative and qualitative experiments have validated the effectiveness of our framework, which not only fills a critical gap by providing scalable and generalizable solutions for Crowd Motion Generation task but also achieves high levels of realism and flexibility.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study
Authors:
Shihan Dou,
Haoxiang Jia,
Shenxi Wu,
Huiyuan Zheng,
Weikang Zhou,
Muling Wu,
Mingxu Chai,
Jessica Fan,
Caishuang Huang,
Yunbo Tao,
Yan Liu,
Enyu Zhou,
Ming Zhang,
Yuhao Zhou,
Yueming Wu,
Rui Zheng,
Ming Wen,
Rongxiang Weng,
Jingang Wang,
Xunliang Cai,
Tao Gui,
Xipeng Qiu,
Qi Zhang,
Xuanjing Huang
Abstract:
The increasing development of large language models (LLMs) in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundar…
▽ More
The increasing development of large language models (LLMs) in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of these existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and four popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions. Additionally, we developed a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. Furthermore, to better understand the performance of LLMs in real-world projects, we manually created a real-world benchmark comprising 140 code generation tasks. Our analysis highlights distinct differences in bug distributions between actual scenarios and existing benchmarks. Finally, we propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback. Experimental results demonstrate that our approach can significantly mitigate bugs and increase the passing rate by 29.2% after two iterations, indicating substantial potential for LLMs to handle more complex problems.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Mamba-FSCIL: Dynamic Adaptation with Selective State Space Model for Few-Shot Class-Incremental Learning
Authors:
Xiaojie Li,
Yibo Yang,
Jianlong Wu,
Bernard Ghanem,
Liqiang Nie,
Min Zhang
Abstract:
Few-shot class-incremental learning (FSCIL) confronts the challenge of integrating new classes into a model with minimal training samples while preserving the knowledge of previously learned classes. Traditional methods widely adopt static adaptation relying on a fixed parameter space to learn from data that arrive sequentially, prone to overfitting to the current session. Existing dynamic strateg…
▽ More
Few-shot class-incremental learning (FSCIL) confronts the challenge of integrating new classes into a model with minimal training samples while preserving the knowledge of previously learned classes. Traditional methods widely adopt static adaptation relying on a fixed parameter space to learn from data that arrive sequentially, prone to overfitting to the current session. Existing dynamic strategies require the expansion of the parameter space continually, leading to increased complexity. To address these challenges, we integrate the recently proposed selective state space model (SSM) into FSCIL. Concretely, we propose a dual selective SSM projector that dynamically adjusts the projection parameters based on the intermediate features for dynamic adaptation. The dual design enables the model to maintain the robust features of base classes, while adaptively learning distinctive feature shifts for novel classes. Additionally, we develop a class-sensitive selective scan mechanism to guide dynamic adaptation. It minimizes the disruption to base-class representations caused by training on novel data, and meanwhile, forces the selective scan to perform in distinct patterns between base and novel classes. Experiments on miniImageNet, CUB-200, and CIFAR-100 demonstrate that our framework outperforms the existing state-of-the-art methods. The code is available at https://github.com/xiaojieli0903/Mamba-FSCIL.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Vision-Braille: An End-to-End Tool for Chinese Braille Image-to-Text Translation
Authors:
Alan Wu,
Ye Yuan,
Ming Zhang
Abstract:
Visually impaired people are a large group who can only use braille for reading and writing. However, the lack of special educational resources is the bottleneck for educating them. Educational equity is a reflection of the level of social civilization, cultural equality, and individual dignity. Facilitating and improving lifelong learning channels for the visually impaired is of great significanc…
▽ More
Visually impaired people are a large group who can only use braille for reading and writing. However, the lack of special educational resources is the bottleneck for educating them. Educational equity is a reflection of the level of social civilization, cultural equality, and individual dignity. Facilitating and improving lifelong learning channels for the visually impaired is of great significance. Their written braille homework or exam papers cannot be understood by sighted teachers, because of the lack of a highly accurate braille translation system, especially in Chinese which has tone marks. braille writers often omit tone marks to save space, leading to confusion when braille with the same consonants and vowels is translated into Chinese. Previous algorithms were insufficient in extracting contextual information, resulting in low accuracy of braille translations into Chinese. This project informatively fine-tuned the mT5 model with an Encoder-decoder architecture for braille to Chinese character conversion. This research created a training set of braille and corresponding Chinese text from the Leipzig Corpora. This project significantly reduced the confusion in braille, achieving $62.4$ and $62.3$ BLEU scores in the validation and test sets, with a curriculum learning fine-tuning method. By incorporating the braille recognition algorithm, this project is the first publicly available braille translation system and can benefit lots of visually impaired students and families who are preparing for the Chinese College Test and help to propel their college dreams in the future. There is a demo on our homepage\footnote{\url{https://vision-braille.com/}}.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
SCATTER: Algorithm-Circuit Co-Sparse Photonic Accelerator with Thermal-Tolerant, Power-Efficient In-situ Light Redistribution
Authors:
Ziang Yin,
Nicholas Gangi,
Meng Zhang,
Jeff Zhang,
Rena Huang,
Jiaqi Gu
Abstract:
Photonic computing has emerged as a promising solution for accelerating computation-intensive artificial intelligence (AI) workloads. However, limited reconfigurability, high electrical-optical conversion cost, and thermal sensitivity limit the deployment of current optical analog computing engines to support power-restricted, performance-sensitive AI workloads at scale. Sparsity provides a great…
▽ More
Photonic computing has emerged as a promising solution for accelerating computation-intensive artificial intelligence (AI) workloads. However, limited reconfigurability, high electrical-optical conversion cost, and thermal sensitivity limit the deployment of current optical analog computing engines to support power-restricted, performance-sensitive AI workloads at scale. Sparsity provides a great opportunity for hardware-efficient AI accelerators. However, current dense photonic accelerators fail to fully exploit the power-saving potential of algorithmic sparsity. It requires sparsity-aware hardware specialization with a fundamental re-design of photonic tensor core topology and cross-layer device-circuit-architecture-algorithm co-optimization aware of hardware non-ideality and power bottleneck. To trim down the redundant power consumption while maximizing robustness to thermal variations, we propose SCATTER, a novel algorithm-circuit co-sparse photonic accelerator featuring dynamically reconfigurable signal path via thermal-tolerant, power-efficient in-situ light redistribution and power gating. A power-optimized, crosstalk-aware dynamic sparse training framework is introduced to explore row-column structured sparsity and ensure marginal accuracy loss and maximum power efficiency. The extensive evaluation shows that our cross-stacked optimized accelerator SCATTER achieves a 511X area reduction and 12.4X power saving with superior crosstalk tolerance that enables unprecedented circuit layout compactness and on-chip power efficiency.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Ternary Spike-based Neuromorphic Signal Processing System
Authors:
Shuai Wang,
Dehao Zhang,
Ammar Belatreche,
Yichen Xiao,
Hongyu Qing,
Wenjie We,
Malu Zhang,
Yang Yang
Abstract:
Deep Neural Networks (DNNs) have been successfully implemented across various signal processing fields, resulting in significant enhancements in performance. However, DNNs generally require substantial computational resources, leading to significant economic costs and posing challenges for their deployment on resource-constrained edge devices. In this study, we take advantage of spiking neural net…
▽ More
Deep Neural Networks (DNNs) have been successfully implemented across various signal processing fields, resulting in significant enhancements in performance. However, DNNs generally require substantial computational resources, leading to significant economic costs and posing challenges for their deployment on resource-constrained edge devices. In this study, we take advantage of spiking neural networks (SNNs) and quantization technologies to develop an energy-efficient and lightweight neuromorphic signal processing system. Our system is characterized by two principal innovations: a threshold-adaptive encoding (TAE) method and a quantized ternary SNN (QT-SNN). The TAE method can efficiently encode time-varying analog signals into sparse ternary spike trains, thereby reducing energy and memory demands for signal processing. QT-SNN, compatible with ternary spike trains from the TAE method, quantifies both membrane potentials and synaptic weights to reduce memory requirements while maintaining performance. Extensive experiments are conducted on two typical signal-processing tasks: speech and electroencephalogram recognition. The results demonstrate that our neuromorphic signal processing system achieves state-of-the-art (SOTA) performance with a 94% reduced memory requirement. Furthermore, through theoretical energy consumption analysis, our system shows 7.5x energy saving compared to other SNN works. The efficiency and efficacy of the proposed system highlight its potential as a promising avenue for energy-efficient signal processing.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
Authors:
Haozhe Zhao,
Xiaojian Ma,
Liang Chen,
Shuzheng Si,
Rujie Wu,
Kaikai An,
Peiyu Yu,
Minjia Zhang,
Qing Li,
Baobao Chang
Abstract:
This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrush, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct a…
▽ More
This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrush, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct advantages: 1) It features a broader range of editing instructions by leveraging the creativity of large language models (LLMs) alongside in-context editing examples from human raters; 2) Its data sources are based on real images, including photographs and artworks, which provide greater diversity and reduced bias compared to datasets solely generated by text-to-image models; 3) It also supports region-based editing, enhanced by high-quality, automatically produced region annotations. Our experiments show that canonical diffusion-based editing baselines trained on UltraEdit set new records on MagicBrush and Emu-Edit benchmarks. Our analysis further confirms the crucial role of real image anchors and region-based editing data. The dataset, code, and models can be found in https://ultra-editing.github.io.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning
Authors:
Min Zhang,
Jianye Hao,
Xian Fu,
Peilong Han,
Hao Zhang,
Lei Shi,
Hongyao Tang,
Yan Zheng
Abstract:
In recent years, Multi-modal Foundation Models (MFMs) and Embodied Artificial Intelligence (EAI) have been advancing side by side at an unprecedented pace. The integration of the two has garnered significant attention from the AI research community. In this work, we attempt to provide an in-depth and comprehensive evaluation of the performance of MFM s on embodied task planning, aiming to shed lig…
▽ More
In recent years, Multi-modal Foundation Models (MFMs) and Embodied Artificial Intelligence (EAI) have been advancing side by side at an unprecedented pace. The integration of the two has garnered significant attention from the AI research community. In this work, we attempt to provide an in-depth and comprehensive evaluation of the performance of MFM s on embodied task planning, aiming to shed light on their capabilities and limitations in this domain. To this end, based on the characteristics of embodied task planning, we first develop a systematic evaluation framework, which encapsulates four crucial capabilities of MFMs: object understanding, spatio-temporal perception, task understanding, and embodied reasoning. Following this, we propose a new benchmark, named MFE-ETP, characterized its complex and variable task scenarios, typical yet diverse task types, task instances of varying difficulties, and rich test case types ranging from multiple embodied question answering to embodied task reasoning. Finally, we offer a simple and easy-to-use automatic evaluation platform that enables the automated testing of multiple MFMs on the proposed benchmark. Using the benchmark and evaluation platform, we evaluated several state-of-the-art MFMs and found that they significantly lag behind human-level performance. The MFE-ETP is a high-quality, large-scale, and challenging benchmark relevant to real-world tasks.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Incremental Multiview Point Cloud Registration
Authors:
Xiaoya Cheng,
Yu Liu,
Maojun Zhang,
Shen Yan
Abstract:
In this paper, we present a novel approach for multiview point cloud registration. Different from previous researches that typically employ a global scheme for multiview registration, we propose to adopt an incremental pipeline to progressively align scans into a canonical coordinate system. Specifically, drawing inspiration from image-based 3D reconstruction, our approach first builds a sparse sc…
▽ More
In this paper, we present a novel approach for multiview point cloud registration. Different from previous researches that typically employ a global scheme for multiview registration, we propose to adopt an incremental pipeline to progressively align scans into a canonical coordinate system. Specifically, drawing inspiration from image-based 3D reconstruction, our approach first builds a sparse scan graph with scan retrieval and geometric verification. Then, we perform incremental registration via initialization, next scan selection and registration, Track create and continue, and Bundle Adjustment. Additionally, for detector-free matchers, we incorporate a Track refinement process. This process primarily constructs a coarse multiview registration and refines the model by adjusting the positions of the keypoints on the Track. Experiments demonstrate that the proposed framework outperforms existing multiview registration methods on three benchmark datasets. The code is available at https://github.com/Choyaa/IncreMVR.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Machine Learning for Complex Systems with Abnormal Pattern by Exception Maximization Outlier Detection Method
Authors:
Zhikun Zhang,
Yiting Duan,
Xiangjun Wang,
Mingyuan Zhang
Abstract:
This paper proposes a novel fast online methodology for outlier detection called the exception maximization outlier detection method(EMODM), which employs probabilistic models and statistical algorithms to detect abnormal patterns from the outputs of complex systems. The EMODM is based on a two-state Gaussian mixture model and demonstrates strong performance in probability anomaly detection workin…
▽ More
This paper proposes a novel fast online methodology for outlier detection called the exception maximization outlier detection method(EMODM), which employs probabilistic models and statistical algorithms to detect abnormal patterns from the outputs of complex systems. The EMODM is based on a two-state Gaussian mixture model and demonstrates strong performance in probability anomaly detection working on real-time raw data rather than using special prior distribution information. We confirm this using the synthetic data from two numerical cases. For the real-world data, we have detected the short circuit pattern of the circuit system using EMODM by the current and voltage output of a three-phase inverter. The EMODM also found an abnormal period due to COVID-19 in the insured unemployment data of 53 regions in the United States from 2000 to 2024. The application of EMODM to these two real-life datasets demonstrated the effectiveness and accuracy of our algorithm.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Planning with Large Language Models for Conversational Agents
Authors:
Zhigen Li,
Jianxiang Peng,
Yanmeng Wang,
Tianhao Shen,
Minghui Zhang,
Linxi Su,
Shang Wu,
Yihang Wu,
Yuqian Wang,
Ye Wang,
Wei Hu,
Jianfeng Li,
Shaojun Wang,
Jing Xiao,
Deyi Xiong
Abstract:
Controllability and proactivity are crucial properties of autonomous conversational agents (CAs). Controllability requires the CAs to follow the standard operating procedures (SOPs), such as verifying identity before activating credit cards. Proactivity requires the CAs to guide the conversation towards the goal during user uncooperation, such as persuasive dialogue. Existing research cannot be un…
▽ More
Controllability and proactivity are crucial properties of autonomous conversational agents (CAs). Controllability requires the CAs to follow the standard operating procedures (SOPs), such as verifying identity before activating credit cards. Proactivity requires the CAs to guide the conversation towards the goal during user uncooperation, such as persuasive dialogue. Existing research cannot be unified with controllability, proactivity, and low manual annotation. To bridge this gap, we propose a new framework for planning-based conversational agents (PCA) powered by large language models (LLMs), which only requires humans to define tasks and goals for the LLMs. Before conversation, LLM plans the core and necessary SOP for dialogue offline. During the conversation, LLM plans the best action path online referring to the SOP, and generates responses to achieve process controllability. Subsequently, we propose a semi-automatic dialogue data creation framework and curate a high-quality dialogue dataset (PCA-D). Meanwhile, we develop multiple variants and evaluation metrics for PCA, e.g., planning with Monte Carlo Tree Search (PCA-M), which searches for the optimal dialogue action while satisfying SOP constraints and achieving the proactive of the dialogue. Experiment results show that LLMs finetuned on PCA-D can significantly improve the performance and generalize to unseen domains. PCA-M outperforms other CoT and ToT baselines in terms of conversation controllability, proactivity, task success rate, and overall logical coherence, and is applicable in industry dialogue scenarios. The dataset and codes are available at XXXX.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Feature-Specific Coefficients of Determination in Tree Ensembles
Authors:
Zhongli Jiang,
Dabao Zhang,
Min Zhang
Abstract:
Tree ensemble methods provide promising predictions with models difficult to interpret. Recent introduction of Shapley values for individualized feature contributions, accompanied with several fast computing algorithms for predicted values, shows intriguing results. However, individualizing coefficients of determination, aka $R^2$, for each feature is challenged by the underlying quadratic losses,…
▽ More
Tree ensemble methods provide promising predictions with models difficult to interpret. Recent introduction of Shapley values for individualized feature contributions, accompanied with several fast computing algorithms for predicted values, shows intriguing results. However, individualizing coefficients of determination, aka $R^2$, for each feature is challenged by the underlying quadratic losses, although these coefficients allow us to comparatively assess single feature's contribution to tree ensembles. Here we propose an efficient algorithm, Q-SHAP, that reduces the computational complexity to polynomial time when calculating Shapley values related to quadratic losses. Our extensive simulation studies demonstrate that this approach not only enhances computational efficiency but also improves estimation accuracy of feature-specific coefficients of determination.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Foundations and Frontiers of Graph Learning Theory
Authors:
Yu Huang,
Min Zhou,
Menglin Yang,
Zhen Wang,
Muhan Zhang,
Jie Wang,
Hong Xie,
Hao Wang,
Defu Lian,
Enhong Chen
Abstract:
Recent advancements in graph learning have revolutionized the way to understand and analyze data with complex structures. Notably, Graph Neural Networks (GNNs), i.e. neural network architectures designed for learning graph representations, have become a popular paradigm. With these models being usually characterized by intuition-driven design or highly intricate components, placing them within the…
▽ More
Recent advancements in graph learning have revolutionized the way to understand and analyze data with complex structures. Notably, Graph Neural Networks (GNNs), i.e. neural network architectures designed for learning graph representations, have become a popular paradigm. With these models being usually characterized by intuition-driven design or highly intricate components, placing them within the theoretical analysis framework to distill the core concepts, helps understand the key principles that drive the functionality better and guide further development. Given this surge in interest, this article provides a comprehensive summary of the theoretical foundations and breakthroughs concerning the approximation and learning behaviors intrinsic to prevalent graph learning models. Encompassing discussions on fundamental aspects such as expressiveness power, generalization, optimization, and unique phenomena such as over-smoothing and over-squashing, this piece delves into the theoretical foundations and frontier driving the evolution of graph learning. In addition, this article also presents several challenges and further initiates discussions on possible solutions.
△ Less
Submitted 7 July, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation
Authors:
Zhibin Lan,
Liqiang Niu,
Fandong Meng,
Jie Zhou,
Min Zhang,
Jinsong Su
Abstract:
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this regard, conventional cascaded methods suffer from issues such as error propagation, massive parameters, and difficulties in deployment and retaining visual characteristics of the input image. Thus, constructing end-to-end models has be…
▽ More
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this regard, conventional cascaded methods suffer from issues such as error propagation, massive parameters, and difficulties in deployment and retaining visual characteristics of the input image. Thus, constructing end-to-end models has become an option, which, however, faces two main challenges: 1) the huge modeling burden, as it is required to simultaneously learn alignment across languages and preserve the visual characteristics of the input image; 2) the difficulties of directly predicting excessively lengthy pixel sequences. In this paper, we propose \textit{Translatotron-V(ision)}, an end-to-end IIMT model consisting of four modules. In addition to an image encoder, and an image decoder, our model contains a target text decoder and an image tokenizer. Among them, the target text decoder is used to alleviate the language alignment burden, and the image tokenizer converts long sequences of pixels into shorter sequences of visual tokens, preventing the model from focusing on low-level visual features. Besides, we present a two-stage training framework for our model to assist the model in learning alignment across modalities and languages. Finally, we propose a location-aware evaluation metric called Structure-BLEU to assess the translation quality of the generated images. Experimental results demonstrate that our model achieves competitive performance compared to cascaded models with only 70.9\% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
AcuVR: Enhancing Acupuncture Training Workflow with Virtual Reality
Authors:
Menghe Zhang,
Chen Chen,
Matin Yarmand,
Anish Rajeshkumar,
Nadir Weibel
Abstract:
Acupuncture is a widely adopted medical practice that involves inserting thin needles into specific points on the body to alleviate pain and treat various health conditions. Current learning practices heavily rely on 2D atlases and practice on peers, which are notably less intuitive and pose risks, particularly in sensitive areas such as the eyes. To address these challenges, we introduce AcuVR, a…
▽ More
Acupuncture is a widely adopted medical practice that involves inserting thin needles into specific points on the body to alleviate pain and treat various health conditions. Current learning practices heavily rely on 2D atlases and practice on peers, which are notably less intuitive and pose risks, particularly in sensitive areas such as the eyes. To address these challenges, we introduce AcuVR, a Virtual Reality (VR) based system designed to add a layer of interactivity and realism. This innovation aims to reduce the risks associated with practicing acupuncture techniques while offering more effective learning strategies. Furthermore, AcuVR incorporates medical imaging and standardized anatomy models, enabling the simulation of customized acupuncture scenarios. This feature represents a significant advancement beyond the limitations of conventional resources such as atlases and textbooks, facilitating a more immersive and personalized learning experience. The evaluation study with eight acupuncture students and practitioners revealed high participant satisfaction and pointed to the effectiveness and potential of AcuVR as a valuable addition to acupuncture training.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
LaMoD: Latent Motion Diffusion Model For Myocardial Strain Generation
Authors:
Jiarui Xing,
Nivetha Jayakumar,
Nian Wu,
Yu Wang,
Frederick H. Epstein,
Miaomiao Zhang
Abstract:
Motion and deformation analysis of cardiac magnetic resonance (CMR) imaging videos is crucial for assessing myocardial strain of patients with abnormal heart functions. Recent advances in deep learning-based image registration algorithms have shown promising results in predicting motion fields from routinely acquired CMR sequences. However, their accuracy often diminishes in regions with subtle ap…
▽ More
Motion and deformation analysis of cardiac magnetic resonance (CMR) imaging videos is crucial for assessing myocardial strain of patients with abnormal heart functions. Recent advances in deep learning-based image registration algorithms have shown promising results in predicting motion fields from routinely acquired CMR sequences. However, their accuracy often diminishes in regions with subtle appearance change, with errors propagating over time. Advanced imaging techniques, such as displacement encoding with stimulated echoes (DENSE) CMR, offer highly accurate and reproducible motion data but require additional image acquisition, which poses challenges in busy clinical flows. In this paper, we introduce a novel Latent Motion Diffusion model (LaMoD) to predict highly accurate DENSE motions from standard CMR videos. More specifically, our method first employs an encoder from a pre-trained registration network that learns latent motion features (also considered as deformation-based shape features) from image sequences. Supervised by the ground-truth motion provided by DENSE, LaMoD then leverages a probabilistic latent diffusion model to reconstruct accurate motion from these extracted features. Experimental results demonstrate that our proposed method, LaMoD, significantly improves the accuracy of motion analysis in standard CMR images; hence improving myocardial strain analysis in clinical settings for cardiac patients. Our code will be publicly available on upon acceptance.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Concise and Precise Context Compression for Tool-Using Language Models
Authors:
Yang Xu,
Yunlong Feng,
Honglin Mu,
Yutai Hou,
Yitong Li,
Xinghao Wang,
Wanjun Zhong,
Zhongyang Li,
Dandan Tu,
Qingfu Zhu,
Min Zhang,
Wanxiang Che
Abstract:
Through reading the documentation in the context, tool-using language models can dynamically extend their capability using external tools. The cost is that we have to input lengthy documentation every time the model needs to use the tool, occupying the input window as well as slowing down the decoding process.
Given the progress in general-purpose compression, soft context compression is a suita…
▽ More
Through reading the documentation in the context, tool-using language models can dynamically extend their capability using external tools. The cost is that we have to input lengthy documentation every time the model needs to use the tool, occupying the input window as well as slowing down the decoding process.
Given the progress in general-purpose compression, soft context compression is a suitable approach to alleviate the problem. However, when compressing tool documentation, existing methods suffer from the weaknesses of key information loss (specifically, tool/parameter name errors) and difficulty in adjusting the length of compressed sequences based on documentation lengths.
To address these problems, we propose two strategies for compressing tool documentation into concise and precise summary sequences for tool-using language models. 1) Selective compression strategy mitigates key information loss by deliberately retaining key information as raw text tokens. 2) Block compression strategy involves dividing tool documentation into short chunks and then employing a fixed-length compression model to achieve variable-length compression. This strategy facilitates the flexible adjustment of the compression ratio.
Results on API-Bank and APIBench show that our approach reaches a performance comparable to the upper-bound baseline under up to 16x compression ratio.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Authors:
Runqi Qiao,
Qiuna Tan,
Guanting Dong,
Minhui Wu,
Chong Sun,
Xiaoshuai Song,
Zhuoma GongQue,
Shanglin Lei,
Zhe Wei,
Miaoxuan Zhang,
Runfeng Qiao,
Yifan Zhang,
Xiao Zong,
Yida Xu,
Muxi Diao,
Zhimin Bao,
Chen Li,
Honggang Zhang
Abstract:
Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduc…
▽ More
Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at https://github.com/We-Math/We-Math.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Leveraging Large Language Models for Actionable Course Evaluation Student Feedback to Lecturers
Authors:
Mike Zhang,
Euan D Lindsay,
Frederik Bode Thorbensen,
Danny Bøgsted Poulsen,
Johannes Bjerva
Abstract:
End of semester student evaluations of teaching are the dominant mechanism for providing feedback to academics on their teaching practice. For large classes, however, the volume of feedback makes these tools impractical for this purpose. This paper explores the use of open-source generative AI to synthesise factual, actionable and appropriate summaries of student feedback from these survey respons…
▽ More
End of semester student evaluations of teaching are the dominant mechanism for providing feedback to academics on their teaching practice. For large classes, however, the volume of feedback makes these tools impractical for this purpose. This paper explores the use of open-source generative AI to synthesise factual, actionable and appropriate summaries of student feedback from these survey responses. In our setup, we have 742 student responses ranging over 75 courses in a Computer Science department. For each course, we synthesise a summary of the course evaluations and actionable items for the instructor. Our results reveal a promising avenue for enhancing teaching practices in the classroom setting. Our contribution lies in demonstrating the feasibility of using generative AI to produce insightful feedback for teachers, thus providing a cost-effective means to support educators' development. Overall, our work highlights the possibility of using generative AI to produce factual, actionable, and appropriate feedback for teachers in the classroom setting.
△ Less
Submitted 2 July, 2024; v1 submitted 1 July, 2024;
originally announced July 2024.
-
EMIF: Evidence-aware Multi-source Information Fusion Network for Explainable Fake News Detection
Authors:
Qingxing Dong,
Mengyi Zhang,
Shiyuan Wu,
Xiaozhen Wu
Abstract:
Extensive research on automatic fake news detection has been conducted due to the significant detrimental effects of fake news proliferation. Most existing approaches rely on a single source of evidence, such as comments or relevant news, to derive explanatory evidence for decision-making, demonstrating exceptional performance. However, their single evidence source suffers from two critical drawba…
▽ More
Extensive research on automatic fake news detection has been conducted due to the significant detrimental effects of fake news proliferation. Most existing approaches rely on a single source of evidence, such as comments or relevant news, to derive explanatory evidence for decision-making, demonstrating exceptional performance. However, their single evidence source suffers from two critical drawbacks: (i) noise abundance, and (ii) resilience deficiency. Inspired by the natural process of fake news identification, we propose an Evidence-aware Multi-source Information Fusion (EMIF) network that jointly leverages user comments and relevant news to make precise decision and excavate reliable evidence. To accomplish this, we initially construct a co-attention network to capture general semantic conflicts between comments and original news. Meanwhile, a divergence selection module is employed to identify the top-K relevant news articles with content that deviates the most from the original news, which ensures the acquisition of multiple evidence with higher objectivity. Finally, we utilize an inconsistency loss function within the evidence fusion layer to strengthen the consistency of two types of evidence, both negating the authenticity of the same news. Extensive experiments and ablation studies on real-world dataset FibVID show the effectiveness of our proposed model. Notably, EMIF shows remarkable robustness even in scenarios where a particular source of information is inadequate.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
Authors:
Jinsheng Huang,
Liang Chen,
Taian Guo,
Fu Zeng,
Yusheng Zhao,
Bohan Wu,
Ye Yuan,
Haozhe Zhao,
Zhihui Guo,
Yichi Zhang,
Jingyang Yuan,
Wei Ju,
Luchen Liu,
Tianyu Liu,
Baobao Chang,
Ming Zhang
Abstract:
Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial p…
▽ More
Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises $2,138$ question triplets, totaling $6,414$ distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by $31.73\%$, compared to an average gap of $8.03\%$ in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by $23.09\%$, whereas the gap for previous benchmarks is just $14.64\%$). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
Towards Universal Mesh Movement Networks
Authors:
Mingrui Zhang,
Chunyang Wang,
Stephan Kramer,
Joseph G. Wallwork,
Siyi Li,
Jiancheng Liu,
Xiang Chen,
Matthew D. Piggott
Abstract:
Solving complex Partial Differential Equations (PDEs) accurately and efficiently is an essential and challenging problem in all scientific and engineering disciplines. Mesh movement methods provide the capability to improve the accuracy of the numerical solution without increasing the overall mesh degree of freedom count. Conventional sophisticated mesh movement methods are extremely expensive and…
▽ More
Solving complex Partial Differential Equations (PDEs) accurately and efficiently is an essential and challenging problem in all scientific and engineering disciplines. Mesh movement methods provide the capability to improve the accuracy of the numerical solution without increasing the overall mesh degree of freedom count. Conventional sophisticated mesh movement methods are extremely expensive and struggle to handle scenarios with complex boundary geometries. However, existing learning-based methods require re-training from scratch given a different PDE type or boundary geometry, which limits their applicability, and also often suffer from robustness issues in the form of inverted elements. In this paper, we introduce the Universal Mesh Movement Network (UM2N), which -- once trained -- can be applied in a non-intrusive, zero-shot manner to move meshes with different size distributions and structures, for solvers applicable to different PDE types and boundary geometries. UM2N consists of a Graph Transformer (GT) encoder for extracting features and a Graph Attention Network (GAT) based decoder for moving the mesh. We evaluate our method on advection and Navier-Stokes based examples, as well as a real-world tsunami simulation case. Our method outperforms existing learning-based mesh movement methods in terms of the benchmarks described above. In comparison to the conventional sophisticated Monge-Ampère PDE-solver based method, our approach not only significantly accelerates mesh movement, but also proves effective in scenarios where the conventional method fails. Our project page is at https://erizmr.github.io/UM2N/.
△ Less
Submitted 1 July, 2024; v1 submitted 29 June, 2024;
originally announced July 2024.
-
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
Authors:
Ruoyu Qin,
Zheming Li,
Weiran He,
Mingxing Zhang,
Yongwei Wu,
Weimin Zheng,
Xinran Xu
Abstract:
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances ma…
▽ More
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake's innovative architecture enables Kimi to handle 75% more requests.
△ Less
Submitted 9 July, 2024; v1 submitted 23 June, 2024;
originally announced July 2024.
-
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Authors:
Hao Fei,
Shengqiong Wu,
Meishan Zhang,
Min Zhang,
Tat-Seng Chua,
Shuicheng Yan
Abstract:
While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-tempo…
▽ More
While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer is further devised to strengthen the sense of the changes in objects across spatial and temporal dimensions. Next, based on the fine-grained structural features of TSG and DSG, we perform object-centered spatial alignment and predicate-centered temporal alignment respectively, enhancing the video-language grounding in both the spatiality and temporality. We design our method as a plug&play system, which can be integrated into existing well-trained VLMs for further representation augmentation, without training from scratch or relying on SG annotations in downstream applications. On 6 representative VL modeling tasks over 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the existing 13 strong-performing VLMs persistently, and refreshes the current state-of-the-art end task performance significantly in both the fine-tuning and zero-shot settings.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
AFBench: A Large-scale Benchmark for Airfoil Design
Authors:
Jian Liu,
Jianyu Wu,
Hairun Xie,
Guoqing Zhang,
Jing Wang,
Wei Liu,
Wanli Ouyang,
Junjun Jiang,
Xianming Liu,
Shixiang Tang,
Miao Zhang
Abstract:
Data-driven generative models have emerged as promising approaches towards achieving efficient mechanical inverse design. However, due to prohibitively high cost in time and money, there is still lack of open-source and large-scale benchmarks in this field. It is mainly the case for airfoil inverse design, which requires to generate and edit diverse geometric-qualified and aerodynamic-qualified ai…
▽ More
Data-driven generative models have emerged as promising approaches towards achieving efficient mechanical inverse design. However, due to prohibitively high cost in time and money, there is still lack of open-source and large-scale benchmarks in this field. It is mainly the case for airfoil inverse design, which requires to generate and edit diverse geometric-qualified and aerodynamic-qualified airfoils following the multimodal instructions, \emph{i.e.,} dragging points and physical parameters. This paper presents the open-source endeavors in airfoil inverse design, \emph{AFBench}, including a large-scale dataset with 200 thousand airfoils and high-quality aerodynamic and geometric labels, two novel and practical airfoil inverse design tasks, \emph{i.e.,} conditional generation on multimodal physical parameters, controllable editing, and comprehensive metrics to evaluate various existing airfoil inverse design methods. Our aim is to establish \emph{AFBench} as an ecosystem for training and evaluating airfoil inverse design methods, with a specific focus on data-driven controllable inverse design models by multimodal instructions capable of bridging the gap between ideas and execution, the academic research and industrial applications. We have provided baseline models, comprehensive experimental observations, and analysis to accelerate future research. Our baseline model is trained on an RTX 3090 GPU within 16 hours. The codebase, datasets and benchmarks will be available at \url{https://hitcslj.github.io/afbench/}.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training
Authors:
Xinyu Lian,
Sam Ade Jacobs,
Lev Kurilenko,
Masahiro Tanaka,
Stas Bekman,
Olatunji Ruwase,
Minjia Zhang
Abstract:
Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are t…
▽ More
Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity.
The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques.
△ Less
Submitted 27 June, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
BiTrack: Bidirectional Offline 3D Multi-Object Tracking Using Camera-LiDAR Data
Authors:
Kemiao Huang,
Meiying Zhang,
Qi Hao
Abstract:
Compared with real-time multi-object tracking (MOT), offline multi-object tracking (OMOT) has the advantages to perform 2D-3D detection fusion, erroneous link correction, and full track optimization but has to deal with the challenges from bounding box misalignment and track evaluation, editing, and refinement. This paper proposes "BiTrack", a 3D OMOT framework that includes modules of 2D-3D detec…
▽ More
Compared with real-time multi-object tracking (MOT), offline multi-object tracking (OMOT) has the advantages to perform 2D-3D detection fusion, erroneous link correction, and full track optimization but has to deal with the challenges from bounding box misalignment and track evaluation, editing, and refinement. This paper proposes "BiTrack", a 3D OMOT framework that includes modules of 2D-3D detection fusion, initial trajectory generation, and bidirectional trajectory re-optimization to achieve optimal tracking results from camera-LiDAR data. The novelty of this paper includes threefold: (1) development of a point-level object registration technique that employs a density-based similarity metric to achieve accurate fusion of 2D-3D detection results; (2) development of a set of data association and track management skills that utilizes a vertex-based similarity metric as well as false alarm rejection and track recovery mechanisms to generate reliable bidirectional object trajectories; (3) development of a trajectory re-optimization scheme that re-organizes track fragments of different fidelities in a greedy fashion, as well as refines each trajectory with completion and smoothing techniques. The experiment results on the KITTI dataset demonstrate that BiTrack achieves the state-of-the-art performance for 3D OMOT tasks in terms of accuracy and efficiency.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
CTS: Sim-to-Real Unsupervised Domain Adaptation on 3D Detection
Authors:
Meiying Zhang,
Weiyuan Peng,
Guangyao Ding,
Chenyang Lei,
Chunlin Ji,
Qi Hao
Abstract:
Simulation data can be accurately labeled and have been expected to improve the performance of data-driven algorithms, including object detection. However, due to the various domain inconsistencies from simulation to reality (sim-to-real), cross-domain object detection algorithms usually suffer from dramatic performance drops. While numerous unsupervised domain adaptation (UDA) methods have been d…
▽ More
Simulation data can be accurately labeled and have been expected to improve the performance of data-driven algorithms, including object detection. However, due to the various domain inconsistencies from simulation to reality (sim-to-real), cross-domain object detection algorithms usually suffer from dramatic performance drops. While numerous unsupervised domain adaptation (UDA) methods have been developed to address cross-domain tasks between real-world datasets, progress in sim-to-real remains limited. This paper presents a novel Complex-to-Simple (CTS) framework to transfer models from labeled simulation (source) to unlabeled reality (target) domains. Based on a two-stage detector, the novelty of this work is threefold: 1) developing fixed-size anchor heads and RoI augmentation to address size bias and feature diversity between two domains, thereby improving the quality of pseudo-label; 2) developing a novel corner-format representation of aleatoric uncertainty (AU) for the bounding box, to uniformly quantify pseudo-label quality; 3) developing a noise-aware mean teacher domain adaptation method based on AU, as well as object-level and frame-level sampling strategies, to migrate the impact of noisy labels. Experimental results demonstrate that our proposed approach significantly enhances the sim-to-real domain adaptation capability of 3D object detection models, outperforming state-of-the-art cross-domain algorithms, which are usually developed for real-to-real UDA tasks.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
LLM-Driven Multimodal Opinion Expression Identification
Authors:
Bonian Jia,
Huiyao Chen,
Yueheng Sun,
Meishan Zhang,
Min Zhang
Abstract:
Opinion Expression Identification (OEI) is essential in NLP for applications ranging from voice assistants to depression diagnosis. This study extends OEI to encompass multimodal inputs, underlining the significance of auditory cues in delivering emotional subtleties beyond the capabilities of text. We introduce a novel multimodal OEI (MOEI) task, integrating text and speech to mirror real-world s…
▽ More
Opinion Expression Identification (OEI) is essential in NLP for applications ranging from voice assistants to depression diagnosis. This study extends OEI to encompass multimodal inputs, underlining the significance of auditory cues in delivering emotional subtleties beyond the capabilities of text. We introduce a novel multimodal OEI (MOEI) task, integrating text and speech to mirror real-world scenarios. Utilizing CMU MOSEI and IEMOCAP datasets, we construct the CI-MOEI dataset. Additionally, Text-to-Speech (TTS) technology is applied to the MPQA dataset to obtain the CIM-OEI dataset. We design a template for the OEI task to take full advantage of the generative power of large language models (LLMs). Advancing further, we propose an LLM-driven method STOEI, which combines speech and text modal to identify opinion expressions. Our experiments demonstrate that MOEI significantly improves the performance while our method outperforms existing methods by 9.20\% and obtains SOTA results.
△ Less
Submitted 29 June, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
Retrieval-style In-Context Learning for Few-shot Hierarchical Text Classification
Authors:
Huiyao Chen,
Yu Zhao,
Zulong Chen,
Mengjia Wang,
Liangyue Li,
Meishan Zhang,
Min Zhang
Abstract:
Hierarchical text classification (HTC) is an important task with broad applications, while few-shot HTC has gained increasing interest recently. While in-context learning (ICL) with large language models (LLMs) has achieved significant success in few-shot learning, it is not as effective for HTC because of the expansive hierarchical label sets and extremely-ambiguous labels. In this work, we intro…
▽ More
Hierarchical text classification (HTC) is an important task with broad applications, while few-shot HTC has gained increasing interest recently. While in-context learning (ICL) with large language models (LLMs) has achieved significant success in few-shot learning, it is not as effective for HTC because of the expansive hierarchical label sets and extremely-ambiguous labels. In this work, we introduce the first ICL-based framework with LLM for few-shot HTC. We exploit a retrieval database to identify relevant demonstrations, and an iterative policy to manage multi-layer hierarchical labels. Particularly, we equip the retrieval database with HTC label-aware representations for the input texts, which is achieved by continual training on a pretrained language model with masked language modeling (MLM), layer-wise classification (CLS, specifically for HTC), and a novel divergent contrastive learning (DCL, mainly for adjacent semantically-similar labels) objective. Experimental results on three benchmark datasets demonstrate superior performance of our method, and we can achieve state-of-the-art results in few-shot HTC.
△ Less
Submitted 29 June, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
Authors:
Jikai Wang,
Yi Su,
Juntao Li,
Qingrong Xia,
Zi Ye,
Xinyu Duan,
Zhefeng Wang,
Min Zhang
Abstract:
Autoregressive language models demonstrate excellent performance in various scenarios. However, the inference efficiency is limited by its one-step-one-word generation mode, which has become a pressing problem recently as the models become increasingly larger. Speculative decoding employs a "draft and then verify" mechanism to allow multiple tokens to be generated in one step, realizing lossless a…
▽ More
Autoregressive language models demonstrate excellent performance in various scenarios. However, the inference efficiency is limited by its one-step-one-word generation mode, which has become a pressing problem recently as the models become increasingly larger. Speculative decoding employs a "draft and then verify" mechanism to allow multiple tokens to be generated in one step, realizing lossless acceleration. Existing methods mainly adopt fixed heuristic draft structures, which fail to adapt to different situations to maximize the acceptance length during verification. To alleviate this dilemma, we proposed OPT-Tree, an algorithm to construct adaptive and scalable draft trees. It searches the optimal tree structure that maximizes the mathematical expectation of the acceptance length in each decoding step. Experimental results reveal that OPT-Tree outperforms the existing draft structures and achieves a speed-up ratio of up to 3.2 compared with autoregressive decoding. If the draft model is powerful enough and the node budget is sufficient, it can generate more than ten tokens in a single step. Our code is available at https://github.com/Jikai0Wang/OPT-Tree.
△ Less
Submitted 16 July, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
Attention Instruction: Amplifying Attention in the Middle via Prompting
Authors:
Meiru Zhang,
Zaiqiao Meng,
Nigel Collier
Abstract:
The context window of large language models has been extended to 128k tokens or more. However, language models still suffer from position bias and have difficulty in accessing and using the middle part of the context due to the lack of attention. We examine the relative position awareness of LLMs and the feasibility of mitigating disproportional attention through prompting. We augment the original…
▽ More
The context window of large language models has been extended to 128k tokens or more. However, language models still suffer from position bias and have difficulty in accessing and using the middle part of the context due to the lack of attention. We examine the relative position awareness of LLMs and the feasibility of mitigating disproportional attention through prompting. We augment the original task instruction with $\texttt{attention instructions}$ that direct language models to allocate more attention towards a selected segment of the context. We conduct a comprehensive investigation on multi-document question answering task with both position-based and index-based instructions. We find that language models do not have relative position awareness of the context. Nevertheless, they demonstrate the capacity to adapt attention to a specific segment using matching indexes. Our analysis contributes to a deeper understanding of position bias in LLMs and provides a pathway to mitigate this bias by instruction, thus benefiting LLMs in locating and utilizing relevant information from retrieved documents in RAG applications.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods
Authors:
Roy Xie,
Junlin Wang,
Ruomin Huang,
Minxing Zhang,
Rong Ge,
Jian Pei,
Neil Zhenqiang Gong,
Bhuwan Dhingra
Abstract:
The rapid scaling of large language models (LLMs) has raised concerns about the transparency and fair use of the pretraining data used for training them. Detecting such content is challenging due to the scale of the data and limited exposure of each instance during training. We propose ReCaLL (Relative Conditional Log-Likelihood), a novel membership inference attack (MIA) to detect LLMs' pretraini…
▽ More
The rapid scaling of large language models (LLMs) has raised concerns about the transparency and fair use of the pretraining data used for training them. Detecting such content is challenging due to the scale of the data and limited exposure of each instance during training. We propose ReCaLL (Relative Conditional Log-Likelihood), a novel membership inference attack (MIA) to detect LLMs' pretraining data by leveraging their conditional language modeling capabilities. ReCaLL examines the relative change in conditional log-likelihoods when prefixing target data points with non-member context. Our empirical findings show that conditioning member data on non-member prefixes induces a larger decrease in log-likelihood compared to non-member data. We conduct comprehensive experiments and show that ReCaLL achieves state-of-the-art performance on the WikiMIA dataset, even with random and synthetic prefixes, and can be further improved using an ensemble approach. Moreover, we conduct an in-depth analysis of LLMs' behavior with different membership contexts, providing insights into how LLMs leverage membership information for effective inference at both the sequence and token level.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
TAGLAS: An atlas of text-attributed graph datasets in the era of large graph and language models
Authors:
Jiarui Feng,
Hao Liu,
Lecheng Kong,
Yixin Chen,
Muhan Zhang
Abstract:
In this report, we present TAGLAS, an atlas of text-attributed graph (TAG) datasets and benchmarks. TAGs are graphs with node and edge features represented in text, which have recently gained wide applicability in training graph-language or graph foundation models. In TAGLAS, we collect and integrate more than 23 TAG datasets with domains ranging from citation graphs to molecule graphs and tasks f…
▽ More
In this report, we present TAGLAS, an atlas of text-attributed graph (TAG) datasets and benchmarks. TAGs are graphs with node and edge features represented in text, which have recently gained wide applicability in training graph-language or graph foundation models. In TAGLAS, we collect and integrate more than 23 TAG datasets with domains ranging from citation graphs to molecule graphs and tasks from node classification to graph question-answering. Unlike previous graph datasets and benchmarks, all datasets in TAGLAS have a unified node and edge text feature format, which allows a graph model to be simultaneously trained and evaluated on multiple datasets from various domains. Further, we provide a standardized, efficient, and simplified way to load all datasets and tasks. We also provide useful utils like text-to-embedding conversion, and graph-to-text conversion, which can facilitate different evaluation scenarios. Finally, we also provide standard and easy-to-use evaluation utils. The project is open-sourced at https://github.com/JiaruiFeng/TAGLAS and is still under construction. Please expect more datasets/features in the future.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset
Authors:
Allen Roush,
Yusuf Shabazz,
Arvind Balaji,
Peter Zhang,
Stefano Mezza,
Markus Zhang,
Sanjay Basu,
Sriram Vishwanath,
Mehdi Fatemi,
Ravid Shwartz-Ziv
Abstract:
We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, providing valuable r…
▽ More
We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, providing valuable resources for training and evaluation. Our extensive experiments demonstrate the efficacy of fine-tuning state-of-the-art large language models for argumentative abstractive summarization across various methods, models, and datasets. By providing this comprehensive resource, we aim to advance computational argumentation and support practical applications for debaters, educators, and researchers. OpenDebateEvidence is publicly available to support further research and innovation in computational argumentation. Access it here: https://huggingface.co/datasets/Yusuf5/OpenCaselist
△ Less
Submitted 5 July, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
Timo: Towards Better Temporal Reasoning for Language Models
Authors:
Zhaochen Su,
Jun Zhang,
Tong Zhu,
Xiaoye Qu,
Juntao Li,
Min Zhang,
Yu Cheng
Abstract:
Reasoning about time is essential for Large Language Models (LLMs) to understand the world. Previous works focus on solving specific tasks, primarily on time-sensitive question answering. While these methods have proven effective, they cannot generalize to a wider spectrum of temporal reasoning tasks. Therefore, we propose a crucial question: Can we build a universal framework to handle a variety…
▽ More
Reasoning about time is essential for Large Language Models (LLMs) to understand the world. Previous works focus on solving specific tasks, primarily on time-sensitive question answering. While these methods have proven effective, they cannot generalize to a wider spectrum of temporal reasoning tasks. Therefore, we propose a crucial question: Can we build a universal framework to handle a variety of temporal reasoning tasks? To that end, we systematically study 38 temporal reasoning tasks. Based on the observation that 19 tasks are directly related to mathematics, we first leverage the available mathematical dataset to set a solid foundation for temporal reasoning. However, the in-depth study indicates that focusing solely on mathematical enhancement falls short of addressing pure temporal reasoning tasks. To mitigate this limitation, we propose a simple but effective self-critic temporal optimization method to enhance the model's temporal reasoning capabilities without sacrificing general task abilities. Finally, we develop Timo, a model designed to excel in temporal reasoning at the 7B and 13B scales. Notably, Timo outperforms the counterpart LLMs by 10.0 and 7.6 in average accuracy scores and achieves the new state-of-the-art (SOTA) performance of comparable size. Extensive experiments further validate our framework's effectiveness and its generalization across diverse temporal tasks. The code is available at https://github.com/zhaochen0110/Timo.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.