Skip to main content

Showing 1–50 of 440 results for author: Zheng, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13541  [pdf, other

    cs.CV

    On the Discriminability of Self-Supervised Representation Learning

    Authors: Zeen Song, Wenwen Qiang, Changwen Zheng, Fuchun Sun, Hui Xiong

    Abstract: Self-supervised learning (SSL) has recently achieved significant success in downstream visual tasks. However, a notable gap still exists between SSL and supervised learning (SL), especially in complex downstream tasks. In this paper, we show that the features learned by SSL methods suffer from the crowding problem, where features of different classes are not distinctly separated, and features with… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  2. arXiv:2407.12415  [pdf, other

    cs.LG

    Not All Frequencies Are Created Equal:Towards a Dynamic Fusion of Frequencies in Time-Series Forecasting

    Authors: Xingyu Zhang, Siyu Zhao, Zeen Song, Huijie Guo, Jianqi Zhang, Changwen Zheng, Wenwen Qiang

    Abstract: Long-term time series forecasting is a long-standing challenge in various applications. A central issue in time series forecasting is that methods should expressively capture long-term dependency. Furthermore, time series forecasting methods should be flexible when applied to different scenarios. Although Fourier analysis offers an alternative to effectively capture reusable and periodic patterns… ▽ More

    Submitted 18 July, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

    Comments: Accpeted by ACMMM2024

  3. arXiv:2407.12322  [pdf, other

    cs.CV

    Frequency Guidance Matters: Skeletal Action Recognition by Frequency-Aware Mixed Transformer

    Authors: Wenhan Wu, Ce Zheng, Zihao Yang, Chen Chen, Srijan Das, Aidong Lu

    Abstract: Recently, transformers have demonstrated great potential for modeling long-term dependencies from skeleton sequences and thereby gained ever-increasing attention in skeleton action recognition. However, the existing transformer-based approaches heavily rely on the naive attention mechanism for capturing the spatiotemporal features, which falls short in learning discriminative representations that… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: Accepted to ACM Multimedia 2024

  4. arXiv:2407.07554  [pdf, other

    cs.GR cs.SD eess.AS

    Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation

    Authors: Zikai Huang, Xuemiao Xu, Cheng Xu, Huaidong Zhang, Chenxi Zheng, Jing Qin, Shengfeng He

    Abstract: Dance, as an art form, fundamentally hinges on the precise synchronization with musical beats. However, achieving aesthetically pleasing dance sequences from music is challenging, with existing methods often falling short in controllability and beat alignment. To address these shortcomings, this paper introduces Beat-It, a novel framework for beat-specific, key pose-guided dance generation. Unlike… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  5. arXiv:2407.07078  [pdf, other

    cs.CV

    MoSt-DSA: Modeling Motion and Structural Interactions for Direct Multi-Frame Interpolation in DSA Images

    Authors: Ziyang Xu, Huangxuan Zhao, Ziwei Cui, Wenyu Liu, Chuansheng Zheng, Xinggang Wang

    Abstract: Artificial intelligence has become a crucial tool for medical image analysis. As an advanced cerebral angiography technique, Digital Subtraction Angiography (DSA) poses a challenge where the radiation dose to humans is proportional to the image count. By reducing images and using AI interpolation instead, the radiation can be cut significantly. However, DSA images present more complex motion and s… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: Accepted to ECAI2024

  6. arXiv:2407.02855  [pdf, other

    cs.CR cs.CL cs.LG

    Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

    Authors: Zhexin Zhang, Junxiao Yang, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang

    Abstract: LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: 15 pages

  7. arXiv:2406.18075  [pdf, other

    cs.SE

    A Context-Driven Approach for Co-Auditing Smart Contracts with The Support of GPT-4 code interpreter

    Authors: Mohamed Salah Bouafif, Chen Zheng, Ilham Ahmed Qasse, Ed Zulkoski, Mohammad Hamdaqa, Foutse Khomh

    Abstract: The surge in the adoption of smart contracts necessitates rigorous auditing to ensure their security and reliability. Manual auditing, although comprehensive, is time-consuming and heavily reliant on the auditor's expertise. With the rise of Large Language Models (LLMs), there is growing interest in leveraging them to assist auditors in the auditing process (co-auditing). However, the effectivenes… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  8. arXiv:2406.17182  [pdf, other

    cs.IR cs.LG

    Debiased Recommendation with Noisy Feedback

    Authors: Haoxuan Li, Chunyuan Zheng, Wenjie Wang, Hao Wang, Fuli Feng, Xiao-Hua Zhou

    Abstract: Ratings of a user to most items in recommender systems are usually missing not at random (MNAR), largely because users are free to choose which items to rate. To achieve unbiased learning of the prediction model under MNAR data, three typical solutions have been proposed, including error-imputation-based (EIB), inverse-propensity-scoring (IPS), and doubly robust (DR) methods. However, these method… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: KDD 24 Research Track Paper

  9. arXiv:2406.17098  [pdf, other

    cs.LG cs.AI

    Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making

    Authors: Vivek Myers, Chongyi Zheng, Anca Dragan, Sergey Levine, Benjamin Eysenbach

    Abstract: Temporal distances lie at the heart of many algorithms for planning, control, and reinforcement learning that involve reaching goals, allowing one to estimate the transit time between two states. However, prior attempts to define such temporal distances in stochastic settings have been stymied by an important limitation: these prior approaches do not satisfy the triangle inequality. This is not me… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Proceedings of the 41st International Conference on Machine Learning (ICML 2024)

  10. arXiv:2406.16815  [pdf, other

    cs.CV

    ClotheDreamer: Text-Guided Garment Generation with 3D Gaussians

    Authors: Yufei Liu, Junshu Tang, Chu Zheng, Shijie Zhang, Jinkun Hao, Junwei Zhu, Dongjin Huang

    Abstract: High-fidelity 3D garment synthesis from text is desirable yet challenging for digital avatar creation. Recent diffusion-based approaches via Score Distillation Sampling (SDS) have enabled new possibilities but either intricately couple with human body or struggle to reuse. We introduce ClotheDreamer, a 3D Gaussian-based method for generating wearable, production-ready 3D garment assets from text p… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Project Page: https://ggxxii.github.io/clothedreamer

  11. arXiv:2406.16495  [pdf, other

    cs.CL cs.AI

    OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser

    Authors: Jingze Shi, Ting Xie, Bingheng Wu, Chunjun Zheng, Kai Wang

    Abstract: Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the seq… ▽ More

    Submitted 24 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

  12. arXiv:2406.15050  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis

    Authors: Lin Fan, Xun Gong, Cenyang Zheng, Yafei Ou

    Abstract: The intersection of medical Visual Question Answering (Med-VQA) is a challenging research topic with advantages including patient engagement and clinical expert involvement for second opinions. However, existing Med-VQA methods based on joint embedding fail to explain whether their provided results are based on correct reasoning or coincidental answers, which undermines the credibility of VQA answ… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    ACM Class: I.2.7; I.2.10; J.3

  13. arXiv:2406.14210  [pdf, other

    eess.IV cs.CV cs.LG

    Self-Supervised Pretext Tasks for Alzheimer's Disease Classification using 3D Convolutional Neural Networks on Large-Scale Synthetic Neuroimaging Dataset

    Authors: Chen Zheng

    Abstract: Structural magnetic resonance imaging (MRI) studies have shown that Alzheimer's Disease (AD) induces both localised and widespread neural degenerative changes throughout the brain. However, the absence of segmentation that highlights brain degenerative changes presents unique challenges for training CNN-based classifiers in a supervised fashion. In this work, we evaluated several unsupervised meth… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  14. arXiv:2406.14024  [pdf, other

    cs.CL

    LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback

    Authors: Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Dayiheng Liu, Chang Zhou, Wen Xiao, Junjie Hu, Tianyu Liu, Baobao Chang

    Abstract: Mathematical verfier achieves success in mathematical reasoning tasks by validating the correctness of solutions. However, existing verifiers are trained with binary classification labels, which are not informative enough for the model to accurately assess the solutions. To mitigate the aforementioned insufficiency of binary labels, we introduce step-wise natural language feedbacks as rationale la… ▽ More

    Submitted 8 July, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: 9 pages

  15. arXiv:2406.11490  [pdf, other

    cs.LG stat.ME

    Interventional Imbalanced Multi-Modal Representation Learning via $β$-Generalization Front-Door Criterion

    Authors: Yi Li, Jiangmeng Li, Fei Song, Qingmeng Zhu, Changwen Zheng, Wenwen Qiang

    Abstract: Multi-modal methods establish comprehensive superiority over uni-modal methods. However, the imbalanced contributions of different modalities to task-dependent predictions constantly degrade the discriminative performance of canonical multi-modal methods. Based on the contribution to task-dependent predictions, modalities can be identified as predominant and auxiliary modalities. Benchmark methods… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  16. arXiv:2406.10569  [pdf, other

    cs.LG cs.CV

    MDA: An Interpretable Multi-Modal Fusion with Missing Modalities and Intrinsic Noise

    Authors: Lin Fan, Yafei Ou, Cenyang Zheng, Pengyu Dai, Tamotsu Kamishima, Masayuki Ikebe, Kenji Suzuki, Xun Gong

    Abstract: Multi-modal fusion is crucial in medical data research, enabling a comprehensive understanding of diseases and improving diagnostic performance by combining diverse modalities. However, multi-modal fusion faces challenges, including capturing interactions between modalities, addressing missing modalities, handling erroneous modal information, and ensuring interpretability. Many existing researcher… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    ACM Class: I.5.2; I.2.7; I.2.10; J.3

  17. arXiv:2406.09317  [pdf, other

    eess.IV cs.CV

    Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

    Authors: Meng Wang, Tian Lin, Aidi Lin, Kai Yu, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Peixin Zhang, Wei Chen, Yilong Luo, Yifan Chen, Honghe Xia, Tingkun Shi, Qi Zhang, Jinming Guo, Xiaolin Chen, Jingcheng Wang, Yih Chung Tham , et al. (24 additional authors not shown)

    Abstract: Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources… ▽ More

    Submitted 30 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  18. arXiv:2406.08709  [pdf, other

    cs.LG stat.ME

    Introducing Diminutive Causal Structure into Graph Representation Learning

    Authors: Hang Gao, Peng Qiao, Yifan Jin, Fengge Wu, Jiangmeng Li, Changwen Zheng

    Abstract: When engaging in end-to-end graph representation learning with Graph Neural Networks (GNNs), the intricate causal relationships and rules inherent in graph data pose a formidable challenge for the model in accurately capturing authentic data relationships. A proposed mitigating strategy involves the direct integration of rules or relationships corresponding to the graph data into the model. Howeve… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  19. arXiv:2406.08659  [pdf, other

    cs.CV

    Vivid-ZOO: Multi-View Video Generation with Diffusion Model

    Authors: Bing Li, Cheng Zheng, Wenxuan Zhu, Jinjie Mai, Biao Zhang, Peter Wonka, Bernard Ghanem

    Abstract: While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline tha… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Our project page is at https://hi-zhengcheng.github.io/vividzoo/

  20. arXiv:2406.08657  [pdf, other

    cs.CL

    Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs

    Authors: Chen Zheng, Ke Sun, Xun Zhou

    Abstract: Despite the advances in Large Language Models (LLMs), exemplified by models like GPT-4 and Claude, smaller-scale LLMs such as Llama and Mistral often struggle with generating in-depth and coherent dialogues. This paper presents a novel two-step Coarse-to-Fine Actor model to address the inherent limitations in conversational and analytical capabilities of small-sized LLMs. Our approach begins with… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  21. arXiv:2406.05928  [pdf, other

    cs.GR math.NA

    Stabler Neo-Hookean Simulation: Absolute Eigenvalue Filtering for Projected Newton

    Authors: Honglin Chen, Hsueh-Ti Derek Liu, David I. W. Levin, Changxi Zheng, Alec Jacobson

    Abstract: Volume-preserving hyperelastic materials are widely used to model near-incompressible materials such as rubber and soft tissues. However, the numerical simulation of volume-preserving hyperelastic materials is notoriously challenging within this regime due to the non-convexity of the energy function. In this work, we identify the pitfalls of the popular eigenvalue clamping strategy for projecting… ▽ More

    Submitted 21 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

    Comments: SIGGRAPH 2024 (Conference track). Project page: https://www.cs.columbia.edu/cg/abs-psd/

  22. arXiv:2406.04367  [pdf, other

    physics.flu-dyn cs.LG physics.comp-ph

    Physics-enhanced Neural Operator for Simulating Turbulent Transport

    Authors: Shengyu Chen, Peyman Givi, Can Zheng, Xiaowei Jia

    Abstract: The precise simulation of turbulent flows is of immense importance in a variety of scientific and engineering fields, including climate science, freshwater science, and the development of energy-efficient manufacturing processes. Within the realm of turbulent flow simulation, direct numerical simulation (DNS) is widely considered to be the most reliable approach, but it is prohibitively expensive… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

    Comments: 13 pages

  23. arXiv:2406.04343  [pdf, other

    cs.CV

    Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image

    Authors: Stanislaw Szymanowicz, Eldar Insafutdinov, Chuanxia Zheng, Dylan Campbell, João F. Henriques, Christian Rupprecht, Andrea Vedaldi

    Abstract: In this paper, we propose Flash3D, a method for scene reconstruction and novel view synthesis from a single image which is both very generalisable and efficient. For generalisability, we start from a "foundation" model for monocular depth estimation and extend it to a full 3D shape and appearance reconstructor. For efficiency, we base this extension on feed-forward Gaussian Splatting. Specifically… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project page: https://www.robots.ox.ac.uk/~vgg/research/flash3d/

  24. arXiv:2406.03757  [pdf, other

    cs.RO cs.LG

    RoboCoder: Robotic Learning from Basic Skills to General Tasks with Large Language Models

    Authors: Jingyao Li, Pengguang Chen, Sitong Wu, Chuanyang Zheng, Hong Xu, Jiaya Jia

    Abstract: The emergence of Large Language Models (LLMs) has improved the prospects for robotic tasks. However, existing benchmarks are still limited to single tasks with limited generalization capabilities. In this work, we introduce a comprehensive benchmark and an autonomous learning framework, RoboCoder aimed at enhancing the generalization capabilities of robots in complex environments. Unlike tradition… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  25. arXiv:2406.02887  [pdf, other

    eess.AS cs.SD

    USM RNN-T model weights binarization

    Authors: Oleg Rybakov, Dmitriy Serdyuk, Chengjian Zheng

    Abstract: Large-scale universal speech models (USM) are already used in production. However, as the model size grows, the serving cost grows too. Serving cost of large models is dominated by model size that is why model size reduction is an important research topic. In this work we are focused on model size reduction using weights only quantization. We present the weights binarization of USM Recurrent Neura… ▽ More

    Submitted 5 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

  26. arXiv:2406.00587  [pdf, other

    cs.CV

    Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024

    Authors: Biao Wu, Diankai Zhang, Si Gao, Chengjian Zheng, Shaoli Liu, Ning Wang

    Abstract: Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction,because the real-world is actually video-based rather th… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: Champion Solution for CVPR 2024 PVUW VSS Track. arXiv admin note: text overlap with arXiv:2306.02894

  27. arXiv:2406.00500  [pdf, other

    cs.CV

    2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation

    Authors: Biao Wu, Diankai Zhang, Si Gao, Chengjian Zheng, Shaoli Liu, Ning Wang

    Abstract: Video Panoptic Segmentation (VPS) is a challenging task that is extends from image panoptic segmentation.VPS aims to simultaneously classify, track, segment all objects in a video, including both things and stuff. Due to its wide application in many downstream tasks such as video understanding, video editing, and autonomous driving. In order to deal with the task of video panoptic segmentation in… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: 2nd Place Solution for CVPR 2024 PVUW VPS Track

  28. arXiv:2405.17436  [pdf, other

    cs.NI cs.AI cs.LG

    Intelligent Hybrid Resource Allocation in MEC-assisted RAN Slicing Network

    Authors: Chong Zheng, Yongming Huang, Cheng Zhang, Tony Q. S. Quek

    Abstract: In this paper, we aim to maximize the SSR for heterogeneous service demands in the cooperative MEC-assisted RAN slicing system by jointly considering the multi-node computing resources cooperation and allocation, the transmission resource blocks (RBs) allocation, and the time-varying dynamicity of the system. To this end, we abstract the system into a weighted undirected topology graph and, then p… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  29. arXiv:2405.16845  [pdf, other

    cs.LG cs.CL stat.ML

    On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability

    Authors: Chenyu Zheng, Wei Huang, Rongzhen Wang, Guoqiang Wu, Jun Zhu, Chongxuan Li

    Abstract: Autoregressively trained transformers have brought a profound revolution to the world, especially with their in-context learning (ICL) ability to address downstream tasks. Recently, several studies suggest that transformers learn a mesa-optimizer during autoregressive (AR) pretraining to implement ICL. Namely, the forward pass of the trained transformer is equivalent to optimizing an inner objecti… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: 37pages

  30. arXiv:2405.15289  [pdf, other

    cs.CV

    Learning Invariant Causal Mechanism from Vision-Language Models

    Authors: Zeen Song, Siyu Zhao, Xingyu Zhang, Jiangmeng Li, Changwen Zheng, Wenwen Qiang

    Abstract: Pre-trained large-scale models have become a major research focus, but their effectiveness is limited in real-world applications due to diverse data distributions. In contrast, humans excel at decision-making across various domains by learning reusable knowledge that remains invariant despite environmental changes in a complex world. Although CLIP, as a successful vision-language pre-trained model… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  31. arXiv:2405.14868  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

    Authors: Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, Carl Vondrick

    Abstract: Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessitating careful recording setups, and significantly restricting their utility in the wild as well as in terms of embodied AI applications. In this pape… ▽ More

    Submitted 5 July, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: Accepted to ECCV 2024. Project webpage is available at: https://gcd.cs.columbia.edu/

  32. arXiv:2405.14722  [pdf, other

    cs.CL

    CAPE: Context-Adaptive Positional Encoding for Length Extrapolation

    Authors: Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, Yu Li

    Abstract: Positional encoding plays a crucial role in transformers, significantly impacting model performance and length generalization. Prior research has introduced absolute positional encoding (APE) and relative positional encoding (RPE) to distinguish token positions in given sequences. However, both APE and RPE remain fixed after model training regardless of input data, limiting their adaptability and… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: Technical Report

  33. arXiv:2405.12200  [pdf, other

    cs.CV

    Multi-View Attentive Contextualization for Multi-View 3D Object Detection

    Authors: Xianpeng Liu, Ce Zheng, Ming Qian, Nan Xue, Chen Chen, Zhebin Zhang, Chen Li, Tianfu Wu

    Abstract: We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection, prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting, due to… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

    Comments: Accepted by CVPR2024

  34. arXiv:2405.10718  [pdf, other

    cs.CV cs.CL

    SignLLM: Sign Languages Production Large Language Models

    Authors: Sen Fang, Lei Wang, Ce Zheng, Yapeng Tian, Chen Chen

    Abstract: In this paper, we introduce the first comprehensive multilingual sign language dataset named Prompt2Sign, which builds from public data including American Sign Language (ASL) and seven others. Our dataset transforms a vast array of videos into a streamlined, model-friendly format, optimized for training with translation models like seq2seq and text2text. Building on this new dataset, we propose Si… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: 33 pages, website at https://signllm.github.io/

  35. arXiv:2405.10705  [pdf, other

    eess.IV cs.CV

    3D Vessel Reconstruction from Sparse-View Dynamic DSA Images via Vessel Probability Guided Attenuation Learning

    Authors: Zhentao Liu, Huangxuan Zhao, Wenhui Qin, Zhenghong Zhou, Xinggang Wang, Wenping Wang, Xiaochun Lai, Chuansheng Zheng, Dinggang Shen, Zhiming Cui

    Abstract: Digital Subtraction Angiography (DSA) is one of the gold standards in vascular disease diagnosing. With the help of contrast agent, time-resolved 2D DSA images deliver comprehensive insights into blood flow information and can be utilized to reconstruct 3D vessel structures. Current commercial DSA systems typically demand hundreds of scanning views to perform reconstruction, resulting in substanti… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: 12 pages, 13 figures, 5 tables

  36. arXiv:2405.04032  [pdf, other

    cs.CR cs.AI

    Locally Differentially Private In-Context Learning

    Authors: Chunyan Zheng, Keke Sun, Wenhao Zhao, Haibo Zhou, Lixin Jiang, Shaoyang Song, Chunlai Zhou

    Abstract: Large pretrained language models (LLMs) have shown surprising In-Context Learning (ICL) ability. An important application in deploying large language models is to augment LLMs with a private database for some specific task. The main problem with this promising commercial use is that LLMs have been shown to memorize their training data and their prompt data are vulnerable to membership inference at… ▽ More

    Submitted 8 May, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

    Comments: This paper was published at LREC-Coling 2024

  37. arXiv:2405.02354  [pdf

    cs.LG cs.AI q-bio.QM

    Heterogeneous network and graph attention auto-encoder for LncRNA-disease association prediction

    Authors: Jin-Xing Liu, Wen-Yu Xi, Ling-Yun Dai, Chun-Hou Zheng, Ying-Lian Gao

    Abstract: The emerging research shows that lncRNAs are associated with a series of complex human diseases. However, most of the existing methods have limitations in identifying nonlinear lncRNA-disease associations (LDAs), and it remains a huge challenge to predict new LDAs. Therefore, the accurate identification of LDAs is very important for the warning and treatment of diseases. In this work, multiple sou… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: 10 pages, 8 figures

    ACM Class: I.2.4; I.2.6; I.2.m

  38. arXiv:2405.02287  [pdf, other

    cs.CL cs.AI cs.CV

    Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

    Authors: Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugenie Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Mikel Artetxe, Yi Tay

    Abstract: We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing a… ▽ More

    Submitted 3 May, 2024; originally announced May 2024.

  39. arXiv:2405.01053  [pdf, other

    cs.LG cs.AI

    Explicitly Modeling Universality into Self-Supervised Learning

    Authors: Jingyao Wang, Wenwen Qiang, Zeen Song, Lingyu Si, Jiangmeng Li, Changwen Zheng, Bing Su

    Abstract: The goal of universality in self-supervised learning (SSL) is to learn universal representations from unlabeled data and achieve excellent performance on all samples and tasks. However, these methods lack explicit modeling of the universality in the learning objective, and the related theoretical understanding remains limited. This may cause models to overfit in data-scarce situations and generali… ▽ More

    Submitted 23 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

    Comments: 28 pages, submitted to ICML24 with 7766

  40. arXiv:2404.19620  [pdf, other

    cs.LG cs.IR stat.ML

    Be Aware of the Neighborhood Effect: Modeling Selection Bias under Interference

    Authors: Haoxuan Li, Chunyuan Zheng, Sihao Ding, Peng Wu, Zhi Geng, Fuli Feng, Xiangnan He

    Abstract: Selection bias in recommender system arises from the recommendation process of system filtering and the interactive process of user selection. Many previous studies have focused on addressing selection bias to achieve unbiased learning of the prediction model, but ignore the fact that potential outcomes for a given user-item pair may vary with the treatments assigned to other user-item pairs, name… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: ICLR 24

  41. arXiv:2404.19596  [pdf, other

    cs.IR cs.LG

    Debiased Collaborative Filtering with Kernel-Based Causal Balancing

    Authors: Haoxuan Li, Chunyuan Zheng, Yanghao Xiao, Peng Wu, Zhi Geng, Xu Chen, Peng Cui

    Abstract: Debiased collaborative filtering aims to learn an unbiased prediction model by removing different biases in observational datasets. To solve this problem, one of the simple and effective methods is based on the propensity score, which adjusts the observational sample distribution to the target one by reweighting observed instances. Ideally, propensity scores should be learned with causal balancing… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: ICLR 24 Spotlight

  42. arXiv:2404.16792  [pdf, other

    cs.LG cs.AI cs.CL

    Weak-to-Strong Extrapolation Expedites Alignment

    Authors: Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, Nanyun Peng

    Abstract: The open-source community is experiencing a surge in the release of large language models (LLMs) that are trained to follow instructions and align with human preference. However, further training to improve them still requires expensive computational resources and data annotations. Is it possible to bypass additional training and cost-effectively acquire better-aligned models? Inspired by the lite… ▽ More

    Submitted 22 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Add theoretical explanation and more evaluation results

  43. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi Jin, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  44. arXiv:2404.13026  [pdf, other

    cs.CV cs.AI

    PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation

    Authors: Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y. Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, William T. Freeman

    Abstract: Realistic object interactions are crucial for creating immersive virtual experiences, yet synthesizing realistic 3D object dynamics in response to novel interactions remains a significant challenge. Unlike unconditional or text-conditioned dynamics generation, action-conditioned dynamics requires perceiving the physical material properties of objects and grounding the 3D motion prediction on these… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Comments: Project website at: https://physdreamer.github.io/

  45. arXiv:2404.12387  [pdf, other

    cs.CL cs.CV

    Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

    Authors: Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu , et al. (1 additional authors not shown)

    Abstract: We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but al… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

  46. arXiv:2404.12024  [pdf, other

    cs.CV

    Meta-Auxiliary Learning for Micro-Expression Recognition

    Authors: Jingyao Wang, Yunhan Tian, Yuxuan Yang, Xiaoxin Chen, Changwen Zheng, Wenwen Qiang

    Abstract: Micro-expressions (MEs) are involuntary movements revealing people's hidden feelings, which has attracted numerous interests for its objectivity in emotion detection. However, despite its wide applications in various scenarios, micro-expression recognition (MER) remains a challenging problem in real life due to three reasons, including (i) data-level: lack of data and imbalanced classes, (ii) feat… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

    Comments: 10 pages, 7 figures, 3 tables

  47. arXiv:2404.10337  [pdf, other

    cs.AI

    Intriguing Properties of Positional Encoding in Time Series Forecasting

    Authors: Jianqi Zhang, Jingyao Wang, Wenwen Qiang, Fanjiang Xu, Changwen Zheng, Fuchun Sun, Hui Xiong

    Abstract: Transformer-based methods have made significant progress in time series forecasting (TSF). They primarily handle two types of tokens, i.e., temporal tokens that contain all variables of the same timestamp, and variable tokens that contain all input time points for a specific variable. Transformer-based methods rely on positional encoding (PE) to mark tokens' positions, facilitating the model to pe… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  48. arXiv:2404.05242  [pdf, other

    cs.RO

    Collision-Free Trajectory Optimization in Cluttered Environments with Sums-of-Squares Programming

    Authors: Yulin Li, Chunxin Zheng, Kai Chen, Yusen Xie, Xindong Tang, Michael Yu Wang, Jun Ma

    Abstract: In this work, we propose a trajectory optimization approach for robot navigation in cluttered 3D environments. We represent the robot's geometry as a semialgebraic set defined by polynomial inequalities such that robots with general shapes can be suitably characterized. To address the robot navigation task in obstacle-dense environments, we exploit the free space directly to construct a sequence o… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  49. arXiv:2404.04922  [pdf, other

    cs.CV cs.AI

    Efficient Learnable Collaborative Attention for Single Image Super-Resolution

    Authors: Yigang Zhao Chaowei Zheng, Jiannan Su, GuangyongChen, MinGan

    Abstract: Non-Local Attention (NLA) is a powerful technique for capturing long-range feature correlations in deep single image super-resolution (SR). However, NLA suffers from high computational complexity and memory consumption, as it requires aggregating all non-local feature information for each query response and recalculating the similarity weight distribution for different abstraction levels of featur… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

  50. arXiv:2404.02145  [pdf, other

    cs.CV

    Iterated Learning Improves Compositionality in Large Vision-Language Models

    Authors: Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna

    Abstract: A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, recent investigations find that most-if not all-our state-of-the-art vision-language models struggle at compositionality. They are unable to distinguish between images of " a girl in white facing a man… ▽ More

    Submitted 16 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

    Comments: CVPR 2024