Skip to main content

Showing 1–50 of 289 results for author: Shi, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13335  [pdf, other

    cs.CV

    OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

    Authors: Yini Fang, Jingling Yu, Haozheng Zhang, Ralf van der Lans, Bertram Shi

    Abstract: Visual search is important in our daily life. The efficient allocation of visual attention is critical to effectively complete visual search tasks. Prior research has predominantly modelled the spatial allocation of visual attention in images at the pixel level, e.g. using a saliency map. However, emerging evidence shows that visual attention is guided by objects rather than pixel intensities. Thi… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted in ECCV 2024

  2. arXiv:2407.09352  [pdf, other

    cs.CV eess.IV

    Imaging Interiors: An Implicit Solution to Electromagnetic Inverse Scattering Problems

    Authors: Ziyuan Luo, Boxin Shi, Haoliang Li, Renjie Wan

    Abstract: Electromagnetic Inverse Scattering Problems (EISP) have gained wide applications in computational imaging. By solving EISP, the internal relative permittivity of the scatterer can be non-invasively determined based on the scattered electromagnetic fields. Despite previous efforts to address EISP, achieving better solutions to this problem has remained elusive, due to the challenges posed by invers… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: 33 pages, accepted by ECCV 2024 non-camera-ready version

  3. arXiv:2407.08231  [pdf, other

    cs.CV

    E2VIDiff: Perceptual Events-to-Video Reconstruction using Diffusion Priors

    Authors: Jinxiu Liang, Bohan Yu, Yixin Yang, Yiming Han, Boxin Shi

    Abstract: Event cameras, mimicking the human retina, capture brightness changes with unparalleled temporal resolution and dynamic range. Integrating events into intensities poses a highly ill-posed challenge, marred by initial condition ambiguities. Traditional regression-based deep learning methods fall short in perceptual quality, offering deterministic and often unrealistic reconstructions. In this paper… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  4. arXiv:2407.03648  [pdf, other

    eess.AS cs.SD

    High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching

    Authors: Gael Le Lan, Bowen Shi, Zhaoheng Ni, Sidd Srinivasan, Anurag Kumar, Brian Ellis, David Kant, Varun Nagaraja, Ernie Chang, Wei-Ning Hsu, Yangyang Shi, Vikas Chandra

    Abstract: We introduce a simple and efficient text-controllable high-fidelity music generation and editing model. It operates on sequences of continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec that eliminates the information loss drawback of discrete representations. Based on a diffusion transformer architecture trained on a flow-matching objective the model… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  5. arXiv:2407.01710  [pdf

    cs.SE

    Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis

    Authors: Shenglin Zhang, Sibo Xia, Wenzhao Fan, Binpeng Shi, Xiao Xiong, Zhenyu Zhong, Minghua Ma, Yongqian Sun, Dan Pei

    Abstract: Modern microservice systems have gained widespread adoption due to their high scalability, flexibility, and extensibility. However, the characteristics of independent deployment, decentralization, and frequent dynamic interactions also introduce the risk of cascading failures, making it challenging to achieve accurate failure diagnosis and rapid system recovery. These issues severely impact operat… ▽ More

    Submitted 27 June, 2024; originally announced July 2024.

  6. arXiv:2406.11815  [pdf, other

    cs.RO cs.CV cs.LG

    LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

    Authors: Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, Roei Herzig

    Abstract: In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs for robotics applications have been extensively trained on language and action data, but their ability to generalize in different settings has often been less… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  7. arXiv:2406.11633  [pdf, other

    cs.CV

    DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

    Authors: Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wenjie Wu, Hancheng Ye, Shiyang Feng, Bin Wang, Chao Xu, Conghui He, Pinlong Cai, Min Dou, Botian Shi, Sheng Zhou, Yongwei Wang, Bin Wang, Junchi Yan, Fei Wu, Yu Qiao

    Abstract: Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is therefore meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extract… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Homepage of DocGenome: https://unimodal4reasoning.github.io/DocGenome_page 22 pages, 11 figures

  8. arXiv:2406.10744  [pdf, other

    cs.CV

    Technique Report of CVPR 2024 PBDL Challenges

    Authors: Ying Fu, Yu Li, Shaodi You, Boxin Shi, Linwei Chen, Yunhao Zou, Zichun Wang, Yichen Li, Yuze Han, Yingkai Zhang, Jianan Wang, Qinglin Liu, Wei Yu, Xiaoqian Lv, Jianing Li, Shengping Zhang, Xiangyang Ji, Yuanpei Chen, Yuhan Zhang, Weihang Peng, Liwen Zhang, Zhe Xu, Dingyong Gou, Cong Li, Senyan Xu , et al. (75 additional authors not shown)

    Abstract: The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, a… ▽ More

    Submitted 12 July, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

    Comments: CVPR 2024 PBDL Challenges: https://pbdl-ws.github.io/pbdl2024/challenge/index.html

  9. arXiv:2406.08418  [pdf, other

    cs.CV cs.AI

    OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

    Authors: Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang , et al. (15 additional authors not shown)

    Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale an… ▽ More

    Submitted 12 July, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  10. arXiv:2406.07111  [pdf, other

    cs.CV

    NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images

    Authors: Yufei Han, Heng Guo, Koki Fukai, Hiroaki Santo, Boxin Shi, Fumio Okura, Zhanyu Ma, Yunpeng Jia

    Abstract: We present NeRSP, a Neural 3D reconstruction technique for Reflective surfaces with Sparse Polarized images. Reflective surface reconstruction is extremely challenging as specular reflections are view-dependent and thus violate the multiview consistency for multiview stereo. On the other hand, sparse image inputs, as a practical capture setting, commonly cause incomplete or distorted results due t… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 10 pages

  11. arXiv:2406.06251  [pdf, other

    eess.AS cs.CL

    Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

    Authors: Chung-Ming Chien, Andros Tjandra, Apoorv Vyas, Matt Le, Bowen Shi, Wei-Ning Hsu

    Abstract: As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained one… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted by InterSpeech 2024

  12. arXiv:2405.15324  [pdf, other

    cs.RO cs.AI cs.CV

    Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving

    Authors: Jianbiao Mei, Yukai Ma, Xuemeng Yang, Licheng Wen, Xinyu Cai, Xin Li, Daocheng Fu, Bo Zhang, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yong Liu, Yu Qiao

    Abstract: Autonomous driving has advanced significantly due to sensors, machine learning, and artificial intelligence improvements. However, prevailing methods struggle with intricate scenarios and causal relationships, hindering adaptability and interpretability in varied environments. To address the above problems, we introduce LeapAD, a novel paradigm for autonomous driving inspired by the human cognitiv… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: 23 pages, 16 figures

  13. arXiv:2405.09556  [pdf, other

    eess.SP cs.AI cs.IT

    Co-learning-aided Multi-modal-deep-learning Framework of Passive DOA Estimators for a Heterogeneous Hybrid Massive MIMO Receiver

    Authors: Jiatong Bai, Feng Shu, Qinghe Zheng, Bo Xu, Baihua Shi, Yiwen Chen, Weibin Zhang, Xianpeng Wang

    Abstract: Due to its excellent performance in rate and resolution, fully-digital (FD) massive multiple-input multiple-output (MIMO) antenna arrays has been widely applied in data transmission and direction of arrival (DOA) measurements, etc. But it confronts with two main challenges: high computational complexity and circuit cost. The two problems may be addressed well by hybrid analog-digital (HAD) structu… ▽ More

    Submitted 12 June, 2024; v1 submitted 27 April, 2024; originally announced May 2024.

  14. arXiv:2405.05714  [pdf, other

    cs.CV cs.LG

    Estimating Noisy Class Posterior with Part-level Labels for Noisy Label Learning

    Authors: Rui Zhao, Bin Shi, Jianfei Ruan, Tianze Pan, Bo Dong

    Abstract: In noisy label learning, estimating noisy class posteriors plays a fundamental role for developing consistent classifiers, as it forms the basis for estimating clean class posteriors and the transition matrix. Existing methods typically learn noisy class posteriors by training a classification model with noisy labels. However, when labels are incorrect, these models may be misled to overemphasize… ▽ More

    Submitted 2 July, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

    Comments: CVPR 2024

  15. arXiv:2405.03520  [pdf, other

    cs.CV

    Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

    Authors: Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, Yang You, Zhaoxiang Zhang, Dawei Zhao, Liang Xiao, Jian Zhao, Jiwen Lu, Guan Huang

    Abstract: General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems. Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibits an incipient comprehension of physical law… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: This survey will be regularly updated at: https://github.com/GigaAI-research/General-World-Models-Survey

  16. arXiv:2404.18394  [pdf, other

    cs.CV

    Reconstructing Satellites in 3D from Amateur Telescope Images

    Authors: Zhiming Chang, Boyang Liu, Yifei Xia, Youming Guo, Boxin Shi, He Sun

    Abstract: This paper proposes a framework for the 3D reconstruction of satellites in low-Earth orbit, utilizing videos captured by small amateur telescopes. The video data obtained from these telescopes differ significantly from data for standard 3D reconstruction tasks, characterized by intense motion blur, atmospheric turbulence, pervasive background light pollution, extended focal length and constrained… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  17. arXiv:2404.16821  [pdf, other

    cs.CV

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Authors: Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai , et al. (10 additional authors not shown)

    Abstract: In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual… ▽ More

    Submitted 29 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Technical report

  18. arXiv:2404.15254  [pdf, other

    cs.CV

    UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition

    Authors: Bin Wang, Zhuangcheng Gu, Chao Xu, Bo Zhang, Botian Shi, Conghui He

    Abstract: This paper presents the UniMER dataset to provide the first study on Mathematical Expression Recognition (MER) towards complex real-world scenarios. The UniMER dataset consists of a large-scale training set UniMER-1M offering an unprecedented scale and diversity with one million training instances and a meticulously designed test set UniMER-Test that reflects a diverse range of formula distributio… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: 17 pages, 5 figures

  19. arXiv:2404.14832  [pdf, other

    cs.IT

    GLDPC-PC Codes for MIMO Systems with Iterative Detection and Decoding

    Authors: Binghui Shi, Yongpeng Wu, Yin Xu, Xiqi Gao, Xiaohu You, Wenjun Zhang

    Abstract: In this work, we propose the integration of GLDPC codes with short polar-like component codes, termed GLDPC codes with polar component codes (GLDPC-PC). This approach leverages the good distance properties of polar-like codes and mitigates their high decoding latency in long block lengths. A recently proposed soft-input soft-output decoder for polar-like codes enables effective iterative belief pr… ▽ More

    Submitted 9 May, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: submitted to globecom 2024

  20. arXiv:2404.06710  [pdf, other

    cs.CV cs.AI

    SpikeNVS: Enhancing Novel View Synthesis from Blurry Images via Spike Camera

    Authors: Gaole Dai, Zhenyu Wang, Qinwen Xu, Ming Lu, Wen Chen, Boxin Shi, Shanghang Zhang, Tiejun Huang

    Abstract: One of the most critical factors in achieving sharp Novel View Synthesis (NVS) using neural field methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) is the quality of the training images. However, Conventional RGB cameras are susceptible to motion blur. In contrast, neuromorphic cameras like event and spike cameras inherently capture more comprehensive temporal information… ▽ More

    Submitted 12 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  21. arXiv:2404.01612  [pdf, other

    cs.CV

    Spin-UP: Spin Light for Natural Light Uncalibrated Photometric Stereo

    Authors: Zongrui Li, Zhan Lu, Haojie Yan, Boxin Shi, Gang Pan, Qian Zheng, Xudong Jiang

    Abstract: Natural Light Uncalibrated Photometric Stereo (NaUPS) relieves the strict environment and light assumptions in classical Uncalibrated Photometric Stereo (UPS) methods. However, due to the intrinsic ill-posedness and high-dimensional ambiguities, addressing NaUPS is still an open question. Existing works impose strong assumptions on the environment lights and objects' material, restricting the effe… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: Paper accepted by CVPR2024

  22. arXiv:2403.14402  [pdf, other

    cs.SD cs.CL eess.AS

    XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

    Authors: HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang

    Abstract: Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-v… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  23. arXiv:2403.13043  [pdf, other

    cs.CV

    When Do We Not Need Larger Vision Models?

    Authors: Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell

    Abstract: Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations. In this work, we discuss the point beyond which larger vision models are not necessary. First, we demonstrate the power of Scaling on Scales (S$^2$), whereby a pre-trained and frozen smaller vision model (e.g., ViT-B or ViT-L), run over multiple image scales, can outperform larger mo… ▽ More

    Submitted 17 July, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: Code: https://github.com/bfshi/scaling_on_scales

  24. arXiv:2403.07346  [pdf, other

    cs.CV

    Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction

    Authors: Jianping Jiang, Xinyu Zhou, Bingxuan Wang, Xiaoming Deng, Chao Xu, Boxin Shi

    Abstract: Reliable hand mesh reconstruction (HMR) from commonly-used color and depth sensors is challenging especially under scenarios with varied illuminations and fast motions. Event camera is a highly promising alternative for its high dynamic range and dense temporal resolution properties, but it lacks key texture appearance for hand mesh reconstruction. In this paper, we propose EvRGBHand -- the first… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  25. arXiv:2403.01079  [pdf, other

    cs.LG cs.AI

    Teaching MLP More Graph Information: A Three-stage Multitask Knowledge Distillation Framework

    Authors: Junxian Li, Bin Shi, Erfei Cui, Hua Wei, Qinghua Zheng

    Abstract: We study the challenging problem for inference tasks on large-scale graph datasets of Graph Neural Networks: huge time and memory consumption, and try to overcome it by reducing reliance on graph structure. Even though distilling graph knowledge to student MLP is an excellent idea, it faces two major problems of positional information loss and low generalization. To solve the problems, we propose… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

    Comments: 20 pages, with Appendix

  26. arXiv:2403.00030  [pdf, other

    cs.SI cs.AI cs.CR cs.LG

    GraphPub: Generation of Differential Privacy Graph with High Availability

    Authors: Wanghan Xu, Bin Shi, Ao Liu, Jiqiang Zhang, Bo Dong

    Abstract: In recent years, with the rapid development of graph neural networks (GNN), more and more graph datasets have been published for GNN tasks. However, when an upstream data owner publishes graph data, there are often many privacy concerns, because many real-world graph data contain sensitive information like person's friend list. Differential privacy (DP) is a common method to protect privacy, but d… ▽ More

    Submitted 5 March, 2024; v1 submitted 28 February, 2024; originally announced March 2024.

  27. arXiv:2402.19469  [pdf, other

    cs.RO cs.CV cs.LG

    Humanoid Locomotion as Next Token Prediction

    Authors: Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik

    Abstract: We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This gen… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  28. arXiv:2402.12317  [pdf, other

    cs.CL cs.AI

    ARKS: Active Retrieval in Knowledge Soup for Code Generation

    Authors: Hongjin Su, Shuyang Jiang, Yuhang Lai, Haoyuan Wu, Boao Shi, Che Liu, Qian Liu, Tao Yu

    Abstract: Recently the retrieval-augmented generation (RAG) paradigm has raised much attention for its potential in incorporating external knowledge into large language models (LLMs) without further training. While widely explored in natural language applications, its utilization in code generation remains under-explored. In this paper, we introduce Active Retrieval in Knowledge Soup (ARKS), an advanced str… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

    Comments: Retrieval-augmented code generation

  29. arXiv:2402.12185  [pdf, other

    cs.CV

    ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

    Authors: Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, Yu Qiao

    Abstract: Recently, many versatile Multi-modal Large Language Models (MLLMs) have emerged continuously. However, their capacity to query information depicted in visual charts and engage in reasoning based on the queried contents remains under-explored. In this paper, to comprehensively and rigorously benchmark the ability of the off-the-shelf MLLMs in the chart domain, we construct ChartX, a multi-modal eva… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

    Comments: Code and dataset are available for downloading at: https://github.com/UniModal4Reasoning/ChartVLM 22 pages, 15 figures

  30. arXiv:2402.12184  [pdf, other

    cs.CV

    Colorizing Monochromatic Radiance Fields

    Authors: Yean Cheng, Renjie Wan, Shuchen Weng, Chengxuan Zhu, Yakun Chang, Boxin Shi

    Abstract: Though Neural Radiance Fields (NeRF) can produce colorful 3D representations of the world by using a set of 2D images, such ability becomes non-existent when only monochromatic images are provided. Since color is necessary in representing the world, reproducing color from monochromatic radiance fields becomes crucial. To achieve this goal, instead of manipulating the monochromatic radiance fields… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

  31. arXiv:2402.11874  [pdf, other

    cs.CV

    Language-guided Image Reflection Separation

    Authors: Haofeng Zhong, Yuchen Hong, Shuchen Weng, Jinxiu Liang, Boxin Shi

    Abstract: This paper studies the problem of language-guided reflection separation, which aims at addressing the ill-posed reflection separation problem by introducing language descriptions to provide layer content. We propose a unified framework to solve this problem, which leverages the cross-attention mechanism with contrastive learning strategies to construct the correspondence between language descripti… ▽ More

    Submitted 4 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

  32. arXiv:2402.09611  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Towards Privacy-Aware Sign Language Translation at Scale

    Authors: Phillip Rust, Bowen Shi, Skyler Wang, Necati Cihan Camgöz, Jean Maillard

    Abstract: A major impediment to the advancement of sign language translation (SLT) is data scarcity. Much of the sign language data currently available on the web cannot be used for training supervised models due to the lack of aligned captions. Furthermore, scaling SLT using large-scale web-scraped datasets bears privacy risks due to the presence of biometric information, which the responsible development… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

  33. arXiv:2402.03830  [pdf, other

    cs.CV

    OASim: an Open and Adaptive Simulator based on Neural Rendering for Autonomous Driving

    Authors: Guohang Yan, Jiahao Pi, Jianfei Guo, Zhaotong Luo, Min Dou, Nianchen Deng, Qiusheng Huang, Daocheng Fu, Licheng Wen, Pinlong Cai, Xing Gao, Xinyu Cai, Bo Zhang, Xuemeng Yang, Yeqi Bai, Hongbin Zhou, Botian Shi

    Abstract: With deep learning and computer vision technology development, autonomous driving provides new solutions to improve traffic safety and efficiency. The importance of building high-quality datasets is self-evident, especially with the rise of end-to-end autonomous driving algorithms in recent years. Data plays a core role in the algorithm closed-loop system. However, collecting real-world data is ex… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

    Comments: 10 pages, 9 figures

  34. arXiv:2402.01246  [pdf, other

    cs.RO eess.SY

    LimSim++: A Closed-Loop Platform for Deploying Multimodal LLMs in Autonomous Driving

    Authors: Daocheng Fu, Wenjie Lei, Licheng Wen, Pinlong Cai, Song Mao, Min Dou, Botian Shi, Yu Qiao

    Abstract: The emergence of Multimodal Large Language Models ((M)LLMs) has ushered in new avenues in artificial intelligence, particularly for autonomous driving by offering enhanced understanding and reasoning capabilities. This paper introduces LimSim++, an extended version of LimSim designed for the application of (M)LLMs in autonomous driving. Acknowledging the limitations of existing simulation platform… ▽ More

    Submitted 12 April, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted by 35th IEEE Intelligent Vehicles Symposium (IV 2024)

  35. arXiv:2402.00904  [pdf, ps, other

    cs.LG cs.AI

    Graph Domain Adaptation: Challenges, Progress and Prospects

    Authors: Boshen Shi, Yongqing Wang, Fangda Guo, Bingbing Xu, Huawei Shen, Xueqi Cheng

    Abstract: As graph representation learning often suffers from label scarcity problems in real-world applications, researchers have proposed graph domain adaptation (GDA) as an effective knowledge-transfer paradigm across graphs. In particular, to enhance model performance on target graphs with specific tasks, GDA introduces a bunch of task-related graphs as source graphs and adapts the knowledge learnt from… ▽ More

    Submitted 31 January, 2024; originally announced February 2024.

  36. arXiv:2401.16792  [pdf, other

    cs.AR

    WideSA: A High Array Utilization Mapping Scheme for Uniform Recurrences on the Versal ACAP Architecture

    Authors: Tuo Dai, Bizhao Shi, Guojie Luo

    Abstract: The Versal Adaptive Compute Acceleration Platform (ACAP) is a new architecture that combines AI Engines (AIEs) with reconfigurable fabric. This architecture offers significant acceleration potential for uniform recurrences in various domains, such as deep learning, high-performance computation, and signal processing. However, efficiently mapping these computations onto the Versal ACAP architecture… ▽ More

    Submitted 30 January, 2024; originally announced January 2024.

    Comments: DATE24 (To appear)

  37. arXiv:2401.14391  [pdf, other

    cs.CV

    Rethinking Patch Dependence for Masked Autoencoders

    Authors: Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg

    Abstract: In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations suggest that self-attention between mask patches is not essential for learning good representations. To this end, we propose a novel pretraining framework:… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

  38. arXiv:2401.06397   

    cs.CV

    UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

    Authors: Bowen Shi, Peisen Zhao, Zichen Wang, Yuhang Zhang, Yaoming Wang, Jin Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian, Xiaopeng Zhang

    Abstract: Vision-language foundation models, represented by Contrastive language-image pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding… ▽ More

    Submitted 18 January, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

    Comments: The paper is undergoing internal legal review and will be resubmitted once it passes the review

  39. arXiv:2401.01572  [pdf, other

    cs.CL cs.SD eess.AS

    Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models

    Authors: Rita Frieske, Bertram E. Shi

    Abstract: Hallucinations are a type of output error produced by deep neural networks. While this has been studied in natural language processing, they have not been researched previously in automatic speech recognition. Here, we define hallucinations in ASR as transcriptions generated by a model that are semantically unrelated to the source utterance, yet still fluent and coherent. The similarity of halluci… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  40. arXiv:2312.16933  [pdf, other

    cs.CV cs.AI

    EvPlug: Learn a Plug-and-Play Module for Event and Image Fusion

    Authors: Jianping Jiang, Xinyu Zhou, Peiqi Duan, Boxin Shi

    Abstract: Event cameras and RGB cameras exhibit complementary characteristics in imaging: the former possesses high dynamic range (HDR) and high temporal resolution, while the latter provides rich texture and color information. This makes the integration of event cameras into middle- and high-level RGB-based vision tasks highly promising. However, challenges arise in multi-modal fusion, data annotation, and… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

  41. arXiv:2312.15942  [pdf, other

    cs.CV eess.IV

    Pano-NeRF: Synthesizing High Dynamic Range Novel Views with Geometry from Sparse Low Dynamic Range Panoramic Images

    Authors: Zhan Lu, Qian Zheng, Boxin Shi, Xudong Jiang

    Abstract: Panoramic imaging research on geometry recovery and High Dynamic Range (HDR) reconstruction becomes a trend with the development of Extended Reality (XR). Neural Radiance Fields (NeRF) provide a promising scene representation for both tasks without requiring extensive prior data. However, in the case of inputting sparse Low Dynamic Range (LDR) panoramic images, NeRF often degrades with under-const… ▽ More

    Submitted 23 February, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

  42. arXiv:2312.15821  [pdf, other

    cs.SD cs.LG eess.AS

    Audiobox: Unified Audio Generation with Natural Language Prompts

    Authors: Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu

    Abstract: Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in sever… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

  43. arXiv:2312.12772  [pdf, other

    cs.RO cs.AI

    Realistic Rainy Weather Simulation for LiDARs in CARLA Simulator

    Authors: Donglin Yang, Zhenfeng Liu, Wentao Jiang, Guohang Yan, Xing Gao, Botian Shi, Si Liu, Xinyu Cai

    Abstract: Employing data augmentation methods to enhance perception performance in adverse weather has attracted considerable attention recently. Most of the LiDAR augmentation methods post-process the existing dataset by physics-based models or machine-learning methods. However, due to the limited environmental annotations and the fixed vehicle trajectories in the existing dataset, it is challenging to edi… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

  44. arXiv:2312.08220  [pdf, other

    cs.CV

    EventAid: Benchmarking Event-aided Image/Video Enhancement Algorithms with Real-captured Hybrid Dataset

    Authors: Peiqi Duan, Boyu Li, Yixin Yang, Hanyue Lou, Minggui Teng, Yi Ma, Boxin Shi

    Abstract: Event cameras are emerging imaging technology that offers advantages over conventional frame-based imaging sensors in dynamic range and sensing speed. Complementing the rich texture and color perception of traditional image frames, the hybrid camera system of event and frame-based cameras enables high-performance imaging. With the assistance of event cameras, high-quality image/video enhancement m… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

  45. arXiv:2312.06343  [pdf, other

    cs.LG

    RankMatch: A Novel Approach to Semi-Supervised Label Distribution Learning Leveraging Inter-label Correlations

    Authors: Kouzhiqiang Yucheng Xie, Jing Wang, Yuheng Jia, Boyu Shi, Xin Geng

    Abstract: This paper introduces RankMatch, an innovative approach for Semi-Supervised Label Distribution Learning (SSLDL). Addressing the challenge of limited labeled data, RankMatch effectively utilizes a small number of labeled examples in conjunction with a larger quantity of unlabeled data, reducing the need for extensive manual labeling in Deep Neural Network (DNN) applications. Specifically, RankMatch… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  46. arXiv:2312.05743  [pdf, other

    cs.LG cs.CV

    Building Variable-sized Models via Learngene Pool

    Authors: Boyu Shi, Shiyu Xia, Xu Yang, Haokun Chen, Zhiqiang Kou, Xin Geng

    Abstract: Recently, Stitchable Neural Networks (SN-Net) is proposed to stitch some pre-trained networks for quickly building numerous networks with different complexity and performance trade-offs. In this way, the burdens of designing or training the variable-sized networks, which can be used in application scenarios with diverse resource constraints, are alleviated. However, SN-Net still faces a few challe… ▽ More

    Submitted 11 December, 2023; v1 submitted 9 December, 2023; originally announced December 2023.

  47. arXiv:2312.04316  [pdf, other

    cs.RO cs.AI cs.CV

    Towards Knowledge-driven Autonomous Driving

    Authors: Xin Li, Yeqi Bai, Pinlong Cai, Licheng Wen, Daocheng Fu, Bo Zhang, Xuemeng Yang, Xinyu Cai, Tao Ma, Jianfei Guo, Xing Gao, Min Dou, Yikang Li, Botian Shi, Yong Liu, Liang He, Yu Qiao

    Abstract: This paper explores the emerging knowledge-driven autonomous driving technologies. Our investigation highlights the limitations of current autonomous driving systems, in particular their sensitivity to data bias, difficulty in handling long-tail scenarios, and lack of interpretability. Conversely, knowledge-driven methods with the abilities of cognition, generalization and life-long learning emerg… ▽ More

    Submitted 27 December, 2023; v1 submitted 7 December, 2023; originally announced December 2023.

  48. arXiv:2312.03526  [pdf, other

    cs.CV cs.AI cs.LG

    On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm

    Authors: Peng Sun, Bei Shi, Daiwei Yu, Tao Lin

    Abstract: Contemporary machine learning requires training large neural networks on massive datasets and thus faces the challenges of high computational demands. Dataset distillation, as a recent emerging strategy, aims to compress real-world datasets for efficient training. However, this line of research currently struggle with large-scale and high-resolution datasets, hindering its practicality and feasibi… ▽ More

    Submitted 19 March, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: 17 pages, 20 figures

  49. arXiv:2312.02249  [pdf, other

    cs.CV cs.CL

    Recursive Visual Programming

    Authors: Jiaxin Ge, Sanjay Subramanian, Baifeng Shi, Roei Herzig, Trevor Darrell

    Abstract: Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in a single function, resulting in code that is suboptimal in terms o… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

  50. arXiv:2311.15736  [pdf, other

    cs.RO cs.AI

    SceneDM: Scene-level Multi-agent Trajectory Generation with Consistent Diffusion Models

    Authors: Zhiming Guo, Xing Gao, Jianlan Zhou, Xinyu Cai, Botian Shi

    Abstract: Realistic scene-level multi-agent motion simulations are crucial for developing and evaluating self-driving algorithms. However, most existing works focus on generating trajectories for a certain single agent type, and typically ignore the consistency of generated trajectories. In this paper, we propose a novel framework based on diffusion models, called SceneDM, to generate joint and consistent f… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.