Skip to main content

Showing 1–50 of 1,291 results for author: Zhang, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13545  [pdf, other

    eess.IV cs.CV

    DiffuX2CT: Diffusion Learning to Reconstruct CT Images from Biplanar X-Rays

    Authors: Xuhui Liu, Zhi Qiao, Runkun Liu, Hong Li, Juan Zhang, Xiantong Zhen, Zhen Qian, Baochang Zhang

    Abstract: Computed tomography (CT) is widely utilized in clinical settings because it delivers detailed 3D images of the human body. However, performing CT scans is not always feasible due to radiation exposure and limitations in certain surgical environments. As an alternative, reconstructing CT images from ultra-sparse X-rays offers a valuable solution and has gained significant interest in scientific res… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  2. arXiv:2407.13161  [pdf, other

    physics.soc-ph cs.CY physics.ed-ph

    How to quantify an examination? Evidence from physics examinations via complex networks

    Authors: Min Xia, Zhu Su, Weibing Deng, Xiumei Feng, Benwei Zhang

    Abstract: Given the untapped potential for continuous improvement of examinations, quantitative investigations of examinations could guide efforts to considerably improve learning efficiency and evaluation and thus greatly help both learners and educators. However, there is a general lack of quantitative methods for investigating examinations. To address this gap, we propose a new metric via complex network… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  3. arXiv:2407.11901  [pdf, other

    stat.ML cs.LG stat.CO stat.ME

    Combining Wasserstein-1 and Wasserstein-2 proximals: robust manifold learning via well-posed generative flows

    Authors: Hyemin Gu, Markos A. Katsoulakis, Luc Rey-Bellet, Benjamin J. Zhang

    Abstract: We formulate well-posed continuous-time generative flows for learning distributions that are supported on low-dimensional manifolds through Wasserstein proximal regularizations of $f$-divergences. Wasserstein-1 proximal operators regularize $f$-divergences so that singular distributions can be compared. Meanwhile, Wasserstein-2 proximal operators regularize the paths of the generative flows by add… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  4. arXiv:2407.11855  [pdf, other

    cs.CL cs.CV cs.LG

    Scaling Sign Language Translation

    Authors: Biao Zhang, Garrett Tanzer, Orhan Firat

    Abstract: Sign language translation (SLT) addresses the problem of translating information from a sign language in video to a spoken language in text. Existing studies, while showing progress, are often limited to narrow domains and/or few sign languages and struggle with open-domain tasks. In this paper, we push forward the frontier of SLT by scaling pretraining data, model size, and number of translation… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  5. arXiv:2407.11522  [pdf, other

    cs.CV

    FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

    Authors: Pengxiang Li, Zhi Gao, Bofei Zhang, Tao Yuan, Yuwei Wu, Mehrtash Harandi, Yunde Jia, Song-Chun Zhu, Qing Li

    Abstract: Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1.1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data c… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  6. arXiv:2407.11502  [pdf, other

    cs.CV

    How Control Information Influences Multilingual Text Image Generation and Editing?

    Authors: Boqiang Zhang, Zuan Gao, Yadong Qu, Hongtao Xie

    Abstract: Visual text generation has significantly advanced through diffusion models aimed at producing images with readable and realistic text. Recent works primarily use a ControlNet-based framework, employing standard font text images to control diffusion models. Recognizing the critical role of control information in generating high-quality text, we investigate its influence from three perspectives: inp… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  7. arXiv:2407.11144  [pdf, other

    cs.CL

    YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

    Authors: Garrett Tanzer, Biao Zhang

    Abstract: Even for better-studied sign languages like American Sign Language (ASL), data is the bottleneck for machine learning research. The situation is worse yet for the many other sign languages used by Deaf/Hard of Hearing communities around the world. In this paper, we present YouTube-SL-25, a large-scale, open-domain multilingual corpus of sign language videos with seemingly well-aligned captions dra… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Access YouTube-SL-25 at https://github.com/google-research/google-research/tree/master/youtube_sl_25

  8. arXiv:2407.09048  [pdf, other

    cs.AI

    KUNPENG: An Embodied Large Model for Intelligent Maritime

    Authors: Naiyao Wang, Tongbang Jiang, Ye Wang, Shaoyang Qiu, Bo Zhang, Xinqiang Xie, Munan Li, Chunliu Wang, Yiyang Wang, Hongxiang Ren, Ruili Wang, Hongjun Shan, Hongbo Liu

    Abstract: Intelligent maritime, as an essential component of smart ocean construction, deeply integrates advanced artificial intelligence technology and data analysis methods, which covers multiple aspects such as smart vessels, route optimization, safe navigation, aiming to enhance the efficiency of ocean resource utilization and the intelligence of transportation networks. However, the complex and dynamic… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: 9 pages, 3 figures

  9. arXiv:2407.06938  [pdf, other

    cs.CV

    RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

    Authors: Bowen Zhang, Yiji Cheng, Chunyu Wang, Ting Zhang, Jiaolong Yang, Yansong Tang, Feng Zhao, Dong Chen, Baining Guo

    Abstract: We present RodinHD, which can generate high-fidelity 3D avatars from a portrait image. Existing methods fail to capture intricate details such as hairstyles which we tackle in this paper. We first identify an overlooked problem of catastrophic forgetting that arises when fitting triplanes sequentially on many avatars, caused by the MLP decoder sharing scheme. To overcome this issue, we raise a nov… ▽ More

    Submitted 10 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

    Comments: ECCV 2024; project page: https://rodinhd.github.io/

  10. arXiv:2407.05621  [pdf, other

    cs.AR

    EA4RCA:Efficient AIE accelerator design framework for Regular Communication-Avoiding Algorithm

    Authors: W. B. Zhang, Y. Q. Liu, T. H. Zang, Z. S. Bao

    Abstract: With the introduction of the Adaptive Intelligence Engine (AIE), the Versal Adaptive Compute Acceleration Platform (Versal ACAP) has garnered great attention. However, the current focus of Vitis Libraries and limited research has mainly been on how to invoke AIE modules, without delving into a thorough discussion on effectively utilizing AIE in its typical use cases. As a result, the widespread ad… ▽ More

    Submitted 8 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

  11. arXiv:2407.05562  [pdf, other

    cs.CV

    Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition

    Authors: Bangbang Zhou, Yadong Qu, Zixiao Wang, Zicheng Li, Boqiang Zhang, Hongtao Xie

    Abstract: Recently, scene text recognition (STR) models have shown significant performance improvements. However, existing models still encounter difficulties in recognizing challenging texts that involve factors such as severely distorted and perspective characters. These challenging texts mainly cause two problems: (1) Large Intra-Class Variance. (2) Small Inter-Class Variance. An extremely distorted char… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: Accepted to IJCAI2024

  12. arXiv:2407.04845  [pdf, other

    cs.NI

    Poster: Flexible Scheduling of Network and Computing Resources for Distributed AI Tasks

    Authors: Ruikun Wang, Jiawei Zhang, Qiaolun Zhang, Bojun Zhang, Zhiqun Gu, Aryanaz Attarpour, Yuefeng Ji, Massimo Tornatore

    Abstract: Many emerging Artificial Intelligence (AI) applications require on-demand provisioning of large-scale computing, which can only be enabled by leveraging distributed computing services interconnected through networking. To address such increasing demand for networking to serve AI tasks, we investigate new scheduling strategies to improve communication efficiency and test them on a programmable test… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  13. arXiv:2407.04675  [pdf, other

    eess.AS cs.SD

    Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

    Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

    Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More

    Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

  14. arXiv:2407.04272  [pdf, other

    cs.LG cs.DC

    Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression

    Authors: Hao Feng, Boyuan Zhang, Fanjiang Ye, Min Si, Ching-Hsiang Chu, Jiannan Tian, Chunxing Yin, Summer Deng, Yuchen Hao, Pavan Balaji, Tong Geng, Dingwen Tao

    Abstract: DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. The large size of DLRM models, however, necessitates the use of multiple devices/GPUs for efficient training. A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices. To mitigate this, we… ▽ More

    Submitted 11 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

    Comments: accepted by SC '24

  15. arXiv:2407.04051  [pdf, other

    cs.SD cs.AI eess.AS

    FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

    Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

    Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  16. arXiv:2407.02773  [pdf, other

    cs.MM

    OpenVNA: A Framework for Analyzing the Behavior of Multimodal Language Understanding System under Noisy Scenarios

    Authors: Ziqi Yuan, Baozheng Zhang, Hua Xu, Zhiyun Liang, Kai Gao

    Abstract: We present OpenVNA, an open-source framework designed for analyzing the behavior of multimodal language understanding systems under noisy conditions. OpenVNA serves as an intuitive toolkit tailored for researchers, facilitating convenience batch-level robustness evaluation and on-the-fly instance-level demonstration. It primarily features a benchmark Python library for assessing global model robus… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 10 pages, 4 figures, to be published in ACL 2024 System Demonstration Track

  17. SUPER: Seated Upper Body Pose Estimation using mmWave Radars

    Authors: Bo Zhang, Zimeng Zhou, Boyu Jiang, Rong Zheng

    Abstract: In industrial countries, adults spend a considerable amount of time sedentary each day at work, driving and during activities of daily living. Characterizing the seated upper body human poses using mmWave radars is an important, yet under-studied topic with many applications in human-machine interaction, transportation and road safety. In this work, we devise SUPER, a framework for seated upper bo… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  18. arXiv:2407.01886  [pdf, other

    cs.LG cs.AI

    Core Knowledge Learning Framework for Graph Adaptation and Scalability Learning

    Authors: Bowen Zhang, Zhichao Huang, Genan Dai, Guangning Xu, Xiaomao Fan, Hu Huang

    Abstract: Graph classification is a pivotal challenge in machine learning, especially within the realm of graph-based data, given its importance in numerous real-world applications such as social network analysis, recommendation systems, and bioinformatics. Despite its significance, graph classification faces several hurdles, including adapting to diverse prediction tasks, training across multiple target do… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  19. arXiv:2407.01521  [pdf, other

    cs.LG cs.AI cs.CV

    Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing

    Authors: Bingliang Zhang, Wenda Chu, Julius Berner, Chenlin Meng, Anima Anandkumar, Yang Song

    Abstract: Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inver… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  20. arXiv:2406.19853  [pdf, other

    cs.CL cs.AI

    YuLan: An Open-source Large Language Model

    Authors: Yutao Zhu, Kun Zhou, Kelong Mao, Wentong Chen, Yiding Sun, Zhipeng Chen, Qian Cao, Yihan Wu, Yushuo Chen, Feng Wang, Lei Zhang, Junyi Li, Xiaolei Wang, Lei Wang, Beichen Zhang, Zican Dong, Xiaoxue Cheng, Yuhan Chen, Xinyu Tang, Yupeng Hou, Qiangqiang Ren, Xincheng Pang, Shufang Xie, Wayne Xin Zhao, Zhicheng Dou , et al. (13 additional authors not shown)

    Abstract: Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billi… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  21. arXiv:2406.18967  [pdf, other

    cs.CV

    Structural Attention: Rethinking Transformer for Unpaired Medical Image Synthesis

    Authors: Vu Minh Hieu Phan, Yutong Xie, Bowen Zhang, Yuankai Qi, Zhibin Liao, Antonios Perperidis, Son Lam Phung, Johan W. Verjans, Minh-Son To

    Abstract: Unpaired medical image synthesis aims to provide complementary information for an accurate clinical diagnostics, and address challenges in obtaining aligned multi-modal medical scans. Transformer-based models excel in imaging translation tasks thanks to their ability to capture long-range dependencies. Although effective in supervised training settings, their performance falters in unpaired image… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: MICCAI2024 - Early Accept Top 11%

  22. arXiv:2406.18547  [pdf

    eess.IV cs.CV

    Enhancing Medical Imaging with GANs Synthesizing Realistic Images from Limited Data

    Authors: Yinqiu Feng, Bo Zhang, Lingxi Xiao, Yutian Yang, Tana Gegen, Zexi Chen

    Abstract: In this research, we introduce an innovative method for synthesizing medical images using generative adversarial networks (GANs). Our proposed GANs method demonstrates the capability to produce realistic synthetic images even when trained on a limited quantity of real medical image data, showcasing commendable generalization prowess. To achieve this, we devised a generator and discriminator networ… ▽ More

    Submitted 22 May, 2024; originally announced June 2024.

  23. arXiv:2406.18327  [pdf, other

    eess.IV cs.CV cs.LG

    Multi-modal Evidential Fusion Network for Trusted PET/CT Tumor Segmentation

    Authors: Yuxuan Qi, Li Lin, Jiajun Wang, Jingya Zhang, Bin Zhang

    Abstract: Accurate segmentation of tumors in PET/CT images is important in computer-aided diagnosis and treatment of cancer. The key issue of such a segmentation problem lies in the effective integration of complementary information from PET and CT images. However, the quality of PET and CT images varies widely in clinical settings, which leads to uncertainty in the modality information extracted by network… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  24. arXiv:2406.16601  [pdf, other

    cs.CV

    Do As I Do: Pose Guided Human Motion Copy

    Authors: Sifan Wu, Zhenguang Liu, Beibei Zhang, Roger Zimmermann, Zhongjie Ba, Xiaosong Zhang, Kui Ren

    Abstract: Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  25. arXiv:2406.15970  [pdf, ps, other

    cs.GT cs.AI cs.CC

    Imperfect-Recall Games: Equilibrium Concepts and Their Complexity

    Authors: Emanuel Tewolde, Brian Hu Zhang, Caspar Oesterheld, Manolis Zampetakis, Tuomas Sandholm, Paul W. Goldberg, Vincent Conitzer

    Abstract: We investigate optimal decision making under imperfect recall, that is, when an agent forgets information it once held before. An example is the absentminded driver game, as well as team games in which the members have limited communication capabilities. In the framework of extensive-form games with imperfect recall, we analyze the computational complexities of finding equilibria in multiplayer se… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: Long version of the paper that got accepted to the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI 2024). 35 pages, 10 figures, 1 table

    MSC Class: 91A05; 91A06; 91A10; 91A11; 91A18; 91A35; 91A68; 68T37; 68Q17; 68Q25 ACM Class: I.2; J.4; F.2

  26. arXiv:2406.15513  [pdf, other

    cs.AI cs.CL

    PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models

    Authors: Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, Yaodong Yang

    Abstract: In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer p… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: a sibling project to SafeRLHF and BeaverTails

  27. arXiv:2406.14991  [pdf, other

    cs.CL cs.SE

    SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

    Authors: Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, Jie Tang

    Abstract: We introduce SpreadsheetBench, a challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios, designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from 912 real questions gathered from online Excel… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: Homepage: https://spreadsheetbench.github.io/

  28. arXiv:2406.13116  [pdf, ps, other

    cs.GT

    A Lower Bound on Swap Regret in Extensive-Form Games

    Authors: Constantinos Daskalakis, Gabriele Farina, Noah Golowich, Tuomas Sandholm, Brian Hu Zhang

    Abstract: Recent simultaneous works by Peng and Rubinstein [2024] and Dagan et al. [2024] have demonstrated the existence of a no-swap-regret learning algorithm that can reach $ε$ average swap regret against an adversary in any extensive-form game within $m^{\tilde{\mathcal O}(1/ε)}$ rounds, where $m$ is the number of nodes in the game tree. However, the question of whether a $\mathrm{poly}(m, 1/ε)$-round a… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  29. arXiv:2406.11633  [pdf, other

    cs.CV

    DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

    Authors: Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wenjie Wu, Hancheng Ye, Shiyang Feng, Bin Wang, Chao Xu, Conghui He, Pinlong Cai, Min Dou, Botian Shi, Sheng Zhou, Yongwei Wang, Bin Wang, Junchi Yan, Fei Wu, Yu Qiao

    Abstract: Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is therefore meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extract… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Homepage of DocGenome: https://unimodal4reasoning.github.io/DocGenome_page 22 pages, 11 figures

  30. arXiv:2406.11189  [pdf, other

    cs.CV

    Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation

    Authors: Bingfeng Zhang, Siyue Yu, Yunchao Wei, Yao Zhao, Jimin Xiao

    Abstract: Weakly supervised semantic segmentation has witnessed great achievements with image-level labels. Several recent approaches use the CLIP model to generate pseudo labels for training an individual segmentation model, while there is no attempt to apply the CLIP model as the backbone to directly segment objects with image-level labels. In this paper, we propose WeCLIP, a CLIP-based single-stage pipel… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: CVPR 2024 Highlight

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3796-3806) 2024

  31. arXiv:2406.10553  [pdf, other

    cs.CV

    A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing

    Authors: Ming Meng, Yufei Zhao, Bo Zhang, Yonggui Zhu, Weimin Shi, Maxwell Wen, Zhaoxin Fan

    Abstract: Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new conten… ▽ More

    Submitted 18 June, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

  32. arXiv:2406.10508  [pdf, other

    cs.CV

    Learning to Adapt Foundation Model DINOv2 for Capsule Endoscopy Diagnosis

    Authors: Bowen Zhang, Ying Chen, Long Bai, Yan Zhao, Yuxiang Sun, Yixuan Yuan, Jianhua Zhang, Hongliang Ren

    Abstract: Foundation models have become prominent in computer vision, achieving notable success in various tasks. However, their effectiveness largely depends on pre-training with extensive datasets. Applying foundation models directly to small datasets of capsule endoscopy images from scratch is challenging. Pre-training on broad, general vision datasets is crucial for successfully fine-tuning our model fo… ▽ More

    Submitted 30 June, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

    Comments: To appear in ICBIR 2024

  33. arXiv:2406.09790  [pdf, other

    cs.CL

    Pcc-tuning: Breaking the Contrastive Learning Ceiling in Semantic Textual Similarity

    Authors: Bowen Zhang, Chunping Li

    Abstract: Semantic Textual Similarity (STS) constitutes a critical research direction in computational linguistics and serves as a key indicator of the encoding capabilities of embedding models. Driven by advances in pre-trained language models and contrastive learning techniques, leading sentence representation methods can already achieved average Spearman's correlation scores of approximately 86 across se… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Work in Progress

  34. arXiv:2406.09700  [pdf, other

    cs.RO physics.bio-ph

    Jointed Tails Enhance Control of Three-dimensional Body Rotation

    Authors: Xun Fu, Bohao Zhang, Ceri J. Weber, Kimberly L. Cooper, Ram Vasudevan, Talia Y. Moore

    Abstract: Tails used as inertial appendages induce body rotations of animals and robots, a phenomenon that is governed largely by the ratio of the body and tail moments of inertia. However, vertebrate tails have more degrees of freedom (e.g., number of joints, rotational axes) than most current theoretical models and robotic tails. To understand how morphology affects inertial appendage function, we develop… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  35. arXiv:2406.08864  [pdf

    cs.LG cs.AI

    Research on Early Warning Model of Cardiovascular Disease Based on Computer Deep Learning

    Authors: Yuxiang Hu, Jinxin Hu, Ting Xu, Bo Zhang, Jiajie Yuan, Haozhang Deng

    Abstract: This project intends to study a cardiovascular disease risk early warning model based on one-dimensional convolutional neural networks. First, the missing values of 13 physiological and symptom indicators such as patient age, blood glucose, cholesterol, and chest pain were filled and Z-score was standardized. The convolutional neural network is converted into a 2D matrix, the convolution function… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 6 pages

  36. arXiv:2406.08659  [pdf, other

    cs.CV

    Vivid-ZOO: Multi-View Video Generation with Diffusion Model

    Authors: Bing Li, Cheng Zheng, Wenxuan Zhu, Jinjie Mai, Biao Zhang, Peter Wonka, Bernard Ghanem

    Abstract: While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline tha… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Our project page is at https://hi-zhengcheng.github.io/vividzoo/

  37. arXiv:2406.08418  [pdf, other

    cs.CV cs.AI

    OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

    Authors: Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang , et al. (15 additional authors not shown)

    Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale an… ▽ More

    Submitted 12 July, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  38. arXiv:2406.08001  [pdf, other

    cs.CV cs.LG

    Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization

    Authors: Jiaxin Deng, Junbiao Pang, Baochang Zhang

    Abstract: Sharpness-Aware Minimization (SAM) has emerged as a promising approach for effectively reducing the generalization error. However, SAM incurs twice the computational cost compared to base optimizer (e.g., SGD). We propose Asymptotic Unbiased Sampling with respect to iterations to accelerate SAM (AUSAM), which maintains the model's generalization capacity while significantly enhancing computational… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  39. arXiv:2406.07558  [pdf, other

    cs.CY cs.AI cs.CV

    A Large Medical Model based on Visual Physiological Monitoring for Public Health

    Authors: Bin Huang, Changchen Zhao, Zimeng Liu, Shenda Hong, Baochang Zhang, Wenjin Wang, Hui Liu

    Abstract: The widespread outbreak of the COVID-19 pandemic has sounded a warning about the globalization challenges in public health. In this context, the establishment of large-scale public health datasets, of medical models, and of decision-making systems with a human-centric approach holds strategic significance. Recently, groundbreaking advancements have emerged in AI methods for physiological signal mo… ▽ More

    Submitted 21 April, 2024; originally announced June 2024.

    Comments: 17 pages, 7 figures

  40. arXiv:2406.07256  [pdf, ps, other

    cs.SD cs.AI eess.AS

    AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection

    Authors: Rong Gong, Hongfei Xue, Lezhi Wang, Xin Xu, Qisheng Li, Lei Xie, Hui Bu, Shaomei Wu, Jiaming Zhou, Yong Qin, Binbin Zhang, Jun Du, Jia Bin, Ming Li

    Abstract: The rapid advancements in speech technologies over the past two decades have led to human-level performance in tasks like automatic speech recognition (ASR) for fluent speech. However, the efficacy of these models diminishes when applied to atypical speech, such as stuttering. This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset, which stands out as the large… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  41. arXiv:2406.05326  [pdf, other

    cs.CL

    Advancing Semantic Textual Similarity Modeling: A Regression Framework with Translated ReLU and Smooth K2 Loss

    Authors: Bowen Zhang, Chunping Li

    Abstract: Since the introduction of BERT and RoBERTa, research on Semantic Textual Similarity (STS) has made groundbreaking progress. Particularly, the adoption of contrastive learning has substantially elevated state-of-the-art performance across various STS benchmarks. However, contrastive learning categorizes text pairs as either semantically similar or dissimilar, failing to leverage fine-grained annota… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: Work in Progress

  42. arXiv:2406.04336  [pdf, other

    cs.LG cs.DM cs.DS math.CO math.SP

    On the Expressive Power of Spectral Invariant Graph Neural Networks

    Authors: Bohang Zhang, Lingxiao Zhao, Haggai Maron

    Abstract: Incorporating spectral information to enhance Graph Neural Networks (GNNs) has shown promising results but raises a fundamental challenge due to the inherent ambiguity of eigenvectors. Various architectures have been proposed to address this ambiguity, referred to as spectral invariant architectures. Notable examples include GNNs and Graph Transformers that use spectral distances, spectral project… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: 31 pages; 3 figures; to appear in ICML 2024

  43. arXiv:2406.04264  [pdf, other

    cs.CV cs.AI cs.CL

    MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

    Authors: Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, Zheng Liu

    Abstract: The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To addres… ▽ More

    Submitted 19 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

  44. arXiv:2406.03234  [pdf, other

    cs.LG cs.AI

    Fine-Grained Causal Dynamics Learning with Quantization for Improving Robustness in Reinforcement Learning

    Authors: Inwoo Hwang, Yunhyeok Kwak, Suhyung Choi, Byoung-Tak Zhang, Sanghack Lee

    Abstract: Causal dynamics learning has recently emerged as a promising approach to enhancing robustness in reinforcement learning (RL). Typically, the goal is to build a dynamics model that makes predictions based on the causal relationships among the entities. Despite the fact that causal connections often manifest only under certain contexts, existing approaches overlook such fine-grained relationships an… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: ICML 2024

  45. arXiv:2406.01467  [pdf, other

    cs.GR cs.CV

    RaDe-GS: Rasterizing Depth in Gaussian Splatting

    Authors: Baowen Zhang, Chuan Fang, Rakesh Shrestha, Yixun Liang, Xiaoxiao Long, Ping Tan

    Abstract: Gaussian Splatting (GS) has proven to be highly effective in novel view synthesis, achieving high-quality and real-time rendering. However, its potential for reconstructing detailed 3D shapes has not been fully explored. Existing methods often suffer from limited shape accuracy due to the discrete and unstructured nature of Gaussian splats, which complicates the shape extraction. While recent tech… ▽ More

    Submitted 24 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

  46. arXiv:2406.00614  [pdf, other

    cs.LG cs.AI

    Efficient Monte Carlo Tree Search via On-the-Fly State-Conditioned Action Abstraction

    Authors: Yunhyeok Kwak, Inwoo Hwang, Dooyoung Kim, Sanghack Lee, Byoung-Tak Zhang

    Abstract: Monte Carlo Tree Search (MCTS) has showcased its efficacy across a broad spectrum of decision-making problems. However, its performance often degrades under vast combinatorial action space, especially where an action is composed of multiple sub-actions. In this work, we propose an action abstraction based on the compositional structure between a state and sub-actions for improving the efficiency o… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: UAI 2024 (Oral). The first two authors contributed equally

  47. arXiv:2406.00383  [pdf, other

    cs.CV

    SpikeMM: Flexi-Magnification of High-Speed Micro-Motions

    Authors: Baoyue Zhang, Yajing Zheng, Shiyan Chen, Jiyuan Zhang, Kang Chen, Zhaofei Yu, Tiejun Huang

    Abstract: The amplification of high-speed micro-motions holds significant promise, with applications spanning fault detection in fast-paced industrial environments to refining precision in medical procedures. However, conventional motion magnification algorithms often encounter challenges in high-speed scenarios due to low sampling rates or motion blur. In recent years, spike cameras have emerged as a super… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

  48. arXiv:2405.20772  [pdf

    cs.LG cs.CY

    Reinforcement Learning for Sociohydrology

    Authors: Tirthankar Roy, Shivendra Srivastava, Beichen Zhang

    Abstract: In this study, we discuss how reinforcement learning (RL) provides an effective and efficient framework for solving sociohydrology problems. The efficacy of RL for these types of problems is evident because of its ability to update policies in an iterative manner - something that is also foundational to sociohydrology, where we are interested in representing the co-evolution of human-water interac… ▽ More

    Submitted 31 May, 2024; originally announced May 2024.

  49. arXiv:2405.19119  [pdf, other

    cs.LG

    Can Graph Learning Improve Task Planning?

    Authors: Xixi Wu, Yifei Shen, Caihua Shan, Kaitao Song, Siwei Wang, Bohang Zhang, Jiarui Feng, Hong Cheng, Wei Chen, Yun Xiong, Dongsheng Li

    Abstract: Task planning is emerging as an important research topic alongside the development of large language models (LLMs). It aims to break down complex user requests into solvable sub-tasks, thereby fulfilling the original requests. In this context, the sub-tasks can be naturally viewed as a graph, where the nodes represent the sub-tasks, and the edges denote the dependencies among them. Consequently, t… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  50. arXiv:2405.18882  [pdf, other

    cs.CV

    DecomCAM: Advancing Beyond Saliency Maps through Decomposition and Integration

    Authors: Yuguang Yang, Runtang Guo, Sheng Wu, Yimi Wang, Linlin Yang, Bo Fan, Jilong Zhong, Juan Zhang, Baochang Zhang

    Abstract: Interpreting complex deep networks, notably pre-trained vision-language models (VLMs), is a formidable challenge. Current Class Activation Map (CAM) methods highlight regions revealing the model's decision-making basis but lack clear saliency maps and detailed interpretability. To bridge this gap, we propose DecomCAM, a novel decomposition-and-integration method that distills shared patterns from… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: Accepted by Neurocomputing journal