Skip to main content

Showing 1–50 of 156 results for author: Yao, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.12358  [pdf, other

    cs.CV cs.CL

    ProcTag: Process Tagging for Assessing the Efficacy of Document Instruction Data

    Authors: Yufan Shen, Chuwei Luo, Zhaoqing Zhu, Yang Chen, Qi Zheng, Zhi Yu, Jiajun Bu, Cong Yao

    Abstract: Recently, large language models (LLMs) and multimodal large language models (MLLMs) have demonstrated promising results on document visual question answering (VQA) task, particularly after training on document instruction datasets. An effective evaluation method for document instruction data is crucial in constructing instruction data with high efficacy, which, in turn, facilitates the training of… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  2. arXiv:2407.11950  [pdf, other

    cs.CV

    Temporally Consistent Stereo Matching

    Authors: Jiaxi Zeng, Chengtang Yao, Yuwei Wu, Yunde Jia

    Abstract: Stereo matching provides depth estimation from binocular images for downstream applications. These applications mostly take video streams as input and require temporally consistent depth maps. However, existing methods mainly focus on the estimation at the single-frame level. This commonly leads to temporally inconsistent results, especially in ill-posed regions. In this paper, we aim to leverage… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  3. arXiv:2406.17255  [pdf, other

    cs.CL

    MPCODER: Multi-user Personalized Code Generator with Explicit and Implicit Style Representation Learning

    Authors: Zhenlong Dai, Chang Yao, WenKang Han, Ying Yuan, Zhipeng Gao, Jingyuan Chen

    Abstract: Large Language Models (LLMs) have demonstrated great potential for assisting developers in their daily development. However, most research focuses on generating correct code, how to use LLMs to generate personalized code has seldom been investigated. To bridge this gap, we proposed MPCoder (Multi-user Personalized Code Generator) to generate personalized code for multiple users. To better learn co… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Accepted by ACL 2024, Main Conference

  4. arXiv:2406.06062  [pdf, other

    cs.CV cs.AI

    ProcessPainter: Learn Painting Process from Sequence Data

    Authors: Yiren Song, Shijie Huang, Chen Yao, Xiaojun Ye, Hai Ci, Jiaming Liu, Yuxuan Zhang, Mike Zheng Shou

    Abstract: The painting process of artists is inherently stepwise and varies significantly among different painters and styles. Generating detailed, step-by-step painting processes is essential for art education and research, yet remains largely underexplored. Traditional stroke-based rendering methods break down images into sequences of brushstrokes, yet they fall short of replicating the authentic processe… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  5. arXiv:2406.02430  [pdf, other

    eess.AS cs.SD

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

    Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  6. arXiv:2405.18458  [pdf

    cs.LG physics.optics

    Asymmetrical estimator for training grey-box deep photonic neural networks

    Authors: Yizhi Wang, Minjia Chen, Chunhui Yao, Jie Ma, Ting Yan, Richard Penty, Qixiang Cheng

    Abstract: Physical neural networks (PNNs) are emerging paradigms for neural network acceleration due to their high-bandwidth, in-propagation analogue processing. Despite the advantages of PNN for inference, training remains a challenge. The imperfect information of the physical transformation means the failure of conventional gradient-based updates from backpropagation (BP). Here, we present the asymmetrica… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: 17 pages, 5 figures

    MSC Class: 78-05

  7. arXiv:2404.13600  [pdf, other

    cs.RO

    Are We Ready for Planetary Exploration Robots? The TAIL-Plus Dataset for SLAM in Granular Environments

    Authors: Zirui Wang, Chen Yao, Yangtao Ge, Guowei Shi, Ningbo Yang, Zheng Zhu, Kewei Dong, Hexiang Wei, Zhenzhong Jia, Jing Wu

    Abstract: So far, planetary surface exploration depends on various mobile robot platforms. The autonomous navigation and decision-making of these mobile robots in complex terrains largely rely on their terrain-aware perception, localization and mapping capabilities. In this paper we release the TAIL-Plus dataset, a new challenging dataset in deformable granular environments for planetary exploration robots,… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

    Comments: Accepted to the IEEE ICRA Workshop on Field Robotics 2024

  8. arXiv:2404.05225  [pdf, other

    cs.CV cs.CL

    LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

    Authors: Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, Cong Yao

    Abstract: Recently, leveraging large language models (LLMs) or multimodal large language models (MLLMs) for document understanding has been proven very promising. However, previous works that employ LLMs/MLLMs for document understanding have not fully explored and utilized the document layout information, which is vital for precise document understanding. In this paper, we propose LayoutLLM, an LLM/MLLM bas… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  9. arXiv:2403.19128  [pdf, other

    cs.CV

    OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition

    Authors: Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, Zhibo Yang

    Abstract: Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  10. arXiv:2403.16875  [pdf, other

    cs.RO

    TAIL: A Terrain-Aware Multi-Modal SLAM Dataset for Robot Locomotion in Deformable Granular Environments

    Authors: Chen Yao, Yangtao Ge, Guowei Shi, Zirui Wang, Ningbo Yang, Zheng Zhu, Hexiang Wei, Yuntian Zhao, Jing Wu, Zhenzhong Jia

    Abstract: Terrain-aware perception holds the potential to improve the robustness and accuracy of autonomous robot navigation in the wilds, thereby facilitating effective off-road traversals. However, the lack of multi-modal perception across various motion patterns hinders the solutions of Simultaneous Localization And Mapping (SLAM), especially when confronting non-geometric hazards in demanding landscapes… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Submitted to IEEE Robotics and Automation Letters

  11. arXiv:2403.16662  [pdf, other

    cs.CL

    RU22Fact: Optimizing Evidence for Multilingual Explainable Fact-Checking on Russia-Ukraine Conflict

    Authors: Yirong Zeng, Xiao Ding, Yi Zhao, Xiangyu Li, Jie Zhang, Chao Yao, Ting Liu, Bing Qin

    Abstract: Fact-checking is the task of verifying the factuality of a given claim by examining the available evidence. High-quality evidence plays a vital role in enhancing fact-checking systems and facilitating the generation of explanations that are understandable to humans. However, the provision of both sufficient and relevant evidence for explainable fact-checking systems poses a challenge. To tackle th… ▽ More

    Submitted 26 March, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: 12 pages, 3 figures, accepted by lrec-coling2024

  12. arXiv:2403.14023  [pdf

    cs.CR

    A system capable of verifiably and privately screening global DNA synthesis

    Authors: Carsten Baum, Jens Berlips, Walther Chen, Hongrui Cui, Ivan Damgard, Jiangbin Dong, Kevin M. Esvelt, Mingyu Gao, Dana Gretton, Leonard Foner, Martin Kysel, Kaiyi Zhang, Juanru Li, Xiang Li, Omer Paneth, Ronald L. Rivest, Francesca Sage-Ling, Adi Shamir, Yue Shen, Meicen Sun, Vinod Vaikuntanathan, Lynn Van Hauwe, Theia Vogel, Benjamin Weinstein-Raun, Yun Wang , et al. (5 additional authors not shown)

    Abstract: Printing custom DNA sequences is essential to scientific and biomedical research, but the technology can be used to manufacture plagues as well as cures. Just as ink printers recognize and reject attempts to counterfeit money, DNA synthesizers and assemblers should deny unauthorized requests to make viral DNA that could be used to ignite a pandemic. There are three complications. First, we don't n… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Main text 10 pages, 4 figures. 5 supplementary figures. Total 21 pages. Direct correspondence to: Ivan B. Damgard ([email protected]), Andrew C. Yao ([email protected]), Kevin M. Esvelt ([email protected])

  13. arXiv:2403.13761  [pdf, other

    cs.CV

    HierCode: A Lightweight Hierarchical Codebook for Zero-shot Chinese Text Recognition

    Authors: Yuyi Zhang, Yuanzhi Zhu, Dezhi Peng, Peirong Zhang, Zhenhua Yang, Zhibo Yang, Cong Yao, Lianwen Jin

    Abstract: Text recognition, especially for complex scripts like Chinese, faces unique challenges due to its intricate character structures and vast vocabulary. Traditional one-hot encoding methods struggle with the representation of hierarchical radicals, recognition of Out-Of-Vocabulary (OOV) characters, and on-device deployment due to their computational intensity. To address these challenges, we propose… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  14. arXiv:2403.12008  [pdf, other

    cs.CV

    SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

    Authors: Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, Varun Jampani

    Abstract: We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affec… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Project page: https://sv3d.github.io/

  15. arXiv:2403.11221  [pdf, other

    cs.DC cs.DB

    Lion: Minimizing Distributed Transactions through Adaptive Replica Provision (Extended Version)

    Authors: Qiushi Zheng, Zhanhao Zhao, Wei Lu, Chang Yao, Yuxing Chen, Anqun Pan, Xiaoyong Du

    Abstract: Distributed transaction processing often involves multiple rounds of cross-node communications, and therefore tends to be slow. To improve performance, existing approaches convert distributed transactions into single-node transactions by either migrating co-accessed partitions onto the same nodes or establishing a super node housing replicas of the entire database. However, migration-based methods… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

  16. arXiv:2403.10357  [pdf, other

    cs.CV cs.GR

    ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D image

    Authors: Marco Pesavento, Yuanlu Xu, Nikolaos Sarafianos, Robert Maier, Ziyan Wang, Chun-Han Yao, Marco Volino, Edmond Boyer, Adrian Hilton, Tony Tung

    Abstract: Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries alon… ▽ More

    Submitted 18 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

    Comments: Accepted to CVPR24; Project page: https://marcopesavento.github.io/ANIM/

  17. arXiv:2402.17232  [pdf, other

    math.NA cs.LG physics.comp-ph

    Two-scale Neural Networks for Partial Differential Equations with Small Parameters

    Authors: Qiao Zhuang, Chris Ziyi Yao, Zhongqiang Zhang, George Em Karniadakis

    Abstract: We propose a two-scale neural network method for solving partial differential equations (PDEs) with small parameters using physics-informed neural networks (PINNs). We directly incorporate the small parameters into the architecture of neural networks. The proposed method enables solving PDEs with small parameters in a simple fashion, without adding Fourier features or other computationally taxing… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    MSC Class: 65N35; 35B25 ACM Class: I.2.6

  18. arXiv:2402.09152  [pdf, other

    cs.LG

    Improved Regret for Bandit Convex Optimization with Delayed Feedback

    Authors: Yuanyu Wan, Chang Yao, Mingli Song, Lijun Zhang

    Abstract: We investigate bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under an arbitrary delay. Let $n,T,\bar{d}$ denote the dimensionality, time horizon, and average delay, respectively. Previous studies have achieved an $O(\sqrt{n}T^{3/4}+(n\bar{d})^{1/3}T^{2/3})$ regret bound for this problem, whose delay-independent part matches the regret o… ▽ More

    Submitted 23 June, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

  19. arXiv:2402.07625  [pdf, other

    cs.CL cs.AI cs.LG

    Autonomous Data Selection with Language Models for Mathematical Texts

    Authors: Yifan Zhang, Yifan Luo, Yang Yuan, Andrew Chi-Chih Yao

    Abstract: To improve language models' proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base language models for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach Autonomous Data Selection (AutoDS) utilizes meta-prompted language models as zero-shot verifiers… ▽ More

    Submitted 2 April, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

  20. arXiv:2401.09003  [pdf, other

    cs.CL cs.AI cs.LG

    Augmenting Math Word Problems via Iterative Question Composing

    Authors: Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew Chi-Chih Yao

    Abstract: Despite the advancements in large language models (LLMs) for mathematical reasoning, solving competition-level math problems remains a significant challenge, especially for open-source LLMs without external tools. We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs, aimed at enhancing the mathematical reasoning capabilities of base langu… ▽ More

    Submitted 10 February, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

  21. arXiv:2401.05638  [pdf, other

    cs.CV

    MatSAM: Efficient Extraction of Microstructures of Materials via Visual Large Model

    Authors: Changtai Li, Xu Han, Chao Yao, Xiaojuan Ban

    Abstract: Efficient and accurate extraction of microstructures in micrographs of materials is essential in process optimization and the exploration of structure-property relationships. Deep learning-based image segmentation techniques that rely on manual annotation are laborious and time-consuming and hardly meet the demand for model transferability and generalization on various source images. Segment Anyth… ▽ More

    Submitted 2 March, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    Comments: 18 pages, 8 figures, and 5 tables. Updated with revision and code repository

  22. arXiv:2401.05412  [pdf, other

    cs.CV cs.AI eess.SP

    Spatial-Related Sensors Matters: 3D Human Motion Reconstruction Assisted with Textual Semantics

    Authors: Xueyuan Yang, Chao Yao, Xiaojuan Ban

    Abstract: Leveraging wearable devices for motion reconstruction has emerged as an economical and viable technique. Certain methodologies employ sparse Inertial Measurement Units (IMUs) on the human body and harness data-driven strategies to model human poses. However, the reconstruction of motion based solely on sparse IMUs data is inherently fraught with ambiguity, a consequence of numerous identical IMU r… ▽ More

    Submitted 26 December, 2023; originally announced January 2024.

    Comments: Accepted by AAAI 2024

  23. arXiv:2401.01522  [pdf, other

    cs.CV

    LORE++: Logical Location Regression Network for Table Structure Recognition with Pre-training

    Authors: Rujiao Long, Hangdi Xing, Zhibo Yang, Qi Zheng, Zhi Yu, Cong Yao, Fei Huang

    Abstract: Table structure recognition (TSR) aims at extracting tables in images into machine-understandable formats. Recent methods solve this problem by predicting the adjacency relations of detected cell boxes or learning to directly generate the corresponding markup sequences from the table images. However, existing approaches either count on additional heuristic rules to recover the table structures, or… ▽ More

    Submitted 2 January, 2024; originally announced January 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2303.03730

  24. arXiv:2312.12142  [pdf, other

    cs.CV cs.AI

    FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning

    Authors: Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen Jin

    Abstract: Automatic font generation is an imitation task, which aims to create a font library that mimics the style of reference images while preserving the content from source images. Although existing font generation methods have achieved satisfactory performance, they still struggle with complex characters and large style variations. To address these issues, we propose FontDiffuser, a diffusion-based ima… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: Accepted to AAAI 2024; Github Page: https://github.com/yeungchenwa/FontDiffuser

    Journal ref: 38th AAAI Conference on Artificial Intelligence (AAAI2024), Vancouver, BC, Canada, 2024

  25. arXiv:2312.09613  [pdf, other

    cs.LG cs.AI stat.ML

    Rethinking Causal Relationships Learning in Graph Neural Networks

    Authors: Hang Gao, Chengyu Yao, Jiangmeng Li, Lingyu Si, Yifan Jin, Fengge Wu, Changwen Zheng, Huaping Liu

    Abstract: Graph Neural Networks (GNNs) demonstrate their significance by effectively modeling complex interrelationships within graph-structured data. To enhance the credibility and robustness of GNNs, it becomes exceptionally crucial to bolster their ability to capture causal relationships. However, despite recent advancements that have indeed strengthened GNNs from a causal learning perspective, conductin… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  26. arXiv:2312.07823  [pdf, other

    cs.CV

    Semantic Lens: Instance-Centric Semantic Alignment for Video Super-Resolution

    Authors: Qi Tang, Yao Zhao, Meiqin Liu, Jian Jin, Chao Yao

    Abstract: As a critical clue of video super-resolution (VSR), inter-frame alignment significantly impacts overall performance. However, accurate pixel-level alignment is a challenging task due to the intricate motion interweaving in the video. In response to this issue, we introduce a novel paradigm for VSR named Semantic Lens, predicated on semantic priors drawn from degraded videos. Specifically, video is… ▽ More

    Submitted 19 January, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

    Comments: Accepted to AAAI 2024

  27. arXiv:2311.11482  [pdf, other

    cs.AI cs.CL

    Meta Prompting for AI Systems

    Authors: Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao

    Abstract: In this work, we present a comprehensive study of Meta Prompting (MP), an innovative technique reshaping the utilization of language models (LMs) and AI systems in problem-solving and data interaction. Grounded in type theory and category theory, Meta Prompting emphasizes the structure and syntax of information over traditional content-centric methods. The paper explores the formal definitions of… ▽ More

    Submitted 15 June, 2024; v1 submitted 19 November, 2023; originally announced November 2023.

  28. arXiv:2310.16070  [pdf, other

    cs.LG

    Spatial-Temporal Hypergraph Neural Network for Traffic Forecasting

    Authors: Chengzhi Yao, Zhi Li, Junbo Wang

    Abstract: Traffic forecasting, which benefits from mobile Internet development and position technologies, plays a critical role in Intelligent Transportation Systems. It helps to implement rich and varied transportation applications and bring convenient transportation services to people based on collected traffic data. Most existing methods usually leverage graph-based deep learning networks to model the co… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  29. arXiv:2310.12430  [pdf, other

    cs.CV cs.CL

    DocXChain: A Powerful Open-Source Toolchain for Document Parsing and Beyond

    Authors: Cong Yao

    Abstract: In this report, we introduce DocXChain, a powerful open-source toolchain for document parsing, which is designed and developed to automatically convert the rich information embodied in unstructured documents, such as text, tables and charts, into structured representations that are readable and manipulable by machines. Specifically, basic capabilities, including text detection, text recognition, t… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Comments: 4 pages, 4 figures, 2 tables

  30. arXiv:2310.10362  [pdf, other

    cs.LG cs.AI

    Self-Pro: A Self-Prompt and Tuning Framework for Graph Neural Networks

    Authors: Chenghua Gong, Xiang Li, Jianxiang Yu, Cheng Yao, Jiaqi Tan, Chengcheng Yu

    Abstract: Graphs have become an important modeling tool for web applications, and Graph Neural Networks (GNNs) have achieved great success in graph representation learning. However, the performance of traditional GNNs heavily relies on a large amount of supervision. Recently, ``pre-train, fine-tune'' has become the paradigm to address the issues of label dependency and poor generalization. However, the pre-… ▽ More

    Submitted 4 June, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

    Comments: Accepted at ECML-PKDD 2024

  31. arXiv:2310.08064  [pdf

    cs.CV

    Age Estimation Based on Graph Convolutional Networks and Multi-head Attention Mechanisms

    Authors: Miaomiao Yang, Changwei Yao, Shijin Yan

    Abstract: Age estimation technology is a part of facial recognition and has been applied to identity authentication. This technology achieves the development and application of a juvenile anti-addiction system by authenticating users in the game. Convolutional Neural Network (CNN) and Transformer algorithms are widely used in this application scenario. However, these two models cannot flexibly extract and m… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

  32. arXiv:2310.04975  [pdf, ps, other

    cs.CR cs.DC

    A Trustworthy and Consistent Blockchain Oracle Scheme for Industrial Internet of Things

    Authors: Peng Liu, Youquan Xian, Chuanjian Yao, Peng Wang, Li-e Wang, Xianxian Li

    Abstract: Blockchain provides decentralization and trustlessness features for the Industrial Internet of Things (IIoT), which expands the application scenarios of IIoT. To address the problem that the blockchain cannot actively obtain off-chain data, the blockchain oracle is proposed as a bridge between the blockchain and external data. However, the existing oracle schemes are difficult to solve the problem… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

    Comments: Rejected after the third round of review of IEEE Internet of Things Journal

  33. IBVC: Interpolation-driven B-frame Video Compression

    Authors: Chenming Xu, Meiqin Liu, Chao Yao, Weisi Lin, Yao Zhao

    Abstract: Learned B-frame video compression aims to adopt bi-directional motion estimation and motion compensation (MEMC) coding for middle frame reconstruction. However, previous learned approaches often directly extend neural P-frame codecs to B-frame relying on bi-directional optical-flow estimation or video frame interpolation. They suffer from inaccurate quantized motions and inefficient motion compens… ▽ More

    Submitted 14 March, 2024; v1 submitted 24 September, 2023; originally announced September 2023.

    Comments: Submitted to Pattern Recognition

  34. arXiv:2309.13596  [pdf, other

    cs.CV

    Advancements in 3D Lane Detection Using LiDAR Point Clouds: From Data Collection to Model Development

    Authors: Runkai Zhao, Yuwen Heng, Heng Wang, Yuanda Gao, Shilei Liu, Changhao Yao, Jiawen Chen, Weidong Cai

    Abstract: Advanced Driver-Assistance Systems (ADAS) have successfully integrated learning-based techniques into vehicle perception and decision-making. However, their application in 3D lane detection for effective driving environment perception is hindered by the lack of comprehensive LiDAR datasets. The sparse nature of LiDAR point cloud data prevents an efficient manual annotation process. To solve this p… ▽ More

    Submitted 15 March, 2024; v1 submitted 24 September, 2023; originally announced September 2023.

    Comments: Accepted by ICRA2024

  35. arXiv:2308.14978  [pdf, other

    cs.CV

    Vision Grid Transformer for Document Layout Analysis

    Authors: Cheng Da, Chuwei Luo, Qi Zheng, Cong Yao

    Abstract: Document pre-trained models and grid-based models have proven to be very effective on various tasks in Document AI. However, for the document layout analysis (DLA) task, existing document pre-trained models, even those pre-trained in a multi-modal fashion, usually rely on either textual features or visual features. Grid-based models for DLA are multi-modality but largely neglect the effect of pre-… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: Accepted by ICCV2023

  36. arXiv:2308.12774  [pdf, other

    cs.CV

    LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition

    Authors: Changxu Cheng, Peng Wang, Cheng Da, Qi Zheng, Cong Yao

    Abstract: The diversity in length constitutes a significant characteristic of text. Due to the long-tail distribution of text lengths, most existing methods for scene text recognition (STR) only work well on short or seen-length text, lacking the capability of recognizing longer text or performing length extrapolation. This is a crucial issue, since the lengths of the text to be recognized are usually not g… ▽ More

    Submitted 24 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  37. arXiv:2308.04371  [pdf, other

    cs.AI

    Cumulative Reasoning with Large Language Models

    Authors: Yifan Zhang, Jingqin Yang, Yang Yuan, Andrew Chi-Chih Yao

    Abstract: Despite the recent advancements in language models (LMs), their ability to solve complex problems remains limited. This paper introduces Cumulative Reasoning (CR), a novel approach that utilizes LMs cumulatively and iteratively, mirroring human thought processes for problem-solving. CR decomposes tasks into smaller, manageable components and leverages previous propositions for effective compositio… ▽ More

    Submitted 1 April, 2024; v1 submitted 8 August, 2023; originally announced August 2023.

  38. arXiv:2307.13244  [pdf, other

    cs.CV

    Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition

    Authors: Cheng Da, Peng Wang, Cong Yao

    Abstract: Due to the enormous technical challenges and wide range of applications, scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this tough problem, numerous innovative methods have been successively proposed, and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

    Comments: submitted to TPAMI; an extension to our previous ECCV 2022 paper arXiv:2209.03592

  39. arXiv:2307.04420  [pdf, ps, other

    cs.DC cs.AI

    FedDCT: A Dynamic Cross-Tier Federated Learning Scheme in Wireless Communication Networks

    Authors: Peng Liu, Youquan Xian, Chuanjian Yao, Xiaoyun Gan, Lianghaojie Zhou, Jianyong Jiang, Dongcheng Li

    Abstract: With the rapid proliferation of Internet of Things (IoT) devices and the growing concern for data privacy among the public, Federated Learning (FL) has gained significant attention as a privacy-preserving machine learning paradigm. FL enables the training of a global model among clients without exposing local data. However, when a federated learning system runs on wireless communication networks,… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

  40. arXiv:2307.02828  [pdf, other

    cs.CV cs.CR cs.LG

    Sampling-based Fast Gradient Rescaling Method for Highly Transferable Adversarial Attacks

    Authors: Xu Han, Anmin Liu, Chenxuan Yao, Yanbo Fan, Kun He

    Abstract: Deep neural networks are known to be vulnerable to adversarial examples crafted by adding human-imperceptible perturbations to the benign input. After achieving nearly 100% attack success rates in white-box setting, more focus is shifted to black-box attacks, of which the transferability of adversarial examples has gained significant attention. In either case, the common gradient-based methods gen… ▽ More

    Submitted 6 July, 2023; originally announced July 2023.

    Comments: 10 pages, 6 figures, 7 tables. arXiv admin note: substantial text overlap with arXiv:2204.02887

  41. arXiv:2306.10804  [pdf, other

    cs.CV

    Conditional Text Image Generation with Diffusion Models

    Authors: Yuanzhi Zhu, Zhaohai Li, Tianwei Wang, Mengchao He, Cong Yao

    Abstract: Current text recognition systems, including those for handwritten scripts and scene text, have relied heavily on image synthesis and augmentation, since it is difficult to realize real-world complexity and diversity through collecting and annotating enough real text images. In this paper, we explore the problem of text image generation, by taking advantage of the powerful abilities of Diffusion Mo… ▽ More

    Submitted 19 June, 2023; originally announced June 2023.

  42. arXiv:2306.04619  [pdf, other

    cs.CV

    ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image Collections

    Authors: Chun-Han Yao, Amit Raj, Wei-Chih Hung, Yuanzhen Li, Michael Rubinstein, Ming-Hsuan Yang, Varun Jampani

    Abstract: Estimating 3D articulated shapes like animal bodies from monocular images is inherently challenging due to the ambiguities of camera viewpoint, pose, texture, lighting, etc. We propose ARTIC3D, a self-supervised framework to reconstruct per-instance 3D shapes from a sparse image collection in-the-wild. Specifically, ARTIC3D is built upon a skeleton-based surface representation and is further guide… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

    Comments: Project page: https://chhankyao.github.io/artic3d/

  43. arXiv:2305.18548  [pdf

    cs.ET physics.optics

    I/O-efficient iterative matrix inversion with photonic integrated circuits

    Authors: Minjia Chen, Yizhi Wang, Chunhui Yao, Adrian Wonfor, Shuai Yang, Richard Penty, Qixiang Cheng

    Abstract: Photonic integrated circuits have been extensively explored for optical processing with the aim of breaking the speed bottleneck of digital electronics. However, the input/output (IO) bottleneck remains one of the key barriers. Here we report a novel photonic iterative processor (PIP) for matrix-inversion-intensive applications. The direct reuse of inputted data in the optical domain unlocks the p… ▽ More

    Submitted 22 May, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

  44. arXiv:2305.18442  [pdf, other

    cs.LG math.OC

    Improved Projection-free Online Continuous Submodular Maximization

    Authors: Yucheng Liao, Yuanyu Wan, Chang Yao, Mingli Song

    Abstract: We investigate the problem of online learning with monotone and continuous DR-submodular reward functions, which has received great attention recently. To efficiently handle this problem, especially in the case with complicated decision sets, previous studies have proposed an efficient projection-free algorithm called Mono-Frank-Wolfe (Mono-FW) using $O(T)$ gradient evaluations and linear optimiza… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

  45. arXiv:2305.15940  [pdf, other

    cs.CV

    Mask Attack Detection Using Vascular-weighted Motion-robust rPPG Signals

    Authors: Chenglin Yao, Jianfeng Ren, Ruibin Bai, Heshan Du, Jiang Liu, Xudong Jiang

    Abstract: Detecting 3D mask attacks to a face recognition system is challenging. Although genuine faces and 3D face masks show significantly different remote photoplethysmography (rPPG) signals, rPPG-based face anti-spoofing methods often suffer from performance degradation due to unstable face alignment in the video sequence and weak rPPG signals. To enhance the rPPG signal in a motion-robust way, a landma… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

  46. arXiv:2305.12131  [pdf, other

    cs.LG

    Non-stationary Online Convex Optimization with Arbitrary Delays

    Authors: Yuanyu Wan, Chang Yao, Mingli Song, Lijun Zhang

    Abstract: Online convex optimization (OCO) with arbitrary delays, in which gradients or other information of functions could be arbitrarily delayed, has received increasing attention recently. Different from previous studies that focus on stationary environments, this paper investigates the delayed OCO in non-stationary environments, and aims to minimize the dynamic regret with respect to any sequence of co… ▽ More

    Submitted 23 June, 2024; v1 submitted 20 May, 2023; originally announced May 2023.

    Comments: Camera-ready Version for ICML2024

  47. arXiv:2305.08325  [pdf, other

    cs.CV eess.IV

    Screentone-Aware Manga Super-Resolution Using DeepLearning

    Authors: Chih-Yuan Yao, Husan-Ting Chou, Yu-Sheng Lin, Kuo-wei Chen

    Abstract: Manga, as a widely beloved form of entertainment around the world, have shifted from paper to electronic screens with the proliferation of handheld devices. However, as the demand for image quality increases with screen development, high-quality images can hinder transmission and affect the viewing experience. Traditional vectorization methods require a significant amount of manual parameter adjus… ▽ More

    Submitted 14 May, 2023; originally announced May 2023.

  48. arXiv:2305.06737  [pdf, other

    cs.IT

    A Diagonal Splitting Algorithm for Adaptive Group Testing

    Authors: Chaorui Yao, Pavlos Nikolopoulos, Christina Fragouli

    Abstract: Group testing enables to identify infected individuals in a population using a smaller number of tests than individual testing. To achieve this, group testing algorithms commonly assume knowledge of the number of infected individuals; nonadaptive and several adaptive algorithms fall in this category. Some adaptive algorithms, like binary splitting, operate without this assumption, but require a nu… ▽ More

    Submitted 14 May, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

  49. arXiv:2304.10759  [pdf, other

    cs.CV cs.CL

    GeoLayoutLM: Geometric Pre-training for Visual Information Extraction

    Authors: Chuwei Luo, Changxu Cheng, Qi Zheng, Cong Yao

    Abstract: Visual information extraction (VIE) plays an important role in Document Intelligence. Generally, it is divided into two tasks: semantic entity recognition (SER) and relation extraction (RE). Recently, pre-trained models for documents have achieved substantial progress in VIE, particularly in SER. However, most of the existing models learn the geometric representation in an implicit way, which has… ▽ More

    Submitted 21 April, 2023; originally announced April 2023.

    Comments: CVPR 2023 Highlight

  50. arXiv:2303.13095  [pdf, other

    cs.CV

    Modeling Entities as Semantic Points for Visual Information Extraction in the Wild

    Authors: Zhibo Yang, Rujiao Long, Pengfei Wang, Sibo Song, Humen Zhong, Wenqing Cheng, Xiang Bai, Cong Yao

    Abstract: Recently, Visual Information Extraction (VIE) has been becoming increasingly important in both the academia and industry, due to the wide range of real-world applications. Previously, numerous works have been proposed to tackle this problem. However, the benchmarks used to assess these methods are relatively plain, i.e., scenarios with real-world complexity are not fully represented in these bench… ▽ More

    Submitted 28 March, 2023; v1 submitted 23 March, 2023; originally announced March 2023.