Zum Hauptinhalt springen

Showing 1–50 of 202 results for author: Tai, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.09126  [pdf, other

    cs.CV

    Barbie: Text to Barbie-Style 3D Avatars

    Authors: Xiaokun Sun, Zhenyu Zhang, Ying Tai, Qian Wang, Hao Tang, Zili Yi, Jian Yang

    Abstract: Recent advances in text-guided 3D avatar generation have made substantial progress by distilling knowledge from diffusion models. Despite the plausible generated appearance, existing methods cannot achieve fine-grained disentanglement or high-fidelity modeling between inner body and outfit. In this paper, we propose Barbie, a novel framework for generating 3D avatars that can be dressed in diverse… ▽ More

    Submitted 27 August, 2024; v1 submitted 17 August, 2024; originally announced August 2024.

    Comments: 9 pages, 7 figures

  2. arXiv:2408.03934  [pdf, other

    cs.CL

    From Words to Worth: Newborn Article Impact Prediction with LLM

    Authors: Penghai Zhao, Qinghua Xing, Kairan Dou, Jinyu Tian, Ying Tai, Jian Yang, Ming-Ming Cheng, Xiang Li

    Abstract: As the academic landscape expands, the challenge of efficiently identifying potentially high-impact articles among the vast number of newly published works becomes critical. This paper introduces a promising approach, leveraging the capabilities of fine-tuned LLMs to predict the future impact of newborn articles solely based on titles and abstracts. Moving beyond traditional methods heavily relian… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

    Comments: 7 pages for main sections, plus 3 additional pages for appendices. Code, dataset are released at https://sway.cloud.microsoft/KOH09sPR21Ubojbc

  3. arXiv:2407.16014  [pdf, other

    cs.SI

    Political Elites in the Attention Economy: Visibility Over Civility and Credibility?

    Authors: Ahana Biswas, Yu-Ru Lin, Yuehong Cassandra Tai, Bruce A. Desmarais

    Abstract: Elected officials have privileged roles in public communication. In contrast to national politicians, whose posting content is more likely to be closely scrutinized by a robust ecosystem of nationally focused media outlets, sub-national politicians are more likely to openly disseminate harmful content with limited media scrutiny. In this paper, we analyze the factors that explain the online visibi… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: Accepted for publication in the International AAAI Conference on Web and Social Media (ICWSM) 2025

  4. arXiv:2407.05420  [pdf, ps, other

    cs.IR

    Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation

    Authors: Xinglong Wu, Anfeng Huang, Hongwei Yang, Hui He, Yu Tai, Weizhe Zhang

    Abstract: Multi-modal recommendation greatly enhances the performance of recommender systems by modeling the auxiliary information from multi-modality contents. Most existing multi-modal recommendation models primarily exploit multimedia information propagation processes to enrich item representations and directly utilize modal-specific embedding vectors independently obtained from upstream pre-trained mode… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  5. arXiv:2407.02371  [pdf, other

    cs.CV

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Authors: Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, Ying Tai

    Abstract: Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is ch… ▽ More

    Submitted 2 August, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

    Comments: 15 pages, 9 figures

  6. arXiv:2406.18284  [pdf, other

    cs.CV

    RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

    Authors: Xiaozhong Ji, Chuming Lin, Zhonggan Ding, Ying Tai, Junwei Zhu, Xiaobin Hu, Donghao Luo, Yanhao Ge, Chengjie Wang

    Abstract: Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-qual… ▽ More

    Submitted 8 August, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

  7. arXiv:2406.05543  [pdf, other

    cs.CV cs.AI

    VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification

    Authors: Jianmeng Liu, Yichen Liu, Yuyao Zhang, Zeyuan Meng, Yu-Wing Tai, Chi-Keung Tang

    Abstract: Recent conditional 3D completion works have mainly relied on CLIP or BERT to encode textual information, which cannot support complex instruction. Meanwhile, large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks. Inspired by the recent advancements of LLM, we present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D comple… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: 27pages, 16 figures

  8. arXiv:2406.03723  [pdf, other

    cs.CV cs.GR cs.MM

    Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling

    Authors: Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang, Pedro Miraldo, Suhas Lohit, Moitreya Chatterjee

    Abstract: Extensions of Neural Radiance Fields (NeRFs) to model dynamic scenes have enabled their near photo-realistic, free-viewpoint rendering. Although these methods have shown some potential in creating immersive experiences, two drawbacks limit their ubiquity: (i) a significant reduction in reconstruction quality when the computing budget is limited, and (ii) a lack of semantic understanding of the und… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Paper accepted to IEEE/CVF CVPR 2024 (Spotlight). Work done when XL was an intern at MERL. Project Page Link: https://merl.com/research/highlights/gear-nerf

    ACM Class: I.2.10

  9. arXiv:2406.01224  [pdf, other

    cs.CL

    Demonstration Augmentation for Zero-shot In-context Learning

    Authors: Yi Su, Yunpeng Tai, Yixin Ji, Juntao Li, Bowen Yan, Min Zhang

    Abstract: Large Language Models (LLMs) have demonstrated an impressive capability known as In-context Learning (ICL), which enables them to acquire knowledge from textual demonstrations without the need for parameter updates. However, many studies have highlighted that the model's performance is sensitive to the choice of demonstrations, presenting a significant challenge for practical applications where we… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 Findings

  10. arXiv:2405.17013  [pdf, other

    cs.CV

    MotionLLM: Multimodal Motion-Language Learning with Large Language Models

    Authors: Qi Wu, Yubo Zhao, Yifan Wang, Yu-Wing Tai, Chi-Keung Tang

    Abstract: Recent advancements in Multimodal Large Language Models (MM-LLMs) have demonstrated promising potential in terms of generalization and robustness when applied to different modalities. While previous works have already achieved 3D human motion generation using various approaches including language modeling, they mostly % are mostly carefully designed use specialized architecture and are restricted… ▽ More

    Submitted 27 May, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

    Comments: Project page: https://knoxzhao.github.io/MotionLLM

  11. arXiv:2405.16136  [pdf, other

    cs.AI cs.CL cs.LG cs.SD eess.AS

    C3LLM: Conditional Multimodal Content Generation Using Large Language Models

    Authors: Zixuan Wang, Qinkai Duan, Yu-Wing Tai, Chi-Keung Tang

    Abstract: We introduce C3LLM (Conditioned-on-Three-Modalities Large Language Models), a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities, synthesizing the given conditional information, and making multimodal generation in a discrete manner. Our contributions… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  12. arXiv:2405.16105  [pdf, other

    cs.CV cs.AI

    MambaLLIE: Implicit Retinex-Aware Low Light Enhancement with Global-then-Local State Space

    Authors: Jiangwei Weng, Zhiqiang Yan, Ying Tai, Jianjun Qian, Jian Yang, Jun Li

    Abstract: Recent advances in low light image enhancement have been dominated by Retinex-based learning framework, leveraging convolutional neural networks (CNNs) and Transformers. However, the vanilla Retinex theory primarily addresses global illumination degradation and neglects local issues such as noise and blur in dark conditions. Moreover, CNNs and Transformers struggle to capture global degradation du… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  13. arXiv:2405.02824  [pdf, other

    cs.CV

    Adaptive Guidance Learning for Camouflaged Object Detection

    Authors: Zhennan Chen, Xuying Zhang, Tian-Zhu Xiang, Ying Tai

    Abstract: Camouflaged object detection (COD) aims to segment objects visually embedded in their surroundings, which is a very challenging task due to the high similarity between the objects and the background. To address it, most methods often incorporate additional information (e.g., boundary, texture, and frequency clues) to guide feature learning for better detecting camouflaged objects from the backgrou… ▽ More

    Submitted 6 May, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

  14. arXiv:2405.01872  [pdf, other

    cs.CV

    Defect Image Sample Generation With Diffusion Prior for Steel Surface Defect Recognition

    Authors: Yichun Tai, Kun Yang, Tao Peng, Zhenzhen Huang, Zhijiang Zhang

    Abstract: The task of steel surface defect recognition is an industrial problem with great industry values. The data insufficiency is the major challenge in training a robust defect recognition network. Existing methods have investigated to enlarge the dataset by generating samples with generative models. However, their generation quality is still limited by the insufficiency of defect image samples. To thi… ▽ More

    Submitted 3 May, 2024; originally announced May 2024.

  15. arXiv:2404.18598  [pdf, other

    cs.CV cs.GR

    Anywhere: A Multi-Agent Framework for Reliable and Diverse Foreground-Conditioned Image Inpainting

    Authors: Tianyidan Xie, Rui Ma, Qian Wang, Xiaoqian Ye, Feixuan Liu, Ying Tai, Zhenyu Zhang, Zili Yi

    Abstract: Recent advancements in image inpainting, particularly through diffusion modeling, have yielded promising outcomes. However, when tested in scenarios involving the completion of images based on the foreground objects, current methods that aim to inpaint an image in an end-to-end manner encounter challenges such as "over-imagination", inconsistency between foreground and background, and limited dive… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: 16 pages, 9 figures, project page: https://anywheremultiagent.github.io

  16. arXiv:2404.01717  [pdf, other

    cs.CV eess.IV

    AddSR: Accelerating Diffusion-based Blind Super-Resolution with Adversarial Diffusion Distillation

    Authors: Rui Xie, Ying Tai, Chen Zhao, Kai Zhang, Zhenyu Zhang, Jun Zhou, Xiaoqian Ye, Qian Wang, Jian Yang

    Abstract: Blind super-resolution methods based on stable diffusion showcase formidable generative capabilities in reconstructing clear high-resolution images with intricate details from low-resolution inputs. However, their practical applicability is often hampered by poor efficiency, stemming from the requirement of thousands or hundreds of sampling steps. Inspired by the efficient adversarial diffusion di… ▽ More

    Submitted 23 May, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

  17. arXiv:2403.17664  [pdf, other

    cs.CV

    DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with Space-sensitive Customization and Semantic Preservation

    Authors: Qilin Wang, Jiangning Zhang, Chengming Xu, Weijian Cao, Ying Tai, Yue Han, Yanhao Ge, Hong Gu, Chengjie Wang, Yanwei Fu

    Abstract: Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient infer… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  18. arXiv:2403.13663  [pdf, other

    cs.CV

    T-Pixel2Mesh: Combining Global and Local Transformer for 3D Mesh Generation from a Single Image

    Authors: Shijie Zhang, Boyan Jiang, Keke He, Junwei Zhu, Ying Tai, Chengjie Wang, Yinda Zhang, Yanwei Fu

    Abstract: Pixel2Mesh (P2M) is a classical approach for reconstructing 3D shapes from a single color image through coarse-to-fine mesh deformation. Although P2M is capable of generating plausible global shapes, its Graph Convolution Network (GCN) often produces overly smooth results, causing the loss of fine-grained geometry details. Moreover, P2M generates non-credible features for occluded regions and stru… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Received by ICASSP 2024

  19. arXiv:2403.01901  [pdf, other

    cs.CV

    FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

    Authors: Chao Xu, Yang Liu, Jiazheng Xing, Weida Wang, Mingze Sun, Jun Dan, Tianxin Huang, Siyuan Li, Zhi-Qi Cheng, Ying Tai, Baigui Sun

    Abstract: In this paper, we abstract the process of people hearing speech, extracting meaningful cues, and creating various dynamically audio-consistent talking faces, termed Listening and Imagining, into the task of high-fidelity diverse talking faces generation from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangl… ▽ More

    Submitted 31 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  20. arXiv:2402.13876  [pdf, other

    cs.CV

    Scene Prior Filtering for Depth Map Super-Resolution

    Authors: Zhengxue Wang, Zhiqiang Yan, Ming-Hsuan Yang, Jinshan Pan, Jian Yang, Ying Tai, Guangwei Gao

    Abstract: Multi-modal fusion is vital to the success of super-resolution of depth maps. However, commonly used fusion strategies, such as addition and concatenation, fall short of effectively bridging the modal gap. As a result, guided image filtering methods have been introduced to mitigate this issue. Nevertheless, it is observed that their filter kernels usually encounter significant texture interference… ▽ More

    Submitted 23 February, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

    Comments: 14 pages

  21. arXiv:2402.12225  [pdf, other

    cs.CV

    Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability

    Authors: Xuelin Qian, Yu Wang, Simian Luo, Yinda Zhang, Ying Tai, Zhenyu Zhang, Chengjie Wang, Xiangyang Xue, Bo Zhao, Tiejun Huang, Yunsheng Wu, Yanwei Fu

    Abstract: Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. In this paper, we extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously. Firstly, we leverage an ensemble of publicly available 3D datasets to facilitate… ▽ More

    Submitted 26 March, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: Project page: https://argus-3d.github.io/ . Datasets: https://huggingface.co/datasets/BAAI/Objaverse-MIX. arXiv admin note: substantial text overlap with arXiv:2303.14700

  22. arXiv:2401.14895  [pdf, other

    cs.CV

    MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

    Authors: Yu-Shan Tai, An-Yeu, Wu

    Abstract: While vision transformers (ViTs) have shown great potential in computer vision tasks, their intense computation and memory requirements pose challenges for practical applications. Existing post-training quantization methods leverage value redistribution or specialized quantizers to address the non-normal distribution in ViTs. However, without considering the asymmetry in activations and relying on… ▽ More

    Submitted 31 January, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

  23. arXiv:2401.03321  [pdf, other

    cs.CL

    PIXAR: Auto-Regressive Language Modeling in Pixel Space

    Authors: Yintao Tai, Xiyang Liao, Alessandro Suglia, Antonio Vergari

    Abstract: Recent work showed the possibility of building open-vocabulary large language models (LLMs) that directly operate on pixel representations. These models are implemented as autoencoders that reconstruct masked patches of rendered text. However, these pixel-based LLMs are limited to discriminative tasks (e.g., classification) and, similar to BERT, cannot be used to generate text. Therefore, they can… ▽ More

    Submitted 23 February, 2024; v1 submitted 6 January, 2024; originally announced January 2024.

  24. arXiv:2401.02616  [pdf, other

    cs.CV

    FED-NeRF: Achieve High 3D Consistency and Temporal Coherence for Face Video Editing on Dynamic NeRF

    Authors: Hao Zhang, Yu-Wing Tai, Chi-Keung Tang

    Abstract: The success of the GAN-NeRF structure has enabled face editing on NeRF to maintain 3D view consistency. However, achieving simultaneously multi-view consistency and temporal coherence while editing video sequences remains a formidable challenge. This paper proposes a novel face video editing architecture built upon the dynamic face GAN-NeRF structure, which effectively utilizes video sequences to… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

    Comments: Our code will be available at: https://github.com/ZHANG1023/FED-NeRF

  25. arXiv:2401.00551  [pdf, other

    cs.CV

    A Generalist FaceX via Learning Unified Facial Representation

    Authors: Yue Han, Jiangning Zhang, Junwei Zhu, Xiangtai Li, Yanhao Ge, Wei Li, Chengjie Wang, Yong Liu, Xiaoming Liu, Ying Tai

    Abstract: This work presents FaceX framework, a novel facial generalist model capable of handling diverse facial tasks simultaneously. To achieve this goal, we initially formulate a unified facial representation for a broad spectrum of facial editing tasks, which macroscopically decomposes a face into fundamental identity, intra-personal variation, and environmental factors. Based on this, we introduce Faci… ▽ More

    Submitted 31 December, 2023; originally announced January 2024.

    Comments: Project page: https://diffusion-facex.github.io/

  26. arXiv:2401.00208  [pdf, other

    cs.CV

    Inpaint4DNeRF: Promptable Spatio-Temporal NeRF Inpainting with Generative Diffusion Models

    Authors: Han Jiang, Haosen Sun, Ruoxuan Li, Chi-Keung Tang, Yu-Wing Tai

    Abstract: Current Neural Radiance Fields (NeRF) can generate photorealistic novel views. For editing 3D scenes represented by NeRF, with the advent of generative models, this paper proposes Inpaint4DNeRF to capitalize on state-of-the-art stable diffusion models (e.g., ControlNet) for direct generation of the underlying completed background content, regardless of static or dynamic. The key advantages of this… ▽ More

    Submitted 30 December, 2023; originally announced January 2024.

  27. arXiv:2312.06354  [pdf, other

    cs.CV

    PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization

    Authors: Xu Peng, Junwei Zhu, Boyuan Jiang, Ying Tai, Donghao Luo, Jiangning Zhang, Wei Lin, Taisong Jin, Chengjie Wang, Rongrong Ji

    Abstract: Recent advancements in personalized image generation using diffusion models have been noteworthy. However, existing methods suffer from inefficiencies due to the requirement for subject-specific fine-tuning. This computationally intensive process hinders efficient deployment, limiting practical usability. Moreover, these methods often grapple with identity distortion and limited expression diversi… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  28. arXiv:2312.02568  [pdf, other

    cs.CV

    Prompt2NeRF-PIL: Fast NeRF Generation via Pretrained Implicit Latent

    Authors: Jianmeng Liu, Yuyao Zhang, Zeyuan Meng, Yu-Wing Tai, Chi-Keung Tang

    Abstract: This paper explores promptable NeRF generation (e.g., text prompt or single image prompt) for direct conditioning and fast generation of NeRF parameters for the underlying 3D scenes, thus undoing complex intermediate steps while providing full 3D generation with conditional control. Unlike previous diffusion-CLIP-based pipelines that involve tedious per-prompt optimizations, Prompt2NeRF-PIL is cap… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

  29. arXiv:2312.02216  [pdf, other

    cs.GR cs.CV

    DragVideo: Interactive Drag-style Video Editing

    Authors: Yufan Deng, Ruida Wang, Yuhao Zhang, Yu-Wing Tai, Chi-Keung Tang

    Abstract: Video generation models have shown their superior ability to generate photo-realistic video. However, how to accurately control (or edit) the video remains a formidable challenge. The main issues are: 1) how to perform direct and accurate user control in editing; 2) how to execute editings like changing shape, expression, and layout without unsightly distortion and artifacts to the edited content;… ▽ More

    Submitted 22 July, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

  30. arXiv:2312.01531  [pdf, other

    cs.CV

    SANeRF-HQ: Segment Anything for NeRF in High Quality

    Authors: Yichen Liu, Benran Hu, Chi-Keung Tang, Yu-Wing Tai

    Abstract: Recently, the Segment Anything Model (SAM) has showcased remarkable capabilities of zero-shot segmentation, while NeRF (Neural Radiance Fields) has gained popularity as a method for various 3D problems beyond novel view synthesis. Though there exist initial attempts to incorporate these two methods into 3D segmentation, they face the challenge of accurately and consistently segmenting objects in c… ▽ More

    Submitted 6 April, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

    Comments: Accepted to CVPR 2024

  31. arXiv:2311.17951  [pdf, other

    cs.LG

    C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

    Authors: Juntao Zhang, Yuehuai Liu, Yu-Wing Tai, Chi-Keung Tang

    Abstract: We present Compound Conditioned ControlNet, C3Net, a novel generative neural architecture taking conditions from multiple modalities and synthesizing multimodal contents simultaneously (e.g., image, text, audio). C3Net adapts the ControlNet architecture to jointly train and make inferences on a production-ready diffusion model and its trainable copies. Specifically, C3Net first aligns the conditio… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

  32. arXiv:2311.16499  [pdf, other

    cs.CV

    InceptionHuman: Controllable Prompt-to-NeRF for Photorealistic 3D Human Generation

    Authors: Shiu-hong Kao, Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang

    Abstract: This paper presents InceptionHuman, a prompt-to-NeRF framework that allows easy control via a combination of prompts in different modalities (e.g., text, poses, edge, segmentation map, etc) as inputs to generate photorealistic 3D humans. While many works have focused on generating 3D human models, they suffer one or more of the following: lack of distinctive features, unnatural shading/shadows, un… ▽ More

    Submitted 6 August, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

  33. arXiv:2311.15776  [pdf, other

    cs.CV

    Stable Segment Anything Model

    Authors: Qi Fan, Xin Tao, Lei Ke, Mingqiao Ye, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Yu-Wing Tai, Chi-Keung Tang

    Abstract: The Segment Anything Model (SAM) achieves remarkable promptable segmentation given high-quality prompts which, however, often require good skills to specify. To make SAM robust to casual prompts, this paper presents the first comprehensive analysis on SAM's segmentation stability across a diverse spectrum of prompt qualities, notably imprecise bounding boxes and insufficient points. Our key findin… ▽ More

    Submitted 5 December, 2023; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: Smaller file size for the easy access. Codes will be released upon acceptance. https://github.com/fanq15/Stable-SAM

  34. arXiv:2311.07395  [pdf

    cs.RO cs.AI

    Predicting Continuous Locomotion Modes via Multidimensional Feature Learning from sEMG

    Authors: Peiwen Fu, Wenjuan Zhong, Yuyang Zhang, Wenxuan Xiong, Yuzhou Lin, Yanlong Tai, Lin Meng, Mingming Zhang

    Abstract: Walking-assistive devices require adaptive control methods to ensure smooth transitions between various modes of locomotion. For this purpose, detecting human locomotion modes (e.g., level walking or stair ascent) in advance is crucial for improving the intelligence and transparency of such robotic systems. This study proposes Deep-STF, a unified end-to-end deep learning model designed for integra… ▽ More

    Submitted 13 November, 2023; originally announced November 2023.

    Comments: 10 pages,7 figures

  35. Dynamic Frame Interpolation in Wavelet Domain

    Authors: Lingtong Kong, Boyuan Jiang, Donghao Luo, Wenqing Chu, Ying Tai, Chengjie Wang, Jie Yang

    Abstract: Video frame interpolation is an important low-level vision task, which can increase frame rate for more fluent visual experience. Existing methods have achieved great success by employing advanced motion models and synthesis networks. However, the spatial redundancy when synthesizing the target frame has not been fully explored, that can result in lots of inefficient computation. On the other hand… ▽ More

    Submitted 20 September, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

    Comments: Accepted by IEEE TIP

  36. arXiv:2309.02423  [pdf, other

    cs.CV

    EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding

    Authors: Yue Xu, Yong-Lu Li, Zhemin Huang, Michael Xu Liu, Cewu Lu, Yu-Wing Tai, Chi-Keung Tang

    Abstract: With the surge in attention to Egocentric Hand-Object Interaction (Ego-HOI), large-scale datasets such as Ego4D and EPIC-KITCHENS have been proposed. However, most current research is built on resources derived from third-person video action recognition. This inherent domain gap between first- and third-person action videos, which have not been adequately addressed before, makes current Ego-HOI su… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  37. arXiv:2308.07891  [pdf, other

    cs.CV cs.CL

    Link-Context Learning for Multimodal LLMs

    Authors: Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, Ziwei Liu

    Abstract: The ability to learn from context with novel concepts, and deliver appropriate responses are essential in human conversations. Despite current Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) being trained on mega-scale datasets, recognizing unseen images or understanding novel concepts in a training-free manner remains a challenge. In-Context Learning (ICL) explores train… ▽ More

    Submitted 15 August, 2023; originally announced August 2023.

    Comments: 10 pages, 8 figures

  38. arXiv:2308.05104  [pdf, other

    cs.CV

    Scene-Generalizable Interactive Segmentation of Radiance Fields

    Authors: Songlin Tang, Wenjie Pei, Xin Tao, Tanghui Jia, Guangming Lu, Yu-Wing Tai

    Abstract: Existing methods for interactive segmentation in radiance fields entail scene-specific optimization and thus cannot generalize across different scenes, which greatly limits their applicability. In this work we make the first attempt at Scene-Generalizable Interactive Segmentation in Radiance Fields (SGISRF) and propose a novel SGISRF method, which can perform 3D object segmentation for novel (unse… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

  39. arXiv:2308.03529  [pdf, other

    cs.CV

    Feature Decoupling-Recycling Network for Fast Interactive Segmentation

    Authors: Huimin Zeng, Weinong Wang, Xin Tao, Zhiwei Xiong, Yu-Wing Tai, Wenjie Pei

    Abstract: Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input without considering the invariant nature of the source image. As a result, extracting features from the source image is repeated in each interaction, resulting in substantial computational redundancy. In this work, we propose the Feature Decoupling-Recycling Network (FDRN… ▽ More

    Submitted 8 August, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

    Comments: Accepted to ACM MM 2023

  40. arXiv:2307.11035  [pdf, other

    cs.CV cs.AI

    Cascade-DETR: Delving into High-Quality Universal Object Detection

    Authors: Mingqiao Ye, Lei Ke, Siyuan Li, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu

    Abstract: Object localization in general environments is a fundamental part of vision systems. While dominating on the COCO benchmark, recent Transformer-based detection methods are not competitive in diverse domains. Moreover, these methods still struggle to very accurately estimate the object bounding boxes in complex environments. We introduce Cascade-DETR for high-quality universal object detection. W… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

    Comments: Accepted in ICCV 2023. Our code and models will be released at https://github.com/SysCV/cascade-detr

  41. arXiv:2307.01197  [pdf, other

    cs.CV

    Segment Anything Meets Point Tracking

    Authors: Frano Rajič, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu

    Abstract: The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, enabled by efficient point-centric annotation and prompt-based models. While click and brush interactions are both well explored in interactive image segmentation, the existing methods on videos focus on mask annotation and propagation. This paper presents SAM-PT, a novel method for point-cent… ▽ More

    Submitted 3 December, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

  42. arXiv:2306.04715  [pdf, other

    cs.CV

    UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks

    Authors: Yanan Sun, Zihan Zhong, Qi Fan, Chi-Keung Tang, Yu-Wing Tai

    Abstract: Large-scale joint training of multimodal models, e.g., CLIP, have demonstrated great performance in many vision-language tasks. However, image-text pairs for pre-training are restricted to the intersection of images and texts, limiting their ability to cover a large distribution of real-world data, where noise can also be introduced as misaligned pairs during pre-processing. Conversely, unimodal m… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

  43. arXiv:2306.01567  [pdf, other

    cs.CV

    Segment Anything in High Quality

    Authors: Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu

    Abstract: The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurat… ▽ More

    Submitted 23 October, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023. We propose HQ-SAM to upgrade SAM for high-quality zero-shot segmentation. Github: https://github.com/SysCV/SAM-HQ

  44. arXiv:2306.00783  [pdf, other

    cs.CV

    FaceDNeRF: Semantics-Driven Face Reconstruction, Prompt Editing and Relighting with Diffusion Models

    Authors: Hao Zhang, Yanbo Xu, Tianyuan Dai, Yu-Wing Tai, Chi-Keung Tang

    Abstract: The ability to create high-quality 3D faces from a single image has become increasingly important with wide applications in video conferencing, AR/VR, and advanced video editing in movie industries. In this paper, we propose Face Diffusion NeRF (FaceDNeRF), a new generative method to reconstruct high-quality Face NeRFs from single images, complete with semantic editing and relighting capabilities.… ▽ More

    Submitted 4 December, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

  45. arXiv:2305.18381  [pdf, other

    cs.LG cs.AI cs.CV

    Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation

    Authors: Yue Xu, Yong-Lu Li, Kaitong Cui, Ziyu Wang, Cewu Lu, Yu-Wing Tai, Chi-Keung Tang

    Abstract: Data-efficient learning has garnered significant attention, especially given the current trend of large multi-modal models. Recently, dataset distillation has become an effective approach by synthesizing data samples that are essential for network training. However, it remains to be explored which samples are essential for the dataset distillation process itself. In this work, we study the data ef… ▽ More

    Submitted 7 August, 2024; v1 submitted 28 May, 2023; originally announced May 2023.

    Comments: ECCV 2024

  46. arXiv:2305.15171  [pdf, other

    cs.CV

    Deceptive-NeRF/3DGS: Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction

    Authors: Xinhang Liu, Jiaben Chen, Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang

    Abstract: Novel view synthesis via Neural Radiance Fields (NeRFs) or 3D Gaussian Splatting (3DGS) typically necessitates dense observations with hundreds of input images to circumvent artifacts. We introduce Deceptive-NeRF/3DGS to enhance sparse-view reconstruction with only a limited set of input images, by leveraging a diffusion model pre-trained from multiview datasets. Different from using diffusion pri… ▽ More

    Submitted 14 July, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Paper accepted to ECCV 2024. Project page: https://xinhangliu.com/deceptive-nerf-3dgs

  47. arXiv:2305.12843  [pdf, other

    cs.CV

    Registering Neural Radiance Fields as 3D Density Images

    Authors: Han Jiang, Ruoxuan Li, Haosen Sun, Yu-Wing Tai, Chi-Keung Tang

    Abstract: No significant work has been done to directly merge two partially overlapping scenes using NeRF representations. Given pre-trained NeRF models of a 3D scene with partial overlapping, this paper aligns them with a rigid transform, by generalizing the traditional registration pipeline, that is, key point detection and point set registration, to operate on 3D density fields. To describe corner points… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  48. arXiv:2305.08195  [pdf, other

    cs.CL

    Learning to Simulate Natural Language Feedback for Interactive Semantic Parsing

    Authors: Hao Yan, Saurabh Srivastava, Yintao Tai, Sida I. Wang, Wen-tau Yih, Ziyu Yao

    Abstract: Interactive semantic parsing based on natural language (NL) feedback, where users provide feedback to correct the parser mistakes, has emerged as a more practical scenario than the traditional one-shot semantic parsing. However, prior work has heavily relied on human-annotated feedback data to train the interactive semantic parser, which is prohibitively expensive and not scalable. In this work, w… ▽ More

    Submitted 4 June, 2023; v1 submitted 14 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023. 18 pages, 6 figures

  49. arXiv:2305.02594  [pdf, other

    cs.CV

    Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator

    Authors: Chao Xu, Shaoting Zhu, Junwei Zhu, Tianxin Huang, Jiangning Zhang, Ying Tai, Yong Liu

    Abstract: Multimodal-driven talking face generation refers to animating a portrait with the given pose, expression, and gaze transferred from the driving image and video, or estimated from the text and audio. However, existing methods ignore the potential of text modal, and their generators mainly follow the source-oriented feature rearrange paradigm coupled with unstable GAN frameworks. In this work, we fi… ▽ More

    Submitted 9 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

  50. arXiv:2305.02572  [pdf, other

    cs.CV

    High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

    Authors: Chao Xu, Junwei Zhu, Jiangning Zhang, Yue Han, Wenqing Chu, Ying Tai, Chengjie Wang, Zhifeng Xie, Yong Liu

    Abstract: Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose… ▽ More

    Submitted 30 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.