Skip to main content

Showing 1–50 of 265 results for author: Shan, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13139  [pdf, other

    cs.CV

    Image Inpainting Models are Effective Tools for Instruction-guided Image Editing

    Authors: Xuan Ju, Junhao Zhuang, Zhaoyang Zhang, Yuxuan Bian, Qiang Xu, Ying Shan

    Abstract: This is the technique report for the winning solution of the CVPR2024 GenAI Media Generation Challenge Workshop's Instruction-guided Image Editing track. Instruction-guided image editing has been largely studied in recent years. The most advanced methods, such as SmartEdit and MGIE, usually combine large language models with diffusion models through joint training, where the former provides text u… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  2. arXiv:2407.10285  [pdf, other

    cs.CV

    Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

    Authors: Qinyu Yang, Haoxin Chen, Yong Zhang, Menghan Xia, Xiaodong Cun, Zhixun Su, Ying Shan

    Abstract: In order to improve the quality of synthesized videos, currently, one predominant method involves retraining an expert diffusion model and then implementing a noising-denoising process for refinement. Despite the significant training costs, maintaining consistency of content between the original and enhanced videos remains a major challenge. To tackle this challenge, we propose a novel formulation… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: ECCV 2024, Project Page: https://yangqy1110.github.io/NC-SDEdit/, Code Repo: https://github.com/yangqy1110/NC-SDEdit/

    ACM Class: I.2; I.4.3

  3. arXiv:2407.08683  [pdf, other

    cs.CV

    SEED-Story: Multimodal Long Story Generation with Large Language Model

    Authors: Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, Yingcong Chen

    Abstract: With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad applications. However, this task poses significant ch… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: Our models, codes and datasets are released in https://github.com/TencentARC/SEED-Story

  4. arXiv:2407.07479  [pdf, other

    cs.CV

    How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

    Authors: Yuxin Chen, Zongyang Ma, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Ying Shan, Xiaojuan Qi, Weiming Hu

    Abstract: Dominant dual-encoder models enable efficient image-text retrieval but suffer from limited accuracy while the cross-encoder models offer higher accuracy at the expense of efficiency. Distilling cross-modality matching knowledge from cross-encoder to dual-encoder provides a natural approach to harness their strengths. Thus we investigate the following valuable question: how to make cross-encoder a… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Accepted by CVPR 2024

  5. arXiv:2407.07478  [pdf, other

    cs.CV

    EA-VTR: Event-Aware Video-Text Retrieval

    Authors: Zongyang Ma, Ziqi Zhang, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Yingmin Luo, Xu Li, Xiaojuan Qi, Ying Shan, Weiming Hu

    Abstract: Understanding the content of events occurring in the video and their inherent temporal logic is crucial for video-text retrieval. However, web-crawled pre-training datasets often lack sufficient event information, and the widely adopted video-level cross-modal contrastive learning also struggles to capture detailed and complex video-text event alignment. To address these challenges, we make improv… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  6. arXiv:2407.06358  [pdf, other

    cs.CV

    MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

    Authors: Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, Ying Shan

    Abstract: Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention. However, existing publicly available datasets are inadequate for generating Sora-like videos, as they mainly contain short videos with low motion intensity and brief captions. To address these issues, we propose MiraData, a high-quality video datase… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  7. arXiv:2406.17565  [pdf, other

    cs.DC

    MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

    Authors: Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

    Abstract: Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemP… ▽ More

    Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

  8. arXiv:2406.15339  [pdf, other

    cs.CV cs.AI cs.MM

    Image Conductor: Precision Control for Interactive Video Synthesis

    Authors: Yaowei Li, Xintao Wang, Zhaoyang Zhang, Zhouxia Wang, Ziyang Yuan, Liangbin Xie, Yuexian Zou, Ying Shan

    Abstract: Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements, typically involving labor-intensive real-world capturing. Despite advancements in generative AI for video creation, achieving precise control over motion for interactive video asset generation remains challenging. To this end, we propose Image Conductor, a method for… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: Project webpage available at https://liyaowei-stu.github.io/project/ImageConductor/

  9. arXiv:2406.12275  [pdf, other

    cs.CV

    VoCo-LLaMA: Towards Vision Compression with Large Language Models

    Authors: Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang

    Abstract: Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 18 pages, 5 figures

  10. arXiv:2406.02884  [pdf, other

    cs.CV

    PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

    Authors: Tao Yang, Yingmin Luo, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen

    Abstract: Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-following manner. Previous approaches are either inefficient for large-scale applications or lack flexibility for varying design requirements. Our research introduces a unified framework for automated graphic… ▽ More

    Submitted 1 July, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: 10 pages; typos corrected, appendix added

  11. arXiv:2406.02395  [pdf, other

    cs.LG cs.CV

    GrootVL: Tree Topology is All You Need in State Space Model

    Authors: Yicheng Xiao, Lin Song, Shaoli Huang, Jiangshan Wang, Siyu Song, Yixiao Ge, Xiu Li, Ying Shan

    Abstract: The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the GrootVL network, which first dynamically generates a tree t… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: The code is available at https://github.com/EasonXiao-888/GrootVL

  12. arXiv:2406.01238  [pdf, other

    cs.CL

    EffiQA: Efficient Question-Answering with Strategic Multi-Model Collaboration on Knowledge Graphs

    Authors: Zixuan Dong, Baoyun Peng, Yufei Wang, Jia Fu, Xiaodong Wang, Yongxue Shan, Xin Zhou

    Abstract: While large language models (LLMs) have shown remarkable capabilities in natural language processing, they struggle with complex, multi-step reasoning tasks involving knowledge graphs (KGs). Existing approaches that integrate LLMs and KGs either underutilize the reasoning abilities of LLMs or suffer from prohibitive computational costs due to tight coupling. To address these limitations, we propos… ▽ More

    Submitted 7 July, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: 10 pages, 4 figures, 3 tables

  13. arXiv:2406.00908  [pdf, other

    cs.CV

    ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

    Authors: Shaoshu Yang, Yong Zhang, Xiaodong Cun, Ying Shan, Ran He

    Abstract: Video generation has made remarkable progress in recent years, especially since the advent of the video diffusion models. Many video generation models can produce plausible synthetic videos, e.g., Stable Video Diffusion (SVD). However, most video models can only generate low frame rate videos due to the limited GPU memory as well as the difficulty of modeling a large set of frames. The training vi… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

  14. arXiv:2405.20279  [pdf, other

    cs.CV cs.AI eess.IV

    CV-VAE: A Compatible Video VAE for Latent Generative Video Models

    Authors: Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, Ying Shan

    Abstract: Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent ex… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: Project Page: https://ailab-cvc.github.io/cvvae/index.html

  15. arXiv:2405.20222  [pdf, other

    cs.CV cs.AI

    MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

    Authors: Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, Yinqiang Zheng

    Abstract: We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. This is different from previous methods which only can work on a specific motion domain or show weak control abilities with diff… ▽ More

    Submitted 11 July, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

    Comments: ECCV 2024 ; Project Page: https://myniuuu.github.io/MOFA_Video/ ; Codes: https://github.com/MyNiuuu/MOFA-Video

  16. arXiv:2405.19283  [pdf, other

    cs.CV

    Programmable Motion Generation for Open-Set Motion Control Tasks

    Authors: Hanchao Liu, Xiaohang Zhan, Shaoli Huang, Tai-Jiang Mu, Ying Shan

    Abstract: Character animation in real-world scenarios necessitates a variety of constraints, such as trajectories, key-frames, interactions, etc. Existing methodologies typically treat single or a finite set of these constraint(s) as separate control tasks. They are often specialized, and the tasks they address are rarely extendable or customizable. We categorize these as solutions to the close-set motion c… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: Accepted by CVPR 2024

  17. arXiv:2405.17933  [pdf, other

    cs.CV

    ToonCrafter: Generative Cartoon Interpolation

    Authors: Jinbo Xing, Hanyuan Liu, Menghan Xia, Yong Zhang, Xintao Wang, Ying Shan, Tien-Tsin Wong

    Abstract: We introduce ToonCrafter, a novel approach that transcends traditional correspondence-based cartoon video interpolation, paving the way for generative interpolation. Traditional methods, that implicitly assume linear motion and the absence of complicated phenomena like dis-occlusion, often struggle with the exaggerated non-linear and large motions with occlusion commonly found in cartoons, resulti… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: Project page: https://doubiiu.github.io/projects/ToonCrafter/

  18. arXiv:2405.17811  [pdf, other

    cs.GR cs.CV

    Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh

    Authors: Xiangjun Gao, Xiaoyu Li, Yiyu Zhuang, Qi Zhang, Wenbo Hu, Chaopeng Zhang, Yao Yao, Ying Shan, Long Quan

    Abstract: Neural 3D representations such as Neural Radiance Fields (NeRF), excel at producing photo-realistic rendering results but lack the flexibility for manipulation and editing which is crucial for content creation. Previous works have attempted to address this issue by deforming a NeRF in canonical space or manipulating the radiance field based on an explicit mesh. However, manipulating NeRF is not hi… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: Project page here: https://gaoxiangjun.github.io/mani_gs/

  19. arXiv:2405.13865  [pdf, other

    cs.CV

    ReVideo: Remake a Video with Motion and Content Control

    Authors: Chong Mou, Mingdeng Cao, Xintao Wang, Zhaoyang Zhang, Ying Shan, Jian Zhang

    Abstract: Despite significant advancements in video generation and editing using diffusion models, achieving accurate and localized video editing remains a substantial challenge. Additionally, most existing video editing methods primarily focus on altering visual content, with limited research dedicated to motion editing. In this paper, we present a novel attempt to Remake a Video (ReVideo) which stands out… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  20. arXiv:2405.13672  [pdf, other

    cs.CV

    Advancing Spiking Neural Networks towards Multiscale Spatiotemporal Interaction Learning

    Authors: Yimeng Shan, Malu Zhang, Rui-jie Zhu, Xuerui Qiu, Jason K. Eshraghian, Haicheng Qu

    Abstract: Recent advancements in neuroscience research have propelled the development of Spiking Neural Networks (SNNs), which not only have the potential to further advance neuroscience research but also serve as an energy-efficient alternative to Artificial Neural Networks (ANNs) due to their spike-driven characteristics. However, previous studies often neglected the multiscale information and its spatiot… ▽ More

    Submitted 27 May, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

  21. arXiv:2405.11299  [pdf, other

    cs.DB cs.LG

    The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving

    Authors: Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan

    Abstract: We survey the large language model (LLM) serving area to understand the intricate dynamics between cost-efficiency and accuracy, which is magnified by the growing need for longer contextual understanding when deploying models at a massive scale. Our findings reveal that works in this space optimize along three distinct but conflicting goals: improving serving context length (C), improving serving… ▽ More

    Submitted 26 May, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

  22. arXiv:2405.07990  [pdf, other

    cs.CL cs.CV

    Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

    Authors: Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo

    Abstract: The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLM… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

  23. arXiv:2405.06696  [pdf, other

    cs.CL cs.AI

    Multi-level Shared Knowledge Guided Learning for Knowledge Graph Completion

    Authors: Yongxue Shan, Jie Zhou, Jie Peng, Xin Zhou, Jiaqian Yin, Xiaodong Wang

    Abstract: In the task of Knowledge Graph Completion (KGC), the existing datasets and their inherent subtasks carry a wealth of shared knowledge that can be utilized to enhance the representation of knowledge triplets and overall performance. However, no current studies specifically address the shared knowledge within KGC. To bridge this gap, we introduce a multi-level Shared Knowledge Guided learning method… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: The paper has been accepted for publication at TACL. And the arXiv version is a pre-MIT Press publication version

  24. arXiv:2405.04007  [pdf, other

    cs.CV

    SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

    Authors: Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, Ying Shan

    Abstract: In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language. SEED-Data-Edit is composed of three distinct types of data: (1) High-quality editing data produced by an automated pipeline, ensuring a substantial volume of diverse image editing pairs. (2) Real-world scenario da… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Technical Report; Dataset released in https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit

  25. arXiv:2405.00351  [pdf, other

    cs.HC cs.AI cs.CV cs.MM

    Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality

    Authors: Zidong Cao, Zhan Wang, Yexin Liu, Yan-Pei Cao, Ying Shan, Wei Zeng, Lin Wang

    Abstract: Viewing omnidirectional images (ODIs) in virtual reality (VR) represents a novel form of media that provides immersive experiences for users to navigate and interact with digital content. Nonetheless, this sense of immersion can be greatly compromised by a blur effect that masks details and hampers the user's ability to engage with objects of interest. In this paper, we present a novel system, cal… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: 11 pages

  26. arXiv:2404.18392  [pdf, other

    cs.DC

    Dflow, a Python framework for constructing cloud-native AI-for-Science workflows

    Authors: Xinzijian Liu, Yanbo Han, Zhuoyuan Li, Jiahao Fan, Chengqian Zhang, Jinzhe Zeng, Yifan Shan, Yannan Yuan, Wei-Hong Xu, Yun-Pei Liu, Yuzhi Zhang, Tongqi Wen, Darrin M. York, Zhicheng Zhong, Hang Zheng, Jun Cheng, Linfeng Zhang, Han Wang

    Abstract: In the AI-for-science era, scientific computing scenarios such as concurrent learning and high-throughput computing demand a new generation of infrastructure that supports scalable computing resources and automated workflow management on both cloud and high-performance supercomputers. Here we introduce Dflow, an open-source Python toolkit designed for scientists to construct workflows with simple… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  27. arXiv:2404.16790  [pdf, other

    cs.CV

    SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

    Authors: Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan

    Abstract: Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their pro… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  28. arXiv:2404.14396  [pdf, other

    cs.CV

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Authors: Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan

    Abstract: The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. I… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: Project released at: https://github.com/AILab-CVC/SEED-X

  29. arXiv:2404.07191  [pdf, other

    cs.CV

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Authors: Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, Ying Shan

    Abstract: We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability. By synergizing the strengths of an off-the-shelf multiview diffusion model and a sparse-view reconstruction model based on the LRM architecture, InstantMesh is able to create diverse 3D assets within 10 seconds. To… ▽ More

    Submitted 14 April, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

    Comments: Technical report. Project: https://github.com/TencentARC/InstantMesh

  30. arXiv:2404.00308  [pdf, other

    cs.CV

    ST-LLM: Large Language Models Are Effective Temporal Learners

    Authors: Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li

    Abstract: Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we fe… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

  31. arXiv:2403.19098  [pdf, other

    cs.CV

    GraphAD: Interaction Scene Graph for End-to-end Autonomous Driving

    Authors: Yunpeng Zhang, Deheng Qian, Ding Li, Yifeng Pan, Yong Chen, Zhenbao Liang, Zhiyao Zhang, Shurui Zhang, Hongxu Li, Maolei Fu, Yun Ye, Zhujin Liang, Yi Shan, Dalong Du

    Abstract: Modeling complicated interactions among the ego-vehicle, road agents, and map elements has been a crucial part for safety-critical autonomous driving. Previous works on end-to-end autonomous driving rely on the attention mechanism for handling heterogeneous interactions, which fails to capture the geometric priors and is also computationally intensive. In this paper, we propose the Interaction Sce… ▽ More

    Submitted 6 April, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

    Comments: project page: https://github.com/zhangyp15/GraphAD

  32. arXiv:2403.11589  [pdf, other

    cs.CV

    UV Gaussians: Joint Learning of Mesh Deformation and Gaussian Textures for Human Avatar Modeling

    Authors: Yujiao Jiang, Qingmin Liao, Xiaoyu Li, Li Ma, Qi Zhang, Chaopeng Zhang, Zongqing Lu, Ying Shan

    Abstract: Reconstructing photo-realistic drivable human avatars from multi-view image sequences has been a popular and challenging topic in the field of computer vision and graphics. While existing NeRF-based methods can achieve high-quality novel view rendering of human models, both training and inference processes are time-consuming. Recent approaches have utilized 3D Gaussians to represent the human body… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  33. arXiv:2403.10050  [pdf, other

    cs.CV

    Texture-GS: Disentangling the Geometry and Texture for 3D Gaussian Splatting Editing

    Authors: Tian-Xing Xu, Wenbo Hu, Yu-Kun Lai, Ying Shan, Song-Hai Zhang

    Abstract: 3D Gaussian splatting, emerging as a groundbreaking approach, has drawn increasing attention for its capabilities of high-fidelity reconstruction and real-time rendering. However, it couples the appearance and geometry of the scene within the Gaussian attributes, which hinders the flexibility of editing operations, such as texture swapping. To address this issue, we propose a novel approach, namel… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  34. arXiv:2403.10044  [pdf, other

    cs.CV

    SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

    Authors: Tao Wu, Xuewei Li, Zhongang Qi, Di Hu, Xintao Wang, Ying Shan, Xi Li

    Abstract: Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains.However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation.In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Accepted by AAAI2024

  35. arXiv:2403.08309  [pdf, other

    cs.LG cs.AI

    HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback

    Authors: Ang Li, Qiugen Xiao, Peng Cao, Jian Tang, Yi Yuan, Zijie Zhao, Xiaoyuan Chen, Liang Zhang, Xiangyang Li, Kaitong Yang, Weidong Guo, Yukang Gan, Xu Yu, Daniell Wang, Ying Shan

    Abstract: Reinforcement Learning from AI Feedback (RLAIF) has the advantages of shorter annotation cycles and lower costs over Reinforcement Learning from Human Feedback (RLHF), making it highly efficient during the rapid strategy iteration periods of large language model (LLM) training. Using ChatGPT as a labeler to provide feedback on open-domain prompts in RLAIF training, we observe an increase in human… ▽ More

    Submitted 14 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: 18 pages, 7 figures

  36. arXiv:2403.06976  [pdf, other

    cs.CV

    BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

    Authors: Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, Qiang Xu

    Abstract: Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs). Despite these advancements, current DM adaptations for inpainting, which involve modifications to the sampling strategy or the development of inpainting-specific DMs, frequently suffer from semantic inconsistencies and reduced image quality. Addressing these cha… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  37. arXiv:2403.05895  [pdf, other

    cs.CV

    DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos

    Authors: Xiuzhe Wu, Xiaoyang Lyu, Qihao Huang, Yong Liu, Yang Wu, Ying Shan, Xiaojuan Qi

    Abstract: Although considerable advancements have been attained in self-supervised depth estimation from monocular videos, most existing methods often treat all objects in a video as static entities, which however violates the dynamic nature of real-world scenes and fails to model the geometry and motion of moving objects. In this paper, we propose a self-supervised method to jointly learn 3D motion and dep… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

    Comments: 24 pages, 14 figures, Tech Report

  38. arXiv:2402.18146  [pdf, ps, other

    cs.CV

    3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling

    Authors: Chaokang Jiang, Guangming Wang, Jiuming Liu, Hesheng Wang, Zhuang Ma, Zhenqiang Liu, Zhujin Liang, Yi Shan, Dalong Du

    Abstract: Learning 3D scene flow from LiDAR point clouds presents significant difficulties, including poor generalization from synthetic datasets to real scenes, scarcity of real-world 3D labels, and poor performance on real sparse LiDAR point clouds. We present a novel approach from the perspective of auto-labelling, aiming to generate a large number of 3D scene flow pseudo labels for real-world LiDAR poin… ▽ More

    Submitted 29 February, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: Accepted by CVPR2024! 10 pages, 6 figures

  39. arXiv:2402.10491  [pdf, other

    cs.CV

    Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

    Authors: Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen

    Abstract: Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

    Comments: Project Page: https://guolanqing.github.io/Self-Cascade/

  40. arXiv:2402.02772  [pdf, other

    cs.LG

    Contrastive Diffuser: Planning Towards High Return States via Contrastive Learning

    Authors: Yixiang Shan, Zhengbang Zhu, Ting Long, Qifan Liang, Yi Chang, Weinan Zhang, Liang Yin

    Abstract: The performance of offline reinforcement learning (RL) is sensitive to the proportion of high-return trajectories in the offline dataset. However, in many simulation environments and real-world scenarios, there are large ratios of low-return trajectories rather than high-return trajectories, which makes learning an efficient policy challenging. In this paper, we propose a method called Contrastive… ▽ More

    Submitted 15 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: 18 pages with appendix and references, 10 figures, 4 tables

  41. arXiv:2402.02583  [pdf, other

    cs.CV cs.LG

    DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

    Authors: Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang

    Abstract: Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years. Although owning diverse and high-quality generation capabilities, translating these abilities to fine-grained image editing remains challenging. In this paper, we propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing: (1) in complex scenarios, editing resu… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

  42. arXiv:2402.02439  [pdf, other

    cs.LG cs.AI

    DiffStitch: Boosting Offline Reinforcement Learning with Diffusion-based Trajectory Stitching

    Authors: Guanghe Li, Yixiang Shan, Zhengbang Zhu, Ting Long, Weinan Zhang

    Abstract: In offline reinforcement learning (RL), the performance of the learned policy highly depends on the quality of offline datasets. However, in many cases, the offline dataset contains very limited optimal trajectories, which poses a challenge for offline RL algorithms as agents must acquire the ability to transit to high-reward regions. To address this issue, we introduce Diffusion-based Trajectory… ▽ More

    Submitted 21 February, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

  43. arXiv:2401.17807  [pdf, other

    cs.CV cs.GR

    Advances in 3D Generation: A Survey

    Authors: Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, Ying Shan

    Abstract: Generating 3D models lies at the core of computer graphics and has been the focus of decades of research. With the emergence of advanced neural representations and generative models, the field of 3D content generation is developing rapidly, enabling the creation of increasingly high-quality and diverse 3D models. The rapid growth of this field makes it difficult to stay abreast of all recent devel… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: 33 pages, 12 figures

  44. arXiv:2401.17270  [pdf, other

    cs.CV

    YOLO-World: Real-Time Open-Vocabulary Object Detection

    Authors: Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan

    Abstract: The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling an… ▽ More

    Submitted 22 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: Work still in progress. Code & models are available at: https://github.com/AILab-CVC/YOLO-World

  45. RecDCL: Dual Contrastive Learning for Recommendation

    Authors: Dan Zhang, Yangliao Geng, Wenwen Gong, Zhongang Qi, Zhiyu Chen, Xing Tang, Ying Shan, Yuxiao Dong, Jie Tang

    Abstract: Self-supervised learning (SSL) has recently achieved great success in mining the user-item interactions for collaborative filtering. As a major paradigm, contrastive learning (CL) based SSL helps address data sparsity in Web platforms by contrasting the embeddings between raw and augmented data. However, existing CL-based methods mostly focus on contrasting in a batch-wise way, failing to exploit… ▽ More

    Submitted 18 February, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

    Comments: Accepted to WWW 2024

    Journal ref: Proceedings of TheWebConf 2024 (WWW '24), May 13--17, 2024, Singapore

  46. arXiv:2401.14828  [pdf, other

    cs.CV

    TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts

    Authors: Jingyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, Ying Shan

    Abstract: Text-driven 3D scene editing has gained significant attention owing to its convenience and user-friendliness. However, existing methods still lack accurate control of the specified appearance and location of the editing result due to the inherent limitations of the text description. To this end, we propose a 3D scene editing framework, TIPEditor, that accepts both text and image prompts and a 3D b… ▽ More

    Submitted 25 April, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

    Comments: Accpeted by Siggraph 2024 & ACM Transactions on Graphics

  47. arXiv:2401.14405  [pdf, other

    cs.CV cs.AI cs.LG

    Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

    Authors: Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue

    Abstract: We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalit… ▽ More

    Submitted 18 March, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: CVPR 2024. Code and models are available at https://github.com/AILab-CVC/M2PT

  48. arXiv:2401.11240  [pdf, other

    cs.DC

    CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference

    Authors: Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, Wei Wang

    Abstract: Pre-trained large language models (LLMs) often need specialization for domain-specific tasks. Low-Rank Adaptation (LoRA) is a popular approach that adapts a base model to multiple tasks by adding lightweight trainable adapters. In this paper, we present CaraServe, a system that efficiently serves many LoRA adapters derived from a common base model. CaraServe maintains the base model on GPUs and dy… ▽ More

    Submitted 20 January, 2024; originally announced January 2024.

  49. arXiv:2401.11181  [pdf, other

    cs.DC

    Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

    Authors: Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

    Abstract: Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference request… ▽ More

    Submitted 20 January, 2024; originally announced January 2024.

  50. arXiv:2401.10222  [pdf, other

    cs.CV cs.AI

    Supervised Fine-tuning in turn Improves Visual Foundation Models

    Authors: Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan

    Abstract: Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP's pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuni… ▽ More

    Submitted 11 April, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: 23 pages, 3 figures, Project page: https://github.com/TencentARC/ViSFT/tree/main