Skip to main content

Showing 1–50 of 90 results for author: Qian, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.12248  [pdf, other

    cs.DC

    Mitigating Interference of Microservices with a Scoring Mechanism in Large-scale Clusters

    Authors: Dingyu Yang, Kangpeng Zheng, Shiyou Qian, Jian Cao, Guangtao Xue

    Abstract: Co-locating latency-critical services (LCSs) and best-effort jobs (BEJs) constitute the principal approach for enhancing resource utilization in production. Nevertheless, the co-location practice hurts the performance of LCSs due to resource competition, even when employing isolation technology. Through an extensive analysis of voluminous real trace data derived from two production clusters, we ob… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  2. arXiv:2407.07364  [pdf, other

    cs.LG cs.AI eess.SY

    Real-time system optimal traffic routing under uncertainties -- Can physics models boost reinforcement learning?

    Authors: Zemian Ke, Qiling Zou, Jiachao Liu, Sean Qian

    Abstract: System optimal traffic routing can mitigate congestion by assigning routes for a portion of vehicles so that the total travel time of all vehicles in the transportation system can be reduced. However, achieving real-time optimal routing poses challenges due to uncertain demands and unknown system dynamics, particularly in expansive transportation networks. While physics model-based methods are sen… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  3. arXiv:2407.06192  [pdf, other

    cs.CV cs.AI cs.CL

    Multi-Object Hallucination in Vision-Language Models

    Authors: Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David F. Fouhey, Joyce Chai

    Abstract: Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent o… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: Accepted to ALVR @ ACL 2024 | Project page: https://multi-object-hallucination.github.io/

  4. arXiv:2406.18158  [pdf, other

    cs.RO cs.CV

    3D-MVP: 3D Multiview Pretraining for Robotic Manipulation

    Authors: Shengyi Qian, Kaichun Mo, Valts Blukis, David F. Fouhey, Dieter Fox, Ankit Goyal

    Abstract: Recent works have shown that visual pretraining on egocentric datasets using masked autoencoders (MAE) can improve generalization for downstream robotics tasks. However, these approaches pretrain only on 2D images, while many robotics applications require 3D scene understanding. In this work, we propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders. We leverage R… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  5. arXiv:2406.17777  [pdf, other

    cs.CV

    Text-Animator: Controllable Visual Text Video Generation

    Authors: Lin Liu, Quande Liu, Shengju Qian, Yuan Zhou, Wengang Zhou, Houqiang Li, Lingxi Xie, Qi Tian

    Abstract: Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts in videos directly, as they mainly focus on summar… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Project Page: https://laulampaul.github.io/text-animator.html

  6. arXiv:2406.16321  [pdf, other

    cs.LG cs.AI

    Multimodal Graph Benchmark

    Authors: Jing Zhu, Yuhang Zhou, Shengyi Qian, Zhongmou He, Tong Zhao, Neil Shah, Danai Koutra

    Abstract: Associating unstructured data with structured information is crucial for real-world tasks that require relevance search. However, existing graph learning benchmarks often overlook the rich semantic information associate with each node. To bridge such gap, we introduce the Multimodal Graph Benchmark (MM-GRAPH), the first comprehensive multi-modal graph benchmark that incorporates both textual and v… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: https://mm-graph-benchmark.github.io/

  7. arXiv:2406.15781  [pdf, other

    cs.CL

    DABL: Detecting Semantic Anomalies in Business Processes Using Large Language Models

    Authors: Wei Guan, Jian Cao, Jianqi Gao, Haiyan Zhao, Shiyou Qian

    Abstract: Detecting anomalies in business processes is crucial for ensuring operational success. While many existing methods rely on statistical frequency to detect anomalies, it's important to note that infrequent behavior doesn't necessarily imply undesirability. To address this challenge, detecting anomalies from a semantic viewpoint proves to be a more effective approach. However, current semantic anoma… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  8. arXiv:2406.15769  [pdf, other

    cs.DC

    Humas: A Heterogeneity- and Upgrade-aware Microservice Auto-scaling Framework in Large-scale Data Centers

    Authors: Qin Hua, Dingyu Yang, Shiyou Qian, Jian Cao, Guangtao Xue, Minglu Li

    Abstract: An effective auto-scaling framework is essential for microservices to ensure performance stability and resource efficiency under dynamic workloads. As revealed by many prior studies, the key to efficient auto-scaling lies in accurately learning performance patterns, i.e., the relationship between performance metrics and workloads in data-driven schemes. However, we notice that there are two signif… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: 14 pages; 27 figures

  9. arXiv:2406.05132  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.RO

    3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

    Authors: Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai

    Abstract: The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datase… ▽ More

    Submitted 12 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: Project website: https://3d-grand.github.io

  10. arXiv:2406.04640  [pdf, other

    cs.LG

    LinkGPT: Teaching Large Language Models To Predict Missing Links

    Authors: Zhongmou He, Jing Zhu, Shengyi Qian, Joyce Chai, Danai Koutra

    Abstract: Large Language Models (LLMs) have shown promising results on various language and vision tasks. Recently, there has been growing interest in applying LLMs to graph-based tasks, particularly on Text-Attributed Graphs (TAGs). However, most studies have focused on node classification, while the use of LLMs for link prediction (LP) remains understudied. In this work, we propose a new task on LLMs, whe… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  11. arXiv:2406.03007  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents

    Authors: Yifei Wang, Dizhan Xue, Shengjie Zhang, Shengsheng Qian

    Abstract: With the prosperity of large language models (LLMs), powerful LLM-based intelligent agents have been developed to provide customized services with a set of user-defined tools. State-of-the-art methods for constructing LLM agents adopt trained LLMs and further fine-tune them on data for the agent task. However, we show that such methods are vulnerable to our proposed backdoor attacks named BadAgent… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted by ACL 2024

  12. arXiv:2404.19026  [pdf, other

    cs.CV

    MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing

    Authors: Cong Wang, Di Kang, He-Yi Sun, Shen-Han Qian, Zi-Xuan Wang, Linchao Bao, Song-Hai Zhang

    Abstract: Creating high-fidelity head avatars from multi-view videos is a core issue for many AR/VR applications. However, existing methods usually struggle to obtain high-quality renderings for all different head components simultaneously since they use one single representation to model components with drastically different characteristics (e.g., skin vs. hair). In this paper, we propose a Hybrid Mesh-Gau… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Project page: https://conallwang.github.io/MeGA_Pages/

  13. arXiv:2404.18219  [pdf, other

    physics.ins-det cs.LG hep-ex hep-ph physics.data-an

    BUFF: Boosted Decision Tree based Ultra-Fast Flow matching

    Authors: Cheng Jiang, Sitian Qian, Huilin Qu

    Abstract: Tabular data stands out as one of the most frequently encountered types in high energy physics. Unlike commonly homogeneous data such as pixelated images, simulating high-dimensional tabular data and accurately capturing their correlations are often quite challenging, even with the most advanced architectures. Based on the findings that tree-based models surpass the performance of deep learning mo… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

    Comments: 9 pages, 10 figures, 1 additional figure in appendix

  14. arXiv:2404.15275  [pdf, other

    cs.CV

    ID-Animator: Zero-Shot Identity-Preserving Human Video Generation

    Authors: Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, Jie Zhang

    Abstract: Generating high-fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case fine-tuning or usually missing identity details in the video generation process. In this study, we present \textb… ▽ More

    Submitted 25 June, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: Project Page: https://id-animator.github.io/

  15. arXiv:2404.02445  [pdf, other

    cs.DC

    MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms

    Authors: Jiaang Duan, Shiyou Qian, Dingyu Yang, Hanwen Hu, Jian Cao, Guangtao Xue

    Abstract: With its elastic power and a pay-as-you-go cost model, the deployment of deep learning inference services (DLISs) on serverless platforms is emerging as a prevalent trend. However, the varying resource requirements of different layers in DL models hinder resource utilization and increase costs, when DLISs are deployed as a single function on serverless platforms. To tackle this problem, we propose… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

  16. arXiv:2403.20041  [pdf

    cs.CL

    Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

    Authors: Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie

    Abstract: The Large Language Model (LLM) is widely employed for tasks such as intelligent assistants, text summarization, translation, and multi-modality on mobile phones. However, the current methods for on-device LLM deployment maintain slow inference speed, which causes poor user experience. To facilitate high-efficiency LLM deployment on device GPUs, we propose four optimization techniques: (a) a symbol… ▽ More

    Submitted 5 July, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

    Comments: 21 pages, 6 figures, fix "E0M4" spell mistake, fix FLOPS to TFLOPS

  17. arXiv:2403.19622  [pdf, other

    cs.RO cs.CV

    RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

    Authors: Zeren Chen, Zhelun Shi, Xiaoya Lu, Lehan He, Sucheng Qian, Hao Shu Fang, Zhenfei Yin, Wanli Ouyang, Jing Shao, Yu Qiao, Cewu Lu, Lu Sheng

    Abstract: The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system capable of performing both seen skills within the training distribution and unseen skills in novel environments. Recent progress in utilizing language models as high-level planners has demonstrated that the complexity of tasks can be reduced through decomposing them into primitive-level plans, mak… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: 24 pages, 12 figures, 6 tables

  18. arXiv:2403.16999  [pdf, other

    cs.CV

    Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

    Authors: Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, Hongsheng Li

    Abstract: Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks. However, they often lack interpretability and struggle with complex visual inputs, especially when the resolution of the input image is high or when the interested region that could provide key information for answering the question is small. To address these challenges, we collect and introduc… ▽ More

    Submitted 7 July, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: Code: https://github.com/deepcs233/Visual-CoT

  19. arXiv:2403.12580  [pdf, other

    cs.CV

    Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection

    Authors: Chengjie Wang, Wenbing Zhu, Bin-Bin Gao, Zhenye Gan, Jianning Zhang, Zhihao Gu, Shuguang Qian, Mingang Chen, Lizhuang Ma

    Abstract: Industrial anomaly detection (IAD) has garnered significant attention and experienced rapid development. However, the recent development of IAD approach has encountered certain difficulties due to dataset limitations. On the one hand, most of the state-of-the-art methods have achieved saturation (over 99% in AUROC) on mainstream datasets such as MVTec, and the differences of methods cannot be well… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: It is accepted by CVPR2024

  20. arXiv:2401.17095  [pdf, other

    cs.LG cs.AI

    Traffic estimation in unobserved network locations using data-driven macroscopic models

    Authors: Pablo Guarda, Sean Qian

    Abstract: This paper leverages macroscopic models and multi-source spatiotemporal data collected from automatic traffic counters and probe vehicles to accurately estimate traffic flow and travel time in links where these measurements are unavailable. This problem is critical in transportation planning applications where the sensor coverage is low and the planned interventions have network-wide impacts. The… ▽ More

    Submitted 30 January, 2024; originally announced January 2024.

    Comments: 34 pages, 28 figures, 6 tables

  21. arXiv:2401.06341  [pdf, other

    cs.CV cs.RO

    AffordanceLLM: Grounding Affordance from Vision Language Models

    Authors: Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li

    Abstract: Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as… ▽ More

    Submitted 17 April, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

  22. arXiv:2312.07955  [pdf, other

    cs.CV cs.AI cs.CR cs.LG

    Erasing Self-Supervised Learning Backdoor by Cluster Activation Masking

    Authors: Shengsheng Qian, Yifei Wang, Dizhan Xue, Shengjie Zhang, Huaiwen Zhang, Changsheng Xu

    Abstract: Researchers have recently found that Self-Supervised Learning (SSL) is vulnerable to backdoor attacks. The attacker can embed hidden SSL backdoors via a few poisoned examples in the training dataset and maliciously manipulate the behavior of downstream models. To defend against SSL backdoor attacks, a feasible route is to detect and remove the poisonous samples in the training set. However, the ex… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

  23. arXiv:2312.04302  [pdf, other

    cs.CV cs.CL

    Prompt Highlighter: Interactive Control for Multi-Modal LLMs

    Authors: Yuechen Zhang, Shengju Qian, Bohao Peng, Shu Liu, Jiaya Jia

    Abstract: This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs, designing… ▽ More

    Submitted 20 March, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: CVPR 2024; Project Page: https://julianjuaner.github.io/projects/PromptHighlighter

  24. arXiv:2312.02069  [pdf, other

    cs.CV

    GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians

    Authors: Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, Matthias Nießner

    Abstract: We introduce GaussianAvatars, a new method to create photorealistic head avatars that are fully controllable in terms of expression, pose, and viewpoint. The core idea is a dynamic 3D representation based on 3D Gaussian splats that are rigged to a parametric morphable face model. This combination facilitates photorealistic rendering while allowing for precise animation control via the underlying p… ▽ More

    Submitted 28 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: Project page: https://shenhanqian.github.io/gaussian-avatars

  25. arXiv:2309.16189  [pdf, other

    cs.CV

    Cloth2Body: Generating 3D Human Body Mesh from 2D Clothing

    Authors: Lu Dai, Liqian Ma, Shenhan Qian, Hao Liu, Ziwei Liu, Hui Xiong

    Abstract: In this paper, we define and study a new Cloth2Body problem which has a goal of generating 3D human body meshes from a 2D clothing image. Unlike the existing human mesh recovery problem, Cloth2Body needs to address new and emerging challenges raised by the partial observation of the input and the high diversity of the output. Indeed, there are three specific challenges. First, how to locate and po… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

    Comments: ICCV 2023 Poster

  26. arXiv:2309.12311  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.RO

    LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

    Authors: Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F. Fouhey, Joyce Chai

    Abstract: 3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipe… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Comments: Project website: https://chat-with-nerf.github.io/

  27. arXiv:2309.12307  [pdf, other

    cs.CL cs.AI cs.LG

    LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

    Authors: Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia

    Abstract: We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layer… ▽ More

    Submitted 8 March, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: Code, models, dataset, and demo are available at https://github.com/dvlab-research/LongLoRA

  28. arXiv:2309.09180  [pdf, other

    eess.AS cs.AI cs.SD

    Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture

    Authors: Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Yanyan Yue, Shuangqing Qian, Shilong Wu, Jun Du, Chin-Hui Lee

    Abstract: We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by in… ▽ More

    Submitted 26 December, 2023; v1 submitted 17 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP 2024

  29. arXiv:2309.01955  [pdf, other

    cs.AI cs.MM

    A Survey on Interpretable Cross-modal Reasoning

    Authors: Dizhan Xue, Shengsheng Qian, Zuyi Zhou, Changsheng Xu

    Abstract: In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics. As the deployment of AI systems becomes more ubiquitous, the demand for transparency and comprehensibility in these systems' decision-making processes has intensified. This… ▽ More

    Submitted 14 September, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    ACM Class: A.1

  30. arXiv:2308.15990  [pdf, other

    cs.SD eess.AS

    Dual-path Transformer Based Neural Beamformer for Target Speech Extraction

    Authors: Aoqi Guo, Sichong Qian, Baoxiang Li, Dazhi Gao

    Abstract: Neural beamformers, which integrate both pre-separation and beamforming modules, have demonstrated impressive effectiveness in target speech extraction. Nevertheless, the performance of these beamformers is inherently limited by the predictive accuracy of the pre-separation module. In this paper, we introduce a neural beamformer supported by a dual-path transformer. Initially, we employ the cross-… ▽ More

    Submitted 7 September, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

  31. arXiv:2308.15802  [pdf, other

    cs.AI

    Benchmarking Robustness and Generalization in Multi-Agent Systems: A Case Study on Neural MMO

    Authors: Yangkun Chen, Joseph Suarez, Junjie Zhang, Chenghui Yu, Bo Wu, Hanmo Chen, Hengman Zhu, Rui Du, Shanliang Qian, Shuai Liu, Weijun Hong, Jinke He, Yibing Zhang, Liang Zhao, Clare Zhu, Julian Togelius, Sharada Mohanty, Jiaxin Chen, Xiu Li, Xiaolong Zhu, Phillip Isola

    Abstract: We present the results of the second Neural MMO challenge, hosted at IJCAI 2022, which received 1600+ submissions. This competition targets robustness and generalization in multi-agent systems: participants train teams of agents to complete a multi-task objective against opponents not seen during training. The competition combines relatively complex environment design with large numbers of agents… ▽ More

    Submitted 30 August, 2023; originally announced August 2023.

  32. arXiv:2308.14638  [pdf, other

    eess.AS cs.SD

    The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge

    Authors: Ruoyu Wang, Maokui He, Jun Du, Hengshun Zhou, Shutong Niu, Hang Chen, Yanyan Yue, Gaobin Yang, Shilong Wu, Lei Sun, Yanhui Tu, Haitao Tang, Shuangqing Qian, Tian Gao, Mengzhi Wang, Genshun Wan, Jia Pan, Jianqing Gao, Chin-Hui Lee

    Abstract: This technical report details our submission system to the CHiME-7 DASR Challenge, which focuses on speaker diarization and speech recognition under complex multi-speaker scenarios. Additionally, it also evaluates the efficiency of systems in handling diverse array devices. To address these issues, we implemented an end-to-end speaker diarization system and introduced a rectification strategy base… ▽ More

    Submitted 10 October, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

    Comments: Accepted by 2023 CHiME Workshop, Oral

  33. arXiv:2308.07741  [pdf, other

    cs.RO cs.LG

    Real Robot Challenge 2022: Learning Dexterous Manipulation from Offline Data in the Real World

    Authors: Nico Gürtler, Felix Widmaier, Cansu Sancaktar, Sebastian Blaes, Pavel Kolev, Stefan Bauer, Manuel Wüthrich, Markus Wulfmeier, Martin Riedmiller, Arthur Allshire, Qiang Wang, Robert McCarthy, Hangyeol Kim, Jongchan Baek, Wookyong Kwon, Shanliang Qian, Yasunori Toshimitsu, Mike Yan Michelis, Amirhossein Kazemipour, Arman Raayatsanati, Hehui Zheng, Barnabas Gavin Cangan, Bernhard Schölkopf, Georg Martius

    Abstract: Experimentation on real robots is demanding in terms of time and costs. For this reason, a large part of the reinforcement learning (RL) community uses simulators to develop and benchmark algorithms. However, insights gained in simulation do not necessarily translate to real robots, in particular for tasks involving complex interactions with the environment. The Real Robot Challenge 2022 therefore… ▽ More

    Submitted 24 November, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

    Comments: Typo in author list fixed

  34. arXiv:2307.16614  [pdf, other

    cs.LG

    LaplaceConfidence: a Graph-based Approach for Learning with Noisy Labels

    Authors: Mingcai Chen, Yuntao Du, Wei Tang, Baoming Zhang, Hao Cheng, Shuwei Qian, Chongjun Wang

    Abstract: In real-world applications, perfect labels are rarely available, making it challenging to develop robust machine learning algorithms that can handle noisy labels. Recent methods have focused on filtering noise based on the discrepancy between model predictions and given noisy labels, assuming that samples with small classification losses are clean. This work takes a different approach by leveragin… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

  35. arXiv:2307.12558  [pdf, other

    cs.CV cs.RO

    Revisiting Event-based Video Frame Interpolation

    Authors: Jiaben Chen, Yichen Zhu, Dongze Lian, Jiaqi Yang, Yifu Wang, Renrui Zhang, Xinhang Liu, Shenhan Qian, Laurent Kneip, Shenghua Gao

    Abstract: Dynamic vision sensors or event cameras provide rich complementary information for video frame interpolation. Existing state-of-the-art methods follow the paradigm of combining both synthesis-based and warping networks. However, few of those methods fully respect the intrinsic characteristics of events streams. Given that event cameras only encode intensity changes and polarity rather than color i… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

    Comments: Accepted by IROS2023 Project Site: https://jiabenchen.github.io/revisit_event

  36. arXiv:2306.11900  [pdf, other

    cs.CL

    Evaluation of Chinese-English Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation

    Authors: Shenbin Qian, Constantin Orasan, Felix do Carmo, Qiuliang Li, Diptesh Kanojia

    Abstract: In this paper, we focus on how current Machine Translation (MT) tools perform on the translation of emotion-loaded texts by evaluating outputs from Google Translate according to a framework proposed in this paper. We propose this evaluation framework based on the Multidimensional Quality Metrics (MQM) and perform a detailed error analysis of the MT outputs. From our analysis, we observe that about… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

  37. arXiv:2306.00899  [pdf, other

    cs.LG cs.IR cs.SI

    Pitfalls in Link Prediction with Graph Neural Networks: Understanding the Impact of Target-link Inclusion & Better Practices

    Authors: Jing Zhu, Yuhang Zhou, Vassilis N. Ioannidis, Shengyi Qian, Wei Ai, Xiang Song, Danai Koutra

    Abstract: While Graph Neural Networks (GNNs) are remarkably successful in a variety of high-impact applications, we demonstrate that, in link prediction, the common practices of including the edges being predicted in the graph at training and/or test have outsized impact on the performance of low-degree nodes. We theoretically and empirically investigate how these practices impact node-level performance acr… ▽ More

    Submitted 17 December, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Extended Version of our WSDM'24 paper. 8 pages, 2 page appendix

  38. arXiv:2305.09664  [pdf, other

    cs.CV

    Understanding 3D Object Interaction from a Single Image

    Authors: Shengyi Qian, David F. Fouhey

    Abstract: Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our a… ▽ More

    Submitted 4 August, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

    Comments: ICCV 2023

  39. NeRSemble: Multi-view Radiance Field Reconstruction of Human Heads

    Authors: Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, Matthias Nießner

    Abstract: We focus on reconstructing high-fidelity radiance fields of human heads, capturing their animations over time, and synthesizing re-renderings from novel viewpoints at arbitrary time steps. To this end, we propose a new multi-view capture setup composed of 16 calibrated machine vision cameras that record time-synchronized images at 7.1 MP resolution and 73 frames per second. With our setup, we coll… ▽ More

    Submitted 4 May, 2023; originally announced May 2023.

    Comments: Siggraph 2023, Project Page: https://tobias-kirschstein.github.io/nersemble/ , Video: https://youtu.be/a-OAWqBzldU

    Journal ref: ACM Transactions on Graphics, Volume 42, Issue 4, Article No. 161 (2023) 1-14

  40. arXiv:2304.07919  [pdf, other

    cs.CV cs.AI

    Chain of Thought Prompt Tuning in Vision Language Models

    Authors: Jiaxin Ge, Hongyin Luo, Siyuan Qian, Yulu Gan, Jie Fu, Shanghang Zhang

    Abstract: Language-Image Pre-training has demonstrated promising results on zero-shot and few-shot downstream tasks by prompting visual models with natural language prompts. However, most recent studies only use a single prompt for tuning, neglecting the inherent step-to-step cognitive reasoning process that humans conduct in complex task settings, for example, when processing images from unfamiliar domains… ▽ More

    Submitted 17 June, 2023; v1 submitted 16 April, 2023; originally announced April 2023.

  41. arXiv:2304.07547  [pdf, other

    cs.CV

    TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

    Authors: Jingyao Li, Pengguang Chen, Shengju Qian, Jiaya Jia

    Abstract: Recent success of Contrastive Language-Image Pre-training~(CLIP) has shown great promise in pixel-level open-vocabulary learning tasks. A general paradigm utilizes CLIP's text and patch embeddings to generate semantic masks. However, existing models easily misidentify input pixels from unseen classes, thus confusing novel classes with semantically-similar ones. In our work, we disentangle the ill-… ▽ More

    Submitted 15 April, 2023; originally announced April 2023.

  42. arXiv:2304.07051  [pdf, other

    cs.CV cs.AI

    The Second Monocular Depth Estimation Challenge

    Authors: Jaime Spencer, C. Stella Qian, Michaela Trescakova, Chris Russell, Simon Hadfield, Erich W. Graf, Wendy J. Adams, Andrew J. Schofield, James Elder, Richard Bowden, Ali Anwar, Hao Chen, Xiaozhi Chen, Kai Cheng, Yuchao Dai, Huynh Thai Hoa, Sadat Hossain, Jianmian Huang, Mohan Jing, Bo Li, Chao Li, Baojun Li, Zhiwen Liu, Stefano Mattoccia, Siegfried Mercelis , et al. (18 additional authors not shown)

    Abstract: This paper discusses the results for the second edition of the Monocular Depth Estimation Challenge (MDEC). This edition was open to methods using any form of supervision, including fully-supervised, self-supervised, multi-task or proxy depth. The challenge was based around the SYNS-Patches dataset, which features a wide diversity of environments with high-quality dense ground-truth. This includes… ▽ More

    Submitted 26 April, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

    Comments: Published at CVPRW2023

  43. arXiv:2303.11329  [pdf, other

    cs.CV cs.SD eess.AS

    Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

    Authors: Ziyang Chen, Shengyi Qian, Andrew Owens

    Abstract: The images and sounds that we perceive undergo subtle but geometrically consistent changes as we rotate our heads. In this paper, we use these cues to solve a problem we call Sound Localization from Motion (SLfM): jointly estimating camera rotation and localizing sound sources. We learn to solve these tasks solely through self-supervision. A visual model predicts camera rotation from a pair of ima… ▽ More

    Submitted 21 August, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

    Comments: ICCV 2023. Project site: https://ificl.github.io/SLfM/

  44. arXiv:2303.00750  [pdf, other

    cs.CV

    StraIT: Non-autoregressive Generation with Stratified Image Transformer

    Authors: Shengju Qian, Huiwen Chang, Yuanzhen Li, Zizhao Zhang, Jiaya Jia, Han Zhang

    Abstract: We propose Stratified Image Transformer(StraIT), a pure non-autoregressive(NAR) generative model that demonstrates superiority in high-quality image synthesis over existing autoregressive(AR) and diffusion models(DMs). In contrast to the under-exploitation of visual characteristics in existing vision tokenizer, we leverage the hierarchical nature of images to encode visual tokens into stratified l… ▽ More

    Submitted 1 March, 2023; originally announced March 2023.

  45. GDOD: Effective Gradient Descent using Orthogonal Decomposition for Multi-Task Learning

    Authors: Xin Dong, Ruize Wu, Chao Xiong, Hai Li, Lei Cheng, Yong He, Shiyou Qian, Jian Cao, Linjian Mo

    Abstract: Multi-task learning (MTL) aims at solving multiple related tasks simultaneously and has experienced rapid growth in recent years. However, MTL models often suffer from performance degeneration with negative transfer due to learning several tasks simultaneously. Some related work attributed the source of the problem is the conflicting gradients. In this case, it is needed to select useful gradient… ▽ More

    Submitted 31 January, 2023; originally announced January 2023.

    Journal ref: Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022: 386-395

  46. arXiv:2212.11115  [pdf, other

    cs.CV

    What Makes for Good Tokenizers in Vision Transformer?

    Authors: Shengju Qian, Yi Zhu, Wenbo Li, Mu Li, Jiaya Jia

    Abstract: The architecture of transformers, which recently witness booming applications in vision tasks, has pivoted against the widespread convolutional paradigm. Relying on the tokenization process that splits inputs into multiple tokens, transformers are capable of extracting their pairwise relationships using self-attention. While being the stemming building block of transformers, what makes for a good… ▽ More

    Submitted 21 December, 2022; originally announced December 2022.

    Comments: To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence

  47. arXiv:2211.12174  [pdf, other

    cs.CV

    The Monocular Depth Estimation Challenge

    Authors: Jaime Spencer, C. Stella Qian, Chris Russell, Simon Hadfield, Erich Graf, Wendy Adams, Andrew J. Schofield, James Elder, Richard Bowden, Heng Cong, Stefano Mattoccia, Matteo Poggi, Zeeshan Khan Suri, Yang Tang, Fabio Tosi, Hao Wang, Youmin Zhang, Yusheng Zhang, Chaoqiang Zhao

    Abstract: This paper summarizes the results of the first Monocular Depth Estimation Challenge (MDEC) organized at WACV2023. This challenge evaluated the progress of self-supervised monocular depth estimation on the challenging SYNS-Patches dataset. The challenge was organized on CodaLab and received submissions from 4 valid teams. Participants were provided a devkit containing updated reference implementati… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: WACV-Workshops 2023

  48. arXiv:2211.06614  [pdf, other

    cs.LG cs.AI

    Robust Training of Graph Neural Networks via Noise Governance

    Authors: Siyi Qian, Haochao Ying, Renjun Hu, Jingbo Zhou, Jintai Chen, Danny Z. Chen, Jian Wu

    Abstract: Graph Neural Networks (GNNs) have become widely-used models for semi-supervised learning. However, the robustness of GNNs in the presence of label noise remains a largely under-explored problem. In this paper, we consider an important yet challenging scenario where labels on nodes of graphs are not only noisy but also scarce. In this scenario, the performance of GNNs is prone to degrade due to lab… ▽ More

    Submitted 25 February, 2023; v1 submitted 12 November, 2022; originally announced November 2022.

    Comments: 9 pages, accepted to WSDM 2023 Research Track

  49. arXiv:2208.14851  [pdf, other

    cs.CV

    Dual-Space NeRF: Learning Animatable Avatars and Scene Lighting in Separate Spaces

    Authors: Yihao Zhi, Shenhan Qian, Xinhao Yan, Shenghua Gao

    Abstract: Modeling the human body in a canonical space is a common practice for capturing and animation. But when involving the neural radiance field (NeRF), learning a static NeRF in the canonical space is not enough because the lighting of the body changes when the person moves even though the scene lighting is constant. Previous methods alleviate the inconsistency of lighting by learning a per-frame embe… ▽ More

    Submitted 31 August, 2022; originally announced August 2022.

    Comments: Accepted by 3DV 2022

  50. arXiv:2207.09835  [pdf, other

    cs.CV

    UNIF: United Neural Implicit Functions for Clothed Human Reconstruction and Animation

    Authors: Shenhan Qian, Jiale Xu, Ziwei Liu, Liqian Ma, Shenghua Gao

    Abstract: We propose united implicit functions (UNIF), a part-based method for clothed human reconstruction and animation with raw scans and skeletons as the input. Previous part-based methods for human reconstruction rely on ground-truth part labels from SMPL and thus are limited to minimal-clothed humans. In contrast, our method learns to separate parts from body motions instead of part supervision, thus… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

    Comments: Accepted to ECCV 2022