Skip to main content

Showing 1–50 of 149 results for author: Nie, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.05840  [pdf, other

    cs.ET physics.optics

    A 103-TOPS/mm$^2$ Integrated Photonic Computing Engine Enabling Next-Generation Reservoir Computing

    Authors: Dongliang Wang, Yikun Nie, Gaolei Hu, Hon Ki Tsang, Chaoran Huang

    Abstract: Reservoir computing (RC) is a leading machine learning algorithm for information processing due to its rich expressiveness. A new RC paradigm has recently emerged, showcasing superior performance and delivering more interpretable results with shorter training data sets and training times, representing the next generation of RC computing. This work presents the first realization of a high-speed nex… ▽ More

    Submitted 31 May, 2024; originally announced July 2024.

  2. arXiv:2407.04277  [pdf, other

    cs.CV

    Research, Applications and Prospects of Event-Based Pedestrian Detection: A Survey

    Authors: Han Wang, Yuman Nie, Yun Li, Hongjie Liu, Min Liu, Wen Cheng, Yaoxiong Wang

    Abstract: Event-based cameras, inspired by the biological retina, have evolved into cutting-edge sensors distinguished by their minimal power requirements, negligible latency, superior temporal resolution, and expansive dynamic range. At present, cameras used for pedestrian detection are mainly frame-based imaging sensors, which have suffered from lethargic response times and hefty data redundancy. In contr… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  3. arXiv:2407.02301  [pdf, other

    cs.CL

    CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models

    Authors: Ying Nie, Binwei Yan, Tianyu Guo, Hao Liu, Haoyu Wang, Wei He, Binfan Zheng, Weihao Wang, Qiang Li, Weijian Sun, Yunhe Wang, Dacheng Tao

    Abstract: Large language models (LLMs) have achieved remarkable performance on various NLP tasks, yet their potential in more challenging and domain-specific task, such as finance, has not been fully explored. In this paper, we present CFinBench: a meticulously crafted, the most comprehensive evaluation benchmark to date, for assessing the financial knowledge of LLMs under Chinese context. In practice, to b… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  4. arXiv:2406.11903  [pdf, other

    q-fin.GN cs.AI q-fin.CP

    A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges

    Authors: Yuqi Nie, Yaxuan Kong, Xiaowen Dong, John M. Mulvey, H. Vincent Poor, Qingsong Wen, Stefan Zohren

    Abstract: Recent advances in large language models (LLMs) have unlocked novel opportunities for machine learning applications in the financial domain. These models have demonstrated remarkable capabilities in understanding context, processing vast amounts of data, and generating human-preferred contents. In this survey, we explore the application of LLMs on various financial tasks, focusing on their potenti… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  5. arXiv:2406.08725  [pdf, other

    cs.CR

    RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

    Authors: Xuan Chen, Yuzhou Nie, Lu Yan, Yunshu Mao, Wenbo Guo, Xiangyu Zhang

    Abstract: Modern large language model (LLM) developers typically conduct a safety alignment to prevent an LLM from generating unethical or harmful content. Recent studies have discovered that the safety alignment of LLMs can be bypassed by jailbreaking prompts. These prompts are designed to create specific conversation scenarios with a harmful question embedded. Querying an LLM with such prompts can mislead… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  6. arXiv:2406.08705  [pdf, other

    cs.CR

    When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

    Authors: Xuan Chen, Yuzhou Nie, Wenbo Guo, Xiangyu Zhang

    Abstract: Recent studies developed jailbreaking attacks, which construct jailbreaking prompts to ``fool'' LLMs into responding to harmful questions. Early-stage jailbreaking attacks require access to model internals or significant human efforts. More advanced attacks utilize genetic algorithms for automatic and black-box attacks. However, the random nature of genetic algorithms significantly limits the effe… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  7. arXiv:2406.00256  [pdf, other

    cs.IT cs.CR

    Over-the-Air Collaborative Inference with Feature Differential Privacy

    Authors: Mohamed Seif, Yuqi Nie, Andrea Goldsmith, Vincent Poor

    Abstract: Collaborative inference in next-generation networks can enhance Artificial Intelligence (AI) applications, including autonomous driving, personal identification, and activity classification. This method involves a three-stage process: a) data acquisition through sensing, b) feature extraction, and c) feature encoding for transmission. Transmission of the extracted features entails the potential ri… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

  8. Correctable Landmark Discovery via Large Models for Vision-Language Navigation

    Authors: Bingqian Lin, Yunshuang Nie, Ziming Wei, Yi Zhu, Hang Xu, Shikui Ma, Jianzhuang Liu, Xiaodan Liang

    Abstract: Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack s… ▽ More

    Submitted 5 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Accepted by TPAMI 2024

  9. arXiv:2405.16783  [pdf, other

    cs.CR cs.AI cs.LG

    TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models

    Authors: Yuzhou. Nie, Yanting. Wang, Jinyuan. Jia, Michael J. De Lucia, Nathaniel D. Bastian, Wenbo. Guo, Dawn. Song

    Abstract: One key challenge in backdoor attacks against large foundation models is the resource limits. Backdoor attacks usually require retraining the target model, which is impractical for very large foundation models. Existing backdoor attacks are mainly designed for supervised classifiers or small foundation models (e.g., BERT). None of these attacks has successfully compromised a very large foundation… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  10. arXiv:2405.13581  [pdf, other

    cs.CV cs.AI

    Safety Alignment for Vision Language Models

    Authors: Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, Bo Zheng

    Abstract: Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to an LLMs can realize Vision Language Models (VLMs). However, existing research shows that the visual modality of VLMs is vulnerable, with attackers easily bypassing LLMs' safety alignment through visual modality features to launch attacks. To address this issue, we enhance the e… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 23 pages, 15 figures

  11. arXiv:2405.04390  [pdf, other

    cs.CV

    DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

    Authors: Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, Liping Jing, Yiming Nie, Bin Dai

    Abstract: Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by i… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Accepted by CVPR2024

  12. arXiv:2405.02357  [pdf, other

    cs.LG

    Large Language Models for Mobility in Transportation Systems: A Survey on Forecasting Tasks

    Authors: Zijian Zhang, Yujie Sun, Zepu Wang, Yuqi Nie, Xiaobo Ma, Peng Sun, Ruolin Li

    Abstract: Mobility analysis is a crucial element in the research area of transportation systems. Forecasting traffic information offers a viable solution to address the conflict between increasing transportation demands and the limitations of transportation infrastructure. Predicting human travel is significant in aiding various transportation and urban management tasks, such as taxi dispatch and urban plan… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: 9 pages

  13. arXiv:2404.15538  [pdf, other

    cs.GR cs.AI cs.CL cs.LG

    DreamCraft: Text-Guided Generation of Functional 3D Environments in Minecraft

    Authors: Sam Earle, Filippos Kokkinos, Yuhe Nie, Julian Togelius, Roberta Raileanu

    Abstract: Procedural Content Generation (PCG) algorithms enable the automatic generation of complex and diverse artifacts. However, they don't provide high-level control over the generated content and typically require domain expertise. In contrast, text-to-3D methods allow users to specify desired characteristics in natural language, offering a high amount of flexibility and expressivity. But unlike PCG, s… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: 16 pages, 9 figures, accepted to Foundation of Digital Games 2024

  14. arXiv:2404.15127  [pdf, other

    cs.CV cs.CL

    MedDr: Diagnosis-Guided Bootstrapping for Large-Scale Medical Vision-Language Learning

    Authors: Sunan He, Yuxiang Nie, Zhixuan Chen, Zhiyuan Cai, Hongmei Wang, Shu Yang, Hao Chen

    Abstract: The rapid advancement of large-scale vision-language models has showcased remarkable capabilities across various tasks. However, the lack of extensive and high-quality image-text data in medicine has greatly hindered the development of large-scale medical vision-language models. In this work, we present a diagnosis-guided bootstrapping strategy that exploits both image and label information to con… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  15. arXiv:2404.03264  [pdf, other

    cs.CY cs.AI

    Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions

    Authors: Yuting He, Fuxiang Huang, Xinrui Jiang, Yuxiang Nie, Minghao Wang, Jiguang Wang, Hao Chen

    Abstract: Foundation model, which is pre-trained on broad data and is able to adapt to a wide range of tasks, is advancing healthcare. It promotes the development of healthcare artificial intelligence (AI) models, breaking the contradiction between limited AI models and diverse healthcare practices. Much more widespread healthcare scenarios will benefit from the development of a healthcare foundation model… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

  16. arXiv:2403.19632  [pdf, other

    cs.CV

    GauStudio: A Modular Framework for 3D Gaussian Splatting and Beyond

    Authors: Chongjie Ye, Yinyu Nie, Jiahao Chang, Yuantao Chen, Yihao Zhi, Xiaoguang Han

    Abstract: We present GauStudio, a novel modular framework for modeling 3D Gaussian Splatting (3DGS) to provide standardized, plug-and-play components for users to easily customize and implement a 3DGS pipeline. Supported by our framework, we propose a hybrid Gaussian representation with foreground and skyball background models. Experiments demonstrate this representation reduces artifacts in unbounded outdo… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: Code: https://github.com/GAP-LAB-CUHK-SZ/gaustudio

  17. arXiv:2403.19319  [pdf, other

    cs.CV

    Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation

    Authors: Yujin Chen, Yinyu Nie, Benjamin Ummenhofer, Reiner Birkl, Michael Paulitsch, Matthias Müller, Matthias Nießner

    Abstract: We present Mesh2NeRF, an approach to derive ground-truth radiance fields from textured meshes for 3D generation tasks. Many 3D generative approaches represent 3D scenes as radiance fields for training. Their ground-truth radiance fields are usually fitted from multi-view renderings from a large-scale synthetic 3D dataset, which often results in artifacts due to occlusions or under-fitting issues.… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: Project page: https://terencecyj.github.io/projects/Mesh2NeRF/ Video: https://youtu.be/oufv1N3f7iY

  18. arXiv:2403.17636  [pdf, other

    cs.CL

    Mix-Initiative Response Generation with Dynamic Prefix Tuning

    Authors: Yuxiang Nie, Heyan Huang, Xian-Ling Mao, Lizi Liao

    Abstract: Mixed initiative serves as one of the key factors in controlling conversation directions. For a speaker, responding passively or leading proactively would result in rather different responses. However, most dialogue systems focus on training a holistic response generation model without any distinction among different initiatives. It leads to the cross-contamination problem, where the model confuse… ▽ More

    Submitted 27 March, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted to the main conference of NAACL 2024

  19. arXiv:2403.16558  [pdf, other

    cs.CV

    Elysium: Exploring Object-level Perception in Videos via MLLM

    Authors: Han Wang, Yanjie Wang, Yongjie Ye, Yuxiang Nie, Can Huang

    Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects acr… ▽ More

    Submitted 29 March, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

  20. Foundation Models for Time Series Analysis: A Tutorial and Survey

    Authors: Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, Qingsong Wen

    Abstract: Time series analysis stands as a focal point within the data mining community, serving as a cornerstone for extracting valuable insights crucial to a myriad of real-world applications. Recent advances in Foundation Models (FMs) have fundamentally reshaped the paradigm of model design for time series analysis, boosting various downstream tasks in practice. These innovative approaches often leverage… ▽ More

    Submitted 18 June, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

    Comments: In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'24)

  21. arXiv:2403.11401  [pdf, other

    cs.CV cs.AI

    Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

    Authors: Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, Wenhan Xiong

    Abstract: This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently… ▽ More

    Submitted 22 March, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

  22. arXiv:2403.07376  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

    Authors: Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han, Hang Xu, Xiaojun Chang, Xiaodan Liang

    Abstract: Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions. Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability. However, their predominant use in an offlin… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  23. arXiv:2402.02074  [pdf, other

    cs.CV

    Multiple-Crop Human Mesh Recovery with Contrastive Learning and Camera Consistency in A Single Image

    Authors: Yongwei Nie, Changzhen Liu, Chengjiang Long, Qing Zhang, Guiqing Li, Hongmin Cai

    Abstract: We tackle the problem of single-image Human Mesh Recovery (HMR). Previous approaches are mostly based on a single crop. In this paper, we shift the single-crop HMR to a novel multiple-crop HMR paradigm. Cropping a human from image multiple times by shifting and scaling the original bounding box is feasible in practice, easy to implement, and incurs neglectable cost, but immediately enriches availa… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

  24. arXiv:2402.01253  [pdf, other

    cs.IR

    RimiRec: Modeling Refined Multi-interest in Hierarchical Structure for Recommendation

    Authors: Haolei Pei, Yuanyuan Xu, Yangping Zhu, Yuan Nie

    Abstract: Industrial recommender systems usually consist of the retrieval stage and the ranking stage, to handle the billion-scale of users and items. The retrieval stage retrieves candidate items relevant to user interests for recommendations and has attracted much attention. Frequently, a user shows refined multi-interests in a hierarchical structure. For example, a user likes Conan and Kuroba Kaito, whic… ▽ More

    Submitted 5 February, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: 4 pages, 4 figures

  25. arXiv:2401.14121  [pdf, other

    cs.CV

    Incorporating Exemplar Optimization into Training with Dual Networks for Human Mesh Recovery

    Authors: Yongwei Nie, Mingxian Fan, Chengjiang Long, Qing Zhang, Jian Zhu, Xuemiao Xu

    Abstract: We propose a novel optimization-based human mesh recovery method from a single image. Given a test exemplar, previous approaches optimize the pre-trained regression network to minimize the 2D re-projection loss, which however suffer from over-/under-fitting problems. This is because the ``exemplar optimization'' at testing time has too weak relation to the pre-training process, and the exemplar op… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

  26. arXiv:2401.13551  [pdf, other

    cs.CV

    Interleaving One-Class and Weakly-Supervised Models with Adaptive Thresholding for Unsupervised Video Anomaly Detection

    Authors: Yongwei Nie, Hao Huang, Chengjiang Long, Qing Zhang, Pradipta Maji, Hongmin Cai

    Abstract: Without human annotations, a typical Unsupervised Video Anomaly Detection (UVAD) method needs to train two models that generate pseudo labels for each other. In previous work, the two models are closely entangled with each other, and it is not known how to upgrade their method without modifying their training framework significantly. Second, previous work usually adopts fixed thresholding to obtai… ▽ More

    Submitted 24 January, 2024; originally announced January 2024.

  27. arXiv:2401.12681  [pdf, other

    cs.LG cs.AI

    Non-Neighbors Also Matter to Kriging: A New Contrastive-Prototypical Learning

    Authors: Zhishuai Li, Yunhao Nie, Ziyue Li, Lei Bai, Yisheng Lv, Rui Zhao

    Abstract: Kriging aims at estimating the attributes of unsampled geo-locations from observations in the spatial vicinity or physical connections, which helps mitigate skewed monitoring caused by under-deployed sensors. Existing works assume that neighbors' information offers the basis for estimating the attributes of the unobserved target while ignoring non-neighbors. However, non-neighbors could also offer… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

    Comments: Accepted in AISTATS 2024

  28. arXiv:2401.08013  [pdf, other

    cs.GT cs.MA econ.GN

    A Day-to-Day Dynamical Approach to the Most Likely User Equilibrium Problem

    Authors: Jiayang Li, Qianni Wang, Liyang Feng, Jun Xie, Yu Marco Nie

    Abstract: The lack of a unique user equilibrium (UE) route flow in traffic assignment has posed a significant challenge to many transportation applications. The maximum-entropy principle, which advocates for the consistent selection of the most likely solution as a representative, is often used to address the challenge. Built on a recently proposed day-to-day (DTD) discrete-time dynamical model called cumul… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

  29. arXiv:2312.17276  [pdf, other

    cs.CL cs.LG

    PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation

    Authors: Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang, Fangcheng Liu, Zhicheng Liu, Jianyuan Guo, Sinan Zeng, Yinchen Zhang, Qinghua Xu, Qun Liu, Jun Yao, Chao Xu, Dacheng Tao

    Abstract: The recent trend of large language models (LLMs) is to increase the scale of both model size (\aka the number of parameters) and dataset to achieve better generative ability, which is definitely proved by a lot of work such as the famous GPT and Llama. However, large models often involve massive computational costs, and practical applications cannot afford such high prices. However, the method of… ▽ More

    Submitted 27 December, 2023; originally announced December 2023.

  30. arXiv:2312.12418  [pdf, other

    cs.CV

    LASA: Instance Reconstruction from Real Scans using A Large-scale Aligned Shape Annotation Dataset

    Authors: Haolin Liu, Chongjie Ye, Yinyu Nie, Yingfan He, Xiaoguang Han

    Abstract: Instance shape reconstruction from a 3D scene involves recovering the full geometries of multiple objects at the semantic instance level. Many methods leverage data-driven learning due to the intricacies of scene complexity and significant indoor occlusions. Training these methods often requires a large-scale, high-quality dataset with aligned and paired shape annotations with real-world scans. Ex… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: homepage: https://gap-lab-cuhk-sz.github.io/LASA/

  31. arXiv:2312.06428  [pdf, other

    cs.CV cs.AI cs.IR cs.LG

    VisionTraj: A Noise-Robust Trajectory Recovery Framework based on Large-scale Camera Network

    Authors: Zhishuai Li, Ziyue Li, Xiaoru Hu, Guoqing Du, Yunhao Nie, Feng Zhu, Lei Bai, Rui Zhao

    Abstract: Trajectory recovery based on the snapshots from the city-wide multi-camera network facilitates urban mobility sensing and driveway optimization. The state-of-the-art solutions devoted to such a vision-based scheme typically incorporate predefined rules or unsupervised iterative feedback, struggling with multi-fold challenges such as lack of open-source datasets for training the whole pipeline, and… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  32. arXiv:2312.05798  [pdf, other

    cs.CV

    Disentangled Representation Learning for Controllable Person Image Generation

    Authors: Wenju Xu, Chengjiang Long, Yongwei Nie, Guanghui Wang

    Abstract: In this paper, we propose a novel framework named DRL-CPG to learn disentangled latent representation for controllable person image generation, which can produce realistic person images with desired poses and human attributes (e.g., pose, head, upper clothes, and pants) provided by various source persons. Unlike the existing works leveraging the semantic masks to obtain the representation of each… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

  33. arXiv:2312.03441  [pdf, other

    cs.CV

    UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity

    Authors: Jialong Zuo, Hanyu Zhou, Ying Nie, Feng Zhang, Tianyu Guo, Nong Sang, Yunhe Wang, Changxin Gao

    Abstract: Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. To address this problem, we contribute a new benchmark named \textbf{UFineBench} for text-based person retrieval with ultra-fine granularity. Firstly, we construct a new \textbf{dataset} named UFine6… ▽ More

    Submitted 6 June, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

  34. arXiv:2312.02902  [pdf, other

    cs.CV

    HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting

    Authors: Helisa Dhamo, Yinyu Nie, Arthur Moreau, Jifei Song, Richard Shaw, Yiren Zhou, Eduardo Pérez-Pellitero

    Abstract: 3D head animation has seen major quality and runtime improvements over the last few years, particularly empowered by the advances in differentiable rendering and neural radiance fields. Real-time rendering is a highly desirable goal for real-world applications. We propose HeadGaS, the first model to use 3D Gaussian Splats (3DGS) for 3D head reconstruction and animation. In this paper we introduce… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

  35. arXiv:2312.01068  [pdf, other

    cs.CV

    DPHMs: Diffusion Parametric Head Models for Depth-based Tracking

    Authors: Jiapeng Tang, Angela Dai, Yinyu Nie, Lev Markhasin, Justus Thies, Matthias Niessner

    Abstract: We introduce Diffusion Parametric Head Models (DPHMs), a generative model that enables robust volumetric head reconstruction and tracking from monocular depth sequences. While recent volumetric head models, such as NPHMs, can now excel in representing high-fidelity head geometries, tracking and reconstructing heads from real-world single-view depth sequences remains very challenging, as the fittin… ▽ More

    Submitted 8 April, 2024; v1 submitted 2 December, 2023; originally announced December 2023.

    Comments: CVPR 2024; homepage: https://tangjiapeng.github.io/projects/DPHMs/

  36. arXiv:2312.00674  [pdf, other

    cs.CV

    LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

    Authors: Ying Nie, Wei He, Kai Han, Yehui Tang, Tianyu Guo, Fanyi Du, Yunhe Wang

    Abstract: Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for trainin… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  37. arXiv:2311.09193  [pdf, other

    cs.CL cs.AI cs.CV

    The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

    Authors: Yifan Wu, Pengchuan Zhang, Wenhan Xiong, Barlas Oguz, James C. Gee, Yixin Nie

    Abstract: The study explores the effectiveness of the Chain-of-Thought approach, known for its proficiency in language tasks by breaking them down into sub-tasks and intermediate steps, in improving vision-language tasks that demand sophisticated perception and reasoning. We present the "Description then Decision" strategy, which is inspired by how humans process signals. This strategy significantly improve… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

  38. arXiv:2310.15081  [pdf, other

    cs.CV

    E4S: Fine-grained Face Swapping via Editing With Regional GAN Inversion

    Authors: Maomao Li, Ge Yuan, Cairong Wang, Zhian Liu, Yong Zhang, Yongwei Nie, Jue Wang, Dong Xu

    Abstract: This paper proposes a novel approach to face swapping from the perspective of fine-grained facial editing, dubbed "editing for swapping" (E4S). The traditional face swapping methods rely on global feature extraction and fail to preserve the detailed source identity. In contrast, we propose a Regional GAN Inversion (RGI) method, which allows the explicit disentanglement of shape and texture. Specif… ▽ More

    Submitted 27 March, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: Project Page: https://e4s2024.github.io/ ;. arXiv admin note: text overlap with arXiv:2211.14068

  39. arXiv:2310.13617  [pdf, other

    cs.HC

    3D-Mirrorcle: Bridging the Virtual and Real through Depth Alignment in AR Mirror Systems

    Authors: Yujia Liu, Qi Xin, Chenzhuo Xiang, Yu Zhang, Lun Yiu Nie, Yingqing Xu

    Abstract: Smart mirrors have emerged as a new form of augmented reality (AR) interface for home environments. However, due to the parallax in human vision, one major challenge hindering their development is the depth misalignment between the 3D mirror reflection and the 2D screen display. This misalignment causes the display content to appear as if it is floating above the mirror, thereby disrupting the sea… ▽ More

    Submitted 24 April, 2024; v1 submitted 20 October, 2023; originally announced October 2023.

  40. arXiv:2309.15564  [pdf, other

    cs.LG cs.CL cs.CV

    Jointly Training Large Autoregressive Multimodal Models

    Authors: Emanuele Aiello, Lili Yu, Yixin Nie, Armen Aghajanyan, Barlas Oguz

    Abstract: In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that… ▽ More

    Submitted 28 September, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

  41. arXiv:2309.14183  [pdf, other

    cs.CV cs.AI

    Species196: A One-Million Semi-supervised Dataset for Fine-grained Species Recognition

    Authors: Wei He, Kai Han, Ying Nie, Chengcheng Wang, Yunhe Wang

    Abstract: The development of foundation vision models has pushed the general visual recognition to a high level, but cannot well address the fine-grained recognition in specialized domain such as invasive species classification. Identifying and managing invasive species has strong social and ecological value. Currently, most invasive species datasets are limited in scale and cover a narrow range of species,… ▽ More

    Submitted 28 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted by NeurIPS 2023 Track Datasets and Benchmarks

  42. arXiv:2309.11331  [pdf, other

    cs.CV cs.AI

    Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism

    Authors: Chengcheng Wang, Wei He, Ying Nie, Jianyuan Guo, Chuanjian Liu, Kai Han, Yunhe Wang

    Abstract: In the past years, YOLO-series models have emerged as the leading approaches in the area of real-time object detection. Many studies pushed up the baseline to a higher level by modifying the architecture, augmenting data and designing new losses. However, we find previous models still suffer from information fusion problem, although Feature Pyramid Network (FPN) and Path Aggregation Network (PANet… ▽ More

    Submitted 23 October, 2023; v1 submitted 20 September, 2023; originally announced September 2023.

    Comments: Accepted by NeurIPS 2023

  43. arXiv:2308.09831  [pdf, other

    eess.IV cs.CV

    Cross-modality Attention-based Multimodal Fusion for Non-small Cell Lung Cancer (NSCLC) Patient Survival Prediction

    Authors: Ruining Deng, Nazim Shaikh, Gareth Shannon, Yao Nie

    Abstract: Cancer prognosis and survival outcome predictions are crucial for therapeutic response estimation and for stratifying patients into various treatment groups. Medical domains concerned with cancer prognosis are abundant with multiple modalities, including pathological image data and non-image data such as genomic information. To date, multimodal learning has shown potential to enhance clinical pred… ▽ More

    Submitted 27 February, 2024; v1 submitted 18 August, 2023; originally announced August 2023.

  44. arXiv:2308.07496  [pdf, other

    cs.LG cs.AI

    ST-MLP: A Cascaded Spatio-Temporal Linear Framework with Channel-Independence Strategy for Traffic Forecasting

    Authors: Zepu Wang, Yuqi Nie, Peng Sun, Nam H. Nguyen, John Mulvey, H. Vincent Poor

    Abstract: The criticality of prompt and precise traffic forecasting in optimizing traffic flow management in Intelligent Transportation Systems (ITS) has drawn substantial scholarly focus. Spatio-Temporal Graph Neural Networks (STGNNs) have been lauded for their adaptability to road graph structures. Yet, current research on STGNNs architectures often prioritizes complex designs, leading to elevated computa… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

  45. arXiv:2308.07307  [pdf, other

    cs.AI

    Extend Wave Function Collapse to Large-Scale Content Generation

    Authors: Yuhe Nie, Shaoming Zheng, Zhan Zhuang, Xuan Song

    Abstract: Wave Function Collapse (WFC) is a widely used tile-based algorithm in procedural content generation, including textures, objects, and scenes. However, the current WFC algorithm and related research lack the ability to generate commercialized large-scale or infinite content due to constraint conflict and time complexity costs. This paper proposes a Nested WFC (N-WFC) algorithm framework to reduce t… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: This paper is accepted by IEEE Conference on Games 2023 (nomination of the Best Paper Award)

  46. arXiv:2308.07234  [pdf, other

    cs.CV cs.RO

    UniWorld: Autonomous Driving Pre-training via World Models

    Authors: Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, Bin Dai

    Abstract: In this paper, we draw inspiration from Alberto Elfes' pioneering work in 1989, where he introduced the concept of the occupancy grid as World Models for robots. We imbue the robot with a spatial-temporal world model, termed UniWorld, to perceive its surroundings and predict the future behavior of other participants. UniWorld involves initially predicting 4D geometric occupancy as the World Models… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: 8 pages, 5 figures. arXiv admin note: substantial text overlap with arXiv:2305.18829

  47. arXiv:2308.04468  [pdf, other

    cs.CV

    3D Scene Diffusion Guidance using Scene Graphs

    Authors: Mohammad Naanaa, Katharina Schmid, Yinyu Nie

    Abstract: Guided synthesis of high-quality 3D scenes is a challenging task. Diffusion models have shown promise in generating diverse data, including 3D scenes. However, current methods rely directly on text embeddings for controlling the generation, limiting the incorporation of complex spatial relationships between objects. We propose a novel approach for 3D scene diffusion guidance using scene graphs. To… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

    Comments: 5 figures

  48. arXiv:2307.13716  [pdf, other

    cs.LG cs.AI

    FedDRL: A Trustworthy Federated Learning Model Fusion Method Based on Staged Reinforcement Learning

    Authors: Leiming Chen, Weishan Zhang, Cihao Dong, Sibo Qiao, Ziling Huang, Yuming Nie, Zhaoxiang Hou, Chee Wei Tan

    Abstract: Traditional federated learning uses the number of samples to calculate the weights of each client model and uses this fixed weight value to fusion the global model. However, in practical scenarios, each client's device and data heterogeneity leads to differences in the quality of each client's model. Thus the contribution to the global model is not wholly determined by the sample size. In addition… ▽ More

    Submitted 19 March, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

  49. arXiv:2307.09288  [pdf, other

    cs.CL cs.AI

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Authors: Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini , et al. (43 additional authors not shown)

    Abstract: In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be… ▽ More

    Submitted 19 July, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

  50. arXiv:2307.02871  [pdf, other

    cs.RO

    Contrastive Label Disambiguation for Self-Supervised Terrain Traversability Learning in Off-Road Environments

    Authors: Hanzhang Xue, Xiaochang Hu, Rui Xie, Hao Fu, Liang Xiao, Yiming Nie, Bin Dai

    Abstract: Discriminating the traversability of terrains is a crucial task for autonomous driving in off-road environments. However, it is challenging due to the diverse, ambiguous, and platform-specific nature of off-road traversability. In this paper, we propose a novel self-supervised terrain traversability learning framework, utilizing a contrastive label disambiguation mechanism. Firstly, weakly labeled… ▽ More

    Submitted 6 July, 2023; originally announced July 2023.

    Comments: 9 pages, 11 figures