Zum Hauptinhalt springen

Showing 1–11 of 11 results for author: Fei, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.04343  [pdf, other

    cs.AI

    CoTBal: Comprehensive Task Balancing for Multi-Task Visual Instruction Tuning

    Authors: Yanqi Dai, Dong Jing, Nanyi Fei, Zhiwu Lu

    Abstract: Visual instruction tuning is a key training stage of large multimodal models (LMMs). Nevertheless, the common practice of indiscriminately mixing instruction-following data from various tasks may result in suboptimal overall performance due to different instruction formats and knowledge domains across tasks. To mitigate this issue, we propose a novel Comprehensive Task Balancing (CoTBal) algorithm… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  2. arXiv:2307.15429  [pdf, other

    cs.LG cs.AI cs.CV

    Improvable Gap Balancing for Multi-Task Learning

    Authors: Yanqi Dai, Nanyi Fei, Zhiwu Lu

    Abstract: In multi-task learning (MTL), gradient balancing has recently attracted more research interest than loss balancing since it often leads to better performance. However, loss balancing is much more efficient than gradient balancing, and thus it is still worth further exploration in MTL. Note that prior studies typically ignore that there exist varying improvable gaps across multiple tasks, where the… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

    Comments: Accepted for the 39th Conference on Uncertainty in Artificial Intelligence (UAI 2023)

  3. arXiv:2305.13311  [pdf, other

    cs.CV

    VDT: General-purpose Video Diffusion Transformers via Mask Modeling

    Authors: Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding

    Abstract: This work introduces Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation. It features transformer blocks with modularized temporal and spatial attention modules to leverage the rich spatial-temporal representation inherited in transformers. We also propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the mo… ▽ More

    Submitted 11 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

  4. arXiv:2209.11388  [pdf, other

    cs.CV cs.AI cs.MM

    LGDN: Language-Guided Denoising Network for Video-Language Modeling

    Authors: Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, Zhiwu Lu

    Abstract: Video-language modeling has attracted much attention with the rapid growth of web videos. Most existing methods assume that the video frames and text description are semantically correlated, and focus on video-language modeling at video level. However, this hypothesis often fails for two reasons: (1) With the rich semantics of video contents, it is difficult to cover all frames with a single video… ▽ More

    Submitted 5 December, 2022; v1 submitted 22 September, 2022; originally announced September 2022.

    Comments: Accepted by NeurIPS2022

  5. arXiv:2208.08263  [pdf, other

    cs.NE cs.AI cs.MM

    Multimodal foundation models are better simulators of the human brain

    Authors: Haoyu Lu, Qiongyi Zhou, Nanyi Fei, Zhiwu Lu, Mingyu Ding, Jingyuan Wen, Changde Du, Xin Zhao, Hao Sun, Huiguang He, Ji-Rong Wen

    Abstract: Multimodal learning, especially large-scale multimodal pre-training, has developed rapidly over the past few years and led to the greatest advances in artificial intelligence (AI). Despite its effectiveness, understanding the underlying mechanism of multimodal pre-training models still remains a grand challenge. Revealing the explainability of such models is likely to enable breakthroughs of novel… ▽ More

    Submitted 17 August, 2022; originally announced August 2022.

  6. arXiv:2204.07441  [pdf, other

    cs.CV cs.CL cs.IR

    COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

    Authors: Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, Ji-Rong Wen

    Abstract: Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers. Recently, two-stream methods like CLIP and ALIGN with high inference efficiency have also shown promising performance, however, they only consider instance-level alignment between the two streams (thus there is still room for i… ▽ More

    Submitted 20 May, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: Accepted by CVPR2022

  7. arXiv:2203.14101   

    cs.LG cs.AI cs.CL

    A Roadmap for Big Model

    Authors: Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han, Zhenghao Liu, Ning Ding, Yongming Rao, Yizhao Gao, Liang Zhang, Ming Ding, Cong Fang, Yisen Wang, Mingsheng Long, Jing Zhang, Yinpeng Dong, Tianyu Pang, Peng Cui , et al. (75 additional authors not shown)

    Abstract: With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM… ▽ More

    Submitted 20 April, 2022; v1 submitted 26 March, 2022; originally announced March 2022.

    Comments: This report has been withdrawn by the authors due to critical issues in Section 2.3.1 of Article 2

  8. Towards artificial general intelligence via a multimodal foundation model

    Authors: Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, Hao Sun, Ji-Rong Wen

    Abstract: The fundamental goal of artificial intelligence (AI) is to mimic the core cognitive activities of human. Despite tremendous success in the AI research, most of existing methods have only single-cognitive ability. To overcome this limitation and take a solid step towards artificial general intelligence (AGI), we develop a foundation model pre-trained with huge multimodal data, which can be quickly… ▽ More

    Submitted 8 June, 2022; v1 submitted 27 October, 2021; originally announced October 2021.

    Comments: Published by Nature Communications, see https://www.nature.com/articles/s41467-022-30761-2

  9. arXiv:2101.09499  [pdf, other

    cs.CV

    Contrastive Prototype Learning with Augmented Embeddings for Few-Shot Learning

    Authors: Yizhao Gao, Nanyi Fei, Guangzhen Liu, Zhiwu Lu, Tao Xiang, Songfang Huang

    Abstract: Most recent few-shot learning (FSL) methods are based on meta-learning with episodic training. In each meta-training episode, a discriminative feature embedding and/or classifier are first constructed from a support set in an inner loop, and then evaluated in an outer loop using a query set for model updating. This query set sample centered learning objective is however intrinsically limited in ad… ▽ More

    Submitted 23 January, 2021; originally announced January 2021.

  10. arXiv:2002.04274   

    cs.LG stat.ML

    Meta-Learning across Meta-Tasks for Few-Shot Learning

    Authors: Nanyi Fei, Zhiwu Lu, Yizhao Gao, Jia Tian, Tao Xiang, Ji-Rong Wen

    Abstract: Existing meta-learning based few-shot learning (FSL) methods typically adopt an episodic training strategy whereby each episode contains a meta-task. Across episodes, these tasks are sampled randomly and their relationships are ignored. In this paper, we argue that the inter-meta-task relationships should be exploited and those tasks are sampled strategically to assist in meta-learning. Specifical… ▽ More

    Submitted 26 September, 2020; v1 submitted 11 February, 2020; originally announced February 2020.

    Comments: There are some mistakes in the experiments. We thus choose to withdraw this paper

  11. arXiv:1812.04427  [pdf, other

    cs.CV

    Zero-Shot Learning with Sparse Attribute Propagation

    Authors: Nanyi Fei, Jiechao Guan, Zhiwu Lu, Tao Xiang, Ji-Rong Wen

    Abstract: Zero-shot learning (ZSL) aims to recognize a set of unseen classes without any training images. The standard approach to ZSL requires a set of training images annotated with seen class labels and a semantic descriptor for seen/unseen classes (attribute vector is the most widely used). Class label/attribute annotation is expensive; it thus severely limits the scalability of ZSL. In this paper, we d… ▽ More

    Submitted 18 March, 2019; v1 submitted 11 December, 2018; originally announced December 2018.