Zum Hauptinhalt springen

Showing 1–13 of 13 results for author: Rao, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.11795  [pdf, other

    cs.CV

    EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

    Authors: Feipeng Ma, Yizhou Zhou, Hebei Li, Zilong He, Siying Wu, Fengyun Rao, Yueyi Zhang, Xiaoyan Sun

    Abstract: In the realm of multimodal research, numerous studies leverage substantial image-text pairs to conduct modal alignment learning, transforming Large Language Models (LLMs) into Multimodal LLMs and excelling in a variety of visual-language tasks. The prevailing methodologies primarily fall into two categories: self-attention-based and cross-attention-based methods. While self-attention-based methods… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  2. arXiv:2405.20339  [pdf, other

    cs.CV

    Visual Perception by Large Language Model's Weights

    Authors: Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun

    Abstract: Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational eff… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  3. arXiv:2405.19333  [pdf, other

    cs.CV

    Multi-Modal Generative Embedding Model

    Authors: Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun

    Abstract: Most multi-modal tasks can be formulated into problems of either generation or embedding. Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding. To explore the minimalism of multi-modal paradigms, we attempt to achieve only one model per modality in this work. We propose a Multi-Modal Generativ… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  4. arXiv:2403.11882  [pdf, other

    cs.CV cs.AI

    ReGenNet: Towards Human Action-Reaction Synthesis

    Authors: Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, Wenjun Zeng

    Abstract: Humans constantly interact with their surrounding environments. Current human-centric generative models mainly focus on synthesizing humans plausibly interacting with static scenes and objects, while the dynamic human action-reaction synthesis for ubiquitous causal human-human interactions is less explored. Human-human interactions can be regarded as asymmetric with actors and reactors in atomic i… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR 2024, Project Page: https://liangxuy.github.io/ReGenNet/

  5. arXiv:2401.08086  [pdf, other

    cs.CV

    Spatial-Semantic Collaborative Cropping for User Generated Content

    Authors: Yukun Su, Yiwen Cao, Jingliang Deng, Fengyun Rao, Qingyao Wu

    Abstract: A large amount of User Generated Content (UGC) is uploaded to the Internet daily and displayed to people world-widely through the client side (e.g., mobile and PC). This requires the cropping algorithms to produce the aesthetic thumbnail within a specific aspect ratio on different devices. However, existing image cropping works mainly focus on landmark or landscape images, which fail to model the… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

  6. arXiv:2312.16051  [pdf, other

    cs.CV

    Inter-X: Towards Versatile Human-Human Interaction Analysis

    Authors: Liang Xu, Xintao Lv, Yichao Yan, Xin Jin, Shuwen Wu, Congsheng Xu, Yifan Liu, Yizhou Zhou, Fengyun Rao, Xingdong Sheng, Yunhui Liu, Wenjun Zeng, Xiaokang Yang

    Abstract: The analysis of the ubiquitous human-human interactions is pivotal for understanding humans as social beings. Existing human-human interaction datasets typically suffer from inaccurate body motions, lack of hand gestures and fine-grained textual descriptions. To better perceive and generate human-human interactions, we propose Inter-X, a currently largest human-human interaction dataset with accur… ▽ More

    Submitted 26 December, 2023; originally announced December 2023.

    Comments: Project page: https://liangxuy.github.io/inter-x/

  7. arXiv:2305.18072  [pdf, other

    cs.CV

    Image Captioning with Multi-Context Synthetic Data

    Authors: Feipeng Ma, Yizhou Zhou, Fengyun Rao, Yueyi Zhang, Xiaoyan Sun

    Abstract: Image captioning requires numerous annotated image-text pairs, resulting in substantial annotation costs. Recently, large models (e.g. diffusion models and large language models) have excelled in producing high-quality images and text. This potential can be harnessed to create synthetic image-text pairs for training captioning models. Synthetic data can improve cost and time efficiency in data col… ▽ More

    Submitted 19 December, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

    Comments: Accepted by AAAI 2024

  8. arXiv:2305.15679  [pdf, other

    cs.CV

    A Similarity Alignment Model for Video Copy Segment Matching

    Authors: Zhenhua Liu, Feipeng Ma, Tianyi Wang, Fengyun Rao

    Abstract: With the development of multimedia technology, Video Copy Detection has been a crucial problem for social media platforms. Meta AI hold Video Similarity Challenge on CVPR 2023 to push the technology forward. In this report, we share our winner solutions on Matching Track. We propose a Similarity Alignment Model(SAM) for video copy segment matching. Our SAM exhibits superior performance compared to… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

  9. arXiv:2305.12361  [pdf, other

    cs.CV

    A Dual-level Detection Method for Video Copy Detection

    Authors: Tianyi Wang, Feipeng Ma, Zhenhua Liu, Fengyun Rao

    Abstract: With the development of multimedia technology, Video Copy Detection has been a crucial problem for social media platforms. Meta AI hold Video Similarity Challenge on CVPR 2023 to push the technology forward. In this paper, we share our winner solutions on both tracks to help progress in this area. For Descriptor Track, we propose a dual-level detection method with Video Editing Detection (VED) and… ▽ More

    Submitted 21 May, 2023; originally announced May 2023.

  10. arXiv:2112.04966  [pdf, other

    cs.CV

    CA-SSL: Class-Agnostic Semi-Supervised Learning for Detection and Segmentation

    Authors: Lu Qi, Jason Kuen, Zhe Lin, Jiuxiang Gu, Fengyun Rao, Dian Li, Weidong Guo, Zhen Wen, Ming-Hsuan Yang, Jiaya Jia

    Abstract: To improve instance-level detection/segmentation performance, existing self-supervised and semi-supervised methods extract either task-unrelated or task-specific training signals from unlabeled data. We show that these two approaches, at the two extreme ends of the task-specificity spectrum, are suboptimal for the task performance. Utilizing too little task-specific training signals causes underfi… ▽ More

    Submitted 19 July, 2022; v1 submitted 9 December, 2021; originally announced December 2021.

    Comments: Appeared in ECCV2022

  11. arXiv:2110.06615  [pdf, other

    cs.CV

    CLIP4Caption: CLIP for Video Caption

    Authors: Mingkang Tang, Zhanyu Wang, Zhenhua Liu, Fengyun Rao, Dian Li, Xiu Li

    Abstract: Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps between videos and texts. To bridge this gap, in this paper, we propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-tex… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

  12. arXiv:2110.05204  [pdf, other

    cs.CV cs.LG

    CLIP4Caption ++: Multi-CLIP for Video Caption

    Authors: Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, Dian Li

    Abstract: This report describes our solution to the VALUE Challenge 2021 in the captioning task. Our solution, named CLIP4Caption++, is built on X-Linear/X-Transformer, which is an advanced model with encoder-decoder architecture. We make the following improvements on the proposed CLIP4Caption++: We employ an advanced encoder-decoder model architecture X-Transformer as our main framework and make the follow… ▽ More

    Submitted 14 October, 2021; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: 4 pages, VALUE Challenge 2021 captioning task chamionship solution

  13. arXiv:1412.4378  [pdf, ps, other

    cs.CR cs.DB

    Privacy-Preserving and Outsourced Multi-User k-Means Clustering

    Authors: Bharath K. Samanthula, Fang-Yu Rao, Elisa Bertino, Xun Yi, Dongxi Liu

    Abstract: Many techniques for privacy-preserving data mining (PPDM) have been investigated over the past decade. Often, the entities involved in the data mining process are end-users or organizations with limited computing and storage resources. As a result, such entities may want to refrain from participating in the PPDM process. To overcome this issue and to take many other benefits of cloud computing, ou… ▽ More

    Submitted 14 December, 2014; originally announced December 2014.

    Comments: 16 pages, 2 figures, 5 tables

    ACM Class: D.4.6; E.3; H.3.3