Zum Hauptinhalt springen

Showing 1–50 of 179 results for author: Mei, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.06357  [pdf

    cs.CV cs.AI

    Algorithm Research of ELMo Word Embedding and Deep Learning Multimodal Transformer in Image Description

    Authors: Xiaohan Cheng, Taiyuan Mei, Yun Zi, Qi Wang, Zijun Gao, Haowei Yang

    Abstract: Zero sample learning is an effective method for data deficiency. The existing embedded zero sample learning methods only use the known classes to construct the embedded space, so there is an overfitting of the known classes in the testing process. This project uses category semantic similarity measures to classify multiple tags. This enables it to incorporate unknown classes that have the same mea… ▽ More

    Submitted 25 July, 2024; originally announced August 2024.

  2. arXiv:2407.16341  [pdf, other

    cs.CV

    Motion Capture from Inertial and Vision Sensors

    Authors: Xiaodong Chen, Wu Liu, Qian Bao, Xinchen Liu, Quanwei Yang, Ruoli Dai, Tao Mei

    Abstract: Human motion capture is the foundation for many computer vision and graphics tasks. While industrial motion capture systems with complex camera arrays or expensive wearable sensors have been widely adopted in movie and game production, consumer-affordable and easy-to-use solutions for personal applications are still far from mature. To utilize a mixture of a monocular camera and very few inertial… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

    Comments: 17 pages,9 figures

  3. arXiv:2407.00247  [pdf, other

    cs.CV

    Prompt Refinement with Image Pivot for Text-to-Image Generation

    Authors: Jingtao Zhan, Qingyao Ai, Yiqun Liu, Yingwei Pan, Ting Yao, Jiaxin Mao, Shaoping Ma, Tao Mei

    Abstract: For text-to-image generation, automatically refining user-provided natural language prompts into the keyword-enriched prompts favored by systems is essential for the user experience. Such a prompt refinement process is analogous to translating the prompt from "user languages" into "system languages". However, the scarcity of such parallel corpora makes it difficult to train a prompt refinement mod… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

    Comments: Accepted by ACL 2024

  4. arXiv:2406.01605  [pdf, other

    eess.IV cs.CV

    An Enhanced Encoder-Decoder Network Architecture for Reducing Information Loss in Image Semantic Segmentation

    Authors: Zijun Gao, Qi Wang, Taiyuan Mei, Xiaohan Cheng, Yun Zi, Haowei Yang

    Abstract: The traditional SegNet architecture commonly encounters significant information loss during the sampling process, which detrimentally affects its accuracy in image semantic segmentation tasks. To counter this challenge, we introduce an innovative encoder-decoder network structure enhanced with residual connections. Our approach employs a multi-residual connection strategy designed to preserve the… ▽ More

    Submitted 26 May, 2024; originally announced June 2024.

  5. arXiv:2405.11704  [pdf

    cs.LG cs.AI

    Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

    Authors: Taiyuan Mei, Yun Zi, Xiaohan Cheng, Zijun Gao, Qi Wang, Haowei Yang

    Abstract: The internal structure and operation mechanism of large-scale language models are analyzed theoretically, especially how Transformer and its derivative architectures can restrict computing efficiency while capturing long-term dependencies. Further, we dig deep into the efficiency bottleneck of the training phase, and evaluate in detail the contribution of adaptive optimization algorithms (such as… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

  6. arXiv:2403.17870  [pdf, other

    cs.CV cs.MM

    Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

    Authors: Yurui Qian, Qi Cai, Yingwei Pan, Yehao Li, Ting Yao, Qibin Sun, Tao Mei

    Abstract: Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prio… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  7. arXiv:2403.17005  [pdf, other

    cs.CV cs.MM

    TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models

    Authors: Zhongwei Zhang, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Ting Yao, Yang Cao, Tao Mei

    Abstract: Recent advances in text-to-video generation have demonstrated the utility of powerful diffusion models. Nevertheless, the problem is not trivial when shaping diffusion models to animate static image (i.e., image-to-video generation). The difficulty originates from the aspect that the diffusion process of subsequent animated frames should not only preserve the faithful alignment with the given imag… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: CVPR 2024; Project page: https://trip-i2v.github.io/TRIP/

  8. arXiv:2403.17004  [pdf, other

    cs.CV cs.MM

    SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

    Authors: Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, Chang Wen Chen

    Abstract: Diffusion Transformer (DiT) has emerged as the new trend of generative diffusion models on image generation. In view of extremely slow convergence in typical DiT, recent breakthroughs have been driven by mask strategy that significantly improves the training efficiency of DiT with additional intra-image contextual learning. Despite this progress, mask strategy still suffers from two inherent limit… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  9. arXiv:2403.17001  [pdf, other

    cs.CV cs.MM

    VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation

    Authors: Yang Chen, Yingwei Pan, Haibo Yang, Ting Yao, Tao Mei

    Abstract: Recent innovations on text-to-3D generation have featured Score Distillation Sampling (SDS), which enables the zero-shot learning of implicit 3D models (NeRF) by directly distilling prior knowledge from 2D diffusion models. However, current SDS-based models still struggle with intricate text prompts and commonly result in distorted 3D models with unrealistic textures or cross-view inconsistency is… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: CVPR 2024; Project page: https://vp3d-cvpr24.github.io

  10. arXiv:2403.17000  [pdf, other

    cs.CV cs.MM

    Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

    Authors: Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, Tao Mei

    Abstract: Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  11. arXiv:2403.11999  [pdf, other

    cs.CV cs.MM

    HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs

    Authors: Ting Yao, Yehao Li, Yingwei Pan, Tao Mei

    Abstract: The hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks. Scaling up the input resolution of such hybrid backbones naturally strengthes model capacity, but inevitably suffers from heavy computational cost that scales quadratically. Instead, we present a new hybrid backbone with HIgh-Resolution Inputs (nam… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

  12. arXiv:2401.01256  [pdf, other

    cs.CV cs.CL

    VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM

    Authors: Fuchen Long, Zhaofan Qiu, Ting Yao, Tao Mei

    Abstract: The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between… ▽ More

    Submitted 2 January, 2024; originally announced January 2024.

    Comments: Project website: https://videodrafter.github.io

  13. arXiv:2311.05464  [pdf, other

    cs.CV cs.MM

    3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with 2D Diffusion Models

    Authors: Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Tao Mei

    Abstract: 3D content creation via text-driven stylization has played a fundamental challenge to multimedia and graphics community. Recent advances of cross-modal foundation models (e.g., CLIP) have made this problem feasible. Those approaches commonly leverage CLIP to align the holistic semantics of stylized mesh with the given text prompt. Nevertheless, it is not trivial to enable more controllable styliza… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: ACM Multimedia 2023

  14. arXiv:2311.05463  [pdf, other

    cs.CV cs.MM

    ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

    Authors: Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei

    Abstract: Recently, the multimedia community has witnessed the rise of diffusion models trained on large-scale multi-modal data for visual content creation, particularly in the field of text-to-image generation. In this paper, we propose a new task for ``stylizing'' text-to-image models, namely text-driven stylized image generation, that further enhances editability in content creation. Given input text pro… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: ACM Multimedia 2023

  15. arXiv:2311.05461  [pdf, other

    cs.CV cs.MM

    Control3D: Towards Controllable Text-to-3D Generation

    Authors: Yang Chen, Yingwei Pan, Yehao Li, Ting Yao, Tao Mei

    Abstract: Recent remarkable advances in large-scale text-to-image diffusion models have inspired a significant breakthrough in text-to-3D generation, pursuing 3D content creation solely from a given text prompt. However, existing text-to-3D techniques lack a crucial ability in the creative process: interactively control and shape the synthetic 3D contents according to users' desired specifications (e.g., sk… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: ACM Multimedia 2023

  16. Bidirectional Knowledge Reconfiguration for Lightweight Point Cloud Analysis

    Authors: Peipei Li, Xing Cui, Yibo Hu, Man Zhang, Ting Yao, Tao Mei

    Abstract: Point cloud analysis faces computational system overhead, limiting its application on mobile or edge devices. Directly employing small models may result in a significant drop in performance since it is difficult for a small model to adequately capture local structure and global shape information simultaneously, which are essential clues for point cloud analysis. This paper explores feature distill… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Accepted by IEEE Transactions on Multimedia (TMM)

    Journal ref: IEEE Transactions on Multimedia ( Early Access ), 02 October 2023

  17. arXiv:2309.09534  [pdf, other

    cs.CV

    Selective Volume Mixup for Video Action Recognition

    Authors: Yi Tan, Zhaofan Qiu, Yanbin Hao, Ting Yao, Xiangnan He, Tao Mei

    Abstract: The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the overfitting effect on small-scale datasets with a limited number of training videos. A common solution is to exploit the existing image augmentation strategies for… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

  18. Learning and Evaluating Human Preferences for Conversational Head Generation

    Authors: Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei

    Abstract: A reliable and comprehensive evaluation metric that aligns with manual preference assessments is crucial for conversational head video synthesis methods development. Existing quantitative evaluations often fail to capture the full complexity of human preference, as they only consider limited evaluation dimensions. Qualitative evaluations and user studies offer a solution but are time-consuming and… ▽ More

    Submitted 2 August, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: Accepted by ACM Multimedia 2023

  19. arXiv:2306.16645  [pdf, other

    cs.CV cs.MM

    Deep Equilibrium Multimodal Fusion

    Authors: Jinhong Ni, Yalong Bai, Wei Zhang, Ting Yao, Tao Mei

    Abstract: Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently. Most existing fusion approaches either learn a fixed fusion strategy during training and inference, or are only capable of fusing the information to a certain extent. Such solutions may fail to fully capture the dynamics of interactions across modalities especially when… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

  20. Visual-Aware Text-to-Speech

    Authors: Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei

    Abstract: Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and s… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

    Comments: accepted as oral and top 3% paper by ICASSP 2023

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023, 1-5

  21. arXiv:2306.02850  [pdf, other

    cs.CV

    TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments

    Authors: Yu Sun, Qian Bao, Wu Liu, Tao Mei, Michael J. Black

    Abstract: Although the estimation of 3D human pose and shape (HPS) is rapidly progressing, current methods still cannot reliably estimate moving humans in global coordinates, which is critical for many applications. This is particularly challenging when the camera is also moving, entangling human and camera motion. To address these issues, we adopt a novel 5D representation (space, time, and identity) that… ▽ More

    Submitted 20 November, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

    Comments: Project page: https://www.yusun.work/TRACE/TRACE.html

  22. arXiv:2303.07123  [pdf, other

    cs.CV cs.AI cs.LG

    Modality-Agnostic Debiasing for Single Domain Generalization

    Authors: Sanqing Qu, Yingwei Pan, Guang Chen, Ting Yao, Changjun Jiang, Tao Mei

    Abstract: Deep neural networks (DNNs) usually fail to generalize well to outside of distribution (OOD) data, especially in the extreme case of single domain generalization (single-DG) that transfers DNNs from single domain to multiple unseen domains. Existing single-DG techniques commonly devise various data-augmentation algorithms, and remould the multi-source domain generalization methodology to learn dom… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: To appear in CVPR-2023

  23. arXiv:2212.04744  [pdf, other

    cs.CV cs.AI

    Weakly Supervised Semantic Segmentation for Large-Scale Point Cloud

    Authors: Yachao Zhang, Zonghao Li, Yuan Xie, Yanyun Qu, Cuihua Li, Tao Mei

    Abstract: Existing methods for large-scale point cloud semantic segmentation require expensive, tedious and error-prone manual point-wise annotations. Intuitively, weakly supervised training is a direct solution to reduce the cost of labeling. However, for weakly supervised large-scale point cloud semantic segmentation, too few annotations will inevitably lead to ineffective learning of network. We propose… ▽ More

    Submitted 9 December, 2022; originally announced December 2022.

  24. arXiv:2212.03099  [pdf, other

    cs.CV cs.CL cs.MM

    Semantic-Conditional Diffusion Networks for Image Captioning

    Authors: Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Jianlin Feng, Hongyang Chao, Tao Mei

    Abstract: Recent advances on text-to-image generation have witnessed the rise of diffusion models which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent variable models to capture the dependency among discrete words and meanwhile pursue complex visual-language alignment in image captioning. In this paper, we break the deeply rooted conventions in learning Transformer… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

    Comments: Source code is available at \url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet}

  25. arXiv:2211.08252  [pdf, other

    cs.CV

    Dynamic Temporal Filtering in Video Models

    Authors: Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Chong-Wah Ngo, Tao Mei

    Abstract: Video temporal dynamics is conventionally modeled with 3D spatial-temporal kernel or its factorized version comprised of 2D spatial kernel and 1D temporal kernel. The modeling power, nevertheless, is limited by the fixed window size and static weights of a kernel along the temporal dimension. The pre-determined kernel size severely limits the temporal receptive fields and the fixed weights treat e… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: ECCV 2022. Source code is available at \url{https://github.com/FuchenUSTC/DTF}

  26. arXiv:2211.08250  [pdf, other

    cs.CV

    SPE-Net: Boosting Point Cloud Analysis via Rotation Robustness Enhancement

    Authors: Zhaofan Qiu, Yehao Li, Yu Wang, Yingwei Pan, Ting Yao, Tao Mei

    Abstract: In this paper, we propose a novel deep architecture tailored for 3D point cloud applications, named as SPE-Net. The embedded ``Selective Position Encoding (SPE)'' procedure relies on an attention mechanism that can effectively attend to the underlying rotation condition of the input. Such encoded rotation condition then determines which part of the network parameters to be focused on, and is shown… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: ECCV 2022. Source code is available at https://github.com/ZhaofanQiu/SPE-Net

  27. arXiv:2211.08249  [pdf, other

    cs.CV

    Explaining Cross-Domain Recognition with Interpretable Deep Classifier

    Authors: Yiheng Zhang, Ting Yao, Zhaofan Qiu, Tao Mei

    Abstract: The recent advances in deep learning predominantly construct models in their internal representations, and it is opaque to explain the rationale behind and decisions to human users. Such explainability is especially essential for domain adaptation, whose challenges require developing more adaptive models across different domains. In this paper, we ask the question: how much each sample in source d… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

  28. arXiv:2211.08248  [pdf, other

    cs.CV

    3D Cascade RCNN: High Quality Object Detection in Point Clouds

    Authors: Qi Cai, Yingwei Pan, Ting Yao, Tao Mei

    Abstract: Recent progress on 2D object detection has featured Cascade RCNN, which capitalizes on a sequence of cascade detectors to progressively improve proposal quality, towards high-quality object detection. However, there has not been evidence in support of building such cascade structures for 3D object detection, a challenging detection scenario with highly sparse LiDAR point clouds. In this work, we p… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: IEEE Transactions on Image Processing (TIP) 2022. The source code is publicly available at \url{https://github.com/caiqi/Cascasde-3D}

  29. arXiv:2209.12807  [pdf, other

    cs.LG cs.CV

    Out-of-Distribution Detection with Hilbert-Schmidt Independence Optimization

    Authors: Jingyang Lin, Yu Wang, Qi Cai, Yingwei Pan, Ting Yao, Hongyang Chao, Tao Mei

    Abstract: Outlier detection tasks have been playing a critical role in AI safety. There has been a great challenge to deal with this task. Observations show that deep neural network classifiers usually tend to incorrectly classify out-of-distribution (OOD) inputs into in-distribution classes with high confidence. Existing works attempt to solve the problem by explicitly imposing uncertainty on classifiers w… ▽ More

    Submitted 26 September, 2022; originally announced September 2022.

    Comments: Source code is available at \url{https://github.com/jylins/hood}

  30. arXiv:2209.03665  [pdf, other

    cs.CV cs.AI

    Generalized One-shot Domain Adaptation of Generative Adversarial Networks

    Authors: Zicheng Zhang, Yinglu Liu, Congying Han, Tiande Guo, Ting Yao, Tao Mei

    Abstract: The adaptation of a Generative Adversarial Network (GAN) aims to transfer a pre-trained GAN to a target domain with limited training data. In this paper, we focus on the one-shot case, which is more challenging and rarely explored in previous works. We consider that the adaptation from a source domain to a target domain can be decoupled into two parts: the transfer of global style like texture and… ▽ More

    Submitted 13 October, 2022; v1 submitted 8 September, 2022; originally announced September 2022.

    Comments: NeurIPS 2022

  31. WOC: A Handy Webcam-based 3D Online Chatroom

    Authors: Chuanhang Yan, Yu Sun, Qian Bao, Jinhui Pang, Wu Liu, Tao Mei

    Abstract: We develop WOC, a webcam-based 3D virtual online chatroom for multi-person interaction, which captures the 3D motion of users and drives their individual 3D virtual avatars in real-time. Compared to the existing wearable equipment-based solution, WOC offers convenient and low-cost 3D motion capture with a single camera. To promote the immersive chat experience, WOC provides high-fidelity virtual a… ▽ More

    Submitted 17 March, 2023; v1 submitted 1 September, 2022; originally announced September 2022.

  32. arXiv:2209.00407  [pdf, other

    cs.CV

    MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised Point Cloud Action Recognition

    Authors: Xiaodong Chen, Wu Liu, Xinchen Liu, Yongdong Zhang, Jungong Han, Tao Mei

    Abstract: Recognizing human actions from point cloud videos has attracted tremendous attention from both academia and industry due to its wide applications like automatic driving, robotics, and so on. However, current methods for point cloud action recognition usually require a huge amount of data with manual annotations and a complex backbone network with high computation costs, which makes it impractical… ▽ More

    Submitted 1 September, 2022; originally announced September 2022.

    Comments: 11 pages, 7 figures

  33. arXiv:2207.13600  [pdf, other

    cs.CV

    Lightweight and Progressively-Scalable Networks for Semantic Segmentation

    Authors: Yiheng Zhang, Ting Yao, Zhaofan Qiu, Tao Mei

    Abstract: Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation. The problem nevertheless is not trivial especially for the real-world deployments, which often demand high efficiency in inference latency. In this paper, we thoroughly analyze the design of convolutional blocks (the type of convolutions and the number of channels in convolutions), and t… ▽ More

    Submitted 27 July, 2022; originally announced July 2022.

  34. arXiv:2207.04978  [pdf, other

    cs.CV cs.LG

    Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

    Authors: Ting Yao, Yingwei Pan, Yehao Li, Chong-Wah Ngo, Tao Mei

    Abstract: Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly employ down-sampling operations (e.g., average pooling) over keys/values to dramatically reduce the computational cost. In this work, we argue that such over-aggre… ▽ More

    Submitted 11 July, 2022; originally announced July 2022.

    Comments: ECCV 2022. Source code is available at \url{https://github.com/YehLi/ImageNetModel}

  35. arXiv:2207.04976  [pdf, other

    cs.CV cs.AI

    Dual Vision Transformer

    Authors: Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, Tao Mei

    Abstract: Prior works have proposed several strategies to reduce the computational cost of self-attention mechanism. Many of these works consider decomposing the self-attention procedure into regional and local feature extraction procedures that each incurs a much smaller computational complexity. However, regional information is typically only achieved at the expense of undesirable information lost owing t… ▽ More

    Submitted 12 July, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

    Comments: Source code is available at \url{https://github.com/YehLi/ImageNetModel}

  36. arXiv:2206.13078  [pdf, other

    cs.CV

    Video2StyleGAN: Encoding Video in Latent Space for Manipulation

    Authors: Jiyang Yu, Jingen Liu, Jing Huang, Wei Zhang, Tao Mei

    Abstract: Many recent works have been proposed for face image editing by leveraging the latent space of pretrained GANs. However, few attempts have been made to directly apply them to videos, because 1) they do not guarantee temporal consistency, 2) their application is limited by their processing speed on videos, and 3) they cannot accurately encode details of face motion and expression. To this end, we pr… ▽ More

    Submitted 27 June, 2022; originally announced June 2022.

  37. arXiv:2206.10491  [pdf, other

    cs.CV cs.MM

    Bi-Calibration Networks for Weakly-Supervised Video Representation Learning

    Authors: Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, Tao Mei

    Abstract: The leverage of large volumes of web videos paired with the searched queries or surrounding texts (e.g., title) offers an economic and extensible alternative to supervised video representation learning. Nevertheless, modeling such weakly visual-textual connection is not trivial due to query polysemy (i.e., many possible meanings for a query) and text isomorphism (i.e., same syntactic structure of… ▽ More

    Submitted 21 June, 2022; originally announced June 2022.

  38. arXiv:2206.06931  [pdf, other

    cs.CV cs.AI cs.MM

    Stand-Alone Inter-Frame Attention in Video Models

    Authors: Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Jiebo Luo, Tao Mei

    Abstract: Motion, as the uniqueness of a video, has been critical to the development of video understanding models. Modern deep learning models leverage motion by either executing spatio-temporal 3D convolutions, factorizing 3D convolutions into spatial and temporal convolutions separately, or computing self-attention along temporal dimension. The implicit assumption behind such successes is that the featur… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: CVPR 2022; Code is publicly available at: https://github.com/FuchenUSTC/SIFA

  39. arXiv:2206.06930  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Comprehending and Ordering Semantics for Image Captioning

    Authors: Yehao Li, Yingwei Pan, Ting Yao, Tao Mei

    Abstract: Comprehending the rich semantics in an image and ordering them in linguistic order are essential to compose a visually-grounded and linguistically coherent description for image captioning. Modern techniques commonly capitalize on a pre-trained object detector/classifier to mine the semantics in an image, while leaving the inherent linguistic ordering of semantics under-exploited. In this paper, w… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: CVPR 2022; Code is publicly available at: https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/cosnet

  40. arXiv:2206.06292  [pdf, other

    cs.CV cs.AI cs.MM

    MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

    Authors: Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Tao Mei

    Abstract: Convolutional Neural Networks (CNNs) have been regarded as the go-to models for visual recognition. More recently, convolution-free networks, based on multi-head self-attention (MSA) or multi-layer perceptrons (MLPs), become more and more popular. Nevertheless, it is not trivial when utilizing these newly-minted networks for video recognition due to the large variations and complexities in video d… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: CVPR 2022; Code is publicly available at: https://github.com/ZhaofanQiu/MLP-3D

  41. arXiv:2206.06291  [pdf, other

    cs.CV cs.AI cs.MM

    Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection

    Authors: Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei, Chang-Wen Chen

    Abstract: Recent high-performing Human-Object Interaction (HOI) detection techniques have been highly influenced by Transformer-based object detector (i.e., DETR). Nevertheless, most of them directly map parametric interaction queries into a set of HOI predictions through vanilla Transformer in a one-stage manner. This leaves rich inter- or intra-interaction structure under-exploited. In this work, we desig… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: CVPR 2022; Code is publicly available at: https://github.com/zyong812/STIP

  42. arXiv:2206.06289  [pdf, other

    cs.CV cs.LG cs.MM cs.RO

    Silver-Bullet-3D at ManiSkill 2021: Learning-from-Demonstrations and Heuristic Rule-based Methods for Object Manipulation

    Authors: Yingwei Pan, Yehao Li, Yiheng Zhang, Qi Cai, Fuchen Long, Zhaofan Qiu, Ting Yao, Tao Mei

    Abstract: This paper presents an overview and comparative analysis of our systems designed for the following two tracks in SAPIEN ManiSkill Challenge 2021: No Interaction Track: The No Interaction track targets for learning policies from pre-collected demonstration trajectories. We investigate both imitation learning-based approach, i.e., imitating the observed behavior using classical supervised learning… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: Accepted by ICLR 2022 Workshop on Generalizable Policy Learning in Physical World. Top-performing systems for both no interaction and no restriction tracks in SAPIEN ManiSkill Challenge 2021. The source code and model are publicly available at: https://github.com/caiqi/Silver-Bullet-3D/

  43. arXiv:2206.01017  [pdf, other

    cs.CV

    Structured Two-stream Attention Network for Video Question Answering

    Authors: Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei, Heng Tao Shen

    Abstract: To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures o… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

  44. arXiv:2204.02569  [pdf, other

    cs.CV

    Gait Recognition in the Wild with Dense 3D Representations and A Benchmark

    Authors: Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei

    Abstract: Existing studies for gait recognition are dominated by 2D representations like the silhouette or skeleton of the human body in constrained scenes. However, humans live and walk in the unconstrained 3D space, so projecting the 3D human body onto the 2D plane will discard a lot of crucial information like the viewpoint, shape, and dynamics for gait recognition. Therefore, this paper aims to explore… ▽ More

    Submitted 5 April, 2022; originally announced April 2022.

    Comments: 16 pages, 11 figures, CVPR 2022 accepted, project page: https://gait3d.github.io/

  45. arXiv:2204.00942  [pdf, other

    cs.CV

    A-ACT: Action Anticipation through Cycle Transformations

    Authors: Akash Gupta, Jingen Liu, Liefeng Bo, Amit K. Roy-Chowdhury, Tao Mei

    Abstract: While action anticipation has garnered a lot of research interest recently, most of the works focus on anticipating future action directly through observed visual cues only. In this work, we take a step back to analyze how the human capability to anticipate the future can be transferred to machine learning algorithms. To incorporate this ability in intelligent systems a question worth pondering up… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

  46. arXiv:2203.05922  [pdf, other

    cs.CV

    Visualizing and Understanding Patch Interactions in Vision Transformer

    Authors: Jie Ma, Yalong Bai, Bineng Zhong, Wei Zhang, Ting Yao, Tao Mei

    Abstract: Vision Transformer (ViT) has become a leading tool in various computer vision tasks, owing to its unique self-attention mechanism that learns visual representations explicitly through cross-patch information interactions. Despite having good success, the literature seldom explores the explainability of vision transformer, and there is no clear picture of how the attention mechanism with respect to… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

    Comments: 15 pages, 14 figures

  47. arXiv:2203.04476  [pdf, other

    cs.CV cs.AI

    Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework

    Authors: Xiaodong Chen, Xinchen Liu, Wu Liu, Kun Liu, Dong Wu, Yongdong Zhang, Tao Mei

    Abstract: Action recognition from videos, i.e., classifying a video into one of the pre-defined action types, has been a popular topic in the communities of artificial intelligence, multimedia, and signal processing. However, existing methods usually consider an input video as a whole and learn models, e.g., Convolutional Neural Networks (CNNs), with coarse video-level class labels. These methods can only o… ▽ More

    Submitted 1 September, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

    Comments: Accepted by IEEE ISCAS 2022, 5 pages, 2 figures. arXiv admin note: text overlap with arXiv:2110.03368

  48. arXiv:2203.02291  [pdf, other

    cs.CV cs.SD eess.AS

    Freeform Body Motion Generation from Speech

    Authors: Jing Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei

    Abstract: People naturally conduct spontaneous body motions to enhance their speeches while giving talks. Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions. Most existing works map speech to motion in a deterministic way by conditioning on certain styles, leading to sub-optimal results. Motivated by studies in linguistics, we decompos… ▽ More

    Submitted 4 March, 2022; originally announced March 2022.

  49. arXiv:2201.09753  [pdf

    cs.LG stat.AP stat.CO

    Evaluation of data imputation strategies in complex, deeply-phenotyped data sets: the case of the EU-AIMS Longitudinal European Autism Project

    Authors: A. Llera, M. Brammer, B. Oakley, J. Tillmann, M. Zabihi, T. Mei, T. Charman, C. Ecker, F. Dell Acqua, T. Banaschewski, C. Moessnang, S. Baron-Cohen, R. Holt, S. Durston, D. Murphy, E. Loth, J. K. Buitelaar, D. L. Floris, C. F. Beckmann

    Abstract: An increasing number of large-scale multi-modal research initiatives has been conducted in the typically developing population, as well as in psychiatric cohorts. Missing data is a common problem in such datasets due to the difficulty of assessing multiple measures on a large number of participants. The consequences of missing data accumulate when researchers aim to explore relationships between m… ▽ More

    Submitted 20 January, 2022; originally announced January 2022.

    Comments: 22 pages, 3 figures, 3 tables

  50. arXiv:2201.06734  [pdf, other

    cs.CV

    Cross-modal Contrastive Distillation for Instructional Activity Anticipation

    Authors: Zhengyuan Yang, Jingen Liu, Jing Huang, Xiaodong He, Tao Mei, Chenliang Xu, Jiebo Luo

    Abstract: In this study, we aim to predict the plausible future action steps given an observation of the past and study the task of instructional activity anticipation. Unlike previous anticipation tasks that aim at action label prediction, our work targets at generating natural language outputs that provide interpretable and accurate descriptions of future action steps. It is a challenging task due to the… ▽ More

    Submitted 17 January, 2022; originally announced January 2022.