Skip to main content

Showing 1–31 of 31 results for author: Zhuo, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.18583  [pdf, other

    cs.CV cs.LG

    Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

    Authors: Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao

    Abstract: Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lu… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Code at: https://github.com/Alpha-VLLM/Lumina-T2X

  2. arXiv:2405.05945  [pdf, other

    cs.CV

    Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

    Authors: Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li

    Abstract: Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified f… ▽ More

    Submitted 13 June, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

    Comments: Technical Report; Code at: https://github.com/Alpha-VLLM/Lumina-T2X

  3. arXiv:2403.07920  [pdf, other

    q-bio.BM cs.AI cs.CL cs.LG

    ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

    Authors: Le Zhuo, Zewen Chi, Minghao Xu, Heyan Huang, Heqi Zheng, Conghui He, Xian-Ling Mao, Wentao Zhang

    Abstract: We propose ProtLLM, a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks. ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs where the natural language text is interspersed with an arbitrary number of proteins. Besides, we propose the protein-as-word language modeling approach to train ProtLLM. By dev… ▽ More

    Submitted 27 February, 2024; originally announced March 2024.

    Comments: https://protllm.github.io/project/

  4. arXiv:2403.00307  [pdf, other

    cs.CV cs.AI

    Embedded Multi-label Feature Selection via Orthogonal Regression

    Authors: Xueyuan Xu, Fulin Wei, Tianyuan Jia, Li Zhuo, Feiping Nie, Xia Wu

    Abstract: In the last decade, embedded multi-label feature selection methods, incorporating the search for feature subsets into model optimization, have attracted considerable attention in accurately evaluating the importance of features in multi-label classification tasks. Nevertheless, the state-of-the-art embedded multi-label feature selection algorithms based on least square regression usually cannot pr… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  5. arXiv:2311.11904  [pdf, other

    cs.CV cs.CL cs.LG

    LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

    Authors: Songhao Han, Le Zhuo, Yue Liao, Si Liu

    Abstract: Vision-language models (VLMs) offer a promising paradigm for image classification by comparing the similarity between images and class embeddings. A critical challenge lies in crafting precise textual representations for class names. While previous studies have leveraged recent advancements in large language models (LLMs) to enhance these descriptors, their outputs often suffer from ambiguity and… ▽ More

    Submitted 19 February, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

  6. arXiv:2310.10036  [pdf, other

    cs.CV cs.MM

    Evading Detection Actively: Toward Anti-Forensics against Forgery Localization

    Authors: Long Zhuo, Shenghai Luo, Shunquan Tan, Han Chen, Bin Li, Jiwu Huang

    Abstract: Anti-forensics seeks to eliminate or conceal traces of tampering artifacts. Typically, anti-forensic methods are designed to deceive binary detectors and persuade them to misjudge the authenticity of an image. However, to the best of our knowledge, no attempts have been made to deceive forgery detectors at the pixel level and mis-locate forged regions. Traditional adversarial attack methods cannot… ▽ More

    Submitted 15 October, 2023; originally announced October 2023.

  7. arXiv:2310.01089  [pdf, other

    cs.CL cs.LG

    GraphText: Graph Reasoning in Text Space

    Authors: Jianan Zhao, Le Zhuo, Yikang Shen, Meng Qu, Kai Liu, Michael Bronstein, Zhaocheng Zhu, Jian Tang

    Abstract: Large Language Models (LLMs) have gained the ability to assimilate human knowledge and facilitate natural language interactions with both humans and other LLMs. However, despite their impressive achievements, LLMs have not made significant advancements in the realm of graph machine learning. This limitation arises because graphs encapsulate distinct relational data, making it challenging to transf… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

    Comments: Preprint. Work in progress

  8. arXiv:2308.02915  [pdf, other

    cs.GR cs.CV cs.SD eess.AS

    DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation

    Authors: Qiaosong Qi, Le Zhuo, Aixi Zhang, Yue Liao, Fei Fang, Si Liu, Shuicheng Yan

    Abstract: When hearing music, it is natural for people to dance to its rhythm. Automatic dance generation, however, is a challenging task due to the physical constraints of human motion and rhythmic alignment with target music. Conventional autoregressive methods introduce compounding errors during sampling and struggle to capture the long-term structure of dance sequences. To address these limitations, we… ▽ More

    Submitted 5 August, 2023; originally announced August 2023.

    Comments: Accepted at ACM MM 2023

  9. arXiv:2306.17103  [pdf, other

    cs.CL cs.SD eess.AS

    LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

    Authors: Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, Yizhi LI, Ge Zhang, Si Liu, Roger Dannenberg, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wenhu Chen, Wei Xue, Yike Guo

    Abstract: We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language mo… ▽ More

    Submitted 21 November, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

    Comments: 9 pages, 2 figures, 5 tables, accepted by ISMIR 2023

  10. arXiv:2306.15390  [pdf, other

    cs.CV cs.AI

    DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-bit CNNs

    Authors: Yanjing Li, Sheng Xu, Xianbin Cao, Li'an Zhuo, Baochang Zhang, Tian Wang, Guodong Guo

    Abstract: Neural architecture search (NAS) proves to be among the effective approaches for many tasks by generating an application-adaptive neural architecture, which is still challenged by high computational cost and memory consumption. At the same time, 1-bit convolutional neural networks (CNNs) with binary weights and activations show their potential for resource-limited embedded devices. One natural app… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: Accepted by International Journal of Computer Vision

  11. arXiv:2306.10548  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    MARBLE: Music Audio Representation Benchmark for Universal Evaluation

    Authors: Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin, Le Zhuo, Yiqi Liu, Jiawen Huang, Zeyue Tian, Binyue Deng, Ningzhi Wang, Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Roger Dannenberg, Wenhu Chen, Gus Xia, Wei Xue, Si Liu, Shi Wang, Ruibo Liu, Yike Guo, Jie Fu

    Abstract: In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue… ▽ More

    Submitted 23 November, 2023; v1 submitted 18 June, 2023; originally announced June 2023.

    Comments: camera-ready version for NeurIPS 2023

  12. arXiv:2305.14836  [pdf, other

    cs.CV

    NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

    Authors: Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang

    Abstract: We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, th… ▽ More

    Submitted 20 February, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted to AAAI 2024

  13. arXiv:2305.13705  [pdf, other

    cs.CV

    DiffHand: End-to-End Hand Mesh Reconstruction via Diffusion Models

    Authors: Lijun Li, Li'an Zhuo, Bang Zhang, Liefeng Bo, Chen Chen

    Abstract: Hand mesh reconstruction from the monocular image is a challenging task due to its depth ambiguity and severe occlusion, there remains a non-unique mapping between the monocular image and hand mesh. To address this, we develop DiffHand, the first diffusion-based framework that approaches hand mesh reconstruction as a denoising diffusion process. Our one-stage pipeline utilizes noise to model the u… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  14. arXiv:2305.13353  [pdf, other

    cs.CV

    RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars

    Authors: Dongwei Pan, Long Zhuo, Jingtan Piao, Huiwen Luo, Wei Cheng, Yuxin Wang, Siming Fan, Shengqi Liu, Lei Yang, Bo Dai, Ziwei Liu, Chen Change Loy, Chen Qian, Wayne Wu, Dahua Lin, Kwan-Yee Lin

    Abstract: Synthesizing high-fidelity head avatars is a central problem for computer vision and graphics. While head avatar synthesis algorithms have advanced rapidly, the best ones still face great obstacles in real-world scenarios. One of the vital causes is inadequate datasets -- 1) current public datasets can only support researchers to explore high-fidelity head avatars in one or two task directions; 2)… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Technical Report; Project Page: 36; Github Link: https://github.com/RenderMe-360/RenderMe-360

  15. arXiv:2211.11248  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Video Background Music Generation: Dataset, Method and Evaluation

    Authors: Le Zhuo, Zhaokai Wang, Baisen Wang, Yue Liao, Chenxi Bao, Stanley Peng, Songhao Han, Aixi Zhang, Fei Fang, Si Liu

    Abstract: Music is essential when editing videos, but selecting music manually is difficult and time-consuming. Thus, we seek to automatically generate background music tracks given video input. This is a challenging task since it requires music-video datasets, efficient architectures for video-to-music generation, and reasonable metrics, none of which currently exist. To close this gap, we introduce a comp… ▽ More

    Submitted 4 August, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

    Comments: Accepted by ICCV2023

  16. TGDM: Target Guided Dynamic Mixup for Cross-Domain Few-Shot Learning

    Authors: Linhai Zhuo, Yuqian Fu, Jingjing Chen, Yixin Cao, Yu-Gang Jiang

    Abstract: Given sufficient training data on the source domain, cross-domain few-shot learning (CD-FSL) aims at recognizing new classes with a small number of labeled examples on the target domain. The key to addressing CD-FSL is to narrow the domain gap and transferring knowledge of a network trained on the source domain to the target domain. To help knowledge transfer, this paper introduces an intermediate… ▽ More

    Submitted 30 November, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: accepted by ACM MM 2022

  17. arXiv:2207.05049  [pdf, other

    cs.CV eess.IV

    Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis

    Authors: Long Zhuo, Guangcong Wang, Shikai Li, Wayne Wu, Ziwei Liu

    Abstract: Video-to-Video synthesis (Vid2Vid) has achieved remarkable results in generating a photo-realistic video from a sequence of semantic maps. However, this pipeline suffers from high computational cost and long inference latency, which largely depends on two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have… ▽ More

    Submitted 11 July, 2022; originally announced July 2022.

    Comments: ECCV 2022, Project Page: https://fast-vid2vid.github.io/ , Code: https://github.com/fast-vid2vid/fast-vid2vid

  18. arXiv:2206.10080  [pdf, other

    cs.CV cs.AI

    One-stage Action Detection Transformer

    Authors: Lijun Li, Li'an Zhuo, Bang Zhang

    Abstract: In this work, we introduce our solution to the EPIC-KITCHENS-100 2022 Action Detection challenge. One-stage Action Detection Transformer (OADT) is proposed to model the temporal connection of video segments. With the help of OADT, both the category and time boundary can be recognized simultaneously. After ensembling multiple OADT models trained from different features, our model can reach 21.28\%… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

  19. Self-Adversarial Training incorporating Forgery Attention for Image Forgery Localization

    Authors: Long Zhuo, Shunquan Tan, Bin Li, Jiwu Huang

    Abstract: Image editing techniques enable people to modify the content of an image without leaving visual traces and thus may cause serious security risks. Hence the detection and localization of these forgeries become quite necessary and challenging. Furthermore, unlike other tasks with extensive data, there is usually a lack of annotated forged images for training due to annotation difficulties. In this p… ▽ More

    Submitted 2 February, 2022; v1 submitted 6 July, 2021; originally announced July 2021.

    Comments: accepted by TIFS

  20. arXiv:2106.10617  [pdf, other

    cs.LG

    Cogradient Descent for Dependable Learning

    Authors: Runqi Wang, Baochang Zhang, Li'an Zhuo, Qixiang Ye, David Doermann

    Abstract: Conventional gradient descent methods compute the gradients for multiple variables through the partial derivative. Treating the coupled variables independently while ignoring the interaction, however, leads to an insufficient optimization for bilinear models. In this paper, we propose a dependable learning based on Cogradient Descent (CoGD) algorithm to address the bilinear optimization problem, p… ▽ More

    Submitted 20 June, 2021; originally announced June 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2006.09142

  21. arXiv:2012.04109  [pdf, other

    cs.CV

    Deformable Gabor Feature Networks for Biomedical Image Classification

    Authors: Xuan Gong, Xin Xia, Wentao Zhu, Baochang Zhang, David Doermann, Lian Zhuo

    Abstract: In recent years, deep learning has dominated progress in the field of medical image analysis. We find however, that the ability of current deep learning approaches to represent the complex geometric structures of many medical images is insufficient. One limitation is that deep learning models require a tremendous amount of data, and it is very difficult to obtain a sufficient amount with the neces… ▽ More

    Submitted 7 December, 2020; originally announced December 2020.

    Comments: 9 pages, 6 figures

  22. arXiv:2009.04247  [pdf, other

    cs.CV

    Binarized Neural Architecture Search for Efficient Object Recognition

    Authors: Hanlin Chen, Li'an Zhuo, Baochang Zhang, Xiawu Zheng, Jianzhuang Liu, Rongrong Ji, David Doermann, Guodong Guo

    Abstract: Traditional neural architecture search (NAS) has a significant impact in computer vision by automatically designing network architectures for various tasks. In this paper, binarized neural architecture search (BNAS), with a search space of binarized convolutions, is introduced to produce extremely compressed models to reduce huge computational cost on embedded devices for edge computing. The BNAS… ▽ More

    Submitted 8 September, 2020; originally announced September 2020.

    Comments: arXiv admin note: substantial text overlap with arXiv:1911.10862

  23. arXiv:2008.08526  [pdf, other

    eess.IV cs.CV

    Blur-Attention: A boosting mechanism for non-uniform blurred image restoration

    Authors: Xiaoguang Li, Feifan Yang, Kin Man Lam, Li Zhuo, Jiafeng Li

    Abstract: Dynamic scene deblurring is a challenging problem in computer vision. It is difficult to accurately estimate the spatially varying blur kernel by traditional methods. Data-driven-based methods usually employ kernel-free end-to-end mapping schemes, which are apt to overlook the kernel estimation. To address this issue, we propose a blur-attention module to dynamically capture the spatially varying… ▽ More

    Submitted 19 August, 2020; originally announced August 2020.

  24. arXiv:2006.15588  [pdf, other

    eess.IV cs.CV cs.LG

    A lateral semicircular canal segmentation based geometric calibration for human temporal bone CT Image

    Authors: Xiaoguang Li, Peng Fu, Hongxia Yin, ZhenChang Wang, Li Zhuo, Hui Zhang

    Abstract: Computed Tomography (CT) of the temporal bone has become an important method for diagnosing ear diseases. Due to the different posture of the subject and the settings of CT scanners, the CT image of the human temporal bone should be geometrically calibrated to ensure the symmetry of the bilateral anatomical structure. Manual calibration is a time-consuming task for radiologists and an important pr… ▽ More

    Submitted 28 June, 2020; originally announced June 2020.

  25. arXiv:2006.09142  [pdf, other

    cs.CV

    Cogradient Descent for Bilinear Optimization

    Authors: Li'an Zhuo, Baochang Zhang, Linlin Yang, Hanlin Chen, Qixiang Ye, David Doermann, Guodong Guo, Rongrong Ji

    Abstract: Conventional learning methods simplify the bilinear model by regarding two intrinsically coupled factors independently, which degrades the optimization procedure. One reason lies in the insufficient training due to the asynchronous gradient descent, which results in vanishing gradients for the coupled variables. In this paper, we introduce a Cogradient Descent algorithm (CoGD) to address the bilin… ▽ More

    Submitted 16 June, 2020; originally announced June 2020.

    Comments: 9 pages, 6 figures

  26. arXiv:2005.00057  [pdf, other

    cs.CV

    CP-NAS: Child-Parent Neural Architecture Search for Binary Neural Networks

    Authors: Li'an Zhuo, Baochang Zhang, Hanlin Chen, Linlin Yang, Chen Chen, Yanjun Zhu, David Doermann

    Abstract: Neural architecture search (NAS) proves to be among the best approaches for many tasks by generating an application-adaptive neural architecture, which is still challenged by high computational cost and memory consumption. At the same time, 1-bit convolutional neural networks (CNNs) with binarized weights and activations show their potential for resource-limited embedded devices. One natural appro… ▽ More

    Submitted 17 May, 2020; v1 submitted 30 April, 2020; originally announced May 2020.

    Comments: 7 pages, 6 figures

  27. arXiv:1911.10862  [pdf, other

    cs.CV

    Binarized Neural Architecture Search

    Authors: Hanlin Chen, Li'an Zhuo, Baochang Zhang, Xiawu Zheng, Jianzhuang Liu, David Doermann, Rongrong Ji

    Abstract: Neural architecture search (NAS) can have a significant impact in computer vision by automatically designing optimal neural network architectures for various tasks. A variant, binarized neural architecture search (BNAS), with a search space of binarized convolutions, can produce extremely compressed models. Unfortunately, this area remains largely unexplored. BNAS is more challenging than NAS due… ▽ More

    Submitted 11 February, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

  28. arXiv:1903.09294  [pdf, other

    eess.SP cs.NI

    Hybrid Precoder and Combiner for Imperfect Beam Alignment in mmWave MIMO Systems

    Authors: Chandan Pradhan, Ang Li, Li Zhuo, Yonghui Li, Branka Vucetic

    Abstract: In this letter, we aim to design a robust hybrid precoder and combiner against beam misalignment in millimeter-wave (mmWave) communication systems. We consider the inclusion of the `error statistics' into the precoder and combiner design, where the array response that incorporates the distribution of the misalignment error is first derived. An iterative algorithm is then proposed to design the rob… ▽ More

    Submitted 21 March, 2019; originally announced March 2019.

    Comments: 4 pages

  29. arXiv:1903.09293  [pdf, other

    cs.NI cs.IT

    Robust Hybrid Precoding for Beam Misalignment in Millimeter-Wave Communications

    Authors: Chandan Pradhan, Ang Li, Li Zhuo, Yonghui Li, Branka Vucetic

    Abstract: In this paper, we focus on the phenomenon of beam misalignment in Millimeter-wave (mmWave) multi-receiver communication systems, and propose robust hybrid precoding designs that alleviate the performance loss caused by this effect. We consider two distinct design methodologies: I) the synthesis of a `flat mainlobe' beam model which maximizes the minimum effective array gain over the beam misalignm… ▽ More

    Submitted 21 March, 2019; originally announced March 2019.

    Comments: 30 Pages, Initial Version of Submitted IEEE Journal

  30. arXiv:1804.00243  [pdf, other

    cs.LG stat.ML

    The Structure Transfer Machine Theory and Applications

    Authors: Baochang Zhang, Lian Zhuo, Ze Wang, Jungong Han, Xiantong Zhen

    Abstract: Representation learning is a fundamental but challenging problem, especially when the distribution of data is unknown. We propose a new representation learning method, termed Structure Transfer Machine (STM), which enables feature learning process to converge at the representation expectation in a probabilistic way. We theoretically show that such an expected value of the representation (mean) is… ▽ More

    Submitted 4 August, 2019; v1 submitted 31 March, 2018; originally announced April 2018.

  31. arXiv:1709.01629  [pdf, ps, other

    cs.IT

    Antenna Selection in MIMO Cognitive Radio-Inspired NOMA Systems

    Authors: Yuehua Yu, He Chen, Yonghui Li, Zhiguo Ding, Li Zhuo

    Abstract: This letter investigates a joint antenna selection (AS) problem for a MIMO cognitive radio-inspired non-orthogonal multiple access (CR-NOMA) network. In particular, a new computationally efficient joint AS algorithm, namely subset-based joint AS (SJ-AS), is proposed to maximize the signal-to-noise ratio of the secondary user under the condition that the quality of service (QoS) of the primary user… ▽ More

    Submitted 5 September, 2017; originally announced September 2017.

    Comments: Accepted to appear in IEEE Communication Letters