-
MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce
Authors:
Hao Jiang,
Haoxiang Zhang,
Qingshan Hou,
Chaofeng Chen,
Weisi Lin,
Jingchang Zhang,
Annan Wang
Abstract:
Providing high-quality item recall for text queries is crucial in large-scale e-commerce search systems. Current Embedding-based Retrieval Systems (ERS) embed queries and items into a shared low-dimensional space, but uni-modality ERS rely too heavily on textual features, making them unreliable in complex contexts. While multi-modality ERS incorporate various data sources, they often overlook indi…
▽ More
Providing high-quality item recall for text queries is crucial in large-scale e-commerce search systems. Current Embedding-based Retrieval Systems (ERS) embed queries and items into a shared low-dimensional space, but uni-modality ERS rely too heavily on textual features, making them unreliable in complex contexts. While multi-modality ERS incorporate various data sources, they often overlook individual preferences for different modalities, leading to suboptimal results. To address these issues, we propose MRSE, a Multi-modality Retrieval System that integrates text, item images, and user preferences through lightweight mixture-of-expert (LMoE) modules to better align features across and within modalities. MRSE also builds user profiles at a multi-modality level and introduces a novel hybrid loss function that enhances consistency and robustness using hard negative sampling. Experiments on a large-scale dataset from Shopee and online A/B testing show that MRSE achieves an 18.9% improvement in offline relevance and a 3.7% gain in online core metrics compared to Shopee's state-of-the-art uni-modality system.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
Progressive Radiance Distillation for Inverse Rendering with Gaussian Splatting
Authors:
Keyang Ye,
Qiming Hou,
Kun Zhou
Abstract:
We propose progressive radiance distillation, an inverse rendering method that combines physically-based rendering with Gaussian-based radiance field rendering using a distillation progress map. Taking multi-view images as input, our method starts from a pre-trained radiance field guidance, and distills physically-based light and material parameters from the radiance field using an image-fitting p…
▽ More
We propose progressive radiance distillation, an inverse rendering method that combines physically-based rendering with Gaussian-based radiance field rendering using a distillation progress map. Taking multi-view images as input, our method starts from a pre-trained radiance field guidance, and distills physically-based light and material parameters from the radiance field using an image-fitting process. The distillation progress map is initialized to a small value, which favors radiance field rendering. During early iterations when fitted light and material parameters are far from convergence, the radiance field fallback ensures the sanity of image loss gradients and avoids local minima that attracts under-fit states. As fitted parameters converge, the physical model gradually takes over and the distillation progress increases correspondingly. In presence of light paths unmodeled by the physical model, the distillation progress never finishes on affected pixels and the learned radiance field stays in the final rendering. With this designed tolerance for physical model limitations, we prevent unmodeled color components from leaking into light and material parameters, alleviating relighting artifacts. Meanwhile, the remaining radiance field compensates for the limitations of the physical model, guaranteeing high-quality novel views synthesis. Experimental results demonstrate that our method significantly outperforms state-of-the-art techniques quality-wise in both novel view synthesis and relighting. The idea of progressive radiance distillation is not limited to Gaussian splatting. We show that it also has positive effects for prominently specular scenes when adapted to a mesh-based inverse rendering method.
△ Less
Submitted 14 August, 2024;
originally announced August 2024.
-
Streaming Torque in Dust-Gas Coupled Protoplanetary Disks
Authors:
Qiang Hou,
Cong Yu
Abstract:
We investigate the migration of low-mass protoplanets embedded in dust-gas coupled protoplanetary disks. Linear calculations are performed with respect to the NSH (Nakagawa-Sekiya-Hayashi 1986) equilibrium within a shearing sheet. We find that the dusty quasi-drift mode dominates the dynamical behaviors in close proximity to the protoplanet. This mode exhibits an extremely short radial wavelength,…
▽ More
We investigate the migration of low-mass protoplanets embedded in dust-gas coupled protoplanetary disks. Linear calculations are performed with respect to the NSH (Nakagawa-Sekiya-Hayashi 1986) equilibrium within a shearing sheet. We find that the dusty quasi-drift mode dominates the dynamical behaviors in close proximity to the protoplanet. This mode exhibits an extremely short radial wavelength, characterized by a dispersion relation of $ \tildeω = \left( 1 + μ\right) \boldsymbol{W}_s \cdot \boldsymbol{k}$. The emergence of this mode leads to a wake with a short radial length-scale ahead of protoplanets, contributing to a positive torque, termed as ``Streaming Torque (ST)''. Furthermore, both Lindblad torque and corotation torque are affected by the NSH velocity. The total torque and planetary migration are contingent upon the coupling strength between dust and gas. In most scenarios, ST predominates, inducing outward migration for planets, thereby addressing the issue of rapid inward migration in their formation paradigm.
△ Less
Submitted 3 August, 2024;
originally announced August 2024.
-
Segmentation-Free Guidance for Text-to-Image Diffusion Models
Authors:
Kambiz Azarian,
Debasmit Das,
Qiqi Hou,
Fatih Porikli
Abstract:
We introduce segmentation-free guidance, a novel method designed for text-to-image diffusion models like Stable Diffusion. Our method does not require retraining of the diffusion model. At no additional compute cost, it uses the diffusion model itself as an implied segmentation network, hence named segmentation-free guidance, to dynamically adjust the negative prompt for each patch of the generate…
▽ More
We introduce segmentation-free guidance, a novel method designed for text-to-image diffusion models like Stable Diffusion. Our method does not require retraining of the diffusion model. At no additional compute cost, it uses the diffusion model itself as an implied segmentation network, hence named segmentation-free guidance, to dynamically adjust the negative prompt for each patch of the generated image, based on the patch's relevance to concepts in the prompt. We evaluate segmentation-free guidance both objectively, using FID, CLIP, IS, and PickScore, and subjectively, through human evaluators. For the subjective evaluation, we also propose a methodology for subsampling the prompts in a dataset like MS COCO-30K to keep the number of human evaluations manageable while ensuring that the selected subset is both representative in terms of content and fair in terms of model performance. The results demonstrate the superiority of our segmentation-free guidance to the widely used classifier-free method. Human evaluators preferred segmentation-free guidance over classifier-free 60% to 19%, with 18% of occasions showing a strong preference. Additionally, PickScore win-rate, a recently proposed metric mimicking human preference, also indicates a preference for our method over classifier-free.
△ Less
Submitted 3 June, 2024;
originally announced July 2024.
-
Towards Stable 3D Object Detection
Authors:
Jiabao Wang,
Qiang Meng,
Guochao Liu,
Liujiang Yan,
Ke Wang,
Ming-Ming Cheng,
Qibin Hou
Abstract:
In autonomous driving, the temporal stability of 3D object detection greatly impacts the driving safety. However, the detection stability cannot be accessed by existing metrics such as mAP and MOTA, and consequently is less explored by the community. To bridge this gap, this work proposes Stability Index (SI), a new metric that can comprehensively evaluate the stability of 3D detectors in terms of…
▽ More
In autonomous driving, the temporal stability of 3D object detection greatly impacts the driving safety. However, the detection stability cannot be accessed by existing metrics such as mAP and MOTA, and consequently is less explored by the community. To bridge this gap, this work proposes Stability Index (SI), a new metric that can comprehensively evaluate the stability of 3D detectors in terms of confidence, box localization, extent, and heading. By benchmarking state-of-the-art object detectors on the Waymo Open Dataset, SI reveals interesting properties of object stability that have not been previously discovered by other metrics. To help models improve their stability, we further introduce a general and effective training strategy, called Prediction Consistency Learning (PCL). PCL essentially encourages the prediction consistency of the same objects under different timestamps and augmentations, leading to enhanced detection stability. Furthermore, we examine the effectiveness of PCL with the widely-used CenterPoint, and achieve a remarkable SI of 86.00 for vehicle class, surpassing the baseline by 5.48. We hope our work could serve as a reliable baseline and draw the community's attention to this crucial issue in 3D object detection. Codes will be made publicly available.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Neural Graphics Texture Compression Supporting Random Acces
Authors:
Farzad Farhadzadeh,
Qiqi Hou,
Hoang Le,
Amir Said,
Randall Rauwendaal,
Alex Bourd,
Fatih Porikli
Abstract:
Advances in rendering have led to tremendous growth in texture assets, including resolution, complexity, and novel textures components, but this growth in data volume has not been matched by advances in its compression. Meanwhile Neural Image Compression (NIC) has advanced significantly and shown promising results, but the proposed methods cannot be directly adapted to neural texture compression.…
▽ More
Advances in rendering have led to tremendous growth in texture assets, including resolution, complexity, and novel textures components, but this growth in data volume has not been matched by advances in its compression. Meanwhile Neural Image Compression (NIC) has advanced significantly and shown promising results, but the proposed methods cannot be directly adapted to neural texture compression. First, texture compression requires on-demand and real-time decoding with random access during parallel rendering (e.g. block texture decompression on GPUs). Additionally, NIC does not support multi-resolution reconstruction (mip-levels), nor does it have the ability to efficiently jointly compress different sets of texture channels. In this work, we introduce a novel approach to texture set compression that integrates traditional GPU texture representation and NIC techniques, designed to enable random access and support many-channel texture sets. To achieve this goal, we propose an asymmetric auto-encoder framework that employs a convolutional encoder to capture detailed information in a bottleneck-latent space, and at decoder side we utilize a fully connected network, whose inputs are sampled latent features plus positional information, for a given texture coordinate and mip level. This latent data is defined to enable simplified access to multi-resolution data by simply changing the scanning strides. Experimental results demonstrate that this approach provides much better results than conventional texture compression, and significant improvement over the latest method using neural networks.
△ Less
Submitted 6 May, 2024;
originally announced July 2024.
-
Automatic AI Model Selection for Wireless Systems: Online Learning via Digital Twinning
Authors:
Qiushuo Hou,
Matteo Zecchin,
Sangwoo Park,
Yunlong Cai,
Guanding Yu,
Kaushik Chowdhury,
Osvaldo Simeone
Abstract:
In modern wireless network architectures, such as O-RAN, artificial intelligence (AI)-based applications are deployed at intelligent controllers to carry out functionalities like scheduling or power control. The AI "apps" are selected on the basis of contextual information such as network conditions, topology, traffic statistics, and design goals. The mapping between context and AI model parameter…
▽ More
In modern wireless network architectures, such as O-RAN, artificial intelligence (AI)-based applications are deployed at intelligent controllers to carry out functionalities like scheduling or power control. The AI "apps" are selected on the basis of contextual information such as network conditions, topology, traffic statistics, and design goals. The mapping between context and AI model parameters is ideally done in a zero-shot fashion via an automatic model selection (AMS) mapping that leverages only contextual information without requiring any current data. This paper introduces a general methodology for the online optimization of AMS mappings. Optimizing an AMS mapping is challenging, as it requires exposure to data collected from many different contexts. Therefore, if carried out online, this initial optimization phase would be extremely time consuming. A possible solution is to leverage a digital twin of the physical system to generate synthetic data from multiple simulated contexts. However, given that the simulator at the digital twin is imperfect, a direct use of simulated data for the optimization of the AMS mapping would yield poor performance when tested in the real system. This paper proposes a novel method for the online optimization of AMS mapping that corrects for the bias of the simulator by means of limited real data collected from the physical system. Experimental results for a graph neural network-based power control app demonstrate the significant advantages of the proposed approach.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
Authors:
Li-Wen Chang,
Wenlei Bao,
Qi Hou,
Chengquan Jiang,
Ningxin Zheng,
Yinmin Zhong,
Xuanrun Zhang,
Zuquan Song,
Ziheng Jiang,
Haibin Lin,
Xin Jin,
Xin Liu
Abstract:
Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation…
▽ More
Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.
△ Less
Submitted 18 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation
Authors:
Yunheng Li,
ZhongYu Li,
Quansheng Zeng,
Qibin Hou,
Ming-Ming Cheng
Abstract:
Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual fea…
▽ More
Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic segmentation methods. Experimental results show that our simple Cascade-CLIP achieves superior zero-shot performance on segmentation benchmarks, like COCO-Stuff, Pascal-VOC, and Pascal-Context. Our code is available at: https://github.com/HVision-NKU/Cascade-CLIP
△ Less
Submitted 6 June, 2024; v1 submitted 2 June, 2024;
originally announced June 2024.
-
Diff-ETS: Learning a Diffusion Probabilistic Model for Electromyography-to-Speech Conversion
Authors:
Zhao Ren,
Kevin Scheck,
Qinhan Hou,
Stefano van Gogh,
Michael Wand,
Tanja Schultz
Abstract:
Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available dat…
▽ More
Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available data and noisy signals, the synthesised speech often exhibits a low level of naturalness. In this work, we propose Diff-ETS, an ETS model which uses a score-based diffusion probabilistic model to enhance the naturalness of synthesised speech. The diffusion model is applied to improve the quality of the acoustic features predicted by an EMG encoder. In our experiments, we evaluated fine-tuning the diffusion model on predictions of a pre-trained EMG encoder, and training both models in an end-to-end fashion. We compared Diff-ETS with a baseline ETS model without diffusion using objective metrics and a listening test. The results indicated the proposed Diff-ETS significantly improved speech naturalness over the baseline.
△ Less
Submitted 11 May, 2024;
originally announced May 2024.
-
Phase coding semi-quantum key distribution system based on the Single-state protocol
Authors:
Qincheng Hou,
Siying Huang,
Naida Mo,
Jindong Wang,
Zhengjun Wei,
Yafei Yu,
Tianming Zhao,
Zhiming Zhang
Abstract:
Semi-quantum key distribution (SQKD) allows sharing random keys between a quantum user and a classical user. However, implementing classical user operations is challenging, posing a hurdle to achieving the Single-state protocol. By using the "selective modulation" method, the feasibility of SQKD is verified in principle. The proposal of the selective modulation method enables the realization of ot…
▽ More
Semi-quantum key distribution (SQKD) allows sharing random keys between a quantum user and a classical user. However, implementing classical user operations is challenging, posing a hurdle to achieving the Single-state protocol. By using the "selective modulation" method, the feasibility of SQKD is verified in principle. The proposal of the selective modulation method enables the realization of other protocols for SQKD. To advance experimental progress in SQKD, we propose and implement a phase-encoded semi-quantum key distribution system based on the Single-state protocol and the "selective modulation" method. The system operates at a frequency of 100MHz and an average photon number of 0.1. The interference contrast achieved 96.52%, the average quantum bit error rate was 1.19%, and the raw key rate reached 88Kbps. Our experimental results demonstrate the feasibility and stability of the proposed phase-encoded semi-quantum key distribution system. Furthermore, by leveraging the "selective modulation" scheme proposed in this paper, we develop a comprehensive theoretical description of selective modulation. Through an analysis of quantum state evolution, we assess the security of our system, ultimately demonstrating its resilience against attacks targeting quantum states. The classical user of our system requires only two optical devices, significantly reducing the equipment requirements and enhancing its application potential. This work validates the feasibility of semi-quantum key distribution experiments and provides ideas for future research on semi-quantum key distribution experiments and security studies.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation
Authors:
Yupeng Zhou,
Daquan Zhou,
Ming-Ming Cheng,
Jiashi Feng,
Qibin Hou
Abstract:
For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent…
▽ More
For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at https://github.com/HVision-NKU/StoryDiffusion.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
3D Gaussian Splatting with Deferred Reflection
Authors:
Keyang Ye,
Qiming Hou,
Kun Zhou
Abstract:
The advent of neural and Gaussian-based radiance field methods have achieved great success in the field of novel view synthesis. However, specular reflection remains non-trivial, as the high frequency radiance field is notoriously difficult to fit stably and accurately. We present a deferred shading method to effectively render specular reflection with Gaussian splatting. The key challenge comes f…
▽ More
The advent of neural and Gaussian-based radiance field methods have achieved great success in the field of novel view synthesis. However, specular reflection remains non-trivial, as the high frequency radiance field is notoriously difficult to fit stably and accurately. We present a deferred shading method to effectively render specular reflection with Gaussian splatting. The key challenge comes from the environment map reflection model, which requires accurate surface normal while simultaneously bottlenecks normal estimation with discontinuous gradients. We leverage the per-pixel reflection gradients generated by deferred shading to bridge the optimization process of neighboring Gaussians, allowing nearly correct normal estimations to gradually propagate and eventually spread over all reflective objects. Our method significantly outperforms state-of-the-art techniques and concurrent work in synthesizing high-quality specular reflection effects, demonstrating a consistent improvement of peak signal-to-noise ratio (PSNR) for both synthetic and real-world scenes, while running at a frame rate almost identical to vanilla Gaussian splatting.
△ Less
Submitted 4 June, 2024; v1 submitted 29 April, 2024;
originally announced April 2024.
-
Active robustness against the detuning-error for Rydberg quantum gates
Authors:
Qing-Ling Hou,
Han Wang,
Jing Qian
Abstract:
Error suppression to the experimental imperfections is a central challenge for useful quantum computing. Recent studies have shown the advantages of using single-modulated pulses based on optimal control which can realize high-fidelity two-qubit gates in neutral-atom arrays. However, typical optimization only minimizes the ideal gate error in the absence of any decay, which allows the gate to be p…
▽ More
Error suppression to the experimental imperfections is a central challenge for useful quantum computing. Recent studies have shown the advantages of using single-modulated pulses based on optimal control which can realize high-fidelity two-qubit gates in neutral-atom arrays. However, typical optimization only minimizes the ideal gate error in the absence of any decay, which allows the gate to be passively influenced by all error sources leading to an exponential increase of sensitivity when the error becomes larger. In the present work, we propose the realization of two-qubit CZ gates with active robustness against two-photon detuning errors. Our method depends on a modified cost function in numerical optimization for shaping gate pulses, which can minimize, not only the ideal gate error but also the fluctuations of gate infidelity over a wide error range. We introduce a family of Rydberg blockade gates with active robustness towards the impacts of versatile noise sources such as Doppler dephasing and ac Stark shifts. The resulting gates with robust pulses can significantly increase the insensitivity to any type of errors acting on the two-photon detuning, benefiting from a relaxed requirement of colder atomic temperatures or more stable lasers for current experimental technology.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Synthesizing Realistic Data for Table Recognition
Authors:
Qiyu Hou,
Jun Wang,
Meixuan Qiao,
Lujun Tian
Abstract:
To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic…
▽ More
To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic styles found in the target domain. By leveraging the actual structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset in this domain. We used this dataset to train several recent deep learning-based end-to-end table recognition models. Additionally, we have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data, thereby effectively validating our method's practicality and effectiveness. Furthermore, we applied our synthesis method to augment the FinTabNet dataset, extracted from English financial announcements, by increasing the proportion of tables with multiple spanning cells to introduce greater complexity. Our experiments show that models trained on this augmented dataset achieve comprehensive improvements in performance, especially in the recognition of tables with multiple spanning cells.
△ Less
Submitted 9 July, 2024; v1 submitted 17 April, 2024;
originally announced April 2024.
-
A Clinical-oriented Multi-level Contrastive Learning Method for Disease Diagnosis in Low-quality Medical Images
Authors:
Qingshan Hou,
Shuai Cheng,
Peng Cao,
Jinzhu Yang,
Xiaoli Liu,
Osmar R. Zaiane,
Yih Chung Tham
Abstract:
Representation learning offers a conduit to elucidate distinctive features within the latent space and interpret the deep models. However, the randomness of lesion distribution and the complexity of low-quality factors in medical images pose great challenges for models to extract key lesion features. Disease diagnosis methods guided by contrastive learning (CL) have shown significant advantages in…
▽ More
Representation learning offers a conduit to elucidate distinctive features within the latent space and interpret the deep models. However, the randomness of lesion distribution and the complexity of low-quality factors in medical images pose great challenges for models to extract key lesion features. Disease diagnosis methods guided by contrastive learning (CL) have shown significant advantages in lesion feature representation. Nevertheless, the effectiveness of CL is highly dependent on the quality of the positive and negative sample pairs. In this work, we propose a clinical-oriented multi-level CL framework that aims to enhance the model's capacity to extract lesion features and discriminate between lesion and low-quality factors, thereby enabling more accurate disease diagnosis from low-quality medical images. Specifically, we first construct multi-level positive and negative pairs to enhance the model's comprehensive recognition capability of lesion features by integrating information from different levels and qualities of medical images. Moreover, to improve the quality of the learned lesion embeddings, we introduce a dynamic hard sample mining method based on self-paced learning. The proposed CL framework is validated on two public medical image datasets, EyeQ and Chest X-ray, demonstrating superior performance compared to other state-of-the-art disease diagnostic methods.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
Low-Latency Neural Stereo Streaming
Authors:
Qiqi Hou,
Farzad Farhadzadeh,
Amir Said,
Guillaume Sautiere,
Hoang Le
Abstract:
The rise of new video modalities like virtual reality or autonomous driving has increased the demand for efficient multi-view video compression methods, both in terms of rate-distortion (R-D) performance and in terms of delay and runtime. While most recent stereo video compression approaches have shown promising performance, they compress left and right views sequentially, leading to poor parallel…
▽ More
The rise of new video modalities like virtual reality or autonomous driving has increased the demand for efficient multi-view video compression methods, both in terms of rate-distortion (R-D) performance and in terms of delay and runtime. While most recent stereo video compression approaches have shown promising performance, they compress left and right views sequentially, leading to poor parallelization and runtime performance. This work presents Low-Latency neural codec for Stereo video Streaming (LLSS), a novel parallel stereo video coding method designed for fast and efficient low-latency stereo video streaming. Instead of using a sequential cross-view motion compensation like existing methods, LLSS introduces a bidirectional feature shifting module to directly exploit mutual information among views and encode them effectively with a joint cross-view prior model for entropy coding. Thanks to this design, LLSS processes left and right views in parallel, minimizing latency; all while substantially improving R-D performance compared to both existing neural and conventional codecs.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Multi-Task Dense Prediction via Mixture of Low-Rank Experts
Authors:
Yuqi Yang,
Peng-Tao Jiang,
Qibin Hou,
Hao Zhang,
Jinwei Chen,
Bo Li
Abstract:
Previous multi-task dense prediction methods based on the Mixture of Experts (MoE) have received great performance but they neglect the importance of explicitly modeling the global relations among all tasks. In this paper, we present a novel decoder-focused method for multi-task dense prediction, called Mixture-of-Low-Rank-Experts (MLoRE). To model the global task relationships, MLoRE adds a gener…
▽ More
Previous multi-task dense prediction methods based on the Mixture of Experts (MoE) have received great performance but they neglect the importance of explicitly modeling the global relations among all tasks. In this paper, we present a novel decoder-focused method for multi-task dense prediction, called Mixture-of-Low-Rank-Experts (MLoRE). To model the global task relationships, MLoRE adds a generic convolution path to the original MoE structure, where each task feature can go through this path for explicit parameter sharing. Furthermore, to control the parameters and computational cost brought by the increase in the number of experts, we take inspiration from LoRA and propose to leverage the low-rank format of a vanilla convolution in the expert network. Since the low-rank experts have fewer parameters and can be dynamically parameterized into the generic convolution, the parameters and computational cost do not change much with the increase of experts. Benefiting from this design, we increase the number of experts and its reception field to enlarge the representation capacity, facilitating multiple dense tasks learning in a unified network. Extensive experiments on the PASCAL-Context and NYUD-v2 benchmarks show that our MLoRE achieves superior performance compared to previous state-of-the-art methods on all metrics. Our code is available at https://github.com/YuqiYang213/MLoRE.
△ Less
Submitted 27 May, 2024; v1 submitted 26 March, 2024;
originally announced March 2024.
-
LSKNet: A Foundation Lightweight Backbone for Remote Sensing
Authors:
Yuxuan Li,
Xiang Li,
Yimian Dai,
Qibin Hou,
Li Liu,
Yongxiang Liu,
Ming-Ming Cheng,
Jian Yang
Abstract:
Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote se…
▽ More
Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote sensing objects may be mistakenly recognized without referencing a sufficiently long-range context, which can vary for different objects. This paper considers these priors and proposes a lightweight Large Selective Kernel Network (LSKNet) backbone. LSKNet can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios. To our knowledge, large and selective kernel mechanisms have not been previously explored in remote sensing images. Without bells and whistles, our lightweight LSKNet sets new state-of-the-art scores on standard remote sensing classification, object detection and semantic segmentation benchmarks. Our comprehensive analysis further validated the significance of the identified priors and the effectiveness of LSKNet. The code is available at https://github.com/zcablii/LSKNet.
△ Less
Submitted 23 June, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection
Authors:
Yuxuan Li,
Xiang Li,
Weijie Li,
Qibin Hou,
Li Liu,
Ming-Ming Cheng,
Jian Yang
Abstract:
Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source met…
▽ More
Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source method for large-scale SAR object detection. Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets, providing a large-scale and diverse dataset for research purposes. To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created. With this high-quality dataset, we conducted comprehensive experiments and uncovered a crucial challenge in SAR object detection: the substantial disparities between the pretraining on RGB datasets and finetuning on SAR datasets in terms of both data domain and model structure. To bridge these gaps, we propose a novel Multi-Stage with Filter Augmentation (MSFA) pretraining framework that tackles the problems from the perspective of data input, domain transition, and model migration. The proposed MSFA method significantly enhances the performance of SAR object detection models while demonstrating exceptional generalizability and flexibility across diverse models. This work aims to pave the way for further advancements in SAR object detection. The dataset and code is available at https://github.com/zcablii/SARDet_100K.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Sora Generates Videos with Stunning Geometrical Consistency
Authors:
Xuanyi Li,
Daquan Zhou,
Chenxu Zhang,
Shaodong Wei,
Qibin Hou,
Ming-Ming Cheng
Abstract:
The recently developed Sora model [1] has exhibited remarkable capabilities in video generation, sparking intense discussions regarding its ability to simulate real-world phenomena. Despite its growing popularity, there is a lack of established metrics to evaluate its fidelity to real-world physics quantitatively. In this paper, we introduce a new benchmark that assesses the quality of the generat…
▽ More
The recently developed Sora model [1] has exhibited remarkable capabilities in video generation, sparking intense discussions regarding its ability to simulate real-world phenomena. Despite its growing popularity, there is a lack of established metrics to evaluate its fidelity to real-world physics quantitatively. In this paper, we introduce a new benchmark that assesses the quality of the generated videos based on their adherence to real-world physics principles. We employ a method that transforms the generated videos into 3D models, leveraging the premise that the accuracy of 3D reconstruction is heavily contingent on the video quality. From the perspective of 3D reconstruction, we use the fidelity of the geometric constraints satisfied by the constructed 3D models as a proxy to gauge the extent to which the generated videos conform to real-world physics rules. Project page: https://sora-geometrical-consistency.github.io/
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Authors:
Ziheng Jiang,
Haibin Lin,
Yinmin Zhong,
Qi Huang,
Yangrui Chen,
Zhi Zhang,
Yanghua Peng,
Xiang Li,
Cong Xie,
Shibiao Nong,
Yulu Jia,
Sun He,
Hongmin Chen,
Zhihao Bai,
Qi Hou,
Shipeng Yan,
Ding Zhou,
Yiyao Sheng,
Zhuo Jiang,
Haohan Xu,
Haoran Wei,
Zhang Zhang,
Pengfei Nie,
Leqi Zou,
Sida Zhao
, et al. (7 additional authors not shown)
Abstract:
We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model bl…
▽ More
We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Fast Window-Based Event Denoising with Spatiotemporal Correlation Enhancement
Authors:
Huachen Fang,
Jinjian Wu,
Qibin Hou,
Weisheng Dong,
Guangming Shi
Abstract:
Previous deep learning-based event denoising methods mostly suffer from poor interpretability and difficulty in real-time processing due to their complex architecture designs. In this paper, we propose window-based event denoising, which simultaneously deals with a stack of events while existing element-based denoising focuses on one event each time. Besides, we give the theoretical analysis based…
▽ More
Previous deep learning-based event denoising methods mostly suffer from poor interpretability and difficulty in real-time processing due to their complex architecture designs. In this paper, we propose window-based event denoising, which simultaneously deals with a stack of events while existing element-based denoising focuses on one event each time. Besides, we give the theoretical analysis based on probability distributions in both temporal and spatial domains to improve interpretability. In temporal domain, we use timestamp deviations between processing events and central event to judge the temporal correlation and filter out temporal-irrelevant events. In spatial domain, we choose maximum a posteriori (MAP) to discriminate real-world event and noise, and use the learned convolutional sparse coding to optimize the objective function. Based on the theoretical analysis, we build Temporal Window (TW) module and Soft Spatial Feature Embedding (SSFE) module to process temporal and spatial information separately, and construct a novel multi-scale window-based event denoising network, named MSDNet. The high denoising accuracy and fast running speed of our MSDNet enables us to achieve real-time denoising in complex scenes. Extensive experimental results verify the effectiveness and robustness of our MSDNet. Our algorithm can remove event noise effectively and efficiently and improve the performance of downstream tasks.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models
Authors:
Senmao Li,
Joost van de Weijer,
Taihang Hu,
Fahad Shahbaz Khan,
Qibin Hou,
Yaxing Wang,
Jian Yang
Abstract:
The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to man…
▽ More
The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to manipulate the text embeddings and remove unwanted content from them. We introduce two contributions, which we refer to as $\textit{soft-weighted regularization}$ and $\textit{inference-time text embedding optimization}$. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion).
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Rational Solutions to the First Order Difference Equations in the Bivariate Difference Field
Authors:
Qing-Hu Hou,
Yarong Wei
Abstract:
Inspired by Karr's algorithm, we consider the summations involving a sequence satisfying a recurrence of order two. The structure of such summations provides an algebraic framework for solving the difference equations of form $aσ(g)+bg=f$ in the bivariate difference field $(\mathbb{F}(α, β), σ)$, where $a, b,f\in\mathbb{F}(α,β)\setminus\{0\}$ are known binary functions of $α$, $β$, and $α$, $β$ ar…
▽ More
Inspired by Karr's algorithm, we consider the summations involving a sequence satisfying a recurrence of order two. The structure of such summations provides an algebraic framework for solving the difference equations of form $aσ(g)+bg=f$ in the bivariate difference field $(\mathbb{F}(α, β), σ)$, where $a, b,f\in\mathbb{F}(α,β)\setminus\{0\}$ are known binary functions of $α$, $β$, and $α$, $β$ are two algebraically independent transcendental elements, $σ$ is a transformation that satisfies $σ(α)=β$, $σ(β)=uα+vβ$, where $u,v\neq 0\in\mathbb{F}$. Based on it, we then describe algorithms for finding the universal denominator for those equations in the bivariate difference field under certain assumptions. This reduces the general problem of finding the rational solutions of such equations to the problem of finding the polynomial solutions of such equations.
△ Less
Submitted 20 January, 2024;
originally announced January 2024.
-
Reduction on the congruences of partial sums of P-recursive sequences
Authors:
Qing-Hu Hou,
Na Li
Abstract:
Hou and Liu developed a telescoping method to prove the congruence of partial sums of P-recursive sequences. We release the requirement on the telescoper and utilize the congruence of the sequence. With this approach, we are able to confirm a conjecture of Sun and find a new congruence on the central trinomial coefficient.
Hou and Liu developed a telescoping method to prove the congruence of partial sums of P-recursive sequences. We release the requirement on the telescoper and utilize the congruence of the sequence. With this approach, we are able to confirm a conjecture of Sun and find a new congruence on the central trinomial coefficient.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
MCANet: Medical Image Segmentation with Multi-Scale Cross-Axis Attention
Authors:
Hao Shao,
Quansheng Zeng,
Qibin Hou,
Jufeng Yang
Abstract:
Efficiently capturing multi-scale information and building long-range dependencies among pixels are essential for medical image segmentation because of the various sizes and shapes of the lesion regions or organs. In this paper, we present Multi-scale Cross-axis Attention (MCA) to solve the above challenging issues based on the efficient axial attention. Instead of simply connecting axial attentio…
▽ More
Efficiently capturing multi-scale information and building long-range dependencies among pixels are essential for medical image segmentation because of the various sizes and shapes of the lesion regions or organs. In this paper, we present Multi-scale Cross-axis Attention (MCA) to solve the above challenging issues based on the efficient axial attention. Instead of simply connecting axial attention along the horizontal and vertical directions sequentially, we propose to calculate dual cross attentions between two parallel axial attentions to capture global information better. To process the significant variations of lesion regions or organs in individual sizes and shapes, we also use multiple convolutions of strip-shape kernels with different kernel sizes in each axial attention path to improve the efficiency of the proposed MCA in encoding spatial information. We build the proposed MCA upon the MSCAN backbone, yielding our network, termed MCANet. Our MCANet with only 4M+ parameters performs even better than most previous works with heavy backbones (e.g., Swin Transformer) on four challenging tasks, including skin lesion segmentation, nuclei segmentation, abdominal multi-organ segmentation, and polyp segmentation. Code is available at https://github.com/haoshao-nku/medical_seg.
△ Less
Submitted 19 December, 2023; v1 submitted 14 December, 2023;
originally announced December 2023.
-
Polyper: Boundary Sensitive Polyp Segmentation
Authors:
Hao Shao,
Yang Zhang,
Qibin Hou
Abstract:
We present a new boundary sensitive framework for polyp segmentation, called Polyper. Our method is motivated by a clinical approach that seasoned medical practitioners often leverage the inherent features of interior polyp regions to tackle blurred boundaries.Inspired by this, we propose explicitly leveraging polyp regions to bolster the model's boundary discrimination capability while minimizing…
▽ More
We present a new boundary sensitive framework for polyp segmentation, called Polyper. Our method is motivated by a clinical approach that seasoned medical practitioners often leverage the inherent features of interior polyp regions to tackle blurred boundaries.Inspired by this, we propose explicitly leveraging polyp regions to bolster the model's boundary discrimination capability while minimizing computation. Our approach first extracts boundary and polyp regions from the initial segmentation map through morphological operators. Then, we design the boundary sensitive attention that concentrates on augmenting the features near the boundary regions using the interior polyp regions's characteristics to generate good segmentation results. Our proposed method can be seamlessly integrated with classical encoder networks, like ResNet-50, MiT-B1, and Swin Transformer. To evaluate the effectiveness of Polyper, we conduct experiments on five publicly available challenging datasets, and receive state-of-the-art performance on all of them. Code is available at https://github.com/haoshao-nku/medical_seg.git.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
A Decoupled Spatio-Temporal Framework for Skeleton-based Action Segmentation
Authors:
Yunheng Li,
Zhongyu Li,
Shanghua Gao,
Qilong Wang,
Qibin Hou,
Ming-Ming Cheng
Abstract:
Effectively modeling discriminative spatio-temporal information is essential for segmenting activities in long action sequences. However, we observe that existing methods are limited in weak spatio-temporal modeling capability due to two forms of decoupled modeling: (i) cascaded interaction couples spatial and temporal modeling, which over-smooths motion modeling over the long sequence, and (ii) j…
▽ More
Effectively modeling discriminative spatio-temporal information is essential for segmenting activities in long action sequences. However, we observe that existing methods are limited in weak spatio-temporal modeling capability due to two forms of decoupled modeling: (i) cascaded interaction couples spatial and temporal modeling, which over-smooths motion modeling over the long sequence, and (ii) joint-shared temporal modeling adopts shared weights to model each joint, ignoring the distinct motion patterns of different joints. We propose a Decoupled Spatio-Temporal Framework (DeST) to address the above issues. Firstly, we decouple the cascaded spatio-temporal interaction to avoid stacking multiple spatio-temporal blocks, while achieving sufficient spatio-temporal interaction. Specifically, DeST performs once unified spatial modeling and divides the spatial features into different groups of subfeatures, which then adaptively interact with temporal features from different layers. Since the different sub-features contain distinct spatial semantics, the model could learn the optimal interaction pattern at each layer. Meanwhile, inspired by the fact that different joints move at different speeds, we propose joint-decoupled temporal modeling, which employs independent trainable weights to capture distinctive temporal features of each joint. On four large-scale benchmarks of different scenes, DeST significantly outperforms current state-of-the-art methods with less computational complexity.
△ Less
Submitted 10 December, 2023;
originally announced December 2023.
-
TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes
Authors:
Xuying Zhang,
Bo-Wen Yin,
Yuming Chen,
Zheng Lin,
Yunheng Li,
Qibin Hou,
Ming-Ming Cheng
Abstract:
Recent progress in the text-driven 3D stylization of a single object has been considerably promoted by CLIP-based methods. However, the stylization of multi-object 3D scenes is still impeded in that the image-text pairs used for pre-training CLIP mostly consist of an object. Meanwhile, the local details of multiple objects may be susceptible to omission due to the existing supervision manner prima…
▽ More
Recent progress in the text-driven 3D stylization of a single object has been considerably promoted by CLIP-based methods. However, the stylization of multi-object 3D scenes is still impeded in that the image-text pairs used for pre-training CLIP mostly consist of an object. Meanwhile, the local details of multiple objects may be susceptible to omission due to the existing supervision manner primarily relying on coarse-grained contrast of image-text pairs. To overcome these challenges, we present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles under the contrast supervision at multiple levels. We first propose a Decoupled Graph Attention (DGA) module to distinguishably reinforce the features of 3D surface points. Particularly, a cross-modal graph is constructed to align the object points accurately and noun phrases decoupled from the 3D mesh and textual description. Then, we develop a Cross-Grained Contrast (CGC) supervision system, where a fine-grained loss between the words in the textual description and the randomly rendered images are constructed to complement the coarse-grained loss. Extensive experiments show that our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes. Our code and results will be made publicly available
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
ChatAnything: Facetime Chat with LLM-Enhanced Personas
Authors:
Yilin Zhao,
Xinbin Yuan,
Shanghua Gao,
Zhijie Lin,
Qibin Hou,
Jiashi Feng,
Daquan Zhou
Abstract:
In this technical report, we target generating anthropomorphized personas for LLM-based characters in an online manner, including visual appearance, personality and tones, with only text descriptions. To achieve this, we first leverage the in-context learning capability of LLMs for personality generation by carefully designing a set of system prompts. We then propose two novel concepts: the mixtur…
▽ More
In this technical report, we target generating anthropomorphized personas for LLM-based characters in an online manner, including visual appearance, personality and tones, with only text descriptions. To achieve this, we first leverage the in-context learning capability of LLMs for personality generation by carefully designing a set of system prompts. We then propose two novel concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation. For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones and select the most matching one based on the user-provided text description automatically. For MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects. We termed the whole framework as ChatAnything. With it, users could be able to animate anything with any personas that are anthropomorphic using just a few text inputs. However, we have observed that the anthropomorphic objects produced by current generative models are often undetectable by pre-trained face landmark detectors, leading to failure of the face motion generation, even if these faces possess human-like appearances because those images are nearly seen during the training (e.g., OOD samples). To address this issue, we incorporate pixel-level guidance to infuse human face landmarks during the image generation phase. To benchmark these metrics, we have built an evaluation dataset. Based on it, we verify that the detection rate of the face landmark is significantly increased from 57.0% to 92.5% thus allowing automatic face animation based on generated speech content. The code and more results can be found at https://chatanything.github.io/.
△ Less
Submitted 12 November, 2023;
originally announced November 2023.
-
Log-behavior of the root sequences of P-recursive sequences
Authors:
Qing-hu Hou,
Zhongjie Li
Abstract:
In recent years, Sun has proposed numerous conjectures regarding the log-concavity of root sequences $\{\sqrt[n]{a_n}}_{n\geqslant 1}$. We establish criteria for the asymptotic log-concavity of $\{\sqrt[n]{a_n}}_{n\geqslant 1}$ and the asymptotic ratio log-convexity of $\{\sqrt[n]{a_n}}_{n\geqslant 1}$ for $P$-recursive sequences $\{\sqrt[n]{a_n}}_{n\geqslant{0}}$. Additionally, by the aid of symb…
▽ More
In recent years, Sun has proposed numerous conjectures regarding the log-concavity of root sequences $\{\sqrt[n]{a_n}}_{n\geqslant 1}$. We establish criteria for the asymptotic log-concavity of $\{\sqrt[n]{a_n}}_{n\geqslant 1}$ and the asymptotic ratio log-convexity of $\{\sqrt[n]{a_n}}_{n\geqslant 1}$ for $P$-recursive sequences $\{\sqrt[n]{a_n}}_{n\geqslant{0}}$. Additionally, by the aid of symbolic computation, we present a systematic approach to determine the explicit integer $N$ such that the sequence $\{\sqrt[n]{a_n}}_{n\geqslant{N}}$ is log-concave and the sequence $\{\sqrt[n]{a_n}}_{n\geqslant N}$ is ratio log-convex.
△ Less
Submitted 29 October, 2023;
originally announced October 2023.
-
Auxiliary Features-Guided Super Resolution for Monte Carlo Rendering
Authors:
Qiqi Hou,
Feng Liu
Abstract:
This paper investigates super resolution to reduce the number of pixels to render and thus speed up Monte Carlo rendering algorithms. While great progress has been made to super resolution technologies, it is essentially an ill-posed problem and cannot recover high-frequency details in renderings. To address this problem, we exploit high-resolution auxiliary features to guide super resolution of l…
▽ More
This paper investigates super resolution to reduce the number of pixels to render and thus speed up Monte Carlo rendering algorithms. While great progress has been made to super resolution technologies, it is essentially an ill-posed problem and cannot recover high-frequency details in renderings. To address this problem, we exploit high-resolution auxiliary features to guide super resolution of low-resolution renderings. These high-resolution auxiliary features can be quickly rendered by a rendering engine and at the same time provide valuable high-frequency details to assist super resolution. To this end, we develop a cross-modality Transformer network that consists of an auxiliary feature branch and a low-resolution rendering branch. These two branches are designed to fuse high-resolution auxiliary features with the corresponding low-resolution rendering. Furthermore, we design residual densely-connected Swin Transformer groups to learn to extract representative features to enable high-quality super-resolution. Our experiments show that our auxiliary features-guided super-resolution method outperforms both super-resolution methods and Monte Carlo denoising methods in producing high-quality renderings.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Zone Evaluation: Revealing Spatial Bias in Object Detection
Authors:
Zhaohui Zheng,
Yuming Chen,
Qibin Hou,
Xiang Li,
Ping Wang,
Ming-Ming Cheng
Abstract:
A fundamental limitation of object detectors is that they suffer from "spatial bias", and in particular perform less satisfactorily when detecting objects near image borders. For a long time, there has been a lack of effective ways to measure and identify spatial bias, and little is known about where it comes from and what degree it is. To this end, we present a new zone evaluation protocol, exten…
▽ More
A fundamental limitation of object detectors is that they suffer from "spatial bias", and in particular perform less satisfactorily when detecting objects near image borders. For a long time, there has been a lack of effective ways to measure and identify spatial bias, and little is known about where it comes from and what degree it is. To this end, we present a new zone evaluation protocol, extending from the traditional evaluation to a more generalized one, which measures the detection performance over zones, yielding a series of Zone Precisions (ZPs). For the first time, we provide numerical results, showing that the object detectors perform quite unevenly across the zones. Surprisingly, the detector's performance in the 96% border zone of the image does not reach the AP value (Average Precision, commonly regarded as the average detection performance in the entire image zone). To better understand spatial bias, a series of heuristic experiments are conducted. Our investigation excludes two intuitive conjectures about spatial bias that the object scale and the absolute positions of objects barely influence the spatial bias. We find that the key lies in the human-imperceptible divergence in data patterns between objects in different zones, thus eventually forming a visible performance gap between the zones. With these findings, we finally discuss a future direction for object detection, namely, spatial disequilibrium problem, aiming at pursuing a balanced detection ability over the entire image zone. By broadly evaluating 10 popular object detectors and 5 detection datasets, we shed light on the spatial bias of object detectors. We hope this work could raise a focus on detection robustness. The source codes, evaluation protocols, and tutorials are publicly available at https://github.com/Zzh-tju/ZoneEval.
△ Less
Submitted 1 June, 2024; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Taylor coefficients and series involving harmonic numbers
Authors:
Qing-Hu Hou,
Zhi-Wei Sun
Abstract:
During 2022--2023 Z.-W. Sun posed many conjectures on infinite series with summands involving generalized harmonic numbers. Motivated by this, we deduce $31$ series identities involving harmonic numbers, three of which were previously conjectured by the second author. For example, we obtain that \[ \sum_{k=1}^{\infty} \frac{(-1)^k}{k^2{2k \choose k}{3k \choose k}} \big( \frac{7 k-2}{2 k-1} H_{k-1}…
▽ More
During 2022--2023 Z.-W. Sun posed many conjectures on infinite series with summands involving generalized harmonic numbers. Motivated by this, we deduce $31$ series identities involving harmonic numbers, three of which were previously conjectured by the second author. For example, we obtain that \[ \sum_{k=1}^{\infty} \frac{(-1)^k}{k^2{2k \choose k}{3k \choose k}} \big( \frac{7 k-2}{2 k-1} H_{k-1}^{(2)}-\frac{3}{4 k^2} \big)=\frac{π^4}{720}. \] and \[ \sum_{k=1}^\infty \frac{1}{k^2 {2k \choose k}^2} \left( \frac{30k-11}{k(2k-1)} (H_{2k-1}^{(3)} + 2 H_{k-1}^{(3)}) + \frac{27}{8k^4} \right) = 4 ζ(3)^2, \] where $H_n^{(m)}$ denotes $\sum_{0<j \le n}j^{-m}$.
△ Less
Submitted 26 October, 2023; v1 submitted 5 October, 2023;
originally announced October 2023.
-
DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation
Authors:
Bowen Yin,
Xuying Zhang,
Zhongyu Li,
Li Liu,
Ming-Ming Cheng,
Qibin Hou
Abstract:
We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and hence the DFormer is endowed with the capacity to encode RGB-D representations; 2)…
▽ More
We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and hence the DFormer is endowed with the capacity to encode RGB-D representations; 2) DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design. DFormer avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in existing methods but has not been resolved. We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer.
△ Less
Submitted 7 February, 2024; v1 submitted 18 September, 2023;
originally announced September 2023.
-
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask
Authors:
Yupeng Zhou,
Daquan Zhou,
Zuo-Liang Zhu,
Yaxing Wang,
Qibin Hou,
Jiashi Feng
Abstract:
Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. Nevertheless, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning between the prompt and…
▽ More
Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. Nevertheless, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning between the prompt and the output image. To better align the prompt and image content, we advance the cross-attention with an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. This mechanism explicitly diminishes the ambiguity in semantic information embedding from the text encoder, leading to a boost of text-to-image consistency in the synthesized images. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models. When applied to the latent diffusion models, our MaskDiffusion can significantly improve the text-to-image consistency with negligible computation overhead compared to the original diffusion models.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
Current percolation model for the special resistivity behavior observed in Cu-doped Apatite
Authors:
Qiang Hou,
Wei Wei,
Xin Zhou,
Xinyue Wang,
Yue Sun,
ZhiXiang Shi
Abstract:
Since the initial report of the potential occurrence of room-temperature superconductivity under normal pressure [arXiv: 2307.12008], there has been significant interest in the field of condensed matter physics regarding Cu-doped Apatite (Pb10-xCux(PO4)6O). In this study, we performed temperature-dependent resistivity measurements on the synthesized Pb10-xCux(PO4)6O samples. The structure of the s…
▽ More
Since the initial report of the potential occurrence of room-temperature superconductivity under normal pressure [arXiv: 2307.12008], there has been significant interest in the field of condensed matter physics regarding Cu-doped Apatite (Pb10-xCux(PO4)6O). In this study, we performed temperature-dependent resistivity measurements on the synthesized Pb10-xCux(PO4)6O samples. The structure of the sample was confirmed to match the reference literature through X-ray diffraction analysis. Remarkably, we observed four distinct types of resistivity behaviors within samples from the same pellet: (1) A semiconductor-like behavior characterized by a decrease in resistivity as the temperature is lowered. (2) A gradual reduction in resistivity, reaching an exceptionally small value that falls below the resolution limits of our measurement equipment. (3) An abrupt drop in resistivity to a low value at ~ 250 K. (4) An almost linear reduction in resistivity exhibiting a transition at approximately 7 K (possibly associated with Pb). Following a thorough compositional analysis, we proposed a current percolation model, based on the formation of a Cu/Pb current channel, to elucidate the observed special resistivity behaviors. It is important to note that the Meissner effect was not observed in our magnetization measurements. Consequently, we reached the conclusion that the presence of superconductivity in Cu-doped Apatite has yet to be substantiated.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection
Authors:
Yuming Chen,
Xinbin Yuan,
Ruiqi Wu,
Jiabao Wang,
Qibin Hou,
Ming-Ming Cheng
Abstract:
We aim at providing the object detection community with an efficient and performant object detector, termed YOLO-MS. The core design is based on a series of investigations on how convolutions with different kernel sizes affect the detection performance of objects at different scales. The outcome is a new strategy that can strongly enhance multi-scale feature representations of real-time object det…
▽ More
We aim at providing the object detection community with an efficient and performant object detector, termed YOLO-MS. The core design is based on a series of investigations on how convolutions with different kernel sizes affect the detection performance of objects at different scales. The outcome is a new strategy that can strongly enhance multi-scale feature representations of real-time object detectors. To verify the effectiveness of our strategy, we build a network architecture, termed YOLO-MS. We train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets, like ImageNet, or pre-trained weights. Without bells and whistles, our YOLO-MS outperforms the recent state-of-the-art real-time object detectors, including YOLO-v7 and RTMDet, when using a comparable number of parameters and FLOPs. Taking the XS version of YOLO-MS as an example, with only 4.5M learnable parameters and 8.7G FLOPs, it can achieve an AP score of 43%+ on MS COCO, which is about 2%+ higher than RTMDet with the same model size. Moreover, our work can also be used as a plug-and-play module for other YOLO models. Typically, our method significantly improves the AP of YOLOv8 from 37%+ to 40%+ with even fewer parameters and FLOPs. Code is available at https://github.com/FishAndWasabi/YOLO-MS.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
Observation of zero resistance above 100$^\circ$ K in Pb$_{10-x}$Cu$_x$(PO$_4$)$_6$O
Authors:
Qiang Hou,
Wei Wei,
Xin Zhou,
Yue Sun,
Zhixiang Shi
Abstract:
Room-temperature superconductivity has always been regarded as the ultimate goal in the fields of solid-state physics and materials science, with its realization holding revolutionary significance, capable of triggering significant changes in energy transmission and storage. However, achieving it poses various challenges. Recent research revealed that material Pb$_{10-x}$Cu$_x$(PO$_4$)$_6$O displa…
▽ More
Room-temperature superconductivity has always been regarded as the ultimate goal in the fields of solid-state physics and materials science, with its realization holding revolutionary significance, capable of triggering significant changes in energy transmission and storage. However, achieving it poses various challenges. Recent research revealed that material Pb$_{10-x}$Cu$_x$(PO$_4$)$_6$O displays room-temperature superconductivity under atmospheric pressure, sparking global interest in further exploration. Here, we utilized solid-phase synthesis to obtain a polycrystalline sample of Pb$_{10-x}$Cu$_x$(PO$_4$)$_6$O. X-ray diffraction confirmed its structural consistency with referenced literature. Zero resistance, which is important evidence for superconductivity, was observed above 100$^\circ$ K under ambient pressure in our experiment. Our finding indicates that Pb$_{10-x}$Cu$_x$(PO$_4$)$_6$O is a possible candidate for searching high-temperature superconductors.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
Molecular Dynamics
Authors:
Halima Mouhib,
Juami H. M. van Gils,
Jose Gavaldá-Garciá,
Qingzhen Hou,
Ali May,
Arriën Symon Rauh,
Jocelyne Vreede,
Sanne Abeln,
K. Anton Feenstra
Abstract:
While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computati…
▽ More
While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computational analysis. We provide an overview of existing computational techniques, to validate, simulate, predict and analyse protein structures. More importantly, it will aim to provide practical knowledge about how and when to use such techniques. We will consider proteins from three major vantage points: Protein structure quantification, Protein structure prediction, and Protein simulation & dynamics.
We know that many proteins have functional motions, and in Chapter "Structure Determination" we already introduced the famous example of the allosteric cooperative binding of oxygen to the haem group in hemoglobin. However, experimentally, such motions are hard to observe. Here, we will introduce MD simulations to investigate the dynamic behaviour of proteins. In a simulation the forces and interactions between particles are used to numerically derive the resulting three-dimensional movement of these particles over a certain time-scale. We will also highlight some applications, and will see how simulation results may be interpreted.
△ Less
Submitted 6 July, 2023; v1 submitted 5 July, 2023;
originally announced July 2023.
-
Function Prediction
Authors:
Bas Stringer,
Annika Jacobsen,
Qingzhen Hou,
Hans de Ferrante,
Olga Ivanova,
Katharina Waury,
Jose Gavaldá-Garciá,
Sanne Abeln,
K. Anton Feenstra
Abstract:
While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computati…
▽ More
While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computational analysis. We provide an overview of existing computational techniques, to validate, simulate, predict and analyse protein structures. More importantly, it will aim to provide practical knowledge about how and when to use such techniques. We will consider proteins from three major vantage points: Protein structure quantification, Protein structure prediction, and Protein simulation & dynamics.
There are still huge gaps in understanding the molecular function of proteins. This raises the question on how we may predict protein function, when little to no knowledge from direct experiments is available. Protein function is a broad concept which spans different scales: from quantum scale effects for catalyzing enzymatic reactions, to phenotypes that manifest at the organism level. In fact, many of these functional scales are entirely different research areas. Here, we will consider prediction of a smaller range of functions, roughly spanning the protein residue-level up to the pathway level. We will give a conceptual overview of which functional aspects of proteins we can predict, which methods are currently available, and how well they work in practice.
△ Less
Submitted 6 July, 2023; v1 submitted 5 July, 2023;
originally announced July 2023.
-
Meta-Gating Framework for Fast and Continuous Resource Optimization in Dynamic Wireless Environments
Authors:
Qiushuo Hou,
Mengyuan Lee,
Guanding Yu,
Yunlong Cai
Abstract:
With the great success of deep learning (DL) in image classification, speech recognition, and other fields, more and more studies have applied various neural networks (NNs) to wireless resource allocation. Generally speaking, these artificial intelligent (AI) models are trained under some special learning hypotheses, especially that the statistics of the training data are static during the trainin…
▽ More
With the great success of deep learning (DL) in image classification, speech recognition, and other fields, more and more studies have applied various neural networks (NNs) to wireless resource allocation. Generally speaking, these artificial intelligent (AI) models are trained under some special learning hypotheses, especially that the statistics of the training data are static during the training stage. However, the distribution of channel state information (CSI) is constantly changing in the real-world wireless communication environment. Therefore, it is essential to study effective dynamic DL technologies to solve wireless resource allocation problems. In this paper, we propose a novel framework, named meta-gating, for solving resource allocation problems in an episodically dynamic wireless environment, where the CSI distribution changes over periods and remains constant within each period. The proposed framework, consisting of an inner network and an outer network, aims to adapt to the dynamic wireless environment by achieving three important goals, i.e., seamlessness, quickness and continuity. Specifically, for the former two goals, we propose a training method by combining a model-agnostic meta-learning (MAML) algorithm with an unsupervised learning mechanism. With this training method, the inner network is able to fast adapt to different channel distributions because of the good initialization. As for the goal of continuity, the outer network can learn to evaluate the importance of inner network's parameters under different CSI distributions, and then decide which subset of the inner network should be activated through the gating operation. Additionally, we theoretically analyze the performance of the proposed meta-gating framework.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
CrossKD: Cross-Head Knowledge Distillation for Object Detection
Authors:
Jiabao Wang,
Yuming Chen,
Zhaohui Zheng,
Xiang Li,
Ming-Ming Cheng,
Qibin Hou
Abstract:
Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors. Existing state-of-the-art KD methods for object detection are mostly based on feature imitation. In this paper, we present a general and effective prediction mimicking distillation scheme, called CrossKD, which delivers the intermediate features of the student's detecti…
▽ More
Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors. Existing state-of-the-art KD methods for object detection are mostly based on feature imitation. In this paper, we present a general and effective prediction mimicking distillation scheme, called CrossKD, which delivers the intermediate features of the student's detection head to the teacher's detection head. The resulting cross-head predictions are then forced to mimic the teacher's predictions. This manner relieves the student's head from receiving contradictory supervision signals from the annotations and the teacher's predictions, greatly improving the student's detection performance. Moreover, as mimicking the teacher's predictions is the target of KD, CrossKD offers more task-oriented information in contrast with feature imitation. On MS COCO, with only prediction mimicking losses applied, our CrossKD boosts the average precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7, outperforming all existing KD methods. In addition, our method also works well when distilling detectors with heterogeneous backbones. Code is available at https://github.com/jbwang1997/CrossKD.
△ Less
Submitted 15 April, 2024; v1 submitted 20 June, 2023;
originally announced June 2023.
-
Referring Camouflaged Object Detection
Authors:
Xuying Zhang,
Bowen Yin,
Zheng Lin,
Qibin Hou,
Deng-Ping Fan,
Ming-Ming Cheng
Abstract:
We consider the problem of referring camouflaged object detection (Ref-COD), a new task that aims to segment specified camouflaged objects based on a small set of referring images with salient target objects. We first assemble a large-scale dataset, called R2C7K, which consists of 7K images covering 64 object categories in real-world scenarios. Then, we develop a simple but strong dual-branch fram…
▽ More
We consider the problem of referring camouflaged object detection (Ref-COD), a new task that aims to segment specified camouflaged objects based on a small set of referring images with salient target objects. We first assemble a large-scale dataset, called R2C7K, which consists of 7K images covering 64 object categories in real-world scenarios. Then, we develop a simple but strong dual-branch framework, dubbed R2CNet, with a reference branch embedding the common representations of target objects from referring images and a segmentation branch identifying and segmenting camouflaged objects under the guidance of the common representations. In particular, we design a Referring Mask Generation module to generate pixel-level prior mask and a Referring Feature Enrichment module to enhance the capability of identifying specified camouflaged objects. Extensive experiments show the superiority of our Ref-COD methods over their COD counterparts in segmenting specified camouflaged objects and identifying the main body of target objects. Our code and dataset are publicly available at https://github.com/zhangxuying1004/RefCOD.
△ Less
Submitted 11 July, 2023; v1 submitted 13 June, 2023;
originally announced June 2023.
-
CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation
Authors:
Boyuan Sun,
Yuqi Yang,
Le Zhang,
Ming-Ming Cheng,
Qibin Hou
Abstract:
This paper presents a simple but performant semi-supervised semantic segmentation approach, called CorrMatch. Previous approaches mostly employ complicated training strategies to leverage unlabeled data but overlook the role of correlation maps in modeling the relationships between pairs of locations. We observe that the correlation maps not only enable clustering pixels of the same category easil…
▽ More
This paper presents a simple but performant semi-supervised semantic segmentation approach, called CorrMatch. Previous approaches mostly employ complicated training strategies to leverage unlabeled data but overlook the role of correlation maps in modeling the relationships between pairs of locations. We observe that the correlation maps not only enable clustering pixels of the same category easily but also contain good shape information, which previous works have omitted. Motivated by these, we aim to improve the use efficiency of unlabeled data by designing two novel label propagation strategies. First, we propose to conduct pixel propagation by modeling the pairwise similarities of pixels to spread the high-confidence pixels and dig out more. Then, we perform region propagation to enhance the pseudo labels with accurate class-agnostic masks extracted from the correlation maps. CorrMatch achieves great performance on popular segmentation benchmarks. Taking the DeepLabV3+ with ResNet-101 backbone as our segmentation model, we receive a 76%+ mIoU score on the Pascal VOC 2012 dataset with only 92 annotated images. Code is available at https://github.com/BBBBchan/CorrMatch.
△ Less
Submitted 10 December, 2023; v1 submitted 7 June, 2023;
originally announced June 2023.
-
Delving Deeper into Data Scaling in Masked Image Modeling
Authors:
Cheng-Ze Lu,
Xiaojie Jin,
Qibin Hou,
Jun Hao Liew,
Ming-Ming Cheng,
Jiashi Feng
Abstract:
Understanding whether self-supervised learning methods can scale with unlimited data is crucial for training large-scale models. In this work, we conduct an empirical study on the scaling capability of masked image modeling (MIM) methods (e.g., MAE) for visual recognition. Unlike most previous works that depend on the widely-used ImageNet dataset, which is manually curated and object-centric, we t…
▽ More
Understanding whether self-supervised learning methods can scale with unlimited data is crucial for training large-scale models. In this work, we conduct an empirical study on the scaling capability of masked image modeling (MIM) methods (e.g., MAE) for visual recognition. Unlike most previous works that depend on the widely-used ImageNet dataset, which is manually curated and object-centric, we take a step further and propose to investigate this problem in a more practical setting. Specifically, we utilize the web-collected Coyo-700M dataset. We randomly sample varying numbers of training images from the Coyo dataset and construct a series of sub-datasets, containing 0.5M, 1M, 5M, 10M, and 100M images, for pre-training. Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models. The study reveals that: 1) MIM can be viewed as an effective method to improve the model capacity when the scale of the training data is relatively small; 2) Strong reconstruction targets can endow the models with increased capacities on downstream tasks; 3) MIM pre-training is data-agnostic under most scenarios, which means that the strategy of sampling pre-training data is non-critical. We hope these observations could provide valuable insights for future research on MIM.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Ramanujan-inspired series for $1/π$ involving harmonic numbers
Authors:
Qinghu Hou,
Haihong He,
Xiaoxia Wang
Abstract:
By applying the derivative operator to the known identities from hypergeometric series or WZ pairs, we obtain seven series associated with harmonic numbers. Specifically, six of them are Ramanujan-like formulas for $1/π$ and the remaining onecontains harmonic numbers of order $2$. As conclusions, Sun's five conjectural series are proved.
By applying the derivative operator to the known identities from hypergeometric series or WZ pairs, we obtain seven series associated with harmonic numbers. Specifically, six of them are Ramanujan-like formulas for $1/π$ and the remaining onecontains harmonic numbers of order $2$. As conclusions, Sun's five conjectural series are proved.
△ Less
Submitted 8 July, 2023; v1 submitted 30 April, 2023;
originally announced May 2023.
-
New 26P(p,γ)27S thermonuclear reaction rate and its astrophysical implication in rp-process
Authors:
S. Q. Hou,
J. B. Liu,
T. C. L. Trueman,
J. G. Li,
M. Pignatari,
C. Bertulani,
X. X. Xu
Abstract:
Accurate nuclear reaction rates for 26P(p,γ)27S are pivotal for a comprehensive understanding of rp-process nucleosynthesis path in the region of proton-rich sulfur and phosphorus isotopes. However, large uncertainties still exist in the current rate of 26P(p,γ)27S because of the lack of the nuclear mass and the energy level structure information of 27S. We reevaluate this reaction rate using the…
▽ More
Accurate nuclear reaction rates for 26P(p,γ)27S are pivotal for a comprehensive understanding of rp-process nucleosynthesis path in the region of proton-rich sulfur and phosphorus isotopes. However, large uncertainties still exist in the current rate of 26P(p,γ)27S because of the lack of the nuclear mass and the energy level structure information of 27S. We reevaluate this reaction rate using the experimentally constrained 27S mass, together with the shell-model predicted level structure. It is found that the 26P(p,γ)27S reaction rate is dominated by a direct-capture (DC) reaction mechanism despite the presence of three resonances at E = 1.104, 1.597, 1.777 MeV above the proton threshold in 27S. The new rate is overall smaller than the other previous rates from Hauser-Feshbach statistical model by at least one order of magnitude in the temperature range of X-ray burst interest. In addition, we consistently update the photodisintegration rate using the new 27S mass. The influence of new rates of forward and reverse reaction in the abundances of isotopes produced in rp-process is explored by post-processing nucleosynthesis calculations. The final abundance ratio of 27S/26P obtained using the new rates is only 10% of that from the old rate. The abundance flow calculations show the reaction path 26P(p,γ)27S(\b{eta}+,ν)27P is not as important as thought previously for producing 27P. The adoption of the new reaction rates for 26P(p,γ)27S only reduces the final production of aluminum by 7.1%, and has no discernible impact on the yield of other elements.
△ Less
Submitted 29 April, 2023;
originally announced May 2023.
-
Structure Diagram Recognition in Financial Announcements
Authors:
Meixuan Qiao,
Jun Wang,
Junfu Xiang,
Qiyu Hou,
Ruixuan Li
Abstract:
Accurately extracting structured data from structure diagrams in financial announcements is of great practical importance for building financial knowledge graphs and further improving the efficiency of various financial applications. First, we proposed a new method for recognizing structure diagrams in financial announcements, which can better detect and extract different types of connecting lines…
▽ More
Accurately extracting structured data from structure diagrams in financial announcements is of great practical importance for building financial knowledge graphs and further improving the efficiency of various financial applications. First, we proposed a new method for recognizing structure diagrams in financial announcements, which can better detect and extract different types of connecting lines, including straight lines, curves, and polylines of different orientations and angles. Second, we developed a two-stage method to efficiently generate the industry's first benchmark of structure diagrams from Chinese financial announcements, where a large number of diagrams were synthesized and annotated using an automated tool to train a preliminary recognition model with fairly good performance, and then a high-quality benchmark can be obtained by automatically annotating the real-world structure diagrams using the preliminary model and then making few manual corrections. Finally, we experimentally verified the significant performance advantage of our structure diagram recognition method over previous methods.
△ Less
Submitted 1 May, 2023; v1 submitted 25 April, 2023;
originally announced April 2023.