Search | arXiv e-print repository

arXiv:2408.16921 [pdf, other]

Rapid, in-situ neutralization of nitrogen- and silicon-vacancy centers in diamond using above-band-gap optical excitation

Authors: Christian Pederson, Nicholas S. Yama, Lane Beale, Matthew L. Markham, Kai-Mei C. Fu

Abstract: The charge state of a quantum point defect in a solid state host strongly determines its optical and spin characteristics. Consequently, techniques for controlling the charge state are required to realize technologies such as quantum networking and sensing. In this work we demonstrate the use of deep-ultraviolet (DUV) radiation to dynamically neutralize nitrogen- (NV) and silicon-vacancy (SiV) cen… ▽ More The charge state of a quantum point defect in a solid state host strongly determines its optical and spin characteristics. Consequently, techniques for controlling the charge state are required to realize technologies such as quantum networking and sensing. In this work we demonstrate the use of deep-ultraviolet (DUV) radiation to dynamically neutralize nitrogen- (NV) and silicon-vacancy (SiV) centers. We first examine the conversion between the neutral and negatively charged NV states by correlating the variation of their respective spectra, indicating that more than 99% of the population of NV centers can be initialized into the neutral charge state. We then examine the time dynamics of bleaching and recharging of negatively charged SiV$^-$ centers and observe an 80% reduction in SiV$^-$ photoluminescence within a single 100-$μ$s DUV pulse. Finally we demonstrate that the bleaching of SiV$^-$ induced by the DUV is accompanied by a dramatic increase in the neutral SiV$^0$ population; SiV$^0$ remains robust to extended periods of near-infrared excitation despite being a non-equilibrium state. DUV excitation thus presents a reliable method of generating SiV$^0$, a desirable charge state for quantum network applications that is challenging to obtain by equilibrium Fermi engineering alone. Our results on two separate color centers at technologically relevant temperatures indicate a potential for above-band-gap excitation as a universal means of generating the neutral charge states of quantum point defects on demand. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.15511 [pdf, other]

AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models

Authors: Fanglong Yao, Yuanchang Yue, Youzhi Liu, Xian Sun, Kun Fu

Abstract: Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied world model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerosp… ▽ More Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied world model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. However, existing embodied world models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored. To address this gap, we construct the first large-scale real-world image-text pre-training dataset, AerialAgent-Ego10k, featuring urban drones from a first-person perspective. We also create a virtual image-text-pose alignment dataset, CyberAgent Ego500k, to facilitate the pre-training of the aerospace embodied world model. For the first time, we clearly define 5 downstream tasks, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and construct corresponding instruction datasets, i.e., SkyAgent-Scene3k, SkyAgent-Reason3k, SkyAgent-Nav3k and SkyAgent-Plan3k, and SkyAgent-Act3k, for fine-tuning the aerospace embodiment world model. Simultaneously, we develop SkyAgentEval, the downstream task evaluation metrics based on GPT-4, to comprehensively, flexibly, and objectively assess the results, revealing the potential and limitations of 2D/3D visual language models in UAV-agent tasks. Furthermore, we integrate over 10 2D/3D visual-language models, 2 pre-training datasets, 5 finetuning datasets, more than 10 evaluation metrics, and a simulator into the benchmark suite, i.e., AeroVerse, which will be released to the community to promote exploration and development of aerospace embodied intelligence. △ Less

Submitted 27 August, 2024; originally announced August 2024.

arXiv:2408.07305 [pdf, ps, other]

Learning Decisions Offline from Censored Observations with ε-insensitive Operational Costs

Authors: Minxia Chen, Ke Fu, Teng Huang, Miao Bai

Abstract: Many important managerial decisions are made based on censored observations. Making decisions without adequately handling the censoring leads to inferior outcomes. We investigate the data-driven decision-making problem with an offline dataset containing the feature data and the censored historical data of the variable of interest without the censoring indicators. Without assuming the underlying di… ▽ More Many important managerial decisions are made based on censored observations. Making decisions without adequately handling the censoring leads to inferior outcomes. We investigate the data-driven decision-making problem with an offline dataset containing the feature data and the censored historical data of the variable of interest without the censoring indicators. Without assuming the underlying distribution, we design and leverage ε-insensitive operational costs to deal with the unobserved censoring in an offline data-driven fashion. We demonstrate the customization of the ε-insensitive operational costs for a newsvendor problem and use such costs to train two representative ML models, including linear regression (LR) models and neural networks (NNs). We derive tight generalization bounds for the custom LR model without regularization (LR-εNVC) and with regularization (LR-εNVC-R), and a high-probability generalization bound for the custom NN (NN-εNVC) trained by stochastic gradient descent. The theoretical results reveal the stability and learnability of LR-εNVC, LR-εNVC-R and NN-εNVC. We conduct extensive numerical experiments to compare LR-εNVC-R and NN-εNVC with two existing approaches, estimate-as-solution (EAS) and integrated estimation and optimization (IEO). The results show that LR-εNVC-R and NN-εNVC outperform both EAS and IEO, with maximum cost savings up to 14.40% and 12.21% compared to the lowest cost generated by the two existing approaches. In addition, LR-εNVC-R's and NN-εNVC's order quantities are statistically significantly closer to the optimal solutions should the underlying distribution be known. △ Less

Submitted 14 August, 2024; originally announced August 2024.

arXiv:2408.04213 [pdf, other]

Hypothesis testing for general network models

Authors: Kang Fu, Jianwei Hu, Seydou Keita

Abstract: The network data has attracted considerable attention in modern statistics. In research on complex network data, one key issue is finding its underlying connection structure given a network sample. The methods that have been proposed in literature usually assume that the underlying structure is a known model. In practice, however, the true model is usually unknown, and network learning procedures… ▽ More The network data has attracted considerable attention in modern statistics. In research on complex network data, one key issue is finding its underlying connection structure given a network sample. The methods that have been proposed in literature usually assume that the underlying structure is a known model. In practice, however, the true model is usually unknown, and network learning procedures based on these methods may suffer from model misspecification. To handle this issue, based on the random matrix theory, we first give a spectral property of the normalized adjacency matrix under a mild condition. Further, we establish a general goodness-of-fit test procedure for the unweight and undirected network. We prove that the null distribution of the proposed statistic converges in distribution to the standard normal distribution. Theoretically, this testing procedure is suitable for nearly all popular network models, such as stochastic block models, and latent space models. Further, we apply the proposed method to the degree-corrected mixed membership model and give a sequential estimator of the number of communities. Both simulation studies and real-world data examples indicate that the proposed method works well. △ Less

Submitted 8 August, 2024; originally announced August 2024.

arXiv:2408.00525 [pdf, other]

Identifying the Hierarchical Emotional Areas in the Human Brain Through Information Fusion

Authors: Zhongyu Huang, Changde Du, Chaozhuo Li, Kaicheng Fu, Huiguang He

Abstract: The brain basis of emotion has consistently received widespread attention, attracting a large number of studies to explore this cutting-edge topic. However, the methods employed in these studies typically only model the pairwise relationship between two brain regions, while neglecting the interactions and information fusion among multiple brain regions$\unicode{x2014}$one of the key ideas of the p… ▽ More The brain basis of emotion has consistently received widespread attention, attracting a large number of studies to explore this cutting-edge topic. However, the methods employed in these studies typically only model the pairwise relationship between two brain regions, while neglecting the interactions and information fusion among multiple brain regions$\unicode{x2014}$one of the key ideas of the psychological constructionist hypothesis. To overcome the limitations of traditional methods, this study provides an in-depth theoretical analysis of how to maximize interactions and information fusion among brain regions. Building on the results of this analysis, we propose to identify the hierarchical emotional areas in the human brain through multi-source information fusion and graph machine learning methods. Comprehensive experiments reveal that the identified hierarchical emotional areas, from lower to higher levels, primarily facilitate the fundamental process of emotion perception, the construction of basic psychological operations, and the coordination and integration of these operations. Overall, our findings provide unique insights into the brain mechanisms underlying specific emotions based on the psychological constructionist hypothesis. △ Less

Submitted 1 August, 2024; originally announced August 2024.

arXiv:2407.09209 [pdf, other]

Pronunciation Assessment with Multi-modal Large Language Models

Authors: Kaiqi Fu, Linkai Peng, Nan Yang, Shuran Zhou

Abstract: Large language models (LLMs), renowned for their powerful conversational abilities, are widely recognized as exceptional tools in the field of education, particularly in the context of automated intelligent instruction systems for language learning. In this paper, we propose a scoring system based on LLMs, motivated by their positive impact on text-related scoring tasks. Specifically, the speech e… ▽ More Large language models (LLMs), renowned for their powerful conversational abilities, are widely recognized as exceptional tools in the field of education, particularly in the context of automated intelligent instruction systems for language learning. In this paper, we propose a scoring system based on LLMs, motivated by their positive impact on text-related scoring tasks. Specifically, the speech encoder first maps the learner's speech into contextual features. The adapter layer then transforms these features to align with the text embedding in latent space. The assessment task-specific prefix and prompt text are embedded and concatenated with the features generated by the modality adapter layer, enabling the LLMs to predict accuracy and fluency scores. Our experiments demonstrate that the proposed scoring systems achieve competitive results compared to the baselines on the Speechocean762 datasets. Moreover, we also conducted an ablation study to better understand the contributions of the prompt text and training strategy in the proposed scoring system. △ Less

Submitted 18 July, 2024; v1 submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.02682 [pdf]

Epitaxial Growth of Rutile GeO$_2$ via MOCVD

Authors: Imteaz Rahaman, Bobby Duersch, Hunter D. Ellis, Michael A. Scarpulla, Kai Fu

Abstract: Rutile Germanium Dioxide (r-GeO$_2$) has been identified as an ultrawide bandgap (UWBG) semiconductor recently, featuring a bandgap of 4.68 eV, comparable to Ga$_2$O$_3$ but offering bipolar dopability, higher electron mobility, higher thermal conductivity, and higher Baliga's figure of merit (BFOM).These superior properties position GeO$_2$ as a promising material for various semiconductor applic… ▽ More Rutile Germanium Dioxide (r-GeO$_2$) has been identified as an ultrawide bandgap (UWBG) semiconductor recently, featuring a bandgap of 4.68 eV, comparable to Ga$_2$O$_3$ but offering bipolar dopability, higher electron mobility, higher thermal conductivity, and higher Baliga's figure of merit (BFOM).These superior properties position GeO$_2$ as a promising material for various semiconductor applications. However, the epitaxial growth of r-GeO$_2$, particularly in its most advantageous rutile polymorph, is still at an early stage. This work explores the growth of r-GeO$_2$ using metal-organic chemical vapor deposition (MOCVD) on an r-TiO$_2$ (001) substrate, utilizing tetraethyl germane (TEGe) as the precursor. Our investigations reveal that higher growth temperatures significantly enhance crystalline quality, achieving a full width at half maximum (FWHM) of 0.181 degree at 925 degree C, compared to 0.54 degree at 840 degree C and amorphous structures at 725 degree C. Additionally, we found that longer growth durations increase surface roughness due to the formation of faceted crystals. Meanwhile, adjusting the susceptor rotation speed from 300 RPM to 170 RPM plays a crucial role in optimizing crystalline quality, effectively reducing surface roughness by approximately 15 times. This study offers a foundational guide for optimizing MOCVD growth conditions of r-GeO$_2$ films, emphasizing the crucial need for precise control over deposition temperature and rotation speed to enhance adatom mobility and effectively minimize the boundary layer thickness. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: 20 pages, 5 figures, 3 tables

arXiv:2407.01067 [pdf, other]

Human-like object concept representations emerge naturally in multimodal large language models

Authors: Changde Du, Kaicheng Fu, Bincheng Wen, Yi Sun, Jie Peng, Wei Wei, Ying Gao, Shengpei Wang, Chuncheng Zhang, Jinpeng Li, Shuang Qiu, Le Chang, Huiguang He

Abstract: The conceptualization and categorization of natural objects in the human mind have long intrigued cognitive scientists and neuroscientists, offering crucial insights into human perception and cognition. Recently, the rapid development of Large Language Models (LLMs) has raised the attractive question of whether these models can also develop human-like object representations through exposure to vas… ▽ More The conceptualization and categorization of natural objects in the human mind have long intrigued cognitive scientists and neuroscientists, offering crucial insights into human perception and cognition. Recently, the rapid development of Large Language Models (LLMs) has raised the attractive question of whether these models can also develop human-like object representations through exposure to vast amounts of linguistic and multimodal data. In this study, we combined behavioral and neuroimaging analysis methods to uncover how the object concept representations in LLMs correlate with those of humans. By collecting large-scale datasets of 4.7 million triplet judgments from LLM and Multimodal LLM (MLLM), we were able to derive low-dimensional embeddings that capture the underlying similarity structure of 1,854 natural objects. The resulting 66-dimensional embeddings were found to be highly stable and predictive, and exhibited semantic clustering akin to human mental representations. Interestingly, the interpretability of the dimensions underlying these embeddings suggests that LLM and MLLM have developed human-like conceptual representations of natural objects. Further analysis demonstrated strong alignment between the identified model embeddings and neural activity patterns in many functionally defined brain ROIs (e.g., EBA, PPA, RSC and FFA). This provides compelling evidence that the object representations in LLMs, while not identical to those in the human, share fundamental commonalities that reflect key schemas of human conceptual knowledge. This study advances our understanding of machine intelligence and informs the development of more human-like artificial cognitive systems. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.15848 [pdf, other]

Quality-guided Skin Tone Enhancement for Portrait Photography

Authors: Shiqi Gao, Huiyu Duan, Xinyue Li, Kang Fu, Yicong Peng, Qihang Xu, Yuanyuan Chang, Jia Wang, Xiongkuo Min, Guangtao Zhai

Abstract: In recent years, learning-based color and tone enhancement methods for photos have become increasingly popular. However, most learning-based image enhancement methods just learn a mapping from one distribution to another based on one dataset, lacking the ability to adjust images continuously and controllably. It is important to enable the learning-based enhancement models to adjust an image contin… ▽ More In recent years, learning-based color and tone enhancement methods for photos have become increasingly popular. However, most learning-based image enhancement methods just learn a mapping from one distribution to another based on one dataset, lacking the ability to adjust images continuously and controllably. It is important to enable the learning-based enhancement models to adjust an image continuously, since in many cases we may want to get a slighter or stronger enhancement effect rather than one fixed adjusted result. In this paper, we propose a quality-guided image enhancement paradigm that enables image enhancement models to learn the distribution of images with various quality ratings. By learning this distribution, image enhancement models can associate image features with their corresponding perceptual qualities, which can be used to adjust images continuously according to different quality scores. To validate the effectiveness of our proposed method, a subjective quality assessment experiment is first conducted, focusing on skin tone adjustment in portrait photography. Guided by the subjective quality ratings obtained from this experiment, our method can adjust the skin tone corresponding to different quality requirements. Furthermore, an experiment conducted on 10 natural raw images corroborates the effectiveness of our model in situations with fewer subjects and fewer shots, and also demonstrates its general applicability to natural images. Our project page is https://github.com/IntMeGroup/quality-guided-enhancement . △ Less

Submitted 22 June, 2024; originally announced June 2024.

arXiv:2406.08804 [pdf, other]

DIET: Customized Slimming for Incompatible Networks in Sequential Recommendation

Authors: Kairui Fu, Shengyu Zhang, Zheqi Lv, Jingyuan Chen, Jiwei Li

Abstract: Due to the continuously improving capabilities of mobile edges, recommender systems start to deploy models on edges to alleviate network congestion caused by frequent mobile requests. Several studies have leveraged the proximity of edge-side to real-time data, fine-tuning them to create edge-specific models. Despite their significant progress, these methods require substantial on-edge computationa… ▽ More Due to the continuously improving capabilities of mobile edges, recommender systems start to deploy models on edges to alleviate network congestion caused by frequent mobile requests. Several studies have leveraged the proximity of edge-side to real-time data, fine-tuning them to create edge-specific models. Despite their significant progress, these methods require substantial on-edge computational resources and frequent network transfers to keep the model up to date. The former may disrupt other processes on the edge to acquire computational resources, while the latter consumes network bandwidth, leading to a decrease in user satisfaction. In response to these challenges, we propose a customizeD slImming framework for incompatiblE neTworks(DIET). DIET deploys the same generic backbone (potentially incompatible for a specific edge) to all devices. To minimize frequent bandwidth usage and storage consumption in personalization, DIET tailors specific subnets for each edge based on its past interactions, learning to generate slimming subnets(diets) within incompatible networks for efficient transfer. It also takes the inter-layer relationships into account, empirically reducing inference time while obtaining more suitable diets. We further explore the repeated modules within networks and propose a more storage-efficient framework, DIETING, which utilizes a single layer of parameters to represent the entire network, achieving comparably excellent performance. The experiments across four state-of-the-art datasets and two widely used models demonstrate the superior accuracy in recommendation and efficiency in transmission and storage of our framework. △ Less

Submitted 15 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted by KDD 2024

arXiv:2406.02017 [pdf, other]

On the Mode-Seeking Properties of Langevin Dynamics

Authors: Xiwei Cheng, Kexin Fu, Farzan Farnia

Abstract: The Langevin Dynamics framework, which aims to generate samples from the score function of a probability distribution, is widely used for analyzing and interpreting score-based generative modeling. While the convergence behavior of Langevin Dynamics under unimodal distributions has been extensively studied in the literature, in practice the data distribution could consist of multiple distinct mode… ▽ More The Langevin Dynamics framework, which aims to generate samples from the score function of a probability distribution, is widely used for analyzing and interpreting score-based generative modeling. While the convergence behavior of Langevin Dynamics under unimodal distributions has been extensively studied in the literature, in practice the data distribution could consist of multiple distinct modes. In this work, we investigate Langevin Dynamics in producing samples from multimodal distributions and theoretically study its mode-seeking properties. We prove that under a variety of sub-Gaussian mixtures, Langevin Dynamics is unlikely to find all mixture components within a sub-exponential number of steps in the data dimension. To reduce the mode-seeking tendencies of Langevin Dynamics, we propose Chained Langevin Dynamics, which divides the data vector into patches of constant size and generates every patch sequentially conditioned on the previous patches. We perform a theoretical analysis of Chained Langevin Dynamics by reducing it to sampling from a constant-dimensional distribution. We present the results of several numerical experiments on synthetic and real image datasets, supporting our theoretical results on the iteration complexities of sample generation from mixture distributions using the chained and vanilla Langevin Dynamics. The code is available at https://github.com/Xiwei-Cheng/Chained_LD. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.01864 [pdf, other]

doi 10.1080/03610918.2024.2391897

Variance-reduced sampling importance resampling

Authors: Yao Xiao, Kang Fu, Kun Li

Abstract: The sampling importance resampling method is widely utilized in various fields, such as numerical integration and statistical simulation. In this paper, two modified methods are presented by incorporating two variance reduction techniques commonly used in Monte Carlo simulation, namely antithetic sampling and Latin hypercube sampling, into the process of sampling importance resampling method respe… ▽ More The sampling importance resampling method is widely utilized in various fields, such as numerical integration and statistical simulation. In this paper, two modified methods are presented by incorporating two variance reduction techniques commonly used in Monte Carlo simulation, namely antithetic sampling and Latin hypercube sampling, into the process of sampling importance resampling method respectively. Theoretical evidence is provided to demonstrate that the proposed methods significantly reduce estimation errors compared to the original approach. Furthermore, the effectiveness and advantages of the proposed methods are validated through both numerical studies and real data analysis. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.20600 [pdf, other]

Multi-label Class Incremental Emotion Decoding with Augmented Emotional Semantics Learning

Authors: Kaicheng Fu, Changde Du, Xiaoyu Chen, Jie Peng, Huiguang He

Abstract: Emotion decoding plays an important role in affective human-computer interaction. However, previous studies ignored the dynamic real-world scenario, where human experience a blend of multiple emotions which are incrementally integrated into the model, leading to the multi-label class incremental learning (MLCIL) problem. Existing methods have difficulty in solving MLCIL issue due to notorious cata… ▽ More Emotion decoding plays an important role in affective human-computer interaction. However, previous studies ignored the dynamic real-world scenario, where human experience a blend of multiple emotions which are incrementally integrated into the model, leading to the multi-label class incremental learning (MLCIL) problem. Existing methods have difficulty in solving MLCIL issue due to notorious catastrophic forgetting caused by partial label problem and inadequate label semantics mining. In this paper, we propose an augmented emotional semantics learning framework for multi-label class incremental emotion decoding. Specifically, we design an augmented emotional relation graph module with label disambiguation to handle the past-missing partial label problem. Then, we leverage domain knowledge from affective dimension space to alleviate future-missing partial label problem by knowledge distillation. Besides, an emotional semantics learning module is constructed with a graph autoencoder to obtain emotion embeddings in order to guide the semantic-specific feature decoupling for better multi-label learning. Extensive experiments on three datasets show the superiority of our method for improving emotion decoding performance and mitigating forgetting on MLCIL problem. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.19735 [pdf, other]

Twin Deformable Point Convolutions for Point Cloud Semantic Segmentation in Remote Sensing Scenes

Authors: Yong-Qiang Mao, Hanbo Bi, Xuexue Li, Kaiqiang Chen, Zhirui Wang, Xian Sun, Kun Fu

Abstract: Thanks to the application of deep learning technology in point cloud processing of the remote sensing field, point cloud segmentation has become a research hotspot in recent years, which can be applied to real-world 3D, smart cities, and other fields. Although existing solutions have made unprecedented progress, they ignore the inherent characteristics of point clouds in remote sensing fields that… ▽ More Thanks to the application of deep learning technology in point cloud processing of the remote sensing field, point cloud segmentation has become a research hotspot in recent years, which can be applied to real-world 3D, smart cities, and other fields. Although existing solutions have made unprecedented progress, they ignore the inherent characteristics of point clouds in remote sensing fields that are strictly arranged according to latitude, longitude, and altitude, which brings great convenience to the segmentation of point clouds in remote sensing fields. To consider this property cleverly, we propose novel convolution operators, termed Twin Deformable point Convolutions (TDConvs), which aim to achieve adaptive feature learning by learning deformable sampling points in the latitude-longitude plane and altitude direction, respectively. First, to model the characteristics of the latitude-longitude plane, we propose a Cylinder-wise Deformable point Convolution (CyDConv) operator, which generates a two-dimensional cylinder map by constructing a cylinder-like grid in the latitude-longitude direction. Furthermore, to better integrate the features of the latitude-longitude plane and the spatial geometric features, we perform a multi-scale fusion of the extracted latitude-longitude features and spatial geometric features, and realize it through the aggregation of adjacent point features of different scales. In addition, a Sphere-wise Deformable point Convolution (SpDConv) operator is introduced to adaptively offset the sampling points in three-dimensional space by constructing a sphere grid structure, aiming at modeling the characteristics in the altitude direction. Experiments on existing popular benchmarks conclude that our TDConvs achieve the best segmentation performance, surpassing the existing state-of-the-art methods. △ Less

Submitted 4 August, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.19689 [pdf, other]

Uncertainty-aware sign language video retrieval with probability distribution modeling

Authors: Xuan Wu, Hongxiang Li, Yuanjiang Luo, Xuxin Cheng, Xianwei Zhuang, Meng Cao, Keren Fu

Abstract: Sign language video retrieval plays a key role in facilitating information access for the deaf community. Despite significant advances in video-text retrieval, the complexity and inherent uncertainty of sign language preclude the direct application of these techniques. Previous methods achieve the mapping between sign language video and text through fine-grained modal alignment. However, due to th… ▽ More Sign language video retrieval plays a key role in facilitating information access for the deaf community. Despite significant advances in video-text retrieval, the complexity and inherent uncertainty of sign language preclude the direct application of these techniques. Previous methods achieve the mapping between sign language video and text through fine-grained modal alignment. However, due to the scarcity of fine-grained annotation, the uncertainty inherent in sign language video is underestimated, limiting the further development of sign language retrieval tasks. To address this challenge, we propose a novel Uncertainty-aware Probability Distribution Retrieval (UPRet), that conceptualizes the mapping process of sign language video and text in terms of probability distributions, explores their potential interrelationships, and enables flexible mappings. Experiments on three benchmarks demonstrate the effectiveness of our method, which achieves state-of-the-art results on How2Sign (59.1%), PHOENIX-2014T (72.0%), and CSL-Daily (78.4%). △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.17140 [pdf, other]

SDL-MVS: View Space and Depth Deformable Learning Paradigm for Multi-View Stereo Reconstruction in Remote Sensing

Authors: Yong-Qiang Mao, Hanbo Bi, Liangyu Xu, Kaiqiang Chen, Zhirui Wang, Xian Sun, Kun Fu

Abstract: Research on multi-view stereo based on remote sensing images has promoted the development of large-scale urban 3D reconstruction. However, remote sensing multi-view image data suffers from the problems of occlusion and uneven brightness between views during acquisition, which leads to the problem of blurred details in depth estimation. To solve the above problem, we re-examine the deformable learn… ▽ More Research on multi-view stereo based on remote sensing images has promoted the development of large-scale urban 3D reconstruction. However, remote sensing multi-view image data suffers from the problems of occlusion and uneven brightness between views during acquisition, which leads to the problem of blurred details in depth estimation. To solve the above problem, we re-examine the deformable learning method in the Multi-View Stereo task and propose a novel paradigm based on view Space and Depth deformable Learning (SDL-MVS), aiming to learn deformable interactions of features in different view spaces and deformably model the depth ranges and intervals to enable high accurate depth estimation. Specifically, to solve the problem of view noise caused by occlusion and uneven brightness, we propose a Progressive Space deformable Sampling (PSS) mechanism, which performs deformable learning of sampling points in the 3D frustum space and the 2D image space in a progressive manner to embed source features to the reference feature adaptively. To further optimize the depth, we introduce Depth Hypothesis deformable Discretization (DHD), which achieves precise positioning of the depth prior by adaptively adjusting the depth range hypothesis and performing deformable discretization of the depth interval hypothesis. Finally, our SDL-MVS achieves explicit modeling of occlusion and uneven brightness faced in multi-view stereo through the deformable learning paradigm of view space and depth, achieving accurate multi-view depth estimation. Extensive experiments on LuoJia-MVS and WHU datasets show that our SDL-MVS reaches state-of-the-art performance. It is worth noting that our SDL-MVS achieves an MAE error of 0.086, an accuracy of 98.9% for <0.6m, and 98.9% for <3-interval on the LuoJia-MVS dataset under the premise of three views as input. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.07564 [pdf]

Growth of GeO2 on R-plane and C-plane Sapphires by MOCVD

Authors: Imteaz Rahaman, Hunter D. Ellis, Kathy Anderson, Michael A. Scarpulla, Kai Fu

Abstract: Rutile Germanium Dioxide (GeO2) has been recently theoretically identified as an ultrawide bandgap (UWBG) semiconductor with bandgap 4.68 eV similar to Ga2O3 but having bipolar dopability and ~2x higher electron mobility, Baliga figure of merit (BFOM) and thermal conductivity than Ga2O3. Bulk crystal growth is rapidly moving towards making large sized native substrates available. These outstanding… ▽ More Rutile Germanium Dioxide (GeO2) has been recently theoretically identified as an ultrawide bandgap (UWBG) semiconductor with bandgap 4.68 eV similar to Ga2O3 but having bipolar dopability and ~2x higher electron mobility, Baliga figure of merit (BFOM) and thermal conductivity than Ga2O3. Bulk crystal growth is rapidly moving towards making large sized native substrates available. These outstanding material properties position GeO2 as a highly attractive UWBG semiconductor for various applications. However, the epitaxial growth in the most advantageous polymorph (rutile), ensuring controlled phase, pristine surface/interface quality, precise microstructure, and optimal functional properties, is still in its infancy. In this work, we explored growth of GeO2 by metal-organic chemical vapor deposition (MOCVD) on both C- and R-plane sapphire. Utilizing tetramethylgermane (TMGe) as a precursor, we have investigated the influences of different parameters on the film properties, including growth temperature, chamber pressure, TMGe flow rate, oxygen flow rate, shroud gas flow rate, and rotation speed. The total pressure emerged as a crucial parameter while growth attempts at low total pressure resulted in no films for a wide range of temperatures, precursor flow rate, argon flow rates, and susceptor rotation rate. A phase diagram, derived from our experimental findings, delineates the growth windows for GeO2 films on sapphire substrates. This study serves as a pioneering guide for the MOCVD growth of GeO2 films. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: 22 pages, 14 Figures

arXiv:2404.13322 [pdf, other]

MergeNet: Knowledge Migration across Heterogeneous Models, Tasks, and Modalities

Authors: Kunxi Li, Tianyu Zhan, Kairui Fu, Shengyu Zhang, Kun Kuang, Jiwei Li, Zhou Zhao, Fei Wu

Abstract: In this study, we focus on heterogeneous knowledge transfer across entirely different model architectures, tasks, and modalities. Existing knowledge transfer methods (e.g., backbone sharing, knowledge distillation) often hinge on shared elements within model structures or task-specific features/labels, limiting transfers to complex model types or tasks. To overcome these challenges, we present Mer… ▽ More In this study, we focus on heterogeneous knowledge transfer across entirely different model architectures, tasks, and modalities. Existing knowledge transfer methods (e.g., backbone sharing, knowledge distillation) often hinge on shared elements within model structures or task-specific features/labels, limiting transfers to complex model types or tasks. To overcome these challenges, we present MergeNet, which learns to bridge the gap of parameter spaces of heterogeneous models, facilitating the direct interaction, extraction, and application of knowledge within these parameter spaces. The core mechanism of MergeNet lies in the parameter adapter, which operates by querying the source model's low-rank parameters and adeptly learning to identify and map parameters into the target model. MergeNet is learned alongside both models, allowing our framework to dynamically transfer and adapt knowledge relevant to the current stage, including the training trajectory knowledge of the source model. Extensive experiments on heterogeneous knowledge transfer demonstrate significant improvements in challenging settings, where representative approaches may falter or prove less applicable. △ Less

Submitted 17 June, 2024; v1 submitted 20 April, 2024; originally announced April 2024.

arXiv:2404.08980 [pdf, other]

Stability and Generalization in Free Adversarial Training

Authors: Xiwei Cheng, Kexin Fu, Farzan Farnia

Abstract: While adversarial training methods have resulted in significant improvements in the deep neural nets' robustness against norm-bounded adversarial perturbations, their generalization performance from training samples to test data has been shown to be considerably worse than standard empirical risk minimization methods. Several recent studies seek to connect the generalization behavior of adversaria… ▽ More While adversarial training methods have resulted in significant improvements in the deep neural nets' robustness against norm-bounded adversarial perturbations, their generalization performance from training samples to test data has been shown to be considerably worse than standard empirical risk minimization methods. Several recent studies seek to connect the generalization behavior of adversarially trained classifiers to various gradient-based min-max optimization algorithms used for their training. In this work, we study the generalization performance of adversarial training methods using the algorithmic stability framework. Specifically, our goal is to compare the generalization performance of the vanilla adversarial training scheme fully optimizing the perturbations at every iteration vs. the free adversarial training simultaneously optimizing the norm-bounded perturbations and classifier parameters. Our proven generalization bounds indicate that the free adversarial training method could enjoy a lower generalization gap between training and test samples due to the simultaneous nature of its min-max optimization algorithm. We perform several numerical experiments to evaluate the generalization performance of vanilla, fast, and free adversarial training methods. Our empirical findings also show the improved generalization performance of the free adversarial training method and further demonstrate that the better generalization result could translate to greater robustness against black-box attack schemes. The code is available at https://github.com/Xiwei-Cheng/Stability_FreeAT. △ Less

Submitted 13 April, 2024; originally announced April 2024.

arXiv:2404.08195 [pdf, other]

Tackling Ambiguity from Perspective of Uncertainty Inference and Affinity Diversification for Weakly Supervised Semantic Segmentation

Authors: Zhiwei Yang, Yucong Meng, Kexue Fu, Shuo Wang, Zhijian Song

Abstract: Weakly supervised semantic segmentation (WSSS) with image-level labels intends to achieve dense tasks without laborious annotations. However, due to the ambiguous contexts and fuzzy regions, the performance of WSSS, especially the stages of generating Class Activation Maps (CAMs) and refining pseudo masks, widely suffers from ambiguity while being barely noticed by previous literature. In this wor… ▽ More Weakly supervised semantic segmentation (WSSS) with image-level labels intends to achieve dense tasks without laborious annotations. However, due to the ambiguous contexts and fuzzy regions, the performance of WSSS, especially the stages of generating Class Activation Maps (CAMs) and refining pseudo masks, widely suffers from ambiguity while being barely noticed by previous literature. In this work, we propose UniA, a unified single-staged WSSS framework, to efficiently tackle this issue from the perspective of uncertainty inference and affinity diversification, respectively. When activating class objects, we argue that the false activation stems from the bias to the ambiguous regions during the feature extraction. Therefore, we design a more robust feature representation with a probabilistic Gaussian distribution and introduce the uncertainty estimation to avoid the bias. A distribution loss is particularly proposed to supervise the process, which effectively captures the ambiguity and models the complex dependencies among features. When refining pseudo labels, we observe that the affinity from the prevailing refinement methods intends to be similar among ambiguities. To this end, an affinity diversification module is proposed to promote diversity among semantics. A mutual complementing refinement is proposed to initially rectify the ambiguous affinity with multiple inferred pseudo labels. More importantly, a contrastive affinity loss is further designed to diversify the relations among unrelated semantics, which reliably propagates the diversity into the whole feature representations and helps generate better pseudo masks. Extensive experiments are conducted on PASCAL VOC, MS COCO, and medical ACDC datasets, which validate the efficiency of UniA tackling ambiguity and the superiority over recent single-staged or even most multi-staged competitors. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2403.18458 [pdf, other]

Dust Extinction Measures for $z\sim 8$ Galaxies using Machine Learning on JWST Imaging

Authors: Kwan Lin Kristy Fu, Christopher J. Conselice, Leonardo Ferreira, Thomas Harvey, Qiao Duan, Nathan Adams, Duncan Austin

Abstract: We present the results of a machine learning study to measure the dust content of galaxies observed with JWST at z > 6 through the use of trained neural networks based on high-resolution IllustrisTNG simulations. Dust is an important unknown in the evolution and observability of distant galaxies and is degenerate with other stellar population features through spectral energy fitting. As such, we d… ▽ More We present the results of a machine learning study to measure the dust content of galaxies observed with JWST at z > 6 through the use of trained neural networks based on high-resolution IllustrisTNG simulations. Dust is an important unknown in the evolution and observability of distant galaxies and is degenerate with other stellar population features through spectral energy fitting. As such, we develop and test a new SED-independent machine learning method to predict dust attenuation and sSFR of high redshift (z > 6) galaxies. Simulated galaxies were constructed using the IllustrisTNG model, with a variety of dust contents parameterized by E(B-V) and A(V) values, then used to train Convolutional Neural Network (CNN) models using supervised learning through a regression model. We demonstrate that within the context of these simulations, our single and multi-band models are able to predict dust content of distant galaxies to within a 1$σ$ dispersion of A(V) $\sim 0.1$. Applied to spectroscopically confirmed z > 6 galaxies from the JADES and CEERS programs, our models predicted attenuation values of A(V) < 0.7 for all systems, with a low average (A(V) = 0.28). Our CNN predictions show larger dust attenuation but lower amounts of star formation compared to SED fitted values. Both results show that distant galaxies with confirmed spectroscopy are not extremely dusty, although this sample is potentially significantly biased. We discuss these issues and present ideas on how to accurately measure dust features at the highest redshifts using a combination of machine learning and SED fitting. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: submitted to MNRAS

arXiv:2403.18238 [pdf, other]

TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes

Authors: Liangyu Xu, Wanxuan Lu, Hongfeng Yu, Yongqiang Mao, Hanbo Bi, Chenglong Liu, Xian Sun, Kun Fu

Abstract: As drone technology advances, using unmanned aerial vehicles for aerial surveys has become the dominant trend in modern low-altitude remote sensing. The surge in aerial video data necessitates accurate prediction for future scenarios and motion states of the interested target, particularly in applications like traffic management and disaster response. Existing video prediction methods focus solely… ▽ More As drone technology advances, using unmanned aerial vehicles for aerial surveys has become the dominant trend in modern low-altitude remote sensing. The surge in aerial video data necessitates accurate prediction for future scenarios and motion states of the interested target, particularly in applications like traffic management and disaster response. Existing video prediction methods focus solely on predicting future scenes (video frames), suffering from the neglect of explicitly modeling target's motion states, which is crucial for aerial video interpretation. To address this issue, we introduce a novel task called Target-Aware Aerial Video Prediction, aiming to simultaneously predict future scenes and motion states of the target. Further, we design a model specifically for this task, named TAFormer, which provides a unified modeling approach for both video and target motion states. Specifically, we introduce Spatiotemporal Attention (STA), which decouples the learning of video dynamics into spatial static attention and temporal dynamic attention, effectively modeling the scene appearance and motion. Additionally, we design an Information Sharing Mechanism (ISM), which elegantly unifies the modeling of video and target motion by facilitating information interaction through two sets of messenger tokens. Moreover, to alleviate the difficulty of distinguishing targets in blurry predictions, we introduce Target-Sensitive Gaussian Loss (TSGL), enhancing the model's sensitivity to both target's position and content. Extensive experiments on UAV123VP and VisDroneVP (derived from single-object tracking datasets) demonstrate the exceptional performance of TAFormer in target-aware video prediction, showcasing its adaptability to the additional requirements of aerial video interpretation for target awareness. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: 17 pages, 9 figures

arXiv:2403.09675 [pdf, other]

Open-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases

Authors: Rio Aguina-Kang, Maxim Gumin, Do Heon Han, Stewart Morris, Seung Jean Yoo, Aditya Ganeshan, R. Kenny Jones, Qiuhong Anna Wei, Kailiang Fu, Daniel Ritchie

Abstract: We present a system for generating indoor scenes in response to text prompts. The prompts are not limited to a fixed vocabulary of scene descriptions, and the objects in generated scenes are not restricted to a fixed set of object categories -- we call this setting indoor scene generation. Unlike most prior work on indoor scene generation, our system does not require a large training dataset of ex… ▽ More We present a system for generating indoor scenes in response to text prompts. The prompts are not limited to a fixed vocabulary of scene descriptions, and the objects in generated scenes are not restricted to a fixed set of object categories -- we call this setting indoor scene generation. Unlike most prior work on indoor scene generation, our system does not require a large training dataset of existing 3D scenes. Instead, it leverages the world knowledge encoded in pre-trained large language models (LLMs) to synthesize programs in a domain-specific layout language that describe objects and spatial relations between them. Executing such a program produces a specification of a constraint satisfaction problem, which the system solves using a gradient-based optimization scheme to produce object positions and orientations. To produce object geometry, the system retrieves 3D meshes from a database. Unlike prior work which uses databases of category-annotated, mutually-aligned meshes, we develop a pipeline using vision-language models (VLMs) to retrieve meshes from massive databases of un-annotated, inconsistently-aligned meshes. Experimental evaluations show that our system outperforms generative models trained on 3D data for traditional, closed-universe scene generation tasks; it also outperforms a recent LLM-based layout generation method on open-universe scene generation. △ Less

Submitted 4 February, 2024; originally announced March 2024.

Comments: See ancillary files for link to supplemental material

arXiv:2403.08973 [pdf, other]

Measurements and modeling of induced flow in collective vertical migration

Authors: Nina Mohebbi, Joonha Hwang, Matthew K. Fu, John O. Dabiri

Abstract: Hydrodynamic interactions among swimming or flying organisms can lead to complex flows on the scale of the group. These emergent fluid dynamics are often more complex than a linear superposition of individual organism flows, especially at intermediate Reynolds numbers. This paper presents an approach to estimate the flow induced by multiple swimmer wakes in proximity using an analytical model that… ▽ More Hydrodynamic interactions among swimming or flying organisms can lead to complex flows on the scale of the group. These emergent fluid dynamics are often more complex than a linear superposition of individual organism flows, especially at intermediate Reynolds numbers. This paper presents an approach to estimate the flow induced by multiple swimmer wakes in proximity using an analytical model that conserves mass and momentum in the aggregation. This analytical model was informed by and validated with empirical measurements of induced vertical migrations of brine shrimp, $\textit{Artemia salina}$. The response of individual swimmers to ambient background flow and light intensity was evaluated. In addition, the time-resolved three-dimensional spatial configuration of the swimmers was measured using a recently developed laser scanning system. Computational experiments using the analytical model found that the induced flow at the front of the aggregation was insensitive to the presence of downstream swimmers, with the induced flow reaching an asymptote beyond a threshold aggregation length. Closer swimmer spacing led to higher induced flow, in some cases leading to model predictions of induced flow exceeding swimmer speeds required to maintain a stable spatial configuration. This result was reconciled by comparing two different models for the near-wake of each swimmer. Our results demonstrate that aggregation-scale flows result from a complex, yet predictable interplay amongst organism-scale wake structure, swimmer spacing and configuration, and aggregation size. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.06068 [pdf, other]

Hypothesis testing for homogenous of nodes in $β$-models

Authors: Kang Fu, Jianwei Hu, Meng Sun

Abstract: The $β$-model has been extensively utilized to model degree heterogeneity in networks, wherein each node is assigned a unique parameter. In this article, we consider the hypothesis testing problem that two nodes $i$ and $j$ of a $β$-model have the same node parameter. We prove that the null distribution of the proposed statistic converges in distribution to the standard normal distribution. Furthe… ▽ More The $β$-model has been extensively utilized to model degree heterogeneity in networks, wherein each node is assigned a unique parameter. In this article, we consider the hypothesis testing problem that two nodes $i$ and $j$ of a $β$-model have the same node parameter. We prove that the null distribution of the proposed statistic converges in distribution to the standard normal distribution. Further, we investigate the homogeneous test for $β$-model by combining individual $p$-values to aggregate small effects of multiple tests. Both simulation studies and real-world data examples indicate that the proposed method works well. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2403.04306 [pdf, other]

Effectiveness Assessment of Recent Large Vision-Language Models

Authors: Yao Jiang, Xinyu Yan, Ge-Peng Ji, Keren Fu, Meijun Sun, Huan Xiong, Deng-Ping Fan, Fahad Shahbaz Khan

Abstract: The advent of large vision-language models (LVLMs) represents a remarkable advance in the quest for artificial general intelligence. However, the model's effectiveness in both specialized and general tasks warrants further investigation. This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive understanding of… ▽ More The advent of large vision-language models (LVLMs) represents a remarkable advance in the quest for artificial general intelligence. However, the model's effectiveness in both specialized and general tasks warrants further investigation. This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive understanding of these novel models. To gauge their effectiveness in specialized tasks, we employ six challenging tasks in three different application scenarios: natural, healthcare, and industrial. These six tasks include salient/camouflaged/transparent object detection, as well as polyp detection, skin lesion detection, and industrial anomaly detection. We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization in these tasks. Moreover, we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V, assessing their multi-modal understanding capabilities in general tasks including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks. We delve deep into this inadequacy and uncover several potential factors, including limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. We hope that this study can provide useful insights for the future development of LVLMs, helping researchers improve LVLMs for both general and specialized applications. △ Less

Submitted 11 June, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

Comments: Accepted by Visual Intelligence

arXiv:2403.01968 [pdf, other]

Explicit Motion Handling and Interactive Prompting for Video Camouflaged Object Detection

Authors: Xin Zhang, Tao Xiao, Gepeng Ji, Xuan Wu, Keren Fu, Qijun Zhao

Abstract: Camouflage poses challenges in distinguishing a static target, whereas any movement of the target can break this disguise. Existing video camouflaged object detection (VCOD) approaches take noisy motion estimation as input or model motion implicitly, restricting detection performance in complex dynamic scenes. In this paper, we propose a novel Explicit Motion handling and Interactive Prompting fra… ▽ More Camouflage poses challenges in distinguishing a static target, whereas any movement of the target can break this disguise. Existing video camouflaged object detection (VCOD) approaches take noisy motion estimation as input or model motion implicitly, restricting detection performance in complex dynamic scenes. In this paper, we propose a novel Explicit Motion handling and Interactive Prompting framework for VCOD, dubbed EMIP, which handles motion cues explicitly using a frozen pre-trained optical flow fundamental model. EMIP is characterized by a two-stream architecture for simultaneously conducting camouflaged segmentation and optical flow estimation. Interactions across the dual streams are realized in an interactive prompting way that is inspired by emerging visual prompt learning. Two learnable modules, i.e. the camouflaged feeder and motion collector, are designed to incorporate segmentation-to-motion and motion-to-segmentation prompts, respectively, and enhance outputs of the both streams. The prompt fed to the motion stream is learned by supervising optical flow in a self-supervised manner. Furthermore, we show that long-term historical information can also be incorporated as a prompt into EMIP and achieve more robust results with temporal consistency. Experimental results demonstrate that our EMIP achieves new state-of-the-art records on popular VCOD benchmarks. The code will be publicly available. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 9 pages, 6 figures

arXiv:2402.18467 [pdf, other]

Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation

Authors: Zhiwei Yang, Kexue Fu, Minghong Duan, Linhao Qu, Shuo Wang, Zhijian Song

Abstract: Weakly supervised semantic segmentation (WSSS) with image-level labels aims to achieve segmentation tasks without dense annotations. However, attributed to the frequent coupling of co-occurring objects and the limited supervision from image-level labels, the challenging co-occurrence problem is widely present and leads to false activation of objects in WSSS. In this work, we devise a 'Separate and… ▽ More Weakly supervised semantic segmentation (WSSS) with image-level labels aims to achieve segmentation tasks without dense annotations. However, attributed to the frequent coupling of co-occurring objects and the limited supervision from image-level labels, the challenging co-occurrence problem is widely present and leads to false activation of objects in WSSS. In this work, we devise a 'Separate and Conquer' scheme SeCo to tackle this issue from dimensions of image space and feature space. In the image space, we propose to 'separate' the co-occurring objects with image decomposition by subdividing images into patches. Importantly, we assign each patch a category tag from Class Activation Maps (CAMs), which spatially helps remove the co-context bias and guide the subsequent representation. In the feature space, we propose to 'conquer' the false activation by enhancing semantic representation with multi-granularity knowledge contrast. To this end, a dual-teacher-single-student architecture is designed and tag-guided contrast is conducted, which guarantee the correctness of knowledge and further facilitate the discrepancy among co-contexts. We streamline the multi-staged WSSS pipeline end-to-end and tackle this issue without external supervision. Extensive experiments are conducted, validating the efficiency of our method and the superiority over previous single-staged and even multi-staged competitors on PASCAL VOC and MS COCO. Code is available at https://github.com/zwyang6/SeCo.git. △ Less

Submitted 21 March, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: Accepted by CVPR 2024

arXiv:2402.18044 [pdf, other]

doi 10.1109/TGRS.2024.3367857

SFTformer: A Spatial-Frequency-Temporal Correlation-Decoupling Transformer for Radar Echo Extrapolation

Authors: Liangyu Xu, Wanxuan Lu, Hongfeng Yu, Fanglong Yao, Xian Sun, Kun Fu

Abstract: Extrapolating future weather radar echoes from past observations is a complex task vital for precipitation nowcasting. The spatial morphology and temporal evolution of radar echoes exhibit a certain degree of correlation, yet they also possess independent characteristics. {Existing methods learn unified spatial and temporal representations in a highly coupled feature space, emphasizing the correla… ▽ More Extrapolating future weather radar echoes from past observations is a complex task vital for precipitation nowcasting. The spatial morphology and temporal evolution of radar echoes exhibit a certain degree of correlation, yet they also possess independent characteristics. {Existing methods learn unified spatial and temporal representations in a highly coupled feature space, emphasizing the correlation between spatial and temporal features but neglecting the explicit modeling of their independent characteristics, which may result in mutual interference between them.} To effectively model the spatiotemporal dynamics of radar echoes, we propose a Spatial-Frequency-Temporal correlation-decoupling Transformer (SFTformer). The model leverages stacked multiple SFT-Blocks to not only mine the correlation of the spatiotemporal dynamics of echo cells but also avoid the mutual interference between the temporal modeling and the spatial morphology refinement by decoupling them. Furthermore, inspired by the practice that weather forecast experts effectively review historical echo evolution to make accurate predictions, SFTfomer incorporates a joint training paradigm for historical echo sequence reconstruction and future echo sequence prediction. Experimental results on the HKO-7 dataset and ChinaNorth-2021 dataset demonstrate the superior performance of SFTfomer in short(1h), mid(2h), and long-term(3h) precipitation nowcasting. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: 16 pages, 11 figures, TGRS

arXiv:2402.17964 [pdf, other]

Direct measure of DNA bending by quantum magnetic imaging of a nano-mechanical torque-balance

Authors: Zeeshawn Kazi, Isaac M. Shelby, Ruhee Nirodi, Joseph Turnbull, Hideyuki Watanabe, Kohei M. Itoh, Paul A. Wiggins, Kai-Mei C. Fu

Abstract: DNA flexibility is a key determinant of biological function, from nucleosome positioning to transcriptional regulation, motivating a direct measurement of the bend-torque response of individual DNA molecules. In this work, DNA bending is detected using a nano-mechanical torque balance formed by tethering a ferromagnetic nanoparticle probe by an individual DNA molecule to a diamond magnetic field i… ▽ More DNA flexibility is a key determinant of biological function, from nucleosome positioning to transcriptional regulation, motivating a direct measurement of the bend-torque response of individual DNA molecules. In this work, DNA bending is detected using a nano-mechanical torque balance formed by tethering a ferromagnetic nanoparticle probe by an individual DNA molecule to a diamond magnetic field imager. The torque exerted by the DNA in response to bending caused by an applied magnetic torque is measured using wide-field imaging of quantum defects near the surface of the diamond. Qualitative measurements of differences in DNA bio-mechanical binding configuration are demonstrated, and as a proof-of-principle, a quantitative measurement of the bend response is made for individual DNA molecules. This quantum-enabled measurement approach could be applied to characterize the bend response of biophysically relevant short DNA molecules as well as the sequence dependence of DNA bending energy. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.11450 [pdf, other]

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Authors: Jacky Liang, Fei Xia, Wenhao Yu, Andy Zeng, Montserrat Gonzalez Arenas, Maria Attarian, Maria Bauza, Matthew Bennice, Alex Bewley, Adil Dostmohamed, Chuyuan Kelly Fu, Nimrod Gileadi, Marissa Giustina, Keerthana Gopalakrishnan, Leonard Hasenclever, Jan Humplik, Jasmine Hsu, Nikhil Joshi, Ben Jyenis, Chase Kew, Sean Kirmani, Tsang-Wei Edward Lee, Kuang-Huei Lee, Assaf Hurwitz Michaely, Joss Moore , et al. (25 additional authors not shown)

Abstract: Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for o… ▽ More Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM, and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are viewed as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions is training a transition dynamics model -- that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments -- improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos, code, and demos at: https://robot-teaching.github.io/. △ Less

Submitted 31 May, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

arXiv:2402.10435 [pdf, other]

Dynamic Patch-aware Enrichment Transformer for Occluded Person Re-Identification

Authors: Xin Zhang, Keren Fu, Qijun Zhao

Abstract: Person re-identification (re-ID) continues to pose a significant challenge, particularly in scenarios involving occlusions. Prior approaches aimed at tackling occlusions have predominantly focused on aligning physical body features through the utilization of external semantic cues. However, these methods tend to be intricate and susceptible to noise. To address the aforementioned challenges, we pr… ▽ More Person re-identification (re-ID) continues to pose a significant challenge, particularly in scenarios involving occlusions. Prior approaches aimed at tackling occlusions have predominantly focused on aligning physical body features through the utilization of external semantic cues. However, these methods tend to be intricate and susceptible to noise. To address the aforementioned challenges, we present an innovative end-to-end solution known as the Dynamic Patch-aware Enrichment Transformer (DPEFormer). This model effectively distinguishes human body information from occlusions automatically and dynamically, eliminating the need for external detectors or precise image alignment. Specifically, we introduce a dynamic patch token selection module (DPSM). DPSM utilizes a label-guided proxy token as an intermediary to identify informative occlusion-free tokens. These tokens are then selected for deriving subsequent local part features. To facilitate the seamless integration of global classification features with the finely detailed local features selected by DPSM, we introduce a novel feature blending module (FBM). FBM enhances feature representation through the complementary nature of information and the exploitation of part diversity. Furthermore, to ensure that DPSM and the entire DPEFormer can effectively learn with only identity labels, we also propose a Realistic Occlusion Augmentation (ROA) strategy. This strategy leverages the recent advances in the Segment Anything Model (SAM). As a result, it generates occlusion images that closely resemble real-world occlusions, greatly enhancing the subsequent contrastive learning process. Experiments on occluded and holistic re-ID benchmarks signify a substantial advancement of DPEFormer over existing state-of-the-art approaches. The code will be made publicly available. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: 12 pages, 6 figures

arXiv:2402.09446 [pdf, other]

MeshAC: A 3D Mesh Generation and Adaptation Package for Multiscale Coupling Methods

Authors: Kejie Fu, Mingjie Liao, Yangshuai Wang, Jianjun Chen, Lei Zhang

Abstract: This paper introduces the MeshAC package, which generates three-dimensional adaptive meshes tailored for the efficient and robust implementation of multiscale coupling methods. While Delaunay triangulation is commonly used for mesh generation across the entire computational domain, generating meshes for multiscale coupling methods is more challenging due to intrinsic discrete structures such as de… ▽ More This paper introduces the MeshAC package, which generates three-dimensional adaptive meshes tailored for the efficient and robust implementation of multiscale coupling methods. While Delaunay triangulation is commonly used for mesh generation across the entire computational domain, generating meshes for multiscale coupling methods is more challenging due to intrinsic discrete structures such as defects, and the need to match these structures to the continuum domain at the interface. The MeshAC package tackles these challenges by generating meshes that align with fine-level discrete structures. It also incorporates localized modification and reconstruction operations specifically designed for interfaces. These enhancements improve both the implementation efficiency and the quality of the coupled mesh. Furthermore, MeshAC introduces a novel adaptive feature that utilizes gradient-based a posteriori error estimation, which automatically adjusts the atomistic region and continuum mesh, ensuring an optimal balance between accuracy and efficiency. This package can be directly applied to the geometry optimization problems of a/c coupling in static mechanics, with potential extensions to many other scenarios. Its capabilities are demonstrated for complex material defects, including straight edge dislocation in BCC W and double voids in FCC Cu. These results suggest that MeshAC can be a valuable tool for researchers and practitioners in computational mechanics. △ Less

Submitted 31 January, 2024; originally announced February 2024.

arXiv:2401.14579 [pdf]

Recognizing Multiple Ingredients in Food Images Using a Single-Ingredient Classification Model

Authors: Kun Fu, Ying Dai

Abstract: Recognizing food images presents unique challenges due to the variable spatial layout and shape changes of ingredients with different cooking and cutting methods. This study introduces an advanced approach for recognizing ingredients segmented from food images. The method localizes the candidate regions of the ingredients using the locating and sliding window techniques. Then, these regions are as… ▽ More Recognizing food images presents unique challenges due to the variable spatial layout and shape changes of ingredients with different cooking and cutting methods. This study introduces an advanced approach for recognizing ingredients segmented from food images. The method localizes the candidate regions of the ingredients using the locating and sliding window techniques. Then, these regions are assigned into ingredient classes using a CNN (Convolutional Neural Network)-based single-ingredient classification model trained on a dataset of single-ingredient images. To address the challenge of processing speed in multi-ingredient recognition, a novel model pruning method is proposed that enhances the efficiency of the classification model. Subsequently, the multi-ingredient identification is achieved through a decision-making scheme, incorporating two novel algorithms. The single-ingredient image dataset, designed in accordance with the book entitled "New Food Ingredients List FOODS 2021", encompasses 9982 images across 110 diverse categories, emphasizing variety in ingredient shapes. In addition, a multi-ingredient image dataset is developed to rigorously evaluate the performance of our approach. Experimental results validate the effectiveness of our method, particularly highlighting its improved capability in recognizing multiple ingredients. This marks a significant advancement in the field of food image analysis. △ Less

Submitted 18 February, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: 9 pages, 21 figures, 6 tables

arXiv:2401.13127 [pdf, other]

Generalization of Heterogeneous Multi-Robot Policies via Awareness and Communication of Capabilities

Authors: Pierce Howell, Max Rudolph, Reza Torbati, Kevin Fu, Harish Ravichandar

Abstract: Recent advances in multi-agent reinforcement learning (MARL) are enabling impressive coordination in heterogeneous multi-robot teams. However, existing approaches often overlook the challenge of generalizing learned policies to teams of new compositions, sizes, and robots. While such generalization might not be important in teams of virtual agents that can retrain policies on-demand, it is pivotal… ▽ More Recent advances in multi-agent reinforcement learning (MARL) are enabling impressive coordination in heterogeneous multi-robot teams. However, existing approaches often overlook the challenge of generalizing learned policies to teams of new compositions, sizes, and robots. While such generalization might not be important in teams of virtual agents that can retrain policies on-demand, it is pivotal in multi-robot systems that are deployed in the real-world and must readily adapt to inevitable changes. As such, multi-robot policies must remain robust to team changes -- an ability we call adaptive teaming. In this work, we investigate if awareness and communication of robot capabilities can provide such generalization by conducting detailed experiments involving an established multi-robot test bed. We demonstrate that shared decentralized policies, that enable robots to be both aware of and communicate their capabilities, can achieve adaptive teaming by implicitly capturing the fundamental relationship between collective capabilities and effective coordination. Videos of trained policies can be viewed at: https://sites.google.com/view/cap-comm △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: Presented at the 7th Conference on Robot Learning (CoRL 2023), Atlanta, USA

arXiv:2401.03331 [pdf, other]

Walnut Detection Through Deep Learning Enhanced by Multispectral Synthetic Images

Authors: Kaiming Fu, Tong Lei, Maryia Halubok, Brian N. Bailey

Abstract: The accurate identification of walnuts within orchards brings forth a plethora of advantages, profoundly amplifying the efficiency and productivity of walnut orchard management. Nevertheless, the unique characteristics of walnut trees, characterized by their closely resembling shapes, colors, and textures between the walnuts and leaves, present a formidable challenge in precisely distinguishing be… ▽ More The accurate identification of walnuts within orchards brings forth a plethora of advantages, profoundly amplifying the efficiency and productivity of walnut orchard management. Nevertheless, the unique characteristics of walnut trees, characterized by their closely resembling shapes, colors, and textures between the walnuts and leaves, present a formidable challenge in precisely distinguishing between them during the annotation process. In this study, we present a novel approach to improve walnut detection efficiency, utilizing YOLOv5 trained on an enriched image set that incorporates both real and synthetic RGB and NIR images. Our analysis comparing results from our original and augmented datasets shows clear improvements in detection when using the synthetic images. △ Less

Submitted 31 October, 2023; originally announced January 2024.

Comments: This work was presented at IEEE/RSI International Conference on Intelligent Robots and Systems (IROS) Workshop

arXiv:2401.01569 [pdf, other]

AttentionLut: Attention Fusion-based Canonical Polyadic LUT for Real-time Image Enhancement

Authors: Kang Fu, Yicong Peng, Zicheng Zhang, Qihang Xu, Xiaohong Liu, Jia Wang, Guangtao Zhai

Abstract: Recently, many algorithms have employed image-adaptive lookup tables (LUTs) to achieve real-time image enhancement. Nonetheless, a prevailing trend among existing methods has been the employment of linear combinations of basic LUTs to formulate image-adaptive LUTs, which limits the generalization ability of these methods. To address this limitation, we propose a novel framework named AttentionLut… ▽ More Recently, many algorithms have employed image-adaptive lookup tables (LUTs) to achieve real-time image enhancement. Nonetheless, a prevailing trend among existing methods has been the employment of linear combinations of basic LUTs to formulate image-adaptive LUTs, which limits the generalization ability of these methods. To address this limitation, we propose a novel framework named AttentionLut for real-time image enhancement, which utilizes the attention mechanism to generate image-adaptive LUTs. Our proposed framework consists of three lightweight modules. We begin by employing the global image context feature module to extract image-adaptive features. Subsequently, the attention fusion module integrates the image feature with the priori attention feature obtained during training to generate image-adaptive canonical polyadic tensors. Finally, the canonical polyadic reconstruction module is deployed to reconstruct image-adaptive residual 3DLUT, which is subsequently utilized for enhancing input images. Experiments on the benchmark MIT-Adobe FiveK dataset demonstrate that the proposed method achieves better enhancement performance quantitatively and qualitatively than the state-of-the-art methods. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2401.00496 [pdf, other]

SAR-RARP50: Segmentation of surgical instrumentation and Action Recognition on Robot-Assisted Radical Prostatectomy Challenge

Authors: Dimitrios Psychogyios, Emanuele Colleoni, Beatrice Van Amsterdam, Chih-Yang Li, Shu-Yu Huang, Yuchong Li, Fucang Jia, Baosheng Zou, Guotai Wang, Yang Liu, Maxence Boels, Jiayu Huo, Rachel Sparks, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin, Mengya Xu, An Wang, Yanan Wu, Long Bai, Hongliang Ren, Atsushi Yamada, Yuriko Harai, Yuto Ishikawa, Kazuyuki Hayashi , et al. (25 additional authors not shown)

Abstract: Surgical tool segmentation and action recognition are fundamental building blocks in many computer-assisted intervention applications, ranging from surgical skills assessment to decision support systems. Nowadays, learning-based action recognition and segmentation approaches outperform classical methods, relying, however, on large, annotated datasets. Furthermore, action recognition and tool segme… ▽ More Surgical tool segmentation and action recognition are fundamental building blocks in many computer-assisted intervention applications, ranging from surgical skills assessment to decision support systems. Nowadays, learning-based action recognition and segmentation approaches outperform classical methods, relying, however, on large, annotated datasets. Furthermore, action recognition and tool segmentation algorithms are often trained and make predictions in isolation from each other, without exploiting potential cross-task relationships. With the EndoVis 2022 SAR-RARP50 challenge, we release the first multimodal, publicly available, in-vivo, dataset for surgical action recognition and semantic instrumentation segmentation, containing 50 suturing video segments of Robotic Assisted Radical Prostatectomy (RARP). The aim of the challenge is twofold. First, to enable researchers to leverage the scale of the provided dataset and develop robust and highly accurate single-task action recognition and tool segmentation approaches in the surgical domain. Second, to further explore the potential of multitask-based learning approaches and determine their comparative advantage against their single-task counterparts. A total of 12 teams participated in the challenge, contributing 7 action recognition methods, 9 instrument segmentation techniques, and 4 multitask approaches that integrated both action recognition and instrument segmentation. The complete SAR-RARP50 dataset is available at: https://rdr.ucl.ac.uk/projects/SARRARP50_Segmentation_of_surgical_instrumentation_and_Action_Recognition_on_Robot-Assisted_Radical_Prostatectomy_Challenge/191091 △ Less

Submitted 23 January, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

arXiv:2401.00248 [pdf, other]

Promoting Segment Anything Model towards Highly Accurate Dichotomous Image Segmentation

Authors: Xianjie Liu, Keren Fu, Qijun Zhao

Abstract: The Segment Anything Model (SAM) represents a significant breakthrough into foundation models for computer vision, providing a large-scale image segmentation model. However, despite SAM's zero-shot performance, its segmentation masks lack fine-grained details, particularly in accurately delineating object boundaries. We have high expectations regarding whether SAM, as a foundation model, can be im… ▽ More The Segment Anything Model (SAM) represents a significant breakthrough into foundation models for computer vision, providing a large-scale image segmentation model. However, despite SAM's zero-shot performance, its segmentation masks lack fine-grained details, particularly in accurately delineating object boundaries. We have high expectations regarding whether SAM, as a foundation model, can be improved towards highly accurate object segmentation, which is known as dichotomous image segmentation (DIS). To address this issue, we propose DIS-SAM, which advances SAM towards DIS with extremely accurate details. DIS-SAM is a framework specifically tailored for highly accurate segmentation, maintaining SAM's promptable design. DIS-SAM employs a two-stage approach, integrating SAM with a modified IS-Net dedicated to DIS. Despite its simplicity, DIS-SAM demonstrates significantly enhanced segmentation accuracy compared to SAM and HQ-SAM. △ Less

Submitted 22 March, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

arXiv:2312.04831 [pdf, other]

Towards Context-Stable and Visual-Consistent Image Inpainting

Authors: Yikai Wang, Chenjie Cao, Ke Fan Xiangyang Xue Yanwei Fu

Abstract: Recent progress in inpainting increasingly relies on generative models, leveraging their strong generation capabilities for addressing large irregular masks. However, this enhanced generation often introduces context-instability, leading to arbitrary object generation within masked regions. This paper proposes a balanced solution, emphasizing the importance of unmasked regions in guiding inpaintin… ▽ More Recent progress in inpainting increasingly relies on generative models, leveraging their strong generation capabilities for addressing large irregular masks. However, this enhanced generation often introduces context-instability, leading to arbitrary object generation within masked regions. This paper proposes a balanced solution, emphasizing the importance of unmasked regions in guiding inpainting while preserving generation capacity. Our approach, Aligned Stable Inpainting with UnKnown Areas Prior (ASUKA), employs a Masked Auto-Encoder (MAE) to produce reconstruction-based prior. Aligned with the powerful Stable Diffusion inpainting model (SD), ASUKA significantly improves context stability. ASUKA further adopts an inpainting-specialized decoder, highly reducing the color inconsistency issue of SD and thus ensuring more visual-consistent inpainting. We validate effectiveness of inpainting algorithms on benchmark dataset Places 2 and a collection of several existing datasets, dubbed MISATO, across diverse domains and masking scenarios. Results on these benchmark datasets confirm ASUKA's efficacy in both context-stability and visual-consistency compared to SD and other inpainting algorithms. △ Less

Submitted 17 March, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

Comments: Project page: https://yikai-wang.github.io/asuka/ where full-size PDF with appendix is available. Dataset: https://github.com/Yikai-Wang/asuka-misato. Yikai Wang and Chenjie Cao contribute equally

arXiv:2312.03758 [pdf, other]

Stock Movement and Volatility Prediction from Tweets, Macroeconomic Factors and Historical Prices

Authors: Shengkun Wang, YangXiao Bai, Taoran Ji, Kaiqun Fu, Linhan Wang, Chang-Tien Lu

Abstract: Predicting stock market is vital for investors and policymakers, acting as a barometer of the economic health. We leverage social media data, a potent source of public sentiment, in tandem with macroeconomic indicators as government-compiled statistics, to refine stock market predictions. However, prior research using tweet data for stock market prediction faces three challenges. First, the qualit… ▽ More Predicting stock market is vital for investors and policymakers, acting as a barometer of the economic health. We leverage social media data, a potent source of public sentiment, in tandem with macroeconomic indicators as government-compiled statistics, to refine stock market predictions. However, prior research using tweet data for stock market prediction faces three challenges. First, the quality of tweets varies widely. While many are filled with noise and irrelevant details, only a few genuinely mirror the actual market scenario. Second, solely focusing on the historical data of a particular stock without considering its sector can lead to oversight. Stocks within the same industry often exhibit correlated price behaviors. Lastly, simply forecasting the direction of price movement without assessing its magnitude is of limited value, as the extent of the rise or fall truly determines profitability. In this paper, diverging from the conventional methods, we pioneer an ECON. The framework has following advantages: First, ECON has an adept tweets filter that efficiently extracts and decodes the vast array of tweet data. Second, ECON discerns multi-level relationships among stocks, sectors, and macroeconomic factors through a self-aware mechanism in semantic space. Third, ECON offers enhanced accuracy in predicting substantial stock price fluctuations by capitalizing on stock price movement. We showcase the state-of-the-art performance of our proposed model using a dataset, specifically curated by us, for predicting stock market movements and volatility. △ Less

Submitted 4 December, 2023; originally announced December 2023.

arXiv:2311.16323 [pdf]

Robust Diamond/\b{eta}-Ga2O3 Hetero-p-n-junction Via Mechanically Integrating Their Building Blocks

Authors: Imteaz Rahaman, Hunter D. Ellis, Kai Fu

Abstract: We report a novel approach for crafting robust diamond/\b{eta}-Ga2O3 hetero-p-n-junctions through the mechanical integration of their bulk materials. This resulting heterojunction, with a turn-on voltage of ~2.7 V at room temperature, exhibits resilient electrical performance across a temperature spectrum up to 125°C, displaying minimal hysteresis-measuring as low as 0.2 V at room temperature and… ▽ More We report a novel approach for crafting robust diamond/\b{eta}-Ga2O3 hetero-p-n-junctions through the mechanical integration of their bulk materials. This resulting heterojunction, with a turn-on voltage of ~2.7 V at room temperature, exhibits resilient electrical performance across a temperature spectrum up to 125°C, displaying minimal hysteresis-measuring as low as 0.2 V at room temperature and below 0.7 V at 125°C. Remarkably, the ideality factor achieves a record low value of 1.28, setting a new benchmark for diamond/ \b{eta}-Ga2O3 heterojunctions. The rectification ratio reaches over 10^8 at different temperatures. This effortlessly fabricated and remarkably resilient diamond/Ga2O3 hetero-p-n-junction pioneers a novel pathway for the exploration and fabrication of heterojunctions for ultra-wide bandgap semiconductors with substantial lattice mismatch and different thermal expansion coefficients. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: 13 pages, 4 figures, 1 table, journal and this draft has been submitted to 'Applied Physics Letters'

arXiv:2311.15606 [pdf, other]

Selective active resonance tuning for multi-mode nonlinear photonic cavities

Authors: Alan D. Logan, Nicholas S. Yama, Kai-Mei C. Fu

Abstract: Resonant enhancement of nonlinear photonic processes is critical for the scalability of applications such as long-distance entanglement generation. To implement nonlinear resonant enhancement, multiple resonator modes must be individually tuned onto a precise set of process wavelengths, which requires multiple linearly-independent tuning methods. Using coupled auxiliary resonators to indirectly tu… ▽ More Resonant enhancement of nonlinear photonic processes is critical for the scalability of applications such as long-distance entanglement generation. To implement nonlinear resonant enhancement, multiple resonator modes must be individually tuned onto a precise set of process wavelengths, which requires multiple linearly-independent tuning methods. Using coupled auxiliary resonators to indirectly tune modes in a multi-resonant nonlinear cavity is particularly attractive because it allows the extension of a single physical tuning mechanism, such as thermal tuning, to provide the required independent controls. Here we model and simulate the performance and tradeoffs of a coupled-resonator tuning scheme which uses auxiliary resonators to tune specific modes of a multi-resonant nonlinear process. Our analysis determines the tuning bandwidth for steady-state mode field intensity can significantly exceed the inter-cavity coupling rate if the total quality factor of the auxiliary resonator is higher than the multi-mode main resonator. Consequently, over-coupling a nonlinear resonator mode to improve the maximum efficiency of a frequency conversion process will simultaneously expand the auxiliary resonator tuning bandwidth for that mode, indicating a natural compatibility with this tuning scheme. We apply the model to an existing small-diameter triply-resonant ring resonator design and find that a tuning bandwidth of 136 GHz ~ 1.1 nm can be attained for a mode in the telecom band while limiting excess scattering losses to a quality factor of 10^6. Such range would span the distribution of inhomogeneously broadened quantum emitter ensembles as well as resonator fabrication variations, indicating the potential for the auxiliary resonators to enable not only low-loss telecom conversion but also the generation of indistinguishable photons in a quantum network. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: 16 pages, 7 figures

arXiv:2311.06435 [pdf, other]

doi 10.1038/s41467-023-43393-x

Optomechanical ring resonator for efficient microwave-optical frequency conversion

Authors: I-Tung Chen, Bingzhao Li, Seokhyeong Lee, Srivatsa Chakravarthi, Kai-Mei Fu, Mo Li

Abstract: Phonons traveling in solid-state devices are emerging as a universal excitation that can couple to different physical systems through mechanical interaction. At microwave frequencies and in solid-state materials, phonons have a similar wavelength to optical photons, enabling them to interact efficiently with light and produce strong optomechanical effects that are highly desirable for classical an… ▽ More Phonons traveling in solid-state devices are emerging as a universal excitation that can couple to different physical systems through mechanical interaction. At microwave frequencies and in solid-state materials, phonons have a similar wavelength to optical photons, enabling them to interact efficiently with light and produce strong optomechanical effects that are highly desirable for classical and quantum signal transduction between optical and microwave. It becomes conceivable to build optomechanical integrated circuits (OMIC) that guide both photons and phonons and interconnect discrete photonic and phononic devices. Here, we demonstrate an OMIC including an optomechanical ring resonator (OMR), in which infrared photons and GHz phonons co-resonate to induce significantly enhanced interconversion. The OMIC is built on a hybrid platform where wide bandgap semiconductor gallium phosphide (GaP) is used as the waveguiding material and piezoelectric zinc oxide (ZnO) is used for phonon generation. The OMR features photonic and phononic quality factors of $>1\times10^5$ and $3.2\times10^3$, respectively, and resonantly enhances the optomechanical conversion between photonic modes to achieve an internal conversion efficiency $η_i=(2.1\pm0.1)%$ and a total device efficiency $η_{tot}=0.57\times10^{-6}$ at a low acoustic pump power of 1.6 mW. The efficient conversion in OMICs enables microwave-optical transduction for many applications in quantum information processing and microwave photonics. △ Less

Submitted 16 November, 2023; v1 submitted 10 November, 2023; originally announced November 2023.

Comments: 11 pages, 4 figures

arXiv:2310.18706 [pdf, other]

doi 10.1145/3625007.3627488

ALERTA-Net: A Temporal Distance-Aware Recurrent Networks for Stock Movement and Volatility Prediction

Authors: Shengkun Wang, YangXiao Bai, Kaiqun Fu, Linhan Wang, Chang-Tien Lu, Taoran Ji

Abstract: For both investors and policymakers, forecasting the stock market is essential as it serves as an indicator of economic well-being. To this end, we harness the power of social media data, a rich source of public sentiment, to enhance the accuracy of stock market predictions. Diverging from conventional methods, we pioneer an approach that integrates sentiment analysis, macroeconomic indicators, se… ▽ More For both investors and policymakers, forecasting the stock market is essential as it serves as an indicator of economic well-being. To this end, we harness the power of social media data, a rich source of public sentiment, to enhance the accuracy of stock market predictions. Diverging from conventional methods, we pioneer an approach that integrates sentiment analysis, macroeconomic indicators, search engine data, and historical prices within a multi-attention deep learning model, masterfully decoding the complex patterns inherent in the data. We showcase the state-of-the-art performance of our proposed model using a dataset, specifically curated by us, for predicting stock market movements and volatility. △ Less

Submitted 28 October, 2023; originally announced October 2023.

arXiv:2310.15482 [pdf, other]

Salient Object Detection in RGB-D Videos

Authors: Ao Mou, Yukang Lu, Jiahao He, Dingyao Min, Keren Fu, Qijun Zhao

Abstract: Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and… ▽ More Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our RDVS, highlight the superiority of DCTNet+ over 17 VSOD models and 14 RGB-D SOD models. Ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth. Our code together with RDVS dataset will be available at https://github.com/kerenfu/RDVS/. △ Less

Submitted 21 May, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: IEEE TIP (under major revision)

arXiv:2310.15138 [pdf, other]

Fusion-Driven Tree Reconstruction and Fruit Localization: Advancing Precision in Agriculture

Authors: Kaiming Fu, Peng Wei, Juan Villacres, Zhaodan Kong, Stavros G. Vougioukas, Brian N. Bailey

Abstract: Fruit distribution is pivotal in shaping the future of both agriculture and agricultural robotics, paving the way for a streamlined supply chain. This study introduces an innovative methodology that harnesses the synergy of RGB imagery, LiDAR, and IMU data, to achieve intricate tree reconstructions and the pinpoint localization of fruits. Such integration not only offers insights into the fruit di… ▽ More Fruit distribution is pivotal in shaping the future of both agriculture and agricultural robotics, paving the way for a streamlined supply chain. This study introduces an innovative methodology that harnesses the synergy of RGB imagery, LiDAR, and IMU data, to achieve intricate tree reconstructions and the pinpoint localization of fruits. Such integration not only offers insights into the fruit distribution, which enhances the precision of guidance for agricultural robotics and automation systems, but also sets the stage for simulating synthetic fruit patterns across varied tree architectures. To validate this approach, experiments have been carried out in both a controlled environment and an actual peach orchard. The results underscore the robustness and efficacy of this fusion-driven methodology, highlighting its potential as a transformative tool for future agricultural robotics and precision farming. △ Less

Submitted 23 October, 2023; originally announced October 2023.

Comments: This work was presented at IEEE/RSI International Conference on Intelligent Robots and Systems (IROS) Workshop

arXiv:2310.12484 [pdf, ps, other]

Creation of color centers in diamond by recoil implantation through dielectric films

Authors: Yuyang Han, Christian Pederson, Bethany E. Matthews, Nicholas S. Yama, Maxwell F. Parsons, Kai-Mei C. Fu

Abstract: The need of near-surface color centers in diamond for quantum technologies motivates the controlled doping of specific extrinsic impurities into the crystal lattice. Recent experiments have shown that this can be achieved by momentum transfer from a surface precursor via ion implantation, an approach known as ``recoil implantation.'' Here, we extend this technique to incorporate dielectric precurs… ▽ More The need of near-surface color centers in diamond for quantum technologies motivates the controlled doping of specific extrinsic impurities into the crystal lattice. Recent experiments have shown that this can be achieved by momentum transfer from a surface precursor via ion implantation, an approach known as ``recoil implantation.'' Here, we extend this technique to incorporate dielectric precursors for creating nitrogen-vacancy (NV) and silicon-vacancy (SiV) centers in diamond. Specifically, we demonstrate that gallium focused-ion-beam exposure to a thin layer of silicon nitride or silicon dioxide on the diamond surface results in the introduction of both extrinsic impurities and carbon vacancies. These defects subsequently give rise to near-surface NV and SiV centers with desirable optical properties after annealing. △ Less

Submitted 28 December, 2023; v1 submitted 19 October, 2023; originally announced October 2023.

arXiv:2310.05806 [pdf, other]

Isolation of Single Donors in ZnO

Authors: Ethan R. Hansen, Vasileios Niaouris, Bethany E. Matthews, Christian Zimmermann, Xingyi Wang, Roman Kolodka, Lasse Vines, Steven R. Spurgeon, Kai-Mei C. Fu

Abstract: The shallow donor in zinc oxide (ZnO) is a promising semiconductor spin qubit with optical access. Single indium donors are isolated in a commercial ZnO substrate using plasma focused ion beam (PFIB) milling. Quantum emitters are identified optically by spatial and frequency filtering. The indium donor assignment is based on the optical bound exciton transition energy and magnetic dependence. The… ▽ More The shallow donor in zinc oxide (ZnO) is a promising semiconductor spin qubit with optical access. Single indium donors are isolated in a commercial ZnO substrate using plasma focused ion beam (PFIB) milling. Quantum emitters are identified optically by spatial and frequency filtering. The indium donor assignment is based on the optical bound exciton transition energy and magnetic dependence. The single donor emission is intensity and frequency stable with a transition linewidth less than twice the lifetime limit. The isolation of optically stable single donors post-FIB fabrication is promising for optical device integration required for scalable quantum technologies based on single donors in direct band gap semiconductors. △ Less

Submitted 17 January, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

Comments: E. R. Hansen and V. Niaouris contributed equally to this work. 15 pages, 13 figures

arXiv:2310.03941 [pdf, other]

LaTeX: Language Pattern-aware Triggering Event Detection for Adverse Experience during Pandemics

Authors: Kaiqun Fu, Yangxiao Bai, Weiwei Zhang, Deepthi Kolady

Abstract: The COVID-19 pandemic has accentuated socioeconomic disparities across various racial and ethnic groups in the United States. While previous studies have utilized traditional survey methods like the Household Pulse Survey (HPS) to elucidate these disparities, this paper explores the role of social media platforms in both highlighting and addressing these challenges. Drawing from real-time data sou… ▽ More The COVID-19 pandemic has accentuated socioeconomic disparities across various racial and ethnic groups in the United States. While previous studies have utilized traditional survey methods like the Household Pulse Survey (HPS) to elucidate these disparities, this paper explores the role of social media platforms in both highlighting and addressing these challenges. Drawing from real-time data sourced from Twitter, we analyzed language patterns related to four major types of adverse experiences: loss of employment income (LI), food scarcity (FS), housing insecurity (HI), and unmet needs for mental health services (UM). We first formulate a sparsity optimization problem that extracts low-level language features from social media data sources. Second, we propose novel constraints on feature similarity exploiting prior knowledge about the similarity of the language patterns among the adverse experiences. The proposed problem is challenging to solve due to the non-convexity objective and non-smoothness penalties. We develop an algorithm based on the alternating direction method of multipliers (ADMM) framework to solve the proposed formulation. Extensive experiments and comparisons to other models on real-world social media and the detection of adverse experiences justify the efficacy of our model. △ Less

Submitted 5 October, 2023; originally announced October 2023.

Comments: arXiv admin note: text overlap with arXiv:1911.08684

Showing 1–50 of 241 results for author: Fu, K