Search | arXiv e-print repository

Toward Robust Early Detection of Alzheimer's Disease via an Integrated Multimodal Learning Approach

Authors: Yifei Chen, Shenghao Zhu, Zhaojie Fang, Chang Liu, Binfeng Zou, Yuhe Wang, Shuo Chang, Fan Jia, Feiwei Qin, Jin Fan, Yong Peng, Changmiao Wang

Abstract: Alzheimer's Disease (AD) is a complex neurodegenerative disorder marked by memory loss, executive dysfunction, and personality changes. Early diagnosis is challenging due to subtle symptoms and varied presentations, often leading to misdiagnosis with traditional unimodal diagnostic methods due to their limited scope. This study introduces an advanced multimodal classification model that integrates… ▽ More Alzheimer's Disease (AD) is a complex neurodegenerative disorder marked by memory loss, executive dysfunction, and personality changes. Early diagnosis is challenging due to subtle symptoms and varied presentations, often leading to misdiagnosis with traditional unimodal diagnostic methods due to their limited scope. This study introduces an advanced multimodal classification model that integrates clinical, cognitive, neuroimaging, and EEG data to enhance diagnostic accuracy. The model incorporates a feature tagger with a tabular data coding architecture and utilizes the TimesBlock module to capture intricate temporal patterns in Electroencephalograms (EEG) data. By employing Cross-modal Attention Aggregation module, the model effectively fuses Magnetic Resonance Imaging (MRI) spatial information with EEG temporal data, significantly improving the distinction between AD, Mild Cognitive Impairment, and Normal Cognition. Simultaneously, we have constructed the first AD classification dataset that includes three modalities: EEG, MRI, and tabular data. Our innovative approach aims to facilitate early diagnosis and intervention, potentially slowing the progression of AD. The source code and our private ADMC dataset are available at https://github.com/JustlfC03/MSTNet. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: 5 pages, 2 figures

arXiv:2408.07605 [pdf, other]

Panacea+: Panoramic and Controllable Video Generation for Autonomous Driving

Authors: Yuqing Wen, Yucheng Zhao, Yingfei Liu, Binyuan Huang, Fan Jia, Yanhui Wang, Chi Zhang, Tiancai Wang, Xiaoyan Sun, Xiangyu Zhang

Abstract: The field of autonomous driving increasingly demands high-quality annotated video training data. In this paper, we propose Panacea+, a powerful and universally applicable framework for generating video data in driving scenes. Built upon the foundation of our previous work, Panacea, Panacea+ adopts a multi-view appearance noise prior mechanism and a super-resolution module for enhanced consistency… ▽ More The field of autonomous driving increasingly demands high-quality annotated video training data. In this paper, we propose Panacea+, a powerful and universally applicable framework for generating video data in driving scenes. Built upon the foundation of our previous work, Panacea, Panacea+ adopts a multi-view appearance noise prior mechanism and a super-resolution module for enhanced consistency and increased resolution. Extensive experiments show that the generated video samples from Panacea+ greatly benefit a wide range of tasks on different datasets, including 3D object tracking, 3D object detection, and lane detection tasks on the nuScenes and Argoverse 2 dataset. These results strongly prove Panacea+ to be a valuable data generation framework for autonomous driving. △ Less

Submitted 14 August, 2024; originally announced August 2024.

Comments: Project page: https://panacea-ad.github.io/. arXiv admin note: text overlap with arXiv:2311.16813

arXiv:2408.05705 [pdf, other]

TC-KANRecon: High-Quality and Accelerated MRI Reconstruction via Adaptive KAN Mechanisms and Intelligent Feature Scaling

Authors: Ruiquan Ge, Xiao Yu, Yifei Chen, Fan Jia, Shenghao Zhu, Guanyu Zhou, Yiyu Huang, Chenyan Zhang, Dong Zeng, Changmiao Wang, Qiegen Liu, Shanzhou Niu

Abstract: Magnetic Resonance Imaging (MRI) has become essential in clinical diagnosis due to its high resolution and multiple contrast mechanisms. However, the relatively long acquisition time limits its broader application. To address this issue, this study presents an innovative conditional guided diffusion model, named as TC-KANRecon, which incorporates the Multi-Free U-KAN (MF-UKAN) module and a dynamic… ▽ More Magnetic Resonance Imaging (MRI) has become essential in clinical diagnosis due to its high resolution and multiple contrast mechanisms. However, the relatively long acquisition time limits its broader application. To address this issue, this study presents an innovative conditional guided diffusion model, named as TC-KANRecon, which incorporates the Multi-Free U-KAN (MF-UKAN) module and a dynamic clipping strategy. TC-KANRecon model aims to accelerate the MRI reconstruction process through deep learning methods while maintaining the quality of the reconstructed images. The MF-UKAN module can effectively balance the tradeoff between image denoising and structure preservation. Specifically, it presents the multi-head attention mechanisms and scalar modulation factors, which significantly enhances the model's robustness and structure preservation capabilities in complex noise environments. Moreover, the dynamic clipping strategy in TC-KANRecon adjusts the cropping interval according to the sampling steps, thereby mitigating image detail loss typically caused by traditional cropping methods and enriching the visual features of the images. Furthermore, the MC-Model module incorporates full-sampling k-space information, realizing efficient fusion of conditional information, enhancing the model's ability to process complex data, and improving the realism and detail richness of reconstructed images. Experimental results demonstrate that the proposed method outperforms other MRI reconstruction methods in both qualitative and quantitative evaluations. Notably, TC-KANRecon method exhibits excellent reconstruction results when processing high-noise, low-sampling-rate MRI data. Our source code is available at https://github.com/lcbkmm/TC-KANRecon. △ Less

Submitted 11 August, 2024; originally announced August 2024.

Comments: 10 pages, 3 figures

arXiv:2408.05208 [pdf, other]

Holographic thermal correlators and quasinormal modes from semiclassical Virasoro blocks

Authors: Hewei Frederic Jia, Mukund Rangamani

Abstract: Motivated by its relevance for thermal correlators in strongly coupled holographic CFTs, we refine and further develop a recent exact analytic approach to black hole perturbation problem, based on the semiclassical Virasoro blocks, or equivalently via AGT relation, the Nekrasov partition functions in the Nekrasov-Shatashvili limit. Focusing on asymptotically $\text{AdS}_5$ black hole backgrounds,… ▽ More Motivated by its relevance for thermal correlators in strongly coupled holographic CFTs, we refine and further develop a recent exact analytic approach to black hole perturbation problem, based on the semiclassical Virasoro blocks, or equivalently via AGT relation, the Nekrasov partition functions in the Nekrasov-Shatashvili limit. Focusing on asymptotically $\text{AdS}_5$ black hole backgrounds, we derive new universal exact expressions for holographic thermal two-point functions, both for scalar operators and conserved currents. Relatedly, we also obtain exact quantization conditions of the associated quasinormal modes (QNMs). Our expressions for the holographic $\text{CFT}_4$ closely resemble the well-known results for 2d thermal CFTs on $\mathbb{R}^{1,1}$. This structural similarity stems from the locality of fusion transformation for Virasoro blocks. We provide numerical checks of our quantization conditions for QNMs. Additionally, we discuss the application of our results to understand specific physical properties of QNMs, including their near-extremal and asymptotic limits. The latter is related to a certain large-momentum regime of semiclassical Virasoro blocks dual to Seiberg-Witten prepotentials. △ Less

Submitted 9 August, 2024; originally announced August 2024.

Comments: 69 pages, 3 figures

arXiv:2408.04545 [pdf, other]

Balancing Efficiency with Equality: Auction Design with Group Fairness Concerns

Authors: Fengjuan Jia, Mengxiao Zhang, Jiamou Liu, Bakh Khoussainov

Abstract: The issue of fairness in AI arises from discriminatory practices in applications like job recommendations and risk assessments, emphasising the need for algorithms that do not discriminate based on group characteristics. This concern is also pertinent to auctions, commonly used for resource allocation, which necessitate fairness considerations. Our study examines auctions with groups distinguished… ▽ More The issue of fairness in AI arises from discriminatory practices in applications like job recommendations and risk assessments, emphasising the need for algorithms that do not discriminate based on group characteristics. This concern is also pertinent to auctions, commonly used for resource allocation, which necessitate fairness considerations. Our study examines auctions with groups distinguished by specific attributes, seeking to (1) define a fairness notion that ensures equitable treatment for all, (2) identify mechanisms that adhere to this fairness while preserving incentive compatibility, and (3) explore the balance between fairness and seller's revenue. We introduce two fairness notions-group fairness and individual fairness-and propose two corresponding auction mechanisms: the Group Probability Mechanism, which meets group fairness and incentive criteria, and the Group Score Mechanism, which also encompasses individual fairness. Through experiments, we validate these mechanisms' effectiveness in promoting fairness and examine their implications for seller revenue. △ Less

Submitted 9 August, 2024; v1 submitted 8 August, 2024; originally announced August 2024.

arXiv:2408.00988 [pdf, other]

Measurement of microwave polarization using two polarization orthogonal local microwave electric fields in a Rydberg atom-based mixer

Authors: Weibo Yin, Jianan Zhang, Fengdong Jia, Yuhan Wang, Yuxiang Wang, Jianhai Hao, Yue Cui, Ya Liu, Zhiping Zhong

Abstract: We propose and demonstrate a novel method for measuring the polarization direction of a microwave electric field in a single measurement using a Rydberg atom-based mixer with two orthogonally polarized local microwave electric fields. Furthermore, introducing a weak static magnetic field enables the utilization of the Zeeman effect and exploitation of polarization asymmetry. This distinction allow… ▽ More We propose and demonstrate a novel method for measuring the polarization direction of a microwave electric field in a single measurement using a Rydberg atom-based mixer with two orthogonally polarized local microwave electric fields. Furthermore, introducing a weak static magnetic field enables the utilization of the Zeeman effect and exploitation of polarization asymmetry. This distinction allows for determining the polarization direction of the microwave field isθor180°-θwithin the 0 to 180 degree range. This is the first real-time measurement of microwave polarization within 0 to 180 degrees, crucial for microwave sensing and information transmission. △ Less

Submitted 1 August, 2024; originally announced August 2024.

arXiv:2407.19219 [pdf, other]

doi 10.1093/mnras/stae1851

Primeval very low-mass stars and brown dwarfs -- VIII. The first age benchmark L subdwarf, a wide companion to a halo white dwarf

Authors: Z. H. Zhang, R. Raddi, A. J. Burgasser, S. L. Casewell, R. L. Smart, M. C. Galvez-Ortiz, H. R. A. Jones, S. Baig, N. Lodieu, B. Gauza, Ya. V. Pavlenko, Y. F. Jiao, Z. K. Zhao, S. Y. Zhou, D. J. Pinfield

Abstract: We report the discovery of five white dwarf + ultracool dwarf systems identified as common proper motion wide binaries in the Gaia Catalogue of Nearby Stars. The discoveries include a white dwarf + L subdwarf binary, VVV 1256-62AB, a gravitationally bound system located 75.6(+1.9/-1.8) pc away with a projected separation of 1375(+35/-33) au. The primary is a cool DC white dwarf with a hydrogen dom… ▽ More We report the discovery of five white dwarf + ultracool dwarf systems identified as common proper motion wide binaries in the Gaia Catalogue of Nearby Stars. The discoveries include a white dwarf + L subdwarf binary, VVV 1256-62AB, a gravitationally bound system located 75.6(+1.9/-1.8) pc away with a projected separation of 1375(+35/-33) au. The primary is a cool DC white dwarf with a hydrogen dominated atmosphere, and has a total age of 10.5(+3.3/-2.1) Gyr, based on white dwarf model fitting. The secondary is an L subdwarf with a metallicity of [M/H] = -0.72(+0.08/-0.10) (i.e. [Fe/H] = -0.81+/-0.10) and Teff = 2298(+45/-43) K based on atmospheric model fitting of its optical to near infrared spectrum, and likely has a mass just above the stellar/substellar boundary. The sub-solar metallicity of the L subdwarf and the system's total space velocity of 406 km/s indicates membership in the Galactic halo, and it has a flat eccentric Galactic orbit passing within 1~kpc of the centre of the Milky Way every ~0.4Gyr and extending to 15-31 kpc at apogal. VVV 1256-62B is the first L subdwarf to have a well-constrained age, making it an ideal benchmark of metal-poor ultracool dwarf atmospheres and evolution. △ Less

Submitted 17 August, 2024; v1 submitted 27 July, 2024; originally announced July 2024.

Comments: 15 pages, 12 figures

arXiv:2407.17337 [pdf, ps, other]

Raman Spectroscopic Study on Bi2Rh3Se2: Two-dimensional-Ising Charge Density Wave and Quantum Fluctuations

Authors: Fei Jiao, Yonghui Zhou, Shuyang Wang, Chao An, Xuliang Chen, Ying Zhou, Min Zhang, Liang Cao, Xigang Luo, Yimin Xiong, Zhaorong Yang

Abstract: The ternary chalcogenide Bi2Rh3Se2 was found to be a charge density wave (CDW) superconductor with a 2*2 periodicity. The key questions regarding the underlying mechanism of CDW state and its interplay with lattice and electronic properties remains to be explored. Here, based on the systematic Raman scattering investigations on single crystalline Bi2Rh3Se2, we observed the fingerprinting feature o… ▽ More The ternary chalcogenide Bi2Rh3Se2 was found to be a charge density wave (CDW) superconductor with a 2*2 periodicity. The key questions regarding the underlying mechanism of CDW state and its interplay with lattice and electronic properties remains to be explored. Here, based on the systematic Raman scattering investigations on single crystalline Bi2Rh3Se2, we observed the fingerprinting feature of CDW state, a collective amplitude mode at 39 cm-1. The temperature evolution of Raman shift and line width for this amplitude mode can be well described by the critical behavior of two-dimensional (2D) Ising model, suggesting the interlayer interactions of Bi2Rh3Se2 is negligible when CDW state is formed, as a consequence, the quantum fluctuations play an important role at low temperature. Moreover, temperature dependence of Raman shift for Ag9 mode deviates significantly from the expected anharmonic behavior when approaching the CDW transition temperature 240 K, demonstrated that strong electron-phonon coupling plays a key role in the formation of CDW. Our results reveal that Bi2Rh3Se2 is an intriguing quasi-2D system to explore electronic quantum phase transition and modulate the correlations between CDW and superconductivity. △ Less

Submitted 24 July, 2024; originally announced July 2024.

arXiv:2407.15719 [pdf, other]

GFE-Mamba: Mamba-based AD Multi-modal Progression Assessment via Generative Feature Extraction from MCI

Authors: Zhaojie Fang, Shenghao Zhu, Yifei Chen, Binfeng Zou, Fan Jia, Linwei Qiu, Chang Liu, Yiyu Huang, Xiang Feng, Feiwei Qin, Changmiao Wang, Yeru Wang, Jin Fan, Changbiao Chu, Wan-Zhen Wu, Hu Zhao

Abstract: Alzheimer's Disease (AD) is an irreversible neurodegenerative disorder that often progresses from Mild Cognitive Impairment (MCI), leading to memory loss and significantly impacting patients' lives. Clinical trials indicate that early targeted interventions for MCI patients can potentially slow or halt the development and progression of AD. Previous research has shown that accurate medical classif… ▽ More Alzheimer's Disease (AD) is an irreversible neurodegenerative disorder that often progresses from Mild Cognitive Impairment (MCI), leading to memory loss and significantly impacting patients' lives. Clinical trials indicate that early targeted interventions for MCI patients can potentially slow or halt the development and progression of AD. Previous research has shown that accurate medical classification requires the inclusion of extensive multimodal data, such as assessment scales and various neuroimaging techniques like Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET). However, consistently tracking the diagnosis of the same individual over time and simultaneously collecting multimodal data poses significant challenges. To address this issue, we introduce GFE-Mamba, a classifier based on Generative Feature Extraction (GFE). This classifier effectively integrates data from assessment scales, MRI, and PET, enabling deeper multimodal fusion. It efficiently extracts both long and short sequence information and incorporates additional information beyond the pixel space. This approach not only improves classification accuracy but also enhances the interpretability and stability of the model. We constructed datasets of over 3000 samples based on the Alzheimer's Disease Neuroimaging Initiative (ADNI) for a two-step training process. Our experimental results demonstrate that the GFE-Mamba model is effective in predicting the conversion from MCI to AD and outperforms several state-of-the-art methods. Our source code and ADNI dataset processing code are available at https://github.com/Tinysqua/GFE-Mamba. △ Less

Submitted 22 July, 2024; originally announced July 2024.

Comments: 35 pages, 4 figures

arXiv:2407.04368 [pdf, other]

Romanization Encoding For Multilingual ASR

Authors: Wen Ding, Fei Jia, Hainan Xu, Yu Xi, Junjie Lai, Boris Ginsburg

Abstract: We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and redu… ▽ More We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method's strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2406.15185 [pdf, other]

Percolation transition of k-frequent destinations network for urban mobility

Authors: Weiyu Zhang, Furong Jia, Jianying Wang, Yu Liu, Gezhi Xiu

Abstract: Urban spatial interactions are a complex aggregation of routine visits and random explorations by individuals. The inherent uncertainty of these random visitations poses significant challenges to understanding urban structures and socioeconomic developments. To capture the core dynamics of urban interaction networks, we analyze the percolation structure of the $k$-most frequented destinations of i… ▽ More Urban spatial interactions are a complex aggregation of routine visits and random explorations by individuals. The inherent uncertainty of these random visitations poses significant challenges to understanding urban structures and socioeconomic developments. To capture the core dynamics of urban interaction networks, we analyze the percolation structure of the $k$-most frequented destinations of intracity place-to-place flows from mobile phone data of eight major U.S. cities at a Census Block Group (CBG) level. Our study reveals a consistent percolation transition at $k^* = 130$, a critical threshold for the number of frequently visited destinations necessary to maintain a cohesive urban network. This percolation threshold proves remarkably consistent across diverse urban configurations, sizes, and geographical settings over a 48-month study period, and can largely be interpreted as the joint effect of the emergence of hubness and the level of mixing of residents. Furthermore, we examine the socioeconomic profiles of residents from different origin areas categorized by the fulfillment level of $k^*=130$ principal destinations, revealing a pronounced distinction in the origins' socioeconomic advantages. These insights offer a nuanced understanding of how urban spaces are interconnected and the determinants of travel behavior. Our findings contribute to a deeper comprehension of the structural dynamics that govern urban spatial interactions. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.10907 [pdf, other]

SparseDet: A Simple and Effective Framework for Fully Sparse LiDAR-based 3D Object Detection

Authors: Lin Liu, Ziying Song, Qiming Xia, Feiyang Jia, Caiyan Jia, Lei Yang, Hongyu Pan

Abstract: LiDAR-based sparse 3D object detection plays a crucial role in autonomous driving applications due to its computational efficiency advantages. Existing methods either use the features of a single central voxel as an object proxy, or treat an aggregated cluster of foreground points as an object proxy. However, the former lacks the ability to aggregate contextual information, resulting in insufficie… ▽ More LiDAR-based sparse 3D object detection plays a crucial role in autonomous driving applications due to its computational efficiency advantages. Existing methods either use the features of a single central voxel as an object proxy, or treat an aggregated cluster of foreground points as an object proxy. However, the former lacks the ability to aggregate contextual information, resulting in insufficient information expression in object proxies. The latter relies on multi-stage pipelines and auxiliary tasks, which reduce the inference speed. To maintain the efficiency of the sparse framework while fully aggregating contextual information, in this work, we propose SparseDet which designs sparse queries as object proxies. It introduces two key modules, the Local Multi-scale Feature Aggregation (LMFA) module and the Global Feature Aggregation (GFA) module, aiming to fully capture the contextual information, thereby enhancing the ability of the proxies to represent objects. Where LMFA sub-module achieves feature fusion across different scales for sparse key voxels %which does this through via coordinate transformations and using nearest neighbor relationships to capture object-level details and local contextual information, GFA sub-module uses self-attention mechanisms to selectively aggregate the features of the key voxels across the entire scene for capturing scene-level contextual information. Experiments on nuScenes and KITTI demonstrate the effectiveness of our method. Specifically, on nuScene, SparseDet surpasses the previous best sparse detector VoxelNeXt by 2.2\% mAP with 13.5 FPS, and on KITTI, it surpasses VoxelNeXt by 1.12\% $\mathbf{AP_{3D}}$ on hard level tasks with 17.9 FPS. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2401.02702

arXiv:2406.09931 [pdf, other]

SCKansformer: Fine-Grained Classification of Bone Marrow Cells via Kansformer Backbone and Hierarchical Attention Mechanisms

Authors: Yifei Chen, Zhu Zhu, Shenghao Zhu, Linwei Qiu, Binfeng Zou, Fan Jia, Yunpeng Zhu, Chenyan Zhang, Zhaojie Fang, Feiwei Qin, Jin Fan, Changmiao Wang, Yu Gao, Gang Yu

Abstract: The incidence and mortality rates of malignant tumors, such as acute leukemia, have risen significantly. Clinically, hospitals rely on cytological examination of peripheral blood and bone marrow smears to diagnose malignant tumors, with accurate blood cell counting being crucial. Existing automated methods face challenges such as low feature expression capability, poor interpretability, and redund… ▽ More The incidence and mortality rates of malignant tumors, such as acute leukemia, have risen significantly. Clinically, hospitals rely on cytological examination of peripheral blood and bone marrow smears to diagnose malignant tumors, with accurate blood cell counting being crucial. Existing automated methods face challenges such as low feature expression capability, poor interpretability, and redundant feature extraction when processing high-dimensional microimage data. We propose a novel fine-grained classification model, SCKansformer, for bone marrow blood cells, which addresses these challenges and enhances classification accuracy and efficiency. The model integrates the Kansformer Encoder, SCConv Encoder, and Global-Local Attention Encoder. The Kansformer Encoder replaces the traditional MLP layer with the KAN, improving nonlinear feature representation and interpretability. The SCConv Encoder, with its Spatial and Channel Reconstruction Units, enhances feature representation and reduces redundancy. The Global-Local Attention Encoder combines Multi-head Self-Attention with a Local Part module to capture both global and local features. We validated our model using the Bone Marrow Blood Cell Fine-Grained Classification Dataset (BMCD-FGCD), comprising over 10,000 samples and nearly 40 classifications, developed with a partner hospital. Comparative experiments on our private dataset, as well as the publicly available PBC and ALL-IDB datasets, demonstrate that SCKansformer outperforms both typical and advanced microcell classification methods across all datasets. Our source code and private BMCD-FGCD dataset are available at https://github.com/JustlfC03/SCKansformer. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 15 pages, 6 figures

arXiv:2405.18361 [pdf, other]

Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

Authors: Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, Xiangyu Zhang

Abstract: Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car… ▽ More Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car planning, which lack 3D geometric priors as a cornerstone of reliable planning. Naturally, this observation raises a critical concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our evaluation of current VLM-based methods across 3D object detection, vectorized map construction, and environmental caption suggests that the answer is, unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable autonomous driving. In response, we introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. This simple yet elegant strategy, termed Atlas, harnesses the inherent priors of the 3D physical world, enabling it to simultaneously process high-resolution multi-view images and employ spatiotemporal modeling. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks on nuScenes dataset, proving that 3D-tokenized LLM is the key to reliable autonomous driving. The code and datasets will be released. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.16873 [pdf, other]

ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection

Authors: Ziying Song, Feiyang Jia, Hongyu Pan, Yadan Luo, Caiyan Jia, Guoxin Zhang, Lin Liu, Yang Ji, Lei Yang, Li Wang

Abstract: In the field of 3D object detection tasks, fusing heterogeneous features from LiDAR and camera sensors into a unified Bird's Eye View (BEV) representation is a widely adopted paradigm. However, existing methods are often compromised by imprecise sensor calibration, resulting in feature misalignment in LiDAR-camera BEV fusion. Moreover, such inaccuracies result in errors in depth estimation for the… ▽ More In the field of 3D object detection tasks, fusing heterogeneous features from LiDAR and camera sensors into a unified Bird's Eye View (BEV) representation is a widely adopted paradigm. However, existing methods are often compromised by imprecise sensor calibration, resulting in feature misalignment in LiDAR-camera BEV fusion. Moreover, such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a novel ContrastAlign approach that utilizes contrastive learning to enhance the alignment of heterogeneous modalities, thereby improving the robustness of the fusion process. Specifically, our approach includes the L-Instance module, which directly outputs LiDAR instance features within LiDAR BEV features. Then, we introduce the C-Instance module, which predicts camera instance features through RoI (Region of Interest) pooling on the camera BEV features. We propose the InstanceFusion module, which utilizes contrastive learning to generate similar instance features across heterogeneous modalities. We then use graph matching to calculate the similarity between the neighboring camera instance features and the similarity instance features to complete the alignment of instance features. Our method achieves state-of-the-art performance, with an mAP of 70.3%, surpassing BEVFusion by 1.8% on the nuScenes validation set. Importantly, our method outperforms BEVFusion by 7.3% under conditions with misalignment noise. △ Less

Submitted 5 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.08816 [pdf, other]

The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition

Authors: Lingdong Kong, Shaoyuan Xie, Hanjiang Hu, Yaru Niu, Wei Tsang Ooi, Benoit R. Cottereau, Lai Xing Ng, Yuexin Ma, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, Weichao Qiu, Wei Zhang, Xu Cao, Hao Lu, Ying-Cong Chen, Caixin Kang, Xinning Zhou, Chengyang Ying, Wentao Shang, Xingxing Wei, Yinpeng Dong, Bo Yang, Shengyin Jiang , et al. (66 additional authors not shown)

Abstract: In the realm of autonomous driving, robust perception under out-of-distribution conditions is paramount for the safe deployment of vehicles. Challenges such as adverse weather, sensor malfunctions, and environmental unpredictability can severely impact the performance of autonomous systems. The 2024 RoboDrive Challenge was crafted to propel the development of driving perception technologies that c… ▽ More In the realm of autonomous driving, robust perception under out-of-distribution conditions is paramount for the safe deployment of vehicles. Challenges such as adverse weather, sensor malfunctions, and environmental unpredictability can severely impact the performance of autonomous systems. The 2024 RoboDrive Challenge was crafted to propel the development of driving perception technologies that can withstand and adapt to these real-world variabilities. Focusing on four pivotal tasks -- BEV detection, map segmentation, semantic occupancy prediction, and multi-view depth estimation -- the competition laid down a gauntlet to innovate and enhance system resilience against typical and atypical disturbances. This year's challenge consisted of five distinct tracks and attracted 140 registered teams from 93 institutes across 11 countries, resulting in nearly one thousand submissions evaluated through our servers. The competition culminated in 15 top-performing solutions, which introduced a range of innovative approaches including advanced data augmentation, multi-sensor fusion, self-supervised learning for error correction, and new algorithmic strategies to enhance sensor robustness. These contributions significantly advanced the state of the art, particularly in handling sensor inconsistencies and environmental variability. Participants, through collaborative efforts, pushed the boundaries of current technologies, showcasing their potential in real-world scenarios. Extensive evaluations and analyses provided insights into the effectiveness of these solutions, highlighting key trends and successful strategies for improving the resilience of driving perception systems. This challenge has set a new benchmark in the field, providing a rich repository of techniques expected to guide future research in this field. △ Less

Submitted 29 May, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

Comments: ICRA 2024; 32 pages, 24 figures, 5 tables; Code at https://robodrive-24.github.io/

arXiv:2405.07120 [pdf]

doi 10.1103/PhysRevApplied.21.054019

Quasiparticle and Excitonic Structures of Few-layer and Bulk GaSe: Interlayer Coupling, Self-energy, and Electron-hole Interaction

Authors: Fanhao Jia, Zhao Tang, Greis J. Cruz, Weiwei Gao, Shaowen Xu, Wei Ren, Peihong Zhang

Abstract: Metal monochalcogenide GaSe is a classic layered semiconductor that has received increasing research interest due to its highly tunable electronic and optical properties for ultrathin electronics applications. Despite intense research efforts, a systematic understanding of the layer-dependent electronic and optical properties of GaSe remains to be established, and there appear significant discrepa… ▽ More Metal monochalcogenide GaSe is a classic layered semiconductor that has received increasing research interest due to its highly tunable electronic and optical properties for ultrathin electronics applications. Despite intense research efforts, a systematic understanding of the layer-dependent electronic and optical properties of GaSe remains to be established, and there appear significant discrepancies between different experiments. We have performed GW plus Bethe-Salpeter equation (BSE) calculations for few-layer and bulk GaSe, aiming at understanding the effects of interlayer coupling and dielectric screening on excited state properties of GaSe, and how the electronic and optical properties evolve from strongly two-dimensional (2D) like to intermediate thick layers, and to three-dimensional (3D) bulk character. Using a new definition of the exciton binding energy, we are able to calculate the binding energies of all excitonic states. Our results reveal an interesting correlation between the binding energy of an exciton and the spread of its wave function in the real and momentum spaces. We find that the existence of (nearly) parallel valence and conduction bands facilitates the formation of excitonic states that spread out in the momentum space. Thus, these excitons tend to be more localized in real space and have large exciton binding energies. The interlayer coupling substantially suppresses the Mexican-hat-like dispersion of the top valence band seen in monolayer system, explaining the greatly enhanced photoluminescence (PL) as layer thickness increases. Our results also help resolve apparent discrepancies between different experiments. After including the quasiparticle and excitonic effects as well the optical activities of excitons, our results compare well with available experimental results. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Journal ref: Phys. Rev. Applied 21, 054019 (2024)

arXiv:2404.14604 [pdf, other]

Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training

Authors: Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, Meng Jiang

Abstract: Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. Although fine-tuning with intermediate steps (i.e., rationales) elicits some mathematical reasoning skills, the resulting models still fall short in vis… ▽ More Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. Although fine-tuning with intermediate steps (i.e., rationales) elicits some mathematical reasoning skills, the resulting models still fall short in visual comprehension due to inadequate visual-centric supervision, which leads to inaccurate interpretation of math figures. To address this issue, we propose a two-step training pipeline VCAR, which emphasizes the Visual Comprehension training in Addition to mathematical Reasoning learning. It first improves the visual comprehension ability of MLLMs through the visual description generation task, followed by another training step on generating rationales with the assistance of descriptions. Experimental results on two popular benchmarks demonstrate that VCAR substantially outperforms baseline methods solely relying on rationale supervision, especially on problems with high visual demands. △ Less

Submitted 25 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.12728 [pdf, other]

Relevant or Random: Can LLMs Truly Perform Analogical Reasoning?

Authors: Chengwei Qin, Wenhan Xia, Tan Wang, Fangkai Jiao, Yuchen Hu, Bosheng Ding, Ruirui Chen, Shafiq Joty

Abstract: Analogical reasoning is a unique ability of humans to address unfamiliar challenges by transferring strategies from relevant past experiences. One key finding in psychology is that compared with irrelevant past experiences, recalling relevant ones can help humans better handle new tasks. Coincidentally, the NLP community has also recently found that self-generating relevant examples in the context… ▽ More Analogical reasoning is a unique ability of humans to address unfamiliar challenges by transferring strategies from relevant past experiences. One key finding in psychology is that compared with irrelevant past experiences, recalling relevant ones can help humans better handle new tasks. Coincidentally, the NLP community has also recently found that self-generating relevant examples in the context can help large language models (LLMs) better solve a given problem than hand-crafted prompts. However, it is yet not clear whether relevance is the key factor eliciting such capability, i.e., can LLMs benefit more from self-generated relevant examples than irrelevant ones? In this work, we systematically explore whether LLMs can truly perform analogical reasoning on a diverse set of reasoning tasks. With extensive experiments and analysis, we show that self-generated random examples can surprisingly achieve comparable or even better performance, e.g., 4% performance boost on GSM8K with random biological examples. We find that the accuracy of self-generated examples is the key factor and subsequently design two improved methods with significantly reduced inference costs. Overall, we aim to advance a deeper understanding of LLM analogical reasoning and hope this work stimulates further research in the design of self-generated contexts. △ Less

Submitted 23 June, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.06654 [pdf, other]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Authors: Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg

Abstract: The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-con… ▽ More The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate 17 long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs. △ Less

Submitted 6 August, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

Comments: COLM 2024; Code is available at https://github.com/hsiehjackson/RULER

arXiv:2404.04295 [pdf, other]

Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

Authors: Hainan Xu, Zhehuai Chen, Fei Jia, Boris Ginsburg

Abstract: This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that P… ▽ More This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: accepted at the ICASSP 2024 conference

arXiv:2404.00699 [pdf, other]

How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library

Authors: Mathieu Ravaut, Bosheng Ding, Fangkai Jiao, Hailin Chen, Xingxuan Li, Ruochen Zhao, Chengwei Qin, Caiming Xiong, Shafiq Joty

Abstract: With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges, among which contamination is quickly becoming critical. Business applications and fundraising in AI have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of dollars, placing high pres… ▽ More With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges, among which contamination is quickly becoming critical. Business applications and fundraising in AI have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of dollars, placing high pressure on model integrity. At the same time, it is becoming harder and harder to keep track of the data that LLMs have seen; if not impossible with closed-source models like GPT-4 and Claude-3 not divulging any information on the training set. As a result, contamination becomes a major issue: LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. This limitation jeopardizes the entire progress in the field of NLP, yet, there remains a lack of methods on how to efficiently detect contamination.In this paper, we survey all recent work on contamination detection with LLMs, and help the community track contamination levels of LLMs by releasing an open-source Python library named LLMSanitize implementing major contamination detection algorithms. △ Less

Submitted 20 August, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

Comments: 8 pages, 1 figure, 1 table

arXiv:2403.19438 [pdf, other]

SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control

Authors: Binyuan Huang, Yuqing Wen, Yucheng Zhao, Yaosi Hu, Yingfei Liu, Fan Jia, Weixin Mao, Tiancai Wang, Chi Zhang, Chang Wen Chen, Zhenzhong Chen, Xiangyu Zhang

Abstract: Autonomous driving progress relies on large-scale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities of freely-labeled data for autonomous driving applications and present SubjectDrive, the first model proven to scale generative data production in a way that could continuously improve autonomous driving applications. We investigate the impact… ▽ More Autonomous driving progress relies on large-scale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities of freely-labeled data for autonomous driving applications and present SubjectDrive, the first model proven to scale generative data production in a way that could continuously improve autonomous driving applications. We investigate the impact of scaling up the quantity of generative data on the performance of downstream perception models and find that enhancing data diversity plays a crucial role in effectively scaling generative data production. Therefore, we have developed a novel model equipped with a subject control mechanism, which allows the generative model to leverage diverse external data sources for producing varied and useful data. Extensive evaluations confirm SubjectDrive's efficacy in generating scalable autonomous driving training data, marking a significant step toward revolutionizing data production methods in this field. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: Project page: https://subjectdrive.github.io/

arXiv:2403.11848 [pdf, other]

GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

Authors: Ziying Song, Lei Yang, Shaoqing Xu, Lin Liu, Dongyang Xu, Caiyan Jia, Feiyang Jia, Li Wang

Abstract: Integrating LiDAR and camera information into Bird's-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between… ▽ More Integrating LiDAR and camera information into Bird's-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a robust fusion framework called Graph BEV. Addressing errors caused by inaccurate point cloud projection, we introduce a Local Align module that employs neighbor-aware depth features via Graph matching. Additionally, we propose a Global Align module to rectify the misalignment between LiDAR and camera BEV features. Our Graph BEV framework achieves state-of-the-art performance, with an mAP of 70.1\%, surpassing BEV Fusion by 1.6\% on the nuscenes validation set. Importantly, our Graph BEV outperforms BEV Fusion by 8.3\% under conditions with misalignment noise. △ Less

Submitted 2 July, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.16810 [pdf]

OncoGPT: A Medical Conversational Model Tailored with Oncology Domain Expertise on a Large Language Model Meta-AI (LLaMA)

Authors: Fujian Jia, Xin Liu, Lixi Deng, Jiwen Gu, Chunchao Pu, Tunan Bai, Mengjiang Huang, Yuanzhi Lu, Kang Liu

Abstract: In the past year, there has been a growing trend in applying Large Language Models (LLMs) to the field of medicine, particularly with the advent of advanced language models such as ChatGPT developed by OpenAI. However, there is limited research on LLMs specifically addressing oncology-related queries. The primary aim of this research was to develop a specialized language model that demonstrates im… ▽ More In the past year, there has been a growing trend in applying Large Language Models (LLMs) to the field of medicine, particularly with the advent of advanced language models such as ChatGPT developed by OpenAI. However, there is limited research on LLMs specifically addressing oncology-related queries. The primary aim of this research was to develop a specialized language model that demonstrates improved accuracy in providing advice related to oncology. We performed an extensive data collection of online question-answer interactions centered around oncology, sourced from reputable doctor-patient platforms. Following data cleaning and anonymization, a dataset comprising over 180K+ oncology-related conversations was established. The conversations were categorized and meticulously reviewed by field specialists and clinicians to ensure precision. Employing the LLaMA model and other selected open-source datasets, we conducted iterative fine-tuning to enhance the model's proficiency in basic medical conversation and specialized oncology knowledge. We observed a substantial enhancement in the model's understanding of genuine patient inquiries and its reliability in offering oncology-related advice through the utilization of real online question-answer interactions in the fine-tuning process. We release database and models to the research community (https://github.com/OncoGPT1). △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.10176 [pdf, other]

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

Authors: Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, Igor Gitman

Abstract: Recent work has shown the immense potential of synthetically generated datasets for training large language models (LLMs), especially for acquiring targeted skills. Current large-scale math instruction tuning datasets such as MetaMathQA (Yu et al., 2024) and MAmmoTH (Yue et al., 2024) are constructed using outputs from closed-source LLMs with commercially restrictive licenses. A key reason limitin… ▽ More Recent work has shown the immense potential of synthetically generated datasets for training large language models (LLMs), especially for acquiring targeted skills. Current large-scale math instruction tuning datasets such as MetaMathQA (Yu et al., 2024) and MAmmoTH (Yue et al., 2024) are constructed using outputs from closed-source LLMs with commercially restrictive licenses. A key reason limiting the use of open-source LLMs in these data generation pipelines has been the wide gap between the mathematical skills of the best closed-source LLMs, such as GPT-4, and the best open-source LLMs. Building on the recent progress in open-source LLMs, our proposed prompting novelty, and some brute-force scaling, we construct OpenMathInstruct-1, a math instruction tuning dataset with 1.8M problem-solution pairs. The dataset is constructed by synthesizing code-interpreter solutions for GSM8K and MATH, two popular math reasoning benchmarks, using the recently released and permissively licensed Mixtral model. Our best model, OpenMath-CodeLlama-70B, trained on a subset of OpenMathInstruct-1, achieves a score of 84.6% on GSM8K and 50.7% on MATH, which is competitive with the best gpt-distilled models. We release our code, models, and the OpenMathInstruct-1 dataset under a commercially permissive license. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: Data and models are available at https://huggingface.co/collections/nvidia/openmath-65c5619de2ba059be0775014

arXiv:2402.04559 [pdf, other]

Can Large Language Model Agents Simulate Human Trust Behaviors?

Authors: Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Kai Shu, Adel Bibi, Ziniu Hu, Philip Torr, Bernard Ghanem, Guohao Li

Abstract: Large Language Model (LLM) agents have been increasingly adopted as simulation tools to model humans in applications such as social science. However, one fundamental question remains: can LLM agents really simulate human behaviors? In this paper, we focus on one of the most critical behaviors in human interactions, trust, and aim to investigate whether or not LLM agents can simulate human trust be… ▽ More Large Language Model (LLM) agents have been increasingly adopted as simulation tools to model humans in applications such as social science. However, one fundamental question remains: can LLM agents really simulate human behaviors? In this paper, we focus on one of the most critical behaviors in human interactions, trust, and aim to investigate whether or not LLM agents can simulate human trust behaviors. We first find that LLM agents generally exhibit trust behaviors, referred to as agent trust, under the framework of Trust Games, which are widely recognized in behavioral economics. Then, we discover that LLM agents can have high behavioral alignment with humans regarding trust behaviors, particularly for GPT-4, indicating the feasibility to simulate human trust behaviors with LLM agents. In addition, we probe into the biases in agent trust and the differences in agent trust towards agents and humans. We also explore the intrinsic properties of agent trust under conditions including advanced reasoning strategies and external manipulations. We further offer important implications of our discoveries for various scenarios where trust is paramount. Our study provides new insights into the behaviors of LLM agents and the fundamental analogy between LLMs and humans. △ Less

Submitted 10 March, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

Comments: The first two authors contributed equally. Project website: https://www.camel-ai.org/research/agent-trust

arXiv:2402.00658 [pdf, other]

Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

Authors: Fangkai Jiao, Chengwei Qin, Zhengyuan Liu, Nancy F. Chen, Shafiq Joty

Abstract: Large Language Models (LLMs) have demonstrated significant potential in handling complex reasoning tasks through step-by-step rationale generation. However, recent studies have raised concerns regarding the hallucination and flaws in their reasoning process. Substantial efforts are being made to improve the reliability and faithfulness of the generated rationales. Some approaches model reasoning a… ▽ More Large Language Models (LLMs) have demonstrated significant potential in handling complex reasoning tasks through step-by-step rationale generation. However, recent studies have raised concerns regarding the hallucination and flaws in their reasoning process. Substantial efforts are being made to improve the reliability and faithfulness of the generated rationales. Some approaches model reasoning as planning, while others focus on annotating for process supervision. Nevertheless, the planning-based search process often results in high latency due to the frequent assessment of intermediate reasoning states and the extensive exploration space. Additionally, supervising the reasoning process with human annotation is costly and challenging to scale for LLM training. To address these issues, in this paper, we propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories, which are ranked according to synthesized process rewards. Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework, showing that our 7B model can surpass the strong counterparts like GPT-3.5-Turbo. △ Less

Submitted 15 April, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: 17 pages, 9 figures

arXiv:2401.14754 [pdf, other]

VJT: A Video Transformer on Joint Tasks of Deblurring, Low-light Enhancement and Denoising

Authors: Yuxiang Hui, Yang Liu, Yaofang Liu, Fan Jia, Jinshan Pan, Raymond Chan, Tieyong Zeng

Abstract: Video restoration task aims to recover high-quality videos from low-quality observations. This contains various important sub-tasks, such as video denoising, deblurring and low-light enhancement, since video often faces different types of degradation, such as blur, low light, and noise. Even worse, these kinds of degradation could happen simultaneously when taking videos in extreme environments. T… ▽ More Video restoration task aims to recover high-quality videos from low-quality observations. This contains various important sub-tasks, such as video denoising, deblurring and low-light enhancement, since video often faces different types of degradation, such as blur, low light, and noise. Even worse, these kinds of degradation could happen simultaneously when taking videos in extreme environments. This poses significant challenges if one wants to remove these artifacts at the same time. In this paper, to the best of our knowledge, we are the first to propose an efficient end-to-end video transformer approach for the joint task of video deblurring, low-light enhancement, and denoising. This work builds a novel multi-tier transformer where each tier uses a different level of degraded video as a target to learn the features of video effectively. Moreover, we carefully design a new tier-to-tier feature fusion scheme to learn video features incrementally and accelerate the training process with a suitable adaptive weighting scheme. We also provide a new Multiscene-Lowlight-Blur-Noise (MLBN) dataset, which is generated according to the characteristics of the joint task based on the RealBlur dataset and YouTube videos to simulate realistic scenes as far as possible. We have conducted extensive experiments, compared with many previous state-of-the-art methods, to show the effectiveness of our approach clearly. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: 12 pages,8 figures

arXiv:2401.09112 [pdf, other]

Stream Query Denoising for Vectorized HD Map Construction

Authors: Shuo Wang, Fan Jia, Yingfei Liu, Yucheng Zhao, Zehui Chen, Tiancai Wang, Chi Zhang, Xiangyu Zhang, Feng Zhao

Abstract: To enhance perception performance in complex and extensive scenarios within the realm of autonomous driving, there has been a noteworthy focus on temporal modeling, with a particular emphasis on streaming methods. The prevailing trend in streaming models involves the utilization of stream queries for the propagation of temporal information. Despite the prevalence of this approach, the direct appli… ▽ More To enhance perception performance in complex and extensive scenarios within the realm of autonomous driving, there has been a noteworthy focus on temporal modeling, with a particular emphasis on streaming methods. The prevailing trend in streaming models involves the utilization of stream queries for the propagation of temporal information. Despite the prevalence of this approach, the direct application of the streaming paradigm to the construction of vectorized high-definition maps (HD-maps) fails to fully harness the inherent potential of temporal information. This paper introduces the Stream Query Denoising (SQD) strategy as a novel approach for temporal modeling in high-definition map (HD-map) construction. SQD is designed to facilitate the learning of temporal consistency among map elements within the streaming model. The methodology involves denoising the queries that have been perturbed by the addition of noise to the ground-truth information from the preceding frame. This denoising process aims to reconstruct the ground-truth information for the current frame, thereby simulating the prediction process inherent in stream queries. The SQD strategy can be applied to those streaming methods (e.g., StreamMapNet) to enhance the temporal modeling. The proposed SQD-MapNet is the StreamMapNet equipped with SQD. Extensive experiments on nuScenes and Argoverse2 show that our method is remarkably superior to other existing methods across all settings of close range and long range. The code will be available soon. △ Less

Submitted 17 January, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

arXiv:2401.06542 [pdf, other]

Robustness-Aware 3D Object Detection in Autonomous Driving: A Review and Outlook

Authors: Ziying Song, Lin Liu, Feiyang Jia, Yadan Luo, Guoxin Zhang, Lei Yang, Li Wang, Caiyan Jia

Abstract: In the realm of modern autonomous driving, the perception system is indispensable for accurately assessing the state of the surrounding environment, thereby enabling informed prediction and planning. The key step to this system is related to 3D object detection that utilizes vehicle-mounted sensors such as LiDAR and cameras to identify the size, the category, and the location of nearby objects. De… ▽ More In the realm of modern autonomous driving, the perception system is indispensable for accurately assessing the state of the surrounding environment, thereby enabling informed prediction and planning. The key step to this system is related to 3D object detection that utilizes vehicle-mounted sensors such as LiDAR and cameras to identify the size, the category, and the location of nearby objects. Despite the surge in 3D object detection methods aimed at enhancing detection precision and efficiency, there is a gap in the literature that systematically examines their resilience against environmental variations, noise, and weather changes. This study emphasizes the importance of robustness, alongside accuracy and latency, in evaluating perception systems under practical scenarios. Our work presents an extensive survey of camera-only, LiDAR-only, and multi-modal 3D object detection algorithms, thoroughly evaluating their trade-off between accuracy, latency, and robustness, particularly on datasets like KITTI-C and nuScenes-C to ensure fair comparisons. Among these, multi-modal 3D detection approaches exhibit superior robustness, and a novel taxonomy is introduced to reorganize the literature for enhanced clarity. This survey aims to offer a more practical perspective on the current capabilities and the constraints of 3D object detection algorithms in real-world applications, thus steering future research towards robustness-centric advancements. △ Less

Submitted 15 August, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

arXiv:2401.03907 [pdf, other]

RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM

Authors: Ziying Song, Guoxing Zhang, Lin Liu, Lei Yang, Shaoqing Xu, Caiyan Jia, Feiyang Jia, Li Wang

Abstract: Multi-modal 3D object detectors are dedicated to exploring secure and reliable perception systems for autonomous driving (AD).Although achieving state-of-the-art (SOTA) performance on clean benchmark datasets, they tend to overlook the complexity and harsh conditions of real-world environments. With the emergence of visual foundation models (VFMs), opportunities and challenges are presented for im… ▽ More Multi-modal 3D object detectors are dedicated to exploring secure and reliable perception systems for autonomous driving (AD).Although achieving state-of-the-art (SOTA) performance on clean benchmark datasets, they tend to overlook the complexity and harsh conditions of real-world environments. With the emergence of visual foundation models (VFMs), opportunities and challenges are presented for improving the robustness and generalization of multi-modal 3D object detection in AD. Therefore, we propose RoboFusion, a robust framework that leverages VFMs like SAM to tackle out-of-distribution (OOD) noise scenarios. We first adapt the original SAM for AD scenarios named SAM-AD. To align SAM or SAM-AD with multi-modal methods, we then introduce AD-FPN for upsampling the image features extracted by SAM. We employ wavelet decomposition to denoise the depth-guided images for further noise reduction and weather interference. At last, we employ self-attention mechanisms to adaptively reweight the fused features, enhancing informative features while suppressing excess noise. In summary, RoboFusion significantly reduces noise by leveraging the generalization and robustness of VFMs, thereby enhancing the resilience of multi-modal 3D object detection. Consequently, RoboFusion achieves SOTA performance in noisy scenarios, as demonstrated by the KITTI-C and nuScenes-C benchmarks. Code is available at https://github.com/adept-thu/RoboFusion. △ Less

Submitted 23 April, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

arXiv:2401.00496 [pdf, other]

SAR-RARP50: Segmentation of surgical instrumentation and Action Recognition on Robot-Assisted Radical Prostatectomy Challenge

Authors: Dimitrios Psychogyios, Emanuele Colleoni, Beatrice Van Amsterdam, Chih-Yang Li, Shu-Yu Huang, Yuchong Li, Fucang Jia, Baosheng Zou, Guotai Wang, Yang Liu, Maxence Boels, Jiayu Huo, Rachel Sparks, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin, Mengya Xu, An Wang, Yanan Wu, Long Bai, Hongliang Ren, Atsushi Yamada, Yuriko Harai, Yuto Ishikawa, Kazuyuki Hayashi , et al. (25 additional authors not shown)

Abstract: Surgical tool segmentation and action recognition are fundamental building blocks in many computer-assisted intervention applications, ranging from surgical skills assessment to decision support systems. Nowadays, learning-based action recognition and segmentation approaches outperform classical methods, relying, however, on large, annotated datasets. Furthermore, action recognition and tool segme… ▽ More Surgical tool segmentation and action recognition are fundamental building blocks in many computer-assisted intervention applications, ranging from surgical skills assessment to decision support systems. Nowadays, learning-based action recognition and segmentation approaches outperform classical methods, relying, however, on large, annotated datasets. Furthermore, action recognition and tool segmentation algorithms are often trained and make predictions in isolation from each other, without exploiting potential cross-task relationships. With the EndoVis 2022 SAR-RARP50 challenge, we release the first multimodal, publicly available, in-vivo, dataset for surgical action recognition and semantic instrumentation segmentation, containing 50 suturing video segments of Robotic Assisted Radical Prostatectomy (RARP). The aim of the challenge is twofold. First, to enable researchers to leverage the scale of the provided dataset and develop robust and highly accurate single-task action recognition and tool segmentation approaches in the surgical domain. Second, to further explore the potential of multitask-based learning approaches and determine their comparative advantage against their single-task counterparts. A total of 12 teams participated in the challenge, contributing 7 action recognition methods, 9 instrument segmentation techniques, and 4 multitask approaches that integrated both action recognition and instrument segmentation. The complete SAR-RARP50 dataset is available at: https://rdr.ucl.ac.uk/projects/SARRARP50_Segmentation_of_surgical_instrumentation_and_Action_Recognition_on_Robot-Assisted_Radical_Prostatectomy_Challenge/191091 △ Less

Submitted 23 January, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

arXiv:2312.17055 [pdf, other]

Improving In-context Learning via Bidirectional Alignment

Authors: Chengwei Qin, Wenhan Xia, Fangkai Jiao, Chen Chen, Yuchen Hu, Bosheng Ding, Shafiq Joty

Abstract: Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL). Despite their success in showing such emergent abilities, the scale and complexity of larger models also lead to unprecedentedly high computational demands and deployment challenges. In reaction, researchers explore transferring the powerful capabilities of larger models to more… ▽ More Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL). Despite their success in showing such emergent abilities, the scale and complexity of larger models also lead to unprecedentedly high computational demands and deployment challenges. In reaction, researchers explore transferring the powerful capabilities of larger models to more efficient and compact models by typically aligning the output of smaller (student) models with that of larger (teacher) models. Existing methods either train student models on the generated outputs of teacher models or imitate their token-level probability distributions. However, these distillation methods pay little to no attention to the input, which also plays a crucial role in ICL. Based on the finding that the performance of ICL is highly sensitive to the selection of demonstration examples, we propose Bidirectional Alignment (BiAlign) to fully leverage the models' preferences for ICL examples to improve the ICL abilities of student models. Specifically, we introduce the alignment of input preferences between student and teacher models by incorporating a novel ranking loss, in addition to aligning the token-level output distribution. With extensive experiments and analysis, we demonstrate that BiAlign can consistently outperform existing baselines on a variety of tasks involving language understanding, reasoning, and coding. △ Less

Submitted 24 June, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

arXiv:2311.16989 [pdf, other]

ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?

Authors: Hailin Chen, Fangkai Jiao, Xingxuan Li, Chengwei Qin, Mathieu Ravaut, Ruochen Zhao, Caiming Xiong, Shafiq Joty

Abstract: Upon its release in late 2022, ChatGPT has brought a seismic shift in the entire landscape of AI, both in research and commerce. Through instruction-tuning a large language model (LLM) with supervised fine-tuning and reinforcement learning from human feedback, it showed that a model could answer human questions and follow instructions on a broad panel of tasks. Following this success, interests in… ▽ More Upon its release in late 2022, ChatGPT has brought a seismic shift in the entire landscape of AI, both in research and commerce. Through instruction-tuning a large language model (LLM) with supervised fine-tuning and reinforcement learning from human feedback, it showed that a model could answer human questions and follow instructions on a broad panel of tasks. Following this success, interests in LLMs have intensified, with new LLMs flourishing at frequent interval across academia and industry, including many start-ups focused on LLMs. While closed-source LLMs (e.g., OpenAI's GPT, Anthropic's Claude) generally outperform their open-source counterparts, the progress on the latter has been rapid with claims of achieving parity or even better on certain tasks. This has crucial implications not only on research but also on business. In this work, on the first anniversary of ChatGPT, we provide an exhaustive overview of this success, surveying all tasks where an open-source LLM has claimed to be on par or better than ChatGPT. △ Less

Submitted 15 January, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

Comments: version v4, included latest top-performing open-sourced LLMs

arXiv:2311.16813 [pdf, other]

Panacea: Panoramic and Controllable Video Generation for Autonomous Driving

Authors: Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, Xiangyu Zhang

Abstract: The field of autonomous driving increasingly demands high-quality annotated training data. In this paper, we propose Panacea, an innovative approach to generate panoramic and controllable videos in driving scenarios, capable of yielding an unlimited numbers of diverse, annotated samples pivotal for autonomous driving advancements. Panacea addresses two critical challenges: 'Consistency' and 'Contr… ▽ More The field of autonomous driving increasingly demands high-quality annotated training data. In this paper, we propose Panacea, an innovative approach to generate panoramic and controllable videos in driving scenarios, capable of yielding an unlimited numbers of diverse, annotated samples pivotal for autonomous driving advancements. Panacea addresses two critical challenges: 'Consistency' and 'Controllability.' Consistency ensures temporal and cross-view coherence, while Controllability ensures the alignment of generated content with corresponding annotations. Our approach integrates a novel 4D attention and a two-stage generation pipeline to maintain coherence, supplemented by the ControlNet framework for meticulous control by the Bird's-Eye-View (BEV) layouts. Extensive qualitative and quantitative evaluations of Panacea on the nuScenes dataset prove its effectiveness in generating high-quality multi-view driving-scene videos. This work notably propels the field of autonomous driving by effectively augmenting the training dataset used for advanced BEV perception techniques. △ Less

Submitted 28 November, 2023; originally announced November 2023.

Comments: Project page: https://panacea-ad.github.io/

arXiv:2311.13549 [pdf, other]

ADriver-I: A General World Model for Autonomous Driving

Authors: Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, Tiancai Wang

Abstract: Typically, autonomous driving adopts a modular design, which divides the full stack into perception, prediction, planning and control parts. Though interpretable, such modular design tends to introduce a substantial amount of redundancy. Recently, multimodal large language models (MLLM) and diffusion techniques have demonstrated their superior performance on comprehension and generation ability. I… ▽ More Typically, autonomous driving adopts a modular design, which divides the full stack into perception, prediction, planning and control parts. Though interpretable, such modular design tends to introduce a substantial amount of redundancy. Recently, multimodal large language models (MLLM) and diffusion techniques have demonstrated their superior performance on comprehension and generation ability. In this paper, we first introduce the concept of interleaved vision-action pair, which unifies the format of visual features and control signals. Based on the vision-action pairs, we construct a general world model based on MLLM and diffusion model for autonomous driving, termed ADriver-I. It takes the vision-action pairs as inputs and autoregressively predicts the control signal of the current frame. The generated control signals together with the historical vision-action pairs are further conditioned to predict the future frames. With the predicted next frame, ADriver-I performs further control signal prediction. Such a process can be repeated infinite times, ADriver-I achieves autonomous driving in the world created by itself. Extensive experiments are conducted on nuScenes and our large-scale private datasets. ADriver-I shows impressive performance compared to several constructed baselines. We hope our ADriver-I can provide some new insights for future autonomous driving and embodied intelligence. △ Less

Submitted 22 November, 2023; originally announced November 2023.

Comments: Tech Report

arXiv:2311.11865 [pdf, other]

VLM-Eval: A General Evaluation on Video Large Language Models

Authors: Shuailin Li, Yuang Zhang, Yucheng Zhao, Qiuyue Wang, Fan Jia, Yingfei Liu, Tiancai Wang

Abstract: Despite the rapid development of video Large Language Models (LLMs), a comprehensive evaluation is still absent. In this paper, we introduce a unified evaluation that encompasses multiple video tasks, including captioning, question and answering, retrieval, and action recognition. In addition to conventional metrics, we showcase how GPT-based evaluation can match human-like performance in assessin… ▽ More Despite the rapid development of video Large Language Models (LLMs), a comprehensive evaluation is still absent. In this paper, we introduce a unified evaluation that encompasses multiple video tasks, including captioning, question and answering, retrieval, and action recognition. In addition to conventional metrics, we showcase how GPT-based evaluation can match human-like performance in assessing response quality across multiple aspects. We propose a simple baseline: Video-LLaVA, which uses a single linear projection and outperforms existing video LLMs. Finally, we evaluate video LLMs beyond academic datasets, which show encouraging recognition and reasoning capabilities in driving scenarios with only hundreds of video-instruction pairs for fine-tuning. We hope our work can serve as a unified evaluation for video LLMs, and help expand more practical scenarios. The evaluation code will be available soon. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2311.00447 [pdf, other]

On the Opportunities of Green Computing: A Survey

Authors: You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo , et al. (16 additional authors not shown)

Abstract: Artificial Intelligence (AI) has achieved significant advancements in technology and research with the development over several decades, and is widely used in many areas including computing vision, natural language processing, time-series analysis, speech synthesis, etc. During the age of deep learning, especially with the arise of Large Language Models, a large majority of researchers' attention… ▽ More Artificial Intelligence (AI) has achieved significant advancements in technology and research with the development over several decades, and is widely used in many areas including computing vision, natural language processing, time-series analysis, speech synthesis, etc. During the age of deep learning, especially with the arise of Large Language Models, a large majority of researchers' attention is paid on pursuing new state-of-the-art (SOTA) results, resulting in ever increasing of model size and computational complexity. The needs for high computing power brings higher carbon emission and undermines research fairness by preventing small or medium-sized research institutions and companies with limited funding in participating in research. To tackle the challenges of computing resources and environmental impact of AI, Green Computing has become a hot research topic. In this survey, we give a systematic overview of the technologies used in Green Computing. We propose the framework of Green Computing and devide it into four key components: (1) Measures of Greenness, (2) Energy-Efficient AI, (3) Energy-Efficient Computing Systems and (4) AI Use Cases for Sustainability. For each components, we discuss the research progress made and the commonly used techniques to optimize the AI efficiency. We conclude that this new research direction has the potential to address the conflicts between resource constraints and AI development. We encourage more researchers to put attention on this direction and make AI more environmental friendly. △ Less

Submitted 8 November, 2023; v1 submitted 1 November, 2023; originally announced November 2023.

Comments: 113 pages, 18 figures

arXiv:2310.16591 [pdf]

Intrinsic Piezoelectric Anisotropy of Tetragonal ABO3 Perovskites: A High-Throughput Study

Authors: Fanhao Jia, Shaowen Xu, Shunbo Hu, Jianguo Chen, Yongchen Wang, Yuan Li, Wei Ren, Jinrong Cheng

Abstract: A comprehensive understand of the intrinsic piezoelectric anisotropy stemming from diverse chemical and physical factors is a key step for the rational design of highly anisotropic materials. We performed high-throughput calculations on tetragonal ABO3 perovskites to investigate the piezoelectricity and the interplay between lattice, displacement, polarization and elasticity. Among the 123 types o… ▽ More A comprehensive understand of the intrinsic piezoelectric anisotropy stemming from diverse chemical and physical factors is a key step for the rational design of highly anisotropic materials. We performed high-throughput calculations on tetragonal ABO3 perovskites to investigate the piezoelectricity and the interplay between lattice, displacement, polarization and elasticity. Among the 123 types of perovskites, the structural tetragonality is naturally divided into two categories: normal tetragonal (c/a ratio < 1.1) and super-tetragonal (c/a ratio > 1.17), exhibiting distinct ferroelectric, elastic, and piezoelectric properties. Charge analysis revealed the mechanisms underlying polarization saturation and piezoelectricity suppression in the super-tetragonal region, which also produces an inherent contradiction between high d33 and large piezoelectric anisotropy ratio |d33/d31|. The polarization axis and elastic softness direction jointly determine the maximum longitudinal piezoelectric response d33 direction. The validity and deficiencies of the widely utilized |d33/d31| ratio for representing piezoelectric anisotropy were reevaluated. △ Less

Submitted 25 October, 2023; originally announced October 2023.

arXiv:2310.10942 [pdf, other]

UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models

Authors: Yangyang Guo, Fangkai Jiao, Zhiqi Shen, Liqiang Nie, Mohan Kankanhalli

Abstract: Teaching Visual Question Answering (VQA) models to refrain from answering unanswerable questions is necessary for building a trustworthy AI system. Existing studies, though have explored various aspects of VQA but somewhat ignored this particular attribute. This paper aims to bridge the research gap by contributing a comprehensive dataset, called UNK-VQA. The dataset is specifically designed to ad… ▽ More Teaching Visual Question Answering (VQA) models to refrain from answering unanswerable questions is necessary for building a trustworthy AI system. Existing studies, though have explored various aspects of VQA but somewhat ignored this particular attribute. This paper aims to bridge the research gap by contributing a comprehensive dataset, called UNK-VQA. The dataset is specifically designed to address the challenge of questions that models do not know. To this end, we first augment the existing data via deliberate perturbations on either the image or question. In specific, we carefully ensure that the question-image semantics remain close to the original unperturbed distribution. By this means, the identification of unanswerable questions becomes challenging, setting our dataset apart from others that involve mere image replacement. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models and discover their significant limitations when applied to our dataset. Additionally, we also propose a straightforward method to tackle these unanswerable questions. This dataset, we believe, will serve as a valuable benchmark for enhancing the abstention capability of VQA models, thereby leading to increased trustworthiness of AI systems. We have made the dataset (https://github.com/guoyang9/UNK-VQA) available to facilitate further exploration in this area. △ Less

Submitted 21 August, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: Accepted by TPAMI

arXiv:2310.06753 [pdf, other]

TopoMLP: A Simple yet Strong Pipeline for Driving Topology Reasoning

Authors: Dongming Wu, Jiahao Chang, Fan Jia, Yingfei Liu, Tiancai Wang, Jianbing Shen

Abstract: Topology reasoning aims to comprehensively understand road scenes and present drivable routes in autonomous driving. It requires detecting road centerlines (lane) and traffic elements, further reasoning their topology relationship, i.e., lane-lane topology, and lane-traffic topology. In this work, we first present that the topology score relies heavily on detection performance on lane and traffic… ▽ More Topology reasoning aims to comprehensively understand road scenes and present drivable routes in autonomous driving. It requires detecting road centerlines (lane) and traffic elements, further reasoning their topology relationship, i.e., lane-lane topology, and lane-traffic topology. In this work, we first present that the topology score relies heavily on detection performance on lane and traffic elements. Therefore, we introduce a powerful 3D lane detector and an improved 2D traffic element detector to extend the upper limit of topology performance. Further, we propose TopoMLP, a simple yet high-performance pipeline for driving topology reasoning. Based on the impressive detection performance, we develop two simple MLP-based heads for topology generation. TopoMLP achieves state-of-the-art performance on OpenLane-V2 benchmark, i.e., 41.2% OLS with ResNet-50 backbone. It is also the 1st solution for 1st OpenLane Topology in Autonomous Driving Challenge. We hope such simple and strong pipeline can provide some new insights to the community. Code is at https://github.com/wudongming97/TopoMLP. △ Less

Submitted 1 November, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

Comments: The 1st solution for 1st OpenLane Topology in Autonomous Driving Challenge. Code is at https://github.com/wudongming97/TopoMLP

arXiv:2310.04948 [pdf, other]

TEMPO: Prompt-based Generative Pre-trained Transformer for Time Series Forecasting

Authors: Defu Cao, Furong Jia, Sercan O Arik, Tomas Pfister, Yixiang Zheng, Wen Ye, Yan Liu

Abstract: The past decade has witnessed significant advances in time series modeling with deep learning. While achieving state-of-the-art results, the best-performing architectures vary highly across applications and domains. Meanwhile, for natural language processing, the Generative Pre-trained Transformer (GPT) has demonstrated impressive performance via training one general-purpose model across various t… ▽ More The past decade has witnessed significant advances in time series modeling with deep learning. While achieving state-of-the-art results, the best-performing architectures vary highly across applications and domains. Meanwhile, for natural language processing, the Generative Pre-trained Transformer (GPT) has demonstrated impressive performance via training one general-purpose model across various textual datasets. It is intriguing to explore whether GPT-type architectures can be effective for time series, capturing the intrinsic dynamic attributes and leading to significant accuracy improvements. In this paper, we propose a novel framework, TEMPO, that can effectively learn time series representations. We focus on utilizing two essential inductive biases of the time series task for pre-trained models: (i) decomposition of the complex interaction between trend, seasonal and residual components; and (ii) introducing the design of prompts to facilitate distribution adaptation in different types of time series. TEMPO expands the capability for dynamically modeling real-world temporal phenomena from data within diverse domains. Our experiments demonstrate the superior performance of TEMPO over state-of-the-art methods on zero shot setting for a number of time series benchmark datasets. This performance gain is observed not only in scenarios involving previously unseen datasets but also in scenarios with multi-modal inputs. This compelling finding highlights TEMPO's potential to constitute a foundational model-building framework. △ Less

Submitted 2 April, 2024; v1 submitted 7 October, 2023; originally announced October 2023.

Comments: Accepted by ICLR 2024. Camera Ready Version

arXiv:2309.08978 [pdf, other]

Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations

Authors: Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang

Abstract: Web is increasingly becoming the primary platform to deliver AI services onto edge devices, making in-browser deep learning (DL) inference more prominent. Nevertheless, the heterogeneity of edge devices, combined with the underdeveloped state of Web hardware acceleration practices, hinders current in-browser inference from achieving its full performance potential on target devices. To address this… ▽ More Web is increasingly becoming the primary platform to deliver AI services onto edge devices, making in-browser deep learning (DL) inference more prominent. Nevertheless, the heterogeneity of edge devices, combined with the underdeveloped state of Web hardware acceleration practices, hinders current in-browser inference from achieving its full performance potential on target devices. To address this issue, this paper presents the pioneering inbrowser inference system, nnJIT, which enables just-in-time (JIT) auto-generation of optimized computing kernels for edge devices. nnJIT is built upon two novel techniques that significantly reduce kernel search and compilation overhead while improving performance firmly: Tensor-Web Compiling Co-Design lowers compiling costs by around 100X through eliminating redundant and ineffective compiling passes; Web-Specific Lite Kernel Optimization Space reduces kernel tuning costs by focusing on Web programming requirements and efficient device resource utilization, pruning the optimization space from millions to only dozens. nnJIT is evaluated for modern models, e.g., BART, T5, and Llama 2, on a range of edge devices including laptops and smartphones using different browsers and hardware from ARM, Intel, AMD and Nvidia. The results show that nnJIT can achieve up to 8.2X faster within 30 seconds compared to the existing baselines. △ Less

Submitted 5 July, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

Comments: Accepted by MobiSys'24

arXiv:2309.06296 [pdf, other]

Holographic Entropy Inequalities and Multipartite Entanglement

Authors: Sergio Hernández-Cuenca, Veronika E. Hubeny, Frederic Jia

Abstract: We study holographic entropy inequalities and their structural properties by making use of a judicious grouping of terms into certain multipartite information quantities. This allows us to recast cumbersome entropic expressions into much simpler ones which share interestingly rigid structures. By performing a systematic search over some of these structures, we are able to discover more than 300 no… ▽ More We study holographic entropy inequalities and their structural properties by making use of a judicious grouping of terms into certain multipartite information quantities. This allows us to recast cumbersome entropic expressions into much simpler ones which share interestingly rigid structures. By performing a systematic search over some of these structures, we are able to discover more than 300 novel entropy inequalities for six parties, thereby demonstrating that these recastings provide a fruitful generating technique for uncovering new holographic entropy inequalities. In attempting to interpret the corresponding sign-definite quantities as correlation measures, we also obtain a no-go result: the superbalance property of holographic entropy inequalities turns out to preclude them from being monotonic under partial tracing. In the process, we also comment on the geometrical significance of multipartite information quantities and present various structural relations amongst them. △ Less

Submitted 10 July, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

Comments: v1: 40 pages, 384 inequalities. v2: 1877 inequalities (w/ numbering distinct from v1); new tables in App. C; link to living repository added

Report number: MIT-CTP/5610

arXiv:2309.04766 [pdf, other]

SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning

Authors: Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, AiTi Aw, Nancy F. Chen

Abstract: We present SeaEval, a benchmark for multilingual foundation models. In addition to characterizing how these models understand and reason with natural language, we also investigate how well they comprehend cultural practices, nuances, and values. Alongside standard accuracy metrics, we investigate the brittleness of foundation models in the dimensions of semantics and multilinguality. Our analyses… ▽ More We present SeaEval, a benchmark for multilingual foundation models. In addition to characterizing how these models understand and reason with natural language, we also investigate how well they comprehend cultural practices, nuances, and values. Alongside standard accuracy metrics, we investigate the brittleness of foundation models in the dimensions of semantics and multilinguality. Our analyses span both open-sourced and closed models, leading to empirical results across classic NLP tasks, reasoning, and cultural comprehension. Key findings indicate (1) Most models exhibit varied behavior when given paraphrased instructions. (2) Many models still suffer from exposure bias (e.g., positional bias, majority label bias). (3) For questions rooted in factual, scientific, and commonsense knowledge, consistent responses are expected across multilingual queries that are semantically equivalent. Yet, most models surprisingly demonstrate inconsistent performance on these queries. (4) Multilingually-trained models have not attained "balanced multilingual" capabilities. Our endeavors underscore the need for more generalizable semantic representations and enhanced multilingual contextualization. SeaEval can serve as a launchpad for more thorough investigations and evaluations for multilingual and multicultural scenarios. △ Less

Submitted 11 July, 2024; v1 submitted 9 September, 2023; originally announced September 2023.

Comments: Published at NAACL 2024. Code: https://seaeval.github.io/

arXiv:2308.16839 [pdf, other]

Twist operator correlators and isomonodromic tau functions from modular Hamiltonians

Authors: Hewei Frederic Jia

Abstract: We introduce a novel approach for computing the twist operator correlators (TOC) in two-dimensional conformal field theories (2d CFT) and the closely related isomonodromic tau functions. The method stems from the formal path integral representation of the ground state reduced density matrix in 2d CFT, and exploits properties of the associated modular Hamiltonians. For a class of genus-zero TOC/tau… ▽ More We introduce a novel approach for computing the twist operator correlators (TOC) in two-dimensional conformal field theories (2d CFT) and the closely related isomonodromic tau functions. The method stems from the formal path integral representation of the ground state reduced density matrix in 2d CFT, and exploits properties of the associated modular Hamiltonians. For a class of genus-zero TOC/tau functions associated with branched covers with non-abelian monodromy group, we present: i) a determinantal representation derived from the correlation matrix method for free fermions, and ii) a formal integral representation derived from the universal single-interval modular Hamiltonians. For the class of genus-zero TOC/tau functions, we also argue an approximate factorization property, utilizing the known ground state correlation structure of large-$c$ holographic CFT and the universality of genus-zero TOCs. We provide explicit examples for verifying the determinantal representation and the approximate factorization property. △ Less

Submitted 14 September, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

Comments: 24 pages, 4 figures; comments welcome. v2: improved discussion on extending to more generic monodromy data, fixed minor typos

arXiv:2308.10919 [pdf]

doi 10.1039/d3ce00987d

Effect of Grain Coalescence on Dislocation and Stress Evolution of GaN Films Grown on Nanoscale Patterned Sapphire Substrates

Authors: Zuojian Pan, Zhizhong Chen, Yiyong Chen, Haodong Zhang, Han Yang, Jingxin Nie, Chuhan Deng, Boyan Dong, Daqi Wang, Yuchen Li, Weihua Chen, Fei Jiao, Xiangning Kang, Chuanyu Jia, Zhiwen Liang, Qi Wang, Guoyi Zhang, Bo Shen

Abstract: Two types of nucleation layers (NLs), including in-situ low-temperature grown GaN (LT-GaN) and ex-situ sputtered physical vapor deposition AlN (PVD-AlN), are applied on cone-shaped nanoscale patterned sapphire substrate (NPSS). The initial growth process of GaN on these two NLs is comparably investigated by a series of growth interruptions. The coalescence process of GaN grains is modulated by adj… ▽ More Two types of nucleation layers (NLs), including in-situ low-temperature grown GaN (LT-GaN) and ex-situ sputtered physical vapor deposition AlN (PVD-AlN), are applied on cone-shaped nanoscale patterned sapphire substrate (NPSS). The initial growth process of GaN on these two NLs is comparably investigated by a series of growth interruptions. The coalescence process of GaN grains is modulated by adjusting the three-dimensional (3D) temperatures. The results indicate that higher 3D temperatures reduce the edge dislocation density while increasing the residual compressive stress in GaN films. Compared to the LT-GaN NLs, the PVD-AlN NLs effectively resist Ostwald ripening and facilitate the uniform growth of GaN grains on NPSS. Furthermore, GaN films grown on NPSS with PVD-AlN NLs exhibit a reduction of over 50% in both screw and edge dislocation densities compared to those grown on LT-GaN NLs. Additionally, PVD-AlN NLs result in an increase of about 0.5 GPa in the residual compressive stress observed in GaN films. △ Less

Submitted 21 August, 2023; originally announced August 2023.

arXiv:2308.09616 [pdf, other]

Far3D: Expanding the Horizon for Surround-view 3D Object Detection

Authors: Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, Xiangyu Zhang

Abstract: Recently 3D object detection from surround-view images has made notable advancements with its low deployment cost. However, most works have primarily focused on close perception range while leaving long-range detection less explored. Expanding existing methods directly to cover long distances poses challenges such as heavy computation costs and unstable convergence. To address these limitations, t… ▽ More Recently 3D object detection from surround-view images has made notable advancements with its low deployment cost. However, most works have primarily focused on close perception range while leaving long-range detection less explored. Expanding existing methods directly to cover long distances poses challenges such as heavy computation costs and unstable convergence. To address these limitations, this paper proposes a novel sparse query-based framework, dubbed Far3D. By utilizing high-quality 2D object priors, we generate 3D adaptive queries that complement the 3D global queries. To efficiently capture discriminative features across different views and scales for long-range objects, we introduce a perspective-aware aggregation module. Additionally, we propose a range-modulated 3D denoising approach to address query error propagation and mitigate convergence issues in long-range tasks. Significantly, Far3D demonstrates SoTA performance on the challenging Argoverse 2 dataset, covering a wide range of 150 meters, surpassing several LiDAR-based approaches. Meanwhile, Far3D exhibits superior performance compared to previous methods on the nuScenes dataset. The code is available at https://github.com/megvii-research/Far3D. △ Less

Submitted 17 December, 2023; v1 submitted 18 August, 2023; originally announced August 2023.

Comments: Accepted by AAAI-2024

arXiv:2307.16267 [pdf]

doi 10.1002/adfm.202315781

Efficient InGaN-based Red Light-Emitting Diodes by Modulating Trench Defects

Authors: Z. Pan, Z. Chen, H. Zhang, H. Yang, Y. Chen, J. Nie, C. Deng, B. Dong, D. Wang, Y. Li, H. Lin, W. Chen, F. Jiao, X. Kang, C. Jia, Z. Liang, Q. Wang, G. Zhang, B. Shen

Abstract: Trench defects in multi-quantum wells (MQWs) have been considered as flawed structures that severely degrade the internal quantum efficiency of light-emitting diodes (LEDs) in the past. In this research, trench defects are innovatively modulated into the structure to enhance the efficiency of red InGaN LEDs. Specifically, dual-color MQWs structures are grown with green MQWs at the bottom and red M… ▽ More Trench defects in multi-quantum wells (MQWs) have been considered as flawed structures that severely degrade the internal quantum efficiency of light-emitting diodes (LEDs) in the past. In this research, trench defects are innovatively modulated into the structure to enhance the efficiency of red InGaN LEDs. Specifically, dual-color MQWs structures are grown with green MQWs at the bottom and red MQWs at the top. When high-density trench defects are introduced into the green MQWs, the upper red MQWs exhibit a significant wavelength redshift of 68 nm and approximately 6-fold luminescence enhancement compared to those without trench defects. The wavelength redshift is attributed to the increased indium incorporation due to the strain relaxation effect of trench defects. Moreover, the luminescence enhancement originates from the strong emission of the red MQWs inside trench defects. The mechanisms behind the superior luminescent properties of red MQWs within trench defects are explored in detail. Red InGaN LEDs with an internal quantum efficiency of 16.4% are achieved by modulating the trench defects. The method of achieving InGaN-based red emission by introducing trench defects is simple and reproducible, requiring no additional substrate designs. This research provides a novel pathway toward achieving high-efficiency red InGaN LEDs. △ Less

Submitted 28 December, 2023; v1 submitted 30 July, 2023; originally announced July 2023.

Showing 1–50 of 148 results for author: Jia, F