Search | arXiv e-print repository

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

Authors: Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, Yueqi Duan

Abstract: Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this… ▽ More Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: Project page: https://liuff19.github.io/ReconX

arXiv:2408.16760 [pdf, other]

OmniRe: Omni Urban Scene Reconstruction

Authors: Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, Li Song, Yue Wang

Abstract: We introduce OmniRe, a holistic approach for efficiently reconstructing high-fidelity dynamic urban scenes from on-device logs. Recent methods for modeling driving sequences using neural radiance fields or Gaussian Splatting have demonstrated the potential of reconstructing challenging dynamic scenes, but often overlook pedestrians and other non-vehicle dynamic actors, hindering a complete pipelin… ▽ More We introduce OmniRe, a holistic approach for efficiently reconstructing high-fidelity dynamic urban scenes from on-device logs. Recent methods for modeling driving sequences using neural radiance fields or Gaussian Splatting have demonstrated the potential of reconstructing challenging dynamic scenes, but often overlook pedestrians and other non-vehicle dynamic actors, hindering a complete pipeline for dynamic urban scene reconstruction. To that end, we propose a comprehensive 3DGS framework for driving scenes, named OmniRe, that allows for accurate, full-length reconstruction of diverse dynamic objects in a driving log. OmniRe builds dynamic neural scene graphs based on Gaussian representations and constructs multiple local canonical spaces that model various dynamic actors, including vehicles, pedestrians, and cyclists, among many others. This capability is unmatched by existing methods. OmniRe allows us to holistically reconstruct different objects present in the scene, subsequently enabling the simulation of reconstructed scenarios with all actors participating in real-time (~60Hz). Extensive evaluations on the Waymo dataset show that our approach outperforms prior state-of-the-art methods quantitatively and qualitatively by a large margin. We believe our work fills a critical gap in driving reconstruction. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: See the project page for code, video results and demos: https://ziyc.github.io/omnire/

arXiv:2408.16751 [pdf, other]

A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models

Authors: Yi-Lin Tuan, William Yang Wang

Abstract: Beyond maximum likelihood estimation (MLE), the standard objective of a language model (LM) that optimizes good examples probabilities, many studies have explored ways that also penalize bad examples for enhancing the quality of output distribution, including unlikelihood training, exponential maximizing average treatment effect (ExMATE), and direct preference optimization (DPO). To systematically… ▽ More Beyond maximum likelihood estimation (MLE), the standard objective of a language model (LM) that optimizes good examples probabilities, many studies have explored ways that also penalize bad examples for enhancing the quality of output distribution, including unlikelihood training, exponential maximizing average treatment effect (ExMATE), and direct preference optimization (DPO). To systematically compare these methods and further provide a unified recipe for LM optimization, in this paper, we present a unique angle of gradient analysis of loss functions that simultaneously reward good examples and penalize bad ones in LMs. Through both mathematical results and experiments on CausalDialogue and Anthropic HH-RLHF datasets, we identify distinct functional characteristics among these methods. We find that ExMATE serves as a superior surrogate for MLE, and that combining DPO with ExMATE instead of MLE further enhances both the statistical (5-7%) and generative (+18% win rate) performance. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.16646 [pdf, other]

Study of the rare decay $J/ψ\to μ^+μ^-μ^+μ^-$

Authors: LHCb collaboration, R. Aaij, A. S. W. Abdelmotteleb, C. Abellan Beteta, F. Abudinén, T. Ackernley, A. A. Adefisoye, B. Adeva, M. Adinolfi, P. Adlarson, C. Agapopoulou, C. A. Aidala, Z. Ajaltouni, S. Akar, K. Akiba, P. Albicocco, J. Albrecht, F. Alessio, M. Alexander, Z. Aliouche, P. Alvarez Cartelle, R. Amalric, S. Amato, J. L. Amey, Y. Amhis , et al. (1096 additional authors not shown)

Abstract: The rare electromagnetic $J/ψ\to μ^+μ^-μ^+μ^-$ decay is observed with a significance greatly exceeding the discovery threshold, using proton-proton collision data collected by the LHCb experiment during 2016-2018 at a center-of-mass energy of 13 TeV, corresponding to an integrated luminosity of $5.4\,\text{fb}^{-1}$. The rate of this decay is measured relative to that of the $J/ψ\to μ^+μ^-$ mode.… ▽ More The rare electromagnetic $J/ψ\to μ^+μ^-μ^+μ^-$ decay is observed with a significance greatly exceeding the discovery threshold, using proton-proton collision data collected by the LHCb experiment during 2016-2018 at a center-of-mass energy of 13 TeV, corresponding to an integrated luminosity of $5.4\,\text{fb}^{-1}$. The rate of this decay is measured relative to that of the $J/ψ\to μ^+μ^-$ mode. Using the QED model for the four-muon decay in the efficiency estimation, its branching fraction is determined to be \begin{equation*} {\mathcal{B}}(J/ψ\to μ^+μ^-μ^+μ^-) = (1.13\pm0.10\pm0.05\pm0.01)\times 10^{-6}, \end{equation*} where the uncertainties are statistical, systematic and due to the uncertainty on the branching fraction of the $J/ψ\to μ^+μ^-$ decay. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: All figures and tables, along with machine-readable versions and any supplementary material and additional information, are available at https://lbfence.cern.ch/alcm/public/analysis/full-details/3453 (LHCb public pages)

Report number: LHCb-PAPER-2024-016, CERN-EP-2024-201

arXiv:2408.16615 [pdf, ps, other]

Topological flat bands in hyperbolic lattices

Authors: Dong-Hao Guan, Lu Qi, Yuan Zhou, Ai-Lei He, Yi-Fei Wang

Abstract: Topological flat bands (TFBs) provide a promising platform to investigate intriguing fractionalization phenomena, such as the fractional Chern insulators (FCIs). Most of TFB models are established in two-dimensional Euclidean lattices with zero curvature. In this work, we systematically explore TFBs in a class of two-dimensional non-Euclidean lattices with constant negative curvature, {\emph i.e.,… ▽ More Topological flat bands (TFBs) provide a promising platform to investigate intriguing fractionalization phenomena, such as the fractional Chern insulators (FCIs). Most of TFB models are established in two-dimensional Euclidean lattices with zero curvature. In this work, we systematically explore TFBs in a class of two-dimensional non-Euclidean lattices with constant negative curvature, {\emph i.e.,} the hyperbolic analogs of the kagome lattice. Based on the Abelian hyperbolic band theory, TFBs have been respectively found in the heptagon-kagome, the octagon-kagome, the nonagon-kagome and the decagon-kagome lattices by introducing staggered magnetic fluxes and the next nearest-neighbor hoppings. The flatness ratios of all hyperbolic TFB models are more than 15, which suggests that the hyperbolic FCIs can be realized in these TFB models. We further demonstrate the existence of a $ν=1/2$ FCI state with open boundary conditions when hard-core bosons fill into these hyperbolic TFB models. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: 7 pages, 5 figures, comments are welcome

arXiv:2408.16564 [pdf, other]

Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Authors: Qianhui Liu, Jiadong Wang, Yang Wang, Xin Yang, Gang Pan, Haizhou Li

Abstract: Humans naturally perform audiovisual speech recognition (AVSR), enhancing the accuracy and robustness by integrating auditory and visual information. Spiking neural networks (SNNs), which mimic the brain's information-processing mechanisms, are well-suited for emulating the human capability of AVSR. Despite their potential, research on SNNs for AVSR is scarce, with most existing audio-visual multi… ▽ More Humans naturally perform audiovisual speech recognition (AVSR), enhancing the accuracy and robustness by integrating auditory and visual information. Spiking neural networks (SNNs), which mimic the brain's information-processing mechanisms, are well-suited for emulating the human capability of AVSR. Despite their potential, research on SNNs for AVSR is scarce, with most existing audio-visual multimodal methods focused on object or digit recognition. These models simply integrate features from both modalities, neglecting their unique characteristics and interactions. Additionally, they often rely on future information for current processing, which increases recognition latency and limits real-time applicability. Inspired by human speech perception, this paper proposes a novel human-inspired SNN named HI-AVSNN for AVSR, incorporating three key characteristics: cueing interaction, causal processing and spike activity. For cueing interaction, we propose a visual-cued auditory attention module (VCA2M) that leverages visual cues to guide attention to auditory features. We achieve causal processing by aligning the SNN's temporal dimension with that of visual and auditory features and applying temporal masking to utilize only past and current information. To implement spike activity, in addition to using SNNs, we leverage the event camera to capture lip movement as spikes, mimicking the human retina and providing efficient visual data. We evaluate HI-AVSNN on an audiovisual speech recognition dataset combining the DVS-Lip dataset with its corresponding audio samples. Experimental results demonstrate the superiority of our proposed fusion method, outperforming existing audio-visual SNN fusion methods and achieving a 2.27% improvement in accuracy over the only existing SNN-based AVSR method. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.16530 [pdf, other]

A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions

Authors: Yu Wang, Shaohua Wang, Yicheng Li, Mingchun Liu

Abstract: In recent years, 3D object perception has become a crucial component in the development of autonomous driving systems, providing essential environmental awareness. However, as perception tasks in autonomous driving evolve, their variants have increased, leading to diverse insights from industry and academia. Currently, there is a lack of comprehensive surveys that collect and summarize these perce… ▽ More In recent years, 3D object perception has become a crucial component in the development of autonomous driving systems, providing essential environmental awareness. However, as perception tasks in autonomous driving evolve, their variants have increased, leading to diverse insights from industry and academia. Currently, there is a lack of comprehensive surveys that collect and summarize these perception tasks and their developments from a broader perspective. This review extensively summarizes traditional 3D object detection methods, focusing on camera-based, LiDAR-based, and fusion detection techniques. We provide a comprehensive analysis of the strengths and limitations of each approach, highlighting advancements in accuracy and robustness. Furthermore, we discuss future directions, including methods to improve accuracy such as temporal perception, occupancy grids, and end-to-end learning frameworks. We also explore cooperative perception methods that extend the perception range through collaborative communication. By providing a holistic view of the current state and future developments in 3D object perception, we aim to offer a more comprehensive understanding of perception tasks for autonomous driving. Additionally, we have established an active repository to provide continuous updates on the latest advancements in this field, accessible at: https://github.com/Fishsoup0/Autonomous-Driving-Perception. △ Less

Submitted 27 August, 2024; originally announced August 2024.

arXiv:2408.16500 [pdf, other]

CogVLM2: Visual Language Models for Image and Video Understanding

Authors: Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang

Abstract: Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2… ▽ More Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.16498 [pdf, other]

A Survey on Evaluating Large Language Models in Code Generation Tasks

Authors: Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, Wei Ye, Shikun Zhang

Abstract: This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development, LLMs have demonstrated significant potential in the field of code generation. The paper begins by reviewing the historical development of LLMs and their applicatio… ▽ More This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development, LLMs have demonstrated significant potential in the field of code generation. The paper begins by reviewing the historical development of LLMs and their applications in code generation. Next, it details various methods and metrics for assessing the code generation capabilities of LLMs, including code correctness, efficiency, readability, and evaluation methods based on expert review and user experience. The paper also evaluates the widely used benchmark datasets, identifying their limitations and proposing directions for future improvements. Specifically, the paper analyzes the performance of code generation models across different tasks by combining multiple evaluation metrics, such as code compilation/interpretation success rates, unit test pass rates, and performance and efficiency metrics, to comprehensively assess the practical application of LLMs in code generation. Finally, the paper discusses the challenges faced in evaluating LLMs in code generation, particularly how to ensure the comprehensiveness and accuracy of evaluation methods and how to adapt to the evolving practices of software development. These analyses and discussions provide valuable insights for further optimizing and improving the application of LLMs in code generation tasks. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.16448 [pdf, other]

doi 10.1109/TPAMI.2024.3444029

Enhancing Sound Source Localization via False Negative Elimination

Authors: Zengjie Song, Jiangshe Zhang, Yuxi Wang, Junsong Fan, Zhaoxiang Zhang

Abstract: Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the… ▽ More Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the visual anchor/query. As a result, this misalignment of audio and visual features could yield inferior performance. To address this issue, we propose a novel audio-visual learning framework which is instantiated with two individual learning schemes: self-supervised predictive learning (SSPL) and semantic-aware contrastive learning (SACL). SSPL explores image-audio positive pairs alone to discover semantically coherent similarities between audio and visual features, while a predictive coding module for feature alignment is introduced to facilitate the positive-only learning. In this regard SSPL acts as a negative-free method to eliminate false negatives. By contrast, SACL is designed to compact visual features and remove false negatives, providing reliable visual anchor and audio negatives for contrast. Different from SSPL, SACL releases the potential of audio-visual contrastive learning, offering an effective alternative to achieve the same goal. Comprehensive experiments demonstrate the superiority of our approach over the state-of-the-arts. Furthermore, we highlight the versatility of the learned representation by extending the approach to audio-visual event classification and object detection tasks. Code and models are available at: https://github.com/zjsong/SACL. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2203.13412

arXiv:2408.16434 [pdf]

Physical Similarity of Fluid Flow in Bimodal Porous Media: Part 1 -- Basic Model and Solution Characteristics

Authors: Yuhe Wang, Yating Wang

Abstract: Fluid flow through bimodal porous media, characterized by a distinct separation in pore size distribution, is critical in various scientific and engineering applications, including groundwater management, oil and gas production, and carbon sequestration. This note delves into the physical similarity of fluid flow within such media, bridging the gap between microscale phenomena and macroscale obser… ▽ More Fluid flow through bimodal porous media, characterized by a distinct separation in pore size distribution, is critical in various scientific and engineering applications, including groundwater management, oil and gas production, and carbon sequestration. This note delves into the physical similarity of fluid flow within such media, bridging the gap between microscale phenomena and macroscale observations. We present a representative mathematical model that conceptualizes bimodal porous media as a double-continuum system, distinguishing between macroporous and microporous regions. The model captures the complex interactions between these regions, particularly focusing on the challenges of modeling fluid flow when there is significant disparity in pore sizes. By employing a heuristic approach grounded in pore-scale tomography, we derive governing equations that describe fluid flow and analyze the solution characteristics. The results reveal unique features of the fluid flow in bimodal systems, such as the occurrence of boundary discontinuities and the delayed transient response, which are not observed in conventional porous media. This work provides ground for further studies in bimodal porous media, offering insights that could enhance predictive modeling and optimization in various applications concerning porous media with similar bimodal pore size distributions. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: 7 pages, 2 figures

arXiv:2408.16431 [pdf, other]

Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS

Authors: Deshui Miao, Yameng Gu, Xin Li, Zhenyu He, Yaowei Wang, Ming-Hsuan Yang

Abstract: Video object segmentation (VOS) is a crucial task in computer vision, but current VOS methods struggle with complex scenes and prolonged object motions. To address these challenges, the MOSE dataset aims to enhance object recognition and differentiation in complex environments, while the LVOS dataset focuses on segmenting objects exhibiting long-term, intricate movements. This report introduces a… ▽ More Video object segmentation (VOS) is a crucial task in computer vision, but current VOS methods struggle with complex scenes and prolonged object motions. To address these challenges, the MOSE dataset aims to enhance object recognition and differentiation in complex environments, while the LVOS dataset focuses on segmenting objects exhibiting long-term, intricate movements. This report introduces a discriminative spatial-temporal VOS model that utilizes discriminative object features as query representations. The semantic understanding of spatial-semantic modules enables it to recognize object parts, while salient features highlight more distinctive object characteristics. Our model, trained on extensive VOS datasets, achieved first place (\textbf{80.90\%} $\mathcal{J \& F}$) on the test set of the 6th LSVOS challenge in the VOS Track, demonstrating its effectiveness in tackling the aforementioned challenges. The code will be available at \href{https://github.com/yahooo-m/VOS-Solution}{code}. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: 1st Place Solution for 6th LSVOS VOS Track. arXiv admin note: substantial text overlap with arXiv:2406.04600

arXiv:2408.16343 [pdf, other]

Toward Robust Early Detection of Alzheimer's Disease via an Integrated Multimodal Learning Approach

Authors: Yifei Chen, Shenghao Zhu, Zhaojie Fang, Chang Liu, Binfeng Zou, Yuhe Wang, Shuo Chang, Fan Jia, Feiwei Qin, Jin Fan, Yong Peng, Changmiao Wang

Abstract: Alzheimer's Disease (AD) is a complex neurodegenerative disorder marked by memory loss, executive dysfunction, and personality changes. Early diagnosis is challenging due to subtle symptoms and varied presentations, often leading to misdiagnosis with traditional unimodal diagnostic methods due to their limited scope. This study introduces an advanced multimodal classification model that integrates… ▽ More Alzheimer's Disease (AD) is a complex neurodegenerative disorder marked by memory loss, executive dysfunction, and personality changes. Early diagnosis is challenging due to subtle symptoms and varied presentations, often leading to misdiagnosis with traditional unimodal diagnostic methods due to their limited scope. This study introduces an advanced multimodal classification model that integrates clinical, cognitive, neuroimaging, and EEG data to enhance diagnostic accuracy. The model incorporates a feature tagger with a tabular data coding architecture and utilizes the TimesBlock module to capture intricate temporal patterns in Electroencephalograms (EEG) data. By employing Cross-modal Attention Aggregation module, the model effectively fuses Magnetic Resonance Imaging (MRI) spatial information with EEG temporal data, significantly improving the distinction between AD, Mild Cognitive Impairment, and Normal Cognition. Simultaneously, we have constructed the first AD classification dataset that includes three modalities: EEG, MRI, and tabular data. Our innovative approach aims to facilitate early diagnosis and intervention, potentially slowing the progression of AD. The source code and our private ADMC dataset are available at https://github.com/JustlfC03/MSTNet. △ Less

Submitted 29 August, 2024; originally announced August 2024.

Comments: 5 pages, 2 figures

arXiv:2408.16308 [pdf, other]

AdaMotif: Graph Simplification via Adaptive Motif Design

Authors: Hong Zhou, Peifeng Lai, Zhida Sun, Xiangyuan Chen, Yang Chen, Huisi Wu, Yong Wang

Abstract: With the increase of graph size, it becomes difficult or even impossible to visualize graph structures clearly within the limited screen space. Consequently, it is crucial to design effective visual representations for large graphs. In this paper, we propose AdaMotif, a novel approach that can capture the essential structure patterns of large graphs and effectively reveal the overall structures vi… ▽ More With the increase of graph size, it becomes difficult or even impossible to visualize graph structures clearly within the limited screen space. Consequently, it is crucial to design effective visual representations for large graphs. In this paper, we propose AdaMotif, a novel approach that can capture the essential structure patterns of large graphs and effectively reveal the overall structures via adaptive motif designs. Specifically, our approach involves partitioning a given large graph into multiple subgraphs, then clustering similar subgraphs and extracting similar structural information within each cluster. Subsequently, adaptive motifs representing each cluster are generated and utilized to replace the corresponding subgraphs, leading to a simplified visualization. Our approach aims to preserve as much information as possible from the subgraphs while simplifying the graph efficiently. Notably, our approach successfully visualizes crucial community information within a large graph. We conduct case studies and a user study using real-world graphs to validate the effectiveness of our proposed approach. The results demonstrate the capability of our approach in simplifying graphs while retaining important structural and community information. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.16279 [pdf, ps, other]

Model-independent determination of the strong-phase difference between $D^0$ and $\bar{D}^0 \to π^+π^-π^+π^-$ decays

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (647 additional authors not shown)

Abstract: Measurements of the strong-phase difference between $D^0$ and $\bar{D}^0\toπ^+π^-π^+π^-$ are performed in bins of phase space. The study exploits a sample of quantum-correlated $D\bar{D}$ mesons collected by the BESIII experiment in $e^+e^-$ collisions at a center-of-mass energy of 3.773~GeV, corresponding to an integrated luminosity of 2.93~fb$^{-1}$. Here, $D$ denotes a neutral charm meson in a… ▽ More Measurements of the strong-phase difference between $D^0$ and $\bar{D}^0\toπ^+π^-π^+π^-$ are performed in bins of phase space. The study exploits a sample of quantum-correlated $D\bar{D}$ mesons collected by the BESIII experiment in $e^+e^-$ collisions at a center-of-mass energy of 3.773~GeV, corresponding to an integrated luminosity of 2.93~fb$^{-1}$. Here, $D$ denotes a neutral charm meson in a superposition of flavor eigenstates. The reported results are valuable for measurements of the $C\!P$-violating phase $γ$ (also denoted $φ_3$) in $B^\pm \to DK^\pm$, $D \to π^+π^-π^+π^-$ decays, and the binning schemes are designed to provide good statistical sensitivity to this parameter. The expected uncertainty on $γ$ arising from the precision of the strong-phase measurements, when applied to very large samples of $B$-meson decays, is around $1.5^\circ$ or $2^\circ$, depending on the binning scheme. The binned strong-phase parameters are combined to give a value of $F_+^{4π} = 0.746 \pm 0.010 \pm 0.004$ for the $C\!P$-even fraction of $D^0 \to π^+π^-π^+π^-$ decays, which is around 30\% more precise than the previous best measurement of this quantity. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.16266 [pdf, other]

Improving Diffusion-based Data Augmentation with Inversion Spherical Interpolation

Authors: Yanghao Wang, Long Chen

Abstract: Data Augmentation (DA), \ie, synthesizing faithful and diverse samples to expand the original training set, is a prevalent and effective strategy to improve various visual recognition tasks. With the powerful image generation ability, diffusion-based DA has shown strong performance gains on different benchmarks. In this paper, we analyze today's diffusion-based DA methods, and argue that they cann… ▽ More Data Augmentation (DA), \ie, synthesizing faithful and diverse samples to expand the original training set, is a prevalent and effective strategy to improve various visual recognition tasks. With the powerful image generation ability, diffusion-based DA has shown strong performance gains on different benchmarks. In this paper, we analyze today's diffusion-based DA methods, and argue that they cannot take account of both faithfulness and diversity, which are two critical keys for generating high-quality samples and boosting final classification performance. To this end, we propose a novel Diffusion-based Inversion Interpolation DA method: Diff-II. Specifically, Diff-II consists of three main steps: 1) Category concepts learning: Learning concept embeddings for each category. 2) Inversion interpolation: Calculating the inversion for each image, and conducting spherical interpolation for two randomly sampled inversions from the same category. 3) Two-stage denoising: Using different prompts to generate synthesized images in a coarse-to-fine manner. Extensive experiments on multiple image classification tasks (\eg, few-shot, long-tailed, and out-of-distribution classification) have demonstrated its effectiveness over state-of-the-art diffusion-based DA methods. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.16258 [pdf, other]

Advancing Architectural Floorplan Design with Geometry-enhanced Graph Diffusion

Authors: Sizhe Hu, Wenming Wu, Yuntao Wang, Benzhu Xu, Liping Zheng

Abstract: Automating architectural floorplan design is vital for housing and interior design, offering a faster, cost-effective alternative to manual sketches by architects. However, existing methods, including rule-based and learning-based approaches, face challenges in design complexity and constrained generation with extensive post-processing, and tend to obvious geometric inconsistencies such as misalig… ▽ More Automating architectural floorplan design is vital for housing and interior design, offering a faster, cost-effective alternative to manual sketches by architects. However, existing methods, including rule-based and learning-based approaches, face challenges in design complexity and constrained generation with extensive post-processing, and tend to obvious geometric inconsistencies such as misalignment, overlap, and gaps. In this work, we propose a novel generative framework for vector floorplan design via structural graph generation, called GSDiff, focusing on wall junction generation and wall segment prediction to capture both geometric and semantic aspects of structural graphs. To improve the geometric rationality of generated structural graphs, we propose two innovative geometry enhancement methods. In wall junction generation, we propose a novel alignment loss function to improve geometric consistency. In wall segment prediction, we propose a random self-supervision method to enhance the model's perception of the overall geometric structure, thereby promoting the generation of reasonable geometric structures. Employing the diffusion model and the Transformer model, as well as the geometry enhancement strategies, our framework can generate wall junctions, wall segments and room polygons with structural and semantic information, resulting in structural graphs that accurately represent floorplans. Extensive experiments show that the proposed method surpasses existing techniques, enabling free generation and constrained generation, marking a shift towards structure generation in architectural design. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.16244 [pdf, other]

Quantum Advantage via Efficient Post-processing on Qudit Shadow tomography

Authors: Yu Wang

Abstract: Efficiently computing the trace of the product of exponential-scale matrices $A$ and $B$ presents a significant challenge in classical computation, particularly when $A$ is a $d$-dimensional positive Hermitian matrix with trace 1, and $B$ is a Hermitian matrix with a bounded norm. This computation traditionally requires $O(d^2)$ time complexity. We explore leveraging quantum advantage to perform t… ▽ More Efficiently computing the trace of the product of exponential-scale matrices $A$ and $B$ presents a significant challenge in classical computation, particularly when $A$ is a $d$-dimensional positive Hermitian matrix with trace 1, and $B$ is a Hermitian matrix with a bounded norm. This computation traditionally requires $O(d^2)$ time complexity. We explore leveraging quantum advantage to perform this computation in $O(\text{poly}(\log d))$ time for a broad class of matrices $A$, offering potential applications in high-dimensional data analysis and complex systems. We propose a shadow tomography scheme for arbitrary $d$-dimensional systems that utilizes random projective measurements onto Dense Dual Bases for efficient sampling and post-processing. Unlike random Clifford or mutually unbiased bases (MUB) measurements, our method is experimentally feasible on optical platforms. It requires exponentially fewer computations to determine all coefficients of the randomly projected states, with a constant post-processing time per measurement, as opposed to the exponential worst-case scenario seen with random Clifford (MUB) measurements. For general dimensions $d$, the existence of $d+1$ MUBs in general dimensions is still an open question, and the processing of randomized Clifford measurements is not fully understood. While the applicability of matrix $A$ may be more limited compared to random Clifford measurements, our approach remains efficient in several cases, with average performance that is particularly efficient. For all $A$, the computational complexity is $O(d)$, and in the approximately average case, it is $O(\text{poly}(\log d))$. This scheme holds significant potential beyond quantum information science; it could be instrumental in fields such as artificial intelligence, enabling efficient computation of $\text{tr}(AB)$. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: The initial version, open to any comments

arXiv:2408.16201 [pdf, other]

Uni-3DAD: GAN-Inversion Aided Universal 3D Anomaly Detection on Model-free Products

Authors: Jiayu Liu, Shancong Mou, Nathan Gaw, Yinan Wang

Abstract: Anomaly detection is a long-standing challenge in manufacturing systems. Traditionally, anomaly detection has relied on human inspectors. However, 3D point clouds have gained attention due to their robustness to environmental factors and their ability to represent geometric data. Existing 3D anomaly detection methods generally fall into two categories. One compares scanned 3D point clouds with des… ▽ More Anomaly detection is a long-standing challenge in manufacturing systems. Traditionally, anomaly detection has relied on human inspectors. However, 3D point clouds have gained attention due to their robustness to environmental factors and their ability to represent geometric data. Existing 3D anomaly detection methods generally fall into two categories. One compares scanned 3D point clouds with design files, assuming these files are always available. However, such assumptions are often violated in many real-world applications where model-free products exist, such as fresh produce (i.e., ``Cookie", ``Potato", etc.), dentures, bone, etc. The other category compares patches of scanned 3D point clouds with a library of normal patches named memory bank. However, those methods usually fail to detect incomplete shapes, which is a fairly common defect type (i.e., missing pieces of different products). The main challenge is that missing areas in 3D point clouds represent the absence of scanned points. This makes it infeasible to compare the missing region with existing point cloud patches in the memory bank. To address these two challenges, we proposed a unified, unsupervised 3D anomaly detection framework capable of identifying all types of defects on model-free products. Our method integrates two detection modules: a feature-based detection module and a reconstruction-based detection module. Feature-based detection covers geometric defects, such as dents, holes, and cracks, while the reconstruction-based method detects missing regions. Additionally, we employ a One-class Support Vector Machine (OCSVM) to fuse the detection results from both modules. The results demonstrate that (1) our proposed method outperforms the state-of-the-art methods in identifying incomplete shapes and (2) it still maintains comparable performance with the SOTA methods in detecting all other types of anomalies. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.16197 [pdf, other]

Economic Optimal Power Management of Second-Life Battery Energy Storage Systems

Authors: Amir Farakhor, Di Wu, Pingen Chen, Junmin Wang, Yebin Wang, Huazhen Fang

Abstract: Second-life battery energy storage systems (SL-BESS) are an economical means of long-duration grid energy storage. They utilize retired battery packs from electric vehicles to store and provide electrical energy at the utility scale. However, they pose critical challenges in achieving optimal utilization and extending their remaining useful life. These complications primarily result from the const… ▽ More Second-life battery energy storage systems (SL-BESS) are an economical means of long-duration grid energy storage. They utilize retired battery packs from electric vehicles to store and provide electrical energy at the utility scale. However, they pose critical challenges in achieving optimal utilization and extending their remaining useful life. These complications primarily result from the constituent battery packs' inherent heterogeneities in terms of their size, chemistry, and degradation. This paper proposes an economic optimal power management approach to ensure the cost-minimized operation of SL-BESS while adhering to safety regulations and maintaining a balance between the power supply and demand. The proposed approach takes into account the costs associated with the degradation, energy loss, and decommissioning of the battery packs. In particular, we capture the degradation costs of the retired battery packs through a weighted average Ah-throughput aging model. The presented model allows us to quantify the capacity fading for second-life battery packs for different operating temperatures and C-rates. To evaluate the performance of the proposed approach, we conduct extensive simulations on a SL-BESS consisting of various heterogeneous retired battery packs in the context of grid operation. The results offer novel insights into SL-BESS operation and highlight the importance of prudent power management to ensure economically optimal utilization. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.16192 [pdf]

Molecular-Scale Insights into the Heterogeneous Interactions Between an m-Terphenyl Isocyanide Ligand and Noble Metal Nanoparticles

Authors: Liya Bi, Yufei Wang, Zhe Wang, Alexandria Do, Alexander Fuqua, Krista P. Balto, Yanning Zhang, Joshua S. Figueroa, Tod A. Pascal, Andrea R. Tao, Shaowei Li

Abstract: The structural and chemical properties of metal nanoparticles are often dictated by their interactions with molecular ligand shells. These interactions are highly material-specific and can vary significantly even among elements within the same group or materials with similar crystal structure. Precise characterization of ligand-metal interactions is crucial for the rational design of ligands and t… ▽ More The structural and chemical properties of metal nanoparticles are often dictated by their interactions with molecular ligand shells. These interactions are highly material-specific and can vary significantly even among elements within the same group or materials with similar crystal structure. Precise characterization of ligand-metal interactions is crucial for the rational design of ligands and the functionalization of nanoparticles. In this study, we found that the ligation behavior with m-terphenyl isocyanide molecule differs significantly between Au and Ag nanoparticles, with distinct ligand extraction efficiencies and size dependencies. Surface-enhanced Raman spectroscopy measurements revealed unique enhancement factors for two molecular vibrational modes, indicating different ligand binding geometries on these metal surfaces. Molecular-level characterization using scanning tunneling microscopy allowed us to directly visualize these variations between Ag and Au surfaces, which we assign as two distinct binding mechanisms. This molecular-scale visualization provides clear insights into the different ligand-metal interactions, as well as the chemical behavior and spectroscopic characterization of isocyanide-functionalized nanoparticles. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.16170 [pdf, other]

CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases

Authors: Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan

Abstract: Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a… ▽ More Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a benchmark, containing thousands of queries over 20 distinct real-world databases for learned cardinality estimation. In contrast to other initial benchmarks, our benchmark is much more diverse and can be used for training and testing learned models systematically. Using this benchmark, we explored whether learned cardinality estimation can be transferred to an unseen dataset in a zero-shot manner. We trained GNN-based and transformer-based models to study the problem in three setups: 1-) instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while we get promising results for zero-shot cardinality estimation on simple single table queries; as soon as we add joins, the accuracy drops. However, we show that with fine-tuning, we can still utilize pre-trained models for cardinality estimation, significantly reducing training overheads compared to instance specific models. We are open sourcing our scripts to collect statistics, generate queries and training datasets to foster more extensive research, also from the ML community on the important problem of cardinality estimation and in particular improve on recent directions such as pre-trained cardinality estimation. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.15902 [pdf, ps, other]

Growth of (Cu,C)Ba$_{2}$Ca$_{2}$Cu$_3$O$_{9+δ}$ thin films on flexible Hastelloy tapes

Authors: Meng-Jun Ou, Yuecong Liu, Yi Wang, Hai-Hu Wen

Abstract: The applications of superconducting cable or magnet require that the superconductors are made into wires or tapes. For cuprate superconductors, this is a big challenge because of the strong flux motion induced by high anisotropy, very short coherence length and strong thermal fluctuation, etc. One of the ways is to fabricate superconducting films on flexible metallic tapes with oxide buffer layers… ▽ More The applications of superconducting cable or magnet require that the superconductors are made into wires or tapes. For cuprate superconductors, this is a big challenge because of the strong flux motion induced by high anisotropy, very short coherence length and strong thermal fluctuation, etc. One of the ways is to fabricate superconducting films on flexible metallic tapes with oxide buffer layers. The successful one so far is the REBa$_2$Cu$_3$O$_7$ (REBCO, RE=rare earth elements) films in tape form, as called the coated conductors. While the superconducting transition temperature of REBCO system is limited to about 90 K. Here we report the successful fabrication of another new non-toxic superconducting film, namely (Cu,C)Ba$_{2}$Ca$_{2}$Cu$_3$O$_{9+δ}$ on these flexible metallic tapes with LaMnO$_3$ and CeO$_2$ as the top layers. The onset superconducting transition occurs at 112 K and 110 K, and the zero-resistance transition temperatures are about 96 K and 98 K, respectively. The temperature dependent resistivity under magnetic fields in different directions reveal a relatively small anisotropy. Further optimization of the films will improve the zero resistance transition temperature, thus can also improve the characteristic properties for applications. Our results show that the (Cu,C)Ba$_{2}$Ca$_{2}$Cu$_3$O$_{9+δ}$ is a promising candidate material for the high power applications in liquid nitrogen temperature region. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: 11 pages, 5 figures

arXiv:2408.15777 [pdf, other]

A Survey on Facial Expression Recognition of Static and Dynamic Emotions

Authors: Yan Wang, Shaoqi Yan, Yang Liu, Wei Song, Jing Liu, Yang Chang, Xinji Mai, Xiping Hu, Wenqiang Zhang, Zhongxue Gan

Abstract: Facial expression recognition (FER) aims to analyze emotional states from static images and dynamic sequences, which is pivotal in enhancing anthropomorphic communication among humans, robots, and digital avatars by leveraging AI technologies. As the FER field evolves from controlled laboratory environments to more complex in-the-wild scenarios, advanced methods have been rapidly developed and new… ▽ More Facial expression recognition (FER) aims to analyze emotional states from static images and dynamic sequences, which is pivotal in enhancing anthropomorphic communication among humans, robots, and digital avatars by leveraging AI technologies. As the FER field evolves from controlled laboratory environments to more complex in-the-wild scenarios, advanced methods have been rapidly developed and new challenges and apporaches are encounted, which are not well addressed in existing reviews of FER. This paper offers a comprehensive survey of both image-based static FER (SFER) and video-based dynamic FER (DFER) methods, analyzing from model-oriented development to challenge-focused categorization. We begin with a critical comparison of recent reviews, an introduction to common datasets and evaluation criteria, and an in-depth workflow on FER to establish a robust research foundation. We then systematically review representative approaches addressing eight main challenges in SFER (such as expression disturbance, uncertainties, compound emotions, and cross-domain inconsistency) as well as seven main challenges in DFER (such as key frame sampling, expression intensity variations, and cross-modal alignment). Additionally, we analyze recent advancements, benchmark performances, major applications, and ethical considerations. Finally, we propose five promising future directions and development trends to guide ongoing research. The project page for this paper can be found at https://github.com/wangyanckxx/SurveyFER. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.15772 [pdf, other]

220 GHz Urban Microcell Channel Measurement and Characterization on a University Campus

Authors: Yuanbo Li, Yiqin Wang, Yejian Lyu, Ziming Yu, Chong Han

Abstract: Owning abundant bandwidth resources, the Terahertz (THz) band (0.1-10~THz) is envisioned as a key technology to realize ultra-high-speed communications in 6G and beyond wireless networks. To realize reliable THz communications in urban microcell (UMi) environments, propagation analysis and channel characterization are still insufficient. In this paper, channel measurement campaigns are conducted i… ▽ More Owning abundant bandwidth resources, the Terahertz (THz) band (0.1-10~THz) is envisioned as a key technology to realize ultra-high-speed communications in 6G and beyond wireless networks. To realize reliable THz communications in urban microcell (UMi) environments, propagation analysis and channel characterization are still insufficient. In this paper, channel measurement campaigns are conducted in a UMi scenario at 220~GHz, using a correlation-based time domain channel sounder. 24 positions are measured along a road on the university campus, with distances ranging from 34~m to 410~m. Based on the measurement results, the spatial consistency and interaction of THz waves to the surrounding environments are analyzed. Moreover, the additional loss due to foliage blockage is calculated and an average value of 16.7~dB is observed. Furthermore, a full portrait of channel characteristics, including path loss, shadow fading, K-factor, delay and angular spreads, as well as cluster parameters, is calculated and analyzed. Specifically, an average K-factor value of 17.5 dB is measured in the line-of-sight (LoS) case, which is nearly two times larger than the extrapolated values from the 3GPP standard, revealing weak multipath effects in the THz band. Additionally, 2.5 clusters on average are observed in the LoS case, around one fifth of what is defined in the 3GPP model, which uncovers the strong sparsity in THz UMi. The results and analysis in this work can offer guidance for system design for future THz UMi networks. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: 5 pages, 4 figures, 1 table

arXiv:2408.15588 [pdf, other]

Opposition control applied to turbulent wings

Authors: Yuning Wang, Marco Atzori, Ricardo Vinuesa

Abstract: We conducted high-resolution large-eddy simulations (LESs) to explore the effects of opposition control (OC) on turbulent boundary layers (TBLs) over a wing at a chord-based Reynolds number (${Re}_c$) of 200,000. Two scenarios were studied: flow over the suction sides of the NACA0012 wing section at a $0^{\circ}$ angle of attack, and the NACA4412 wing section at a $5^{\circ}$ angle of attack, repr… ▽ More We conducted high-resolution large-eddy simulations (LESs) to explore the effects of opposition control (OC) on turbulent boundary layers (TBLs) over a wing at a chord-based Reynolds number (${Re}_c$) of 200,000. Two scenarios were studied: flow over the suction sides of the NACA0012 wing section at a $0^{\circ}$ angle of attack, and the NACA4412 wing section at a $5^{\circ}$ angle of attack, representing TBLs under mild and strong nonuniform adverse pressure gradients (APGs), respectively. Our results show that the effectiveness of OC in reducing friction drag decreases significantly with increasing APG intensity. This reduction is linked to intensified wall-normal convection caused by the stronger APG. OC, designed to reduce near-wall fluctuations, attenuates the outer peak of streamwise velocity fluctuations and the production term of the turbulent kinetic energy budget. We also confirmed the formation of a "virtual wall," where the balance between viscous diffusion and dissipation at the virtual wall plane mirrors that at the physical wall. Spectral analyses reveal that the wall-normal transport of small-scale structures to the outer region due to the APG negatively impacts OC performance. We also examined uniform blowing and body-force damping as control strategies. Uniform blowing mimics the effects of a stronger APG, while body-force damping shares similarities with OC in the streamwise development of the TBL, despite differences in turbulent statistics. This study is the first detailed analysis of OC applied to TBLs under nonuniform APGs with complex geometries. △ Less

Submitted 28 August, 2024; originally announced August 2024.

MSC Class: 76-00

arXiv:2408.15576 [pdf, other]

Quantum Assemblage Tomography

Authors: Luis Villegas-Aguilar, Yuanlong Wang, Alex Pepper, Travis J. Baker, Geoff J. Pryde, Sergei Slussarenko, Nora Tischler, Howard M. Wiseman

Abstract: A central requirement in asymmetric quantum nonlocality protocols, such as quantum steering, is the precise reconstruction of state assemblages -- statistical ensembles of quantum states correlated with remote classical signals. Here we introduce a generalized loss model for assemblage tomography that uses conical optimization techniques combined with maximum likelihood estimation. Using an eviden… ▽ More A central requirement in asymmetric quantum nonlocality protocols, such as quantum steering, is the precise reconstruction of state assemblages -- statistical ensembles of quantum states correlated with remote classical signals. Here we introduce a generalized loss model for assemblage tomography that uses conical optimization techniques combined with maximum likelihood estimation. Using an evidence-based framework based on Akaike's Information Criterion, we demonstrate that our approach excels in the accuracy of reconstructions while accounting for model complexity. In comparison, standard tomographic methods fall short when applied to experimentally relevant data. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: 5 + 4 pages, 3 + 3 figures (Main + Supplemental Material)

arXiv:2408.15568 [pdf, other]

Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing

Authors: Ruilong Wu, Yisu Wang, Dirk Kutscher

Abstract: This study explores strategies for academic researchers to optimize computational resources within limited budgets, focusing on building small, efficient computing clusters. It delves into the comparative costs of purchasing versus renting servers, guided by market research and economic theories on tiered pricing. The paper offers detailed insights into the selection and assembly of hardware compo… ▽ More This study explores strategies for academic researchers to optimize computational resources within limited budgets, focusing on building small, efficient computing clusters. It delves into the comparative costs of purchasing versus renting servers, guided by market research and economic theories on tiered pricing. The paper offers detailed insights into the selection and assembly of hardware components such as CPUs, GPUs, and motherboards tailored to specific research needs. It introduces innovative methods to mitigate the performance issues caused by PCIe switch bandwidth limitations in order to enhance GPU task scheduling. Furthermore, a Graph Neural Network (GNN) framework is proposed to analyze and optimize parallelism in computing networks. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.15542 [pdf, other]

Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Authors: Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, Jie Hu

Abstract: Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively proce… ▽ More Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.15518 [pdf, other]

Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models

Authors: Wei Chen, Zhiyuan Li, Shuo Xin, Yihao Wang

Abstract: This paper presents Dolphin, a novel decoder-decoder architecture for energy-efficient processing of long contexts in language models. Our approach addresses the significant energy consumption and latency challenges inherent in on-device models. Dolphin employs a compact 0.5B parameter decoder to distill extensive contextual information into a memory embedding, substantially reducing the input len… ▽ More This paper presents Dolphin, a novel decoder-decoder architecture for energy-efficient processing of long contexts in language models. Our approach addresses the significant energy consumption and latency challenges inherent in on-device models. Dolphin employs a compact 0.5B parameter decoder to distill extensive contextual information into a memory embedding, substantially reducing the input length for the primary 7B parameter decoder model. Inspired by vision-language models, we repurpose the image embedding projector to encode long textual contexts, effectively treating extended context as a distinct modality. This innovative method enables processing of substantially longer contexts without the typical computational overhead associated with extended input sequences. Empirical evaluations demonstrate a 10-fold improvement in energy efficiency and a 5-fold reduction in latency compared to conventional full-length context processing methods without losing quality of the response. Our work contributes to the development of more sustainable and scalable language models for on-device applications, addressing the critical need for energy-efficient and responsive AI technologies in resource-constrained environments while maintaining the accuracy to understand long contexts. This research has implications for the broader field of natural language processing, particularly in the domain of efficient model design for resource-limited settings. By enabling more sophisticated AI capabilities on edge devices, Dolphin paves the way for advanced language processing in a wide range of applications where computational resources are at a premium. The Dolphin model is publicly available at https://huggingface.co/NexaAIDev/Dolphin. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.15484 [pdf, other]

NAS-BNN: Neural Architecture Search for Binary Neural Networks

Authors: Zhihao Lin, Yongtao Wang, Jinhe Zhang, Xiaojie Chu, Haibin Ling

Abstract: Binary Neural Networks (BNNs) have gained extensive attention for their superior inferencing efficiency and compression ratio compared to traditional full-precision networks. However, due to the unique characteristics of BNNs, designing a powerful binary architecture is challenging and often requires significant manpower. A promising solution is to utilize Neural Architecture Search (NAS) to assis… ▽ More Binary Neural Networks (BNNs) have gained extensive attention for their superior inferencing efficiency and compression ratio compared to traditional full-precision networks. However, due to the unique characteristics of BNNs, designing a powerful binary architecture is challenging and often requires significant manpower. A promising solution is to utilize Neural Architecture Search (NAS) to assist in designing BNNs, but current NAS methods for BNNs are relatively straightforward and leave a performance gap between the searched models and manually designed ones. To address this gap, we propose a novel neural architecture search scheme for binary neural networks, named NAS-BNN. We first carefully design a search space based on the unique characteristics of BNNs. Then, we present three training strategies, which significantly enhance the training of supernet and boost the performance of all subnets. Our discovered binary model family outperforms previous BNNs for a wide range of operations (OPs) from 20M to 200M. For instance, we achieve 68.20% top-1 accuracy on ImageNet with only 57M OPs. In addition, we validate the transferability of these searched BNNs on the object detection task, and our binary detectors with the searched BNNs achieve a novel state-of-the-art result, e.g., 31.6% mAP with 370M OPs, on MS COCO dataset. The source code and models will be released at https://github.com/VDIGPKU/NAS-BNN. △ Less

Submitted 27 August, 2024; originally announced August 2024.

Comments: 23 pages

arXiv:2408.15431 [pdf, other]

Integer Topological Defects Reveal Effective Forces in Active Nematics

Authors: Zihui Zhao, Yisong Yao, He Li, Yongfeng Zhao, Yujia Wang, Hepeng Zhang, Hugues Chat'e, Masaki Sano

Abstract: Cell layers are often categorized as contractile or extensile active nematics but recent experiments on neural progenitor cells with induced $+1$ topological defects challenge this classification. In a bottom-up approach, we first study a relevant particle-level model and then analyze a continuous theory derived from it. We show that both model and theory account qualitatively for the main experim… ▽ More Cell layers are often categorized as contractile or extensile active nematics but recent experiments on neural progenitor cells with induced $+1$ topological defects challenge this classification. In a bottom-up approach, we first study a relevant particle-level model and then analyze a continuous theory derived from it. We show that both model and theory account qualitatively for the main experimental result, i.e. accumulation of cells at the core of any type of +1 defect. We argue that cell accumulation is essentially due to two generally ignored 'effective active forces'. We finally discuss the relevance and consequences of our findings in the context of other cellular active nematics experiments and previously proposed theories. △ Less

Submitted 27 August, 2024; originally announced August 2024.

Comments: 7 pages, 5 figures, plus Supplemental Information

arXiv:2408.15299 [pdf, other]

TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

Authors: Yiqing Shen, Zan Chen, Michail Mamalakis, Yungeng Liu, Tianbin Li, Yanzhou Su, Junjun He, Pietro Liò, Yu Guang Wang

Abstract: The structural similarities between protein sequences and natural languages have led to parallel advancements in deep learning across both domains. While large language models (LLMs) have achieved much progress in the domain of natural language processing, their potential in protein engineering remains largely unexplored. Previous approaches have equipped LLMs with protein understanding capabiliti… ▽ More The structural similarities between protein sequences and natural languages have led to parallel advancements in deep learning across both domains. While large language models (LLMs) have achieved much progress in the domain of natural language processing, their potential in protein engineering remains largely unexplored. Previous approaches have equipped LLMs with protein understanding capabilities by incorporating external protein encoders, but this fails to fully leverage the inherent similarities between protein sequences and natural languages, resulting in sub-optimal performance and increased model complexity. To address this gap, we present TourSynbio-7B, the first multi-modal large model specifically designed for protein engineering tasks without external protein encoders. TourSynbio-7B demonstrates that LLMs can inherently learn to understand proteins as language. The model is post-trained and instruction fine-tuned on InternLM2-7B using ProteinLMDataset, a dataset comprising 17.46 billion tokens of text and protein sequence for self-supervised pretraining and 893K instructions for supervised fine-tuning. TourSynbio-7B outperforms GPT-4 on the ProteinLMBench, a benchmark of 944 manually verified multiple-choice questions, with 62.18% accuracy. Leveraging TourSynbio-7B's enhanced protein sequence understanding capability, we introduce TourSynbio-Agent, an innovative framework capable of performing various protein engineering tasks, including mutation analysis, inverse folding, protein folding, and visualization. TourSynbio-Agent integrates previously disconnected deep learning models in the protein engineering domain, offering a unified conversational user interface for improved usability. Finally, we demonstrate the efficacy of TourSynbio-7B and TourSynbio-Agent through two wet lab case studies on vanilla key enzyme modification and steroid compound catalysis. △ Less

Submitted 27 August, 2024; originally announced August 2024.

arXiv:2408.15270 [pdf, other]

SkillMimic: Learning Reusable Basketball Skills from Demonstrations

Authors: Yinhuai Wang, Qihan Zhao, Runyi Yu, Ailing Zeng, Jing Lin, Zhengyi Luo, Hok Wai Tsui, Jiwen Yu, Xiu Li, Qifeng Chen, Jian Zhang, Lei Zhang, Ping Tan

Abstract: Mastering basketball skills such as diverse layups and dribbling involves complex interactions with the ball and requires real-time adjustments. Traditional reinforcement learning methods for interaction skills rely on labor-intensive, manually designed rewards that do not generalize well across different skills. Inspired by how humans learn from demonstrations, we propose SkillMimic, a data-drive… ▽ More Mastering basketball skills such as diverse layups and dribbling involves complex interactions with the ball and requires real-time adjustments. Traditional reinforcement learning methods for interaction skills rely on labor-intensive, manually designed rewards that do not generalize well across different skills. Inspired by how humans learn from demonstrations, we propose SkillMimic, a data-driven approach that mimics both human and ball motions to learn a wide variety of basketball skills. SkillMimic employs a unified configuration to learn diverse skills from human-ball motion datasets, with skill diversity and generalization improving as the dataset grows. This approach allows training a single policy to learn multiple skills, enabling smooth skill switching even if these switches are not present in the reference dataset. The skills acquired by SkillMimic can be easily reused by a high-level controller to accomplish complex basketball tasks. To evaluate our approach, we introduce two basketball datasets: one estimated through monocular RGB videos and the other using advanced motion capture equipment, collectively containing about 35 minutes of diverse basketball skills. Experiments show that our method can effectively learn various basketball skills included in the dataset with a unified configuration, including various styles of dribbling, layups, and shooting. Furthermore, by training a high-level controller to reuse the acquired skills, we can achieve complex basketball tasks such as layup scoring, which involves dribbling toward the basket, timing the dribble and layup to score, retrieving the rebound, and repeating the process. The project page and video demonstrations are available at https://ingrid789.github.io/SkillMimic/ △ Less

Submitted 12 August, 2024; originally announced August 2024.

arXiv:2408.15176 [pdf, other]

Unlocking Potential in Pre-Trained Music Language Models for Versatile Multi-Track Music Arrangement

Authors: Longshen Ou, Jingwei Zhao, Ziyu Wang, Gus Xia, Ye Wang

Abstract: Large language models have shown significant capabilities across various domains, including symbolic music generation. However, leveraging these pre-trained models for controllable music arrangement tasks, each requiring different forms of musical information as control, remains a novel challenge. In this paper, we propose a unified sequence-to-sequence framework that enables the fine-tuning of a… ▽ More Large language models have shown significant capabilities across various domains, including symbolic music generation. However, leveraging these pre-trained models for controllable music arrangement tasks, each requiring different forms of musical information as control, remains a novel challenge. In this paper, we propose a unified sequence-to-sequence framework that enables the fine-tuning of a symbolic music language model for multiple multi-track arrangement tasks, including band arrangement, piano reduction, drum arrangement, and voice separation. Our experiments demonstrate that the proposed approach consistently achieves higher musical quality compared to task-specific baselines across all four tasks. Furthermore, through additional experiments on probing analysis, we show the pre-training phase equips the model with essential knowledge to understand musical conditions, which is hard to acquired solely through task-specific fine-tuning. △ Less

Submitted 27 August, 2024; originally announced August 2024.

Comments: Submitted to AAAI 2025

arXiv:2408.15115 [pdf, other]

A novel numerical framework for three-dimensional fully resolved simulation of freely falling particles of arbitrary shape

Authors: Taraprasad Bhowmick, Jonas Latt, Yong Wang, Gholamhossein Bagheri

Abstract: This article introduces a novel numerical framework designed to model the interplay between free-falling particles and their surrounding fluid in situations of high particle to fluid density ratio, typically exhibited by atmospheric particles. This method is designed to complement experimental studies in vertical wind tunnels to improve the understanding of the aerodynamic behavior of small atmosp… ▽ More This article introduces a novel numerical framework designed to model the interplay between free-falling particles and their surrounding fluid in situations of high particle to fluid density ratio, typically exhibited by atmospheric particles. This method is designed to complement experimental studies in vertical wind tunnels to improve the understanding of the aerodynamic behavior of small atmospheric particles, such as the transport and sedimentation of volcanic particles, cloud ice crystals and other application areas. The solver is based on the lattice Boltzmann method and it addresses the numerical challenges, including the high density ratio and moderate to high Reynolds number, by using an immersed-boundary approach and a recursive-regularized collision model. A predictor-corrector scheme is applied for the robust time integration of the six-degrees-of-freedom (6DOF) rigid-body motion. Finally, the multi-scale nature arising from the long free-fall distances of a particle is addressed through a dynamic memory allocation scheme allowing for a virtually infinite falling distance. This tool allows for the simulation of particles of arbitrary shape represented by a triangularized surface. The framework is validated against the analytical and experimental data for falling spheres and ellipsoids, and is then applied to the case of an actual volcanic particle geometry, the shape of which is obtained from a 3D surface-contour scanning process. The physics of the free-fall of this particle is investigated and described, and its terminal velocity is compared against the experimental data measured with the 3D printed exemplars of the same particle. △ Less

Submitted 27 August, 2024; originally announced August 2024.

arXiv:2408.14961 [pdf, other]

CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task

Authors: Lingyun Huang, Jianxu Mao, Yaonan Wang, Junfei Yi, Ziming Tao

Abstract: In recent years, the rapid expansion of model sizes has led to large-scale pre-trained models demonstrating remarkable capabilities. Consequently, there has been a trend towards increasing the scale of models. However, this trend introduces significant challenges, including substantial computational costs of training and transfer to downstream tasks. To address these issues, Parameter-Efficient Fi… ▽ More In recent years, the rapid expansion of model sizes has led to large-scale pre-trained models demonstrating remarkable capabilities. Consequently, there has been a trend towards increasing the scale of models. However, this trend introduces significant challenges, including substantial computational costs of training and transfer to downstream tasks. To address these issues, Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced. These methods optimize large-scale pre-trained models for specific tasks by fine-tuning a select group of parameters. Among these PEFT methods, adapter-based and prompt-based methods are the primary techniques. Specifically, in the field of visual fine-tuning, adapters gain prominence over prompts because of the latter's relatively weaker performance and efficiency. Under the circumstances, we refine the widely-used Visual Prompt Tuning (VPT) method, proposing Cross Visual Prompt Tuning (CVPT). CVPT calculates cross-attention between the prompt tokens and the embedded tokens, which allows us to compute the semantic relationship between them and conduct the fine-tuning of models exactly to adapt visual tasks better. Furthermore, we introduce the weight-sharing mechanism to initialize the parameters of cross-attention, which avoids massive learnable parameters from cross-attention and enhances the representative capability of cross-attention. We conduct comprehensive testing across 25 datasets and the result indicates that CVPT significantly improves VPT's performance and efficiency in visual tasks. For example, on the VTAB-1K benchmark, CVPT outperforms VPT over 4% in average accuracy, rivaling the advanced adapter-based methods in performance and efficiency. Our experiments confirm that prompt-based methods can achieve exceptional results in visual fine-tuning. △ Less

Submitted 27 August, 2024; originally announced August 2024.

arXiv:2408.14866 [pdf, other]

Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

Authors: Hongfu Liu, Yuxi Xie, Ye Wang, Michael Shieh

Abstract: Language Language Models (LLMs) face safety concerns due to potential misuse by malicious users. Recent red-teaming efforts have identified adversarial suffixes capable of jailbreaking LLMs using the gradient-based search algorithm Greedy Coordinate Gradient (GCG). However, GCG struggles with computational inefficiency, limiting further investigations regarding suffix transferability and scalabili… ▽ More Language Language Models (LLMs) face safety concerns due to potential misuse by malicious users. Recent red-teaming efforts have identified adversarial suffixes capable of jailbreaking LLMs using the gradient-based search algorithm Greedy Coordinate Gradient (GCG). However, GCG struggles with computational inefficiency, limiting further investigations regarding suffix transferability and scalability across models and data. In this work, we bridge the connection between search efficiency and suffix transferability. We propose a two-stage transfer learning framework, DeGCG, which decouples the search process into behavior-agnostic pre-searching and behavior-relevant post-searching. Specifically, we employ direct first target token optimization in pre-searching to facilitate the search process. We apply our approach to cross-model, cross-data, and self-transfer scenarios. Furthermore, we introduce an interleaved variant of our approach, i-DeGCG, which iteratively leverages self-transferability to accelerate the search process. Experiments on HarmBench demonstrate the efficiency of our approach across various models and domains. Notably, our i-DeGCG outperforms the baseline on Llama2-chat-7b with ASRs of $43.9$ ($+22.2$) and $39.0$ ($+19.5$) on valid and test sets, respectively. Further analysis on cross-model transfer indicates the pivotal role of first target token optimization in leveraging suffix transferability for efficient searching. △ Less

Submitted 27 August, 2024; originally announced August 2024.

Comments: 11 pages, 4 figures

arXiv:2408.14812 [pdf, other]

HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling

Authors: Yubin Wang, Xinyang Jiang, De Cheng, Wenli Sun, Dongsheng Li, Cairong Zhao

Abstract: Prompt learning has become a prevalent strategy for adapting vision-language foundation models (VLMs) such as CLIP to downstream tasks. With the emergence of large language models (LLMs), recent studies have explored the potential of using category-related descriptions to enhance prompt effectiveness. However, conventional descriptions lack explicit structured information necessary to represent th… ▽ More Prompt learning has become a prevalent strategy for adapting vision-language foundation models (VLMs) such as CLIP to downstream tasks. With the emergence of large language models (LLMs), recent studies have explored the potential of using category-related descriptions to enhance prompt effectiveness. However, conventional descriptions lack explicit structured information necessary to represent the interconnections among key elements like entities or attributes with relation to a particular category. Since existing prompt tuning methods give little consideration to managing structured knowledge, this paper advocates leveraging LLMs to construct a graph for each description to prioritize such structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), enabling simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and global-level prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Finally, by enhancing multi-granularity knowledge generation, redesigning the relationship-driven attention re-weighting module, and incorporating consistent constraints on the hierarchical text encoder, we propose HPT++, which further improves the performance of HPT. Our experiments are conducted across a wide range of evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization. Extensive results and ablation studies demonstrate the effectiveness of our methods, which consistently outperform existing SOTA methods. △ Less

Submitted 27 August, 2024; originally announced August 2024.

Comments: 19 pages, 7 figures, 7 tables. arXiv admin note: substantial text overlap with arXiv:2312.06323

arXiv:2408.14736 [pdf, other]

doi 10.1145/3673038.3673142

Bandwidth-Aware and Overlap-Weighted Compression for Communication-Efficient Federated Learning

Authors: Zichen Tang, Junlin Huang, Rudan Yan, Yuxin Wang, Zhenheng Tang, Shaohuai Shi, Amelie Chi Zhou, Xiaowen Chu

Abstract: Current data compression methods, such as sparsification in Federated Averaging (FedAvg), effectively enhance the communication efficiency of Federated Learning (FL). However, these methods encounter challenges such as the straggler problem and diminished model performance due to heterogeneous bandwidth and non-IID (Independently and Identically Distributed) data. To address these issues, we intro… ▽ More Current data compression methods, such as sparsification in Federated Averaging (FedAvg), effectively enhance the communication efficiency of Federated Learning (FL). However, these methods encounter challenges such as the straggler problem and diminished model performance due to heterogeneous bandwidth and non-IID (Independently and Identically Distributed) data. To address these issues, we introduce a bandwidth-aware compression framework for FL, aimed at improving communication efficiency while mitigating the problems associated with non-IID data. First, our strategy dynamically adjusts compression ratios according to bandwidth, enabling clients to upload their models at a close pace, thus exploiting the otherwise wasted time to transmit more data. Second, we identify the non-overlapped pattern of retained parameters after compression, which results in diminished client update signals due to uniformly averaged weights. Based on this finding, we propose a parameter mask to adjust the client-averaging coefficients at the parameter level, thereby more closely approximating the original updates, and improving the training convergence under heterogeneous environments. Our evaluations reveal that our method significantly boosts model accuracy, with a maximum improvement of 13% over the uncompressed FedAvg. Moreover, it achieves a $3.37\times$ speedup in reaching the target accuracy compared to FedAvg with a Top-K compressor, demonstrating its effectiveness in accelerating convergence with compression. The integration of common compression techniques into our framework further establishes its potential as a versatile foundation for future cross-device, communication-efficient FL research, addressing critical challenges in FL and advancing the field of distributed machine learning. △ Less

Submitted 26 August, 2024; originally announced August 2024.

arXiv:2408.14472 [pdf, other]

Advancing Humanoid Locomotion: Mastering Challenging Terrains with Denoising World Model Learning

Authors: Xinyang Gu, Yen-Jen Wang, Xiang Zhu, Chengming Shi, Yanjiang Guo, Yichen Liu, Jianyu Chen

Abstract: Humanoid robots, with their human-like skeletal structure, are especially suited for tasks in human-centric environments. However, this structure is accompanied by additional challenges in locomotion controller design, especially in complex real-world environments. As a result, existing humanoid robots are limited to relatively simple terrains, either with model-based control or model-free reinfor… ▽ More Humanoid robots, with their human-like skeletal structure, are especially suited for tasks in human-centric environments. However, this structure is accompanied by additional challenges in locomotion controller design, especially in complex real-world environments. As a result, existing humanoid robots are limited to relatively simple terrains, either with model-based control or model-free reinforcement learning. In this work, we introduce Denoising World Model Learning (DWL), an end-to-end reinforcement learning framework for humanoid locomotion control, which demonstrates the world's first humanoid robot to master real-world challenging terrains such as snowy and inclined land in the wild, up and down stairs, and extremely uneven terrains. All scenarios run the same learned neural network with zero-shot sim-to-real transfer, indicating the superior robustness and generalization capability of the proposed method. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: Robotics: Science and Systems (RSS), 2024. (Best Paper Award Finalist)

arXiv:2408.14354 [pdf, other]

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

Authors: Daoguang Zan, Zhirong Huang, Ailun Yu, Shaoxin Lin, Yifan Shi, Wei Liu, Dong Chen, Zongshuai Qi, Hao Yu, Lei Yu, Dezhi Ran, Muhan Zeng, Bo Shen, Pan Bian, Guangtai Liang, Bei Guan, Pengjie Huang, Tao Xie, Yongji Wang, Qianxiang Wang

Abstract: GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in… ▽ More GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: This work is in progress

arXiv:2408.14254 [pdf, other]

Integrated Brain Connectivity Analysis with fMRI, DTI, and sMRI Powered by Interpretable Graph Neural Networks

Authors: Gang Qu, Ziyu Zhou, Vince D. Calhoun, Aiying Zhang, Yu-Ping Wang

Abstract: Multimodal neuroimaging modeling has becomes a widely used approach but confronts considerable challenges due to heterogeneity, which encompasses variability in data types, scales, and formats across modalities. This variability necessitates the deployment of advanced computational methods to integrate and interpret these diverse datasets within a cohesive analytical framework. In our research, we… ▽ More Multimodal neuroimaging modeling has becomes a widely used approach but confronts considerable challenges due to heterogeneity, which encompasses variability in data types, scales, and formats across modalities. This variability necessitates the deployment of advanced computational methods to integrate and interpret these diverse datasets within a cohesive analytical framework. In our research, we amalgamate functional magnetic resonance imaging, diffusion tensor imaging, and structural MRI into a cohesive framework. This integration capitalizes on the unique strengths of each modality and their inherent interconnections, aiming for a comprehensive understanding of the brain's connectivity and anatomical characteristics. Utilizing the Glasser atlas for parcellation, we integrate imaging derived features from various modalities: functional connectivity from fMRI, structural connectivity from DTI, and anatomical features from sMRI within consistent regions. Our approach incorporates a masking strategy to differentially weight neural connections, thereby facilitating a holistic amalgamation of multimodal imaging data. This technique enhances interpretability at connectivity level, transcending traditional analyses centered on singular regional attributes. The model is applied to the Human Connectome Project's Development study to elucidate the associations between multimodal imaging and cognitive functions throughout youth. The analysis demonstrates improved predictive accuracy and uncovers crucial anatomical features and essential neural connections, deepening our understanding of brain structure and function. △ Less

Submitted 26 August, 2024; originally announced August 2024.

arXiv:2408.14158 [pdf, other]

Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning

Authors: Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei , et al. (27 additional authors not shown)

Abstract: The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic… ▽ More The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: This is the preprint version of the paper accepted for presentation at the 2024 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'24). \c{opyright} 2024 IEEE. Personal use of this material is permitted. For other uses, permission from IEEE must be obtained. Please refer to IEEE Xplore for the final published version

arXiv:2408.14071 [pdf, other]

Benchmarking the design of the cryogenics system for the underground argon in DarkSide-20k

Authors: DarkSide-20k Collaboration, :, F. Acerbi, P. Adhikari, P. Agnes, I. Ahmad, S. Albergo, I. F. M. Albuquerque, T. Alexander, A. K. Alton, P. Amaudruz, M. Angiolilli, E. Aprile, R. Ardito, M. Atzori Corona, D. J. Auty, M. Ave, I. C. Avetisov, O. Azzolini, H. O. Back, Z. Balmforth, A. Barrado Olmedo, P. Barrillon, G. Batignani, P. Bhowmick , et al. (294 additional authors not shown)

Abstract: DarkSide-20k (DS-20k) is a dark matter detection experiment under construction at the Laboratori Nazionali del Gran Sasso (LNGS) in Italy. It utilises ~100 t of low radioactivity argon from an underground source (UAr) in its inner detector, with half serving as target in a dual-phase time projection chamber (TPC). The UAr cryogenics system must maintain stable thermodynamic conditions throughout t… ▽ More DarkSide-20k (DS-20k) is a dark matter detection experiment under construction at the Laboratori Nazionali del Gran Sasso (LNGS) in Italy. It utilises ~100 t of low radioactivity argon from an underground source (UAr) in its inner detector, with half serving as target in a dual-phase time projection chamber (TPC). The UAr cryogenics system must maintain stable thermodynamic conditions throughout the experiment's lifetime of >10 years. Continuous removal of impurities and radon from the UAr is essential for maximising signal yield and mitigating background. We are developing an efficient and powerful cryogenics system with a gas purification loop with a target circulation rate of 1000 slpm. Central to its design is a condenser operated with liquid nitrogen which is paired with a gas heat exchanger cascade, delivering a combined cooling power of >8 kW. Here we present the design choices in view of the DS-20k requirements, in particular the condenser's working principle and the cooling control, and we show test results obtained with a dedicated benchmarking platform at CERN and LNGS. We find that the thermal efficiency of the recirculation loop, defined in terms of nitrogen consumption per argon flow rate, is 95 % and the pressure in the test cryostat can be maintained within $\pm$(0.1-0.2) mbar. We further detail a 5-day cool-down procedure of the test cryostat, maintaining a cooling rate typically within -2 K/h, as required for the DS-20k inner detector. Additionally, we assess the circuit's flow resistance, and the heat transfer capabilities of two heat exchanger geometries for argon phase change, used to provide gas for recirculation. We conclude by discussing how our findings influence the finalisation of the system design, including necessary modifications to meet requirements and ongoing testing activities. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: 45 pages, 24 figures

arXiv:2408.14070 [pdf]

RiD-kit: Software package designed to do enhanced sampling using reinforced dynamics

Authors: Jiahao Fan, Yanze Wang, Dongdong Wang, Linfeng Zhang

Abstract: Developing an efficient method to accelerate the speed of molecular dynamics is a central theme in the field of molecular simulation. One category among the methods are collective-variable-based methods, which rely on predefined collective variables (CVs). The difficulty of selecting a few important CVs hinders the methods to be applied to large systems easily. Here we present a CV-based enhanced… ▽ More Developing an efficient method to accelerate the speed of molecular dynamics is a central theme in the field of molecular simulation. One category among the methods are collective-variable-based methods, which rely on predefined collective variables (CVs). The difficulty of selecting a few important CVs hinders the methods to be applied to large systems easily. Here we present a CV-based enhanced sampling method RiD-kit, which could handle a large number of CVs and perform efficient sampling. The method could be applied to various kinds of systems, including biomolecules, chemical reactions and materials. In this protocol, we guide the users through all phases of the RiD-kit workflow, from preparing the input files, setting the simulation parameters and analyzing the results. The RiD-kit workflow provides an efficient and user-friendly command line tool which could submit jobs to various kinds of platforms including the high-performance computers (HPC), cloud server and local machines. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: 43 pages,4 figures

arXiv:2408.14047 [pdf]

Alleviating Class Imbalance in Semi-supervised Multi-organ Segmentation via Balanced Subclass Regularization

Authors: Zhenghao Feng, Lu Wen, Binyu Yan, Jiaqi Cui, Yan Wang

Abstract: Semi-supervised learning (SSL) has shown notable potential in relieving the heavy demand of dense prediction tasks on large-scale well-annotated datasets, especially for the challenging multi-organ segmentation (MoS). However, the prevailing class-imbalance problem in MoS, caused by the substantial variations in organ size, exacerbates the learning difficulty of the SSL network. To alleviate this… ▽ More Semi-supervised learning (SSL) has shown notable potential in relieving the heavy demand of dense prediction tasks on large-scale well-annotated datasets, especially for the challenging multi-organ segmentation (MoS). However, the prevailing class-imbalance problem in MoS, caused by the substantial variations in organ size, exacerbates the learning difficulty of the SSL network. To alleviate this issue, we present a two-phase semi-supervised network (BSR-Net) with balanced subclass regularization for MoS. Concretely, in Phase I, we introduce a class-balanced subclass generation strategy based on balanced clustering to effectively generate multiple balanced subclasses from original biased ones according to their pixel proportions. Then, in Phase II, we design an auxiliary subclass segmentation (SCS) task within the multi-task framework of the main MoS task. The SCS task contributes a balanced subclass regularization to the main MoS task and transfers unbiased knowledge to the MoS network, thus alleviating the influence of the class-imbalance problem. Extensive experiments conducted on two publicly available datasets, i.e., the MICCAI FLARE 2022 dataset and the WORD dataset, verify the superior performance of our method compared with other methods. △ Less

Submitted 26 August, 2024; originally announced August 2024.

arXiv:2408.14022 [pdf, other]

An Efficient and Exact Algorithm for Locally h-Clique Densest Subgraph Discovery

Authors: Xiaojia Xu, Haoyu Liu, Xiaowei Lv, Yongcai Wang, Deying Li

Abstract: Detecting locally, non-overlapping, near-clique densest subgraphs is a crucial problem for community search in social networks. As a vertex may be involved in multiple overlapped local cliques, detecting locally densest sub-structures considering h-clique density, i.e., locally h-clique densest subgraph (LhCDS) attracts great interests. This paper investigates the LhCDS detection problem and propo… ▽ More Detecting locally, non-overlapping, near-clique densest subgraphs is a crucial problem for community search in social networks. As a vertex may be involved in multiple overlapped local cliques, detecting locally densest sub-structures considering h-clique density, i.e., locally h-clique densest subgraph (LhCDS) attracts great interests. This paper investigates the LhCDS detection problem and proposes an efficient and exact algorithm to list the top-k non-overlapping, locally h-clique dense, and compact subgraphs. We in particular jointly consider h-clique compact number and LhCDS and design a new "Iterative Propose-Prune-and-Verify" pipeline (IPPV) for top-k LhCDS detection. (1) In the proposal part, we derive initial bounds for h-clique compact numbers; prove the validity, and extend a convex programming method to tighten the bounds for proposing LhCDS candidates without missing any. (2) Then a tentative graph decomposition method is proposed to solve the challenging case where a clique spans multiple subgraphs in graph decomposition. (3) To deal with the verification difficulty, both a basic and a fast verification method are proposed, where the fast method constructs a smaller-scale flow network to improve efficiency while preserving the verification correctness. The verified LhCDSes are returned, while the candidates that remained unsure reenter the IPPV pipeline. (4) We further extend the proposed methods to locally more general pattern densest subgraph detection problems. We prove the exactness and low complexity of the proposed algorithm. Extensive experiments on real datasets show the effectiveness and high efficiency of IPPV. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: This paper has been accepted by SIGMOD 2025

arXiv:2408.13981 [pdf]

ARANet: Attention-based Residual Adversarial Network with Deep Supervision for Radiotherapy Dose Prediction of Cervical Cancer

Authors: Lu Wen, Wenxia Yin, Zhenghao Feng, Xi Wu, Deng Xiong, Yan Wang

Abstract: Radiation therapy is the mainstay treatment for cervical cancer, and its ultimate goal is to ensure the planning target volume (PTV) reaches the prescribed dose while reducing dose deposition of organs-at-risk (OARs) as much as possible. To achieve these clinical requirements, the medical physicist needs to manually tweak the radiotherapy plan repeatedly in a trial-anderror manner until finding th… ▽ More Radiation therapy is the mainstay treatment for cervical cancer, and its ultimate goal is to ensure the planning target volume (PTV) reaches the prescribed dose while reducing dose deposition of organs-at-risk (OARs) as much as possible. To achieve these clinical requirements, the medical physicist needs to manually tweak the radiotherapy plan repeatedly in a trial-anderror manner until finding the optimal one in the clinic. However, such trial-and-error processes are quite time-consuming, and the quality of plans highly depends on the experience of the medical physicist. In this paper, we propose an end-to-end Attentionbased Residual Adversarial Network with deep supervision, namely ARANet, to automatically predict the 3D dose distribution of cervical cancer. Specifically, given the computer tomography (CT) images and their corresponding segmentation masks of PTV and OARs, ARANet employs a prediction network to generate the dose maps. We also utilize a multi-scale residual attention module and deep supervision mechanism to enforce the prediction network to extract more valuable dose features while suppressing irrelevant information. Our proposed method is validated on an in-house dataset including 54 cervical cancer patients, and experimental results have demonstrated its obvious superiority compared to other state-of-the-art methods. △ Less

Submitted 25 August, 2024; originally announced August 2024.

Comments: Accepted by 2024 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM)

arXiv:2408.13980 [pdf, other]

FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

Authors: Daixun Li, Weiying Xie, Mingxiang Cao, Yunke Wang, Jiaqing Zhang, Yunsong Li, Leyuan Fang, Chang Xu

Abstract: Multimodal image fusion and segmentation enhance scene understanding in autonomous driving by integrating data from various sensors. However, current models struggle to efficiently segment densely packed elements in such scenes, due to the absence of comprehensive fusion features that can guide mid-process fine-tuning and focus attention on relevant areas. The Segment Anything Model (SAM) has emer… ▽ More Multimodal image fusion and segmentation enhance scene understanding in autonomous driving by integrating data from various sensors. However, current models struggle to efficiently segment densely packed elements in such scenes, due to the absence of comprehensive fusion features that can guide mid-process fine-tuning and focus attention on relevant areas. The Segment Anything Model (SAM) has emerged as a transformative segmentation method. It provides more effective prompts through its flexible prompt encoder, compared to transformers lacking fine-tuned control. Nevertheless, SAM has not been extensively studied in the domain of multimodal fusion for natural images. In this paper, we introduce SAM into multimodal image segmentation for the first time, proposing a novel framework that combines Latent Space Token Generation (LSTG) and Fusion Mask Prompting (FMP) modules to enhance SAM's multimodal fusion and segmentation capabilities. Specifically, we first obtain latent space features of the two modalities through vector quantization and embed them into a cross-attention-based inter-domain fusion module to establish long-range dependencies between modalities. Then, we use these comprehensive fusion features as prompts to guide precise pixel-level segmentation. Extensive experiments on several public datasets demonstrate that the proposed method significantly outperforms SAM and SAM2 in multimodal autonomous driving scenarios, achieving at least 3.9$\%$ higher segmentation mIoU than the state-of-the-art approaches. △ Less

Submitted 25 August, 2024; originally announced August 2024.

Showing 1–50 of 21,041 results for author: Wang, Y