Search | arXiv e-print repository

MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning

Authors: Yixin Liu, Chenrui Fan, Yutong Dai, Xun Chen, Pan Zhou, Lichao Sun

Abstract: Text-to-image diffusion models allow seamless generation of personalized images from scant reference photos. Yet, these tools, in the wrong hands, can fabricate misleading or harmful content, endangering individuals. To address this problem, existing poisoning-based approaches perturb user images in an imperceptible way to render them "unlearnable" from malicious uses. We identify two limitations… ▽ More Text-to-image diffusion models allow seamless generation of personalized images from scant reference photos. Yet, these tools, in the wrong hands, can fabricate misleading or harmful content, endangering individuals. To address this problem, existing poisoning-based approaches perturb user images in an imperceptible way to render them "unlearnable" from malicious uses. We identify two limitations of these defending approaches: i) sub-optimal due to the hand-crafted heuristics for solving the intractable bilevel optimization and ii) lack of robustness against simple data transformations like Gaussian filtering. To solve these challenges, we propose MetaCloak, which solves the bi-level poisoning problem with a meta-learning framework with an additional transformation sampling process to craft transferable and robust perturbation. Specifically, we employ a pool of surrogate diffusion models to craft transferable and model-agnostic perturbation. Furthermore, by incorporating an additional transformation process, we design a simple denoising-error maximization loss that is sufficient for causing transformation-robust semantic distortion and degradation in a personalized generation. Extensive experiments on the VGGFace2 and CelebA-HQ datasets show that MetaCloak outperforms existing approaches. Notably, MetaCloak can successfully fool online training services like Replicate, in a black-box manner, demonstrating the effectiveness of MetaCloak in real-world scenarios. Our code is available at https://github.com/liuyixin-louis/MetaCloak. △ Less

Submitted 26 April, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

Comments: Accepted to CVPR 2024 (Oral)

arXiv:2311.11863 [pdf, other]

GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding

Authors: Hao Li, Dingwen Zhang, Yalun Dai, Nian Liu, Lechao Cheng, Jingfeng Li, Jingdong Wang, Junwei Han

Abstract: Applying NeRF to downstream perception tasks for scene understanding and representation is becoming increasingly popular. Most existing methods treat semantic prediction as an additional rendering task, \textit{i.e.}, the "label rendering" task, to build semantic NeRFs. However, by rendering semantic/instance labels per pixel without considering the contextual information of the rendered image, th… ▽ More Applying NeRF to downstream perception tasks for scene understanding and representation is becoming increasingly popular. Most existing methods treat semantic prediction as an additional rendering task, \textit{i.e.}, the "label rendering" task, to build semantic NeRFs. However, by rendering semantic/instance labels per pixel without considering the contextual information of the rendered image, these methods usually suffer from unclear boundary segmentation and abnormal segmentation of pixels within an object. To solve this problem, we propose Generalized Perception NeRF (GP-NeRF), a novel pipeline that makes the widely used segmentation model and NeRF work compatibly under a unified framework, for facilitating context-aware 3D scene perception. To accomplish this goal, we introduce transformers to aggregate radiance as well as semantic embedding fields jointly for novel views and facilitate the joint volumetric rendering of both fields. In addition, we propose two self-distillation mechanisms, i.e., the Semantic Distill Loss and the Depth-Guided Semantic Distill Loss, to enhance the discrimination and quality of the semantic field and the maintenance of geometric consistency. In evaluation, we conduct experimental comparisons under two perception tasks (\textit{i.e.} semantic and instance segmentation) using both synthetic and real-world datasets. Notably, our method outperforms SOTA approaches by 6.94\%, 11.76\%, and 8.47\% on generalized semantic segmentation, finetuning semantic segmentation, and instance segmentation, respectively. △ Less

Submitted 7 April, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

Comments: CVPR 2024 (Highlight). Project Page: https://lifuguan.github.io/gpnerf-pages/

arXiv:2311.09861 [pdf, other]

ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology

Authors: Junlei Zhang, Hongliang He, Nirui Song, Zhanchao Zhou, Shuyuan He, Shuai Zhang, Huachuan Qiu, Anqi Li, Yong Dai, Lizhi Ma, Zhenzhong Lan

Abstract: The critical field of psychology necessitates a comprehensive benchmark to enhance the evaluation and development of domain-specific Large Language Models (LLMs). Existing MMLU-type benchmarks, such as C-EVAL and CMMLU, include psychology-related subjects, but their limited number of questions and lack of systematic concept sampling strategies mean they cannot cover the concepts required in psycho… ▽ More The critical field of psychology necessitates a comprehensive benchmark to enhance the evaluation and development of domain-specific Large Language Models (LLMs). Existing MMLU-type benchmarks, such as C-EVAL and CMMLU, include psychology-related subjects, but their limited number of questions and lack of systematic concept sampling strategies mean they cannot cover the concepts required in psychology. Consequently, despite their broad subject coverage, these benchmarks lack the necessary depth in the psychology domain, making them inadequate as psychology-specific evaluation suite. To address this issue, this paper presents ConceptPsy, designed to evaluate Chinese complex reasoning and knowledge abilities in psychology. ConceptPsy includes 12 core subjects and 1383 manually collected concepts. Specifically, we prompt GPT-4 to generate questions for each concept using carefully designed diverse prompts and hire professional psychologists to review these questions. To help to understand the fine-grained performances and enhance the weaknesses, we annotate each question with a chapter label and provide chapter-wise accuracy. Based on ConceptPsy, we evaluate a broad range of LLMs. We observe that, although some LLMs achieve similar accuracies on overall performances, they exhibit significant performance variations across different psychology concepts, even when they are models from the same series. We hope our work can facilitate the development of LLMs in the field of psychology. △ Less

Submitted 16 June, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

Comments: Under Review

arXiv:2311.08045 [pdf, other]

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

Authors: Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Tianhao Hu, Peixin Cao, Nan Du, Xiaolong Li

Abstract: Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However, continuously updating LLMs for alignment raises a distribution gap between model-generated samples and human-annotated responses, hindering training effectiveness. To mi… ▽ More Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However, continuously updating LLMs for alignment raises a distribution gap between model-generated samples and human-annotated responses, hindering training effectiveness. To mitigate this issue, previous methods require additional preference annotation on newly generated samples to adapt to the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an Adversarial Preference Optimization (APO) framework, in which the LLM and the reward model update alternatively via a min-max game. Through adversarial training, the reward model can adapt to the shifted generation distribution of the LLM without any additional annotation. With comprehensive experiments, we find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness. The code is at https://github.com/Linear95/APO. △ Less

Submitted 3 June, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

Comments: Accepted by ACL2024 findings

arXiv:2311.07869 [pdf]

Hybrid GRU-CNN Bilinear Parameters Initialization for Quantum Approximate Optimization Algorithm

Authors: Zuyu Xu, Pengnian Cai, Kang Sheng, Tao Yang, Yuanming Hu, Yunlai Zhu, Zuheng Wu, Yuehua Dai, Fei Yang

Abstract: The Quantum Approximate Optimization Algorithm (QAOA), a pivotal paradigm in the realm of variational quantum algorithms (VQAs), offers promising computational advantages for tackling combinatorial optimization problems. Well-defined initial circuit parameters, responsible for preparing a parameterized quantum state encoding the solution, play a key role in optimizing QAOA. However, classical opti… ▽ More The Quantum Approximate Optimization Algorithm (QAOA), a pivotal paradigm in the realm of variational quantum algorithms (VQAs), offers promising computational advantages for tackling combinatorial optimization problems. Well-defined initial circuit parameters, responsible for preparing a parameterized quantum state encoding the solution, play a key role in optimizing QAOA. However, classical optimization techniques encounter challenges in discerning optimal parameters that align with the optimal solution. In this work, we propose a hybrid optimization approach that integrates Gated Recurrent Units (GRU), Convolutional Neural Networks (CNN), and a bilinear strategy as an innovative alternative to conventional optimizers for predicting optimal parameters of QAOA circuits. GRU serves to stochastically initialize favorable parameters for depth-1 circuits, while CNN predicts initial parameters for depth-2 circuits based on the optimized parameters of depth-1 circuits. To assess the efficacy of our approach, we conducted a comparative analysis with traditional initialization methods using QAOA on Erdős-Rényi graph instances, revealing superior optimal approximation ratios. We employ the bilinear strategy to initialize QAOA circuit parameters at greater depths, with reference parameters obtained from GRU-CNN optimization. This approach allows us to forecast parameters for a depth-12 QAOA circuit, yielding a remarkable approximation ratio of 0.998 across 10 qubits, which surpasses that of the random initialization strategy and the PPN2 method at a depth of 10. The proposed hybrid GRU-CNN bilinear optimization method significantly improves the effectiveness and accuracy of parameters initialization, offering a promising iterative framework for QAOA that elevates its performance. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.05374 [pdf, other]

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Authors: Shuyi Xie, Wenlin Yao, Yong Dai, Shaobo Wang, Donlin Zhou, Lifeng Jin, Xinhua Feng, Pengzhi Wei, Yujie Lin, Zhichao Hu, Dong Yu, Zhengyou Zhang, Jing Nie, Yuhong Liu

Abstract: Large language models (LLMs) have shown impressive capabilities across various natural language tasks. However, evaluating their alignment with human preferences remains a challenge. To this end, we propose a comprehensive human evaluation framework to assess LLMs' proficiency in following instructions on diverse real-world tasks. We construct a hierarchical task tree encompassing 7 major areas co… ▽ More Large language models (LLMs) have shown impressive capabilities across various natural language tasks. However, evaluating their alignment with human preferences remains a challenge. To this end, we propose a comprehensive human evaluation framework to assess LLMs' proficiency in following instructions on diverse real-world tasks. We construct a hierarchical task tree encompassing 7 major areas covering over 200 categories and over 800 tasks, which covers diverse capabilities such as question answering, reasoning, multiturn dialogue, and text generation, to evaluate LLMs in a comprehensive and in-depth manner. We also design detailed evaluation standards and processes to facilitate consistent, unbiased judgments from human evaluators. A test set of over 3,000 instances is released, spanning different difficulty levels and knowledge domains. Our work provides a standardized methodology to evaluate human alignment in LLMs for both English and Chinese. We also analyze the feasibility of automating parts of evaluation with a strong LLM (GPT-4). Our framework supports a thorough assessment of LLMs as they are integrated into real-world applications. We have made publicly available the task tree, TencentLLMEval dataset, and evaluation methodology which have been demonstrated as effective in assessing the performance of Tencent Hunyuan LLMs. By doing so, we aim to facilitate the benchmarking of advances in the development of safe and human-aligned LLMs. △ Less

Submitted 9 November, 2023; originally announced November 2023.

arXiv:2311.03706 [pdf, other]

Parallelized Conflict Graph Cut Generation

Authors: Yongzheng Dai, Chen Chen

Abstract: A conflict graph represents logical relations between binary variables, and effective use of the graph can significantly accelerate branch-and-cut solvers for mixed-integer programming (MIP). In this paper we develop efficient parallel conflict graph management: conflict detection; maximal clique generation; clique extension; and clique merging. We leverage parallel computing in order to intensify… ▽ More A conflict graph represents logical relations between binary variables, and effective use of the graph can significantly accelerate branch-and-cut solvers for mixed-integer programming (MIP). In this paper we develop efficient parallel conflict graph management: conflict detection; maximal clique generation; clique extension; and clique merging. We leverage parallel computing in order to intensify computational effort on the conflict graph, thereby generating a much larger pool of cutting planes than what can be practically achieved in serial. Computational experiments demonstrate that the expanded pool of cuts enabled by parallel computing lead to substantial reductions in total MIP solve time, especially for more challenging cases. △ Less

Submitted 27 May, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

Comments: 19 pages, 2 figures

MSC Class: 90C10

arXiv:2311.00669 [pdf, ps, other]

The ground-state phase diagram for an alternative anisotropic extension of quantum spin-1 ferromagnetic biquadratic model

Authors: Yan-Wei Dai, Qian-Qian Shi, Xi-Hao Chen, Huan-Qiang Zhou

Abstract: The ground-state phase diagram is mapped out for an alternative anisotropic extension of quantum spin-1 ferromagnetic biquadratic model, which accommodates twelve distinct phases: three degenerate fractal phases, six Luttinger liquid phases and three symmetry-protected trivial phases. It is found that distinct types of quantum phase transitions are involved between them. In particular, one type ar… ▽ More The ground-state phase diagram is mapped out for an alternative anisotropic extension of quantum spin-1 ferromagnetic biquadratic model, which accommodates twelve distinct phases: three degenerate fractal phases, six Luttinger liquid phases and three symmetry-protected trivial phases. It is found that distinct types of quantum phase transitions are involved between them. In particular, one type arises from an instability of a Luttinger liquid towards a degenerate fractal phase, and the other type describes spontaneous symmetry breaking with type-B Goldstone modes from one degenerate fractal phase to another degenerate fractal phase, with the fractal dimension $d_f$ being identical to the number of the type-B Goldstone modes, both of which turn out to be one. In addition, quantum phase transitions from the Luttinger liquid phases to the symmetry-protected trivial phases are identified to be in the Kosterlitz-Thouless universality class, with central charge being one. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: 11 pages, 8 figures, 2 tables

arXiv:2311.00574 [pdf, ps, other]

An alternative spontaneous symmetry breaking pattern for $\rm{U}(1)$ with no gapless Goldstone mode

Authors: Huan-Qiang Zhou, Qian-Qian Shi, Yan-Wei Dai

Abstract: An emergent gapless Goldstone mode originates from continuous spontaneous symmetry breaking, which has become a doctrine since the pioneering work by Goldstone [J. Goldstone, Nuovo Cimento \textbf{19}, 154 (1961)]. However, we argue that it is possible for a continuous symmetry group $\rm{U}(1)$ to make an exceptional case, simply due to the well-known mathematical result that a continuous symmetr… ▽ More An emergent gapless Goldstone mode originates from continuous spontaneous symmetry breaking, which has become a doctrine since the pioneering work by Goldstone [J. Goldstone, Nuovo Cimento \textbf{19}, 154 (1961)]. However, we argue that it is possible for a continuous symmetry group $\rm{U}(1)$ to make an exceptional case, simply due to the well-known mathematical result that a continuous symmetry group $\rm{U}(1)$ may be regarded as a limit of a discrete symmetry group $Z_q$ when $q$ tends to infinity. As a consequence, spontaneous symmetry breaking for such a continuous symmetry group $\rm{U}(1)$ does not necessarily lead to any gapless Goldstone mode. This is explicitly explained for an anisotropic extension of the ferromagnetic spin-1 biquadratic model. In a sense, this model provides an illustrative example regarding the dichotomy between continuity and discreteness. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: 13 pages, 9 figures, 9 tables

arXiv:2310.20155 [pdf]

doi 10.1021/acs.jctc.3c01203

MLatom 3: Platform for machine learning-enhanced computational chemistry simulations and workflows

Authors: Pavlo O. Dral, Fuchun Ge, Yi-Fan Hou, Peikun Zheng, Yuxinxin Chen, Mario Barbatti, Olexandr Isayev, Cheng Wang, Bao-Xin Xue, Max Pinheiro Jr, Yuming Su, Yiheng Dai, Yangtao Chen, Lina Zhang, Shuang Zhang, Arif Ullah, Quanhao Zhang, Yanchi Ou

Abstract: Machine learning (ML) is increasingly becoming a common tool in computational chemistry. At the same time, the rapid development of ML methods requires a flexible software framework for designing custom workflows. MLatom 3 is a program package designed to leverage the power of ML to enhance typical computational chemistry simulations and to create complex workflows. This open-source package provid… ▽ More Machine learning (ML) is increasingly becoming a common tool in computational chemistry. At the same time, the rapid development of ML methods requires a flexible software framework for designing custom workflows. MLatom 3 is a program package designed to leverage the power of ML to enhance typical computational chemistry simulations and to create complex workflows. This open-source package provides plenty of choice to the users who can run simulations with the command line options, input files, or with scripts using MLatom as a Python package, both on their computers and on the online XACS cloud computing at XACScloud.com. Computational chemists can calculate energies and thermochemical properties, optimize geometries, run molecular and quantum dynamics, and simulate (ro)vibrational, one-photon UV/vis absorption, and two-photon absorption spectra with ML, quantum mechanical, and combined models. The users can choose from an extensive library of methods containing pre-trained ML models and quantum mechanical approximations such as AIQM1 approaching coupled-cluster accuracy. The developers can build their own models using various ML algorithms. The great flexibility of MLatom is largely due to the extensive use of the interfaces to many state-of-the-art software packages and libraries. △ Less

Submitted 30 October, 2023; originally announced October 2023.

arXiv:2310.18635 [pdf, other]

T-PickSeer: Visual Analysis of Taxi Pick-up Point Selection Behavior

Authors: Shuxian Gu, Yemo Dai, Zezheng Feng, Yong Wang, Haipeng Zeng

Abstract: Taxi drivers often take much time to navigate the streets to look for passengers, which leads to high vacancy rates and wasted resources. Empty taxi cruising remains a big concern for taxi companies. Analyzing the pick-up point selection behavior can solve this problem effectively, providing suggestions for taxi management and dispatch. Many studies have been devoted to analyzing and recommending… ▽ More Taxi drivers often take much time to navigate the streets to look for passengers, which leads to high vacancy rates and wasted resources. Empty taxi cruising remains a big concern for taxi companies. Analyzing the pick-up point selection behavior can solve this problem effectively, providing suggestions for taxi management and dispatch. Many studies have been devoted to analyzing and recommending hot-spot regions of pick-up points, which can make it easier for drivers to pick up passengers. However, the selection of pick-up points is complex and affected by multiple factors, such as convenience and traffic management. Most existing approaches cannot produce satisfactory results in real-world applications because of the changing travel demands and the lack of interpretability. In this paper, we introduce a visual analytics system, T-PickSeer, for taxi company analysts to better explore and understand the pick-up point selection behavior of passengers. We explore massive taxi GPS data and employ an overview-to-detail approach to enable effective analysis of pick-up point selection. Our system provides coordinated views to compare different regularities and characteristics in different regions. Also, our system assists in identifying potential pick-up points and checking the performance of each pick-up point. Three case studies based on a real-world dataset and interviews with experts have demonstrated the effectiveness of our system. △ Less

Submitted 28 October, 2023; originally announced October 2023.

Comments: 10 pages, 10 figures; The 10th China Visualization and Visual Analytics Conference

arXiv:2310.16850 [pdf, other]

doi 10.1016/j.jebo.2023.11.004

The impact of the Russia-Ukraine conflict on the extreme risk spillovers between agricultural futures and spots

Authors: Wei-Xing Zhou, Yun-Shi Dai, Kiet Tuan Duong, Peng-Fei Dai

Abstract: The ongoing Russia-Ukraine conflict between two major agricultural powers has posed significant threats and challenges to the global food system and world food security. Focusing on the impact of the conflict on the global agricultural market, we propose a new analytical framework for tail dependence, and combine the Copula-CoVaR method with the ARMA-GARCH-skewed Student-t model to examine the tai… ▽ More The ongoing Russia-Ukraine conflict between two major agricultural powers has posed significant threats and challenges to the global food system and world food security. Focusing on the impact of the conflict on the global agricultural market, we propose a new analytical framework for tail dependence, and combine the Copula-CoVaR method with the ARMA-GARCH-skewed Student-t model to examine the tail dependence structure and extreme risk spillover between agricultural futures and spots over the pre- and post-outbreak periods. Our results indicate that the tail dependence structures in the futures-spot markets of soybean, maize, wheat, and rice have all reacted to the Russia-Ukraine conflict. Furthermore, the outbreak of the conflict has intensified risks of the four agricultural markets in varying degrees, with the wheat market being affected the most. Additionally, all the agricultural futures markets exhibit significant downside and upside risk spillovers to their corresponding spot markets before and after the outbreak of the conflict, whereas the strengths of these extreme risk spillover effects demonstrate significant asymmetries at the directional (downside versus upside) and temporal (pre-outbreak versus post-outbreak) levels. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: 35 pages, 2 figures

Journal ref: Journal of Economic Behavior & Organization 217, 91-111 (2024)

arXiv:2310.16849 [pdf, other]

doi 10.1016/j.ribaf.2022.101677

Correlation structure analysis of the global agricultural futures market

Authors: Yun-Shi Dai, Ngoc Quang Anh Huynh, Qing-Huan Zheng, Wei-Xing Zhou

Abstract: This paper adopts the random matrix theory (RMT) to analyze the correlation structure of the global agricultural futures market from 2000 to 2020. It is found that the distribution of correlation coefficients is asymmetric and right skewed, and many eigenvalues of the correlation matrix deviate from the RMT prediction. The largest eigenvalue reflects a collective market effect common to all agricu… ▽ More This paper adopts the random matrix theory (RMT) to analyze the correlation structure of the global agricultural futures market from 2000 to 2020. It is found that the distribution of correlation coefficients is asymmetric and right skewed, and many eigenvalues of the correlation matrix deviate from the RMT prediction. The largest eigenvalue reflects a collective market effect common to all agricultural futures, the other largest deviating eigenvalues can be implemented to identify futures groups, and there are modular structures based on regional properties or agricultural commodities among the significant participants of their corresponding eigenvectors. Except for the smallest eigenvalue, other smallest deviating eigenvalues represent the agricultural futures pairs with highest correlations. This paper can be of reference and significance for using agricultural futures to manage risk and optimize asset allocation. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: 19 pages, 7 figures

Journal ref: Research in International Business and Finance 61, 101677 (2022)

arXiv:2310.16474 [pdf]

doi 10.21203/rs.3.rs-3439840/v1

Janus icosahedral particles: amorphization driven by three-dimensional atomic misfit and edge dislocation compensation

Authors: Zhen Sun, Yao Zhang, Zezhou Li, Xuanxuan Du, Zhiheng Xie, Yiheng Dai, Colin Ophus, Jihan Zhou

Abstract: Icosahedral nanoparticles composed of fivefold twinned tetrahedra have broad applications. The strain relief mechanism and angular deficiency in icosahedral multiply twinned particles are poorly understood in three dimensions. Here, we resolved the three-dimensional atomic structures of Janus icosahedral nanoparticles using atomic resolution electron tomography. A geometrically fivefold face consi… ▽ More Icosahedral nanoparticles composed of fivefold twinned tetrahedra have broad applications. The strain relief mechanism and angular deficiency in icosahedral multiply twinned particles are poorly understood in three dimensions. Here, we resolved the three-dimensional atomic structures of Janus icosahedral nanoparticles using atomic resolution electron tomography. A geometrically fivefold face consistently corresponds to a less ordered face like two hemispheres. We quantify rich structural variety of icosahedra including bond orientation order, bond length, strain tensor; and packing efficiency, atom number, solid angle of each tetrahedron. These structural characteristics exhibit two-sided distribution. Edge dislocations near the axial atoms and small disordered domains fill the angular deficiency. Our findings provide new insights how the fivefold symmetry can be compensated and the geometrically-necessary internal strains relived in multiply twinned particles. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: 30 pages, 5 figures

arXiv:2310.15925 [pdf, other]

doi 10.1051/0004-6361/202348351

Noema formIng Cluster survEy (NICE): Discovery of a starbursting galaxy group with a radio-luminous core at z=3.95

Authors: Luwenjia Zhou, Tao Wang, Emanuele Daddi, Rosemary Coogan, Hanwen Sun, Ke Xu, Vinodiran Arumugam, Shuowen Jin, Daizhong Liu, Shiying Lu, Nikolaj Sillassen, Yijun Wang, Yong Shi, Zhi-Yu Zhang, Qinghua Tan, Qiusheng Gu, David Elbaz, Aurelien Le Bail, Benjamin Magnelli, Carlos Gómez-Guijarro, Chiara d'Eugenio, Georgios E. Magdis, Francesco Valentino, Zhiyuan Ji, Raphael Gobat , et al. (12 additional authors not shown)

Abstract: The study of distant galaxy groups and clusters at the peak epoch of star formation is limited by the lack of a statistically and homogeneously selected and spectroscopically confirmed sample. Recent discoveries of concentrated starburst activities in cluster cores have opened a new window to hunt for these structures based on their integrated IR luminosities. Hereby we carry out the large NOEMA (… ▽ More The study of distant galaxy groups and clusters at the peak epoch of star formation is limited by the lack of a statistically and homogeneously selected and spectroscopically confirmed sample. Recent discoveries of concentrated starburst activities in cluster cores have opened a new window to hunt for these structures based on their integrated IR luminosities. Hereby we carry out the large NOEMA (NOrthern Extended Millimeter Array) program targeting a statistical sample of infrared-luminous sources associated with overdensities of massive galaxies at z>2, the Noema formIng Cluster survEy (NICE). We present the first result from the ongoing NICE survey, a compact group at z=3.95 in the Lockman Hole field (LH-SBC3), confirmed via four massive (M_star>10^10.5M_sun) galaxies detected in CO(4-3) and [CI](1-0) lines. The four CO-detected members of LH-SBC3 are distributed over a 180 kpc physical scale, and the entire structure has an estimated halo mass of ~10^13Msun and total star formation rate (SFR) of ~4000Msun/yr. In addition, the most massive galaxy hosts a radio-loud AGN with L_1.4GHz, rest = 3.0*10^25W/Hz. The discovery of LH-SBC3 demonstrates the feasibility of our method to efficiently identify high-z compact groups or forming cluster cores. The existence of these starbursting cluster cores up to z~4 provides critical insights into the mass assembly history of the central massive galaxies in clusters. △ Less

Submitted 29 April, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: 10 pages, 8 figures, published by A&A

Journal ref: A&A, 684, A196 (2024)

arXiv:2310.09726 [pdf, other]

FuseSR: Super Resolution for Real-time Rendering through Efficient Multi-resolution Fusion

Authors: Zhihua Zhong, Jingsen Zhu, Yuxin Dai, Chuankun Zheng, Yuchi Huo, Guanlin Chen, Hujun Bao, Rui Wang

Abstract: The workload of real-time rendering is steeply increasing as the demand for high resolution, high refresh rates, and high realism rises, overwhelming most graphics cards. To mitigate this problem, one of the most popular solutions is to render images at a low resolution to reduce rendering overhead, and then manage to accurately upsample the low-resolution rendered image to the target resolution,… ▽ More The workload of real-time rendering is steeply increasing as the demand for high resolution, high refresh rates, and high realism rises, overwhelming most graphics cards. To mitigate this problem, one of the most popular solutions is to render images at a low resolution to reduce rendering overhead, and then manage to accurately upsample the low-resolution rendered image to the target resolution, a.k.a. super-resolution techniques. Most existing methods focus on exploiting information from low-resolution inputs, such as historical frames. The absence of high frequency details in those LR inputs makes them hard to recover fine details in their high-resolution predictions. In this paper, we propose an efficient and effective super-resolution method that predicts high-quality upsampled reconstructions utilizing low-cost high-resolution auxiliary G-Buffers as additional input. With LR images and HR G-buffers as input, the network requires to align and fuse features at multi resolution levels. We introduce an efficient and effective H-Net architecture to solve this problem and significantly reduce rendering overhead without noticeable quality deterioration. Experiments show that our method is able to produce temporally consistent reconstructions in $4 \times 4$ and even challenging $8 \times 8$ upsampling cases at 4K resolution with real-time performance, with substantially improved quality and significant performance boost compared to existing works. △ Less

Submitted 15 October, 2023; originally announced October 2023.

Comments: Accepted by SIGGRAPH Asia 2023. Project page: https://isaac-paradox.github.io/FuseSR/

arXiv:2310.08956 [pdf, other]

LRRU: Long-short Range Recurrent Updating Networks for Depth Completion

Authors: Yufei Wang, Bo Li, Ge Zhang, Qi Liu, Tao Gao, Yuchao Dai

Abstract: Existing deep learning-based depth completion methods generally employ massive stacked layers to predict the dense depth map from sparse input data. Although such approaches greatly advance this task, their accompanied huge computational complexity hinders their practical applications. To accomplish depth completion more efficiently, we propose a novel lightweight deep network framework, the Long-… ▽ More Existing deep learning-based depth completion methods generally employ massive stacked layers to predict the dense depth map from sparse input data. Although such approaches greatly advance this task, their accompanied huge computational complexity hinders their practical applications. To accomplish depth completion more efficiently, we propose a novel lightweight deep network framework, the Long-short Range Recurrent Updating (LRRU) network. Without learning complex feature representations, LRRU first roughly fills the sparse input to obtain an initial dense depth map, and then iteratively updates it through learned spatially-variant kernels. Our iterative update process is content-adaptive and highly flexible, where the kernel weights are learned by jointly considering the guidance RGB images and the depth map to be updated, and large-to-small kernel scopes are dynamically adjusted to capture long-to-short range dependencies. Our initial depth map has coarse but complete scene depth information, which helps relieve the burden of directly regressing the dense depth from sparse ones, while our proposed method can effectively refine it to an accurate depth map with less learnable parameters and inference time. Experimental results demonstrate that our proposed LRRU variants achieve state-of-the-art performance across different parameter regimes. In particular, the LRRU-Base model outperforms competing approaches on the NYUv2 dataset, and ranks 1st on the KITTI depth completion benchmark at the time of submission. Project page: https://npucvr.github.io/LRRU/. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: Published in ICCV 2023

arXiv:2310.08303 [pdf, other]

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Authors: Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, Yuchao Dai

Abstract: We propose an Explicit Conditional Multimodal Variational Auto-Encoder (ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the video sequence. Existing AVS methods focus on implicit feature fusion strategies, where models are trained to fit the discrete samples in the dataset. With a limited and less diverse dataset, the resulting performance is usually unsatisfactory.… ▽ More We propose an Explicit Conditional Multimodal Variational Auto-Encoder (ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the video sequence. Existing AVS methods focus on implicit feature fusion strategies, where models are trained to fit the discrete samples in the dataset. With a limited and less diverse dataset, the resulting performance is usually unsatisfactory. In contrast, we address this problem from an effective representation learning perspective, aiming to model the contribution of each modality explicitly. Specifically, we find that audio contains critical category information of the sound producers, and visual data provides candidate sound producer(s). Their shared information corresponds to the target sound producer(s) shown in the visual data. In this case, cross-modal shared representation learning is especially important for AVS. To achieve this, our ECMVAE factorizes the representations of each modality with a modality-shared representation and a modality-specific representation. An orthogonality constraint is applied between the shared and specific representations to maintain the exclusive attribute of the factorized latent code. Further, a mutual information maximization regularizer is introduced to achieve extensive exploration of each modality. Quantitative and qualitative evaluations on the AVSBench demonstrate the effectiveness of our approach, leading to a new state-of-the-art for AVS, with a 3.84 mIOU performance leap on the challenging MS3 subset for multiple sound source segmentation. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: Accepted by ICCV2023,Project page(https://npucvr.github.io/MMVAE-AVS),Code(https://github.com/OpenNLPLab/MMVAE-AVS)

arXiv:2310.08233 [pdf, other]

The Impact of Time Step Frequency on the Realism of Robotic Manipulation Simulation for Objects of Different Scales

Authors: Minh Q. Ta, Holly Dinkel, Hameed Abdul-Rashid, Yangfei Dai, Jessica Myers, Tan Chen, Junyi Geng, Timothy Bretl

Abstract: This work evaluates the impact of time step frequency and component scale on robotic manipulation simulation accuracy. Increasing the time step frequency for small-scale objects is shown to improve simulation accuracy. This simulation, demonstrating pre-assembly part picking for two object geometries, serves as a starting point for discussing how to improve Sim2Real transfer in robotic assembly pr… ▽ More This work evaluates the impact of time step frequency and component scale on robotic manipulation simulation accuracy. Increasing the time step frequency for small-scale objects is shown to improve simulation accuracy. This simulation, demonstrating pre-assembly part picking for two object geometries, serves as a starting point for discussing how to improve Sim2Real transfer in robotic assembly processes. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: 3 pages, 3 figures, Best Poster Finalist at the 2023 Robotics and AI in Future Factory Workshop at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Video presentation [https://www.youtube.com/watch?v=JOXrBpMmI0A]. Robotics and AI in Future Factory workshop [https://sites.google.com/view/robot-ai-future-factory/]

arXiv:2310.08027 [pdf, other]

Exploring Large Language Models for Multi-Modal Out-of-Distribution Detection

Authors: Yi Dai, Hao Lang, Kaisheng Zeng, Fei Huang, Yongbin Li

Abstract: Out-of-distribution (OOD) detection is essential for reliable and trustworthy machine learning. Recent multi-modal OOD detection leverages textual information from in-distribution (ID) class names for visual OOD detection, yet it currently neglects the rich contextual information of ID classes. Large language models (LLMs) encode a wealth of world knowledge and can be prompted to generate descript… ▽ More Out-of-distribution (OOD) detection is essential for reliable and trustworthy machine learning. Recent multi-modal OOD detection leverages textual information from in-distribution (ID) class names for visual OOD detection, yet it currently neglects the rich contextual information of ID classes. Large language models (LLMs) encode a wealth of world knowledge and can be prompted to generate descriptive features for each class. Indiscriminately using such knowledge causes catastrophic damage to OOD detection due to LLMs' hallucinations, as is observed by our analysis. In this paper, we propose to apply world knowledge to enhance OOD detection performance through selective generation from LLMs. Specifically, we introduce a consistency-based uncertainty calibration method to estimate the confidence score of each generation. We further extract visual objects from each image to fully capitalize on the aforementioned world knowledge. Extensive experiments demonstrate that our method consistently outperforms the state-of-the-art. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: EMNLP2023 Findings Long Paper

arXiv:2310.07968 [pdf, other]

Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation

Authors: Yinpei Dai, Run Peng, Sikai Li, Joyce Chai

Abstract: Zero-Shot Object Navigation (ZSON) enables agents to navigate towards open-vocabulary objects in unknown environments. The existing works of ZSON mainly focus on following individual instructions to find generic object classes, neglecting the utilization of natural language interaction and the complexities of identifying user-specific objects. To address these limitations, we introduce Zero-shot I… ▽ More Zero-Shot Object Navigation (ZSON) enables agents to navigate towards open-vocabulary objects in unknown environments. The existing works of ZSON mainly focus on following individual instructions to find generic object classes, neglecting the utilization of natural language interaction and the complexities of identifying user-specific objects. To address these limitations, we introduce Zero-shot Interactive Personalized Object Navigation (ZIPON), where robots need to navigate to personalized goal objects while engaging in conversations with users. To solve ZIPON, we propose a new framework termed Open-woRld Interactive persOnalized Navigation (ORION), which uses Large Language Models (LLMs) to make sequential decisions to manipulate different modules for perception, navigation and communication. Experimental results show that the performance of interactive agents that can leverage user feedback exhibits significant improvement. However, obtaining a good balance between task completion and the efficiency of navigation and interaction remains challenging for all methods. We further provide more findings on the impact of diverse user feedback forms on the agents' performance. Code is available at https://github.com/sled-group/navchat. △ Less

Submitted 29 May, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: Video URL: https://www.youtube.com/watch?v=rN5S8QIhhQc

arXiv:2310.05620 [pdf, other]

LAiW: A Chinese Legal Large Language Models Benchmark

Authors: Yongfu Dai, Duanyu Feng, Jimin Huang, Haochen Jia, Qianqian Xie, Yifang Zhang, Weiguang Han, Wei Tian, Hao Wang

Abstract: General and legal domain LLMs have demonstrated strong performance in various tasks of LegalAI. However, the current evaluations of these LLMs in LegalAI are defined by the experts of computer science, lacking consistency with the logic of legal practice, making it difficult to judge their practical capabilities. To address this challenge, we are the first to build the Chinese legal LLMs benchmark… ▽ More General and legal domain LLMs have demonstrated strong performance in various tasks of LegalAI. However, the current evaluations of these LLMs in LegalAI are defined by the experts of computer science, lacking consistency with the logic of legal practice, making it difficult to judge their practical capabilities. To address this challenge, we are the first to build the Chinese legal LLMs benchmark LAiW, based on the logic of legal practice. To align with the thinking process of legal experts and legal practice (syllogism), we divide the legal capabilities of LLMs from easy to difficult into three levels: basic information retrieval, legal foundation inference, and complex legal application. Each level contains multiple tasks to ensure a comprehensive evaluation. Through automated evaluation of current general and legal domain LLMs on our benchmark, we indicate that these LLMs may not align with the logic of legal practice. LLMs seem to be able to directly acquire complex legal application capabilities but perform poorly in some basic tasks, which may pose obstacles to their practical application and acceptance by legal experts. To further confirm the complex legal application capabilities of current LLMs in legal application scenarios, we also incorporate human evaluation with legal experts. The results indicate that while LLMs may demonstrate strong performance, they still require reinforcement of legal logic. △ Less

Submitted 18 February, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

arXiv:2310.03143 [pdf]

doi 10.1002/adfm.202405665

One-Dimensional Crystallographic Etching of Few-Layer WS$_2$

Authors: Shisheng Li, Yung-Chang Lin, Yiling Chiew, Yunyun Dai, Zixuan Ning, Hideaki Nakajima, Hong En Lim, Jing Wu, Yasuhisa Naito, Toshiya Okazaki, Zhipei Sun, Kazu Suenaga, Yoshiki Sakuma, Kazuhito Tsukagoshi, Takaaki Taniguchi

Abstract: Layer number-dependent band structures and symmetry are vital for the electrical and optical characteristics of two-dimensional (2D) transition metal dichalcogenides (TMDCs). Harvesting 2D TMDCs with tunable thickness and properties can be achieved through top-down etching and bottom-up growth strategies. In this study, we report a pioneering technique that utilizes the migration of in-situ genera… ▽ More Layer number-dependent band structures and symmetry are vital for the electrical and optical characteristics of two-dimensional (2D) transition metal dichalcogenides (TMDCs). Harvesting 2D TMDCs with tunable thickness and properties can be achieved through top-down etching and bottom-up growth strategies. In this study, we report a pioneering technique that utilizes the migration of in-situ generated Na-W-S-O droplets to etch out one-dimensional (1D) nanotrenches in few-layer WS$_2$. 1D WS$_2$ nanotrenches were successfully fabricated on the optically inert bilayer WS$_2$, showing pronounced photoluminescence and second harmonic generation signals. Additionally, we demonstrate the modulation of inkjet-printed Na$_2$WO$_4$-Na$_2$SO$_4$ particles to switch between the etching and growth modes by manipulating the sulfur supply. This versatile approach enables the creation of 1D nanochannels on 2D TMDCs. Our research presents exciting prospects for the top-down and bottom-up fabrication of 1D-2D mixed-dimensional TMDC nanostructures, expanding their use for photonic and optoelectronic applications. △ Less

Submitted 4 October, 2023; originally announced October 2023.

Comments: 37 pages, 16 figures

Journal ref: Advanced Functional Materials, 2024

arXiv:2310.00919 [pdf, other]

BAAF: A Benchmark Attention Adaptive Framework for Medical Ultrasound Image Segmentation Tasks

Authors: Gongping Chen, Lei Zhao, Xiaotao Yin, Liang Cui, Jianxun Zhang, Yu Dai

Abstract: The AI-based assisted diagnosis programs have been widely investigated on medical ultrasound images. Complex scenario of ultrasound image, in which the coupled interference of internal and external factors is severe, brings a unique challenge for localize the object region automatically and precisely in ultrasound images. In this study, we seek to propose a more general and robust Benchmark Attent… ▽ More The AI-based assisted diagnosis programs have been widely investigated on medical ultrasound images. Complex scenario of ultrasound image, in which the coupled interference of internal and external factors is severe, brings a unique challenge for localize the object region automatically and precisely in ultrasound images. In this study, we seek to propose a more general and robust Benchmark Attention Adaptive Framework (BAAF) to assist doctors segment or diagnose lesions and tissues in ultrasound images more quickly and accurately. Different from existing attention schemes, the BAAF consists of a parallel hybrid attention module (PHAM) and an adaptive calibration mechanism (ACM). Specifically, BAAF first coarsely calibrates the input features from the channel and spatial dimensions, and then adaptively selects more robust lesion or tissue characterizations from the coarse-calibrated feature maps. The design of BAAF further optimizes the "what" and "where" focus and selection problems in CNNs and seeks to improve the segmentation accuracy of lesions or tissues in medical ultrasound images. The method is evaluated on four medical ultrasound segmentation tasks, and the adequate experimental results demonstrate the remarkable performance improvement over existing state-of-the-art methods. In addition, the comparison with existing attention mechanisms also demonstrates the superiority of BAAF. This work provides the possibility for automated medical ultrasound assisted diagnosis and reduces reliance on human accuracy and precision. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2310.00566 [pdf, other]

Empowering Many, Biasing a Few: Generalist Credit Scoring through Large Language Models

Authors: Duanyu Feng, Yongfu Dai, Jimin Huang, Yifang Zhang, Qianqian Xie, Weiguang Han, Zhengyu Chen, Alejandro Lopez-Lira, Hao Wang

Abstract: In the financial industry, credit scoring is a fundamental element, shaping access to credit and determining the terms of loans for individuals and businesses alike. Traditional credit scoring methods, however, often grapple with challenges such as narrow knowledge scope and isolated evaluation of credit tasks. Our work posits that Large Language Models (LLMs) have great potential for credit scori… ▽ More In the financial industry, credit scoring is a fundamental element, shaping access to credit and determining the terms of loans for individuals and businesses alike. Traditional credit scoring methods, however, often grapple with challenges such as narrow knowledge scope and isolated evaluation of credit tasks. Our work posits that Large Language Models (LLMs) have great potential for credit scoring tasks, with strong generalization ability across multiple tasks. To systematically explore LLMs for credit scoring, we propose the first open-source comprehensive framework. We curate a novel benchmark covering 9 datasets with 14K samples, tailored for credit assessment and a critical examination of potential biases within LLMs, and the novel instruction tuning data with over 45k samples. We then propose the first Credit and Risk Assessment Large Language Model (CALM) by instruction tuning, tailored to the nuanced demands of various financial risk assessment tasks. We evaluate CALM, existing state-of-art (SOTA) methods, open source and closed source LLMs on the build benchmark. Our empirical results illuminate the capability of LLMs to not only match but surpass conventional models, pointing towards a future where credit scoring can be more inclusive, comprehensive, and unbiased. We contribute to the industry's transformation by sharing our pioneering instruction-tuning datasets, credit and risk assessment LLM, and benchmarks with the research community and the financial industry. △ Less

Submitted 17 February, 2024; v1 submitted 30 September, 2023; originally announced October 2023.

arXiv:2309.17390 [pdf, other]

Forward Flow for Novel View Synthesis of Dynamic Scenes

Authors: Xiang Guo, Jiadai Sun, Yuchao Dai, Guanying Chen, Xiaoqing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, Jingdong Wang

Abstract: This paper proposes a neural radiance field (NeRF) approach for novel view synthesis of dynamic scenes using forward warping. Existing methods often adopt a static NeRF to represent the canonical space, and render dynamic images at other time steps by mapping the sampled 3D points back to the canonical space with the learned backward flow field. However, this backward flow field is non-smooth and… ▽ More This paper proposes a neural radiance field (NeRF) approach for novel view synthesis of dynamic scenes using forward warping. Existing methods often adopt a static NeRF to represent the canonical space, and render dynamic images at other time steps by mapping the sampled 3D points back to the canonical space with the learned backward flow field. However, this backward flow field is non-smooth and discontinuous, which is difficult to be fitted by commonly used smooth motion models. To address this problem, we propose to estimate the forward flow field and directly warp the canonical radiance field to other time steps. Such forward flow field is smooth and continuous within the object region, which benefits the motion model learning. To achieve this goal, we represent the canonical radiance field with voxel grids to enable efficient forward warping, and propose a differentiable warping process, including an average splatting operation and an inpaint network, to resolve the many-to-one and one-to-many mapping issues. Thorough experiments show that our method outperforms existing methods in both novel view rendering and motion modeling, demonstrating the effectiveness of our forward flow motion modeling. Project page: https://npucvr.github.io/ForwardFlowDNeRF △ Less

Submitted 29 September, 2023; originally announced September 2023.

Comments: Accepted by ICCV2023 as oral. Project page: https://npucvr.github.io/ForwardFlowDNeRF

Journal ref: ICCV2023

arXiv:2309.15082 [pdf, other]

RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation

Authors: Zhexiong Wan, Yuxin Mao, Jing Zhang, Yuchao Dai

Abstract: Recently, the RGB images and point clouds fusion methods have been proposed to jointly estimate 2D optical flow and 3D scene flow. However, as both conventional RGB cameras and LiDAR sensors adopt a frame-based data acquisition mechanism, their performance is limited by the fixed low sampling rates, especially in highly-dynamic scenes. By contrast, the event camera can asynchronously capture the i… ▽ More Recently, the RGB images and point clouds fusion methods have been proposed to jointly estimate 2D optical flow and 3D scene flow. However, as both conventional RGB cameras and LiDAR sensors adopt a frame-based data acquisition mechanism, their performance is limited by the fixed low sampling rates, especially in highly-dynamic scenes. By contrast, the event camera can asynchronously capture the intensity changes with a very high temporal resolution, providing complementary dynamic information of the observed scenes. In this paper, we incorporate RGB images, Point clouds and Events for joint optical flow and scene flow estimation with our proposed multi-stage multimodal fusion model, RPEFlow. First, we present an attention fusion module with a cross-attention mechanism to implicitly explore the internal cross-modal correlation for 2D and 3D branches, respectively. Second, we introduce a mutual information regularization term to explicitly model the complementary information of three modalities for effective multimodal feature learning. We also contribute a new synthetic dataset to advocate further research. Experiments on both synthetic and real datasets show that our model outperforms the existing state-of-the-art by a wide margin. Code and dataset is available at https://npucvr.github.io/RPEFlow. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: ICCV 2023. Project page: https://npucvr.github.io/RPEFlow Code: https://github.com/danqu130/RPEFlow

arXiv:2309.13816 [pdf, ps, other]

Exact penalty method for D-stationary point of nonlinear optimization

Authors: Xin-Wei Liu, Yu-Hong Dai

Abstract: We consider the nonlinear optimization problem with least $\ell_1$-norm measure of constraint violations and introduce the concepts of the D-stationary point, the DL-stationary point and the DZ-stationary point with the help of exact penalty function. If the stationary point is feasible, they correspond to the Fritz-John stationary point, the KKT stationary point and the singular stationary point,… ▽ More We consider the nonlinear optimization problem with least $\ell_1$-norm measure of constraint violations and introduce the concepts of the D-stationary point, the DL-stationary point and the DZ-stationary point with the help of exact penalty function. If the stationary point is feasible, they correspond to the Fritz-John stationary point, the KKT stationary point and the singular stationary point, respectively. In order to show the usefulness of the new stationary points, we propose a new exact penalty sequential quadratic programming (SQP) method with inner and outer iterations and analyze its global and local convergence. The proposed method admits convergence to a D-stationary point and rapid infeasibility detection without driving the penalty parameter to zero, which demonstrates the commentary given in [SIAM J. Optim., 20 (2010), 2281--2299] and can be thought to be a supplement of the theory of nonlinear optimization on rapid detection of infeasibility. Some illustrative examples and preliminary numerical results demonstrate that the proposed method is robust and efficient in solving infeasible nonlinear problems and a degenerate problem without LICQ in the literature. △ Less

Submitted 24 September, 2023; originally announced September 2023.

Comments: 24 pages

MSC Class: 49M37; 65K05; 90C26; 90C30; 90C55

arXiv:2309.12300 [pdf, other]

See to Touch: Learning Tactile Dexterity through Visual Incentives

Authors: Irmak Guzey, Yinlong Dai, Ben Evans, Soumith Chintala, Lerrel Pinto

Abstract: Equipping multi-fingered robots with tactile sensing is crucial for achieving the precise, contact-rich, and dexterous manipulation that humans excel at. However, relying solely on tactile sensing fails to provide adequate cues for reasoning about objects' spatial configurations, limiting the ability to correct errors and adapt to changing situations. In this paper, we present Tactile Adaptation f… ▽ More Equipping multi-fingered robots with tactile sensing is crucial for achieving the precise, contact-rich, and dexterous manipulation that humans excel at. However, relying solely on tactile sensing fails to provide adequate cues for reasoning about objects' spatial configurations, limiting the ability to correct errors and adapt to changing situations. In this paper, we present Tactile Adaptation from Visual Incentives (TAVI), a new framework that enhances tactile-based dexterity by optimizing dexterous policies using vision-based rewards. First, we use a contrastive-based objective to learn visual representations. Next, we construct a reward function using these visual representations through optimal-transport based matching on one human demonstration. Finally, we use online reinforcement learning on our robot to optimize tactile-based policies that maximize the visual reward. On six challenging tasks, such as peg pick-and-place, unstacking bowls, and flipping slender objects, TAVI achieves a success rate of 73% using our four-fingered Allegro robot hand. The increase in performance is 108% higher than policies using tactile and vision-based rewards and 135% higher than policies without tactile observational input. Robot videos are best viewed on our project website: https://see-to-touch.github.io/. △ Less

Submitted 21 September, 2023; originally announced September 2023.

arXiv:2309.11559 [pdf, other]

A Close Look at Ly$α$ Emitters with JWST/NIRCam at $z\approx3.1$

Authors: Yixiao Liu, Y. Sophia Dai, Stijn Wuyts, Jia-Sheng Huang, Linhua Jiang

Abstract: We study 10 spectroscopically confirmed Ly$α$ emitters (LAEs) at $z\approx3.1$ in the UDS field, covered by JWST/NIRCam in the PRIMER program. All LAEs are detected in all NIRCam bands from F090W to F444W, corresponding to restframe 2200Å--1.2$\mathrm{μm}$. Based on morphological analysis of the F200W images, three out of the 10 targets are resolved into pair-like systems with separations of… ▽ More We study 10 spectroscopically confirmed Ly$α$ emitters (LAEs) at $z\approx3.1$ in the UDS field, covered by JWST/NIRCam in the PRIMER program. All LAEs are detected in all NIRCam bands from F090W to F444W, corresponding to restframe 2200Å--1.2$\mathrm{μm}$. Based on morphological analysis of the F200W images, three out of the 10 targets are resolved into pair-like systems with separations of $<0.9''$, and another three show asymmetric structures. We then construct the spectral energy distributions (SEDs) of these LAEs. All sources, including the pairs, show similar SED shapes, with a prominent flux excess in the F200W band, corresponding to extremely strong [O III]+H$β$ emission lines (${\rm EW_{rest}}=740$--$6500\,$Å). The median effective radii, stellar mass, and UV slope of our sample are 0.36$\,$kpc, $3.8\times10^7\,M_\odot$, and --2.48, respectively. The average burst age, estimated by stellar mass over star formation rate, is $<40\,$Myr. These measurements reveal an intriguing starbursting dwarf galaxy population lying off the extrapolations of the $z \sim 3$ scaling relations to the low-mass end: $\sim 0.7$ dex above the star-forming main sequence, $\sim 0.35$ dex below the mass--size relation, and bluer in the UV slope than typical high-z galaxies at similar UV luminosities. We speculate that these numbers may require a larger main sequence scatter or tail in the dwarf galaxy regime towards the starburst outliers. △ Less

Submitted 2 April, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

Comments: 17 pages, 6 figures, ApJ accepted

arXiv:2309.09426 [pdf, other]

Joint Demosaicing and Denoising with Double Deep Image Priors

Authors: Taihui Li, Anish Lahiri, Yutong Dai, Owen Mayer

Abstract: Demosaicing and denoising of RAW images are crucial steps in the processing pipeline of modern digital cameras. As only a third of the color information required to produce a digital image is captured by the camera sensor, the process of demosaicing is inherently ill-posed. The presence of noise further exacerbates this problem. Performing these two steps sequentially may distort the content of th… ▽ More Demosaicing and denoising of RAW images are crucial steps in the processing pipeline of modern digital cameras. As only a third of the color information required to produce a digital image is captured by the camera sensor, the process of demosaicing is inherently ill-posed. The presence of noise further exacerbates this problem. Performing these two steps sequentially may distort the content of the captured RAW images and accumulate errors from one step to another. Recent deep neural-network-based approaches have shown the effectiveness of joint demosaicing and denoising to mitigate such challenges. However, these methods typically require a large number of training samples and do not generalize well to different types and intensities of noise. In this paper, we propose a novel joint demosaicing and denoising method, dubbed JDD-DoubleDIP, which operates directly on a single RAW image without requiring any training data. We validate the effectiveness of our method on two popular datasets -- Kodak and McMaster -- with various noises and noise intensities. The experimental results show that our method consistently outperforms other compared methods in terms of PSNR, SSIM, and qualitative visual perception. △ Less

Submitted 17 September, 2023; originally announced September 2023.

arXiv:2309.08622 [pdf, other]

Representation Learning in Low-rank Slate-based Recommender Systems

Authors: Yijia Dai, Wen Sun

Abstract: Reinforcement learning (RL) in recommendation systems offers the potential to optimize recommendations for long-term user engagement. However, the environment often involves large state and action spaces, which makes it hard to efficiently learn and explore. In this work, we propose a sample-efficient representation learning algorithm, using the standard slate recommendation setup, to treat this a… ▽ More Reinforcement learning (RL) in recommendation systems offers the potential to optimize recommendations for long-term user engagement. However, the environment often involves large state and action spaces, which makes it hard to efficiently learn and explore. In this work, we propose a sample-efficient representation learning algorithm, using the standard slate recommendation setup, to treat this as an online RL problem with low-rank Markov decision processes (MDPs). We also construct the recommender simulation environment with the proposed setup and sampling method. △ Less

Submitted 18 September, 2023; v1 submitted 10 September, 2023; originally announced September 2023.

Comments: in MFPL, ICML 2023

arXiv:2309.08348 [pdf, other]

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

Authors: Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao

Abstract: Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023… ▽ More Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhance-ment challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: 5 pages, 4 figures

arXiv:2309.04953 [pdf, ps, other]

Extracting the number of type-B Goldstone modes and the dynamical critical exponent for a type of scale-invariant states

Authors: Huan-Qiang Zhou, Yan-Wei Dai, Qian-Qian Shi, Ian P. McCulloch, Murray T. Batchelor

Abstract: A generic scheme is proposed to perform a finite-entanglement scaling analysis for scale-invariant states, which appear to be highly degenerate ground states arising from spontaneous symmetry breaking with type-B Goldstone modes. This allows us to extract the number of type-B Goldstone modes and the dynamical critical exponent, in combination with a finite block-size scaling analysis, from numeric… ▽ More A generic scheme is proposed to perform a finite-entanglement scaling analysis for scale-invariant states, which appear to be highly degenerate ground states arising from spontaneous symmetry breaking with type-B Goldstone modes. This allows us to extract the number of type-B Goldstone modes and the dynamical critical exponent, in combination with a finite block-size scaling analysis, from numerical simulations of quantum many-body systems in the context of tensor network representations. The number of type-B Goldstone modes is identical to the fractal dimension, thus reflecting an abstract fractal underlying the ground state subspace. As illustrative examples, we investigate the spin-$s$ Heisenberg ferromagnetic model, the $\rm{SU}(3)$ ferromagnetic model and the $\rm{SO}(4)$ spin-orbital model. △ Less

Submitted 30 November, 2023; v1 submitted 10 September, 2023; originally announced September 2023.

Comments: 14 pages, 24 figures, 11 tables. Minor changes

arXiv:2309.03559 [pdf, other]

An Anchor Learning Approach for Citation Field Learning

Authors: Zilin Yuan, Borun Chen, Yimeng Dai, Yinghui Li, Hai-Tao Zheng, Rui Zhang

Abstract: Citation field learning is to segment a citation string into fields of interest such as author, title, and venue. Extracting such fields from citations is crucial for citation indexing, researcher profile analysis, etc. User-generated resources like academic homepages and Curriculum Vitae, provide rich citation field information. However, extracting fields from these resources is challenging due t… ▽ More Citation field learning is to segment a citation string into fields of interest such as author, title, and venue. Extracting such fields from citations is crucial for citation indexing, researcher profile analysis, etc. User-generated resources like academic homepages and Curriculum Vitae, provide rich citation field information. However, extracting fields from these resources is challenging due to inconsistent citation styles, incomplete sentence syntax, and insufficient training data. To address these challenges, we propose a novel algorithm, CIFAL (citation field learning by anchor learning), to boost the citation field learning performance. CIFAL leverages the anchor learning, which is model-agnostic for any Pre-trained Language Model, to help capture citation patterns from the data of different citation styles. The experiments demonstrate that CIFAL outperforms state-of-the-art methods in citation field learning, achieving a 2.68% improvement in field-level F1-scores. Extensive analysis of the results further confirms the effectiveness of CIFAL quantitatively and qualitatively. △ Less

Submitted 14 December, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

Comments: accepted by ICASSP2024

arXiv:2309.03490 [pdf, other]

Lipschitz Transport Maps via the Follmer Flow

Authors: Yin Dai, Yuan Gao, Jian Huang, Yuling Jiao, Lican Kang, Jin Liu

Abstract: Inspired by the construction of the F{ö}llmer process, we construct a unit-time flow on the Euclidean space, termed the F{ö}llmer flow, whose flow map at time 1 pushes forward a standard Gaussian measure onto a general target measure. We study the well-posedness of the F{ö}llmer flow and establish the Lipschitz property of the flow map at time 1. We apply the Lipschitz mapping to several rich clas… ▽ More Inspired by the construction of the F{ö}llmer process, we construct a unit-time flow on the Euclidean space, termed the F{ö}llmer flow, whose flow map at time 1 pushes forward a standard Gaussian measure onto a general target measure. We study the well-posedness of the F{ö}llmer flow and establish the Lipschitz property of the flow map at time 1. We apply the Lipschitz mapping to several rich classes of probability measures on deriving dimension-free functional inequalities and concentration inequalities for the empirical measure. △ Less

Submitted 7 September, 2023; originally announced September 2023.

arXiv:2309.03126 [pdf, other]

Everyone Deserves A Reward: Learning Customized Human Preferences

Authors: Pengyu Cheng, Jiawen Xie, Ke Bai, Yong Dai, Nan Du

Abstract: Reward models (RMs) are essential for aligning large language models (LLMs) with human preferences to improve interaction quality. However, the real world is pluralistic, which leads to diversified human preferences with respect to different religions, politics, cultures, etc. Moreover, each individual can have their unique preferences on various topics. Neglecting the diversity of human preferenc… ▽ More Reward models (RMs) are essential for aligning large language models (LLMs) with human preferences to improve interaction quality. However, the real world is pluralistic, which leads to diversified human preferences with respect to different religions, politics, cultures, etc. Moreover, each individual can have their unique preferences on various topics. Neglecting the diversity of human preferences, current human feedback aligning methods only consider a general reward model, which is below satisfaction for customized or personalized application scenarios. To explore customized preference learning, we collect a domain-specific preference (DSP) dataset, which includes preferred responses for each given query from four practical domains. Besides, from the perspective of data efficiency, we propose a three-stage customized RM learning scheme, then empirically verify its effectiveness on both general preference datasets and our DSP set. Furthermore, we test multiple training and data strategies on the three learning stages. We find several ways to better preserve the general preferring ability while training the customized RMs, especially general preference enrichment, and customized preference imitation learning. The DSP dataset and code are available at https://github.com/Linear95/DSP. △ Less

Submitted 15 September, 2023; v1 submitted 6 September, 2023; originally announced September 2023.

arXiv:2309.02043 [pdf, other]

Decomposed Guided Dynamic Filters for Efficient RGB-Guided Depth Completion

Authors: Yufei Wang, Yuxin Mao, Qi Liu, Yuchao Dai

Abstract: RGB-guided depth completion aims at predicting dense depth maps from sparse depth measurements and corresponding RGB images, where how to effectively and efficiently exploit the multi-modal information is a key issue. Guided dynamic filters, which generate spatially-variant depth-wise separable convolutional filters from RGB features to guide depth features, have been proven to be effective in thi… ▽ More RGB-guided depth completion aims at predicting dense depth maps from sparse depth measurements and corresponding RGB images, where how to effectively and efficiently exploit the multi-modal information is a key issue. Guided dynamic filters, which generate spatially-variant depth-wise separable convolutional filters from RGB features to guide depth features, have been proven to be effective in this task. However, the dynamically generated filters require massive model parameters, computational costs and memory footprints when the number of feature channels is large. In this paper, we propose to decompose the guided dynamic filters into a spatially-shared component multiplied by content-adaptive adaptors at each spatial location. Based on the proposed idea, we introduce two decomposition schemes A and B, which decompose the filters by splitting the filter structure and using spatial-wise attention, respectively. The decomposed filters not only maintain the favorable properties of guided dynamic filters as being content-dependent and spatially-variant, but also reduce model parameters and hardware costs, as the learned adaptors are decoupled with the number of feature channels. Extensive experimental results demonstrate that the methods using our schemes outperform state-of-the-art methods on the KITTI dataset, and rank 1st and 2nd on the KITTI benchmark at the time of submission. Meanwhile, they also achieve comparable performance on the NYUv2 dataset. In addition, our proposed methods are general and could be employed as plug-and-play feature fusion blocks in other multi-modal fusion tasks such as RGB-D salient object detection. △ Less

Submitted 5 September, 2023; originally announced September 2023.

arXiv:2308.14032 [pdf, ps, other]

$ρ$-meson longitudinal leading-twist distribution amplitude revisited and the $D\to ρ$ semileptonic decay

Authors: Tao Zhong, Ya-Hong Dai, Hai-Bing Fu

Abstract: Motivated by our previous work [Phys. Rev. D \textbf{104}, no.1, 016021 (2021)] on pionic leading-twist distribution amplitude (DA), we revisit $ρ$-meson leading-twist longitudinal DA $φ_{2;ρ}^\|(x,μ)$ in this paper. A model proposed by Chang based on the Dyson-Schwinger equations (DSEs) is adopted to describe the behavior of $φ_{2;ρ}^\|(x,μ)$. On the other hand, the $ξ$-moments of… ▽ More Motivated by our previous work [Phys. Rev. D \textbf{104}, no.1, 016021 (2021)] on pionic leading-twist distribution amplitude (DA), we revisit $ρ$-meson leading-twist longitudinal DA $φ_{2;ρ}^\|(x,μ)$ in this paper. A model proposed by Chang based on the Dyson-Schwinger equations (DSEs) is adopted to describe the behavior of $φ_{2;ρ}^\|(x,μ)$. On the other hand, the $ξ$-moments of $φ_{2;ρ}^\|(x,μ)$ are calculated with the QCD sum rules in the framework of the background field theory. The sum rule formula for those moments are improved. More accurate values for the first five nonzero $ξ$-moments at typical scale $μ=1, 1.4, 2, 3~{\rm GeV}$ are given, e.g., at $μ= 1~{\rm GeV}$, \modi{$\langleξ^2\rangle_{2;ρ}^\| = 0.220(6) $, $\langleξ^4\rangle_{2;ρ}^\| = 0.103(4)$, $\langleξ^6\rangle_{2;ρ}^\| = 0.066(5)$, $\langleξ^8\rangle_{2;ρ}^\| = 0.046(4)$ and $\langleξ^{10}\rangle_{2;ρ}^\| = 0.035(3)$}. By fitting those values with the least squares method, the DSE model for $φ_{2;ρ}^\|(x,μ)$ is determined. By taking the left-handed current light-cone sum rule approach, we get the transition form factor at large recoil region, {\it i.e.} $A_1(0) = 0.498^{+0.014}_{-0.012}$, $A_2(0)=0.460^{+0.055}_{-0.047}$, $V(0) = 0.800^{+0.015}_{-0.014}$, and the ratio $r_2 = 0.923^{+0.133}_{-0.119}$, $r_V = 1.607^{+0.071}_{-0.071}$. After making the extrapolation with a rapidly converging series based on $z(t)$-expansion, we present the decay width for the semileptonic decays $D\toρ\ell^+ν_\ell$. Finally, the branching fractions are $\mathcal{B}(D^0\to ρ^- e^+ ν_e) = 1.889^{+0.176}_{-0.170}\pm 0.005$, $\mathcal{B}(D^+ \to ρ^0 e^+ ν_e) = 2.380^{+0.221}_{-0.214}\pm 0.012$, $\mathcal{B}(D^0\to ρ^- μ^+ ν_μ) = 1.881^{+0.174}_{-0.168}\pm 0.005$, $\mathcal{B}(D^+ \to ρ^0 μ^+ ν_μ) =2.369^{+0.219}_{-0.211}\pm 0.011$. △ Less

Submitted 27 August, 2023; originally announced August 2023.

Comments: 9 pages, 3 figures

arXiv:2308.13774 [pdf, other]

Central Similarity Multi-View Hashing for Multimedia Retrieval

Authors: Jian Zhu, Wen Cheng, Yu Cui, Chang Tang, Yuyang Dai, Yong Li, Lingfang Zeng

Abstract: Hash representation learning of multi-view heterogeneous data is the key to improving the accuracy of multimedia retrieval. However, existing methods utilize local similarity and fall short of deeply fusing the multi-view features, resulting in poor retrieval accuracy. Current methods only use local similarity to train their model. These methods ignore global similarity. Furthermore, most recent w… ▽ More Hash representation learning of multi-view heterogeneous data is the key to improving the accuracy of multimedia retrieval. However, existing methods utilize local similarity and fall short of deeply fusing the multi-view features, resulting in poor retrieval accuracy. Current methods only use local similarity to train their model. These methods ignore global similarity. Furthermore, most recent works fuse the multi-view features via a weighted sum or concatenation. We contend that these fusion methods are insufficient for capturing the interaction between various views. We present a novel Central Similarity Multi-View Hashing (CSMVH) method to address the mentioned problems. Central similarity learning is used for solving the local similarity problem, which can utilize the global similarity between the hash center and samples. We present copious empirical data demonstrating the superiority of gate-based fusion over conventional approaches. On the MS COCO and NUS-WIDE, the proposed CSMVH performs better than the state-of-the-art methods by a large margin (up to 11.41% mean Average Precision (mAP) improvement). △ Less

Submitted 26 August, 2023; originally announced August 2023.

Comments: accepted by the Asia Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data (APWeb-WAIM2023)

arXiv:2308.13191 [pdf, other]

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

Authors: Jiawen Xie, Pengyu Cheng, Xiao Liang, Yong Dai, Nan Du

Abstract: Although dominant in natural language processing, transformer-based models remain challenged by the task of long-sequence processing, because the computational cost of self-attention operations in transformers swells quadratically with the input sequence length. To alleviate the complexity of long-sequence processing, we propose a simple framework to enable the offthe-shelf pre-trained transformer… ▽ More Although dominant in natural language processing, transformer-based models remain challenged by the task of long-sequence processing, because the computational cost of self-attention operations in transformers swells quadratically with the input sequence length. To alleviate the complexity of long-sequence processing, we propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences, while the computation and memory costs remain growing linearly with the input sequence lengths. More specifically, our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps, and finally selects the most representative hidden states from the encoder for the decoding process. To extract inter-chunk semantic information, we align the start and end token embeddings among chunks in each encoding transformer block. To learn an effective hidden selection policy, we design a dual updating scheme inspired by reinforcement learning, which regards the decoders of transformers as environments, and the downstream performance metrics as the rewards to evaluate the hidden selection actions. Our empirical results on real-world long-text summarization and reading comprehension tasks demonstrate effective improvements compared to prior longsequence processing baselines. △ Less

Submitted 5 July, 2024; v1 submitted 25 August, 2023; originally announced August 2023.

Comments: ACL 2024

arXiv:2308.11925 [pdf, other]

Solving Elliptic Optimal Control Problems via Neural Networks and Optimality System

Authors: Yongcheng Dai, Bangti Jin, Ramesh Sau, Zhi Zhou

Abstract: In this work, we investigate a neural network based solver for optimal control problems (without / with box constraint) for linear and semilinear second-order elliptic problems. It utilizes a coupled system derived from the first-order optimality system of the optimal control problem, and employs deep neural networks to represent the solutions to the reduced system. We present an error analysis of… ▽ More In this work, we investigate a neural network based solver for optimal control problems (without / with box constraint) for linear and semilinear second-order elliptic problems. It utilizes a coupled system derived from the first-order optimality system of the optimal control problem, and employs deep neural networks to represent the solutions to the reduced system. We present an error analysis of the scheme, and provide $L^2(Ω)$ error bounds on the state, control and adjoint in terms of neural network parameters (e.g., depth, width, and parameter bounds) and the numbers of sampling points. The main tools in the analysis include offset Rademacher complexity and boundedness and Lipschitz continuity of neural network functions. We present several numerical examples to illustrate the method and compare it with two existing ones. △ Less

Submitted 8 May, 2024; v1 submitted 23 August, 2023; originally announced August 2023.

Comments: 26 pages

arXiv:2308.10705 [pdf, other]

Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion Modeling

Authors: Haorui Ji, Hui Deng, Yuchao Dai, Hongdong Li

Abstract: Most of the previous 3D human pose estimation work relied on the powerful memory capability of the network to obtain suitable 2D-3D mappings from the training data. Few works have studied the modeling of human posture deformation in motion. In this paper, we propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior. Inspired by the field of n… ▽ More Most of the previous 3D human pose estimation work relied on the powerful memory capability of the network to obtain suitable 2D-3D mappings from the training data. Few works have studied the modeling of human posture deformation in motion. In this paper, we propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior. Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton, and a frame-by-frame skeleton deformation. A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence, and then sum them to obtain the pose of each frame. Subsequently, a loss term based on the diffusion model is used to ensure that the pipeline learns the correct prior motion knowledge. Finally, we have evaluated our proposed method on mainstream datasets and obtained superior results outperforming the state-of-the-art. △ Less

Submitted 18 August, 2023; originally announced August 2023.

arXiv:2308.09064 [pdf, other]

The Lyman Continuum Escape Fraction of Star-forming Galaxies at $2.4\lesssim z\lesssim3.7$ from UVCANDELS

Authors: Xin Wang, Harry I. Teplitz, Brent M. Smith, Rogier A. Windhorst, Marc Rafelski, Vihang Mehta, Anahita Alavi, Gabriel Brammer, James Colbert, Norman Grogin, Nimish P. Hathi, Anton M. Koekemoer, Laura Prichard, Claudia Scarlata, Ben Sunnquist, Pablo Arrabal Haro, Christopher Conselice, Eric Gawiser, Yicheng Guo, Matthew Hayes, Rolf A. Jansen, Zhiyuan Ji, Ray A. Lucas, Robert O'Connell, Brant Robertson , et al. (52 additional authors not shown)

Abstract: The UltraViolet Imaging of the Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey Fields (UVCANDELS) survey is a Hubble Space Telescope (HST) Cycle-26 Treasury Program, allocated in total 164 orbits of primary Wide-Field Camera 3 Ultraviolet and Visible light F275W imaging with coordinated parallel Advanced Camera for Surveys F435W imaging, on four of the five premier extragalactic sur… ▽ More The UltraViolet Imaging of the Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey Fields (UVCANDELS) survey is a Hubble Space Telescope (HST) Cycle-26 Treasury Program, allocated in total 164 orbits of primary Wide-Field Camera 3 Ultraviolet and Visible light F275W imaging with coordinated parallel Advanced Camera for Surveys F435W imaging, on four of the five premier extragalactic survey fields: GOODS-N, GOODS-S, EGS, and COSMOS. We introduce this survey by presenting a thorough search for galaxies at $z\gtrsim2.4$ that leak significant Lyman continuum (LyC) radiation, as well as a stringent constraint on the LyC escape fraction ($f_{\rm esc}$) from stacking the UV images of a population of star-forming galaxies with secure redshifts. Our extensive search for LyC emission and stacking analysis benefit from the catalogs of high-quality spectroscopic redshifts compiled from archival ground-based data and HST slitless spectroscopy, carefully vetted by dedicated visual inspection efforts. We report a sample of five galaxies as individual LyC leaker candidates, showing $f_{\rm esc}^{\rm rel}\gtrsim60\%$ estimated using detailed Monte Carlo analysis of intergalactic medium attenuation. We develop a robust stacking method to apply to five samples of in total 85 non-detection galaxies in the redshift range of $z\in[2.4,3.7]$. Most stacks give tight 2-$σ$ upper limits below $f_{\rm esc}^{\rm rel}<6\%$. A stack for a subset of 32 emission-line galaxies shows tentative LyC leakage detected at 2.9-$σ$, indicating $f_{\rm esc}^{\rm rel}=5.7\%$ at $z\sim2.65$, supporting the key role of such galaxies in contributing to the cosmic reionization and maintaining the UV ionization background. These new F275W and F435W imaging mosaics from UVCANDELS have been made publicly available on the Barbara A. Mikulski Archive for Space Telescopes. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: 33 pages, 21 figures, and 5 tables. Resubmitted after addressing the referee report

arXiv:2308.08488 [pdf, other]

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

Authors: Yusheng Dai, Hang Chen, Jun Du, Xiaofei Ding, Ning Ding, Feijun Jiang, Chin-Hui Lee

Abstract: In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in the end-to-end framework with low-quality videos. Unmatching convergence rates and specialized input representations between audio and visual modalities are considered to cause the problem. In this paper, we propose two novel techniques to improve a… ▽ More In recent research, slight performance improvement is observed from automatic speech recognition systems to audio-visual speech recognition systems in the end-to-end framework with low-quality videos. Unmatching convergence rates and specialized input representations between audio and visual modalities are considered to cause the problem. In this paper, we propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. This enables accurate alignment of video and audio streams during visual model pre-training and cross-modal fusion. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers to make full use of modality complementarity. Experiments on the MISP2021-AVSR data set show the effectiveness of the two proposed techniques. Together, using only a relatively small amount of training data, the final system achieves better performances than state-of-the-art systems with more complex front-ends and back-ends. △ Less

Submitted 8 March, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

Comments: 6 pages, 2 figures, published in ICME2023

arXiv:2308.08288 [pdf, other]

Improving Audio-Visual Segmentation with Bidirectional Generation

Authors: Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, Yiran Zhong

Abstract: The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio… ▽ More The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio-visual modeling. In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS. To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Moreover, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods. To showcase the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. To facilitate reproducibility, we plan to release both the source code and the pre-trained model. △ Less

Submitted 19 December, 2023; v1 submitted 16 August, 2023; originally announced August 2023.

Comments: AAAI Camera Ready. Dawei Hao and Yuxin Mao contribute equality to this paper. Yiran Zhong is the corresponding author. The code will be released at https://github.com/OpenNLPLab/AVS-bidirectional

arXiv:2308.04413 [pdf, other]

Digging into Depth Priors for Outdoor Neural Radiance Fields

Authors: Chen Wang, Jiadai Sun, Lina Liu, Chenming Wu, Zhelun Shen, Dayan Wu, Yuchao Dai, Liangjun Zhang

Abstract: Neural Radiance Fields (NeRF) have demonstrated impressive performance in vision and graphics tasks, such as novel view synthesis and immersive reality. However, the shape-radiance ambiguity of radiance fields remains a challenge, especially in the sparse viewpoints setting. Recent work resorts to integrating depth priors into outdoor NeRF training to alleviate the issue. However, the criteria for… ▽ More Neural Radiance Fields (NeRF) have demonstrated impressive performance in vision and graphics tasks, such as novel view synthesis and immersive reality. However, the shape-radiance ambiguity of radiance fields remains a challenge, especially in the sparse viewpoints setting. Recent work resorts to integrating depth priors into outdoor NeRF training to alleviate the issue. However, the criteria for selecting depth priors and the relative merits of different priors have not been thoroughly investigated. Moreover, the relative merits of selecting different approaches to use the depth priors is also an unexplored problem. In this paper, we provide a comprehensive study and evaluation of employing depth priors to outdoor neural radiance fields, covering common depth sensing technologies and most application ways. Specifically, we conduct extensive experiments with two representative NeRF methods equipped with four commonly-used depth priors and different depth usages on two widely used outdoor datasets. Our experimental results reveal several interesting findings that can potentially benefit practitioners and researchers in training their NeRF models with depth priors. Project Page: https://cwchenwang.github.io/outdoor-nerf-depth △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: Accepted to ACM MM 2023. Project Page: https://cwchenwang.github.io/outdoor-nerf-depth

arXiv:2308.02809 [pdf]

3D front tip fields in creeping solids under constraint effects: a higher-order asymptotic solution

Authors: Weichen Kong, Yanwei Dai, Yinghua Liu

Abstract: As one of the most important topics studied in creep fracture mechanics, mechanics fields at three-dimensional (3D) sharp V-notches and crack tip have drawn tremendous attentions. With many years efforts on constraint theory developed in creeping solids, there still seems dense fog on how in-plane and out-of-plane constraint effects are interacted for 3D sharp V-notch and crack in creeping solids.… ▽ More As one of the most important topics studied in creep fracture mechanics, mechanics fields at three-dimensional (3D) sharp V-notches and crack tip have drawn tremendous attentions. With many years efforts on constraint theory developed in creeping solids, there still seems dense fog on how in-plane and out-of-plane constraint effects are interacted for 3D sharp V-notch and crack in creeping solids. To shed lights on this topic, a 3D higher-order termed solution for sharp V-notches in creeping materials subjected to mode 1 loading is established by introducing the out-of-plane factor, which is the out-of-plane stress divided by the sum of in-plane normal stress. The solution can naturally be degenerated to a 3D crack. Based on the 3D higher-order term solution, a new fracture parameter is proposed and combined with to characterize 3D constraint effect. It is found that the stress exponents and angular distribution of higher-order term for 3D notches and cracks are highly related to . The proposed higher order termed solutions show better agreement with the FEA results than the 3D leading-term and 2D two-term solutions, especially for smaller notch angles and ligament width. Moreover, the presented 3D constraint theory shows that effects of and are highly interlinked rather than simply separated. It implies that the 3D constraint level may be significantly influenced by . The 3D mathematical solutions discussed in this paper could enhance the understanding of the 3D effect and has the potential to explain the 3D constraint effect on the notches and cracks under creep conditions. △ Less

Submitted 5 August, 2023; originally announced August 2023.

Comments: 56 pages, 25 figures

arXiv:2308.00041 [pdf, other]

doi 10.3847/1538-4357/aced3e

UV-Bright Star-Forming Clumps and Their Host Galaxies in UVCANDELS at 0.5 $\leq$ z $\leq$ 1

Authors: Alec Martin, Yicheng Guo, Xin Wang, Anton M. Koekemoer, Marc Rafelski, Harry I. Teplitz, Rogier A. Windhorst, Anahita Alavi, Norman A. Grogin, Laura Prichard, Ben Sunnquist, Daniel Ceverino, Nima Chartab, Christopher J. Conselice, Y. Sophia Dai, Avishai Dekel, Johnathan P. Gardner, Eric Gawiser, Nimish P. Hathi, Matthew J. Hayes, Rolf A. Jansen, Zhiyuan Ji, David C. Koo, Ray A. Lucas, Nir Mandelker , et al. (10 additional authors not shown)

Abstract: Giant star-forming clumps are a prominent feature of star-forming galaxies (SFGs) and contain important clues on galaxy formation and evolution. However, basic demographics of clumps and their host galaxies remain uncertain. Using the HST/WFC3 F275W images from the Ultraviolet Imaging of the Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey (UVCANDELS), we detect and analyze giant sta… ▽ More Giant star-forming clumps are a prominent feature of star-forming galaxies (SFGs) and contain important clues on galaxy formation and evolution. However, basic demographics of clumps and their host galaxies remain uncertain. Using the HST/WFC3 F275W images from the Ultraviolet Imaging of the Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey (UVCANDELS), we detect and analyze giant star-forming clumps in galaxies at 0.5 $\leq$ z $\leq$ 1, connecting two epochs when clumps are common (at cosmic high-noon, z $\sim$ 2) and rare (in the local universe). We construct a clump sample whose rest-frame 1600 Å luminosity is 3 times higher than the most luminous local HII regions (M$_{UV} \leq -$16 AB). In our sample, 35 $\pm$ 3$\%$ of low-mass galaxies (log[M$_{*}$/M$_{\odot}$] $<$ 10) are clumpy (i.e., containing at least one off-center clump). This fraction changes to 22 $\pm$ 3$\%$ and 22 $\pm$ 4$\%$ for intermediate (10 $\leq$ log[M$_{*}$/M$_{\odot}$] $\leq$ 10.5) and high-mass (log[M$_{*}$/M$_{\odot}$] $>$ 10.5) galaxies in agreement with previous studies. When compared to similar-mass non-clumpy SFGs, low- and intermediate-mass clumpy SFGs tend to have higher SFRs and bluer rest-frame U-V colors, while high-mass clumpy SFGs tend to be larger than non-clumpy SFGs. However, clumpy and non-clumpy SFGs have similar Sérsic index, indicating a similar underlying density profile. Furthermore, we investigate how UV luminosity of star-forming regions correlates with the physical properties of host galaxies. On average, more luminous star-forming regions reside in more luminous, smaller, and/or higher-specific SFR galaxies and are found closer to their hosts' galactic center. △ Less

Submitted 2 October, 2023; v1 submitted 31 July, 2023; originally announced August 2023.

Comments: 21 pages, 13 figures, accepted for publication in ApJ

Journal ref: ApJ 955 106 (2023)

arXiv:2307.16579 [pdf, other]

Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

Authors: Yuxin Mao, Jing Zhang, Mochu Xiang, Yunqiu Lv, Yiran Zhong, Yuchao Dai

Abstract: We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new interpretation, it is especially necessary to model the correlation between audio and the final segme… ▽ More We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new interpretation, it is especially necessary to model the correlation between audio and the final segmentation map to ensure its contribution. We introduce a latent diffusion model to our framework to achieve semantic-correlated representation learning. Specifically, our diffusion model learns the conditional generation process of the ground-truth segmentation map, leading to ground-truth aware inference when we perform the denoising process at the test stage. As a conditional diffusion model, we argue it is essential to ensure that the conditional variable contributes to model output. We then introduce contrastive learning to our framework to learn audio-visual correspondence, which is proven consistent with maximizing the mutual information between model prediction and the audio data. In this way, our latent diffusion model via contrastive learning explicitly maximizes the contribution of audio for AVS. Experimental results on the benchmark dataset verify the effectiveness of our solution. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS. △ Less

Submitted 31 July, 2023; originally announced July 2023.

Showing 151–200 of 891 results for author: Dai, Y