Search | arXiv e-print repository

Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding

Authors: Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Winston H. Hsu, Shang-Hong Lai

Abstract: While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the… ▽ More While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels. Second, we propose a Semantics reTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. Extensive experiments demonstrate that BREASE achieves state-of-the-art performance across multiple long video understanding benchmarks in both zero-shot and fully-supervised settings. The project page and code are at: https://joslefaure.github.io/assets/html/hermes.html. △ Less

Submitted 30 August, 2024; originally announced August 2024.

Comments: Accepted to the EVAL-FoMo Workshop at ECCV'24. Project page: https://joslefaure.github.io/assets/html/hermes.html

arXiv:2408.11038 [pdf, other]

Multiple Topology Replica Exchange of Expanded Ensembles (MT-REXEE) for Multidimensional Alchemical Calculations

Authors: Anika J. Friedman, Wei-Tse Hsu, Michael R. Shirts

Abstract: Relative free energy calculations are now widely used in academia and industry, but the accuracy is often limited by poor sampling of the complexes conformational ensemble. To address this, we have developed a novel method termed Multi-Topology Replica Exchange of Expanded Ensembles (MT-REXEE). This method enables parallel expanded ensemble calculations, facilitating iterative relative free energy… ▽ More Relative free energy calculations are now widely used in academia and industry, but the accuracy is often limited by poor sampling of the complexes conformational ensemble. To address this, we have developed a novel method termed Multi-Topology Replica Exchange of Expanded Ensembles (MT-REXEE). This method enables parallel expanded ensemble calculations, facilitating iterative relative free energy computations while allowing conformational exchange between parallel transformations. These iterative transformations are adaptable to any set of systems with a common backbone or central substructure. We demonstrate that the MT-REXEE method maintains thermodynamic cycle closure to the same extent as standard expanded ensemble for both solvation free energy and relative binding free energy. The transformations tested involve simple systems that incorporate diverse heavy atoms and multi-site perturbations of a small molecule core resembling multi-site $λ$ dynamics, without necessitating modifications to the MD code, which in our initial implementation is GROMACS. We outline a systematic approach for topology set-up and provide instructions on how to perform inter-replicate coordinate modifications. This work shows that MT-REEXE can be used to perform accurate and reproducible free energy estimates and prompts expansion to more complex test systems and other molecular dynamics simulation infrastructures. △ Less

Submitted 20 August, 2024; originally announced August 2024.

arXiv:2408.09481 [pdf, other]

PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis

Authors: Meng Luo, Hao Fei, Bobo Li, Shengqiong Wu, Qian Liu, Soujanya Poria, Erik Cambria, Mong-Li Lee, Wynne Hsu

Abstract: While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and advancement, there are still gaps in defining a more holistic research target seamlessly integrating multimodality, conversation context, fine-granularity, and also covering the changing sentiment dynamics as well as cognitive causal rationales. This paper bridges the gaps by introducing a multimodal conversati… ▽ More While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and advancement, there are still gaps in defining a more holistic research target seamlessly integrating multimodality, conversation context, fine-granularity, and also covering the changing sentiment dynamics as well as cognitive causal rationales. This paper bridges the gaps by introducing a multimodal conversational ABSA, where two novel subtasks are proposed: 1) Panoptic Sentiment Sextuple Extraction, panoramically recognizing holder, target, aspect, opinion, sentiment, rationale from multi-turn multi-party multimodal dialogue. 2) Sentiment Flipping Analysis, detecting the dynamic sentiment transformation throughout the conversation with the causal reasons. To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements. To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism. Extensive evaluations demonstrate the superiority of our methods over strong baselines, validating the efficacy of all our proposed methods. The work is expected to open up a new era for the ABSA community, and thus all our codes and data are open at https://PanoSent.github.io/ △ Less

Submitted 18 August, 2024; originally announced August 2024.

Comments: Accepted by ACM MM 2024 (Oral)

arXiv:2407.15291 [pdf, other]

Evidence-Based Temporal Fact Verification

Authors: Anab Maulana Barik, Wynne Hsu, Mong Li Lee

Abstract: Automated fact verification plays an essential role in fostering trust in the digital space. Despite the growing interest, the verification of temporal facts has not received much attention in the community. Temporal fact verification brings new challenges where cues of the temporal information need to be extracted and temporal reasoning involving various temporal aspects of the text must be appli… ▽ More Automated fact verification plays an essential role in fostering trust in the digital space. Despite the growing interest, the verification of temporal facts has not received much attention in the community. Temporal fact verification brings new challenges where cues of the temporal information need to be extracted and temporal reasoning involving various temporal aspects of the text must be applied. In this work, we propose an end-to-end solution for temporal fact verification that considers the temporal information in claims to obtain relevant evidence sentences and harness the power of large language model for temporal reasoning. Recognizing that temporal facts often involve events, we model these events in the claim and evidence sentences. We curate two temporal fact datasets to learn time-sensitive representations that encapsulate not only the semantic relationships among the events, but also their chronological proximity. This allows us to retrieve the top-k relevant evidence sentences and provide the context for a large language model to perform temporal reasoning and outputs whether a claim is supported or refuted by the retrieved evidence sentences. Experiment results demonstrate that the proposed approach significantly enhances the accuracy of temporal claim verification, thereby advancing current state-of-the-art in automated fact verification. △ Less

Submitted 18 August, 2024; v1 submitted 21 July, 2024; originally announced July 2024.

arXiv:2407.12867 [pdf, other]

Swift-BAT GUANO follow-up of gravitational-wave triggers in the third LIGO-Virgo-KAGRA observing run

Authors: Gayathri Raman, Samuele Ronchini, James Delaunay, Aaron Tohuvavohu, Jamie A. Kennea, Tyler Parsotan, Elena Ambrosi, Maria Grazia Bernardini, Sergio Campana, Giancarlo Cusumano, Antonino D'Ai, Paolo D'Avanzo, Valerio D'Elia, Massimiliano De Pasquale, Simone Dichiara, Phil Evans, Dieter Hartmann, Paul Kuin, Andrea Melandri, Paul O'Brien, Julian P. Osborne, Kim Page, David M. Palmer, Boris Sbarufatti, Gianpiero Tagliaferri , et al. (1797 additional authors not shown)

Abstract: We present results from a search for X-ray/gamma-ray counterparts of gravitational-wave (GW) candidates from the third observing run (O3) of the LIGO-Virgo-KAGRA (LVK) network using the Swift Burst Alert Telescope (Swift-BAT). The search includes 636 GW candidates received in low latency, 86 of which have been confirmed by the offline analysis and included in the third cumulative Gravitational-Wav… ▽ More We present results from a search for X-ray/gamma-ray counterparts of gravitational-wave (GW) candidates from the third observing run (O3) of the LIGO-Virgo-KAGRA (LVK) network using the Swift Burst Alert Telescope (Swift-BAT). The search includes 636 GW candidates received in low latency, 86 of which have been confirmed by the offline analysis and included in the third cumulative Gravitational-Wave Transient Catalogs (GWTC-3). Targeted searches were carried out on the entire GW sample using the maximum--likelihood NITRATES pipeline on the BAT data made available via the GUANO infrastructure. We do not detect any significant electromagnetic emission that is temporally and spatially coincident with any of the GW candidates. We report flux upper limits in the 15-350 keV band as a function of sky position for all the catalog candidates. For GW candidates where the Swift-BAT false alarm rate is less than 10$^{-3}$ Hz, we compute the GW--BAT joint false alarm rate. Finally, the derived Swift-BAT upper limits are used to infer constraints on the putative electromagnetic emission associated with binary black hole mergers. △ Less

Submitted 13 July, 2024; originally announced July 2024.

Comments: 50 pages, 10 figures, 4 tables

arXiv:2407.03648 [pdf, other]

High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching

Authors: Gael Le Lan, Bowen Shi, Zhaoheng Ni, Sidd Srinivasan, Anurag Kumar, Brian Ellis, David Kant, Varun Nagaraja, Ernie Chang, Wei-Ning Hsu, Yangyang Shi, Vikas Chandra

Abstract: We introduce a simple and efficient text-controllable high-fidelity music generation and editing model. It operates on sequences of continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec that eliminates the information loss drawback of discrete representations. Based on a diffusion transformer architecture trained on a flow-matching objective the model… ▽ More We introduce a simple and efficient text-controllable high-fidelity music generation and editing model. It operates on sequences of continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec that eliminates the information loss drawback of discrete representations. Based on a diffusion transformer architecture trained on a flow-matching objective the model can generate and edit diverse high quality stereo samples of variable duration, with simple text descriptions. We also explore a new regularized latent inversion method for zero-shot test-time text-guided editing and demonstrate its superior performance over naive denoising diffusion implicit model (DDIM) inversion for variety of music editing prompts. Evaluations are conducted on both objective and subjective metrics and demonstrate that the proposed model is not only competitive to the evaluated baselines on a standard text-to-music benchmark - quality and efficiency-wise - but also outperforms previous state of the art for music editing when combined with our proposed latent inversion. Samples are available at https://melodyflow.github.io. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2406.13578 [pdf, other]

Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration

Authors: Han-Cheng Yu, Yu-An Shih, Kin-Man Law, Kai-Yu Hsieh, Yu-Chen Cheng, Hsin-Chih Ho, Zih-An Lin, Wen-Chuan Hsu, Yao-Chung Fan

Abstract: In this paper, we tackle the task of distractor generation (DG) for multiple-choice questions. Our study introduces two key designs. First, we propose \textit{retrieval augmented pretraining}, which involves refining the language model pretraining to align it more closely with the downstream task of DG. Second, we explore the integration of knowledge graphs to enhance the performance of DG. Throug… ▽ More In this paper, we tackle the task of distractor generation (DG) for multiple-choice questions. Our study introduces two key designs. First, we propose \textit{retrieval augmented pretraining}, which involves refining the language model pretraining to align it more closely with the downstream task of DG. Second, we explore the integration of knowledge graphs to enhance the performance of DG. Through experiments with benchmarking datasets, we show that our models significantly outperform the state-of-the-art results. Our best-performing model advances the F1@3 score from 14.80 to 16.47 in MCQ dataset and from 15.92 to 16.50 in Sciq dataset. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Findings at ACL 2024

arXiv:2406.10923 [pdf, other]

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Authors: Hung-Ting Su, Chun-Tong Chao, Ya-Ching Hsu, Xudong Lin, Yulei Niu, Hung-Yi Lee, Winston H. Hsu

Abstract: Large Language Models (LLMs) have demonstrated effectiveness not only in language tasks but also in video reasoning. This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reaso… ▽ More Large Language Models (LLMs) have demonstrated effectiveness not only in language tasks but also in video reasoning. This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reasoning: planning and integrating intermediate reasoning steps for understanding long-range videos with numerous frames. Utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches. Our experiments show that current methods, including Captioner-Reasoner, Large Multimodal Model Instruction Fine-tuning, and Visual Programming, only marginally outperform a random baseline when tackling the challenges of Abstract Perception and Long-range Compositional Reasoning. To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR), which enhance Visual Programming by fostering role interaction awareness and progressively refining movie contexts and trope queries during reasoning processes, significantly improving performance by 15 F1 points. However, this performance still lags behind human levels (40 vs. 65 F1). Additionally, we introduce a new protocol to evaluate the necessity of Abstract Perception and Long-range Compositional Reasoning for task resolution. This is done by analyzing the code generated through Visual Programming using an Abstract Syntax Tree (AST), thereby confirming the increased complexity of TiM. The dataset and code are available at: https://ander1119.github.io/TiM △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: Project page: https://ander1119.github.io/TiM

arXiv:2406.09342 [pdf, other]

Wavefront shaping simulations with augmented partial factorization

Authors: Ho-Chun Lin, Zeyu Wang, Chia Wei Hsu

Abstract: Wavefront shaping can tailor multipath interference to control multiple scattering of waves in complex optical systems. However, full-wave simulations that capture multiple scattering are computationally demanding given the large system size and the large number of input channels. Recently, an "augmented partial factorization" (APF) method was proposed to significantly speed-up such full-wave simu… ▽ More Wavefront shaping can tailor multipath interference to control multiple scattering of waves in complex optical systems. However, full-wave simulations that capture multiple scattering are computationally demanding given the large system size and the large number of input channels. Recently, an "augmented partial factorization" (APF) method was proposed to significantly speed-up such full-wave simulations. In this tutorial, we illustrate how to perform wavefront shaping simulations with the APF method using the open-source frequency-domain electromagnetic scattering solver MESTI. We present the foundational concepts and then walk through four examples: computing the scattering matrix of a slab with random permittivities, open high-transmission channels through disorder, focusing inside disorder with phase conjugation, and reflection matrix computation in a spatial focused-beam basis. The goal is to lower the barrier for researchers to use simulations to explore the rich phenomena enabled by wavefront shaping. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.09272 [pdf, other]

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Authors: Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman

Abstract: Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations… ▽ More Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and EPIC-KITCHENS, and we introduce Ego4D-Sounds -- 1.2M curated clips with action-audio correspondence. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our approach is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds. △ Less

Submitted 25 July, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: Project page: https://vision.cs.utexas.edu/projects/action2sound. ECCV 2024 camera-ready version

arXiv:2406.07777 [pdf, other]

Unifying Interpretability and Explainability for Alzheimer's Disease Progression Prediction

Authors: Raja Farrukh Ali, Stephanie Milani, John Woods, Emmanuel Adenij, Ayesha Farooq, Clayton Mansel, Jeffrey Burns, William Hsu

Abstract: Reinforcement learning (RL) has recently shown promise in predicting Alzheimer's disease (AD) progression due to its unique ability to model domain knowledge. However, it is not clear which RL algorithms are well-suited for this task. Furthermore, these methods are not inherently explainable, limiting their applicability in real-world clinical scenarios. Our work addresses these two important ques… ▽ More Reinforcement learning (RL) has recently shown promise in predicting Alzheimer's disease (AD) progression due to its unique ability to model domain knowledge. However, it is not clear which RL algorithms are well-suited for this task. Furthermore, these methods are not inherently explainable, limiting their applicability in real-world clinical scenarios. Our work addresses these two important questions. Using a causal, interpretable model of AD, we first compare the performance of four contemporary RL algorithms in predicting brain cognition over 10 years using only baseline (year 0) data. We then apply SHAP (SHapley Additive exPlanations) to explain the decisions made by each algorithm in the model. Our approach combines interpretability with explainability to provide insights into the key factors influencing AD progression, offering both global and individual, patient-level analysis. Our findings show that only one of the RL methods is able to satisfactorily model disease progression, but the post-hoc explanations indicate that all methods fail to properly capture the importance of amyloid accumulation, one of the pathological hallmarks of Alzheimer's disease. Our work aims to merge predictive accuracy with transparency, assisting clinicians and researchers in enhancing disease progression modeling for informed healthcare decisions. Code is available at https://github.com/rfali/xrlad. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Previous versions accepted to NeurIPS 2023's XAIA and AAAI 2024's XAI4DRL workshops

arXiv:2406.06727 [pdf, other]

Full transmission of vectorial waves through 3D multiple-scattering media

Authors: Ho-Chun Lin, Chia Wei Hsu

Abstract: A striking prediction from the random matrix theory in mesoscopic physics is the existence of "open channels": waves that can use multipath interference to achieve perfect transmission across an opaque disordered medium even in the multiple-scattering regime. Realization of such open channels requires a coherent control of the complete incident wavefront. To date, the open channels have only been… ▽ More A striking prediction from the random matrix theory in mesoscopic physics is the existence of "open channels": waves that can use multipath interference to achieve perfect transmission across an opaque disordered medium even in the multiple-scattering regime. Realization of such open channels requires a coherent control of the complete incident wavefront. To date, the open channels have only been demonstrated in scalar two-dimensional (2D) structures, both experimentally and with numerical studies. Here, we utilize a recently proposed "augmented partial factorization" full-wave simulation method to compute the scattering matrix from 3D vectorial Maxwell's equations and demonstrate the existence of open channels in 3D disordered media. We examine the spatial profile of such open channels, demonstrate the existence of a bimodal transmission eigenvalue distribution with full control, and study the effects of incomplete polarization control and of a finite illumination area. This study confirms the validity of the random matrix theory in vectorial systems. The simulation framework provides full access to the complex multi-channel wave transport in 3D disordered systems, filling the gap left by experimental capabilities. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06251 [pdf, other]

Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

Authors: Chung-Ming Chien, Andros Tjandra, Apoorv Vyas, Matt Le, Bowen Shi, Wei-Ning Hsu

Abstract: As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained one… ▽ More As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained ones, we explore various efficient fine-tuning approaches. Our experiment shows that the LoRA with bias-tuning configuration yields the best performance, enhancing controllability without compromising speech quality. Across three fine-grained conditional generation tasks, we demonstrate the effectiveness and resource efficiency of Voicebox Adapter. Follow-up experiments further highlight the robustness of Voicebox Adapter across diverse data setups. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted by InterSpeech 2024

arXiv:2406.04377 [pdf, other]

Combining Graph Neural Network and Mamba to Capture Local and Global Tissue Spatial Relationships in Whole Slide Images

Authors: Ruiwen Ding, Kha-Dinh Luong, Erika Rodriguez, Ana Cristina Araujo Lemos da Silva, William Hsu

Abstract: In computational pathology, extracting spatial features from gigapixel whole slide images (WSIs) is a fundamental task, but due to their large size, WSIs are typically segmented into smaller tiles. A critical aspect of this analysis is aggregating information from these tiles to make predictions at the WSI level. We introduce a model that combines a message-passing graph neural network (GNN) with… ▽ More In computational pathology, extracting spatial features from gigapixel whole slide images (WSIs) is a fundamental task, but due to their large size, WSIs are typically segmented into smaller tiles. A critical aspect of this analysis is aggregating information from these tiles to make predictions at the WSI level. We introduce a model that combines a message-passing graph neural network (GNN) with a state space model (Mamba) to capture both local and global spatial relationships among the tiles in WSIs. The model's effectiveness was demonstrated in predicting progression-free survival among patients with early-stage lung adenocarcinomas (LUAD). We compared the model with other state-of-the-art methods for tile-level information aggregation in WSIs, including tile-level information summary statistics-based aggregation, multiple instance learning (MIL)-based aggregation, GNN-based aggregation, and GNN-transformer-based aggregation. Additional experiments showed the impact of different types of node features and different tile sampling strategies on the model performance. This work can be easily extended to any WSI-based analysis. Code: https://github.com/rina-ding/gat-mamba. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2406.00761 [pdf, other]

Shared-unique Features and Task-aware Prioritized Sampling on Multi-task Reinforcement Learning

Authors: Po-Shao Lin, Jia-Fong Yeh, Yi-Ting Chen, Winston H. Hsu

Abstract: We observe that current state-of-the-art (SOTA) methods suffer from the performance imbalance issue when performing multi-task reinforcement learning (MTRL) tasks. While these methods may achieve impressive performance on average, they perform extremely poorly on a few tasks. To address this, we propose a new and effective method called STARS, which consists of two novel strategies: a shared-uniqu… ▽ More We observe that current state-of-the-art (SOTA) methods suffer from the performance imbalance issue when performing multi-task reinforcement learning (MTRL) tasks. While these methods may achieve impressive performance on average, they perform extremely poorly on a few tasks. To address this, we propose a new and effective method called STARS, which consists of two novel strategies: a shared-unique feature extractor and task-aware prioritized sampling. First, the shared-unique feature extractor learns both shared and task-specific features to enable better synergy of knowledge between different tasks. Second, the task-aware sampling strategy is combined with the prioritized experience replay for efficient learning on tasks with poor performance. The effectiveness and stability of our STARS are verified through experiments on the mainstream Meta-World benchmark. From the results, our STARS statistically outperforms current SOTA methods and alleviates the performance imbalance issue. Besides, we visualize the learned features to support our claims and enhance the interpretability of STARS. △ Less

Submitted 2 June, 2024; originally announced June 2024.

Comments: The first two authors contribute equally

arXiv:2405.18357 [pdf, other]

Faithful Logical Reasoning via Symbolic Chain-of-Thought

Authors: Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, Wynne Hsu

Abstract: While the recent Chain-of-Thought (CoT) technique enhances the reasoning ability of large language models (LLMs) with the theory of mind, it might still struggle in handling logical reasoning that relies much on symbolic expressions and rigid deducing rules. To strengthen the logical reasoning capability of LLMs, we propose a novel Symbolic Chain-of-Thought, namely SymbCoT, a fully LLM-based frame… ▽ More While the recent Chain-of-Thought (CoT) technique enhances the reasoning ability of large language models (LLMs) with the theory of mind, it might still struggle in handling logical reasoning that relies much on symbolic expressions and rigid deducing rules. To strengthen the logical reasoning capability of LLMs, we propose a novel Symbolic Chain-of-Thought, namely SymbCoT, a fully LLM-based framework that integrates symbolic expressions and logic rules with CoT prompting. Technically, building upon an LLM, SymbCoT 1) first translates the natural language context into the symbolic format, and then 2) derives a step-by-step plan to solve the problem with symbolic logical rules, 3) followed by a verifier to check the translation and reasoning chain. Via thorough evaluations on 5 standard datasets with both First-Order Logic and Constraint Optimization symbolic expressions, SymbCoT shows striking improvements over the CoT method consistently, meanwhile refreshing the current state-of-the-art performances. We further demonstrate that our system advances in more faithful, flexible, and explainable logical reasoning. To our knowledge, this is the first to combine symbolic expressions and rules into CoT for logical reasoning with LLMs. Code is open at https://github.com/Aiden0526/SymbCoT. △ Less

Submitted 11 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: Accepted by ACL 2024 (main proceeding)

arXiv:2405.17507 [pdf, other]

Enhancing Sustainable Urban Mobility Prediction with Telecom Data: A Spatio-Temporal Framework Approach

Authors: ChungYi Lin, Shen-Lung Tung, Hung-Ting Su, Winston H. Hsu

Abstract: Traditional traffic prediction, limited by the scope of sensor data, falls short in comprehensive traffic management. Mobile networks offer a promising alternative using network activity counts, but these lack crucial directionality. Thus, we present the TeltoMob dataset, featuring undirected telecom counts and corresponding directional flows, to predict directional mobility flows on roadways. To… ▽ More Traditional traffic prediction, limited by the scope of sensor data, falls short in comprehensive traffic management. Mobile networks offer a promising alternative using network activity counts, but these lack crucial directionality. Thus, we present the TeltoMob dataset, featuring undirected telecom counts and corresponding directional flows, to predict directional mobility flows on roadways. To address this, we propose a two-stage spatio-temporal graph neural network (STGNN) framework. The first stage uses a pre-trained STGNN to process telecom data, while the second stage integrates directional and geographic insights for accurate prediction. Our experiments demonstrate the framework's compatibility with various STGNN models and confirm its effectiveness. We also show how to incorporate the framework into real-world transportation systems, enhancing sustainable urban mobility. △ Less

Submitted 26 May, 2024; originally announced May 2024.

Comments: 8 Figures, 5 Tables. Just accepted by IJCAI (to appear)

arXiv:2405.16545 [pdf, other]

VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

Authors: Kuo-Han Hung, Pang-Chi Lo, Jia-Fong Yeh, Han-Yuan Hsu, Yi-Ting Chen, Winston H. Hsu

Abstract: We study reward models for long-horizon manipulation tasks by learning from action-free videos and language instructions, which we term the visual-instruction correlation (VIC) problem. Recent advancements in cross-modality modeling have highlighted the potential of reward modeling through visual and language correlations. However, existing VIC methods face challenges in learning rewards for long-… ▽ More We study reward models for long-horizon manipulation tasks by learning from action-free videos and language instructions, which we term the visual-instruction correlation (VIC) problem. Recent advancements in cross-modality modeling have highlighted the potential of reward modeling through visual and language correlations. However, existing VIC methods face challenges in learning rewards for long-horizon tasks due to their lack of sub-stage awareness, difficulty in modeling task complexities, and inadequate object state estimation. To address these challenges, we introduce VICtoR, a novel hierarchical VIC reward model capable of providing effective reward signals for long-horizon manipulation tasks. VICtoR precisely assesses task progress at various levels through a novel stage detector and motion progress evaluator, offering insightful guidance for agents learning the task effectively. To validate the effectiveness of VICtoR, we conducted extensive experiments in both simulated and real-world environments. The results suggest that VICtoR outperformed the best existing VIC methods, achieving a 43% improvement in success rates for long-horizon tasks. △ Less

Submitted 26 May, 2024; originally announced May 2024.

arXiv:2405.13237 [pdf]

Spatial Matching of 2D Mammography Images and Specimen Radiographs: Towards Improved Characterization of Suspicious Microcalcifications

Authors: Noor Nakhaei, Chrysostomos Marasinou, Akinyinka Omigbodun, Nina Capiro, Bo Li, Anne Hoyt, William Hsu

Abstract: Accurate characterization of suspicious microcalcifications is critical to determine whether these calcifications are associated with invasive disease. Our overarching objective is to enable the joint characterization of microcalcifications and surrounding breast tissue using mammography images and digital histopathology images. Towards this goal, we investigate a template matching-based approach… ▽ More Accurate characterization of suspicious microcalcifications is critical to determine whether these calcifications are associated with invasive disease. Our overarching objective is to enable the joint characterization of microcalcifications and surrounding breast tissue using mammography images and digital histopathology images. Towards this goal, we investigate a template matching-based approach that utilizes microcalcifications as landmarks to match radiographs taken of biopsy core specimens to groups of calcifications that are visible on mammography. Our approach achieved a high negative predictive value (0.98) but modest precision (0.66) and recall (0.58) in identifying the mammographic region where microcalcifications were taken during a core needle biopsy. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Journal ref: Medical Imaging 2021: Computer-Aided Diagnosis (Vol. 11597, pp. 511-516). SPIE

arXiv:2405.11478 [pdf, other]

Unsupervised Image Prior via Prompt Learning and CLIP Semantic Guidance for Low-Light Image Enhancement

Authors: Igor Morawski, Kai He, Shusil Dangi, Winston H. Hsu

Abstract: Currently, low-light conditions present a significant challenge for machine cognition. In this paper, rather than optimizing models by assuming that human and machine cognition are correlated, we use zero-reference low-light enhancement to improve the performance of downstream task models. We propose to improve the zero-reference low-light enhancement method by leveraging the rich visual-linguisti… ▽ More Currently, low-light conditions present a significant challenge for machine cognition. In this paper, rather than optimizing models by assuming that human and machine cognition are correlated, we use zero-reference low-light enhancement to improve the performance of downstream task models. We propose to improve the zero-reference low-light enhancement method by leveraging the rich visual-linguistic CLIP prior without any need for paired or unpaired normal-light data, which is laborious and difficult to collect. We propose a simple but effective strategy to learn prompts that help guide the enhancement method and experimentally show that the prompts learned without any need for normal-light data improve image contrast, reduce over-enhancement, and reduce noise over-amplification. Next, we propose to reuse the CLIP model for semantic guidance via zero-shot open vocabulary classification to optimize low-light enhancement for task-based performance rather than human visual perception. We conduct extensive experimental results showing that the proposed method leads to consistent improvements across various datasets regarding task-based performance and compare our method against state-of-the-art methods, showing favorable results across various low-light datasets. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: Accepted to CVPR 2024 Workshop NTIRE: New Trends in Image Restoration and Enhancement workshop and Challenges

arXiv:2405.08586 [pdf, other]

Cross-Domain Feature Augmentation for Domain Generalization

Authors: Yingnan Liu, Yingtian Zou, Rui Qiao, Fusheng Liu, Mong Li Lee, Wynne Hsu

Abstract: Domain generalization aims to develop models that are robust to distribution shifts. Existing methods focus on learning invariance across domains to enhance model robustness, and data augmentation has been widely used to learn invariant predictors, with most methods performing augmentation in the input space. However, augmentation in the input space has limited diversity whereas in the feature spa… ▽ More Domain generalization aims to develop models that are robust to distribution shifts. Existing methods focus on learning invariance across domains to enhance model robustness, and data augmentation has been widely used to learn invariant predictors, with most methods performing augmentation in the input space. However, augmentation in the input space has limited diversity whereas in the feature space is more versatile and has shown promising results. Nonetheless, feature semantics is seldom considered and existing feature augmentation methods suffer from a limited variety of augmented features. We decompose features into class-generic, class-specific, domain-generic, and domain-specific components. We propose a cross-domain feature augmentation method named XDomainMix that enables us to increase sample diversity while emphasizing the learning of invariant representations to achieve domain generalization. Experiments on widely used benchmark datasets demonstrate that our proposed method is able to achieve state-of-the-art performance. Quantitative analysis indicates that our feature augmentation approach facilitates the learning of effective models that are invariant across different domains. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: Accepted to the 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024); Code is available at https://github.com/NancyQuris/XDomainMix

arXiv:2404.11678 [pdf, other]

Corrected Correlation Estimates for Meta-Analysis

Authors: Alexander Johnson-Vázquez, Alexander W. Hsu, Peng Zheng, Aleksandr Aravkin

Abstract: Meta-analysis allows rigorous aggregation of estimates and uncertainty across multiple studies. When a given study reports multiple estimates, such as log odds ratios (ORs) or log relative risks (RRs) across exposure groups, accounting for within-study correlations improves accuracy and efficiency of meta-analytic results. Canonical approaches of Greenland-Longnecker and Hamling estimate pseudo ca… ▽ More Meta-analysis allows rigorous aggregation of estimates and uncertainty across multiple studies. When a given study reports multiple estimates, such as log odds ratios (ORs) or log relative risks (RRs) across exposure groups, accounting for within-study correlations improves accuracy and efficiency of meta-analytic results. Canonical approaches of Greenland-Longnecker and Hamling estimate pseudo cases and non-cases for exposure groups to obtain within-study correlations. However, currently available implementations for both methods fail on simple examples. We review both GL and Hamling methods through the lens of optimization. For ORs, we provide modifications of each approach that ensure convergence for any feasible inputs. For GL, this is achieved through a new connection to entropic minimization. For Hamling, a modification leads to a provably solvable equivalent set of equations given a specific initialization. For each, we provide implementations a guaranteed to work for any feasible input. For RRs, we show the new GL approach is always guaranteed to succeed, but any Hamling approach may fail: we give counter-examples where no solutions exist. We derive a sufficient condition on reported RRs that guarantees success when reported variances are all equal. △ Less

Submitted 28 June, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

Comments: 31 pages, 9 figures

MSC Class: 62-08; 62P10; 90C25

arXiv:2404.09956 [pdf, other]

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Authors: Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria

Abstract: Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models… ▽ More Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics. △ Less

Submitted 17 July, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Comments: Accepted at ACM MM 2024

arXiv:2404.04248 [pdf, other]

doi 10.3847/2041-8213/ad5beb

Observation of Gravitational Waves from the Coalescence of a $2.5\text{-}4.5~M_\odot$ Compact Object and a Neutron Star

Authors: The LIGO Scientific Collaboration, the Virgo Collaboration, the KAGRA Collaboration, A. G. Abac, R. Abbott, I. Abouelfettouh, F. Acernese, K. Ackley, S. Adhicary, N. Adhikari, R. X. Adhikari, V. K. Adkins, D. Agarwal, M. Agathos, M. Aghaei Abchouyeh, O. D. Aguiar, I. Aguilar, L. Aiello, A. Ain, P. Ajith, S. Akçay, T. Akutsu, S. Albanesi, R. A. Alfaidi, A. Al-Jodah , et al. (1771 additional authors not shown)

Abstract: We report the observation of a coalescing compact binary with component masses $2.5\text{-}4.5~M_\odot$ and $1.2\text{-}2.0~M_\odot$ (all measurements quoted at the 90% credible level). The gravitational-wave signal GW230529_181500 was observed during the fourth observing run of the LIGO-Virgo-KAGRA detector network on 2023 May 29 by the LIGO Livingston Observatory. The primary component of the so… ▽ More We report the observation of a coalescing compact binary with component masses $2.5\text{-}4.5~M_\odot$ and $1.2\text{-}2.0~M_\odot$ (all measurements quoted at the 90% credible level). The gravitational-wave signal GW230529_181500 was observed during the fourth observing run of the LIGO-Virgo-KAGRA detector network on 2023 May 29 by the LIGO Livingston Observatory. The primary component of the source has a mass less than $5~M_\odot$ at 99% credibility. We cannot definitively determine from gravitational-wave data alone whether either component of the source is a neutron star or a black hole. However, given existing estimates of the maximum neutron star mass, we find the most probable interpretation of the source to be the coalescence of a neutron star with a black hole that has a mass between the most massive neutron stars and the least massive black holes observed in the Galaxy. We provisionally estimate a merger rate density of $55^{+127}_{-47}~\text{Gpc}^{-3}\,\text{yr}^{-1}$ for compact binary coalescences with properties similar to the source of GW230529_181500; assuming that the source is a neutron star-black hole merger, GW230529_181500-like sources constitute about 60% of the total merger rate inferred for neutron star-black hole coalescences. The discovery of this system implies an increase in the expected rate of neutron star-black hole mergers with electromagnetic counterparts and provides further evidence for compact objects existing within the purported lower mass gap. △ Less

Submitted 26 July, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

Comments: 45 pages (10 pages author list, 13 pages main text, 1 page acknowledgements, 13 pages appendices, 8 pages bibliography), 17 figures, 16 tables. Update to match version published in The Astrophysical Journal Letters. Data products available from https://zenodo.org/records/10845779

Report number: LIGO-P2300352

Journal ref: ApJL 970, L34 (2024)

arXiv:2403.18330 [pdf, other]

Tracking-Assisted Object Detection with Event Cameras

Authors: Ting-Kang Yen, Igor Morawski, Shusil Dangi, Kai He, Chung-Yi Lin, Jia-Fong Yeh, Hung-Ting Su, Winston Hsu

Abstract: Event-based object detection has recently garnered attention in the computer vision community due to the exceptional properties of event cameras, such as high dynamic range and no motion blur. However, feature asynchronism and sparsity cause invisible objects due to no relative motion to the camera, posing a significant challenge in the task. Prior works have studied various implicit-learned memor… ▽ More Event-based object detection has recently garnered attention in the computer vision community due to the exceptional properties of event cameras, such as high dynamic range and no motion blur. However, feature asynchronism and sparsity cause invisible objects due to no relative motion to the camera, posing a significant challenge in the task. Prior works have studied various implicit-learned memories to retain as many temporal cues as possible. However, implicit memories still struggle to preserve long-term features effectively. In this paper, we consider those invisible objects as pseudo-occluded objects and aim to detect them by tracking through occlusions. Firstly, we introduce the visibility attribute of objects and contribute an auto-labeling algorithm to not only clean the existing event camera dataset but also append additional visibility labels to it. Secondly, we exploit tracking strategies for pseudo-occluded objects to maintain their permanence and retain their bounding boxes, even when features have not been available for a very long time. These strategies can be treated as an explicit-learned memory guided by the tracking objective to record the displacements of objects across frames. Lastly, we propose a spatio-temporal feature aggregation module to enrich the latent features and a consistency loss to increase the robustness of the overall pipeline. We conduct comprehensive experiments to verify our method's effectiveness where still objects are retained, but real occluded objects are discarded. The results demonstrate that (1) the additional visibility labels can assist in supervised training, and (2) our method outperforms state-of-the-art approaches with a significant improvement of 7.9% absolute mAP. △ Less

Submitted 11 August, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.14402 [pdf, other]

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

Authors: HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang

Abstract: Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-v… ▽ More Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning. △ Less

Submitted 12 August, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

Comments: ACL2024

arXiv:2403.13493 [pdf, other]

SPDEs on narrow channels and graphs: convergence and large deviations in case of non smooth noise

Authors: Sandra Cerrai, Wen-Tai Hsu

Abstract: We investigate a class of stochastic partial differential equations of reaction-diffusion type defined on graphs, which can be derived as the limit of SPDEs on narrow planar channels. In the first part, we demonstrate that this limit can be achieved under less restrictive assumptions on the regularity of the noise, compared to [4]. In the second part, we establish the validity of a large deviation… ▽ More We investigate a class of stochastic partial differential equations of reaction-diffusion type defined on graphs, which can be derived as the limit of SPDEs on narrow planar channels. In the first part, we demonstrate that this limit can be achieved under less restrictive assumptions on the regularity of the noise, compared to [4]. In the second part, we establish the validity of a large deviation principle for the SPDEs on the narrow channels and on the graphs, as the width of the narrow channels and the intensity of the noise are jointly vanishing. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.12991 [pdf, other]

Tel2Veh: Fusion of Telecom Data and Vehicle Flow to Predict Camera-Free Traffic via a Spatio-Temporal Framework

Authors: ChungYi Lin, Shen-Lung Tung, Hung-Ting Su, Winston H. Hsu

Abstract: Vehicle flow, a crucial indicator for transportation, is often limited by detector coverage. With the advent of extensive mobile network coverage, we can leverage mobile user activities, or cellular traffic, on roadways as a proxy for vehicle flow. However, as counts of cellular traffic may not directly align with vehicle flow due to data from various user types, we present a new task: predicting… ▽ More Vehicle flow, a crucial indicator for transportation, is often limited by detector coverage. With the advent of extensive mobile network coverage, we can leverage mobile user activities, or cellular traffic, on roadways as a proxy for vehicle flow. However, as counts of cellular traffic may not directly align with vehicle flow due to data from various user types, we present a new task: predicting vehicle flow in camera-free areas using cellular traffic. To uncover correlations within multi-source data, we deployed cameras on selected roadways to establish the Tel2Veh dataset, consisting of extensive cellular traffic and sparse vehicle flows. Addressing this challenge, we propose a framework that independently extracts features and integrates them with a graph neural network (GNN)-based fusion to discern disparities, thereby enabling the prediction of unseen vehicle flows using cellular traffic. This work advances the use of telecom data in transportation and pioneers the fusion of telecom and vision-based data, offering solutions for traffic management. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: 4 pages, 5 figures, 4 tables. Accepted by WWW'24, to appear

arXiv:2403.06392 [pdf, other]

Towards Robust Out-of-Distribution Generalization Bounds via Sharpness

Authors: Yingtian Zou, Kenji Kawaguchi, Yingnan Liu, Jiashuo Liu, Mong-Li Lee, Wynne Hsu

Abstract: Generalizing to out-of-distribution (OOD) data or unseen domain, termed OOD generalization, still lacks appropriate theoretical guarantees. Canonical OOD bounds focus on different distance measurements between source and target domains but fail to consider the optimization property of the learned model. As empirically shown in recent work, the sharpness of learned minima influences OOD generalizat… ▽ More Generalizing to out-of-distribution (OOD) data or unseen domain, termed OOD generalization, still lacks appropriate theoretical guarantees. Canonical OOD bounds focus on different distance measurements between source and target domains but fail to consider the optimization property of the learned model. As empirically shown in recent work, the sharpness of learned minima influences OOD generalization. To bridge this gap between optimization and OOD generalization, we study the effect of sharpness on how a model tolerates data change in domain shift which is usually captured by "robustness" in generalization. In this paper, we give a rigorous connection between sharpness and robustness, which gives better OOD guarantees for robust algorithms. It also provides a theoretical backing for "flat minima leads to better OOD generalization". Overall, we propose a sharpness-based OOD generalization bound by taking robustness into consideration, resulting in a tighter bound than non-robust guarantees. Our findings are supported by the experiments on a ridge regression model, as well as the experiments on deep learning classification tasks. △ Less

Submitted 10 March, 2024; originally announced March 2024.

Comments: 40 pages, 9 figures, ICLR 2024 Spotlight Presentation

arXiv:2403.03170 [pdf, other]

SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection

Authors: Peng Qi, Zehong Yan, Wynne Hsu, Mong Li Lee

Abstract: Misinformation is a prevalent societal issue due to its potential high risks. Out-of-context (OOC) misinformation, where authentic images are repurposed with false text, is one of the easiest and most effective ways to mislead audiences. Current methods focus on assessing image-text consistency but lack convincing explanations for their judgments, which is essential for debunking misinformation. W… ▽ More Misinformation is a prevalent societal issue due to its potential high risks. Out-of-context (OOC) misinformation, where authentic images are repurposed with false text, is one of the easiest and most effective ways to mislead audiences. Current methods focus on assessing image-text consistency but lack convincing explanations for their judgments, which is essential for debunking misinformation. While Multimodal Large Language Models (MLLMs) have rich knowledge and innate capability for visual reasoning and explanation generation, they still lack sophistication in understanding and discovering the subtle crossmodal differences. In this paper, we introduce SNIFFER, a novel multimodal large language model specifically engineered for OOC misinformation detection and explanation. SNIFFER employs two-stage instruction tuning on InstructBLIP. The first stage refines the model's concept alignment of generic objects with news-domain entities and the second stage leverages language-only GPT-4 generated OOC-specific instruction data to fine-tune the model's discriminatory powers. Enhanced by external tools and retrieval, SNIFFER not only detects inconsistencies between text and image but also utilizes external knowledge for contextual verification. Our experiments show that SNIFFER surpasses the original MLLM by over 40% and outperforms state-of-the-art methods in detection accuracy. SNIFFER also provides accurate and persuasive explanations as validated by quantitative and human evaluations. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: To appear in CVPR 2024

arXiv:2403.03162 [pdf, other]

Statistical modeling of equilibrium phase transition in confined fluids

Authors: Gunjan Auti, Soumyadeep Paul, Wei-Lun Hsu, Shohei Chiashi, Shigeo Maruyama, Hirofumi Daiguji

Abstract: The phase transition of confined fluids in mesoporous materials deviates from that of bulk fluids due to the interactions with the surrounding heterogeneous structure. For example, adsorbed fluids in metal-organic-frameworks (MOFs) have atypical phase characteristics such as capillary condensation and higher-order phase transitions due to a strong heterogeneous field. Considering a many-body probl… ▽ More The phase transition of confined fluids in mesoporous materials deviates from that of bulk fluids due to the interactions with the surrounding heterogeneous structure. For example, adsorbed fluids in metal-organic-frameworks (MOFs) have atypical phase characteristics such as capillary condensation and higher-order phase transitions due to a strong heterogeneous field. Considering a many-body problem in the presence of a nonuniform external field, we model the host-guest and guest-guest interactions in MOFs. To solve the three-dimensional Ising model, we use the mean-field theory to approximate the guest-guest interactions and Mayer's f-functions to describe the host-guest interactions in a unit cell. Later, using Hill's theory of nanothermodynamics, we define differential thermodynamic functions to understand the distribution of intensive properties and integral thermodynamic functions to explain the phase transition in confined fluids. The investigation reveals a distinct behavior where fluids confined in larger pores undergo a discontinuous (first-order) phase transition, whereas those confined in smaller pores experience a continuous (higher-order) phase transition. Furthermore, the results indicate that the free-energy barrier for phase transitions is lower in confined fluids than in bulk fluids giving rise to a lower condensation pressure relative to the bulk saturation pressure. Finally, the integral thermodynamic functions are succinctly presented in the form of a phase diagram, marking an initial step toward a more practical approach for understanding the phase behavior of confined fluids. △ Less

Submitted 20 March, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: 25 pages, 14 figures

arXiv:2403.03004 [pdf, other]

Ultralight vector dark matter search using data from the KAGRA O3GK run

Authors: The LIGO Scientific Collaboration, the Virgo Collaboration, the KAGRA Collaboration, A. G. Abac, R. Abbott, H. Abe, I. Abouelfettouh, F. Acernese, K. Ackley, C. Adamcewicz, S. Adhicary, N. Adhikari, R. X. Adhikari, V. K. Adkins, V. B. Adya, C. Affeldt, D. Agarwal, M. Agathos, O. D. Aguiar, I. Aguilar, L. Aiello, A. Ain, P. Ajith, T. Akutsu, S. Albanesi , et al. (1778 additional authors not shown)

Abstract: Among the various candidates for dark matter (DM), ultralight vector DM can be probed by laser interferometric gravitational wave detectors through the measurement of oscillating length changes in the arm cavities. In this context, KAGRA has a unique feature due to differing compositions of its mirrors, enhancing the signal of vector DM in the length change in the auxiliary channels. Here we prese… ▽ More Among the various candidates for dark matter (DM), ultralight vector DM can be probed by laser interferometric gravitational wave detectors through the measurement of oscillating length changes in the arm cavities. In this context, KAGRA has a unique feature due to differing compositions of its mirrors, enhancing the signal of vector DM in the length change in the auxiliary channels. Here we present the result of a search for $U(1)_{B-L}$ gauge boson DM using the KAGRA data from auxiliary length channels during the first joint observation run together with GEO600. By applying our search pipeline, which takes into account the stochastic nature of ultralight DM, upper bounds on the coupling strength between the $U(1)_{B-L}$ gauge boson and ordinary matter are obtained for a range of DM masses. While our constraints are less stringent than those derived from previous experiments, this study demonstrates the applicability of our method to the lower-mass vector DM search, which is made difficult in this measurement by the short observation time compared to the auto-correlation time scale of DM. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: 20 pages, 5 figures

Report number: LIGO-P2300250

arXiv:2402.03860 [pdf, other]

AED: Adaptable Error Detection for Few-shot Imitation Policy

Authors: Jia-Fong Yeh, Kuo-Han Hung, Pang-Chi Lo, Chi-Ming Chung, Tsung-Han Wu, Hung-Ting Su, Yi-Ting Chen, Winston H. Hsu

Abstract: We introduce a new task called Adaptable Error Detection (AED), which aims to identify behavior errors in few-shot imitation (FSI) policies based on visual observations in novel environments. The potential to cause serious damage to surrounding areas limits the application of FSI policies in real-world scenarios. Thus, a robust system is necessary to notify operators when FSI policies are inconsis… ▽ More We introduce a new task called Adaptable Error Detection (AED), which aims to identify behavior errors in few-shot imitation (FSI) policies based on visual observations in novel environments. The potential to cause serious damage to surrounding areas limits the application of FSI policies in real-world scenarios. Thus, a robust system is necessary to notify operators when FSI policies are inconsistent with the intent of demonstrations. This task introduces three challenges: (1) detecting behavior errors in novel environments, (2) identifying behavior errors that occur without revealing notable changes, and (3) lacking complete temporal information of the rollout due to the necessity of online detection. However, the existing benchmarks cannot support the development of AED because their tasks do not present all these challenges. To this end, we develop a cross-domain AED benchmark, consisting of 322 base and 153 novel environments. Additionally, we propose Pattern Observer (PrObe) to address these challenges. PrObe is equipped with a powerful pattern extractor and guided by novel learning objectives to parse discernible patterns in the policy feature representations of normal or error states. Through our comprehensive evaluation, PrObe demonstrates superior capability to detect errors arising from a wide range of FSI policies, consistently surpassing strong baselines. Moreover, we conduct detailed ablations and a pilot study on error correction to validate the effectiveness of the proposed architecture design and the practicality of the AED task, respectively. △ Less

Submitted 25 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2402.00747 [pdf, other]

Mott resistive switching initiated by topological defects

Authors: Alessandra Milloch, Ignacio Figueruelo-Campanero, Wei-Fan Hsu, Selene Mor, Simon Mellaerts, Francesco Maccherozzi, Larissa Ishibe Veiga, Sarnjeet S. Dhesi, Mauro Spera, Jin Won Seo, Jean-Pierre Locquet, Michele Fabrizio, Mariela Menghini, Claudio Giannetti

Abstract: Resistive switching is the fundamental process that triggers the sudden change of the electrical properties in solid-state devices under the action of intense electric fields. Despite its relevance for information processing, ultrafast electronics, neuromorphic devices, resistive memories and brain-inspired computation, the nature of the local stochastic fluctuations that drive the formation of me… ▽ More Resistive switching is the fundamental process that triggers the sudden change of the electrical properties in solid-state devices under the action of intense electric fields. Despite its relevance for information processing, ultrafast electronics, neuromorphic devices, resistive memories and brain-inspired computation, the nature of the local stochastic fluctuations that drive the formation of metallic nuclei out of the insulating state has remained hidden. Here, using operando X-ray nano-imaging, we have captured the early-stages of resistive switching in a V2O3-based device under working conditions. V2O3 is a paradigmatic Mott material, which undergoes a first-order metal-to-insulator transition coupled to a lattice transformation that breaks the threefold rotational symmetry of the rhombohedral metal phase. We reveal a new class of volatile electronic switching triggered by nanoscale topological defects of the lattice order parameter of the insulating phase. Our results pave the way to the use of strain engineering approaches to manipulate topological defects and achieve the full control of the electronic Mott switching. The concept of topology-driven reversible electronic transition is of interest for a broad class of quantum materials, comprising transition metal oxides, chalcogenides and kagome metals, that exhibit first-order electronic transitions coupled to a symmetry-breaking order. △ Less

Submitted 1 February, 2024; originally announced February 2024.

arXiv:2401.16737 [pdf]

Formation of highly stable interfacial nitrogen gas hydrate overlayers under ambient conditions

Authors: Chung-Kai Fang, Cheng-Hao Chuang, Chih-Wen Yang, Zheng-Rong Guo, Wei-Hao Hsu, Chia-Hsin Wang, Ing-Shouh Hwang

Abstract: Surfaces (interfaces) dictate many physical and chemical properties of solid materials and adsorbates considerably affect these properties. Nitrogen molecules, which are the most abundant constituent in ambient air, are considered to be inert. Our study combining atomic force microscopy (AFM), X-ray photoemission spectroscopy (XPS), and thermal desorption spectroscopy (TDS) revealed that nitrogen… ▽ More Surfaces (interfaces) dictate many physical and chemical properties of solid materials and adsorbates considerably affect these properties. Nitrogen molecules, which are the most abundant constituent in ambient air, are considered to be inert. Our study combining atomic force microscopy (AFM), X-ray photoemission spectroscopy (XPS), and thermal desorption spectroscopy (TDS) revealed that nitrogen and water molecules can self-assemble into two-dimensional domains, forming ordered stripe structures on graphitic surfaces in both water and ambient air. The stripe structures of this study were composed of approximately 90% and 10% water and nitrogen molecules, respectively, and survived in ultra-high vacuum (UHV) conditions at temperatures up to approximately 350 K. Because pure water molecules completely desorb from graphitic surfaces in a UHV at temperatures lower than 200 K, our results indicate that the incorporation of nitrogen molecules substantially enhanced the stability of the crystalline water hydrogen bonding network. Additional studies on interfacial gas hydrates can provide deeper insight into the mechanisms underlying formation of gas hydrates. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.07781 [pdf, other]

Towards A Better Metric for Text-to-Video Generation

Authors: Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou

Abstract: Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However… ▽ More Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts. Moreover, to evaluate the proposed metrics and facilitate future improvements on them, we present the TVGE dataset, collecting human judgements of 2,543 text-to-video generated videos on the two criteria. Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation. △ Less

Submitted 15 January, 2024; originally announced January 2024.

Comments: Project page: https://showlab.github.io/T2VScore/

arXiv:2401.03138 [pdf, other]

TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling

Authors: ChungYi Lin, Shen-Lung Tung, Hung-Ting Su, Winston H. Hsu

Abstract: To address the limitations of traffic prediction from location-bound detectors, we present Geographical Cellular Traffic (GCT) flow, a novel data source that leverages the extensive coverage of cellular traffic to capture mobility patterns. Our extensive analysis validates its potential for transportation. Focusing on vehicle-related GCT flow prediction, we propose a graph neural network that inte… ▽ More To address the limitations of traffic prediction from location-bound detectors, we present Geographical Cellular Traffic (GCT) flow, a novel data source that leverages the extensive coverage of cellular traffic to capture mobility patterns. Our extensive analysis validates its potential for transportation. Focusing on vehicle-related GCT flow prediction, we propose a graph neural network that integrates multivariate, temporal, and spatial facets for improved accuracy. Experiments reveal our model's superiority over baselines, especially in long-term predictions. We also highlight the potential for GCT flow integration into transportation systems. △ Less

Submitted 6 January, 2024; originally announced January 2024.

Comments: 7 pages, 7 figures, 4 tables. Accepted by AAAI-24-IAAI, to appear

arXiv:2312.15821 [pdf, other]

Audiobox: Unified Audio Generation with Natural Language Prompts

Authors: Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu

Abstract: Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in sever… ▽ More Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at https://audiobox.metademolab.com/ △ Less

Submitted 25 December, 2023; originally announced December 2023.

arXiv:2312.04425 [pdf, other]

doi 10.1021/acsami.3c18807

Confinement-Induced Isosymmetric Metal-Insulator Transition in Ultrathin Epitaxial V2O3 Films

Authors: Simon Mellaerts, Claudio Bellani, Wei-Fan Hsu, Alberto Binetti, Koen Schouteden, Maria Recaman-Payo, Mariela Menghini, Juan Rubio Zuazo, Jesús López Sánchez, Jin Won Seo, Michel Houssa, Jean-Pierre Locquet

Abstract: Dimensional confinement has shown to be an effective strategy to tune competing degrees of freedom in complex oxides. Here, we achieved atomic layered growth of trigonal vanadium sesquioxide (V2O3) by means of oxygen-assisted molecular beam epitaxy. This led to a series of high-quality epitaxial ultrathin V2O3 films down to unit cell thickness, enabling the study of the intrinsic electron correlat… ▽ More Dimensional confinement has shown to be an effective strategy to tune competing degrees of freedom in complex oxides. Here, we achieved atomic layered growth of trigonal vanadium sesquioxide (V2O3) by means of oxygen-assisted molecular beam epitaxy. This led to a series of high-quality epitaxial ultrathin V2O3 films down to unit cell thickness, enabling the study of the intrinsic electron correlations upon confinement. By electrical and optical measurements, we demonstrate a dimensional confinement-induced metal-insulator transition in these ultrathin films. We shed light on the Mott-Hubbard nature of this transition, revealing an abrupt vanishing of the quasiparticle weight as demonstrated by photoemission spectroscopy. Furthermore, we prove that dimensional confinement acts as an effective out-of-plane stress. This highlights the structural component of correlated oxides in a confined architecture, while opening an avenue to control both in-plane and out-of-plane lattice components by epitaxial strain and confinement, respectively. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Journal ref: ACS Appl. Mater. Interfaces 2021, 13, 30941-30949

arXiv:2311.13592 [pdf, other]

doi 10.1364/OPTICA.514907

High-efficiency high-NA metalens designed by maximizing the efficiency limit

Authors: Shiyu Li, Ho-Chun Lin, Chia Wei Hsu

Abstract: Theoretical bounds are commonly used to assess the limitations of photonic design. Here we introduce a more active way to use theoretical bounds, integrating them into part of the design process and identifying optimal system parameters that maximize the efficiency limit itself. As an example, we consider wide-field-of-view high-numerical-aperture metalenses, which can be used for high-resolution… ▽ More Theoretical bounds are commonly used to assess the limitations of photonic design. Here we introduce a more active way to use theoretical bounds, integrating them into part of the design process and identifying optimal system parameters that maximize the efficiency limit itself. As an example, we consider wide-field-of-view high-numerical-aperture metalenses, which can be used for high-resolution imaging in microscopy and endoscopy, but no existing design has achieved a high efficiency. By choosing aperture sizes to maximize an efficiency bound, setting the thickness according to a thickness bound, and then performing inverse design, we come up with high-numerical-aperture (NA = 0.9) metalens designs with record-high 98% transmission efficiency and 92% Strehl ratio across all incident angles within a 60-deg field of view, reaching the maximized bound. This maximizing-efficiency-limit approach applies to any multi-channel system and can help a wide range of optical devices reach their highest possible performance. △ Less

Submitted 1 December, 2023; v1 submitted 22 November, 2023; originally announced November 2023.

Journal ref: Optica 11, 454-459 (2024)

arXiv:2311.05672 [pdf, other]

Conditional Optimal Transport on Function Spaces

Authors: Bamdad Hosseini, Alexander W. Hsu, Amirhossein Taghvaei

Abstract: We present a systematic study of conditional triangular transport maps in function spaces from the perspective of optimal transportation and with a view towards amortized Bayesian inference. More specifically, we develop a theory of constrained optimal transport problems that describe block-triangular Monge maps that characterize conditional measures along with their Kantorovich relaxations. This… ▽ More We present a systematic study of conditional triangular transport maps in function spaces from the perspective of optimal transportation and with a view towards amortized Bayesian inference. More specifically, we develop a theory of constrained optimal transport problems that describe block-triangular Monge maps that characterize conditional measures along with their Kantorovich relaxations. This generalizes the theory of optimal triangular transport to separable infinite-dimensional function spaces with general cost functions. We further tailor our results to the case of Bayesian inference problems and obtain regularity estimates on the conditioning maps from the prior to the posterior. Finally, we present numerical experiments that demonstrate the computational applicability of our theoretical results for amortized and likelihood-free inference of functional parameters. △ Less

Submitted 6 February, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

MSC Class: 49Q22; 62G86; 62F15; 60B05

arXiv:2311.02772 [pdf, ps, other]

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

Authors: Sungho Jeon, Ching-Feng Yeh, Hakan Inan, Wei-Ning Hsu, Rashi Rungta, Yashar Mehdad, Daniel Bikel

Abstract: In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing convolutional modules with self-attention modules. They achieve state-of-the-art performance on ASR with top efficiency. We first show that employing these speech tr… ▽ More In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing convolutional modules with self-attention modules. They achieve state-of-the-art performance on ASR with top efficiency. We first show that employing these speech transformers as an encoder significantly improves the efficiency of pre-trained audio models as well. However, our study shows that we can achieve comparable efficiency with advanced self-attention solely. We demonstrate that this simpler approach is particularly beneficial with a low-bit weight quantization technique of a neural network to improve efficiency. We hypothesize that it prevents propagating the errors between different quantized modules compared to recent speech transformers mixing quantized convolution and the quantized self-attention modules. △ Less

Submitted 8 February, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

Comments: 5 pages; accepted to Self-supervision in Audio, Speech and Beyond (SASB) workshop in ICASSP24

arXiv:2311.02332 [pdf, other]

Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects

Authors: Elisa Warner, Joonsang Lee, William Hsu, Tanveer Syeda-Mahmood, Charles Kahn, Olivier Gevaert, Arvind Rao

Abstract: Machine learning (ML) applications in medical artificial intelligence (AI) systems have shifted from traditional and statistical methods to increasing application of deep learning models. This survey navigates the current landscape of multimodal ML, focusing on its profound impact on medical image analysis and clinical decision support systems. Emphasizing challenges and innovations in addressing… ▽ More Machine learning (ML) applications in medical artificial intelligence (AI) systems have shifted from traditional and statistical methods to increasing application of deep learning models. This survey navigates the current landscape of multimodal ML, focusing on its profound impact on medical image analysis and clinical decision support systems. Emphasizing challenges and innovations in addressing multimodal representation, fusion, translation, alignment, and co-learning, the paper explores the transformative potential of multimodal models for clinical predictions. It also highlights the need for principled assessments and practical implementation of such models, bringing attention to the dynamics between decision support systems and healthcare providers and personnel. Despite advancements, challenges such as data biases and the scarcity of "big data" in many biomedical domains persist. We conclude with a discussion on principled innovation and collaborative efforts to further the mission of seamless integration of multimodal ML models into biomedical practice. △ Less

Submitted 19 January, 2024; v1 submitted 4 November, 2023; originally announced November 2023.

arXiv:2310.16338 [pdf, other]

Generative Pre-training for Speech with Flow Matching

Authors: Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

Abstract: Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there… ▽ More Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training. △ Less

Submitted 25 March, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: ICLR 2024

arXiv:2310.14643 [pdf, other]

Dynamic gain and frequency comb formation in exceptional-point lasers

Authors: Xingwei Gao, Hao He, Scott Sobolewski, Alexander Cerjan, Chia Wei Hsu

Abstract: Exceptional points (EPs)--singularities in the parameter space of non-Hermitian systems where two nearby eigenmodes coalesce--feature unique properties with applications for microcavity lasers such as sensitivity enhancement and chiral emission. Present EP lasers operate with static populations in the gain medium. Here, we show theoretically that a laser operating sufficiently close to an EP will… ▽ More Exceptional points (EPs)--singularities in the parameter space of non-Hermitian systems where two nearby eigenmodes coalesce--feature unique properties with applications for microcavity lasers such as sensitivity enhancement and chiral emission. Present EP lasers operate with static populations in the gain medium. Here, we show theoretically that a laser operating sufficiently close to an EP will spontaneously induce a multi-spectral multi-modal instability that creates an oscillating population inversion and generates a frequency comb. The comb formation is enhanced by the non-orthogonality of modes via the Petermann factor. Such an "EP comb" features an ultra-compact size and a widely tunable repetition rate, without requiring external modulators or a continuous-wave pump. We develop an exact ab initio dynamic solution of the space-dependent Maxwell-Bloch equations, describing all steady-state properties of the EP comb. We illustrate this phenomenon in a realistic parity-time-symmetric 5-μm-long AlGaAs cavity and validate our prediction with finite-difference time-domain simulations. This work reveals the rich physics that connect non-Hermitian degeneracies and the nonlinear dynamics of gain media to fundamentally alter the laser behavior. △ Less

Submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.13615 [pdf, other]

Three Questions Concerning the Use of Large Language Models to Facilitate Mathematics Learning

Authors: An-Zi Yen, Wei-Ling Hsu

Abstract: Due to the remarkable language understanding and generation abilities of large language models (LLMs), their use in educational applications has been explored. However, little work has been done on investigating the pedagogical ability of LLMs in helping students to learn mathematics. In this position paper, we discuss the challenges associated with employing LLMs to enhance students' mathematical… ▽ More Due to the remarkable language understanding and generation abilities of large language models (LLMs), their use in educational applications has been explored. However, little work has been done on investigating the pedagogical ability of LLMs in helping students to learn mathematics. In this position paper, we discuss the challenges associated with employing LLMs to enhance students' mathematical problem-solving skills by providing adaptive feedback. Apart from generating the wrong reasoning processes, LLMs can misinterpret the meaning of the question, and also exhibit difficulty in understanding the given questions' rationales when attempting to correct students' answers. Three research questions are formulated. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: Accepted by EMNLP 2023 Findings

arXiv:2310.08715 [pdf, other]

Toward Joint Language Modeling for Speech Units and Text

Authors: Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, Michael Auli

Abstract: Speech and text are two major forms of human language. The research community has been focusing on mapping speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform co… ▽ More Speech and text are two major forms of human language. The research community has been focusing on mapping speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform continuous speech signals into discrete units and use different methods to construct mixed speech-text data. We introduce automatic metrics to evaluate how well the joint LM mixes speech and text. We also fine-tune the LM on downstream spoken language understanding (SLU) tasks with different modalities (speech or text) and test its performance to assess the model's learning of shared representations. Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks and shows zero-shot cross-modal transferability. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: EMNLP findings 2023

arXiv:2310.03821 [pdf, other]

WLST: Weak Labels Guided Self-training for Weakly-supervised Domain Adaptation on 3D Object Detection

Authors: Tsung-Lin Tsou, Tsung-Han Wu, Winston H. Hsu

Abstract: In the field of domain adaptation (DA) on 3D object detection, most of the work is dedicated to unsupervised domain adaptation (UDA). Yet, without any target annotations, the performance gap between the UDA approaches and the fully-supervised approach is still noticeable, which is impractical for real-world applications. On the other hand, weakly-supervised domain adaptation (WDA) is an underexplo… ▽ More In the field of domain adaptation (DA) on 3D object detection, most of the work is dedicated to unsupervised domain adaptation (UDA). Yet, without any target annotations, the performance gap between the UDA approaches and the fully-supervised approach is still noticeable, which is impractical for real-world applications. On the other hand, weakly-supervised domain adaptation (WDA) is an underexplored yet practical task that only requires few labeling effort on the target domain. To improve the DA performance in a cost-effective way, we propose a general weak labels guided self-training framework, WLST, designed for WDA on 3D object detection. By incorporating autolabeler, which can generate 3D pseudo labels from 2D bounding boxes, into the existing self-training pipeline, our method is able to generate more robust and consistent pseudo labels that would benefit the training process on the target domain. Extensive experiments demonstrate the effectiveness, robustness, and detector-agnosticism of our WLST framework. Notably, it outperforms previous state-of-the-art methods on all evaluation tasks. △ Less

Submitted 7 February, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

Comments: Accepted to ICRA 2024. Code is available at https://github.com/jacky121298/WLST

arXiv:2309.17020 [pdf, other]

Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

Authors: Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-yi Lee, Abdelrahman Mohamed

Abstract: Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TT… ▽ More Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TTS) system with limited resources using SSL features and generate a large synthetic corpus for pre-training. Experimental results demonstrate that our proposed approach effectively reduces the demand for speech data by 90% with only slight performance degradation. To the best of our knowledge, this is the first work aiming to enhance low-resource self-supervised learning in speech processing. △ Less

Submitted 4 June, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

Comments: ASRU 2023 SPARKS Workshop

arXiv:2309.09376 [pdf, other]

doi 10.1038/s41566-024-01446-7

Delivering Broadband Light Deep Inside Diffusive Media

Authors: Rohin McIntosh, Arthur Goetschy, Nicholas Bender, Alexey Yamilov, Chia Wei Hsu, Hasan Yilmaz, Hui Cao

Abstract: Wavefront shaping enables targeted delivery of coherent light into random-scattering media, such as biological tissue, by constructive interference of scattered waves. However, broadband waves have short coherence times, weakening the interference effect. Here, we introduce a broadband deposition matrix that identifies a single input wavefront that maximizes the broadband energy delivered to an ex… ▽ More Wavefront shaping enables targeted delivery of coherent light into random-scattering media, such as biological tissue, by constructive interference of scattered waves. However, broadband waves have short coherence times, weakening the interference effect. Here, we introduce a broadband deposition matrix that identifies a single input wavefront that maximizes the broadband energy delivered to an extended target deep inside a diffusive system. We experimentally demonstrate that long-range spatial and spectral correlations result in a six-fold energy enhancement for targets containing more than 1500 speckle grains and located at a depth of up to ten transport mean free paths, even when the coherence time is an order of magnitude shorter than the diffusion dwell time of light in the scattering sample. In the broadband (fast decoherence) limit, enhancement of energy delivery to extended targets becomes nearly independent of the target depth and dissipation. Our experiments, numerical simulations, and analytic theory establish the fundamental limit for broadband energy delivery deep into a diffusive system, which has important consequences for practical applications. △ Less

Submitted 17 September, 2023; originally announced September 2023.

Comments: 17 pages, 10 figures

Journal ref: Nature Photonics (2024)

Showing 1–50 of 343 results for author: Hsu, W