Skip to main content

Showing 1–50 of 258 results for author: Huang, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13322  [pdf, other

    cs.CV

    Fully Test-Time rPPG Estimation via Synthetic Signal-Guided Feature Learning

    Authors: Pei-Kai Huang, Tzu-Hsien Chen, Ya-Ting Chan, Kuan-Wen Chen, Chiou-Ting Hsu

    Abstract: Many remote photoplethysmography (rPPG) estimation models have achieved promising performance on the training domain but often fail to measure the physiological signals or heart rates (HR) on test domains. Domain generalization (DG) or domain adaptation (DA) techniques are therefore adopted in the offline training stage to adapt the model to the unobserved or observed test domain by referring to a… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  2. arXiv:2407.13164  [pdf, other

    cs.CL cs.AI

    Translate-and-Revise: Boosting Large Language Models for Constrained Translation

    Authors: Pengcheng Huang, Yongyu Mu, Yuzhang Wu, Bei Li, Chunyang Xiao, Tong Xiao, Jingbo Zhu

    Abstract: Imposing constraints on machine translation systems presents a challenging issue because these systems are not trained to make use of constraints in generating adequate, fluent translations. In this paper, we leverage the capabilities of large language models (LLMs) for constrained translation, given that LLMs can easily adapt to this task by taking translation instructions and constraints as prom… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: 16 pages

  3. arXiv:2407.07331  [pdf, ps, other

    cs.CV cs.AI

    Learning with Instance-Dependent Noisy Labels by Anchor Hallucination and Hard Sample Label Correction

    Authors: Po-Hsuan Huang, Chia-Ching Lin, Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

    Abstract: Learning from noisy-labeled data is crucial for real-world applications. Traditional Noisy-Label Learning (NLL) methods categorize training data into clean and noisy sets based on the loss distribution of training samples. However, they often neglect that clean samples, especially those with intricate visual patterns, may also yield substantial losses. This oversight is particularly significant in… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: ICIP 2024

  4. arXiv:2407.00752  [pdf, other

    cs.CV cs.AI

    Chest-Diffusion: A Light-Weight Text-to-Image Model for Report-to-CXR Generation

    Authors: Peng Huang, Xue Gao, Lihong Huang, Jing Jiao, Xiaokang Li, Yuanyuan Wang, Yi Guo

    Abstract: Text-to-image generation has important implications for generation of diverse and controllable images. Several attempts have been made to adapt Stable Diffusion (SD) to the medical domain. However, the large distribution difference between medical reports and natural texts, as well as high computational complexity in common stable diffusion limit the authenticity and feasibility of the generated m… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  5. arXiv:2406.17338  [pdf, other

    eess.IV cs.CV cs.LG

    Robustly Optimized Deep Feature Decoupling Network for Fatty Liver Diseases Detection

    Authors: Peng Huang, Shu Hu, Bo Peng, Jiashu Zhang, Xi Wu, Xin Wang

    Abstract: Current medical image classification efforts mainly aim for higher average performance, often neglecting the balance between different classes. This can lead to significant differences in recognition accuracy between classes and obvious recognition weaknesses. Without the support of massive data, deep learning faces challenges in fine-grained classification of fatty liver. In this paper, we propos… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: MICCAI 2024

  6. arXiv:2406.09870  [pdf, other

    cs.LG cs.AI

    IGL-Bench: Establishing the Comprehensive Benchmark for Imbalanced Graph Learning

    Authors: Jiawen Qin, Haonan Yuan, Qingyun Sun, Lyujin Xu, Jiaqi Yuan, Pengfeng Huang, Zhaonan Wang, Xingcheng Fu, Hao Peng, Jianxin Li, Philip S. Yu

    Abstract: Deep graph learning has gained grand popularity over the past years due to its versatility and success in representing graph data across a wide range of domains. However, the pervasive issue of imbalanced graph data distributions, where certain parts exhibit disproportionally abundant data while others remain sparse, undermines the efficacy of conventional graph learning algorithms, leading to bia… ▽ More

    Submitted 19 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Preprint, under review)

  7. arXiv:2406.02430  [pdf, other

    eess.AS cs.SD

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

    Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  8. arXiv:2406.01436  [pdf, other

    cs.CL

    Editing the Mind of Giants: An In-Depth Exploration of Pitfalls of Knowledge Editing in Large Language Models

    Authors: Cheng-Hsun Hsueh, Paul Kuo-Ming Huang, Tzu-Han Lin, Che-Wei Liao, Hung-Chieh Fang, Chao-Wei Huang, Yun-Nung Chen

    Abstract: Knowledge editing is a rising technique for efficiently updating factual knowledge in Large Language Models (LLMs) with minimal alteration of parameters. However, recent studies have identified concerning side effects, such as knowledge distortion and the deterioration of general abilities, that have emerged after editing. This survey presents a comprehensive study of these side effects, providing… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  9. arXiv:2405.16640  [pdf, other

    cs.AI cs.CL cs.CV cs.MM

    A Survey of Multimodal Large Language Model from A Data-centric Perspective

    Authors: Tianyi Bai, Hao Liang, Binwang Wan, Yanran Xu, Xi Li, Shiyu Li, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, Ping Huang, Jiulong Shan, Conghui He, Binhang Yuan, Wentao Zhang

    Abstract: Multimodal large language models (MLLMs) enhance the capabilities of standard large language models by integrating and processing data from multiple modalities, including text, vision, audio, video, and 3D environments. Data plays a pivotal role in the development and refinement of these models. In this survey, we comprehensively review the literature on MLLMs from a data-centric perspective. Spec… ▽ More

    Submitted 18 July, 2024; v1 submitted 26 May, 2024; originally announced May 2024.

  10. arXiv:2405.15655  [pdf, other

    cs.SD cs.LG eess.AS

    HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification System

    Authors: Zhisheng Zhang, Pengyang Huang

    Abstract: In recent years, the remarkable advancements in deep neural networks have brought tremendous convenience. However, the training process of a highly effective model necessitates a substantial quantity of samples, which brings huge potential threats, like unauthorized exploitation with privacy leakage. In response, we propose a framework named HiddenSpeaker, embedding imperceptible perturbations wit… ▽ More

    Submitted 26 May, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: Accepted by IJCNN 2024

  11. arXiv:2405.15199  [pdf, other

    cs.CV

    ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models

    Authors: Jingyuan Zhu, Shiyu Li, Yuxuan Liu, Ping Huang, Jiulong Shan, Huimin Ma, Jian Yuan

    Abstract: Modern diffusion-based image generative models have made significant progress and become promising to enrich training data for the object detection task. However, the generation quality and the controllability for complex scenes containing multi-class objects and dense objects with occlusions remain limited. This paper presents ODGEN, a novel method to generate high-quality images conditioned on b… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  12. arXiv:2405.15140  [pdf, other

    cs.LG

    Better Membership Inference Privacy Measurement through Discrepancy

    Authors: Ruihan Wu, Pengrun Huang, Kamalika Chaudhuri

    Abstract: Membership Inference Attacks have emerged as a dominant method for empirically measuring privacy leakage from machine learning models. Here, privacy is measured by the {\em{advantage}} or gap between a score or a function computed on the training and the test data. A major barrier to the practical deployment of these attacks is that they do not scale to large well-generalized models -- either the… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 9 pages

  13. arXiv:2405.14855  [pdf, other

    cs.CV cs.AI

    Synergistic Global-space Camera and Human Reconstruction from Videos

    Authors: Yizhou Zhao, Tuanfeng Y. Wang, Bhiksha Raj, Min Xu, Jimei Yang, Chun-Hao Paul Huang

    Abstract: Remarkable strides have been made in reconstructing static scenes or human bodies from monocular videos. Yet, the two problems have largely been approached independently, without much synergy. Most visual SLAM methods can only reconstruct camera trajectories and scene structures up to scale, while most HMR methods reconstruct human meshes in metric scale but fall short in reasoning with cameras an… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: CVPR 2024

  14. arXiv:2405.13788  [pdf, other

    quant-ph cs.GT

    Quantum algorithm for large-scale market equilibrium computation

    Authors: Po-Wei Huang, Patrick Rebentrost

    Abstract: Classical algorithms for market equilibrium computation such as proportional response dynamics face scalability issues with Internet-based applications such as auctions, recommender systems, and fair division, despite having an almost linear runtime in terms of the product of buyers and goods. In this work, we provide the first quantum algorithm for market equilibrium computation with sub-linear p… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 22 pages, 1 figure

  15. arXiv:2405.12664  [pdf, other

    cs.NI

    IREE Oriented Green 6G Networks: A Radial Basis Function Based Approach

    Authors: Tao Yu, Pengbo Huang, Shunqing Zhang, Xiaojing Chen, Yanzan Sun, Xin Wang

    Abstract: In order to provide design guidelines for energy efficient 6G networks, we propose a novel radial basis function (RBF) based optimization framework to maximize the integrated relative energy efficiency (IREE) metric. Different from the conventional energy efficient optimization schemes, we maximize the transformed utility for any given IREE using spectrum efficiency oriented RBF network and gradua… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  16. arXiv:2405.10554  [pdf, other

    cs.CV

    NeRO: Neural Road Surface Reconstruction

    Authors: Ruibo Wang, Song Zhang, Ping Huang, Donghai Zhang, Haoyu Chen

    Abstract: Accurately reconstructing road surfaces is pivotal for various applications especially in autonomous driving. This paper introduces a position encoding Multi-Layer Perceptrons (MLPs) framework to reconstruct road surfaces, with input as world coordinates x and y, and output as height, color, and semantic information. The effectiveness of this method is demonstrated through its compatibility with a… ▽ More

    Submitted 28 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

  17. arXiv:2405.04434  [pdf, other

    cs.CL cs.AI

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Authors: DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding , et al. (132 additional authors not shown)

    Abstract: We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference… ▽ More

    Submitted 19 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

  18. arXiv:2405.02077  [pdf, other

    cs.CV

    MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition

    Authors: Hongyu Qu, Rui Yan, Xiangbo Shu, Hailiang Gao, Peng Huang, Guo-Sen Xie

    Abstract: Recent few-shot action recognition (FSAR) methods typically perform semantic matching on learned discriminative features to achieve promising performance. However, most FSAR methods focus on single-scale (e.g., frame-level, segment-level, etc) feature alignment, which ignores that human actions with the same semantic may appear at different velocities. To this end, we develop a novel Multi-Velocit… ▽ More

    Submitted 23 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

  19. arXiv:2405.01582  [pdf, other

    cs.CL cs.AI cs.LG

    Text Quality-Based Pruning for Efficient Training of Language Models

    Authors: Vasu Sharma, Karthik Padthe, Newsha Ardalani, Kushal Tirumala, Russell Howes, Hu Xu, Po-Yao Huang, Shang-Wen Li, Armen Aghajanyan, Gargi Ghosh, Luke Zettlemoyer

    Abstract: In recent times training Language Models (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious. In this paper we propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets in a model agnostic manner to assign the text instances a "quality score". By proposing the text quality metric, th… ▽ More

    Submitted 10 May, 2024; v1 submitted 26 April, 2024; originally announced May 2024.

  20. arXiv:2404.16030  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    MoDE: CLIP Data Experts via Clustering

    Authors: Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih, Hu Xu

    Abstract: The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inferen… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: IEEE CVPR 2024 Camera Ready. Code Link: https://github.com/facebookresearch/MetaCLIP/tree/main/mode

  21. arXiv:2404.14852  [pdf, other

    cs.CV

    Ultrasound Nodule Segmentation Using Asymmetric Learning with Simple Clinical Annotation

    Authors: Xingyue Zhao, Zhongyu Li, Xiangde Luo, Peiqi Li, Peng Huang, Jianwei Zhu, Yang Liu, Jihua Zhu, Meng Yang, Shi Chang, Jun Dong

    Abstract: Recent advances in deep learning have greatly facilitated the automated segmentation of ultrasound images, which is essential for nodule morphological analysis. Nevertheless, most existing methods depend on extensive and precise annotations by domain experts, which are labor-intensive and time-consuming. In this study, we suggest using simple aspect ratio annotations directly from ultrasound clini… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: Accepted by TCSVT

  22. arXiv:2404.13134  [pdf, other

    cs.MM cs.CV cs.LG

    Deep Learning-based Text-in-Image Watermarking

    Authors: Bishwa Karki, Chun-Hua Tsai, Pei-Chi Huang, Xin Zhong

    Abstract: In this work, we introduce a novel deep learning-based approach to text-in-image watermarking, a method that embeds and extracts textual information within images to enhance data security and integrity. Leveraging the capabilities of deep learning, specifically through the use of Transformer-based architectures for text processing and Vision Transformers for image feature extraction, our method se… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  23. arXiv:2404.05583  [pdf, other

    cs.CV

    Towards More General Video-based Deepfake Detection through Facial Feature Guided Adaptation for Foundation Model

    Authors: Yue-Hua Han, Tai-Ming Huang, Shu-Tzu Lo, Po-Han Huang, Kai-Lung Hua, Jun-Cheng Chen

    Abstract: With the rise of deep learning, generative models have enabled the creation of highly realistic synthetic images, presenting challenges due to their potential misuse. While research in Deepfake detection has grown rapidly in response, many detection methods struggle with unseen Deepfakes generated by new synthesis techniques. To address this generalisation challenge, we propose a novel Deepfake de… ▽ More

    Submitted 5 June, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

  24. arXiv:2404.00893  [pdf, other

    cs.RO

    An Integrating Comprehensive Trajectory Prediction with Risk Potential Field Method for Autonomous Driving

    Authors: Kailu Wu, Xing Liu, Feiyu Bian, Yizhai Zhang, Panfeng Huang

    Abstract: Due to the uncertainty of traffic participants' intentions, generating safe but not overly cautious behavior in interactive driving scenarios remains a formidable challenge for autonomous driving. In this paper, we address this issue by combining a deep learning-based trajectory prediction model with risk potential field-based motion planning. In order to comprehensively predict the possible futur… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

  25. arXiv:2404.00576  [pdf

    cs.LG cs.AI cs.CV

    Automated Bi-Fold Weighted Ensemble Algorithms and its Application to Brain Tumor Detection and Classification

    Authors: PoTsang B. Huang, Muhammad Rizwan, Mehboob Ali

    Abstract: The uncontrolled and unstructured growth of brain cells is known as brain tumor, which has one of the highest mortality rates among diseases from all types of cancers. Due to limited diagnostic and treatment capabilities, they pose significant challenges, especially in third-world countries. Early diagnosis plays a vital role in effectively managing brain tumors and reducing mortality rates. Howev… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

  26. arXiv:2403.19374  [pdf, other

    cs.ET eess.SY

    A noise-tolerant, resource-saving probabilistic binary neural network implemented by the SOT-MRAM compute-in-memory system

    Authors: Yu Gu, Puyang Huang, Tianhao Chen, Chenyi Fu, Aitian Chen, Shouzhong Peng, Xixiang Zhang, Xufeng Kou

    Abstract: We report a spin-orbit torque(SOT) magnetoresistive random-access memory(MRAM)-based probabilistic binary neural network(PBNN) for resource-saving and hardware noise-tolerant computing applications. With the presence of thermal fluctuation, the non-destructive SOT-driven magnetization switching characteristics lead to a random weight matrix with controllable probability distribution. In the meanwh… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: 5 pages, 10 figures

    MSC Class: 94C60 ACM Class: B.2.4; B.3.0

  27. arXiv:2403.16973  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

    Authors: Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath

    Abstract: We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an… ▽ More

    Submitted 13 June, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: ACL 2024. Data, code, and model weights are available at https://github.com/jasonppy/VoiceCraft

  28. arXiv:2403.16242  [pdf, other

    cs.CV

    Adversarially Masked Video Consistency for Unsupervised Domain Adaptation

    Authors: Xiaoyu Zhu, Junwei Liang, Po-Yao Huang, Alex Hauptmann

    Abstract: We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is called Generative Adversarial Domain Alignment Network with the aim of learning domain-invariant representations. It simultaneously learns a mask generator… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

  29. Semantic Is Enough: Only Semantic Information For NeRF Reconstruction

    Authors: Ruibo Wang, Song Zhang, Ping Huang, Donghai Zhang, Wei Yan

    Abstract: Recent research that combines implicit 3D representation with semantic information, like Semantic-NeRF, has proven that NeRF model could perform excellently in rendering 3D structures with semantic labels. This research aims to extend the Semantic Neural Radiance Fields (Semantic-NeRF) model by focusing solely on semantic output and removing the RGB output component. We reformulate the model and i… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

  30. arXiv:2403.13208  [pdf, other

    cs.RO

    CaDRE: Controllable and Diverse Generation of Safety-Critical Driving Scenarios using Real-World Trajectories

    Authors: Peide Huang, Wenhao Ding, Jonathan Francis, Bingqing Chen, Ding Zhao

    Abstract: Simulation is an indispensable tool in the development and testing of autonomous vehicles (AVs), offering an efficient and safe alternative to road testing by allowing the exploration of a wide range of scenarios. Despite its advantages, a significant challenge within simulation-based testing is the generation of safety-critical scenarios, which are essential to ensure that AVs can handle rare but… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

  31. arXiv:2403.04481  [pdf, other

    cs.CL cs.AI

    Do Large Language Model Understand Multi-Intent Spoken Language ?

    Authors: Shangjian Yin, Peijie Huang, Yuhong Xu, Haojing Huang, Jiatian Chen

    Abstract: This research signifies a considerable breakthrough in leveraging Large Language Models (LLMs) for multi-intent spoken language understanding (SLU). Our approach re-imagines the use of entity slots in multi-intent SLU applications, making the most of the generative potential of LLMs within the SLU landscape, leading to the development of the EN-LLM series. Furthermore, we introduce the concept of… ▽ More

    Submitted 15 April, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

  32. arXiv:2402.16398  [pdf, other

    cs.RO

    Efficient Continuous-Time Ego-Motion Estimation for Asynchronous Event-based Data Associations

    Authors: Zhixiang Wang, Xudong Li, Tianle Liu, Yizhai Zhang, Panfeng Huang

    Abstract: Event cameras are bio-inspired vision sensors that asynchronously measure per-pixel brightness changes. The high temporal resolution and asynchronicity of event cameras offer great potential for estimating the robot motion state. Recent works have adopted the continuous-time ego-motion estimation methods to exploit the inherent nature of event cameras. However, most of the adopted methods have poo… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: 8 pages, 7 figures

  33. arXiv:2401.15704  [pdf, other

    cs.CR cs.SD eess.AS

    Phoneme-Based Proactive Anti-Eavesdropping with Controlled Recording Privilege

    Authors: Peng Huang, Yao Wei, Peng Cheng, Zhongjie Ba, Li Lu, Feng Lin, Yang Wang, Kui Ren

    Abstract: The widespread smart devices raise people's concerns of being eavesdropped on. To enhance voice privacy, recent studies exploit the nonlinearity in microphone to jam audio recorders with inaudible ultrasound. However, existing solutions solely rely on energetic masking. Their simple-form noise leads to several problems, such as high energy requirements and being easily removed by speech enhancemen… ▽ More

    Submitted 28 January, 2024; originally announced January 2024.

    Comments: 14 pages, 28 figures; submitted to IEEE TDSC

  34. arXiv:2401.14461  [pdf, other

    cs.AI cs.LG cs.LO

    Marabou 2.0: A Versatile Formal Analyzer of Neural Networks

    Authors: Haoze Wu, Omri Isac, Aleksandar Zeljić, Teruhiro Tagomori, Matthew Daggitt, Wen Kokke, Idan Refaeli, Guy Amir, Kyle Julian, Shahaf Bassan, Pei Huang, Ori Lahav, Min Wu, Min Zhang, Ekaterina Komendantskaya, Guy Katz, Clark Barrett

    Abstract: This paper serves as a comprehensive system description of version 2.0 of the Marabou framework for formal analysis of neural networks. We discuss the tool's architectural design and highlight the major features and components introduced since its initial release.

    Submitted 20 May, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: Condensed version accepted at CAV'24

  35. arXiv:2401.13649  [pdf, other

    cs.LG cs.CL cs.CV

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

    Authors: Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried

    Abstract: Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augmen… ▽ More

    Submitted 5 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted to ACL 2024. 24 pages. Project page: https://jykoh.com/vwa

  36. arXiv:2401.10822  [pdf, other

    cs.CV

    ActAnywhere: Subject-Aware Video Background Generation

    Authors: Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang

    Abstract: Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention. We introduce ActAnywhere, a generative model that automates this process which tra… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

  37. arXiv:2401.08422  [pdf, other

    cs.CV

    Improving Limited Supervised Foot Ulcer Segmentation Using Cross-Domain Augmentation

    Authors: Shang-Jui Kuo, Po-Han Huang, Chia-Ching Lin, Jeng-Lin Li, Ming-Ching Chang

    Abstract: Diabetic foot ulcers pose health risks, including higher morbidity, mortality, and amputation rates. Monitoring wound areas is crucial for proper care, but manual segmentation is subjective due to complex wound features and background variation. Expert annotations are costly and time-intensive, thus hampering large dataset creation. Existing segmentation models relying on extensive annotations are… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: 5 pages, 2 figures, accepted by ICASSP 2024

  38. arXiv:2401.06066  [pdf, other

    cs.CL

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Authors: Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, Wenfeng Liang

    Abstract: In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-$K$ out of $N$ experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

  39. arXiv:2401.04143  [pdf, other

    cs.CV

    RHOBIN Challenge: Reconstruction of Human Object Interaction

    Authors: Xianghui Xie, Xi Wang, Nikos Athanasiou, Bharat Lal Bhatnagar, Chun-Hao P. Huang, Kaichun Mo, Hao Chen, Xia Jia, Zerui Zhang, Liangxian Cui, Xiao Lin, Bingqiao Qian, Jie Xiao, Wenfei Yang, Hyeongjin Nam, Daniel Sungho Jung, Kihoon Kim, Kyoung Mu Lee, Otmar Hilliges, Gerard Pons-Moll

    Abstract: Modeling the interaction between humans and objects has been an emerging research direction in recent years. Capturing human-object interaction is however a very challenging task due to heavy occlusion and complex dynamics, which requires understanding not only 3D human pose, and object pose but also the interaction between them. Reconstruction of 3D humans and objects has been two separate resear… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

    Comments: 14 pages, 5 tables, 7 figure. Technical report of the CVPR'23 workshop: RHOBIN challenge (https://rhobin-challenge.github.io/)

  40. arXiv:2401.02954  [pdf, other

    cs.CL cs.AI cs.LG

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Authors: DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li , et al. (63 additional authors not shown)

    Abstract: The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

  41. arXiv:2401.02523  [pdf, other

    cs.CV cs.AI cs.LG eess.SY

    Image-based Deep Learning for Smart Digital Twins: a Review

    Authors: Md Ruman Islam, Mahadevan Subramaniam, Pei-Chi Huang

    Abstract: Smart Digital twins (SDTs) are being increasingly used to virtually replicate and predict the behaviors of complex physical systems through continual data assimilation enabling the optimization of the performance of these systems by controlling the actions of systems. Recently, deep learning (DL) models have significantly enhanced the capabilities of SDTs, particularly for tasks such as predictive… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

    Comments: 12 pages, 2 figures, and 3 tables

  42. arXiv:2312.15127  [pdf, other

    cs.LG

    Gradient Shaping for Multi-Constraint Safe Reinforcement Learning

    Authors: Yihang Yao, Zuxin Liu, Zhepeng Cen, Peide Huang, Tingnan Zhang, Wenhao Yu, Ding Zhao

    Abstract: Online safe reinforcement learning (RL) involves training a policy that maximizes task efficiency while satisfying constraints via interacting with the environments. In this paper, our focus lies in addressing the complex challenges associated with solving multi-constraint (MC) safe RL problems. We approach the safe RL problem from the perspective of Multi-Objective Optimization (MOO) and propose… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

  43. arXiv:2312.12679  [pdf, other

    cs.LG cs.AI cs.LO

    Towards Efficient Verification of Quantized Neural Networks

    Authors: Pei Huang, Haoze Wu, Yuting Yang, Ieva Daukantas, Min Wu, Yedi Zhang, Clark Barrett

    Abstract: Quantization replaces floating point arithmetic with integer arithmetic in deep neural network models, providing more efficient on-device inference with less power and memory. In this work, we propose a framework for formally verifying properties of quantized neural networks. Our baseline technique is based on integer linear programming which guarantees both soundness and completeness. We then sho… ▽ More

    Submitted 27 December, 2023; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: This paper has been accepted by AAAI2024

  44. Aligning Human Intent from Imperfect Demonstrations with Confidence-based Inverse soft-Q Learning

    Authors: Xizhou Bu, Wenjuan Li, Zhengxiong Liu, Zhiqiang Ma, Panfeng Huang

    Abstract: Imitation learning attracts much attention for its ability to allow robots to quickly learn human manipulation skills through demonstrations. However, in the real world, human demonstrations often exhibit random behavior that is not intended by humans. Collecting high-quality human datasets is both challenging and expensive. Consequently, robots need to have the ability to learn behavioral policie… ▽ More

    Submitted 19 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: Our code see https://github.com/XizoB/CIQL

    Journal ref: IEEE Robotics and Automation Letters, vol. 9, no. 8, pp. 7150 - 7157, Aug. 2024

  45. arXiv:2312.06397  [pdf, other

    cs.DB cs.IR

    MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality

    Authors: Mengzhao Wang, Xiangyu Ke, Xiaoliang Xu, Lu Chen, Yunjun Gao, Pinpin Huang, Runkai Zhu

    Abstract: We investigate the problem of multimodal search of target modality, where the task involves enhancing a query in a specific target modality by integrating information from auxiliary modalities. The goal is to retrieve relevant objects whose contents in the target modality match the specified multimodal query. The paper first introduces two baseline approaches that integrate techniques from the Dat… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

    Comments: This paper has been accepted by ICDE 2024

  46. arXiv:2312.01409  [pdf, other

    cs.CV cs.AI cs.GR

    Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

    Authors: Shengqu Cai, Duygu Ceylan, Matheus Gadelha, Chun-Hao Paul Huang, Tuanfeng Yang Wang, Gordon Wetzstein

    Abstract: Traditional 3D content creation tools empower users to bring their imagination to life by giving them direct control over a scene's geometry, appearance, motion, and camera path. Creating computer-generated videos, however, is a tedious manual process, which can be automated by emerging text-to-video diffusion models. Despite great promise, video diffusion models are difficult to control, hinderin… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

    Comments: Project page: https://primecai.github.io/generative_rendering/

  47. arXiv:2311.18260  [pdf, other

    eess.IV cs.CL cs.CV cs.LG

    Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation

    Authors: Ryutaro Tanno, David G. T. Barrett, Andrew Sellergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Karan Singhal, Shekoofeh Azizi, Tao Tu, Mike Schaekermann, Rhys May, Roy Lee, SiWai Man, Zahra Ahmed, Sara Mahdavi, Yossi Matias, Joelle Barral, Ali Eslami, Danielle Belgrave, Vivek Natarajan, Shravya Shetty, Pushmeet Kohli, Po-Sen Huang, Alan Karthikesalingam , et al. (1 additional authors not shown)

    Abstract: Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors and delays in report delivery. While recent progress in automated report generation with vision-language models offer clear pote… ▽ More

    Submitted 20 December, 2023; v1 submitted 30 November, 2023; originally announced November 2023.

  48. arXiv:2311.11287  [pdf, other

    cs.RO cs.AI

    Tactile Active Inference Reinforcement Learning for Efficient Robotic Manipulation Skill Acquisition

    Authors: Zihao Liu, Xing Liu, Yizhai Zhang, Zhengxiong Liu, Panfeng Huang

    Abstract: Robotic manipulation holds the potential to replace humans in the execution of tedious or dangerous tasks. However, control-based approaches are not suitable due to the difficulty of formally describing open-world manipulation in reality, and the inefficiency of existing learning methods. Thus, applying manipulation in a wide range of scenarios presents significant challenges. In this study, we pr… ▽ More

    Submitted 19 November, 2023; originally announced November 2023.

  49. arXiv:2311.05298  [pdf, other

    cs.CV

    Improving Vision-and-Language Reasoning via Spatial Relations Modeling

    Authors: Cheng Yang, Rui Xu, Ye Guo, Peixiang Huang, Yiru Chen, Wenkui Ding, Zhongyuan Wang, Hong Zhou

    Abstract: Visual commonsense reasoning (VCR) is a challenging multi-modal task, which requires high-level cognition and commonsense reasoning ability about the real world. In recent years, large-scale pre-training approaches have been developed and promoted the state-of-the-art performance of VCR. However, the existing approaches almost employ the BERT-like objectives to learn multi-modal representations. T… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

  50. arXiv:2311.05143  [pdf, other

    cs.CV

    SCAAT: Improving Neural Network Interpretability via Saliency Constrained Adaptive Adversarial Training

    Authors: Rui Xu, Wenkang Qin, Peixiang Huang, Hao Wang, Lin Luo

    Abstract: Deep Neural Networks (DNNs) are expected to provide explanation for users to understand their black-box predictions. Saliency map is a common form of explanation illustrating the heatmap of feature attributions, but it suffers from noise in distinguishing important features. In this paper, we propose a model-agnostic learning method called Saliency Constrained Adaptive Adversarial Training (SCAAT)… ▽ More

    Submitted 10 November, 2023; v1 submitted 8 November, 2023; originally announced November 2023.