Search | arXiv e-print repository

FORGE: Force-Guided Exploration for Robust Contact-Rich Manipulation under Uncertainty

Authors: Michael Noseworthy, Bingjie Tang, Bowen Wen, Ankur Handa, Nicholas Roy, Dieter Fox, Fabio Ramos, Yashraj Narang, Iretiayo Akinola

Abstract: We present FORGE, a method that enables sim-to-real transfer of contact-rich manipulation policies in the presence of significant pose uncertainty. FORGE combines a force threshold mechanism with a dynamics randomization scheme during policy learning in simulation, to enable the robust transfer of the learned policies to the real robot. At deployment, FORGE policies, conditioned on a maximum allow… ▽ More We present FORGE, a method that enables sim-to-real transfer of contact-rich manipulation policies in the presence of significant pose uncertainty. FORGE combines a force threshold mechanism with a dynamics randomization scheme during policy learning in simulation, to enable the robust transfer of the learned policies to the real robot. At deployment, FORGE policies, conditioned on a maximum allowable force, adaptively perform contact-rich tasks while respecting the specified force threshold, regardless of the controller gains. Additionally, FORGE autonomously predicts a termination action once the task has succeeded. We demonstrate that FORGE can be used to learn a variety of robust contact-rich policies, enabling multi-stage assembly of a planetary gear system, which requires success across three assembly tasks: nut-threading, insertion, and gear meshing. Project website can be accessed at https://noseworm.github.io/forge/. △ Less

Submitted 8 August, 2024; originally announced August 2024.

arXiv:2407.20177 [pdf, other]

AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs

Authors: Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia

Abstract: To ensure performance on a diverse set of downstream tasks, LLMs are pretrained via data mixtures over different domains. In this work, we demonstrate that the optimal data composition for a fixed compute budget varies depending on the scale of the training data, suggesting that the common practice of empirically determining an optimal composition using small-scale experiments will not yield the o… ▽ More To ensure performance on a diverse set of downstream tasks, LLMs are pretrained via data mixtures over different domains. In this work, we demonstrate that the optimal data composition for a fixed compute budget varies depending on the scale of the training data, suggesting that the common practice of empirically determining an optimal composition using small-scale experiments will not yield the optimal data mixtures when scaling up to the final model. To address this challenge, we propose *AutoScale*, an automated tool that finds a compute-optimal data composition for training at any desired target scale. AutoScale first determines the optimal composition at a small scale using a novel bilevel optimization framework, Direct Data Optimization (*DDO*), and then fits a predictor to estimate the optimal composition at larger scales. The predictor's design is inspired by our theoretical analysis of scaling laws related to data composition, which could be of independent interest. In empirical studies with pre-training 774M Decoder-only LMs (GPT-2 Large) on RedPajama dataset, AutoScale decreases validation perplexity at least 25% faster than any baseline with up to 38% speed up compared to without reweighting, achieving the best overall performance across downstream tasks. On pre-training Encoder-only LMs (BERT) with masked language modeling, DDO is shown to decrease loss on all domains while visibly improving average task performance on GLUE benchmark by 8.7% and on large-scale QA dataset (SQuAD) by 5.9% compared with without reweighting. AutoScale speeds up training by up to 28%. Our codes are open-sourced. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.18418 [pdf, other]

Know Your Limits: A Survey of Abstention in Large Language Models

Authors: Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, Lucy Lu Wang

Abstract: Abstention, the refusal of large language models (LLMs) to provide an answer, is increasingly recognized for its potential to mitigate hallucinations and enhance safety in LLM systems. In this survey, we introduce a framework to examine abstention from three perspectives: the query, the model, and human values. We organize the literature on abstention methods, benchmarks, and evaluation metrics us… ▽ More Abstention, the refusal of large language models (LLMs) to provide an answer, is increasingly recognized for its potential to mitigate hallucinations and enhance safety in LLM systems. In this survey, we introduce a framework to examine abstention from three perspectives: the query, the model, and human values. We organize the literature on abstention methods, benchmarks, and evaluation metrics using this framework, and discuss merits and limitations of prior work. We further identify and motivate areas for future work, centered around whether abstention can be achieved as a meta-capability that transcends specific tasks or domains, while still providing opportunities to optimize abstention abilities based on context. △ Less

Submitted 8 August, 2024; v1 submitted 25 July, 2024; originally announced July 2024.

Comments: preprint

arXiv:2407.17996 [pdf, other]

Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography

Authors: Kailai Zhou, Lijing Cai, Yibo Wang, Mengya Zhang, Bihan Wen, Qiu Shen, Xun Cao

Abstract: The integration of miniaturized spectrometers into mobile devices offers new avenues for image quality enhancement and facilitates novel downstream tasks. However, the broader application of spectral sensors in mobile photography is hindered by the inherent complexity of spectral images and the constraints of spectral imaging capabilities. To overcome these challenges, we propose a joint RGB-Spect… ▽ More The integration of miniaturized spectrometers into mobile devices offers new avenues for image quality enhancement and facilitates novel downstream tasks. However, the broader application of spectral sensors in mobile photography is hindered by the inherent complexity of spectral images and the constraints of spectral imaging capabilities. To overcome these challenges, we propose a joint RGB-Spectral decomposition model guided enhancement framework, which consists of two steps: joint decomposition and prior-guided enhancement. Firstly, we leverage the complementarity between RGB and Low-resolution Multi-Spectral Images (Lr-MSI) to predict shading, reflectance, and material semantic priors. Subsequently, these priors are seamlessly integrated into the established HDRNet to promote dynamic range enhancement, color mapping, and grid expert learning, respectively. Additionally, we construct a high-quality Mobile-Spec dataset to support our research, and our experiments validate the effectiveness of Lr-MSI in the tone enhancement task. This work aims to establish a solid foundation for advancing spectral vision in mobile photography. The code is available at \url{https://github.com/CalayZhou/JDM-HDRNet}. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.14177 [pdf, other]

EVLM: An Efficient Vision-Language Model for Visual Understanding

Authors: Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang

Abstract: In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to sig… ▽ More In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning. △ Less

Submitted 19 July, 2024; originally announced July 2024.

arXiv:2407.13761 [pdf, other]

SegPoint: Segment Any Point Cloud via Large Language Model

Authors: Shuting He, Henghui Ding, Xudong Jiang, Bihan Wen

Abstract: Despite significant progress in 3D point cloud segmentation, existing methods primarily address specific tasks and depend on explicit instructions to identify targets, lacking the capability to infer and understand implicit user intentions in a unified framework. In this work, we propose a model, called SegPoint, that leverages the reasoning capabilities of a multi-modal Large Language Model (LLM)… ▽ More Despite significant progress in 3D point cloud segmentation, existing methods primarily address specific tasks and depend on explicit instructions to identify targets, lacking the capability to infer and understand implicit user intentions in a unified framework. In this work, we propose a model, called SegPoint, that leverages the reasoning capabilities of a multi-modal Large Language Model (LLM) to produce point-wise segmentation masks across a diverse range of tasks: 1) 3D instruction segmentation, 2) 3D referring segmentation, 3) 3D semantic segmentation, and 4) 3D open-vocabulary semantic segmentation. To advance 3D instruction research, we introduce a new benchmark, Instruct3D, designed to evaluate segmentation performance from complex and implicit instructional texts, featuring 2,565 point cloud-instruction pairs. Our experimental results demonstrate that SegPoint achieves competitive performance on established benchmarks such as ScanRefer for referring segmentation and ScanNet for semantic segmentation, while delivering outstanding outcomes on the Instruct3D dataset. To our knowledge, SegPoint is the first model to address these varied segmentation tasks within a single framework, achieving satisfactory performance. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: ECCV 2024, Project Page: https://heshuting555.github.io/SegPoint

arXiv:2407.13097 [pdf, other]

AlcLaM: Arabic Dialectal Language Model

Authors: Murtadha Ahmed, Saghir Alfasly, Bo Wen, Jamaal Qasem, Mohammed Ahmed, Yunfeng Liu

Abstract: Pre-trained Language Models (PLMs) are integral to many modern natural language processing (NLP) systems. Although multilingual models cover a wide range of languages, they often grapple with challenges like high inference costs and a lack of diverse non-English training data. Arabic-specific PLMs are trained predominantly on modern standard Arabic, which compromises their performance on regional… ▽ More Pre-trained Language Models (PLMs) are integral to many modern natural language processing (NLP) systems. Although multilingual models cover a wide range of languages, they often grapple with challenges like high inference costs and a lack of diverse non-English training data. Arabic-specific PLMs are trained predominantly on modern standard Arabic, which compromises their performance on regional dialects. To tackle this, we construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms. We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch. Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models such as CAMeL, MARBERT, and ArBERT, compared to 7.8%, 10.2%, and 21.3%, respectively. Remarkably, AlcLaM demonstrates superior performance on a variety of Arabic NLP tasks despite the limited training data. AlcLaM is available at GitHub https://github.com/amurtadha/Alclam and HuggingFace https://huggingface.co/rahbi. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: Accepted by ArabicNLP 2024, presented in ACL 2024

arXiv:2407.08865 [pdf, other]

Single-Image Shadow Removal Using Deep Learning: A Comprehensive Survey

Authors: Laniqng Guo, Chong Wang, Yufei Wang, Siyu Huang, Wenhan Yang, Alex C. Kot, Bihan Wen

Abstract: Shadow removal aims at restoring the image content within shadow regions, pursuing a uniform distribution of illumination that is consistent between shadow and non-shadow regions. {Comparing to other image restoration tasks, there are two unique challenges in shadow removal:} 1) The patterns of shadows are arbitrary, varied, and often have highly complex trace structures, making ``trace-less'' ima… ▽ More Shadow removal aims at restoring the image content within shadow regions, pursuing a uniform distribution of illumination that is consistent between shadow and non-shadow regions. {Comparing to other image restoration tasks, there are two unique challenges in shadow removal:} 1) The patterns of shadows are arbitrary, varied, and often have highly complex trace structures, making ``trace-less'' image recovery difficult. 2) The degradation caused by shadows is spatially non-uniform, resulting in inconsistencies in illumination and color between shadow and non-shadow areas. Recent developments in this field are primarily driven by deep learning-based solutions, employing a variety of learning strategies, network architectures, loss functions, and training data. Nevertheless, a thorough and insightful review of deep learning-based shadow removal techniques is still lacking. In this paper, we are the first to provide a comprehensive survey to cover various aspects ranging from technical details to applications. We highlight the major advancements in deep learning-based single-image shadow removal methods, thoroughly review previous research across various categories, and provide insights into the historical progression of these developments. Additionally, we summarize performance comparisons both quantitatively and qualitatively. Beyond the technical aspects of shadow removal methods, we also explore potential future directions for this field. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: url: https://github.com/GuoLanqing/Awesome-Shadow-Removal

arXiv:2407.08028 [pdf, other]

AutoMate: Specialist and Generalist Assembly Policies over Diverse Geometries

Authors: Bingjie Tang, Iretiayo Akinola, Jie Xu, Bowen Wen, Ankur Handa, Karl Van Wyk, Dieter Fox, Gaurav S. Sukhatme, Fabio Ramos, Yashraj Narang

Abstract: Robotic assembly for high-mixture settings requires adaptivity to diverse parts and poses, which is an open challenge. Meanwhile, in other areas of robotics, large models and sim-to-real have led to tremendous progress. Inspired by such work, we present AutoMate, a learning framework and system that consists of 4 parts: 1) a dataset of 100 assemblies compatible with simulation and the real world,… ▽ More Robotic assembly for high-mixture settings requires adaptivity to diverse parts and poses, which is an open challenge. Meanwhile, in other areas of robotics, large models and sim-to-real have led to tremendous progress. Inspired by such work, we present AutoMate, a learning framework and system that consists of 4 parts: 1) a dataset of 100 assemblies compatible with simulation and the real world, along with parallelized simulation environments for policy learning, 2) a novel simulation-based approach for learning specialist (i.e., part-specific) policies and generalist (i.e., unified) assembly policies, 3) demonstrations of specialist policies that individually solve 80 assemblies with 80% or higher success rates in simulation, as well as a generalist policy that jointly solves 20 assemblies with an 80%+ success rate, and 4) zero-shot sim-to-real transfer that achieves similar (or better) performance than simulation, including on perception-initialized assembly. The key methodological takeaway is that a union of diverse algorithms from manufacturing engineering, character animation, and time-series analysis provides a generic and robust solution for a diverse range of robotic assembly problems. To our knowledge, AutoMate provides the first simulation-based framework for learning specialist and generalist policies over a wide range of assemblies, as well as the first system demonstrating zero-shot sim-to-real transfer over such a range. For videos and additional details, please see our project website: https://bingjietang718.github.io/automate/ △ Less

Submitted 31 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.06600 [pdf, other]

Integrating Clinical Knowledge into Concept Bottleneck Models

Authors: Winnie Pang, Xueyi Ke, Satoshi Tsutsui, Bihan Wen

Abstract: Concept bottleneck models (CBMs), which predict human-interpretable concepts (e.g., nucleus shapes in cell images) before predicting the final output (e.g., cell type), provide insights into the decision-making processes of the model. However, training CBMs solely in a data-driven manner can introduce undesirable biases, which may compromise prediction performance, especially when the trained mode… ▽ More Concept bottleneck models (CBMs), which predict human-interpretable concepts (e.g., nucleus shapes in cell images) before predicting the final output (e.g., cell type), provide insights into the decision-making processes of the model. However, training CBMs solely in a data-driven manner can introduce undesirable biases, which may compromise prediction performance, especially when the trained models are evaluated on out-of-domain images (e.g., those acquired using different devices). To mitigate this challenge, we propose integrating clinical knowledge to refine CBMs, better aligning them with clinicians' decision-making processes. Specifically, we guide the model to prioritize the concepts that clinicians also prioritize. We validate our approach on two datasets of medical images: white blood cell and skin images. Empirical validation demonstrates that incorporating medical guidance enhances the model's classification performance on unseen datasets with varying preparation methods, thereby increasing its real-world applicability. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: Accepted to MICCAI2024

arXiv:2407.03978 [pdf, other]

Benchmarking Complex Instruction-Following with Multiple Constraints Composition

Authors: Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, Yiming Liu, Jie Tang, Hongning Wang, Minlie Huang

Abstract: Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on m… ▽ More Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints, which is an indispensable constituent in complex instructions. To this end, we propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints. We propose a hierarchical taxonomy for complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types, and manually collect a high-quality dataset accordingly. To make the evaluation reliable, we augment LLM-based evaluators with rules to effectively verify whether generated texts can satisfy each constraint and composition. Furthermore, we obtain the final evaluation score based on the dependency structure determined by different composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition. △ Less

Submitted 11 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: 20 pages, 7 figures

arXiv:2407.01067 [pdf, other]

Human-like object concept representations emerge naturally in multimodal large language models

Authors: Changde Du, Kaicheng Fu, Bincheng Wen, Yi Sun, Jie Peng, Wei Wei, Ying Gao, Shengpei Wang, Chuncheng Zhang, Jinpeng Li, Shuang Qiu, Le Chang, Huiguang He

Abstract: The conceptualization and categorization of natural objects in the human mind have long intrigued cognitive scientists and neuroscientists, offering crucial insights into human perception and cognition. Recently, the rapid development of Large Language Models (LLMs) has raised the attractive question of whether these models can also develop human-like object representations through exposure to vas… ▽ More The conceptualization and categorization of natural objects in the human mind have long intrigued cognitive scientists and neuroscientists, offering crucial insights into human perception and cognition. Recently, the rapid development of Large Language Models (LLMs) has raised the attractive question of whether these models can also develop human-like object representations through exposure to vast amounts of linguistic and multimodal data. In this study, we combined behavioral and neuroimaging analysis methods to uncover how the object concept representations in LLMs correlate with those of humans. By collecting large-scale datasets of 4.7 million triplet judgments from LLM and Multimodal LLM (MLLM), we were able to derive low-dimensional embeddings that capture the underlying similarity structure of 1,854 natural objects. The resulting 66-dimensional embeddings were found to be highly stable and predictive, and exhibited semantic clustering akin to human mental representations. Interestingly, the interpretability of the dimensions underlying these embeddings suggests that LLM and MLLM have developed human-like conceptual representations of natural objects. Further analysis demonstrated strong alignment between the identified model embeddings and neural activity patterns in many functionally defined brain ROIs (e.g., EBA, PPA, RSC and FFA). This provides compelling evidence that the object representations in LLMs, while not identical to those in the human, share fundamental commonalities that reflect key schemas of human conceptual knowledge. This study advances our understanding of machine intelligence and informs the development of more human-like artificial cognitive systems. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00820 [pdf]

Localization and Perception for Control of a Low Speed Autonomous Shuttle in a Campus Pilot Deployment

Authors: Bowen Wen

Abstract: Future SAE Level 4 and Level 5 autonomous vehicles will require novel applications of localization, perception, control and artificial intelligence technology in order to offer innovative and disruptive solutions to current mobility problems. Accurate localization is essential for self driving vehicle navigation in GPS inaccessible environments. This thesis concentrates on low speed autonomous shu… ▽ More Future SAE Level 4 and Level 5 autonomous vehicles will require novel applications of localization, perception, control and artificial intelligence technology in order to offer innovative and disruptive solutions to current mobility problems. Accurate localization is essential for self driving vehicle navigation in GPS inaccessible environments. This thesis concentrates on low speed autonomous shuttles that are mainly utilized for university campus intelligent transportation systems and presents initial results of ongoing work on developing solutions to the localization and perception challenges of a university planned pilot deployment orientated application. The paper treats autonomous driving with real time kinematics GPS (Global Positioning Systems) with an inertial measurement unit (IMU), combined with simultaneous localization and mapping (SLAM) with threedimensional light detection and ranging (LIDAR) sensor, which provides solutions to scenarios where GPS is not available or a lower cost and hence lower accuracy GPS is desirable. The in-house automated low speed electric vehicle from the Automated Driving Lab is used in experimental evaluation and verification. An improved version of Hector SLAM was implemented on ROS and compared with high resolution GPS aided localization framework in the same hardware architecture. The overall configuration that combines ROS with DSpace controller can be easily transplantable prototype in other hardware architectures for future similar research. Real-world experiments that are reported here have been conducted in a small test area close to the Ohio State University AV pilot test route. are used for demonstrating the feasibility and robustness of this approach to developing and evaluating low speed autonomous shuttle localization and perception algorithms for control and decision making. △ Less

Submitted 2 April, 2024; originally announced July 2024.

Comments: Master thesis, ADL & GDA, The Ohio State University, 2014

arXiv:2406.13659 [pdf, other]

Leveraging Large Language Models for Patient Engagement: The Power of Conversational AI in Digital Health

Authors: Bo Wen, Raquel Norel, Julia Liu, Thaddeus Stappenbeck, Farhana Zulkernine, Huamin Chen

Abstract: The rapid advancements in large language models (LLMs) have opened up new opportunities for transforming patient engagement in healthcare through conversational AI. This paper presents an overview of the current landscape of LLMs in healthcare, specifically focusing on their applications in analyzing and generating conversations for improved patient engagement. We showcase the power of LLMs in han… ▽ More The rapid advancements in large language models (LLMs) have opened up new opportunities for transforming patient engagement in healthcare through conversational AI. This paper presents an overview of the current landscape of LLMs in healthcare, specifically focusing on their applications in analyzing and generating conversations for improved patient engagement. We showcase the power of LLMs in handling unstructured conversational data through four case studies: (1) analyzing mental health discussions on Reddit, (2) developing a personalized chatbot for cognitive engagement in seniors, (3) summarizing medical conversation datasets, and (4) designing an AI-powered patient engagement system. These case studies demonstrate how LLMs can effectively extract insights and summarizations from unstructured dialogues and engage patients in guided, goal-oriented conversations. Leveraging LLMs for conversational analysis and generation opens new doors for many patient-centered outcomes research opportunities. However, integrating LLMs into healthcare raises important ethical considerations regarding data privacy, bias, transparency, and regulatory compliance. We discuss best practices and guidelines for the responsible development and deployment of LLMs in healthcare settings. Realizing the full potential of LLMs in digital health will require close collaboration between the AI and healthcare professionals communities to address technical challenges and ensure these powerful tools' safety, efficacy, and equity. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 10 pages, 6 figures, ICDH 2024 invited paper

arXiv:2406.10543 [pdf, other]

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows

Authors: Zhenggang Tang, Zhongzheng Ren, Xiaoming Zhao, Bowen Wen, Jonathan Tremblay, Stan Birchfield, Alexander Schwing

Abstract: We present a method for automatically modifying a NeRF representation based on a single observation of a non-rigid transformed version of the original scene. Our method defines the transformation as a 3D flow, specifically as a weighted linear blending of rigid transformations of 3D anchor points that are defined on the surface of the scene. In order to identify anchor points, we introduce a novel… ▽ More We present a method for automatically modifying a NeRF representation based on a single observation of a non-rigid transformed version of the original scene. Our method defines the transformation as a 3D flow, specifically as a weighted linear blending of rigid transformations of 3D anchor points that are defined on the surface of the scene. In order to identify anchor points, we introduce a novel correspondence algorithm that first matches RGB-based pairs, then leverages multi-view information and 3D reprojection to robustly filter false positives in two steps. We also introduce a new dataset for exploring the problem of modifying a NeRF scene through a single observation. Our dataset ( https://github.com/nerfdeformer/nerfdeformer ) contains 113 synthetic scenes leveraging 47 3D assets. We show that our proposed method outperforms NeRF editing methods as well as diffusion-based methods, and we also explore different methods for filtering correspondences. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: 8 pages of main paper, CVPR 2024. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024

arXiv:2406.10462 [pdf, other]

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

Authors: Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, Long Chen

Abstract: Interleaved image-text generation has emerged as a crucial multimodal task, aiming at creating sequences of interleaved visual and textual content given a query. Despite notable advancements in recent multimodal large language models (MLLMs), generating integrated image-text sequences that exhibit narrative coherence and entity and style consistency remains challenging due to poor training data qu… ▽ More Interleaved image-text generation has emerged as a crucial multimodal task, aiming at creating sequences of interleaved visual and textual content given a query. Despite notable advancements in recent multimodal large language models (MLLMs), generating integrated image-text sequences that exhibit narrative coherence and entity and style consistency remains challenging due to poor training data quality. To address this gap, we introduce CoMM, a high-quality Coherent interleaved image-text MultiModal dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content. Initially, CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling, establishing a foundation for coherent and consistent content. To further refine the data quality, we devise a multi-perspective filter strategy that leverages advanced pre-trained models to ensure the development of sentences, consistency of inserted images, and semantic alignment between them. Various quality evaluation metrics are designed to prove the high quality of the filtered dataset. Meanwhile, extensive few-shot experiments on various downstream tasks demonstrate CoMM's effectiveness in significantly enhancing the in-context learning capabilities of MLLMs. Moreover, we propose four new tasks to evaluate MLLMs' interleaved generation abilities, supported by a comprehensive evaluation framework. We believe CoMM opens a new avenue for advanced MLLMs with superior multimodal in-context learning and understanding ability. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 22 pages

arXiv:2406.08300 [pdf, other]

From Chaos to Clarity: 3DGS in the Dark

Authors: Zhihao Li, Yufei Wang, Alex Kot, Bihan Wen

Abstract: Novel view synthesis from raw images provides superior high dynamic range (HDR) information compared to reconstructions from low dynamic range RGB images. However, the inherent noise in unprocessed raw images compromises the accuracy of 3D scene representation. Our study reveals that 3D Gaussian Splatting (3DGS) is particularly susceptible to this noise, leading to numerous elongated Gaussian shap… ▽ More Novel view synthesis from raw images provides superior high dynamic range (HDR) information compared to reconstructions from low dynamic range RGB images. However, the inherent noise in unprocessed raw images compromises the accuracy of 3D scene representation. Our study reveals that 3D Gaussian Splatting (3DGS) is particularly susceptible to this noise, leading to numerous elongated Gaussian shapes that overfit the noise, thereby significantly degrading reconstruction quality and reducing inference speed, especially in scenarios with limited views. To address these issues, we introduce a novel self-supervised learning framework designed to reconstruct HDR 3DGS from a limited number of noisy raw images. This framework enhances 3DGS by integrating a noise extractor and employing a noise-robust reconstruction loss that leverages a noise distribution prior. Experimental results show that our method outperforms LDR/HDR 3DGS and previous state-of-the-art (SOTA) self-supervised and supervised pre-trained models in both reconstruction quality and inference speed on the RawNeRF dataset across a broad range of training views. Code can be found in \url{https://lizhihao6.github.io/Raw3DGS}. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.06843 [pdf, other]

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

Authors: Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, Yu Xiang

Abstract: We introduce a data capture system and a new dataset named HO-Cap that can be used to study 3D reconstruction and pose tracking of hands and objects in videos. The capture system uses multiple RGB-D cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method to obtain annotations of shape and pose of hands and o… ▽ More We introduce a data capture system and a new dataset named HO-Cap that can be used to study 3D reconstruction and pose tracking of hands and objects in videos. The capture system uses multiple RGB-D cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method to obtain annotations of shape and pose of hands and objects in the collected videos, which significantly reduces the required annotation time compared to manual labeling. With this system, we captured a video dataset of humans using objects to perform different tasks, as well as simple pick-and-place and handover of an object from one hand to the other, which can be used as human demonstrations for embodied AI and robot manipulation research. Our data capture setup and annotation framework can be used by the community to reconstruct 3D shapes of objects and human hands and track their poses in videos. △ Less

Submitted 16 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.05955 [pdf, other]

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Authors: Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, Haibo Chen

Abstract: Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreove… ▽ More Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. Our models are available at \url{https://huggingface.co/PowerInfer} △ Less

Submitted 10 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

arXiv:2405.20721 [pdf, other]

ContextGS: Compact 3D Gaussian Splatting with Anchor Level Context Model

Authors: Yufei Wang, Zhihao Li, Lanqing Guo, Wenhan Yang, Alex C. Kot, Bihan Wen

Abstract: Recently, 3D Gaussian Splatting (3DGS) has become a promising framework for novel view synthesis, offering fast rendering speeds and high fidelity. However, the large number of Gaussians and their associated attributes require effective compression techniques. Existing methods primarily compress neural Gaussians individually and independently, i.e., coding all the neural Gaussians at the same time… ▽ More Recently, 3D Gaussian Splatting (3DGS) has become a promising framework for novel view synthesis, offering fast rendering speeds and high fidelity. However, the large number of Gaussians and their associated attributes require effective compression techniques. Existing methods primarily compress neural Gaussians individually and independently, i.e., coding all the neural Gaussians at the same time, with little design for their interactions and spatial dependence. Inspired by the effectiveness of the context model in image compression, we propose the first autoregressive model at the anchor level for 3DGS compression in this work. We divide anchors into different levels and the anchors that are not coded yet can be predicted based on the already coded ones in all the coarser levels, leading to more accurate modeling and higher coding efficiency. To further improve the efficiency of entropy coding, e.g., to code the coarsest level with no already coded anchors, we propose to introduce a low-dimensional quantized feature as the hyperprior for each anchor, which can be effectively compressed. Our work pioneers the context model in the anchor level for 3DGS representation, yielding an impressive size reduction of over 100 times compared to vanilla 3DGS and 15 times compared to the most recent state-of-the-art work Scaffold-GS, while achieving comparable or even higher rendering quality. △ Less

Submitted 31 May, 2024; originally announced May 2024.

arXiv:2405.19996 [pdf, other]

DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the Wild

Authors: Honghao Fu, Yufei Wang, Wenhan Yang, Bihan Wen

Abstract: Blind image quality assessment (IQA) in the wild, which assesses the quality of images with complex authentic distortions and no reference images, presents significant challenges. Given the difficulty in collecting large-scale training data, leveraging limited data to develop a model with strong generalization remains an open problem. Motivated by the robust image perception capabilities of pre-tr… ▽ More Blind image quality assessment (IQA) in the wild, which assesses the quality of images with complex authentic distortions and no reference images, presents significant challenges. Given the difficulty in collecting large-scale training data, leveraging limited data to develop a model with strong generalization remains an open problem. Motivated by the robust image perception capabilities of pre-trained text-to-image (T2I) diffusion models, we propose a novel IQA method, diffusion priors-based IQA (DP-IQA), to utilize the T2I model's prior for improved performance and generalization ability. Specifically, we utilize pre-trained Stable Diffusion as the backbone, extracting multi-level features from the denoising U-Net guided by prompt embeddings through a tunable text adapter. Simultaneously, an image adapter compensates for information loss introduced by the lossy pre-trained encoder. Unlike T2I models that require full image distribution modeling, our approach targets image quality assessment, which inherently requires fewer parameters. To improve applicability, we distill the knowledge into a lightweight CNN-based student model, significantly reducing parameters while maintaining or even enhancing generalization performance. Experimental results demonstrate that DP-IQA achieves state-of-the-art performance on various in-the-wild datasets, highlighting the superior generalization capability of T2I priors in blind IQA tasks. To our knowledge, DP-IQA is the first method to apply pre-trained diffusion priors in blind IQA. Codes and checkpoints are available at https://github.com/RomGai/DP-IQA. △ Less

Submitted 17 August, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.16820 [pdf, other]

doi 10.1145/3630106.3658966

Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

Authors: Robert Wolfe, Isaac Slaughter, Bin Han, Bingbing Wen, Yiwei Yang, Lucas Rosenblatt, Bernease Herman, Eva Brown, Zening Qu, Nic Weber, Bill Howe

Abstract: The rapid proliferation of generative AI has raised questions about the competitiveness of lower-parameter, locally tunable, open-weight models relative to high-parameter, API-guarded, closed-weight models in terms of performance, domain adaptation, cost, and generalization. Centering under-resourced yet risk-intolerant settings in government, research, and healthcare, we see for-profit closed-wei… ▽ More The rapid proliferation of generative AI has raised questions about the competitiveness of lower-parameter, locally tunable, open-weight models relative to high-parameter, API-guarded, closed-weight models in terms of performance, domain adaptation, cost, and generalization. Centering under-resourced yet risk-intolerant settings in government, research, and healthcare, we see for-profit closed-weight models as incompatible with requirements for transparency, privacy, adaptability, and standards of evidence. Yet the performance penalty in using open-weight models, especially in low-data and low-resource settings, is unclear. We assess the feasibility of using smaller, open-weight models to replace GPT-4-Turbo in zero-shot, few-shot, and fine-tuned regimes, assuming access to only a single, low-cost GPU. We assess value-sensitive issues around bias, privacy, and abstention on three additional tasks relevant to those topics. We find that with relatively low effort, very low absolute monetary cost, and relatively little data for fine-tuning, small open-weight models can achieve competitive performance in domain-adapted tasks without sacrificing generality. We then run experiments considering practical issues in bias, privacy, and hallucination risk, finding that open models offer several benefits over closed models. We intend this work as a case study in understanding the opportunity cost of reproducibility and transparency over for-profit state-of-the-art zero shot performance, finding this cost to be marginal under realistic settings. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Accepted at the ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2024

arXiv:2405.16295 [pdf, other]

Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data

Authors: Yuhao Chen, Zhimu Wang, Bo Wen, Farhana Zulkernine

Abstract: Unstructured text in medical notes and dialogues contains rich information. Recent advancements in Large Language Models (LLMs) have demonstrated superior performance in question answering and summarization tasks on unstructured text data, outperforming traditional text analysis approaches. However, there is a lack of scientific studies in the literature that methodically evaluate and report on th… ▽ More Unstructured text in medical notes and dialogues contains rich information. Recent advancements in Large Language Models (LLMs) have demonstrated superior performance in question answering and summarization tasks on unstructured text data, outperforming traditional text analysis approaches. However, there is a lack of scientific studies in the literature that methodically evaluate and report on the performance of different LLMs, specifically for domain-specific data such as medical chart notes. We propose an evaluation approach to analyze the performance of open-source LLMs such as Llama2 and Mistral for medical summarization tasks, using GPT-4 as an assessor. Our innovative approach to quantitative evaluation of LLMs can enable quality control, support the selection of effective LLMs for specific tasks, and advance knowledge discovery in digital health. △ Less

Submitted 29 May, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.11852 [pdf, other]

Evolving Storytelling: Benchmarks and Methods for New Character Customization with Diffusion Models

Authors: Xiyu Wang, Yufei Wang, Satoshi Tsutsui, Weisi Lin, Bihan Wen, Alex C. Kot

Abstract: Diffusion-based models for story visualization have shown promise in generating content-coherent images for storytelling tasks. However, how to effectively integrate new characters into existing narratives while maintaining character consistency remains an open problem, particularly with limited data. Two major limitations hinder the progress: (1) the absence of a suitable benchmark due to potenti… ▽ More Diffusion-based models for story visualization have shown promise in generating content-coherent images for storytelling tasks. However, how to effectively integrate new characters into existing narratives while maintaining character consistency remains an open problem, particularly with limited data. Two major limitations hinder the progress: (1) the absence of a suitable benchmark due to potential character leakage and inconsistent text labeling, and (2) the challenge of distinguishing between new and old characters, leading to ambiguous results. To address these challenges, we introduce the NewEpisode benchmark, comprising refined datasets designed to evaluate generative models' adaptability in generating new stories with fresh characters using just a single example story. The refined dataset involves refined text prompts and eliminates character leakage. Additionally, to mitigate the character confusion of generated results, we propose EpicEvo, a method that customizes a diffusion-based visual story generation model with a single story featuring the new characters seamlessly integrating them into established character dynamics. EpicEvo introduces a novel adversarial character alignment module to align the generated images progressively in the diffusive process, with exemplar images of new characters, while applying knowledge distillation to prevent forgetting of characters and background details. Our evaluation quantitatively demonstrates that EpicEvo outperforms existing baselines on the NewEpisode benchmark, and qualitative studies confirm its superior customization of visual story generation in diffusion models. In summary, EpicEvo provides an effective way to incorporate new characters using only one example story, unlocking new possibilities for applications such as serialized cartoons. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.09364 [pdf, other]

doi 10.3390/universe10050219

Orbital Stability Study of the Taiji Space Gravitational Wave Detector

Authors: Yu-Yang Zhang, Geng Li, Bo Wen

Abstract: Space-based gravitational wave detection is extremely sensitive to disturbances. The Keplerian configuration cannot accurately reflect the variations in spacecraft configuration. Planetary gravitational disturbances are one of the main sources. Numerical simulation is an effective method to investigate the impact of perturbation on spacecraft orbits. This study shows that, in the context of the Ta… ▽ More Space-based gravitational wave detection is extremely sensitive to disturbances. The Keplerian configuration cannot accurately reflect the variations in spacecraft configuration. Planetary gravitational disturbances are one of the main sources. Numerical simulation is an effective method to investigate the impact of perturbation on spacecraft orbits. This study shows that, in the context of the Taiji project, Earth's gravity is an essential factor in the change in heliocentric formation configuration, contributing to the relative acceleration between spacecrafts in the order of $\mathcal O(10^{-6})\,{\rm m\cdot s^{-2}}$. Considering 00:00:00 on 27 October 2032 as the initial orbiting moment, under the influence of Earth's gravitational perturbation, the maximum relative change in armlengths and variation rates of armlengths for Taiji is $1.6\times 10^{5}\,{\rm km}$, $32\,{\rm m\cdot s^{-1}}$, respectively, compared with the unperturbed Keplerian orbit. Additionally, by considering the gravitational perturbations of Venus and Jupiter, the armlength and relative velocity for Taiji are reduced by $16.01\%$ and $17.45\%$, respectively, compared with when only considering that of Earth. The maximum amplitude of the formation motion indicator changes with the orbit entry time. Results show that the relative velocity increase between the spacecrafts is minimal when the initial orbital moment occurs in July. Moreover, the numerical simulation results are inconsistent when using different ephemerides. The differences between ephemerides DE440 and DE430 are smaller than those between DE440 and DE421. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: 18 pages, 11 figures

Journal ref: Universe 2024, Volume 10, Issue 5, 219

arXiv:2405.08438 [pdf, other]

doi 10.1103/PhysRevB.109.205133

Magnetic fluctuation and dominant superconducting pairing symmetry near the tunable Van Hove singularity

Authors: Xiaohan Kong, Boyang Wen, Kaiyi Guo, Ying Liang, Tianxing Ma

Abstract: We have investigated the magnetism and pairing correlations of the triangular lattice based on the Hubbard model using the determinant quantum Monte Carlo method and the constrained path Monte Carlo. The results show that the presence of the next-nearest-neighbor hopping integral $t^{\prime}$ introduces an additional energy scale to the system, and through $t^{\prime}$, one can regulate the shape… ▽ More We have investigated the magnetism and pairing correlations of the triangular lattice based on the Hubbard model using the determinant quantum Monte Carlo method and the constrained path Monte Carlo. The results show that the presence of the next-nearest-neighbor hopping integral $t^{\prime}$ introduces an additional energy scale to the system, and through $t^{\prime}$, one can regulate the shape of the density of states and thus the position of the van Hove singularity point. Increasing inverse temperature $β$ and on-site interaction $U$ favor the formation of ferromagnetic correlation in a rather large filling region, and the calculations for different lattice sizes show that the range of the ferromagnetic correlations is smaller than the smallest lattice simulated at the investigated temperatures. We study the different pairing correlations of the triangular lattice near several typical fillings and show that the $f$-wave pairing dominates the system in the filling region near the van Hove singularity point with a high density of states, where the ferromagnetic correlation is also enhanced. When the filling is close to half-filling, the pairing susceptibility with $f$ wave is suppressed and the pairing susceptibility of $f_n$ wave is enhanced, however, both the effective pairing interaction with $f$ wave and $f_n$ wave are negative, which indicates that neither $f$-wave nor $f_n$-wave superconductivity may exist. Finally, we find that the pairing channel of different symmetry in the system maybe closely related to the magnetic properties. Ferromagnetic fluctuation favors the formation of $f$-wave pairing, while antiferromagnetic fluctuation tends to promote $f_n$-wave pairing. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: 7 pages and 9 figures. Accepted for publication as a Regular Article in Physical Review B

Journal ref: Phys. Rev. B 109, 205133(2024)

arXiv:2405.00574 [pdf, other]

EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

Authors: Deng Li, Xin Liu, Bohao Xing, Baiqiang Xia, Yuan Zong, Bihan Wen, Heikki Kälviäinen

Abstract: Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliber… ▽ More Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform. △ Less

Submitted 1 May, 2024; originally announced May 2024.

arXiv:2404.12452 [pdf, other]

Characterizing LLM Abstention Behavior in Science QA with Context Perturbations

Authors: Bingbing Wen, Bill Howe, Lucy Lu Wang

Abstract: The correct model response in the face of uncertainty is to abstain from answering a question so as not to mislead the user. In this work, we study the ability of LLMs to abstain from answering context-dependent science questions when provided insufficient or incorrect context. We probe model sensitivity in several settings: removing gold context, replacing gold context with irrelevant context, an… ▽ More The correct model response in the face of uncertainty is to abstain from answering a question so as not to mislead the user. In this work, we study the ability of LLMs to abstain from answering context-dependent science questions when provided insufficient or incorrect context. We probe model sensitivity in several settings: removing gold context, replacing gold context with irrelevant context, and providing additional context beyond what is given. In experiments on four QA datasets with four LLMs, we show that performance varies greatly across models, across the type of context provided, and also by question type; in particular, many LLMs seem unable to abstain from answering boolean questions using standard QA prompts. Our analysis also highlights the unexpected impact of abstention performance on QA task accuracy. Counter-intuitively, in some settings, replacing gold context with irrelevant context or adding irrelevant context to gold context can improve abstention performance in a way that results in improvements in task performance. Our results imply that changes are needed in QA dataset design and evaluation to more effectively assess the correctness and downstream impacts of model abstention. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.01440 [pdf, other]

Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects

Authors: Yijia Weng, Bowen Wen, Jonathan Tremblay, Valts Blukis, Dieter Fox, Leonidas Guibas, Stan Birchfield

Abstract: We address the problem of building digital twins of unknown articulated objects from two RGBD scans of the object at different articulation states. We decompose the problem into two stages, each addressing distinct aspects. Our method first reconstructs object-level shape at each state, then recovers the underlying articulation model including part segmentation and joint articulations that associa… ▽ More We address the problem of building digital twins of unknown articulated objects from two RGBD scans of the object at different articulation states. We decompose the problem into two stages, each addressing distinct aspects. Our method first reconstructs object-level shape at each state, then recovers the underlying articulation model including part segmentation and joint articulations that associate the two states. By explicitly modeling point-level correspondences and exploiting cues from images, 3D reconstructions, and kinematics, our method yields more accurate and stable results compared to prior work. It also handles more than one movable part and does not rely on any object shape or structure priors. Project page: https://github.com/NVlabs/DigitalTwinArt △ Less

Submitted 6 June, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2403.10076 [pdf, other]

Benchmarking Adversarial Robustness of Image Shadow Removal with Shadow-adaptive Attacks

Authors: Chong Wang, Yi Yu, Lanqing Guo, Bihan Wen

Abstract: Shadow removal is a task aimed at erasing regional shadows present in images and reinstating visually pleasing natural scenes with consistent illumination. While recent deep learning techniques have demonstrated impressive performance in image shadow removal, their robustness against adversarial attacks remains largely unexplored. Furthermore, many existing attack frameworks typically allocate a u… ▽ More Shadow removal is a task aimed at erasing regional shadows present in images and reinstating visually pleasing natural scenes with consistent illumination. While recent deep learning techniques have demonstrated impressive performance in image shadow removal, their robustness against adversarial attacks remains largely unexplored. Furthermore, many existing attack frameworks typically allocate a uniform budget for perturbations across the entire input image, which may not be suitable for attacking shadow images. This is primarily due to the unique characteristic of spatially varying illumination within shadow images. In this paper, we propose a novel approach, called shadow-adaptive adversarial attack. Different from standard adversarial attacks, our attack budget is adjusted based on the pixel intensity in different regions of shadow images. Consequently, the optimized adversarial noise in the shadowed regions becomes visually less perceptible while permitting a greater tolerance for perturbations in non-shadow regions. The proposed shadow-adaptive attacks naturally align with the varying illumination distribution in shadow images, resulting in perturbations that are less conspicuous. Building on this, we conduct a comprehensive empirical evaluation of existing shadow removal methods, subjecting them to various levels of attack on publicly available datasets. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: Accepted to ICASSP 2024

arXiv:2403.10064 [pdf, other]

Progressive Divide-and-Conquer via Subsampling Decomposition for Accelerated MRI

Authors: Chong Wang, Lanqing Guo, Yufei Wang, Hao Cheng, Yi Yu, Bihan Wen

Abstract: Deep unfolding networks (DUN) have emerged as a popular iterative framework for accelerated magnetic resonance imaging (MRI) reconstruction. However, conventional DUN aims to reconstruct all the missing information within the entire null space in each iteration. Thus it could be challenging when dealing with highly ill-posed degradation, usually leading to unsatisfactory reconstruction. In this wo… ▽ More Deep unfolding networks (DUN) have emerged as a popular iterative framework for accelerated magnetic resonance imaging (MRI) reconstruction. However, conventional DUN aims to reconstruct all the missing information within the entire null space in each iteration. Thus it could be challenging when dealing with highly ill-posed degradation, usually leading to unsatisfactory reconstruction. In this work, we propose a Progressive Divide-And-Conquer (PDAC) strategy, aiming to break down the subsampling process in the actual severe degradation and thus perform reconstruction sequentially. Starting from decomposing the original maximum-a-posteriori problem of accelerated MRI, we present a rigorous derivation of the proposed PDAC framework, which could be further unfolded into an end-to-end trainable network. Specifically, each iterative stage in PDAC focuses on recovering a distinct moderate degradation according to the decomposition. Furthermore, as part of the PDAC iteration, such decomposition is adaptively learned as an auxiliary task through a degradation predictor which provides an estimation of the decomposed sampling mask. Following this prediction, the sampling mask is further integrated via a severity conditioning module to ensure awareness of the degradation severity at each stage. Extensive experiments demonstrate that our proposed method achieves superior performance on the publicly available fastMRI and Stanford2D FSE datasets in both multi-coil and single-coil settings. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024

arXiv:2403.01367 [pdf]

doi 10.1145/3644523.3644657

Optimization decision model of vegetable stock and pricing based on TCN-Attention and genetic algorithm

Authors: Linhan Xia, Jinyuan Zhang, Bohan Wen

Abstract: With the expansion of operational scale of supermarkets in China, the vegetable market has grown considerably. The decision-making related to procurement costs and allocation quantities of vegetables has become a pivotal factor in determining the profitability of supermarkets. This paper analyzes the relationship between pricing and allocation faced by supermarkets in vegetable operations. Optimiz… ▽ More With the expansion of operational scale of supermarkets in China, the vegetable market has grown considerably. The decision-making related to procurement costs and allocation quantities of vegetables has become a pivotal factor in determining the profitability of supermarkets. This paper analyzes the relationship between pricing and allocation faced by supermarkets in vegetable operations. Optimization algorithms are employed to determine replenishment and pricing strategies. Linear regression is utilized to model the historical data of various products, establishing the relationship between sale prices and sales volumes for 61 products. By integrating historical data on vegetable costs with time information based on the 24 solar terms, a cost prediction model is trained using TCN-Attention. The Topis evaluation model identifies the 32 most market-demanded products. A genetic algorithm is then used to search for the globally optimized vegetable product allocation-pricing decision. △ Less

Submitted 2 March, 2024; originally announced March 2024.

Comments: ICCSMT2023

arXiv:2403.00527 [pdf, other]

doi 10.1145/3613904.3642959

"There is a Job Prepared for Me Here": Understanding How Short Video and Live-streaming Platforms Empower Ageing Job Seekers in China

Authors: PiaoHong Wang, Siying Hu, Bo Wen, Zhicong Lu

Abstract: In recent years, the global unemployment rate has remained persistently high. Compounding this issue, the ageing population in China often encounters additional challenges in finding employment due to prevalent age discrimination in daily life. However, with the advent of social media, there has been a rise in the popularity of short videos and live-streams for recruiting ageing workers. To better… ▽ More In recent years, the global unemployment rate has remained persistently high. Compounding this issue, the ageing population in China often encounters additional challenges in finding employment due to prevalent age discrimination in daily life. However, with the advent of social media, there has been a rise in the popularity of short videos and live-streams for recruiting ageing workers. To better understand the motivations of ageing job seekers to engage with these video-based recruitment methods and to explore the extent to which such platforms can empower them, we conducted an interview-based study with ageing job seekers who have had exposure to these short recruitment videos and live-streaming channels. Our findings reveal that these platforms can provide a job-seeking choice that is particularly friendly to ageing job seekers, effectively improving their disadvantaged situation. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: 14 pages, 3 figures; Accepted to ACM CHI 2024. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI'24)

ACM Class: H.5.m; K.4.0

arXiv:2402.15052 [pdf, other]

ToMBench: Benchmarking Theory of Mind in Large Language Models

Authors: Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao Jiang, Yaru Cao, Mengting Hu, Yunghwei Lai, Zexuan Xiong, Minlie Huang

Abstract: Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether large language models (LLMs) exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this… ▽ More Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether large language models (LLMs) exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag behind human performance by over 10% points, indicating that LLMs have not achieved a human-level theory of mind yet. Our aim with ToMBench is to enable an efficient and effective evaluation of LLMs' ToM capabilities, thereby facilitating the development of LLMs with inherent social intelligence. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: Under review

arXiv:2402.10491 [pdf, other]

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Authors: Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen

Abstract: Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution… ▽ More Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: Project Page: https://guolanqing.github.io/Self-Cascade/

arXiv:2401.01223 [pdf]

Twinning induced by elastic anisotropy in FCC crystals

Authors: Jie Huang, Mingyu Lei, Guangpeng Sun, Guochun Yang, Bin Wen

Abstract: Dislocation slip and deformation twin are widely regarded as two important mechanisms of active competition in the process of plastic deformation. Calculating and comparing the critical resolved shear stress (CRSS) of two deformation modes are the key to discussing the mechanical properties reflected by different mechanisms in crystals. Here, the paper proposes a model to predict the CRSS of discr… ▽ More Dislocation slip and deformation twin are widely regarded as two important mechanisms of active competition in the process of plastic deformation. Calculating and comparing the critical resolved shear stress (CRSS) of two deformation modes are the key to discussing the mechanical properties reflected by different mechanisms in crystals. Here, the paper proposes a model to predict the CRSS of discrete twins, resembling thin layers, using the elastic anisotropy theory and a macroscopic energy perspective. In addition, the directionality of deformation twinning is also verified. We investigated twinning in FCC crystals to illustrate the methodology, and predicted the CRSS of twinning under different variables such as temperature and strain rate, both of which were in excellent agreement with experimental and other theory results. It draws the conclusion that we can promote twinning nucleation by applying shear stress along the <112> direction to reduce the interface energy as a resistance term and increase the difference in strain energy for twinning nucleation. This conclusion provides a guiding direction for exploring and accurately predicting the conditions of twinning in FCC crystals in future. △ Less

Submitted 2 January, 2024; originally announced January 2024.

Comments: 20 pages, 4 figures

arXiv:2312.13503 [pdf, other]

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models

Authors: Bingbing Wen, Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Bill Howe, Lijuan Wang

Abstract: In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to b… ▽ More In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3). GIT can describe the image content even with scene text, while GPT-3 can generate informative dialogue based on the image description and appropriate prompting techniques. With such automatic pipeline, we can readily generate informative visual dialogue data at scale. Then, we ask human annotators to rate the generated dialogues to filter the low-quality conversations.Human analyses show that InfoVisDial covers informative and diverse dialogue topics: $54.4\%$ of the dialogue rounds are related to image scene texts, and $36.7\%$ require external knowledge. Each round's answer is also long and open-ended: $87.3\%$ of answers are unique with an average length of $8.9$, compared with $27.37\%$ and $2.9$ in VisDial. Last, we propose a strong baseline by adapting the GIT model for the visual dialogue task and fine-tune the model on InfoVisDial. Hopefully, our work can motivate more effort on this direction. △ Less

Submitted 20 December, 2023; originally announced December 2023.

arXiv:2312.08344 [pdf, other]

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Authors: Bowen Wen, Wei Yang, Jan Kautz, Stan Birchfield

Abstract: We present FoundationPose, a unified foundation model for 6D object pose estimation and tracking, supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given, or a small number of reference images are captured. We bridge the gap between these two setups with a neural implicit represen… ▽ More We present FoundationPose, a unified foundation model for 6D object pose estimation and tracking, supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given, or a small number of reference images are captured. We bridge the gap between these two setups with a neural implicit representation that allows for effective novel view synthesis, keeping the downstream pose estimation modules invariant under the same unified framework. Strong generalizability is achieved via large-scale synthetic training, aided by a large language model (LLM), a novel transformer-based architecture, and contrastive learning formulation. Extensive evaluation on multiple public datasets involving challenging scenarios and objects indicate our unified approach outperforms existing methods specialized for each task by a large margin. In addition, it even achieves comparable results to instance-level methods despite the reduced assumptions. Project page: https://nvlabs.github.io/FoundationPose/ △ Less

Submitted 26 March, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

arXiv:2312.02459 [pdf, other]

An adaptive preconditioning scheme for the self-consistent field iteration and generalized stacking-fault energy calculations

Authors: Sitong Zhang, Xingyu Gao, Haifeng Song, Bin Wen

Abstract: The generalized stacking-fault energy (GSFE) is the fundamental but key parameter for the plastic deformation of materials. We perform first-principles calculations by full-potential linearized augmented planewave (FLAPW) method to evaluate the GSFE based on the single-shift and triple-shift supercell models. Different degrees of defects are introduced in the two models, thereby affecting the conv… ▽ More The generalized stacking-fault energy (GSFE) is the fundamental but key parameter for the plastic deformation of materials. We perform first-principles calculations by full-potential linearized augmented planewave (FLAPW) method to evaluate the GSFE based on the single-shift and triple-shift supercell models. Different degrees of defects are introduced in the two models, thereby affecting the convergence of the self-consistent field (SCF) iterations. We present an adaptive preconditioning scheme which can identify the long-wavelength divergence behavior of the Jacobian during the SCF iteration and automatically switch on the Kerker preconditioning to accelerate the convergence. We implement this algorithm in Elk-7.2.42 package and calculate the GSFE curves for Al, Cu, and Si (111) plane <-1-12> direction. We found that the single-shift and triple-shift supercell models have equivalent calculation accuracy and are within the experimental data uncertainty. For computational efficiency, the triple-shift supercell model is preferable due to its better convergence, exhibiting lower degree of defect compared to the single-shift supercell model. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: 10 pages, 8 figures

arXiv:2312.00583 [pdf, other]

DeformGS: Scene Flow in Highly Deformable Scenes for Deformable Object Manipulation

Authors: Bardienus P. Duisterhof, Zhao Mandi, Yunchao Yao, Jia-Wei Liu, Jenny Seidenschwarz, Mike Zheng Shou, Deva Ramanan, Shuran Song, Stan Birchfield, Bowen Wen, Jeffrey Ichnowski

Abstract: Teaching robots to fold, drape, or reposition deformable objects such as cloth will unlock a variety of automation applications. While remarkable progress has been made for rigid object manipulation, manipulating deformable objects poses unique challenges, including frequent occlusions, infinite-dimensional state spaces and complex dynamics. Just as object pose estimation and tracking have aided r… ▽ More Teaching robots to fold, drape, or reposition deformable objects such as cloth will unlock a variety of automation applications. While remarkable progress has been made for rigid object manipulation, manipulating deformable objects poses unique challenges, including frequent occlusions, infinite-dimensional state spaces and complex dynamics. Just as object pose estimation and tracking have aided robots for rigid manipulation, dense 3D tracking (scene flow) of highly deformable objects will enable new applications in robotics while aiding existing approaches, such as imitation learning or creating digital twins with real2sim transfer. We propose DeformGS, an approach to recover scene flow in highly deformable scenes, using simultaneous video captures of a dynamic scene from multiple cameras. DeformGS builds on recent advances in Gaussian splatting, a method that learns the properties of a large number of Gaussians for state-of-the-art and fast novel-view synthesis. DeformGS learns a deformation function to project a set of Gaussians with canonical properties into world space. The deformation function uses a neural-voxel encoding and a multilayer perceptron (MLP) to infer Gaussian position, rotation, and a shadow scalar. We enforce physics-inspired regularization terms based on conservation of momentum and isometry, which leads to trajectories with smaller trajectory errors. We also leverage existing foundation models SAM and XMEM to produce noisy masks, and learn a per-Gaussian mask for better physics-inspired regularization. DeformGS achieves high-quality 3D tracking on highly deformable scenes with shadows and occlusions. In experiments, DeformGS improves 3D tracking by an average of 55.8% compared to the state-of-the-art. With sufficient texture, DeformGS achieves a median tracking error of 3.3 mm on a cloth of 1.5 x 1.5 m in area. Website: https://deformgs.github.io △ Less

Submitted 30 August, 2024; v1 submitted 30 November, 2023; originally announced December 2023.

arXiv:2311.18743 [pdf, other]

AlignBench: Benchmarking Chinese Alignment of Large Language Models

Authors: Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam, Xiaohan Zhang, Lichao Sun, Xiaotao Gu, Hongning Wang, Jing Zhang, Minlie Huang, Yuxiao Dong, Jie Tang

Abstract: Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, the effective evaluation of alignment for emerging Chinese LLMs is still largely unexplored. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. We design a human-in-the-loop data curation pip… ▽ More Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, the effective evaluation of alignment for emerging Chinese LLMs is still largely unexplored. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. We design a human-in-the-loop data curation pipeline, containing eight main categories, 683 real-scenario rooted queries and corresponding human verified references. To ensure the correctness of references, each knowledge-intensive query is accompanied with evidences collected from reliable web sources (including URLs and quotations) by our annotators. For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge~\cite{zheng2023judging} approach with Chain-of-Thought to generate explanations and final ratings, ensuring high reliability and interpretability. All evaluation code, data, and LLM generations are available at \url{https://github.com/THUDM/AlignBench}. Since its release, AlignBench has been adopted by top (Chinese) LLMs for evaluating their alignment capabilities in Chinese, including ChatGLM, Qwen, DeepSeek, Yi, Baichuan, and Abab. △ Less

Submitted 25 August, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

Comments: Accepted to ACL 2024

arXiv:2311.18702 [pdf, other]

CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation

Authors: Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang

Abstract: Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4's direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise c… ▽ More Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4's direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT. △ Less

Submitted 26 June, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

Comments: Accepted by ACL 2024 (Main Conference)

arXiv:2311.18303 [pdf, other]

OmniMotionGPT: Animal Motion Generation with Limited Data

Authors: Zhangsihao Yang, Mingyuan Zhou, Mengyi Shan, Bingbing Wen, Ziwei Xuan, Mitch Hill, Junjie Bai, Guo-Jun Qi, Yalin Wang

Abstract: Our paper aims to generate diverse and realistic animal motion sequences from textual descriptions, without a large-scale animal text-motion dataset. While the task of text-driven human motion synthesis is already extensively studied and benchmarked, it remains challenging to transfer this success to other skeleton structures with limited data. In this work, we design a model architecture that imi… ▽ More Our paper aims to generate diverse and realistic animal motion sequences from textual descriptions, without a large-scale animal text-motion dataset. While the task of text-driven human motion synthesis is already extensively studied and benchmarked, it remains challenging to transfer this success to other skeleton structures with limited data. In this work, we design a model architecture that imitates Generative Pretraining Transformer (GPT), utilizing prior knowledge learned from human data to the animal domain. We jointly train motion autoencoders for both animal and human motions and at the same time optimize through the similarity scores among human motion encoding, animal motion encoding, and text CLIP embedding. Presenting the first solution to this problem, we are able to generate animal motions with high diversity and fidelity, quantitatively and qualitatively outperforming the results of training human motion generation baselines on animal data. Additionally, we introduce AnimalML3D, the first text-animal motion dataset with 1240 animation sequences spanning 36 different animal identities. We hope this dataset would mediate the data scarcity problem in text-driven animal motion generation, providing a new playground for the research community. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: The project page is at https://zshyang.github.io/omgpt-website/

arXiv:2311.16832 [pdf, other]

CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models

Authors: Jinfeng Zhou, Zhuang Chen, Dazhen Wan, Bosi Wen, Yi Song, Jifan Yu, Yongkang Huang, Libiao Peng, Jiaming Yang, Xiyao Xiao, Sahand Sabour, Xiaohan Zhang, Wenjing Hou, Yijia Zhang, Yuxiao Dong, Jie Tang, Minlie Huang

Abstract: In this paper, we present CharacterGLM, a series of models built upon ChatGLM, with model sizes ranging from 6B to 66B parameters. Our CharacterGLM is designed for generating Character-based Dialogues (CharacterDial), which aims to equip a conversational AI system with character customization for satisfying people's inherent social desires and emotional needs. On top of CharacterGLM, we can custom… ▽ More In this paper, we present CharacterGLM, a series of models built upon ChatGLM, with model sizes ranging from 6B to 66B parameters. Our CharacterGLM is designed for generating Character-based Dialogues (CharacterDial), which aims to equip a conversational AI system with character customization for satisfying people's inherent social desires and emotional needs. On top of CharacterGLM, we can customize various AI characters or social agents by configuring their attributes (identities, interests, viewpoints, experiences, achievements, social relationships, etc.) and behaviors (linguistic features, emotional expressions, interaction patterns, etc.). Our model outperforms most mainstream close-source large langauge models, including the GPT series, especially in terms of consistency, human-likeness, and engagement according to manual evaluations. We will release our 6B version of CharacterGLM and a subset of training data to facilitate further research development in the direction of character-based dialogue generation. △ Less

Submitted 28 November, 2023; originally announced November 2023.

Comments: Work in progress

arXiv:2311.16551 [pdf]

Thermally-activated precipitation strengthening

Authors: Guangpeng Sun, Liqiang zhang, Bin Wen

Abstract: Precipitation strengthening is a key strengthening method for metallic materials. However, the temperature effect on precipitation strengthening is still unclear to date. Based on dislocation theory, a thermally-activated precipitation strengthening model is built by considering the competition between shear and bypass mechanisms. For medium-sized precipitate particles, the thermally-activated she… ▽ More Precipitation strengthening is a key strengthening method for metallic materials. However, the temperature effect on precipitation strengthening is still unclear to date. Based on dislocation theory, a thermally-activated precipitation strengthening model is built by considering the competition between shear and bypass mechanisms. For medium-sized precipitate particles, the thermally-activated shear mechanism dominates the precipitation strengthening, resulting in a plateau region. While, for large or very fine precipitate particles, the thermally-activated bypass mechanism dominates the precipitation strengthening, leading to the strengthening or weakening regions. Moreover, the effects of precipitate phase volume fraction, temperature, shear modulus, strain rate, and mobile dislocation density on precipitation strengthening are also investigated. This study not only provides new insights into precipitation strengthening from the perspective of thermal activation but also offers clear guidance for the design of new materials. △ Less

Submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.14760 [pdf, other]

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

Authors: Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C. Kot, Bihan Wen

Abstract: While super-resolution (SR) methods based on diffusion models exhibit promising results, their practical application is hindered by the substantial number of required inference steps. Recent methods utilize degraded images in the initial state, thereby shortening the Markov chain. Nevertheless, these solutions either rely on a precise formulation of the degradation process or still necessitate a r… ▽ More While super-resolution (SR) methods based on diffusion models exhibit promising results, their practical application is hindered by the substantial number of required inference steps. Recent methods utilize degraded images in the initial state, thereby shortening the Markov chain. Nevertheless, these solutions either rely on a precise formulation of the degradation process or still necessitate a relatively lengthy generation path (e.g., 15 iterations). To enhance inference speed, we propose a simple yet effective method for achieving single-step SR generation, named SinSR. Specifically, we first derive a deterministic sampling process from the most recent state-of-the-art (SOTA) method for accelerating diffusion-based SR. This allows the mapping between the input random noise and the generated high-resolution image to be obtained in a reduced and acceptable number of inference steps during training. We show that this deterministic mapping can be distilled into a student model that performs SR within only one inference step. Additionally, we propose a novel consistency-preserving loss to simultaneously leverage the ground-truth image during the distillation process, ensuring that the performance of the student model is not solely bound by the feature manifold of the teacher model, resulting in further performance improvement. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method can achieve comparable or even superior performance compared to both previous SOTA methods and the teacher model, in just one sampling step, resulting in a remarkable up to x10 speedup for inference. Our code will be released at https://github.com/wyf0912/SinSR △ Less

Submitted 23 November, 2023; originally announced November 2023.

arXiv:2311.01373 [pdf, other]

Optimization Efficient Open-World Visual Region Recognition

Authors: Haosen Yang, Chuofan Ma, Bin Wen, Yi Jiang, Zehuan Yuan, Xiatian Zhu

Abstract: Understanding the semantics of individual regions or patches of unconstrained images, such as open-world object detection, remains a critical yet challenging task in computer vision. Building on the success of powerful image-level vision-language (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an e… ▽ More Understanding the semantics of individual regions or patches of unconstrained images, such as open-world object detection, remains a critical yet challenging task in computer vision. Building on the success of powerful image-level vision-language (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an extensive collection of region-label pairs or aligning the outputs of a detection model with image-level representations of region proposals. Despite notable progress, these approaches are plagued by computationally intensive training requirements, susceptibility to data noise, and deficiency in contextual information. To address these limitations, we explore the synergistic potential of off-the-shelf foundation models, leveraging their respective strengths in localization and semantics. We introduce a novel, generic, and efficient architecture, named RegionSpot, designed to integrate position-aware localization knowledge from a localization foundation model (e.g., SAM) with semantic information from a ViL model (e.g., CLIP). To fully exploit pretrained knowledge while minimizing training overhead, we keep both foundation models frozen, focusing optimization efforts solely on a lightweight attention-based knowledge integration module. Extensive experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives, along with substantial computational savings (e.g., training our model with 3 million data in a single day using 8 V100 GPUs). RegionSpot outperforms GLIP-L by 2.9 in mAP on LVIS val set, with an even larger margin of 13.1 AP for more challenging and rare categories, and a 2.5 AP increase on ODinW. Furthermore, it exceeds GroundingDINO-L by 11.0 AP for rare categories on the LVIS minival set. △ Less

Submitted 13 June, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

arXiv:2310.17596 [pdf, other]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

Authors: Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, Dieter Fox

Abstract: Imitation learning from a large set of human demonstrations has proved to be an effective paradigm for building capable robot agents. However, the demonstrations can be extremely costly and time-consuming to collect. We introduce MimicGen, a system for automatically synthesizing large-scale, rich datasets from only a small number of human demonstrations by adapting them to new contexts. We use Mim… ▽ More Imitation learning from a large set of human demonstrations has proved to be an effective paradigm for building capable robot agents. However, the demonstrations can be extremely costly and time-consuming to collect. We introduce MimicGen, a system for automatically synthesizing large-scale, rich datasets from only a small number of human demonstrations by adapting them to new contexts. We use MimicGen to generate over 50K demonstrations across 18 tasks with diverse scene configurations, object instances, and robot arms from just ~200 human demonstrations. We show that robot agents can be effectively trained on this generated dataset by imitation learning to achieve strong performance in long-horizon and high-precision tasks, such as multi-part assembly and coffee preparation, across broad initial state distributions. We further demonstrate that the effectiveness and utility of MimicGen data compare favorably to collecting additional human demonstrations, making it a powerful and economical approach towards scaling up robot learning. Datasets, simulation environments, videos, and more at https://mimicgen.github.io . △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: Conference on Robot Learning (CoRL) 2023

arXiv:2310.01733 [pdf, other]

doi 10.1109/ICDH60066.2023.00019

Health Guardian: Using Multi-modal Data to Understand Individual Health

Authors: Vince S. Siu, Kuan Yu Hsieh, Italo Buleje, Takashi Itoh, Tian Hao, Ben Civjan, Nigel Hinds, Bing Dang, Jeffrey L. Rogers, Bo Wen

Abstract: Artificial intelligence (AI) has shown great promise in revolutionizing the field of digital health by improving disease diagnosis, treatment, and prevention. This paper describes the Health Guardian platform, a non-commercial, scientific research-based platform developed by the IBM Digital Health team to rapidly translate AI research into cloud-based microservices. The platform can collect health… ▽ More Artificial intelligence (AI) has shown great promise in revolutionizing the field of digital health by improving disease diagnosis, treatment, and prevention. This paper describes the Health Guardian platform, a non-commercial, scientific research-based platform developed by the IBM Digital Health team to rapidly translate AI research into cloud-based microservices. The platform can collect health-related data from various digital devices, including wearables and mobile applications. Its flexible architecture supports microservices that accept diverse data types such as text, audio, and video, expanding the range of digital health assessments and enabling holistic health evaluations by capturing voice, facial, and motion bio-signals. These microservices can be deployed to a clinical cohort specified through the Clinical Task Manager (CTM). The CTM then collects multi-modal, clinical data that can iteratively improve the accuracy of AI predictive models, discover new disease mechanisms, or identify novel biomarkers. This paper highlights three microservices with different input data types, including a text-based microservice for depression assessment, a video-based microservice for sit-to-stand mobility assessment, and a wearable-based microservice for functional mobility assessment. The CTM is also discussed as a tool to help design and set up clinical studies to unlock the full potential of the platform. Today, the Health Guardian platform is being leveraged in collaboration with research partners to optimize the development of AI models by utilizing a multitude of input sources. This approach streamlines research efforts, enhances efficiency, and facilitates the development and validation of digital health applications. △ Less

Submitted 2 October, 2023; originally announced October 2023.

Comments: 10 pages, 6 figures

Journal ref: IEEE International Conference on Digital Health (ICDH), 2023, pp. 65-74

arXiv:2310.00463 [pdf, other]

Diff-DOPE: Differentiable Deep Object Pose Estimation

Authors: Jonathan Tremblay, Bowen Wen, Valts Blukis, Balakumar Sundaralingam, Stephen Tyree, Stan Birchfield

Abstract: We introduce Diff-DOPE, a 6-DoF pose refiner that takes as input an image, a 3D textured model of an object, and an initial pose of the object. The method uses differentiable rendering to update the object pose to minimize the visual error between the image and the projection of the model. We show that this simple, yet effective, idea is able to achieve state-of-the-art results on pose estimation… ▽ More We introduce Diff-DOPE, a 6-DoF pose refiner that takes as input an image, a 3D textured model of an object, and an initial pose of the object. The method uses differentiable rendering to update the object pose to minimize the visual error between the image and the projection of the model. We show that this simple, yet effective, idea is able to achieve state-of-the-art results on pose estimation datasets. Our approach is a departure from recent methods in which the pose refiner is a deep neural network trained on a large synthetic dataset to map inputs to refinement steps. Rather, our use of differentiable rendering allows us to avoid training altogether. Our approach performs multiple gradient descent optimizations in parallel with different random learning rates to avoid local minima from symmetric objects, similar appearances, or wrong step size. Various modalities can be used, e.g., RGB, depth, intensity edges, and object segmentation masks. We present experiments examining the effect of various choices, showing that the best results are found when the RGB image is accompanied by an object mask and depth image to guide the optimization process. △ Less

Submitted 30 September, 2023; originally announced October 2023.

Comments: Submitted to ICRA 2023. Project page is at https://diffdope.github.io

Showing 1–50 of 207 results for author: Wen, B