Search | arXiv e-print repository

doi 10.1145/3664647.3681593

MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models

Authors: Haoxuan Li, Zhengmao Yang, Yunshan Ma, Yi Bin, Yang Yang, Tat-Seng Chua

Abstract: We study an emerging and intriguing problem of multimodal temporal event forecasting with large language models. Compared to using text or graph modalities, the investigation of utilizing images for temporal event forecasting has not been fully explored, especially in the era of large language models (LLMs). To bridge this gap, we are particularly interested in two key questions of: 1) why images… ▽ More We study an emerging and intriguing problem of multimodal temporal event forecasting with large language models. Compared to using text or graph modalities, the investigation of utilizing images for temporal event forecasting has not been fully explored, especially in the era of large language models (LLMs). To bridge this gap, we are particularly interested in two key questions of: 1) why images will help in temporal event forecasting, and 2) how to integrate images into the LLM-based forecasting framework. To answer these research questions, we propose to identify two essential functions that images play in the scenario of temporal event forecasting, i.e., highlighting and complementary. Then, we develop a novel framework, named MM-Forecast. It employs an Image Function Identification module to recognize these functions as verbal descriptions using multimodal large language models (MLLMs), and subsequently incorporates these function descriptions into LLM-based forecasting models. To evaluate our approach, we construct a new multimodal dataset, MidEast-TE-mm, by extending an existing event dataset MidEast-TE-mini with images. Empirical studies demonstrate that our MM-Forecast can correctly identify the image functions, and further more, incorporating these verbal function descriptions significantly improves the forecasting performance. The dataset, code, and prompts are available at https://github.com/LuminosityX/MM-Forecast. △ Less

Submitted 8 August, 2024; originally announced August 2024.

ACM Class: H.3.3

arXiv:2408.00491 [pdf, other]

doi 10.1145/3664647.3681656

GalleryGPT: Analyzing Paintings with Large Multimodal Models

Authors: Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See-Kiong Ng, Heng Tao Shen

Abstract: Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Understanding artworks is challenging due to its subjective nature, diverse interpretations, and complex visual elements, requiring expertise in art history, cultural background, and aesthetic theory. However, limited by the data… ▽ More Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Understanding artworks is challenging due to its subjective nature, diverse interpretations, and complex visual elements, requiring expertise in art history, cultural background, and aesthetic theory. However, limited by the data collection and model ability, previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. To facilitate the research progress, in this paper, we step further to compose comprehensive analysis inspired by the remarkable perception and generation ability of large multimodal models. Specifically, we first propose a task of composing paragraph analysis for artworks, i.e., painting in this paper, only focusing on visual characteristics to formulate more comprehensive understanding of artworks. To support the research on formal analysis, we collect a large dataset PaintingForm, with about 19k painting images and 50k analysis paragraphs. We further introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture leveraging our collected data. We conduct formal analysis generation and zero-shot experiments across several datasets to assess the capacity of our model. The results show remarkable performance improvements comparing with powerful baseline LMMs, demonstrating its superb ability of art analysis and generalization. \textcolor{blue}{The codes and model are available at: https://github.com/steven640pixel/GalleryGPT. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: Accepted as Oral Presentation at ACM Multimedia 2024

arXiv:2408.00305 [pdf, other]

doi 10.1145/3664647.3681677

Leveraging Weak Cross-Modal Guidance for Coherence Modelling via Iterative Learning

Authors: Yi Bin, Junrong Liao, Yujuan Ding, Haoxuan Li, Yang Yang, See-Kiong Ng, Heng Tao Shen

Abstract: Cross-modal coherence modeling is essential for intelligent systems to help them organize and structure information, thereby understanding and creating content of the physical world coherently like human-beings. Previous work on cross-modal coherence modeling attempted to leverage the order information from another modality to assist the coherence recovering of the target modality. Despite of the… ▽ More Cross-modal coherence modeling is essential for intelligent systems to help them organize and structure information, thereby understanding and creating content of the physical world coherently like human-beings. Previous work on cross-modal coherence modeling attempted to leverage the order information from another modality to assist the coherence recovering of the target modality. Despite of the effectiveness, labeled associated coherency information is not always available and might be costly to acquire, making the cross-modal guidance hard to leverage. To tackle this challenge, this paper explores a new way to take advantage of cross-modal guidance without gold labels on coherency, and proposes the Weak Cross-Modal Guided Ordering (WeGO) model. More specifically, it leverages high-confidence predicted pairwise order in one modality as reference information to guide the coherence modeling in another. An iterative learning paradigm is further designed to jointly optimize the coherence modeling in two modalities with selected guidance from each other. The iterative cross-modal boosting also functions in inference to further enhance coherence prediction in each modality. Experimental results on two public datasets have demonstrated that the proposed method outperforms existing methods for cross-modal coherence modeling tasks. Major technical modules have been evaluated effective through ablation studies. Codes are available at: \url{https://github.com/scvready123/IterWeGO}. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: Accepted by ACM Multimedia 2024

arXiv:2407.12339 [pdf, other]

Exploring Deeper! Segment Anything Model with Depth Perception for Camouflaged Object Detection

Authors: Zhenni Yu, Xiaoqin Zhang, Li Zhao, Yi Bin, Guobao Xiao

Abstract: This paper introduces a new Segment Anything Model with Depth Perception (DSAM) for Camouflaged Object Detection (COD). DSAM exploits the zero-shot capability of SAM to realize precise segmentation in the RGB-D domain. It consists of the Prompt-Deeper Module and the Finer Module. The Prompt-Deeper Module utilizes knowledge distillation and the Bias Correction Module to achieve the interaction betw… ▽ More This paper introduces a new Segment Anything Model with Depth Perception (DSAM) for Camouflaged Object Detection (COD). DSAM exploits the zero-shot capability of SAM to realize precise segmentation in the RGB-D domain. It consists of the Prompt-Deeper Module and the Finer Module. The Prompt-Deeper Module utilizes knowledge distillation and the Bias Correction Module to achieve the interaction between RGB features and depth features, especially using depth features to correct erroneous parts in RGB features. Then, the interacted features are combined with the box prompt in SAM to create a prompt with depth perception. The Finer Module explores the possibility of accurately segmenting highly camouflaged targets from a depth perspective. It uncovers depth cues in areas missed by SAM through mask reversion, self-filtering, and self-attention operations, compensating for its defects in the COD domain. DSAM represents the first step towards the SAM-based RGB-D COD model. It maximizes the utilization of depth features while synergizing with RGB features to achieve multimodal complementarity, thereby overcoming the segmentation limitations of SAM and improving its accuracy in COD. Experimental results on COD benchmarks demonstrate that DSAM achieves excellent segmentation performance and reaches the state-of-the-art (SOTA) on COD benchmarks with less consumption of training resources. The code will be available at https://github.com/guobaoxiao/DSAM. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: ACM MM 2024

arXiv:2407.03788 [pdf, other]

Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

Authors: Thong Nguyen, Yi Bin, Xiaobao Wu, Xinshuai Dong, Zhiyuan Hu, Khoi Le, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan

Abstract: Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering t… ▽ More Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering the downstream performance across unpopular subjects. To address these problems, we propose a contrastive objective with a subtractive angular margin to regularize cross-modal representations in their effort to reach perfect similarity. Furthermore, to adapt to the non-uniform concept distribution, we propose a multi-layer perceptron (MLP)-parameterized weighting function that maps loss values to sample weights which enable dynamic adjustment of the model's focus throughout the training. With the training guided by a small amount of unbiased meta-data and augmented by video-text data generated by large vision-language model, we improve video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets. △ Less

Submitted 19 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024

arXiv:2406.17294 [pdf, other]

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

Authors: Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, Roy Ka-Wei Lee

Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge th… ▽ More Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge this gap, we address the lack of high-quality, diverse multimodal mathematical datasets by collecting 40K high-quality images with question-answer pairs from 24 existing datasets and synthesizing 320K new pairs, creating the MathV360K dataset, which enhances both the breadth and depth of multimodal mathematical questions. We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5, achieving a 19-point increase and comparable performance to GPT-4V on MathVista's minitest split. Furthermore, Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark. Our research highlights the importance of dataset diversity and synthesis in advancing MLLMs' mathematical reasoning abilities. The code and data are available at: \url{https://github.com/HZQ950419/Math-LLaVA}. △ Less

Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

Comments: 8 pages

arXiv:2406.05615 [pdf, other]

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Authors: Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan

Abstract: Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with te… ▽ More Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics. In this survey, we review the key tasks of these systems and highlight the associated challenges. Based on the challenges, we summarize their methods from model architecture, model training, and data perspectives. We also conduct performance comparison among the methods, and discuss promising directions for future research. △ Less

Submitted 1 July, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

Comments: Accepted at ACL 2024 (Findings)

arXiv:2404.05705 [pdf, other]

Learning 3D-Aware GANs from Unposed Images with Template Feature Field

Authors: Xinya Chen, Hanlei Guo, Yanrui Bin, Shangzhan Zhang, Yuanbo Yang, Yue Wang, Yujun Shen, Yiyi Liao

Abstract: Collecting accurate camera poses of training images has been shown to well serve the learning of 3D-aware generative adversarial networks (GANs) yet can be quite expensive in practice. This work targets learning 3D-aware GANs from unposed images, for which we propose to perform on-the-fly pose estimation of training images with a learned template feature field (TeFF). Concretely, in addition to a… ▽ More Collecting accurate camera poses of training images has been shown to well serve the learning of 3D-aware generative adversarial networks (GANs) yet can be quite expensive in practice. This work targets learning 3D-aware GANs from unposed images, for which we propose to perform on-the-fly pose estimation of training images with a learned template feature field (TeFF). Concretely, in addition to a generative radiance field as in previous approaches, we ask the generator to also learn a field from 2D semantic features while sharing the density from the radiance field. Such a framework allows us to acquire a canonical 3D feature template leveraging the dataset mean discovered by the generative model, and further efficiently estimate the pose parameters on real data. Experimental results on various challenging datasets demonstrate the superiority of our approach over state-of-the-art alternatives from both the qualitative and the quantitative perspectives. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: https://XDimlab.github.io/TeFF

arXiv:2311.03133 [pdf, other]

Incorporating basic calibrations in existing machine-learned turbulence modeling

Authors: Jiaqi J. L. Li, Yuanwei Bin, George P. Huang, Xiang I. A. Yang

Abstract: This work aims to incorporate basic calibrations of Reynolds-averaged Navier-Stokes (RANS) models as part of machine learning (ML) frameworks. The ML frameworks considered are tensor-basis neural network (TBNN), physics-informed machine learning (PIML), and field inversion & machine learning (FIML) in J. Fluid Mech., 2016, 807, 155-166, Phys. Rev. Fluids, 2017, 2(3), 034603 and J. Comp. Phys., 201… ▽ More This work aims to incorporate basic calibrations of Reynolds-averaged Navier-Stokes (RANS) models as part of machine learning (ML) frameworks. The ML frameworks considered are tensor-basis neural network (TBNN), physics-informed machine learning (PIML), and field inversion & machine learning (FIML) in J. Fluid Mech., 2016, 807, 155-166, Phys. Rev. Fluids, 2017, 2(3), 034603 and J. Comp. Phys., 2016, 305, 758-774, and the baseline RANS models are the one-equation Spalart-Allmaras model, the two-equation $k$-$ω$ model, and the seven-equation Reynolds stress transport models. ML frameworks are trained against plane channel flow and shear-layer flow data. We compare the ML frameworks and study whether the machine-learned augmentations are detrimental outside the training set. The findings are summarized as follows. The augmentations due to TBNN are detrimental. PIML leads to augmentations that are beneficial inside the training dataset but detrimental outside it. These results are not affected by the baseline RANS model. FIML's augmentations to the two eddy viscosity models, where an inner-layer treatment already exists, are largely neutral. Its augmentation to the seven-equation model, where an inner-layer treatment does not exist, improves the mean flow prediction in a channel. Furthermore, these FIML augmentations are mostly non-detrimental outside the training dataset. In addition to reporting these results, the paper offers physical explanations of the results. Last, we note that the conclusions drawn here are confined to the ML frameworks and the flows considered in this study. More detailed comparative studies and validation & verification studies are needed to account for developments in recent years. △ Less

Submitted 14 November, 2023; v1 submitted 6 November, 2023; originally announced November 2023.

arXiv:2311.01807 [pdf, other]

Cross-modal Consistency Learning with Fine-grained Fusion Network for Multimodal Fake News Detection

Authors: Jun Li, Yi Bin, Jie Zou, Jie Zou, Guoqing Wang, Yang Yang

Abstract: Previous studies on multimodal fake news detection have observed the mismatch between text and images in the fake news and attempted to explore the consistency of multimodal news based on global features of different modalities. However, they fail to investigate this relationship between fine-grained fragments in multimodal content. To gain public trust, fake news often includes relevant parts in… ▽ More Previous studies on multimodal fake news detection have observed the mismatch between text and images in the fake news and attempted to explore the consistency of multimodal news based on global features of different modalities. However, they fail to investigate this relationship between fine-grained fragments in multimodal content. To gain public trust, fake news often includes relevant parts in the text and the image, making such multimodal content appear consistent. Using global features may suppress potential inconsistencies in irrelevant parts. Therefore, in this paper, we propose a novel Consistency-learning Fine-grained Fusion Network (CFFN) that separately explores the consistency and inconsistency from high-relevant and low-relevant word-region pairs. Specifically, for a multimodal post, we divide word-region pairs into high-relevant and low-relevant parts based on their relevance scores. For the high-relevant part, we follow the cross-modal attention mechanism to explore the consistency. For low-relevant part, we calculate inconsistency scores to capture inconsistent points. Finally, a selection module is used to choose the primary clue (consistency or inconsistency) for identifying the credibility of multimodal news. Extensive experiments on two public datasets demonstrate that our CFFN substantially outperforms all the baselines. △ Less

Submitted 3 November, 2023; originally announced November 2023.

arXiv:2310.12640 [pdf, other]

Non-Autoregressive Sentence Ordering

Authors: Yi Bin, Wenhao Shi, Bin Ji, Jipeng Zhang, Yujuan Ding, Yang Yang

Abstract: Existing sentence ordering approaches generally employ encoder-decoder frameworks with the pointer net to recover the coherence by recurrently predicting each sentence step-by-step. Such an autoregressive manner only leverages unilateral dependencies during decoding and cannot fully explore the semantic dependency between sentences for ordering. To overcome these limitations, in this paper, we pro… ▽ More Existing sentence ordering approaches generally employ encoder-decoder frameworks with the pointer net to recover the coherence by recurrently predicting each sentence step-by-step. Such an autoregressive manner only leverages unilateral dependencies during decoding and cannot fully explore the semantic dependency between sentences for ordering. To overcome these limitations, in this paper, we propose a novel Non-Autoregressive Ordering Network, dubbed \textit{NAON}, which explores bilateral dependencies between sentences and predicts the sentence for each position in parallel. We claim that the non-autoregressive manner is not just applicable but also particularly suitable to the sentence ordering task because of two peculiar characteristics of the task: 1) each generation target is in deterministic length, and 2) the sentences and positions should match exclusively. Furthermore, to address the repetition issue of the naive non-autoregressive Transformer, we introduce an exclusive loss to constrain the exclusiveness between positions and sentences. To verify the effectiveness of the proposed model, we conduct extensive experiments on several common-used datasets and the experimental results show that our method outperforms all the autoregressive approaches and yields competitive performance compared with the state-of-the-arts. The codes are available at: \url{https://github.com/steven640pixel/nonautoregressive-sentence-ordering}. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: Accepted at Findings of EMNLP2023

arXiv:2310.09590 [pdf, other]

Solving Math Word Problems with Reexamination

Authors: Yi Bin, Wenhao Shi, Yujuan Ding, Yang Yang, See-Kiong Ng

Abstract: Math word problem (MWP) solving aims to understand the descriptive math problem and calculate the result, for which previous efforts are mostly devoted to upgrade different technical modules. This paper brings a different perspective of \textit{reexamination process} during training by introducing a pseudo-dual task to enhance the MWP solving. We propose a pseudo-dual (PseDual) learning scheme to… ▽ More Math word problem (MWP) solving aims to understand the descriptive math problem and calculate the result, for which previous efforts are mostly devoted to upgrade different technical modules. This paper brings a different perspective of \textit{reexamination process} during training by introducing a pseudo-dual task to enhance the MWP solving. We propose a pseudo-dual (PseDual) learning scheme to model such process, which is model-agnostic thus can be adapted to any existing MWP solvers. The pseudo-dual task is specifically defined as filling the numbers in the expression back into the original word problem with numbers masked. To facilitate the effective joint learning of the two tasks, we further design a scheduled fusion strategy for the number infilling task, which smoothly switches the input from the ground-truth math expressions to the predicted ones. Our pseudo-dual learning scheme has been tested and proven effective when being equipped in several representative MWP solvers through empirical studies. \textit{The codes and trained models are available at:} \url{https://github.com/steven640pixel/PsedualMWP}. \end{abstract} △ Less

Submitted 19 November, 2023; v1 submitted 14 October, 2023; originally announced October 2023.

Comments: To be appeared at NeurIPS2023 Workshop on MATH-AI

arXiv:2310.09368 [pdf, other]

Constrained re-calibration of Reynolds-averaged Navier-Stokes models

Authors: Yuanwei Bin, George Huang, Robert Kunz, Xiang I A Yang

Abstract: The constants and functions in Reynolds-averaged Navier Stokes (RANS) turbulence models are coupled. Consequently, modifications of a RANS model often negatively impact its basic calibrations, which is why machine-learned augmentations are often detrimental outside the training dataset. A solution to this is to identify the degrees of freedom that do not affect the basic calibrations and only modi… ▽ More The constants and functions in Reynolds-averaged Navier Stokes (RANS) turbulence models are coupled. Consequently, modifications of a RANS model often negatively impact its basic calibrations, which is why machine-learned augmentations are often detrimental outside the training dataset. A solution to this is to identify the degrees of freedom that do not affect the basic calibrations and only modify these identified degrees of freedom when re-calibrating the baseline model to accommodate a specific application. This approach is colloquially known as the "rubber-band" approach, which we formally call "constrained model re-calibration" in this article. To illustrate the efficacy of the approach, we identify the degrees of freedom in the Spalart-Allmaras (SA) model that do not affect the log law calibration. By subsequently interfacing data-based methods with these degrees of freedom, we train models to solve historically challenging flow scenarios, including the round-jet/plane-jet anomaly, airfoil stall, secondary flow separation, and recovery after separation. In addition to good performance inside the training dataset, the trained models yield similar performance as the baseline model outside the training dataset. △ Less

Submitted 13 October, 2023; originally announced October 2023.

arXiv:2310.09367 [pdf, other]

Large-eddy simulation of separated flows on unconventionally coarse grids

Authors: Yuanwei Bin, George I. Park, Yu Lv, Xiang I. A. Yang

Abstract: We examine and benchmark the emerging idea of applying the large-eddy simulation (LES) formalism to unconventionally coarse grids where RANS would be considered more appropriate at first glance. We distinguish this idea from very-large-eddy-simulation (VLES) and detached-eddy-simulation (DES), which require switching between RANS and LES formalism. LES on RANS grid is appealing because first, it r… ▽ More We examine and benchmark the emerging idea of applying the large-eddy simulation (LES) formalism to unconventionally coarse grids where RANS would be considered more appropriate at first glance. We distinguish this idea from very-large-eddy-simulation (VLES) and detached-eddy-simulation (DES), which require switching between RANS and LES formalism. LES on RANS grid is appealing because first, it requires minimal changes to a production code; second, it is more cost-effective than LES; third, it converges to LES; and most importantly, it accurately predicts flows with separation. This work quantifies the benefit of LES on RANS-like grids as compared to RANS on the same grids. Three canonical cases are considered: periodic hill, backward-facing step, and jet in cross flow. We conduct direct numerical simulation (DNS), proper LES on LES grids, LES on RANS-quality grids, and RANS. We show that while the LES solutions on the RANS-quality grids are not grid converged, they are twice as accurate as the RANS on the same grids. △ Less

Submitted 13 October, 2023; originally announced October 2023.

arXiv:2310.09366 [pdf, other]

A priori screening of data-enabled turbulence models

Authors: Peng E S Chen, Yuanwei Bin, Xiang I A Yang, Yipeng Shi, Mahdi Abkar, George I. Park

Abstract: Assessing the compliance of a white-box turbulence model with known turbulent knowledge is straightforward. It enables users to screen conventional turbulence models and identify apparent inadequacies, thereby allowing for a more focused and fruitful validation and verification. However, comparing a black-box machine-learning model to known empirical scalings is not straightforward. Unless one imp… ▽ More Assessing the compliance of a white-box turbulence model with known turbulent knowledge is straightforward. It enables users to screen conventional turbulence models and identify apparent inadequacies, thereby allowing for a more focused and fruitful validation and verification. However, comparing a black-box machine-learning model to known empirical scalings is not straightforward. Unless one implements and tests the model, it would not be clear if a machine-learning model, trained at finite Reynolds numbers preserves the known high Reynolds number limit. This is inconvenient, particularly because model implementation involves retraining and re-interfacing. This work attempts to address this issue, allowing fast a priori screening of machine-learning models that are based on feed-forward neural networks (FNN). The method leverages the mathematical theorems we present in the paper. These theorems offer estimates of a network's limits even when the exact weights and biases are unknown. For demonstration purposes, we screen existing machine-learning wall models and RANS models for their compliance with the log layer physics and the viscous layer physics in a priori manner. In addition, the theorems serve as essential guidelines for future machine-learning models. △ Less

Submitted 13 October, 2023; originally announced October 2023.

arXiv:2309.04800 [pdf, other]

VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis

Authors: Xinya Chen, Jiaxin Huang, Yanrui Bin, Lu Yu, Yiyi Liao

Abstract: Unsupervised learning of 3D-aware generative adversarial networks has lately made much progress. Some recent work demonstrates promising results of learning human generative models using neural articulated radiance fields, yet their generalization ability and controllability lag behind parametric human models, i.e., they do not perform well when generalizing to novel pose/shape and are not part co… ▽ More Unsupervised learning of 3D-aware generative adversarial networks has lately made much progress. Some recent work demonstrates promising results of learning human generative models using neural articulated radiance fields, yet their generalization ability and controllability lag behind parametric human models, i.e., they do not perform well when generalizing to novel pose/shape and are not part controllable. To solve these problems, we propose VeRi3D, a generative human vertex-based radiance field parameterized by vertices of the parametric human template, SMPL. We map each 3D point to the local coordinate system defined on its neighboring vertices, and use the corresponding vertex feature and local coordinates for mapping it to color and density values. We demonstrate that our simple approach allows for generating photorealistic human images with free control over camera pose, human pose, shape, as well as enabling part-level editing. △ Less

Submitted 9 September, 2023; originally announced September 2023.

arXiv:2308.04380 [pdf, other]

doi 10.1145/3581783.3612101

Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination

Authors: Haoxuan Li, Yi Bin, Junrong Liao, Yang Yang, Heng Tao Shen

Abstract: Most existing image-text matching methods adopt triplet loss as the optimization objective, and choosing a proper negative sample for the triplet of <anchor, positive, negative> is important for effectively training the model, e.g., hard negatives make the model learn efficiently and effectively. However, we observe that existing methods mainly employ the most similar samples as hard negatives, wh… ▽ More Most existing image-text matching methods adopt triplet loss as the optimization objective, and choosing a proper negative sample for the triplet of <anchor, positive, negative> is important for effectively training the model, e.g., hard negatives make the model learn efficiently and effectively. However, we observe that existing methods mainly employ the most similar samples as hard negatives, which may not be true negatives. In other words, the samples with high similarity but not paired with the anchor may reserve positive semantic associations, and we call them false negatives. Repelling these false negatives in triplet loss would mislead the semantic representation learning and result in inferior retrieval performance. In this paper, we propose a novel False Negative Elimination (FNE) strategy to select negatives via sampling, which could alleviate the problem introduced by false negatives. Specifically, we first construct the distributions of positive and negative samples separately via their similarities with the anchor, based on the features extracted from image and text encoders. Then we calculate the false negative probability of a given sample based on its similarity with the anchor and the above distributions via the Bayes' rule, which is employed as the sampling weight during negative sampling process. Since there may not exist any false negative in a small batch size, we design a memory module with momentum to retain a large negative buffer and implement our negative sampling strategy spanning over the buffer. In addition, to make the model focus on hard negatives, we reassign the sampling weights for the simple negatives with a cut-down strategy. The extensive experiments are conducted on Flickr30K and MS-COCO, and the results demonstrate the superiority of our proposed false negative elimination strategy. The code is available at https://github.com/LuminosityX/FNE. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: Accepted at ACM MM 2023

arXiv:2308.04343 [pdf, other]

doi 10.1145/3581783.3612427

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

Authors: Yi Bin, Haoxuan Li, Yahui Xu, Xing Xu, Yang Yang, Heng Tao Shen

Abstract: Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit{e.g.}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this… ▽ More Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit{e.g.}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments between them much easier. Besides, to leverage the rich semantics, we devise a hierarchical alignment scheme to explore multi-level correspondences of different layers between images and texts. To evaluate the effectiveness of the proposed HAT, we conduct extensive experiments on two benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that HAT outperforms SOTA baselines by a large margin. Specifically, on two key tasks, \textit{i.e.}, image-to-text and text-to-image retrieval, HAT achieves 7.6\% and 16.7\% relative score improvement of Recall@1 on MSCOCO, and 4.4\% and 11.6\% on Flickr30k respectively. The code is available at \url{https://github.com/LuminosityX/HAT}. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: Accepted at ACM Multimedia 2023

arXiv:2306.11746 [pdf, other]

Focusing on Relevant Responses for Multi-modal Rumor Detection

Authors: Jun Li, Yi Bin, Liang Peng, Yang Yang, Yangyang Li, Hao Jin, Zi Huang

Abstract: In the absence of an authoritative statement about a rumor, people may expose the truth behind such rumor through their responses on social media. Most rumor detection methods aggregate the information of all the responses and have made great progress. However, due to the different backgrounds of users, the responses have different relevance for discovering th suspicious points hidden in a rumor c… ▽ More In the absence of an authoritative statement about a rumor, people may expose the truth behind such rumor through their responses on social media. Most rumor detection methods aggregate the information of all the responses and have made great progress. However, due to the different backgrounds of users, the responses have different relevance for discovering th suspicious points hidden in a rumor claim. The methods that focus on all the responding tweets would dilute the effect of the critical ones. Moreover, for a multi-modal rumor claim, the focus of a user may be on several words in the text or an object in the image, so the different modalities should be considered to select the relevant responses and verify the claim. In this paper, we propose a novel multi-modal rumor detection model, termed Focal Reasoning Model (FoRM), to filter out the irrelevant responses and further conduct fine-grained reasoning with the multi-modal claim and corresponding responses. Concretely, there are two main components in our FoRM: the coarse-grained selection and the fine-grained reasoning. The coarse-grained selection component leverages the post-level features of the responses to verify the claim and learns a relevant score of each response. Based on the relevant scores, the most relevant responses are reserved as the critical ones to the further reasoning. In the fine-grained reasoning component, we design a relation attention module to explore the fine-grained relations, i.e., token-to-token and token-to-object relations, between the reserved responses and the multi-modal claim for finding out the valuable clues. Extensive experiments have been conducted on two real-world datasets, and the results demonstrate that our proposed model outperforms all the baselines. △ Less

Submitted 18 June, 2023; originally announced June 2023.

Comments: Submitted to TKDE

arXiv:2305.04556 [pdf, other]

Non-Autoregressive Math Word Problem Solver with Unified Tree Structure

Authors: Yi Bin, Mengqun Han, Wenhao Shi, Lei Wang, Yang Yang, See-Kiong Ng, Heng Tao Shen

Abstract: Existing MWP solvers employ sequence or binary tree to present the solution expression and decode it from given problem description. However, such structures fail to handle the variants that can be derived via mathematical manipulation, e.g., $(a_1+a_2) * a_3$ and $a_1 * a_3+a_2 * a_3$ can both be possible valid solutions for a same problem but formulated as different expression sequences or trees… ▽ More Existing MWP solvers employ sequence or binary tree to present the solution expression and decode it from given problem description. However, such structures fail to handle the variants that can be derived via mathematical manipulation, e.g., $(a_1+a_2) * a_3$ and $a_1 * a_3+a_2 * a_3$ can both be possible valid solutions for a same problem but formulated as different expression sequences or trees. The multiple solution variants depicting different possible solving procedures for the same input problem would raise two issues: 1) making it hard for the model to learn the mapping function between the input and output spaces effectively, and 2) wrongly indicating \textit{wrong} when evaluating a valid expression variant. To address these issues, we introduce a unified tree structure to present a solution expression, where the elements are permutable and identical for all the expression variants. We propose a novel non-autoregressive solver, named \textit{MWP-NAS}, to parse the problem and deduce the solution expression based on the unified tree. For evaluating the possible expression variants, we design a path-based metric to evaluate the partial accuracy of expressions of a unified tree. The results from extensive experiments conducted on Math23K and MAWPS demonstrate the effectiveness of our proposed MWP-NAS. The codes and checkpoints are available at: \url{https://github.com/mengqunhan/MWP-NAS}. △ Less

Submitted 28 October, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

Comments: Accepted at EMNLP2023

arXiv:2201.02062 [pdf]

Traffic Flow Modeling for UAV-Enabled Wireless Networks

Authors: A. Abada, Y. Bin, T. Taleb

Abstract: This paper investigates traffic flow modeling issue in multi-services oriented unmanned aerial vehicle (UAV)-enabled wireless networks, which is critical for supporting future various applications of such networks. We propose a general traffic flow model for multi-services oriented UAV-enable wireless networks. Under this model, we first classify the network services into three subsets: telemetry,… ▽ More This paper investigates traffic flow modeling issue in multi-services oriented unmanned aerial vehicle (UAV)-enabled wireless networks, which is critical for supporting future various applications of such networks. We propose a general traffic flow model for multi-services oriented UAV-enable wireless networks. Under this model, we first classify the network services into three subsets: telemetry, Internet of Things (IoT), and streaming data. Based on the Pareto distribution, we then partition all UAVs into three subgroups with different network usage. We further determine the number of packets for different network services and total data size according to the packet arrival rate for the nine segments, each of which represents one map relationship between a subset of services and a subgroup of UAVs. Simulation results are provided to illustrate that the number of packets and the data size predicted by our traffic model can well match with these under a real scenario. △ Less

Submitted 5 January, 2022; originally announced January 2022.

arXiv:2105.03072 [pdf, other]

NTIRE 2021 Challenge on Perceptual Image Quality Assessment

Authors: Jinjin Gu, Haoming Cai, Chao Dong, Jimmy S. Ren, Yu Qiao, Shuhang Gu, Radu Timofte, Manri Cheon, Sungjun Yoon, Byungyeon Kang, Junwoo Lee, Qing Zhang, Haiyang Guo, Yi Bin, Yuqing Hou, Hengliang Luo, Jingyu Guo, Zirui Wang, Hai Wang, Wenming Yang, Qingyan Bai, Shuwei Shi, Weihao Xia, Mingdeng Cao, Jiahao Wang , et al. (25 additional authors not shown)

Abstract: This paper reports on the NTIRE 2021 challenge on perceptual image quality assessment (IQA), held in conjunction with the New Trends in Image Restoration and Enhancement workshop (NTIRE) workshop at CVPR 2021. As a new type of image processing technology, perceptual image processing algorithms based on Generative Adversarial Networks (GAN) have produced images with more realistic textures. These o… ▽ More This paper reports on the NTIRE 2021 challenge on perceptual image quality assessment (IQA), held in conjunction with the New Trends in Image Restoration and Enhancement workshop (NTIRE) workshop at CVPR 2021. As a new type of image processing technology, perceptual image processing algorithms based on Generative Adversarial Networks (GAN) have produced images with more realistic textures. These output images have completely different characteristics from traditional distortions, thus pose a new challenge for IQA methods to evaluate their visual quality. In comparison with previous IQA challenges, the training and testing datasets in this challenge include the outputs of perceptual image processing algorithms and the corresponding subjective scores. Thus they can be used to develop and evaluate IQA methods on GAN-based distortions. The challenge has 270 registered participants in total. In the final testing stage, 13 participating teams submitted their models and fact sheets. Almost all of them have achieved much better results than existing IQA methods, while the winning method can demonstrate state-of-the-art performance. △ Less

Submitted 28 June, 2021; v1 submitted 7 May, 2021; originally announced May 2021.

arXiv:2011.11221 [pdf, other]

Adversarial Refinement Network for Human Motion Prediction

Authors: Xianjin Chao, Yanrui Bin, Wenqing Chu, Xuan Cao, Yanhao Ge, Chengjie Wang, Jilin Li, Feiyue Huang, Howard Leung

Abstract: Human motion prediction aims to predict future 3D skeletal sequences by giving a limited human motion as inputs. Two popular methods, recurrent neural networks and feed-forward deep networks, are able to predict rough motion trend, but motion details such as limb movement may be lost. To predict more accurate future human motion, we propose an Adversarial Refinement Network (ARNet) following a sim… ▽ More Human motion prediction aims to predict future 3D skeletal sequences by giving a limited human motion as inputs. Two popular methods, recurrent neural networks and feed-forward deep networks, are able to predict rough motion trend, but motion details such as limb movement may be lost. To predict more accurate future human motion, we propose an Adversarial Refinement Network (ARNet) following a simple yet effective coarse-to-fine mechanism with novel adversarial error augmentation. Specifically, we take both the historical motion sequences and coarse prediction as input of our cascaded refinement network to predict refined human motion and strengthen the refinement network with adversarial error augmentation. During training, we deliberately introduce the error distribution by learning through the adversarial mechanism among different subjects. In testing, our cascaded refinement network alleviates the prediction error from the coarse predictor resulting in a finer prediction robustly. This adversarial error augmentation provides rich error cases as input to our refinement network, leading to better generalization performance on the testing dataset. We conduct extensive experiments on three standard benchmark datasets and show that our proposed ARNet outperforms other state-of-the-art methods, especially on challenging aperiodic actions in both short-term and long-term predictions. △ Less

Submitted 23 November, 2020; v1 submitted 23 November, 2020; originally announced November 2020.

Comments: Accepted by ACCV 2020(Oral)

arXiv:2008.00697 [pdf, other]

Adversarial Semantic Data Augmentation for Human Pose Estimation

Authors: Yanrui Bin, Xuan Cao, Xinya Chen, Yanhao Ge, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Changxin Gao, Nong Sang

Abstract: Human pose estimation is the task of localizing body keypoints from still images. The state-of-the-art methods suffer from insufficient examples of challenging cases such as symmetric appearance, heavy occlusion and nearby person. To enlarge the amounts of challenging cases, previous methods augmented images by cropping and pasting image patches with weak semantics, which leads to unrealistic appe… ▽ More Human pose estimation is the task of localizing body keypoints from still images. The state-of-the-art methods suffer from insufficient examples of challenging cases such as symmetric appearance, heavy occlusion and nearby person. To enlarge the amounts of challenging cases, previous methods augmented images by cropping and pasting image patches with weak semantics, which leads to unrealistic appearance and limited diversity. We instead propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity. Furthermore, we propose Adversarial Semantic Data Augmentation (ASDA), which exploits a generative network to dynamiclly predict tailored pasting configuration. Given off-the-shelf pose estimation network as discriminator, the generator seeks the most confusing transformation to increase the loss of the discriminator while the discriminator takes the generated sample as input and learns from it. The whole pipeline is optimized in an adversarial manner. State-of-the-art results are achieved on challenging benchmarks. △ Less

Submitted 3 August, 2020; originally announced August 2020.

arXiv:2005.09816 [pdf, other]

Relevant Region Prediction for Crowd Counting

Authors: Xinya Chen, Yanrui Bin, Changxin Gao, Nong Sang, Hao Tang

Abstract: Crowd counting is a concerned and challenging task in computer vision. Existing density map based methods excessively focus on the individuals' localization which harms the crowd counting performance in highly congested scenes. In addition, the dependency between the regions of different density is also ignored. In this paper, we propose Relevant Region Prediction (RRP) for crowd counting, which c… ▽ More Crowd counting is a concerned and challenging task in computer vision. Existing density map based methods excessively focus on the individuals' localization which harms the crowd counting performance in highly congested scenes. In addition, the dependency between the regions of different density is also ignored. In this paper, we propose Relevant Region Prediction (RRP) for crowd counting, which consists of the Count Map and the Region Relation-Aware Module (RRAM). Each pixel in the count map represents the number of heads falling into the corresponding local area in the input image, which discards the detailed spatial information and forces the network pay more attention to counting rather than localizing individuals. Based on the Graph Convolutional Network (GCN), Region Relation-Aware Module is proposed to capture and exploit the important region dependency. The module builds a fully connected directed graph between the regions of different density where each node (region) is represented by weighted global pooled feature, and GCN is learned to map this region graph to a set of relation-aware regions representations. Experimental results on three datasets show that our method obviously outperforms other existing state-of-the-art methods. △ Less

Submitted 19 May, 2020; originally announced May 2020.

Comments: accepted by Neurocomputing

arXiv:1606.04631 [pdf, other]

Bidirectional Long-Short Term Memory for Video Description

Authors: Yi Bin, Yang Yang, Zi Huang, Fumin Shen, Xing Xu, Heng Tao Shen

Abstract: Video captioning has been attracting broad research attention in multimedia community. However, most existing approaches either ignore temporal information among video frames or just employ local contextual temporal knowledge. In this work, we propose a novel video captioning framework, termed as \emph{Bidirectional Long-Short Term Memory} (BiLSTM), which deeply captures bidirectional global tempo… ▽ More Video captioning has been attracting broad research attention in multimedia community. However, most existing approaches either ignore temporal information among video frames or just employ local contextual temporal knowledge. In this work, we propose a novel video captioning framework, termed as \emph{Bidirectional Long-Short Term Memory} (BiLSTM), which deeply captures bidirectional global temporal structure in video. Specifically, we first devise a joint visual modelling approach to encode video data by combining a forward LSTM pass, a backward LSTM pass, together with visual features from Convolutional Neural Networks (CNNs). Then, we inject the derived video representation into the subsequent language model for initialization. The benefits are in two folds: 1) comprehensively preserving sequential and visual information; and 2) adaptively learning dense visual features and sparse semantic representations for videos and sentences, respectively. We verify the effectiveness of our proposed video captioning framework on a commonly-used benchmark, i.e., Microsoft Video Description (MSVD) corpus, and the experimental results demonstrate that the superiority of the proposed approach as compared to several state-of-the-art methods. △ Less

Submitted 14 June, 2016; originally announced June 2016.

Comments: 5 pages

arXiv:1311.4659 [pdf]

A Fast local Reconstruction algorithm by selective backprojection for Low-Dose in Dental Computed Tomography

Authors: Yan Bin, Deng Lin, Han Yu, Zhang Feng, Wang Xian Chao, Li Lei

Abstract: High radiation dose in computed tomography (CT) scans increases the lifetime risk of cancer, which become a major clinical concern. The backprojection-filtration (BPF) algorithm could reduce radiation dose by reconstructing images from truncated data in a short scan. In dental CT, it could reduce radiation dose for the teeth by using the projection acquired in a short scan, and could avoid irradia… ▽ More High radiation dose in computed tomography (CT) scans increases the lifetime risk of cancer, which become a major clinical concern. The backprojection-filtration (BPF) algorithm could reduce radiation dose by reconstructing images from truncated data in a short scan. In dental CT, it could reduce radiation dose for the teeth by using the projection acquired in a short scan, and could avoid irradiation to other part by using truncated projection. However, the limit of integration for backprojection varies per PI-line, resulting in low calculation efficiency and poor parallel performance. Recently, a tent BPF (T-BPF) has been proposed to improve calculation efficiency by rearranging projection. However, the memory-consuming data rebinning process is included. Accordingly, the chose-BPF (C-BPF) algorithm is proposed in this paper. In this algorithm, the derivative of projection is backprojected to the points whose x coordinate is less than that of the source focal spot to obtain the differentiated backprojection (DBP). The finite Hilbert inverse is then applied to each PI-line segment. C-BPF avoids the influence of the variable limit of integration by selective backprojection without additional time cost or memory cost. The simulation experiment and the real experiment demonstrated the higher reconstruction efficiency of C-BPF. △ Less

Submitted 19 November, 2013; originally announced November 2013.

Comments: 18 pages,10 figures

MSC Class: 78-05

arXiv:1208.1379 [pdf, ps, other]

All-optical Switch Based on Optical Waveguide Coupling with Micro Cavity Array

Authors: Yang Bin, Li Heling, Zhao Hongsheng, Yang Tao

Abstract: This paper theoretically analyzes the optical transmission characteristics of an optical waveguide when coupling to a micro cavity array. The results showed that not only were there sharp peaks on the transmission and reflection spectra, but also that a certain system configuration can produce a backward wave to obtain a phase shift in the small detuning range between the incident wave and the mic… ▽ More This paper theoretically analyzes the optical transmission characteristics of an optical waveguide when coupling to a micro cavity array. The results showed that not only were there sharp peaks on the transmission and reflection spectra, but also that a certain system configuration can produce a backward wave to obtain a phase shift in the small detuning range between the incident wave and the micro cavities. Based on a discussion of the relationship between the transmission and the number and the dissipation coefficients of each coupled micro cavity, the paper proposes a high efficiency all-optical switch model. △ Less

Submitted 7 August, 2012; originally announced August 2012.

Showing 1–28 of 28 results for author: Bin, Y