Search | arXiv e-print repository

arXiv:2009.14167 [pdf, other]

Contrastive Distillation on Intermediate Representations for Language Model Compression

Authors: Siqi Sun, Zhe Gan, Yu Cheng, Yuwei Fang, Shuohang Wang, Jingjing Liu

Abstract: Existing language model compression methods mostly use a simple L2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one. Although widely used, this objective by design assumes that all the dimensions of hidden representations are independent, failing to capture important structural knowledge in the intermediate layers of the teacher network. To achie… ▽ More Existing language model compression methods mostly use a simple L2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one. Although widely used, this objective by design assumes that all the dimensions of hidden representations are independent, failing to capture important structural knowledge in the intermediate layers of the teacher network. To achieve better distillation efficacy, we propose Contrastive Distillation on Intermediate Representations (CoDIR), a principled knowledge distillation framework where the student is trained to distill knowledge through intermediate layers of the teacher via a contrastive objective. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark, outperforming state-of-the-art compression methods. △ Less

Submitted 29 September, 2020; originally announced September 2020.

Comments: Accepted by EMNLP 2020

arXiv:2009.06097 [pdf, other]

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

Authors: Shuohang Wang, Luowei Zhou, Zhe Gan, Yen-Chun Chen, Yuwei Fang, Siqi Sun, Yu Cheng, Jingjing Liu

Abstract: Transformer has become ubiquitous in the deep learning field. One of the key ingredients that destined its success is the self-attention mechanism, which allows fully-connected contextual encoding over input tokens. However, despite its effectiveness in modeling short sequences, self-attention suffers when handling inputs with extreme long-range dependencies, as its complexity grows quadratically… ▽ More Transformer has become ubiquitous in the deep learning field. One of the key ingredients that destined its success is the self-attention mechanism, which allows fully-connected contextual encoding over input tokens. However, despite its effectiveness in modeling short sequences, self-attention suffers when handling inputs with extreme long-range dependencies, as its complexity grows quadratically with respect to the sequence length. Therefore, long sequences are often encoded by Transformer in chunks using a sliding window. In this paper, we propose Cluster-Former, a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer, which encode local sequence information and global context jointly and iteratively. This new design allows information integration beyond local windows, which is especially beneficial for question answering (QA) tasks that rely on long-range dependencies. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks. △ Less

Submitted 7 June, 2021; v1 submitted 13 September, 2020; originally announced September 2020.

Comments: ACL Findings 2021, 11 pages

arXiv:2009.05167 [pdf, other]

Accelerating Real-Time Question Answering via Question Generation

Authors: Yuwei Fang, Shuohang Wang, Zhe Gan, Siqi Sun, Jingjing Liu, Chenguang Zhu

Abstract: Although deep neural networks have achieved tremendous success for question answering (QA), they are still suffering from heavy computational and energy cost for real product deployment. Further, existing QA systems are bottlenecked by the encoding time of real-time questions with neural networks, thus suffering from detectable latency in deployment for large-volume traffic. To reduce the computat… ▽ More Although deep neural networks have achieved tremendous success for question answering (QA), they are still suffering from heavy computational and energy cost for real product deployment. Further, existing QA systems are bottlenecked by the encoding time of real-time questions with neural networks, thus suffering from detectable latency in deployment for large-volume traffic. To reduce the computational cost and accelerate real-time question answering (RTQA) for practical usage, we propose to remove all the neural networks from online QA systems, and present Ocean-Q (an Ocean of Questions), which introduces a new question generation (QG) model to generate a large pool of QA pairs offline, then in real time matches an input question with the candidate QA pool to predict the answer without question encoding. Ocean-Q can be readily deployed in existing distributed database systems or search engine for large-scale query usage, and much greener with no additional cost for maintaining large neural networks. Experiments on SQuAD(-open) and HotpotQA benchmarks demonstrate that Ocean-Q is able to accelerate the fastest state-of-the-art RTQA system by 4X times, with only a 3+% accuracy drop. △ Less

Submitted 1 September, 2021; v1 submitted 10 September, 2020; originally announced September 2020.

arXiv:2009.05166 [pdf, other]

FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding

Authors: Yuwei Fang, Shuohang Wang, Zhe Gan, Siqi Sun, Jingjing Liu

Abstract: Large-scale cross-lingual language models (LM), such as mBERT, Unicoder and XLM, have achieved great success in cross-lingual representation learning. However, when applied to zero-shot cross-lingual transfer tasks, most existing methods use only single-language input for LM finetuning, without leveraging the intrinsic cross-lingual alignment between different languages that proves essential for m… ▽ More Large-scale cross-lingual language models (LM), such as mBERT, Unicoder and XLM, have achieved great success in cross-lingual representation learning. However, when applied to zero-shot cross-lingual transfer tasks, most existing methods use only single-language input for LM finetuning, without leveraging the intrinsic cross-lingual alignment between different languages that proves essential for multilingual tasks. In this paper, we propose FILTER, an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. Specifically, FILTER first encodes text input in the source language and its translation in the target language independently in the shallow layers, then performs cross-language fusion to extract multilingual knowledge in the intermediate layers, and finally performs further language-specific encoding. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. For simple tasks such as classification, translated text in the target language shares the same label as the source language. However, this shared label becomes less accurate or even unavailable for more complex tasks such as question answering, NER and POS tagging. To tackle this issue, we further propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language. Extensive experiments demonstrate that FILTER achieves new state of the art on two challenging multilingual multi-task benchmarks, XTREME and XGLUE. △ Less

Submitted 15 December, 2020; v1 submitted 10 September, 2020; originally announced September 2020.

Comments: Accepted to AAAI 2021; Top-1 Performance on XTREME (https://sites.research.google/xtreme, September 8, 2020) and XGLUE (https://microsoft.github.io/XGLUE, September 14, 2020) benchmark

arXiv:2006.14744 [pdf, other]

Graph Optimal Transport for Cross-Domain Alignment

Authors: Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, Jingjing Liu

Abstract: Cross-domain alignment between two sets of entities (e.g., objects in an image, words in a sentence) is fundamental to both computer vision and natural language processing. Existing methods mainly focus on designing advanced attention mechanisms to simulate soft alignment, with no training signals to explicitly encourage alignment. The learned attention matrices are also dense and lacks interpreta… ▽ More Cross-domain alignment between two sets of entities (e.g., objects in an image, words in a sentence) is fundamental to both computer vision and natural language processing. Existing methods mainly focus on designing advanced attention mechanisms to simulate soft alignment, with no training signals to explicitly encourage alignment. The learned attention matrices are also dense and lacks interpretability. We propose Graph Optimal Transport (GOT), a principled framework that germinates from recent advances in Optimal Transport (OT). In GOT, cross-domain alignment is formulated as a graph matching problem, by representing entities into a dynamically-constructed graph. Two types of OT distances are considered: (i) Wasserstein distance (WD) for node (entity) matching; and (ii) Gromov-Wasserstein distance (GWD) for edge (structure) matching. Both WD and GWD can be incorporated into existing neural network models, effectively acting as a drop-in regularizer. The inferred transport plan also yields sparse and self-normalized alignment, enhancing the interpretability of the learned model. Experiments show consistent outperformance of GOT over baselines across a wide range of tasks, including image-text retrieval, visual question answering, image captioning, machine translation, and text summarization. △ Less

Submitted 24 July, 2020; v1 submitted 25 June, 2020; originally announced June 2020.

Journal ref: ICML 2020

arXiv:2006.12013 [pdf, other]

CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information

Authors: Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, Lawrence Carin

Abstract: Mutual information (MI) minimization has gained considerable interests in various machine learning tasks. However, estimating and minimizing MI in high-dimensional spaces remains a challenging problem, especially when only samples, rather than distribution forms, are accessible. Previous works mainly focus on MI lower bound approximation, which is not applicable to MI minimization problems. In thi… ▽ More Mutual information (MI) minimization has gained considerable interests in various machine learning tasks. However, estimating and minimizing MI in high-dimensional spaces remains a challenging problem, especially when only samples, rather than distribution forms, are accessible. Previous works mainly focus on MI lower bound approximation, which is not applicable to MI minimization problems. In this paper, we propose a novel Contrastive Log-ratio Upper Bound (CLUB) of mutual information. We provide a theoretical analysis of the properties of CLUB and its variational approximation. Based on this upper bound, we introduce a MI minimization training scheme and further accelerate it with a negative sampling strategy. Simulation studies on Gaussian distributions show the reliable estimation ability of CLUB. Real-world MI minimization experiments, including domain adaptation and information bottleneck, demonstrate the effectiveness of the proposed method. The code is at https://github.com/Linear95/CLUB. △ Less

Submitted 23 July, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

Comments: Accepted by the 37th International Conference on Machine Learing (ICML2020)

arXiv:2006.11918 [pdf, ps, other]

MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients

Authors: Chen Zhu, Yu Cheng, Zhe Gan, Furong Huang, Jingjing Liu, Tom Goldstein

Abstract: Adaptive gradient methods such as RMSProp and Adam use exponential moving estimate of the squared gradient to compute adaptive step sizes, achieving better convergence than SGD in face of noisy objectives. However, Adam can have undesirable convergence behaviors due to unstable or extreme adaptive learning rates. Methods such as AMSGrad and AdaBound have been proposed to stabilize the adaptive lea… ▽ More Adaptive gradient methods such as RMSProp and Adam use exponential moving estimate of the squared gradient to compute adaptive step sizes, achieving better convergence than SGD in face of noisy objectives. However, Adam can have undesirable convergence behaviors due to unstable or extreme adaptive learning rates. Methods such as AMSGrad and AdaBound have been proposed to stabilize the adaptive learning rates of Adam in the later stage of training, but they do not outperform Adam in some practical tasks such as training Transformers \cite{transformer}. In this paper, we propose an adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance of each coordinate. This results in a faster adaptation to the local gradient variance, which leads to more desirable empirical convergence behaviors than Adam. We prove the proposed algorithm converges under mild assumptions for nonconvex stochastic optimization problems, and demonstrate the improved efficacy of our adaptive averaging approach on machine translation, natural language understanding and large-batch pretraining of BERT. The code is available at https://github.com/zhuchen03/MaxVA. △ Less

Submitted 4 July, 2021; v1 submitted 21 June, 2020; originally announced June 2020.

Comments: ECML PKDD 2021

arXiv:2006.06195 [pdf, other]

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Authors: Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu

Abstract: We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the… ▽ More We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2. △ Less

Submitted 22 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: NeurIPS 2020 Spotlight paper

arXiv:2006.03315 [pdf, other]

Multi-modal Feature Fusion with Feature Attention for VATEX Captioning Challenge 2020

Authors: Ke Lin, Zhuoxin Gan, Liwei Wang

Abstract: This report describes our model for VATEX Captioning Challenge 2020. First, to gather information from multiple domains, we extract motion, appearance, semantic and audio features. Then we design a feature attention module to attend on different feature when decoding. We apply two types of decoders, top-down and X-LAN and ensemble these models to get the final result. The proposed method outperfor… ▽ More This report describes our model for VATEX Captioning Challenge 2020. First, to gather information from multiple domains, we extract motion, appearance, semantic and audio features. Then we design a feature attention module to attend on different feature when decoding. We apply two types of decoders, top-down and X-LAN and ensemble these models to get the final result. The proposed method outperforms official baseline with a significant gap. We achieve 76.0 CIDEr and 50.0 CIDEr on English and Chinese private test set. We rank 2nd on both English and Chinese private test leaderboard. △ Less

Submitted 5 June, 2020; originally announced June 2020.

arXiv:2005.07310 [pdf, other]

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Authors: Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu

Abstract: Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER have significantly lifted state of the art across a wide range of V+L benchmarks with joint image-text pre-training. However, little is known about the inner mechanisms that destine their impressive success. To reveal the secrets behind the scene… ▽ More Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER have significantly lifted state of the art across a wide range of V+L benchmarks with joint image-text pre-training. However, little is known about the inner mechanisms that destine their impressive success. To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e.g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings). Through extensive analysis of each archetypal model architecture via these probing tasks, our key observations are: (i) Pre-trained models exhibit a propensity for attending over text rather than images during inference. (ii) There exists a subset of attention heads that are tailored for capturing cross-modal interactions. (iii) Learned attention matrix in pre-trained models demonstrates patterns coherent with the latent alignment between image regions and textual words. (iv) Plotted attention patterns reveal visually-interpretable relations among image regions. (v) Pure linguistic knowledge is also effectively encoded in the attention heads. These are valuable insights serving to guide future work towards designing better model architecture and objectives for multimodal pre-training. △ Less

Submitted 18 July, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

Comments: Accepted by ECCV 2020 as Spotlight

arXiv:2005.04201 [pdf]

doi 10.1002/adma.202003826

Scalable functionalization of optical fibers using atomically thin semiconductors

Authors: Gia Quyet Ngo, Antony George, Robin Tristan Klaus Schock, Alessandro Tuniz, Emad Najafidehaghani, Ziyang Gan, Nils C. Geib, Tobias Bucher, Heiko Knopf, Christof Neumann, Tilman Lühder, Stephen Warren-Smith, Heike Ebendorff-Heidepriem, Thomas Pertsch, Markus A. Schmidt, Andrey Turchanin, Falk Eilenberger

Abstract: Atomically thin transition metal dichalcogenides are highly promising for integrated optoelectronic and photonic systems due to their exciton-driven linear and nonlinear interaction with light. Integrating them into optical fibers yields novel opportunities in optical communication, remote sensing, and all-fiber optoelectronics. However, scalable and reproducible deposition of high quality monolay… ▽ More Atomically thin transition metal dichalcogenides are highly promising for integrated optoelectronic and photonic systems due to their exciton-driven linear and nonlinear interaction with light. Integrating them into optical fibers yields novel opportunities in optical communication, remote sensing, and all-fiber optoelectronics. However, scalable and reproducible deposition of high quality monolayers on optical fibers is a challenge. Here, we report the chemical vapor deposition of monolayer MoS2 and WS2 crystals on the core of microstructured exposed core optical fibers and their interaction with the fibers' guided modes. We demonstrate two distinct application possibilities of 2D-functionalized waveguides to exemplify their potential. First, we simultaneously excite and collect excitonic 2D material photoluminescence with the fiber modes, opening a novel route to remote sensing. Then we show that third harmonic generation is modified by the highly localized nonlinear polarization of the monolayers, yielding a new avenue to tailor nonlinear optical processes in fibers. We anticipate that our results may lead to significant advances in optical fiber based technologies. △ Less

Submitted 2 September, 2020; v1 submitted 8 May, 2020; originally announced May 2020.

Journal ref: Adv. Mater. 2020, 2003826

arXiv:2005.01279 [pdf, other]

Improving Adversarial Text Generation by Modeling the Distant Future

Authors: Ruiyi Zhang, Changyou Chen, Zhe Gan, Wenlin Wang, Dinghan Shen, Guoyin Wang, Zheng Wen, Lawrence Carin

Abstract: Auto-regressive text generation models usually focus on local fluency, and may cause inconsistent semantic meaning in long text generation. Further, automatically generating words with similar semantics is challenging, and hand-crafted linguistic rules are difficult to apply. We consider a text planning scheme and present a model-based imitation-learning approach to alleviate the aforementioned is… ▽ More Auto-regressive text generation models usually focus on local fluency, and may cause inconsistent semantic meaning in long text generation. Further, automatically generating words with similar semantics is challenging, and hand-crafted linguistic rules are difficult to apply. We consider a text planning scheme and present a model-based imitation-learning approach to alleviate the aforementioned issues. Specifically, we propose a novel guider network to focus on the generative process over a longer horizon, which can assist next-word prediction and provide intermediate rewards for generator optimization. Extensive experiments demonstrate that the proposed method leads to improved performance. △ Less

Submitted 4 May, 2020; originally announced May 2020.

Comments: ACL 2020. arXiv admin note: substantial text overlap with arXiv:1811.00696

arXiv:2005.00558 [pdf, other]

POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training

Authors: Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, Bill Dolan

Abstract: Large-scale pre-trained language models, such as BERT and GPT-2, have achieved excellent performance in language representation learning and free-form text generation. However, these models cannot be directly employed to generate text under specified lexical constraints. To address this challenge, we present POINTER (PrOgressive INsertion-based TransformER), a simple yet novel insertion-based appr… ▽ More Large-scale pre-trained language models, such as BERT and GPT-2, have achieved excellent performance in language representation learning and free-form text generation. However, these models cannot be directly employed to generate text under specified lexical constraints. To address this challenge, we present POINTER (PrOgressive INsertion-based TransformER), a simple yet novel insertion-based approach for hard-constrained text generation. The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner. This procedure is recursively applied until a sequence is completed. The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable. We pre-train our model with the proposed progressive insertion-based objective on a 12GB Wikipedia dataset, and fine-tune it on downstream hard-constrained generation tasks. Non-autoregressive decoding yields an empirically logarithmic time complexity during inference time. Experimental results on both News and Yelp datasets demonstrate that POINTER achieves state-of-the-art performance on constrained text generation. We released the pre-trained models and the source code to facilitate future research (https://github.com/dreasysnail/POINTER). △ Less

Submitted 26 September, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

Comments: EMNLP 2020 long paper

arXiv:2005.00200 [pdf, other]

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

Authors: Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, Jingjing Liu

Abstract: We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by a Temporal Transformer. In addition to standard Masked Language Modeling (MLM) and Masked Frame Modelin… ▽ More We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by a Temporal Transformer. In addition to standard Masked Language Modeling (MLM) and Masked Frame Modeling (MFM) objectives, we design two new pre-training tasks: (i) Video-Subtitle Matching (VSM), where the model predicts both global and local temporal alignment; and (ii) Frame Order Modeling (FOM), where the model predicts the right order of shuffled video frames. HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions. Comprehensive experiments demonstrate that HERO achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference and Video Captioning tasks across different domains. We also introduce two new challenging benchmarks How2QA and How2R for Video QA and Retrieval, collected from diverse video content over multimodalities. △ Less

Submitted 29 September, 2020; v1 submitted 30 April, 2020; originally announced May 2020.

Comments: Accepted by EMNLP 2020

arXiv:2005.00136 [pdf, other]

Contextual Text Style Transfer

Authors: Yu Cheng, Zhe Gan, Yizhe Zhang, Oussama Elachqar, Dianqi Li, Jingjing Liu

Abstract: We introduce a new task, Contextual Text Style Transfer - translating a sentence into a desired style with its surrounding context taken into account. This brings two key challenges to existing style transfer approaches: ($i$) how to preserve the semantic meaning of target sentence and its consistency with surrounding context during transfer; ($ii$) how to train a robust model with limited labeled… ▽ More We introduce a new task, Contextual Text Style Transfer - translating a sentence into a desired style with its surrounding context taken into account. This brings two key challenges to existing style transfer approaches: ($i$) how to preserve the semantic meaning of target sentence and its consistency with surrounding context during transfer; ($ii$) how to train a robust model with limited labeled data accompanied with context. To realize high-quality style transfer with natural context preservation, we propose a Context-Aware Style Transfer (CAST) model, which uses two separate encoders for each input sentence and its surrounding context. A classifier is further trained to ensure contextual consistency of the generated sentence. To compensate for the lack of parallel data, additional self-reconstruction and back-translation losses are introduced to leverage non-parallel data in a semi-supervised fashion. Two new benchmarks, Enron-Context and Reddit-Context, are introduced for formality and offensiveness style transfer. Experimental results on these datasets demonstrate the effectiveness of the proposed CAST model over state-of-the-art methods across style accuracy, content preservation and contextual consistency metrics. △ Less

Submitted 30 April, 2020; originally announced May 2020.

arXiv:2005.00117 [pdf]

Discovering universal scaling laws in 3D printing of metals with genetic programming and dimensional analysis

Authors: Zhengtao Gan, Orion L. Kafka, Niranjan Parab, Cang Zhao, Olle Heinonen, Tao Sun, Wing Liu

Abstract: We leverage dimensional analysis and genetic programming (a type of machine learning) to discover two strikingly simple but universal scaling laws, which remain accurate for different materials, processing conditions, and machines in metal three-dimensional (3D) printing. The first one is extracted from high-fidelity high-speed synchrotron X-ray imaging, and defines a new dimensionless number, Key… ▽ More We leverage dimensional analysis and genetic programming (a type of machine learning) to discover two strikingly simple but universal scaling laws, which remain accurate for different materials, processing conditions, and machines in metal three-dimensional (3D) printing. The first one is extracted from high-fidelity high-speed synchrotron X-ray imaging, and defines a new dimensionless number, Keyhole number, to predict melt-pool vapor depression depth. The second predicts porosity using the Keyhole number and another dimensionless number, normalized energy density. By reducing the dimensions of these longstanding problems, the low-dimensional scaling laws will aid process optimization and defect elimination, and potentially lead to a quantitative predictive framework for the critical issues in metal 3D printing. Moreover, the method itself is broadly applicable to a range of scientific areas. △ Less

Submitted 27 May, 2020; v1 submitted 30 April, 2020; originally announced May 2020.

arXiv:2005.00054 [pdf, other]

APo-VAE: Text Generation in Hyperbolic Space

Authors: Shuyang Dai, Zhe Gan, Yu Cheng, Chenyang Tao, Lawrence Carin, Jingjing Liu

Abstract: Natural language often exhibits inherent hierarchical structure ingrained with complex syntax and semantics. However, most state-of-the-art deep generative models learn embeddings only in Euclidean vector space, without accounting for this structural property of language. In this paper, we investigate text generation in a hyperbolic latent space to learn continuous hierarchical representations. An… ▽ More Natural language often exhibits inherent hierarchical structure ingrained with complex syntax and semantics. However, most state-of-the-art deep generative models learn embeddings only in Euclidean vector space, without accounting for this structural property of language. In this paper, we investigate text generation in a hyperbolic latent space to learn continuous hierarchical representations. An Adversarial Poincare Variational Autoencoder (APo-VAE) is presented, where both the prior and variational posterior of latent variables are defined over a Poincare ball via wrapped normal distributions. By adopting the primal-dual formulation of KL divergence, an adversarial learning procedure is introduced to empower robust model training. Extensive experiments in language modeling and dialog-response generation tasks demonstrate the winning effectiveness of the proposed APo-VAE model over VAEs in Euclidean latent space, thanks to its superb capabilities in capturing latent language hierarchies in hyperbolic space. △ Less

Submitted 14 July, 2021; v1 submitted 30 April, 2020; originally announced May 2020.

arXiv:2003.11690 [pdf, other]

BachGAN: High-Resolution Image Synthesis from Salient Object Layout

Authors: Yandong Li, Yu Cheng, Zhe Gan, Licheng Yu, Liqiang Wang, Jingjing Liu

Abstract: We propose a new task towards more practical application for image generation - high-quality image synthesis from salient object layout. This new setting allows users to provide the layout of salient objects only (i.e., foreground bounding boxes and categories), and lets the model complete the drawing with an invented background and a matching foreground. Two main challenges spring from this new t… ▽ More We propose a new task towards more practical application for image generation - high-quality image synthesis from salient object layout. This new setting allows users to provide the layout of salient objects only (i.e., foreground bounding boxes and categories), and lets the model complete the drawing with an invented background and a matching foreground. Two main challenges spring from this new task: (i) how to generate fine-grained details and realistic textures without segmentation map input; and (ii) how to create a background and weave it seamlessly into standalone objects. To tackle this, we propose Background Hallucination Generative Adversarial Network (BachGAN), which first selects a set of segmentation maps from a large candidate pool via a background retrieval module, then encodes these candidate layouts via a background fusion module to hallucinate a suitable background for the given objects. By generating the hallucinated background representation dynamically, our model can synthesize high-resolution images with both photo-realistic foreground and integral background. Experiments on Cityscapes and ADE20K datasets demonstrate the advantage of BachGAN over existing methods, measured on both visual fidelity of generated images and visual alignment between output images and input layouts. △ Less

Submitted 27 March, 2020; v1 submitted 25 March, 2020; originally announced March 2020.

Comments: Accepted to CVPR 2020

arXiv:2003.11618 [pdf, other]

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

Authors: Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu

Abstract: We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. A new large-scale dataset, named Violin (VIdeO-and-Language INferenc… ▽ More We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video. These video clips contain rich content with diverse temporal dynamics, event shifts, and people interactions, collected from two sources: (i) popular TV shows, and (ii) movie clips from YouTube channels. In order to address our new multimodal inference task, a model is required to possess sophisticated reasoning skills, from surface-level grounding (e.g., identifying objects and characters in the video) to in-depth commonsense reasoning (e.g., inferring causal relations of events in the video). We present a detailed analysis of the dataset and an extensive evaluation over many strong baselines, providing valuable insights on the challenges of this new task. △ Less

Submitted 25 March, 2020; originally announced March 2020.

Comments: Accepted to CVPR2020

arXiv:2001.06944 [pdf, other]

Nested-Wasserstein Self-Imitation Learning for Sequence Generation

Authors: Ruiyi Zhang, Changyou Chen, Zhe Gan, Zheng Wen, Wenlin Wang, Lawrence Carin

Abstract: Reinforcement learning (RL) has been widely studied for improving sequence-generation models. However, the conventional rewards used for RL training typically cannot capture sufficient semantic information and therefore render model bias. Further, the sparse and delayed rewards make RL exploration inefficient. To alleviate these issues, we propose the concept of nested-Wasserstein distance for dis… ▽ More Reinforcement learning (RL) has been widely studied for improving sequence-generation models. However, the conventional rewards used for RL training typically cannot capture sufficient semantic information and therefore render model bias. Further, the sparse and delayed rewards make RL exploration inefficient. To alleviate these issues, we propose the concept of nested-Wasserstein distance for distributional semantic matching. To further exploit it, a novel nested-Wasserstein self-imitation learning framework is developed, encouraging the model to exploit historical high-rewarded sequences for enhanced exploration and better semantic matching. Our solution can be understood as approximately executing proximal policy optimization with Wasserstein trust-regions. Experiments on a variety of unconditional and conditional sequence-generation tasks demonstrate the proposed approach consistently leads to improved performance. △ Less

Submitted 19 January, 2020; originally announced January 2020.

Comments: Accepted by AISTATS2020

arXiv:1912.06147 [pdf, other]

Infra-Red Emission from Cold Gas Dusty Disks in Massive Ellipticals

Authors: Zhaoming Gan, Brandon S. Hensley, Jeremiah P. Ostriker, Luca Ciotti, David Schiminovich, Silvia Pellegrini

Abstract: What is the expected infrared output of elliptical galaxies? Here we report the latest findings obtained in this high time resolution (~10 years) and high spatial resolution (2.5 parsec at center) study. We add a set of grain physics to the MACER code, including (a) dust grains made in passive stellar evolution; (b) dust grain growth due to collision and sticking; (c) grain destruction due to ther… ▽ More What is the expected infrared output of elliptical galaxies? Here we report the latest findings obtained in this high time resolution (~10 years) and high spatial resolution (2.5 parsec at center) study. We add a set of grain physics to the MACER code, including (a) dust grains made in passive stellar evolution; (b) dust grain growth due to collision and sticking; (c) grain destruction due to thermal sputtering; (d) dust cooling of hot gas via inelastic collisions; and (e) radiation pressure on dust grains. The code improvements enable us to analyze the effects of dust on metal depletion and AGN obscuration, and also to assess the infrared output of the modeled galaxies. We simulate a representative massive elliptical galaxy of a central stellar velocity dispersion ~ 260 km/s and modest rotation. We find that: (1) the circumnuclear disk (of a size <~ 1 kpc) is dusty in its outer region where most of the metals are in dust grains, while in the inner disk most of the dust grains are destroyed by the AGN irradiation; (2) the dusty disk is optically thick to both the starlight within the disk and the radiation from the central AGN. Thus the AGN will be obscured behind the disk, and the latter is of a covering factor ~ 0.2; (3) the dust infrared emission is mainly due to the AGN irradiation. The median infrared luminosity is ~ 2e44 erg/s, and it can reach >~ 1e46 erg/s during outbursts; (4) the duty cycles of the AGN activities, star formation, and the dust infrared luminosity roughly match observations, e.g., in most of its lifetime, the simulated galaxy is a stereotypical "quiescent" elliptical galaxy with L_{IR} ~ 1e11*L_{solar}, while the star formation rate can exceed 250 M_{solar}/yr during central outbursts. △ Less

Submitted 17 August, 2020; v1 submitted 12 December, 2019; originally announced December 2019.

Comments: 20 pages, 15 figures; accepted by ApJ

arXiv:1912.05857 [pdf, other]

doi 10.1002/asna.202023776

Metal abundances in the MACER simulations of the hot interstellar medium

Authors: S. Pellegrini, Z. Gan, J. P. Ostriker, L. Ciotti

Abstract: A hot plasma is the dominant phase of the interstellar medium of early-type galaxies. Its origin can reside in stellar mass losses, residual gas from the formation epoch, and accretion from outside of the galaxies. Its evolution is linked to the dynamical structure of the host galaxy, to the supernova and AGN feedback, and to (late-epoch) star formation, in a way that has yet to be fully understoo… ▽ More A hot plasma is the dominant phase of the interstellar medium of early-type galaxies. Its origin can reside in stellar mass losses, residual gas from the formation epoch, and accretion from outside of the galaxies. Its evolution is linked to the dynamical structure of the host galaxy, to the supernova and AGN feedback, and to (late-epoch) star formation, in a way that has yet to be fully understood. Important clues about the origin and evolution of the hot gas come from the abundances of heavy metals, that have been studied with increasing detail with XMM-Newton and Chandra. We present recent high resolution hydrodynamical simulations of the hot gas evolution that include the above processes, and where several chemical species, originating in AGB stars and supernovae of type Ia and II, have also been considered. The high resolution, of few parsecs in the central galactic region, allows us to track the metal enrichment, transportation and dilution throughout the galaxy. The comparison of model results with observed abundances reveals a good agreement for the region enriched by the AGN wind, but also discrepancies for the diffuse hot gas; the latter indicate the need for a revision of standard assumptions, and/or the importance of neglected effects as those due to the dust, and/or residual uncertainties in deriving abundances from the X-ray spectra. △ Less

Submitted 12 December, 2019; originally announced December 2019.

Comments: 6 pages, 5 figures, Proceedings of the XMM-Newton 2019 science workshop "Astrophysics of hot plasma in extended X-ray sources", to be published in Astron. Nachr

arXiv:1912.03414 [pdf, other]

doi 10.1093/mnras/stz3474

Hot Gas Flows on Parsec Scale in the Low-Luminosity Active Galactic Nucleus NGC 3115

Authors: Zhiyuan Yao, Zhaoming Gan

Abstract: NGC 3115 is known as the low-luminosity active galactic nucleus which hosts the nearest ($z\sim0.002$) billion solar mass supermassive black hole ($\sim1.5\times10^9~M_\odot$). Its Bondi radius $r_\mathrm{B}$ ($\sim3\farcs6$) can be readily resolved with Chandra, which offers us an excellent opportunity to investigate the accretion flow onto a supermassive black hole. In this paper, we perform two… ▽ More NGC 3115 is known as the low-luminosity active galactic nucleus which hosts the nearest ($z\sim0.002$) billion solar mass supermassive black hole ($\sim1.5\times10^9~M_\odot$). Its Bondi radius $r_\mathrm{B}$ ($\sim3\farcs6$) can be readily resolved with Chandra, which offers us an excellent opportunity to investigate the accretion flow onto a supermassive black hole. In this paper, we perform two-dimensional hydrodynamical numerical simulations, tailored for NGC 3115, on the mass flow across the Bondi radius. Our best fittings for the density and temperature agree well with the observations of the hot interstellar medium in the centre of NGC 3115. We find that the flow properties are solely determined by the local galaxy properties in the galaxy centre: (1) stellar winds (including supernova ejecta) supply the mass and energy sources for the accreting gas; (2) similar to the one-dimensional calculations, a stagnation radius $r_\mathrm{st}\sim0.1~r_\mathrm{B}$ is also found in the two-dimensional simulations, which divides the mass flow into an inflow-outflow structure; (3) the radiatively inefficient accretion flow theory applies well inside the stagnation radius, where the gravity is dominated by the supermassive black hole and the gas is supported by rotation; (4) beyond the stagnation radius, the stellar gravity dominates the spherical-like fluid dynamics and causes the transition from a steep density profile outside to a flat density profile inside the Bondi radius. △ Less

Submitted 6 December, 2019; originally announced December 2019.

Comments: 13 pages, 8 figures. Accepted for publication by MNRAS

arXiv:1911.08709 [pdf, other]

Graph-Driven Generative Models for Heterogeneous Multi-Task Learning

Authors: Wenlin Wang, Hongteng Xu, Zhe Gan, Bai Li, Guoyin Wang, Liqun Chen, Qian Yang, Wenqi Wang, Lawrence Carin

Abstract: We propose a novel graph-driven generative model, that unifies multiple heterogeneous learning tasks into the same framework. The proposed model is based on the fact that heterogeneous learning tasks, which correspond to different generative processes, often rely on data with a shared graph structure. Accordingly, our model combines a graph convolutional network (GCN) with multiple variational aut… ▽ More We propose a novel graph-driven generative model, that unifies multiple heterogeneous learning tasks into the same framework. The proposed model is based on the fact that heterogeneous learning tasks, which correspond to different generative processes, often rely on data with a shared graph structure. Accordingly, our model combines a graph convolutional network (GCN) with multiple variational autoencoders, thus embedding the nodes of the graph i.e., samples for the tasks) in a uniform manner while specializing their organization and usage to different tasks. With a focus on healthcare applications (tasks), including clinical topic modeling, procedure recommendation and admission-type prediction, we demonstrate that our method successfully leverages information across different tasks, boosting performance in all tasks and outperforming existing state-of-the-art approaches. △ Less

Submitted 20 November, 2019; originally announced November 2019.

Comments: Accepted by AAAI-2020

arXiv:1911.03829 [pdf, other]

Distilling Knowledge Learned in BERT for Text Generation

Authors: Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu, Jingjing Liu

Abstract: Large-scale pre-trained language model such as BERT has achieved great success in language understanding tasks. However, it remains an open question how to utilize BERT for language generation. In this paper, we present a novel approach, Conditional Masked Language Modeling (C-MLM), to enable the finetuning of BERT on target generation tasks. The finetuned BERT (teacher) is exploited as extra supe… ▽ More Large-scale pre-trained language model such as BERT has achieved great success in language understanding tasks. However, it remains an open question how to utilize BERT for language generation. In this paper, we present a novel approach, Conditional Masked Language Modeling (C-MLM), to enable the finetuning of BERT on target generation tasks. The finetuned BERT (teacher) is exploited as extra supervision to improve conventional Seq2Seq models (student) for better text generation performance. By leveraging BERT's idiosyncratic bidirectional nature, distilling knowledge learned in BERT can encourage auto-regressive Seq2Seq models to plan ahead, imposing global sequence-level supervision for coherent text generation. Experiments show that the proposed approach significantly outperforms strong Transformer baselines on multiple language generation tasks such as machine translation and text summarization. Our proposed model also achieves new state of the art on IWSLT German-English and English-Vietnamese MT datasets. Code is available at https://github.com/ChenRocks/Distill-BERT-Textgen. △ Less

Submitted 17 July, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

Comments: ACL 2020

arXiv:1911.03631 [pdf, other]

Hierarchical Graph Network for Multi-hop Question Answering

Authors: Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuohang Wang, Jingjing Liu

Abstract: In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering. To aggregate clues from scattered texts across multiple paragraphs, a hierarchical graph is created by constructing nodes on different levels of granularity (questions, paragraphs, sentences, entities), the representations of which are initialized with pre-trained contextual encoders. Given this hierarchic… ▽ More In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering. To aggregate clues from scattered texts across multiple paragraphs, a hierarchical graph is created by constructing nodes on different levels of granularity (questions, paragraphs, sentences, entities), the representations of which are initialized with pre-trained contextual encoders. Given this hierarchical graph, the initial node representations are updated through graph propagation, and multi-hop reasoning is performed via traversing through the graph edges for each subsequent sub-task (e.g., paragraph selection, supporting facts extraction, answer prediction). By weaving heterogeneous nodes into an integral unified graph, this hierarchical differentiation of node granularity enables HGN to support different question answering sub-tasks simultaneously. Experiments on the HotpotQA benchmark demonstrate that the proposed model achieves new state of the art, outperforming existing multi-hop QA approaches. △ Less

Submitted 6 October, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

Comments: Accepted to EMNLP 2020

arXiv:1910.14142 [pdf, other]

Discourse-Aware Neural Extractive Text Summarization

Authors: Jiacheng Xu, Zhe Gan, Yu Cheng, Jingjing Liu

Abstract: Recently BERT has been adopted for document encoding in state-of-the-art text summarization models. However, sentence-based extractive models often result in redundant or uninformative phrases in the extracted summaries. Also, long-range dependencies throughout a document are not well captured by BERT, which is pre-trained on sentence pairs instead of documents. To address these issues, we present… ▽ More Recently BERT has been adopted for document encoding in state-of-the-art text summarization models. However, sentence-based extractive models often result in redundant or uninformative phrases in the extracted summaries. Also, long-range dependencies throughout a document are not well captured by BERT, which is pre-trained on sentence pairs instead of documents. To address these issues, we present a discourse-aware neural summarization model - DiscoBert. DiscoBert extracts sub-sentential discourse units (instead of sentences) as candidates for extractive selection on a finer granularity. To capture the long-range dependencies among discourse units, structural discourse graphs are constructed based on RST trees and coreference mentions, encoded with Graph Convolutional Networks. Experiments show that the proposed model outperforms state-of-the-art methods by a significant margin on popular summarization benchmarks compared to other BERT-base models. △ Less

Submitted 24 April, 2020; v1 submitted 30 October, 2019; originally announced October 2019.

Comments: To appear at ACL 2020; Code available at https://github.com/jiacheng-xu/DiscoBERT

arXiv:1910.03230 [pdf, other]

Meta Module Network for Compositional Visual Reasoning

Authors: Wenhu Chen, Zhe Gan, Linjie Li, Yu Cheng, William Wang, Jingjing Liu

Abstract: Neural Module Network (NMN) exhibits strong interpretability and compositionality thanks to its handcrafted neural modules with explicit multi-hop reasoning capability. However, most NMNs suffer from two critical drawbacks: 1) scalability: customized module for specific function renders it impractical when scaling up to a larger set of functions in complex tasks; 2) generalizability: rigid pre-def… ▽ More Neural Module Network (NMN) exhibits strong interpretability and compositionality thanks to its handcrafted neural modules with explicit multi-hop reasoning capability. However, most NMNs suffer from two critical drawbacks: 1) scalability: customized module for specific function renders it impractical when scaling up to a larger set of functions in complex tasks; 2) generalizability: rigid pre-defined module inventory makes it difficult to generalize to unseen functions in new tasks/domains. To design a more powerful NMN architecture for practical use, we propose Meta Module Network (MMN) centered on a novel meta module, which can take in function recipes and morph into diverse instance modules dynamically. The instance modules are then woven into an execution graph for complex visual reasoning, inheriting the strong explainability and compositionality of NMN. With such a flexible instantiation mechanism, the parameters of instance modules are inherited from the central meta module, retaining the same model complexity as the function set grows, which promises better scalability. Meanwhile, as functions are encoded into the embedding space, unseen functions can be readily represented based on its structural similarity with previously observed ones, which ensures better generalizability. Experiments on GQA and CLEVR datasets validate the superiority of MMN over state-of-the-art NMN designs. Synthetic experiments on held-out unseen functions from GQA dataset also demonstrate the strong generalizability of MMN. Our code and model are released in Github https://github.com/wenhuchen/Meta-Module-Network. △ Less

Submitted 7 November, 2020; v1 submitted 8 October, 2019; originally announced October 2019.

Comments: Accepted to WACV 21 (Oral)

arXiv:1909.13456 [pdf, other]

Improving Textual Network Learning with Variational Homophilic Embeddings

Authors: Wenlin Wang, Chenyang Tao, Zhe Gan, Guoyin Wang, Liqun Chen, Xinyuan Zhang, Ruiyi Zhang, Qian Yang, Ricardo Henao, Lawrence Carin

Abstract: The performance of many network learning applications crucially hinges on the success of network embedding algorithms, which aim to encode rich network information into low-dimensional vertex-based vector representations. This paper considers a novel variational formulation of network embeddings, with special focus on textual networks. Different from most existing methods that optimize a discrimin… ▽ More The performance of many network learning applications crucially hinges on the success of network embedding algorithms, which aim to encode rich network information into low-dimensional vertex-based vector representations. This paper considers a novel variational formulation of network embeddings, with special focus on textual networks. Different from most existing methods that optimize a discriminative objective, we introduce Variational Homophilic Embedding (VHE), a fully generative model that learns network embeddings by modeling the semantic (textual) information with a variational autoencoder, while accounting for the structural (topology) information through a novel homophilic prior design. Homophilic vertex embeddings encourage similar embedding vectors for related (connected) vertices. The proposed VHE promises better generalization for downstream tasks, robustness to incomplete observations, and the ability to generalize to unseen vertices. Extensive experiments on real-world networks, for multiple tasks, demonstrate that the proposed method consistently achieves superior performance relative to competing state-of-the-art approaches. △ Less

Submitted 30 September, 2019; originally announced September 2019.

Comments: Accepted to NeurIPS 2019

arXiv:1909.11764 [pdf, ps, other]

FreeLB: Enhanced Adversarial Training for Natural Language Understanding

Authors: Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, Jingjing Liu

Abstract: Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models. In this work, we propose a novel adversarial training algorithm, FreeLB, that promotes higher invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversaria… ▽ More Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models. In this work, we propose a novel adversarial training algorithm, FreeLB, that promotes higher invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples. To validate the effectiveness of the proposed approach, we apply it to Transformer-based models for natural language understanding and commonsense reasoning tasks. Experiments on the GLUE benchmark show that when applied only to the finetuning stage, it is able to improve the overall test scores of BERT-base model from 78.3 to 79.4, and RoBERTa-large model from 88.5 to 88.8. In addition, the proposed approach achieves state-of-the-art single-model test accuracies of 85.44\% and 67.75\% on ARC-Easy and ARC-Challenge. Experiments on CommonsenseQA benchmark further demonstrate that FreeLB can be generalized and boost the performance of RoBERTa-large model on other tasks as well. Code is available at \url{https://github.com/zhuchen03/FreeLB . △ Less

Submitted 23 April, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

Comments: Adding results with ALBERT

arXiv:1909.11740 [pdf, other]

UNITER: UNiversal Image-TExt Representation Learning

Authors: Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu

Abstract: Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which ca… ▽ More Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$. Code is available at https://github.com/ChenRocks/UNITER. △ Less

Submitted 17 July, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

Comments: ECCV 2020

arXiv:1909.11125 [pdf, other]

Leveraging the Template and Anchor Framework for Safe, Online Robotic Gait Design

Authors: Jinsun Liu, Pengcheng Zhao, Zhenyu Gan, Matthew Johnson-Roberson, Ram Vasudevan

Abstract: Online control design using a high-fidelity, full-order model for a bipedal robot can be challenging due to the size of the state space of the model. A commonly adopted solution to overcome this challenge is to approximate the full-order model (anchor) with a simplified, reduced-order model (template), while performing control synthesis. Unfortunately it is challenging to make formal guarantees ab… ▽ More Online control design using a high-fidelity, full-order model for a bipedal robot can be challenging due to the size of the state space of the model. A commonly adopted solution to overcome this challenge is to approximate the full-order model (anchor) with a simplified, reduced-order model (template), while performing control synthesis. Unfortunately it is challenging to make formal guarantees about the safety of an anchor model using a controller designed in an online fashion using a template model. To address this problem, this paper proposes a method to generate safety-preserving controllers for anchor models by performing reachability analysis on template models while bounding the modeling error. This paper describes how this reachable set can be incorporated into a Model Predictive Control framework to select controllers that result in safe walking on the anchor model in an online fashion. The method is illustrated on a 5-link RABBIT model, and is shown to allow the robot to walk safely while utilizing controllers designed in an online fashion. △ Less

Submitted 24 September, 2019; originally announced September 2019.

arXiv:1909.05316 [pdf, other]

What Makes A Good Story? Designing Composite Rewards for Visual Storytelling

Authors: Junjie Hu, Yu Cheng, Zhe Gan, Jingjing Liu, Jianfeng Gao, Graham Neubig

Abstract: Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr. In this paper, we re-examine this problem from a different angle, by looking deep into what defines a realistically-natural and topically-coherent story. To this end, we propose three assessment criteria: relevance, coherence and expressiveness, which we observe through empirical analys… ▽ More Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr. In this paper, we re-examine this problem from a different angle, by looking deep into what defines a realistically-natural and topically-coherent story. To this end, we propose three assessment criteria: relevance, coherence and expressiveness, which we observe through empirical analysis could constitute a "high-quality" story to the human eye. Following this quality guideline, we propose a reinforcement learning framework, ReCo-RL, with reward functions designed to capture the essence of these quality criteria. Experiments on the Visual Storytelling Dataset (VIST) with both automatic and human evaluations demonstrate that our ReCo-RL model achieves better performance than state-of-the-art baselines on both traditional metrics and the proposed new criteria. △ Less

Submitted 25 February, 2020; v1 submitted 11 September, 2019; originally announced September 2019.

Comments: Accepted paper in Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI) 2020

arXiv:1909.05288 [pdf, other]

Contrastively Smoothed Class Alignment for Unsupervised Domain Adaptation

Authors: Shuyang Dai, Yu Cheng, Yizhe Zhang, Zhe Gan, Jingjing Liu, Lawrence Carin

Abstract: Recent unsupervised approaches to domain adaptation primarily focus on minimizing the gap between the source and the target domains through refining the feature generator, in order to learn a better alignment between the two domains. This minimization can be achieved via a domain classifier to detect target-domain features that are divergent from source-domain features. However, by optimizing via… ▽ More Recent unsupervised approaches to domain adaptation primarily focus on minimizing the gap between the source and the target domains through refining the feature generator, in order to learn a better alignment between the two domains. This minimization can be achieved via a domain classifier to detect target-domain features that are divergent from source-domain features. However, by optimizing via such domain classification discrepancy, ambiguous target samples that are not smoothly distributed on the low-dimensional data manifold are often missed. To solve this issue, we propose a novel Contrastively Smoothed Class Alignment (CoSCA) model, that explicitly incorporates both intra- and inter-class domain discrepancy to better align ambiguous target samples with the source domain. CoSCA estimates the underlying label hypothesis of target samples, and simultaneously adapts their feature representations by optimizing a proposed contrastive loss. In addition, Maximum Mean Discrepancy (MMD) is utilized to directly match features between source and target samples for better global alignment. Experiments on several benchmark datasets demonstrate that CoSCA can outperform state-of-the-art approaches for unsupervised domain adaptation by producing more discriminative features. △ Less

Submitted 6 October, 2020; v1 submitted 11 September, 2019; originally announced September 2019.

arXiv:1909.02050 [pdf, other]

TIGEr: Text-to-Image Grounding for Image Caption Evaluation

Authors: Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, Jianfeng Gao

Abstract: This paper presents a new metric called TIGEr for the automatic evaluation of image captioning systems. Popular metrics, such as BLEU and CIDEr, are based solely on text matching between reference captions and machine-generated captions, potentially leading to biased evaluations because references may not fully cover the image content and natural language is inherently ambiguous. Building upon a m… ▽ More This paper presents a new metric called TIGEr for the automatic evaluation of image captioning systems. Popular metrics, such as BLEU and CIDEr, are based solely on text matching between reference captions and machine-generated captions, potentially leading to biased evaluations because references may not fully cover the image content and natural language is inherently ambiguous. Building upon a machine-learned text-image grounding model, TIGEr allows to evaluate caption quality not only based on how well a caption represents image content, but also on how well machine-generated captions match human-generated captions. Our empirical tests show that TIGEr has a higher consistency with human judgments than alternative existing metrics. We also comprehensively assess the metric's effectiveness in caption evaluation by measuring the correlation between human judgments and metric scores. △ Less

Submitted 4 September, 2019; originally announced September 2019.

arXiv:1908.09395 [pdf, other]

Domain Adaptive Text Style Transfer

Authors: Dianqi Li, Yizhe Zhang, Zhe Gan, Yu Cheng, Chris Brockett, Ming-Ting Sun, Bill Dolan

Abstract: Text style transfer without parallel data has achieved some practical success. However, in the scenario where less data is available, these methods may yield poor performance. In this paper, we examine domain adaptation for text style transfer to leverage massively available data from other domains. These data may demonstrate domain shift, which impedes the benefits of utilizing such data for trai… ▽ More Text style transfer without parallel data has achieved some practical success. However, in the scenario where less data is available, these methods may yield poor performance. In this paper, we examine domain adaptation for text style transfer to leverage massively available data from other domains. These data may demonstrate domain shift, which impedes the benefits of utilizing such data for training. To address this challenge, we propose simple yet effective domain adaptive text style transfer models, enabling domain-adaptive information exchange. The proposed models presumably learn from the source domain to: (i) distinguish stylized information and generic content information; (ii) maximally preserve content information; and (iii) adaptively transfer the styles in a domain-aware manner. We evaluate the proposed models on two style transfer tasks (sentiment and formality) over multiple target domains where only limited non-parallel data is available. Extensive experiments demonstrate the effectiveness of the proposed model compared to the baselines. △ Less

Submitted 25 August, 2019; originally announced August 2019.

Comments: EMNLP 2019, long paper

arXiv:1908.09355 [pdf, other]

Patient Knowledge Distillation for BERT Model Compression

Authors: Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu

Abstract: Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In order to alleviate this resource hunger in large-scale model training, we propose a Patient Knowledge Distillation approach to compress an original large model (tea… ▽ More Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In order to alleviate this resource hunger in large-scale model training, we propose a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model patiently learns from multiple intermediate layers of the teacher model for incremental knowledge extraction, following two strategies: ($i$) PKD-Last: learning from the last $k$ layers; and ($ii$) PKD-Skip: learning from every $k$ layers. These two patient distillation schemes enable the exploitation of rich information in the teacher's hidden layers, and encourage the student model to patiently learn from and imitate the teacher through a multi-layer distillation process. Empirically, this translates into improved results on multiple NLP tasks with significant gain in training efficiency, without sacrificing model accuracy. △ Less

Submitted 25 August, 2019; originally announced August 2019.

Comments: Accepted to EMNLP 2019

arXiv:1908.09209 [pdf, other]

Adversarial Domain Adaptation for Machine Reading Comprehension

Authors: Huazheng Wang, Zhe Gan, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Hongning Wang

Abstract: In this paper, we focus on unsupervised domain adaptation for Machine Reading Comprehension (MRC), where the source domain has a large amount of labeled data, while only unlabeled passages are available in the target domain. To this end, we propose an Adversarial Domain Adaptation framework (AdaMRC), where ($i$) pseudo questions are first generated for unlabeled passages in the target domain, and… ▽ More In this paper, we focus on unsupervised domain adaptation for Machine Reading Comprehension (MRC), where the source domain has a large amount of labeled data, while only unlabeled passages are available in the target domain. To this end, we propose an Adversarial Domain Adaptation framework (AdaMRC), where ($i$) pseudo questions are first generated for unlabeled passages in the target domain, and then ($ii$) a domain classifier is incorporated into an MRC model to predict which domain a given passage-question pair comes from. The classifier and the passage-question encoder are jointly trained using adversarial learning to enforce domain-invariant representation learning. Comprehensive evaluations demonstrate that our approach ($i$) is generalizable to different MRC models and datasets, ($ii$) can be combined with pre-trained large-scale language models (such as ELMo and BERT), and ($iii$) can be extended to semi-supervised learning. △ Less

Submitted 24 August, 2019; originally announced August 2019.

Comments: Accepted to EMNLP 2019

arXiv:1907.01724 [pdf, other]

doi 10.4208/cicp.OA-2020-0002

Harmonic surface mapping algorithm for electrostatic potentials in an atomistic/continuum hybrid model for electrolyte solutions

Authors: Jing Fu, Zecheng Gan

Abstract: Simulating charged many-body systems has been a computational demanding task due to the long-range nature of electrostatic interaction. For the multi-scale model of electrolytes which combines the strengths of atomistic/continuum electrolyte representations, a harmonic surface mapping algorithm is developed for fast and accurate evaluation of the electrostatic reaction potentials. Our method refor… ▽ More Simulating charged many-body systems has been a computational demanding task due to the long-range nature of electrostatic interaction. For the multi-scale model of electrolytes which combines the strengths of atomistic/continuum electrolyte representations, a harmonic surface mapping algorithm is developed for fast and accurate evaluation of the electrostatic reaction potentials. Our method reformulates the reaction potential into a sum of image charges for the near-field, and a charge density on an auxiliary spherical surface for the far-field, which can be further discretized into point charges. Fast multipole method is used to accelerate the pairwise Coulomb summation. The accuracy and efficiency of our algorithm, as well as the choice of relevant numerical parameters are demonstrated in detail. As a concrete example, for charges close to the dielectric interface, our method can improve the accuracy by two orders of magnitudes compared to the Kirkwood series expansion method. △ Less

Submitted 6 January, 2020; v1 submitted 2 July, 2019; originally announced July 2019.

Comments: 17 pages, 5 figures

Journal ref: Commun. Comput. Phys., Vol. 29, No. 2, pp. 571-587, 2021

arXiv:1904.04961 [pdf, ps, other]

doi 10.1016/j.physletb.2019.135096

Fine structure in the $α$ decay of $^{223}$U

Authors: M. D. Sun, Z. Liu, T. H. Huang, W. Q. Zhang, A. N. Andreyev, B. Ding, J. G. Wang, X. Y. Liu, H. Y. Lu, D. S. Hou, Z. G. Gan, L. Ma, H. B. Yang, Z. Y. Zhang, L. Yu, J. Jiang, K. L. Wang, Y. S. Wang, M. L. Liu, Z. H. Li, J. Li, X. Wang, A. H. Feng, C. J. Lin, L. J. Sun , et al. (7 additional authors not shown)

Abstract: Fine structure in the $α$ decay of $^{223}$U was observed in the fusion-evaporation reaction $^{187}$Re($^{40}$Ar, p3n) by using fast digital pulse processing technique. Two $α$-decay branches of $^{223}$U feeding the ground state and 244 keV excited state of $^{219}$Th were identified by establishing the decay chain $^{223}$U $\xrightarrow{α_{1}}$ $^{219}$Th $\xrightarrow{α_{2}}$ $^{215}$Ra… ▽ More Fine structure in the $α$ decay of $^{223}$U was observed in the fusion-evaporation reaction $^{187}$Re($^{40}$Ar, p3n) by using fast digital pulse processing technique. Two $α$-decay branches of $^{223}$U feeding the ground state and 244 keV excited state of $^{219}$Th were identified by establishing the decay chain $^{223}$U $\xrightarrow{α_{1}}$ $^{219}$Th $\xrightarrow{α_{2}}$ $^{215}$Ra $\xrightarrow{α_{3}}$ $^{211}$Rn. The $α$-particle energy for the ground-state to ground-state transition of $^{223}$U was determined to be 8993(17) keV, 213 keV higher than the previous value, the half-life was updated to be 62$^{+14}_{-10}$ $μ$s. Evolution of nuclear structure for $N$ = 131 even-$Z$ isotones from Po to U was discussed in the frameworks of nuclear mass and reduced $α$-decay width, a weakening octupole deformation in the ground state of $^{223}$U relative to its lighter isotones $^{219}$Ra and $^{221}$Th was suggested. △ Less

Submitted 22 October, 2019; v1 submitted 9 April, 2019; originally announced April 2019.

Comments: 6 pages, 6 figures

Journal ref: PhysicsLettersB800(2020)135096

arXiv:1904.01415 [pdf, ps, other]

doi 10.1007/s11432-018-9645-3

Synthesis of model predictive control based on data-driven learning

Authors: Yuanqiang Zhou, Dewei Li, Yugeng Xi, Zhongxue Gan

Abstract: For the application of MPC design in on-line regulation or tracking control problems, several studies have attempted to develop an accurate model, and realize adequate uncertainty description of linear or non-linear plants of the processes. In this study, we employ the data-driven learning technique to iteratively approximate the dynamical parameters, without requiring a priori knowledge of system… ▽ More For the application of MPC design in on-line regulation or tracking control problems, several studies have attempted to develop an accurate model, and realize adequate uncertainty description of linear or non-linear plants of the processes. In this study, we employ the data-driven learning technique to iteratively approximate the dynamical parameters, without requiring a priori knowledge of system matrices. The proposed MPC approach can predict and optimize the future behaviors using multiorder derivatives of control input as decision variables. Because the proposed algorithm can obtain a linear system model at each sampling, it can adapt to the actual dynamics of time-varying or nonlinear plants. This methodology can serve as a data-driven identification tool to study adaptive optimal control problems for unknown complex systems. △ Less

Submitted 29 March, 2019; originally announced April 2019.

Comments: 4 pages

Journal ref: SCIENCE CHINA Information Sciences, 2019

arXiv:1903.12314 [pdf, other]

Relation-Aware Graph Attention Network for Visual Question Answering

Authors: Linjie Li, Zhe Gan, Yu Cheng, Jingjing Liu

Abstract: In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mech… ▽ More In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that ReGAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA. △ Less

Submitted 9 October, 2019; v1 submitted 28 March, 2019; originally announced March 2019.

Comments: To appear in ICCV 2019

arXiv:1903.11959 [pdf, ps, other]

doi 10.1088/1674-1056/28/5/054702

On-node lattices construction using $\textit{partial}$ Gauss-Hermite quadrature for the lattice Boltzmann method

Authors: Huanfeng Ye, Zecheng Gan, Bo Kuang, Yanhua Yang

Abstract: A concise theoretical framework, the $\textit{partial}$ Gauss-Hermite quadrature (pGHQ), is established for constructing on-node lattices of the lattice Boltzmann (LB) method under a Cartesian coordinate system. Comparing with existing approaches, the pGHQ scheme has the following advantages: $\textbf{a).}$ extremely concise algorithm, $\textbf{b).}$ unifying the constructing procedure of symmetri… ▽ More A concise theoretical framework, the $\textit{partial}$ Gauss-Hermite quadrature (pGHQ), is established for constructing on-node lattices of the lattice Boltzmann (LB) method under a Cartesian coordinate system. Comparing with existing approaches, the pGHQ scheme has the following advantages: $\textbf{a).}$ extremely concise algorithm, $\textbf{b).}$ unifying the constructing procedure of symmetric and asymmetric on-node lattices, $\textbf{c).}$ covering full-range quadrature degree of a given discrete velocity set. We employ it to search the local optimal and asymmetric lattices for $\left\{ {n = 3,4,5,6,7} \right\}$ moment degree equilibrium distribution discretization on range $\left[ { - 10,10} \right]$. The search reveals a surprising abundance of available lattices. Through a brief analysis, the discrete velocity set shows a significant influence on the positivity of equilibrium distributions, which is considered as one major impact to the numerical stability of the LB method. Hence the results of the pGHQ scheme lay a foundation for further investigations on improving the numerical stability of the LB method by modifying the discrete velocity set. It also worths noting that pGHQ can be extended into the entropic LB model though it was proposed for the Hermite polynomial expansion LB theory. △ Less

Submitted 25 March, 2019; originally announced March 2019.

Journal ref: Chin. Phys. B, Vol. 28, No. 5 (2019) 054702

arXiv:1903.07137 [pdf, other]

Topic-Guided Variational Autoencoders for Text Generation

Authors: Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou Chen, Lawrence Carin

Abstract: We propose a topic-guided variational autoencoder (TGVAE) model for text generation. Distinct from existing variational autoencoder (VAE) based approaches, which assume a simple Gaussian prior for the latent code, our model specifies the prior as a Gaussian mixture model (GMM) parametrized by a neural topic module. Each mixture component corresponds to a latent topic, which provides guidance to ge… ▽ More We propose a topic-guided variational autoencoder (TGVAE) model for text generation. Distinct from existing variational autoencoder (VAE) based approaches, which assume a simple Gaussian prior for the latent code, our model specifies the prior as a Gaussian mixture model (GMM) parametrized by a neural topic module. Each mixture component corresponds to a latent topic, which provides guidance to generate sentences under the topic. The neural topic module and the VAE-based neural sequence module in our model are learned jointly. In particular, a sequence of invertible Householder transformations is applied to endow the approximate posterior of the latent code with high flexibility during model inference. Experimental results show that our TGVAE outperforms alternative approaches on both unconditional and conditional text generation, which can generate semantically-meaningful sentences with various topics. △ Less

Submitted 17 March, 2019; originally announced March 2019.

arXiv:1903.04504 [pdf, other]

Multi-Physics of AGN Jets in the Multi-Messenger Era

Authors: B. Rani, M. Petropoulou, H. Zhang, F. D'Ammando, J. Finke, M. Baring, M. Böttcher, S. Dimitrakoudis, Z. Gan, D. Giannios, D. H. Hartmann, T. P. Krichbaum, A. P. Marscher, A. Mastichiadis, K. Nalewajko, R. Ojha, D. Paneque, C. Shrader, L. Sironi, A. Tchekhovskoy, D. J. Thompson, N. Vlahakis, T. M. Venters

Abstract: Active galactic nuclei (AGN) with relativistic jets, powered by gas accretion onto their central supermassive black hole (SMBH), are unique laboratories for studying the physics of matter and elementary particles in extreme conditions that cannot be realized on Earth. For a long time since the discovery of AGN, photons were the only way to probe the underlying physical processes. The recent discov… ▽ More Active galactic nuclei (AGN) with relativistic jets, powered by gas accretion onto their central supermassive black hole (SMBH), are unique laboratories for studying the physics of matter and elementary particles in extreme conditions that cannot be realized on Earth. For a long time since the discovery of AGN, photons were the only way to probe the underlying physical processes. The recent discovery of a very high energy neutrino, IceCube-170922A, coincident with a flaring blazar, TXS 0506+056, provides the first evidence that AGN jets are multi-messenger sources; they are capable of accelerating hadrons to very high energies, while producing non-thermal EM radiation and high-energy neutrinos. This new era of multi-messenger astronomy, which will mature in the next decade, offers us the unprecedented opportunity to combine more than one messenger to solve some long-standing puzzles of AGN jet physics: How do jets dissipate their energy to accelerate particles? What is the jet total kinetic power? Where and how do jets produce high-energy emission and neutrinos? What physical mechanisms drive the particle acceleration? △ Less

Submitted 11 March, 2019; originally announced March 2019.

Comments: submitted to Astro2020 (Astronomy and Astrophysics Decadal Survey)

arXiv:1903.02547 [pdf, other]

Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation

Authors: Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, Siddhartha Srinivasa

Abstract: We present the Frontier Aware Search with backTracking (FAST) Navigator, a general framework for action decoding, that achieves state-of-the-art results on the Room-to-Room (R2R) Vision-and-Language navigation challenge of Anderson et. al. (2018). Given a natural language instruction and photo-realistic image views of a previously unseen environment, the agent was tasked with navigating from sourc… ▽ More We present the Frontier Aware Search with backTracking (FAST) Navigator, a general framework for action decoding, that achieves state-of-the-art results on the Room-to-Room (R2R) Vision-and-Language navigation challenge of Anderson et. al. (2018). Given a natural language instruction and photo-realistic image views of a previously unseen environment, the agent was tasked with navigating from source to target location as quickly as possible. While all current approaches make local action decisions or score entire trajectories using beam search, ours balances local and global signals when exploring an unobserved environment. Importantly, this lets us act greedily but use global signals to backtrack when necessary. Applying FAST framework to existing state-of-the-art models achieved a 17% relative gain, an absolute 6% gain on Success rate weighted by Path Length (SPL). △ Less

Submitted 2 April, 2019; v1 submitted 6 March, 2019; originally announced March 2019.

Comments: CVPR 2019 Oral, video demo: https://youtu.be/AD9TNohXoPA

arXiv:1902.00579 [pdf, other]

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

Authors: Zhe Gan, Yu Cheng, Ahmed El Kholy, Linjie Li, Jingjing Liu, Jianfeng Gao

Abstract: This paper presents a new model for visual dialog, Recurrent Dual Attention Network (ReDAN), using multi-step reasoning to answer a series of questions about an image. In each question-answering turn of a dialog, ReDAN infers the answer progressively through multiple reasoning steps. In each step of the reasoning process, the semantic representation of the question is updated based on the image an… ▽ More This paper presents a new model for visual dialog, Recurrent Dual Attention Network (ReDAN), using multi-step reasoning to answer a series of questions about an image. In each question-answering turn of a dialog, ReDAN infers the answer progressively through multiple reasoning steps. In each step of the reasoning process, the semantic representation of the question is updated based on the image and the previous dialog history, and the recurrently-refined representation is used for further reasoning in the subsequent step. On the VisDial v1.0 dataset, the proposed ReDAN model achieves a new state-of-the-art of 64.47% NDCG score. Visualization on the reasoning process further demonstrates that ReDAN can locate context-relevant visual and textual clues via iterative refinement, which can lead to the correct answer step-by-step. △ Less

Submitted 4 June, 2019; v1 submitted 1 February, 2019; originally announced February 2019.

Comments: Accepted to ACL 2019

arXiv:1901.06283 [pdf, other]

Improving Sequence-to-Sequence Learning via Optimal Transport

Authors: Liqun Chen, Yizhe Zhang, Ruiyi Zhang, Chenyang Tao, Zhe Gan, Haichao Zhang, Bai Li, Dinghan Shen, Changyou Chen, Lawrence Carin

Abstract: Sequence-to-sequence models are commonly trained via maximum likelihood estimation (MLE). However, standard MLE training considers a word-level objective, predicting the next word given the previous ground-truth partial sentence. This procedure focuses on modeling local syntactic patterns, and may fail to capture long-range semantic structure. We present a novel solution to alleviate these issues.… ▽ More Sequence-to-sequence models are commonly trained via maximum likelihood estimation (MLE). However, standard MLE training considers a word-level objective, predicting the next word given the previous ground-truth partial sentence. This procedure focuses on modeling local syntactic patterns, and may fail to capture long-range semantic structure. We present a novel solution to alleviate these issues. Our approach imposes global sequence-level guidance via new supervision based on optimal transport, enabling the overall characterization and preservation of semantic features. We further show that this method can be understood as a Wasserstein gradient flow trying to match our model to the ground truth sequence distribution. Extensive experiments are conducted to validate the utility of the proposed approach, showing consistent improvements over a wide variety of NLP tasks, including machine translation, abstractive text summarization, and image captioning. △ Less

Submitted 18 January, 2019; originally announced January 2019.

arXiv:1812.10130 [pdf, other]

doi 10.3847/1538-4357/ab1007

Adding a Suite of Chemical Abundances to the MACER Code for the Evolution of Massive Elliptical Galaxies

Authors: Zhaoming Gan, Ena Choi, Jeremiah P. Ostriker, Luca Ciotti, Silvia Pellegrini

Abstract: We add a suite of chemical abundances to the MACER (Massive AGN Controlled Ellipticals Resolved) 2D code, by solving 12 additional continuity equations for H, He, C, N, O, Ne, Mg, Si, S, Ca, Fe and Ni respectively with sources from AGB stars and supernovae of type Ia and II with metal yields based on standard stellar physics. New stars, formed in Toomre unstable circumnuclear disks (of a size… ▽ More We add a suite of chemical abundances to the MACER (Massive AGN Controlled Ellipticals Resolved) 2D code, by solving 12 additional continuity equations for H, He, C, N, O, Ne, Mg, Si, S, Ca, Fe and Ni respectively with sources from AGB stars and supernovae of type Ia and II with metal yields based on standard stellar physics. New stars, formed in Toomre unstable circumnuclear disks (of a size $\lesssim150$ parsec), are assumed to have a top-heavy initial mass function with a power index of 1.65. The metal dilution effects due to cosmic accretion are also included. With the high resolution of few parsecs in central regions, resolved black hole accretion and AGN feedback, we can track the metal enrichment, transportation and dilution throughout the modeled massive elliptical galaxy of velocity dispersion $\sim280$ km/s. We retrieve the chemical composition of the BAL winds launched by the central AGN, synthesize the X-ray features of the hot ISM, and find that (1) the simulated metallicity in the BAL winds could be up to $\sim 8 Z_\odot$, while that of the hot ISM in the host galaxy is $\sim 2.3 Z_\odot$, matching well with SDSS observations of BLR gas; (2) the X-ray emitting hot gas is metal enriched with a typical value $\sim2.5 Z_\odot$; (3) the circumunuclear cold gas disk, where the metals are condensed, further enriched and recycled, plays a critical role in the metal enrichment; (4) the black hole accretion rate $\dot{M}_{\rm BH}$ linearly correlates with the star formation rate $\dot{M}_\star^+$ in the circumnuclear disk, i.e, $\dot{M}_\star^+ \sim 7.7 \dot{M}_{\rm BH}$, but lagged in time by roughly $10^6$ years. △ Less

Submitted 8 April, 2019; v1 submitted 25 December, 2018; originally announced December 2018.

Comments: 10 pages, 10 figures; accepted by ApJ

arXiv:1812.08352 [pdf, other]

Sequential Attention GAN for Interactive Image Editing

Authors: Yu Cheng, Zhe Gan, Yitong Li, Jingjing Liu, Jianfeng Gao

Abstract: Most existing text-to-image synthesis tasks are static single-turn generation, based on pre-defined textual descriptions of images. To explore more practical and interactive real-life applications, we introduce a new task - Interactive Image Editing, where users can guide an agent to edit images via multi-turn textual commands on-the-fly. In each session, the agent takes a natural language descrip… ▽ More Most existing text-to-image synthesis tasks are static single-turn generation, based on pre-defined textual descriptions of images. To explore more practical and interactive real-life applications, we introduce a new task - Interactive Image Editing, where users can guide an agent to edit images via multi-turn textual commands on-the-fly. In each session, the agent takes a natural language description from the user as the input and modifies the image generated in the previous turn to a new design, following the user description. The main challenges in this sequential and interactive image generation task are two-fold: 1) contextual consistency between a generated image and the provided textual description; 2) step-by-step region-level modification to maintain visual consistency across the generated image sequence in each session. To address these challenges, we propose a novel Sequential Attention Generative Adversarial Net-work (SeqAttnGAN), which applies a neural state tracker to encode the previous image and the textual description in each turn of the sequence, and uses a GAN framework to generate a modified version of the image that is consistent with the preceding images and coherent with the description. To achieve better region-specific refinement, we also introduce a sequential attention mechanism into the model. To benchmark on the new task, we introduce two new datasets, Zap-Seq and DeepFashion-Seq, which contain multi-turn sessions with image-description sequences in the fashion domain. Experiments on both datasets show that the proposed SeqAttnGANmodel outperforms state-of-the-art approaches on the interactive image editing task across all evaluation metrics including visual quality, image sequence coherence, and text-image consistency. △ Less

Submitted 5 August, 2020; v1 submitted 19 December, 2018; originally announced December 2018.

Comments: ACM MM 2020

Showing 151–200 of 270 results for author: Gan, Z