-
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Authors:
Le Xue,
Manli Shu,
Anas Awadalla,
Jun Wang,
An Yan,
Senthil Purushwalkam,
Honglu Zhou,
Viraj Prabhu,
Yutong Dai,
Michael S Ryoo,
Shrikant Kendre,
Jieyu Zhang,
Can Qin,
Shu Zhang,
Chia-Chih Chen,
Ning Yu,
Juntao Tan,
Tulika Manoj Awalgaonkar,
Shelby Heinecke,
Huan Wang,
Yejin Choi,
Ludwig Schmidt,
Zeyuan Chen,
Silvio Savarese,
Juan Carlos Niebles
, et al. (2 additional authors not shown)
Abstract:
This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tas…
▽ More
This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.
△ Less
Submitted 28 August, 2024; v1 submitted 16 August, 2024;
originally announced August 2024.
-
Noncommutative nonisospectral Toda and Lotka-Volterra lattices, and matrix discrete Painlevé equations
Authors:
Anhui Yan,
Chunxia Li
Abstract:
The noncommutative analogues of the nonisospectral Toda and Lotka-Volterra lattices are proposed and studied by performing nonisopectral deformations on the matrix orthogonal polynomials and matrix symmetric orthogonal polynomials without specific weight functions, respectively. Under stationary reductions, matrix discrete Painlevé I and matrix asymmetric discrete Painlevé I equations are derived…
▽ More
The noncommutative analogues of the nonisospectral Toda and Lotka-Volterra lattices are proposed and studied by performing nonisopectral deformations on the matrix orthogonal polynomials and matrix symmetric orthogonal polynomials without specific weight functions, respectively. Under stationary reductions, matrix discrete Painlevé I and matrix asymmetric discrete Painlevé I equations are derived separately not only from the noncommutative nonisospectral lattices themselves, but also from their Lax pairs. The rationality of the stationary reduction has been justified in the sense that quasideterminant solutions are provided for the corresponding matrix discrete Painlevé equations.
△ Less
Submitted 17 July, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
Answering real-world clinical questions using large language model based systems
Authors:
Yen Sia Low,
Michael L. Jackson,
Rebecca J. Hyde,
Robert E. Brown,
Neil M. Sanghavi,
Julian D. Baldwin,
C. William Pike,
Jananee Muralidharan,
Gavin Hui,
Natasha Alexander,
Hadeel Hassan,
Rahul V. Nene,
Morgan Pike,
Courtney J. Pokrzywa,
Shivam Vedak,
Adam Paul Yan,
Dong-han Yao,
Amy R. Zipursky,
Christina Dinh,
Philip Ballentine,
Dan C. Derieg,
Vladimir Polony,
Rehan N. Chawdry,
Jordan Davies,
Brigham B. Hyde
, et al. (2 additional authors not shown)
Abstract:
Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-bas…
▽ More
Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
CRAG -- Comprehensive RAG Benchmark
Authors:
Xiao Yang,
Kai Sun,
Hao Xin,
Yushi Sun,
Nikita Bhalla,
Xiangsen Chen,
Sajal Choudhary,
Rongze Daniel Gui,
Ziran Will Jiang,
Ziyu Jiang,
Lingkun Kong,
Brian Moran,
Jiaqi Wang,
Yifan Ethan Xu,
An Yan,
Chenyu Yang,
Eting Yuan,
Hanwen Zha,
Nan Tang,
Lei Chen,
Nicolas Scheffer,
Yue Liu,
Nirav Shah,
Rakesh Wanga,
Anuj Kumar
, et al. (2 additional authors not shown)
Abstract:
Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering bench…
▽ More
Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Are You Copying My Prompt? Protecting the Copyright of Vision Prompt for VPaaS via Watermark
Authors:
Huali Ren,
Anli Yan,
Chong-zhi Gao,
Hongyang Yan,
Zhenxin Zhang,
Jin Li
Abstract:
Visual Prompt Learning (VPL) differs from traditional fine-tuning methods in reducing significant resource consumption by avoiding updating pre-trained model parameters. Instead, it focuses on learning an input perturbation, a visual prompt, added to downstream task data for making predictions. Since learning generalizable prompts requires expert design and creation, which is technically demanding…
▽ More
Visual Prompt Learning (VPL) differs from traditional fine-tuning methods in reducing significant resource consumption by avoiding updating pre-trained model parameters. Instead, it focuses on learning an input perturbation, a visual prompt, added to downstream task data for making predictions. Since learning generalizable prompts requires expert design and creation, which is technically demanding and time-consuming in the optimization process, developers of Visual Prompts as a Service (VPaaS) have emerged. These developers profit by providing well-crafted prompts to authorized customers. However, a significant drawback is that prompts can be easily copied and redistributed, threatening the intellectual property of VPaaS developers. Hence, there is an urgent need for technology to protect the rights of VPaaS developers. To this end, we present a method named \textbf{WVPrompt} that employs visual prompt watermarking in a black-box way. WVPrompt consists of two parts: prompt watermarking and prompt verification. Specifically, it utilizes a poison-only backdoor attack method to embed a watermark into the prompt and then employs a hypothesis-testing approach for remote verification of prompt ownership. Extensive experiments have been conducted on three well-known benchmark datasets using three popular pre-trained models: RN50, BIT-M, and Instagram. The experimental results demonstrate that WVPrompt is efficient, harmless, and robust to various adversarial operations.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law
Authors:
Zhiyu Zoey Chen,
Jing Ma,
Xinlu Zhang,
Nan Hao,
An Yan,
Armineh Nourbakhsh,
Xianjun Yang,
Julian McAuley,
Linda Petzold,
William Yang Wang
Abstract:
In the fast-evolving domain of artificial intelligence, large language models (LLMs) such as GPT-3 and GPT-4 are revolutionizing the landscapes of finance, healthcare, and law: domains characterized by their reliance on professional expertise, challenging data acquisition, high-stakes, and stringent regulatory compliance. This survey offers a detailed exploration of the methodologies, applications…
▽ More
In the fast-evolving domain of artificial intelligence, large language models (LLMs) such as GPT-3 and GPT-4 are revolutionizing the landscapes of finance, healthcare, and law: domains characterized by their reliance on professional expertise, challenging data acquisition, high-stakes, and stringent regulatory compliance. This survey offers a detailed exploration of the methodologies, applications, challenges, and forward-looking opportunities of LLMs within these high-stakes sectors. We highlight the instrumental role of LLMs in enhancing diagnostic and treatment methodologies in healthcare, innovating financial analytics, and refining legal interpretation and compliance strategies. Moreover, we critically examine the ethics for LLM applications in these fields, pointing out the existing ethical concerns and the need for transparent, fair, and robust AI systems that respect regulatory norms. By presenting a thorough review of current literature and practical applications, we showcase the transformative impact of LLMs, and outline the imperative for interdisciplinary cooperation, methodological advancements, and ethical vigilance. Through this lens, we aim to spark dialogue and inspire future research dedicated to maximizing the benefits of LLMs while mitigating their risks in these precision-dependent sectors. To facilitate future research on LLMs in these critical societal domains, we also initiate a reading list that tracks the latest advancements under this topic, which will be continually updated: \url{https://github.com/czyssrs/LLM_X_papers}.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Soft X-ray prompt emission from a high-redshift gamma-ray burst EP240315a
Authors:
Y. Liu,
H. Sun,
D. Xu,
D. S. Svinkin,
J. Delaunay,
N. R. Tanvir,
H. Gao,
C. Zhang,
Y. Chen,
X. -F. Wu,
B. Zhang,
W. Yuan,
J. An,
G. Bruni,
D. D. Frederiks,
G. Ghirlanda,
J. -W. Hu,
A. Li,
C. -K. Li,
J. -D. Li,
D. B. Malesani,
L. Piro,
G. Raman,
R. Ricci,
E. Troja
, et al. (170 additional authors not shown)
Abstract:
Long gamma-ray bursts (GRBs) are believed to originate from core collapse of massive stars. High-redshift GRBs can probe the star formation and reionization history of the early universe, but their detection remains rare. Here we report the detection of a GRB triggered in the 0.5--4 keV band by the Wide-field X-ray Telescope (WXT) on board the Einstein Probe (EP) mission, designated as EP240315a,…
▽ More
Long gamma-ray bursts (GRBs) are believed to originate from core collapse of massive stars. High-redshift GRBs can probe the star formation and reionization history of the early universe, but their detection remains rare. Here we report the detection of a GRB triggered in the 0.5--4 keV band by the Wide-field X-ray Telescope (WXT) on board the Einstein Probe (EP) mission, designated as EP240315a, whose bright peak was also detected by the Swift Burst Alert Telescope and Konus-Wind through off-line analyses. At a redshift of $z=4.859$, EP240315a showed a much longer and more complicated light curve in the soft X-ray band than in gamma-rays. Benefiting from a large field-of-view ($\sim$3600 deg$^2$) and a high sensitivity, EP-WXT captured the earlier engine activation and extended late engine activity through a continuous detection. With a peak X-ray flux at the faint end of previously known high-$z$ GRBs, the detection of EP240315a demonstrates the great potential for EP to study the early universe via GRBs.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
Authors:
An Yan,
Zhengyuan Yang,
Junda Wu,
Wanrong Zhu,
Jianwei Yang,
Linjie Li,
Kevin Lin,
Jianfeng Wang,
Julian McAuley,
Jianfeng Gao,
Lijuan Wang
Abstract:
Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these vis…
▽ More
Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these visual tags. To promote the learning of SoM prompting for open-source models, we propose a new learning paradigm: "list items one by one," which asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags. By integrating our curated dataset with other visual instruction tuning datasets, we are able to equip existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our finetuned SoM models on five MLLM benchmarks. We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs. Perhaps surprisingly, these improvements persist even when the visual tags are omitted from input images during inference. This suggests the potential of "list items one by one" as a new paradigm for training MLLMs, which strengthens the object-text alignment through the use of visual tags in the training stage. Finally, we conduct analyses by probing trained models to understand the working mechanism of SoM. Our code and data are available at \url{https://github.com/zzxslp/SoM-LLaVA}.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Log-concavity in Combinatorics
Authors:
Alan Yan
Abstract:
We survey some of the mechanisms used to prove that naturally defined sequences in combinatorics are log-concave. Among these mechanisms are Alexandrov's inequality for mixed discriminants, the Alexandrov Fenchel inequality for mixed volumes, Lorentzian polynomials, and the Hard Lefschetz theorem. We use these mechanisms to prove some new log-concavity and extremal results related to partially ord…
▽ More
We survey some of the mechanisms used to prove that naturally defined sequences in combinatorics are log-concave. Among these mechanisms are Alexandrov's inequality for mixed discriminants, the Alexandrov Fenchel inequality for mixed volumes, Lorentzian polynomials, and the Hard Lefschetz theorem. We use these mechanisms to prove some new log-concavity and extremal results related to partially ordered sets and matroids. We present joint work with Ramon van Handel and Xinmeng Zeng to give a complete characterization for the extremals of the Kahn-Saks inequality. We extend Stanley's inequality for regular matroids to arbitrary matroids using the technology of Lorentzian polynomials. As a result, we provide a new proof of the weakest Mason conjecture. We also prove necessary and sufficient conditions for the Gorenstein ring associated to the basis generating polynomial of a matroid to satisfy Hodge-Riemann relations of degree one on the facets of the positive orthant.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Bridging Language and Items for Retrieval and Recommendation
Authors:
Yupeng Hou,
Jiacheng Li,
Zhankui He,
An Yan,
Xiusi Chen,
Julian McAuley
Abstract:
This paper introduces BLaIR, a series of pretrained sentence embedding models specialized for recommendation scenarios. BLaIR is trained to learn correlations between item metadata and potential natural language context, which is useful for retrieving and recommending items. To pretrain BLaIR, we collect Amazon Reviews 2023, a new dataset comprising over 570 million reviews and 48 million items fr…
▽ More
This paper introduces BLaIR, a series of pretrained sentence embedding models specialized for recommendation scenarios. BLaIR is trained to learn correlations between item metadata and potential natural language context, which is useful for retrieving and recommending items. To pretrain BLaIR, we collect Amazon Reviews 2023, a new dataset comprising over 570 million reviews and 48 million items from 33 categories, significantly expanding beyond the scope of previous versions. We evaluate the generalization ability of BLaIR across multiple domains and tasks, including a new task named complex product search, referring to retrieving relevant items given long, complex natural language contexts. Leveraging large language models like ChatGPT, we correspondingly construct a semi-synthetic evaluation set, Amazon-C4. Empirical results on the new task, as well as conventional retrieval and recommendation tasks, demonstrate that BLaIR exhibit strong text and item representation capacity. Our datasets, code, and checkpoints are available at: https://github.com/hyp1231/AmazonReviews2023.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
GanFinger: GAN-Based Fingerprint Generation for Deep Neural Network Ownership Verification
Authors:
Huali Ren,
Anli Yan,
Xiaojun Ren,
Pei-Gen Ye,
Chong-zhi Gao,
Zhili Zhou,
Jin Li
Abstract:
Deep neural networks (DNNs) are extensively employed in a wide range of application scenarios. Generally, training a commercially viable neural network requires significant amounts of data and computing resources, and it is easy for unauthorized users to use the networks illegally. Therefore, network ownership verification has become one of the most crucial steps in safeguarding digital assets. To…
▽ More
Deep neural networks (DNNs) are extensively employed in a wide range of application scenarios. Generally, training a commercially viable neural network requires significant amounts of data and computing resources, and it is easy for unauthorized users to use the networks illegally. Therefore, network ownership verification has become one of the most crucial steps in safeguarding digital assets. To verify the ownership of networks, the existing network fingerprinting approaches perform poorly in the aspects of efficiency, stealthiness, and discriminability. To address these issues, we propose a network fingerprinting approach, named as GanFinger, to construct the network fingerprints based on the network behavior, which is characterized by network outputs of pairs of original examples and conferrable adversarial examples. Specifically, GanFinger leverages Generative Adversarial Networks (GANs) to effectively generate conferrable adversarial examples with imperceptible perturbations. These examples can exhibit identical outputs on copyrighted and pirated networks while producing different results on irrelevant networks. Moreover, to enhance the accuracy of fingerprint ownership verification, the network similarity is computed based on the accuracy-robustness distance of fingerprint examples'outputs. To evaluate the performance of GanFinger, we construct a comprehensive benchmark consisting of 186 networks with five network structures and four popular network post-processing techniques. The benchmark experiments demonstrate that GanFinger significantly outperforms the state-of-the-arts in efficiency, stealthiness, and discriminability. It achieves a remarkable 6.57 times faster in fingerprint generation and boosts the ARUC value by 0.175, resulting in a relative improvement of about 26%.
△ Less
Submitted 25 December, 2023;
originally announced December 2023.
-
A continuous cold rubidium atomic beam with enhanced flux and tunable velocity
Authors:
Shengzhe Wang,
Zhixin Meng,
and Peiqiang Yan,
Yuanxing Liu,
Yanying Feng
Abstract:
We present a cold atomic beam source based on a two-dimensional (2D)+ magneto-optical trap (MOT), capable of generating a continuous cold beam of 87Rb atoms with a flux up to 4.3*10^9 atoms/s, a mean velocity of 10.96(2.20) m/s, and a transverse temperature of 16.90(1.56) uK. Investigating the influence of high cooling laser intensity, we observe a significant population loss of atoms to hyperfine…
▽ More
We present a cold atomic beam source based on a two-dimensional (2D)+ magneto-optical trap (MOT), capable of generating a continuous cold beam of 87Rb atoms with a flux up to 4.3*10^9 atoms/s, a mean velocity of 10.96(2.20) m/s, and a transverse temperature of 16.90(1.56) uK. Investigating the influence of high cooling laser intensity, we observe a significant population loss of atoms to hyperfine-level dark states. To account for this, we employ a multiple hyperfine level model to calculate the cooling efficiency associated with the population in dark states, subsequently modifying the scattering force. Simulations of beam flux at different cooling and repumping laser intensities using the modified scattering force are in agreement with experimental results. Optimizing repumping and cooling intensities enhances the flux by 50%. The influence of phase modulation on both the pushing and cooling lasers is experimentally studied, revealing that the mean velocity of cold atoms can be tuned from 9.5 m/s to 14.6 m/s with a phase-modulated pushing laser. The versatility of this continuous beam source, featuring high flux, controlled velocity, and narrow transverse temperature, renders it valuable for applications in atom interferometers and clocks, ultimately enhancing bandwidth, sensitivity, and signal contrast in these devices.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Ground Calibration Result of the Lobster Eye Imager for Astronomy
Authors:
Huaqing Cheng,
Zhixing Ling,
Chen Zhang,
Xiaojin Sun,
Shengli Sun,
Yuan Liu,
Yanfeng Dai,
Zhenqing Jia,
Haiwu Pan,
Wenxin Wang,
Donghua Zhao,
Yifan Chen,
Zhiwei Cheng,
Wei Fu,
Yixiao Han,
Junfei Li,
Zhengda Li,
Xiaohao Ma,
Yulong Xue,
Ailiang Yan,
Qiang Zhang,
Yusa Wang,
Xiongtao Yang,
Zijian Zhao,
Weimin Yuan
Abstract:
We report on results of the on-ground X-ray calibration of the Lobster Eye Imager for Astronomy (LEIA), an experimental space wide-field (18.6*18.6 square degrees) X-ray telescope built from novel lobster eye mirco-pore optics. LEIA was successfully launched on July 27, 2022 onboard the SATech-01 satellite. To achieve full characterisation of its performance before launch, a series of tests and ca…
▽ More
We report on results of the on-ground X-ray calibration of the Lobster Eye Imager for Astronomy (LEIA), an experimental space wide-field (18.6*18.6 square degrees) X-ray telescope built from novel lobster eye mirco-pore optics. LEIA was successfully launched on July 27, 2022 onboard the SATech-01 satellite. To achieve full characterisation of its performance before launch, a series of tests and calibrations have been carried out at different levels of devices, assemblies and the complete module. In this paper, we present the results of the end-to-end calibration campaign of the complete module carried out at the 100-m X-ray Test Facility at IHEP. The PSF, effective area and energy response of the detectors were measured in a wide range of incident directions at several X-ray line energies. The distributions of the PSF and effective areas are roughly uniform across the FoV, in large agreement with the prediction of lobster-eye optics. The mild variations and deviations from the prediction of idealized lobster-eye optics can be understood to be caused by the imperfect shapes and alignment of the micro-pores as well as the obscuration by the supporting frames, which can be well reproduced by MC simulations. The spatial resolution of LEIA defined by the FWHM of the focal spot ranges from 4-8 arcmin with a median of 5.7. The measured effective areas are in range of 2-3 $cm^2$ at ~1.25 keV across the entire FoV, and its dependence on photon energy is in large agreement with simulations. The gains of the CMOS sensors are in range of 6.5-6.9 eV/DN, and the energy resolutions in the range of ~120-140 eV at 1.25 keV and ~170-190 eV at 4.5 keV. These results have been ingested into the calibration database and applied to the analysis of the scientific data acquired by LEIA. This work paves the way for the calibration of the Wide-field X-Ray Telescope modules of the Einstein Probe mission.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
Authors:
An Yan,
Zhengyuan Yang,
Wanrong Zhu,
Kevin Lin,
Linjie Li,
Jianfeng Wang,
Jianwei Yang,
Yiwu Zhong,
Julian McAuley,
Jianfeng Gao,
Zicheng Liu,
Lijuan Wang
Abstract:
We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretat…
▽ More
We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at https://github.com/zzxslp/MM-Navigator.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks
Authors:
Xinlu Zhang,
Yujie Lu,
Weizhi Wang,
An Yan,
Jun Yan,
Lianke Qin,
Heng Wang,
Xifeng Yan,
William Yang Wang,
Linda Ruth Petzold
Abstract:
Automatically evaluating vision-language tasks is challenging, especially when it comes to reflecting human judgments due to limitations in accounting for fine-grained details. Although GPT-4V has shown promising results in various multi-modal tasks, leveraging GPT-4V as a generalist evaluator for these tasks has not yet been systematically explored. We comprehensively validate GPT-4V's capabiliti…
▽ More
Automatically evaluating vision-language tasks is challenging, especially when it comes to reflecting human judgments due to limitations in accounting for fine-grained details. Although GPT-4V has shown promising results in various multi-modal tasks, leveraging GPT-4V as a generalist evaluator for these tasks has not yet been systematically explored. We comprehensively validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment. We employ two evaluation methods, single-answer grading and pairwise comparison, using GPT-4V. Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators. Despite limitations like restricted visual clarity grading and real-world complex reasoning, its ability to provide human-aligned scores enriched with detailed explanations is promising for universal automatic evaluator.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
Driving through the Concept Gridlock: Unraveling Explainability Bottlenecks in Automated Driving
Authors:
Jessica Echterhoff,
An Yan,
Kyungtae Han,
Amr Abdelraouf,
Rohit Gupta,
Julian McAuley
Abstract:
Concept bottleneck models have been successfully used for explainable machine learning by encoding information within the model with a set of human-defined concepts. In the context of human-assisted or autonomous driving, explainability models can help user acceptance and understanding of decisions made by the autonomous vehicle, which can be used to rationalize and explain driver or vehicle behav…
▽ More
Concept bottleneck models have been successfully used for explainable machine learning by encoding information within the model with a set of human-defined concepts. In the context of human-assisted or autonomous driving, explainability models can help user acceptance and understanding of decisions made by the autonomous vehicle, which can be used to rationalize and explain driver or vehicle behavior. We propose a new approach using concept bottlenecks as visual features for control command predictions and explanations of user and vehicle behavior. We learn a human-understandable concept layer that we use to explain sequential driving scenes while learning vehicle control commands. This approach can then be used to determine whether a change in a preferred gap or steering commands from a human (or autonomous vehicle) is led by an external stimulus or change in preferences. We achieve competitive performance to latent visual features while gaining interpretability within our model setup.
△ Less
Submitted 26 October, 2023; v1 submitted 25 October, 2023;
originally announced October 2023.
-
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation
Authors:
Zexue He,
Yu Wang,
An Yan,
Yao Liu,
Eric Y. Chang,
Amilcare Gentili,
Julian McAuley,
Chun-Nan Hsu
Abstract:
Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modaliti…
▽ More
Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modalities. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data and supporting a wide range of tasks. Moreover, we systematically evaluated 10 generic and domain-specific language models under zero-shot and finetuning settings, from domain-adapted baselines in healthcare to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our evaluations reveal varying effectiveness of the two categories of language models across different tasks, from which we notice the importance of instruction tuning for few-shot usage of large language models. Our investigation paves the way toward benchmarking language models for healthcare and provides valuable insights into the strengths and limitations of adopting large language models in medical domains, informing their practical applications and future advancements.
△ Less
Submitted 14 November, 2023; v1 submitted 21 October, 2023;
originally announced October 2023.
-
Topological Magnetoresistance of Magnetic Skyrmionic Bubbles
Authors:
Fei Li,
Hao Nie,
Yu Zhao,
Zhihe Zhao,
Juntao Huo,
Hongxian Shen,
Sida Jiang,
Renjie Chen,
Aru Yan,
S-W Cheong,
Weixing Xia,
Lunyong Zhang,
Jianfei Sun
Abstract:
Magnetic skyrmions offer promising prospects for constructing future energy-efficient and high-density information technology, leading to extensive explorations of new skyrmionic materials recently. The topological Hall effect has been widely adopted as a distinctive marker of skyrmion emergence. Alternately, here we propose a novel signature of skyrmion state by quantitatively investigating the m…
▽ More
Magnetic skyrmions offer promising prospects for constructing future energy-efficient and high-density information technology, leading to extensive explorations of new skyrmionic materials recently. The topological Hall effect has been widely adopted as a distinctive marker of skyrmion emergence. Alternately, here we propose a novel signature of skyrmion state by quantitatively investigating the magnetoresistance (MR) induced by skyrmionic bubbles in CeMn2Ge2. An intriguing finding was revealed: the anomalous MR measured at different temperatures can be normalized into a single curve, regardless of sample thickness. This behavior can be accurately reproduced by the recent chiral spin textures MR model. Further analysis of the MR anomaly allowed us to quantitatively examine the effective magnetic fields of various scattering channels. Remarkably, the analyses, combined with the Lorentz transmission electronic microscopy results, indicate that the in-plane scattering channel with triplet exchange interactions predominantly governs the magnetotransport in the Bloch-type skyrmionic bubble state. Our results not only provide insights into the quantum correction on MR induced by skyrmionic bubble phase, but also present an electrical probing method for studying chiral spin texture formation, evolution and their topological properties, which opens up exciting possibilities for identifying new skyrmionic materials and advancing the methodology for studying chiral spin textures.
△ Less
Submitted 21 October, 2023;
originally announced October 2023.
-
Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models
Authors:
An Yan,
Yu Wang,
Yiwu Zhong,
Zexue He,
Petros Karypis,
Zihan Wang,
Chengyu Dong,
Amilcare Gentili,
Chun-Nan Hsu,
Jingbo Shang,
Julian McAuley
Abstract:
Medical image classification is a critical problem for healthcare, with the potential to alleviate the workload of doctors and facilitate diagnoses of patients. However, two challenges arise when deploying deep learning models to real-world healthcare applications. First, neural models tend to learn spurious correlations instead of desired features, which could fall short when generalizing to new…
▽ More
Medical image classification is a critical problem for healthcare, with the potential to alleviate the workload of doctors and facilitate diagnoses of patients. However, two challenges arise when deploying deep learning models to real-world healthcare applications. First, neural models tend to learn spurious correlations instead of desired features, which could fall short when generalizing to new domains (e.g., patients with different ages). Second, these black-box models lack interpretability. When making diagnostic predictions, it is important to understand why a model makes a decision for trustworthy and safety considerations. In this paper, to address these two limitations, we propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts. Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model. We systematically evaluate our method on eight medical image classification datasets to verify its effectiveness. On challenging datasets with strong confounding factors, our method can mitigate spurious correlations thus substantially outperform standard visual encoders and other baselines. Finally, we show how classification with a small number of concepts brings a level of interpretability for understanding model decisions through case studies in real medical data.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
The extremals of the Kahn-Saks inequality
Authors:
Ramon van Handel,
Alan Yan,
Xinmeng Zeng
Abstract:
A classical result of Kahn and Saks states that given any partially ordered set with two distinguished elements, the number of linear extensions in which the ranks of the distinguished elements differ by $k$ is log-concave as a function of $k$. The log-concave sequences that can arise in this manner prove to exhibit a much richer structure, however, than is evident from log-concavity alone. The ma…
▽ More
A classical result of Kahn and Saks states that given any partially ordered set with two distinguished elements, the number of linear extensions in which the ranks of the distinguished elements differ by $k$ is log-concave as a function of $k$. The log-concave sequences that can arise in this manner prove to exhibit a much richer structure, however, than is evident from log-concavity alone. The main result of this paper is a complete characterization of the extremals of the Kahn-Saks inequality: we obtain a detailed combinatorial understanding of where and what kind of geometric progressions can appear in these log-concave sequences. This settles a partial conjecture of Chan-Pak-Panova, while the analysis uncovers new extremals that were not previously conjectured. The proof relies on a much more general geometric mechanism -- a hard Lefschetz theorem for nef classes that was obtained in the setting of convex polytopes by Shenfeld and Van Handel -- which forms a model for the investigation of such structures in other combinatorial problems.
△ Less
Submitted 30 June, 2024; v1 submitted 23 September, 2023;
originally announced September 2023.
-
Learning Concise and Descriptive Attributes for Visual Recognition
Authors:
An Yan,
Yu Wang,
Yiwu Zhong,
Chengyu Dong,
Zexue He,
Yujie Lu,
William Wang,
Jingbo Shang,
Julian McAuley
Abstract:
Recent advances in foundation models present new opportunities for interpretable visual recognition -- one can first query Large Language Models (LLMs) to obtain a set of attributes that describe each class, then apply vision-language models to classify images via these attributes. Pioneering work shows that querying thousands of attributes can achieve performance competitive with image features.…
▽ More
Recent advances in foundation models present new opportunities for interpretable visual recognition -- one can first query Large Language Models (LLMs) to obtain a set of attributes that describe each class, then apply vision-language models to classify images via these attributes. Pioneering work shows that querying thousands of attributes can achieve performance competitive with image features. However, our further investigation on 8 datasets reveals that LLM-generated attributes in a large quantity perform almost the same as random words. This surprising finding suggests that significant noise may be present in these attributes. We hypothesize that there exist subsets of attributes that can maintain the classification performance with much smaller sizes, and propose a novel learning-to-search method to discover those concise sets of attributes. As a result, on the CUB dataset, our method achieves performance close to that of massive LLM-generated attributes (e.g., 10k attributes for CUB), yet using only 32 attributes in total to distinguish 200 bird species. Furthermore, our new paradigm demonstrates several additional benefits: higher interpretability and interactivity for humans, and the ability to summarize knowledge for a recognition task.
△ Less
Submitted 7 August, 2023;
originally announced August 2023.
-
Effective Hamiltonian approach to the quantum phase transitions in the extended Jaynes-Cummings model
Authors:
H. T. Cui,
Y. A. Yan,
M. Qin,
X. X. Yi
Abstract:
The study of phase transitions in dissipative quantum systems based on the Liouvillian is often hindered by the difficulty of constructing a time-local master equation when the system-environment coupling is strong. To address this issue, the complex discretization approximation for the environment is proposed to study the quantum phase transition in the extended Jaynes-Cumming model with an infin…
▽ More
The study of phase transitions in dissipative quantum systems based on the Liouvillian is often hindered by the difficulty of constructing a time-local master equation when the system-environment coupling is strong. To address this issue, the complex discretization approximation for the environment is proposed to study the quantum phase transition in the extended Jaynes-Cumming model with an infinite number of boson modes. This approach yields a non-Hermitian effective Hamiltonian that can be used to simulate the dynamics of the spin. It is found that the ground state of this effective Hamiltonian determines the spin dynamics in the single-excitation subspace. Depending on the opening of the energy gap and the maximum population of excitations on the spin degree of freedom, three distinct phases can be identified: fast decaying, localized, and stretched dynamics of the spin. This approach can be extended to multiple excitations, and similar dynamics were found in the double-excitation subspace, indicating the robustness of the single-excitation phase.
△ Less
Submitted 6 April, 2024; v1 submitted 25 July, 2023;
originally announced July 2023.
-
Comparing Apples to Apples: Generating Aspect-Aware Comparative Sentences from User Reviews
Authors:
Jessica Echterhoff,
An Yan,
Julian McAuley
Abstract:
It is time-consuming to find the best product among many similar alternatives. Comparative sentences can help to contrast one item from others in a way that highlights important features of an item that stand out. Given reviews of one or multiple items and relevant item features, we generate comparative review sentences to aid users to find the best fit. Specifically, our model consists of three s…
▽ More
It is time-consuming to find the best product among many similar alternatives. Comparative sentences can help to contrast one item from others in a way that highlights important features of an item that stand out. Given reviews of one or multiple items and relevant item features, we generate comparative review sentences to aid users to find the best fit. Specifically, our model consists of three successive components in a transformer: (i) an item encoding module to encode an item for comparison, (ii) a comparison generation module that generates comparative sentences in an autoregressive manner, (iii) a novel decoding method for user personalization. We show that our pipeline generates fluent and diverse comparative sentences. We run experiments on the relevance and fidelity of our generated sentences in a human evaluation study and find that our algorithm creates comparative review sentences that are relevant and truthful.
△ Less
Submitted 23 July, 2023; v1 submitted 5 July, 2023;
originally announced July 2023.
-
The Lobster Eye Imager for Astronomy Onboard the SATech-01 Satellite
Authors:
Z. X. Ling,
X. J. Sun,
C. Zhang,
S. L. Sun,
G. Jin,
S. N. Zhang,
X. F. Zhang,
J. B. Chang,
F. S. Chen,
Y. F. Chen,
Z. W. Cheng,
W. Fu,
Y. X. Han,
H. Li,
J. F. Li,
Y. Li,
Z. D. Li,
P. R. Liu,
Y. H. Lv,
X. H. Ma,
Y. J. Tang,
C. B. Wang,
R. J. Xie,
Y. L. Xue,
A. L. Yan
, et al. (101 additional authors not shown)
Abstract:
The Lobster Eye Imager for Astronomy (LEIA), a pathfinder of the Wide-field X-ray Telescope of the Einstein Probe (EP) mission, was successfully launched onboard the SATech-01 satellite of the Chinese Academy of Sciences on 27 July 2022. In this paper, we introduce the design and on-ground test results of the LEIA instrument. Using state-of-the-art Micro-Pore Optics (MPO), a wide field-of-view (Fo…
▽ More
The Lobster Eye Imager for Astronomy (LEIA), a pathfinder of the Wide-field X-ray Telescope of the Einstein Probe (EP) mission, was successfully launched onboard the SATech-01 satellite of the Chinese Academy of Sciences on 27 July 2022. In this paper, we introduce the design and on-ground test results of the LEIA instrument. Using state-of-the-art Micro-Pore Optics (MPO), a wide field-of-view (FoV) of 346 square degrees (18.6 degrees * 18.6 degrees) of the X-ray imager is realized. An optical assembly composed of 36 MPO chips is used to focus incident X-ray photons, and four large-format complementary metal-oxide semiconductor (CMOS) sensors, each of 6 cm * 6 cm, are used as the focal plane detectors. The instrument has an angular resolution of 4 - 8 arcmin (in FWHM) for the central focal spot of the point spread function, and an effective area of 2 - 3 cm2 at 1 keV in essentially all the directions within the field of view. The detection passband is 0.5 - 4 keV in the soft X-rays and the sensitivity is 2 - 3 * 10-11 erg s-1 cm-2 (about 1 mini-Crab) at 1,000 second observation. The total weight of LEIA is 56 kg and the power is 85 W. The satellite, with a design lifetime of 2 years, operates in a Sun-synchronous orbit of 500 km with an orbital period of 95 minutes. LEIA is paving the way for future missions by verifying in flight the technologies of both novel focusing imaging optics and CMOS sensors for X-ray observation, and by optimizing the working setups of the instrumental parameters. In addition, LEIA is able to carry out scientific observations to find new transients and to monitor known sources in the soft X-ray band, albeit limited useful observing time available.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
"Nothing Abnormal": Disambiguating Medical Reports via Contrastive Knowledge Infusion
Authors:
Zexue He,
An Yan,
Amilcare Gentili,
Julian McAuley,
Chun-Nan Hsu
Abstract:
Sharing medical reports is essential for patient-centered care. A recent line of work has focused on automatically generating reports with NLP methods. However, different audiences have different purposes when writing/reading medical reports -- for example, healthcare professionals care more about pathology, whereas patients are more concerned with the diagnosis ("Is there any abnormality?"). The…
▽ More
Sharing medical reports is essential for patient-centered care. A recent line of work has focused on automatically generating reports with NLP methods. However, different audiences have different purposes when writing/reading medical reports -- for example, healthcare professionals care more about pathology, whereas patients are more concerned with the diagnosis ("Is there any abnormality?"). The expectation gap results in a common situation where patients find their medical reports to be ambiguous and therefore unsure about the next steps. In this work, we explore the audience expectation gap in healthcare and summarize common ambiguities that lead patients to be confused about their diagnosis into three categories: medical jargon, contradictory findings, and misleading grammatical errors. Based on our analysis, we define a disambiguation rewriting task to regenerate an input to be unambiguous while preserving information about the original content. We further propose a rewriting algorithm based on contrastive pretraining and perturbation-based rewriting. In addition, we create two datasets, OpenI-Annotated based on chest reports and VA-Annotated based on general medical reports, with available binary labels for ambiguity and abnormality presence annotated by radiology specialists. Experimental results on these datasets show that our proposed algorithm effectively rewrites input sentences in a less ambiguous way with high content fidelity. Our code and annotated data are released to facilitate future research.
△ Less
Submitted 14 May, 2023;
originally announced May 2023.
-
Landauer-QFLPS model for mixed Schottky-Ohmic contact two-dimensional transistors
Authors:
Zhao-Yi Yan,
Zhan Hou,
Kan-Hao Xue,
Tian Lu,
Ruiting Zhao,
Junying Xue,
Fan Wu,
Minghao Shao,
Jianlan Yan,
Anzhi Yan,
Zhenze Wang,
Penghui Shen,
Mingyue Zhao,
Xiangshui Miao,
Zhaoyang Lin,
Houfang Liu,
He Tian,
Yi Yang,
Tian-Ling Ren
Abstract:
Two-dimensional material-based field effect transistors (2DM-FETs) are playing a revolutionary role in electronic devices. However, after years of development, no device model can match the Pao-Sah model for standard silicon-based transistors in terms of physical accuracy and computational efficiency to support large-scale integrated circuit design. One remaining critical obstacle is the contacts…
▽ More
Two-dimensional material-based field effect transistors (2DM-FETs) are playing a revolutionary role in electronic devices. However, after years of development, no device model can match the Pao-Sah model for standard silicon-based transistors in terms of physical accuracy and computational efficiency to support large-scale integrated circuit design. One remaining critical obstacle is the contacts of 2DM-FETs. In order to self-consistently include the contact effect in the current model, it is necessary to perform self-consistent calculations, which is a fatal flaw for applications that prioritize efficiency. Here, we report that the Landauer-QFLPS model effectively overcomes the above contradiction, where QFLPS means quasi-Fermi-level phase space theory. By connecting the physical pictures of the contact and the intrinsic channel part, we have successfully derived a drain-source current formula including the contact effect. To verify the model, we prepared transistors based on two typical 2DMs, black phosphorus (BP) and molybdenum disulfide (MoS2), the former having ambipolar transport and the latter showing electron-dominant unipolar transport. The proposed new formula could describe both 2DM-FETs with Schottky or Ohmic contacts. Moreover, compared with traditional methods, the proposed model has the advantages of accuracy and efficiency, especially in describing non-monotonic drain conductance characteristics, because the contact effect is self-consistently and compactly packaged as an exponential term. More importantly, we also examined the model at the circuit level. Here, we fabricated a three-bit threshold inverter quantizer circuit based on ambipolar-BP process and experimentally demonstrated that the model can accurately predict the circuit performance. This industry-benign 2DM-FET model is supposed to be very useful for the development of 2DM-FET-based integrated circuits.
△ Less
Submitted 20 March, 2023;
originally announced March 2023.
-
An efficient model algorithm for two-dimensional field-effect transistors
Authors:
Zhao-Yi Yan,
Zhan Hou,
Fan Wu,
Ruiting Zhao,
Jianlan Yan,
Anzhi Yan,
Zhenze Wang,
Kan-Hao Xue,
Houfang Liu,
He Tian,
Yi Yang,
Tian-Ling Ren
Abstract:
Two-dimensional materials-based field-effect transistors (2DM-FETs) exhibit both ambipolar and unipolar transport types. To physically and compactly cover both cases, we put forward a quasi-Fermi-level phase space (QFLPS) approach to model the ambipolar effect in our previous work. This work aims to further improve the QFLPS model's numerical aspect so that the model can be implanted into the stan…
▽ More
Two-dimensional materials-based field-effect transistors (2DM-FETs) exhibit both ambipolar and unipolar transport types. To physically and compactly cover both cases, we put forward a quasi-Fermi-level phase space (QFLPS) approach to model the ambipolar effect in our previous work. This work aims to further improve the QFLPS model's numerical aspect so that the model can be implanted into the standard circuit simulator. We first rigorously derive the integral-free formula for the drain-source current to achieve this goal. It is more friendly to computation than the integral form. Besides, it explicitly gives the correlation terms between the electron and hole components. Secondly, to work out the boundary values required by the new expressions, we develop a fast evaluation algorithm for the surface electrostatic potential based on the zero-temperature limit property of the 2DM-FET system. By calibrating the model with the realistic device data of black phosphorus (BP) and monolayer molybdenum disulfide (ML-MoS2) FETs, the completed algorithm is tested against practical cases. The results show a typical superiority to the benchmark algorithm by two orders of magnitude in time consumption can be achieved while keeping a high accuracy with 7 to 9 significant digits.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
Effective Hamiltonian approach to the exact dynamics of open system by complex discretization approximation for environment
Authors:
H. T. Cui,
Y. A. Yan,
M. Qin,
X. X. Yi
Abstract:
The discretization approximation method commonly used to simulate the open dynamics of system coupled to the environment in continuum often suffers from the recurrence. To address this issue, this paper proposes a noval generalization of the discretization approximation method in the complex plane using complex Gauss quadratures. The effective Hamiltonian can be constructed by this way, which is n…
▽ More
The discretization approximation method commonly used to simulate the open dynamics of system coupled to the environment in continuum often suffers from the recurrence. To address this issue, this paper proposes a noval generalization of the discretization approximation method in the complex plane using complex Gauss quadratures. The effective Hamiltonian can be constructed by this way, which is non-Hermitian and demonstrates the complex energy modes with negative imaginary part, describing accurately the dissipative dynamics of the system. This method is applied to examine the dynamics in two exactly solvable models: the dephasing model and the single-excitation open dynamics in the Aubry-André-Harper model. This approach not only significantly reduces recurrence and improve the effectiveness of calculation, but also provide the microscopic viewpoint on the dynamics of system through the effective Hamiltonian. In addition, a simple relationship between the parameters in computation and the effectiveness of evaluation is also established.
△ Less
Submitted 27 May, 2024; v1 submitted 12 March, 2023;
originally announced March 2023.
-
First wide field-of-view X-ray observations by a lobster eye focusing telescope in orbit
Authors:
C. Zhang,
Z. X. Ling,
X. J. Sun,
S. L. Sun,
Y. Liu,
Z. D. Li,
Y. L. Xue,
Y. F. Chen,
Y. F. Dai,
Z. Q. Jia,
H. Y. Liu,
X. F. Zhang,
Y. H. Zhang,
S. N. Zhang,
F. S. Chen,
Z. W. Cheng,
W. Fu,
Y. X. Han,
H. Li,
J. F. Li,
Y. Li,
P. R. Liu,
X. H. Ma,
Y. J. Tang,
C. B. Wang
, et al. (53 additional authors not shown)
Abstract:
As a novel X-ray focusing technology, lobster eye micro-pore optics (MPO) feature both a wide observing field of view and true imaging capability, promising sky monitoring with significantly improved sensitivity and spatial resolution in soft X-rays. Since first proposed by Angel (1979), the optics have been extensively studied, developed and trialed over the past decades. In this Letter, we repor…
▽ More
As a novel X-ray focusing technology, lobster eye micro-pore optics (MPO) feature both a wide observing field of view and true imaging capability, promising sky monitoring with significantly improved sensitivity and spatial resolution in soft X-rays. Since first proposed by Angel (1979), the optics have been extensively studied, developed and trialed over the past decades. In this Letter, we report on the first-light results from a flight experiment of the Lobster Eye Imager for Astronomy ($LEIA$), a pathfinder of the wide-field X-ray telescope of the Einstein Probe mission. The piggyback imager, launched in July 2022, has a mostly un-vignetted field of view of $18.6^\circ \times 18.6^\circ $. Its spatial resolution is in the range of 4$-$7 arcmin in FWHM and the focal spot effective area is 2$-$3 cm$^2$, both showing only mild fluctuations across the field of view. We present images of the Galactic center region, Sco X-1 and the diffuse Cygnus Loop nebular taken in snapshot observations over 0.5$-$4 keV. These are truly wide-field X-ray images of celestial bodies observed, for the first time, by a focusing imaging telescope. Initial analyses of the in-flight data show excellent agreement between the observed images and the on-ground calibration and simulations. The instrument and its characterization are briefly described, as well as the flight experiment. The results provide a solid basis for the development of the present and proposed wide-field X-ray missions using lobster eye MPO.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
CLIP also Understands Text: Prompting CLIP for Phrase Understanding
Authors:
An Yan,
Jiacheng Li,
Wanrong Zhu,
Yujie Lu,
William Yang Wang,
Julian McAuley
Abstract:
Contrastive Language-Image Pretraining (CLIP) efficiently learns visual concepts by pre-training with natural language supervision. CLIP and its visual encoder have been explored on various vision and language tasks and achieve strong zero-shot or transfer learning performance. However, the application of its text encoder solely for text understanding has been less explored. In this paper, we find…
▽ More
Contrastive Language-Image Pretraining (CLIP) efficiently learns visual concepts by pre-training with natural language supervision. CLIP and its visual encoder have been explored on various vision and language tasks and achieve strong zero-shot or transfer learning performance. However, the application of its text encoder solely for text understanding has been less explored. In this paper, we find that the text encoder of CLIP actually demonstrates strong ability for phrase understanding, and can even significantly outperform popular language models such as BERT with a properly designed prompt. Extensive experiments validate the effectiveness of our method across different datasets and domains on entity clustering and entity set expansion tasks.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Visualize Before You Write: Imagination-Guided Open-Ended Text Generation
Authors:
Wanrong Zhu,
An Yan,
Yujie Lu,
Wenda Xu,
Xin Eric Wang,
Miguel Eckstein,
William Yang Wang
Abstract:
Recent advances in text-to-image synthesis make it possible to visualize machine imaginations for a given context. On the other hand, when generating text, human writers are gifted at creative visualization, which enhances their writings by forming imaginations as blueprints before putting down the stories in words. Inspired by such a cognitive process, we ask the natural question of whether we ca…
▽ More
Recent advances in text-to-image synthesis make it possible to visualize machine imaginations for a given context. On the other hand, when generating text, human writers are gifted at creative visualization, which enhances their writings by forming imaginations as blueprints before putting down the stories in words. Inspired by such a cognitive process, we ask the natural question of whether we can endow machines with the same ability to utilize visual information and construct a general picture of the context to guide text generation. In this work, we propose iNLG that uses machine-generated images to guide language models in open-ended text generation. The experiments and analyses demonstrate the effectiveness of iNLG on open-ended text generation tasks, including text completion, story generation, and concept-to-text generation in both few-shot and full-data scenarios. Both automatic metrics and human evaluations verify that the text snippets generated by our iNLG are coherent and informative while displaying minor degeneration.
△ Less
Submitted 14 February, 2023; v1 submitted 7 October, 2022;
originally announced October 2022.
-
Group frame neural network of moving object ghost imaging combined with frame merging algorithm
Authors:
Da Chen,
Shan-Guo Feng,
Hua-Hua Wang,
Jia-Ning Cao,
Zhi-Wei Zhang,
Zhi-Xin Yang,
Ao Yan,
Lu Gao,
Ze Zhang
Abstract:
The nature of multiple samples to extract correlation information limits the applications of ghost imaging of moving objects. A novel multi-to-one neural network is proposed and the concept of "batch frame" is introduced to improve the serial imaging method. The neural network extracts more correlation information from a small number of samples, thus reducing the sampling ratio of the ghost imagin…
▽ More
The nature of multiple samples to extract correlation information limits the applications of ghost imaging of moving objects. A novel multi-to-one neural network is proposed and the concept of "batch frame" is introduced to improve the serial imaging method. The neural network extracts more correlation information from a small number of samples, thus reducing the sampling ratio of the ghost imaging technique. We combine the correlation characteristics between images to propose a frame merging algorithm, which eliminates the dynamic blur of high-speed moving objects and further improves the reconstruction quality of moving object images at a low sampling ratio. The experimental results are consistent with the simulation results.
△ Less
Submitted 31 August, 2022;
originally announced September 2022.
-
Personalized Showcases: Generating Multi-Modal Explanations for Recommendations
Authors:
An Yan,
Zhankui He,
Jiacheng Li,
Tianyang Zhang,
Julian McAuley
Abstract:
Existing explanation models generate only text for recommendations but still struggle to produce diverse contents. In this paper, to further enrich explanations, we propose a new task named personalized showcases, in which we provide both textual and visual information to explain our recommendations. Specifically, we first select a personalized image set that is the most relevant to a user's inter…
▽ More
Existing explanation models generate only text for recommendations but still struggle to produce diverse contents. In this paper, to further enrich explanations, we propose a new task named personalized showcases, in which we provide both textual and visual information to explain our recommendations. Specifically, we first select a personalized image set that is the most relevant to a user's interest toward a recommended item. Then, natural language explanations are generated accordingly given our selected images. For this new task, we collect a large-scale dataset from Google Local (i.e.,~maps) and construct a high-quality subset for generating multi-modal explanations. We propose a personalized multi-modal framework which can generate diverse and visually-aligned explanations via contrastive learning. Experiments show that our framework benefits from different modalities as inputs, and is able to produce more diverse and expressive explanations compared to previous methods on a variety of evaluation metrics.
△ Less
Submitted 6 April, 2023; v1 submitted 29 June, 2022;
originally announced July 2022.
-
AdS$_3$/AdS$_2$ degression of Fronsdal fields
Authors:
A. N. Yan
Abstract:
We analyze the Kaluza-Klein type procedure in AdS$_3$ space called the dimensional degression. The topological theory of the Fronsdal field in AdS$_3$ is reformulated in terms of the fields propagating in AdS$_2$. We find that the Fronsdal field in AdS$_3$ leads to finitely many Kaluza-Klein modes. Namely, the obtained spectrum is the massive Klein-Gordon and Proca fields in AdS$_2$. The result is…
▽ More
We analyze the Kaluza-Klein type procedure in AdS$_3$ space called the dimensional degression. The topological theory of the Fronsdal field in AdS$_3$ is reformulated in terms of the fields propagating in AdS$_2$. We find that the Fronsdal field in AdS$_3$ leads to finitely many Kaluza-Klein modes. Namely, the obtained spectrum is the massive Klein-Gordon and Proca fields in AdS$_2$. The result is derived by using the specific mode expansion, the gauge fixing, and 2-dimensional Schouten identities.
△ Less
Submitted 3 July, 2022; v1 submitted 11 May, 2022;
originally announced May 2022.
-
An Oracle Gradient Regularized Newton Method for Quadratic Measurements Regression
Authors:
Jun Fan,
Jie Sun,
Ailing Yan,
Shenglong Zhou
Abstract:
Recovering an unknown signal from quadratic measurements has gained popularity due to its wide range of applications, including phase retrieval, fusion frame phase retrieval, and positive operator-valued measures. In this paper, we employ a least squares approach to reconstruct the signal and establish its non-asymptotic statistical properties. Our analysis shows that the estimator perfectly recov…
▽ More
Recovering an unknown signal from quadratic measurements has gained popularity due to its wide range of applications, including phase retrieval, fusion frame phase retrieval, and positive operator-valued measures. In this paper, we employ a least squares approach to reconstruct the signal and establish its non-asymptotic statistical properties. Our analysis shows that the estimator perfectly recovers the true signal in the noiseless case, while the error between the estimator and the true signal is bounded by $O(\sqrt{p\log(1+2n)/n})$ in the noisy case, where $n$ is the number of measurements and $p$ is the dimension of the signal. We then develop a two-phase algorithm, gradient regularized Newton method (GRNM), to solve the least squares problem. It is proven that the first phase terminates within finitely many steps, and the sequence generated in the second phase converges to a unique local minimum at a superlinear rate under certain mild conditions. Beyond these deterministic results, GRNM is capable of exactly reconstructing the true signal in the noiseless case and achieving the stated error rate with a high probability in the noisy case. Numerical experiments demonstrate that GRNM offers a high level of recovery capability and accuracy as well as fast computational speed.
△ Less
Submitted 30 August, 2024; v1 submitted 19 February, 2022;
originally announced February 2022.
-
Weakly Supervised Contrastive Learning for Chest X-Ray Report Generation
Authors:
An Yan,
Zexue He,
Xing Lu,
Jiang Du,
Eric Chang,
Amilcare Gentili,
Julian McAuley,
Chun-Nan Hsu
Abstract:
Radiology report generation aims at generating descriptive text from radiology images automatically, which may present an opportunity to improve radiology reporting and interpretation. A typical setting consists of training encoder-decoder models on image-report pairs with a cross entropy loss, which struggles to generate informative sentences for clinical diagnoses since normal findings dominate…
▽ More
Radiology report generation aims at generating descriptive text from radiology images automatically, which may present an opportunity to improve radiology reporting and interpretation. A typical setting consists of training encoder-decoder models on image-report pairs with a cross entropy loss, which struggles to generate informative sentences for clinical diagnoses since normal findings dominate the datasets. To tackle this challenge and encourage more clinically-accurate text outputs, we propose a novel weakly supervised contrastive loss for medical report generation. Experimental results demonstrate that our method benefits from contrasting target reports with incorrect but semantically-close ones. It outperforms previous work on both clinical correctness and text generation metrics for two public benchmarks.
△ Less
Submitted 24 September, 2021;
originally announced September 2021.
-
Study on S2 Flow Path Design and Three-dimensional Numerical Simulation Parameter Calibration in Axial Compressor
Authors:
Aobo Yang,
An Yan,
Jiang Chen
Abstract:
Aerodynamic design process of multi - stage axial flow compressor usually uses the way that combines the S2 flow design and three-dimensional CFD numerical simulation analysis. Based on Mr. Wu Zhonghua's " Three-dimensional flow theory ", aiming at the S2 flow design matching parameters and the three-dimensional CFD numerical simulation data, through autonomous programming, the S2 design parameter…
▽ More
Aerodynamic design process of multi - stage axial flow compressor usually uses the way that combines the S2 flow design and three-dimensional CFD numerical simulation analysis. Based on Mr. Wu Zhonghua's " Three-dimensional flow theory ", aiming at the S2 flow design matching parameters and the three-dimensional CFD numerical simulation data, through autonomous programming, the S2 design parameter distribution and the corresponding CGNS format CFD calculation results are extracted. Then make the comparative analysis of the two and provide the modification suggestion of the design. The examples have been tested by the comparison and correction of the eight-stage axial flow compressor calculation and finally improve the design performance of the compressor design. The design adiabatic efficiency increases by 0.5%. The surge margin increases by more than 5%. The validity and feasibility of the method are verified.
△ Less
Submitted 15 September, 2021;
originally announced September 2021.
-
ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation
Authors:
Wanrong Zhu,
Xin Eric Wang,
An Yan,
Miguel Eckstein,
William Yang Wang
Abstract:
Automatic evaluations for natural language generation (NLG) conventionally rely on token-level or embedding-level comparisons with text references. This differs from human language processing, for which visual imagination often improves comprehension. In this work, we propose ImaginE, an imagination-based automatic evaluation metric for natural language generation. With the help of StableDiffusion…
▽ More
Automatic evaluations for natural language generation (NLG) conventionally rely on token-level or embedding-level comparisons with text references. This differs from human language processing, for which visual imagination often improves comprehension. In this work, we propose ImaginE, an imagination-based automatic evaluation metric for natural language generation. With the help of StableDiffusion, a state-of-the-art text-to-image generator, we automatically generate an image as the embodied imagination for the text snippet and compute the imagination similarity using contextual embeddings. Experiments spanning several text generation tasks demonstrate that adding machine-generated images with our ImaginE displays great potential in introducing multi-modal information into NLG evaluation, and improves existing automatic metrics' correlations with human similarity judgments in both reference-based and reference-free evaluation scenarios.
△ Less
Submitted 14 February, 2023; v1 submitted 10 June, 2021;
originally announced June 2021.
-
AdS$_3$/AdS$_2$ degression of massless particles
Authors:
K. B. Alkalaev,
A. N. Yan
Abstract:
We study a 3d/2d dimensional degression which is a Kaluza-Klein type mechanism in AdS$_3$ space foliated into AdS$_2$ hypersurfaces. It is shown that an AdS$_3$ massless particle of spin $s=1,2,...,\infty$ degresses into a couple of AdS$_2$ particles of equal energies $E=s$. Note that the Kaluza-Klein spectra in higher dimensions are always infinite. To formulate the AdS$_3$/AdS$_2$ degression we…
▽ More
We study a 3d/2d dimensional degression which is a Kaluza-Klein type mechanism in AdS$_3$ space foliated into AdS$_2$ hypersurfaces. It is shown that an AdS$_3$ massless particle of spin $s=1,2,...,\infty$ degresses into a couple of AdS$_2$ particles of equal energies $E=s$. Note that the Kaluza-Klein spectra in higher dimensions are always infinite. To formulate the AdS$_3$/AdS$_2$ degression we consider branching rules for AdS$_3$ isometry algebra o$(2,2)$ representations decomposed with respect to AdS$_2$ isometry algebra o$(1,2)$. We find that a given o$(2,2)$ higher-spin representation lying on the unitary bound (i.e. massless) decomposes into two equal o$(1,2)$ modules. In the field-theoretical terms, this phenomenon is demonstrated for spin-2 and spin-3 free massless fields. The truncation to a finite spectrum can be seen by using particular mode expansions, (partial) diagonalizations, and identities specific to two dimensions.
△ Less
Submitted 29 September, 2021; v1 submitted 12 May, 2021;
originally announced May 2021.
-
L2C: Describing Visual Differences Needs Semantic Understanding of Individuals
Authors:
An Yan,
Xin Eric Wang,
Tsu-Jui Fu,
William Yang Wang
Abstract:
Recent advances in language and vision push forward the research of captioning a single image to describing visual differences between image pairs. Suppose there are two images, I_1 and I_2, and the task is to generate a description W_{1,2} comparing them, existing methods directly model { I_1, I_2 } -> W_{1,2} mapping without the semantic understanding of individuals. In this paper, we introduce…
▽ More
Recent advances in language and vision push forward the research of captioning a single image to describing visual differences between image pairs. Suppose there are two images, I_1 and I_2, and the task is to generate a description W_{1,2} comparing them, existing methods directly model { I_1, I_2 } -> W_{1,2} mapping without the semantic understanding of individuals. In this paper, we introduce a Learning-to-Compare (L2C) model, which learns to understand the semantic structures of these two images and compare them while learning to describe each one. We demonstrate that L2C benefits from a comparison between explicit semantic representations and single-image captions, and generalizes better on the new testing image pairs. It outperforms the baseline on both automatic evaluation and human evaluation for the Birds-to-Words dataset.
△ Less
Submitted 2 February, 2021;
originally announced February 2021.
-
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
Authors:
Wanrong Zhu,
Xin Eric Wang,
Tsu-Jui Fu,
An Yan,
Pradyumna Narayana,
Kazoo Sone,
Sugato Basu,
William Yang Wang
Abstract:
One of the most challenging topics in Natural Language Processing (NLP) is visually-grounded language understanding and reasoning. Outdoor vision-and-language navigation (VLN) is such a task where an agent follows natural language instructions and navigates a real-life urban environment. Due to the lack of human-annotated instructions that illustrate intricate urban scenes, outdoor VLN remains a c…
▽ More
One of the most challenging topics in Natural Language Processing (NLP) is visually-grounded language understanding and reasoning. Outdoor vision-and-language navigation (VLN) is such a task where an agent follows natural language instructions and navigates a real-life urban environment. Due to the lack of human-annotated instructions that illustrate intricate urban scenes, outdoor VLN remains a challenging task to solve. This paper introduces a Multimodal Text Style Transfer (MTST) learning approach and leverages external multimodal resources to mitigate data scarcity in outdoor navigation tasks. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external outdoor navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task, improving task completion rate by 8.7% relatively on the test set.
△ Less
Submitted 3 February, 2021; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Motion2Vec: Semi-Supervised Representation Learning from Surgical Videos
Authors:
Ajay Kumar Tanwani,
Pierre Sermanet,
Andy Yan,
Raghav Anand,
Mariano Phielipp,
Ken Goldberg
Abstract:
Learning meaningful visual representations in an embedding space can facilitate generalization in downstream tasks such as action segmentation and imitation. In this paper, we learn a motion-centric representation of surgical video demonstrations by grouping them into action segments/sub-goals/options in a semi-supervised manner. We present Motion2Vec, an algorithm that learns a deep embedding fea…
▽ More
Learning meaningful visual representations in an embedding space can facilitate generalization in downstream tasks such as action segmentation and imitation. In this paper, we learn a motion-centric representation of surgical video demonstrations by grouping them into action segments/sub-goals/options in a semi-supervised manner. We present Motion2Vec, an algorithm that learns a deep embedding feature space from video observations by minimizing a metric learning loss in a Siamese network: images from the same action segment are pulled together while pushed away from randomly sampled images of other segments, while respecting the temporal ordering of the images. The embeddings are iteratively segmented with a recurrent neural network for a given parametrization of the embedding space after pre-training the Siamese network. We only use a small set of labeled video segments to semantically align the embedding space and assign pseudo-labels to the remaining unlabeled data by inference on the learned model parameters. We demonstrate the use of this representation to imitate surgical suturing motions from publicly available videos of the JIGSAWS dataset. Results give 85.5 % segmentation accuracy on average suggesting performance improvement over several state-of-the-art baselines, while kinematic pose imitation gives 0.94 centimeter error in position per observation on the test set. Videos, code and data are available at https://sites.google.com/view/motion2vec
△ Less
Submitted 31 May, 2020;
originally announced June 2020.
-
Continuously tuning the refractive indices by restructuring materials on the 20-1000 atoms scale: improving anti-reflection coating designs
Authors:
Jacob Poole,
Aidong Yan,
Paul Ohodnicki,
Kevin Chen
Abstract:
We demonstrate the capability of block-copolymer templating to tune the refractive indices of functional oxides over a broad range by structuring materials on the 20-1000 atoms scale, with simple one-pot synthesis. The presented method is then combined with genetic algorithm-based optimization to explore its application for anti-reflection coating design. Merging these techniques allows for the re…
▽ More
We demonstrate the capability of block-copolymer templating to tune the refractive indices of functional oxides over a broad range by structuring materials on the 20-1000 atoms scale, with simple one-pot synthesis. The presented method is then combined with genetic algorithm-based optimization to explore its application for anti-reflection coating design. Merging these techniques allows for the realization of a minimal two-layer anti-reflection stack for silicon with broadband reflectivity of just 3% from the nominal value of 38%, over a 120° angular span, validated by fabrication followed by optical measurements.
△ Less
Submitted 4 April, 2020;
originally announced April 2020.
-
The ultrafast onset of exciton formation in 2D semiconductors
Authors:
Chiara Trovatello,
Florian Katsch,
Nicholas J. Borys,
Malte Selig,
Kaiyuan Yao,
Rocio Borrego-Varillas,
Francesco Scotognella,
Ilka Kriegel,
Aiming Yan,
Alex Zettl,
P. James Schuck,
Andreas Knorr,
Giulio Cerullo,
Stefano Dal Conte
Abstract:
The equilibrium and non-equilibrium optical properties of single-layer transition metal dichalcogenides (TMDs) are determined by strongly bound excitons. Exciton relaxation dynamics in TMDs have been extensively studied by time-domain optical spectroscopies. However, the formation dynamics of excitons following non-resonant photoexcitation of free electron-hole pairs have been challenging to direc…
▽ More
The equilibrium and non-equilibrium optical properties of single-layer transition metal dichalcogenides (TMDs) are determined by strongly bound excitons. Exciton relaxation dynamics in TMDs have been extensively studied by time-domain optical spectroscopies. However, the formation dynamics of excitons following non-resonant photoexcitation of free electron-hole pairs have been challenging to directly probe because of their inherently fast timescales. Here we use extremely short optical pulses to non-resonantly excite an electron-hole plasma and show the formation of two-dimensional excitons in single-layer MoS2 on the timescale of 30 fs via the induced changes to photo-absorption. These formation dynamics are significantly faster than in conventional 2D quantum wells and are attributed to the intense Coulombic interactions present in 2D TMDs. A theoretical model of a coherent polarization that dephases and relaxes to an incoherent exciton population reproduces the experimental dynamics on the sub-100-fs timescale and sheds light into the underlying mechanism of how the lowest-energy excitons, which are the most important for optoelectronic applications, form from higher-energy excitations. Importantly, a phonon-mediated exciton cascade from higher energy states to the ground excitonic state is found to be the rate-limiting process. These results set an ultimate timescale of the exciton formation in TMDs and elucidate the exceptionally fast physical mechanism behind this process.
△ Less
Submitted 17 February, 2020;
originally announced February 2020.
-
Cross-Lingual Vision-Language Navigation
Authors:
An Yan,
Xin Eric Wang,
Jiangtao Feng,
Lei Li,
William Yang Wang
Abstract:
Commanding a robot to navigate with natural language instructions is a long-term goal for grounded language understanding and robotics. But the dominant language is English, according to previous studies on vision-language navigation (VLN). To go beyond English and serve people speaking different languages, we collect a bilingual Room-to-Room (BL-R2R) dataset, extending the original benchmark with…
▽ More
Commanding a robot to navigate with natural language instructions is a long-term goal for grounded language understanding and robotics. But the dominant language is English, according to previous studies on vision-language navigation (VLN). To go beyond English and serve people speaking different languages, we collect a bilingual Room-to-Room (BL-R2R) dataset, extending the original benchmark with new Chinese instructions. Based on this newly introduced dataset, we study how an agent can be trained on existing English instructions but navigate effectively with another language under a zero-shot learning scenario. Without any training data of the target language, our model shows competitive results even compared to a model with full access to the target language training data. Moreover, we investigate the transferring ability of our model when given a certain amount of target language training data.
△ Less
Submitted 5 December, 2020; v1 submitted 24 October, 2019;
originally announced October 2019.
-
Analyzing and Improving Neural Networks by Generating Semantic Counterexamples through Differentiable Rendering
Authors:
Lakshya Jain,
Varun Chandrasekaran,
Uyeong Jang,
Wilson Wu,
Andrew Lee,
Andy Yan,
Steven Chen,
Somesh Jha,
Sanjit A. Seshia
Abstract:
Even as deep neural networks (DNNs) have achieved remarkable success on vision-related tasks, their performance is brittle to transformations in the input. Of particular interest are semantic transformations that model changes that have a basis in the physical world, such as rotations, translations, changes in lighting or camera pose. In this paper, we show how differentiable rendering can be util…
▽ More
Even as deep neural networks (DNNs) have achieved remarkable success on vision-related tasks, their performance is brittle to transformations in the input. Of particular interest are semantic transformations that model changes that have a basis in the physical world, such as rotations, translations, changes in lighting or camera pose. In this paper, we show how differentiable rendering can be utilized to generate images that are informative, yet realistic, and which can be used to analyze DNN performance and improve its robustness through data augmentation. Given a differentiable renderer and a DNN, we show how to use off-the-shelf attacks from adversarial machine learning to generate semantic counterexamples -- images where semantic features are changed as to produce misclassifications or misdetections. We validate our approach on DNNs for image classification and object detection. For classification, we show that semantic counterexamples, when used to augment the dataset, (i) improve generalization performance (ii) enhance robustness to semantic transformations, and (iii) transfer between models. Additionally, in comparison to sampling-based semantic augmentation, our technique generates more informative data in a sample efficient manner.
△ Less
Submitted 17 July, 2020; v1 submitted 1 October, 2019;
originally announced October 2019.
-
CosRec: 2D Convolutional Neural Networks for Sequential Recommendation
Authors:
An Yan,
Shuo Cheng,
Wang-Cheng Kang,
Mengting Wan,
Julian McAuley
Abstract:
Sequential patterns play an important role in building modern recommender systems. To this end, several recommender systems have been built on top of Markov Chains and Recurrent Models (among others). Although these sequential models have proven successful at a range of tasks, they still struggle to uncover complex relationships nested in user purchase histories. In this paper, we argue that model…
▽ More
Sequential patterns play an important role in building modern recommender systems. To this end, several recommender systems have been built on top of Markov Chains and Recurrent Models (among others). Although these sequential models have proven successful at a range of tasks, they still struggle to uncover complex relationships nested in user purchase histories. In this paper, we argue that modeling pairwise relationships directly leads to an efficient representation of sequential features and captures complex item correlations. Specifically, we propose a 2D convolutional network for sequential recommendation (CosRec). It encodes a sequence of items into a three-way tensor; learns local features using 2D convolutional filters; and aggregates high-order interactions in a feedforward manner. Quantitative results on two public datasets show that our method outperforms both conventional methods and recent sequence-based approaches, achieving state-of-the-art performance on various evaluation metrics.
△ Less
Submitted 26 August, 2019;
originally announced August 2019.
-
FairST: Equitable Spatial and Temporal Demand Prediction for New Mobility Systems
Authors:
An Yan,
Bill Howe
Abstract:
Emerging transportation modes, including car-sharing, bike-sharing, and ride-hailing, are transforming urban mobility but have been shown to reinforce socioeconomic inequities. Spatiotemporal demand prediction models for these new mobility regimes must therefore consider fairness as a first-class design requirement. We present FairST, a fairness-aware model for predicting demand for new mobility s…
▽ More
Emerging transportation modes, including car-sharing, bike-sharing, and ride-hailing, are transforming urban mobility but have been shown to reinforce socioeconomic inequities. Spatiotemporal demand prediction models for these new mobility regimes must therefore consider fairness as a first-class design requirement. We present FairST, a fairness-aware model for predicting demand for new mobility systems. Our approach utilizes 1D, 2D and 3D convolutions to integrate various urban features and learn the spatial-temporal dynamics of a mobility system, but we include fairness metrics as a form of regularization to make the predictions more equitable across demographic groups. We propose two novel spatiotemporal fairness metrics, a region-based fairness gap (RFG) and an individual-based fairness gap (IFG). Both quantify equity in a spatiotemporal context, but vary by whether demographics are labeled at the region level (RFG) or whether population distribution information is available (IFG). Experimental results on real bike share and ride share datasets demonstrate the effectiveness of the proposed model: FairST not only reduces the fairness gap by more than 80%, but can surprisingly achieve better accuracy than state-of-the-art yet fairness-oblivious methods including LSTMs, ConvLSTMs, and 3D CNN.
△ Less
Submitted 21 June, 2019;
originally announced July 2019.
-
Sample phase gradient and fringe phase shift in dual phase grating X-ray interferometry
Authors:
Aimin Yan,
Xizeng Wu,
Hong Liu
Abstract:
One of the key tasks in grating based x-ray phase contrast imaging is to accurately retrieve local phase gradients of a sample from measured intensity fringe shifts. To fulfill this task in dual phase grating interferometry, one needs to know the exact mathematical relationship between the two. In this work, using intuitive analysis of the sample-generated fringe shifts based on the beat pattern f…
▽ More
One of the key tasks in grating based x-ray phase contrast imaging is to accurately retrieve local phase gradients of a sample from measured intensity fringe shifts. To fulfill this task in dual phase grating interferometry, one needs to know the exact mathematical relationship between the two. In this work, using intuitive analysis of the sample-generated fringe shifts based on the beat pattern formation mechanism, the authors derived the formulas relating sample's phase gradients to fringe phase shifts. These formulas provide also a design optimization tool for dual phase grating interferometry.
△ Less
Submitted 20 June, 2019;
originally announced June 2019.
-
Clarification on Generalized Lau condition for X-ray interferometers based on dual phase gratings
Authors:
Aimin Yan,
Xizeng Wu,
Hong Liu
Abstract:
To implement dual phase grating x-ray interferometry with x-ray tubes, one needs to incorporate an absorbing source grating. In order to attain good fringe visibility, the period of a source grating should be subject to a stringent condition. In literature some authors claim that the Lau-condition in Talbot-Lau interferometry can be literally transferred to dual phase grating interferometry. In th…
▽ More
To implement dual phase grating x-ray interferometry with x-ray tubes, one needs to incorporate an absorbing source grating. In order to attain good fringe visibility, the period of a source grating should be subject to a stringent condition. In literature some authors claim that the Lau-condition in Talbot-Lau interferometry can be literally transferred to dual phase grating interferometry. In this work we show that this statement in literature is incorrect. Instead, through an intuitive geometrical analysis of fringe formation, we derived a new generalized Lau-condition that provides a useful design tool for implementation of dual phase grating interferometry.
△ Less
Submitted 4 June, 2019;
originally announced June 2019.