-
Organ At Risk Segmentation with Multiple Modality
Authors:
Kuan-Lun Tseng,
Winston Hsu,
Chun-ting Wu,
Ya-Fang Shih,
Fan-Yun Sun
Abstract:
With the development of image segmentation in computer vision, biomedical image segmentation have achieved remarkable progress on brain tumor segmentation and Organ At Risk (OAR) segmentation. However, most of the research only uses single modality such as Computed Tomography (CT) scans while in real world scenario doctors often use multiple modalities to get more accurate result. To better levera…
▽ More
With the development of image segmentation in computer vision, biomedical image segmentation have achieved remarkable progress on brain tumor segmentation and Organ At Risk (OAR) segmentation. However, most of the research only uses single modality such as Computed Tomography (CT) scans while in real world scenario doctors often use multiple modalities to get more accurate result. To better leverage different modalities, we have collected a large dataset consists of 136 cases with CT and MR images which diagnosed with nasopharyngeal cancer. In this paper, we propose to use Generative Adversarial Network to perform CT to MR transformation to synthesize MR images instead of aligning two modalities. The synthesized MR can be jointly trained with CT to achieve better performance. In addition, we use instance segmentation model to extend the OAR segmentation task to segment both organs and tumor region. The collected dataset will be made public soon.
△ Less
Submitted 17 October, 2019;
originally announced October 2019.
-
360-Indoor: Towards Learning Real-World Objects in 360° Indoor Equirectangular Images
Authors:
Shih-Han Chou,
Cheng Sun,
Wen-Yen Chang,
Wan-Ting Hsu,
Min Sun,
Jianlong Fu
Abstract:
While there are several widely used object detection datasets, current computer vision algorithms are still limited in conventional images. Such images narrow our vision in a restricted region. On the other hand, 360° images provide a thorough sight. In this paper, our goal is to provide a standard dataset to facilitate the vision and machine learning communities in 360° domain. To facilitate the…
▽ More
While there are several widely used object detection datasets, current computer vision algorithms are still limited in conventional images. Such images narrow our vision in a restricted region. On the other hand, 360° images provide a thorough sight. In this paper, our goal is to provide a standard dataset to facilitate the vision and machine learning communities in 360° domain. To facilitate the research, we present a real-world 360° panoramic object detection dataset, 360-Indoor, which is a new benchmark for visual object detection and class recognition in 360° indoor images. It is achieved by gathering images of complex indoor scenes containing common objects and the intensive annotated bounding field-of-view. In addition, 360-Indoor has several distinct properties: (1) the largest category number (37 labels in total). (2) the most complete annotations on average (27 bounding boxes per image). The selected 37 objects are all common in indoor scene. With around 3k images and 90k labels in total, 360-Indoor achieves the largest dataset for detection in 360° images. In the end, extensive experiments on the state-of-the-art methods for both classification and detection are provided. We will release this dataset in the near future.
△ Less
Submitted 3 October, 2019;
originally announced October 2019.
-
Real-time solar image classification: assessing spectral, pixel-based approaches
Authors:
J. Marcus Hughes,
Vicki W. Hsu,
Daniel B. Seaton,
Hazel M. Bain,
Jonathan M. Darnel,
Larisza Krista
Abstract:
In order to utilize solar imagery for real-time feature identification and large-scale data science investigations of solar structures, we need maps of the Sun where phenomena, or themes, are labeled. Since solar imagers produce observations every few minutes, it is not feasible to label all images by hand. Here, we compare three machine learning algorithms performing solar image classification us…
▽ More
In order to utilize solar imagery for real-time feature identification and large-scale data science investigations of solar structures, we need maps of the Sun where phenomena, or themes, are labeled. Since solar imagers produce observations every few minutes, it is not feasible to label all images by hand. Here, we compare three machine learning algorithms performing solar image classification using extreme ultraviolet and Hydrogen-alpha images: a maximum likelihood model assuming a single normal probability distribution for each theme from Rigler et al. (2012), a maximum-likelihood model with an underlying Gaussian mixtures distribution, and a random forest model. We create a small database of expert-labeled maps to train and test these algorithms. Due to the ambiguity between the labels created by different experts, a collaborative labeling is used to include all inputs. We find the random forest algorithm performs the best amongst the three algorithms. The advantages of this algorithm are best highlighted in: comparison of outputs to hand-drawn maps; response to short-term variability; and tracking long-term changes on the Sun. Our work indicates that the next generation of solar image classification algorithms would benefit significantly from using spatial structure recognition, compared to only using spectral, pixel-by-pixel brightness distributions.
△ Less
Submitted 30 September, 2019;
originally announced October 2019.
-
Natural Adversarial Sentence Generation with Gradient-based Perturbation
Authors:
Yu-Lun Hsieh,
Minhao Cheng,
Da-Cheng Juan,
Wei Wei,
Wen-Lian Hsu,
Cho-Jui Hsieh
Abstract:
This work proposes a novel algorithm to generate natural language adversarial input for text classification models, in order to investigate the robustness of these models. It involves applying gradient-based perturbation on the sentence embeddings that are used as the features for the classifier, and learning a decoder for generation. We employ this method to a sentiment analysis model and verify…
▽ More
This work proposes a novel algorithm to generate natural language adversarial input for text classification models, in order to investigate the robustness of these models. It involves applying gradient-based perturbation on the sentence embeddings that are used as the features for the classifier, and learning a decoder for generation. We employ this method to a sentiment analysis model and verify its effectiveness in inducing incorrect predictions by the model. We also conduct quantitative and qualitative analysis on these examples and demonstrate that our approach can generate more natural adversaries. In addition, it can be used to successfully perform black-box attacks, which involves attacking other existing models whose parameters are not known. On a public sentiment analysis API, the proposed method introduces a 20% relative decrease in average accuracy and 74% relative increase in absolute error.
△ Less
Submitted 6 September, 2019;
originally announced September 2019.
-
Theory of reflectionless scattering modes
Authors:
William R. Sweeney,
Chia Wei Hsu,
A. Douglas Stone
Abstract:
We develop the theory of a special type of scattering state in which a set of asymptotic channels are chosen as inputs and the complementary set as outputs, and there is zero reflection back into the input channels. In general an infinite number of such solutions exist at discrete complex frequencies. Our results apply to linear electromagnetic and acoustic wave scattering and also to quantum scat…
▽ More
We develop the theory of a special type of scattering state in which a set of asymptotic channels are chosen as inputs and the complementary set as outputs, and there is zero reflection back into the input channels. In general an infinite number of such solutions exist at discrete complex frequencies. Our results apply to linear electromagnetic and acoustic wave scattering and also to quantum scattering, in all dimensions, for arbitrary geometries including scatterers in free space, and for any choice of the input/output sets. We refer to such a state as reflection-zero (R-zero) when it occurs off the real-frequency axis and as an Reflectionless Scattering Mode (RSM) when it is tuned to a real frequency as a steady-state solution. Such reflectionless behavior requires a specific monochromatic input wavefront, given by the eigenvector of a filtered scattering matrix with eigenvalue zero. Steady-state RSMs may be realized by index tuning which do not break flux conservation or by gain-loss tuning. RSMs of flux-conserving cavities are bidirectional while those of non-flux-conserving cavities are generically unidirectional. Cavities with ${\cal PT}$-symmetry have unidirectional R-zeros in complex-conjugate pairs, implying that reflectionless states naturally arise at real frequencies for small gain-loss parameter but move into the complex-frequency plane after a spontaneous ${\cal PT}$-breaking transition. Numerical examples of RSMs are given for one-dimensional cavities with and without gain/loss, a ${\cal PT}$ cavity, a two-dimensional multiwaveguide junction, and a two-dimensional deformed dielectric cavity in free space. We outline and implement a general technique for solving such problems, which shows promise for designing photonic structures which are perfectly impedance-matched for specific inputs, or can perfectly convert inputs from one set of modes to a complementary set.
△ Less
Submitted 9 September, 2019;
originally announced September 2019.
-
Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos
Authors:
Sebastian Agethen,
Winston H. Hsu
Abstract:
Action recognition greatly benefits motion understanding in video analysis. Recurrent networks such as long short-term memory (LSTM) networks are a popular choice for motion-aware sequence learning tasks. Recently, a convolutional extension of LSTM was proposed, in which input-to-hidden and hidden-to-hidden transitions are modeled through convolution with a single kernel. This implies an unavoidab…
▽ More
Action recognition greatly benefits motion understanding in video analysis. Recurrent networks such as long short-term memory (LSTM) networks are a popular choice for motion-aware sequence learning tasks. Recently, a convolutional extension of LSTM was proposed, in which input-to-hidden and hidden-to-hidden transitions are modeled through convolution with a single kernel. This implies an unavoidable trade-off between effectiveness and efficiency. Herein, we propose a new enhancement to convolutional LSTM networks that supports accommodation of multiple convolutional kernels and layers. This resembles a Network-in-LSTM approach, which improves upon the aforementioned concern. In addition, we propose an attention-based mechanism that is specifically designed for our multi-kernel extension. We evaluated our proposed extensions in a supervised classification setting on the UCF-101 and Sports-1M datasets, with the findings showing that our enhancements improve accuracy. We also undertook qualitative analysis to reveal the characteristics of our system and the convolutional LSTM baseline.
△ Less
Submitted 30 July, 2019;
originally announced August 2019.
-
Indoor Depth Completion with Boundary Consistency and Self-Attention
Authors:
Yu-Kai Huang,
Tsung-Han Wu,
Yueh-Cheng Liu,
Winston H. Hsu
Abstract:
Depth estimation features are helpful for 3D recognition. Commodity-grade depth cameras are able to capture depth and color image in real-time. However, glossy, transparent or distant surface cannot be scanned properly by the sensor. As a result, enhancement and restoration from sensing depth is an important task. Depth completion aims at filling the holes that sensors fail to detect, which is sti…
▽ More
Depth estimation features are helpful for 3D recognition. Commodity-grade depth cameras are able to capture depth and color image in real-time. However, glossy, transparent or distant surface cannot be scanned properly by the sensor. As a result, enhancement and restoration from sensing depth is an important task. Depth completion aims at filling the holes that sensors fail to detect, which is still a complex task for machine to learn. Traditional hand-tuned methods have reached their limits, while neural network based methods tend to copy and interpolate the output from surrounding depth values. This leads to blurred boundaries, and structures of the depth map are lost. Consequently, our main work is to design an end-to-end network improving completion depth maps while maintaining edge clarity. We utilize self-attention mechanism, previously used in image inpainting fields, to extract more useful information in each layer of convolution so that the complete depth map is enhanced. In addition, we propose boundary consistency concept to enhance the depth map quality and structure. Experimental results validate the effectiveness of our self-attention and boundary consistency schema, which outperforms previous state-of-the-art depth completion work on Matterport3D dataset. Our code is publicly available at https://github.com/tsunghan-wu/Depth-Completion.
△ Less
Submitted 8 June, 2022; v1 submitted 22 August, 2019;
originally announced August 2019.
-
A Unified Point-Based Framework for 3D Segmentation
Authors:
Hung-Yueh Chiang,
Yen-Liang Lin,
Yueh-Cheng Liu,
Winston H. Hsu
Abstract:
3D point cloud segmentation remains challenging for structureless and textureless regions. We present a new unified point-based framework for 3D point cloud segmentation that effectively optimizes pixel-level features, geometrical structures and global context priors of an entire scene. By back-projecting 2D image features into 3D coordinates, our network learns 2D textural appearance and 3D struc…
▽ More
3D point cloud segmentation remains challenging for structureless and textureless regions. We present a new unified point-based framework for 3D point cloud segmentation that effectively optimizes pixel-level features, geometrical structures and global context priors of an entire scene. By back-projecting 2D image features into 3D coordinates, our network learns 2D textural appearance and 3D structural features in a unified framework. In addition, we investigate a global context prior to obtain a better prediction. We evaluate our framework on ScanNet online benchmark and show that our method outperforms several state-of-the-art approaches. We explore synthesizing camera poses in 3D reconstructed scenes for achieving higher performance. In-depth analysis on feature combinations and synthetic camera pose verify that features from different modalities benefit each other and dense camera pose sampling further improves the segmentation results.
△ Less
Submitted 18 August, 2019; v1 submitted 1 August, 2019;
originally announced August 2019.
-
Cities and space: Common power laws and spatial fractal structures
Authors:
Tomoya Mori,
Tony E. Smith,
Wen-Tai Hsu
Abstract:
City size distributions are known to be well approximated by power laws across a wide range of countries. But such distributions are also meaningful at other spatial scales, such as within certain regions of a country. Using data from China, France, Germany, India, Japan, and the US, we first document that large cities are significantly more spaced out than would be expected by chance alone. We ne…
▽ More
City size distributions are known to be well approximated by power laws across a wide range of countries. But such distributions are also meaningful at other spatial scales, such as within certain regions of a country. Using data from China, France, Germany, India, Japan, and the US, we first document that large cities are significantly more spaced out than would be expected by chance alone. We next construct spatial hierarchies for countries by first partitioning geographic space using a given number of their largest cities as cell centers, and then continuing this partitioning procedure within each cell recursively. We find that city size distributions in different parts of these spatial hierarchies exhibit power laws that are again far more similar than would be expected by chance alone -- suggesting the existence of a spatial fractal structure.
△ Less
Submitted 29 July, 2019;
originally announced July 2019.
-
A Novel Approach for Detection and Ranking of Trendy and Emerging Cyber Threat Events in Twitter Streams
Authors:
Avishek Bose,
Vahid Behzadan,
Carlos Aguirre,
William H. Hsu
Abstract:
We present a new machine learning and text information extraction approach to detection of cyber threat events in Twitter that are novel (previously non-extant) and developing (marked by significance with respect to similarity with a previously detected event). While some existing approaches to event detection measure novelty and trendiness, typically as independent criteria and occasionally as a…
▽ More
We present a new machine learning and text information extraction approach to detection of cyber threat events in Twitter that are novel (previously non-extant) and developing (marked by significance with respect to similarity with a previously detected event). While some existing approaches to event detection measure novelty and trendiness, typically as independent criteria and occasionally as a holistic measure, this work focuses on detecting both novel and developing events using an unsupervised machine learning approach. Furthermore, our proposed approach enables the ranking of cyber threat events based on an importance score by extracting the tweet terms that are characterized as named entities, keywords, or both. We also impute influence to users in order to assign a weighted score to noun phrases in proportion to user influence and the corresponding event scores for named entities and keywords. To evaluate the performance of our proposed approach, we measure the efficiency and detection error rate for events over a specified time interval, relative to human annotator ground truth.
△ Less
Submitted 12 July, 2019;
originally announced July 2019.
-
Transfer Learning from Audio-Visual Grounding to Speech Recognition
Authors:
Wei-Ning Hsu,
David Harwath,
James Glass
Abstract:
Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks. This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts.…
▽ More
Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks. This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts. As semantics of speech are largely determined by its lexical content, grounding models learn to preserve phonetic information while disregarding uncorrelated factors, such as speaker and channel. To study the properties of features distilled from different layers, we use them as input separately to train multiple speech recognition models. Empirical results demonstrate that layers closer to input retain more phonetic information, while following layers exhibit greater invariance to domain shift. Moreover, while most previous studies include training data for speech recognition for feature extractor training, our grounding models are not trained on any of those data, indicating more universal applicability to new domains.
△ Less
Submitted 9 July, 2019;
originally announced July 2019.
-
Video Question Generation via Cross-Modal Self-Attention Networks Learning
Authors:
Yu-Siang Wang,
Hung-Ting Su,
Chen-Hsi Chang,
Zhe-Yu Liu,
Winston H. Hsu
Abstract:
We introduce a novel task, Video Question Generation (Video QG). A Video QG model automatically generates questions given a video clip and its corresponding dialogues. Video QG requires a range of skills -- sentence comprehension, temporal relation, the interplay between vision and language, and the ability to ask meaningful questions. To address this, we propose a novel semantic rich cross-modal…
▽ More
We introduce a novel task, Video Question Generation (Video QG). A Video QG model automatically generates questions given a video clip and its corresponding dialogues. Video QG requires a range of skills -- sentence comprehension, temporal relation, the interplay between vision and language, and the ability to ask meaningful questions. To address this, we propose a novel semantic rich cross-modal self-attention (SRCMSA) network to aggregate the multi-modal and diverse features. To be more precise, we enhance the video frames semantic by integrating the object-level information, and we jointly consider the cross-modal attention for the video question generation task. Excitingly, our proposed model remarkably improves the baseline from 7.58 to 14.48 in the BLEU-4 score on the TVQA dataset. Most of all, we arguably pave a novel path toward understanding the challenging video input and we provide detailed analysis in terms of diversity, which ushers the avenues for future investigations.
△ Less
Submitted 16 February, 2020; v1 submitted 5 July, 2019;
originally announced July 2019.
-
Hybridized Threshold Clustering for Massive Data
Authors:
Jianmei Luo,
ChandraVyas Annakula,
Aruna Sai Kannamareddy,
Jasjeet S. Sekhon,
William Henry Hsu,
Michael Higgins
Abstract:
As the size $n$ of datasets become massive, many commonly-used clustering algorithms (for example, $k$-means or hierarchical agglomerative clustering (HAC) require prohibitive computational cost and memory. In this paper, we propose a solution to these clustering problems by extending threshold clustering (TC) to problems of instance selection. TC is a recently developed clustering algorithm desig…
▽ More
As the size $n$ of datasets become massive, many commonly-used clustering algorithms (for example, $k$-means or hierarchical agglomerative clustering (HAC) require prohibitive computational cost and memory. In this paper, we propose a solution to these clustering problems by extending threshold clustering (TC) to problems of instance selection. TC is a recently developed clustering algorithm designed to partition data into many small clusters in linearithmic time (on average). Our proposed clustering method is as follows. First, TC is performed and clusters are reduced into single "prototype" points. Then, TC is applied repeatedly on these prototype points until sufficient data reduction has been obtained. Finally, a more sophisticated clustering algorithm is applied to the reduced prototype points, thereby obtaining a clustering on all $n$ data points. This entire procedure for clustering is called iterative hybridized threshold clustering (IHTC). Through simulation results and by applying our methodology on several real datasets, we show that IHTC combined with $k$-means or HAC substantially reduces the run time and memory usage of the original clustering algorithms while still preserving their performance. Additionally, IHTC helps prevent singular data points from being overfit by clustering algorithms.
△ Less
Submitted 5 July, 2019;
originally announced July 2019.
-
Learnable Gated Temporal Shift Module for Deep Video Inpainting
Authors:
Ya-Liang Chang,
Zhe Yu Liu,
Kuan-Ying Lee,
Winston Hsu
Abstract:
How to efficiently utilize temporal information to recover videos in a consistent way is the main issue for video inpainting problems. Conventional 2D CNNs have achieved good performance on image inpainting but often lead to temporally inconsistent results where frames will flicker when applied to videos (see https://www.youtube.com/watch?v=87Vh1HDBjD0&list=PLPoVtv-xp_dL5uckIzz1PKwNjg1yI0I94&index…
▽ More
How to efficiently utilize temporal information to recover videos in a consistent way is the main issue for video inpainting problems. Conventional 2D CNNs have achieved good performance on image inpainting but often lead to temporally inconsistent results where frames will flicker when applied to videos (see https://www.youtube.com/watch?v=87Vh1HDBjD0&list=PLPoVtv-xp_dL5uckIzz1PKwNjg1yI0I94&index=1); 3D CNNs can capture temporal information but are computationally intensive and hard to train. In this paper, we present a novel component termed Learnable Gated Temporal Shift Module (LGTSM) for video inpainting models that could effectively tackle arbitrary video masks without additional parameters from 3D convolutions. LGTSM is designed to let 2D convolutions make use of neighboring frames more efficiently, which is crucial for video inpainting. Specifically, in each layer, LGTSM learns to shift some channels to its temporal neighbors so that 2D convolutions could be enhanced to handle temporal information. Meanwhile, a gated convolution is applied to the layer to identify the masked areas that are poisoning for conventional convolutions. On the FaceForensics and Free-form Video Inpainting (FVI) dataset, our model achieves state-of-the-art results with simply 33% of parameters and inference time.
△ Less
Submitted 9 July, 2019; v1 submitted 1 July, 2019;
originally announced July 2019.
-
Angular memory effect of transmission eigenchannels
Authors:
Hasan Yılmaz,
Chia Wei Hsu,
Arthur Goetschy,
Stefan Bittner,
Stefan Rotter,
Alexey Yamilov,
Hui Cao
Abstract:
The optical memory effect has emerged as a powerful tool for imaging through multiple-scattering media; however, the finite angular range of the memory effect limits the field of view. Here, we demonstrate experimentally that selective coupling of incident light into a high-transmission channel increases the angular memory-effect range. This enhancement is attributed to the robustness of the high-…
▽ More
The optical memory effect has emerged as a powerful tool for imaging through multiple-scattering media; however, the finite angular range of the memory effect limits the field of view. Here, we demonstrate experimentally that selective coupling of incident light into a high-transmission channel increases the angular memory-effect range. This enhancement is attributed to the robustness of the high-transmission channels against perturbations such as sample tilt or wavefront tilt. Our work shows that the high-transmission channels provide an enhanced field of view for memory effect-based imaging through diffusive media.
△ Less
Submitted 18 October, 2019; v1 submitted 14 June, 2019;
originally announced June 2019.
-
Sequential Triggers for Watermarking of Deep Reinforcement Learning Policies
Authors:
Vahid Behzadan,
William Hsu
Abstract:
This paper proposes a novel scheme for the watermarking of Deep Reinforcement Learning (DRL) policies. This scheme provides a mechanism for the integration of a unique identifier within the policy in the form of its response to a designated sequence of state transitions, while incurring minimal impact on the nominal performance of the policy. The applications of this watermarking scheme include de…
▽ More
This paper proposes a novel scheme for the watermarking of Deep Reinforcement Learning (DRL) policies. This scheme provides a mechanism for the integration of a unique identifier within the policy in the form of its response to a designated sequence of state transitions, while incurring minimal impact on the nominal performance of the policy. The applications of this watermarking scheme include detection of unauthorized replications of proprietary policies, as well as enabling the graceful interruption or termination of DRL activities by authorized entities. We demonstrate the feasibility of our proposal via experimental evaluation of watermarking a DQN policy trained in the Cartpole environment.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Adversarial Exploitation of Policy Imitation
Authors:
Vahid Behzadan,
William Hsu
Abstract:
This paper investigates a class of attacks targeting the confidentiality aspect of security in Deep Reinforcement Learning (DRL) policies. Recent research have established the vulnerability of supervised machine learning models (e.g., classifiers) to model extraction attacks. Such attacks leverage the loosely-restricted ability of the attacker to iteratively query the model for labels, thereby all…
▽ More
This paper investigates a class of attacks targeting the confidentiality aspect of security in Deep Reinforcement Learning (DRL) policies. Recent research have established the vulnerability of supervised machine learning models (e.g., classifiers) to model extraction attacks. Such attacks leverage the loosely-restricted ability of the attacker to iteratively query the model for labels, thereby allowing for the forging of a labeled dataset which can be used to train a replica of the original model. In this work, we demonstrate the feasibility of exploiting imitation learning techniques in launching model extraction attacks on DRL agents. Furthermore, we develop proof-of-concept attacks that leverage such techniques for black-box attacks against the integrity of DRL policies. We also present a discussion on potential solution concepts for mitigation techniques.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Analysis and Improvement of Adversarial Training in DQN Agents With Adversarially-Guided Exploration (AGE)
Authors:
Vahid Behzadan,
William Hsu
Abstract:
This paper investigates the effectiveness of adversarial training in enhancing the robustness of Deep Q-Network (DQN) policies to state-space perturbations. We first present a formal analysis of adversarial training in DQN agents and its performance with respect to the proportion of adversarial perturbations to nominal observations used for training. Next, we consider the sample-inefficiency of cu…
▽ More
This paper investigates the effectiveness of adversarial training in enhancing the robustness of Deep Q-Network (DQN) policies to state-space perturbations. We first present a formal analysis of adversarial training in DQN agents and its performance with respect to the proportion of adversarial perturbations to nominal observations used for training. Next, we consider the sample-inefficiency of current adversarial training techniques, and propose a novel Adversarially-Guided Exploration (AGE) mechanism based on a modified hybrid of the $ε$-greedy algorithm and Boltzmann exploration. We verify the feasibility of this exploration mechanism through experimental evaluation of its performance in comparison with the traditional decaying $ε$-greedy and parameter-space noise exploration algorithms.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
RL-Based Method for Benchmarking the Adversarial Resilience and Robustness of Deep Reinforcement Learning Policies
Authors:
Vahid Behzadan,
William Hsu
Abstract:
This paper investigates the resilience and robustness of Deep Reinforcement Learning (DRL) policies to adversarial perturbations in the state space. We first present an approach for the disentanglement of vulnerabilities caused by representation learning of DRL agents from those that stem from the sensitivity of the DRL policies to distributional shifts in state transitions. Building on this appro…
▽ More
This paper investigates the resilience and robustness of Deep Reinforcement Learning (DRL) policies to adversarial perturbations in the state space. We first present an approach for the disentanglement of vulnerabilities caused by representation learning of DRL agents from those that stem from the sensitivity of the DRL policies to distributional shifts in state transitions. Building on this approach, we propose two RL-based techniques for quantitative benchmarking of adversarial resilience and robustness in DRL policies against perturbations of state transitions. We demonstrate the feasibility of our proposals through experimental evaluation of resilience and robustness in DQN, A2C, and PPO2 policies trained in the Cartpole environment.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Free-form Video Inpainting with 3D Gated Convolution and Temporal PatchGAN
Authors:
Ya-Liang Chang,
Zhe Yu Liu,
Kuan-Ying Lee,
Winston Hsu
Abstract:
Free-form video inpainting is a very challenging task that could be widely used for video editing such as text removal. Existing patch-based methods could not handle non-repetitive structures such as faces, while directly applying image-based inpainting models to videos will result in temporal inconsistency (see http://bit.ly/2Fu1n6b ). In this paper, we introduce a deep learn-ing based free-form…
▽ More
Free-form video inpainting is a very challenging task that could be widely used for video editing such as text removal. Existing patch-based methods could not handle non-repetitive structures such as faces, while directly applying image-based inpainting models to videos will result in temporal inconsistency (see http://bit.ly/2Fu1n6b ). In this paper, we introduce a deep learn-ing based free-form video inpainting model, with proposed 3D gated convolutions to tackle the uncertainty of free-form masks and a novel Temporal PatchGAN loss to enhance temporal consistency. In addition, we collect videos and design a free-form mask generation algorithm to build the free-form video inpainting (FVI) dataset for training and evaluation of video inpainting models. We demonstrate the benefits of these components and experiments on both the FaceForensics and our FVI dataset suggest that our method is superior to existing ones. Related source code, full-resolution result videos and the FVI dataset could be found on Github https://github.com/amjltc295/Free-Form-Video-Inpainting .
△ Less
Submitted 23 July, 2019; v1 submitted 23 April, 2019;
originally announced April 2019.
-
FishNet: A Camera Localizer using Deep Recurrent Networks
Authors:
Hsin-I Chen,
Sebastian Agethen,
Chiamin Wu,
Winston Hsu,
Bing-Yu Chen
Abstract:
This paper proposes a robust localization system that employs deep learning for better scene representation, and enhances the accuracy of 6-DOF camera pose estimation. Inspired by the fact that global scene structure can be revealed by wide field-of-view, we leverage the large overlap of a fisheye camera between adjacent frames, and the powerful high-level feature representations of deep learning.…
▽ More
This paper proposes a robust localization system that employs deep learning for better scene representation, and enhances the accuracy of 6-DOF camera pose estimation. Inspired by the fact that global scene structure can be revealed by wide field-of-view, we leverage the large overlap of a fisheye camera between adjacent frames, and the powerful high-level feature representations of deep learning. Our main contribution is the novel network architecture that extracts both temporal and spatial information using a Recurrent Neural Network. Specifically, we propose a novel pose regularization term combined with LSTM. This leads to smoother pose estimation, especially for large outdoor scenery. Promising experimental results on three benchmark datasets manifest the effectiveness of the proposed approach.
△ Less
Submitted 22 April, 2019;
originally announced April 2019.
-
VORNet: Spatio-temporally Consistent Video Inpainting for Object Removal
Authors:
Ya-Liang Chang,
Zhe Yu Liu,
Winston Hsu
Abstract:
Video object removal is a challenging task in video processing that often requires massive human efforts. Given the mask of the foreground object in each frame, the goal is to complete (inpaint) the object region and generate a video without the target object. While recently deep learning based methods have achieved great success on the image inpainting task, they often lead to inconsistent result…
▽ More
Video object removal is a challenging task in video processing that often requires massive human efforts. Given the mask of the foreground object in each frame, the goal is to complete (inpaint) the object region and generate a video without the target object. While recently deep learning based methods have achieved great success on the image inpainting task, they often lead to inconsistent results between frames when applied to videos. In this work, we propose a novel learning-based Video Object Removal Network (VORNet) to solve the video object removal task in a spatio-temporally consistent manner, by combining the optical flow warping and image-based inpainting model. Experiments are done on our Synthesized Video Object Removal (SVOR) dataset based on the YouTube-VOS video segmentation dataset, and both the objective and subjective evaluation demonstrate that our VORNet generates more spatially and temporally consistent videos compared with existing methods.
△ Less
Submitted 14 April, 2019;
originally announced April 2019.
-
An Unsupervised Autoregressive Model for Speech Representation Learning
Authors:
Yu-An Chung,
Wei-Ning Hsu,
Hao Tang,
James Glass
Abstract:
This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations. In contrast to other speech representation learning methods that aim to remove noise or speaker variabilities, ours is designed to preserve information for a wide range of downstream tasks. In addition, the proposed model does not require any phonetic or word boundary labels, allowing…
▽ More
This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations. In contrast to other speech representation learning methods that aim to remove noise or speaker variabilities, ours is designed to preserve information for a wide range of downstream tasks. In addition, the proposed model does not require any phonetic or word boundary labels, allowing the model to benefit from large quantities of unlabeled data. Speech representations learned by our model significantly improve performance on both phone classification and speaker verification over the surface features and other supervised and unsupervised approaches. Further analysis shows that different levels of speech information are captured by our model at different layers. In particular, the lower layers tend to be more discriminative for speakers, while the upper layers provide more phonetic content.
△ Less
Submitted 18 June, 2019; v1 submitted 5 April, 2019;
originally announced April 2019.
-
OpBerg: Discovering causal sentences using optimal alignments
Authors:
Justin Wood,
Nicholas J. Matiasz,
Alcino J. Silva,
William Hsu,
Alexej Abyzov,
Wei Wang
Abstract:
The biological literature is rich with sentences that describe causal relations. Methods that automatically extract such sentences can help biologists to synthesize the literature and even discover latent relations that had not been articulated explicitly. Current methods for extracting causal sentences are based on either machine learning or a predefined database of causal terms. Machine learning…
▽ More
The biological literature is rich with sentences that describe causal relations. Methods that automatically extract such sentences can help biologists to synthesize the literature and even discover latent relations that had not been articulated explicitly. Current methods for extracting causal sentences are based on either machine learning or a predefined database of causal terms. Machine learning approaches require a large set of labeled training data and can be susceptible to noise. Methods based on predefined databases are limited by the quality of their curation and are unable to capture new concepts or mistakes in the input. We address these challenges by adapting and improving a method designed for a seemingly unrelated problem: finding alignments between genomic sequences. This paper presents a novel and outperforming method for extracting causal relations from text by aligning the part-of-speech representations of an input set with that of known causal sentences. Our experiments show that when applied to the task of finding causal sentences in biological literature, our method improves on the accuracy of other methods in a computationally efficient manner.
△ Less
Submitted 3 April, 2019;
originally announced April 2019.
-
Tailoring excitonic states of van der Waals bilayers through stacking configuration, band alignment and valley-spin
Authors:
Wei-Ting Hsu,
Bo-Han Lin,
Li-Syuan Lu,
Ming-Hao Lee,
Ming-Wen Chu,
Lain-Jong Li,
Wang Yao,
Wen-Hao Chang,
Chih-Kang Shih
Abstract:
Excitons in monolayer semiconductors have large optical transition dipole for strong coupling with light field. Interlayer excitons in heterobilayers, with layer separation of electron and hole components, feature large electric dipole that enables strong coupling with electric field and exciton-exciton interaction, at the cost that the optical dipole is substantially quenched (by several orders o…
▽ More
Excitons in monolayer semiconductors have large optical transition dipole for strong coupling with light field. Interlayer excitons in heterobilayers, with layer separation of electron and hole components, feature large electric dipole that enables strong coupling with electric field and exciton-exciton interaction, at the cost that the optical dipole is substantially quenched (by several orders of magnitude). In this letter, we demonstrate the ability to create a new class of excitons in transition metal dichalcogenide (TMD) hetero- and homo-bilayers that combines the advantages of monolayer- and interlayer-excitons, i.e. featuring both large optical dipole and large electric dipole. These excitons consist of an electron that is well confined in an individual layer, and a hole that is well extended in both layers, realized here through the carrier-species specific layer-hybridization controlled through the interplay of rotational, translational, band offset, and valley-spin degrees of freedom. We observe different species of such layer-hybridized valley excitons in different heterobilayer and homobilayer systems, which can be utilized for realizing strongly interacting excitonic/polaritonic gases, as well as optical quantum coherent controls of bidirectional interlayer carrier transfer either with upper conversion or down conversion in energy.
△ Less
Submitted 5 March, 2019;
originally announced March 2019.
-
Understanding the Mechanism of Deep Learning Framework for Lesion Detection in Pathological Images with Breast Cancer
Authors:
Wei-Wen Hsu,
Chung-Hao Chen,
Chang Hoa,
Yu-Ling Hou,
Xiang Gao,
Yun Shao,
Xueli Zhang,
Jingjing Wang,
Tao He,
Yanghong Tai
Abstract:
The computer-aided detection (CADe) systems are developed to assist pathologists in slide assessment, increasing diagnosis efficiency and reducing missing inspections. Many studies have shown such a CADe system with deep learning approaches outperforms the one using conventional methods that rely on hand-crafted features based on field-knowledge. However, most developers who adopted deep learning…
▽ More
The computer-aided detection (CADe) systems are developed to assist pathologists in slide assessment, increasing diagnosis efficiency and reducing missing inspections. Many studies have shown such a CADe system with deep learning approaches outperforms the one using conventional methods that rely on hand-crafted features based on field-knowledge. However, most developers who adopted deep learning models directly focused on the efficacy of outcomes, without providing comprehensive explanations on why their proposed frameworks can work effectively. In this study, we designed four experiments to verify the consecutive concepts, showing that the deep features learned from pathological patches are interpretable by domain knowledge of pathology and enlightening for clinical diagnosis in the task of lesion detection. The experimental results show the activation features work as morphological descriptors for specific cells or tissues, which agree with the clinical rules in classification. That is, the deep learning framework not only detects the distribution of tumor cells but also recognizes lymphocytes, collagen fibers, and some other non-cell structural tissues. Most of the characteristics learned by the deep learning models have summarized the detection rules that can be recognized by the experienced pathologists, whereas there are still some features may not be intuitive to domain experts but discriminative in classification for machines. Those features are worthy to be further studied in order to find out the reasonable correlations to pathological knowledge, from which pathological experts may draw inspirations for exploring new characteristics in diagnosis.
△ Less
Submitted 4 March, 2019;
originally announced March 2019.
-
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
Authors:
Jonathan Shen,
Patrick Nguyen,
Yonghui Wu,
Zhifeng Chen,
Mia X. Chen,
Ye Jia,
Anjuli Kannan,
Tara Sainath,
Yuan Cao,
Chung-Cheng Chiu,
Yanzhang He,
Jan Chorowski,
Smit Hinsu,
Stella Laurenzo,
James Qin,
Orhan Firat,
Wolfgang Macherey,
Suyog Gupta,
Ankur Bapna,
Shuyuan Zhang,
Ruoming Pang,
Ron J. Weiss,
Rohit Prabhavalkar,
Qiao Liang,
Benoit Jacob
, et al. (66 additional authors not shown)
Abstract:
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly w…
▽ More
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly within the framework, and it contains existing implementations of a large number of utilities, helper functions, and the newest research ideas. Lingvo has been used in collaboration by dozens of researchers in more than 20 papers over the last two years. This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the framework.
△ Less
Submitted 21 February, 2019;
originally announced February 2019.
-
Thermodynamics of $f(R)$ Gravity with Disformal Transformation
Authors:
Chao-Qiang Geng,
Wei-Cheng Hsu,
Jhih-Rong Lu,
Ling-Wei Luo
Abstract:
We study thermodynamics in $f(R)$ gravity with the disformal transformation. The transformation applied to the matter Lagrangian has the form of $\g_{\m\n} = A(φ,X)g_{\m\n} + B(φ,X)\pa_\m\f\pa_\n\f$ with the assumption of the Minkowski matter metric $\g_{\m\n} = \e_{\m\n}$, where $φ$ is the disformal scalar and $X$ is the corresponding kinetic term of $φ$. We verify the generalized first and secon…
▽ More
We study thermodynamics in $f(R)$ gravity with the disformal transformation. The transformation applied to the matter Lagrangian has the form of $\g_{\m\n} = A(φ,X)g_{\m\n} + B(φ,X)\pa_\m\f\pa_\n\f$ with the assumption of the Minkowski matter metric $\g_{\m\n} = \e_{\m\n}$, where $φ$ is the disformal scalar and $X$ is the corresponding kinetic term of $φ$. We verify the generalized first and second laws of thermodynamics in this disformal type of $f(R)$ gravity in the Friedmann-Lemaître-Robertson-Walker (FLRW) universe. In addition, we show that the Hubble parameter contains the disformally induced terms, which define the effectively varying equations of state for matter.
△ Less
Submitted 13 February, 2019;
originally announced February 2019.
-
Revised JNLPBA Corpus: A Revised Version of Biomedical NER Corpus for Relation Extraction Task
Authors:
Ming-Siang Huang,
Po-Ting Lai,
Richard Tzong-Han Tsai,
Wen-Lian Hsu
Abstract:
The advancement of biomedical named entity recognition (BNER) and biomedical relation extraction (BRE) researches promotes the development of text mining in biological domains. As a cornerstone of BRE, robust BNER system is required to identify the mentioned NEs in plain texts for further relation extraction stage. However, the current BNER corpora, which play important roles in these tasks, paid…
▽ More
The advancement of biomedical named entity recognition (BNER) and biomedical relation extraction (BRE) researches promotes the development of text mining in biological domains. As a cornerstone of BRE, robust BNER system is required to identify the mentioned NEs in plain texts for further relation extraction stage. However, the current BNER corpora, which play important roles in these tasks, paid less attention to achieve the criteria for BRE task. In this study, we present Revised JNLPBA corpus, the revision of JNLPBA corpus, to broaden the applicability of a NER corpus from BNER to BRE task. We preserve the original entity types including protein, DNA, RNA, cell line and cell type while all the abstracts in JNLPBA corpus are manually curated by domain experts again basis on the new annotation guideline focusing on the specific NEs instead of general terms. Simultaneously, several imperfection issues in JNLPBA are pointed out and made up in the new corpus. To compare the adaptability of different NER systems in Revised JNLPBA and JNLPBA corpora, the F1-measure was measured in three open sources NER systems including BANNER, Gimli and NERSuite. In the same circumstance, all the systems perform average 10% better in Revised JNLPBA than in JNLPBA. Moreover, the cross-validation test is carried out which we train the NER systems on JNLPBA/Revised JNLPBA corpora and access the performance in both protein-protein interaction extraction (PPIE) and biomedical event extraction (BEE) corpora to confirm that the newly refined Revised JNLPBA is a competent NER corpus in biomedical relation application. The revised JNLPBA corpus is freely available at iasl-btm.iis.sinica.edu.tw/BNER/Content/Revised_JNLPBA.zip.
△ Less
Submitted 29 January, 2019;
originally announced January 2019.
-
Bound states in the continuum through environmental design
Authors:
Alexander Cerjan,
Chia Wei Hsu,
Mikael C. Rechtsman
Abstract:
We propose a new paradigm for realizing bound states in the continuum (BICs) by engineering the environment of a system to control the number of available radiation channels. Using this method, we demonstrate that a photonic crystal slab embedded in a photonic crystal environment can exhibit both isolated points and lines of BICs in different regions of its Brillouin zone. Finally, we demonstrate…
▽ More
We propose a new paradigm for realizing bound states in the continuum (BICs) by engineering the environment of a system to control the number of available radiation channels. Using this method, we demonstrate that a photonic crystal slab embedded in a photonic crystal environment can exhibit both isolated points and lines of BICs in different regions of its Brillouin zone. Finally, we demonstrate that the intersection between a line of BICs and line of leaky resonance can yield exceptional points connected by a bulk Fermi arc. The ability to design the environment of a system opens up a broad range of experimental possibilities for realizing BICs in three-dimensional geometries, such as in 3D-printed structures and the planar grain boundaries of self-assembled systems.
△ Less
Submitted 21 January, 2019;
originally announced January 2019.
-
Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge
Authors:
Spyridon Bakas,
Mauricio Reyes,
Andras Jakab,
Stefan Bauer,
Markus Rempfler,
Alessandro Crimi,
Russell Takeshi Shinohara,
Christoph Berger,
Sung Min Ha,
Martin Rozycki,
Marcel Prastawa,
Esther Alberts,
Jana Lipkova,
John Freymann,
Justin Kirby,
Michel Bilello,
Hassan Fathallah-Shaykh,
Roland Wiest,
Jan Kirschke,
Benedikt Wiestler,
Rivka Colen,
Aikaterini Kotrotsou,
Pamela Lamontagne,
Daniel Marcus,
Mikhail Milchenko
, et al. (402 additional authors not shown)
Abstract:
Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles dissem…
▽ More
Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles disseminated across multi-parametric magnetic resonance imaging (mpMRI) scans, reflecting varying biological properties. Their heterogeneous shape, extent, and location are some of the factors that make these tumors difficult to resect, and in some cases inoperable. The amount of resected tumor is a factor also considered in longitudinal scans, when evaluating the apparent tumor for potential diagnosis of progression. Furthermore, there is mounting evidence that accurate segmentation of the various tumor sub-regions can offer the basis for quantitative image analysis towards prediction of patient overall survival. This study assesses the state-of-the-art machine learning (ML) methods used for brain tumor image analysis in mpMRI scans, during the last seven instances of the International Brain Tumor Segmentation (BraTS) challenge, i.e., 2012-2018. Specifically, we focus on i) evaluating segmentations of the various glioma sub-regions in pre-operative mpMRI scans, ii) assessing potential tumor progression by virtue of longitudinal growth of tumor sub-regions, beyond use of the RECIST/RANO criteria, and iii) predicting the overall survival from pre-operative mpMRI scans of patients that underwent gross total resection. Finally, we investigate the challenge of identifying the best ML algorithms for each of these tasks, considering that apart from being diverse on each instance of the challenge, the multi-institutional mpMRI BraTS dataset has also been a continuously evolving/growing dataset.
△ Less
Submitted 23 April, 2019; v1 submitted 5 November, 2018;
originally announced November 2018.
-
Spatio-temporal correlations in multimode fibers for pulse delivery
Authors:
Wen Xiong,
Chia Wei Hsu,
Hui Cao
Abstract:
Long-range speckle correlations play an essential role in wave transport through disordered media, but have rarely been studied in other complex systems. Here we discover spatio-temporal intensity correlations for an optical pulse propagating through a multimode fiber with strong random mode coupling. Positive long-range correlations arise from multiple scattering in fiber mode space and depend on…
▽ More
Long-range speckle correlations play an essential role in wave transport through disordered media, but have rarely been studied in other complex systems. Here we discover spatio-temporal intensity correlations for an optical pulse propagating through a multimode fiber with strong random mode coupling. Positive long-range correlations arise from multiple scattering in fiber mode space and depend on the statistical distribution of arrival times. By optimizing the incident wavefront of a pulse, we maximize the power transmitted at a selected time, and such control is significantly enhanced by the long-range spatio-temporal correlations. We provide an explicit relation between the correlations and the enhancements, which closely agrees with experimental data. Our work shows that multimode fibers provide a fertile ground for studying complex wave phenomena, and the strong spatio-temporal correlations can be employed for efficient power delivery at a well-defined time.
△ Less
Submitted 6 November, 2018;
originally announced November 2018.
-
Super-Identity Convolutional Neural Network for Face Hallucination
Authors:
Kaipeng Zhang,
Zhanpeng Zhang,
Chia-Wen Cheng,
Winston H. Hsu,
Yu Qiao,
Wei Liu,
Tong Zhang
Abstract:
Face hallucination is a generative task to super-resolve the facial image with low resolution while human perception of face heavily relies on identity information. However, previous face hallucination approaches largely ignore facial identity recovery. This paper proposes Super-Identity Convolutional Neural Network (SICNN) to recover identity information for generating faces closed to the real id…
▽ More
Face hallucination is a generative task to super-resolve the facial image with low resolution while human perception of face heavily relies on identity information. However, previous face hallucination approaches largely ignore facial identity recovery. This paper proposes Super-Identity Convolutional Neural Network (SICNN) to recover identity information for generating faces closed to the real identity. Specifically, we define a super-identity loss to measure the identity difference between a hallucinated face and its corresponding high-resolution face within the hypersphere identity metric space. However, directly using this loss will lead to a Dynamic Domain Divergence problem, which is caused by the large margin between the high-resolution domain and the hallucination domain. To overcome this challenge, we present a domain-integrated training approach by constructing a robust identity metric for faces from these two domains. Extensive experimental evaluations demonstrate that the proposed SICNN achieves superior visual quality over the state-of-the-art methods on a challenging task to super-resolve 12$\times$14 faces with an 8$\times$ upscaling factor. In addition, SICNN significantly improves the recognizability of ultra-low-resolution faces.
△ Less
Submitted 6 November, 2018;
originally announced November 2018.
-
Essential properties of Li/Li$^+$ graphite intercalation compounds
Authors:
Shih-Yang Lin,
Wei-Bang Li,
Ngoc Thanh Thuy Tran,
Wen-Dung Hsu,
Hsin-Yi Liu,
Ming Fa-Lin
Abstract:
The essential properties of graphite-based 3D systems are thoroughly investigated by the first-principles method. Such materials cover a simple hexagonal graphite, a Bernal graphite, and the stage-1 to stage-4 Li/Li$^+$ graphite intercalation compounds. The delicate calculations and the detailed analyses are done for their optimal stacking configurations, bong lengths, interlayer distances, free e…
▽ More
The essential properties of graphite-based 3D systems are thoroughly investigated by the first-principles method. Such materials cover a simple hexagonal graphite, a Bernal graphite, and the stage-1 to stage-4 Li/Li$^+$ graphite intercalation compounds. The delicate calculations and the detailed analyses are done for their optimal stacking configurations, bong lengths, interlayer distances, free electron $\&$ hole densities, Fermi levels, transferred charges in chemical bondings, atom- or ion-dominated energy bands, spatial charge distributions and the significant variations after intercalation, Li-/Li$^+$- $\&$ C-orbital-decomposed DOSs. The above-mentioned physical quantities are sufficient in determining the critical orbital hybridizations responsible for the unusual fundamental properties. How to dramatically alter the low-lying electronic structures by modulating the quest-atom/quest-ion concentration is one of focuses, e.g., the drastic changes on the Fermi level, band widths, and number of energy bands. The theoretical predictions on the stage-n-dependent band structures could be examined by the high-resolution angle-resolved photoemission spectroscopy (ARPES). Most important, the low-energy DOSs near the Fermi might provide the reliable data for estimating the free carrier density due to the interlayer atomic interactions or the quest-atom/quest-ion intercalation. The van Hove singularities, which mainly arise from the critical points in energy-wave-vector space, could be directly examined by the experimental measurements of scanning tunneling spectroscopy (STS). Their features should be very useful in distinguishing the important differences among the stage-$n$ graphite intercalation compounds, and the distinct effects due to the atom or ion decoration.
△ Less
Submitted 25 October, 2018;
originally announced October 2018.
-
Hierarchical Generative Modeling for Controllable Speech Synthesis
Authors:
Wei-Ning Hsu,
Yu Zhang,
Ron J. Weiss,
Heiga Zen,
Yonghui Wu,
Yuxuan Wang,
Yuan Cao,
Ye Jia,
Zhifeng Chen,
Jonathan Shen,
Patrick Nguyen,
Ruoming Pang
Abstract:
This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarch…
▽ More
This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, we train a high-quality controllable TTS model on real found data, which is capable of inferring speaker and style attributes from a noisy utterance and use it to synthesize clean speech with controllable speaking style.
△ Less
Submitted 27 December, 2018; v1 submitted 16 October, 2018;
originally announced October 2018.
-
Scattering concentration bounds: Brightness theorems for waves
Authors:
Hanwen Zhang,
Chia Wei Hsu,
Owen D. Miller
Abstract:
The brightness theorem---brightness is nonincreasing in passive systems---is a foundational conservation law, with applications ranging from photovoltaics to displays, yet it is restricted to the field of ray optics. For general linear wave scattering, we show that power per scattering channel generalizes brightness, and we derive power-concentration bounds for systems of arbitrary coherence. The…
▽ More
The brightness theorem---brightness is nonincreasing in passive systems---is a foundational conservation law, with applications ranging from photovoltaics to displays, yet it is restricted to the field of ray optics. For general linear wave scattering, we show that power per scattering channel generalizes brightness, and we derive power-concentration bounds for systems of arbitrary coherence. The bounds motivate a concept of "wave étendue" as a measure of incoherence among the scattering-channel amplitudes, and which is given by the rank of an appropriate density matrix. The bounds apply to nonreciprocal systems that are of increasing interest, and we demonstrate their applicability to maximal control in nanophotonics, for metasurfaces and waveguide junctions. Through inverse design, we discover metasurface elements operating near the theoretical limits.
△ Less
Submitted 2 September, 2019; v1 submitted 5 October, 2018;
originally announced October 2018.
-
Unsupervised Representation Learning of Speech for Dialect Identification
Authors:
Suwon Shon,
Wei-Ning Hsu,
James Glass
Abstract:
In this paper, we explore the use of a factorized hierarchical variational autoencoder (FHVAE) model to learn an unsupervised latent representation for dialect identification (DID). An FHVAE can learn a latent space that separates the more static attributes within an utterance from the more dynamic attributes by encoding them into two different sets of latent variables. Useful factors for dialect…
▽ More
In this paper, we explore the use of a factorized hierarchical variational autoencoder (FHVAE) model to learn an unsupervised latent representation for dialect identification (DID). An FHVAE can learn a latent space that separates the more static attributes within an utterance from the more dynamic attributes by encoding them into two different sets of latent variables. Useful factors for dialect identification, such as phonetic or linguistic content, are encoded by a segmental latent variable, while irrelevant factors that are relatively constant within a sequence, such as a channel or a speaker information, are encoded by a sequential latent variable. The disentanglement property makes the segmental latent variable less susceptible to channel and speaker variation, and thus reduces degradation from channel domain mismatch. We demonstrate that on fully-supervised DID tasks, an end-to-end model trained on the features extracted from the FHVAE model achieves the best performance, compared to the same model trained on conventional acoustic features and an i-vector based system. Moreover, we also show that the proposed approach can leverage a large amount of unlabeled data for FHVAE training to learn domain-invariant features for DID, and significantly improve the performance in a low-resource condition, where the labels for the in-domain data are not available.
△ Less
Submitted 12 September, 2018;
originally announced September 2018.
-
Fail-Stop Group Signature Scheme
Authors:
Yi-Yuan Chiang,
Wang-Hsin Hsu,
Wen-Yen Lin,
Jonathan Jen-Rong Chen
Abstract:
In this paper, we propose a Fail-Stop Group Signature Scheme (FSGSS). FSGSS combines the features of the Group Signature and the Fail-Stop Signature to enhance the security level of the original Group Signature. Assuming that the FSGSS encounters an attack by a hacker armed with a supercomputer, this scheme can prove that the digital signature is indeed forged. Based on the above objectives, this…
▽ More
In this paper, we propose a Fail-Stop Group Signature Scheme (FSGSS). FSGSS combines the features of the Group Signature and the Fail-Stop Signature to enhance the security level of the original Group Signature. Assuming that the FSGSS encounters an attack by a hacker armed with a supercomputer, this scheme can prove that the digital signature is indeed forged. Based on the above objectives, this paper proposes three lemmas and proves that they are indeed feasible. First, how does a recipient of a digitally signed document verify the authenticity of the signature? Second, when a digitally signed document is under dispute, how can the group's manager find out the identity of the original group member who signed the document, if necessary for an investigation? Third, how can we prove that the signature is indeed forged following an external attack from a supercomputer? Soon, in a future paper, we will extend this work to make the scheme even more effective. Following an attack, the signature could be proved to be forged without the need to expose the key.
△ Less
Submitted 5 September, 2018;
originally announced September 2018.
-
Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis
Authors:
Yu-An Chung,
Yuxuan Wang,
Wei-Ning Hsu,
Yu Zhang,
RJ Skerry-Ryan
Abstract:
Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent results, they typically require a sizable set of high-quality <text, audio> pairs for training, which are expensive to collect. In this paper, we propose a semi-supervised training framework to improve the data efficiency of Tacotron. The idea is to allow Tacotron to utilize textual and acoustic knowledge contain…
▽ More
Although end-to-end text-to-speech (TTS) models such as Tacotron have shown excellent results, they typically require a sizable set of high-quality <text, audio> pairs for training, which are expensive to collect. In this paper, we propose a semi-supervised training framework to improve the data efficiency of Tacotron. The idea is to allow Tacotron to utilize textual and acoustic knowledge contained in large, publicly-available text and speech corpora. Importantly, these external data are unpaired and potentially noisy. Specifically, first we embed each word in the input text into word vectors and condition the Tacotron encoder on them. We then use an unpaired speech corpus to pre-train the Tacotron decoder in the acoustic domain. Finally, we fine-tune the model using available paired data. We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.
△ Less
Submitted 30 August, 2018;
originally announced August 2018.
-
Efficient Uncertainty Estimation for Semantic Segmentation in Videos
Authors:
Po-Yu Huang,
Wan-Ting Hsu,
Chun-Yueh Chiu,
Ting-Fan Wu,
Min Sun
Abstract:
Uncertainty estimation in deep learning becomes more important recently. A deep learning model can't be applied in real applications if we don't know whether the model is certain about the decision or not. Some literature proposes the Bayesian neural network which can estimate the uncertainty by Monte Carlo Dropout (MC dropout). However, MC dropout needs to forward the model $N$ times which result…
▽ More
Uncertainty estimation in deep learning becomes more important recently. A deep learning model can't be applied in real applications if we don't know whether the model is certain about the decision or not. Some literature proposes the Bayesian neural network which can estimate the uncertainty by Monte Carlo Dropout (MC dropout). However, MC dropout needs to forward the model $N$ times which results in $N$ times slower. For real-time applications such as a self-driving car system, which needs to obtain the prediction and the uncertainty as fast as possible, so that MC dropout becomes impractical. In this work, we propose the region-based temporal aggregation (RTA) method which leverages the temporal information in videos to simulate the sampling procedure. Our RTA method with Tiramisu backbone is 10x faster than the MC dropout with Tiramisu backbone ($N=5$). Furthermore, the uncertainty estimation obtained by our RTA method is comparable to MC dropout's uncertainty estimation on pixel-level and frame-level metrics.
△ Less
Submitted 29 July, 2018;
originally announced July 2018.
-
Perfectly absorbing exceptional points and chiral absorbers
Authors:
William R. Sweeney,
Chia Wei Hsu,
Stefan Rotter,
A. Douglas Stone
Abstract:
We identify a new kind of physically realizable exceptional point (EP) corresponding to degenerate coherent perfect absorption, in which two purely incoming solutions of the wave operator for electromagnetic or acoustic waves coalesce to a single state. Such non-hermitian degeneracies can occur at a real-valued frequency without any associated noise or non-linearity, in contrast to EPs in lasers.…
▽ More
We identify a new kind of physically realizable exceptional point (EP) corresponding to degenerate coherent perfect absorption, in which two purely incoming solutions of the wave operator for electromagnetic or acoustic waves coalesce to a single state. Such non-hermitian degeneracies can occur at a real-valued frequency without any associated noise or non-linearity, in contrast to EPs in lasers. The absorption lineshape for the eigenchannel near the EP is quartic in frequency around its maximum in any dimension. In general, for the parameters at which an operator EP occurs, the associated scattering matrix does not have an EP. However, in one dimension, when the $S$-matrix does have a perfectly absorbing EP, it takes on a universal one-parameter form with degenerate values for all scattering coefficients. For absorbing disk resonators, these EPs give rise to chiral absorption: perfect absorption for only one sense of rotation of the input wave.
△ Less
Submitted 25 July, 2018; v1 submitted 23 July, 2018;
originally announced July 2018.
-
Computed Tomography Image Enhancement using 3D Convolutional Neural Network
Authors:
Meng Li,
Shiwen Shen,
Wen Gao,
William Hsu,
Jason Cong
Abstract:
Computed tomography (CT) is increasingly being used for cancer screening, such as early detection of lung cancer. However, CT studies have varying pixel spacing due to differences in acquisition parameters. Thick slice CTs have lower resolution, hindering tasks such as nodule characterization during computer-aided detection due to partial volume effect. In this study, we propose a novel 3D enhance…
▽ More
Computed tomography (CT) is increasingly being used for cancer screening, such as early detection of lung cancer. However, CT studies have varying pixel spacing due to differences in acquisition parameters. Thick slice CTs have lower resolution, hindering tasks such as nodule characterization during computer-aided detection due to partial volume effect. In this study, we propose a novel 3D enhancement convolutional neural network (3DECNN) to improve the spatial resolution of CT studies that were acquired using lower resolution/slice thicknesses to higher resolutions. Using a subset of the LIDC dataset consisting of 20,672 CT slices from 100 scans, we simulated lower resolution/thick section scans then attempted to reconstruct the original images using our 3DECNN network. A significant improvement in PSNR (29.3087dB vs. 28.8769dB, p-value < 2.2e-16) and SSIM (0.8529dB vs. 0.8449dB, p-value < 2.2e-16) compared to other state-of-art deep learning methods is observed.
△ Less
Submitted 18 July, 2018;
originally announced July 2018.
-
Statistical Description of Transport in Multimode Fibers with Mode-Dependent Loss
Authors:
P. Chiarawongse,
H. Li,
W. Xiong,
C. W. Hsu,
H. Cao,
T. Kottos
Abstract:
We analyze coherent wave transport in a new physical setting associated with multimode wave systems where reflection is completely suppressed and mode-dependent losses together with mode-mixing are dictating the wave propagation. An additional physical constraint is the fact that in realistic circumstances the access to the scattering (or transmission) matrix is incomplete. We have addressed all t…
▽ More
We analyze coherent wave transport in a new physical setting associated with multimode wave systems where reflection is completely suppressed and mode-dependent losses together with mode-mixing are dictating the wave propagation. An additional physical constraint is the fact that in realistic circumstances the access to the scattering (or transmission) matrix is incomplete. We have addressed all these challenges by providing a statistical description of wave transport which fuses together a free probability theory approach with a Filtered Random Matrix ensemble. Our theoretical predictions have been tested successfully against experimental data of light transport in multimode fibers.
△ Less
Submitted 28 June, 2018;
originally announced June 2018.
-
Unsupervised Adaptation with Interpretable Disentangled Representations for Distant Conversational Speech Recognition
Authors:
Wei-Ning Hsu,
Hao Tang,
James Glass
Abstract:
The current trend in automatic speech recognition is to leverage large amounts of labeled data to train supervised neural network models. Unfortunately, obtaining data for a wide range of domains to train robust models can be costly. However, it is relatively inexpensive to collect large amounts of unlabeled data from domains that we want the models to generalize to. In this paper, we propose a no…
▽ More
The current trend in automatic speech recognition is to leverage large amounts of labeled data to train supervised neural network models. Unfortunately, obtaining data for a wide range of domains to train robust models can be costly. However, it is relatively inexpensive to collect large amounts of unlabeled data from domains that we want the models to generalize to. In this paper, we propose a novel unsupervised adaptation method that learns to synthesize labeled data for the target domain from unlabeled in-domain data and labeled out-of-domain data. We first learn without supervision an interpretable latent representation of speech that encodes linguistic and nuisance factors (e.g., speaker and channel) using different latent variables. To transform a labeled out-of-domain utterance without altering its transcript, we transform the latent nuisance variables while maintaining the linguistic variables. To demonstrate our approach, we focus on a channel mismatch setting, where the domain of interest is distant conversational speech, and labels are only available for close-talking speech. Our proposed method is evaluated on the AMI dataset, outperforming all baselines and bridging the gap between unadapted and in-domain models by over 77% without using any parallel data.
△ Less
Submitted 13 June, 2018;
originally announced June 2018.
-
A Study of Enhancement, Augmentation, and Autoencoder Methods for Domain Adaptation in Distant Speech Recognition
Authors:
Hao Tang,
Wei-Ning Hsu,
Francois Grondin,
James Glass
Abstract:
Speech recognizers trained on close-talking speech do not generalize to distant speech and the word error rate degradation can be as large as 40% absolute. Most studies focus on tackling distant speech recognition as a separate problem, leaving little effort to adapting close-talking speech recognizers to distant speech. In this work, we review several approaches from a domain adaptation perspecti…
▽ More
Speech recognizers trained on close-talking speech do not generalize to distant speech and the word error rate degradation can be as large as 40% absolute. Most studies focus on tackling distant speech recognition as a separate problem, leaving little effort to adapting close-talking speech recognizers to distant speech. In this work, we review several approaches from a domain adaptation perspective. These approaches, including speech enhancement, multi-condition training, data augmentation, and autoencoders, all involve a transformation of the data between domains. We conduct experiments on the AMI data set, where these approaches can be realized under the same controlled setting. These approaches lead to different amounts of improvement under their respective assumptions. The purpose of this paper is to quantify and characterize the performance gap between the two domains, setting up the basis for studying adaptation of speech recognizers from close-talking speech to distant speech. Our results also have implications for improving distant speech recognition.
△ Less
Submitted 13 June, 2018;
originally announced June 2018.
-
Transverse localization of transmission eigenchannels
Authors:
Hasan Yılmaz,
Chia Wei Hsu,
Alexey Yamilov,
Hui Cao
Abstract:
Transmission eigenchannels are building blocks of coherent wave transport in diffusive media, and selective excitation of individual eigenchannels can lead to diverse transport behavior. An essential yet poorly understood property is the transverse spatial profile of each eigenchannel, which is critical for coupling into and out of it. Here, we discover that the transmission eigenchannels of a dis…
▽ More
Transmission eigenchannels are building blocks of coherent wave transport in diffusive media, and selective excitation of individual eigenchannels can lead to diverse transport behavior. An essential yet poorly understood property is the transverse spatial profile of each eigenchannel, which is critical for coupling into and out of it. Here, we discover that the transmission eigenchannels of a disordered slab possess localized incident and outgoing profiles, even in the diffusive regime far from Anderson localization. Such transverse localization arises from a combination of reciprocity, local coupling of spatial modes, and nonlocal correlations of scattered waves. Experimentally, we observe signatures of such localization despite finite illumination area. Our results reveal the intrinsic characteristics of transmission eigenchannels in the open slab geometry, commonly used for applications in imaging and energy transfer through turbid media.
△ Less
Submitted 7 June, 2018; v1 submitted 5 June, 2018;
originally announced June 2018.
-
Machine Learning for Predictive Analytics of Compute Cluster Jobs
Authors:
Dan Andresen,
William Hsu,
Huichen Yang,
Adedolapo Okanlawon
Abstract:
We address the problem of predicting whether sufficient memory and CPU resources have been requested for jobs at submission time. For this purpose, we examine the task of training a supervised machine learning system to predict the outcome - whether the job will fail specifically due to insufficient resources - as a classification task. Sufficiently high accuracy, precision, and recall at this tas…
▽ More
We address the problem of predicting whether sufficient memory and CPU resources have been requested for jobs at submission time. For this purpose, we examine the task of training a supervised machine learning system to predict the outcome - whether the job will fail specifically due to insufficient resources - as a classification task. Sufficiently high accuracy, precision, and recall at this task facilitates more anticipatory decision support applications in the domain of HPC resource allocation. Our preliminary results using a new test bed show that the probability of failed jobs is associated with information freely available at job submission time and may thus be usable by a learning system for user modeling that gives personalized feedback to users.
△ Less
Submitted 19 May, 2018;
originally announced June 2018.
-
An Interpretable Deep Hierarchical Semantic Convolutional Neural Network for Lung Nodule Malignancy Classification
Authors:
Shiwen Shen,
Simon X. Han,
Denise R. Aberle,
Alex A. T. Bui,
Willliam Hsu
Abstract:
While deep learning methods are increasingly being applied to tasks such as computer-aided diagnosis, these models are difficult to interpret, do not incorporate prior domain knowledge, and are often considered as a "black-box." The lack of model interpretability hinders them from being fully understood by target users such as radiologists. In this paper, we present a novel interpretable deep hier…
▽ More
While deep learning methods are increasingly being applied to tasks such as computer-aided diagnosis, these models are difficult to interpret, do not incorporate prior domain knowledge, and are often considered as a "black-box." The lack of model interpretability hinders them from being fully understood by target users such as radiologists. In this paper, we present a novel interpretable deep hierarchical semantic convolutional neural network (HSCNN) to predict whether a given pulmonary nodule observed on a computed tomography (CT) scan is malignant. Our network provides two levels of output: 1) low-level radiologist semantic features, and 2) a high-level malignancy prediction score. The low-level semantic outputs quantify the diagnostic features used by radiologists and serve to explain how the model interprets the images in an expert-driven manner. The information from these low-level tasks, along with the representations learned by the convolutional layers, are then combined and used to infer the high-level task of predicting nodule malignancy. This unified architecture is trained by optimizing a global loss function including both low- and high-level tasks, thereby learning all the parameters within a joint framework. Our experimental results using the Lung Image Database Consortium (LIDC) show that the proposed method not only produces interpretable lung cancer predictions but also achieves significantly better results compared to common 3D CNN approaches.
△ Less
Submitted 2 June, 2018;
originally announced June 2018.
-
Quantum Noise Theory of Exceptional Point Sensors
Authors:
Mengzhen Zhang,
William Sweeney,
Chia Wei Hsu,
Lan Yang,
A. D. Stone,
Liang Jiang
Abstract:
Distinct from closed quantum systems, non-Hermitian system can have exceptional points (EPs) where both eigenvalues and eigenvectors coalesce. Recently, it has been proposed and demonstrated that EPs can enhance the performance of sensors in terms of amplification of detected signal. Meanwhile, the noise might also be amplified at EPs and it is not obvious whether exceptional points will still imp…
▽ More
Distinct from closed quantum systems, non-Hermitian system can have exceptional points (EPs) where both eigenvalues and eigenvectors coalesce. Recently, it has been proposed and demonstrated that EPs can enhance the performance of sensors in terms of amplification of detected signal. Meanwhile, the noise might also be amplified at EPs and it is not obvious whether exceptional points will still improve the performance of sensors when both signal and noise are amplified. We develop quantum noise theory to systematically calculate the signal and noise associated with the EP sensors. We then compute quantum Fisher information to extract a lower bound of the sensitivity of EP sensors. Finally, we explicitly construct an EP sensing scheme based on heterodyne detection to achieve the same scaling of the ultimate sensitivity with enhanced performance. Our results can be generalized to higher order EPs for any bosonic non-Hermitian system with linear interactions.
△ Less
Submitted 25 January, 2019; v1 submitted 30 May, 2018;
originally announced May 2018.
-
Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data
Authors:
Wei-Ning Hsu,
James Glass
Abstract:
Multimodal sensory data resembles the form of information perceived by humans for learning, and are easy to obtain in large quantities. Compared to unimodal data, synchronization of concepts between modalities in such data provides supervision for disentangling the underlying explanatory factors of each modality. Previous work leveraging multimodal data has mainly focused on retaining only the mod…
▽ More
Multimodal sensory data resembles the form of information perceived by humans for learning, and are easy to obtain in large quantities. Compared to unimodal data, synchronization of concepts between modalities in such data provides supervision for disentangling the underlying explanatory factors of each modality. Previous work leveraging multimodal data has mainly focused on retaining only the modality-invariant factors while discarding the rest. In this paper, we present a partitioned variational autoencoder (PVAE) and several training objectives to learn disentangled representations, which encode not only the shared factors, but also modality-dependent ones, into separate latent variables. Specifically, PVAE integrates a variational inference framework and a multimodal generative model that partitions the explanatory factors and conditions only on the relevant subset of them for generation. We evaluate our model on two parallel speech/image datasets, and demonstrate its ability to learn disentangled representations by qualitatively exploring within-modality and cross-modality conditional generation with semantics and styles specified by examples. For quantitative analysis, we evaluate the classification accuracy of automatically discovered semantic units. Our PVAE can achieve over 99% accuracy on both modalities.
△ Less
Submitted 29 May, 2018;
originally announced May 2018.