Search | arXiv e-print repository

Decomposable Sparse Tensor on Tensor Regression

Abstract: Most regularized tensor regression research focuses on tensors predictors with scalars responses or vectors predictors to tensors responses. We consider the sparse low rank tensor on tensor regression where predictors $\mathcal{X}$ and responses $\mathcal{Y}$ are both high-dimensional tensors. By demonstrating that the general inner product or the contracted product on a unit rank tensor can be de… ▽ More Most regularized tensor regression research focuses on tensors predictors with scalars responses or vectors predictors to tensors responses. We consider the sparse low rank tensor on tensor regression where predictors $\mathcal{X}$ and responses $\mathcal{Y}$ are both high-dimensional tensors. By demonstrating that the general inner product or the contracted product on a unit rank tensor can be decomposed into standard inner products and outer products, the problem can be simply transformed into a tensor to scalar regression followed by a tensor decomposition. So we propose a fast solution based on stagewise search composed by contraction part and generation part which are optimized alternatively. We successfully demonstrate our method can out perform current methods in terms of accuracy and predictors selection by effectively incorporating the structural information. △ Less

Submitted 14 December, 2022; v1 submitted 9 December, 2022; originally announced December 2022.

arXiv:2208.08056 [pdf, other]

doi 10.13140/RG.2.2.21905.92008

Sampling Through the Lens of Sequential Decision Making

Authors: Jason Xiaotian Dou, Alvin Qingkai Pan, Runxue Bao, Haiyi Harry Mao, Lei Luo, Zhi-Hong Mao

Abstract: Sampling is ubiquitous in machine learning methodologies. Due to the growth of large datasets and model complexity, we want to learn and adapt the sampling process while training a representation. Towards achieving this grand goal, a variety of sampling techniques have been proposed. However, most of them either use a fixed sampling scheme or adjust the sampling scheme based on simple heuristics.… ▽ More Sampling is ubiquitous in machine learning methodologies. Due to the growth of large datasets and model complexity, we want to learn and adapt the sampling process while training a representation. Towards achieving this grand goal, a variety of sampling techniques have been proposed. However, most of them either use a fixed sampling scheme or adjust the sampling scheme based on simple heuristics. They cannot choose the best sample for model training in different stages. Inspired by "Think, Fast and Slow" (System 1 and System 2) in cognitive science, we propose a reward-guided sampling strategy called Adaptive Sample with Reward (ASR) to tackle this challenge. To the best of our knowledge, this is the first work utilizing reinforcement learning (RL) to address the sampling problem in representation learning. Our approach optimally adjusts the sampling process to achieve optimal performance. We explore geographical relationships among samples by distance-based sampling to maximize overall cumulative reward. We apply ASR to the long-standing sampling problems in similarity-based loss functions. Empirical results in information retrieval and clustering demonstrate ASR's superb performance across different datasets. We also discuss an engrossing phenomenon which we name as "ASR gravity well" in experiments. △ Less

Submitted 13 December, 2022; v1 submitted 17 August, 2022; originally announced August 2022.

arXiv:2207.14483 [pdf, other]

QuCloud+: A Holistic Qubit Mapping Scheme for Single/Multi-programming on 2D/3D NISQ Quantum Computers

Authors: Lei Liu, Xinglei Dou

Abstract: Qubit mapping is essential to quantum computing's fidelity and quantum computers' resource utilization. Yet, the existing qubit mapping schemes meet some challenges (e.g., crosstalk, SWAP overheads, diverse device topologies, etc.), leading to qubit resource under-utilization, high error rate, and low fidelity in computing results. This paper presents QuCloud+, a new qubit mapping scheme capable o… ▽ More Qubit mapping is essential to quantum computing's fidelity and quantum computers' resource utilization. Yet, the existing qubit mapping schemes meet some challenges (e.g., crosstalk, SWAP overheads, diverse device topologies, etc.), leading to qubit resource under-utilization, high error rate, and low fidelity in computing results. This paper presents QuCloud+, a new qubit mapping scheme capable of handling these challenges. QuCloud+ has several new designs. (1) QuCloud+ enables multi-programming quantum computing on quantum chips with 2D/3D topology. (2) It partitions physical qubits for concurrent quantum programs with the crosstalk-aware community detection technique and further allocates qubits according to qubit degree, improving fidelity and resource utilization. (3) QuCloud+ includes an X-SWAP mechanism that avoids SWAPs with high crosstalk errors and enables inter-program SWAPs to reduce the SWAP overheads. (4) QuCloud+ schedules concurrent quantum programs to be mapped and executed based on estimated fidelity for the best practice. QuCloud+ outperforms the previous multi-programming work on various devices by 6.84% on fidelity and saves 40.9% additional gates required during mapping transition. △ Less

Submitted 29 July, 2022; originally announced July 2022.

Comments: arXiv admin note: text overlap with arXiv:2004.12854

arXiv:2207.07734 [pdf, other]

COEM: Cross-Modal Embedding for MetaCell Identification

Authors: Haiyi Mao, Minxue Jia, Jason Xiaotian Dou, Haotian Zhang, Panayiotis V. Benos

Abstract: Metacells are disjoint and homogeneous groups of single-cell profiles, representing discrete and highly granular cell states. Existing metacell algorithms tend to use only one modality to infer metacells, even though single-cell multi-omics datasets profile multiple molecular modalities within the same cell. Here, we present \textbf{C}ross-M\textbf{O}dal \textbf{E}mbedding for \textbf{M}etaCell Id… ▽ More Metacells are disjoint and homogeneous groups of single-cell profiles, representing discrete and highly granular cell states. Existing metacell algorithms tend to use only one modality to infer metacells, even though single-cell multi-omics datasets profile multiple molecular modalities within the same cell. Here, we present \textbf{C}ross-M\textbf{O}dal \textbf{E}mbedding for \textbf{M}etaCell Identification (COEM), which utilizes an embedded space leveraging the information of both scATAC-seq and scRNA-seq to perform aggregation, balancing the trade-off between fine resolution and sufficient sequencing coverage. COEM outperforms the state-of-the-art method SEACells by efficiently identifying accurate and well-separated metacells across datasets with continuous and discrete cell types. Furthermore, COEM significantly improves peak-to-gene association analyses, and facilitates complex gene regulatory inference tasks. △ Less

Submitted 24 July, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

Comments: 5 pages, 2 figures, ICML workshop on computational biology

arXiv:2011.09013 [pdf]

doi 10.1016/j.adapen.2021.100017

Estimates of daily ground-level NO2 concentrations in China based on big data and machine learning approaches

Authors: Xinyu Dou, Cuijuan Liao, Hengqi Wang, Ying Huang, Ying Tu, Xiaomeng Huang, Yiran Peng, Biqing Zhu, Jianguang Tan, Zhu Deng, Nana Wu, Taochun Sun, Piyu Ke, Zhu Liu

Abstract: Nitrogen dioxide (NO2) is one of the most important atmospheric pollutants. However, current ground-level NO2 concentration data are lack of either high-resolution coverage or full coverage national wide, due to the poor quality of source data and the computing power of the models. To our knowledge, this study is the first to estimate the ground-level NO2 concentration in China with national cover… ▽ More Nitrogen dioxide (NO2) is one of the most important atmospheric pollutants. However, current ground-level NO2 concentration data are lack of either high-resolution coverage or full coverage national wide, due to the poor quality of source data and the computing power of the models. To our knowledge, this study is the first to estimate the ground-level NO2 concentration in China with national coverage as well as relatively high spatiotemporal resolution (0.25 degree; daily intervals) over the newest past 6 years (2013-2018). We advanced a Random Forest model integrated K-means (RF-K) for the estimates with multi-source parameters. Besides meteorological parameters, satellite retrievals parameters, we also, for the first time, introduce socio-economic parameters to assess the impact by human activities. The results show that: (1) the RF-K model we developed shows better prediction performance than other models, with cross-validation R2 = 0.64 (MAPE = 34.78%). (2) The annual average concentration of NO2 in China showed a weak increasing trend . While in the economic zones such as Beijing-Tianjin-Hebei region, Yangtze River Delta, and Pearl River Delta, the NO2 concentration there even decreased or remained unchanged, especially in spring. Our dataset has verified that pollutant controlling targets have been achieved in these areas. With mapping daily nationwide ground-level NO2 concentrations, this study provides timely data with high quality for air quality management for China. We provide a universal model framework to quickly generate a timely national atmospheric pollutants concentration map with a high spatial-temporal resolution, based on improved machine learning methods. △ Less

Submitted 17 November, 2020; originally announced November 2020.

arXiv:2004.12854 [pdf]

A New Qubits Mapping Mechanism for Multi-programming Quantum Computing

Authors: Lei Liu, Xinglei Dou

Abstract: For a specific quantum chip, multi-programming helps to improve overall throughput and resource utilization. However, the previous solutions for mapping multiple programs onto a quantum chip often lead to resource under-utilization, high error rate and low fidelity. In this paper, we propose a new approach to map concurrent quantum programs. Our approach has three critical components. The first on… ▽ More For a specific quantum chip, multi-programming helps to improve overall throughput and resource utilization. However, the previous solutions for mapping multiple programs onto a quantum chip often lead to resource under-utilization, high error rate and low fidelity. In this paper, we propose a new approach to map concurrent quantum programs. Our approach has three critical components. The first one is the Community Detection Assisted Partition (CDAP) algorithm, which partitions physical qubits for concurrent quantum programs by considering both physical typology and the error rates, avoiding the waste of robust resources. The second one is the X-SWAP scheme that enables inter-program SWAP operations to reduce the SWAP overheads. Finally, we propose a compilation task scheduling framework, which dynamically selects concurrent quantum programs to be executed based on estimated fidelity, increasing the throughput of the quantum computer. We evaluate our work on publicly available quantum computer IBMQ16 and a simulated quantum chip IBMQ20. Our work outperforms the previous solution on multi-programming in both fidelity and SWAP overheads by 12.0% and 11.1%, respectively. △ Less

Submitted 27 April, 2020; originally announced April 2020.

arXiv:1909.03433 [pdf, other]

Distributionally Robust Optimization with Correlated Data from Vector Autoregressive Processes

Authors: Xialiang Dou, Mihai Anitescu

Abstract: We present a distributionally robust formulation of a stochastic optimization problem for non-i.i.d vector autoregressive data. We use the Wasserstein distance to define robustness in the space of distributions and we show, using duality theory, that the problem is equivalent to a finite convex-concave saddle point problem. The performance of the method is demonstrated on both synthetic and real d… ▽ More We present a distributionally robust formulation of a stochastic optimization problem for non-i.i.d vector autoregressive data. We use the Wasserstein distance to define robustness in the space of distributions and we show, using duality theory, that the problem is equivalent to a finite convex-concave saddle point problem. The performance of the method is demonstrated on both synthetic and real data. △ Less

Submitted 8 September, 2019; originally announced September 2019.

arXiv:1905.07097 [pdf]

The Discussion on Shannon channel capacity formula from the viewpoint of signal uncertainty and Research on the Technique of Breaking through the Shannon Limit

Authors: Dequn Liang, Xinyu Dou

Abstract: In this paper, firstly, the Shannon channel capacity formula is briefly stated, and the relationship between the formula and the signal uncertainty principle is analyzed in order to prepare for deriving the formula which is able to break through the Shannon channel capacity. Then, as a practical example of breaking the Shannon limit, the time-shift non orthogonal multicarrier modulation technology… ▽ More In this paper, firstly, the Shannon channel capacity formula is briefly stated, and the relationship between the formula and the signal uncertainty principle is analyzed in order to prepare for deriving the formula which is able to break through the Shannon channel capacity. Then, as a practical example of breaking the Shannon limit, the time-shift non orthogonal multicarrier modulation technology is introduced. After more than twenty years of development, this technique is proved to be a practical modulation technique for digital communication. △ Less

Submitted 26 April, 2020; v1 submitted 16 May, 2019; originally announced May 2019.

arXiv:1901.07114 [pdf, other]

doi 10.1080/01621459.2020.1745812

Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits

Authors: Xialiang Dou, Tengyuan Liang

Abstract: Consider the problem: given the data pair $(\mathbf{x}, \mathbf{y})$ drawn from a population with $f_*(x) = \mathbf{E}[\mathbf{y} | \mathbf{x} = x]$, specify a neural network model and run gradient flow on the weights over time until reaching any stationarity. How does $f_t$, the function computed by the neural network at time $t$, relate to $f_*$, in terms of approximation and representation? Wha… ▽ More Consider the problem: given the data pair $(\mathbf{x}, \mathbf{y})$ drawn from a population with $f_*(x) = \mathbf{E}[\mathbf{y} | \mathbf{x} = x]$, specify a neural network model and run gradient flow on the weights over time until reaching any stationarity. How does $f_t$, the function computed by the neural network at time $t$, relate to $f_*$, in terms of approximation and representation? What are the provable benefits of the adaptive representation by neural networks compared to the pre-specified fixed basis representation in the classical nonparametric literature? We answer the above questions via a dynamic reproducing kernel Hilbert space (RKHS) approach indexed by the training process of neural networks. Firstly, we show that when reaching any local stationarity, gradient flow learns an adaptive RKHS representation and performs the global least-squares projection onto the adaptive RKHS, simultaneously. Secondly, we prove that as the RKHS is data-adaptive and task-specific, the residual for $f_*$ lies in a subspace that is potentially much smaller than the orthogonal complement of the RKHS. The result formalizes the representation and approximation benefits of neural networks. Lastly, we show that the neural network function computed by gradient flow converges to the kernel ridgeless regression with an adaptive kernel, in the limit of vanishing regularization. The adaptive kernel viewpoint provides new angles of studying the approximation, representation, generalization, and optimization advantages of neural networks. △ Less

Submitted 23 July, 2019; v1 submitted 21 January, 2019; originally announced January 2019.

Comments: 38 pages, 5 figures

Journal ref: Journal of the American Statistical Association (2020)

arXiv:1812.11534 [pdf, ps, other]

A New Deflation Method For Verifying the Isolated Singular Zeros of Polynomial Systems

Authors: Jin-San Cheng, Xiaojie Dou, Junyi Wen

Abstract: In this paper, we develop a new deflation technique for refining or verifying the isolated singular zeros of polynomial systems. Starting from a polynomial system with an isolated singular zero, by computing the derivatives of the input polynomials directly or the linear combinations of the related polynomials, we construct a new system, which can be used to refine or verify the isolated singular… ▽ More In this paper, we develop a new deflation technique for refining or verifying the isolated singular zeros of polynomial systems. Starting from a polynomial system with an isolated singular zero, by computing the derivatives of the input polynomials directly or the linear combinations of the related polynomials, we construct a new system, which can be used to refine or verify the isolated singular zero of the input system. In order to preserve the accuracy in numerical computation as much as possible, new variables are introduced to represent the coefficients of the linear combinations of the related polynomials. To our knowledge, it is the first time that considering the deflation problem of polynomial systems from the perspective of the linear combination. Some acceleration strategies are proposed to reduce the scale of the final system. We also give some further analysis of the tolerances we use, which can help us have a better understanding of our method.The experiments show that our method is effective and efficient. Especially, it works well for zeros with high multiplicities of large systems. It also works for isolated singular zeros of non-polynomial systems. △ Less

Submitted 30 December, 2018; originally announced December 2018.

arXiv:1710.00273 [pdf]

What Words Do We Use to Lie?: Word Choice in Deceptive Messages

Authors: Jason Xiaotian Dou, Michelle Liu, Haaris Muneer, Adam Schlussel

Abstract: Text messaging is the most widely used form of computer-mediated communication (CMC). Previous findings have shown that linguistic factors can reliably indicate messages as deceptive. For example, users take longer and use more words to craft deceptive messages than they do truthful messages. Existing research has also examined how factors, such as student status and gender, affect rates of decept… ▽ More Text messaging is the most widely used form of computer-mediated communication (CMC). Previous findings have shown that linguistic factors can reliably indicate messages as deceptive. For example, users take longer and use more words to craft deceptive messages than they do truthful messages. Existing research has also examined how factors, such as student status and gender, affect rates of deception and word choice in deceptive messages. However, this research has been limited by small sample sizes and has returned contradicting findings. This paper aims to address these issues by using a dataset of text messages collected from a large and varied set of participants using an Android messaging application. The results of this paper show significant differences in word choice and frequency of deceptive messages between male and female participants, as well as between students and non-students. △ Less

Submitted 1 August, 2022; v1 submitted 30 September, 2017; originally announced October 2017.

arXiv:1705.10450 [pdf]

RSI-CB: A Large Scale Remote Sensing Image Classification Benchmark via Crowdsource Data

Authors: Haifeng Li, Xin Dou, Chao Tao, Zhixiang Hou, Jie Chen, Jian Peng, Min Deng, Ling Zhao

Abstract: In recent years, deep convolutional neural network (DCNN) has seen a breakthrough progress in natural image recognition because of three points: universal approximation ability via DCNN, large-scale database (such as ImageNet), and supercomputing ability powered by GPU. The remote sensing field is still lacking a large-scale benchmark compared to ImageNet and Place2. In this paper, we propose a re… ▽ More In recent years, deep convolutional neural network (DCNN) has seen a breakthrough progress in natural image recognition because of three points: universal approximation ability via DCNN, large-scale database (such as ImageNet), and supercomputing ability powered by GPU. The remote sensing field is still lacking a large-scale benchmark compared to ImageNet and Place2. In this paper, we propose a remote sensing image classification benchmark (RSI-CB) based on massive, scalable, and diverse crowdsource data. Using crowdsource data, such as Open Street Map (OSM) data, ground objects in remote sensing images can be annotated effectively by points of interest, vector data from OSM, or other crowdsource data. The annotated images can be used in remote sensing image classification tasks. Based on this method, we construct a worldwide large-scale benchmark for remote sensing image classification. This benchmark has two sub-datasets with 256 by 256 and 128 by 128 sizes because different DCNNs require different image sizes. The former contains 6 categories with 35 subclasses of more than 24,000 images. The latter contains 6 categories with 45 subclasses of more than 36,000 images. This classification system of ground objects is defined according to the national standard of land-use classification in China and is inspired by the hierarchy mechanism of ImageNet. Finally, we conduct many experiments to compare RSI-CB with the SAT-4, SAT-6, and UC-Merced datasets on handcrafted features, such as scale-invariant feature transform, color histogram, local binary patterns, and GIST, and classical DCNN models, such as AlexNet, VGGNet, GoogLeNet, and ResNet. △ Less

Submitted 10 January, 2020; v1 submitted 29 May, 2017; originally announced May 2017.

Comments: 41 pages, 19 figures, 7 tables

arXiv:1510.03247

Impartial Redistricting: A Markov Chain Approach

Authors: Lucy Chenyun Wu, Jason Xiaotian Dou, Danny Sleator, Alan Frieze, David Miller

Abstract: The gerrymandering problem is a worldwide problem which sets great threat to democracy and justice in district based elections. Thanks to partisan redistricting commissions, district boundaries are often manipulated to benefit incumbents. Since an independent commission is hard to come by, the possibility of impartially generating districts with a computer is explored in this thesis. We have devel… ▽ More The gerrymandering problem is a worldwide problem which sets great threat to democracy and justice in district based elections. Thanks to partisan redistricting commissions, district boundaries are often manipulated to benefit incumbents. Since an independent commission is hard to come by, the possibility of impartially generating districts with a computer is explored in this thesis. We have developed an algorithm to randomly produce legal redistricting schemes for Pennsylvania. △ Less

Submitted 13 October, 2015; v1 submitted 12 October, 2015; originally announced October 2015.

Comments: about authorship naming problem, will fix soon

Showing 1–13 of 13 results for author: Dou, X