-
Simple Transferability Estimation for Regression Tasks
Authors:
Cuong N. Nguyen,
Phong Tran,
Lam Si Tung Ho,
Vu Dinh,
Anh T. Tran,
Tal Hassner,
Cuong V. Nguyen
Abstract:
We consider transferability estimation, the problem of estimating how well deep learning models transfer from a source to a target task. We focus on regression tasks, which received little previous attention, and propose two simple and computationally efficient approaches that estimate transferability based on the negative regularized mean squared error of a linear regression model. We prove novel…
▽ More
We consider transferability estimation, the problem of estimating how well deep learning models transfer from a source to a target task. We focus on regression tasks, which received little previous attention, and propose two simple and computationally efficient approaches that estimate transferability based on the negative regularized mean squared error of a linear regression model. We prove novel theoretical results connecting our approaches to the actual transferability of the optimal target models obtained from the transfer learning process. Despite their simplicity, our approaches significantly outperform existing state-of-the-art regression transferability estimators in both accuracy and efficiency. On two large-scale keypoint regression benchmarks, our approaches yield 12% to 36% better results on average while being at least 27% faster than previous state-of-the-art methods.
△ Less
Submitted 3 December, 2023; v1 submitted 1 December, 2023;
originally announced December 2023.
-
Searching for Minimal Optimal Neural Networks
Authors:
Lam Si Tung Ho,
Vu Dinh
Abstract:
Large neural network models have high predictive power but may suffer from overfitting if the training set is not large enough. Therefore, it is desirable to select an appropriate size for neural networks. The destructive approach, which starts with a large architecture and then reduces the size using a Lasso-type penalty, has been used extensively for this task. Despite its popularity, there is n…
▽ More
Large neural network models have high predictive power but may suffer from overfitting if the training set is not large enough. Therefore, it is desirable to select an appropriate size for neural networks. The destructive approach, which starts with a large architecture and then reduces the size using a Lasso-type penalty, has been used extensively for this task. Despite its popularity, there is no theoretical guarantee for this technique. Based on the notion of minimal neural networks, we posit a rigorous mathematical framework for studying the asymptotic theory of the destructive technique. We prove that Adaptive group Lasso is consistent and can reconstruct the correct number of hidden nodes of one-hidden-layer feedforward networks with high probability. To the best of our knowledge, this is the first theoretical result establishing for the destructive technique.
△ Less
Submitted 27 September, 2021;
originally announced September 2021.
-
Consistent Feature Selection for Analytic Deep Neural Networks
Authors:
Vu Dinh,
Lam Si Tung Ho
Abstract:
One of the most important steps toward interpretability and explainability of neural network models is feature selection, which aims to identify the subset of relevant features. Theoretical results in the field have mostly focused on the prediction aspect of the problem with virtually no work on feature selection consistency for deep neural networks due to the model's severe nonlinearity and unide…
▽ More
One of the most important steps toward interpretability and explainability of neural network models is feature selection, which aims to identify the subset of relevant features. Theoretical results in the field have mostly focused on the prediction aspect of the problem with virtually no work on feature selection consistency for deep neural networks due to the model's severe nonlinearity and unidentifiability. This lack of theoretical foundation casts doubt on the applicability of deep learning to contexts where correct interpretations of the features play a central role.
In this work, we investigate the problem of feature selection for analytic deep networks. We prove that for a wide class of networks, including deep feed-forward neural networks, convolutional neural networks, and a major sub-class of residual neural networks, the Adaptive Group Lasso selection procedure with Group Lasso as the base estimator is selection-consistent. The work provides further evidence that Group Lasso might be inefficient for feature selection with neural networks and advocates the use of Adaptive Group Lasso over the popular Group Lasso.
△ Less
Submitted 15 October, 2020;
originally announced October 2020.
-
Consistent feature selection for neural networks via Adaptive Group Lasso
Authors:
Vu Dinh,
Lam Si Tung Ho
Abstract:
One main obstacle for the wide use of deep learning in medical and engineering sciences is its interpretability. While neural network models are strong tools for making predictions, they often provide little information about which features play significant roles in influencing the prediction accuracy. To overcome this issue, many regularization procedures for learning with neural networks have be…
▽ More
One main obstacle for the wide use of deep learning in medical and engineering sciences is its interpretability. While neural network models are strong tools for making predictions, they often provide little information about which features play significant roles in influencing the prediction accuracy. To overcome this issue, many regularization procedures for learning with neural networks have been proposed for dropping non-significant features. Unfortunately, the lack of theoretical results casts doubt on the applicability of such pipelines. In this work, we propose and establish a theoretical guarantee for the use of the adaptive group lasso for selecting important features of neural networks. Specifically, we show that our feature selection method is consistent for single-output feed-forward neural networks with one hidden layer and hyperbolic tangent activation function. We demonstrate its applicability using both simulation and data analysis.
△ Less
Submitted 2 December, 2021; v1 submitted 30 May, 2020;
originally announced June 2020.
-
Bayesian Active Learning With Abstention Feedbacks
Authors:
Cuong V. Nguyen,
Lam Si Tung Ho,
Huan Xu,
Vu Dinh,
Binh Nguyen
Abstract:
We study pool-based active learning with abstention feedbacks where a labeler can abstain from labeling a queried example with some unknown abstention rate. This is an important problem with many useful applications. We take a Bayesian approach to the problem and develop two new greedy algorithms that learn both the classification problem and the unknown abstention rate at the same time. These are…
▽ More
We study pool-based active learning with abstention feedbacks where a labeler can abstain from labeling a queried example with some unknown abstention rate. This is an important problem with many useful applications. We take a Bayesian approach to the problem and develop two new greedy algorithms that learn both the classification problem and the unknown abstention rate at the same time. These are achieved by simply incorporating the estimated average abstention rate into the greedy criteria. We prove that both algorithms have near-optimality guarantees: they respectively achieve a ${(1-\frac{1}{e})}$ constant factor approximation of the optimal expected or worst-case value of a useful utility function. Our experiments show the algorithms perform well in various practical scenarios.
△ Less
Submitted 30 December, 2020; v1 submitted 4 June, 2019;
originally announced June 2019.
-
Non-bifurcating phylogenetic tree inference via the adaptive LASSO
Authors:
Cheng Zhang,
Vu Dinh,
Frederick A. Matsen IV
Abstract:
Phylogenetic tree inference using deep DNA sequencing is reshaping our understanding of rapidly evolving systems, such as the within-host battle between viruses and the immune system. Densely sampled phylogenetic trees can contain special features, including "sampled ancestors" in which we sequence a genotype along with its direct descendants, and "polytomies" in which multiple descendants arise s…
▽ More
Phylogenetic tree inference using deep DNA sequencing is reshaping our understanding of rapidly evolving systems, such as the within-host battle between viruses and the immune system. Densely sampled phylogenetic trees can contain special features, including "sampled ancestors" in which we sequence a genotype along with its direct descendants, and "polytomies" in which multiple descendants arise simultaneously. These features are apparent after identifying zero-length branches in the tree. However, current maximum-likelihood based approaches are not capable of revealing such zero-length branches. In this paper, we find these zero-length branches by introducing adaptive-LASSO-type regularization estimators to phylogenetics, deriving their properties, and showing regularization to be a practically useful approach for phylogenetics.
△ Less
Submitted 1 June, 2020; v1 submitted 28 May, 2018;
originally announced May 2018.
-
Bayesian Pool-based Active Learning With Abstention Feedbacks
Authors:
Cuong V. Nguyen,
Lam Si Tung Ho,
Huan Xu,
Vu Dinh,
Binh Nguyen
Abstract:
We study pool-based active learning with abstention feedbacks, where a labeler can abstain from labeling a queried example with some unknown abstention rate. This is an important problem with many useful applications. We take a Bayesian approach to the problem and develop two new greedy algorithms that learn both the classification problem and the unknown abstention rate at the same time. These ar…
▽ More
We study pool-based active learning with abstention feedbacks, where a labeler can abstain from labeling a queried example with some unknown abstention rate. This is an important problem with many useful applications. We take a Bayesian approach to the problem and develop two new greedy algorithms that learn both the classification problem and the unknown abstention rate at the same time. These are achieved by simply incorporating the estimated abstention rate into the greedy criteria. We prove that both of our algorithms have near-optimality guarantees: they respectively achieve a ${(1-\frac{1}{e})}$ constant factor approximation of the optimal expected or worst-case value of a useful utility function. Our experiments show the algorithms perform well in various practical scenarios.
△ Less
Submitted 2 January, 2021; v1 submitted 23 May, 2017;
originally announced May 2017.
-
Fast learning rates with heavy-tailed losses
Authors:
Vu Dinh,
Lam Si Tung Ho,
Duy Nguyen,
Binh T. Nguyen
Abstract:
We study fast learning rates when the losses are not necessarily bounded and may have a distribution with heavy tails. To enable such analyses, we introduce two new conditions: (i) the envelope function $\sup_{f \in \mathcal{F}}|\ell \circ f|$, where $\ell$ is the loss function and $\mathcal{F}$ is the hypothesis class, exists and is $L^r$-integrable, and (ii) $\ell$ satisfies the multi-scale Bern…
▽ More
We study fast learning rates when the losses are not necessarily bounded and may have a distribution with heavy tails. To enable such analyses, we introduce two new conditions: (i) the envelope function $\sup_{f \in \mathcal{F}}|\ell \circ f|$, where $\ell$ is the loss function and $\mathcal{F}$ is the hypothesis class, exists and is $L^r$-integrable, and (ii) $\ell$ satisfies the multi-scale Bernstein's condition on $\mathcal{F}$. Under these assumptions, we prove that learning rate faster than $O(n^{-1/2})$ can be obtained and, depending on $r$ and the multi-scale Bernstein's powers, can be arbitrarily close to $O(n^{-1})$. We then verify these assumptions and derive fast learning rates for the problem of vector quantization by $k$-means clustering with heavy-tailed distributions. The analyses enable us to obtain novel learning rates that extend and complement existing results in the literature from both theoretical and practical viewpoints.
△ Less
Submitted 29 September, 2016;
originally announced September 2016.
-
Learning From Non-iid Data: Fast Rates for the One-vs-All Multiclass Plug-in Classifiers
Authors:
Vu Dinh,
Lam Si Tung Ho,
Nguyen Viet Cuong,
Duy Nguyen,
Binh T. Nguyen
Abstract:
We prove new fast learning rates for the one-vs-all multiclass plug-in classifiers trained either from exponentially strongly mixing data or from data generated by a converging drifting distribution. These are two typical scenarios where training data are not iid. The learning rates are obtained under a multiclass version of Tsybakov's margin assumption, a type of low-noise assumption, and do not…
▽ More
We prove new fast learning rates for the one-vs-all multiclass plug-in classifiers trained either from exponentially strongly mixing data or from data generated by a converging drifting distribution. These are two typical scenarios where training data are not iid. The learning rates are obtained under a multiclass version of Tsybakov's margin assumption, a type of low-noise assumption, and do not depend on the number of classes. Our results are general and include a previous result for binary-class plug-in classifiers with iid data as a special case. In contrast to previous works for least squares SVMs under the binary-class setting, our results retain the optimal learning rate in the iid case.
△ Less
Submitted 24 January, 2015; v1 submitted 12 August, 2014;
originally announced August 2014.
-
Generalization and Robustness of Batched Weighted Average Algorithm with V-geometrically Ergodic Markov Data
Authors:
Nguyen Viet Cuong,
Lam Si Tung Ho,
Vu Dinh
Abstract:
We analyze the generalization and robustness of the batched weighted average algorithm for V-geometrically ergodic Markov data. This algorithm is a good alternative to the empirical risk minimization algorithm when the latter suffers from overfitting or when optimizing the empirical risk is hard. For the generalization of the algorithm, we prove a PAC-style bound on the training sample size for th…
▽ More
We analyze the generalization and robustness of the batched weighted average algorithm for V-geometrically ergodic Markov data. This algorithm is a good alternative to the empirical risk minimization algorithm when the latter suffers from overfitting or when optimizing the empirical risk is hard. For the generalization of the algorithm, we prove a PAC-style bound on the training sample size for the expected $L_1$-loss to converge to the optimal loss when training data are V-geometrically ergodic Markov chains. For the robustness, we show that if the training target variable's values contain bounded noise, then the generalization bound of the algorithm deviates at most by the range of the noise. Our results can be applied to the regression problem, the classification problem, and the case where there exists an unknown deterministic target hypothesis.
△ Less
Submitted 12 August, 2014; v1 submitted 12 June, 2014;
originally announced June 2014.