Search | arXiv e-print repository

Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime

Authors: Alistair Shilton, Sunil Gupta, Santu Rana, Svetha Venkatesh

Abstract: This paper presents two models of neural-networks and their training applicable to neural networks of arbitrary width, depth and topology, assuming only finite-energy neural activations; and a novel representor theory for neural networks in terms of a matrix-valued kernel. The first model is exact (un-approximated) and global, casting the neural network as an elements in a reproducing kernel Banac… ▽ More This paper presents two models of neural-networks and their training applicable to neural networks of arbitrary width, depth and topology, assuming only finite-energy neural activations; and a novel representor theory for neural networks in terms of a matrix-valued kernel. The first model is exact (un-approximated) and global, casting the neural network as an elements in a reproducing kernel Banach space (RKBS); we use this model to provide tight bounds on Rademacher complexity. The second model is exact and local, casting the change in neural network function resulting from a bounded change in weights and biases (ie. a training step) in reproducing kernel Hilbert space (RKHS) in terms of a local-intrinsic neural kernel (LiNK). This local model provides insight into model adaptation through tight bounds on Rademacher complexity of network adaptation. We also prove that the neural tangent kernel (NTK) is a first-order approximation of the LiNK kernel. Finally, and noting that the LiNK does not provide a representor theory for technical reasons, we present an exact novel representor theory for layer-wise neural network training with unregularized gradient descent in terms of a local-extrinsic neural kernel (LeNK). This representor theory gives insight into the role of higher-order statistics in neural network training and the effect of kernel evolution in neural-network kernel models. Throughout the paper (a) feedforward ReLU networks and (b) residual networks (ResNet) are used as illustrative examples. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2402.17343 [pdf, other]

Enhanced Bayesian Optimization via Preferential Modeling of Abstract Properties

Authors: Arun Kumar A V, Alistair Shilton, Sunil Gupta, Santu Rana, Stewart Greenhill, Svetha Venkatesh

Abstract: Experimental (design) optimization is a key driver in designing and discovering new products and processes. Bayesian Optimization (BO) is an effective tool for optimizing expensive and black-box experimental design processes. While Bayesian optimization is a principled data-driven approach to experimental optimization, it learns everything from scratch and could greatly benefit from the expertise… ▽ More Experimental (design) optimization is a key driver in designing and discovering new products and processes. Bayesian Optimization (BO) is an effective tool for optimizing expensive and black-box experimental design processes. While Bayesian optimization is a principled data-driven approach to experimental optimization, it learns everything from scratch and could greatly benefit from the expertise of its human (domain) experts who often reason about systems at different abstraction levels using physical properties that are not necessarily directly measured (or measurable). In this paper, we propose a human-AI collaborative Bayesian framework to incorporate expert preferences about unmeasured abstract properties into the surrogate modeling to further boost the performance of BO. We provide an efficient strategy that can also handle any incorrect/misleading expert bias in preferential judgments. We discuss the convergence behavior of our proposed framework. Our experimental results involving synthetic functions and real-world datasets show the superiority of our method against the baselines. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: 19 Pages, 6 Figures

arXiv:2402.03243 [pdf, other]

PINN-BO: A Black-box Optimization Algorithm using Physics-Informed Neural Networks

Authors: Dat Phan-Trong, Hung The Tran, Alistair Shilton, Sunil Gupta

Abstract: Black-box optimization is a powerful approach for discovering global optima in noisy and expensive black-box functions, a problem widely encountered in real-world scenarios. Recently, there has been a growing interest in leveraging domain knowledge to enhance the efficacy of machine learning methods. Partial Differential Equations (PDEs) often provide an effective means for elucidating the fundame… ▽ More Black-box optimization is a powerful approach for discovering global optima in noisy and expensive black-box functions, a problem widely encountered in real-world scenarios. Recently, there has been a growing interest in leveraging domain knowledge to enhance the efficacy of machine learning methods. Partial Differential Equations (PDEs) often provide an effective means for elucidating the fundamental principles governing the black-box functions. In this paper, we propose PINN-BO, a black-box optimization algorithm employing Physics-Informed Neural Networks that integrates the knowledge from Partial Differential Equations (PDEs) to improve the sample efficiency of the optimization. We analyze the theoretical behavior of our algorithm in terms of regret bound using advances in NTK theory and prove that the use of the PDE alongside the black-box function evaluations, PINN-BO leads to a tighter regret bound. We perform several experiments on a variety of optimization tasks and show that our algorithm is more sample-efficient compared to existing methods. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2303.01684 [pdf, other]

BO-Muse: A human expert and AI teaming framework for accelerated experimental design

Authors: Sunil Gupta, Alistair Shilton, Arun Kumar A V, Shannon Ryan, Majid Abdolshah, Hung Le, Santu Rana, Julian Berk, Mahad Rashid, Svetha Venkatesh

Abstract: In this paper we introduce BO-Muse, a new approach to human-AI teaming for the optimization of expensive black-box functions. Inspired by the intrinsic difficulty of extracting expert knowledge and distilling it back into AI models and by observations of human behavior in real-world experimental design, our algorithm lets the human expert take the lead in the experimental process. The human expert… ▽ More In this paper we introduce BO-Muse, a new approach to human-AI teaming for the optimization of expensive black-box functions. Inspired by the intrinsic difficulty of extracting expert knowledge and distilling it back into AI models and by observations of human behavior in real-world experimental design, our algorithm lets the human expert take the lead in the experimental process. The human expert can use their domain expertise to its full potential, while the AI plays the role of a muse, injecting novelty and searching for areas of weakness to break the human out of over-exploitation induced by cognitive entrenchment. With mild assumptions, we show that our algorithm converges sub-linearly, at a rate faster than the AI or human alone. We validate our algorithm using synthetic data and with human experts performing real-world experiments. △ Less

Submitted 30 March, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

Comments: 34 Pages, 7 Figures and 5 Tables

arXiv:2302.00205 [pdf, other]

Gradient Descent in Neural Networks as Sequential Learning in RKBS

Authors: Alistair Shilton, Sunil Gupta, Santu Rana, Svetha Venkatesh

Abstract: The study of Neural Tangent Kernels (NTKs) has provided much needed insight into convergence and generalization properties of neural networks in the over-parametrized (wide) limit by approximating the network using a first-order Taylor expansion with respect to its weights in the neighborhood of their initialization values. This allows neural network training to be analyzed from the perspective of… ▽ More The study of Neural Tangent Kernels (NTKs) has provided much needed insight into convergence and generalization properties of neural networks in the over-parametrized (wide) limit by approximating the network using a first-order Taylor expansion with respect to its weights in the neighborhood of their initialization values. This allows neural network training to be analyzed from the perspective of reproducing kernel Hilbert spaces (RKHS), which is informative in the over-parametrized regime, but a poor approximation for narrower networks as the weights change more during training. Our goal is to extend beyond the limits of NTK toward a more general theory. We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights as an inner product of two feature maps, respectively from data and weight-step space, to feature space, allowing neural network training to be analyzed from the perspective of reproducing kernel {\em Banach} space (RKBS). We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning in RKBS. Using this, we present novel bound on uniform convergence where the iterations count and learning rate play a central role, giving new theoretical insight into neural network training. △ Less

Submitted 31 January, 2023; originally announced February 2023.

arXiv:2009.03543 [pdf, other]

Sequential Subspace Search for Functional Bayesian Optimization Incorporating Experimenter Intuition

Authors: Alistair Shilton, Sunil Gupta, Santu Rana, Svetha Venkatesh

Abstract: We propose an algorithm for Bayesian functional optimisation - that is, finding the function to optimise a process - guided by experimenter beliefs and intuitions regarding the expected characteristics (length-scale, smoothness, cyclicity etc.) of the optimal solution encoded into the covariance function of a Gaussian Process. Our algorithm generates a sequence of finite-dimensional random subspac… ▽ More We propose an algorithm for Bayesian functional optimisation - that is, finding the function to optimise a process - guided by experimenter beliefs and intuitions regarding the expected characteristics (length-scale, smoothness, cyclicity etc.) of the optimal solution encoded into the covariance function of a Gaussian Process. Our algorithm generates a sequence of finite-dimensional random subspaces of functional space spanned by a set of draws from the experimenter's Gaussian Process. Standard Bayesian optimisation is applied on each subspace, and the best solution found used as a starting point (origin) for the next subspace. Using the concept of effective dimensionality, we analyse the convergence of our algorithm and provide a regret bound to show that our algorithm converges in sub-linear time provided a finite effective dimension exists. We test our algorithm in simulated and real-world experiments, namely blind function matching, finding the optimal precipitation-strengthening function for an aluminium alloy, and learning rate schedule optimisation for deep networks. △ Less

Submitted 8 September, 2020; originally announced September 2020.

arXiv:2007.07459 [pdf, other]

From deep to Shallow: Equivalent Forms of Deep Networks in Reproducing Kernel Krein Space and Indefinite Support Vector Machines

Authors: Alistair Shilton, Sunil Gupta, Santu Rana, Svetha Venkatesh

Abstract: In this paper we explore a connection between deep networks and learning in reproducing kernel Krein space. Our approach is based on the concept of push-forward - that is, taking a fixed non-linear transform on a linear projection and converting it to a linear projection on the output of a fixed non-linear transform, pushing the weights forward through the non-linearity. Applying this repeatedly f… ▽ More In this paper we explore a connection between deep networks and learning in reproducing kernel Krein space. Our approach is based on the concept of push-forward - that is, taking a fixed non-linear transform on a linear projection and converting it to a linear projection on the output of a fixed non-linear transform, pushing the weights forward through the non-linearity. Applying this repeatedly from the input to the output of a deep network, the weights can be progressively "pushed" to the output layer, resulting in a flat network that has the form of a fixed non-linear map (whose form is determined by the structure of the deep network) followed by a linear projection determined by the weight matrices - that is, we take a deep network and convert it to an equivalent (indefinite) kernel machine. We then investigate the implications of this transformation for capacity control and uniform convergence, and provide a Rademacher complexity bound on the deep network in terms of Rademacher complexity in reproducing kernel Krein space. Finally, we analyse the sparsity properties of the flat representation, showing that the flat weights are (effectively) Lp-"norm" regularised with 0<p<1 (bridge regression). △ Less

Submitted 8 September, 2020; v1 submitted 14 July, 2020; originally announced July 2020.

arXiv:1911.12473 [pdf, other]

Bayesian Optimization for Categorical and Category-Specific Continuous Inputs

Authors: Dang Nguyen, Sunil Gupta, Santu Rana, Alistair Shilton, Svetha Venkatesh

Abstract: Many real-world functions are defined over both categorical and category-specific continuous variables and thus cannot be optimized by traditional Bayesian optimization (BO) methods. To optimize such functions, we propose a new method that formulates the problem as a multi-armed bandit problem, wherein each category corresponds to an arm with its reward distribution centered around the optimum of… ▽ More Many real-world functions are defined over both categorical and category-specific continuous variables and thus cannot be optimized by traditional Bayesian optimization (BO) methods. To optimize such functions, we propose a new method that formulates the problem as a multi-armed bandit problem, wherein each category corresponds to an arm with its reward distribution centered around the optimum of the objective function in continuous variables. Our goal is to identify the best arm and the maximizer of the corresponding continuous function simultaneously. Our algorithm uses a Thompson sampling scheme that helps connecting both multi-arm bandit and BO in a unified framework. We extend our method to batch BO to allow parallel optimization when multiple resources are available. We theoretically analyze our method for convergence and prove sub-linear regret bounds. We perform a variety of experiments: optimization of several benchmark functions, hyper-parameter tuning of a neural network, and automatic selection of the best machine learning model along with its optimal hyper-parameters (a.k.a automated machine learning). Comparisons with other methods demonstrate the effectiveness of our proposed method. △ Less

Submitted 27 November, 2019; originally announced November 2019.

Comments: To appear at AAAI 2020

arXiv:1909.03600 [pdf, other]

Cost-aware Multi-objective Bayesian optimisation

Authors: Majid Abdolshah, Alistair Shilton, Santu Rana, Sunil Gupta, Svetha Venkatesh

Abstract: The notion of expense in Bayesian optimisation generally refers to the uniformly expensive cost of function evaluations over the whole search space. However, in some scenarios, the cost of evaluation for black-box objective functions is non-uniform since different inputs from search space may incur different costs for function evaluations. We introduce a cost-aware multi-objective Bayesian optimis… ▽ More The notion of expense in Bayesian optimisation generally refers to the uniformly expensive cost of function evaluations over the whole search space. However, in some scenarios, the cost of evaluation for black-box objective functions is non-uniform since different inputs from search space may incur different costs for function evaluations. We introduce a cost-aware multi-objective Bayesian optimisation with non-uniform evaluation cost over objective functions by defining cost-aware constraints over the search space. The cost-aware constraints are a sorted tuple of indexes that demonstrate the ordering of dimensions of the search space based on the user's prior knowledge about their cost of usage. We formulate a new multi-objective Bayesian optimisation acquisition function with detailed analysis of the convergence that incorporates this cost-aware constraints while optimising the objective functions. We demonstrate our algorithm based on synthetic and real-world problems in hyperparameter tuning of neural networks and random forests. △ Less

Submitted 8 September, 2019; originally announced September 2019.

arXiv:1902.07846 [pdf, other]

Stable Bayesian Optimisation via Direct Stability Quantification

Authors: Alistair Shilton, Sunil Gupta, Santu Rana, Svetha Venkatesh, Majid Abdolshah, Dang Nguyen

Abstract: In this paper we consider the problem of finding stable maxima of expensive (to evaluate) functions. We are motivated by the optimisation of physical and industrial processes where, for some input ranges, small and unavoidable variations in inputs lead to unacceptably large variation in outputs. Our approach uses multiple gradient Gaussian Process models to estimate the probability that worst-case… ▽ More In this paper we consider the problem of finding stable maxima of expensive (to evaluate) functions. We are motivated by the optimisation of physical and industrial processes where, for some input ranges, small and unavoidable variations in inputs lead to unacceptably large variation in outputs. Our approach uses multiple gradient Gaussian Process models to estimate the probability that worst-case output variation for specified input perturbation exceeded the desired maxima, and these probabilities are then used to (a) guide the optimisation process toward solutions satisfying our stability criteria and (b) post-filter results to find the best stable solution. We exhibit our algorithm on synthetic and real-world problems and demonstrate that it is able to effectively find stable maxima. △ Less

Submitted 20 February, 2019; originally announced February 2019.

arXiv:1902.04228 [pdf, other]

Multi-objective Bayesian optimisation with preferences over objectives

Authors: Majid Abdolshah, Alistair Shilton, Santu Rana, Sunil Gupta, Svetha Venkatesh

Abstract: We present a multi-objective Bayesian optimisation algorithm that allows the user to express preference-order constraints on the objectives of the type "objective A is more important than objective B". These preferences are defined based on the stability of the obtained solutions with respect to preferred objective functions. Rather than attempting to find a representative subset of the complete P… ▽ More We present a multi-objective Bayesian optimisation algorithm that allows the user to express preference-order constraints on the objectives of the type "objective A is more important than objective B". These preferences are defined based on the stability of the obtained solutions with respect to preferred objective functions. Rather than attempting to find a representative subset of the complete Pareto front, our algorithm selects those Pareto-optimal points that satisfy these constraints. We formulate a new acquisition function based on expected improvement in dominated hypervolume (EHI) to ensure that the subset of Pareto front satisfying the constraints is thoroughly explored. The hypervolume calculation is weighted by the probability of a point satisfying the constraints from a gradient Gaussian Process model. We demonstrate our algorithm on both synthetic and real-world problems. △ Less

Submitted 12 November, 2019; v1 submitted 11 February, 2019; originally announced February 2019.

arXiv:1805.07852 [pdf, other]

Accelerated Bayesian Optimization throughWeight-Prior Tuning

Authors: Alistair Shilton, Sunil Gupta, Santu Rana, Pratibha Vellanki, Laurence Park, Cheng Li, Svetha Venkatesh, Alessandra Sutti, David Rubin, Thomas Dorin, Alireza Vahid, Murray Height, Teo Slezak

Abstract: Bayesian optimization (BO) is a widely-used method for optimizing expensive (to evaluate) problems. At the core of most BO methods is the modeling of the objective function using a Gaussian Process (GP) whose covariance is selected from a set of standard covariance functions. From a weight-space view, this models the objective as a linear function in a feature space implied by the given covariance… ▽ More Bayesian optimization (BO) is a widely-used method for optimizing expensive (to evaluate) problems. At the core of most BO methods is the modeling of the objective function using a Gaussian Process (GP) whose covariance is selected from a set of standard covariance functions. From a weight-space view, this models the objective as a linear function in a feature space implied by the given covariance K, with an arbitrary Gaussian weight prior ${\bf w} \sim \mathcal{N} ({\bf 0}, {\bf I})$. In many practical applications there is data available that has a similar (covariance) structure to the objective, but which, having different form, cannot be used directly in standard transfer learning. In this paper we show how such auxiliary data may be used to construct a GP covariance corresponding to a more appropriate weight prior for the objective function. Building on this, we show that we may accelerate BO by modeling the objective function using this (learned) weight prior, which we demonstrate on both test functions and a practical application to short-polymer fibre manufacture. △ Less

Submitted 6 February, 2020; v1 submitted 20 May, 2018; originally announced May 2018.

Journal ref: PMLR 108:635-645, 2020

arXiv:1802.05400 [pdf]

High Dimensional Bayesian Optimization Using Dropout

Authors: Cheng Li, Sunil Gupta, Santu Rana, Vu Nguyen, Svetha Venkatesh, Alistair Shilton

Abstract: Scaling Bayesian optimization to high dimensions is challenging task as the global optimization of high-dimensional acquisition function can be expensive and often infeasible. Existing methods depend either on limited active variables or the additive form of the objective function. We propose a new method for high-dimensional Bayesian optimization, that uses a dropout strategy to optimize only a s… ▽ More Scaling Bayesian optimization to high dimensions is challenging task as the global optimization of high-dimensional acquisition function can be expensive and often infeasible. Existing methods depend either on limited active variables or the additive form of the objective function. We propose a new method for high-dimensional Bayesian optimization, that uses a dropout strategy to optimize only a subset of variables at each iteration. We derive theoretical bounds for the regret and show how it can inform the derivation of our algorithm. We demonstrate the efficacy of our algorithms for optimization on two benchmark functions and two real-world applications- training cascade classifiers and optimizing alloy composition. △ Less

Submitted 14 February, 2018; originally announced February 2018.

Comments: 7 pages; Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence 2017

arXiv:1802.05370 [pdf, other]

Covariance Function Pre-Training with m-Kernels for Accelerated Bayesian Optimisation

Authors: Alistair Shilton, Sunil Gupta, Santu Rana, Pratibha Vellanki, Cheng Li, Laurence Park, Svetha Venkatesh, Alessandra Sutti, David Rubin, Thomas Dorin, Alireza Vahid, Murray Height

Abstract: The paper presents a novel approach to direct covariance function learning for Bayesian optimisation, with particular emphasis on experimental design problems where an existing corpus of condensed knowledge is present. The method presented borrows techniques from reproducing kernel Banach space theory (specifically m-kernels) and leverages them to convert (or re-weight) existing covariance functio… ▽ More The paper presents a novel approach to direct covariance function learning for Bayesian optimisation, with particular emphasis on experimental design problems where an existing corpus of condensed knowledge is present. The method presented borrows techniques from reproducing kernel Banach space theory (specifically m-kernels) and leverages them to convert (or re-weight) existing covariance functions into new, problem-specific covariance functions. The key advantage of this approach is that rather than relying on the user to manually select (with some hyperparameter tuning and experimentation) an appropriate covariance function it constructs the covariance function to specifically match the problem at hand. The technique is demonstrated on two real-world problems - specifically alloy design and short-polymer fibre manufacturing - as well as a selected test function. △ Less

Submitted 12 March, 2018; v1 submitted 14 February, 2018; originally announced February 2018.

arXiv:1106.4613 [pdf, ps, other]

doi 10.1016/j.cpc.2011.12.026

Fast supersymmetry phenomenology at the Large Hadron Collider using machine learning techniques

Authors: A. Buckley, A. Shilton, M. J. White

Abstract: A pressing problem for supersymmetry (SUSY) phenomenologists is how to incorporate Large Hadron Collider search results into parameter fits designed to measure or constrain the SUSY parameters. Owing to the computational expense of fully simulating lots of points in a generic SUSY space to aid the calculation of the likelihoods, the limits published by experimental collaborations are frequently in… ▽ More A pressing problem for supersymmetry (SUSY) phenomenologists is how to incorporate Large Hadron Collider search results into parameter fits designed to measure or constrain the SUSY parameters. Owing to the computational expense of fully simulating lots of points in a generic SUSY space to aid the calculation of the likelihoods, the limits published by experimental collaborations are frequently interpreted in slices of reduced parameter spaces. For example, both ATLAS and CMS have presented results in the Constrained Minimal Supersymmetric Model (CMSSM) by fixing two of four parameters, and generating a coarse grid in the remaining two. We demonstrate that by generating a grid in the full space of the CMSSM, one can interpolate between the output of an LHC detector simulation using machine learning techniques, thus obtaining a superfast likelihood calculator for LHC-based SUSY parameter fits. We further investigate how much training data is required to obtain usable results, finding that approximately 2000 points are required in the CMSSM to get likelihood predictions to an accuracy of a few per cent. The techniques presented here provide a general approach for adding LHC event rate data to SUSY fitting algorithms, and can easily be used to explore other candidate physics models. △ Less

Submitted 7 July, 2011; v1 submitted 22 June, 2011; originally announced June 2011.

Comments: 20 pages, 7 figures, replaced to correct author contact details

Showing 1–15 of 15 results for author: Shilton, A