-
Transformers Learn Temporal Difference Methods for In-Context Reinforcement Learning
Authors:
Jiuqi Wang,
Ethan Blaser,
Hadi Daneshmand,
Shangtong Zhang
Abstract:
In-context learning refers to the learning ability of a model during inference time without adapting its parameters. The input (i.e., prompt) to the model (e.g., transformers) consists of both a context (i.e., instance-label pairs) and a query instance. The model is then able to output a label for the query instance according to the context during inference. A possible explanation for in-context l…
▽ More
In-context learning refers to the learning ability of a model during inference time without adapting its parameters. The input (i.e., prompt) to the model (e.g., transformers) consists of both a context (i.e., instance-label pairs) and a query instance. The model is then able to output a label for the query instance according to the context during inference. A possible explanation for in-context learning is that the forward pass of (linear) transformers implements iterations of gradient descent on the instance-label pairs in the context. In this paper, we prove by construction that transformers can also implement temporal difference (TD) learning in the forward pass, a phenomenon we refer to as in-context TD. We demonstrate the emergence of in-context TD after training the transformer with a multi-task TD algorithm, accompanied by theoretical analysis. Furthermore, we prove that transformers are expressive enough to implement many other policy evaluation algorithms in the forward pass, including residual gradient, TD with eligibility trace, and average-reward TD.
△ Less
Submitted 31 July, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion
Authors:
Alexandru Meterez,
Amir Joudaki,
Francesco Orabona,
Alexander Immer,
Gunnar Rätsch,
Hadi Daneshmand
Abstract:
Normalization layers are one of the key building blocks for deep neural networks. Several theoretical studies have shown that batch normalization improves the signal propagation, by avoiding the representations from becoming collinear across the layers. However, results on mean-field theory of batch normalization also conclude that this benefit comes at the expense of exploding gradients in depth.…
▽ More
Normalization layers are one of the key building blocks for deep neural networks. Several theoretical studies have shown that batch normalization improves the signal propagation, by avoiding the representations from becoming collinear across the layers. However, results on mean-field theory of batch normalization also conclude that this benefit comes at the expense of exploding gradients in depth. Motivated by these two aspects of batch normalization, in this study we pose the following question: "Can a batch-normalized network keep the optimal signal propagation properties, but avoid exploding gradients?" We answer this question in the affirmative by giving a particular construction of an Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded gradients at any depth. Based on Weingarten calculus, we develop a rigorous and non-asymptotic theory for this constructed MLP that gives a precise characterization of forward signal propagation, while proving that gradients remain bounded for linearly independent input samples, which holds in most practical settings. Inspired by our theory, we also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
Transformers learn to implement preconditioned gradient descent for in-context learning
Authors:
Kwangjun Ahn,
Xiang Cheng,
Hadi Daneshmand,
Suvrit Sra
Abstract:
Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instance…
▽ More
Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instances? To our knowledge, we make the first theoretical progress on this question via an analysis of the loss landscape for linear transformers trained over random instances of linear regression. For a single attention layer, we prove the global minimum of the training objective implements a single iteration of preconditioned gradient descent. Notably, the preconditioning matrix not only adapts to the input distribution but also to the variance induced by data inadequacy. For a transformer with $L$ attention layers, we prove certain critical points of the training objective implement $L$ iterations of preconditioned gradient descent. Our results call for future theoretical studies on learning algorithms by training transformers.
△ Less
Submitted 9 November, 2023; v1 submitted 31 May, 2023;
originally announced June 2023.
-
On the impact of activation and normalization in obtaining isometric embeddings at initialization
Authors:
Amir Joudaki,
Hadi Daneshmand,
Francis Bach
Abstract:
In this paper, we explore the structure of the penultimate Gram matrix in deep neural networks, which contains the pairwise inner products of outputs corresponding to a batch of inputs. In several architectures it has been observed that this Gram matrix becomes degenerate with depth at initialization, which dramatically slows training. Normalization layers, such as batch or layer normalization, pl…
▽ More
In this paper, we explore the structure of the penultimate Gram matrix in deep neural networks, which contains the pairwise inner products of outputs corresponding to a batch of inputs. In several architectures it has been observed that this Gram matrix becomes degenerate with depth at initialization, which dramatically slows training. Normalization layers, such as batch or layer normalization, play a pivotal role in preventing the rank collapse issue. Despite promising advances, the existing theoretical results do not extend to layer normalization, which is widely used in transformers, and can not quantitatively characterize the role of non-linear activations. To bridge this gap, we prove that layer normalization, in conjunction with activation layers, biases the Gram matrix of a multilayer perceptron towards the identity matrix at an exponential rate with depth at initialization. We quantify this rate using the Hermite expansion of the activation function.
△ Less
Submitted 17 November, 2023; v1 submitted 28 May, 2023;
originally announced May 2023.
-
Efficient displacement convex optimization with particle gradient descent
Authors:
Hadi Daneshmand,
Jason D. Lee,
Chi Jin
Abstract:
Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures. This paper considers particle gradient descent with a finite number of particles and establishes its theoretical guarantees to optimize functions that are \emph{displacement convex} in measures. Conc…
▽ More
Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures. This paper considers particle gradient descent with a finite number of particles and establishes its theoretical guarantees to optimize functions that are \emph{displacement convex} in measures. Concretely, for Lipschitz displacement convex functions defined on probability over $\mathbb{R}^d$, we prove that $O(1/ε^2)$ particles and $O(d/ε^4)$ computations are sufficient to find the $ε$-optimal solutions. We further provide improved complexity bounds for optimizing smooth displacement convex functions. We demonstrate the application of our results for function approximation with specific neural architectures with two-dimensional inputs.
△ Less
Submitted 9 February, 2023;
originally announced February 2023.
-
On Bridging the Gap between Mean Field and Finite Width in Deep Random Neural Networks with Batch Normalization
Authors:
Amir Joudaki,
Hadi Daneshmand,
Francis Bach
Abstract:
Mean field theory is widely used in the theoretical studies of neural networks. In this paper, we analyze the role of depth in the concentration of mean-field predictions, specifically for deep multilayer perceptron (MLP) with batch normalization (BN) at initialization. By scaling the network width to infinity, it is postulated that the mean-field predictions suffer from layer-wise errors that amp…
▽ More
Mean field theory is widely used in the theoretical studies of neural networks. In this paper, we analyze the role of depth in the concentration of mean-field predictions, specifically for deep multilayer perceptron (MLP) with batch normalization (BN) at initialization. By scaling the network width to infinity, it is postulated that the mean-field predictions suffer from layer-wise errors that amplify with depth. We demonstrate that BN stabilizes the distribution of representations that avoids the error propagation of mean-field predictions. This stabilization, which is characterized by a geometric mixing property, allows us to establish concentration bounds for mean field predictions in infinitely-deep neural networks with a finite width.
△ Less
Submitted 20 February, 2023; v1 submitted 25 May, 2022;
originally announced May 2022.
-
Polynomial-time Sparse Measure Recovery: From Mean Field Theory to Algorithm Design
Authors:
Hadi Daneshmand,
Francis Bach
Abstract:
Mean field theory has provided theoretical insights into various algorithms by letting the problem size tend to infinity. We argue that the applications of mean-field theory go beyond theoretical insights as it can inspire the design of practical algorithms. Leveraging mean-field analyses in physics, we propose a novel algorithm for sparse measure recovery. For sparse measures over $\mathbb{R}$, w…
▽ More
Mean field theory has provided theoretical insights into various algorithms by letting the problem size tend to infinity. We argue that the applications of mean-field theory go beyond theoretical insights as it can inspire the design of practical algorithms. Leveraging mean-field analyses in physics, we propose a novel algorithm for sparse measure recovery. For sparse measures over $\mathbb{R}$, we propose a polynomial-time recovery method from Fourier moments that improves upon convex relaxation methods in a specific parameter regime; then, we demonstrate the application of our results for the optimization of particular two-dimensional, single-layer neural networks in realizable settings.
△ Less
Submitted 12 February, 2023; v1 submitted 16 April, 2022;
originally announced April 2022.
-
Batch Normalization Orthogonalizes Representations in Deep Random Networks
Authors:
Hadi Daneshmand,
Amir Joudaki,
Francis Bach
Abstract:
This paper underlines a subtle property of batch-normalization (BN): Successive batch normalizations with random linear transformations make hidden representations increasingly orthogonal across layers of a deep neural network. We establish a non-asymptotic characterization of the interplay between depth, width, and the orthogonality of deep representations. More precisely, under a mild assumption…
▽ More
This paper underlines a subtle property of batch-normalization (BN): Successive batch normalizations with random linear transformations make hidden representations increasingly orthogonal across layers of a deep neural network. We establish a non-asymptotic characterization of the interplay between depth, width, and the orthogonality of deep representations. More precisely, under a mild assumption, we prove that the deviation of the representations from orthogonality rapidly decays with depth up to a term inversely proportional to the network width. This result has two main implications: 1) Theoretically, as the depth grows, the distribution of the representation -- after the linear layers -- contracts to a Wasserstein-2 ball around an isotropic Gaussian distribution. Furthermore, the radius of this Wasserstein ball shrinks with the width of the network. 2) In practice, the orthogonality of the representations directly influences the performance of stochastic gradient descent (SGD). When representations are initially aligned, we observe SGD wastes many iterations to orthogonalize representations before the classification. Nevertheless, we experimentally show that starting optimization from orthogonal representations is sufficient to accelerate SGD, with no need for BN.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization
Authors:
Peiyuan Zhang,
Antonio Orvieto,
Hadi Daneshmand,
Thomas Hofmann,
Roy Smith
Abstract:
Viewing optimization methods as numerical integrators for ordinary differential equations (ODEs) provides a thought-provoking modern framework for studying accelerated first-order optimizers. In this literature, acceleration is often supposed to be linked to the quality of the integrator (accuracy, energy preservation, symplecticity). In this work, we propose a novel ordinary differential equation…
▽ More
Viewing optimization methods as numerical integrators for ordinary differential equations (ODEs) provides a thought-provoking modern framework for studying accelerated first-order optimizers. In this literature, acceleration is often supposed to be linked to the quality of the integrator (accuracy, energy preservation, symplecticity). In this work, we propose a novel ordinary differential equation that questions this connection: both the explicit and the semi-implicit (a.k.a symplectic) Euler discretizations on this ODE lead to an accelerated algorithm for convex programming. Although semi-implicit methods are well-known in numerical analysis to enjoy many desirable features for the integration of physical systems, our findings show that these properties do not necessarily relate to acceleration.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
Batch Normalization Provably Avoids Rank Collapse for Randomly Initialised Deep Networks
Authors:
Hadi Daneshmand,
Jonas Kohler,
Francis Bach,
Thomas Hofmann,
Aurelien Lucchi
Abstract:
Randomly initialized neural networks are known to become harder to train with increasing depth, unless architectural enhancements like residual connections and batch normalization are used. We here investigate this phenomenon by revisiting the connection between random initialization in deep networks and spectral instabilities in products of random matrices. Given the rich literature on random mat…
▽ More
Randomly initialized neural networks are known to become harder to train with increasing depth, unless architectural enhancements like residual connections and batch normalization are used. We here investigate this phenomenon by revisiting the connection between random initialization in deep networks and spectral instabilities in products of random matrices. Given the rich literature on random matrices, it is not surprising to find that the rank of the intermediate representations in unnormalized networks collapses quickly with depth. In this work we highlight the fact that batch normalization is an effective strategy to avoid rank collapse for both linear and ReLU networks. Leveraging tools from Markov chain theory, we derive a meaningful lower rank bound in deep linear networks. Empirically, we also demonstrate that this rank robustness generalizes to ReLU nets. Finally, we conduct an extensive set of experiments on real-world data sets, which confirm that rank stability is indeed a crucial condition for training modern-day deep neural architectures.
△ Less
Submitted 11 June, 2020; v1 submitted 3 March, 2020;
originally announced March 2020.
-
Mixing of Stochastic Accelerated Gradient Descent
Authors:
Peiyuan Zhang,
Hadi Daneshmand,
Thomas Hofmann
Abstract:
We study the mixing properties for stochastic accelerated gradient descent (SAGD) on least-squares regression. First, we show that stochastic gradient descent (SGD) and SAGD are simulating the same invariant distribution. Motivated by this, we then establish mixing rate for SAGD-iterates and compare it with those of SGD-iterates. Theoretically, we prove that the chain of SAGD iterates is geometric…
▽ More
We study the mixing properties for stochastic accelerated gradient descent (SAGD) on least-squares regression. First, we show that stochastic gradient descent (SGD) and SAGD are simulating the same invariant distribution. Motivated by this, we then establish mixing rate for SAGD-iterates and compare it with those of SGD-iterates. Theoretically, we prove that the chain of SAGD iterates is geometrically ergodic --using a proper choice of parameters and under regularity assumptions on the input distribution. More specifically, we derive an explicit mixing rate depending on the first 4 moments of the data distribution. By means of illustrative examples, we prove that SAGD-iterate chain mixes faster than the chain of iterates obtained by SGD. Furthermore, we highlight applications of the established mixing rate in the convergence analysis of SAGD on realizable objectives. The proposed analysis is based on a novel non-asymptotic analysis of products of random matrices. This theoretical result is substantiated and validated by experiments.
△ Less
Submitted 31 October, 2019;
originally announced October 2019.
-
Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization
Authors:
Jonas Kohler,
Hadi Daneshmand,
Aurelien Lucchi,
Ming Zhou,
Klaus Neymeyr,
Thomas Hofmann
Abstract:
Normalization techniques such as Batch Normalization have been applied successfully for training deep neural networks. Yet, despite its apparent empirical benefits, the reasons behind the success of Batch Normalization are mostly hypothetical. We here aim to provide a more thorough theoretical understanding from a classical optimization perspective. Our main contribution towards this goal is the i…
▽ More
Normalization techniques such as Batch Normalization have been applied successfully for training deep neural networks. Yet, despite its apparent empirical benefits, the reasons behind the success of Batch Normalization are mostly hypothetical. We here aim to provide a more thorough theoretical understanding from a classical optimization perspective. Our main contribution towards this goal is the identification of various problem instances in the realm of machine learning where % -- under certain assumptions-- Batch Normalization can provably accelerate optimization. We argue that this acceleration is due to the fact that Batch Normalization splits the optimization task into optimizing length and direction of the parameters separately. This allows gradient-based methods to leverage a favourable global structure in the loss landscape that we prove to exist in Learning Halfspace problems and neural network training with Gaussian inputs. We thereby turn Batch Normalization from an effective practical heuristic into a provably converging algorithm for these settings. Furthermore, we substantiate our analysis with empirical evidence that suggests the validity of our theoretical results in a broader context.
△ Less
Submitted 6 October, 2018; v1 submitted 27 May, 2018;
originally announced May 2018.
-
Local Saddle Point Optimization: A Curvature Exploitation Approach
Authors:
Leonard Adolphs,
Hadi Daneshmand,
Aurelien Lucchi,
Thomas Hofmann
Abstract:
Gradient-based optimization methods are the most popular choice for finding local optima for classical minimization and saddle point problems. Here, we highlight a systemic issue of gradient dynamics that arise for saddle point problems, namely the presence of undesired stable stationary points that are no local optima. We propose a novel optimization approach that exploits curvature information i…
▽ More
Gradient-based optimization methods are the most popular choice for finding local optima for classical minimization and saddle point problems. Here, we highlight a systemic issue of gradient dynamics that arise for saddle point problems, namely the presence of undesired stable stationary points that are no local optima. We propose a novel optimization approach that exploits curvature information in order to escape from these undesired stationary points. We prove that different optimization methods, including gradient method and Adagrad, equipped with curvature exploitation can escape non-optimal stationary points. We also provide empirical results on common saddle point problems which confirm the advantage of using curvature exploitation.
△ Less
Submitted 14 February, 2019; v1 submitted 15 May, 2018;
originally announced May 2018.
-
Escaping Saddles with Stochastic Gradients
Authors:
Hadi Daneshmand,
Jonas Kohler,
Aurelien Lucchi,
Thomas Hofmann
Abstract:
We analyze the variance of stochastic gradients along negative curvature directions in certain non-convex machine learning models and show that stochastic gradients exhibit a strong component along these directions. Furthermore, we show that - contrary to the case of isotropic noise - this variance is proportional to the magnitude of the corresponding eigenvalues and not decreasing in the dimensio…
▽ More
We analyze the variance of stochastic gradients along negative curvature directions in certain non-convex machine learning models and show that stochastic gradients exhibit a strong component along these directions. Furthermore, we show that - contrary to the case of isotropic noise - this variance is proportional to the magnitude of the corresponding eigenvalues and not decreasing in the dimensionality. Based upon this observation we propose a new assumption under which we show that the injection of explicit, isotropic noise usually applied to make gradient descent escape saddle points can successfully be replaced by a simple SGD step. Additionally - and under the same condition - we derive the first convergence rate for plain SGD to a second-order stationary point in a number of iterations that is independent of the problem dimension.
△ Less
Submitted 16 September, 2018; v1 submitted 15 March, 2018;
originally announced March 2018.
-
Accelerated Dual Learning by Homotopic Initialization
Authors:
Hadi Daneshmand,
Hamed Hassani,
Thomas Hofmann
Abstract:
Gradient descent and coordinate descent are well understood in terms of their asymptotic behavior, but less so in a transient regime often used for approximations in machine learning. We investigate how proper initialization can have a profound effect on finding near-optimal solutions quickly. We show that a certain property of a data set, namely the boundedness of the correlations between eigenfe…
▽ More
Gradient descent and coordinate descent are well understood in terms of their asymptotic behavior, but less so in a transient regime often used for approximations in machine learning. We investigate how proper initialization can have a profound effect on finding near-optimal solutions quickly. We show that a certain property of a data set, namely the boundedness of the correlations between eigenfeatures and the response variable, can lead to faster initial progress than expected by commonplace analysis. Convex optimization problems can tacitly benefit from that, but this automatism does not apply to their dual formulation. We analyze this phenomenon and devise provably good initialization strategies for dual optimization as well as heuristics for the non-convex case, relevant for deep learning. We find our predictions and methods to be experimentally well-supported.
△ Less
Submitted 13 June, 2017;
originally announced June 2017.
-
DynaNewton - Accelerating Newton's Method for Machine Learning
Authors:
Hadi Daneshmand,
Aurelien Lucchi,
Thomas Hofmann
Abstract:
Newton's method is a fundamental technique in optimization with quadratic convergence within a neighborhood around the optimum. However reaching this neighborhood is often slow and dominates the computational costs. We exploit two properties specific to empirical risk minimization problems to accelerate Newton's method, namely, subsampling training data and increasing strong convexity through regu…
▽ More
Newton's method is a fundamental technique in optimization with quadratic convergence within a neighborhood around the optimum. However reaching this neighborhood is often slow and dominates the computational costs. We exploit two properties specific to empirical risk minimization problems to accelerate Newton's method, namely, subsampling training data and increasing strong convexity through regularization. We propose a novel continuation method, where we define a family of objectives over increasing sample sizes and with decreasing regularization strength. Solutions on this path are tracked such that the minimizer of the previous objective is guaranteed to be within the quadratic convergence region of the next objective to be optimized. Thereby every Newton iteration is guaranteed to achieve super-linear contractions with regard to the chosen objective, which becomes a moving target. We provide a theoretical analysis that motivates our algorithm, called DynaNewton, and characterizes its speed of convergence. Experiments on a wide range of data sets and problems consistently confirm the predicted computational savings.
△ Less
Submitted 20 May, 2016;
originally announced May 2016.
-
Starting Small -- Learning with Adaptive Sample Sizes
Authors:
Hadi Daneshmand,
Aurelien Lucchi,
Thomas Hofmann
Abstract:
For many machine learning problems, data is abundant and it may be prohibitive to make multiple passes through the full training set. In this context, we investigate strategies for dynamically increasing the effective sample size, when using iterative methods such as stochastic gradient descent. Our interest is motivated by the rise of variance-reduced methods, which achieve linear convergence rat…
▽ More
For many machine learning problems, data is abundant and it may be prohibitive to make multiple passes through the full training set. In this context, we investigate strategies for dynamically increasing the effective sample size, when using iterative methods such as stochastic gradient descent. Our interest is motivated by the rise of variance-reduced methods, which achieve linear convergence rates that scale favorably for smaller sample sizes. Exploiting this feature, we show -- theoretically and empirically -- how to obtain significant speed-ups with a novel algorithm that reaches statistical accuracy on an $n$-sample in $2n$, instead of $n \log n$ steps.
△ Less
Submitted 7 October, 2016; v1 submitted 9 March, 2016;
originally announced March 2016.
-
Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm
Authors:
Hadi Daneshmand,
Manuel Gomez-Rodriguez,
Le Song,
Bernhard Schoelkopf
Abstract:
Information spreads across social and technological networks, but often the network structures are hidden from us and we only observe the traces left by the diffusion processes, called cascades. Can we recover the hidden network structures from these observed cascades? What kind of cascades and how many cascades do we need? Are there some network structures which are more difficult than others to…
▽ More
Information spreads across social and technological networks, but often the network structures are hidden from us and we only observe the traces left by the diffusion processes, called cascades. Can we recover the hidden network structures from these observed cascades? What kind of cascades and how many cascades do we need? Are there some network structures which are more difficult than others to recover? Can we design efficient inference algorithms with provable guarantees?
Despite the increasing availability of cascade data and methods for inferring networks from these data, a thorough theoretical understanding of the above questions remains largely unexplored in the literature. In this paper, we investigate the network structure inference problem for a general family of continuous-time diffusion models using an $l_1$-regularized likelihood maximization framework. We show that, as long as the cascade sampling process satisfies a natural incoherence condition, our framework can recover the correct network structure with high probability if we observe $O(d^3 \log N)$ cascades, where $d$ is the maximum number of parents of a node and $N$ is the total number of nodes. Moreover, we develop a simple and efficient soft-thresholding inference algorithm, which we use to illustrate the consequences of our theoretical results, and show that our framework outperforms other alternatives in practice.
△ Less
Submitted 12 May, 2014;
originally announced May 2014.