Zum Hauptinhalt springen

Showing 1–18 of 18 results for author: Daneshmand, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.13861  [pdf, other

    cs.LG

    Transformers Learn Temporal Difference Methods for In-Context Reinforcement Learning

    Authors: Jiuqi Wang, Ethan Blaser, Hadi Daneshmand, Shangtong Zhang

    Abstract: In-context learning refers to the learning ability of a model during inference time without adapting its parameters. The input (i.e., prompt) to the model (e.g., transformers) consists of both a context (i.e., instance-label pairs) and a query instance. The model is then able to output a label for the query instance according to the context during inference. A possible explanation for in-context l… ▽ More

    Submitted 31 July, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

  2. arXiv:2310.02012  [pdf, other

    cs.LG cs.AI

    Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion

    Authors: Alexandru Meterez, Amir Joudaki, Francesco Orabona, Alexander Immer, Gunnar Rätsch, Hadi Daneshmand

    Abstract: Normalization layers are one of the key building blocks for deep neural networks. Several theoretical studies have shown that batch normalization improves the signal propagation, by avoiding the representations from becoming collinear across the layers. However, results on mean-field theory of batch normalization also conclude that this benefit comes at the expense of exploding gradients in depth.… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

  3. arXiv:2306.00297  [pdf, other

    cs.LG cs.AI

    Transformers learn to implement preconditioned gradient descent for in-context learning

    Authors: Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, Suvrit Sra

    Abstract: Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instance… ▽ More

    Submitted 9 November, 2023; v1 submitted 31 May, 2023; originally announced June 2023.

    Comments: Improved presentation and added new results for the nonlinear activation case; 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

    Journal ref: 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  4. arXiv:2305.18399  [pdf, other

    cs.LG cs.AI stat.ML

    On the impact of activation and normalization in obtaining isometric embeddings at initialization

    Authors: Amir Joudaki, Hadi Daneshmand, Francis Bach

    Abstract: In this paper, we explore the structure of the penultimate Gram matrix in deep neural networks, which contains the pairwise inner products of outputs corresponding to a batch of inputs. In several architectures it has been observed that this Gram matrix becomes degenerate with depth at initialization, which dramatically slows training. Normalization layers, such as batch or layer normalization, pl… ▽ More

    Submitted 17 November, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

  5. arXiv:2302.04753  [pdf, other

    cs.LG stat.ML

    Efficient displacement convex optimization with particle gradient descent

    Authors: Hadi Daneshmand, Jason D. Lee, Chi Jin

    Abstract: Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures. This paper considers particle gradient descent with a finite number of particles and establishes its theoretical guarantees to optimize functions that are \emph{displacement convex} in measures. Conc… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

  6. arXiv:2205.13076  [pdf, other

    cs.LG cs.AI math.ST

    On Bridging the Gap between Mean Field and Finite Width in Deep Random Neural Networks with Batch Normalization

    Authors: Amir Joudaki, Hadi Daneshmand, Francis Bach

    Abstract: Mean field theory is widely used in the theoretical studies of neural networks. In this paper, we analyze the role of depth in the concentration of mean-field predictions, specifically for deep multilayer perceptron (MLP) with batch normalization (BN) at initialization. By scaling the network width to infinity, it is postulated that the mean-field predictions suffer from layer-wise errors that amp… ▽ More

    Submitted 20 February, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

  7. arXiv:2204.07879  [pdf, other

    cs.LG stat.ML

    Polynomial-time Sparse Measure Recovery: From Mean Field Theory to Algorithm Design

    Authors: Hadi Daneshmand, Francis Bach

    Abstract: Mean field theory has provided theoretical insights into various algorithms by letting the problem size tend to infinity. We argue that the applications of mean-field theory go beyond theoretical insights as it can inspire the design of practical algorithms. Leveraging mean-field analyses in physics, we propose a novel algorithm for sparse measure recovery. For sparse measures over $\mathbb{R}$, w… ▽ More

    Submitted 12 February, 2023; v1 submitted 16 April, 2022; originally announced April 2022.

  8. arXiv:2106.03970  [pdf, other

    stat.ML cs.AI cs.LG

    Batch Normalization Orthogonalizes Representations in Deep Random Networks

    Authors: Hadi Daneshmand, Amir Joudaki, Francis Bach

    Abstract: This paper underlines a subtle property of batch-normalization (BN): Successive batch normalizations with random linear transformations make hidden representations increasingly orthogonal across layers of a deep neural network. We establish a non-asymptotic characterization of the interplay between depth, width, and the orthogonality of deep representations. More precisely, under a mild assumption… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

  9. arXiv:2102.11537  [pdf, other

    math.OC cs.LG

    Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization

    Authors: Peiyuan Zhang, Antonio Orvieto, Hadi Daneshmand, Thomas Hofmann, Roy Smith

    Abstract: Viewing optimization methods as numerical integrators for ordinary differential equations (ODEs) provides a thought-provoking modern framework for studying accelerated first-order optimizers. In this literature, acceleration is often supposed to be linked to the quality of the integrator (accuracy, energy preservation, symplecticity). In this work, we propose a novel ordinary differential equation… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

    Comments: 18 pages, 5 figures; Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021, San Diego, California, USA. PMLR: Volume 130

  10. arXiv:2003.01652  [pdf, other

    stat.ML cs.LG

    Batch Normalization Provably Avoids Rank Collapse for Randomly Initialised Deep Networks

    Authors: Hadi Daneshmand, Jonas Kohler, Francis Bach, Thomas Hofmann, Aurelien Lucchi

    Abstract: Randomly initialized neural networks are known to become harder to train with increasing depth, unless architectural enhancements like residual connections and batch normalization are used. We here investigate this phenomenon by revisiting the connection between random initialization in deep networks and spectral instabilities in products of random matrices. Given the rich literature on random mat… ▽ More

    Submitted 11 June, 2020; v1 submitted 3 March, 2020; originally announced March 2020.

  11. arXiv:1910.14616  [pdf, other

    math.OC cs.LG

    Mixing of Stochastic Accelerated Gradient Descent

    Authors: Peiyuan Zhang, Hadi Daneshmand, Thomas Hofmann

    Abstract: We study the mixing properties for stochastic accelerated gradient descent (SAGD) on least-squares regression. First, we show that stochastic gradient descent (SGD) and SAGD are simulating the same invariant distribution. Motivated by this, we then establish mixing rate for SAGD-iterates and compare it with those of SGD-iterates. Theoretically, we prove that the chain of SAGD iterates is geometric… ▽ More

    Submitted 31 October, 2019; originally announced October 2019.

  12. arXiv:1805.10694  [pdf, other

    stat.ML cs.LG

    Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization

    Authors: Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Ming Zhou, Klaus Neymeyr, Thomas Hofmann

    Abstract: Normalization techniques such as Batch Normalization have been applied successfully for training deep neural networks. Yet, despite its apparent empirical benefits, the reasons behind the success of Batch Normalization are mostly hypothetical. We here aim to provide a more thorough theoretical understanding from a classical optimization perspective. Our main contribution towards this goal is the i… ▽ More

    Submitted 6 October, 2018; v1 submitted 27 May, 2018; originally announced May 2018.

  13. arXiv:1805.05751  [pdf, other

    cs.LG math.OC stat.ML

    Local Saddle Point Optimization: A Curvature Exploitation Approach

    Authors: Leonard Adolphs, Hadi Daneshmand, Aurelien Lucchi, Thomas Hofmann

    Abstract: Gradient-based optimization methods are the most popular choice for finding local optima for classical minimization and saddle point problems. Here, we highlight a systemic issue of gradient dynamics that arise for saddle point problems, namely the presence of undesired stable stationary points that are no local optima. We propose a novel optimization approach that exploits curvature information i… ▽ More

    Submitted 14 February, 2019; v1 submitted 15 May, 2018; originally announced May 2018.

  14. arXiv:1803.05999  [pdf, other

    cs.LG math.OC stat.ML

    Escaping Saddles with Stochastic Gradients

    Authors: Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, Thomas Hofmann

    Abstract: We analyze the variance of stochastic gradients along negative curvature directions in certain non-convex machine learning models and show that stochastic gradients exhibit a strong component along these directions. Furthermore, we show that - contrary to the case of isotropic noise - this variance is proportional to the magnitude of the corresponding eigenvalues and not decreasing in the dimensio… ▽ More

    Submitted 16 September, 2018; v1 submitted 15 March, 2018; originally announced March 2018.

  15. arXiv:1706.03958  [pdf, other

    cs.LG

    Accelerated Dual Learning by Homotopic Initialization

    Authors: Hadi Daneshmand, Hamed Hassani, Thomas Hofmann

    Abstract: Gradient descent and coordinate descent are well understood in terms of their asymptotic behavior, but less so in a transient regime often used for approximations in machine learning. We investigate how proper initialization can have a profound effect on finding near-optimal solutions quickly. We show that a certain property of a data set, namely the boundedness of the correlations between eigenfe… ▽ More

    Submitted 13 June, 2017; originally announced June 2017.

  16. arXiv:1605.06561  [pdf, other

    cs.LG

    DynaNewton - Accelerating Newton's Method for Machine Learning

    Authors: Hadi Daneshmand, Aurelien Lucchi, Thomas Hofmann

    Abstract: Newton's method is a fundamental technique in optimization with quadratic convergence within a neighborhood around the optimum. However reaching this neighborhood is often slow and dominates the computational costs. We exploit two properties specific to empirical risk minimization problems to accelerate Newton's method, namely, subsampling training data and increasing strong convexity through regu… ▽ More

    Submitted 20 May, 2016; originally announced May 2016.

  17. arXiv:1603.02839  [pdf, ps, other

    cs.LG

    Starting Small -- Learning with Adaptive Sample Sizes

    Authors: Hadi Daneshmand, Aurelien Lucchi, Thomas Hofmann

    Abstract: For many machine learning problems, data is abundant and it may be prohibitive to make multiple passes through the full training set. In this context, we investigate strategies for dynamically increasing the effective sample size, when using iterative methods such as stochastic gradient descent. Our interest is motivated by the rise of variance-reduced methods, which achieve linear convergence rat… ▽ More

    Submitted 7 October, 2016; v1 submitted 9 March, 2016; originally announced March 2016.

  18. arXiv:1405.2936  [pdf, other

    cs.SI physics.soc-ph stat.ML

    Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm

    Authors: Hadi Daneshmand, Manuel Gomez-Rodriguez, Le Song, Bernhard Schoelkopf

    Abstract: Information spreads across social and technological networks, but often the network structures are hidden from us and we only observe the traces left by the diffusion processes, called cascades. Can we recover the hidden network structures from these observed cascades? What kind of cascades and how many cascades do we need? Are there some network structures which are more difficult than others to… ▽ More

    Submitted 12 May, 2014; originally announced May 2014.

    Comments: To appear in the 31st International Conference on Machine Learning (ICML), 2014