Structured Inverse-Free Natural Gradient Descent:
Memory-Efficient & Numerically-Stable KFAC

Wu Lin Felix Dangel Runa Eschenhagen Kirill Neklyudov Agustinus Kristiadi Richard E. Turner Alireza Makhzani

Abstract

Second-order methods such as KFAC can be useful for neural net training. However, they are often memory-inefficient since their preconditioning Kronecker factors are dense, and numerically unstable in low precision as they require matrix inversion or decomposition. These limitations render such methods unpopular for modern mixed-precision training. We address them by (i) formulating an inverse-free KFAC update and (ii) imposing structures in the Kronecker factors, resulting in structured inverse-free natural gradient descent (SINGD). On modern neural networks, we show that SINGD is memory-efficient and numerically robust, in contrast to KFAC, and often outperforms AdamW even in half precision. Our work closes a gap between first- and second-order methods in modern low-precision training.

Machine Learning, ICML

1 Introduction

The continuing success of deep learning (DL) is—to a large extent—powered by scaling up computational power (Thompson et al., 2020) to increase the number of trainable neural network (NN) parameters. Contemporary natural language processing (Radford et al., 2019; Brown et al., 2020; Touvron et al., 2023) and computer vision (Dehghani et al., 2023) models often consist of billions of parameters, and will likely grow further in the future. To compensate for increasing computational demands, many training pipelines use lower precision data types (Micikevicius et al., 2018) and memory-efficient first-order optimizers like SGD (Robbins & Monro, 1951) or Adam(W) (Kingma & Ba, 2015; Loshchilov & Hutter, 2019).

Second-order methods, like natural gradient descent (NGD, Amari, 1998), leverage curvature information which has many applications in DL: It is useful for improving training dynamics (Martens & Grosse, 2015; Osawa et al., 2023), pruning (Wang et al., 2019), understanding the influence of training examples (Bae et al., 2022), and uncertainty estimation (Zhang et al., 2018; Immer et al., 2021; Daxberger et al., 2021). One major obstacle why those methods are rarely used is their higher memory consumption and iteration cost.

The perhaps most common concept to scale second-order methods for DL is Kronecker-factored approximate curvature (KFAC, Heskes, 2000; Martens & Grosse, 2015) which approximates the Fisher’s block diagonals via Kronecker products. The KFAC optimizer built on top of this curvature approximation, and its variants such as George et al. (2018) show promising results for medium-sized NNs (e.g. Osawa et al., 2023), its usefulness is often limited by (i) memory consumption, and (ii) the use of low-precision floating-point (FP) training that renders matrix decompositions/inversions required to pre-condition the gradient numerically unstable.

Recently, Lin et al. (2023) proposed an inverse-free Kronecker-factored natural gradient descent (INGD) algorithm that replaces matrix inversion with subtraction in a matrix logarithm space. Their update is purely based on matrix multiplications and therefore numerically stable in single-precision (FP-32); however, it is unclear whether this extends to half-precision (BFP-16). Furthermore, INGD has not been derived from the popular natural gradient approaches for DL. It is unclear if and how the method is connected to the predominant KFAC optimizer. Also, INGD does not improve over KFAC’s memory complexity since its Kronecker factors are dense matrices of the same size. And lastly, INGD has only been tested on convolution-based models and it is unclear whether it is useful for training modern transformer-based architectures (Vaswani et al., 2017).

Refer to caption — Figure 1: CIFAR-100 experiments on VGG net. *Left/Center:* Our methods (IKFAC and SINGD) outperform AdamW and perform stably in FP-32 *and* BFP-16—unlike KFAC—as they do not require matrix inversions. IKFAC effectively performs KFAC updates and achieves similar performance in FP-32. For this task, replacing the dense Kronecker factors (INGD = SINGD-Dense) with diagonal ones (SINGD-Diag) does not harm performance while reducing cost. *Right:* Memory consumption. Removing Riemannian momentum (IKFAC) or using structured Kronecker factors (SINGD-Diag) reduces INGD’s memory in FP-32 and BFP-16. In BFP-16, SINGD-Diag achieves AdamW’s memory consumption (dashed line).

Here, we extend INGD to lower its computational cost and theoretically resolve its connection to other approximate NGD methods for DL (overview in Figure 2): First, we show that a special case of INGD recovers the KFAC method. This allows us to effectively perform KFAC updates in an inverse-free fashion. We call this modification of INGD inverse-free KFAC (IKFAC). Second, we exploit an algebraic structure in the matrix logarithm space and propose structure-preserving updates to maintain sparse structures on Kronecker factors. This significantly reduces memory and leads to a novel, scalable second-order optimization algorithm we call structured inverse-free natural gradient descent (SINGD) which contains INGD and IKFAC as special cases. We evaluate SINGD on convolution- and transformer-based models and show that it can (i) outperform SGD and AdamW while using as little memory as the latter thanks to structured Kronecker factors and (ii) yield better performance than KFAC while being stable in half-precision:

(a)

We bridge the gap between INGD (Lin et al., 2023) and the original KFAC (Martens & Grosse, 2015), whose matrix inversions are unstable in low precision. Thereby, we effectively make KFAC inverse-free and amenable to low-precision training (Figure 1, left/center).
(b)

We impose various structures (block-diagonal, low-rank, Toeplitz, hierarchical) on INGD’s Kronecker factors, allowing them to be sparse to lower the memory consumption and run time (Figure 1, right and Table 1). Unlike many existing second-order methods tailored to a form of structure, our proposed update rule (Figure 4) is unified, efficient, and inverse-free for a range of structures. We analyze the impact of structures on downstream performance and find that structures with considerably lower memory consumption (even lower than AdamW) can yield competitive performance.
(c)

Unlike other second-order methods, we show that SINGD can stably train a range of modern architectures (transformers, CNNs, GNNs) in BFP-16. In contrast to first-order methods which are often useful in narrower scopes (SGD is best for CNNs, AdamW is best for transformers), SINGD works well and outperforms SGD and AdamW in many cases (see Section 4).

Our work closes a gap between first- and second-order methods in modern low precision neural network training¹¹1PyTorch implementation: github.com/f-dangel/singd.

Table 1: Training times and memory consumption for the optimizers shown in Figure 1 (parenthesized values are normalized relative to SGD; our methods are marked with an asterisk). INGD has 80 % time and 30 % memory overhead compared to SGD. In contrast, our SINGD-Diag only has 30 % time and 2 % memory overhead. This means that by using structures we can reduce INGD’s time overhead by more than half, and basically eliminate its memory overhead compared to first-order competitors.

Method	Peak memory	Training time
Method	[GiB]	[min]
SGD (BFP-16)	2.63 (1.00 x)	18.5 (1.00 x)
AdamW (BFP-16)	2.69 (1.02 x)	19.7 (1.07 x)
SINGD-Diag* (BFP-16)	2.67 (1.02 x)	23.8 (1.29 x)
IKFAC* (BFP-16)	3.18 (1.21 x)	34.0 (1.84 x)
INGD (BFP-16)	3.39 (1.29 x)	34.1 (1.84 x)
KFAC (FP-32)	4.00 (1.52 x)	83.2 (4.49 x)

2 Preliminaries

We first introduce the necessary ingredients to establish a connection between INGD and KFAC, which are derived from different perspectives. We start by describing Newton’s method since both methods can be seen as approximate Newton methods using NGD. NN training often corresponds to an unconstrained minimization problem. Consider training a NN for image classification. Given a set of $N$ examples $\{y_{i},\mbox{$\mbox{$\mathbf{x}$}$}_{i}\}_{i=1}^{N}$ with labels $y_{i}$ and images $\mbox{$\mbox{$\mathbf{x}$}$}_{i}$ , the optimization problem is

\min_{\mu}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf{y}$}$}% ,\mbox{$\mbox{$\mathbf{X}$}$})\coloneq\min_{\mu}\textstyle\sum_{i=1}^{N}c(y_{i% },f(\mbox{$\mbox{$\boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf{x}$}$}_{i}))\,,

(1)

where $\mbox{$\mbox{$\mathbf{y}$}$}\coloneq(y_{1},\dots,y_{N})$ , $\mbox{$\mbox{$\mathbf{X}$}$}\coloneq(\mbox{$\mbox{$\mathbf{x}$}$}_{1},\dots,% \mbox{$\mbox{$\mathbf{x}$}$}_{N})$ , and $\hat{y}_{i}\coloneq f(\mbox{$\mbox{$\boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf% {x}$}$}_{i})$ is a NN that outputs a predicted label $\hat{y}_{i}$ for an image $\mbox{$\mbox{$\mathbf{x}$}$}_{i}$ . Parameters $\boldsymbol{\mu}$ denote learnable weights of the NN and $c(y_{i},\hat{y}_{i})$ is a differentiable loss function to measure the difference between a true label $y_{i}$ and a predicted label $\hat{y}_{i}$ . To solve Equation 1, Newton’s method follows the update

\mbox{$\mbox{$\boldsymbol{\mu}$}$}\leftarrow\mbox{$\mbox{$\boldsymbol{\mu}$}$}% -\mbox{$\mbox{$\mathbf{S}$}$}^{-1}\left(\nabla_{\mu}\ell(\mbox{$\mbox{$% \boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf{y}$}$},\mbox{$\mbox{$\mathbf{X}$}$}% )\right)\,,

(2)

where $\mbox{$\mbox{$\mathbf{S}$}$}:=\nabla_{\mu}^{2}\ell(\mbox{$\mbox{$\boldsymbol{% \mu}$}$};\mbox{$\mbox{$\mathbf{y}$}$},\mbox{$\mbox{$\mathbf{X}$}$})$ is the Hessian of the loss.

2.1 KFAC: Approximate NGD for MLE

Computing the Hessian, as required by Newton’s method, is usually intractable for NNs. NGD uses a Fisher information matrix (FIM) instead of the Hessian by reformulating problem (1) as maximum likelihood estimation (MLE) of $p(\mbox{$\mbox{$\mathbf{y}$}$}\mid\mbox{$\mbox{$\boldsymbol{\mu}$}$},\mbox{$% \mbox{$\mathbf{X}$}$})=\prod_{i}p(y_{i}\mid\mbox{$\mbox{$\boldsymbol{\mu}$}$},% \mbox{$\mbox{$\mathbf{x}$}$}_{i})$ , where $p(y_{i}\mid\mbox{$\mbox{$\boldsymbol{\mu}$}$},\mbox{$\mbox{$\mathbf{x}$}$}_{i}% )\coloneq\exp(-c(y_{i},f(\mbox{$\mbox{$\boldsymbol{\mu}$}$},\mbox{$\mbox{$% \mathbf{x}$}$}_{i})))$ . The maximization problem $\max_{\mu}p(\mbox{$\mbox{$\mathbf{y}$}$}\mid\mbox{$\mbox{$\boldsymbol{\mu}$}$}% ,\mbox{$\mbox{$\mathbf{X}$}$})$ is equivalent to the MLE problem

\min_{\mu}-\log p(\mbox{$\mbox{$\mathbf{y}$}$}\mid\mbox{$\mbox{$\boldsymbol{% \mu}$}$},\mbox{$\mbox{$\mathbf{X}$}$})=\min_{\mu}\ell(\mbox{$\mbox{$% \boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf{y}$}$},\mbox{$\mbox{$\mathbf{X}$}$}% )\,.

(3)

This formulation allows to exploit additional statistical structures such as the FIM which is defined as shown below (Kunstner et al., 2019), where we assume a label $y$ is sampled from the likelihood $p(y\mid\mbox{$\mbox{$\boldsymbol{\mu}$}$},\mbox{$\mbox{$\mathbf{x}$}$}_{i})$ given an image $\mbox{$\mbox{$\mathbf{x}$}$}_{i}$ . With $\mbox{$\mbox{$\mathbf{s}$}$}_{i}(y)\coloneq\log p(y\mid\mu,\mbox{$\mbox{$% \mathbf{x}$}$}_{i})$ , we have

\displaystyle\begin{split}F(\mbox{$\mbox{$\boldsymbol{\mu}$}$})&\coloneq\sum_{% i=1}^{N}\mathbb{E}_{y\sim p(y\mid\mu,x_{i})}\left[\nabla_{\mu}\mbox{$\mbox{$% \mathbf{s}$}$}_{i}(y)(\nabla_{\mu}\mbox{$\mbox{$\mathbf{s}$}$}_{i}(y))^{\top}% \right]\\ &=\sum_{i=1}^{N}\mathbb{E}_{y\sim p(y\mid\mu,x_{i})}\left[-\nabla_{\mu}^{2}% \mbox{$\mbox{$\mathbf{s}$}$}_{i}(y)\right]\,.\end{split}

(4)

For ubiquitous loss functions like the mean-squared error and cross-entropy, and more generally, many members of the exponential family with natural parameterization, the FIM coincides with the generalized Gauss-Newton (GGN) matrix (Wang, 2010; Martens, 2014), a common approximation of the Hessian in deep learning (Schraudolph, 2002; Botev et al., 2017). This relationship connects NGD to Newton’s method. A common approximation of the FIM/GGN and Hessian is the so-called empirical Fisher $\smash{\hat{F}}(\mbox{$\mbox{$\boldsymbol{\mu}$}$})$ , which replaces the samples $y$ from the model’s predictive distribution in Equation 4 with the empirical data labels $y_{i}$ :

\displaystyle\begin{split}\hat{F}(\mbox{$\mbox{$\boldsymbol{\mu}$}$})&\coloneq% \sum_{i=1}^{N}\nabla_{\mu}\mbox{$\mbox{$\mathbf{s}$}$}_{i}(y_{i})(\nabla_{\mu}% \mbox{$\mbox{$\mathbf{s}$}$}_{i}(y_{i}))^{\top}\\ &\approx-\sum_{i=1}^{N}\nabla_{\mu}^{2}\mbox{$\mbox{$\mathbf{s}$}$}_{i}(y_{i})% =\mbox{$\mbox{$\mathbf{S}$}$}\,.\end{split}

While there is no clear theoretical justification for this Hessian approximation (Kunstner et al., 2019), it simplifies the implementation, reduces cost, and has been shown to work well in practice (Graves, 2011; Osawa et al., 2019). This approximation is also known as Fisher’s scoring with observed FIM for nonlinear models (Osborne, 1992; Smyth, 1996, 2015). With this, we can formulate an NGD update with the empirical FIM $\smash{\hat{F}}(\mbox{$\mbox{$\boldsymbol{\mu}$}$})$ to approximate Newton’s method as

	$\boldsymbol{\mu}$	$\displaystyle\leftarrow\mbox{$\mbox{$\boldsymbol{\mu}$}$}-\beta\left(\hat{F}(% \mbox{$\mbox{$\boldsymbol{\mu}$}$})\right)^{-1}\nabla_{\mu}\ell(\mbox{$\mbox{$% \boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf{y}$}$},\mbox{$\mbox{$\mathbf{X}$}$})$
		$\displaystyle\phantom{\leftarrow}\approx\mbox{$\mbox{$\boldsymbol{\mu}$}$}-% \beta\mbox{$\mbox{$\mathbf{S}$}$}^{-1}\nabla_{\mu}\ell(\mbox{$\mbox{$% \boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf{y}$}$},\mbox{$\mbox{$\mathbf{X}$}$}).$

We call this update NGD for MLE.

KFAC (Heskes, 2000; Martens & Grosse, 2015) is the probably most common second-order optimizer in DL. The KFAC algorithm is based on a Kronecker-factored approximation of the Fisher, which is also sometimes referred to as KFAC. Here, we refer to the algorithm as KFAC oder KFAC method and to the approximation as Kronecker approximation; we will consider the empirical Fisher’s Kronecker approximation. It approximates the per-layer FIM with a Kronecker-factored block $\smash{\tilde{F}}_{l}$ for each layer $l$ of the net. This approximation has first been derived for linear layers, later for convolutional (Grosse & Martens, 2016) and recurrent layers (Martens et al., 2018), and recently been generalized to all linear layers that use weight sharing (Eschenhagen et al., 2023), e.g. graph neural networks and transformers. A block is given by $\smash{\tilde{F}_{l}}(\mbox{$\mbox{$\boldsymbol{\mu}$}$})\coloneq\mbox{$\mbox{% $\mathbf{U}$}$}_{l}\otimes\mbox{$\mbox{$\mathbf{G}$}$}_{l},$ with $\mbox{$\mbox{$\mathbf{U}$}$}_{l}\coloneq\mbox{$\mbox{$\mathbf{u}$}$}_{l}\smash% {\mbox{$\mbox{$\mathbf{u}$}$}_{l}^{\top}}\in^{d_{i}\times d_{i}}$ and $\mbox{$\mbox{$\mathbf{G}$}$}_{l}\coloneq\mbox{$\mbox{$\mathbf{g}$}$}_{l}\smash% {\mbox{$\mbox{$\mathbf{g}$}$}_{l}^{\top}}\in\smash{{}^{d_{o}\times d_{o}}}$ , where $\mbox{$\mbox{$\mathbf{u}$}$}_{l}\in\smash{{}^{d_{i}}}$ is the $l$ th layer’s input and $\mbox{$\mbox{$\mathbf{g}$}$}_{l}\in\smash{{}^{d_{o}}}$ is the gradient of the loss w.r.t. the layer’s output. We suppress the dependence on the parameters $\boldsymbol{\mu}$ and the input $\mbox{$\mbox{$\mathbf{x}$}$}_{i}$ and, for simplicity, assume no weight sharing. KFAC also uses exponential moving averages ( $\beta_{1}$ ) over $\mathbf{U}$ and $\mathbf{G}$ (yielding $\mbox{$\mbox{$\mathbf{S}$}$}_{K},\mbox{$\mbox{$\mathbf{S}$}$}_{C}$ ) and damping $\lambda$ , see Figure 4.

While the Kronecker approximation enables more efficient gradient preconditioning, KFAC needs to store the dense Kronecker factors $\mbox{$\mbox{$\mathbf{S}$}$}_{K}$ and $\mbox{$\mbox{$\mathbf{S}$}$}_{C}$ and invert them at every preconditioner update. The run time overhead is usually amortized by updating the preconditioner less frequently, but this can cause instabilities, especially in low-precision settings. Second, the Kronecker factors introduce significant memory overhead, which poses issues in large models. Since low-precision training is becoming the standard norm in fields like natural language processing, these issues will become more apparent in modern DL. There are multiple numerical concerns when using KFAC or variants thereof in low precision. In PyTorch (Paszke et al., 2019) and JAX (Bradbury et al., 2018) implementations, all tensors must be casted into FP-32 as (B)FP-16 matrix inverses/decompositions are not supported. Moreover, $\mbox{$\mbox{$\mathbf{g}$}$}_{l}$ has to be rescaled to avoid over- or under-flows when calculating $\mbox{$\mbox{$\mathbf{G}$}$}_{l}$ . Memory consumption has previously been addressed through diagonal or block-diagonal versions of $\mbox{$\mbox{$\mathbf{U}$}$}_{l},\mbox{$\mbox{$\mathbf{G}$}$}_{l}$ (Zhang et al., 2018; Grosse et al., 2023). However, it is unclear if these simple structures maintain downstream performance.

2.2 INGD: Approximate NGD for Bayesian estimation

Derived from Bayesian principles, INGD (Lin et al., 2023) directly approximates the Hessian inverse. We first introduce two ingredients INGD builds on: the Bayesian learning rule (BLR, Khan & Lin, 2017; Zhang et al., 2018; Khan et al., 2018; Osawa et al., 2019; Lin et al., 2020; Khan & Rue, 2021; Tan, 2022) and an inverse-free second-order method from Lin et al. (2021). By the BLR, Newton’s method to solve the MLE (3) can be seen as another natural-gradient update to solve a variational inference (VI) problem with a delta approximation (Khan & Rue, 2021). This interpretation allows to view a precision matrix in the variational problem as Hessian estimation in the MLE problem. Thus, Lin et al. (2021) suggest reparameterizing the Hessian as the precision of the Gaussian posterior in a matrix logarithm space and exploiting the parameterization invariance of natural gradients to obtain an inverse-free update.

BLR

Consider a Bayesian problem formulation, where NN weights are random variables. We denote these weights by new parameters $\mathbf{w}$ since random variables are no longer learnable and use a variational Gaussian distribution to approximate the posterior over the random variables. Its mean and precision will be treated as the learnable weights $\boldsymbol{\mu}$ and the Hessian estimation $\mathbf{S}$ in Newton’s step (2).

The VI problem considered in the learning rule is defined as $\min_{\tau}-\mathcal{L}(\boldsymbol{\tau})$ with the evidence lower bound (ELBO)

\displaystyle\begin{split}\mathcal{L}(\boldsymbol{\tau})&\coloneq\mathbb{E}_{w% \sim q(w\mid\tau)}\left[\log p(\mbox{$\mbox{$\mathbf{w}$}$})+\log p(\mbox{$% \mbox{$\mathbf{y}$}$}\mid\mbox{$\mbox{$\mathbf{w}$}$},\mbox{$\mbox{$\mathbf{X}% $}$})\right]\\ &\phantom{\coloneq}+H_{q}(\boldsymbol{\tau})\,.\end{split}

(5)

$\boldsymbol{\tau}=\{\mbox{$\mbox{$\boldsymbol{\mu}$}$},\mbox{$\mbox{$\mathbf{S% }$}$}\}$ are the learnable parameters of the variational Gaussian distribution $q(\mbox{$\mbox{$\mathbf{w}$}$}\mid\boldsymbol{\tau})=\mbox{${\cal N}$}(\mbox{$% \mbox{$\mathbf{w}$}$}\mid\mbox{$\mbox{$\boldsymbol{\mu}$}$},\mbox{$\mbox{$% \mathbf{S}$}$})$ with mean $\boldsymbol{\mu}$ and precision $\mathbf{S}$ . The likelihood $p(\mbox{$\mbox{$\mathbf{y}$}$}\mid\mbox{$\mbox{$\mathbf{w}$}$},\mbox{$\mbox{$% \mathbf{X}$}$})=\exp(-\ell(\mbox{$\mbox{$\mathbf{w}$}$};\mbox{$\mbox{$\mathbf{% y}$}$},\mbox{$\mbox{$\mathbf{X}$}$}))$ takes the same form as in the MLE setting while the prior $p(\mbox{$\mbox{$\mathbf{w}$}$})\propto\exp(-R(\mbox{$\mbox{$\mathbf{w}$}$}))$ is defined by a regularizer $R(\mbox{$\mbox{$\mathbf{w}$}$})\geq 0$ . To recover the MLE problem, we consider an uninformative prior $p(\mbox{$\mbox{$\mathbf{w}$}$})$ (i.e., $R(\mbox{$\mbox{$\mathbf{w}$}$})=0$ ). $H_{q}(\boldsymbol{\tau})\coloneq\mathbb{E}_{w\sim q}\left[-\log q\right]$ is the entropy of $q(\mbox{$\mbox{$\mathbf{w}$}$}\mid\boldsymbol{\tau})$ .

Similar to the MLE case, the Bayesian formulation allows to exploit additional statistical structures in form of another FIM, which is that of the variational Gaussian defined as

	$\displaystyle F(\boldsymbol{\tau})$	$\displaystyle\coloneq\mathbb{E}_{w\sim q(w\mid\tau)}\left[\nabla_{\tau}\log q(% \mbox{$\mbox{$\mathbf{w}$}$}\mid\boldsymbol{\tau})\nabla_{\tau}^{\top}\log q(% \mbox{$\mbox{$\mathbf{w}$}$}\mid\boldsymbol{\tau})\right]$
		$\displaystyle=-\mathbb{E}_{w\sim q}\left[\nabla_{\tau}^{2}\log q(\mbox{$\mbox{% $\mathbf{w}$}$}\mid\boldsymbol{\tau})\right]\,,$

and has a closed-form expression. This FIM should not be confused with the FIM used for MLE (4).

Under the BLR, we perform NGD updates not only on $\boldsymbol{\mu}$ but also on $\mathbf{S}$ . Khan & Rue (2021) formulate a step with the exact FIM $F(\boldsymbol{\tau})$ and stepsize $\beta>0$ to update $\boldsymbol{\tau}=\{\mbox{$\mbox{$\boldsymbol{\mu}$}$},\mbox{$\mbox{$\mathbf{S% }$}$}\}$ ,

\boldsymbol{\tau}\leftarrow\boldsymbol{\tau}-\beta\Big{(}F(\boldsymbol{\tau})% \Big{)}^{-1}\nabla_{\tau}\left(-\mathcal{L}(\boldsymbol{\tau})\right)\,.

This is the NGD update for BLR, vis-à-vis for MLE. Following Khan & Nielsen (2018), the update simplifies to

	$\mathbf{S}$	$\displaystyle\leftarrow(1-\beta)\mbox{$\mbox{$\mathbf{S}$}$}+\beta{\color[rgb]% {1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbb{E}_{w\sim q(w% \mid\mu,S)}\left[\nabla_{w}^{2}\ell(\mbox{$\mbox{$\mathbf{w}$}$};\mbox{$\mbox{% $\mathbf{y}$}$},\mbox{$\mbox{$\mathbf{X}$}$})\right]}\,,$
	$\boldsymbol{\mu}$	$\displaystyle\leftarrow\mbox{$\mbox{$\boldsymbol{\mu}$}$}-\beta\mbox{$\mbox{$% \mathbf{S}$}$}^{-1}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}\mathbb{E}_{w\sim q(w\mid\mu,S)}\left[\nabla_{w}\ell(\mbox{$\mbox{$% \mathbf{w}$}$};\mbox{$\mbox{$\mathbf{y}$}$},\mbox{$\mbox{$\mathbf{X}$}$})% \right]}\,.$

Further simplifying expectations with a delta approximation (highlighted in red) at mean $\boldsymbol{\mu}$ , we obtain

	$\mathbf{S}$	$\displaystyle\leftarrow(1-\beta)\mbox{$\mbox{$\mathbf{S}$}$}+\beta{\color[rgb]% {1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\nabla_{\mu}^{2}\ell(% \mbox{$\mbox{$\boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf{y}$}$},\mbox{$\mbox{$% \mathbf{X}$}$})}\,,$
	$\boldsymbol{\mu}$	$\displaystyle\leftarrow\mbox{$\mbox{$\boldsymbol{\mu}$}$}-\beta\mbox{$\mbox{$% \mathbf{S}$}$}^{-1}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,0,0}\nabla_{\mu}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$};\mbox{$\mbox{$% \mathbf{y}$}$},\mbox{$\mbox{$\mathbf{X}$}$})}\,.$

which recovers Newton’s method in (2) for $\beta=1$ .

KFAC (Martens & Grosse, 2015)

1: Each

T

iters, update

\mbox{$\mbox{$\mathbf{S}$}$}_{K}

\mbox{$\mbox{$\mathbf{S}$}$}_{C}

Obtain

\mbox{$\mbox{$\mathbf{U}$}$}\otimes\mbox{$\mbox{$\mathbf{G}$}$}

to approximate

\nabla_{\mu}^{2}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$})

\mbox{$\mbox{$\mathbf{S}$}$}_{K}\leftarrow(1-\beta_{1})\mbox{$\mbox{$\mathbf{S% }$}$}_{K}+\beta_{1}\mbox{$\mbox{$\mathbf{U}$}$}

\mbox{$\mbox{$\mathbf{S}$}$}_{C}\leftarrow(1-\beta_{1})\mbox{$\mbox{$\mathbf{S% }$}$}_{C}+\beta_{1}\mbox{$\mbox{$\mathbf{G}$}$}

\mbox{$\mbox{$\mathbf{S}$}$}_{K}^{-1}\leftarrow\left(\mbox{$\mbox{$\mathbf{S}$% }$}_{K}+\lambda\mbox{$\mbox{$\mathbf{I}$}$}_{d_{i}}\right)^{-1}

\mbox{$\mbox{$\mathbf{S}$}$}_{C}^{-1}\leftarrow\left(\mbox{$\mbox{$\mathbf{S}$% }$}_{C}+\lambda\mbox{$\mbox{$\mathbf{I}$}$}_{d_{o}}\right)^{-1}

$\mbox{$\mbox{$\mathbf{m}$}$}_{\mu}\leftarrow\alpha_{2}\mbox{$\mbox{$\mathbf{m}% $}$}_{\mu}+\mbox{$\mbox{$\mathbf{S}$}$}_{C}^{-1}\mathrm{vec}^{-1}(\mbox{$\mbox% {$\mathbf{g}$}$})\mbox{$\mbox{$\mathbf{S}$}$}_{K}^{-1}+\gamma\mathrm{vec}^{-1}% (\mbox{$\mbox{$\boldsymbol{\mu}$}$})$

$\mbox{$\mbox{$\boldsymbol{\mu}$}$}\leftarrow\mbox{$\mbox{$\boldsymbol{\mu}$}$}% -\beta_{2}\mathrm{vec}(\mbox{$\mbox{$\mathbf{m}$}$}_{\mu})$

IKFAC (ours)

1: Each

T

iters, update

\mbox{$\mbox{$\mathbf{m}$}$}_{K}

\mbox{$\mbox{$\mathbf{m}$}$}_{C}

\mathbf{K}

\mathbf{C}

Obtain

\mbox{$\mbox{$\mathbf{U}$}$}\otimes\mbox{$\mbox{$\mathbf{G}$}$}

to approximate

\nabla_{\mu}^{2}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$})

\mbox{$\mbox{$\mathbf{m}$}$}_{K}\leftarrow{\color[rgb]{1,0,0}\definecolor[% named]{pgfstrokecolor}{rgb}{1,0,0}0}\mbox{$\mbox{$\mathbf{m}$}$}_{K}+\frac{1}{% 2d_{o}}({\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}d_{o% }}\mbox{$\mbox{$\mathbf{H}$}$}_{K}+{\color[rgb]{1,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{1,0,0}\lambda d_{o}}\mbox{$\mbox{$\mathbf{K}$}$}^{\top}% \mbox{$\mbox{$\mathbf{K}$}$}-d_{o}\mbox{$\mbox{$\mathbf{I}$}$}_{d_{i}})

\mbox{$\mbox{$\mathbf{m}$}$}_{C}\leftarrow{\color[rgb]{1,0,0}\definecolor[% named]{pgfstrokecolor}{rgb}{1,0,0}0}\mbox{$\mbox{$\mathbf{m}$}$}_{C}+\frac{1}{% 2d_{i}}({\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}d_{i% }}\mbox{$\mbox{$\mathbf{H}$}$}_{C}+{\color[rgb]{1,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{1,0,0}\lambda d_{i}}\mbox{$\mbox{$\mathbf{C}$}$}^{\top}% \mbox{$\mbox{$\mathbf{C}$}$}-d_{i}\mbox{$\mbox{$\mathbf{I}$}$}_{d_{o}})

\mbox{$\mbox{$\mathbf{K}$}$}\leftarrow\mbox{$\mbox{$\mathbf{K}$}$}(\mbox{$% \mbox{$\mathbf{I}$}$}_{d_{i}}-\beta_{1}\mbox{$\mbox{$\mathbf{m}$}$}_{K})

\mbox{$\mbox{$\mathbf{C}$}$}\leftarrow\mbox{$\mbox{$\mathbf{C}$}$}(\mbox{$% \mbox{$\mathbf{I}$}$}_{d_{o}}-\beta_{1}\mbox{$\mbox{$\mathbf{m}$}$}_{C})

$\mbox{$\mbox{$\mathbf{m}$}$}_{\mu}\leftarrow\alpha_{2}\mbox{$\mbox{$\mathbf{m}% $}$}_{\mu}+\mbox{$\mbox{$\mathbf{C}$}$}\mbox{$\mbox{$\mathbf{C}$}$}^{\top}% \mathrm{vec}^{-1}(\mbox{$\mbox{$\mathbf{g}$}$})\mbox{$\mbox{$\mathbf{K}$}$}% \mbox{$\mbox{$\mathbf{K}$}$}^{\top}+\gamma\mathrm{vec}^{-1}(\mbox{$\mbox{$% \boldsymbol{\mu}$}$})$

$\mbox{$\mbox{$\boldsymbol{\mu}$}$}\leftarrow\mbox{$\mbox{$\boldsymbol{\mu}$}$}% -\beta_{2}\mathrm{vec}(\mbox{$\mbox{$\mathbf{m}$}$}_{\mu})$

Figure 3: Comparison between KFAC and IKFAC update for one weight matrix

\mathrm{vec}^{-1}(\mbox{$\mbox{$\boldsymbol{\mu}$}$})\in^{d_{o}\times d_{i}}

. The flattened gradient is

\mbox{$\mbox{$\mathbf{g}$}$}\coloneq\nabla_{\mu}\ell(\mbox{$\mbox{$\boldsymbol% {\mu}$}$})\in^{d_{o}d_{i}}

and

\mathrm{vec}^{-1}(\mbox{$\mbox{$\mathbf{g}$}$})\in^{d_{o}\times d_{i}}

is its matrix reshape. IKFAC uses

\mbox{$\mbox{$\mathbf{H}$}$}_{K}\coloneq\mbox{$\mbox{$\mathbf{K}$}$}^{\top}% \mbox{$\mbox{$\mathbf{U}$}$}\mbox{$\mbox{$\mathbf{K}$}$}

and

\mbox{$\mbox{$\mathbf{H}$}$}_{C}\coloneq\mbox{$\mbox{$\mathbf{C}$}$}^{\top}% \mbox{$\mbox{$\mathbf{G}$}$}\mbox{$\mbox{$\mathbf{C}$}$}

to incorporate the Kronecker curvature

\mathbf{U}

and

\mathbf{G}

. Both methods use momentum buffers

\mbox{$\mbox{$\mathbf{m}$}$}_{\mu}

for the weight-decayed update direction with momentum

\alpha_{2}

and weight decay

\gamma

, and a learning rate

\beta_{2}

for the parameter update. (Left) KFAC uses an exponentially moving average with decay

1-\beta_{1}

to accumulate the Kronecker factors and applies a damping term

\lambda\mbox{$\mbox{$\mathbf{I}$}$}

before inversion to handle potential singularities in

\mbox{$\mbox{$\mathbf{S}$}$}_{K}

\mbox{$\mbox{$\mathbf{S}$}$}_{C}

. (Right) In contrast to KFAC, IKFAC directly approximates

\smash{(\mbox{$\mbox{$\mathbf{S}$}$}_{K}+\lambda\mbox{$\mbox{$\mathbf{I}$}$})^% {-1}}

and

\smash{(\mbox{$\mbox{$\mathbf{S}$}$}_{C}+\lambda\mbox{$\mbox{$\mathbf{I}$}$})^% {-1}}

\mbox{$\mbox{$\mathbf{K}$}$}\smash{\mbox{$\mbox{$\mathbf{K}$}$}^{\top}}

and

\mbox{$\mbox{$\mathbf{C}$}$}\smash{\mbox{$\mbox{$\mathbf{C}$}$}^{\top}}

. The pre-conditioner update is a modification of INGD (Lin et al., 2023) and the changes—zero Riemannian momentum, and non-adaptive damping and curvature—are highlighted in red.

INGD (Lin et al., 2023)

1: Each

T

iterations, update

\mbox{$\mbox{$\mathbf{m}$}$}_{K}

\mbox{$\mbox{$\mathbf{m}$}$}_{C}

\mathbf{K}

\mathbf{C}

Obtain

\mbox{$\mbox{$\mathbf{U}$}$}\otimes\mbox{$\mbox{$\mathbf{G}$}$}

to approximate

\nabla_{\mu}^{2}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$})

\mbox{$\mbox{$\mathbf{m}$}$}_{K}\leftarrow{\color[rgb]{1,0,0}\definecolor[% named]{pgfstrokecolor}{rgb}{1,0,0}\alpha_{1}}\mbox{$\mbox{$\mathbf{m}$}$}_{K}+% \frac{1}{2d_{o}}({\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 1,0,0}\mathrm{Tr}(\mbox{$\mbox{$\mathbf{H}$}$}_{C})}\mbox{$\mbox{$\mathbf{H}$}% $}_{K}+{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}c^{2}% }\mbox{$\mbox{$\mathbf{K}$}$}^{\top}\mbox{$\mbox{$\mathbf{K}$}$}-d_{o}\mbox{$% \mbox{$\mathbf{I}$}$}_{d_{i}})

\mbox{$\mbox{$\mathbf{m}$}$}_{C}\leftarrow{\color[rgb]{1,0,0}\definecolor[% named]{pgfstrokecolor}{rgb}{1,0,0}\alpha_{1}}\mbox{$\mbox{$\mathbf{m}$}$}_{C}+% \frac{1}{2d_{i}}({\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 1,0,0}\mathrm{Tr}(\mbox{$\mbox{$\mathbf{H}$}$}_{K})}\mbox{$\mbox{$\mathbf{H}$}% $}_{C}+{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \kappa^{2}}\mbox{$\mbox{$\mathbf{C}$}$}^{\top}\mbox{$\mbox{$\mathbf{C}$}$}-d_{% i}\mbox{$\mbox{$\mathbf{I}$}$}_{d_{o}})

\mbox{$\mbox{$\mathbf{K}$}$}\leftarrow\mbox{$\mbox{$\mathbf{K}$}$}(\mbox{$% \mbox{$\mathbf{I}$}$}_{d_{i}}-\beta_{1}\mbox{$\mbox{$\mathbf{m}$}$}_{K})

\mbox{$\mbox{$\mathbf{C}$}$}\leftarrow\mbox{$\mbox{$\mathbf{C}$}$}(\mbox{$% \mbox{$\mathbf{I}$}$}_{d_{o}}-\beta_{1}\mbox{$\mbox{$\mathbf{m}$}$}_{C})

$\mbox{$\mbox{$\boldsymbol{\mu}$}$}\leftarrow\mbox{$\mbox{$\boldsymbol{\mu}$}$}% -\beta_{2}\mathrm{vec}(\mbox{$\mbox{$\mathbf{m}$}$}_{\mu})$

SINGD (ours)

1: Each

T

iterations, update

{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \mathcal{L}}}_{m_{K}}

{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \mathcal{L}}}_{m_{C}}

{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \mathcal{L}}}_{K}

{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \mathcal{L}}}_{C}

Obtain

\mbox{$\mbox{$\mathbf{U}$}$}\otimes\mbox{$\mbox{$\mathbf{G}$}$}

to approximate

\nabla_{\mu}^{2}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$})

{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \mathcal{L}}}_{m_{K}}\leftarrow{\color[rgb]{1,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{1,0,0}\alpha_{1}}{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{L}}}_{m_{K}}+\frac{1}{2d_{o}}{\color[% rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\Pi}}_{K}{% \color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(}{\color[rgb% ]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathrm{Tr}(\mbox{$% \mbox{$\mathbf{H}$}$}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0,1}\hat{\mathcal{L}}}_{C}})}\mbox{$\mbox{$\mathbf{H}$}$}_{{\color[rgb]% {0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{L}}}_{K}}+% {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}c^{2}}({% \color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal% {L}}}_{K})^{\top}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,1}\hat{\mathcal{L}}}_{K}-d_{o}\mbox{$\mbox{$\mathbf{I}$}$}_{d_{i}}{\color[% rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1})}

{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \mathcal{L}}}_{m_{C}}\leftarrow{\color[rgb]{1,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{1,0,0}\alpha_{1}}{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{L}}}_{\mbox{$\mbox{$\mathbf{m}$}$}_{C% }}+\frac{1}{2d_{i}}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}% {0,0,1}\hat{\Pi}}_{C}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0,1}(}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0% }\mathrm{Tr}(\mbox{$\mbox{$\mathbf{H}$}$}_{{\color[rgb]{0,0,1}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{L}}}_{K}})}\mbox{$\mbox{$% \mathbf{H}$}$}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,1}\hat{\mathcal{L}}}_{C}}+{\color[rgb]{1,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{1,0,0}\kappa^{2}}({\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{L}}}_{C})^{\top}{\color[rgb]{0,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{L}}}_{C}-d_{i}% \mbox{$\mbox{$\mathbf{I}$}$}_{d_{o}}{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1})}

{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \mathcal{L}}}_{K}\leftarrow{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{L}}}_{K}(\mbox{$\mbox{$\mathbf{I}$}$}% _{d_{i}}-\beta_{1}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,1}\hat{\mathcal{L}}}_{m_{K}})

{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \mathcal{L}}}_{C}\leftarrow{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{L}}}_{C}(\mbox{$\mbox{$\mathbf{I}$}$}% _{d_{o}}-\beta_{1}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,1}\hat{\mathcal{L}}}_{m_{C}})

$\mbox{$\mbox{$\mathbf{m}$}$}_{\mu}\leftarrow\alpha_{2}\mbox{$\mbox{$\mathbf{m}% $}$}_{\mu}+{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \hat{\mathcal{L}}}_{C}({\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0,1}\hat{\mathcal{L}}}_{C})^{\top}\mathrm{vec}^{-1}(\mbox{$\mbox{$% \mathbf{g}$}$}){\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,1}\hat{\mathcal{L}}}_{K}({\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\hat{\mathcal{L}}}_{K})^{\top}+\gamma\mathrm{vec}^{% -1}(\mbox{$\mbox{$\boldsymbol{\mu}$}$})$

$\mbox{$\mbox{$\boldsymbol{\mu}$}$}\leftarrow\mbox{$\mbox{$\boldsymbol{\mu}$}$}% -\beta_{2}\mathrm{vec}(\mbox{$\mbox{$\mathbf{m}$}$}_{\mu})$

Figure 4: Comparison of a single weight matrix’s update between INGD and our extension—SINGD—via structured Kronecker factors. (Left) INGD features Riemannian momentum (

\alpha_{1}

), adaptive curvature (

\mathrm{Tr}(\mbox{$\mbox{$\mathbf{H}$}$}_{C})

\mathrm{Tr}(\mbox{$\mbox{$\mathbf{H}$}$}_{K})

), adaptive damping (

c^{2}\coloneq\lambda\mathrm{Tr}(\mbox{$\mbox{$\mathbf{C}$}$}^{\top}\mbox{$% \mbox{$\mathbf{C}$}$})

\kappa^{2}\coloneq\lambda\mathrm{Tr}(\mbox{$\mbox{$\mathbf{K}$}$}^{\top}\mbox{% $\mbox{$\mathbf{K}$}$})\,

), and correlated updates of

\mathbf{K}

and

\mathbf{C}

(

\mbox{$\mbox{$\mathbf{m}$}$}_{K}

\mbox{$\mbox{$\mathbf{m}$}$}_{C}

). The pre-conditioner matrices are updated with a learning rate

\beta_{1}

, and the optimizer keeps a momentum buffer on the weight-decayed update with momentum

\alpha_{2}

and weight decay

\gamma

. The learning rate for the parameters is

\beta_{2}

. (Right) SINGD’s update is similar but each Kronecker factor and its momentum (

\bullet

) is replaced by its structured version (

\smash{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \mathcal{L}}}_{\bullet}}

, e.g. (block-)diagonal); likewise in the computation of

c^{2}

\kappa^{2}

\mbox{$\mbox{$\mathbf{H}$}$}_{K}

, and

\mbox{$\mbox{$\mathbf{H}$}$}_{C}

. When updating the momenta, their structure is preserved through a subspace projection map

\smash{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \Pi}}_{\bullet}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,1}(}\cdot{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1% })}}

that restores

\smash{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{% \mathcal{L}}}_{\bullet}}

’s structure from a dense symmetric matrix

\cdot

(e.g. taking the (block) diagonal). Importantly, we can efficiently compute the extraction map without expanding its argument in dense form, which reduces memory and run time. The extension of IKFAC to SIKFAC is analogous. One of the notable elements of INGD and SINGD is that they are scale invariant to the choice of the Kronecker approximation (see Appendix E) as the approximation is not unique.

Removing inversion

Lin et al. (2021) reparameterize the precision matrix $\mathbf{S}$ in a matrix logarithm space and perform natural gradient updates in this space, which transforms inversion into subtraction. One can go back directly to the original space, without explicitly inverting a matrix, via a truncated matrix exponential. The method is inverse-free and, since NGs are parameterization invariant, Newton-like.

The first step is to express the precision matrix $\mathbf{S}$ using a non-singular square matrix $\mathbf{A}$ as $\mbox{$\mbox{$\mathbf{S}$}$}=\smash{\mbox{$\mbox{$\mathbf{A}$}$}^{-\top}\mbox{% $\mbox{$\mathbf{A}$}$}^{-1}}$ and perform a natural gradient step using the exact FIM in a tangent space (denoted by $\mathbf{M}$ ) of $\mbox{$\mbox{$\mathbf{A}$}$}_{t}$ at iteration $t$ . We then construct a new map as $\mbox{$\mbox{$\mathbf{A}$}$}\coloneq\mbox{$\mbox{$\boldsymbol{\phi}$}$}(\mbox{% $\mbox{$\mathbf{A}$}$}_{t},\mbox{$\mbox{$\mathbf{M}$}$})\coloneq\mbox{$\mbox{$% \mathbf{A}$}$}_{t}\mathrm{Expm}(\nicefrac{{1}}{{2}}\mbox{$\mbox{$\mathbf{M}$}$})$ using both the current point $\mbox{$\mbox{$\mathbf{A}$}$}_{t}$ and $\mathbf{M}$ as input, where $\mathrm{Expm}(\mbox{$\mbox{$\mathbf{N}$}$})=\mbox{$\mbox{$\mathbf{I}$}$}+% \smash{\sum_{j=1}^{\infty}\nicefrac{{\mbox{$\mbox{$\mathbf{N}$}$}^{j}}}{{j!}}}$ is the matrix exponential. Observe that $\mathbf{M}$ stays in a matrix logarithm space. At each iteration $t$ , we use a new matrix logarithm space associated to $\mbox{$\mbox{$\mathbf{A}$}$}_{t}$ and generate a new origin $\mbox{$\mbox{$\mathbf{M}$}$}_{0}=\mathbf{0}$ in this space to represent $\mbox{$\mbox{$\mathbf{A}$}$}_{t}$ since $\mbox{$\mbox{$\mathbf{A}$}$}_{t}\equiv\mbox{$\mbox{$\boldsymbol{\phi}$}$}(% \mbox{$\mbox{$\mathbf{A}$}$}_{t},\mathbf{0})=\mbox{$\mbox{$\mathbf{A}$}$}_{t}% \mathrm{Expm}(\nicefrac{{1}}{{2}}\mbox{$\mbox{$\mathbf{M}$}$}_{0})$ . The map $\boldsymbol{\phi}$ is a local reparameterization map that takes not only $\mathbf{M}$ but also $\mbox{$\mbox{$\mathbf{A}$}$}_{t}$ as input. Thanks to this map, the Fisher block is locally orthonormalized (Lin et al., 2023) at origin $\mbox{$\mbox{$\mathbf{M}$}$}_{0}$ . Since we used the origin to represent $\mbox{$\mbox{$\mathbf{A}$}$}_{t}$ in the local coordinate $\mathbf{M}$ , a natural gradient step becomes a (Euclidean) gradient step in the space of $\mathbf{M}$ , which makes it easy to add Riemannian momentum (Lin et al., 2023) into the structured positive-definite matrix $\mathbf{S}$ . This allows to perform updates in the logarithmic space of $\mathbf{M}$ and avoid matrix inversions:

\displaystyle\begin{split}\mbox{$\mbox{$\mathbf{M}$}$}&\leftarrow\mbox{$\mbox{% $\mathbf{M}$}$}_{0}-\beta\mbox{$\mbox{$\mathbf{N}$}$}\,,\\ \mbox{$\mbox{$\boldsymbol{\mu}$}$}&\leftarrow\mbox{$\mbox{$\boldsymbol{\mu}$}$% }-\beta\mbox{$\mbox{$\mathbf{A}$}$}_{t+1}\mbox{$\mbox{$\mathbf{A}$}$}_{t+1}^{% \top}\nabla_{\mu}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf% {y}$}$},\mbox{$\mbox{$\mathbf{X}$}$})\,,\end{split}

(6)

where $\mbox{$\mbox{$\mathbf{A}$}$}_{t+1}\coloneq\mbox{$\mbox{$\boldsymbol{\phi}$}$}(% \mbox{$\mbox{$\mathbf{A}$}$}_{t},\mbox{$\mbox{$\mathbf{M}$}$})=\mbox{$\mbox{$% \mathbf{A}$}$}_{t}\mathrm{Expm}\left(\nicefrac{{1}}{{2}}\mbox{$\mbox{$\mathbf{% M}$}$}\right)$ and $\mbox{$\mbox{$\mathbf{N}$}$}\coloneq\mbox{$\mbox{$\mathbf{A}$}$}_{t}^{\top}% \nabla_{\mu}^{2}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf{% y}$}$},\mbox{$\mbox{$\mathbf{X}$}$})\mbox{$\mbox{$\mathbf{A}$}$}_{t}-\mbox{$% \mbox{$\mathbf{I}$}$}$ . Equation 6 is a Newton-like update without matrix inverse. To see that, we can reexpress the update of $\mathbf{A}$ in terms of $\mathbf{S}$ and use properties of the matrix exponential function,

	$\displaystyle\mbox{$\mbox{$\mathbf{S}$}$}_{t+1}$	$\displaystyle=\mbox{$\mbox{$\mathbf{A}$}$}_{t+1}^{-T}\mbox{$\mbox{$\mathbf{A}$% }$}_{t+1}^{-1}=\mbox{$\mbox{$\mathbf{A}$}$}_{t}^{-T}\mathrm{Expm}\left(\beta% \mbox{$\mbox{$\mathbf{N}$}$}\right)\mbox{$\mbox{$\mathbf{A}$}$}_{t}^{-1}$
		$\displaystyle=(1-\beta)\mbox{$\mbox{$\mathbf{S}$}$}_{t}+\beta\nabla_{\mu}^{2}% \ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf{y}$}$},\mbox{$% \mbox{$\mathbf{X}$}$})+O(\beta^{2}).$

Next, we can construct a structured precision matrix $\mathbf{S}$ as a structured Hessian estimation using a sparse non-singular matrix $\mathbf{A}$ . As we will discuss in Section 3.2, it is essential to update $\mathbf{M}$ to preserve sparsity in $\mathbf{A}$ . The space of $\mathbf{M}$ as a tangent/logarithm space of $\mathbf{A}$ allows us to efficiently impose sparse structures on $\mathbf{A}$ without requiring the Hessian $\nabla_{\mu}^{2}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf{% y}$}$},\mbox{$\mbox{$\mathbf{X}$}$})$ or a Hessian approximation to be sparse or structured. This is different from another inverse-free method (Tan, 2022) that considers directly performing NGD updates of $\mathbf{A}$ instead of $\mathbf{M}$ , where $\mathbf{A}$ must be restricted to a (triangular) Cholesky factor. This does not preserve sparsity in $\mathbf{A}$ unless the Hessian or its approximation admit a special structure, which is usually not the case in DL problems.

INGD

Our work is built on INGD (Figure 4) where $\mbox{$\mbox{$\mathbf{A}$}$}=\mbox{$\mbox{$\mathbf{K}$}$}\otimes\mbox{$\mbox{$% \mathbf{C}$}$}$ is factorized into two Kronecker factors. The exact FIM under this parameterization is singular due to a correlation between $\mathbf{K}$ and $\mathbf{C}$ : the Kronecker factorization is not unique. Lin et al. (2023) propose a (non-singular) block-diagonal approximated FIM by ignoring the correlation in the original FIM and perform NGD with this block-diagonal FIM on tangent spaces of the factors. Riemannian momentum is further introduced in the update of $\mathbf{K}$ and $\mathbf{C}$ . They use the Kronecker approximation discussed in Section 2.1 to approximate the Hessian $\nabla_{\mu}^{2}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$};\mbox{$\mbox{$\mathbf{% y}$}$},\mbox{$\mbox{$\mathbf{X}$}$})$ and truncate the matrix exponential to obtain a purely matrix-multiplication based update scheme. It is unclear how INGD is related to KFAC which uses another Kronecker factorization $\mbox{$\mbox{$\mathbf{S}$}$}=\mbox{$\mbox{$\mathbf{S}$}$}_{K}\otimes\mbox{$% \mbox{$\mathbf{S}$}$}_{C}$ . INGD also remains memory-inefficient due to the use of dense Kronecker factors. The authors only consider and evaluate it on convolution-based models in single precision. It remains unclear whether INGD is useful to train transformer-based models, and in half-precision.

3 Structured inverse-free NGD

Inspired by INGD, we propose an inverse-free KFAC update as a specific setting of INGD to address KFAC’s numerical instability in low precision. We show that this scheme effectively recovers KFAC. We then address the memory inefficiency of KFAC and INGD for training transformer-based models by extending INGD with structures.

3.1 Inverse-free KFAC Updates for Numerical Stability

Subspace of the log (Lie-algebraic) space	Matrix Lie sub-group structure in $\mathbf{K}$	Subspace projection map $\hat{\Pi}(\mbox{$\mbox{$\mathbf{M}$}$})$
$\begin{bmatrix}a_{1,1}&0&\ldots&0\\ a_{2,1}&a_{2,2}&&0\\ \vdots&\vdots&\ddots&\vdots\\ a_{d_{i},1}&a_{d_{i},2}&\ldots&a_{d_{i},d_{i}}\end{bmatrix}$	Lower-triangular (Tril.)	$\begin{bmatrix}m_{1,1}&0&\ldots&0\\ 2m_{2,1}&m_{2,2}&&0\\ \vdots&\vdots&\ddots&\vdots\\ 2m_{d_{i},1}&2m_{d_{i},2}&\ldots&m_{d_{i},d_{i}}\end{bmatrix}$

$\begin{bmatrix}\mathbf{A}_{11}&\mathbf{0}&\cdots&\mathbf{0}\\ \mathbf{0}&\mathbf{A}_{22}&\cdots&\mathbf{0}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{0}&\mathbf{0}&\cdots&\mathbf{A}_{qq}\end{bmatrix}$	(Block) Diagonal (block size $k$ )	$\begin{bmatrix}\mathbf{M}_{11}&\mathbf{0}&\cdots&\mathbf{0}\\ \mathbf{0}&\mathbf{M}_{22}&\cdots&\mathbf{0}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{0}&\mathbf{0}&\cdots&\mathbf{M}_{qq}\end{bmatrix}$

$\begin{bmatrix}\mathbf{A}_{11}&\mathbf{A}_{12}&\mathbf{A}_{13}\\ \mathbf{0}&\mathbf{A}_{22}&\mathbf{0}\\ \mathbf{0}&\mathbf{A}_{32}&\mathbf{A}_{33}\end{bmatrix}$ , $\mathbf{A}_{22}$ is diag., $\mathbf{A}_{11}\in\mathbb{R}^{d_{2}\times d_{2}}$ , $\mathbf{A}_{33}\in\mathbb{R}^{d_{3}\times d_{3}}$	Hierarchical ( $k\coloneq d_{2}+d_{3}$ )	$\begin{bmatrix}\mathbf{M}_{11}&2\mathbf{M}_{12}&2\mathbf{M}_{13}\\ \mathbf{0}&\mathrm{Diag}(\mathbf{M}_{22})&\mathbf{0}\\ \mathbf{0}&2\mathbf{M}_{32}&\mathbf{M}_{33}\end{bmatrix}$

$\begin{bmatrix}{\mbox{$\mbox{$\mathbf{A}$}$}}_{11}&\mathbf{A}_{12}\\ \mathbf{0}&\mathbf{D}_{22}\end{bmatrix}$ , $\mathbf{D}_{22}$ is diag., $\mbox{$\mbox{$\mathbf{A}$}$}_{11}\in\mathbb{R}^{k\times k}$	Rank- $k$ upper-triangular	$\begin{bmatrix}{\mbox{$\mbox{$\mathbf{M}$}$}}_{11}&2\mathbf{M}_{12}\\ \mathbf{0}&\mathrm{Diag}(\mathbf{M}_{22})\end{bmatrix}$

$\begin{bmatrix}a_{0}&a_{1}&a_{2}&\cdots&a_{(d_{i}-1)}\\ 0&a_{0}&a_{1}&\ddots&\vdots\\ 0&0&\ddots&\ddots&a_{2}\\ \vdots&\ddots&\ddots&\ddots&a_{1}\\ 0&\cdots&\ddots&0&a_{0}\end{bmatrix}$	Upper-triangular Toeplitz (Triu-Toepl.)	$\begin{bmatrix}b_{0}&2b_{1}&2b_{2}&\cdots&2b_{(d_{i}-1)}\\ 0&b_{0}&2b_{1}&\ddots&\vdots\\ 0&0&\ddots&\ddots&2b_{2}\\ \vdots&\ddots&\ddots&\ddots&2b_{1}\\ 0&\cdots&\cdots&0&b_{0}\end{bmatrix}$ $b_{j}\coloneq\frac{1}{d_{i}-j}\sum_{k=1}^{d_{i}-j}m_{k,k+j}$

We first propose a new inverse-free update to mimic the behavior of the KFAC update; we call this update IKFAC. We then show that IKFAC corresponds to a specific setting of INGD. This bridges the gap between INGD and KFAC and sheds light on the difference between both methods.

Inspired by INGD, we replace matrix inversion with matrix subtraction in a matrix logarithm space, then go back to the original space without explicitly inverting any matrix using a truncated matrix exponential map. The IKFAC update is related to the KFAC update as we will use $\mbox{$\mbox{$\mathbf{K}$}$}\smash{\mbox{$\mbox{$\mathbf{K}$}$}^{\top}}$ and $\mbox{$\mbox{$\mathbf{C}$}$}\smash{\mbox{$\mbox{$\mathbf{C}$}$}^{\top}}$ to approximate the inverse Kronecker factors $\smash{\big{(}{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}+\lambda\mbox{$\mbox{$\mathbf{% I}$}$}\big{)}^{-1}}$ and $\smash{\big{(}{\mbox{$\mbox{$\mathbf{S}$}$}}_{C}+\lambda\mbox{$\mbox{$\mathbf{% I}$}$}\big{)}^{-1}}$ in KFAC, respectively. We propose the following IKFAC update with learning rate $\beta_{1}$ for $\mathbf{K}$ and $\mathbf{C}$ using a truncated matrix exponential

\displaystyle\begin{split}\mbox{$\mbox{$\mathbf{K}$}$}^{\text{new}}&\leftarrow% \mbox{$\mbox{$\mathbf{K}$}$}\left(\mbox{$\mbox{$\mathbf{I}$}$}-\nicefrac{{% \beta_{1}}}{{2}}{\mbox{$\mbox{$\mathbf{m}$}$}_{K}}\right)\,,\\ \mbox{$\mbox{$\mathbf{C}$}$}^{\text{new}}&\leftarrow\mbox{$\mbox{$\mathbf{C}$}% $}\left(\mbox{$\mbox{$\mathbf{I}$}$}-\nicefrac{{\beta_{1}}}{{2}}{\mbox{$\mbox{% $\mathbf{m}$}$}_{C}}\right)\,,\end{split}

(7)

where $\mbox{$\mbox{$\mathbf{H}$}$}_{K}\coloneq\smash{\mbox{$\mbox{$\mathbf{K}$}$}^{% \top}}\mbox{$\mbox{$\mathbf{U}$}$}\mbox{$\mbox{$\mathbf{K}$}$}$ , $\mbox{$\mbox{$\mathbf{H}$}$}_{C}\coloneq\smash{\mbox{$\mbox{$\mathbf{C}$}$}^{% \top}}\mbox{$\mbox{$\mathbf{G}$}$}\mbox{$\mbox{$\mathbf{C}$}$}$ , ${\mbox{$\mbox{$\mathbf{m}$}$}_{K}}\coloneq\mbox{$\mbox{$\mathbf{H}$}$}_{K}+% \lambda\smash{\mbox{$\mbox{$\mathbf{K}$}$}^{\top}}\mbox{$\mbox{$\mathbf{K}$}$}% -\mbox{$\mbox{$\mathbf{I}$}$}$ , ${\mbox{$\mbox{$\mathbf{m}$}$}_{C}}\coloneq\mbox{$\mbox{$\mathbf{H}$}$}_{C}+% \lambda\smash{\mbox{$\mbox{$\mathbf{C}$}$}^{\top}}\mbox{$\mbox{$\mathbf{C}$}$}% -\mbox{$\mbox{$\mathbf{I}$}$}$ . This update is inverse- and matrix-decomposition-free. Since we truncate the matrix exponential $\mathrm{Expm}(-\nicefrac{{\beta_{1}}}{{2}}\mbox{$\mbox{$\mathbf{m}$}$}_{K})% \approx(\mbox{$\mbox{$\mathbf{I}$}$}-\nicefrac{{\beta_{1}}}{{2}}\,{\mbox{$% \mbox{$\mathbf{m}$}$}_{K}})$ , $\mbox{$\mbox{$\mathbf{m}$}$}_{K}$ indeed stays in a matrix logarithm space (see Appendix C). The logarithm space allows to impose structural constraints on $\mathbf{K}$ we discuss in Section 3.2.

The following theorem—proof in Appendix D—formally shows that $\mbox{$\mbox{$\mathbf{K}$}$}\mbox{$\mbox{$\mathbf{K}$}$}^{\top}$ used in IKFAC is an approximation of $\smash{(\mbox{$\mbox{$\mathbf{S}$}$}_{K}+\lambda\mbox{$\mbox{$\mathbf{I}$}$})^% {-1}}$ in KFAC at every step even with a truncated matrix exponential. Similarly, $\mbox{$\mbox{$\mathbf{C}$}$}\smash{\mbox{$\mbox{$\mathbf{C}$}$}^{\top}}$ is an approximation of $\smash{(\mbox{$\mbox{$\mathbf{S}$}$}_{C}+\lambda\mbox{$\mbox{$\mathbf{I}$}$})^% {-1}}$ . Thus, IKFAC effectively recovers KFAC up to a first-order accuracy.

Theorem 1.

If $\mathbf{K}$ is updated according to the IKFAC scheme (Figure 4) with the truncation of the matrix exponential and these two updates use the same initialization and the same sequence of curvature matrices $\mathbf{U}$ , then the product $\mbox{$\mbox{$\mathbf{K}$}$}\smash{\mbox{$\mbox{$\mathbf{K}$}$}^{\top}}$ has a first-order accuracy of the KFAC update of $\smash{\big{(}\mbox{$\mbox{$\mathbf{S}$}$}_{K}+\lambda\mbox{$\mbox{$\mathbf{I}% $}$}\big{)}^{-1}}$ at each iteration, i.e., ${\mbox{$\mbox{$\mathbf{K}$}$}}\smash{{\mbox{$\mbox{$\mathbf{K}$}$}}^{\top}}=% \smash{\big{(}{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}+\lambda\mbox{$\mbox{$\mathbf{% I}$}$}\big{)}^{-1}}+O(\beta_{1}^{2})$ .

1 trivially extends to diagonal and block-diagonal structures. I.e., KFAC with diagonal or block-diagonal Kronecker factors is equivalent to IKFAC with diagonal or block-diagonal structure up to first order in $\beta_{1}$ .

Now, we show that IKFAC is a specific case of INGD, whose update of $\mathbf{K}$ without Riemannian momentum ( $\alpha_{1}=0$ ) is

(8)

Since $\mathrm{Tr}(\mbox{$\mbox{$\mathbf{I}$}$}_{d_{o}})=d_{o}$ , $\mbox{$\mbox{$\mathbf{H}$}$}_{C}\in^{d_{o}\times d_{o}}$ , $\mbox{$\mbox{$\mathbf{C}$}$}\in^{d_{o}\times d_{o}}$ , and $\mbox{$\mbox{$\mathbf{K}$}$}\in^{d_{i}\times d_{i}}$ , we can obtain IKFAC from INGD by simply replacing $\mathrm{Tr}(\mbox{$\mbox{$\mathbf{H}$}$}_{C})$ and $\mathrm{Tr}(\smash{\mbox{$\mbox{$\mathbf{C}$}$}^{\top}}\mbox{$\mbox{$\mathbf{C% }$}$})$ with $\mathrm{Tr}(\mbox{$\mbox{$\mathbf{I}$}$}_{d_{o}})$ :

(9)

This sheds light on the difference between both methods. In IKFAC (see Appendix C for details), $\mbox{$\mbox{$\mathbf{H}$}$}_{K}$ and $\lambda\smash{\mbox{$\mbox{$\mathbf{K}$}$}^{\top}}\mbox{$\mbox{$\mathbf{K}$}$}$ are used for incorporating KFAC’s curvature $\mathbf{U}$ and damping $\lambda\mbox{$\mbox{$\mathbf{I}$}$}$ , respectively. In contrast, the curvature and damping are adaptively incorporated in INGD using $(\mathrm{Tr}(\mbox{$\mbox{$\mathbf{H}$}$}_{C})/d_{o})\mbox{$\mbox{$\mathbf{H}$% }$}_{K}$ and $(\lambda\mathrm{Tr}(\mbox{$\mbox{$\mathbf{C}$}$}^{\top}\mbox{$\mbox{$\mathbf{C% }$}$})/d_{o})\mbox{$\mbox{$\mathbf{K}$}$}^{\top}\mbox{$\mbox{$\mathbf{K}$}$}$ . The updates of $\mathbf{K}$ and $\mathbf{C}$ are correlated in INGD due to the trace terms, while $\mathbf{K}$ and $\mathbf{C}$ are updated independently in IKFAC—just like $\mbox{$\mbox{$\mathbf{S}$}$}_{K}$ and $\mbox{$\mbox{$\mathbf{S}$}$}_{C}$ in KFAC. These trace terms are needed to satisfy the orthonormalization condition of the Fisher matrix (Lin et al., 2023). They make INGD and SINGD scale-invariant to the Kronecker approximation (see Appendix E) as the approximation is not unique. In contrast, KFAC and IKFAC are not scale-invariant. The trace terms together with Riemannian momentum ( $\alpha_{1}>0$ ) are missing in KFAC and IKFAC. Our experiments show that they can contribute to stability.

3.2 Sparse Kronecker Factors for Reducing Memory

Now, we extend INGD to reduce its memory and iteration cost. Existing sparse KFAC methods use (block-)diagonal structures for $\mbox{$\mbox{$\mathbf{S}$}$}_{K}$ and $\mbox{$\mbox{$\mathbf{S}$}$}_{C}$ (Zhang et al., 2019; Grosse et al., 2023). In contrast, we propose using sparse Kronecker factors $\mathbf{K}$ and $\mathbf{C}$ in INGD and exploiting Lie-algebraic properties in the logarithm space and algebraic sparsity of the Kronecker factors. This enables more flexible structures (Figure 5) that potentially achieve better downstream performance than (block-)diagonal structures in $\mbox{$\mbox{$\mathbf{S}$}$}_{K}$ , $\mbox{$\mbox{$\mathbf{S}$}$}_{C}$ .

Other related works are Lie group preconditioners (Li, 2018, 2022) originally derived from directly approximating the Hessian inverse. Some Hessian-vector-product-based versions of these methods can be expensive and unavailable in pure low-precision settings due to sampling random weights and solving linear systems that are unstable in low precision. Our approach is sampling-free and available in pure half-precision settings.

We want to construct sparse factors $\mathbf{K}$ and $\mathbf{C}$ without requiring the Kronecker/Hessian approximation ( $\mbox{$\mbox{$\mathbf{U}$}$}\otimes\mbox{$\mbox{$\mathbf{G}$}$}$ ) to be further sparse or structured. Imposing sparsity often leads to a complicated FIM which makes it difficult to perform NGD due to the FIM inversion. It is essential to update $\mbox{$\mbox{$\mathbf{m}$}$}_{K}$ as the logarithm space of $\mathbf{K}$ to impose sparsity on $\mathbf{K}$ as the FIM in this (moving) coordinate $\mbox{$\mbox{$\mathbf{m}$}$}_{K}$ is simplified and becomes an identity matrix due to the orthonormalization condition. This condition (Lin et al., 2023) makes it easy for us to impose a range of sparse structures on $\mathbf{K}$ through a unified and inverse-free update rule (Figure 4) since we can avoid inverting the Fisher block regarding the sparse structures. We also exploit the algebraic sparsity in these structures to make our rule more efficient than INGD (Table 3).

We exploit Lie-algebraic properties in the log space of $\mbox{$\mbox{$\mathbf{m}$}$}_{K}$ to construct sparse structures of $\mathbf{K}$ . As a general design principle, we consider structures of $\mathbf{K}$ preserved under (i) elementwise matrix operations (subtraction and scalar multiplication) and (ii) matrix multiplication, which are needed for our updates. Concretely, we construct a new local reparameterization for $\mathbf{K}$ at iteration $t$ via

\displaystyle\mbox{$\mbox{$\mathbf{K}$}$}\coloneq\mbox{$\boldsymbol{\psi}$}(% \mbox{$\mbox{$\mathbf{K}$}$}_{t},\mbox{$\mbox{$\mathbf{m}$}$}_{K})\coloneq% \mbox{$\mbox{$\mathbf{K}$}$}_{t}\mathrm{Expm}\left(\frac{1}{\sqrt{2d_{i}}}\,% \smash{\hat{\Pi}}_{K}(\mbox{$\mbox{$\mathbf{m}$}$}_{K})\right)\,,

where $\smash{\hat{\Pi}}_{K}(\mbox{$\mbox{$\mathbf{m}$}$}_{K})$ projects the dense $\mbox{$\mbox{$\mathbf{m}$}$}_{K}$ to a subspace (identically for $\mathbf{C}$ , but potentially using a different structure $\hat{\Pi}_{C}$ .

Many popular structures such as tri-diagonal matrices do not satisfy our requirements as they are not closed under matrix multiplication. Moreover, it can be difficult to construct the projection map to satisfy the orthonormalization condition. One subspace structure satisfying the requirements are upper/lower triangular matrices. The subspace projection $\smash{\hat{\Pi}_{K}}$ is a weighted extraction map since projecting the logarithm space onto a subspace is like projecting a dense square matrix onto a triangular matrix. Technically, we use

\displaystyle\mbox{$\mbox{$\mathbf{A}$}$}\coloneq\mbox{$\mbox{$\mathbf{K}$}$}_% {t}\mathrm{Expm}\left(\frac{\hat{\Pi}_{K}(\mbox{$\mbox{$\mathbf{m}$}$}_{K})}{% \sqrt{2d_{i}}}\right)\otimes\mbox{$\mbox{$\mathbf{C}$}$}_{t}

to update $\mathbf{K}$ at iteration $t$ , treating $\mbox{$\mbox{$\mathbf{C}$}$}_{t}$ and $\mbox{$\mbox{$\mathbf{K}$}$}_{t}$ as constants. Given a subspace $\Omega_{K}\subset^{d_{i}\times d_{i}}$ in the matrix logarithm space, the subspace projection map $\hat{\Pi}_{K}:\mathrm{Sym}^{d_{i}\times d_{i}}\mapsto\Omega_{K}$ is specified by satisfying the local orthonormalization condition of the Fisher block regarding $\mbox{$\mbox{$\mathbf{m}$}$}_{K}$ :

\displaystyle F|_{m_{K}=\mathbf{0}}\coloneq-\mathbb{E}_{w\sim q}\left[\nabla_{% m_{K}}^{2}\log q(\mbox{$\mbox{$\mathbf{w}$}$}\mid\mbox{$\mbox{$\boldsymbol{\mu% }$}$},\mbox{$\mbox{$\mathbf{S}$}$})\right]\big{|}_{m_{K}=\mathbf{0}}=\mbox{$% \mbox{$\mathbf{I}$}$}\,,

with the variational Gaussian $q(\mbox{$\mbox{$\mathbf{w}$}$}\mid\mbox{$\mbox{$\boldsymbol{\mu}$}$},\mbox{$% \mbox{$\mathbf{S}$}$})$ with mean $\boldsymbol{\mu}$ , precision $\mbox{$\mbox{$\mathbf{S}$}$}\coloneq\mbox{$\mbox{$\mathbf{A}$}$}^{-\top}\mbox{% $\mbox{$\mathbf{A}$}$}^{-1}$ and $\mathrm{Sym}^{d_{i}\times d_{i}}$ the set of symmetric square real matrices. Similarly, we can obtain $\hat{\Pi}_{C}$ for $\mathbf{C}$ .

We consider several sparsities and block extensions of triangular matrices illustrated in Figure 5. E.g., the subspace projection map for a diagonal structure simply extracts diagonal entries of its input. As a non-trivial example, the subspace projection map for a lower-triangular structure extracts lower-triangular entries of its input and multiplies the entries below the main diagonal by 2. Table 2 summarizes structures and their projection maps mathematically.

Using such a subspace and its projection map, we obtain a structured INGD update (Figure 4), and similar for IKFAC. Our approach allows to use more expressive structures than the block-diagonal structure shown in Figure 5, e.g. low-rank, flexible hierarchical, and Toeplitz structures. While existing methods mainly support low-rank structures. For an efficient implementation, we only compute and store non-zero entries of $\smash{\hat{\Pi}_{K}}(\mbox{$\mbox{$\mathbf{m}$}$}_{K})$ and $\mathbf{K}$ without explicitly forming dense matrices. These structures lower not only memory consumption (Table 4), but also the iteration cost (Table 3).

4 Experiments

We evaluate SINGD on convolutional, transformer, and graph NNs, using mixed-precision training in BFP-16 with KFAC-reduce (Eschenhagen et al., 2023) and numerical tricks (Dangel, 2023) to further reduce memory consumption and iteration cost for convolutions. The performance metric is test error. To be memory-efficient, we consider SINGD with sparse structures such as ‘diagonal’, ‘block-diagonal’, and ‘hierarchical’. We also consider IKFAC, INGD (recall SINGD with dense structure becomes INGD), and AdamW as baselines. All methods except KFAC directly support training in BFP-16. For KFAC, we have to transform a matrix into FP-32 and then transform its inverse into BFP-16. We find that KFAC performs unstably in BFP-16. For ‘VGG’ and ‘ConvMixer’, we also consider SGD as a strong baseline, We fix momentum to 0.9 and tune other hyper-parameters of each optimizer using random search. For ‘VGG’ and ‘ConvMixer’, we decrease the learning rate $\beta_{2}$ every 40 epochs. For ‘GNN’, we use a constant learning rate; all other models use a cosine learning rate schedule. We consider KFAC as a strong baseline for the GNN as suggested by Izadi et al. (2020). We train the GNN in FP-32 so that KFAC performs stably. The search space for the random search can be found in Table 5 in Appendix B.

From Figure 6 and 7, we can observe that SINGD, including IKFAC and INGD as special cases, outperforms AdamW in many cases. SINGD works well for mixed-precision training. We do not show KFAC in the plots as it performs unstably due to numerical issues. We also observe that the hierarchical structure often performs as well as the dense structure (INGD) on all the models. In several cases, the hierarchical structure outperforms the block-diagonal and diagonal structures. However, on the models shown in Figure 7, even the diagonal structure can perform as well as the dense one. Thus, we can reduce INGD’s memory consumption and make SINGD as competitive as AdamW. We also train a ViT model on “ImageNet-100" to demonstrate the superior performance of SINGD over AdamW in large-scale settings (see Figure 9 in Appendix B).

5 Conclusion

We propose an inverse-free, memory-efficient natural gradient descent method—SINGD—which addresses the numerical instability and memory inefficiency of second-order methods like KFAC (Martens & Grosse, 2015). The algorithm is an extension of the inverse-free natural gradient (INGD) method from Lin et al. (2023), whose update relies only on matrix multiplications. We theoretically establish the algorithm’s relation to KFAC by showing that a modification of INGD effectively performs KFAC-like updates and further improve its memory efficiency through sparse Kronecker factors. We showed that SINGD supports low-precision training and often outperforms AdamW on transformer-based models. Our work expands the scope of second-order methods to training transformer-based NNs and in low precision, making them more widely applicable.

Acknowledgements

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring Vector Institute. Runa Eschenhagen is supported by ARM and the Cambridge Trust. Richard E. Turner is supported by Google, Amazon, ARM, Improbable and EPSRC grant EP/T005386/1.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

Amari (1998) Amari, S.-I. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
Bae et al. (2022) Bae, J., Ng, N., Lo, A., Ghassemi, M., and Grosse, R. B. If influence functions are the answer, then what is the question? In NeurIPS, 2022.
Botev et al. (2017) Botev, A., Ritter, H., and Barber, D. Practical Gauss-Newton optimisation for deep learning. In ICML, 2017.
Bradbury et al. (2018) Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., and Zhang, Q. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In NeurIPS, 2020.
Dangel (2023) Dangel, F. Convolutions through the lens of tensor networks. arXiv 2307.02275, 2023.
Daxberger et al. (2021) Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig, P. Laplace redux—effortless Bayesian deep learning. In NeurIPS, 2021.
Dehghani et al. (2023) Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A. P., Caron, M., Geirhos, R., Alabdulmohsin, I., et al. Scaling vision transformers to 22 billion parameters. In ICML, 2023.
Eschenhagen et al. (2023) Eschenhagen, R., Immer, A., Turner, R. E., Schneider, F., and Hennig, P. Kronecker-Factored Approximate Curvature for modern neural network architectures. In NeurIPS, 2023.
George et al. (2018) George, T., Laurent, C., Bouthillier, X., Ballas, N., and Vincent, P. Fast approximate natural gradient descent in a kronecker factored eigenbasis. In NeurIPS, 2018.
Graves (2011) Graves, A. Practical variational inference for neural networks. In NeurIPS, 2011.
Grosse & Martens (2016) Grosse, R. and Martens, J. A kronecker-factored approximate fisher matrix for convolution layers. In ICML, 2016.
Grosse et al. (2023) Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., Steiner, B., Li, D., Durmus, E., Perez, E., et al. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296, 2023.
Hassani et al. (2021) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., and Shi, H. Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704, 2021.
Hatamizadeh et al. (2023) Hatamizadeh, A., Yin, H., Heinrich, G., Kautz, J., and Molchanov, P. Global context vision transformers. In International Conference on Machine Learning, pp. 12633–12646. PMLR, 2023.
Heskes (2000) Heskes, T. On “natural” learning and pruning in multilayered perceptrons. Neural Computation, 12(4), 2000.
Immer et al. (2021) Immer, A., Bauer, M., Fortuin, V., Rätsch, G., and Emtiyaz, K. M. Scalable marginal likelihood estimation for model selection in deep learning. In ICML, 2021.
Izadi et al. (2020) Izadi, M. R., Fang, Y., Stevenson, R., and Lin, L. Optimization of graph neural networks with natural gradient descent. In 2020 IEEE international conference on big data (big data), pp. 171–179. IEEE, 2020.
Khan & Lin (2017) Khan, M. and Lin, W. Conjugate-computation variational inference: Converting variational inference in non-conjugate models to inferences in conjugate models. In Artificial Intelligence and Statistics, pp. 878–887, 2017.
Khan & Nielsen (2018) Khan, M. E. and Nielsen, D. Fast yet Simple Natural-Gradient Descent for Variational Inference in Complex Models. arXiv preprint arXiv:1807.04489, 2018.
Khan & Rue (2021) Khan, M. E. and Rue, H. The bayesian learning rule. arXiv preprint arXiv:2107.04562, 2021.
Khan et al. (2018) Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A. Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In ICML, 2018.
Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
Kipf & Welling (2016) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
Kunstner et al. (2019) Kunstner, F., Balles, L., and Hennig, P. Limitations of the empirical Fisher approximation for natural gradient descent. In NeurIPS, 2019.
Li (2022) Li, X. Black box lie group preconditioners for sgd. arXiv preprint arXiv:2211.04422, 2022.
Li (2018) Li, X.-L. Preconditioner on matrix lie group for sgd. In International Conference on Learning Representations, 2018.
Lin et al. (2020) Lin, W., Schmidt, M., and Khan, M. E. Handling the positive-definite constraint in the bayesian learning rule. In ICML, 2020.
Lin et al. (2021) Lin, W., Nielsen, F., Emtiyaz, K. M., and Schmidt, M. Tractable structured natural-gradient descent using local parameterizations. In ICML, 2021.
Lin et al. (2023) Lin, W., Duruisseaux, V., Leok, M., Nielsen, F., Khan, M. E., and Schmidt, M. Simplifying momentum-based positive-definite submanifold optimization with applications to deep learning. In ICML, 2023.
Liu et al. (2021) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
Loshchilov & Hutter (2019) Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In ICLR, 2019.
Lu et al. (2022) Lu, Z., Xie, H., Liu, C., and Zhang, Y. Bridging the gap between vision transformers and convolutional neural networks on small datasets. Advances in Neural Information Processing Systems, 35:14663–14677, 2022.
Martens (2014) Martens, J. New insights and perspectives on the natural gradient method. JMLR, 21(146), 2014.
Martens & Grosse (2015) Martens, J. and Grosse, R. Optimizing neural networks with Kronecker-factored approximate curvature. In ICML, 2015.
Martens et al. (2018) Martens, J., Ba, J., and Johnson, M. Kronecker-factored curvature approximations for recurrent neural networks. In ICLR, 2018.
Micikevicius et al. (2018) Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., and Wu, H. Mixed precision training. In International Conference on Learning Representations (ICLR), 2018.
Osawa et al. (2019) Osawa, K., Swaroop, S., Khan, M. E. E., Jain, A., Eschenhagen, R., Turner, R. E., and Yokota, R. Practical deep learning with Bayesian principles. In NeurIPS, 2019.
Osawa et al. (2023) Osawa, K., Li, S., and Hoefler, T. PipeFisher: Efficient training of large language models using pipelining and Fisher information matrices. In MLSys, 2023.
Osborne (1992) Osborne, M. R. Fisher’s method of scoring. International Statistical Review/Revue Internationale de Statistique, pp. 99–117, 1992.
Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Robbins & Monro (1951) Robbins, H. and Monro, S. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 1951.
Schraudolph (2002) Schraudolph, N. N. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7), 2002.
Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Smyth (1996) Smyth, G. K. Partitioned algorithms for maximum likelihood and other non-linear estimation. Statistics and Computing, 6:201–216, 1996.
Smyth (2015) Smyth, G. K. Optimization and nonlinear equations. Statistics reference online, 1:1–9, 2015.
Tan (2022) Tan, L. S. Analytic natural gradient updates for cholesky factor in gaussian variational approximation. arXiv preprint arXiv:2109.00375, 2022.
Thompson et al. (2020) Thompson, N. C., Greenewald, K., Lee, K., and Manso, G. F. The computational limits of deep learning. 2020.
Touvron et al. (2023) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Trockman & Kolter (2023) Trockman, A. and Kolter, J. Z. Patches are all you need? Transactions on Machine Learning Research, 2023.
Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In NIPS, 2017.
Wang et al. (2023) Wang, A., Chen, H., Lin, Z., Pu, H., and Ding, G. Repvit: Revisiting mobile cnn from vit perspective. arXiv preprint arXiv:2307.09283, 2023.
Wang et al. (2019) Wang, C., Grosse, R., Fidler, S., and Zhang, G. Eigendamage: Structured pruning in the kronecker-factored eigenbasis. In ICML, 2019.
Wang (2010) Wang, Y. Fisher scoring: An interpolation family and its Monte Carlo implementations. Comput. Stat. Data Anal., 54(7), 2010.
Zhang et al. (2018) Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. Noisy natural gradient as variational inference. In ICML, 2018.
Zhang et al. (2019) Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl, G. E., Shallue, C. J., and Grosse, R. B. Which algorithmic choices matter at which batch sizes? Insights from a noisy quadratic model. In NeurIPS, 2019.

Appendix A space and time complexity

Method

\triangle\mbox{$\mbox{$\boldsymbol{\mu}$}$}

(descent direction)

Update

\mbox{$\mbox{$\mathbf{S}$}$}_{K}

oder

\mathbf{K}

Update

\mbox{$\mbox{$\mathbf{S}$}$}_{C}

oder

\mathbf{C}

\nabla_{\mu}\ell

(BackProp)

Iteration Cost

KFAC

O(d_{i}^{2}d_{o}+d_{o}^{2}d_{i})

O(\frac{1}{T}(md_{i}^{2}+d_{i}^{3}))

O(\frac{1}{T}(md_{o}^{2}+d_{o}^{3}))

O(md_{i}d_{o})

INGD/SINGD (Dense)

O(d_{i}^{2}d_{o}+d_{o}^{2}d_{i})

O(\frac{1}{T}(md_{i}^{2}+d_{i}^{3}))

O(\frac{1}{T}(md_{o}^{2}+d_{o}^{3}))

O(md_{i}d_{o})

SINGD (Block-Diag. with block size

k

)

O(kd_{i}d_{o})

O(\frac{1}{T}(kmd_{i}))

O(\frac{1}{T}(kmd_{o}))

O(md_{i}d_{o})

SINGD (Toeplitz)

O(d_{i}d_{o}\log(d_{o}d_{i}))

O(\frac{1}{T}(md_{i}\log d_{i}))

O(\frac{1}{T}(md_{o}\log d_{o}))

O(md_{i}d_{o})

SINGD (Rank-1 Triangular)

O(d_{i}d_{o})

O(\frac{1}{T}(md_{i}))

O(\frac{1}{T}(md_{o}))

O(md_{i}d_{o})

SINGD (Hierarchical with parameter

k

)

O(kd_{i}d_{o})

O(\frac{1}{T}(kmd_{i}))

O(\frac{1}{T}(kmd_{o}))

O(md_{i}d_{o})

AdamW

O(d_{i}d_{o})

O(md_{i}d_{o})

Table 3: Iteration cost for a non-weight-sharing layer, where

m

is the size of a mini-batch and

\mbox{$\mbox{$\boldsymbol{\mu}$}$}\in^{d_{i}\times d_{o}}

is a learnable weight matrix. We assume factors

\mathbf{K}

and

\mathbf{C}

use the same structure.

Method

\nabla_{\mu}\ell\odot\nabla_{\mu}\ell

\mbox{$\mbox{$\mathbf{S}$}$}_{K}

oder

\mathbf{K}

\mbox{$\mbox{$\mathbf{S}$}$}_{C}

oder

\mathbf{C}

Memory Usage

KFAC

O(d_{i}^{2})

O(d_{o}^{2})

INGD/SINGD (Dense)

O(d_{i}^{2})

O(d_{o}^{2})

SINGD (Block-Diag. with block size

k

)

O(kd_{i})

O(kd_{o})

SINGD (Toeplitz)

O(d_{i})

O(d_{o})

SINGD (Rank-1 Triangular)

O(d_{i})

O(d_{o})

SINGD (Hierarchical with parameter

k

)

O(kd_{i})

O(kd_{o})

AdamW

O(d_{i}d_{o})

Table 4: Additional Storage

Appendix B Details of the Experiments

To demonstrate the robustness and memory efficiency of our method, we consider image classification tasks with transformer-based models such as “Compact-ViT" (Hassani et al., 2021), “Swin-ViT" (Liu et al., 2021), “GC-ViT" (Hatamizadeh et al., 2023), and “HDVT” (Lu et al., 2022). We also consider convolution-based models such as “VGG” (Simonyan & Zisserman, 2014), “ConvMixer” (Trockman & Kolter, 2023), and “Rep-ViT" (Wang et al., 2023). We train these models on datasets “CIFAR-100" and “ImageWoof-10". Note that “Rep-ViT" is a CNN model inspired by transformers while “Compact-ViT" is a data-efficient transformer using convolutional tokenization. We also consider a graph convolution model (Kipf & Welling, 2016) denoted by “GNN” for node classification on dataset “Cora". We also train a ViT model on “ImageNet-100" (https://www.kaggle.com/datasets/ambityga/imagenet100) to demonstrate the performance of SINGD in large-scale settings (see Fig. 9).

B.1 Hyper-parameter Tuning

Hyperparameter

Meaning

KFAC/IKFAC/SINGD in Figure 4 and 8

AdamW in Figure 8

\beta_{2}

Standard stepsize

Tuned

\alpha_{2}

Standard momentum weight

0.9

\gamma

(L2) weight decay

Tuned

\lambda

Damping

Tuned

\beta_{1}

Stepsize for preconditioner

Tuned

\alpha_{1}

Riemannian Momentum

(SINGD only) Tuned

Table 5: Hyperparameters used for a random search.

Table 6: Peak memory and run time of different optimizers for GCViT on ImageWoof10 (Figure 6, right). Parenthesized values are normalized relative to SGD. For this vision transformer task, we observe that the backpropagation dominates both run time and memory. In this setting, all our methods as well as INGD have basically no run time and memory overhead compared to the first-order methods. INGD and our proposed methods are even able to beat AdamW and SGD in terms of test error. INGD, KFAC and SINGD update their preconditioner every

T=5

iterations.

Method	Peak memory	Training time
Method	[GiB]	[min]
SGD (BFP-16)	15.6 (1.00 x)	190 (1.00 x)
AdamW (BFP-16)	15.7 (1.00 x)	191 (1.01 x)
SINGD-Diag* (BFP-16)	15.8 (1.02 x)	200 (1.06 x)
IKFAC* (BFP-16)	16.0 (1.02 x)	197 (1.04 x)
INGD (BFP-16)	16.0 (1.02 x)	203 (1.07 x)
KFAC (FP-32)	16.0 (1.02 x)	359 (1.89 x)

INGD 1: Each $T$ iter., update $\mbox{$\mbox{$\mathbf{m}$}$}_{K}$ , $\mbox{$\mbox{$\mathbf{m}$}$}_{C}$ , $\mathbf{K}$ , $\mathbf{C}$ Obtain $\mbox{$\mbox{$\boldsymbol{\mu}$}$}_{AA}\otimes\mbox{$\mbox{$\boldsymbol{\mu}$}% $}_{GG}$ to approximate $\nabla_{\mu}^{2}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$})$ $\mbox{$\mbox{$\mathbf{m}$}$}_{K}\leftarrow\alpha_{1}\mbox{$\mbox{$\mathbf{m}$}% $}_{K}+\frac{1}{2d}(\mathrm{Tr}(\mbox{$\mbox{$\mathbf{H}$}$}_{C})\mbox{$\mbox{% $\mathbf{H}$}$}_{K}+c^{2}\mbox{$\mbox{$\mathbf{K}$}$}^{T}\mbox{$\mbox{$\mathbf% {K}$}$}-d\mbox{$\mbox{$\mathbf{I}$}$}_{p})$ $\mbox{$\mbox{$\mathbf{m}$}$}_{C}\leftarrow\alpha_{1}\mbox{$\mbox{$\mathbf{m}$}% $}_{C}+\frac{1}{2p}(\mathrm{Tr}(\mbox{$\mbox{$\mathbf{H}$}$}_{K})\mbox{$\mbox{% $\mathbf{H}$}$}_{C}+\kappa^{2}\mbox{$\mbox{$\mathbf{C}$}$}^{T}\mbox{$\mbox{$% \mathbf{C}$}$}-p\mbox{$\mbox{$\mathbf{I}$}$}_{d})$ $\mbox{$\mbox{$\mathbf{K}$}$}\leftarrow\mbox{$\mbox{$\mathbf{K}$}$}\mathrm{Expm% }(-\beta_{1}\mbox{$\mbox{$\mathbf{m}$}$}_{K})\approx\mbox{$\mbox{$\mathbf{K}$}% $}(\mbox{$\mbox{$\mathbf{I}$}$}_{p}-\beta_{1}\mbox{$\mbox{$\mathbf{m}$}$}_{K})$ $\mbox{$\mbox{$\mathbf{C}$}$}\leftarrow\mbox{$\mbox{$\mathbf{C}$}$}\mathrm{Expm% }(-\beta_{1}\mbox{$\mbox{$\mathbf{m}$}$}_{C})\approx\mbox{$\mbox{$\mathbf{C}$}% $}(\mbox{$\mbox{$\mathbf{I}$}$}_{d}-\beta_{1}\mbox{$\mbox{$\mathbf{m}$}$}_{C})$ 2: $\mbox{$\mbox{$\mathbf{M}$}$}_{\mu}\leftarrow\alpha_{2}\mbox{$\mbox{$\mathbf{M}% $}$}_{\mu}+\mbox{$\mbox{$\mathbf{C}$}$}\mbox{$\mbox{$\mathbf{C}$}$}^{T}\mathrm% {vec}^{-1}(\nabla_{\mu}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$}))\mbox{$\mbox{$% \mathbf{K}$}$}\mbox{$\mbox{$\mathbf{K}$}$}^{T}+\gamma\mathrm{vec}^{-1}(\mbox{$% \mbox{$\boldsymbol{\mu}$}$})$ 3: $\mbox{$\mbox{$\boldsymbol{\mu}$}$}\leftarrow\mbox{$\mbox{$\boldsymbol{\mu}$}$}% -\beta_{2}\mathrm{vec}(\mbox{$\mbox{$\mathbf{M}$}$}_{\mu})$ AdamW Optimizer 1: At iter. $t$ , update $\mbox{$\mbox{$\mathbf{m}$}$}_{s}$ , $\mathbf{s}$ Use $\left(\nabla_{\mu}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$})\right)^{2}$ to approximate $\mathrm{diag}\left(\nabla_{\mu}^{2}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$})\right)$ $\mbox{$\mbox{$\mathbf{m}$}$}_{s}\leftarrow(1-\beta_{1})\mbox{$\mbox{$\mathbf{m% }$}$}_{s}+\beta_{1}\left(\nabla_{\mu}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$})% \right)^{2}$ $\mbox{$\mbox{$\mathbf{s}$}$}^{2}\leftarrow\nicefrac{{\mbox{$\mbox{$\mathbf{m}$% }$}_{s}}}{{(1-(1-\beta_{1})^{t})}}$ $\mbox{$\mbox{$\mathbf{s}$}$}\leftarrow\sqrt{\mbox{$\mbox{$\mathbf{s}$}$}^{2}}+\lambda$ 2: $\mbox{$\mbox{$\mathbf{m}$}$}_{\mu}\leftarrow\alpha_{2}\mbox{$\mbox{$\mathbf{m}% $}$}_{\mu}+(1-\alpha_{2})\nabla_{\mu}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$})$ $\mbox{$\mbox{$\mathbf{M}$}$}_{\mu}\leftarrow\mbox{$\mbox{$\mathbf{s}$}$}^{-1}% \mbox{$\mbox{$\mathbf{m}$}$}_{\mu}/\big{(}1-\alpha_{2}^{t}\big{)}$ 3: $\mbox{$\mbox{$\boldsymbol{\mu}$}$}\leftarrow\mbox{$\mbox{$\boldsymbol{\mu}$}$}% -\beta_{2}\mbox{$\mbox{$\mathbf{M}$}$}_{\mu}+\gamma\mbox{$\mbox{$\boldsymbol{% \mu}$}$}$

Figure 8: Baseline methods in the same notation for a hyperparameter search.

Appendix C Connection between IKFAC and KFAC

To relate to the KFAC method, we now show that $\mbox{$\mbox{$\mathbf{K}$}$}^{\text{new}}\big{(}\mbox{$\mbox{$\mathbf{K}$}$}^{% \text{new}}\big{)}^{\top}$ is an approximation of $\big{(}\mbox{$\mbox{$\mathbf{S}$}$}_{K}^{\text{new}}+\lambda\mbox{$\mbox{$% \mathbf{I}$}$}\big{)}^{-1}$ at a new step of our scheme. For simplicity, we first assume $\mbox{$\mbox{$\mathbf{K}$}$}\mbox{$\mbox{$\mathbf{K}$}$}^{\top}$ exactly equals to $\left(\mbox{$\mbox{$\mathbf{S}$}$}_{K}^{\text{cur}}+\lambda\mbox{$\mbox{$% \mathbf{I}$}$}\right)^{-1}$ at the current step. Later, we will relax this assumption and prove that $\mbox{$\mbox{$\mathbf{K}$}$}\mbox{$\mbox{$\mathbf{K}$}$}^{\top}$ is an approximation of $\left(\mbox{$\mbox{$\mathbf{S}$}$}_{K}+\lambda\mbox{$\mbox{$\mathbf{I}$}$}% \right)^{-1}$ at every step as stated in Theorem 1. For notation simplicity, we denote $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}\coloneq\mbox{$\mbox{$\mathbf{S}$}$}_{K}% +\lambda\mbox{$\mbox{$\mathbf{I}$}$}$ . The update of $\mbox{$\mbox{$\mathbf{S}$}$}_{K}$ with damping $\lambda\mbox{$\mbox{$\mathbf{I}$}$}$ can be reexpressed as an update of $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}$ :

\displaystyle\left(\mbox{$\mbox{$\mathbf{S}$}$}_{K}^{\text{new}}+\lambda\mbox{% $\mbox{$\mathbf{I}$}$}\right)=\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{\text{% new}}\leftarrow(1-\beta_{1})\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{\text{cur}% }+\beta_{1}\left(\mbox{$\mbox{$\mathbf{U}$}$}+\lambda\mbox{$\mbox{$\mathbf{I}$% }$}\right).

Since $\hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{\text{cur}}=\mbox{$\mbox{$\mathbf{K}$}% $}^{-T}\mbox{$\mbox{$\mathbf{K}$}$}^{-1}$ by our assumption, we can express update of $\mbox{$\mbox{$\mathbf{S}$}$}_{K}$ in terms of $\mathbf{K}$ as follows. $\displaystyle\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{\text{new}}\leftarrow(1-% \beta_{1})\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{\text{cur}}+\beta_{1}\left(% \mbox{$\mbox{$\mathbf{U}$}$}+\lambda\mbox{$\mbox{$\mathbf{I}$}$}\right)$ $\displaystyle=\mbox{$\mbox{$\mathbf{K}$}$}^{-T}\left(\mbox{$\mbox{$\mathbf{I}$% }$}+\beta_{1}\left(\mbox{$\mbox{$\mathbf{K}$}$}^{\top}\mbox{$\mbox{$\mathbf{U}% $}$}\mbox{$\mbox{$\mathbf{K}$}$}+\lambda\mbox{$\mbox{$\mathbf{K}$}$}^{\top}% \mbox{$\mbox{$\mathbf{K}$}$}-\mbox{$\mbox{$\mathbf{I}$}$}\right)\right)\mbox{$% \mbox{$\mathbf{K}$}$}^{-1}=\mbox{$\mbox{$\mathbf{K}$}$}^{-T}\left(\mbox{$\mbox% {$\mathbf{I}$}$}+\beta_{1}\mbox{$\mbox{$\mathbf{m}$}$}_{K}\right)\mbox{$\mbox{% $\mathbf{K}$}$}^{-1}$

$\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{\text{new}}$ in the KFAC update can be approximated as below, where we consider $\mbox{$\mbox{$\mathbf{I}$}$}+\beta_{1}{\mbox{$\mbox{$\mathbf{m}$}$}_{K}}$ as an approximate of the matrix exponential $\mathrm{Expm}(\beta_{1}{\mbox{$\mbox{$\mathbf{m}$}$}_{K}})\approx\mbox{$\mbox{% $\mathbf{I}$}$}+\beta_{1}{\mbox{$\mbox{$\mathbf{m}$}$}_{K}}$ and notice that ${\mbox{$\mbox{$\mathbf{m}$}$}_{K}}$ is symmetric. $\displaystyle\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{\text{new}}=\mbox{$\mbox{% $\mathbf{K}$}$}^{-T}\left(\mbox{$\mbox{$\mathbf{I}$}$}+\beta_{1}{\mbox{$\mbox{% $\mathbf{m}$}$}_{K}}\right)\mbox{$\mbox{$\mathbf{K}$}$}^{-1}\approx\mbox{$% \mbox{$\mathbf{K}$}$}^{-T}\mathrm{Expm}\left(\beta_{1}{\mbox{$\mbox{$\mathbf{m% }$}$}_{K}}\right)\mbox{$\mbox{$\mathbf{K}$}$}^{-1}=\mbox{$\mbox{$\mathbf{K}$}$% }^{-T}\mathrm{Expm}\Big{(}\frac{\beta_{1}}{2}{\mbox{$\mbox{$\mathbf{m}$}$}_{K}% }\Big{)}^{\top}\mathrm{Expm}\Big{(}\frac{\beta_{1}}{2}{\mbox{$\mbox{$\mathbf{m% }$}$}_{K}}\Big{)}\mbox{$\mbox{$\mathbf{K}$}$}^{-1}.$

Informally, we can see that $\mbox{$\mbox{$\mathbf{K}$}$}^{\text{new}}\big{(}\mbox{$\mbox{$\mathbf{K}$}$}^{% \text{new}}\big{)}^{\top}$ approximates $\big{(}\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{\text{new}}\big{)}^{-1}$ by using the matrix exponential. We can see that ${\mbox{$\mbox{$\mathbf{m}$}$}_{K}}$ stays in a matrix logarithm space.

\displaystyle\left(\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{\text{new}}\right)^% {-1}\approx\mbox{$\mbox{$\mathbf{K}$}$}\mathrm{Expm}\Big{(}{\color[rgb]{1,0,0}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}\frac{\beta_{1}}{2}{\mbox{$% \mbox{$\mathbf{m}$}$}_{K}}\Big{)}\mathrm{Expm}\Big{(}{\color[rgb]{1,0,0}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-}\frac{\beta_{1}}{2}{\mbox{$% \mbox{$\mathbf{m}$}$}_{K}}\Big{)}^{\top}\mbox{$\mbox{$\mathbf{K}$}$}^{\top}% \approx\mbox{$\mbox{$\mathbf{K}$}$}\Big{(}\mbox{$\mbox{$\mathbf{I}$}$}-\frac{% \beta_{1}}{2}{\mbox{$\mbox{$\mathbf{m}$}$}_{K}}\Big{)}\Big{(}\mbox{$\mbox{$% \mathbf{I}$}$}-\frac{\beta_{1}}{2}{\mbox{$\mbox{$\mathbf{m}$}$}_{K}}\Big{)}^{T% }\mbox{$\mbox{$\mathbf{K}$}$}^{\top}=\mbox{$\mbox{$\mathbf{K}$}$}^{\text{new}}% \big{(}\mbox{$\mbox{$\mathbf{K}$}$}^{\text{new}}\big{)}^{\top}

Theorem 1 formally shows that $\mbox{$\mbox{$\mathbf{K}$}$}\mbox{$\mbox{$\mathbf{K}$}$}^{\top}$ used in our update is an approximation of $\Big{(}\mbox{$\mbox{$\mathbf{S}$}$}_{K}+\lambda\mbox{$\mbox{$\mathbf{I}$}$}% \Big{)}^{-1}$ in the KFAC update for every step even when the truncation of the matrix exponential is employed.

Appendix D Proof of Theorem 1

We first consider the following lemmas in order to prove Theorem 1.

Recall that we denote $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}\coloneq\mbox{$\mbox{$\mathbf{S}$}$}_{K}% +\lambda\mbox{$\mbox{$\mathbf{I}$}$}$ . For notation simplicity, we will drop the subscript $K$ in this section and use $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}$ to denote $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}$ at iteration $t$ . Notice that $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}$ is non-singular at each iteration $t$ so that we can inverse it in the original KFAC update (see Figure 4).

Lemma D.1.

Consider the following update in the original KFAC update at iteration $t$ .

\displaystyle\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}\coloneq(1-\beta_{1})\bar{% \mbox{$\mbox{$\mathbf{S}$}$}}_{t-1}+\beta_{1}\big{(}\hat{\mbox{$\mbox{$\mathbf% {U}$}$}}_{t-1}+\lambda\mbox{$\mbox{$\mathbf{I}$}$}\big{)}

where $\mbox{$\mbox{$\mathbf{S}$}$}_{t}$ is the factor $\mbox{$\mbox{$\mathbf{S}$}$}_{K}$ used in the original KFAC update, $\beta_{1}$ is known as the weight of the moving average, and $\hat{\mbox{$\mbox{$\mathbf{U}$}$}}_{t-1}$ is a curvature matrix.

The initial factor $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{0}$ can be decomposed as $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{0}=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^% {-T}\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-1}$ since $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{0}$ as a preconditioning factor is symmetric positive definite.

Define $\hat{\mbox{$\mbox{$\mathbf{N}$}$}}_{i}\coloneq\hat{\mbox{$\mbox{$\mathbf{K}$}$% }}_{0}^{T}\hat{\mbox{$\mbox{$\mathbf{U}$}$}}_{i}\hat{\mbox{$\mbox{$\mathbf{K}$% }$}}_{0}+\lambda\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{T}\hat{\mbox{$\mbox{$% \mathbf{K}$}$}}_{0}-\mbox{$\mbox{$\mathbf{I}$}$}$ .

The Kronecker factor can be reexpressed as

\displaystyle\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}=\hat{\mbox{$\mbox{$\mathbf% {K}$}$}}_{0}^{-T}\left(\mbox{$\mbox{$\mathbf{I}$}$}+\beta_{1}\sum_{i=0}^{t-1}% \hat{\mbox{$\mbox{$\mathbf{N}$}$}}_{i}\right)\hat{\mbox{$\mbox{$\mathbf{K}$}$}% }_{0}^{-1}+O(\beta_{1}^{2})

Lemma D.2.

Consider the following update in our inverse-free KFAC at iteration $t$ .

\displaystyle{\mbox{$\mbox{$\mathbf{K}$}$}}_{t}\coloneq{\mbox{$\mbox{$\mathbf{% K}$}$}}_{t-1}\left(\mbox{$\mbox{$\mathbf{I}$}$}-\frac{\beta_{1}}{2}\left({% \mbox{$\mbox{$\mathbf{K}$}$}}_{t-1}^{\top}{\mbox{$\mbox{$\mathbf{U}$}$}}_{t-1}% {\mbox{$\mbox{$\mathbf{K}$}$}}_{t-1}+\lambda{\mbox{$\mbox{$\mathbf{K}$}$}}_{t-% 1}^{\top}{\mbox{$\mbox{$\mathbf{K}$}$}}_{t-1}-\mbox{$\mbox{$\mathbf{I}$}$}% \right)\right)

where ${\mbox{$\mbox{$\mathbf{K}$}$}}_{t-1}^{\top}{\mbox{$\mbox{$\mathbf{U}$}$}}_{t-1% }{\mbox{$\mbox{$\mathbf{K}$}$}}_{t-1}$ is used in our update and ${\mbox{$\mbox{$\mathbf{U}$}$}}_{t-1}$ is a curvature matrix.

Define ${\mbox{$\mbox{$\mathbf{N}$}$}}_{i}\coloneq{\mbox{$\mbox{$\mathbf{K}$}$}}_{i}^{% \top}{\mbox{$\mbox{$\mathbf{U}$}$}}_{i}{\mbox{$\mbox{$\mathbf{K}$}$}}_{i}+% \lambda{\mbox{$\mbox{$\mathbf{K}$}$}}_{i}^{\top}{\mbox{$\mbox{$\mathbf{K}$}$}}% _{i}-\mbox{$\mbox{$\mathbf{I}$}$}$ .

Our update of $\mathbf{K}$ can be reexpressed as

\displaystyle{\mbox{$\mbox{$\mathbf{K}$}$}}_{t}={\mbox{$\mbox{$\mathbf{K}$}$}}% _{0}\left(\mbox{$\mbox{$\mathbf{I}$}$}-\frac{\beta_{1}}{2}\sum_{i=0}^{t-1}{% \mbox{$\mbox{$\mathbf{N}$}$}}_{i}\right)+O(\beta_{1}^{2})

Moreover, the product ${\mbox{$\mbox{$\mathbf{K}$}$}}{\mbox{$\mbox{$\mathbf{K}$}$}}^{\top}$ can be reexpressed as

\displaystyle{\mbox{$\mbox{$\mathbf{K}$}$}}_{t}{\mbox{$\mbox{$\mathbf{K}$}$}}_% {t}^{\top}={\mbox{$\mbox{$\mathbf{K}$}$}}_{0}\left(\mbox{$\mbox{$\mathbf{I}$}$% }-\beta_{1}\sum_{i=0}^{t-1}{\mbox{$\mbox{$\mathbf{N}$}$}}_{i}\right){\mbox{$% \mbox{$\mathbf{K}$}$}}_{0}^{\top}+O(\beta_{1}^{2})

Lemma D.3 is useful to establish a relationship between the KFAC update and our inverse-free update.

Lemma D.3.

If we use the same sequence of curvature matrices in both the original KFAC update and our update such as $\hat{\mbox{$\mbox{$\mathbf{U}$}$}}_{i}=\mbox{$\mbox{$\mathbf{U}$}$}_{i}$ for each iteration $i$ and $\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}={\mbox{$\mbox{$\mathbf{K}$}$}}_{0}$ are used on the initialization, we have the following expression.

\displaystyle{\mbox{$\mbox{$\mathbf{N}$}$}}_{i}=\hat{\mbox{$\mbox{$\mathbf{N}$% }$}}_{i}+O(\beta_{1})

Similarly, we have the following result for $\mathbf{C}$ .

Theorem 2.

The product $\mbox{$\mbox{$\mathbf{C}$}$}\mbox{$\mbox{$\mathbf{C}$}$}^{\top}$ has a first-order accuracy of the KFAC update of $\big{(}\mbox{$\mbox{$\mathbf{S}$}$}_{C}+\lambda\mbox{$\mbox{$\mathbf{I}$}$}% \big{)}^{-1}$ at each iteration if the update of $\mathbf{C}$ is updated according to Figure 4 with the truncation of the matrix exponential and these two updates use the same initialization and the same sequence of curvature matrices $\mathbf{G}$ .

\displaystyle{\mbox{$\mbox{$\mathbf{C}$}$}}{\mbox{$\mbox{$\mathbf{C}$}$}}^{% \top}=\big{(}\mbox{$\mbox{$\mathbf{S}$}$}_{C}+\lambda\mbox{$\mbox{$\mathbf{I}$% }$}\big{)}^{-1}+O(\beta_{1}^{2})

D.1 Proof of Lemma D.1

We prove the lemma by induction We first show the base case when $t=1$ . By definition, we have

$\displaystyle\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{1}$	$\displaystyle=(1-\beta_{1})\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{0}+\beta_{1}% \big{(}\hat{\mbox{$\mbox{$\mathbf{U}$}$}}_{0}+\lambda\mbox{$\mbox{$\mathbf{I}$% }$}\big{)}$	(10)
	$\displaystyle=(1-\beta_{1})\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-T}\hat{% \mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-1}+\beta_{1}\big{(}\hat{\mbox{$\mbox{$% \mathbf{U}$}$}}_{0}+\lambda\mbox{$\mbox{$\mathbf{I}$}$}\big{)}$	(11)
	$\displaystyle=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-T}\Big{[}\mbox{$\mbox{$% \mathbf{I}$}$}+\beta_{1}\underbrace{\Big{(}\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_% {0}^{T}\hat{\mbox{$\mbox{$\mathbf{U}$}$}}_{0}\hat{\mbox{$\mbox{$\mathbf{K}$}$}% }_{0}+\lambda\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{T}\hat{\mbox{$\mbox{$% \mathbf{K}$}$}}_{0}-\mbox{$\mbox{$\mathbf{I}$}$}\Big{)}}_{=\hat{\mbox{$\mbox{$% \mathbf{N}$}$}}_{0}}\Big{]}\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-1}$	(12)
	$\displaystyle=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-T}\left[\mbox{$\mbox{$% \mathbf{I}$}$}+\beta_{1}\hat{\mbox{$\mbox{$\mathbf{N}$}$}}_{0}\right]\hat{% \mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-1}$	(13)

Thus, the claim holds when $t=1$ .

Suppose, the claim holds when $t=n$ . By the claim, we have

\displaystyle\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{n}=\hat{\mbox{$\mbox{$\mathbf% {K}$}$}}_{0}^{-T}\left(\mbox{$\mbox{$\mathbf{I}$}$}+\beta_{1}\sum_{i=0}^{n-1}% \hat{\mbox{$\mbox{$\mathbf{N}$}$}}_{i}\right)\hat{\mbox{$\mbox{$\mathbf{K}$}$}% }_{0}^{-1}+O(\beta_{1}^{2})

(14)

Now, we consider the case when $t=n+1$ . Notice that

	$\displaystyle(1-\beta_{1})\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{n}$	$\displaystyle=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-T}\left(\mbox{$\mbox{$% \mathbf{I}$}$}+\beta_{1}\sum_{i=0}^{n-1}\hat{\mbox{$\mbox{$\mathbf{N}$}$}}_{i}% -\beta_{1}\mbox{$\mbox{$\mathbf{I}$}$}+O(\beta_{1}^{2})\right)\hat{\mbox{$% \mbox{$\mathbf{K}$}$}}_{0}^{-1}+O(\beta_{1}^{2})$
		$\displaystyle=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-T}\left(\mbox{$\mbox{$% \mathbf{I}$}$}+\beta_{1}\sum_{i=0}^{n-1}\hat{\mbox{$\mbox{$\mathbf{N}$}$}}_{i}% -\beta_{1}\mbox{$\mbox{$\mathbf{I}$}$}\right)\hat{\mbox{$\mbox{$\mathbf{K}$}$}% }_{0}^{-1}+O(\beta_{1}^{2})$

By the definition of $\hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{n+1}$ , we have

$\displaystyle\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{n+1}$	$\displaystyle=(1-\beta_{1})\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{n}+\beta_{1}% \big{(}\hat{\mbox{$\mbox{$\mathbf{U}$}$}}_{n}+\lambda\mbox{$\mbox{$\mathbf{I}$% }$}\big{)}$	(15)
	$\displaystyle=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-T}\left(\mbox{$\mbox{$% \mathbf{I}$}$}+\beta_{1}\sum_{i=0}^{n-1}\hat{\mbox{$\mbox{$\mathbf{N}$}$}}_{i}% \underbrace{-\beta_{1}\mbox{$\mbox{$\mathbf{I}$}$}+\beta_{1}\hat{\mbox{$\mbox{% $\mathbf{K}$}$}}_{0}^{T}\hat{\mbox{$\mbox{$\mathbf{U}$}$}}_{n}\hat{\mbox{$% \mbox{$\mathbf{K}$}$}}_{0}+\beta_{1}\lambda\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_% {0}^{T}\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}}_{=\beta_{1}\hat{\mbox{$\mbox{$% \mathbf{N}$}$}}_{n}}\right)\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-1}+O(\beta% _{1}^{2})$	(16)
	$\displaystyle=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-T}\left(\mbox{$\mbox{$% \mathbf{I}$}$}+\beta_{1}\sum_{i=0}^{n}\hat{\mbox{$\mbox{$\mathbf{N}$}$}}_{i}% \right)\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-1}+O(\beta_{1}^{2})$	(17)

which is exactly the claim when $t=n+1$ .

Thus, by induction, the claim holds.

D.2 Proof of Lemma D.2

We prove the lemma by induction We first show the base case when $t=1$ . By definition, we have

\displaystyle\mbox{$\mbox{$\mathbf{K}$}$}_{1}=\mbox{$\mbox{$\mathbf{K}$}$}_{0}% \Big{(}\mbox{$\mbox{$\mathbf{I}$}$}-\frac{\beta_{1}}{2}\underbrace{\left(\mbox% {$\mbox{$\mathbf{K}$}$}_{0}^{\top}\mbox{$\mbox{$\mathbf{U}$}$}_{0}\mbox{$\mbox% {$\mathbf{K}$}$}_{0}+\lambda\mbox{$\mbox{$\mathbf{K}$}$}_{0}^{\top}\mbox{$% \mbox{$\mathbf{K}$}$}_{0}-\mbox{$\mbox{$\mathbf{I}$}$}\right)}_{=\mbox{$\mbox{% $\mathbf{N}$}$}_{0}}\Big{)}

(18)

Thus, the claim holds when $t=1$ .

Suppose, the claim holds when $t=n$ . By the claim, we have

\displaystyle\mbox{$\mbox{$\mathbf{K}$}$}_{n}=\mbox{$\mbox{$\mathbf{K}$}$}_{0}% \left(\mbox{$\mbox{$\mathbf{I}$}$}-\frac{\beta_{1}}{2}\sum_{i=0}^{n-1}\mbox{$% \mbox{$\mathbf{N}$}$}_{i}\right)+O(\beta_{1}^{2})

(19)

Now, we consider the case when $t=n+1$ . Notice that

$\displaystyle\mbox{$\mbox{$\mathbf{K}$}$}_{n+1}$	$\displaystyle=\mbox{$\mbox{$\mathbf{K}$}$}_{n}\Big{(}\mbox{$\mbox{$\mathbf{I}$% }$}-\frac{\beta_{1}}{2}\underbrace{\Big{(}\mbox{$\mbox{$\mathbf{K}$}$}_{n}^{% \top}\mbox{$\mbox{$\mathbf{U}$}$}_{n}\mbox{$\mbox{$\mathbf{K}$}$}_{n}+\lambda% \mbox{$\mbox{$\mathbf{K}$}$}_{n}^{\top}\mbox{$\mbox{$\mathbf{K}$}$}_{n}-\mbox{% $\mbox{$\mathbf{I}$}$}\Big{)}}_{=\mbox{$\mbox{$\mathbf{N}$}$}_{n}}\Big{)}$	(20)
	$\displaystyle=\underbrace{\mbox{$\mbox{$\mathbf{K}$}$}_{0}\left(\mbox{$\mbox{$% \mathbf{I}$}$}-\frac{\beta_{1}}{2}\sum_{i=0}^{n-1}\mbox{$\mbox{$\mathbf{N}$}$}% _{i}\right)}_{=\mbox{$\mbox{$\mathbf{K}$}$}_{n}-O(\beta_{1}^{2})}\Big{(}\mbox{% $\mbox{$\mathbf{I}$}$}-\frac{\beta_{1}}{2}\mbox{$\mbox{$\mathbf{N}$}$}_{n}\Big% {)}+O(\beta_{1}^{2})$	(21)
	$\displaystyle=\mbox{$\mbox{$\mathbf{K}$}$}_{0}\left(\mbox{$\mbox{$\mathbf{I}$}% $}-\frac{\beta_{1}}{2}\sum_{i=0}^{n-1}\mbox{$\mbox{$\mathbf{N}$}$}_{i}-\frac{% \beta_{1}}{2}\mbox{$\mbox{$\mathbf{N}$}$}_{n}+O(\beta_{1}^{2})\right)+O(\beta_% {1}^{2})$	(22)
	$\displaystyle=\mbox{$\mbox{$\mathbf{K}$}$}_{0}\left(\mbox{$\mbox{$\mathbf{I}$}% $}-\frac{\beta_{1}}{2}\sum_{i=0}^{n}\mbox{$\mbox{$\mathbf{N}$}$}_{i}\right)+O(% \beta_{1}^{2})$	(23)

which is exactly the claim when $t=n+1$ .

Thus, by induction, the claim holds.

Notice that $\mbox{$\mbox{$\mathbf{N}$}$}_{i}$ by definition is symmetric. It is easy to see that

$\displaystyle\mbox{$\mbox{$\mathbf{K}$}$}_{t}\mbox{$\mbox{$\mathbf{K}$}$}_{t}^% {\top}$	$\displaystyle=\mbox{$\mbox{$\mathbf{K}$}$}_{0}\left(\mbox{$\mbox{$\mathbf{I}$}% $}-\frac{\beta_{1}}{2}\sum_{i=0}^{t-1}\mbox{$\mbox{$\mathbf{N}$}$}_{i}\right)% \left(\mbox{$\mbox{$\mathbf{I}$}$}-\frac{\beta_{1}}{2}\sum_{i=0}^{t-1}\mbox{$% \mbox{$\mathbf{N}$}$}_{i}\right)^{\top}\mbox{$\mbox{$\mathbf{K}$}$}_{0}^{\top}% +O(\beta_{1}^{2})$	(24)
	$\displaystyle=\mbox{$\mbox{$\mathbf{K}$}$}_{0}\left(\mbox{$\mbox{$\mathbf{I}$}% $}-\frac{\beta_{1}}{2}\sum_{i=0}^{t-1}\mbox{$\mbox{$\mathbf{N}$}$}_{i}\right)% \left(\mbox{$\mbox{$\mathbf{I}$}$}-\frac{\beta_{1}}{2}\sum_{i=0}^{t-1}\mbox{$% \mbox{$\mathbf{N}$}$}_{i}\right)\mbox{$\mbox{$\mathbf{K}$}$}_{0}^{\top}+O(% \beta_{1}^{2})$	(25)
	$\displaystyle=\mbox{$\mbox{$\mathbf{K}$}$}_{0}\left(\mbox{$\mbox{$\mathbf{I}$}% $}-\beta_{1}\sum_{i=0}^{t-1}\mbox{$\mbox{$\mathbf{N}$}$}_{i}\right)\mbox{$% \mbox{$\mathbf{K}$}$}_{0}^{\top}+O(\beta_{1}^{2})$	(26)

Thus, the claim also holds.

D.3 Proof of Lemma D.3

We first show the base case when $t=0$ . By the assumption, we have $\mbox{$\mbox{$\mathbf{K}$}$}_{0}=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}$ . Similarly, we have $\mbox{$\mbox{$\mathbf{U}$}$}_{0}=\hat{\mbox{$\mbox{$\mathbf{U}$}$}}_{0}$ by the assumption.

By definition, we have

$\displaystyle\mbox{$\mbox{$\mathbf{N}$}$}_{0}$	$\displaystyle={\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{\top}{\mbox{$\mbox{$\mathbf{% U}$}$}}_{0}{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}+\lambda{\mbox{$\mbox{$\mathbf{K}% $}$}}_{0}^{\top}{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}-\mbox{$\mbox{$\mathbf{I}$}$}$	(27)
	$\displaystyle=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{\top}\hat{\mbox{$\mbox{$% \mathbf{U}$}$}}_{0}\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}+\lambda\hat{\mbox{$% \mbox{$\mathbf{K}$}$}}_{0}^{\top}\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}-\mbox{% $\mbox{$\mathbf{I}$}$}$	(28)
	$\displaystyle=\hat{\mbox{$\mbox{$\mathbf{N}$}$}}_{0}$	(29)

Thus, the claim holds when $t=0$ .

When $t>0$ , we can use Lemma D.2 to obtain the claim. Notice that

$\displaystyle{\mbox{$\mbox{$\mathbf{N}$}$}}_{n+1}$	$\displaystyle=\mbox{$\mbox{$\mathbf{K}$}$}_{n+1}^{\top}\mbox{$\mbox{$\mathbf{U% }$}$}_{n+1}\mbox{$\mbox{$\mathbf{K}$}$}_{n+1}+\lambda\mbox{$\mbox{$\mathbf{K}$% }$}_{n+1}^{\top}\mbox{$\mbox{$\mathbf{K}$}$}_{n+1}-\mbox{$\mbox{$\mathbf{I}$}$}$	(30)
	$\displaystyle=\left(\mbox{$\mbox{$\mathbf{I}$}$}-\frac{\beta_{1}}{2}\sum_{i=0}% ^{n}\mbox{$\mbox{$\mathbf{N}$}$}_{i}\right)^{\top}\mbox{$\mbox{$\mathbf{K}$}$}% _{0}^{\top}\big{(}\mbox{$\mbox{$\mathbf{U}$}$}_{n+1}+\lambda\mbox{$\mbox{$% \mathbf{I}$}$}\big{)}\mbox{$\mbox{$\mathbf{K}$}$}_{0}\left(\mbox{$\mbox{$% \mathbf{I}$}$}-\frac{\beta_{1}}{2}\sum_{i=0}^{n}\mbox{$\mbox{$\mathbf{N}$}$}_{% i}\right)-\mbox{$\mbox{$\mathbf{I}$}$}+O(\beta_{1}^{2})\,\,\text{(Lemma 2)}$	(31)
	$\displaystyle=\mbox{$\mbox{$\mathbf{K}$}$}_{0}^{\top}\big{(}\mbox{$\mbox{$% \mathbf{U}$}$}_{n+1}+\lambda\mbox{$\mbox{$\mathbf{I}$}$})\mbox{$\mbox{$\mathbf% {K}$}$}_{0}+O(\beta_{1})+O(\beta_{1}^{2})$	(32)
	$\displaystyle=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{\top}\big{(}\hat{\mbox{$% \mbox{$\mathbf{U}$}$}}_{n+1}+\lambda\mbox{$\mbox{$\mathbf{I}$}$}\big{)}\hat{% \mbox{$\mbox{$\mathbf{K}$}$}}_{0}+O(\beta_{1})\,\,\text{(Assumption)}$	(33)
	$\displaystyle=\hat{\mbox{$\mbox{$\mathbf{N}$}$}}_{n+1}+O(\beta_{1})$	(34)

D.4 Proof of Theorem 1

It is sufficient to show that the following claim holds at iteration $t$ since $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}$ is non-singular.

\displaystyle{\mbox{$\mbox{$\mathbf{K}$}$}}_{t}{\mbox{$\mbox{$\mathbf{K}$}$}}^% {\top}_{t}\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}=\mbox{$\mbox{$\mathbf{I}$}$}+% O(\beta_{1}^{2})

where we use $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}$ to denote $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}$ at iteration $t$ .

By assumptions, we know that Lemmas D.1, D.2, D.3 hold. Moreover, we have $\mbox{$\mbox{$\mathbf{K}$}$}_{0}=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}$ . Thus, we have

$\displaystyle{\mbox{$\mbox{$\mathbf{K}$}$}}_{t}{\mbox{$\mbox{$\mathbf{K}$}$}}^% {\top}_{t}\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}$	$\displaystyle=\mbox{$\mbox{$\mathbf{K}$}$}_{0}\left(\mbox{$\mbox{$\mathbf{I}$}% $}-\beta_{1}\sum_{i=0}^{t-1}\mbox{$\mbox{$\mathbf{N}$}$}_{i}\right)\mbox{$% \mbox{$\mathbf{K}$}$}_{0}^{\top}\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}+O(\beta% _{1}^{2})\text{ (by Lemma \ref{lemma:ours_identity}) }$	(35)
	$\displaystyle=\mbox{$\mbox{$\mathbf{K}$}$}_{0}\left(\mbox{$\mbox{$\mathbf{I}$}% $}-\beta_{1}\sum_{i=0}^{t-1}\mbox{$\mbox{$\mathbf{N}$}$}_{i}\right)\mbox{$% \mbox{$\mathbf{K}$}$}_{0}^{\top}\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-T}% \left(\mbox{$\mbox{$\mathbf{I}$}$}+\beta_{1}\sum_{i=0}^{t-1}\hat{\mbox{$\mbox{% $\mathbf{N}$}$}}_{i}\right)\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-1}+O(\beta% _{1}^{2})\text{ (by Lemma \ref{lemma:kfac_identity}) }$	(36)
	$\displaystyle=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}\left(\mbox{$\mbox{$% \mathbf{I}$}$}-\beta_{1}\sum_{i=0}^{t-1}\hat{\mbox{$\mbox{$\mathbf{N}$}$}}_{i}% +O(\beta_{1}^{2})\right)\left(\mbox{$\mbox{$\mathbf{I}$}$}+\beta_{1}\sum_{i=0}% ^{t-1}\hat{\mbox{$\mbox{$\mathbf{N}$}$}}_{i}\right)\hat{\mbox{$\mbox{$\mathbf{% K}$}$}}_{0}^{-1}+O(\beta_{1}^{2})\text{ (by Lemma \ref{lemma:kfac_and_ours}) }$	(37)
	$\displaystyle=\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}\mbox{$\mbox{$\mathbf{I}$}% $}\hat{\mbox{$\mbox{$\mathbf{K}$}$}}_{0}^{-1}+O(\beta_{1}^{2})$	(38)
	$\displaystyle=\mbox{$\mbox{$\mathbf{I}$}$}+O(\beta_{1}^{2})$	(39)

Appendix E Invariance of INGD and SINGD

INGD and SINGD are scale invariant to the choice of the Kronecker approximation while KFAC and IKFAC are not. Recall that we use the following Kronecker approximation to approximate the Hessian.

\displaystyle\mbox{$\mbox{$\mathbf{U}$}$}\otimes\mbox{$\mbox{$\mathbf{G}$}$}% \approx\nabla_{\mu}^{2}\ell(\mbox{$\mbox{$\boldsymbol{\mu}$}$})

However, such an approximation is not unique. We can consider an equivalent approximation such as

\displaystyle(\alpha\mbox{$\mbox{$\mathbf{U}$}$})\otimes(\alpha^{-1}\mbox{$% \mbox{$\mathbf{G}$}$})\approx\nabla_{\mu}^{2}\ell(\mbox{$\mbox{$\boldsymbol{% \mu}$}$})

where $\alpha\neq 0$ can be any arbitrary non-zero scalar.

INGD is invariant since the update scheme involving the approximation is scale invariant: $\mathrm{Tr}(\mbox{$\mbox{$\mathbf{H}$}$}_{C})\mbox{$\mbox{$\mathbf{H}$}$}_{K}=% \mathrm{Tr}(\mbox{$\mbox{$\mathbf{C}$}$}^{T}\mbox{$\mbox{$\mathbf{G}$}$}\mbox{% $\mbox{$\mathbf{C}$}$})\mbox{$\mbox{$\mathbf{K}$}$}^{T}\mbox{$\mbox{$\mathbf{U% }$}$}\mbox{$\mbox{$\mathbf{K}$}$}=\mathrm{Tr}(\mbox{$\mbox{$\mathbf{C}$}$}^{T}% (\alpha^{-1}\mbox{$\mbox{$\mathbf{G}$}$})\mbox{$\mbox{$\mathbf{C}$}$})\mbox{$% \mbox{$\mathbf{K}$}$}^{T}(\alpha\mbox{$\mbox{$\mathbf{U}$}$})\mbox{$\mbox{$% \mathbf{K}$}$}$ . The invariance is also preserved in SINGD since structures and their subspace projection maps are closed under scalar multiplications.

In contrast, the updates of KFAC and IKFAC are not scale invariant. As an example, we consider using curvature approximations $\mathbf{U}$ and $(\alpha\mbox{$\mbox{$\mathbf{U}$}$})$ to update $\mbox{$\mbox{$\mathbf{S}$}$}_{K}^{-1}$ in KFAC, and denote the updated $\mbox{$\mbox{$\mathbf{S}$}$}_{K}^{-1}$ by $\hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{-1}$ and $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{-1}$ , respectively. As shown below, we cannot recover $\hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{-1}$ from $\bar{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{-1}$ by scale transformations and thus, the KFAC update is not scale invariant.

\displaystyle\hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}^{-1}=\big{[}(1-\beta_{1})% \hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{K}+\beta_{1}\mbox{$\mbox{$\mathbf{U}$}$}+% \lambda\mbox{$\mbox{$\mathbf{I}$}$}\big{]}^{-1}\neq\big{[}(1-\beta_{1})\bar{% \mbox{$\mbox{$\mathbf{S}$}$}}_{K}+\beta_{1}(\alpha\mbox{$\mbox{$\mathbf{U}$}$}% )+\lambda\mbox{$\mbox{$\mathbf{I}$}$}\big{]}^{-1}=\bar{\mbox{$\mbox{$\mathbf{S% }$}$}}_{K}^{-1}

An attempt to make the update of $\mbox{$\mbox{$\mathbf{S}$}$}_{K}$ invariant is to set the damping weight to be $\alpha\lambda$ . However, the update of $\mbox{$\mbox{$\mathbf{S}$}$}_{C}$ requires us to set the damping weight to be $\alpha^{-1}\lambda$ as shown below. Thus, it is impossible to make KFAC invariant without introducing individual damping weights.

\displaystyle\hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{C}^{-1}=\big{[}(1-\beta_{1})% \hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{C}+\beta_{1}\mbox{$\mbox{$\mathbf{G}$}$}+% \lambda\mbox{$\mbox{$\mathbf{I}$}$}\big{]}^{-1}\neq\big{[}(1-\beta_{1})\bar{% \mbox{$\mbox{$\mathbf{S}$}$}}_{C}+\beta_{1}(\alpha^{-1}\mbox{$\mbox{$\mathbf{G% }$}$})+\lambda\mbox{$\mbox{$\mathbf{I}$}$}\big{]}^{-1}=\bar{\mbox{$\mbox{$% \mathbf{S}$}$}}_{C}^{-1}

	$\mathbf{K}$	$\mbox{$\mbox{$\mathbf{K}$}$}\mbox{$\mbox{$\mathbf{K}$}$}^{\top}$	$\big{(}\mbox{$\mbox{$\mathbf{K}$}$}\mbox{$\mbox{$\mathbf{K}$}$}^{\top})^{-1}$
Dense
Diagonal
Block-diag.
Tril-Toepl.
Triu-Toepl.
Hierarchical
Sparse Triu.
Sparse Triu.
Sparse Tril.
Sparse Tril.

Structured Inverse-Free Natural Gradient Descent: Memory-Efficient & Numerically-Stable KFAC

Abstract

1 Introduction

2 Preliminaries

2.1 KFAC: Approximate NGD for MLE

2.2 INGD: Approximate NGD for Bayesian estimation

BLR

Removing inversion

INGD

3 Structured inverse-free NGD

3.1 Inverse-free KFAC Updates for Numerical Stability

Theorem 1.

3.2 Sparse Kronecker Factors for Reducing Memory

4 Experiments

5 Conclusion

Acknowledgements

Impact Statement

References

Appendix A space and time complexity

Appendix B Details of the Experiments

B.1 Hyper-parameter Tuning

Appendix C Connection between IKFAC and KFAC

Appendix D Proof of Theorem 1

Lemma D.1.

Lemma D.2.

Lemma D.3.

Theorem 2.

D.1 Proof of Lemma D.1

D.2 Proof of Lemma D.2

D.3 Proof of Lemma D.3

D.4 Proof of Theorem 1

Appendix E Invariance of INGD and SINGD

Structured Inverse-Free Natural Gradient Descent:
Memory-Efficient & Numerically-Stable KFAC