An Improved Finite-time Analysis of Temporal Difference
Learning with Deep Neural Networks

Zhifa Ke Zaiwen Wen Junyu Zhang

Abstract

Temporal difference (TD) learning algorithms with neural network function parameterization have well-established empirical success in many practical large-scale reinforcement learning tasks. However, theoretical understanding of these algorithms remains challenging due to the nonlinearity of the action-value approximation. In this paper, we develop an improved non-asymptotic analysis of the neural TD method with a general $L$ -layer neural network. New proof techniques are developed and an improved new $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity is derived. To our best knowledge, this is the first finite-time analysis of neural TD that achieves an $\tilde{\mathcal{O}}(\epsilon^{-1})$ complexity under the Markovian sampling, as opposed to the best known $\tilde{\mathcal{O}}(\epsilon^{-2})$ complexity in the existing literature.

Machine Learning, ICML

1 Introduction

The temporal difference (TD) learning method, firstly designed for policy evaluation (Sutton, 1988), is a fundamental building block of many popular Reinforcement Learning (RL) algorithms. In standard TD learning algorithms for tabular MDP, based on the Bellman operator, the agent iteratively obtains a state-action-reward-transition tuple and then updates the Q values by a weighted average of the current value and the TD target. Once the algorithm converges, the Q function is considered to be the final return obtained by executing the target policy given some initial action-state pair.

For large-scale reinforcement learning (RL) problems, appropriate parameterization of the Q function is crucial for better scalability of the TD algorithms. Common examples include linear (Tesauro et al., 1995), general smooth nonlinear (Maei et al., 2009), and neural network (Mnih et al., 2013) function approximations. However, it is well known that the naive extension of TD learning and Q-learning algorithms can diverge under the general function approximation Tsitsiklis & Van Roy (1996). To encourage convergence, numerous variants of TD and Q-learning have been proposed, including Least-squares TD (LSTD) (Bradtke & Barto, 1996; Boyan, 2002) and gradient TD (GTD) (Sutton et al., 2009a, b), to name a few.

The applications of neural network function approximation have witnessed huge empirical success in many real-world tasks, including Deep Q-network (DQN) algorithms (Mnih et al., 2013; Van Hasselt et al., 2016), policy improvement method (Sutton et al., 1999), trust region policy optimization (Schulman et al., 2015) and the actor-critic algorithms (Konda & Tsitsiklis, 1999; Lillicrap et al., 2015; Fujimoto et al., 2018), etc. However, due to the analysis difficulties brought by the function approximation, a significant gap exists between the empirical success and the theoretical understanding of these algorithms. Hence analyzing the convergence and sample complexity of TD learning and Q-learning under various Q function parameterizations has always been an active topic in the RL community during the past decades.

Early works focus on the asymptotic convergence of the algorithms with tabular or linear function approximation. For the tabular (stochastic) TD or Q learning method, Jaakkola et al. (1993) established the asymptotic convergence for the first time. Later on, the asymptotic convergence of algorithms with linear function approximation has been extensively discussed using ODE-based methods, see e.g. Tsitsiklis & Van Roy (1996); Perkins & Pendrith (2002); Borkar (2009). Meanwhile, in contrast to the convergent results for RL algorithms under the tabular or linear settings, TD with nonlinear function approximation is known to diverge in general Tsitsiklis & Van Roy (1996); Brandfonbrener & Bruna (2019). To overcome this issue, Maei et al. (2009) proposed to optimize the Mean Squared Projected Bellman Error (MSPBE) via a gradient-based algorithm. Due to the problem nonconvexity, only asymptotic convergence to stationary points can be guaranteed.

More recently, benefiting from the improved techniques for analyzing stochastic optimization algorithms, there has been a growing number of research on providing finite-time analysis for TD and Q-learning algorithms with function approximations.

For linear function approximation, the non-asymptotic results of TD learning and its variants are relatively well-understood, including TD Bhandari et al. (2018); Dalal et al. (2018); Zou et al. (2019), gradient TD Dalal et al. (2018); Touati et al. (2018); Liu et al. (2020a), and Least-Squares TD Lazaric et al. (2010); Prashanth et al. (2014); Tagorti & Scherrer (2015), etc. In particular, Bhandari et al. (2018) established the first finite-time analysis of linear Q-learning under both i.i.d. sampling and Markovian sampling settings.

For neural network function approximation, which is directly related to this paper, we provide a more detailed discussion. Based on the recent advances in the understanding of optimizing ReLU network Jacot et al. (2018); Du et al. (2018); Allen-Zhu et al. (2019a, b); Cao & Gu (2019, 2020), a few recent works have successfully developed the finite-time analysis of the neural TD and neural Q-learning algorithms, as long as the Q network is sufficiently wide. Let $Q^{*}$ be the true action-value function and let $Q(s,a;\boldsymbol{\theta})$ denote the action-value function parameterized by a neural network with weights $\boldsymbol{\theta}$ , at any state action pair $(s,a)$ . Then we aim to find some $\epsilon$ -optimal parameter $\bar{\boldsymbol{\theta}}$ such that $\mathbb{E}\big{[}(Q(s,a;\bar{\boldsymbol{\theta}})-Q^{*}(s,a))^{2}\big{]}\leq% \epsilon+\epsilon_{\mathcal{F}}$ , where the expectation is taken over the possible randomness in the output $\bar{\theta}$ as well as the distribution over the state-action pairs $(s,a)$ , and $\epsilon_{\mathcal{F}}$ is the optimal approximation error of the parameterization function class. In (Xu & Gu, 2020), a neural Q-learning algorithm with a general $L$ -layer ReLU network is analyzed, and an $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity is guaranteed given that the network is sufficiently wide. In (Cai et al., 2023), the authors studied both the neural TD learning and neural Q-learning algorithms for minimizing the MSPBE for policy evaluation and policy optimization, respectively. For policy evaluation, the $Q^{*}$ in the definition of an $\epsilon$ -optimal solution is defaulted as $Q^{\pi}$ with $\pi$ being the policy to be evaluated. For both cases, an $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity is guaranteed for wide two-layer ReLU networks. In (Sun et al., 2022), an $\tilde{O}(\epsilon^{-\frac{2}{2-\alpha}})$ complexity has been achieved by an adaptive neural TD algorithm with multi-layer ReLU networks, where $\alpha\in(0,1]$ is a constant that characterizes the sparsity and decay rate of the stochastic semi-gradients. However, without additional assumption, only an $\tilde{O}(\epsilon^{-2})$ complexity with $\alpha=1$ can be theoretically guaranteed. Finally, for policy evaluation problems, there are also several works that aim at reducing the width of the over-parameterized Q networks in the existing works (Tian et al., 2022; Cayci et al., 2023). In terms of complexity, both of them requires $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples to obtain an $\epsilon$ -optimal solution.

Despite the fact that existing analysis of the neural TD or neural Q-learning algorithms merely provides the $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under various settings, an $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity should be expected. In fact, a double-loop fitted Q-iteration (FQI) method (Fan et al., 2020) and its single-loop Gauss-Newton variant (Ke et al., 2023) can achieve an $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity is obtained for two-layer Q networks. Let $\mathcal{T}$ be the Bellman (optimality) operator, then the FQI method repeatedly solves a nonlinear least square subproblem to obtain the next iteration: $\boldsymbol{\theta}_{k+1}\approx\mathop{\mathrm{argmin}}_{\boldsymbol{\theta}% \in\Theta}\mathbb{E}\big{[}(Q(s,a;\boldsymbol{\theta})-\mathcal{T}Q(s,a;% \boldsymbol{\theta}_{k}))^{2}\big{]}$ . Compared to the single-loop neural TD or neural Q-learning method that takes only one sample (or a mini-batch) to update the weights of Q networks, the update scheme of FQI requires repeatedly solving a subproblem to sufficiently high accuracy to enable convergence, which makes it inefficient and less favorable in practice. Therefore, we would like to raise a question: {mdframed}[leftmargin=1cm,rightmargin=1cm, backgroundcolor=gray!10] Can we improve the existing analysis of the neural temporal difference learning algorithm and obtain an $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity under general multi-layer Q neural networks?

To answer this question, we revisit the convergence analysis of the neural TD learning or Q-learning algorithms under the non-i.i.d. Markovian observations where a general $L$ -layer neural network is used for Q function parameterization. By proposing a new subspace analysis technique, under suitable conditions, we derive a brand new $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity for neural TD learning or Q-learning, improving the state-of-the-art $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity in the existing works. Our contributions are summarized as follows.

	Neural Approximation	Network Depth	Network Width	Activation	Sample Complexity
(Bhandari et al., 2018)	No	NA	NA	NA	$\mathcal{O}(1/\epsilon)$
(Cai et al., 2023)	Yes	2	$\Omega(1/\epsilon^{4})$	ReLu	$\mathcal{O}(1/\epsilon^{2})$
(Xu & Gu, 2020)	Yes	$L$	$\Omega(1/\epsilon^{6})$	ReLu	$\mathcal{O}(1/\epsilon^{2})$
(Sun et al., 2022)	Yes	$L$	$\Omega(1/\epsilon^{6})$	ReLu	$\mathcal{O}(1/\epsilon^{\frac{2}{2-\alpha}}),\alpha\in(0,1]$
(Tian et al., 2022)	Yes	$L$	$\Omega(1/\epsilon^{2})$	ELU, GeLU	$\mathcal{O}(1/\epsilon^{2})$
Ours	Yes	$L$	$\Omega(1/\epsilon^{2})$	ELU, GeLU	$\mathcal{O}(1/\epsilon)$

Table 1: Sample complexity for parameterized Q learning to find some

\bar{\boldsymbol{\theta}}

such that

\mathbb{E}\left[\|Q(s,a;\bar{\boldsymbol{\theta}})-Q^{*}(s,a)\|_{\mu}^{2}% \right]\leq\varepsilon

, where

\|f\|_{\mu}^{2}:=\int|f|^{2}d\mu

and

Q^{*}(s,a)

satisfies the Bellman optimality equation

Q^{*}(s,a)=\mathcal{T}Q^{*}(s,a)

•

Under the non-i.i.d. Markovian sampling setting, we derive an $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity for both neural TD learning and Q-learning methods under the multi-layer network approximation for Q functions. Our result also improves the best known $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity in the existing works.
•

Based on our newly developed techniques, we further provide a finite-sample analysis for a minimax neural Q-learning algorithm that solves two-player zero-sum Markov games. An $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity is obtained under the non-i.i.d. Markovian sampling setting.

Technically, the subspace analysis approach that we propose to establish the $\tilde{O}(\epsilon^{-1})$ sample complexity is by itself of independent interest. We believe this technique can potentially be applied to linear Q-learning algorithms and linear Actor-Critic algorithms without requiring the positive definiteness assumption of the feature covariance matrix (Bhandari et al., 2018; Zou et al., 2019; Barakat et al., 2022), while maintaining the $\tilde{O}(\epsilon^{-1})$ complexity.

In summary, we provide a comprehensive comparison between our work and the most related works in their respective settings and sample complexity in Table 1. Our work establishes an optimal sample complexity analysis within a broader contextual framework.

2 Preliminaries

We consider the infinite-horizon discounted Markov decision process (MDP), which is denoted as $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathbb{P},r,\gamma)$ . We consider a general state space $\mathcal{S}$ and a finite action space $\mathcal{A}$ . At any state $s\in\mathcal{S}$ , if the agent takes an action $a\in\mathcal{A}$ , it will receive a reward $r(s,a)\in[-R_{\max},R_{\max}]$ and transition to the next state $s^{\prime}\in\mathcal{S}$ with probability $\mathbb{P}(s^{\prime}|s,a)$ . We call $r$ the reward function and $\mathbb{P}$ the transition kernel. Let $\gamma\in(0,1)$ be a discount factor, then an MDP aims to find a sequence of actions $\{a_{t}\}_{t\geq 0}$ to maximize the expected and discounted cumulative reward $\mathbb{E}\big{[}\sum_{t=0}^{\infty}\gamma^{t}\!\cdot r(s_{t},a_{t})|s_{0}\sim% \mu\big{]}$ , where $\mu$ is the distribution of the initial state $s_{0}$ .

Let $\Delta_{\mathcal{A}}$ denote the set of all probability distributions over the action space $\mathcal{A}$ , and let a policy $\pi:\mathcal{S}\mapsto\Delta_{\mathcal{A}}$ be a mapping that returns a probability distribution $\pi(\cdot|s)\in\Delta_{\mathcal{A}}$ given any state $s\in\mathcal{S}$ . If an agent follows a policy $\pi$ , then at any state $s_{t}$ , it will act by sampling an action $a_{t}\sim\pi(\cdot|s_{t})$ . Therefore, the action-value function (Q-function) under the policy $\pi$ is

Q^{\pi}(s,a):=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}\cdot r(s_{t}% ,a_{t})|s_{0}=s,a_{0}=a\right],

for $\forall(s,a)\in\mathcal{S}\times\mathcal{A}$ , where all actions except $a_{0}$ are sampled according to $\pi$ . For any mapping $Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ , let the Bellman operator $\mathcal{T}^{\pi}$ be

	$\displaystyle\mathcal{T}^{\pi}Q(s,a):=r(s,a)+\gamma$	$\displaystyle\mathbb{E}\left[Q(s^{\prime},a^{\prime})\mid s^{\prime}\sim% \mathbb{P}(\cdot\mid s,a),\right.$
		$\displaystyle\left.a^{\prime}\sim\pi(\cdot\mid s^{\prime})\right],\,\,\forall s% ,a.$

Then $\mathcal{T}^{\pi}$ is a $\gamma$ -contraction under the infinity norm and $Q^{\pi}$ is the unique solution to the fixed-point equation $Q=\mathcal{T}^{\pi}Q$ (Bertsekas, 2012). If the Q function is parameterized by some function $Q(s,a;\boldsymbol{\theta})$ to gain better scalability for large-scale RL problems, popular approaches for finding a good $\boldsymbol{\theta}$ include minimizing the the Mean-Squared Bellman Error (MSBE):

\min_{\boldsymbol{\theta}\in\Theta}\mathbb{E}_{(s,a)\sim\mu}\Big{[}\left(Q(s,a% ;\boldsymbol{\theta})-\mathcal{T}^{\pi}Q(s,a;\boldsymbol{\theta})\right)^{2}% \Big{]},

(1)

and minimizing the Mean-Squared Projected Bellman Error (MSPBE):

\min_{\boldsymbol{\theta}\in\Theta}\mathbb{E}_{(s,a)\sim\mu}\left[\left(Q(s,a;% \boldsymbol{\theta})-\Pi_{\mathcal{F}}\mathcal{T}^{\pi}Q(s,a;\boldsymbol{% \theta})\right)^{2}\right],

(2)

where $\Theta$ is a feasible domain of the parameter $\boldsymbol{\theta}$ , $\mu$ is some distribution over state action pairs, and $\Pi_{\mathcal{F}}$ is the projection onto some function class $\mathcal{F}$ . Typical choices of $\mathcal{F}$ includes the Q function parameterization class itself $\mathcal{F}:=\{Q(\cdot;\boldsymbol{\theta}):\theta\in\Theta\}$ (Maei et al., 2009), and some local linearization of the parameterization function class (Cai et al., 2023).

In this paper, we study the neural temporal difference learning method where the action-value function is parameterized by some multi-layer neural network. Let us define a feedforward neural network by the following recursion:

\boldsymbol{x}^{(l)}=\frac{1}{\sqrt{m}}\sigma\left(\boldsymbol{W}_{l}% \boldsymbol{x}^{(l-1)}\right),\quad l\in\{1,2,\cdots,L\},

(3)

where $\boldsymbol{W}_{1}\in\mathbb{R}^{m\times d}$ , $\boldsymbol{W}_{l}\in\mathbb{R}^{m\times m}$ for $2\leq l\leq L$ are the weight matrices of the network, $\sigma(\cdot)$ is an activation function, and the input is a feature map $\boldsymbol{x}^{(0)}=\phi(s,a)\in\mathbb{R}^{d}$ for any state action pair $(s,a)$ . For simplicity of notation, we write $\boldsymbol{x}=\boldsymbol{x}^{(0)}$ , then $Q(s,a)$ is parameterized by

Q(\boldsymbol{x};\boldsymbol{\theta})=\frac{1}{\sqrt{m}}\boldsymbol{b}^{\top}% \boldsymbol{x}^{(L)},

(4)

where the parameter $\boldsymbol{\theta}=\left(\mbox{Vec}(\boldsymbol{W}_{1});\cdots;\mbox{Vec}(% \boldsymbol{W}_{L})\right)$ denotes the collection of all weight matrices, and $\boldsymbol{b}$ is given by a random initialization. $\mbox{Vec}(\cdot)$ stands for the vetorization operator that reshapes a matrix to a column vector by stacking its columns one by one and the “;” separator in $\boldsymbol{\theta}$ stands for the vertical stacking of the elements. That is, we reshape $\boldsymbol{\theta}$ to a long column vector for the notational convenience in later discussion.

Assumption 2.1.

The activation function $\sigma(\cdot)$ is $L_{1}$ -Lipschitz and $L_{2}$ -smooth, i.e. , for $\forall y_{1},y_{2}\in\mathbb{R}:$

\left|\sigma(y_{1})-\sigma(y_{2})\right|\leq L_{1}|y_{1}-y_{2}|

and

\left|\sigma^{\prime}(y_{1})-\sigma^{\prime}(y_{2})\right|\leq L_{2}|y_{1}-y_{% 2}|.

Assumption 2.1 indicates that our results below are not based on the popular ReLU activation function. However, we primarily focus on some twice-differentiable activation functions (such as Sigmoid, ELU, GeLU, etc.), which are smooth approximations of the ReLU function and are frequently utilized in practical problems (Devlin et al., 2018; Godfrey, 2019). Such a setup aligns with (Liu et al., 2020b), and provides a $\mathcal{O}(m^{-\frac{1}{2}})$ -smooth property for the neural Q-function.

Let $\boldsymbol{\theta}^{0}=\left(\mbox{Vec}(\boldsymbol{W}_{1}^{0});\cdots;\mbox{% Vec}(\boldsymbol{W}_{L}^{0})\right)$ be the initial solution. For each $l$ , we initialize the weights of $\boldsymbol{W}_{l}^{0}$ element-wise from a normal distribution $\mathcal{N}(0,1)$ and each element of $\boldsymbol{b}$ is drawn uniformly from $\{-1,+1\}$ . The parameter $\boldsymbol{b}$ will not be optimized during training. For regularity purpose, we would like to restrict the iterations to a bounded set around $\boldsymbol{\theta}^{0}$ , which is defined as

	$\displaystyle S_{\omega}:=$	$\displaystyle\left\{\boldsymbol{\theta}=\left(\mbox{Vec}\left(\boldsymbol{W}_{% 1}\right);\cdots;\mbox{Vec}(\boldsymbol{W}_{L})\right):\right.$
		$\displaystyle\left.\\|\boldsymbol{\theta}-\boldsymbol{\theta}^{0}\\|_{2}\leq% \omega,1\leq l\leq L\right\}.$

In each iteration $t$ , the neural Q-learning algorithm obtains a sample of state-action-reward-transition tuple $(s_{t},a_{t},r_{t},s_{t+1},a_{t+1})$ and computes the TD error by

\Delta_{t}=Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\Big{(}r_{t}+\gamma Q% (\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t})\Big{)}

(5)

with $\boldsymbol{x}_{t}=\phi(s_{t},a_{t}),\boldsymbol{x}_{t+1}=\phi(s_{t+1},a_{t+1}).$ Then a projected stochastic semi-gradient step is performed to update the weight matrices:

\boldsymbol{\theta}^{t+1}=\Pi_{S_{\omega}}\Big{(}\boldsymbol{\theta}^{t}-\eta_% {t}\boldsymbol{g}\left(\boldsymbol{\theta}^{t}\right)\!\!\Big{)}

(6)

with

\boldsymbol{g}(\boldsymbol{\theta}^{t})=\Delta_{t}\cdot\nabla_{\boldsymbol{% \theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t}).

We formally describe the neural TD learning method in Algorithm 1.

Algorithm 1 Neural Temporal Difference Learning with Markovian Sampling

Input: A learning policy

\pi

, a discount factor

\gamma\in(0,1)

, a sequence of learning rates

\{\eta_{t}\}_{t\geq 0}

, a maximum iteration number

T

, a projection radius

\omega>0

, a Q network with architecture (4).

Initialization: Generate each entry of

\boldsymbol{W}_{l}^{0}

independently from

\mathcal{N}(0,1)

, for

l=1,2,\cdots,L

, and each entry of

\boldsymbol{b}

independently from

\text{Unif}\{-1,+1\}

. Generate

s_{0}\sim\mu,a_{0}\sim\pi(\cdot|s_{0})

for

t=0,1,\cdots,T-1

Sample

(s_{t},a_{t},r_{t},s_{t+1},a_{t+1})

from the learning policy

\pi

with

a_{t+1}\sim\pi(\cdot|s_{t+1})

Compute the TD error

\Delta_{t}

by (5).

Update

\boldsymbol{\theta}^{t+1}

by the projected stochastic semi-gradient step (6).

end for

Output:

\boldsymbol{\theta}^{T}

One remark is that, under the non-i.i.d. Markovian sampling setting, the agent is only able to generate a trajectory of samples following some given learning policy $\pi$ , which is very common in the offline RL (Wu et al., 2019; Levine et al., 2020; Kostrikov et al., 2021) where the data trajectories are generated by some learning policy.

In later sections, we will revisit the Algorithm 1 and design a novel subspace analysis technique for this method and achieve an improved sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-1})$ . Moreover, by replacing the TD error induced by the Bellman operator (5) with TD error induced by the Bellman optimality operator: $\Delta_{t}=Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\big{(}r_{t}+\gamma% \max_{b\in\mathcal{A}}Q(s^{\prime},b;\boldsymbol{\theta}^{t})\big{)}$ , Algorithm 1 can be reduced to the neural Q-learning method for finding optimal state-action value $Q^{*}$ . Our analysis for neural TD learning can be extended to the neural Q-learning analogously and obtain the same $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity.

3 Convergence of Neural Temporal Difference Learning

3.1 Basic Settings and Assumptions

To analyze Algorithm 1, let us first define the local linearization function class of the multi-layer Q network (4) at the random initialization $\boldsymbol{\theta}^{0}$ :

\mathcal{F}_{\omega,m}:=\left\{\widehat{Q}(\cdot\,;\boldsymbol{\theta})=Q(% \cdot\,;\boldsymbol{\theta}^{0})+\left<\nabla_{\boldsymbol{\theta}}Q(\cdot\,;% \boldsymbol{\theta}^{0}),\boldsymbol{\theta}-\boldsymbol{\theta}^{0}\right>\right\}

(7)

for any $\boldsymbol{\theta}\in S_{\omega}$ . Consider the MSPBE minimization problem:

\min_{\boldsymbol{\theta}\in S_{\omega}}\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[% \left(Q(\boldsymbol{x};\boldsymbol{\theta})-\Pi_{\mathcal{F}_{\omega,m}}% \mathcal{T}^{\pi}Q(\boldsymbol{x};\boldsymbol{\theta})\right)^{2}\right],

(8)

where $\mu$ is the initial state distribution, $\pi$ is the learning policy, and $\mathbb{P}$ is the transition kernel, the expectation $\mathbb{E}_{\mu,\pi,\mathbb{P}}[\cdot]$ is taken over $s\sim\mu$ , $a\sim\pi(\cdot|s)$ , and $s^{\prime}\sim\mathbb{P}(\cdot|s,a),a^{\prime}\sim\pi(\cdot|s^{\prime})$ in $\mathcal{T}^{\pi}$ . Define the set $\Xi_{\beta}$ as

	$\displaystyle\Xi_{\beta}:=\left\{\boldsymbol{\theta}\in S_{\beta}:\right.$	$\displaystyle\left.\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta})=\Pi_{% \mathcal{F}_{\omega,m}}\mathcal{T}^{\pi}\widehat{Q}(\boldsymbol{x};\boldsymbol% {\theta}),\right.$
		$\displaystyle\left.\forall\boldsymbol{x}=\phi(s,a)\right\}.$		(9)

Then the set $\Xi_{\omega}$ consists of the points $\boldsymbol{\theta}$ with which $\widehat{Q}(\cdot\,;\boldsymbol{\theta})$ forms a fixed point of the projected Bellman operator $\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}^{\pi}$ for the problem (8). By Section 4.1 in (Cai et al., 2023), the fixed point of $\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}^{\pi}$ is unique for $\boldsymbol{\theta}\in S_{\omega}$ . Therefore, the following relationship holds

\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta})=\widehat{Q}(\boldsymbol{x};% \boldsymbol{\theta}^{\prime}).

(10)

for $\forall\boldsymbol{x}=\phi(s,a),\,\,\forall(s,a)\in\mathcal{S}\times\mathcal{A% },\,\,\forall\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}\in\Xi_{\beta},% \forall\beta\geq\omega.$ Moreover, it is also shown that a point $\boldsymbol{\theta}^{*}\in\Xi_{\omega}$ if and only if it satisfies the stationarity condition:

\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(\boldsymbol{x},% \boldsymbol{x}^{\prime};\boldsymbol{\theta}^{*}\right)\big{\langle}\nabla_{% \boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x};\boldsymbol{\theta}^{*}% \right),\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\big{\rangle}\right]\geq 0,

(11)

where $\widehat{Q}(\cdot\,;\boldsymbol{\theta}^{*})\in\mathcal{F}_{\omega,m}$ is a local linearization provided by (7) and $\widehat{\Delta}$ is defined as

\widehat{\Delta}\left(\boldsymbol{x},\boldsymbol{x}^{\prime};\boldsymbol{% \theta}^{*}\right)=\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})-\Big{(}% r(s,a)+\gamma\widehat{Q}\left(\boldsymbol{x}^{\prime};\boldsymbol{\theta}^{*}% \right)\!\Big{)}.

Hence people may analyze the gap between $Q^{\pi}(\cdot)$ and $Q(\cdot,;\boldsymbol{\theta}^{T})$ by first connecting it to $\widehat{Q}(\cdot\,;\boldsymbol{\theta}^{*})$ . Based on this, Cai et al. (2023) derived an $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for the neural TD method. Now we define

\displaystyle\Sigma_{\pi}

\displaystyle=

\displaystyle\mathbb{E}_{\mu,\pi}\left[\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x};\boldsymbol{\theta}^{0})\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x};\boldsymbol{\theta}^{0})^{\top}\right].

(12)

It is worth noting that the matrix $\Sigma_{\pi}$ only depends on $\pi$ and $\boldsymbol{\theta}^{0}$ . In the original assumption about (12), (Zou et al., 2019; Xu & Gu, 2020) in fact assumed positive definiteness ( $\succ 0$ ) of $\Sigma_{\pi}$ , which can be viewed as a generalized version of the positive definite feature covariance matrix assumption in the analysis of linear TD and linear Q-learning, see e.g. (Zou et al., 2019). However, in this paper we adopt the following weaker regularity assumption.

Assumption 3.1.

Let $\overline{\sigma_{\min}}(\Sigma_{\pi})$ denote the minimum non-zero singular value of the matrix $\Sigma_{\pi}$ , then there exist constants $\lambda_{0},m^{*}>0$ such that $\overline{\sigma_{\min}}(\Sigma_{\pi})\geq\lambda_{0}$ as long as the Q network width $m\geq m^{*}$ .

For neural Q function approximation, a sufficient but not necessary condition for Assumption 3.1 can be obtained by exploiting the theory of over-parameterized neural networks. Roughly speaking, for a finite MDP with an $L$ -layer ReLU Q network, if the feature map satisfies $\phi(s,a)\nparallel\phi(s^{\prime},a^{\prime})$ for $\forall(s,a)\neq(s^{\prime},a^{\prime})$ , the results of (Jacot et al., 2018; Allen-Zhu et al., 2019a, b; Cao & Gu, 2019, 2020) suggest that there exist $\lambda^{\prime},m^{*}>0$ such that with high probability $\mathrm{Gram}(\boldsymbol{\theta}_{0})\succ\lambda^{\prime}\!\cdot\!\mathbf{I}$ for networks with width $m\geq m^{*}.$ Here $\mathrm{Gram}(\boldsymbol{\theta}_{0})$ stands for the Gram matrix of the network at the initialization $\boldsymbol{\theta}_{0}$ . A lower bound on $\overline{\sigma_{\min}}(\Sigma_{\pi})$ can then be constructed with $\lambda^{\prime}$ , refer to Remark D.6 in Appendix D.

Finally, to facilitate the sample complexity analysis under the non-i.i.d. Markovian sampling setting, let us make the following assumption on the fast mixing rate of the MDP sample trajectories, which is widely adopted in the related analysis (Zou et al., 2019; Xu & Gu, 2020; Cai et al., 2023).

Assumption 3.2.

We assume that the Markov chain $\left\{s_{t}\right\}_{t=0,1,\ldots}$ induced by the learning policy $\pi$ and the transition kernel $\mathbb{P}$ is uniformly ergodic with its invariant measure $\mathbb{P}^{\pi}$ . Furthermore, we assume that there are constants $\kappa>0,\rho\in(0,1)$ such that

\sup_{s\in\mathcal{S}}d_{TV}\left(\mathbb{P}\left(s_{t}\in\cdot\mid s_{0}=s% \right),\mathbb{P}^{\pi}\right)\leq\kappa\rho^{t}

for all $t\geq 0.$

Without loss of generality, we also make the following technical assumption, which is not fundamental as opposed to Assumption 3.1, and 3.2.

Assumption 3.3.

We assume the initial state distribution $\mu$ to be the stationary state distribution under policy $\pi$ .

This assumption is in fact very natural. Concerning the stationarity of $\mu$ , it can always be guaranteed by abandoning the first $\tilde{\mathcal{O}}(t_{\mathrm{mix}})$ samples while Assumption 3.2 indicates that the mixing time $t_{\mathrm{mix}}=\tilde{\mathcal{O}}(1)$ . This assumption guarantees that the operator $\mathcal{T}^{\pi}$ is $\gamma$ -contractive w.r.t. $\|\cdot\|_{\mu}$ in policy evaluation. Similar assumptions are included in (Bhandari et al., 2018; Cai et al., 2023).

3.2 An Improved Complexity of Neural TD Learning

To derive the $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity, we rely on the following key observation on subspace decomposition, which is beyond the existing analysis framework.

Proposition 3.4.

Let $\mathcal{R}(\Sigma_{\pi})$ and $\mathcal{K}(\Sigma_{\pi})$ denote the range space and kernel space of the matrix $\Sigma_{\pi}$ , respectively. Then for any parameter $\boldsymbol{\theta}\in S_{\omega}$ , there exists $\boldsymbol{\theta}_{*}$ such that

\boldsymbol{\theta}_{*}\in\Xi_{2\omega}\qquad\mbox{and}\qquad\boldsymbol{% \theta}-\boldsymbol{\theta}_{*}\in\mathcal{R}(\Sigma_{\pi}),

which also implies that the projections of $\boldsymbol{\theta}$ and $\boldsymbol{\theta}_{*}$ onto the subspace $\mathcal{K}(\Sigma_{\pi})$ are identical.

Based on this argument, for the iteration sequence $\{\boldsymbol{\theta}^{t}\}_{t\geq 0}$ generated by Algorithm 1, there exists a sequence $\{\boldsymbol{\theta}^{t}_{*}\}_{t\geq 0}\subseteq\Xi_{2\omega}$ such that $\{\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\}_{t\geq 0}\subseteq% \mathcal{R}(\Sigma_{\pi})$ . Therefore, unlike the existing works that analyze $\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{*}\|^{2}$ for some $\boldsymbol{\theta}^{*}\in\Xi_{\omega}$ , c.f. (Cai et al., 2023; Xu & Gu, 2020), we will prove a much faster convergence in $\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}$ . Combined with (10), this further indicates the improved sample complexity in this paper. The proof of this proposition is presented as follows.

Proof.

For the ease of discussion, let us denote the dimension of weight parameter $\boldsymbol{\theta}$ as $n$ . Then we may denote $\Sigma_{\pi}\in\mathbb{R}^{n\times n}$ and $\boldsymbol{\theta}\in\mathbb{R}^{n}$ . First of all let us fix an arbitrary $\bar{\boldsymbol{\theta}}\in\Xi_{\omega}$ , then we may decompose it into two orthogonal components:

\bar{\boldsymbol{\theta}}=\bar{\boldsymbol{\theta}}_{\parallel}+\bar{% \boldsymbol{\theta}}_{\bot}\quad\mbox{s.t.}\quad\bar{\boldsymbol{\theta}}_{% \parallel}\in\mathcal{R}(\Sigma_{\pi})\mbox{ and }\bar{\boldsymbol{\theta}}_{% \bot}\in\mathcal{K}(\Sigma_{\pi}).

Similarly, we can decompose the currently considered vector $\boldsymbol{\theta}$ as

\boldsymbol{\theta}=\boldsymbol{\theta}_{\parallel}+\boldsymbol{\theta}_{\bot}% \quad\mbox{s.t.}\quad\boldsymbol{\theta}_{\parallel}\in\mathcal{R}(\Sigma_{\pi% })\mbox{ and }\boldsymbol{\theta}_{\bot}\in\mathcal{K}(\Sigma_{\pi}).

Note that having an arbitrary vector $\boldsymbol{v}\in\mathbb{R}^{n}$ in the kernel space of $\Sigma_{\pi}$ means that $\Sigma_{\pi}\boldsymbol{v}=0$ , which further indicates that

$\displaystyle 0$	$\displaystyle=$	$\displaystyle\boldsymbol{v}^{\top}\Sigma_{\pi}\boldsymbol{v}$
	$\displaystyle=$	$\displaystyle\boldsymbol{v}^{\top}\mathbb{E}_{\mu,\pi}\left[\nabla_{% \boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta}^{0})\nabla_{% \boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta}^{0})^{\top}\right]% \boldsymbol{v}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\mu,\pi}\left[\left\langle\nabla_{\boldsymbol{\theta}% }Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{v}\right\rangle^{2}% \right].$

Therefore, under the measure $(s,a)\sim\mu\times\pi$ , we have

\boldsymbol{v}\in\mathcal{K}(\Sigma_{\pi})\quad\Longrightarrow\quad\left% \langle\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),% \boldsymbol{v}\right\rangle=0\quad a.s.

(13)

where $a.s.$ stands for almost surely. Therefore, define $\boldsymbol{\theta}_{*}=\bar{\boldsymbol{\theta}}_{\parallel}+\boldsymbol{% \theta}_{\bot}$ , we can check the stationarity condition (11) for $\boldsymbol{\theta}_{*}$ by establishing:

		$\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(% \boldsymbol{x},\boldsymbol{x}^{\prime};\boldsymbol{\theta}_{}\right)\cdot\big% {\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x};% \boldsymbol{\theta}_{}\right),\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta% }_{*}\big{\rangle}\right]$		(14)
		$\displaystyle=\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(% \boldsymbol{x},\boldsymbol{x}^{\prime};\bar{\boldsymbol{\theta}}\right)\cdot% \big{\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x};% \boldsymbol{\theta}_{0}\right),\boldsymbol{\theta}^{\prime}-\bar{\boldsymbol{% \theta}}\big{\rangle}\right]\geq 0$

The proof of (14) is lengthy and is thus moved to Appendix A.1 for succinctness. As a result we have $\boldsymbol{\theta}_{*}\in\Xi_{2\omega}$ and $\boldsymbol{\theta}-\boldsymbol{\theta}_{*}=\bar{\boldsymbol{\theta}}_{% \parallel}\in\mathcal{K}(\Sigma_{\pi})$ . Note that for any $\boldsymbol{\theta}\in S_{\omega}$ ,

	$\displaystyle\\|\boldsymbol{\theta}_{*}-\boldsymbol{\theta}^{0}\\|\leq$	$\displaystyle\ \\|\bar{\boldsymbol{\theta}}_{\parallel}-\boldsymbol{\theta}^{0}% _{\parallel}\\|+\\|\boldsymbol{\theta}_{\bot}-\boldsymbol{\theta}^{0}_{\bot}\\|$
	$\displaystyle\leq$	$\displaystyle\ \\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{0}\\|+\\|\bar{% \boldsymbol{\theta}}-\boldsymbol{\theta}^{0}\\|\leq 2\omega,$

which completes the proof. ∎

Following basic linear algebra analysis, we also have the following proposition.

Proposition 3.5.

Under Assumption 3.1, suppose the adopted Q network is sufficiently wide so that $m\geq m^{*}$ , then for any $\boldsymbol{\theta}\in\mathcal{R}(\Sigma_{\pi})$ , we have $\boldsymbol{\theta}^{\top}\Sigma_{\pi}\boldsymbol{\theta}\geq\lambda_{0}\|% \boldsymbol{\theta}\|^{2}_{2}$ .

Proposition 3.4 indicates that the variations in the local linearization of Q-function values solely depend on the variations in parameters within the subspace $\mathcal{R}(\Sigma_{\pi})$ . In the mean while, Proposition 3.5 indicates that such local linearization is non-singular within $\mathcal{R}(\Sigma_{\pi})$ . Based on these observations, we can first provide a fast convergence of $\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}=\mathcal{O}(1/T)$ and then show that $\mathbb{E}\left[\big{(}\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{T})-% \widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})\big{)}^{2}\right]=\mathbb{% E}\left[\big{(}\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{T})-\widehat{Q}% (\boldsymbol{x};\boldsymbol{\theta}^{T}_{*})\big{)}^{2}\right]\leq\mathcal{O}(% 1/T)$ for any $\boldsymbol{\theta}^{*}\in\Xi_{\omega}$ . We summarize this result in Theorem 3.6 while presenting its proof in Appendix A.2.

Theorem 3.6.

Suppose Assumptions 3.1, 3.2 and 3.3 hold. We set $\omega=\widetilde{C}_{1}$ and the learning rate $\eta_{t}=\frac{1}{2(1-\gamma)\lambda_{0}(t+1)}$ . If the feature map $\|\phi(s,a)\|=1$ for each state-action pair $(s,a)$ and the network width $m\geq m^{*}$ , then the output $\boldsymbol{\theta}^{T}$ of Algorithm 1 satisfies

			$\displaystyle\mathbb{E}\left[\big{(}\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{T})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})\big{)}^{2}% \mid\boldsymbol{\theta}^{0}\right]$
		$\displaystyle\leq$	$\displaystyle\frac{\widetilde{C}_{3}(\log T+1)}{(1-\gamma)^{2}\lambda_{0}^{2}T% }+\frac{\widetilde{C}_{4}m^{-1/2}}{(1-\gamma)\lambda_{0}}\cdot\sqrt{\log(T/% \delta)}$
		$\displaystyle+$	$\displaystyle\frac{\widetilde{C}_{5}\tau^{*}\left(\log(T/\delta)+1\right)\log T% }{(1-\gamma)^{2}\lambda_{0}^{2}T},$

with probability at least $1-2\delta-2L\exp\!\big{(}\!\!-\widetilde{C}_{2}m\big{)}$ , where $\tau^{*}$ is the mixing time of Markov chain in Assumption 3.2, and $\widetilde{C}_{1},\cdots,\widetilde{C}_{5}>0$ are universal constants.

Let $Q^{\pi}$ be the true state-action value function that satisfies the Bellman equation $Q^{\pi}=\mathcal{T}^{\pi}Q^{\pi}$ . Then based on the convergence of the local linearization in Theorem 3.6, we establish the global convergence of neural temporal difference learning as Theorem 3.7.

Theorem 3.7.

Suppose the conditions in Theorem 3.6 hold. Then the output of Algorithm 1 satisfies

	$\displaystyle\mathbb{E}\Big{[}\big{(}Q(\phi(s,a);\boldsymbol{\theta}^{T})-Q^{% \pi}(s,a)\big{)}^{2}\mid\boldsymbol{\theta}^{0}\Big{]}$
$\displaystyle\leq$	$\displaystyle\ \frac{3\mathbb{E}\left[\left(Q^{\pi}(s,a)-\Pi_{\mathcal{F}_{% \omega,m}}Q^{\pi}(s,a)\right)^{2}\right]}{(1-\gamma)^{2}}+\widetilde{C}_{6}m^{% -1}$
	$\displaystyle+\frac{\widetilde{C}_{7}(\log T+1)}{(1-\gamma)^{2}\lambda_{0}^{2}% T}+\frac{\widetilde{C}_{8}m^{-1/2}}{(1-\gamma)\lambda_{0}}\sqrt{\log(T/\delta)}$	(15)
	$\displaystyle+\frac{\widetilde{C}_{9}\tau^{*}\left(\log(T/\delta)+1\right)\log T% }{(1-\gamma)^{2}\lambda_{0}^{2}T}$

$w.p.\ 1-2\delta-2L\exp\!\big{(}\!-\widetilde{C}_{2}m\big{)}$ , where $\widetilde{C}_{6},\cdots,\widetilde{C}_{9}>0$ are universal constants.

Let $\epsilon_{\mathcal{F}}:=\frac{3}{(1-\gamma)^{2}}\mathbb{E}\big{[}\big{(}Q^{\pi% }(s,a)-\Pi_{\mathcal{F}_{\omega,m}}Q^{\pi}(s,a)\big{)}^{2}\big{]}$ be the optimal approximation error of the function class $\mathcal{F}_{\omega,m}$ . Then Theorem 3.7 demonstrates that under suitable parameter choices, neural TD learning method identify an approximation error bound of $\mathcal{O}(\epsilon_{\mathcal{F}}+\epsilon+m^{-\frac{1}{2}})$ within $\tilde{\mathcal{O}}(\epsilon^{-1})$ samples. Existing works include Cai et al. (2023); Xu & Gu (2020); Tian et al. (2022); Cayci et al. (2023) achieve $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity, and (Sun et al., 2022) achieves $\mathcal{O}\big{(}\epsilon^{-\frac{2}{2-a}}\big{)},a\in(0,1]$ with additional assumptions.

Following a similar analysis while adopting an additional regularity assumption on the matrix $\Sigma_{\pi}$ , one can further extend the above analysis to the neural Q-learning by substituting the Bellman operator with the Bellman optimality operator. A similar $\mathcal{O}(\epsilon^{-1})$ sample complexity can still be achieved, which is relegated to Appendix for succinctness.

4 Convergence of Minimax Neural Q-Learning

A two-player zero-sum Markov game (Littman, 1994; Bowling & Veloso, 2001; Perolat et al., 2018), as a simple variant of MDP, is defined as a six-tuple $\mathcal{M}=(\mathcal{S},\mathcal{A}_{1},\mathcal{A}_{2},\mathbb{P},r,\gamma)$ . Here $\mathcal{S}$ is state space, $\mathcal{A}_{1}$ and $\mathcal{A}_{2}$ are the action space of the first and second player, respectively, $\mathbb{P}:\mathcal{A}_{1}\times\mathcal{A}_{2}\rightarrow\mathcal{P}(\mathcal% {S})$ is the transition probability, $r:\mathcal{S}\times\mathcal{A}_{1}\times\mathcal{A}_{2}\rightarrow\mathbb{R}$ is the reward function and $\gamma$ is the discounted factor. At time $t$ , player 1 and player 2 take actions ( $a^{1}_{t}\in\mathcal{A}_{1}$ and $a^{2}_{t}\in\mathcal{A}_{2}$ ) simultaneously. Player 1 obtains the reward $r(s_{t},a^{1}_{t},a^{2}_{t})$ . while player 2 obtains $-r(s_{t},a^{1}_{t},a^{2}_{t})$ . The goal of the two players is to maximize their cumulative rewards respectively. For a policy pair $(\pi_{1},\pi_{2})$ , we can define the state-action value function as follows:

	$\displaystyle Q^{\pi_{1},\pi_{2}}(s,a^{1},a^{2})=$	$\displaystyle\mathbb{E}_{\pi_{1},\pi_{2}}\left[\sum_{t=0}^{\infty}\gamma^{t}% \cdot r(s_{t},a^{1}_{t},a^{2}_{t})\mid s_{0}=s,\right.$
		$\displaystyle\left.a^{1}_{0}=a^{1},a^{2}_{0}=a^{2}\right],\ \forall s,a^{1},a^% {2}.$

The optimal state-action value function $Q^{*}$ is defined as

	$\displaystyle Q^{*}(s,a^{1},a^{2})$	$\displaystyle=$	$\displaystyle\max_{\pi_{1}}\min_{\pi_{2}}Q^{\pi_{1},\pi_{2}}(s,a^{1},a^{2})$
		$\displaystyle=$	$\displaystyle\min_{\pi_{2}}\max_{\pi_{1}}Q^{\pi_{1},\pi_{2}}(s,a^{1},a^{2}).$

We denote the optimal policy pair $\pi^{*}=\{\pi_{1}^{*},\pi_{2}^{*}\}$ if $Q^{*}(s,a^{1},a^{2})=Q^{\pi_{1}^{*},\pi_{2}^{*}}$ . Moreover, the Minimax Bellman operator $\mathcal{H}$ for the Markov game is defined as

	$\displaystyle\mathcal{H}Q(s,a^{1},a^{2})=$	$\displaystyle r(s,a^{1},a^{2})+\gamma\mathbb{E}\left[\min_{b^{1}}\max_{b^{2}}Q% (s^{\prime},b^{1},b^{2})\right.\mid$
		$\displaystyle\left.s^{\prime}\sim\mathbb{P}(\cdot\mid s,a^{1},a^{2})\right],\,% \,\forall s,a^{1},a^{2}.$

Thus $\mathcal{H}Q^{*}=Q^{*}$ . Let the feature map $\boldsymbol{x}=\phi(s,a^{1},a^{2})$ and $\pi=\{\pi^{1},\pi^{2}\}$ be a given learning policy for players 1 and 2. Assume that $\{s_{t},a^{1}_{t},a^{2}_{t},r_{t}\}_{t=0}^{T}$ is a sampled trajectory of states, actions and rewards obtained from the environment using policy $\pi$ . Let us recall the definition of the local linearization function class $\mathcal{F}_{\omega,m}$ introduced in (7). Consider the MSPBE minimization problem with multi-layer neural network approximation:

\min_{\boldsymbol{\theta}\in S_{\omega}}\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[% \left(Q(\boldsymbol{x};\boldsymbol{\theta})-\Pi_{\mathcal{F}_{\omega,m}}% \mathcal{H}Q(\boldsymbol{x};\boldsymbol{\theta})\right)^{2}\right].

To solve this problem, we still adopt the projected stochastic semi-gradient iteration method is provided described by (6), that is,

\boldsymbol{\theta}^{t+1}=\Pi_{S_{\omega}}\Big{(}\boldsymbol{\theta}^{t}-\eta_% {t}\boldsymbol{g}\left(\boldsymbol{\theta}^{t}\right)\!\!\Big{)},

(16)

while redefining the stochastic semi-gradient estimator $\boldsymbol{g}(\boldsymbol{\theta}^{t})$ as

\boldsymbol{g}(\boldsymbol{\theta}^{t})=\Delta\left(s_{t},a^{1}_{t},a^{2}_{t},% s_{t+1};\boldsymbol{\theta}^{t}\right)\cdot\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x}_{t};\boldsymbol{\theta}^{t}),

where $\boldsymbol{x}_{t}:=\phi(s_{t},a^{1}_{t},a^{2}_{t})$ and

	$\displaystyle\Delta\left(s_{t},a^{1}_{t},\right.$	$\displaystyle\!\!\!\left.a^{2}_{t},s_{t+1};\boldsymbol{\theta}^{t}\right)=Q(% \boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\Big{(}r(s_{t},a^{1}_{t},a^{2}_{t}% )+\Big{.}$		(17)
		$\displaystyle\Big{.}\gamma\max_{b^{1}\in\mathcal{A}_{1}}\min_{b^{2}\in\mathcal% {A}_{2}}Q(\phi(s_{t+1},b^{1},b^{2});\boldsymbol{\theta}^{t})\Big{)}.$		(17)

Now we redefine the function class $\mathcal{F}_{\omega,m}$ as a collection of all local linearization of $Q(\boldsymbol{x};\boldsymbol{\theta})$ at the initial point $\boldsymbol{\theta}^{0}$ :

	$\displaystyle\mathcal{F}_{\omega,m}=\$	$\displaystyle\left\{\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta})=Q(% \boldsymbol{x};\boldsymbol{\theta}^{0})+\left<\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x};\boldsymbol{\theta}^{0}),\right.\right.$
		$\displaystyle\left.\left.\boldsymbol{\theta}-\boldsymbol{\theta}^{0}\right>,\ % \boldsymbol{\theta}\in S_{\omega}\right\}.$

To analyze this method, for any $\beta>0$ , we redefine the set $\Xi_{\beta}$ introduced in (3.1) by replacing the Bellman operator $\mathcal{T}^{\pi}$ with the Minimax Bellman operator $\mathcal{H}$ . Similar to (10), we still have a point $\boldsymbol{\theta}^{*}\in\Xi_{\omega}$ if and only if

	$\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}$	$\displaystyle\left[\widehat{\Delta}\left(s,a^{1},a^{2},s^{\prime};\boldsymbol{% \theta}^{}\right)\big{\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(% \phi(s,a^{1},a^{2});\boldsymbol{\theta}^{}\right),\big{.}\right.$
		$\displaystyle\big{.}\left.\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\big{% \rangle}\right]\geq 0,$

where $\widehat{Q}(\cdot\,;\boldsymbol{\theta}^{*})\in\mathcal{F}_{\omega,m}$ , and $\widehat{\Delta}\left(s,a^{1},a^{2},s^{\prime};\boldsymbol{\theta}\right)$ has the same structure as $\Delta\left(s,a^{1},a^{2},s^{\prime};\boldsymbol{\theta}\right)$ expect that the function $Q(\cdot;\boldsymbol{\theta})$ is replaced by $\widehat{Q}(\cdot;\boldsymbol{\theta})$ .

Unlike the neural temporal difference learning method that aims at evaluating the state-action values of a fixed learning policy. The Minimax Bellman operator significantly sophisticates the analysis. Let us redefine the feature covariance matrix $\Sigma_{\pi}$ with respect to the learning policy $\pi=\{\pi^{1},\pi^{2}\}$ , that is

\displaystyle\Sigma_{\pi}

\displaystyle=

\displaystyle\mathbb{E}_{\pi}\left[\nabla_{\boldsymbol{\theta}}Q(s,a^{1},a^{2}% ;\boldsymbol{\theta}^{0})\nabla_{\boldsymbol{\theta}}Q(s,a^{1},a^{2};% \boldsymbol{\theta}^{0})^{\top}\right].

Let the actions $(a^{1}_{\boldsymbol{\theta}},a^{2}_{\boldsymbol{\theta}})$ satisfies $\left<\nabla_{\boldsymbol{\theta}}Q(s,a^{1}_{,}a^{2};\boldsymbol{\theta}^{0}),% \boldsymbol{\theta}\right>=\max_{a^{1}\in\mathcal{A}_{1}}\min_{a^{2}\in% \mathcal{A}_{2}}\left<\nabla_{\boldsymbol{\theta}}Q(s,a^{1},a^{2};\boldsymbol{% \theta}^{0}),\boldsymbol{\theta}\right>$ . For each parameter pair $\boldsymbol{\theta}_{s}=(\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2})$ , we define the action pair $(a^{1}_{\boldsymbol{\theta}_{s}},a^{2}_{\boldsymbol{\theta}_{s}})$ that satisfies

	$\displaystyle(a^{1}_{\boldsymbol{\theta}_{s}},a^{2}_{\boldsymbol{\theta}_{s}})=$	$\displaystyle{\arg\max}_{(a^{1},a^{2})\in\left\{\left(a^{1}_{\boldsymbol{% \theta}_{1}},a^{2}_{\boldsymbol{\theta}_{2}}\right),\left(a^{1}_{\boldsymbol{% \theta}_{2}},a^{2}_{\boldsymbol{\theta}_{1}}\right)\right\}}$
		$\displaystyle\Big{\{}\left\|\left<\nabla_{\boldsymbol{\theta}}Q(s,a^{1},a^{2};% \boldsymbol{\theta}^{0}),\boldsymbol{\theta}_{1}-\boldsymbol{\theta}_{2}\right% >\right\|\Big{\}}.$

Then for any $\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2}$ , the minimax feature covariance matrix is defined as follows:

	$\displaystyle\Sigma^{*}_{\pi}(\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2})=$	$\displaystyle\mathbb{E}_{\pi}\left[\nabla_{\boldsymbol{\theta}}Q(s,a^{1}_{% \boldsymbol{\theta}_{s}},a^{2}_{\boldsymbol{\theta}_{s}};\boldsymbol{\theta}^{% 0})\right.$
		$\displaystyle\left.\nabla_{\boldsymbol{\theta}}Q(s,a^{1}_{\boldsymbol{\theta}_% {s}},a^{2}_{\boldsymbol{\theta}_{s}};\boldsymbol{\theta}^{0})^{\top}\right].$

Refer to caption — Figure 1: Training curves and the ratio of the largest and smallest non-zero singular values of $\Sigma_{\pi}$ over different network widths $m$ .

Assumption 4.1.

For any $\boldsymbol{\theta}^{1},\boldsymbol{\theta}^{2}$ , there exists a constant $\nu\in(0,1)$ such that $(1-\nu)^{2}\Sigma_{\pi}-\gamma^{2}\Sigma_{\pi}^{*}(\boldsymbol{\theta}_{1},% \boldsymbol{\theta}_{2})\succeq 0$ .

Note the original version of this assumption in (Zou et al., 2019) in fact requires a strict positive definite condition: $((1-\nu)^{2}\Sigma_{\pi}-\gamma^{2}\Sigma_{\pi}^{*}(\boldsymbol{\theta}_{1},% \boldsymbol{\theta}_{2})\succ 0$ . Under this additional assumption, (Zou et al., 2019) obtained an $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity for minimax Q-learning with linear function approximation. With the help of our subspace analysis technique, in this paper, we relax it to the positive semi-definiteness ( $\succeq 0$ ). Now we are ready to state our result for minimax neural Q-learning.

Theorem 4.2.

Suppose Assumptions 3.1, 3.2 and 4.1 hold. We set $\omega=\widetilde{C}_{1}$ and the learning rate $\eta_{t}=\frac{1}{2\nu\lambda_{0}(t+1)}$ . If the feature map $\|\phi(s,a^{1},a^{2})\|=1$ for each state-action pair $(s,a^{1},a^{2})$ and the network width $m\geq m^{*}$ , then the output $\boldsymbol{\theta}^{T}$ of neural minimax Q-learning Algorithm 3 satisfies

			$\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{T})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})\right)^{2}% \mid\boldsymbol{\theta}^{0}\right]$
		$\displaystyle\leq$	$\displaystyle\frac{\widetilde{C}_{3}(\log T+1)}{\nu^{2}\lambda_{0}^{2}T}+\frac% {\widetilde{C}_{4}m^{-1/2}}{\nu\lambda_{0}}\cdot\sqrt{\log(T/\delta)}$
			$\displaystyle+\frac{\widetilde{C}_{5}\tau^{*}\left(\log(T/\delta)+1\right)\log T% }{\nu^{2}\lambda_{0}^{2}T},$

with probability at least $1-2\delta-2L\exp{(-\widetilde{C}_{2}m)}$ , where $\tau^{*}$ is the mixing time of Markov chain in Assumption 3.2, and $\left\{\widetilde{C}_{i}>0\right\}_{i=1,\ldots,5}$ are universal constants.

Theorem 4.2 establishes a finite-time analysis of $\tilde{\mathcal{O}}(\epsilon^{-1})$ -sample complexity for minimax neural Q-learning in terms of the function class $\mathcal{F}_{\omega,m}$ . For a more specific description and theorem proof, see Appendix C. To the best of our knowledge, this is the first analysis of minimax Q-learning with neural network function approximation, characterized by a complexity bound of $\tilde{\mathcal{O}}(\epsilon^{-1})$ .

5 Experiments

Finally, we construct several experiments over the OpenAI Gym (Brockman et al., 2016) tasks and validate our theoretical findings. We consider a two-layer neural network, as follows:

Q(s,a;\boldsymbol{\theta}):=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}b_{r}\sigma(% \boldsymbol{\theta}_{r}^{\top}\phi(s,a)),

where $\sigma(\cdot)$ is ELU activation in this section. Furthermore, details regarding the initialization and iteration methods for the parameters can be found in Section 2. For all experiments, we generate samples based on a prescribed $\epsilon$ -greedy policy with $\epsilon=0.1$ . To prevent redundancy in the features $\phi(s,a)$ , we employ one-hot encoding for discrete action-state spaces and implement a fixed grid discretization for continuous spaces. when both $\phi(s,a)$ and $\phi(s^{\prime},a^{\prime})$ belong to the same one-hot encoding or grid, we treat them as the same sample point. Our investigation into the impact of network width on the TD learning algorithm will be conducted from two perspectives: (i) examining whether the network width $m$ is correlated with the TD error, and (ii) exploring the existence of constants $m^{*}$ and $\lambda_{0}$ that satisfy Assumption 3.1.

The four subfigures in Figure 1 represent two types of environments: one with a discrete state space and the other with a continuous state space. The first two subfigures depict the convergence performance of the TD algorithm at different network widths. We generate 2,000 sample points and run for 500 epochs. Notably, as the parameter $m$ increases, the TD algorithm demonstrates faster convergence, resulting in smaller final TD errors. The latter two subfigures illustrate the existence of $m^{*}$ and $\lambda_{0}$ . Specifically, we compute the largest non-zero singular value $\sigma_{\max}$ and smallest non-zero singular value $\sigma_{\min}$ of the matrix $\Sigma_{\pi}$ . To mitigate the absolute magnitude of $\sigma_{\min}$ , we introduce the ratio $r=\sigma_{\max}/\sigma_{\min}$ as a metric to validate Assumption 3.1. It can be observed that the value of $r$ approaches a constant as $m$ increases for all cases, providing empirical support for the validity of the assumption.

6 Conclusion

We study the finite-time analysis of the TD and Q learning methods with neural network approximation, where the state-action pairs are generated by a given policy under the Markovian sampling. Besides the convergence to the true action-value function except for an inevitable function approximation error, an improved analysis technique is introduced to establish an $\tilde{\mathcal{O}}(\epsilon^{-1})$ complexity for the neural TD and Q learning methods, which improves the existing $\tilde{\mathcal{O}}(\epsilon^{-2})$ complexity. For future work, it is also interesting to investigate if the proposed technique can improve the current complexity estimate of the actor-critic methods, which are partially built upon the neural TD methods.

7 Acknowledgements

Dr. Zaiwen Wen is supported in part by the NSFC grant 12331010. Dr. Junyu Zhang is supported in part by the MOE AcRF grant A-0009530-05-00.

References

Allen-Zhu et al. (2019a) Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019a.
Allen-Zhu et al. (2019b) Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019b.
Barakat et al. (2022) Barakat, A., Bianchi, P., and Lehmann, J. Analysis of a target-based actor-critic algorithm with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pp. 991–1040. PMLR, 2022.
Bertsekas (2012) Bertsekas, D. Dynamic programming and optimal control: Volume I, volume 1. Athena scientific, 2012.
Bhandari et al. (2018) Bhandari, J., Russo, D., and Singal, R. A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory, pp. 1691–1692. PMLR, 2018.
Borkar (2009) Borkar, V. S. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
Bowling & Veloso (2001) Bowling, M. and Veloso, M. Rational and convergent learning in stochastic games. In International joint conference on artificial intelligence, volume 17, pp. 1021–1026. Citeseer, 2001.
Boyan (2002) Boyan, J. A. Technical update: Least-squares temporal difference learning. Machine learning, 49(2):233–246, 2002.
Bradtke & Barto (1996) Bradtke, S. J. and Barto, A. G. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1):33–57, 1996.
Brandfonbrener & Bruna (2019) Brandfonbrener, D. and Bruna, J. Geometric insights into the convergence of nonlinear td learning. arXiv preprint arXiv:1905.12185, 2019.
Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
Cai et al. (2023) Cai, Q., Yang, Z., Lee, J. D., and Wang, Z. Neural temporal difference and q learning provably converge to global optima. Mathematics of Operations Research, 2023.
Cao & Gu (2019) Cao, Y. and Gu, Q. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019.
Cao & Gu (2020) Cao, Y. and Gu, Q. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3349–3356, 2020.
Cayci et al. (2023) Cayci, S., Satpathi, S., He, N., and Srikant, R. Sample complexity and overparameterization bounds for temporal difference learning with neural network approximation. IEEE Transactions on Automatic Control, 2023.
Dalal et al. (2018) Dalal, G., Szörényi, B., Thoppe, G., and Mannor, S. Finite sample analyses for td (0) with function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Du et al. (2019) Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675–1685. PMLR, 2019.
Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
Fan et al. (2020) Fan, J., Wang, Z., Xie, Y., and Yang, Z. A theoretical analysis of deep q-learning. In Learning for dynamics and control, pp. 486–489. PMLR, 2020.
Fujimoto et al. (2018) Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
Godfrey (2019) Godfrey, L. B. An evaluation of parametric activation functions for deep learning. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 3006–3011. IEEE, 2019.
Jaakkola et al. (1993) Jaakkola, T., Jordan, M., and Singh, S. Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 6, 1993.
Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
Ke et al. (2023) Ke, Z., Wen, Z., and Zhang, J. Provably efficient gauss-newton temporal difference learning method with function approximation. arXiv preprint arXiv:2302.13087, 2023.
Konda & Tsitsiklis (1999) Konda, V. and Tsitsiklis, J. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
Kostrikov et al. (2021) Kostrikov, I., Fergus, R., Tompson, J., and Nachum, O. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774–5783. PMLR, 2021.
Lazaric et al. (2010) Lazaric, A., Ghavamzadeh, M., and Munos, R. Finite-sample analysis of lstd. In ICML-27th International Conference on Machine Learning, pp. 615–622, 2010.
Levine et al. (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
Littman (1994) Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157–163. Elsevier, 1994.
Liu et al. (2020a) Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. Finite-sample analysis of proximal gradient td algorithms. arXiv preprint arXiv:2006.14364, 2020a.
Liu et al. (2020b) Liu, C., Zhu, L., and Belkin, M. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954–15964, 2020b.
Maei et al. (2009) Maei, H., Szepesvari, C., Bhatnagar, S., Precup, D., Silver, D., and Sutton, R. S. Convergent temporal-difference learning with arbitrary smooth function approximation. Advances in neural information processing systems, 22, 2009.
Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
Perkins & Pendrith (2002) Perkins, T. J. and Pendrith, M. D. On the existence of fixed points for q-learning and sarsa in partially observable domains. In ICML, pp. 490–497, 2002.
Perolat et al. (2018) Perolat, J., Piot, B., and Pietquin, O. Actor-critic fictitious play in simultaneous move multistage games. In International Conference on Artificial Intelligence and Statistics, pp. 919–928. PMLR, 2018.
Prashanth et al. (2014) Prashanth, L., Korda, N., and Munos, R. Fast lstd using stochastic approximation: Finite time analysis and application to traffic control. In Joint European conference on machine learning and knowledge discovery in databases, pp. 66–81. Springer, 2014.
Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015.
Sun et al. (2022) Sun, T., Li, D., and Wang, B. Finite-time analysis of adaptive temporal difference learning with deep neural networks. Advances in Neural Information Processing Systems, 35:19592–19604, 2022.
Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
Sutton et al. (1999) Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
Sutton et al. (2009a) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning, pp. 993–1000, 2009a.
Sutton et al. (2009b) Sutton, R. S., Szepesvári, C., and Maei, H. R. A convergent o (n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in neural information processing systems, 21(21):1609–1616, 2009b.
Tagorti & Scherrer (2015) Tagorti, M. and Scherrer, B. On the rate of convergence and error bounds for lstd ( $\lambda$ ). In International Conference on Machine Learning, pp. 1521–1529. PMLR, 2015.
Tesauro et al. (1995) Tesauro, G. et al. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
Tian et al. (2022) Tian, H., Paschalidis, I., and Olshevsky, A. On the performance of temporal difference learning with neural networks. In The Eleventh International Conference on Learning Representations, 2022.
Touati et al. (2018) Touati, A., Bacon, P.-L., Precup, D., and Vincent, P. Convergent tree backup and retrace with function approximation. In International Conference on Machine Learning, pp. 4955–4964. PMLR, 2018.
Tsitsiklis & Van Roy (1996) Tsitsiklis, J. and Van Roy, B. Analysis of temporal-diffference learning with function approximation. Advances in neural information processing systems, 9, 1996.
Van Hasselt et al. (2016) Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
Wu et al. (2019) Wu, Y., Tucker, G., and Nachum, O. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
Xu & Gu (2020) Xu, P. and Gu, Q. A finite-time analysis of q-learning with neural network function approximation. In International Conference on Machine Learning, pp. 10555–10565. PMLR, 2020.
Zou et al. (2019) Zou, S., Xu, T., and Liang, Y. Finite-sample analysis for sarsa with linear function approximation. Advances in neural information processing systems, 32, 2019.

Appendix A Details of Section 3

A.1 Proof of (14)

			$\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(% \boldsymbol{x},\boldsymbol{x}^{\prime};\boldsymbol{\theta}_{}\right)\cdot\big% {\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x};% \boldsymbol{\theta}_{}\right),\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta% }_{*}\big{\rangle}\right]$
		$\displaystyle\overset{(i)}{=}$	$\displaystyle\mathbb{E}_{\mu,\pi}\left[\mathbb{E}_{\mathbb{P}}\big{[}\widehat{% \Delta}\left(\boldsymbol{x},\boldsymbol{x}^{\prime};\boldsymbol{\theta}_{*}% \right)\big{]}\cdot\big{\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(% \boldsymbol{x};\boldsymbol{\theta}_{0}\right),\boldsymbol{\theta}^{\prime}-% \bar{\boldsymbol{\theta}}\big{\rangle}\right]$
		$\displaystyle\overset{(ii)}{=}$	$\displaystyle\mathbb{E}_{\mu,\pi}\left[\mathbb{E}_{\mathbb{P}}\big{[}\widehat{% \Delta}\left(\boldsymbol{x},\boldsymbol{x}^{\prime};\bar{\boldsymbol{\theta}}% \right)\big{]}\cdot\big{\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(% \boldsymbol{x};\boldsymbol{\theta}_{0}\right),\boldsymbol{\theta}^{\prime}-% \bar{\boldsymbol{\theta}}\big{\rangle}\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(% \boldsymbol{x},\boldsymbol{x}^{\prime};\bar{\boldsymbol{\theta}}\right)\cdot% \big{\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x};% \boldsymbol{\theta}_{0}\right),\boldsymbol{\theta}^{\prime}-\bar{\boldsymbol{% \theta}}\big{\rangle}\right]$

where (i) is because $\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\cdot\,;\boldsymbol{\theta}_{*}% \right)=\widehat{Q}\left(\cdot\,;\boldsymbol{\theta}_{0}\right)$ , the decomposition

\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}_{*}=\boldsymbol{\theta}^{% \prime}-\bar{\boldsymbol{\theta}}+\bar{\boldsymbol{\theta}}-\boldsymbol{\theta% }_{*}=\boldsymbol{\theta}^{\prime}-\bar{\boldsymbol{\theta}}+(\bar{\boldsymbol% {\theta}}_{\bot}-\boldsymbol{\theta}_{\bot})

the fact that $(\bar{\boldsymbol{\theta}}_{\bot}-\boldsymbol{\theta}_{\bot})\in\mathcal{K}(% \Sigma_{\pi})$ , and (13), (ii) is because $(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*})\in\mathcal{K}(\Sigma_{\pi})$ .

A.2 Proof of Theorem 3.6

Proof.

Recall the definition of the semi-gradient in Section (6). We denote $\bar{\mathbf{g}}(\boldsymbol{\theta})$ as its expectation. Let $\bar{\mathbf{m}}(\boldsymbol{\theta})$ and $\mathbf{m}(\boldsymbol{\theta})$ also be the corresponding semi-gradients based on the linearized function $\widehat{Q}(\cdot;\boldsymbol{\theta})$ , that is,

	$\displaystyle\mathbf{g}(\boldsymbol{\theta}^{t})$	$\displaystyle=$	$\displaystyle\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};\boldsymbol{\theta% }^{t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta% }^{t}),\quad\bar{\mathbf{g}}(\boldsymbol{\theta}^{t})\ =\ \mathbb{E}_{\mu,\pi,% \mathbb{P}}\left[\mathbf{g}(\boldsymbol{\theta}^{t})\right]$
	$\displaystyle\mathbf{m}(\boldsymbol{\theta}^{t})$	$\displaystyle=$	$\displaystyle\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};% \boldsymbol{\theta}^{t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{0}),\quad\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})\ =\ % \mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\mathbf{m}(\boldsymbol{\theta}^{t})\right],$

where

	$\displaystyle\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};\boldsymbol{\theta% }^{t})$	$\displaystyle=$	$\displaystyle Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s_{t},a_{t% })+\gamma\cdot Q(\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t})\right),$
	$\displaystyle\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};% \boldsymbol{\theta}^{t})$	$\displaystyle=$	$\displaystyle\widehat{Q}(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s% _{t},a_{t})+\gamma\cdot\widehat{Q}(\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t% })\right).$

To simplify the notation, let $\Delta_{t}:=\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};\boldsymbol{\theta}% ^{t})$ and $\widehat{\Delta}_{t}:=\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1}% ;\boldsymbol{\theta}^{t})$ . Recall the definition of the range space $\mathcal{R}(\Sigma_{\pi})$ and the kernel space $\mathcal{K}(\Sigma_{\pi})$ . By Proposition 3.4, we know that $\boldsymbol{v}_{1}^{\top}\boldsymbol{v}_{2}=0$ for any vector $\boldsymbol{v}_{1}\in\mathcal{R}(\Sigma_{\pi}),\boldsymbol{v}_{2}\in\mathcal{K% }(\Sigma_{\pi})$ thus $\left<\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),% \boldsymbol{\theta}_{\bot}\right>=0$ for any feature map $\boldsymbol{x}$ and parameter $\boldsymbol{\theta}_{\bot}\in\mathcal{K}(\Sigma_{\pi})$ . Then we can decompose $\|\boldsymbol{\theta}^{t+1}-\boldsymbol{\theta}^{t+1}_{*}\|^{2}$ as

$\displaystyle\\|\boldsymbol{\theta}^{t+1}-\boldsymbol{\theta}^{t+1}_{*}\\|^{2}$	$\displaystyle=$	$\displaystyle\left\\|\Pi_{S_{2\omega}}\left(\boldsymbol{\theta}^{t}-\eta_{t}% \mathbf{g}(\boldsymbol{\theta}^{t})\right)-\Pi_{S_{2\omega}}\left(\boldsymbol{% \theta}^{t+1}_{*}\right)\right\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\eta_{t}\mathbf{g}(\boldsymbol{\theta}^% {t})-\boldsymbol{\theta}^{t+1}_{*}\\|^{2}$
	$\displaystyle=$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\eta_{t}\mathbf{g}(\boldsymbol{\theta}^% {t})-\boldsymbol{\theta}^{t}_{}+\boldsymbol{\theta}^{t}_{}-\boldsymbol{% \theta}^{t+1}_{*}\\|^{2}$
	$\displaystyle=$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\\|^{2}+\eta_% {t}^{2}\\|\mathbf{g}(\boldsymbol{\theta}^{t})\\|^{2}+\\|\boldsymbol{\theta}^{t}_{% }-\boldsymbol{\theta}^{t+1}_{}\\|^{2}-2\eta_{t}\left<\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t}_{},\mathbf{g}(\boldsymbol{\theta}^{t})\right>$
		$\displaystyle-2\eta_{t}\left<\boldsymbol{\theta}^{t}_{}-\boldsymbol{\theta}^{% t+1}_{},\mathbf{g}(\boldsymbol{\theta}^{t})\right>+2\eta_{t}\left<\boldsymbol% {\theta}^{t}-\boldsymbol{\theta}^{t}_{},\boldsymbol{\theta}^{t}_{}-% \boldsymbol{\theta}^{t+1}_{*}\right>$
	$\displaystyle\overset{(i)}{=}$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\\|^{2}-2\eta% _{t}\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{},\mathbf{g}(% \boldsymbol{\theta}^{t})\right>$
		$\displaystyle-2\eta_{t}\left<\boldsymbol{\theta}^{t}_{}-\boldsymbol{\theta}^{% t+1}_{},\mathbf{g}(\boldsymbol{\theta}^{t})-\mathbf{m}(\boldsymbol{\theta}^{t% })\right>+\eta_{t}^{2}\\|\mathbf{g}(\boldsymbol{\theta}^{t})\\|^{2},$

where (i) follows

\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*},\boldsymbol{\theta}^% {t}_{*}-\boldsymbol{\theta}^{t+1}_{*}\right>=\left<\boldsymbol{\theta}^{t}_{% \parallel}-\boldsymbol{\theta}^{*}_{\parallel},\boldsymbol{\theta}^{t}_{\bot}-% \boldsymbol{\theta}^{t+1}_{\bot}\right>=0,

and

	$\displaystyle\left<\boldsymbol{\theta}^{t}_{}-\boldsymbol{\theta}^{t+1}_{},% \mathbf{m}(\boldsymbol{\theta}^{t})\right>$	$\displaystyle=$	$\displaystyle\left<\boldsymbol{\theta}^{t}_{\bot}-\boldsymbol{\theta}^{t+1}_{% \bot},\mathbf{m}(\boldsymbol{\theta}^{t})\right>$
		$\displaystyle=$	$\displaystyle\widehat{\Delta}_{t}\cdot\left<\boldsymbol{\theta}^{t}_{\bot}-% \boldsymbol{\theta}^{t+1}_{\bot},\nabla Q(\boldsymbol{x}_{t};\boldsymbol{% \theta}^{0})\right>\ =\ 0.$

Recall the stationarity condition (11), for any $t\in\{1,2,\cdots,T\}$ ,

$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(% \boldsymbol{x}_{t},\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{}\right)\big{% \langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{*}% \big{\rangle}\right]$
	$\displaystyle=$	$\displaystyle\big{\langle}\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta% }\left(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t}_{}% \right)\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{t}_{}\right)\right],\boldsymbol{\theta}^{t}-\boldsymbol{% \theta}^{*}\big{\rangle}$
	$\displaystyle\overset{(i)}{=}$	$\displaystyle\big{\langle}\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta% }\left(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t}_{}% \right)\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{t}_{}\right)\right],\boldsymbol{\theta}^{t}-\boldsymbol{% \theta}^{t}_{}\big{\rangle}\ =\ \left<\bar{\mathbf{m}}(\boldsymbol{\theta^{t}% _{}}),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right>,$

where (i) is the same as the proof in Section A.1. Therefore,

	$\displaystyle\\|\boldsymbol{\theta}^{t+1}-\boldsymbol{\theta}^{t+1}_{*}\\|^{2}$	(18)
$\displaystyle\leq$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\\|^{2}-2\eta% _{t}\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{},\mathbf{g}(% \boldsymbol{\theta}^{t})\right>-2\eta_{t}\left<\boldsymbol{\theta}^{t}_{}-% \boldsymbol{\theta}^{t+1}_{},\mathbf{g}(\boldsymbol{\theta}^{t})-\mathbf{m}(% \boldsymbol{\theta}^{t})\right>+\eta_{t}^{2}\\|\mathbf{g}(\boldsymbol{\theta}^{% t})\\|^{2}$
$\displaystyle=$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\\|^{2}-2\eta% _{t}\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{},\mathbf{g}(% \boldsymbol{\theta}^{t})-\mathbf{m}(\boldsymbol{\theta}^{t})\right>-2\eta_{t}% \left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*},\mathbf{m}(% \boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})\right>$
	$\displaystyle-2\eta_{t}\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{% },\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})\right>-2\eta_{t}\left<\boldsymbol% {\theta}^{t}_{}-\boldsymbol{\theta}^{t+1}_{*},\mathbf{g}(\boldsymbol{\theta}^% {t})-\mathbf{m}(\boldsymbol{\theta}^{t})\right>+\eta_{t}^{2}\\|\mathbf{g}(% \boldsymbol{\theta}^{t})\\|^{2}$
$\displaystyle\overset{(i)}{\leq}$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\\|^{2}+\eta_% {t}^{2}\underbrace{\\|\mathbf{g}(\boldsymbol{\theta}^{t})\\|^{2}}_{\mbox{I}_{1}% \mbox{:\ Gradient Bound}}-2\eta_{t}\underbrace{\left<\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t+1}_{},\mathbf{g}(\boldsymbol{\theta}^{t})-\mathbf{m}(% \boldsymbol{\theta}^{t})\right>}_{\mbox{I}_{2}\mbox{:\ Gradient Gap}}$
	$\displaystyle-2\eta_{t}\underbrace{\left<\boldsymbol{\theta}^{t}-\boldsymbol{% \theta}^{t}_{},\mathbf{m}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t})\right>}_{\mbox{I}_{3}\mbox{:\ Markov Sampling Error}}% -2\eta_{t}\underbrace{\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{% },\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(\boldsymbol{% \theta}^{t}_{*})\right>}_{\mbox{I}_{4}\mbox{:\ Gradient Decent}},$

where (i) follows $\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*},\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{*})\right>\geq 0$ for any $0\leq t\leq T-1$ .

Next, we analyze the upper bounds of $\mathbf{I}_{1}$ , $\mathbf{I}_{2}$ , $\mathbf{I}_{3}$ and $\mathbf{I}_{4}$ item by item. To simplify the notation, let $\left\{C_{i}>0\right\}_{i=1,\ldots,7}$ be universal constants in this section. We set $\omega=C_{1}$ and $\delta\in(0,1)$ . By Lemma D.5, we have

\|\boldsymbol{g}(\boldsymbol{\theta}^{t})\|^{2}\leq C_{2}\sqrt{\log(T/\delta)}

(19)

and

	$\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left\|\left<\mathbf{g}\left(% \boldsymbol{\theta}^{t}\right)-\mathbf{m}\left(\boldsymbol{\theta}^{t}\right),% \boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t+1}_{*}\right>\right\|\mid% \boldsymbol{\theta}^{0}\right]$	$\displaystyle\leq$	$\displaystyle\left(C_{3}\omega m^{-\frac{1}{2}}\sqrt{\log(T/\delta)}+C_{4}m^{-% \frac{1}{2}}\right)\left\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t+1}_{*% }\right\\|$		(20)
		$\displaystyle\overset{(i)}{\leq}$	$\displaystyle C_{5}m^{-1/2}\sqrt{\log(T/\delta)},$		(20)

with probability at least $1-2\delta-2\exp{(-C_{6}m)}$ , where (i) follows $\omega=C_{1}$ and

$\displaystyle\\|\theta^{t}-\theta^{t+1}_{*}\\|$	$\displaystyle\leq$	$\displaystyle\\|\theta^{t}-\theta^{0}\\|+\\|\theta^{0}-\theta^{0}_{}\\|+\\|\theta^% {0}_{}-\theta^{t+1}_{*}\\|$
	$\displaystyle=$	$\displaystyle\\|\theta^{t}-\theta^{0}\\|+\\|\theta^{0}-\theta^{0}_{*}\\|+\\|\theta^% {0}_{\bot}-\theta^{t+1}_{\bot}\\|$
	$\displaystyle\leq$	$\displaystyle\\|\theta^{t}-\theta^{0}\\|+\\|\theta^{0}-\theta^{0}_{*}\\|+\\|\theta^% {0}-\theta^{t+1}\\|\ \leq\ 3\omega.$

Thus (19) and (20) provide upper bounds on $\mathbf{I}_{1}$ and $\mathbf{I}_{2}$ , respectively. The next lemma provides an estimate of the Markov sampling error.

Lemma A.1.

Suppose the learning rate sequence $\left\{\eta_{0},\eta_{1},\ldots,\eta_{T}\right\}$ is non-increasing. Under Assumption 3.2, it holds that

\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left\langle\mathbf{m}\left(\boldsymbol{% \theta}^{t}\right)-\overline{\mathbf{m}}\left(\boldsymbol{\theta}^{t}\right),% \boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{*}\right\rangle\mid\boldsymbol{% \theta}^{0}\right]\leq C_{7}\left(\log(T/\delta)+C_{1}^{2}\right)\tau^{*}\eta_% {\max\left\{0,t-\tau^{*}\right\}},

(21)

for any fixed $t\leq T$ , where

\tau^{*}=\min\left\{t=0,1,2,\ldots\mid\kappa\rho^{t}\leq\eta_{T}\right\}

is the mixing time of the Markov chain $\left\{s_{t},a_{t}\right\}_{t=0,1,\ldots.}$ .

Proof.

We adopt the proof framework outlined in Lemma 6.2 of Xu & Gu (2020). However, variations in the neural network settings lead to differences in the norms of gradients and parameters, thereby resulting in slight variations in the results. Thereby we have

\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left\langle\mathbf{m}\left(\boldsymbol{% \theta}^{t}\right)-\overline{\mathbf{m}}\left(\boldsymbol{\theta}^{t}\right),% \boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{*}\right\rangle\mid\boldsymbol{% \theta}^{0}\right]\leq C_{7}\left(\log(T/\delta)+\omega^{2}\right)\tau^{*}\eta% _{\max\left\{0,t-\tau^{*}\right\}}.

∎

Looking back at the definitions of $\lambda_{0}$ and $\Sigma_{\pi}$ , and the discussion in Section 3.2, we derive Lemmas A.2 and A.3 to estimate $\mathbf{I}_{4}$ .

Lemma A.2.

Let $\lambda_{0}$ as the minimum nonzero singular value of $\Sigma_{\pi}$ . For any $\boldsymbol{\theta}\in\mathcal{R}(\Sigma_{\pi})$ , we have

\boldsymbol{\theta}^{\top}\Sigma_{\pi}\boldsymbol{\theta}\geq\lambda_{0}\|% \boldsymbol{\theta}\|_{2}^{2}.

Lemma A.3.

Under Assumption 3.3, we have that

\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left<\boldsymbol{\theta}^{t}-\boldsymbol% {\theta}^{t}_{*},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{*})\right>\mid\boldsymbol{\theta}^{0}\right]\geq(1-% \gamma)\lambda_{0}\cdot\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|% ^{2}.

(22)

Proof.

Define $d\sim\mu\times\pi$ . To begin with, the Bellman operator $\mathcal{T}^{\pi}$ is a $\gamma$ -contraction with $\ell_{2}$ -norm since $d$ is the stationary distribution of $(s,a)$ corresponding to the policy $\pi$ . In details, consider

	$\displaystyle\mathbb{E}_{(s,a)\sim d}\left[\left(\mathcal{T}^{\pi}Q_{1}(% \boldsymbol{x})-\mathcal{T}^{\pi}Q_{2}(\boldsymbol{x})\right)^{2}\right]$	$\displaystyle=\gamma^{2}\mathbb{E}_{(s,a)\sim d}\left[\mathbb{E}\left[\left(Q_% {1}(\boldsymbol{x}^{\prime})-Q_{2}(\boldsymbol{x}^{\prime})\right)^{2}\mid s^{% \prime}\sim\mathbb{P}(\cdot\|s,a),a^{\prime}\sim\pi(\cdot\|s^{\prime})\right]\right]$		(23)
		$\displaystyle\overset{(i)}{\leq}\gamma^{2}\mathbb{E}_{(s,a)\sim d}\left[\left(% Q_{1}(\boldsymbol{x})-Q_{2}(\boldsymbol{x})\right)^{2}\right],$		(23)

where (i) follows that $\boldsymbol{x}$ and $\boldsymbol{x}^{\prime}$ have the same stationary distribution. To simplify the notation, we denote $\mathbb{E}[\cdot]$ as $\mathbb{E}_{\mu,\pi,\mathbb{P}}[\cdot]$ in the proof of this lemma. Then we compute

	$\displaystyle\mathbb{E}\left[\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}% ^{t}_{},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{})\right>\mid\boldsymbol{\theta}^{0}\right]$	(24)
$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left(\widehat{\Delta}(\boldsymbol{x},\boldsymbol% {x}^{\prime};\boldsymbol{\theta}^{t})-\widehat{\Delta}(\boldsymbol{x},% \boldsymbol{x}^{\prime};\boldsymbol{\theta}^{t}_{})\right)\left<\nabla Q(% \boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{\theta}^{t}-\boldsymbol{% \theta}^{t}_{}\right>\mid\boldsymbol{\theta}^{0}\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{t}_{})\right)% \left<\nabla Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{\theta}^{t}% -\boldsymbol{\theta}^{t}_{}\right>\mid\boldsymbol{\theta}^{0}\right]$
	$\displaystyle-\gamma\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x}^{\prime};% \boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x}^{\prime};\boldsymbol{% \theta}^{t}_{})\right)\left<\nabla Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),% \boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\right>\mid\boldsymbol{% \theta}^{0}\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{})\right)^{2% }\right]-\gamma\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{})\right)% \cdot\left(\widehat{Q}(\boldsymbol{x}^{\prime};\boldsymbol{\theta}^{t})-% \widehat{Q}(\boldsymbol{x}^{\prime};\boldsymbol{\theta}^{t}_{*})\right)\right]$
$\displaystyle\overset{(i)}{\geq}$	$\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{})^{2}\right% )\right]-\gamma\sqrt{\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},% \boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{% })\right)^{2}\right]}\cdot\sqrt{\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{% x}^{\prime};\boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x}^{\prime};% \boldsymbol{\theta}^{t}_{*})\right)^{2}\right]}$
$\displaystyle\overset{(i)}{\geq}$	$\displaystyle\left(1-\gamma\right)\mathbb{E}\left[\widehat{Q}(\boldsymbol{x},% \boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{*% })^{2}\right]$
$\displaystyle=$	$\displaystyle(1-\gamma)(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{})^{% \top}\Sigma_{\pi}(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{})$
$\displaystyle\overset{(ii)}{\geq}$	$\displaystyle(1-\gamma)\lambda_{0}\cdot\\|\boldsymbol{\theta}^{t}-\boldsymbol{% \theta}^{t}_{*}\\|^{2},$

where (i) follows the Cauchy-Schwarz inequality, (ii) follows (23), and (iii) follows $\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}=\boldsymbol{\theta}^{t}_{% \parallel}-\bar{\boldsymbol{\theta}}_{\parallel}\in\mathcal{R}(\Sigma_{\pi})$ and Lemma A.2, which provides $\lambda_{0}$ -strong convexity. Thus we complete the proof of Lemma A.3. ∎

Given $\boldsymbol{\theta}^{0}$ , taking the expectation on both sides of (18) and plugging (19) $\sim$ (22) into (18) yields that

	$\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\\|\boldsymbol{\theta}^{t+1}-% \boldsymbol{\theta}^{t+1}_{*}\\|^{2}\right.$	$\displaystyle\left.\mid\boldsymbol{\theta}^{0}\right]\leq(1-2\eta_{t}(1-\gamma% )\lambda_{0})\mathbb{E}\left[\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}% _{*}\\|^{2}\mid\boldsymbol{\theta}^{0}\right]+C_{2}\eta_{t}^{2}$
		$\displaystyle+2\eta_{t}C_{5}m^{-1/2}\sqrt{\log(T/\delta)}+2\eta_{t}C_{7}\left(% \log(T/\delta)+C_{1}^{2}\right)\tau^{}\eta_{\max\left\{0,t-\tau^{}\right\}}.$

We choose $\eta_{t}=\frac{1}{2(1-\gamma)\lambda_{0}(t+1)}$ and have that

	$\displaystyle(1-\gamma)\lambda_{0}(t+1)\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\\|% \boldsymbol{\theta}^{t+1}\right.$	$\displaystyle-\left.\boldsymbol{\theta}^{t+1}_{}\\|^{2}\mid\boldsymbol{\theta}% ^{0}\right]\leq(1-\gamma)\lambda_{0}t\ \mathbb{E}\left[\\|\boldsymbol{\theta}^{% t}-\boldsymbol{\theta}^{t}_{}\\|^{2}\mid\boldsymbol{\theta}^{0}\right]+C_{2}% \eta_{t}$
		$\displaystyle+C_{5}m^{-1/2}\sqrt{\log(T/\delta)}+C_{7}\left(\log(T/\delta)+C_{% 1}^{2}\right)\tau^{}\eta_{\max\left\{0,t-\tau^{}\right\}}.$

Summing (A.2) from $t=0,1,\cdots,T-1$ yields that

$\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\\|\boldsymbol{\theta}^{T}-% \boldsymbol{\theta}^{T}_{*}\\|^{2}\mid\boldsymbol{\theta}^{0}\right]$	$\displaystyle\leq$	$\displaystyle\frac{1}{(1-\gamma)\lambda_{0}T}\sum_{t=0}^{T-1}\left(C_{2}\eta_{% t}+C_{5}m^{-1/2}\sqrt{\log(T/\delta)}\right.$
		$\displaystyle\left.+C_{7}\left(\log(T/\delta)+C_{1}^{2}\right)\tau^{}\eta_{% \max\left\{0,t-\tau^{}\right\}}\right)$
	$\displaystyle\leq$	$\displaystyle\frac{C_{2}(\log T+1)}{2(1-\gamma)^{2}\lambda_{0}^{2}T}+\frac{C_{% 5}m^{-1/2}\sqrt{\log(T/\delta)}}{(1-\gamma)\lambda_{0}}+\frac{C_{8}\tau^{*}% \left(\log(T/\delta)+1\right)\log T}{2(1-\gamma)^{2}\lambda_{0}^{2}T}.$

Therefore, according to the gradient bound (19), we have

$\displaystyle\mathbb{E}_{\mu,\pi}\left[\left(\widehat{Q}(\boldsymbol{x};% \boldsymbol{\theta}^{T})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})% \right)^{2}\mid\boldsymbol{\theta}^{0}\right]$	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{T})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{T}_{*})\right)^{2% }\mid\boldsymbol{\theta}^{0}\right]$
	$\displaystyle\leq$	$\displaystyle C_{3}^{2}m\mathbb{E}\left[\\|\boldsymbol{\theta}^{T}-\boldsymbol{% \theta}^{T}_{*}\\|^{2}\mid\boldsymbol{\theta}^{0}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{C_{2}^{3}(\log T+1)}{2(1-\gamma)^{2}\lambda_{0}^{2}T}+\frac% {C_{2}^{2}C_{5}m^{-1/2}\sqrt{\log(T/\delta)}}{(1-\gamma)\lambda_{0}}$
		$\displaystyle+\frac{C_{2}^{2}C_{8}\tau^{*}\left(\log(T/\delta)+1\right)\log T}% {2(1-\gamma)^{2}\lambda_{0}^{2}T}$

with probability at least $1-2\delta-2L\exp{(-C_{6}m)}$ . Let $\widetilde{C}_{1}=\max\{1,C_{1}\},\widetilde{C}_{2}=C_{6},\widetilde{C}_{3}=% \frac{C_{2}^{3}}{2},\widetilde{C}_{4}=\frac{C_{2}^{2}C_{5}}{2}$ , and $\widetilde{C}_{5}=\frac{C_{2}^{2}C_{8}}{2}$ , and we complete the proof. ∎

A.3 Proof of Theorem 3.7

Proof.

Let $(s,a)\sim\mu\times\pi=:d$ . To simplify the notation, we denote $\mathbb{E}[\cdot]$ as $\mathbb{E}_{(s,a)\sim d}[\cdot]$ in this subsection. Note that

	$\displaystyle\mathbb{E}\left[\left(Q(\boldsymbol{x};\boldsymbol{\theta}^{T})% \right.\right.$	$\displaystyle-\left.\left.Q^{*}(s,a)\right)^{2}\mid\boldsymbol{\theta}^{0}% \right]\leq 3\mathbb{E}\left[\left(Q(\boldsymbol{x};\boldsymbol{\theta}^{T})-% \widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{T})\right)^{2}\mid\boldsymbol{% \theta}^{0}\right]$
		$\displaystyle+3\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{T})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{})\right)^{2}% \mid\boldsymbol{\theta}^{0}\right]+3\mathbb{E}\left[\left(\widehat{Q}(% \boldsymbol{x};\boldsymbol{\theta}^{})-Q^{*}(s,a)\right)^{2}\mid\boldsymbol{% \theta}^{0}\right].$

By Lemma D.4, we have

\mathbb{E}\left[\left(Q(\boldsymbol{x};\boldsymbol{\theta}^{T})-\widehat{Q}(% \boldsymbol{x};\boldsymbol{\theta}^{T})\right)^{2}\mid\boldsymbol{\theta}^{0}% \right]\leq C_{8}m^{-1}

(25)

with probability at least $1-\delta$ . Recall that $\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})$ is the fixed point of $\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}$ and $Q^{*}(s,a)$ is the fixed point of $\mathcal{T}$ . We define the $\ell_{2}$ -norm $\|f(s,a)\|_{d}^{2}=\mathbb{E}_{(s,a)\sim d}\left[f(s,a)^{2}\right]$ . Thus

$\displaystyle\left\\|\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{})-Q^{}(% s,a)\right\\|_{d}$	$\displaystyle=$	$\displaystyle\left\\|\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{})-\Pi_{% \mathcal{F}_{\omega,m}}Q^{}(s,a)+\Pi_{\mathcal{F}_{\omega,m}}Q^{}(s,a)-Q^{}% (s,a)\right\\|_{d}$
	$\displaystyle\overset{(i)}{=}$	$\displaystyle\left\\|\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}\widehat{Q}(% \boldsymbol{x};\boldsymbol{\theta}^{})-\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T% }Q^{}(s,a)+\Pi_{\mathcal{F}_{\omega,m}}Q^{}(s,a)-Q^{}(s,a)\right\\|_{d}$
	$\displaystyle\leq$	$\displaystyle\left\\|\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}\widehat{Q}(% \boldsymbol{x};\boldsymbol{\theta}^{})-\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T% }Q^{}(s,a)\right\\|_{d}+\left\\|\Pi_{\mathcal{F}_{\omega,m}}Q^{}(s,a)-Q^{}(s,% a)\right\\|_{d}$
	$\displaystyle\overset{(ii)}{\leq}$	$\displaystyle\gamma\left\\|\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{})-% Q^{}(s,a)\right\\|_{d}+\left\\|\Pi_{\mathcal{F}_{\omega,m}}Q^{}(s,a)-Q^{}(s,a% )\right\\|_{d},$

where (i) is due to the properties of the fixed point, and (ii) is due to $\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}$ is $\gamma$ -contractive on the $\infty$ -norm. This further means that

\left\|\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})-Q^{*}(s,a)\right\|_% {d}^{2}\leq\frac{1}{(1-\gamma)^{2}}\left\|\Pi_{\mathcal{F}_{\omega,m}}Q^{*}(s,% a)-Q^{*}(s,a)\right\|_{d}^{2}.

(26)

Plugging (25) and (26) into (A.3) and using Theorem 3.6, we complete the proof. ∎

Appendix B Convergence Results of Neural Q-learning

B.1 Neural Q-Learning Algorithm

For neural Q-learning, let us redefine some of the above notations. Let the optimal Q-function be $Q^{*}(s,a)=\sup_{\pi}Q^{\pi}(s,a)$ for all state action pairs $(s,a)$ , then the optimal sequence of actions that maximizes the expected cumulative reward will follow $a_{t}=\mathop{\mathrm{argmax}}_{a^{\prime}\in\mathcal{A}}Q^{*}(s_{t},a^{\prime% }),t\geq 0$ . Therefore, to obtain a near-optimal policy, it is sufficient to find some $\hat{Q}$ that approximates $Q^{*}$ well. Define the Bellman optimality operator $\mathcal{T}$ as

\mathcal{T}Q(s,a):=r(s,a)+\gamma\mathbb{E}\left[\max_{a^{\prime}}Q(s^{\prime},% a^{\prime})\mid s^{\prime}\sim\mathbb{P}(\cdot\mid s,a)\right],

for any $(s,a)$ . Let us remain the definition of the local linearization function class $\mathcal{F}_{\omega,m}$ introduced in (7). Consider the MSPBE minimization problem with multi-layer neural network approximation:

\min_{\boldsymbol{\theta}\in S_{\omega}}\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[% \left(Q(\boldsymbol{x};\boldsymbol{\theta})-\Pi_{\mathcal{F}_{\omega,m}}% \mathcal{T}Q(\boldsymbol{x};\boldsymbol{\theta})\right)^{2}\right].

Then the projected neural Q-learning algorithm can be written as follows:

\boldsymbol{\theta}^{t+1}=\Pi_{S_{\omega}}\Big{(}\boldsymbol{\theta}^{t}-\eta_% {t}\boldsymbol{g}\left(\boldsymbol{\theta}^{t}\right)\!\!\Big{)},\quad\mbox{% with}\quad\boldsymbol{g}(\boldsymbol{\theta}^{t})=\Delta\left(s_{t},a_{t},s_{t% +1};\boldsymbol{\theta}^{t}\right)\cdot\nabla_{\boldsymbol{\theta}}Q(\phi(s_{t% },a_{t});\boldsymbol{\theta}^{t})

(27)

where

\displaystyle\Delta\left(s,a,s^{\prime};\boldsymbol{\theta}^{t}\right)=

\displaystyle Q(\phi(s,a);\boldsymbol{\theta}^{t})-\Big{(}r(s,a)+\gamma\max_{b% \in\mathcal{A}}Q\left(\phi\left(s^{\prime},b\right);\boldsymbol{\theta}^{t}% \right)\Big{)}.

(28)

The algorithm details can be described by Algorithm 2 as follows.

Algorithm 2 Neural Q-Learning with Markovian Sampling

Input: A learning policy

\pi

, a discount factor

\gamma\in(0,1)

, a sequence of learning rates

\{\eta_{t}\}_{t\geq 0}

, a maximum iteration number

T

, a projection radius

\omega>0

, a Q network with architecture (4).

Initialization: Generate each entry of

\boldsymbol{W}_{l}^{0}

independently from

\mathcal{N}(0,1)

, for

l=1,2,\cdots,L

, and each entry of

\boldsymbol{b}

independently from

\text{Unif}\{-1,+1\}

for

t=0,1,\cdots,T-1

Sample

(s_{t},a_{t},r_{t},s_{t+1})

from the learning policy

\pi

with

a_{t}\sim\pi(\cdot|s_{t})

Compute the TD error

\Delta_{t}

by (28).

Update

\boldsymbol{\theta}^{t+1}

by the projected stochastic semi-gradient step (27).

end for

Output:

\boldsymbol{\theta}^{T}

B.2 Global Convergence

Similar to Section 3, we define the function class $\mathcal{F}_{\omega,m}$ as a collection of all local linearization of $Q(\boldsymbol{x};\boldsymbol{\theta})$ at the initial point $\boldsymbol{\theta}^{0}$ :

\mathcal{F}_{\omega,m}:=\left\{\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta})% =Q(\boldsymbol{x};\boldsymbol{\theta}^{0})+\left<\nabla_{\boldsymbol{\theta}}Q% (\boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{\theta}-\boldsymbol{% \theta}^{0}\right>,\ \boldsymbol{\theta}\in S_{\omega}\right\}.

Let $\widehat{Q}(\cdot\,;\boldsymbol{\theta}^{*})\in\mathcal{F}_{\omega,m}$ , and $\widehat{\Delta}\left(s,a,s^{\prime};\boldsymbol{\theta}\right)$ has the same structure as $\Delta\left(s,a,s^{\prime};\boldsymbol{\theta}\right)$ expect that the function $Q(\cdot;\boldsymbol{\theta})$ is replaced by $\widehat{Q}(\cdot;\boldsymbol{\theta})$ . The stationary point $\boldsymbol{\theta}^{*}$ satisfies $\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})=\Pi_{\mathcal{F}_{\omega,m% }}\mathcal{T}\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})$ for neural Q-learning. We redefine $\Xi_{\beta}$ by replacing the Bellman operator $\mathcal{T}^{\pi}$ in Section 3 with the Bellman optimality operator $\mathcal{T}$ . A point $\boldsymbol{\theta}^{*}\in\Xi_{\omega}$ if and only if

\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(s,a,s^{\prime};% \boldsymbol{\theta}^{*}\right)\big{\langle}\nabla_{\boldsymbol{\theta}}% \widehat{Q}\left(\phi(s,a);\boldsymbol{\theta}^{*}\right),\boldsymbol{\theta}-% \boldsymbol{\theta}^{*}\big{\rangle}\right]\geq 0.

The maximum operator introduced by the Bellman optimality operator significantly sophisticates the analysis. Let us remain the definition of $\Sigma_{\pi}$ in (12), and we define $\Sigma^{*}_{\pi}(\boldsymbol{\theta})$ as follows:

\displaystyle\mathbb{E}_{\mu,\pi}\left[\nabla_{\boldsymbol{\theta}}Q(\phi(s,a^% {\boldsymbol{\theta}}_{\max});\boldsymbol{\theta}^{0})\nabla_{\boldsymbol{% \theta}}Q(\phi(s,a^{\boldsymbol{\theta}}_{\max});\boldsymbol{\theta}^{0})^{% \top}\right],

(29)

where $a^{\boldsymbol{\theta}}_{\max}=\arg\max_{a\in\mathcal{A}}\left|\left<\nabla_{% \boldsymbol{\theta}}Q(s,a;\boldsymbol{\theta}^{0}),\boldsymbol{\theta}\right>\right|$ . To facilitate the analysis of neural Q-learning, we further assume the following regularity condition introduced by (Xu & Gu, 2020).

Assumption B.1.

$\exists\nu\in(0,1)$ such that $(1-\nu)^{2}\Sigma_{\pi}-\gamma^{2}\Sigma_{\pi}^{*}(\boldsymbol{\theta})\succeq 0$ for any $\boldsymbol{\theta}^{0}$ and $\boldsymbol{\theta}\in S_{\omega}$ .

The original version of this assumption comes from (Xu & Gu, 2020), which requires a strict positive definite condition: $(1-\nu)^{2}\Sigma_{\pi}-\gamma^{2}\Sigma_{\pi}^{*}(\boldsymbol{\theta})\succ 0$ . Under this additional assumption, (Xu & Gu, 2020) obtained an $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for neural Q-learning. A similar complexity result was also derived in (Cai et al., 2023) under a similar regularity condition on the learning policy $\pi$ . At this time, we relax it to the positive semi-definiteness ( $\succeq 0$ ) and provide a convergence result of neural Q-learning. See Theorem B.2.

Theorem B.2.

Suppose Assumptions 3.1, 3.2 and B.1 hold. We set $\omega=\widetilde{C}_{1}$ and the learning rate $\eta_{t}=\frac{1}{2\nu\lambda_{0}(t+1)}$ . If the feature map $\|\phi(s,a)\|=1$ for each state-action pair $(s,a)$ and the network width $m\geq m^{*}$ , then the output $\boldsymbol{\theta}^{T}$ of neural Q-learning algorithm (i.e. (27)) satisfies

\displaystyle\mathbb{E}\left[\big{(}\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{T})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})\big{)}^{2}% \mid\boldsymbol{\theta}^{0}\right]\leq\frac{\widetilde{C}_{3}(\log T+1)}{\nu^{% 2}\lambda_{0}^{2}T}+\frac{\widetilde{C}_{4}m^{-1/2}}{\nu\lambda_{0}}\cdot\sqrt% {\log(T/\delta)}+\frac{\widetilde{C}_{5}\tau^{*}\left(\log(T/\delta)+1\right)% \log T}{\nu^{2}\lambda_{0}^{2}T},

with probability at least $1-2\delta-2L\exp\!\big{(}-\widetilde{C}_{2}m\big{)}$ , where $\tau^{*}$ is the mixing time of Markov chain in Assumption 3.2, and $\widetilde{C}_{1},\cdots,\widetilde{C}_{5}>0$ are universal constants.

Proof.

For a little notation abuse, we redefine

	$\displaystyle\mathbf{g}(\boldsymbol{\theta}^{t})$	$\displaystyle=$	$\displaystyle\Delta(s_{t},a_{t},s^{\prime}_{t},\boldsymbol{\theta}^{t})\cdot% \nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t}),% \quad\bar{\mathbf{g}}(\boldsymbol{\theta}^{t})\ =\ \mathbb{E}_{\mu,\pi,\mathbb% {P}}\left[\mathbf{g}(\boldsymbol{\theta}^{t})\right]$
	$\displaystyle\mathbf{m}(\boldsymbol{\theta}^{t})$	$\displaystyle=$	$\displaystyle\widehat{\Delta}(s_{t},a_{t},s^{\prime}_{t},\boldsymbol{\theta}^{% t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{% 0}),\quad\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})\ =\ \mathbb{E}_{\mu,\pi,% \mathbb{P}}\left[\mathbf{m}(\boldsymbol{\theta}^{t})\right],$

where

	$\displaystyle\Delta(s_{t},a_{t},s^{\prime}_{t},\boldsymbol{\theta}^{t})$	$\displaystyle=$	$\displaystyle Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s_{t},a_{t% })+\gamma\max_{b\in\mathcal{A}}Q(\phi(s_{t+1},b);\boldsymbol{\theta}^{t})% \right),$
	$\displaystyle\widehat{\Delta}(s_{t},a_{t},s^{\prime}_{t},\boldsymbol{\theta}^{% t})$	$\displaystyle=$	$\displaystyle\widehat{Q}(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s% _{t},a_{t})+\gamma\max_{b\in\mathcal{A}}\widehat{Q}(\phi(s_{t+1},b);% \boldsymbol{\theta}^{t})\right).$

Let $\Delta_{t}=\Delta(s_{t},a_{t},s^{\prime}_{t},\boldsymbol{\theta}^{t})$ and $\widehat{\Delta}_{t}=\widehat{\Delta}(s_{t},a_{t},s^{\prime}_{t},\boldsymbol{% \theta}^{t})$ . Similarly, (18) can be derived in neural Q-learning. To estimate the terms $\mathbf{I}_{1}\sim\mathbf{I}_{4}$ , we can apply Lemmas D.5 and A.1. However, due to the utilization of the Bellman optimality operator in neural Q-learning, some modifications based on Lemma A.3 are required.

Lemma B.3.

Under Assumption B.1, we have that

\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left<\boldsymbol{\theta}^{t}-\boldsymbol% {\theta}^{t}_{*},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{*})\right>\mid\boldsymbol{\theta}^{0}\right]\geq\nu% \lambda_{0}\cdot\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}.

(30)

Proof.

			$\displaystyle\mathbb{E}\left[\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}% ^{t}_{},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{})\right>\mid\boldsymbol{\theta}^{0}\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left(\widehat{\Delta}(s_{t},a_{t},s;_{t},% \boldsymbol{\theta}^{t})-\widehat{\Delta}(s_{t},a_{t},s;_{t},\boldsymbol{% \theta}^{t}_{})\right)\left<\nabla Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),% \boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\right>\mid\boldsymbol{% \theta}^{0}\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{t}_{})\right)% \left<\nabla Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{\theta}^{t}% -\boldsymbol{\theta}^{t}_{}\right>\mid\boldsymbol{\theta}^{0}\right]$
			$\displaystyle-\gamma\mathbb{E}\left[\left(\widehat{Q}^{\#}(s;\boldsymbol{% \theta}^{t})-\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t}_{})\right)\left<% \nabla Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t}_{}\right>\mid\boldsymbol{\theta}^{0}\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{})\right)^{2% }\right]-\gamma\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{})\right)% \cdot\left(\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t}_{*})\right)\right].$

For the second term of (B.2), we consider

$\displaystyle\mathbb{E}\left[\left(\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})% -\widehat{Q}^{\#}(s;\boldsymbol{\theta}_{*}^{t})\right)^{2}\right]$	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\max_{a\in\mathcal{A}}\left\|\widehat{Q}(s,a;% \boldsymbol{\theta}^{t})-\widehat{Q}(s,a;\boldsymbol{\theta}^{t}_{*})\right\|^{% 2}\right]$	(32)
	$\displaystyle\overset{(i)}{=}$	$\displaystyle\mathbb{E}\left[\max_{a\in\mathcal{A}}\left\|\widehat{Q}(s,a;% \boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*})\right\|^{2}\right]$
	$\displaystyle=$	$\displaystyle(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{})^{\top}% \Sigma_{\pi}^{}(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*})$
	$\displaystyle\overset{(ii)}{\leq}$	$\displaystyle\frac{(1-\nu)^{2}}{\gamma^{2}}(\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t}_{})^{\top}\Sigma_{\pi}\cdot(\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t}_{})$
	$\displaystyle\overset{(iii)}{=}$	$\displaystyle\frac{(1-\nu)^{2}}{\gamma^{2}}\mathbb{E}\left[\left(\widehat{Q}(% \boldsymbol{x},\boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol% {\theta}^{t}_{*})\right)^{2}\right],$

where (i) and (iii) follow that $\widehat{Q}(\boldsymbol{x};\cdot)$ is linear, and (ii) follows Assumption B.1. Therefore,

			$\displaystyle\mathbb{E}\left[\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}% ^{t}_{},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{})\right>\mid\boldsymbol{\theta}^{0}\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{})\right)^{2% }\right]-\gamma\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{})\right)% \cdot\left(\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t}_{*})\right)\right]$
		$\displaystyle\geq$	$\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{})^{2}\right% )\right]-\gamma\sqrt{\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},% \boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{% })\right)^{2}\right]}\cdot\sqrt{\mathbb{E}\left[\left(\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t}_{*})\right% )^{2}\right]}$
		$\displaystyle\overset{(i)}{\geq}$	$\displaystyle\left(1-\gamma\cdot\frac{1-\nu}{\gamma}\right)\mathbb{E}\left[% \widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x}% ,\boldsymbol{\theta}^{t}_{*})^{2}\right]$
		$\displaystyle=$	$\displaystyle\nu\cdot(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{})^{% \top}\Sigma_{\pi}(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{})$
		$\displaystyle\overset{(ii)}{\geq}$	$\displaystyle\nu\lambda_{0}\cdot\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^% {t}_{*}\\|^{2},$

where (i) follows (32), and (ii) follows $\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}=\boldsymbol{\theta}^{t}_{% \parallel}-\boldsymbol{\theta}^{*}\in\mathcal{R}(\mathbf{\Sigma_{\pi}})$ and Lemma A.2, which provides $\lambda_{0}$ -strong convexity. ∎

Now given $\boldsymbol{\theta}^{0}$ , we can deduce that

	$\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\\|\boldsymbol{\theta}^{t+1}\right.$	$\displaystyle-\left.\boldsymbol{\theta}^{t+1}_{}\\|^{2}\mid\boldsymbol{\theta}% ^{0}\right]\leq(1-2\eta_{t}\nu\lambda_{0})\mathbb{E}\left[\\|\boldsymbol{\theta% }^{t}-\boldsymbol{\theta}^{t}_{}\\|^{2}\mid\boldsymbol{\theta}^{0}\right]+C_{1% }\eta_{t}^{2}$
		$\displaystyle+2\eta_{t}C_{2}m^{-1/2}\sqrt{\log(T/\delta)}+2\eta_{t}C_{3}\left(% \log(T/\delta)+C_{1}^{2}\right)\tau^{}\eta_{\max\left\{0,t-\tau^{}\right\}},$

with probability at least $1-2\delta-2L\exp(-C_{4}m)$ , where $\left\{C_{i}>0\right\}_{i=1,\ldots,4}$ are universal constants in this subsection. Choosing $\eta_{t}=\frac{1}{2\nu\lambda_{0}(t+1)}$ can derive the similar results as (A.2). This suggests that we can utilize the techniques outlined in Section A.2 to finalize the remaining proof of Theorem B.2. As a result, we conclude the proof.

∎

Appendix C Details of Section 4

We formally describe the minimax neural Q-learning method in Algorithm 3.

Algorithm 3 Minimax Neural Q-Learning with Gaussian Initialization

Input: A learning policy pair

\pi=(\pi^{1},\pi^{2})

, a discount factor

\gamma\in(0,1)

, a sequence of learning rates

\{\eta_{t}\}_{t\geq 0}

, a maximum iteration number

T

, a projection radius

\omega>0

, a Q network with architecture (4).

Initialization: Generate each entry of

\boldsymbol{W}_{l}^{0}

independently from

\mathcal{N}(0,1)

, for

l=1,2,\cdots,L

, and each entry of

\boldsymbol{b}

independently from

\text{Unif}\{-1,+1\}

for

t=0,1,\cdots,T-1

Sample

(s_{t},a^{1}_{t},a^{2}_{t},r_{t},s_{t+1})

from the learning policy pair

\pi

with

a^{1}_{t}\sim\pi^{1}(\cdot|s_{t}),a^{2}_{t}\sim\pi^{2}(\cdot|s_{t})

Compute the TD error

\Delta_{t}

by (17).

Update

\boldsymbol{\theta}^{t+1}

by the projected stochastic semi-gradient step (16).

end for

Output:

\boldsymbol{\theta}^{T}

C.1 Proof of Theorem 4.2

The proof of Theorem 4.2 is similar to Sections A.2 and B.2. However, due to the difference in Bellman operators, we still need to make some modifications to Lemma A.3 or Lemma B.3. See Lemma C.1.

Lemma C.1.

Under Assumption 4.1, we have that

\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left<\boldsymbol{\theta}^{t}-\boldsymbol% {\theta}^{t}_{*},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{*})\right>\mid\boldsymbol{\theta}^{0}\right]\geq\nu% \lambda_{0}\cdot\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}.

(33)

Proof.

To simplify the notation, we denote $\mathbb{E}[\cdot]$ as $\mathbb{E}_{\mu,\pi,\mathbb{P}}[\cdot]$ in the proof of this lemma. Define $\widehat{Q}^{\#}(s;\boldsymbol{\theta}):=\max_{a^{1}\in\mathcal{A}}\min_{a^{2}% \in\mathcal{A}}\widehat{Q}(\phi(s,a^{1},a^{2});\boldsymbol{\theta})$ . Define the sets $\mathcal{S}_{+}=\{s:\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})>\widehat{Q}^{% \#}(s;\boldsymbol{\theta}^{t}_{*})\}$ and $\mathcal{S}_{-}=\mathcal{S}/\mathcal{S}_{+}$ . For each $s\in\mathcal{S}_{+}$ ,

$\displaystyle\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t}_{*})$	$\displaystyle=$	$\displaystyle\Big{<}\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{\theta}^% {0}\right),\boldsymbol{\theta}^{t}\Big{>}-\left<\nabla_{\boldsymbol{\theta}}Q% \left(\phi(s,a^{1}_{\boldsymbol{\theta}^{t}_{}},a^{2}_{\boldsymbol{\theta}^{t% }_{}});\boldsymbol{\theta}^{0}\right),\boldsymbol{\theta}^{t}_{*}\right>$
	$\displaystyle=$	$\displaystyle\left(\Big{<}\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{\theta}^% {0}\right),\boldsymbol{\theta}^{t}\Big{>}-\left<\nabla_{\boldsymbol{\theta}}Q% \left(\phi(s,a^{1}_{\boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}_{*% }});\boldsymbol{\theta}^{0}\right),\boldsymbol{\theta}^{t}\right>\right)-$
		$\displaystyle\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}_{}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\right>-$
		$\displaystyle\left(\Big{<}\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}_{}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}_{}\Big{>}-\left<\nabla_{% \boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{\boldsymbol{\theta}^{t}_{}},a^{2}_{% \boldsymbol{\theta}^{t}_{}});\boldsymbol{\theta}^{0}\right),\boldsymbol{% \theta}^{t}_{*}\right>\right)$
	$\displaystyle\leq$	$\displaystyle\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}_{}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\right>$

and

$\displaystyle\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t}_{*})$	$\displaystyle=$	$\displaystyle\left(\Big{<}\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{\theta}^% {0}\right),\boldsymbol{\theta}^{t}\Big{>}-\left<\nabla_{\boldsymbol{\theta}}Q% \left(\phi(s,a^{1}_{\boldsymbol{\theta}^{t}_{*}},a^{2}_{\boldsymbol{\theta}^{t% }});\boldsymbol{\theta}^{0}\right),\boldsymbol{\theta}^{t}\right>\right)-$
		$\displaystyle\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}_{}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\right>-$
		$\displaystyle\left(\Big{<}\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}_{}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}_{}\Big{>}-\left<\nabla_{% \boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{\boldsymbol{\theta}^{t}_{}},a^{2}_{% \boldsymbol{\theta}^{t}_{}});\boldsymbol{\theta}^{0}\right),\boldsymbol{% \theta}^{t}_{*}\right>\right)$
	$\displaystyle\geq$	$\displaystyle\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}_{}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\right>.$

In the same way, for each $s\in\mathcal{S}_{-}$ ,

	$\displaystyle\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}_{}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\right>$	$\displaystyle\leq$	$\displaystyle\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t}_{*})$
		$\displaystyle\leq$	$\displaystyle\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}_{}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\right>.$

Therefore,

	$\displaystyle\left\|\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})-\widehat{Q}^{\#% }(s;\boldsymbol{\theta}^{t}_{*})\right\|$	$\displaystyle\leq$	$\displaystyle\max\left\{\Big{\|}\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s% ,a^{1}_{\boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}_{}});% \boldsymbol{\theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}% _{}\right>\Big{\|},\right.$		(34)
			$\displaystyle\left.\Big{\|}\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1% }_{\boldsymbol{\theta}^{t}_{}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\right>% \Big{\|}\right\}.$		(34)

By Assumption 4.1, we compute

$\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left\|\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t}_{*})\right% \|^{2}\right]$	$\displaystyle\overset{(i)}{\leq}$	$\displaystyle\left(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\right)^% {\top}\Sigma_{\pi}^{}(\boldsymbol{\theta}^{t},\boldsymbol{\theta}^{t}_{})% \left(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\right)$
	$\displaystyle\overset{(ii)}{\leq}$	$\displaystyle\frac{(1-\nu)^{2}}{\gamma^{2}}\cdot\left(\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t}_{}\right)^{\top}\Sigma_{\pi}\left(\boldsymbol{\theta}% ^{t}-\boldsymbol{\theta}^{t}_{}\right)$
	$\displaystyle=$	$\displaystyle\frac{(1-\nu)^{2}}{\gamma^{2}}\cdot\mathbb{E}_{\mu,\pi,\mathbb{P}% }\left[\left\|\widehat{Q}(s,a^{1},a^{2};\boldsymbol{\theta}^{t})-\widehat{Q}(s,% a^{1},a^{2};\boldsymbol{\theta}^{t}_{*})\right\|^{2}\right],$

where (i) is due to (34), and (ii) is due to Assumption 4.1. Similar to Lemma B.3, we can also obtain (B.2) in this lemma. By substituting (C.1) into (B.2), the proof can be completed.

∎

Now we are ready to prove Theorem 4.2. For a little notation abuse, we redefine

	$\displaystyle\mathbf{g}(\boldsymbol{\theta}^{t})$	$\displaystyle=$	$\displaystyle\Delta(s_{t},a^{1}_{t},a^{2}_{t},s^{\prime}_{t},\boldsymbol{% \theta}^{t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{% \theta}^{t}),\quad\bar{\mathbf{g}}(\boldsymbol{\theta}^{t})\ =\ \mathbb{E}_{% \mu,\pi,\mathbb{P}}\left[\mathbf{g}(\boldsymbol{\theta}^{t})\right]$
	$\displaystyle\mathbf{m}(\boldsymbol{\theta}^{t})$	$\displaystyle=$	$\displaystyle\widehat{\Delta}(s_{t},a^{1}_{t},a^{2}_{t},s^{\prime}_{t},% \boldsymbol{\theta}^{t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{0}),\quad\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})\ =\ % \mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\mathbf{m}(\boldsymbol{\theta}^{t})\right],$

where

	$\displaystyle\Delta(s_{t},a^{1}_{t},a^{2}_{t},s^{\prime}_{t},\boldsymbol{% \theta}^{t})$	$\displaystyle=$	$\displaystyle Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s_{t},a_{t% })+\gamma\max_{b^{1}\in\mathcal{A}}\min_{b^{2}\in\mathcal{A}}Q(\phi(s_{t+1},b^% {1},b^{2});\boldsymbol{\theta}^{t})\right),$
	$\displaystyle\widehat{\Delta}(s_{t},a^{1}_{t},a^{2}_{t},s^{\prime}_{t},% \boldsymbol{\theta}^{t})$	$\displaystyle=$	$\displaystyle\widehat{Q}(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s% _{t},a_{t})+\gamma\max_{b^{1}\in\mathcal{A}}\min_{b^{2}\in\mathcal{A}}\widehat% {Q}(\phi(s_{t+1},b^{1},b^{2});\boldsymbol{\theta}^{t})\right).$

Let $\Delta_{t}=\Delta(s_{t},a^{1}_{t},a^{2}_{t},s^{\prime}_{t},\boldsymbol{\theta}% ^{t})$ and $\widehat{\Delta}_{t}=\widehat{\Delta}(s_{t},a^{1}_{t},a^{2}_{t},s^{\prime}_{t}% ,\boldsymbol{\theta}^{t})$ . After redefining the corresponding notation, we can similarly derive (18) and adopt the associated lemmas. Due to the introduction of the additional Assumption 4.1, we provide Lemma C.1, ensuring that terms $\mathbf{I}_{1}\sim\mathbf{I}_{4}$ can be estimated. The remainder of the proof is entirely analogous to Sections A.2 and B.2. Thus, we conclude the proof.

Appendix D Supporting Lemmas for Multi-layer Neural Network

Recalling the definition of the parameterized Q-function, we present the following lemmas related to neural network functions, which play a crucial role in illustrating the main results of our paper. $\left\{\tau_{i}>0\right\}_{i=1,\ldots,10}$ mentioned below are universal constants.

Lemma D.1.

For any $t\in\{1,2,\cdots,T\}$ , we have

\|\boldsymbol{\theta}^{t}\|\leq\tau_{1}\sqrt{m},\quad w.p.1-L\exp{(-\tau_{2}m)}.

Proof.

By Lemma G.2 in (Du et al., 2019), $\|\boldsymbol{\theta}^{0}\|\leq\mathcal{O}\left(\sqrt{m}\right)$ with probability at least $1-L\exp{(-\tau_{2}m)}$ . Therefore,

$\displaystyle\\|\boldsymbol{\theta}^{t}\\|_{2}$	$\displaystyle\leq$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{0}\\|+\\|\boldsymbol% {\theta}^{0}\\|$
	$\displaystyle\leq$	$\displaystyle\omega+\\|\boldsymbol{\theta}^{0}\\|$
	$\displaystyle\leq$	$\displaystyle\mathcal{O}\left(\sqrt{m}\right).$

∎

Lemma D.2.

For any $l\in\{1,2,\cdots,L\}$ , we have

\|\boldsymbol{x}^{(l)}\|\leq\tau_{3}\sqrt{m},\quad\text{and}\quad\left\|\nabla% _{\boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta})\right\|\leq\tau_{4% },\quad w.p.\ 1-L\exp{(-\tau_{2}m)}.

Lemma D.2 has been proved by (Tian et al., 2022) (Lemmas A.6 $\sim$ A.10).

Lemma D.3.

For any $t\in\{1,2,\cdots,T\}$ and $\boldsymbol{\theta}\in S_{\omega}$ , we have

\left|Q(\boldsymbol{x}_{t};\boldsymbol{\theta})\right|\leq\tau_{5}\sqrt{\log(T% /\delta)},\quad w.p.\ 1-\delta-L\exp{(-\tau_{2}m)}.

Proof.

By Lemma D.2, we have $\frac{1}{\sqrt{m}}\|\boldsymbol{x}^{(L)}\|\leq\tau_{3},\ w.p.\ 1-L\exp{(-\tau_% {2}m)}$ . Recall the definition of the parameterized Q-function:

Q(\boldsymbol{x};\boldsymbol{\theta})=\frac{1}{\sqrt{m}}\boldsymbol{b}^{\top}% \boldsymbol{x}^{(L)},

where each element of $\boldsymbol{b}$ is generated from a uniform distribution over $\{-1,+1\}$ . For each $\boldsymbol{x}$ , by Hoeffding inequality, we have

\mathbb{P}\left(\left|\left<\frac{1}{m}\boldsymbol{x}^{(L)},\boldsymbol{b}% \right>\right|\geq t\right)\leq 2\exp\left(-\frac{2t^{2}}{4\|\frac{1}{m}% \boldsymbol{x}^{(L)}\|^{2}}\right)\overset{(i)}{\leq}2\exp\left(-\frac{t^{2}}{% 2\tau_{3}^{2}}\right),

where (i) follows Lemma D.2. Substituting $\frac{\delta}{T}=2\exp\left(-\frac{t^{2}}{2\tau_{3}^{2}}\right)$ , we get

\mathbb{P}\left(\left|\left<\frac{1}{m}\boldsymbol{x}^{(L)},\boldsymbol{b}% \right>\right|\geq\tau_{3}\sqrt{2\log(T/\delta)}\right)\leq\frac{\delta}{T}.

Now, by the union bound, if we set $\tau_{5}=\tau_{3}\sqrt{2}$ , then

\mathbb{P}\left(\max_{t\in[T]}\left|Q(\boldsymbol{x}_{t};\boldsymbol{\theta})% \right|\geq\tau_{5}\sqrt{\log(T/\delta)}\right)\leq\sum_{t=1}^{T}\mathbb{P}% \left(\left|Q(\boldsymbol{x}_{t};\boldsymbol{\theta})\right|\geq\tau_{5}\sqrt{% \log(T/\delta)}\right)\leq\delta,

which completes the proof. ∎

Lemma D.4.

Denote $\nabla^{2}_{\boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta})$ as the Hessian matrix of $Q(\boldsymbol{x};\boldsymbol{\theta})$ . Then for all $\boldsymbol{x},\boldsymbol{\theta}\in S_{\omega}$ , we have $w.p.\ 1-\delta$ that

\|\nabla^{2}_{\boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta})\|_{2}% \leq\tau_{6}m^{-\frac{1}{2}}

and

\left|Q(\boldsymbol{x};\boldsymbol{\theta})-\widehat{Q}(\boldsymbol{x};% \boldsymbol{\theta})\right|\leq\tau_{7}m^{-\frac{1}{2}}.

Proof.

The first inequality in Lemma D.4 has been proved by (Liu et al., 2020b) (Theorem 3.2), which implies that $Q(\boldsymbol{x};\boldsymbol{\theta})$ is $\mathcal{O}(m^{-\frac{1}{2}})$ -smoothness w.r.t. $\theta$ . Therefore,

\left|Q(\boldsymbol{x};\boldsymbol{\theta})-\widehat{Q}(\boldsymbol{x};% \boldsymbol{\theta})\right|=\left|Q(\boldsymbol{x};\boldsymbol{\theta})-Q(% \boldsymbol{x};\boldsymbol{\theta}^{0})-\left<\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{\theta}^{0}-\boldsymbol{% \theta}\right>\right|=\mathcal{O}(m^{-\frac{1}{2}}).

∎

Lemma D.5.

Let $\boldsymbol{\theta}\in S_{\omega}$ with the radius satisfying $\omega=\mathcal{O}(1)$ . Then for all $\|\boldsymbol{x}\|_{2}=1$ and $\boldsymbol{\theta}^{t}_{*}\in S_{2\omega}$ in the neural temporal difference learning algorithm 1, it holds that

\displaystyle\left|\left<\mathbf{g}\left(\boldsymbol{\theta}^{t}\right)-% \mathbf{m}\left(\boldsymbol{\theta}^{t}\right),\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t+1}_{*}\right>\right|

\displaystyle\leq

\displaystyle\left(\tau_{8}\tau_{9}\omega m^{-\frac{1}{2}}\sqrt{\log(T/\delta)% }+2\tau_{4}\tau_{8}m^{-\frac{1}{2}}\right)\left\|\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t+1}_{*}\right\|

with probability at least $1-2\delta-2L\exp{(-\tau_{2}m)}$ over the randomness of the initial point, and $\left\|\mathbf{g}\left(\boldsymbol{\theta}^{t}\right)\right\|_{2}\leq\tau_{10}% \sqrt{\log(T/\delta)}$ holds with probability at least $1-\delta-2L\exp{(-\tau_{2}m)}$ .

Proof.

Note that

$\displaystyle\left\\|\mathbf{g}\left(\boldsymbol{\theta}^{t}\right)-\mathbf{m}% \left(\boldsymbol{\theta}^{t}\right)\right\\|$	$\displaystyle=$	$\displaystyle\left\\|\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},\boldsymbol% {\theta}^{t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol% {\theta}^{t})-\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},% \boldsymbol{\theta}^{t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{0})\right\\|$
	$\displaystyle\leq$	$\displaystyle\left\\|\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},\boldsymbol% {\theta}^{t})\cdot\left(\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{t})-\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{0})\right)\right\\|$
		$\displaystyle+\left\\|\left(\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},% \boldsymbol{\theta}^{t})-\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t% +1},\boldsymbol{\theta}^{t})\right)\cdot\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x}_{t};\boldsymbol{\theta}^{0})\right\\|,$

where

	$\displaystyle\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},\boldsymbol{\theta% }^{t})$	$\displaystyle=$	$\displaystyle Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s_{t},a_{t% })+\gamma Q(\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t})\right),$
	$\displaystyle\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},% \boldsymbol{\theta}^{t})$	$\displaystyle=$	$\displaystyle\widehat{Q}(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s% _{t},a_{t})+\gamma\widehat{Q}(\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t})% \right).$

For the first term in (D), by Lemma D.3, we have $w.p.\ 1-\delta-L\exp{(-\tau_{2}m)}$ that

\left|\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},\boldsymbol{\theta}^{t})% \right|\leq\tau_{8}\sqrt{\log(T/\delta)}.

By Lemma D.4, we get that

\left\|\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t% })-\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{0})% \right\|\leq\|\nabla^{2}_{\boldsymbol{\theta}}Q(\boldsymbol{x_{t}};\boldsymbol% {\theta_{t}})\|_{2}\cdot\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{0}\|% \leq\tau_{9}\omega m^{-\frac{1}{2}}.

Therefore,

\left\|\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},\boldsymbol{\theta}^{t})% \cdot\left(\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta% }^{t})-\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{0% })\right)\right\|\leq\tau_{8}\tau_{9}\omega m^{-\frac{1}{2}}\sqrt{\log(T/% \delta)}.

(37)

For the second term in (D), we decompose it into

	$\displaystyle\left\\|\left(\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},% \boldsymbol{\theta}^{t})-\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t% +1},\boldsymbol{\theta}^{t})\right)\cdot\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x}_{t};\boldsymbol{\theta}^{0})\right\\|$	(38)
$\displaystyle\leq$	$\displaystyle\left\\|\left(Q(\boldsymbol{x},\boldsymbol{\theta}^{t})-\widehat{Q% }(\boldsymbol{x},\boldsymbol{\theta}^{t})\right)\cdot\nabla_{\boldsymbol{% \theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{0})\right\\|+\left\\|\left(Q(% \boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x}_{t+1}% ;\boldsymbol{\theta}^{t})\right)\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol% {x}_{t};\boldsymbol{\theta}^{0})\right\\|$
$\displaystyle\leq$	$\displaystyle\left\|Q(\boldsymbol{x},\boldsymbol{\theta}^{t})-\widehat{Q}(% \boldsymbol{x},\boldsymbol{\theta}^{t})\right\|\cdot\left\\|\nabla_{\boldsymbol{% \theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{0})\right\\|+\left\|Q(% \boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x}_{t+1}% ;\boldsymbol{\theta}^{t})\right\|\cdot\left\\|\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x}_{t};\boldsymbol{\theta}^{0})\right\\|$
$\displaystyle\overset{(i)}{\leq}$	$\displaystyle 2\tau_{4}\tau_{8}m^{-\frac{1}{2}},$

with probability at least $w.p.\ 1-\delta-L\exp{(-\tau_{2}m)}$ , where (i) is due to Lemmas D.3 and D.4. Plugging (37) and (38) into (D) yields that

	$\displaystyle\left\|\left<\mathbf{g}\left(\boldsymbol{\theta}^{t}\right)-% \mathbf{m}\left(\boldsymbol{\theta}^{t}\right),\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t+1}_{*}\right>\right\|$	$\displaystyle\leq$	$\displaystyle\left\\|\mathbf{g}\left(\boldsymbol{\theta}^{t}\right)-\mathbf{m}% \left(\boldsymbol{\theta}^{t}\right)\right\\|\cdot\left\\|\boldsymbol{\theta}^{t% }-\boldsymbol{\theta}^{t+1}_{*}\right\\|$
		$\displaystyle\leq$	$\displaystyle\left(\tau_{8}\tau_{9}\omega m^{-\frac{1}{2}}\sqrt{\log(T/\delta)% }+2\tau_{4}\tau_{8}m^{-\frac{1}{2}}\right)\left\\|\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t+1}_{*}\right\\|.$

Thus we complete the proof. ∎

Lemma D.5 provides the upper bounds on $\mathbf{I}_{1}$ and $\mathbf{I}_{2}$ in Section A.2. As discussed in Section 3.1, for a finite MDP, the Gram matrix of the L-layer neural network function is positive definite and has a minimum eigenvalue of $\mathcal{O}(1)$ when the network width is sufficiently large. This, in fact, serves as an upper bound for $m^{*}$ in Assumption 3.1. Further details are provided in Remark D.6.

Remark D.6.

In this special case, we assume that both state space and action space are finite. Let $|\mathcal{S}|$ and $|\mathcal{A}|$ represent the dimensions of the state space and action space, respectively. For simplicity of notation, we view $\mathbf{Q}(\boldsymbol{\theta})$ as an $|\mathcal{S}||\mathcal{A}|\times 1$ column vector, with $(s,a)$ being a multi-index arranged in the lexicographical order. Let $d\sim\mu\times\pi$ and $\mathbf{D}=\text{diag}(d)$ be an $|\mathcal{S}||\mathcal{A}|$ -dimensional diagonal matrix, whose $(s,a)$ -th diagonal entry is $d(s,a)$ , and the order of $(s,a)$ in $\mathbf{D}$ is the same as $\mathbf{Q}(\boldsymbol{\theta})$ . Denote $\mathbf{J}$ as the Jacobian matrix of $\mathbf{Q}(\boldsymbol{\theta}^{0})$ and $\mathbf{J_{D}}=\mathbf{D}^{\frac{1}{2}}\mathbf{J}$ . Thus we can rewrite $\Sigma_{\pi}=\mathbf{J_{D}^{\top}}\mathbf{J_{D}}$ . Notice that $\Sigma_{\pi}$ is different from the Gram matrix Jacot et al. (2018); Du et al. (2018, 2019); Cao & Gu (2019); Allen-Zhu et al. (2019b) in deep neural network. To derive the $\mu$ -weighted Gram matrix $\text{Gram}(\boldsymbol{\theta}^{0})$ , we provide the following definition.

Definition D.7.

(Du et al. (2019); Cao & Gu (2019); Allen-Zhu et al. (2019b, a), Neural Tangent Kernel Matrix). For any $i,j\in\left[|\mathcal{S}||\mathcal{A}|\right]$ , define

		$\displaystyle\widetilde{\boldsymbol{\Theta}}_{i,j}^{(1)}=\boldsymbol{\Sigma}_{% i,j}^{(1)}=\left\langle\widehat{\boldsymbol{x}}_{i},\widehat{\boldsymbol{x}}_{% j}\right\rangle,\quad\mathbf{A}_{ij}^{(l)}=\left(\begin{array}[]{cc}% \boldsymbol{\Sigma}_{i,i}^{(l)}&\boldsymbol{\Sigma}_{i,j}^{(l)}\\ \boldsymbol{\Sigma}_{i,j}^{(l)}&\boldsymbol{\Sigma}_{j,j}^{(l)}\end{array}% \right),$
		$\displaystyle\boldsymbol{\Sigma}_{i,j}^{(l+1)}=2\cdot\mathbb{E}_{(u,v)\sim N% \left(\mathbf{0},\mathbf{A}_{ij}^{(l)}\right)}[\sigma(u)\sigma(v)],$
		$\displaystyle\widetilde{\boldsymbol{\Theta}}_{i,j}^{(l+1)}=\widetilde{% \boldsymbol{\Theta}}_{i,j}^{(l)}\cdot 2\cdot\mathbb{E}_{(u,v)\sim N\left(% \mathbf{0},\mathbf{A}_{ij}^{(l)}\right)}\left[\sigma^{\prime}(u)\sigma^{\prime% }(v)\right]+\boldsymbol{\Sigma}_{i,j}^{(l+1)},$

where $\widehat{\boldsymbol{x}}=\sqrt{d(s,a)}\phi(s,a)$ . Then we call $\boldsymbol{\Theta}^{(L)}=\left[\left(\widetilde{\boldsymbol{\Theta}}_{i,j}^{(% L)}+\boldsymbol{\Sigma}_{i,j}^{(L)}\right)/2\right]_{|\mathcal{S}||\mathcal{A}% |\times|\mathcal{S}||\mathcal{A}|}$ the $\mu$ -weighted neural tangent kernel matrix of an $L$ -layer ReLU network on $\mu$ -weighted state-action pairs $\widehat{\boldsymbol{x}}_{1},\ldots,\widehat{\boldsymbol{x}}_{|\mathcal{S}||% \mathcal{A}|}$ .

There is a large body of work Jacot et al. (2018); Du et al. (2018, 2019); Cao & Gu (2019); Allen-Zhu et al. (2019b) exploring the positive definiteness of $\text{Gram}(\boldsymbol{\theta}^{0})$ in the literature. Suppose that for all pairs $(s,a),(s^{\prime},a^{\prime})$ , $\|\phi(s,a)\|=1,d(s,a)>0$ and $\phi(s,a)\nparallel\phi(s^{\prime},a^{\prime})$ . The results of Theorem 1 and Proposition 2 in Jacot et al. (2018) shows that for an $L$ -layer neural network with Gaussian initialization parameters,

\left<\nabla_{\boldsymbol{\theta}}Q(\widehat{\boldsymbol{x}}_{i};\boldsymbol{% \theta}^{0}),\nabla_{\boldsymbol{\theta}}Q(\widehat{\boldsymbol{x}}_{j},% \boldsymbol{\theta}^{0})\right>\rightarrow\boldsymbol{\Theta}^{(L)}_{i,j},% \quad\text{as}\ m\rightarrow\infty.

That is, under the NTK regime, the $\mu$ -weighted Gram matrix $\text{Gram}(\boldsymbol{\theta}^{0})=\mathbf{J_{D}}\mathbf{J_{D}^{\top}}$ converges to $\boldsymbol{\Theta}^{(L)}$ when $m$ is sufficiently large. Let $\lambda_{\min}\left(\boldsymbol{\Theta}^{(L)}\right)=2\lambda^{\prime}=% \mathcal{O}(1)$ . For any $\delta\in(0,1)$ , there exists $m^{*}=\text{Poly}(|\mathcal{S}||\mathcal{A}|,L,\delta,\lambda^{\prime})$ such that if $m\geq m^{*}$ , we have

\lambda_{\min}\left(\text{Gram}(\boldsymbol{\theta}^{0})\right)\geq\lambda^{% \prime}\quad w.p.\ 1-\delta.

This signifies that if the network width $m\geq m^{*}$ , then we have $\lambda_{0}\geq\lambda_{\min}\left(\text{Gram}(\boldsymbol{\theta}^{0})\right)% \geq\lambda^{\prime}$ , thereby substantiating our claim.

Appendix E Additional Notes on the Experiments in Section 5

In this section, we further discuss the experimental setup introduced in Section 5. As mentioned in Section 5, our experiments mainly test the following two aspects: (1) how does the network width m affects the final error of the algorithm (first two subfigures in Figure 1); (2) the minimum nonzero singular value in Assumption 3.3 (latter two subfigures in Figure 1).

For point (1), we first generate 2000 samples according to a given policy $\pi$ to imitate the Markov process of Algorithm 1. A two-layer neural network with ELU activation is introduced, and the parameters are initialized using Algorithm 1. We set the initial learning rate at 0.001 with linear decay (per epoch) and a batch size of 100. Notably, as the parameter $m$ increases, the TD algorithm demonstrates smaller final TD errors.

For point (2), our experiments are based on three main points. First, note that the norm of feature map and parameter random initialization will affect the scaling of the gradient norm w.r.t. $Q(s,a;\boldsymbol{\theta})$ . Thus we employ the ratio $r=\sigma_{\max}/\sigma_{\min}$ to characterize the minimum non-zero singular value in order to eliminate the impact of numerical scaling. Second, we set varying network widths to verify the existence of $\lambda_{0}$ . Finally, it’s tough to directly obtain the joint distribution of $(s,a)$ with a fixed learning policy. However, we have that $\left\|\frac{1}{N}\sum_{(s,a)\sim\mu\times\pi}\nabla_{\boldsymbol{\theta}}Q(s,% a;\boldsymbol{\theta})\nabla_{\boldsymbol{\theta}}Q(s,a;\boldsymbol{\theta})^{% \top}-\Sigma_{\pi}\right\|_{2}=\mathcal{O}(1/N)$ . To avoid the effects of sampling randomness, we estimate $\Sigma_{\pi}$ from different samples. The experiments demonstrate that the minimum non-zero singular value $\lambda_{0}$ converges to a constant as $m$ increases.

	$\displaystyle\\|\boldsymbol{\theta}_{*}-\boldsymbol{\theta}^{0}\\|\leq$	$\displaystyle\ \\|\bar{\boldsymbol{\theta}}_{\parallel}-\boldsymbol{\theta}^{0}% _{\parallel}\\|+\\|\boldsymbol{\theta}_{\bot}-\boldsymbol{\theta}^{0}_{\bot}\\|$
	$\displaystyle\leq$	$\displaystyle\ \\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{0}\\|+\\|\bar{% \boldsymbol{\theta}}-\boldsymbol{\theta}^{0}\\|\leq 2\omega,$

$\displaystyle\\|\boldsymbol{\theta}^{t+1}-\boldsymbol{\theta}^{t+1}_{*}\\|^{2}$	$\displaystyle=$	$\displaystyle\left\\|\Pi_{S_{2\omega}}\left(\boldsymbol{\theta}^{t}-\eta_{t}% \mathbf{g}(\boldsymbol{\theta}^{t})\right)-\Pi_{S_{2\omega}}\left(\boldsymbol{% \theta}^{t+1}_{*}\right)\right\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\eta_{t}\mathbf{g}(\boldsymbol{\theta}^% {t})-\boldsymbol{\theta}^{t+1}_{*}\\|^{2}$
	$\displaystyle=$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\eta_{t}\mathbf{g}(\boldsymbol{\theta}^% {t})-\boldsymbol{\theta}^{t}_{}+\boldsymbol{\theta}^{t}_{}-\boldsymbol{% \theta}^{t+1}_{*}\\|^{2}$
	$\displaystyle=$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\\|^{2}+\eta_% {t}^{2}\\|\mathbf{g}(\boldsymbol{\theta}^{t})\\|^{2}+\\|\boldsymbol{\theta}^{t}_{% }-\boldsymbol{\theta}^{t+1}_{}\\|^{2}-2\eta_{t}\left<\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t}_{},\mathbf{g}(\boldsymbol{\theta}^{t})\right>$
		$\displaystyle-2\eta_{t}\left<\boldsymbol{\theta}^{t}_{}-\boldsymbol{\theta}^{% t+1}_{},\mathbf{g}(\boldsymbol{\theta}^{t})\right>+2\eta_{t}\left<\boldsymbol% {\theta}^{t}-\boldsymbol{\theta}^{t}_{},\boldsymbol{\theta}^{t}_{}-% \boldsymbol{\theta}^{t+1}_{*}\right>$
	$\displaystyle\overset{(i)}{=}$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\\|^{2}-2\eta% _{t}\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{},\mathbf{g}(% \boldsymbol{\theta}^{t})\right>$
		$\displaystyle-2\eta_{t}\left<\boldsymbol{\theta}^{t}_{}-\boldsymbol{\theta}^{% t+1}_{},\mathbf{g}(\boldsymbol{\theta}^{t})-\mathbf{m}(\boldsymbol{\theta}^{t% })\right>+\eta_{t}^{2}\\|\mathbf{g}(\boldsymbol{\theta}^{t})\\|^{2},$

	$\displaystyle\\|\boldsymbol{\theta}^{t+1}-\boldsymbol{\theta}^{t+1}_{*}\\|^{2}$	(18)
$\displaystyle\leq$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\\|^{2}-2\eta% _{t}\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{},\mathbf{g}(% \boldsymbol{\theta}^{t})\right>-2\eta_{t}\left<\boldsymbol{\theta}^{t}_{}-% \boldsymbol{\theta}^{t+1}_{},\mathbf{g}(\boldsymbol{\theta}^{t})-\mathbf{m}(% \boldsymbol{\theta}^{t})\right>+\eta_{t}^{2}\\|\mathbf{g}(\boldsymbol{\theta}^{% t})\\|^{2}$
$\displaystyle=$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\\|^{2}-2\eta% _{t}\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{},\mathbf{g}(% \boldsymbol{\theta}^{t})-\mathbf{m}(\boldsymbol{\theta}^{t})\right>-2\eta_{t}% \left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*},\mathbf{m}(% \boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})\right>$
	$\displaystyle-2\eta_{t}\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{% },\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})\right>-2\eta_{t}\left<\boldsymbol% {\theta}^{t}_{}-\boldsymbol{\theta}^{t+1}_{*},\mathbf{g}(\boldsymbol{\theta}^% {t})-\mathbf{m}(\boldsymbol{\theta}^{t})\right>+\eta_{t}^{2}\\|\mathbf{g}(% \boldsymbol{\theta}^{t})\\|^{2}$
$\displaystyle\overset{(i)}{\leq}$	$\displaystyle\\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{}\\|^{2}+\eta_% {t}^{2}\underbrace{\\|\mathbf{g}(\boldsymbol{\theta}^{t})\\|^{2}}_{\mbox{I}_{1}% \mbox{:\ Gradient Bound}}-2\eta_{t}\underbrace{\left<\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t+1}_{},\mathbf{g}(\boldsymbol{\theta}^{t})-\mathbf{m}(% \boldsymbol{\theta}^{t})\right>}_{\mbox{I}_{2}\mbox{:\ Gradient Gap}}$
	$\displaystyle-2\eta_{t}\underbrace{\left<\boldsymbol{\theta}^{t}-\boldsymbol{% \theta}^{t}_{},\mathbf{m}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t})\right>}_{\mbox{I}_{3}\mbox{:\ Markov Sampling Error}}% -2\eta_{t}\underbrace{\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{% },\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(\boldsymbol{% \theta}^{t}_{*})\right>}_{\mbox{I}_{4}\mbox{:\ Gradient Decent}},$

$\displaystyle\\|\theta^{t}-\theta^{t+1}_{*}\\|$	$\displaystyle\leq$	$\displaystyle\\|\theta^{t}-\theta^{0}\\|+\\|\theta^{0}-\theta^{0}_{}\\|+\\|\theta^% {0}_{}-\theta^{t+1}_{*}\\|$
	$\displaystyle=$	$\displaystyle\\|\theta^{t}-\theta^{0}\\|+\\|\theta^{0}-\theta^{0}_{*}\\|+\\|\theta^% {0}_{\bot}-\theta^{t+1}_{\bot}\\|$
	$\displaystyle\leq$	$\displaystyle\\|\theta^{t}-\theta^{0}\\|+\\|\theta^{0}-\theta^{0}_{*}\\|+\\|\theta^% {0}-\theta^{t+1}\\|\ \leq\ 3\omega.$

$\displaystyle\left\\|\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{})-Q^{}(% s,a)\right\\|_{d}$	$\displaystyle=$	$\displaystyle\left\\|\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{})-\Pi_{% \mathcal{F}_{\omega,m}}Q^{}(s,a)+\Pi_{\mathcal{F}_{\omega,m}}Q^{}(s,a)-Q^{}% (s,a)\right\\|_{d}$
	$\displaystyle\overset{(i)}{=}$	$\displaystyle\left\\|\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}\widehat{Q}(% \boldsymbol{x};\boldsymbol{\theta}^{})-\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T% }Q^{}(s,a)+\Pi_{\mathcal{F}_{\omega,m}}Q^{}(s,a)-Q^{}(s,a)\right\\|_{d}$
	$\displaystyle\leq$	$\displaystyle\left\\|\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}\widehat{Q}(% \boldsymbol{x};\boldsymbol{\theta}^{})-\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T% }Q^{}(s,a)\right\\|_{d}+\left\\|\Pi_{\mathcal{F}_{\omega,m}}Q^{}(s,a)-Q^{}(s,% a)\right\\|_{d}$
	$\displaystyle\overset{(ii)}{\leq}$	$\displaystyle\gamma\left\\|\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{})-% Q^{}(s,a)\right\\|_{d}+\left\\|\Pi_{\mathcal{F}_{\omega,m}}Q^{}(s,a)-Q^{}(s,a% )\right\\|_{d},$

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

Abstract

1 Introduction

2 Preliminaries

Assumption 2.1.

3 Convergence of Neural Temporal Difference Learning

3.1 Basic Settings and Assumptions

Assumption 3.1.

Assumption 3.2.

Assumption 3.3.

3.2 An Improved Complexity of Neural TD Learning

Proposition 3.4.

Proof.

Proposition 3.5.

Theorem 3.6.

Theorem 3.7.

4 Convergence of Minimax Neural Q-Learning

Assumption 4.1.

Theorem 4.2.

5 Experiments

6 Conclusion

7 Acknowledgements

References

Appendix A Details of Section 3

A.1 Proof of (14)

A.2 Proof of Theorem 3.6

Proof.

Lemma A.1.

Proof.

Lemma A.2.

Lemma A.3.

Proof.

A.3 Proof of Theorem 3.7

Proof.

Appendix B Convergence Results of Neural Q-learning

B.1 Neural Q-Learning Algorithm

B.2 Global Convergence

Assumption B.1.

Theorem B.2.

Proof.

Lemma B.3.

Proof.

Appendix C Details of Section 4

C.1 Proof of Theorem 4.2

Lemma C.1.

Proof.

Appendix D Supporting Lemmas for Multi-layer Neural Network

Lemma D.1.

Proof.

Lemma D.2.

Lemma D.3.

Proof.

Lemma D.4.

Proof.

Lemma D.5.

Proof.

Remark D.6.

Definition D.7.

Appendix E Additional Notes on the Experiments in Section 5

An Improved Finite-time Analysis of Temporal Difference
Learning with Deep Neural Networks