An Improved Finite-time Analysis of Temporal Difference
Learning with Deep Neural Networks

Zhifa Ke    Zaiwen Wen    Junyu Zhang
Abstract

Temporal difference (TD) learning algorithms with neural network function parameterization have well-established empirical success in many practical large-scale reinforcement learning tasks. However, theoretical understanding of these algorithms remains challenging due to the nonlinearity of the action-value approximation. In this paper, we develop an improved non-asymptotic analysis of the neural TD method with a general L𝐿Litalic_L-layer neural network. New proof techniques are developed and an improved new 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sample complexity is derived. To our best knowledge, this is the first finite-time analysis of neural TD that achieves an 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) complexity under the Markovian sampling, as opposed to the best known 𝒪~(ϵ2)~𝒪superscriptitalic-ϵ2\tilde{\mathcal{O}}(\epsilon^{-2})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) complexity in the existing literature.

Machine Learning, ICML

1 Introduction

The temporal difference (TD) learning method, firstly designed for policy evaluation (Sutton, 1988), is a fundamental building block of many popular Reinforcement Learning (RL) algorithms. In standard TD learning algorithms for tabular MDP, based on the Bellman operator, the agent iteratively obtains a state-action-reward-transition tuple and then updates the Q values by a weighted average of the current value and the TD target. Once the algorithm converges, the Q function is considered to be the final return obtained by executing the target policy given some initial action-state pair.

For large-scale reinforcement learning (RL) problems, appropriate parameterization of the Q function is crucial for better scalability of the TD algorithms. Common examples include linear (Tesauro et al., 1995), general smooth nonlinear (Maei et al., 2009), and neural network (Mnih et al., 2013) function approximations. However, it is well known that the naive extension of TD learning and Q-learning algorithms can diverge under the general function approximation Tsitsiklis & Van Roy (1996). To encourage convergence, numerous variants of TD and Q-learning have been proposed, including Least-squares TD (LSTD) (Bradtke & Barto, 1996; Boyan, 2002) and gradient TD (GTD) (Sutton et al., 2009a, b), to name a few.

The applications of neural network function approximation have witnessed huge empirical success in many real-world tasks, including Deep Q-network (DQN) algorithms (Mnih et al., 2013; Van Hasselt et al., 2016), policy improvement method (Sutton et al., 1999), trust region policy optimization (Schulman et al., 2015) and the actor-critic algorithms (Konda & Tsitsiklis, 1999; Lillicrap et al., 2015; Fujimoto et al., 2018), etc. However, due to the analysis difficulties brought by the function approximation, a significant gap exists between the empirical success and the theoretical understanding of these algorithms. Hence analyzing the convergence and sample complexity of TD learning and Q-learning under various Q function parameterizations has always been an active topic in the RL community during the past decades.

Early works focus on the asymptotic convergence of the algorithms with tabular or linear function approximation. For the tabular (stochastic) TD or Q learning method, Jaakkola et al. (1993) established the asymptotic convergence for the first time. Later on, the asymptotic convergence of algorithms with linear function approximation has been extensively discussed using ODE-based methods, see e.g. Tsitsiklis & Van Roy (1996); Perkins & Pendrith (2002); Borkar (2009). Meanwhile, in contrast to the convergent results for RL algorithms under the tabular or linear settings, TD with nonlinear function approximation is known to diverge in general Tsitsiklis & Van Roy (1996); Brandfonbrener & Bruna (2019). To overcome this issue, Maei et al. (2009) proposed to optimize the Mean Squared Projected Bellman Error (MSPBE) via a gradient-based algorithm. Due to the problem nonconvexity, only asymptotic convergence to stationary points can be guaranteed.

More recently, benefiting from the improved techniques for analyzing stochastic optimization algorithms, there has been a growing number of research on providing finite-time analysis for TD and Q-learning algorithms with function approximations.

For linear function approximation, the non-asymptotic results of TD learning and its variants are relatively well-understood, including TD Bhandari et al. (2018); Dalal et al. (2018); Zou et al. (2019), gradient TD Dalal et al. (2018); Touati et al. (2018); Liu et al. (2020a), and Least-Squares TD Lazaric et al. (2010); Prashanth et al. (2014); Tagorti & Scherrer (2015), etc. In particular, Bhandari et al. (2018) established the first finite-time analysis of linear Q-learning under both i.i.d. sampling and Markovian sampling settings.

For neural network function approximation, which is directly related to this paper, we provide a more detailed discussion. Based on the recent advances in the understanding of optimizing ReLU network Jacot et al. (2018); Du et al. (2018); Allen-Zhu et al. (2019a, b); Cao & Gu (2019, 2020), a few recent works have successfully developed the finite-time analysis of the neural TD and neural Q-learning algorithms, as long as the Q network is sufficiently wide. Let Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the true action-value function and let Q(s,a;𝜽)𝑄𝑠𝑎𝜽Q(s,a;\boldsymbol{\theta})italic_Q ( italic_s , italic_a ; bold_italic_θ ) denote the action-value function parameterized by a neural network with weights 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, at any state action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ). Then we aim to find some ϵitalic-ϵ\epsilonitalic_ϵ-optimal parameter 𝜽¯¯𝜽\bar{\boldsymbol{\theta}}over¯ start_ARG bold_italic_θ end_ARG such that 𝔼[(Q(s,a;𝜽¯)Q(s,a))2]ϵ+ϵ𝔼delimited-[]superscript𝑄𝑠𝑎¯𝜽superscript𝑄𝑠𝑎2italic-ϵsubscriptitalic-ϵ\mathbb{E}\big{[}(Q(s,a;\bar{\boldsymbol{\theta}})-Q^{*}(s,a))^{2}\big{]}\leq% \epsilon+\epsilon_{\mathcal{F}}blackboard_E [ ( italic_Q ( italic_s , italic_a ; over¯ start_ARG bold_italic_θ end_ARG ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_ϵ + italic_ϵ start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT, where the expectation is taken over the possible randomness in the output θ¯¯𝜃\bar{\theta}over¯ start_ARG italic_θ end_ARG as well as the distribution over the state-action pairs (s,a)𝑠𝑎(s,a)( italic_s , italic_a ), and ϵsubscriptitalic-ϵ\epsilon_{\mathcal{F}}italic_ϵ start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT is the optimal approximation error of the parameterization function class. In (Xu & Gu, 2020), a neural Q-learning algorithm with a general L𝐿Litalic_L-layer ReLU network is analyzed, and an 𝒪~(ϵ2)~𝒪superscriptitalic-ϵ2\tilde{\mathcal{O}}(\epsilon^{-2})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) sample complexity is guaranteed given that the network is sufficiently wide. In (Cai et al., 2023), the authors studied both the neural TD learning and neural Q-learning algorithms for minimizing the MSPBE for policy evaluation and policy optimization, respectively. For policy evaluation, the Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the definition of an ϵitalic-ϵ\epsilonitalic_ϵ-optimal solution is defaulted as Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT with π𝜋\piitalic_π being the policy to be evaluated. For both cases, an 𝒪~(ϵ2)~𝒪superscriptitalic-ϵ2\tilde{\mathcal{O}}(\epsilon^{-2})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) sample complexity is guaranteed for wide two-layer ReLU networks. In (Sun et al., 2022), an O~(ϵ22α)~𝑂superscriptitalic-ϵ22𝛼\tilde{O}(\epsilon^{-\frac{2}{2-\alpha}})over~ start_ARG italic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - divide start_ARG 2 end_ARG start_ARG 2 - italic_α end_ARG end_POSTSUPERSCRIPT ) complexity has been achieved by an adaptive neural TD algorithm with multi-layer ReLU networks, where α(0,1]𝛼01\alpha\in(0,1]italic_α ∈ ( 0 , 1 ] is a constant that characterizes the sparsity and decay rate of the stochastic semi-gradients. However, without additional assumption, only an O~(ϵ2)~𝑂superscriptitalic-ϵ2\tilde{O}(\epsilon^{-2})over~ start_ARG italic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) complexity with α=1𝛼1\alpha=1italic_α = 1 can be theoretically guaranteed. Finally, for policy evaluation problems, there are also several works that aim at reducing the width of the over-parameterized Q networks in the existing works (Tian et al., 2022; Cayci et al., 2023). In terms of complexity, both of them requires 𝒪~(ϵ2)~𝒪superscriptitalic-ϵ2\tilde{\mathcal{O}}(\epsilon^{-2})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) samples to obtain an ϵitalic-ϵ\epsilonitalic_ϵ-optimal solution.

Despite the fact that existing analysis of the neural TD or neural Q-learning algorithms merely provides the 𝒪~(ϵ2)~𝒪superscriptitalic-ϵ2\tilde{\mathcal{O}}(\epsilon^{-2})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) sample complexity under various settings, an 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sample complexity should be expected. In fact, a double-loop fitted Q-iteration (FQI) method (Fan et al., 2020) and its single-loop Gauss-Newton variant (Ke et al., 2023) can achieve an 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sample complexity is obtained for two-layer Q networks. Let 𝒯𝒯\mathcal{T}caligraphic_T be the Bellman (optimality) operator, then the FQI method repeatedly solves a nonlinear least square subproblem to obtain the next iteration: 𝜽k+1argmin𝜽Θ𝔼[(Q(s,a;𝜽)𝒯Q(s,a;𝜽k))2]subscript𝜽𝑘1subscriptargmin𝜽Θ𝔼delimited-[]superscript𝑄𝑠𝑎𝜽𝒯𝑄𝑠𝑎subscript𝜽𝑘2\boldsymbol{\theta}_{k+1}\approx\mathop{\mathrm{argmin}}_{\boldsymbol{\theta}% \in\Theta}\mathbb{E}\big{[}(Q(s,a;\boldsymbol{\theta})-\mathcal{T}Q(s,a;% \boldsymbol{\theta}_{k}))^{2}\big{]}bold_italic_θ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ≈ roman_argmin start_POSTSUBSCRIPT bold_italic_θ ∈ roman_Θ end_POSTSUBSCRIPT blackboard_E [ ( italic_Q ( italic_s , italic_a ; bold_italic_θ ) - caligraphic_T italic_Q ( italic_s , italic_a ; bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Compared to the single-loop neural TD or neural Q-learning method that takes only one sample (or a mini-batch) to update the weights of Q networks, the update scheme of FQI requires repeatedly solving a subproblem to sufficiently high accuracy to enable convergence, which makes it inefficient and less favorable in practice. Therefore, we would like to raise a question: {mdframed}[leftmargin=1cm,rightmargin=1cm, backgroundcolor=gray!10] Can we improve the existing analysis of the neural temporal difference learning algorithm and obtain an 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sample complexity under general multi-layer Q neural networks?

To answer this question, we revisit the convergence analysis of the neural TD learning or Q-learning algorithms under the non-i.i.d. Markovian observations where a general L𝐿Litalic_L-layer neural network is used for Q function parameterization. By proposing a new subspace analysis technique, under suitable conditions, we derive a brand new 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sample complexity for neural TD learning or Q-learning, improving the state-of-the-art 𝒪~(ϵ2)~𝒪superscriptitalic-ϵ2\tilde{\mathcal{O}}(\epsilon^{-2})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) sample complexity in the existing works. Our contributions are summarized as follows.

Neural Approximation Network Depth Network Width Activation Sample Complexity
(Bhandari et al., 2018) No NA NA NA 𝒪(1/ϵ)𝒪1italic-ϵ\mathcal{O}(1/\epsilon)caligraphic_O ( 1 / italic_ϵ )
(Cai et al., 2023) Yes 2 Ω(1/ϵ4)Ω1superscriptitalic-ϵ4\Omega(1/\epsilon^{4})roman_Ω ( 1 / italic_ϵ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) ReLu 𝒪(1/ϵ2)𝒪1superscriptitalic-ϵ2\mathcal{O}(1/\epsilon^{2})caligraphic_O ( 1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
(Xu & Gu, 2020) Yes L𝐿Litalic_L Ω(1/ϵ6)Ω1superscriptitalic-ϵ6\Omega(1/\epsilon^{6})roman_Ω ( 1 / italic_ϵ start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) ReLu 𝒪(1/ϵ2)𝒪1superscriptitalic-ϵ2\mathcal{O}(1/\epsilon^{2})caligraphic_O ( 1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
(Sun et al., 2022) Yes L𝐿Litalic_L Ω(1/ϵ6)Ω1superscriptitalic-ϵ6\Omega(1/\epsilon^{6})roman_Ω ( 1 / italic_ϵ start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) ReLu 𝒪(1/ϵ22α),α(0,1]𝒪1superscriptitalic-ϵ22𝛼𝛼01\mathcal{O}(1/\epsilon^{\frac{2}{2-\alpha}}),\alpha\in(0,1]caligraphic_O ( 1 / italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG 2 - italic_α end_ARG end_POSTSUPERSCRIPT ) , italic_α ∈ ( 0 , 1 ]
(Tian et al., 2022) Yes L𝐿Litalic_L Ω(1/ϵ2)Ω1superscriptitalic-ϵ2\Omega(1/\epsilon^{2})roman_Ω ( 1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ELU, GeLU 𝒪(1/ϵ2)𝒪1superscriptitalic-ϵ2\mathcal{O}(1/\epsilon^{2})caligraphic_O ( 1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
Ours Yes L𝐿Litalic_L Ω(1/ϵ2)Ω1superscriptitalic-ϵ2\Omega(1/\epsilon^{2})roman_Ω ( 1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ELU, GeLU 𝒪(1/ϵ)𝒪1italic-ϵ\mathcal{O}(1/\epsilon)caligraphic_O ( 1 / italic_ϵ )
Table 1: Sample complexity for parameterized Q learning to find some 𝜽¯¯𝜽\bar{\boldsymbol{\theta}}over¯ start_ARG bold_italic_θ end_ARG such that 𝔼[Q(s,a;𝜽¯)Q(s,a)μ2]ε𝔼delimited-[]superscriptsubscriptnorm𝑄𝑠𝑎¯𝜽superscript𝑄𝑠𝑎𝜇2𝜀\mathbb{E}\left[\|Q(s,a;\bar{\boldsymbol{\theta}})-Q^{*}(s,a)\|_{\mu}^{2}% \right]\leq\varepsilonblackboard_E [ ∥ italic_Q ( italic_s , italic_a ; over¯ start_ARG bold_italic_θ end_ARG ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_ε, where fμ2:=|f|2𝑑μassignsuperscriptsubscriptnorm𝑓𝜇2superscript𝑓2differential-d𝜇\|f\|_{\mu}^{2}:=\int|f|^{2}d\mu∥ italic_f ∥ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := ∫ | italic_f | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ and Q(s,a)superscript𝑄𝑠𝑎Q^{*}(s,a)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) satisfies the Bellman optimality equation Q(s,a)=𝒯Q(s,a)superscript𝑄𝑠𝑎𝒯superscript𝑄𝑠𝑎Q^{*}(s,a)=\mathcal{T}Q^{*}(s,a)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) = caligraphic_T italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ).
  • Under the non-i.i.d. Markovian sampling setting, we derive an 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sample complexity for both neural TD learning and Q-learning methods under the multi-layer network approximation for Q functions. Our result also improves the best known 𝒪~(ϵ2)~𝒪superscriptitalic-ϵ2\tilde{\mathcal{O}}(\epsilon^{-2})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) sample complexity in the existing works.

  • Based on our newly developed techniques, we further provide a finite-sample analysis for a minimax neural Q-learning algorithm that solves two-player zero-sum Markov games. An 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sample complexity is obtained under the non-i.i.d. Markovian sampling setting.

Technically, the subspace analysis approach that we propose to establish the O~(ϵ1)~𝑂superscriptitalic-ϵ1\tilde{O}(\epsilon^{-1})over~ start_ARG italic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sample complexity is by itself of independent interest. We believe this technique can potentially be applied to linear Q-learning algorithms and linear Actor-Critic algorithms without requiring the positive definiteness assumption of the feature covariance matrix (Bhandari et al., 2018; Zou et al., 2019; Barakat et al., 2022), while maintaining the O~(ϵ1)~𝑂superscriptitalic-ϵ1\tilde{O}(\epsilon^{-1})over~ start_ARG italic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) complexity.

In summary, we provide a comprehensive comparison between our work and the most related works in their respective settings and sample complexity in Table 1. Our work establishes an optimal sample complexity analysis within a broader contextual framework.

2 Preliminaries

We consider the infinite-horizon discounted Markov decision process (MDP), which is denoted as =(𝒮,𝒜,,r,γ)𝒮𝒜𝑟𝛾\mathcal{M}=(\mathcal{S},\mathcal{A},\mathbb{P},r,\gamma)caligraphic_M = ( caligraphic_S , caligraphic_A , blackboard_P , italic_r , italic_γ ). We consider a general state space 𝒮𝒮\mathcal{S}caligraphic_S and a finite action space 𝒜𝒜\mathcal{A}caligraphic_A. At any state s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, if the agent takes an action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, it will receive a reward r(s,a)[Rmax,Rmax]𝑟𝑠𝑎subscript𝑅subscript𝑅r(s,a)\in[-R_{\max},R_{\max}]italic_r ( italic_s , italic_a ) ∈ [ - italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] and transition to the next state s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S with probability (s|s,a)conditionalsuperscript𝑠𝑠𝑎\mathbb{P}(s^{\prime}|s,a)blackboard_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ). We call r𝑟ritalic_r the reward function and \mathbb{P}blackboard_P the transition kernel. Let γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ) be a discount factor, then an MDP aims to find a sequence of actions {at}t0subscriptsubscript𝑎𝑡𝑡0\{a_{t}\}_{t\geq 0}{ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT to maximize the expected and discounted cumulative reward 𝔼[t=0γtr(st,at)|s0μ]𝔼delimited-[]similar-toconditionalsuperscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡subscript𝑠0𝜇\mathbb{E}\big{[}\sum_{t=0}^{\infty}\gamma^{t}\!\cdot r(s_{t},a_{t})|s_{0}\sim% \mu\big{]}blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_μ ], where μ𝜇\muitalic_μ is the distribution of the initial state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Let Δ𝒜subscriptΔ𝒜\Delta_{\mathcal{A}}roman_Δ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT denote the set of all probability distributions over the action space 𝒜𝒜\mathcal{A}caligraphic_A, and let a policy π:𝒮Δ𝒜:𝜋maps-to𝒮subscriptΔ𝒜\pi:\mathcal{S}\mapsto\Delta_{\mathcal{A}}italic_π : caligraphic_S ↦ roman_Δ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT be a mapping that returns a probability distribution π(|s)Δ𝒜\pi(\cdot|s)\in\Delta_{\mathcal{A}}italic_π ( ⋅ | italic_s ) ∈ roman_Δ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT given any state s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S. If an agent follows a policy π𝜋\piitalic_π, then at any state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it will act by sampling an action atπ(|st)a_{t}\sim\pi(\cdot|s_{t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Therefore, the action-value function (Q-function) under the policy π𝜋\piitalic_π is

Qπ(s,a):=𝔼π[t=0γtr(st,at)|s0=s,a0=a],assignsuperscript𝑄𝜋𝑠𝑎subscript𝔼𝜋delimited-[]formulae-sequenceconditionalsuperscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡subscript𝑠0𝑠subscript𝑎0𝑎Q^{\pi}(s,a):=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}\cdot r(s_{t}% ,a_{t})|s_{0}=s,a_{0}=a\right],italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) := blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_a ] ,

for (s,a)𝒮×𝒜for-all𝑠𝑎𝒮𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A}∀ ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A, where all actions except a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are sampled according to π𝜋\piitalic_π. For any mapping Q:𝒮×𝒜:𝑄𝒮𝒜Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}italic_Q : caligraphic_S × caligraphic_A → blackboard_R, let the Bellman operator 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi}caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT be

𝒯πQ(s,a):=r(s,a)+γassignsuperscript𝒯𝜋𝑄𝑠𝑎𝑟𝑠𝑎𝛾\displaystyle\mathcal{T}^{\pi}Q(s,a):=r(s,a)+\gammacaligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_Q ( italic_s , italic_a ) := italic_r ( italic_s , italic_a ) + italic_γ 𝔼[Q(s,a)s(s,a),\displaystyle\mathbb{E}\left[Q(s^{\prime},a^{\prime})\mid s^{\prime}\sim% \mathbb{P}(\cdot\mid s,a),\right.blackboard_E [ italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∣ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ blackboard_P ( ⋅ ∣ italic_s , italic_a ) ,
aπ(s)],s,a.\displaystyle\left.a^{\prime}\sim\pi(\cdot\mid s^{\prime})\right],\,\,\forall s% ,a.italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π ( ⋅ ∣ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] , ∀ italic_s , italic_a .

Then 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi}caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is a γ𝛾\gammaitalic_γ-contraction under the infinity norm and Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is the unique solution to the fixed-point equation Q=𝒯πQ𝑄superscript𝒯𝜋𝑄Q=\mathcal{T}^{\pi}Qitalic_Q = caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_Q (Bertsekas, 2012). If the Q function is parameterized by some function Q(s,a;𝜽)𝑄𝑠𝑎𝜽Q(s,a;\boldsymbol{\theta})italic_Q ( italic_s , italic_a ; bold_italic_θ ) to gain better scalability for large-scale RL problems, popular approaches for finding a good 𝜽𝜽\boldsymbol{\theta}bold_italic_θ include minimizing the the Mean-Squared Bellman Error (MSBE):

min𝜽Θ𝔼(s,a)μ[(Q(s,a;𝜽)𝒯πQ(s,a;𝜽))2],subscript𝜽Θsubscript𝔼similar-to𝑠𝑎𝜇delimited-[]superscript𝑄𝑠𝑎𝜽superscript𝒯𝜋𝑄𝑠𝑎𝜽2\min_{\boldsymbol{\theta}\in\Theta}\mathbb{E}_{(s,a)\sim\mu}\Big{[}\left(Q(s,a% ;\boldsymbol{\theta})-\mathcal{T}^{\pi}Q(s,a;\boldsymbol{\theta})\right)^{2}% \Big{]},roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ roman_Θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_μ end_POSTSUBSCRIPT [ ( italic_Q ( italic_s , italic_a ; bold_italic_θ ) - caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_Q ( italic_s , italic_a ; bold_italic_θ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (1)

and minimizing the Mean-Squared Projected Bellman Error (MSPBE):

min𝜽Θ𝔼(s,a)μ[(Q(s,a;𝜽)Π𝒯πQ(s,a;𝜽))2],subscript𝜽Θsubscript𝔼similar-to𝑠𝑎𝜇delimited-[]superscript𝑄𝑠𝑎𝜽subscriptΠsuperscript𝒯𝜋𝑄𝑠𝑎𝜽2\min_{\boldsymbol{\theta}\in\Theta}\mathbb{E}_{(s,a)\sim\mu}\left[\left(Q(s,a;% \boldsymbol{\theta})-\Pi_{\mathcal{F}}\mathcal{T}^{\pi}Q(s,a;\boldsymbol{% \theta})\right)^{2}\right],roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ roman_Θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_μ end_POSTSUBSCRIPT [ ( italic_Q ( italic_s , italic_a ; bold_italic_θ ) - roman_Π start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_Q ( italic_s , italic_a ; bold_italic_θ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (2)

where ΘΘ\Thetaroman_Θ is a feasible domain of the parameter 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, μ𝜇\muitalic_μ is some distribution over state action pairs, and ΠsubscriptΠ\Pi_{\mathcal{F}}roman_Π start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT is the projection onto some function class \mathcal{F}caligraphic_F. Typical choices of \mathcal{F}caligraphic_F includes the Q function parameterization class itself :={Q(;𝜽):θΘ}assignconditional-set𝑄𝜽𝜃Θ\mathcal{F}:=\{Q(\cdot;\boldsymbol{\theta}):\theta\in\Theta\}caligraphic_F := { italic_Q ( ⋅ ; bold_italic_θ ) : italic_θ ∈ roman_Θ } (Maei et al., 2009), and some local linearization of the parameterization function class (Cai et al., 2023).

In this paper, we study the neural temporal difference learning method where the action-value function is parameterized by some multi-layer neural network. Let us define a feedforward neural network by the following recursion:

𝒙(l)=1mσ(𝑾l𝒙(l1)),l{1,2,,L},formulae-sequencesuperscript𝒙𝑙1𝑚𝜎subscript𝑾𝑙superscript𝒙𝑙1𝑙12𝐿\boldsymbol{x}^{(l)}=\frac{1}{\sqrt{m}}\sigma\left(\boldsymbol{W}_{l}% \boldsymbol{x}^{(l-1)}\right),\quad l\in\{1,2,\cdots,L\},bold_italic_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG italic_σ ( bold_italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) , italic_l ∈ { 1 , 2 , ⋯ , italic_L } , (3)

where 𝑾1m×dsubscript𝑾1superscript𝑚𝑑\boldsymbol{W}_{1}\in\mathbb{R}^{m\times d}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT, 𝑾lm×msubscript𝑾𝑙superscript𝑚𝑚\boldsymbol{W}_{l}\in\mathbb{R}^{m\times m}bold_italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT for 2lL2𝑙𝐿2\leq l\leq L2 ≤ italic_l ≤ italic_L are the weight matrices of the network, σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is an activation function, and the input is a feature map 𝒙(0)=ϕ(s,a)dsuperscript𝒙0italic-ϕ𝑠𝑎superscript𝑑\boldsymbol{x}^{(0)}=\phi(s,a)\in\mathbb{R}^{d}bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_ϕ ( italic_s , italic_a ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for any state action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ). For simplicity of notation, we write 𝒙=𝒙(0)𝒙superscript𝒙0\boldsymbol{x}=\boldsymbol{x}^{(0)}bold_italic_x = bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, then Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) is parameterized by

Q(𝒙;𝜽)=1m𝒃𝒙(L),𝑄𝒙𝜽1𝑚superscript𝒃topsuperscript𝒙𝐿Q(\boldsymbol{x};\boldsymbol{\theta})=\frac{1}{\sqrt{m}}\boldsymbol{b}^{\top}% \boldsymbol{x}^{(L)},italic_Q ( bold_italic_x ; bold_italic_θ ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG bold_italic_b start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT , (4)

where the parameter 𝜽=(Vec(𝑾1);;Vec(𝑾L))𝜽Vecsubscript𝑾1Vecsubscript𝑾𝐿\boldsymbol{\theta}=\left(\mbox{Vec}(\boldsymbol{W}_{1});\cdots;\mbox{Vec}(% \boldsymbol{W}_{L})\right)bold_italic_θ = ( Vec ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ; ⋯ ; Vec ( bold_italic_W start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) ) denotes the collection of all weight matrices, and 𝒃𝒃\boldsymbol{b}bold_italic_b is given by a random initialization. Vec()Vec\mbox{Vec}(\cdot)Vec ( ⋅ ) stands for the vetorization operator that reshapes a matrix to a column vector by stacking its columns one by one and the “;” separator in 𝜽𝜽\boldsymbol{\theta}bold_italic_θ stands for the vertical stacking of the elements. That is, we reshape 𝜽𝜽\boldsymbol{\theta}bold_italic_θ to a long column vector for the notational convenience in later discussion.

Assumption 2.1.

The activation function σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-Lipschitz and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-smooth, i.e. , for y1,y2::for-allsubscript𝑦1subscript𝑦2absent\forall y_{1},y_{2}\in\mathbb{R}:∀ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R :

|σ(y1)σ(y2)|L1|y1y2|𝜎subscript𝑦1𝜎subscript𝑦2subscript𝐿1subscript𝑦1subscript𝑦2\left|\sigma(y_{1})-\sigma(y_{2})\right|\leq L_{1}|y_{1}-y_{2}|| italic_σ ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_σ ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | ≤ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT |

and

|σ(y1)σ(y2)|L2|y1y2|.superscript𝜎subscript𝑦1superscript𝜎subscript𝑦2subscript𝐿2subscript𝑦1subscript𝑦2\left|\sigma^{\prime}(y_{1})-\sigma^{\prime}(y_{2})\right|\leq L_{2}|y_{1}-y_{% 2}|.| italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | ≤ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | .

Assumption 2.1 indicates that our results below are not based on the popular ReLU activation function. However, we primarily focus on some twice-differentiable activation functions (such as Sigmoid, ELU, GeLU, etc.), which are smooth approximations of the ReLU function and are frequently utilized in practical problems (Devlin et al., 2018; Godfrey, 2019). Such a setup aligns with (Liu et al., 2020b), and provides a 𝒪(m12)𝒪superscript𝑚12\mathcal{O}(m^{-\frac{1}{2}})caligraphic_O ( italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )-smooth property for the neural Q-function.

Let 𝜽0=(Vec(𝑾10);;Vec(𝑾L0))superscript𝜽0Vecsuperscriptsubscript𝑾10Vecsuperscriptsubscript𝑾𝐿0\boldsymbol{\theta}^{0}=\left(\mbox{Vec}(\boldsymbol{W}_{1}^{0});\cdots;\mbox{% Vec}(\boldsymbol{W}_{L}^{0})\right)bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ( Vec ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ; ⋯ ; Vec ( bold_italic_W start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) be the initial solution. For each l𝑙litalic_l, we initialize the weights of 𝑾l0superscriptsubscript𝑾𝑙0\boldsymbol{W}_{l}^{0}bold_italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT element-wise from a normal distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ) and each element of 𝒃𝒃\boldsymbol{b}bold_italic_b is drawn uniformly from {1,+1}11\{-1,+1\}{ - 1 , + 1 }. The parameter 𝒃𝒃\boldsymbol{b}bold_italic_b will not be optimized during training. For regularity purpose, we would like to restrict the iterations to a bounded set around 𝜽0superscript𝜽0\boldsymbol{\theta}^{0}bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, which is defined as

Sω:=assignsubscript𝑆𝜔absent\displaystyle S_{\omega}:=italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT := {𝜽=(Vec(𝑾1);;Vec(𝑾L)):\displaystyle\left\{\boldsymbol{\theta}=\left(\mbox{Vec}\left(\boldsymbol{W}_{% 1}\right);\cdots;\mbox{Vec}(\boldsymbol{W}_{L})\right):\right.{ bold_italic_θ = ( Vec ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ; ⋯ ; Vec ( bold_italic_W start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) ) :
𝜽𝜽02ω,1lL}.\displaystyle\left.\|\boldsymbol{\theta}-\boldsymbol{\theta}^{0}\|_{2}\leq% \omega,1\leq l\leq L\right\}.∥ bold_italic_θ - bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ω , 1 ≤ italic_l ≤ italic_L } .

In each iteration t𝑡titalic_t, the neural Q-learning algorithm obtains a sample of state-action-reward-transition tuple (st,at,rt,st+1,at+1)subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1subscript𝑎𝑡1(s_{t},a_{t},r_{t},s_{t+1},a_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) and computes the TD error by

Δt=Q(𝒙t;𝜽t)(rt+γQ(𝒙t+1;𝜽t))subscriptΔ𝑡𝑄subscript𝒙𝑡superscript𝜽𝑡subscript𝑟𝑡𝛾𝑄subscript𝒙𝑡1superscript𝜽𝑡\Delta_{t}=Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\Big{(}r_{t}+\gamma Q% (\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t})\Big{)}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) (5)

with 𝒙t=ϕ(st,at),𝒙t+1=ϕ(st+1,at+1).formulae-sequencesubscript𝒙𝑡italic-ϕsubscript𝑠𝑡subscript𝑎𝑡subscript𝒙𝑡1italic-ϕsubscript𝑠𝑡1subscript𝑎𝑡1\boldsymbol{x}_{t}=\phi(s_{t},a_{t}),\boldsymbol{x}_{t+1}=\phi(s_{t+1},a_{t+1}).bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) . Then a projected stochastic semi-gradient step is performed to update the weight matrices:

𝜽t+1=ΠSω(𝜽tηt𝒈(𝜽t))superscript𝜽𝑡1subscriptΠsubscript𝑆𝜔superscript𝜽𝑡subscript𝜂𝑡𝒈superscript𝜽𝑡\boldsymbol{\theta}^{t+1}=\Pi_{S_{\omega}}\Big{(}\boldsymbol{\theta}^{t}-\eta_% {t}\boldsymbol{g}\left(\boldsymbol{\theta}^{t}\right)\!\!\Big{)}bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = roman_Π start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) (6)

with

𝒈(𝜽t)=Δt𝜽Q(𝒙t;𝜽t).𝒈superscript𝜽𝑡subscriptΔ𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽𝑡\boldsymbol{g}(\boldsymbol{\theta}^{t})=\Delta_{t}\cdot\nabla_{\boldsymbol{% \theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t}).bold_italic_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) .

We formally describe the neural TD learning method in Algorithm 1.

Algorithm 1 Neural Temporal Difference Learning with Markovian Sampling
  Input: A learning policy π𝜋\piitalic_π, a discount factor γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ), a sequence of learning rates {ηt}t0subscriptsubscript𝜂𝑡𝑡0\{\eta_{t}\}_{t\geq 0}{ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT, a maximum iteration number T𝑇Titalic_T, a projection radius ω>0𝜔0\omega>0italic_ω > 0, a Q network with architecture (4).
  Initialization: Generate each entry of 𝑾l0superscriptsubscript𝑾𝑙0\boldsymbol{W}_{l}^{0}bold_italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT independently from 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ), for l=1,2,,L𝑙12𝐿l=1,2,\cdots,Litalic_l = 1 , 2 , ⋯ , italic_L, and each entry of 𝒃𝒃\boldsymbol{b}bold_italic_b independently from Unif{1,+1}Unif11\text{Unif}\{-1,+1\}Unif { - 1 , + 1 }. Generate s0μ,a0π(|s0)s_{0}\sim\mu,a_{0}\sim\pi(\cdot|s_{0})italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_μ , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).
  for t=0,1,,T1𝑡01𝑇1t=0,1,\cdots,T-1italic_t = 0 , 1 , ⋯ , italic_T - 1 do
     Sample (st,at,rt,st+1,at+1)subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1subscript𝑎𝑡1(s_{t},a_{t},r_{t},s_{t+1},a_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) from the learning policy π𝜋\piitalic_π with at+1π(|st+1)a_{t+1}\sim\pi(\cdot|s_{t+1})italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ).
     Compute the TD error ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by (5).
     Update 𝜽t+1superscript𝜽𝑡1\boldsymbol{\theta}^{t+1}bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT by the projected stochastic semi-gradient step (6).
  end for
  Output: 𝜽Tsuperscript𝜽𝑇\boldsymbol{\theta}^{T}bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

One remark is that, under the non-i.i.d. Markovian sampling setting, the agent is only able to generate a trajectory of samples following some given learning policy π𝜋\piitalic_π, which is very common in the offline RL (Wu et al., 2019; Levine et al., 2020; Kostrikov et al., 2021) where the data trajectories are generated by some learning policy.

In later sections, we will revisit the Algorithm 1 and design a novel subspace analysis technique for this method and achieve an improved sample complexity of 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Moreover, by replacing the TD error induced by the Bellman operator (5) with TD error induced by the Bellman optimality operator: Δt=Q(𝒙t;𝜽t)(rt+γmaxb𝒜Q(s,b;𝜽t))subscriptΔ𝑡𝑄subscript𝒙𝑡superscript𝜽𝑡subscript𝑟𝑡𝛾subscript𝑏𝒜𝑄superscript𝑠𝑏superscript𝜽𝑡\Delta_{t}=Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\big{(}r_{t}+\gamma% \max_{b\in\mathcal{A}}Q(s^{\prime},b;\boldsymbol{\theta}^{t})\big{)}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ), Algorithm 1 can be reduced to the neural Q-learning method for finding optimal state-action value Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Our analysis for neural TD learning can be extended to the neural Q-learning analogously and obtain the same 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sample complexity.

3 Convergence of Neural Temporal Difference Learning

3.1 Basic Settings and Assumptions

To analyze Algorithm 1, let us first define the local linearization function class of the multi-layer Q network (4) at the random initialization 𝜽0superscript𝜽0\boldsymbol{\theta}^{0}bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT:

ω,m:={Q^(;𝜽)=Q(;𝜽0)+𝜽Q(;𝜽0),𝜽𝜽0}assignsubscript𝜔𝑚^𝑄𝜽𝑄superscript𝜽0subscript𝜽𝑄superscript𝜽0𝜽superscript𝜽0\mathcal{F}_{\omega,m}:=\left\{\widehat{Q}(\cdot\,;\boldsymbol{\theta})=Q(% \cdot\,;\boldsymbol{\theta}^{0})+\left<\nabla_{\boldsymbol{\theta}}Q(\cdot\,;% \boldsymbol{\theta}^{0}),\boldsymbol{\theta}-\boldsymbol{\theta}^{0}\right>\right\}caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT := { over^ start_ARG italic_Q end_ARG ( ⋅ ; bold_italic_θ ) = italic_Q ( ⋅ ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) + ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( ⋅ ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ - bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ⟩ } (7)

for any 𝜽Sω𝜽subscript𝑆𝜔\boldsymbol{\theta}\in S_{\omega}bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT. Consider the MSPBE minimization problem:

min𝜽Sω𝔼μ,π,[(Q(𝒙;𝜽)Πω,m𝒯πQ(𝒙;𝜽))2],subscript𝜽subscript𝑆𝜔subscript𝔼𝜇𝜋delimited-[]superscript𝑄𝒙𝜽subscriptΠsubscript𝜔𝑚superscript𝒯𝜋𝑄𝒙𝜽2\min_{\boldsymbol{\theta}\in S_{\omega}}\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[% \left(Q(\boldsymbol{x};\boldsymbol{\theta})-\Pi_{\mathcal{F}_{\omega,m}}% \mathcal{T}^{\pi}Q(\boldsymbol{x};\boldsymbol{\theta})\right)^{2}\right],roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ( italic_Q ( bold_italic_x ; bold_italic_θ ) - roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (8)

where μ𝜇\muitalic_μ is the initial state distribution, π𝜋\piitalic_π is the learning policy, and \mathbb{P}blackboard_P is the transition kernel, the expectation 𝔼μ,π,[]subscript𝔼𝜇𝜋delimited-[]\mathbb{E}_{\mu,\pi,\mathbb{P}}[\cdot]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ⋅ ] is taken over sμsimilar-to𝑠𝜇s\sim\muitalic_s ∼ italic_μ, aπ(|s)a\sim\pi(\cdot|s)italic_a ∼ italic_π ( ⋅ | italic_s ), and s(|s,a),aπ(|s)s^{\prime}\sim\mathbb{P}(\cdot|s,a),a^{\prime}\sim\pi(\cdot|s^{\prime})italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ blackboard_P ( ⋅ | italic_s , italic_a ) , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi}caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT. Define the set ΞβsubscriptΞ𝛽\Xi_{\beta}roman_Ξ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT as

Ξβ:={𝜽Sβ:\displaystyle\Xi_{\beta}:=\left\{\boldsymbol{\theta}\in S_{\beta}:\right.roman_Ξ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT := { bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT : Q^(𝒙;𝜽)=Πω,m𝒯πQ^(𝒙;𝜽),^𝑄𝒙𝜽subscriptΠsubscript𝜔𝑚superscript𝒯𝜋^𝑄𝒙𝜽\displaystyle\left.\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta})=\Pi_{% \mathcal{F}_{\omega,m}}\mathcal{T}^{\pi}\widehat{Q}(\boldsymbol{x};\boldsymbol% {\theta}),\right.over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ ) = roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ ) ,
𝒙=ϕ(s,a)}.\displaystyle\left.\forall\boldsymbol{x}=\phi(s,a)\right\}.∀ bold_italic_x = italic_ϕ ( italic_s , italic_a ) } . (9)

Then the set ΞωsubscriptΞ𝜔\Xi_{\omega}roman_Ξ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT consists of the points 𝜽𝜽\boldsymbol{\theta}bold_italic_θ with which Q^(;𝜽)^𝑄𝜽\widehat{Q}(\cdot\,;\boldsymbol{\theta})over^ start_ARG italic_Q end_ARG ( ⋅ ; bold_italic_θ ) forms a fixed point of the projected Bellman operator Πω,m𝒯πsubscriptΠsubscript𝜔𝑚superscript𝒯𝜋\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}^{\pi}roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT for the problem (8). By Section 4.1 in (Cai et al., 2023), the fixed point of Πω,m𝒯πsubscriptΠsubscript𝜔𝑚superscript𝒯𝜋\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}^{\pi}roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is unique for 𝜽Sω𝜽subscript𝑆𝜔\boldsymbol{\theta}\in S_{\omega}bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT. Therefore, the following relationship holds

Q^(𝒙;𝜽)=Q^(𝒙;𝜽).^𝑄𝒙𝜽^𝑄𝒙superscript𝜽\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta})=\widehat{Q}(\boldsymbol{x};% \boldsymbol{\theta}^{\prime}).over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ ) = over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (10)

for 𝒙=ϕ(s,a),(s,a)𝒮×𝒜,𝜽,𝜽Ξβ,βω.formulae-sequencefor-all𝒙italic-ϕ𝑠𝑎formulae-sequencefor-all𝑠𝑎𝒮𝒜for-all𝜽formulae-sequencesuperscript𝜽subscriptΞ𝛽for-all𝛽𝜔\forall\boldsymbol{x}=\phi(s,a),\,\,\forall(s,a)\in\mathcal{S}\times\mathcal{A% },\,\,\forall\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}\in\Xi_{\beta},% \forall\beta\geq\omega.∀ bold_italic_x = italic_ϕ ( italic_s , italic_a ) , ∀ ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A , ∀ bold_italic_θ , bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Ξ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT , ∀ italic_β ≥ italic_ω . Moreover, it is also shown that a point 𝜽Ξωsuperscript𝜽subscriptΞ𝜔\boldsymbol{\theta}^{*}\in\Xi_{\omega}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Ξ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT if and only if it satisfies the stationarity condition:

𝔼μ,π,[Δ^(𝒙,𝒙;𝜽)𝜽Q^(𝒙;𝜽),𝜽𝜽]0,subscript𝔼𝜇𝜋delimited-[]^Δ𝒙superscript𝒙superscript𝜽subscript𝜽^𝑄𝒙superscript𝜽𝜽superscript𝜽0\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(\boldsymbol{x},% \boldsymbol{x}^{\prime};\boldsymbol{\theta}^{*}\right)\big{\langle}\nabla_{% \boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x};\boldsymbol{\theta}^{*}% \right),\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\big{\rangle}\right]\geq 0,blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_italic_θ - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ ] ≥ 0 , (11)

where Q^(;𝜽)ω,m^𝑄superscript𝜽subscript𝜔𝑚\widehat{Q}(\cdot\,;\boldsymbol{\theta}^{*})\in\mathcal{F}_{\omega,m}over^ start_ARG italic_Q end_ARG ( ⋅ ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT is a local linearization provided by (7) and Δ^^Δ\widehat{\Delta}over^ start_ARG roman_Δ end_ARG is defined as

Δ^(𝒙,𝒙;𝜽)=Q^(𝒙;𝜽)(r(s,a)+γQ^(𝒙;𝜽)).^Δ𝒙superscript𝒙superscript𝜽^𝑄𝒙superscript𝜽𝑟𝑠𝑎𝛾^𝑄superscript𝒙superscript𝜽\widehat{\Delta}\left(\boldsymbol{x},\boldsymbol{x}^{\prime};\boldsymbol{% \theta}^{*}\right)=\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})-\Big{(}% r(s,a)+\gamma\widehat{Q}\left(\boldsymbol{x}^{\prime};\boldsymbol{\theta}^{*}% \right)\!\Big{)}.over^ start_ARG roman_Δ end_ARG ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - ( italic_r ( italic_s , italic_a ) + italic_γ over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) .

Hence people may analyze the gap between Qπ()superscript𝑄𝜋Q^{\pi}(\cdot)italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( ⋅ ) and Q(,;𝜽T)Q(\cdot,;\boldsymbol{\theta}^{T})italic_Q ( ⋅ , ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) by first connecting it to Q^(;𝜽)^𝑄superscript𝜽\widehat{Q}(\cdot\,;\boldsymbol{\theta}^{*})over^ start_ARG italic_Q end_ARG ( ⋅ ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Based on this, Cai et al. (2023) derived an 𝒪~(ϵ2)~𝒪superscriptitalic-ϵ2\tilde{\mathcal{O}}(\epsilon^{-2})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) sample complexity for the neural TD method. Now we define

ΣπsubscriptΣ𝜋\displaystyle\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT =\displaystyle== 𝔼μ,π[𝜽Q(𝒙;𝜽0)𝜽Q(𝒙;𝜽0)].subscript𝔼𝜇𝜋delimited-[]subscript𝜽𝑄𝒙superscript𝜽0subscript𝜽𝑄superscript𝒙superscript𝜽0top\displaystyle\mathbb{E}_{\mu,\pi}\left[\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x};\boldsymbol{\theta}^{0})\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x};\boldsymbol{\theta}^{0})^{\top}\right].blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] . (12)

It is worth noting that the matrix ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT only depends on π𝜋\piitalic_π and 𝜽0superscript𝜽0\boldsymbol{\theta}^{0}bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. In the original assumption about (12), (Zou et al., 2019; Xu & Gu, 2020) in fact assumed positive definiteness (0succeedsabsent0\succ 0≻ 0) of ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, which can be viewed as a generalized version of the positive definite feature covariance matrix assumption in the analysis of linear TD and linear Q-learning, see e.g. (Zou et al., 2019). However, in this paper we adopt the following weaker regularity assumption.

Assumption 3.1.

Let σmin¯(Σπ)¯subscript𝜎subscriptΣ𝜋\overline{\sigma_{\min}}(\Sigma_{\pi})over¯ start_ARG italic_σ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) denote the minimum non-zero singular value of the matrix ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, then there exist constants λ0,m>0subscript𝜆0superscript𝑚0\lambda_{0},m^{*}>0italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0 such that σmin¯(Σπ)λ0¯subscript𝜎subscriptΣ𝜋subscript𝜆0\overline{\sigma_{\min}}(\Sigma_{\pi})\geq\lambda_{0}over¯ start_ARG italic_σ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) ≥ italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as long as the Q network width mm𝑚superscript𝑚m\geq m^{*}italic_m ≥ italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

For neural Q function approximation, a sufficient but not necessary condition for Assumption 3.1 can be obtained by exploiting the theory of over-parameterized neural networks. Roughly speaking, for a finite MDP with an L𝐿Litalic_L-layer ReLU Q network, if the feature map satisfies ϕ(s,a)ϕ(s,a)not-parallel-toitalic-ϕ𝑠𝑎italic-ϕsuperscript𝑠superscript𝑎\phi(s,a)\nparallel\phi(s^{\prime},a^{\prime})italic_ϕ ( italic_s , italic_a ) ∦ italic_ϕ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for (s,a)(s,a)for-all𝑠𝑎superscript𝑠superscript𝑎\forall(s,a)\neq(s^{\prime},a^{\prime})∀ ( italic_s , italic_a ) ≠ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), the results of (Jacot et al., 2018; Allen-Zhu et al., 2019a, b; Cao & Gu, 2019, 2020) suggest that there exist λ,m>0superscript𝜆superscript𝑚0\lambda^{\prime},m^{*}>0italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0 such that with high probability Gram(𝜽0)λ𝐈succeedsGramsubscript𝜽0superscript𝜆𝐈\mathrm{Gram}(\boldsymbol{\theta}_{0})\succ\lambda^{\prime}\!\cdot\!\mathbf{I}roman_Gram ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≻ italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ bold_I for networks with width mm.𝑚superscript𝑚m\geq m^{*}.italic_m ≥ italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . Here Gram(𝜽0)Gramsubscript𝜽0\mathrm{Gram}(\boldsymbol{\theta}_{0})roman_Gram ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) stands for the Gram matrix of the network at the initialization 𝜽0subscript𝜽0\boldsymbol{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. A lower bound on σmin¯(Σπ)¯subscript𝜎subscriptΣ𝜋\overline{\sigma_{\min}}(\Sigma_{\pi})over¯ start_ARG italic_σ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) can then be constructed with λsuperscript𝜆\lambda^{\prime}italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, refer to Remark D.6 in Appendix D.

Finally, to facilitate the sample complexity analysis under the non-i.i.d. Markovian sampling setting, let us make the following assumption on the fast mixing rate of the MDP sample trajectories, which is widely adopted in the related analysis (Zou et al., 2019; Xu & Gu, 2020; Cai et al., 2023).

Assumption 3.2.

We assume that the Markov chain {st}t=0,1,subscriptsubscript𝑠𝑡𝑡01\left\{s_{t}\right\}_{t=0,1,\ldots}{ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 , 1 , … end_POSTSUBSCRIPT induced by the learning policy π𝜋\piitalic_π and the transition kernel \mathbb{P}blackboard_P is uniformly ergodic with its invariant measure πsuperscript𝜋\mathbb{P}^{\pi}blackboard_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT. Furthermore, we assume that there are constants κ>0,ρ(0,1)formulae-sequence𝜅0𝜌01\kappa>0,\rho\in(0,1)italic_κ > 0 , italic_ρ ∈ ( 0 , 1 ) such that

sups𝒮dTV((sts0=s),π)κρt\sup_{s\in\mathcal{S}}d_{TV}\left(\mathbb{P}\left(s_{t}\in\cdot\mid s_{0}=s% \right),\mathbb{P}^{\pi}\right)\leq\kappa\rho^{t}roman_sup start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( blackboard_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ⋅ ∣ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s ) , blackboard_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ≤ italic_κ italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT

for all t0.𝑡0t\geq 0.italic_t ≥ 0 .

Without loss of generality, we also make the following technical assumption, which is not fundamental as opposed to Assumption 3.1, and 3.2.

Assumption 3.3.

We assume the initial state distribution μ𝜇\muitalic_μ to be the stationary state distribution under policy π𝜋\piitalic_π.

This assumption is in fact very natural. Concerning the stationarity of μ𝜇\muitalic_μ, it can always be guaranteed by abandoning the first 𝒪~(tmix)~𝒪subscript𝑡mix\tilde{\mathcal{O}}(t_{\mathrm{mix}})over~ start_ARG caligraphic_O end_ARG ( italic_t start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT ) samples while Assumption 3.2 indicates that the mixing time tmix=𝒪~(1)subscript𝑡mix~𝒪1t_{\mathrm{mix}}=\tilde{\mathcal{O}}(1)italic_t start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT = over~ start_ARG caligraphic_O end_ARG ( 1 ). This assumption guarantees that the operator 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi}caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is γ𝛾\gammaitalic_γ-contractive w.r.t. μ\|\cdot\|_{\mu}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT in policy evaluation. Similar assumptions are included in (Bhandari et al., 2018; Cai et al., 2023).

3.2 An Improved Complexity of Neural TD Learning

To derive the 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sample complexity, we rely on the following key observation on subspace decomposition, which is beyond the existing analysis framework.

Proposition 3.4.

Let (Σπ)subscriptΣ𝜋\mathcal{R}(\Sigma_{\pi})caligraphic_R ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) and 𝒦(Σπ)𝒦subscriptΣ𝜋\mathcal{K}(\Sigma_{\pi})caligraphic_K ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) denote the range space and kernel space of the matrix ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, respectively. Then for any parameter 𝛉Sω𝛉subscript𝑆𝜔\boldsymbol{\theta}\in S_{\omega}bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, there exists 𝛉subscript𝛉\boldsymbol{\theta}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT such that

𝜽Ξ2ωand𝜽𝜽(Σπ),formulae-sequencesubscript𝜽subscriptΞ2𝜔and𝜽subscript𝜽subscriptΣ𝜋\boldsymbol{\theta}_{*}\in\Xi_{2\omega}\qquad\mbox{and}\qquad\boldsymbol{% \theta}-\boldsymbol{\theta}_{*}\in\mathcal{R}(\Sigma_{\pi}),bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ roman_Ξ start_POSTSUBSCRIPT 2 italic_ω end_POSTSUBSCRIPT and bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ caligraphic_R ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) ,

which also implies that the projections of 𝛉𝛉\boldsymbol{\theta}bold_italic_θ and 𝛉subscript𝛉\boldsymbol{\theta}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT onto the subspace 𝒦(Σπ)𝒦subscriptΣ𝜋\mathcal{K}(\Sigma_{\pi})caligraphic_K ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) are identical.

Based on this argument, for the iteration sequence {𝜽t}t0subscriptsuperscript𝜽𝑡𝑡0\{\boldsymbol{\theta}^{t}\}_{t\geq 0}{ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT generated by Algorithm 1, there exists a sequence {𝜽t}t0Ξ2ωsubscriptsubscriptsuperscript𝜽𝑡𝑡0subscriptΞ2𝜔\{\boldsymbol{\theta}^{t}_{*}\}_{t\geq 0}\subseteq\Xi_{2\omega}{ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT ⊆ roman_Ξ start_POSTSUBSCRIPT 2 italic_ω end_POSTSUBSCRIPT such that {𝜽t𝜽t}t0(Σπ)subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡𝑡0subscriptΣ𝜋\{\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\}_{t\geq 0}\subseteq% \mathcal{R}(\Sigma_{\pi}){ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT ⊆ caligraphic_R ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ). Therefore, unlike the existing works that analyze 𝜽t𝜽2superscriptnormsuperscript𝜽𝑡superscript𝜽2\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{*}\|^{2}∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for some 𝜽Ξωsuperscript𝜽subscriptΞ𝜔\boldsymbol{\theta}^{*}\in\Xi_{\omega}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Ξ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, c.f. (Cai et al., 2023; Xu & Gu, 2020), we will prove a much faster convergence in 𝜽t𝜽t2superscriptnormsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡2\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Combined with (10), this further indicates the improved sample complexity in this paper. The proof of this proposition is presented as follows.

Proof.

For the ease of discussion, let us denote the dimension of weight parameter 𝜽𝜽\boldsymbol{\theta}bold_italic_θ as n𝑛nitalic_n. Then we may denote Σπn×nsubscriptΣ𝜋superscript𝑛𝑛\Sigma_{\pi}\in\mathbb{R}^{n\times n}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT and 𝜽n𝜽superscript𝑛\boldsymbol{\theta}\in\mathbb{R}^{n}bold_italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. First of all let us fix an arbitrary 𝜽¯Ξω¯𝜽subscriptΞ𝜔\bar{\boldsymbol{\theta}}\in\Xi_{\omega}over¯ start_ARG bold_italic_θ end_ARG ∈ roman_Ξ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, then we may decompose it into two orthogonal components:

𝜽¯=𝜽¯+𝜽¯s.t.𝜽¯(Σπ) and 𝜽¯𝒦(Σπ).formulae-sequence¯𝜽subscript¯𝜽parallel-tosubscript¯𝜽bottoms.t.subscript¯𝜽parallel-tosubscriptΣ𝜋 and subscript¯𝜽bottom𝒦subscriptΣ𝜋\bar{\boldsymbol{\theta}}=\bar{\boldsymbol{\theta}}_{\parallel}+\bar{% \boldsymbol{\theta}}_{\bot}\quad\mbox{s.t.}\quad\bar{\boldsymbol{\theta}}_{% \parallel}\in\mathcal{R}(\Sigma_{\pi})\mbox{ and }\bar{\boldsymbol{\theta}}_{% \bot}\in\mathcal{K}(\Sigma_{\pi}).over¯ start_ARG bold_italic_θ end_ARG = over¯ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + over¯ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT s.t. over¯ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT ∈ caligraphic_R ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) and over¯ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT ∈ caligraphic_K ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) .

Similarly, we can decompose the currently considered vector 𝜽𝜽\boldsymbol{\theta}bold_italic_θ as

𝜽=𝜽+𝜽s.t.𝜽(Σπ) and 𝜽𝒦(Σπ).formulae-sequence𝜽subscript𝜽parallel-tosubscript𝜽bottoms.t.subscript𝜽parallel-tosubscriptΣ𝜋 and subscript𝜽bottom𝒦subscriptΣ𝜋\boldsymbol{\theta}=\boldsymbol{\theta}_{\parallel}+\boldsymbol{\theta}_{\bot}% \quad\mbox{s.t.}\quad\boldsymbol{\theta}_{\parallel}\in\mathcal{R}(\Sigma_{\pi% })\mbox{ and }\boldsymbol{\theta}_{\bot}\in\mathcal{K}(\Sigma_{\pi}).bold_italic_θ = bold_italic_θ start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + bold_italic_θ start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT s.t. bold_italic_θ start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT ∈ caligraphic_R ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) and bold_italic_θ start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT ∈ caligraphic_K ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) .

Note that having an arbitrary vector 𝒗n𝒗superscript𝑛\boldsymbol{v}\in\mathbb{R}^{n}bold_italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT in the kernel space of ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT means that Σπ𝒗=0subscriptΣ𝜋𝒗0\Sigma_{\pi}\boldsymbol{v}=0roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT bold_italic_v = 0, which further indicates that

00\displaystyle 0 =\displaystyle== 𝒗Σπ𝒗superscript𝒗topsubscriptΣ𝜋𝒗\displaystyle\boldsymbol{v}^{\top}\Sigma_{\pi}\boldsymbol{v}bold_italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT bold_italic_v
=\displaystyle== 𝒗𝔼μ,π[𝜽Q(𝒙;𝜽0)𝜽Q(𝒙;𝜽0)]𝒗superscript𝒗topsubscript𝔼𝜇𝜋delimited-[]subscript𝜽𝑄𝒙superscript𝜽0subscript𝜽𝑄superscript𝒙superscript𝜽0top𝒗\displaystyle\boldsymbol{v}^{\top}\mathbb{E}_{\mu,\pi}\left[\nabla_{% \boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta}^{0})\nabla_{% \boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta}^{0})^{\top}\right]% \boldsymbol{v}bold_italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] bold_italic_v
=\displaystyle== 𝔼μ,π[𝜽Q(𝒙;𝜽0),𝒗2].subscript𝔼𝜇𝜋delimited-[]superscriptsubscript𝜽𝑄𝒙superscript𝜽0𝒗2\displaystyle\mathbb{E}_{\mu,\pi}\left[\left\langle\nabla_{\boldsymbol{\theta}% }Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{v}\right\rangle^{2}% \right].blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π end_POSTSUBSCRIPT [ ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_v ⟩ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

Therefore, under the measure (s,a)μ×πsimilar-to𝑠𝑎𝜇𝜋(s,a)\sim\mu\times\pi( italic_s , italic_a ) ∼ italic_μ × italic_π, we have

𝒗𝒦(Σπ)𝜽Q(𝒙;𝜽0),𝒗=0a.s.formulae-sequenceformulae-sequence𝒗𝒦subscriptΣ𝜋subscript𝜽𝑄𝒙superscript𝜽0𝒗0𝑎𝑠\boldsymbol{v}\in\mathcal{K}(\Sigma_{\pi})\quad\Longrightarrow\quad\left% \langle\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),% \boldsymbol{v}\right\rangle=0\quad a.s.bold_italic_v ∈ caligraphic_K ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) ⟹ ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_v ⟩ = 0 italic_a . italic_s . (13)

where a.s.formulae-sequence𝑎𝑠a.s.italic_a . italic_s . stands for almost surely. Therefore, define 𝜽=𝜽¯+𝜽subscript𝜽subscript¯𝜽parallel-tosubscript𝜽bottom\boldsymbol{\theta}_{*}=\bar{\boldsymbol{\theta}}_{\parallel}+\boldsymbol{% \theta}_{\bot}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = over¯ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + bold_italic_θ start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT, we can check the stationarity condition (11) for 𝜽subscript𝜽\boldsymbol{\theta}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT by establishing:

𝔼μ,π,[Δ^(𝒙,𝒙;𝜽)𝜽Q^(𝒙;𝜽),𝜽𝜽]subscript𝔼𝜇𝜋delimited-[]^Δ𝒙superscript𝒙subscript𝜽subscript𝜽^𝑄𝒙subscript𝜽superscript𝜽subscript𝜽\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(% \boldsymbol{x},\boldsymbol{x}^{\prime};\boldsymbol{\theta}_{*}\right)\cdot\big% {\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x};% \boldsymbol{\theta}_{*}\right),\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta% }_{*}\big{\rangle}\right]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⋅ ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ] (14)
=𝔼μ,π,[Δ^(𝒙,𝒙;𝜽¯)𝜽Q^(𝒙;𝜽0),𝜽𝜽¯]0absentsubscript𝔼𝜇𝜋delimited-[]^Δ𝒙superscript𝒙¯𝜽subscript𝜽^𝑄𝒙subscript𝜽0superscript𝜽¯𝜽0\displaystyle=\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(% \boldsymbol{x},\boldsymbol{x}^{\prime};\bar{\boldsymbol{\theta}}\right)\cdot% \big{\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x};% \boldsymbol{\theta}_{0}\right),\boldsymbol{\theta}^{\prime}-\bar{\boldsymbol{% \theta}}\big{\rangle}\right]\geq 0= blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; over¯ start_ARG bold_italic_θ end_ARG ) ⋅ ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over¯ start_ARG bold_italic_θ end_ARG ⟩ ] ≥ 0

The proof of (14) is lengthy and is thus moved to Appendix A.1 for succinctness. As a result we have 𝜽Ξ2ωsubscript𝜽subscriptΞ2𝜔\boldsymbol{\theta}_{*}\in\Xi_{2\omega}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ roman_Ξ start_POSTSUBSCRIPT 2 italic_ω end_POSTSUBSCRIPT and 𝜽𝜽=𝜽¯𝒦(Σπ)𝜽subscript𝜽subscript¯𝜽parallel-to𝒦subscriptΣ𝜋\boldsymbol{\theta}-\boldsymbol{\theta}_{*}=\bar{\boldsymbol{\theta}}_{% \parallel}\in\mathcal{K}(\Sigma_{\pi})bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = over¯ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT ∈ caligraphic_K ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ). Note that for any 𝜽Sω𝜽subscript𝑆𝜔\boldsymbol{\theta}\in S_{\omega}bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT,

𝜽𝜽0normsubscript𝜽superscript𝜽0absent\displaystyle\|\boldsymbol{\theta}_{*}-\boldsymbol{\theta}^{0}\|\leq∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ ≤ 𝜽¯𝜽0+𝜽𝜽0normsubscript¯𝜽parallel-tosubscriptsuperscript𝜽0parallel-tonormsubscript𝜽bottomsubscriptsuperscript𝜽0bottom\displaystyle\ \|\bar{\boldsymbol{\theta}}_{\parallel}-\boldsymbol{\theta}^{0}% _{\parallel}\|+\|\boldsymbol{\theta}_{\bot}-\boldsymbol{\theta}^{0}_{\bot}\|∥ over¯ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT ∥ + ∥ bold_italic_θ start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT ∥
\displaystyle\leq 𝜽¯𝜽0+𝜽¯𝜽02ω,norm¯𝜽superscript𝜽0norm¯𝜽superscript𝜽02𝜔\displaystyle\ \|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{0}\|+\|\bar{% \boldsymbol{\theta}}-\boldsymbol{\theta}^{0}\|\leq 2\omega,∥ over¯ start_ARG bold_italic_θ end_ARG - bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ + ∥ over¯ start_ARG bold_italic_θ end_ARG - bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ ≤ 2 italic_ω ,

which completes the proof. ∎

Following basic linear algebra analysis, we also have the following proposition.

Proposition 3.5.

Under Assumption 3.1, suppose the adopted Q network is sufficiently wide so that mm𝑚superscript𝑚m\geq m^{*}italic_m ≥ italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then for any 𝛉(Σπ)𝛉subscriptΣ𝜋\boldsymbol{\theta}\in\mathcal{R}(\Sigma_{\pi})bold_italic_θ ∈ caligraphic_R ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ), we have 𝛉Σπ𝛉λ0𝛉22superscript𝛉topsubscriptΣ𝜋𝛉subscript𝜆0subscriptsuperscriptnorm𝛉22\boldsymbol{\theta}^{\top}\Sigma_{\pi}\boldsymbol{\theta}\geq\lambda_{0}\|% \boldsymbol{\theta}\|^{2}_{2}bold_italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT bold_italic_θ ≥ italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ bold_italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Proposition 3.4 indicates that the variations in the local linearization of Q-function values solely depend on the variations in parameters within the subspace (Σπ)subscriptΣ𝜋\mathcal{R}(\Sigma_{\pi})caligraphic_R ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ). In the mean while, Proposition 3.5 indicates that such local linearization is non-singular within (Σπ)subscriptΣ𝜋\mathcal{R}(\Sigma_{\pi})caligraphic_R ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ). Based on these observations, we can first provide a fast convergence of 𝜽t𝜽t2=𝒪(1/T)superscriptnormsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡2𝒪1𝑇\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}=\mathcal{O}(1/T)∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = caligraphic_O ( 1 / italic_T ) and then show that 𝔼[(Q^(𝒙;𝜽T)Q^(𝒙;𝜽))2]=𝔼[(Q^(𝒙;𝜽T)Q^(𝒙;𝜽T))2]𝒪(1/T)𝔼delimited-[]superscript^𝑄𝒙superscript𝜽𝑇^𝑄𝒙superscript𝜽2𝔼delimited-[]superscript^𝑄𝒙superscript𝜽𝑇^𝑄𝒙subscriptsuperscript𝜽𝑇2𝒪1𝑇\mathbb{E}\left[\big{(}\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{T})-% \widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})\big{)}^{2}\right]=\mathbb{% E}\left[\big{(}\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{T})-\widehat{Q}% (\boldsymbol{x};\boldsymbol{\theta}^{T}_{*})\big{)}^{2}\right]\leq\mathcal{O}(% 1/T)blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ caligraphic_O ( 1 / italic_T ) for any 𝜽Ξωsuperscript𝜽subscriptΞ𝜔\boldsymbol{\theta}^{*}\in\Xi_{\omega}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Ξ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT. We summarize this result in Theorem 3.6 while presenting its proof in Appendix A.2.

Theorem 3.6.

Suppose Assumptions 3.1, 3.2 and 3.3 hold. We set ω=C~1𝜔subscript~𝐶1\omega=\widetilde{C}_{1}italic_ω = over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the learning rate ηt=12(1γ)λ0(t+1)subscript𝜂𝑡121𝛾subscript𝜆0𝑡1\eta_{t}=\frac{1}{2(1-\gamma)\lambda_{0}(t+1)}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_γ ) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t + 1 ) end_ARG. If the feature map ϕ(s,a)=1normitalic-ϕ𝑠𝑎1\|\phi(s,a)\|=1∥ italic_ϕ ( italic_s , italic_a ) ∥ = 1 for each state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) and the network width mm𝑚superscript𝑚m\geq m^{*}italic_m ≥ italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then the output 𝛉Tsuperscript𝛉𝑇\boldsymbol{\theta}^{T}bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT of Algorithm 1 satisfies

𝔼[(Q^(𝒙;𝜽T)Q^(𝒙;𝜽))2𝜽0]𝔼delimited-[]conditionalsuperscript^𝑄𝒙superscript𝜽𝑇^𝑄𝒙superscript𝜽2superscript𝜽0\displaystyle\mathbb{E}\left[\big{(}\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{T})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})\big{)}^{2}% \mid\boldsymbol{\theta}^{0}\right]blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
\displaystyle\leq C~3(logT+1)(1γ)2λ02T+C~4m1/2(1γ)λ0log(T/δ)subscript~𝐶3𝑇1superscript1𝛾2superscriptsubscript𝜆02𝑇subscript~𝐶4superscript𝑚121𝛾subscript𝜆0𝑇𝛿\displaystyle\frac{\widetilde{C}_{3}(\log T+1)}{(1-\gamma)^{2}\lambda_{0}^{2}T% }+\frac{\widetilde{C}_{4}m^{-1/2}}{(1-\gamma)\lambda_{0}}\cdot\sqrt{\log(T/% \delta)}divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( roman_log italic_T + 1 ) end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG + divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_γ ) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⋅ square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG
+\displaystyle++ C~5τ(log(T/δ)+1)logT(1γ)2λ02T,subscript~𝐶5superscript𝜏𝑇𝛿1𝑇superscript1𝛾2superscriptsubscript𝜆02𝑇\displaystyle\frac{\widetilde{C}_{5}\tau^{*}\left(\log(T/\delta)+1\right)\log T% }{(1-\gamma)^{2}\lambda_{0}^{2}T},divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_log ( italic_T / italic_δ ) + 1 ) roman_log italic_T end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG ,

with probability at least 12δ2Lexp(C~2m)12𝛿2𝐿subscript~𝐶2𝑚1-2\delta-2L\exp\!\big{(}\!\!-\widetilde{C}_{2}m\big{)}1 - 2 italic_δ - 2 italic_L roman_exp ( - over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ), where τsuperscript𝜏\tau^{*}italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the mixing time of Markov chain in Assumption 3.2, and C~1,,C~5>0subscript~𝐶1subscript~𝐶50\widetilde{C}_{1},\cdots,\widetilde{C}_{5}>0over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT > 0 are universal constants.

Let Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT be the true state-action value function that satisfies the Bellman equation Qπ=𝒯πQπsuperscript𝑄𝜋superscript𝒯𝜋superscript𝑄𝜋Q^{\pi}=\mathcal{T}^{\pi}Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT. Then based on the convergence of the local linearization in Theorem 3.6, we establish the global convergence of neural temporal difference learning as Theorem 3.7.

Theorem 3.7.

Suppose the conditions in Theorem 3.6 hold. Then the output of Algorithm 1 satisfies

𝔼[(Q(ϕ(s,a);𝜽T)Qπ(s,a))2𝜽0]𝔼delimited-[]conditionalsuperscript𝑄italic-ϕ𝑠𝑎superscript𝜽𝑇superscript𝑄𝜋𝑠𝑎2superscript𝜽0\displaystyle\mathbb{E}\Big{[}\big{(}Q(\phi(s,a);\boldsymbol{\theta}^{T})-Q^{% \pi}(s,a)\big{)}^{2}\mid\boldsymbol{\theta}^{0}\Big{]}blackboard_E [ ( italic_Q ( italic_ϕ ( italic_s , italic_a ) ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
\displaystyle\leq 3𝔼[(Qπ(s,a)Πω,mQπ(s,a))2](1γ)2+C~6m13𝔼delimited-[]superscriptsuperscript𝑄𝜋𝑠𝑎subscriptΠsubscript𝜔𝑚superscript𝑄𝜋𝑠𝑎2superscript1𝛾2subscript~𝐶6superscript𝑚1\displaystyle\ \frac{3\mathbb{E}\left[\left(Q^{\pi}(s,a)-\Pi_{\mathcal{F}_{% \omega,m}}Q^{\pi}(s,a)\right)^{2}\right]}{(1-\gamma)^{2}}+\widetilde{C}_{6}m^{% -1}divide start_ARG 3 blackboard_E [ ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) - roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
+C~7(logT+1)(1γ)2λ02T+C~8m1/2(1γ)λ0log(T/δ)subscript~𝐶7𝑇1superscript1𝛾2superscriptsubscript𝜆02𝑇subscript~𝐶8superscript𝑚121𝛾subscript𝜆0𝑇𝛿\displaystyle+\frac{\widetilde{C}_{7}(\log T+1)}{(1-\gamma)^{2}\lambda_{0}^{2}% T}+\frac{\widetilde{C}_{8}m^{-1/2}}{(1-\gamma)\lambda_{0}}\sqrt{\log(T/\delta)}+ divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ( roman_log italic_T + 1 ) end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG + divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_γ ) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG (15)
+C~9τ(log(T/δ)+1)logT(1γ)2λ02Tsubscript~𝐶9superscript𝜏𝑇𝛿1𝑇superscript1𝛾2superscriptsubscript𝜆02𝑇\displaystyle+\frac{\widetilde{C}_{9}\tau^{*}\left(\log(T/\delta)+1\right)\log T% }{(1-\gamma)^{2}\lambda_{0}^{2}T}+ divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_log ( italic_T / italic_δ ) + 1 ) roman_log italic_T end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG

w.p. 12δ2Lexp(C~2m)formulae-sequence𝑤𝑝.12𝛿2𝐿subscript~𝐶2𝑚w.p.\ 1-2\delta-2L\exp\!\big{(}\!-\widetilde{C}_{2}m\big{)}italic_w . italic_p . 1 - 2 italic_δ - 2 italic_L roman_exp ( - over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ), where C~6,,C~9>0subscript~𝐶6subscript~𝐶90\widetilde{C}_{6},\cdots,\widetilde{C}_{9}>0over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT > 0 are universal constants.

Let ϵ:=3(1γ)2𝔼[(Qπ(s,a)Πω,mQπ(s,a))2]assignsubscriptitalic-ϵ3superscript1𝛾2𝔼delimited-[]superscriptsuperscript𝑄𝜋𝑠𝑎subscriptΠsubscript𝜔𝑚superscript𝑄𝜋𝑠𝑎2\epsilon_{\mathcal{F}}:=\frac{3}{(1-\gamma)^{2}}\mathbb{E}\big{[}\big{(}Q^{\pi% }(s,a)-\Pi_{\mathcal{F}_{\omega,m}}Q^{\pi}(s,a)\big{)}^{2}\big{]}italic_ϵ start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT := divide start_ARG 3 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) - roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] be the optimal approximation error of the function class ω,msubscript𝜔𝑚\mathcal{F}_{\omega,m}caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT. Then Theorem 3.7 demonstrates that under suitable parameter choices, neural TD learning method identify an approximation error bound of 𝒪(ϵ+ϵ+m12)𝒪subscriptitalic-ϵitalic-ϵsuperscript𝑚12\mathcal{O}(\epsilon_{\mathcal{F}}+\epsilon+m^{-\frac{1}{2}})caligraphic_O ( italic_ϵ start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT + italic_ϵ + italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) within 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) samples. Existing works include Cai et al. (2023); Xu & Gu (2020); Tian et al. (2022); Cayci et al. (2023) achieve 𝒪~(ϵ2)~𝒪superscriptitalic-ϵ2\tilde{\mathcal{O}}(\epsilon^{-2})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) sample complexity, and (Sun et al., 2022) achieves 𝒪(ϵ22a),a(0,1]𝒪superscriptitalic-ϵ22𝑎𝑎01\mathcal{O}\big{(}\epsilon^{-\frac{2}{2-a}}\big{)},a\in(0,1]caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - divide start_ARG 2 end_ARG start_ARG 2 - italic_a end_ARG end_POSTSUPERSCRIPT ) , italic_a ∈ ( 0 , 1 ] with additional assumptions.

Following a similar analysis while adopting an additional regularity assumption on the matrix ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, one can further extend the above analysis to the neural Q-learning by substituting the Bellman operator with the Bellman optimality operator. A similar 𝒪(ϵ1)𝒪superscriptitalic-ϵ1\mathcal{O}(\epsilon^{-1})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sample complexity can still be achieved, which is relegated to Appendix for succinctness.

4 Convergence of Minimax Neural Q-Learning

A two-player zero-sum Markov game (Littman, 1994; Bowling & Veloso, 2001; Perolat et al., 2018), as a simple variant of MDP, is defined as a six-tuple =(𝒮,𝒜1,𝒜2,,r,γ)𝒮subscript𝒜1subscript𝒜2𝑟𝛾\mathcal{M}=(\mathcal{S},\mathcal{A}_{1},\mathcal{A}_{2},\mathbb{P},r,\gamma)caligraphic_M = ( caligraphic_S , caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , blackboard_P , italic_r , italic_γ ). Here 𝒮𝒮\mathcal{S}caligraphic_S is state space, 𝒜1subscript𝒜1\mathcal{A}_{1}caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒜2subscript𝒜2\mathcal{A}_{2}caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the action space of the first and second player, respectively, :𝒜1×𝒜2𝒫(𝒮):subscript𝒜1subscript𝒜2𝒫𝒮\mathbb{P}:\mathcal{A}_{1}\times\mathcal{A}_{2}\rightarrow\mathcal{P}(\mathcal% {S})blackboard_P : caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → caligraphic_P ( caligraphic_S ) is the transition probability, r:𝒮×𝒜1×𝒜2:𝑟𝒮subscript𝒜1subscript𝒜2r:\mathcal{S}\times\mathcal{A}_{1}\times\mathcal{A}_{2}\rightarrow\mathbb{R}italic_r : caligraphic_S × caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → blackboard_R is the reward function and γ𝛾\gammaitalic_γ is the discounted factor. At time t𝑡titalic_t, player 1 and player 2 take actions (at1𝒜1subscriptsuperscript𝑎1𝑡subscript𝒜1a^{1}_{t}\in\mathcal{A}_{1}italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and at2𝒜2subscriptsuperscript𝑎2𝑡subscript𝒜2a^{2}_{t}\in\mathcal{A}_{2}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) simultaneously. Player 1 obtains the reward r(st,at1,at2)𝑟subscript𝑠𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡r(s_{t},a^{1}_{t},a^{2}_{t})italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). while player 2 obtains r(st,at1,at2)𝑟subscript𝑠𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡-r(s_{t},a^{1}_{t},a^{2}_{t})- italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The goal of the two players is to maximize their cumulative rewards respectively. For a policy pair (π1,π2)subscript𝜋1subscript𝜋2(\pi_{1},\pi_{2})( italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), we can define the state-action value function as follows:

Qπ1,π2(s,a1,a2)=superscript𝑄subscript𝜋1subscript𝜋2𝑠superscript𝑎1superscript𝑎2absent\displaystyle Q^{\pi_{1},\pi_{2}}(s,a^{1},a^{2})=italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = 𝔼π1,π2[t=0γtr(st,at1,at2)s0=s,\displaystyle\mathbb{E}_{\pi_{1},\pi_{2}}\left[\sum_{t=0}^{\infty}\gamma^{t}% \cdot r(s_{t},a^{1}_{t},a^{2}_{t})\mid s_{0}=s,\right.blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∣ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s ,
a01=a1,a02=a2],s,a1,a2.\displaystyle\left.a^{1}_{0}=a^{1},a^{2}_{0}=a^{2}\right],\ \forall s,a^{1},a^% {2}.italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , ∀ italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The optimal state-action value function Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is defined as

Q(s,a1,a2)superscript𝑄𝑠superscript𝑎1superscript𝑎2\displaystyle Q^{*}(s,a^{1},a^{2})italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) =\displaystyle== maxπ1minπ2Qπ1,π2(s,a1,a2)subscriptsubscript𝜋1subscriptsubscript𝜋2superscript𝑄subscript𝜋1subscript𝜋2𝑠superscript𝑎1superscript𝑎2\displaystyle\max_{\pi_{1}}\min_{\pi_{2}}Q^{\pi_{1},\pi_{2}}(s,a^{1},a^{2})roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=\displaystyle== minπ2maxπ1Qπ1,π2(s,a1,a2).subscriptsubscript𝜋2subscriptsubscript𝜋1superscript𝑄subscript𝜋1subscript𝜋2𝑠superscript𝑎1superscript𝑎2\displaystyle\min_{\pi_{2}}\max_{\pi_{1}}Q^{\pi_{1},\pi_{2}}(s,a^{1},a^{2}).roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

We denote the optimal policy pair π={π1,π2}superscript𝜋superscriptsubscript𝜋1superscriptsubscript𝜋2\pi^{*}=\{\pi_{1}^{*},\pi_{2}^{*}\}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } if Q(s,a1,a2)=Qπ1,π2superscript𝑄𝑠superscript𝑎1superscript𝑎2superscript𝑄superscriptsubscript𝜋1superscriptsubscript𝜋2Q^{*}(s,a^{1},a^{2})=Q^{\pi_{1}^{*},\pi_{2}^{*}}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Moreover, the Minimax Bellman operator \mathcal{H}caligraphic_H for the Markov game is defined as

Q(s,a1,a2)=𝑄𝑠superscript𝑎1superscript𝑎2absent\displaystyle\mathcal{H}Q(s,a^{1},a^{2})=caligraphic_H italic_Q ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = r(s,a1,a2)+γ𝔼[minb1maxb2Q(s,b1,b2)\displaystyle r(s,a^{1},a^{2})+\gamma\mathbb{E}\left[\min_{b^{1}}\max_{b^{2}}Q% (s^{\prime},b^{1},b^{2})\right.\miditalic_r ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_γ blackboard_E [ roman_min start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∣
s(s,a1,a2)],s,a1,a2.\displaystyle\left.s^{\prime}\sim\mathbb{P}(\cdot\mid s,a^{1},a^{2})\right],\,% \,\forall s,a^{1},a^{2}.italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ blackboard_P ( ⋅ ∣ italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] , ∀ italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Thus Q=Qsuperscript𝑄superscript𝑄\mathcal{H}Q^{*}=Q^{*}caligraphic_H italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Let the feature map 𝒙=ϕ(s,a1,a2)𝒙italic-ϕ𝑠superscript𝑎1superscript𝑎2\boldsymbol{x}=\phi(s,a^{1},a^{2})bold_italic_x = italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and π={π1,π2}𝜋superscript𝜋1superscript𝜋2\pi=\{\pi^{1},\pi^{2}\}italic_π = { italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } be a given learning policy for players 1 and 2. Assume that {st,at1,at2,rt}t=0Tsuperscriptsubscriptsubscript𝑠𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡subscript𝑟𝑡𝑡0𝑇\{s_{t},a^{1}_{t},a^{2}_{t},r_{t}\}_{t=0}^{T}{ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is a sampled trajectory of states, actions and rewards obtained from the environment using policy π𝜋\piitalic_π. Let us recall the definition of the local linearization function class ω,msubscript𝜔𝑚\mathcal{F}_{\omega,m}caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT introduced in (7). Consider the MSPBE minimization problem with multi-layer neural network approximation:

min𝜽Sω𝔼μ,π,[(Q(𝒙;𝜽)Πω,mQ(𝒙;𝜽))2].subscript𝜽subscript𝑆𝜔subscript𝔼𝜇𝜋delimited-[]superscript𝑄𝒙𝜽subscriptΠsubscript𝜔𝑚𝑄𝒙𝜽2\min_{\boldsymbol{\theta}\in S_{\omega}}\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[% \left(Q(\boldsymbol{x};\boldsymbol{\theta})-\Pi_{\mathcal{F}_{\omega,m}}% \mathcal{H}Q(\boldsymbol{x};\boldsymbol{\theta})\right)^{2}\right].roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ( italic_Q ( bold_italic_x ; bold_italic_θ ) - roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_H italic_Q ( bold_italic_x ; bold_italic_θ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

To solve this problem, we still adopt the projected stochastic semi-gradient iteration method is provided described by (6), that is,

𝜽t+1=ΠSω(𝜽tηt𝒈(𝜽t)),superscript𝜽𝑡1subscriptΠsubscript𝑆𝜔superscript𝜽𝑡subscript𝜂𝑡𝒈superscript𝜽𝑡\boldsymbol{\theta}^{t+1}=\Pi_{S_{\omega}}\Big{(}\boldsymbol{\theta}^{t}-\eta_% {t}\boldsymbol{g}\left(\boldsymbol{\theta}^{t}\right)\!\!\Big{)},bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = roman_Π start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) , (16)

while redefining the stochastic semi-gradient estimator 𝒈(𝜽t)𝒈superscript𝜽𝑡\boldsymbol{g}(\boldsymbol{\theta}^{t})bold_italic_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) as

𝒈(𝜽t)=Δ(st,at1,at2,st+1;𝜽t)𝜽Q(𝒙t;𝜽t),𝒈superscript𝜽𝑡Δsubscript𝑠𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡subscript𝑠𝑡1superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽𝑡\boldsymbol{g}(\boldsymbol{\theta}^{t})=\Delta\left(s_{t},a^{1}_{t},a^{2}_{t},% s_{t+1};\boldsymbol{\theta}^{t}\right)\cdot\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x}_{t};\boldsymbol{\theta}^{t}),bold_italic_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = roman_Δ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ,

where 𝒙t:=ϕ(st,at1,at2)assignsubscript𝒙𝑡italic-ϕsubscript𝑠𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡\boldsymbol{x}_{t}:=\phi(s_{t},a^{1}_{t},a^{2}_{t})bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and

Δ(st,at1,\displaystyle\Delta\left(s_{t},a^{1}_{t},\right.roman_Δ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , at2,st+1;𝜽t)=Q(𝒙t;𝜽t)(r(st,at1,at2)+.\displaystyle\!\!\!\left.a^{2}_{t},s_{t+1};\boldsymbol{\theta}^{t}\right)=Q(% \boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\Big{(}r(s_{t},a^{1}_{t},a^{2}_{t}% )+\Big{.}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + . (17)
.γmaxb1𝒜1minb2𝒜2Q(ϕ(st+1,b1,b2);𝜽t)).\displaystyle\Big{.}\gamma\max_{b^{1}\in\mathcal{A}_{1}}\min_{b^{2}\in\mathcal% {A}_{2}}Q(\phi(s_{t+1},b^{1},b^{2});\boldsymbol{\theta}^{t})\Big{)}.. italic_γ roman_max start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) .

Now we redefine the function class ω,msubscript𝜔𝑚\mathcal{F}_{\omega,m}caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT as a collection of all local linearization of Q(𝒙;𝜽)𝑄𝒙𝜽Q(\boldsymbol{x};\boldsymbol{\theta})italic_Q ( bold_italic_x ; bold_italic_θ ) at the initial point 𝜽0superscript𝜽0\boldsymbol{\theta}^{0}bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT:

ω,m=subscript𝜔𝑚absent\displaystyle\mathcal{F}_{\omega,m}=\ caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT = {Q^(𝒙;𝜽)=Q(𝒙;𝜽0)+𝜽Q(𝒙;𝜽0),\displaystyle\left\{\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta})=Q(% \boldsymbol{x};\boldsymbol{\theta}^{0})+\left<\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x};\boldsymbol{\theta}^{0}),\right.\right.{ over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ ) = italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) + ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ,
𝜽𝜽0,𝜽Sω}.\displaystyle\left.\left.\boldsymbol{\theta}-\boldsymbol{\theta}^{0}\right>,\ % \boldsymbol{\theta}\in S_{\omega}\right\}.bold_italic_θ - bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ⟩ , bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT } .

To analyze this method, for any β>0𝛽0\beta>0italic_β > 0, we redefine the set ΞβsubscriptΞ𝛽\Xi_{\beta}roman_Ξ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT introduced in (3.1) by replacing the Bellman operator 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi}caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT with the Minimax Bellman operator \mathcal{H}caligraphic_H. Similar to (10), we still have a point 𝜽Ξωsuperscript𝜽subscriptΞ𝜔\boldsymbol{\theta}^{*}\in\Xi_{\omega}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Ξ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT if and only if

𝔼μ,π,subscript𝔼𝜇𝜋\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [Δ^(s,a1,a2,s;𝜽)𝜽Q^(ϕ(s,a1,a2);𝜽),.\displaystyle\left[\widehat{\Delta}\left(s,a^{1},a^{2},s^{\prime};\boldsymbol{% \theta}^{*}\right)\big{\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(% \phi(s,a^{1},a^{2});\boldsymbol{\theta}^{*}\right),\big{.}\right.[ over^ start_ARG roman_Δ end_ARG ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , .
.𝜽𝜽]0,\displaystyle\big{.}\left.\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\big{% \rangle}\right]\geq 0,. bold_italic_θ - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ ] ≥ 0 ,

where Q^(;𝜽)ω,m^𝑄superscript𝜽subscript𝜔𝑚\widehat{Q}(\cdot\,;\boldsymbol{\theta}^{*})\in\mathcal{F}_{\omega,m}over^ start_ARG italic_Q end_ARG ( ⋅ ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT, and Δ^(s,a1,a2,s;𝜽)^Δ𝑠superscript𝑎1superscript𝑎2superscript𝑠𝜽\widehat{\Delta}\left(s,a^{1},a^{2},s^{\prime};\boldsymbol{\theta}\right)over^ start_ARG roman_Δ end_ARG ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) has the same structure as Δ(s,a1,a2,s;𝜽)Δ𝑠superscript𝑎1superscript𝑎2superscript𝑠𝜽\Delta\left(s,a^{1},a^{2},s^{\prime};\boldsymbol{\theta}\right)roman_Δ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) expect that the function Q(;𝜽)𝑄𝜽Q(\cdot;\boldsymbol{\theta})italic_Q ( ⋅ ; bold_italic_θ ) is replaced by Q^(;𝜽)^𝑄𝜽\widehat{Q}(\cdot;\boldsymbol{\theta})over^ start_ARG italic_Q end_ARG ( ⋅ ; bold_italic_θ ).

Unlike the neural temporal difference learning method that aims at evaluating the state-action values of a fixed learning policy. The Minimax Bellman operator significantly sophisticates the analysis. Let us redefine the feature covariance matrix ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT with respect to the learning policy π={π1,π2}𝜋superscript𝜋1superscript𝜋2\pi=\{\pi^{1},\pi^{2}\}italic_π = { italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }, that is

ΣπsubscriptΣ𝜋\displaystyle\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT =\displaystyle== 𝔼π[𝜽Q(s,a1,a2;𝜽0)𝜽Q(s,a1,a2;𝜽0)].subscript𝔼𝜋delimited-[]subscript𝜽𝑄𝑠superscript𝑎1superscript𝑎2superscript𝜽0subscript𝜽𝑄superscript𝑠superscript𝑎1superscript𝑎2superscript𝜽0top\displaystyle\mathbb{E}_{\pi}\left[\nabla_{\boldsymbol{\theta}}Q(s,a^{1},a^{2}% ;\boldsymbol{\theta}^{0})\nabla_{\boldsymbol{\theta}}Q(s,a^{1},a^{2};% \boldsymbol{\theta}^{0})^{\top}\right].blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] .

Let the actions (a𝜽1,a𝜽2)subscriptsuperscript𝑎1𝜽subscriptsuperscript𝑎2𝜽(a^{1}_{\boldsymbol{\theta}},a^{2}_{\boldsymbol{\theta}})( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ) satisfies 𝜽Q(s,a,1a2;𝜽0),𝜽=maxa1𝒜1mina2𝒜2𝜽Q(s,a1,a2;𝜽0),𝜽subscript𝜽𝑄𝑠subscriptsuperscript𝑎1,superscript𝑎2superscript𝜽0𝜽subscriptsuperscript𝑎1subscript𝒜1subscriptsuperscript𝑎2subscript𝒜2subscript𝜽𝑄𝑠superscript𝑎1superscript𝑎2superscript𝜽0𝜽\left<\nabla_{\boldsymbol{\theta}}Q(s,a^{1}_{,}a^{2};\boldsymbol{\theta}^{0}),% \boldsymbol{\theta}\right>=\max_{a^{1}\in\mathcal{A}_{1}}\min_{a^{2}\in% \mathcal{A}_{2}}\left<\nabla_{\boldsymbol{\theta}}Q(s,a^{1},a^{2};\boldsymbol{% \theta}^{0}),\boldsymbol{\theta}\right>⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT , end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ ⟩ = roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ ⟩. For each parameter pair 𝜽s=(𝜽1,𝜽2)subscript𝜽𝑠subscript𝜽1subscript𝜽2\boldsymbol{\theta}_{s}=(\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2})bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), we define the action pair (a𝜽s1,a𝜽s2)subscriptsuperscript𝑎1subscript𝜽𝑠subscriptsuperscript𝑎2subscript𝜽𝑠(a^{1}_{\boldsymbol{\theta}_{s}},a^{2}_{\boldsymbol{\theta}_{s}})( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) that satisfies

(a𝜽s1,a𝜽s2)=subscriptsuperscript𝑎1subscript𝜽𝑠subscriptsuperscript𝑎2subscript𝜽𝑠absent\displaystyle(a^{1}_{\boldsymbol{\theta}_{s}},a^{2}_{\boldsymbol{\theta}_{s}})=( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = argmax(a1,a2){(a𝜽11,a𝜽22),(a𝜽21,a𝜽12)}subscriptsuperscript𝑎1superscript𝑎2subscriptsuperscript𝑎1subscript𝜽1subscriptsuperscript𝑎2subscript𝜽2subscriptsuperscript𝑎1subscript𝜽2subscriptsuperscript𝑎2subscript𝜽1\displaystyle{\arg\max}_{(a^{1},a^{2})\in\left\{\left(a^{1}_{\boldsymbol{% \theta}_{1}},a^{2}_{\boldsymbol{\theta}_{2}}\right),\left(a^{1}_{\boldsymbol{% \theta}_{2}},a^{2}_{\boldsymbol{\theta}_{1}}\right)\right\}}roman_arg roman_max start_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∈ { ( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } end_POSTSUBSCRIPT
{|𝜽Q(s,a1,a2;𝜽0),𝜽1𝜽2|}.subscript𝜽𝑄𝑠superscript𝑎1superscript𝑎2superscript𝜽0subscript𝜽1subscript𝜽2\displaystyle\Big{\{}\left|\left<\nabla_{\boldsymbol{\theta}}Q(s,a^{1},a^{2};% \boldsymbol{\theta}^{0}),\boldsymbol{\theta}_{1}-\boldsymbol{\theta}_{2}\right% >\right|\Big{\}}.{ | ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ | } .

Then for any 𝜽1,𝜽2subscript𝜽1subscript𝜽2\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the minimax feature covariance matrix is defined as follows:

Σπ(𝜽1,𝜽2)=subscriptsuperscriptΣ𝜋subscript𝜽1subscript𝜽2absent\displaystyle\Sigma^{*}_{\pi}(\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2})=roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 𝔼π[𝜽Q(s,a𝜽s1,a𝜽s2;𝜽0)\displaystyle\mathbb{E}_{\pi}\left[\nabla_{\boldsymbol{\theta}}Q(s,a^{1}_{% \boldsymbol{\theta}_{s}},a^{2}_{\boldsymbol{\theta}_{s}};\boldsymbol{\theta}^{% 0})\right.blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT )
𝜽Q(s,a𝜽s1,a𝜽s2;𝜽0)].\displaystyle\left.\nabla_{\boldsymbol{\theta}}Q(s,a^{1}_{\boldsymbol{\theta}_% {s}},a^{2}_{\boldsymbol{\theta}_{s}};\boldsymbol{\theta}^{0})^{\top}\right].∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] .
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Training curves and the ratio of the largest and smallest non-zero singular values of ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT over different network widths m𝑚mitalic_m.
Assumption 4.1.

For any 𝜽1,𝜽2superscript𝜽1superscript𝜽2\boldsymbol{\theta}^{1},\boldsymbol{\theta}^{2}bold_italic_θ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, there exists a constant ν(0,1)𝜈01\nu\in(0,1)italic_ν ∈ ( 0 , 1 ) such that (1ν)2Σπγ2Σπ(𝜽1,𝜽2)0succeeds-or-equalssuperscript1𝜈2subscriptΣ𝜋superscript𝛾2superscriptsubscriptΣ𝜋subscript𝜽1subscript𝜽20(1-\nu)^{2}\Sigma_{\pi}-\gamma^{2}\Sigma_{\pi}^{*}(\boldsymbol{\theta}_{1},% \boldsymbol{\theta}_{2})\succeq 0( 1 - italic_ν ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⪰ 0.

Note the original version of this assumption in (Zou et al., 2019) in fact requires a strict positive definite condition: ((1ν)2Σπγ2Σπ(𝜽1,𝜽2)0((1-\nu)^{2}\Sigma_{\pi}-\gamma^{2}\Sigma_{\pi}^{*}(\boldsymbol{\theta}_{1},% \boldsymbol{\theta}_{2})\succ 0( ( 1 - italic_ν ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≻ 0. Under this additional assumption, (Zou et al., 2019) obtained an 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) sample complexity for minimax Q-learning with linear function approximation. With the help of our subspace analysis technique, in this paper, we relax it to the positive semi-definiteness (0succeeds-or-equalsabsent0\succeq 0⪰ 0). Now we are ready to state our result for minimax neural Q-learning.

Theorem 4.2.

Suppose Assumptions 3.1, 3.2 and 4.1 hold. We set ω=C~1𝜔subscript~𝐶1\omega=\widetilde{C}_{1}italic_ω = over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the learning rate ηt=12νλ0(t+1)subscript𝜂𝑡12𝜈subscript𝜆0𝑡1\eta_{t}=\frac{1}{2\nu\lambda_{0}(t+1)}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_ν italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t + 1 ) end_ARG. If the feature map ϕ(s,a1,a2)=1normitalic-ϕ𝑠superscript𝑎1superscript𝑎21\|\phi(s,a^{1},a^{2})\|=1∥ italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ = 1 for each state-action pair (s,a1,a2)𝑠superscript𝑎1superscript𝑎2(s,a^{1},a^{2})( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and the network width mm𝑚superscript𝑚m\geq m^{*}italic_m ≥ italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then the output 𝛉Tsuperscript𝛉𝑇\boldsymbol{\theta}^{T}bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT of neural minimax Q-learning Algorithm 3 satisfies

𝔼[(Q^(𝒙;𝜽T)Q^(𝒙;𝜽))2𝜽0]𝔼delimited-[]conditionalsuperscript^𝑄𝒙superscript𝜽𝑇^𝑄𝒙superscript𝜽2superscript𝜽0\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{T})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})\right)^{2}% \mid\boldsymbol{\theta}^{0}\right]blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
\displaystyle\leq C~3(logT+1)ν2λ02T+C~4m1/2νλ0log(T/δ)subscript~𝐶3𝑇1superscript𝜈2superscriptsubscript𝜆02𝑇subscript~𝐶4superscript𝑚12𝜈subscript𝜆0𝑇𝛿\displaystyle\frac{\widetilde{C}_{3}(\log T+1)}{\nu^{2}\lambda_{0}^{2}T}+\frac% {\widetilde{C}_{4}m^{-1/2}}{\nu\lambda_{0}}\cdot\sqrt{\log(T/\delta)}divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( roman_log italic_T + 1 ) end_ARG start_ARG italic_ν start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG + divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ν italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⋅ square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG
+C~5τ(log(T/δ)+1)logTν2λ02T,subscript~𝐶5superscript𝜏𝑇𝛿1𝑇superscript𝜈2superscriptsubscript𝜆02𝑇\displaystyle+\frac{\widetilde{C}_{5}\tau^{*}\left(\log(T/\delta)+1\right)\log T% }{\nu^{2}\lambda_{0}^{2}T},+ divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_log ( italic_T / italic_δ ) + 1 ) roman_log italic_T end_ARG start_ARG italic_ν start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG ,

with probability at least 12δ2Lexp(C~2m)12𝛿2𝐿subscript~𝐶2𝑚1-2\delta-2L\exp{(-\widetilde{C}_{2}m)}1 - 2 italic_δ - 2 italic_L roman_exp ( - over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ), where τsuperscript𝜏\tau^{*}italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the mixing time of Markov chain in Assumption 3.2, and {C~i>0}i=1,,5subscriptsubscript~𝐶𝑖0𝑖15\left\{\widetilde{C}_{i}>0\right\}_{i=1,\ldots,5}{ over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 } start_POSTSUBSCRIPT italic_i = 1 , … , 5 end_POSTSUBSCRIPT are universal constants.

Theorem 4.2 establishes a finite-time analysis of 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT )-sample complexity for minimax neural Q-learning in terms of the function class ω,msubscript𝜔𝑚\mathcal{F}_{\omega,m}caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT. For a more specific description and theorem proof, see Appendix C. To the best of our knowledge, this is the first analysis of minimax Q-learning with neural network function approximation, characterized by a complexity bound of 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ).

5 Experiments

Finally, we construct several experiments over the OpenAI Gym (Brockman et al., 2016) tasks and validate our theoretical findings. We consider a two-layer neural network, as follows:

Q(s,a;𝜽):=1mr=1mbrσ(𝜽rϕ(s,a)),assign𝑄𝑠𝑎𝜽1𝑚superscriptsubscript𝑟1𝑚subscript𝑏𝑟𝜎superscriptsubscript𝜽𝑟topitalic-ϕ𝑠𝑎Q(s,a;\boldsymbol{\theta}):=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}b_{r}\sigma(% \boldsymbol{\theta}_{r}^{\top}\phi(s,a)),italic_Q ( italic_s , italic_a ; bold_italic_θ ) := divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( bold_italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ ( italic_s , italic_a ) ) ,

where σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is ELU activation in this section. Furthermore, details regarding the initialization and iteration methods for the parameters can be found in Section 2. For all experiments, we generate samples based on a prescribed ϵitalic-ϵ\epsilonitalic_ϵ-greedy policy with ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1. To prevent redundancy in the features ϕ(s,a)italic-ϕ𝑠𝑎\phi(s,a)italic_ϕ ( italic_s , italic_a ), we employ one-hot encoding for discrete action-state spaces and implement a fixed grid discretization for continuous spaces. when both ϕ(s,a)italic-ϕ𝑠𝑎\phi(s,a)italic_ϕ ( italic_s , italic_a ) and ϕ(s,a)italic-ϕsuperscript𝑠superscript𝑎\phi(s^{\prime},a^{\prime})italic_ϕ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) belong to the same one-hot encoding or grid, we treat them as the same sample point. Our investigation into the impact of network width on the TD learning algorithm will be conducted from two perspectives: (i) examining whether the network width m𝑚mitalic_m is correlated with the TD error, and (ii) exploring the existence of constants msuperscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that satisfy Assumption 3.1.

The four subfigures in Figure 1 represent two types of environments: one with a discrete state space and the other with a continuous state space. The first two subfigures depict the convergence performance of the TD algorithm at different network widths. We generate 2,000 sample points and run for 500 epochs. Notably, as the parameter m𝑚mitalic_m increases, the TD algorithm demonstrates faster convergence, resulting in smaller final TD errors. The latter two subfigures illustrate the existence of msuperscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Specifically, we compute the largest non-zero singular value σmaxsubscript𝜎\sigma_{\max}italic_σ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT and smallest non-zero singular value σminsubscript𝜎\sigma_{\min}italic_σ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT of the matrix ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. To mitigate the absolute magnitude of σminsubscript𝜎\sigma_{\min}italic_σ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT, we introduce the ratio r=σmax/σmin𝑟subscript𝜎subscript𝜎r=\sigma_{\max}/\sigma_{\min}italic_r = italic_σ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / italic_σ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT as a metric to validate Assumption 3.1. It can be observed that the value of r𝑟ritalic_r approaches a constant as m𝑚mitalic_m increases for all cases, providing empirical support for the validity of the assumption.

6 Conclusion

We study the finite-time analysis of the TD and Q learning methods with neural network approximation, where the state-action pairs are generated by a given policy under the Markovian sampling. Besides the convergence to the true action-value function except for an inevitable function approximation error, an improved analysis technique is introduced to establish an 𝒪~(ϵ1)~𝒪superscriptitalic-ϵ1\tilde{\mathcal{O}}(\epsilon^{-1})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) complexity for the neural TD and Q learning methods, which improves the existing 𝒪~(ϵ2)~𝒪superscriptitalic-ϵ2\tilde{\mathcal{O}}(\epsilon^{-2})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) complexity. For future work, it is also interesting to investigate if the proposed technique can improve the current complexity estimate of the actor-critic methods, which are partially built upon the neural TD methods.

7 Acknowledgements

Dr. Zaiwen Wen is supported in part by the NSFC grant 12331010. Dr. Junyu Zhang is supported in part by the MOE AcRF grant A-0009530-05-00.

References

  • Allen-Zhu et al. (2019a) Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019a.
  • Allen-Zhu et al. (2019b) Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019b.
  • Barakat et al. (2022) Barakat, A., Bianchi, P., and Lehmann, J. Analysis of a target-based actor-critic algorithm with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pp.  991–1040. PMLR, 2022.
  • Bertsekas (2012) Bertsekas, D. Dynamic programming and optimal control: Volume I, volume 1. Athena scientific, 2012.
  • Bhandari et al. (2018) Bhandari, J., Russo, D., and Singal, R. A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory, pp.  1691–1692. PMLR, 2018.
  • Borkar (2009) Borkar, V. S. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
  • Bowling & Veloso (2001) Bowling, M. and Veloso, M. Rational and convergent learning in stochastic games. In International joint conference on artificial intelligence, volume 17, pp.  1021–1026. Citeseer, 2001.
  • Boyan (2002) Boyan, J. A. Technical update: Least-squares temporal difference learning. Machine learning, 49(2):233–246, 2002.
  • Bradtke & Barto (1996) Bradtke, S. J. and Barto, A. G. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1):33–57, 1996.
  • Brandfonbrener & Bruna (2019) Brandfonbrener, D. and Bruna, J. Geometric insights into the convergence of nonlinear td learning. arXiv preprint arXiv:1905.12185, 2019.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Cai et al. (2023) Cai, Q., Yang, Z., Lee, J. D., and Wang, Z. Neural temporal difference and q learning provably converge to global optima. Mathematics of Operations Research, 2023.
  • Cao & Gu (2019) Cao, Y. and Gu, Q. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019.
  • Cao & Gu (2020) Cao, Y. and Gu, Q. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  3349–3356, 2020.
  • Cayci et al. (2023) Cayci, S., Satpathi, S., He, N., and Srikant, R. Sample complexity and overparameterization bounds for temporal difference learning with neural network approximation. IEEE Transactions on Automatic Control, 2023.
  • Dalal et al. (2018) Dalal, G., Szörényi, B., Thoppe, G., and Mannor, S. Finite sample analyses for td (0) with function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Du et al. (2019) Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675–1685. PMLR, 2019.
  • Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
  • Fan et al. (2020) Fan, J., Wang, Z., Xie, Y., and Yang, Z. A theoretical analysis of deep q-learning. In Learning for dynamics and control, pp.  486–489. PMLR, 2020.
  • Fujimoto et al. (2018) Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
  • Godfrey (2019) Godfrey, L. B. An evaluation of parametric activation functions for deep learning. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp.  3006–3011. IEEE, 2019.
  • Jaakkola et al. (1993) Jaakkola, T., Jordan, M., and Singh, S. Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 6, 1993.
  • Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  • Ke et al. (2023) Ke, Z., Wen, Z., and Zhang, J. Provably efficient gauss-newton temporal difference learning method with function approximation. arXiv preprint arXiv:2302.13087, 2023.
  • Konda & Tsitsiklis (1999) Konda, V. and Tsitsiklis, J. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
  • Kostrikov et al. (2021) Kostrikov, I., Fergus, R., Tompson, J., and Nachum, O. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774–5783. PMLR, 2021.
  • Lazaric et al. (2010) Lazaric, A., Ghavamzadeh, M., and Munos, R. Finite-sample analysis of lstd. In ICML-27th International Conference on Machine Learning, pp.  615–622, 2010.
  • Levine et al. (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Littman (1994) Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp.  157–163. Elsevier, 1994.
  • Liu et al. (2020a) Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. Finite-sample analysis of proximal gradient td algorithms. arXiv preprint arXiv:2006.14364, 2020a.
  • Liu et al. (2020b) Liu, C., Zhu, L., and Belkin, M. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954–15964, 2020b.
  • Maei et al. (2009) Maei, H., Szepesvari, C., Bhatnagar, S., Precup, D., Silver, D., and Sutton, R. S. Convergent temporal-difference learning with arbitrary smooth function approximation. Advances in neural information processing systems, 22, 2009.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Perkins & Pendrith (2002) Perkins, T. J. and Pendrith, M. D. On the existence of fixed points for q-learning and sarsa in partially observable domains. In ICML, pp.  490–497, 2002.
  • Perolat et al. (2018) Perolat, J., Piot, B., and Pietquin, O. Actor-critic fictitious play in simultaneous move multistage games. In International Conference on Artificial Intelligence and Statistics, pp.  919–928. PMLR, 2018.
  • Prashanth et al. (2014) Prashanth, L., Korda, N., and Munos, R. Fast lstd using stochastic approximation: Finite time analysis and application to traffic control. In Joint European conference on machine learning and knowledge discovery in databases, pp.  66–81. Springer, 2014.
  • Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015.
  • Sun et al. (2022) Sun, T., Li, D., and Wang, B. Finite-time analysis of adaptive temporal difference learning with deep neural networks. Advances in Neural Information Processing Systems, 35:19592–19604, 2022.
  • Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  • Sutton et al. (1999) Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  • Sutton et al. (2009a) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning, pp.  993–1000, 2009a.
  • Sutton et al. (2009b) Sutton, R. S., Szepesvári, C., and Maei, H. R. A convergent o (n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in neural information processing systems, 21(21):1609–1616, 2009b.
  • Tagorti & Scherrer (2015) Tagorti, M. and Scherrer, B. On the rate of convergence and error bounds for lstd (λ𝜆\lambdaitalic_λ). In International Conference on Machine Learning, pp. 1521–1529. PMLR, 2015.
  • Tesauro et al. (1995) Tesauro, G. et al. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
  • Tian et al. (2022) Tian, H., Paschalidis, I., and Olshevsky, A. On the performance of temporal difference learning with neural networks. In The Eleventh International Conference on Learning Representations, 2022.
  • Touati et al. (2018) Touati, A., Bacon, P.-L., Precup, D., and Vincent, P. Convergent tree backup and retrace with function approximation. In International Conference on Machine Learning, pp. 4955–4964. PMLR, 2018.
  • Tsitsiklis & Van Roy (1996) Tsitsiklis, J. and Van Roy, B. Analysis of temporal-diffference learning with function approximation. Advances in neural information processing systems, 9, 1996.
  • Van Hasselt et al. (2016) Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
  • Wu et al. (2019) Wu, Y., Tucker, G., and Nachum, O. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  • Xu & Gu (2020) Xu, P. and Gu, Q. A finite-time analysis of q-learning with neural network function approximation. In International Conference on Machine Learning, pp. 10555–10565. PMLR, 2020.
  • Zou et al. (2019) Zou, S., Xu, T., and Liang, Y. Finite-sample analysis for sarsa with linear function approximation. Advances in neural information processing systems, 32, 2019.

Appendix A Details of Section 3

A.1 Proof of (14)

𝔼μ,π,[Δ^(𝒙,𝒙;𝜽)𝜽Q^(𝒙;𝜽),𝜽𝜽]subscript𝔼𝜇𝜋delimited-[]^Δ𝒙superscript𝒙subscript𝜽subscript𝜽^𝑄𝒙subscript𝜽superscript𝜽subscript𝜽\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(% \boldsymbol{x},\boldsymbol{x}^{\prime};\boldsymbol{\theta}_{*}\right)\cdot\big% {\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x};% \boldsymbol{\theta}_{*}\right),\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta% }_{*}\big{\rangle}\right]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⋅ ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ]
=(i)𝑖\displaystyle\overset{(i)}{=}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG = end_ARG 𝔼μ,π[𝔼[Δ^(𝒙,𝒙;𝜽)]𝜽Q^(𝒙;𝜽0),𝜽𝜽¯]subscript𝔼𝜇𝜋delimited-[]subscript𝔼delimited-[]^Δ𝒙superscript𝒙subscript𝜽subscript𝜽^𝑄𝒙subscript𝜽0superscript𝜽¯𝜽\displaystyle\mathbb{E}_{\mu,\pi}\left[\mathbb{E}_{\mathbb{P}}\big{[}\widehat{% \Delta}\left(\boldsymbol{x},\boldsymbol{x}^{\prime};\boldsymbol{\theta}_{*}% \right)\big{]}\cdot\big{\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(% \boldsymbol{x};\boldsymbol{\theta}_{0}\right),\boldsymbol{\theta}^{\prime}-% \bar{\boldsymbol{\theta}}\big{\rangle}\right]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ] ⋅ ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over¯ start_ARG bold_italic_θ end_ARG ⟩ ]
=(ii)𝑖𝑖\displaystyle\overset{(ii)}{=}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG = end_ARG 𝔼μ,π[𝔼[Δ^(𝒙,𝒙;𝜽¯)]𝜽Q^(𝒙;𝜽0),𝜽𝜽¯]subscript𝔼𝜇𝜋delimited-[]subscript𝔼delimited-[]^Δ𝒙superscript𝒙¯𝜽subscript𝜽^𝑄𝒙subscript𝜽0superscript𝜽¯𝜽\displaystyle\mathbb{E}_{\mu,\pi}\left[\mathbb{E}_{\mathbb{P}}\big{[}\widehat{% \Delta}\left(\boldsymbol{x},\boldsymbol{x}^{\prime};\bar{\boldsymbol{\theta}}% \right)\big{]}\cdot\big{\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(% \boldsymbol{x};\boldsymbol{\theta}_{0}\right),\boldsymbol{\theta}^{\prime}-% \bar{\boldsymbol{\theta}}\big{\rangle}\right]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; over¯ start_ARG bold_italic_θ end_ARG ) ] ⋅ ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over¯ start_ARG bold_italic_θ end_ARG ⟩ ]
=\displaystyle== 𝔼μ,π,[Δ^(𝒙,𝒙;𝜽¯)𝜽Q^(𝒙;𝜽0),𝜽𝜽¯]subscript𝔼𝜇𝜋delimited-[]^Δ𝒙superscript𝒙¯𝜽subscript𝜽^𝑄𝒙subscript𝜽0superscript𝜽¯𝜽\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(% \boldsymbol{x},\boldsymbol{x}^{\prime};\bar{\boldsymbol{\theta}}\right)\cdot% \big{\langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x};% \boldsymbol{\theta}_{0}\right),\boldsymbol{\theta}^{\prime}-\bar{\boldsymbol{% \theta}}\big{\rangle}\right]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; over¯ start_ARG bold_italic_θ end_ARG ) ⋅ ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over¯ start_ARG bold_italic_θ end_ARG ⟩ ]

where (i) is because 𝜽Q^(;𝜽)=Q^(;𝜽0)subscript𝜽^𝑄subscript𝜽^𝑄subscript𝜽0\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\cdot\,;\boldsymbol{\theta}_{*}% \right)=\widehat{Q}\left(\cdot\,;\boldsymbol{\theta}_{0}\right)∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( ⋅ ; bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = over^ start_ARG italic_Q end_ARG ( ⋅ ; bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), the decomposition

𝜽𝜽=𝜽𝜽¯+𝜽¯𝜽=𝜽𝜽¯+(𝜽¯𝜽)superscript𝜽subscript𝜽superscript𝜽¯𝜽¯𝜽subscript𝜽superscript𝜽¯𝜽subscript¯𝜽bottomsubscript𝜽bottom\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}_{*}=\boldsymbol{\theta}^{% \prime}-\bar{\boldsymbol{\theta}}+\bar{\boldsymbol{\theta}}-\boldsymbol{\theta% }_{*}=\boldsymbol{\theta}^{\prime}-\bar{\boldsymbol{\theta}}+(\bar{\boldsymbol% {\theta}}_{\bot}-\boldsymbol{\theta}_{\bot})bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over¯ start_ARG bold_italic_θ end_ARG + over¯ start_ARG bold_italic_θ end_ARG - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over¯ start_ARG bold_italic_θ end_ARG + ( over¯ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT )

the fact that (𝜽¯𝜽)𝒦(Σπ)subscript¯𝜽bottomsubscript𝜽bottom𝒦subscriptΣ𝜋(\bar{\boldsymbol{\theta}}_{\bot}-\boldsymbol{\theta}_{\bot})\in\mathcal{K}(% \Sigma_{\pi})( over¯ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT ) ∈ caligraphic_K ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ), and (13), (ii) is because (𝜽¯𝜽)𝒦(Σπ)¯𝜽superscript𝜽𝒦subscriptΣ𝜋(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*})\in\mathcal{K}(\Sigma_{\pi})( over¯ start_ARG bold_italic_θ end_ARG - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ caligraphic_K ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ).

A.2 Proof of Theorem 3.6

Proof.

Recall the definition of the semi-gradient in Section (6). We denote 𝐠¯(𝜽)¯𝐠𝜽\bar{\mathbf{g}}(\boldsymbol{\theta})over¯ start_ARG bold_g end_ARG ( bold_italic_θ ) as its expectation. Let 𝐦¯(𝜽)¯𝐦𝜽\bar{\mathbf{m}}(\boldsymbol{\theta})over¯ start_ARG bold_m end_ARG ( bold_italic_θ ) and 𝐦(𝜽)𝐦𝜽\mathbf{m}(\boldsymbol{\theta})bold_m ( bold_italic_θ ) also be the corresponding semi-gradients based on the linearized function Q^(;𝜽)^𝑄𝜽\widehat{Q}(\cdot;\boldsymbol{\theta})over^ start_ARG italic_Q end_ARG ( ⋅ ; bold_italic_θ ), that is,

𝐠(𝜽t)𝐠superscript𝜽𝑡\displaystyle\mathbf{g}(\boldsymbol{\theta}^{t})bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Δ(𝒙t,𝒙t+1;𝜽t)𝜽Q(𝒙t;𝜽t),𝐠¯(𝜽t)=𝔼μ,π,[𝐠(𝜽t)]Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽𝑡¯𝐠superscript𝜽𝑡subscript𝔼𝜇𝜋delimited-[]𝐠superscript𝜽𝑡\displaystyle\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};\boldsymbol{\theta% }^{t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta% }^{t}),\quad\bar{\mathbf{g}}(\boldsymbol{\theta}^{t})\ =\ \mathbb{E}_{\mu,\pi,% \mathbb{P}}\left[\mathbf{g}(\boldsymbol{\theta}^{t})\right]roman_Δ ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , over¯ start_ARG bold_g end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ]
𝐦(𝜽t)𝐦superscript𝜽𝑡\displaystyle\mathbf{m}(\boldsymbol{\theta}^{t})bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Δ^(𝒙t,𝒙t+1;𝜽t)𝜽Q(𝒙t;𝜽0),𝐦¯(𝜽t)=𝔼μ,π,[𝐦(𝜽t)],^Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽0¯𝐦superscript𝜽𝑡subscript𝔼𝜇𝜋delimited-[]𝐦superscript𝜽𝑡\displaystyle\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};% \boldsymbol{\theta}^{t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{0}),\quad\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})\ =\ % \mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\mathbf{m}(\boldsymbol{\theta}^{t})\right],over^ start_ARG roman_Δ end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] ,

where

Δ(𝒙t,𝒙t+1;𝜽t)Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡\displaystyle\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};\boldsymbol{\theta% }^{t})roman_Δ ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Q(𝒙t;𝜽t)(r(st,at)+γQ(𝒙t+1;𝜽t)),𝑄subscript𝒙𝑡superscript𝜽𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡𝛾𝑄subscript𝒙𝑡1superscript𝜽𝑡\displaystyle Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s_{t},a_{t% })+\gamma\cdot Q(\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t})\right),italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ ⋅ italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ,
Δ^(𝒙t,𝒙t+1;𝜽t)^Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡\displaystyle\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};% \boldsymbol{\theta}^{t})over^ start_ARG roman_Δ end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Q^(𝒙t;𝜽t)(r(st,at)+γQ^(𝒙t+1;𝜽t)).^𝑄subscript𝒙𝑡superscript𝜽𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡𝛾^𝑄subscript𝒙𝑡1superscript𝜽𝑡\displaystyle\widehat{Q}(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s% _{t},a_{t})+\gamma\cdot\widehat{Q}(\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t% })\right).over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ ⋅ over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) .

To simplify the notation, let Δt:=Δ(𝒙t,𝒙t+1;𝜽t)assignsubscriptΔ𝑡Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡\Delta_{t}:=\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};\boldsymbol{\theta}% ^{t})roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := roman_Δ ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and Δ^t:=Δ^(𝒙t,𝒙t+1;𝜽t)assignsubscript^Δ𝑡^Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡\widehat{\Delta}_{t}:=\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1}% ;\boldsymbol{\theta}^{t})over^ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := over^ start_ARG roman_Δ end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). Recall the definition of the range space (Σπ)subscriptΣ𝜋\mathcal{R}(\Sigma_{\pi})caligraphic_R ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) and the kernel space 𝒦(Σπ)𝒦subscriptΣ𝜋\mathcal{K}(\Sigma_{\pi})caligraphic_K ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ). By Proposition 3.4, we know that 𝒗1𝒗2=0superscriptsubscript𝒗1topsubscript𝒗20\boldsymbol{v}_{1}^{\top}\boldsymbol{v}_{2}=0bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 for any vector 𝒗1(Σπ),𝒗2𝒦(Σπ)formulae-sequencesubscript𝒗1subscriptΣ𝜋subscript𝒗2𝒦subscriptΣ𝜋\boldsymbol{v}_{1}\in\mathcal{R}(\Sigma_{\pi}),\boldsymbol{v}_{2}\in\mathcal{K% }(\Sigma_{\pi})bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ caligraphic_R ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) , bold_italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_K ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) thus 𝜽Q(𝒙;𝜽0),𝜽=0subscript𝜽𝑄𝒙superscript𝜽0subscript𝜽bottom0\left<\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),% \boldsymbol{\theta}_{\bot}\right>=0⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT ⟩ = 0 for any feature map 𝒙𝒙\boldsymbol{x}bold_italic_x and parameter 𝜽𝒦(Σπ)subscript𝜽bottom𝒦subscriptΣ𝜋\boldsymbol{\theta}_{\bot}\in\mathcal{K}(\Sigma_{\pi})bold_italic_θ start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT ∈ caligraphic_K ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ). Then we can decompose 𝜽t+1𝜽t+12superscriptnormsuperscript𝜽𝑡1subscriptsuperscript𝜽𝑡12\|\boldsymbol{\theta}^{t+1}-\boldsymbol{\theta}^{t+1}_{*}\|^{2}∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as

𝜽t+1𝜽t+12superscriptnormsuperscript𝜽𝑡1subscriptsuperscript𝜽𝑡12\displaystyle\|\boldsymbol{\theta}^{t+1}-\boldsymbol{\theta}^{t+1}_{*}\|^{2}∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =\displaystyle== ΠS2ω(𝜽tηt𝐠(𝜽t))ΠS2ω(𝜽t+1)2superscriptnormsubscriptΠsubscript𝑆2𝜔superscript𝜽𝑡subscript𝜂𝑡𝐠superscript𝜽𝑡subscriptΠsubscript𝑆2𝜔subscriptsuperscript𝜽𝑡12\displaystyle\left\|\Pi_{S_{2\omega}}\left(\boldsymbol{\theta}^{t}-\eta_{t}% \mathbf{g}(\boldsymbol{\theta}^{t})\right)-\Pi_{S_{2\omega}}\left(\boldsymbol{% \theta}^{t+1}_{*}\right)\right\|^{2}∥ roman_Π start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 2 italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) - roman_Π start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 2 italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq 𝜽tηt𝐠(𝜽t)𝜽t+12superscriptnormsuperscript𝜽𝑡subscript𝜂𝑡𝐠superscript𝜽𝑡subscriptsuperscript𝜽𝑡12\displaystyle\|\boldsymbol{\theta}^{t}-\eta_{t}\mathbf{g}(\boldsymbol{\theta}^% {t})-\boldsymbol{\theta}^{t+1}_{*}\|^{2}∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 𝜽tηt𝐠(𝜽t)𝜽t+𝜽t𝜽t+12superscriptnormsuperscript𝜽𝑡subscript𝜂𝑡𝐠superscript𝜽𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡12\displaystyle\|\boldsymbol{\theta}^{t}-\eta_{t}\mathbf{g}(\boldsymbol{\theta}^% {t})-\boldsymbol{\theta}^{t}_{*}+\boldsymbol{\theta}^{t}_{*}-\boldsymbol{% \theta}^{t+1}_{*}\|^{2}∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 𝜽t𝜽t2+ηt2𝐠(𝜽t)2+𝜽t𝜽t+122ηt𝜽t𝜽t,𝐠(𝜽t)superscriptnormsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡2superscriptsubscript𝜂𝑡2superscriptnorm𝐠superscript𝜽𝑡2superscriptnormsubscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡122subscript𝜂𝑡superscript𝜽𝑡subscriptsuperscript𝜽𝑡𝐠superscript𝜽𝑡\displaystyle\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}+\eta_% {t}^{2}\|\mathbf{g}(\boldsymbol{\theta}^{t})\|^{2}+\|\boldsymbol{\theta}^{t}_{% *}-\boldsymbol{\theta}^{t+1}_{*}\|^{2}-2\eta_{t}\left<\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t}_{*},\mathbf{g}(\boldsymbol{\theta}^{t})\right>∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩
2ηt𝜽t𝜽t+1,𝐠(𝜽t)+2ηt𝜽t𝜽t,𝜽t𝜽t+12subscript𝜂𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡1𝐠superscript𝜽𝑡2subscript𝜂𝑡superscript𝜽𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡1\displaystyle-2\eta_{t}\left<\boldsymbol{\theta}^{t}_{*}-\boldsymbol{\theta}^{% t+1}_{*},\mathbf{g}(\boldsymbol{\theta}^{t})\right>+2\eta_{t}\left<\boldsymbol% {\theta}^{t}-\boldsymbol{\theta}^{t}_{*},\boldsymbol{\theta}^{t}_{*}-% \boldsymbol{\theta}^{t+1}_{*}\right>- 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ + 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩
=(i)𝑖\displaystyle\overset{(i)}{=}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG = end_ARG 𝜽t𝜽t22ηt𝜽t𝜽t,𝐠(𝜽t)superscriptnormsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡22subscript𝜂𝑡superscript𝜽𝑡subscriptsuperscript𝜽𝑡𝐠superscript𝜽𝑡\displaystyle\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}-2\eta% _{t}\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*},\mathbf{g}(% \boldsymbol{\theta}^{t})\right>∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩
2ηt𝜽t𝜽t+1,𝐠(𝜽t)𝐦(𝜽t)+ηt2𝐠(𝜽t)2,2subscript𝜂𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡1𝐠superscript𝜽𝑡𝐦superscript𝜽𝑡superscriptsubscript𝜂𝑡2superscriptnorm𝐠superscript𝜽𝑡2\displaystyle-2\eta_{t}\left<\boldsymbol{\theta}^{t}_{*}-\boldsymbol{\theta}^{% t+1}_{*},\mathbf{g}(\boldsymbol{\theta}^{t})-\mathbf{m}(\boldsymbol{\theta}^{t% })\right>+\eta_{t}^{2}\|\mathbf{g}(\boldsymbol{\theta}^{t})\|^{2},- 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where (i) follows

𝜽t𝜽t,𝜽t𝜽t+1=𝜽t𝜽,𝜽t𝜽t+1=0,superscript𝜽𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡1subscriptsuperscript𝜽𝑡parallel-tosubscriptsuperscript𝜽parallel-tosubscriptsuperscript𝜽𝑡bottomsubscriptsuperscript𝜽𝑡1bottom0\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*},\boldsymbol{\theta}^% {t}_{*}-\boldsymbol{\theta}^{t+1}_{*}\right>=\left<\boldsymbol{\theta}^{t}_{% \parallel}-\boldsymbol{\theta}^{*}_{\parallel},\boldsymbol{\theta}^{t}_{\bot}-% \boldsymbol{\theta}^{t+1}_{\bot}\right>=0,⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ = ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT ⟩ = 0 ,

and

𝜽t𝜽t+1,𝐦(𝜽t)subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡1𝐦superscript𝜽𝑡\displaystyle\left<\boldsymbol{\theta}^{t}_{*}-\boldsymbol{\theta}^{t+1}_{*},% \mathbf{m}(\boldsymbol{\theta}^{t})\right>⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ =\displaystyle== 𝜽t𝜽t+1,𝐦(𝜽t)subscriptsuperscript𝜽𝑡bottomsubscriptsuperscript𝜽𝑡1bottom𝐦superscript𝜽𝑡\displaystyle\left<\boldsymbol{\theta}^{t}_{\bot}-\boldsymbol{\theta}^{t+1}_{% \bot},\mathbf{m}(\boldsymbol{\theta}^{t})\right>⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT , bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩
=\displaystyle== Δ^t𝜽t𝜽t+1,Q(𝒙t;𝜽0)= 0.subscript^Δ𝑡subscriptsuperscript𝜽𝑡bottomsubscriptsuperscript𝜽𝑡1bottom𝑄subscript𝒙𝑡superscript𝜽0 0\displaystyle\widehat{\Delta}_{t}\cdot\left<\boldsymbol{\theta}^{t}_{\bot}-% \boldsymbol{\theta}^{t+1}_{\bot},\nabla Q(\boldsymbol{x}_{t};\boldsymbol{% \theta}^{0})\right>\ =\ 0.over^ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT , ∇ italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ⟩ = 0 .

Recall the stationarity condition (11), for any t{1,2,,T}𝑡12𝑇t\in\{1,2,\cdots,T\}italic_t ∈ { 1 , 2 , ⋯ , italic_T },

00\displaystyle 0 \displaystyle\leq 𝔼μ,π,[Δ^(𝒙t,𝒙t+1;𝜽)𝜽Q^(𝒙t;𝜽),𝜽t𝜽]subscript𝔼𝜇𝜋delimited-[]^Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽subscript𝜽^𝑄subscript𝒙𝑡superscript𝜽superscript𝜽𝑡superscript𝜽\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(% \boldsymbol{x}_{t},\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{*}\right)\big{% \langle}\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{*}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{*}% \big{\rangle}\right]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ ]
=\displaystyle== 𝔼μ,π,[Δ^(𝒙t,𝒙t+1;𝜽t)𝜽Q^(𝒙t;𝜽t)],𝜽t𝜽subscript𝔼𝜇𝜋delimited-[]^Δsubscript𝒙𝑡subscript𝒙𝑡1subscriptsuperscript𝜽𝑡subscript𝜽^𝑄subscript𝒙𝑡subscriptsuperscript𝜽𝑡superscript𝜽𝑡superscript𝜽\displaystyle\big{\langle}\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta% }\left(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t}_{*}% \right)\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{t}_{*}\right)\right],\boldsymbol{\theta}^{t}-\boldsymbol{% \theta}^{*}\big{\rangle}⟨ blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ] , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩
=(i)𝑖\displaystyle\overset{(i)}{=}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG = end_ARG 𝔼μ,π,[Δ^(𝒙t,𝒙t+1;𝜽t)𝜽Q^(𝒙t;𝜽t)],𝜽t𝜽t=𝐦¯(𝜽𝒕),𝜽t𝜽t,subscript𝔼𝜇𝜋delimited-[]^Δsubscript𝒙𝑡subscript𝒙𝑡1subscriptsuperscript𝜽𝑡subscript𝜽^𝑄subscript𝒙𝑡subscriptsuperscript𝜽𝑡superscript𝜽𝑡subscriptsuperscript𝜽𝑡¯𝐦subscriptsuperscript𝜽𝒕superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle\big{\langle}\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta% }\left(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t}_{*}% \right)\nabla_{\boldsymbol{\theta}}\widehat{Q}\left(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{t}_{*}\right)\right],\boldsymbol{\theta}^{t}-\boldsymbol{% \theta}^{t}_{*}\big{\rangle}\ =\ \left<\bar{\mathbf{m}}(\boldsymbol{\theta^{t}% _{*}}),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right>,⟨ blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ] , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ = ⟨ over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT bold_italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_∗ end_POSTSUBSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ,

where (i) is the same as the proof in Section A.1. Therefore,

𝜽t+1𝜽t+12superscriptnormsuperscript𝜽𝑡1subscriptsuperscript𝜽𝑡12\displaystyle\|\boldsymbol{\theta}^{t+1}-\boldsymbol{\theta}^{t+1}_{*}\|^{2}∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (18)
\displaystyle\leq 𝜽t𝜽t22ηt𝜽t𝜽t,𝐠(𝜽t)2ηt𝜽t𝜽t+1,𝐠(𝜽t)𝐦(𝜽t)+ηt2𝐠(𝜽t)2superscriptnormsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡22subscript𝜂𝑡superscript𝜽𝑡subscriptsuperscript𝜽𝑡𝐠superscript𝜽𝑡2subscript𝜂𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡1𝐠superscript𝜽𝑡𝐦superscript𝜽𝑡superscriptsubscript𝜂𝑡2superscriptnorm𝐠superscript𝜽𝑡2\displaystyle\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}-2\eta% _{t}\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*},\mathbf{g}(% \boldsymbol{\theta}^{t})\right>-2\eta_{t}\left<\boldsymbol{\theta}^{t}_{*}-% \boldsymbol{\theta}^{t+1}_{*},\mathbf{g}(\boldsymbol{\theta}^{t})-\mathbf{m}(% \boldsymbol{\theta}^{t})\right>+\eta_{t}^{2}\|\mathbf{g}(\boldsymbol{\theta}^{% t})\|^{2}∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ - 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== 𝜽t𝜽t22ηt𝜽t𝜽t,𝐠(𝜽t)𝐦(𝜽t)2ηt𝜽t𝜽t,𝐦(𝜽t)𝐦¯(𝜽t)superscriptnormsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡22subscript𝜂𝑡superscript𝜽𝑡subscriptsuperscript𝜽𝑡𝐠superscript𝜽𝑡𝐦superscript𝜽𝑡2subscript𝜂𝑡superscript𝜽𝑡subscriptsuperscript𝜽𝑡𝐦superscript𝜽𝑡¯𝐦superscript𝜽𝑡\displaystyle\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}-2\eta% _{t}\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*},\mathbf{g}(% \boldsymbol{\theta}^{t})-\mathbf{m}(\boldsymbol{\theta}^{t})\right>-2\eta_{t}% \left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*},\mathbf{m}(% \boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})\right>∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ - 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩
2ηt𝜽t𝜽t,𝐦¯(𝜽t)2ηt𝜽t𝜽t+1,𝐠(𝜽t)𝐦(𝜽t)+ηt2𝐠(𝜽t)22subscript𝜂𝑡superscript𝜽𝑡subscriptsuperscript𝜽𝑡¯𝐦superscript𝜽𝑡2subscript𝜂𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡1𝐠superscript𝜽𝑡𝐦superscript𝜽𝑡superscriptsubscript𝜂𝑡2superscriptnorm𝐠superscript𝜽𝑡2\displaystyle-2\eta_{t}\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{% *},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})\right>-2\eta_{t}\left<\boldsymbol% {\theta}^{t}_{*}-\boldsymbol{\theta}^{t+1}_{*},\mathbf{g}(\boldsymbol{\theta}^% {t})-\mathbf{m}(\boldsymbol{\theta}^{t})\right>+\eta_{t}^{2}\|\mathbf{g}(% \boldsymbol{\theta}^{t})\|^{2}- 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ - 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG 𝜽t𝜽t2+ηt2𝐠(𝜽t)2I1: Gradient Bound2ηt𝜽t𝜽t+1,𝐠(𝜽t)𝐦(𝜽t)I2: Gradient Gapsuperscriptnormsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡2superscriptsubscript𝜂𝑡2subscriptsuperscriptnorm𝐠superscript𝜽𝑡2subscriptI1: Gradient Bound2subscript𝜂𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡1𝐠superscript𝜽𝑡𝐦superscript𝜽𝑡subscriptI2: Gradient Gap\displaystyle\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}+\eta_% {t}^{2}\underbrace{\|\mathbf{g}(\boldsymbol{\theta}^{t})\|^{2}}_{\mbox{I}_{1}% \mbox{:\ Gradient Bound}}-2\eta_{t}\underbrace{\left<\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t+1}_{*},\mathbf{g}(\boldsymbol{\theta}^{t})-\mathbf{m}(% \boldsymbol{\theta}^{t})\right>}_{\mbox{I}_{2}\mbox{:\ Gradient Gap}}∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT under⏟ start_ARG ∥ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : Gradient Bound end_POSTSUBSCRIPT - 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT under⏟ start_ARG ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ end_ARG start_POSTSUBSCRIPT I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : Gradient Gap end_POSTSUBSCRIPT
2ηt𝜽t𝜽t,𝐦(𝜽t)𝐦¯(𝜽t)I3: Markov Sampling Error2ηt𝜽t𝜽t,𝐦¯(𝜽t)𝐦¯(𝜽t)I4: Gradient Decent,2subscript𝜂𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡𝐦superscript𝜽𝑡¯𝐦superscript𝜽𝑡subscriptI3: Markov Sampling Error2subscript𝜂𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡¯𝐦superscript𝜽𝑡¯𝐦subscriptsuperscript𝜽𝑡subscriptI4: Gradient Decent\displaystyle-2\eta_{t}\underbrace{\left<\boldsymbol{\theta}^{t}-\boldsymbol{% \theta}^{t}_{*},\mathbf{m}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t})\right>}_{\mbox{I}_{3}\mbox{:\ Markov Sampling Error}}% -2\eta_{t}\underbrace{\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*% },\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(\boldsymbol{% \theta}^{t}_{*})\right>}_{\mbox{I}_{4}\mbox{:\ Gradient Decent}},- 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT under⏟ start_ARG ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ end_ARG start_POSTSUBSCRIPT I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT : Markov Sampling Error end_POSTSUBSCRIPT - 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT under⏟ start_ARG ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ end_ARG start_POSTSUBSCRIPT I start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT : Gradient Decent end_POSTSUBSCRIPT ,

where (i) follows 𝜽t𝜽t,𝐦¯(𝜽t)0superscript𝜽𝑡subscriptsuperscript𝜽𝑡¯𝐦subscriptsuperscript𝜽𝑡0\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*},\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{*})\right>\geq 0⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ ≥ 0 for any 0tT10𝑡𝑇10\leq t\leq T-10 ≤ italic_t ≤ italic_T - 1.

Next, we analyze the upper bounds of 𝐈1subscript𝐈1\mathbf{I}_{1}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝐈2subscript𝐈2\mathbf{I}_{2}bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, 𝐈3subscript𝐈3\mathbf{I}_{3}bold_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and 𝐈4subscript𝐈4\mathbf{I}_{4}bold_I start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT item by item. To simplify the notation, let {Ci>0}i=1,,7subscriptsubscript𝐶𝑖0𝑖17\left\{C_{i}>0\right\}_{i=1,\ldots,7}{ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 } start_POSTSUBSCRIPT italic_i = 1 , … , 7 end_POSTSUBSCRIPT be universal constants in this section. We set ω=C1𝜔subscript𝐶1\omega=C_{1}italic_ω = italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ). By Lemma D.5, we have

𝒈(𝜽t)2C2log(T/δ)superscriptnorm𝒈superscript𝜽𝑡2subscript𝐶2𝑇𝛿\|\boldsymbol{g}(\boldsymbol{\theta}^{t})\|^{2}\leq C_{2}\sqrt{\log(T/\delta)}∥ bold_italic_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG (19)

and

𝔼μ,π,[|𝐠(𝜽t)𝐦(𝜽t),𝜽t𝜽t+1|𝜽0]subscript𝔼𝜇𝜋delimited-[]conditional𝐠superscript𝜽𝑡𝐦superscript𝜽𝑡superscript𝜽𝑡subscriptsuperscript𝜽𝑡1superscript𝜽0\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left|\left<\mathbf{g}\left(% \boldsymbol{\theta}^{t}\right)-\mathbf{m}\left(\boldsymbol{\theta}^{t}\right),% \boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t+1}_{*}\right>\right|\mid% \boldsymbol{\theta}^{0}\right]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ | ⟨ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ | ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] \displaystyle\leq (C3ωm12log(T/δ)+C4m12)𝜽t𝜽t+1subscript𝐶3𝜔superscript𝑚12𝑇𝛿subscript𝐶4superscript𝑚12normsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡1\displaystyle\left(C_{3}\omega m^{-\frac{1}{2}}\sqrt{\log(T/\delta)}+C_{4}m^{-% \frac{1}{2}}\right)\left\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t+1}_{*% }\right\|( italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_ω italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG + italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ (20)
(i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG C5m1/2log(T/δ),subscript𝐶5superscript𝑚12𝑇𝛿\displaystyle C_{5}m^{-1/2}\sqrt{\log(T/\delta)},italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG ,

with probability at least 12δ2exp(C6m)12𝛿2subscript𝐶6𝑚1-2\delta-2\exp{(-C_{6}m)}1 - 2 italic_δ - 2 roman_exp ( - italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT italic_m ), where (i) follows ω=C1𝜔subscript𝐶1\omega=C_{1}italic_ω = italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and

θtθt+1normsuperscript𝜃𝑡subscriptsuperscript𝜃𝑡1\displaystyle\|\theta^{t}-\theta^{t+1}_{*}\|∥ italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ \displaystyle\leq θtθ0+θ0θ0+θ0θt+1normsuperscript𝜃𝑡superscript𝜃0normsuperscript𝜃0subscriptsuperscript𝜃0normsubscriptsuperscript𝜃0subscriptsuperscript𝜃𝑡1\displaystyle\|\theta^{t}-\theta^{0}\|+\|\theta^{0}-\theta^{0}_{*}\|+\|\theta^% {0}_{*}-\theta^{t+1}_{*}\|∥ italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ + ∥ italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥
=\displaystyle== θtθ0+θ0θ0+θ0θt+1normsuperscript𝜃𝑡superscript𝜃0normsuperscript𝜃0subscriptsuperscript𝜃0normsubscriptsuperscript𝜃0bottomsubscriptsuperscript𝜃𝑡1bottom\displaystyle\|\theta^{t}-\theta^{0}\|+\|\theta^{0}-\theta^{0}_{*}\|+\|\theta^% {0}_{\bot}-\theta^{t+1}_{\bot}\|∥ italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ + ∥ italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⊥ end_POSTSUBSCRIPT ∥
\displaystyle\leq θtθ0+θ0θ0+θ0θt+1 3ω.normsuperscript𝜃𝑡superscript𝜃0normsuperscript𝜃0subscriptsuperscript𝜃0normsuperscript𝜃0superscript𝜃𝑡13𝜔\displaystyle\|\theta^{t}-\theta^{0}\|+\|\theta^{0}-\theta^{0}_{*}\|+\|\theta^% {0}-\theta^{t+1}\|\ \leq\ 3\omega.∥ italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ + ∥ italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ + ∥ italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ∥ ≤ 3 italic_ω .

Thus (19) and (20) provide upper bounds on 𝐈1subscript𝐈1\mathbf{I}_{1}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐈2subscript𝐈2\mathbf{I}_{2}bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively. The next lemma provides an estimate of the Markov sampling error.

Lemma A.1.

Suppose the learning rate sequence {η0,η1,,ηT}subscript𝜂0subscript𝜂1subscript𝜂𝑇\left\{\eta_{0},\eta_{1},\ldots,\eta_{T}\right\}{ italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } is non-increasing. Under Assumption 3.2, it holds that

𝔼μ,π,[𝐦(𝜽t)𝐦¯(𝜽t),𝜽t𝜽𝜽0]C7(log(T/δ)+C12)τηmax{0,tτ},subscript𝔼𝜇𝜋delimited-[]conditional𝐦superscript𝜽𝑡¯𝐦superscript𝜽𝑡superscript𝜽𝑡superscript𝜽superscript𝜽0subscript𝐶7𝑇𝛿superscriptsubscript𝐶12superscript𝜏subscript𝜂0𝑡superscript𝜏\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left\langle\mathbf{m}\left(\boldsymbol{% \theta}^{t}\right)-\overline{\mathbf{m}}\left(\boldsymbol{\theta}^{t}\right),% \boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{*}\right\rangle\mid\boldsymbol{% \theta}^{0}\right]\leq C_{7}\left(\log(T/\delta)+C_{1}^{2}\right)\tau^{*}\eta_% {\max\left\{0,t-\tau^{*}\right\}},blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ⟨ bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ≤ italic_C start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ( roman_log ( italic_T / italic_δ ) + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT roman_max { 0 , italic_t - italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT , (21)

for any fixed tT𝑡𝑇t\leq Titalic_t ≤ italic_T, where

τ=min{t=0,1,2,κρtηT}superscript𝜏𝑡012conditional𝜅superscript𝜌𝑡subscript𝜂𝑇\tau^{*}=\min\left\{t=0,1,2,\ldots\mid\kappa\rho^{t}\leq\eta_{T}\right\}italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_min { italic_t = 0 , 1 , 2 , … ∣ italic_κ italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ≤ italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }

is the mixing time of the Markov chain {st,at}t=0,1,.subscriptsubscript𝑠𝑡subscript𝑎𝑡𝑡01\left\{s_{t},a_{t}\right\}_{t=0,1,\ldots.}{ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 , 1 , … . end_POSTSUBSCRIPT.

Proof.

We adopt the proof framework outlined in Lemma 6.2 of Xu & Gu (2020). However, variations in the neural network settings lead to differences in the norms of gradients and parameters, thereby resulting in slight variations in the results. Thereby we have

𝔼μ,π,[𝐦(𝜽t)𝐦¯(𝜽t),𝜽t𝜽𝜽0]C7(log(T/δ)+ω2)τηmax{0,tτ}.subscript𝔼𝜇𝜋delimited-[]conditional𝐦superscript𝜽𝑡¯𝐦superscript𝜽𝑡superscript𝜽𝑡superscript𝜽superscript𝜽0subscript𝐶7𝑇𝛿superscript𝜔2superscript𝜏subscript𝜂0𝑡superscript𝜏\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left\langle\mathbf{m}\left(\boldsymbol{% \theta}^{t}\right)-\overline{\mathbf{m}}\left(\boldsymbol{\theta}^{t}\right),% \boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{*}\right\rangle\mid\boldsymbol{% \theta}^{0}\right]\leq C_{7}\left(\log(T/\delta)+\omega^{2}\right)\tau^{*}\eta% _{\max\left\{0,t-\tau^{*}\right\}}.blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ⟨ bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ≤ italic_C start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ( roman_log ( italic_T / italic_δ ) + italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT roman_max { 0 , italic_t - italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT .

Looking back at the definitions of λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, and the discussion in Section 3.2, we derive Lemmas A.2 and A.3 to estimate 𝐈4subscript𝐈4\mathbf{I}_{4}bold_I start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT.

Lemma A.2.

Let λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as the minimum nonzero singular value of ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. For any 𝛉(Σπ)𝛉subscriptΣ𝜋\boldsymbol{\theta}\in\mathcal{R}(\Sigma_{\pi})bold_italic_θ ∈ caligraphic_R ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ), we have

𝜽Σπ𝜽λ0𝜽22.superscript𝜽topsubscriptΣ𝜋𝜽subscript𝜆0superscriptsubscriptnorm𝜽22\boldsymbol{\theta}^{\top}\Sigma_{\pi}\boldsymbol{\theta}\geq\lambda_{0}\|% \boldsymbol{\theta}\|_{2}^{2}.bold_italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT bold_italic_θ ≥ italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ bold_italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
Lemma A.3.

Under Assumption 3.3, we have that

𝔼μ,π,[𝜽t𝜽t,𝐦¯(𝜽t)𝐦¯(𝜽t)𝜽0](1γ)λ0𝜽t𝜽t2.subscript𝔼𝜇𝜋delimited-[]conditionalsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡¯𝐦superscript𝜽𝑡¯𝐦subscriptsuperscript𝜽𝑡superscript𝜽01𝛾subscript𝜆0superscriptnormsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡2\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left<\boldsymbol{\theta}^{t}-\boldsymbol% {\theta}^{t}_{*},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{*})\right>\mid\boldsymbol{\theta}^{0}\right]\geq(1-% \gamma)\lambda_{0}\cdot\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|% ^{2}.blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ≥ ( 1 - italic_γ ) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (22)
Proof.

Define dμ×πsimilar-to𝑑𝜇𝜋d\sim\mu\times\piitalic_d ∼ italic_μ × italic_π. To begin with, the Bellman operator 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi}caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is a γ𝛾\gammaitalic_γ-contraction with 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm since d𝑑ditalic_d is the stationary distribution of (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) corresponding to the policy π𝜋\piitalic_π. In details, consider

𝔼(s,a)d[(𝒯πQ1(𝒙)𝒯πQ2(𝒙))2]subscript𝔼similar-to𝑠𝑎𝑑delimited-[]superscriptsuperscript𝒯𝜋subscript𝑄1𝒙superscript𝒯𝜋subscript𝑄2𝒙2\displaystyle\mathbb{E}_{(s,a)\sim d}\left[\left(\mathcal{T}^{\pi}Q_{1}(% \boldsymbol{x})-\mathcal{T}^{\pi}Q_{2}(\boldsymbol{x})\right)^{2}\right]blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_d end_POSTSUBSCRIPT [ ( caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) - caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =γ2𝔼(s,a)d[𝔼[(Q1(𝒙)Q2(𝒙))2s(|s,a),aπ(|s)]]\displaystyle=\gamma^{2}\mathbb{E}_{(s,a)\sim d}\left[\mathbb{E}\left[\left(Q_% {1}(\boldsymbol{x}^{\prime})-Q_{2}(\boldsymbol{x}^{\prime})\right)^{2}\mid s^{% \prime}\sim\mathbb{P}(\cdot|s,a),a^{\prime}\sim\pi(\cdot|s^{\prime})\right]\right]= italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_d end_POSTSUBSCRIPT [ blackboard_E [ ( italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ blackboard_P ( ⋅ | italic_s , italic_a ) , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ] (23)
(i)γ2𝔼(s,a)d[(Q1(𝒙)Q2(𝒙))2],𝑖superscript𝛾2subscript𝔼similar-to𝑠𝑎𝑑delimited-[]superscriptsubscript𝑄1𝒙subscript𝑄2𝒙2\displaystyle\overset{(i)}{\leq}\gamma^{2}\mathbb{E}_{(s,a)\sim d}\left[\left(% Q_{1}(\boldsymbol{x})-Q_{2}(\boldsymbol{x})\right)^{2}\right],start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_d end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) - italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where (i) follows that 𝒙𝒙\boldsymbol{x}bold_italic_x and 𝒙superscript𝒙\boldsymbol{x}^{\prime}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT have the same stationary distribution. To simplify the notation, we denote 𝔼[]𝔼delimited-[]\mathbb{E}[\cdot]blackboard_E [ ⋅ ] as 𝔼μ,π,[]subscript𝔼𝜇𝜋delimited-[]\mathbb{E}_{\mu,\pi,\mathbb{P}}[\cdot]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ⋅ ] in the proof of this lemma. Then we compute

𝔼[𝜽t𝜽t,𝐦¯(𝜽t)𝐦¯(𝜽t)𝜽0]𝔼delimited-[]conditionalsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡¯𝐦superscript𝜽𝑡¯𝐦subscriptsuperscript𝜽𝑡superscript𝜽0\displaystyle\mathbb{E}\left[\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}% ^{t}_{*},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{*})\right>\mid\boldsymbol{\theta}^{0}\right]blackboard_E [ ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] (24)
=\displaystyle== 𝔼[(Δ^(𝒙,𝒙;𝜽t)Δ^(𝒙,𝒙;𝜽t))Q(𝒙;𝜽0),𝜽t𝜽t𝜽0]𝔼delimited-[]conditional^Δ𝒙superscript𝒙superscript𝜽𝑡^Δ𝒙superscript𝒙subscriptsuperscript𝜽𝑡𝑄𝒙superscript𝜽0superscript𝜽𝑡subscriptsuperscript𝜽𝑡superscript𝜽0\displaystyle\mathbb{E}\left[\left(\widehat{\Delta}(\boldsymbol{x},\boldsymbol% {x}^{\prime};\boldsymbol{\theta}^{t})-\widehat{\Delta}(\boldsymbol{x},% \boldsymbol{x}^{\prime};\boldsymbol{\theta}^{t}_{*})\right)\left<\nabla Q(% \boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{\theta}^{t}-\boldsymbol{% \theta}^{t}_{*}\right>\mid\boldsymbol{\theta}^{0}\right]blackboard_E [ ( over^ start_ARG roman_Δ end_ARG ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG roman_Δ end_ARG ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ⟨ ∇ italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
=\displaystyle== 𝔼[(Q^(𝒙;𝜽t)Q^(𝒙;𝜽t))Q(𝒙;𝜽0),𝜽t𝜽t𝜽0]𝔼delimited-[]conditional^𝑄𝒙superscript𝜽𝑡^𝑄𝒙subscriptsuperscript𝜽𝑡𝑄𝒙superscript𝜽0superscript𝜽𝑡subscriptsuperscript𝜽𝑡superscript𝜽0\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{t}_{*})\right)% \left<\nabla Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{\theta}^{t}% -\boldsymbol{\theta}^{t}_{*}\right>\mid\boldsymbol{\theta}^{0}\right]blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ⟨ ∇ italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
γ𝔼[(Q^(𝒙;𝜽t)Q^(𝒙;𝜽t))Q(𝒙;𝜽0),𝜽t𝜽t𝜽0]𝛾𝔼delimited-[]conditional^𝑄superscript𝒙superscript𝜽𝑡^𝑄superscript𝒙subscriptsuperscript𝜽𝑡𝑄𝒙superscript𝜽0superscript𝜽𝑡subscriptsuperscript𝜽𝑡superscript𝜽0\displaystyle-\gamma\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x}^{\prime};% \boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x}^{\prime};\boldsymbol{% \theta}^{t}_{*})\right)\left<\nabla Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),% \boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right>\mid\boldsymbol{% \theta}^{0}\right]- italic_γ blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ⟨ ∇ italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
=\displaystyle== 𝔼[(Q^(𝒙,𝜽t)Q^(𝒙,𝜽t))2]γ𝔼[(Q^(𝒙,𝜽t)Q^(𝒙,𝜽t))(Q^(𝒙;𝜽t)Q^(𝒙;𝜽t))]𝔼delimited-[]superscript^𝑄𝒙superscript𝜽𝑡^𝑄𝒙subscriptsuperscript𝜽𝑡2𝛾𝔼delimited-[]^𝑄𝒙superscript𝜽𝑡^𝑄𝒙subscriptsuperscript𝜽𝑡^𝑄superscript𝒙superscript𝜽𝑡^𝑄superscript𝒙subscriptsuperscript𝜽𝑡\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{*})\right)^{2% }\right]-\gamma\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{*})\right)% \cdot\left(\widehat{Q}(\boldsymbol{x}^{\prime};\boldsymbol{\theta}^{t})-% \widehat{Q}(\boldsymbol{x}^{\prime};\boldsymbol{\theta}^{t}_{*})\right)\right]blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - italic_γ blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ⋅ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ]
(i)𝑖\displaystyle\overset{(i)}{\geq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≥ end_ARG 𝔼[(Q^(𝒙,𝜽t)Q^(𝒙,𝜽t)2)]γ𝔼[(Q^(𝒙,𝜽t)Q^(𝒙,𝜽t))2]𝔼[(Q^(𝒙;𝜽t)Q^(𝒙;𝜽t))2]𝔼delimited-[]^𝑄𝒙superscript𝜽𝑡^𝑄superscript𝒙subscriptsuperscript𝜽𝑡2𝛾𝔼delimited-[]superscript^𝑄𝒙superscript𝜽𝑡^𝑄𝒙subscriptsuperscript𝜽𝑡2𝔼delimited-[]superscript^𝑄superscript𝒙superscript𝜽𝑡^𝑄superscript𝒙subscriptsuperscript𝜽𝑡2\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{*})^{2}\right% )\right]-\gamma\sqrt{\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},% \boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{*% })\right)^{2}\right]}\cdot\sqrt{\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{% x}^{\prime};\boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x}^{\prime};% \boldsymbol{\theta}^{t}_{*})\right)^{2}\right]}blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] - italic_γ square-root start_ARG blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG ⋅ square-root start_ARG blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG
(i)𝑖\displaystyle\overset{(i)}{\geq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≥ end_ARG (1γ)𝔼[Q^(𝒙,𝜽t)Q^(𝒙,𝜽t)2]1𝛾𝔼delimited-[]^𝑄𝒙superscript𝜽𝑡^𝑄superscript𝒙subscriptsuperscript𝜽𝑡2\displaystyle\left(1-\gamma\right)\mathbb{E}\left[\widehat{Q}(\boldsymbol{x},% \boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{*% })^{2}\right]( 1 - italic_γ ) blackboard_E [ over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== (1γ)(𝜽t𝜽t)Σπ(𝜽t𝜽t)1𝛾superscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡topsubscriptΣ𝜋superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle(1-\gamma)(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*})^{% \top}\Sigma_{\pi}(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*})( 1 - italic_γ ) ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\geq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≥ end_ARG (1γ)λ0𝜽t𝜽t2,1𝛾subscript𝜆0superscriptnormsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡2\displaystyle(1-\gamma)\lambda_{0}\cdot\|\boldsymbol{\theta}^{t}-\boldsymbol{% \theta}^{t}_{*}\|^{2},( 1 - italic_γ ) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where (i) follows the Cauchy-Schwarz inequality, (ii) follows (23), and (iii) follows 𝜽t𝜽t=𝜽t𝜽¯(Σπ)superscript𝜽𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡parallel-tosubscript¯𝜽parallel-tosubscriptΣ𝜋\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}=\boldsymbol{\theta}^{t}_{% \parallel}-\bar{\boldsymbol{\theta}}_{\parallel}\in\mathcal{R}(\Sigma_{\pi})bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT ∈ caligraphic_R ( roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) and Lemma A.2, which provides λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT-strong convexity. Thus we complete the proof of Lemma A.3. ∎

Given 𝜽0superscript𝜽0\boldsymbol{\theta}^{0}bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, taking the expectation on both sides of (18) and plugging (19)similar-to\sim(22) into (18) yields that

𝔼μ,π,[𝜽t+1𝜽t+12\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\|\boldsymbol{\theta}^{t+1}-% \boldsymbol{\theta}^{t+1}_{*}\|^{2}\right.blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 𝜽0](12ηt(1γ)λ0)𝔼[𝜽t𝜽t2𝜽0]+C2ηt2\displaystyle\left.\mid\boldsymbol{\theta}^{0}\right]\leq(1-2\eta_{t}(1-\gamma% )\lambda_{0})\mathbb{E}\left[\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}% _{*}\|^{2}\mid\boldsymbol{\theta}^{0}\right]+C_{2}\eta_{t}^{2}∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ≤ ( 1 - 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - italic_γ ) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) blackboard_E [ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2ηtC5m1/2log(T/δ)+2ηtC7(log(T/δ)+C12)τηmax{0,tτ}.2subscript𝜂𝑡subscript𝐶5superscript𝑚12𝑇𝛿2subscript𝜂𝑡subscript𝐶7𝑇𝛿superscriptsubscript𝐶12superscript𝜏subscript𝜂0𝑡superscript𝜏\displaystyle+2\eta_{t}C_{5}m^{-1/2}\sqrt{\log(T/\delta)}+2\eta_{t}C_{7}\left(% \log(T/\delta)+C_{1}^{2}\right)\tau^{*}\eta_{\max\left\{0,t-\tau^{*}\right\}}.+ 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG + 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ( roman_log ( italic_T / italic_δ ) + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT roman_max { 0 , italic_t - italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT .

We choose ηt=12(1γ)λ0(t+1)subscript𝜂𝑡121𝛾subscript𝜆0𝑡1\eta_{t}=\frac{1}{2(1-\gamma)\lambda_{0}(t+1)}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_γ ) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t + 1 ) end_ARG and have that

(1γ)λ0(t+1)𝔼μ,π,[𝜽t+1\displaystyle(1-\gamma)\lambda_{0}(t+1)\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\|% \boldsymbol{\theta}^{t+1}\right.( 1 - italic_γ ) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t + 1 ) blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT 𝜽t+12𝜽0](1γ)λ0t𝔼[𝜽t𝜽t2𝜽0]+C2ηt\displaystyle-\left.\boldsymbol{\theta}^{t+1}_{*}\|^{2}\mid\boldsymbol{\theta}% ^{0}\right]\leq(1-\gamma)\lambda_{0}t\ \mathbb{E}\left[\|\boldsymbol{\theta}^{% t}-\boldsymbol{\theta}^{t}_{*}\|^{2}\mid\boldsymbol{\theta}^{0}\right]+C_{2}% \eta_{t}- bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ≤ ( 1 - italic_γ ) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t blackboard_E [ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
+C5m1/2log(T/δ)+C7(log(T/δ)+C12)τηmax{0,tτ}.subscript𝐶5superscript𝑚12𝑇𝛿subscript𝐶7𝑇𝛿superscriptsubscript𝐶12superscript𝜏subscript𝜂0𝑡superscript𝜏\displaystyle+C_{5}m^{-1/2}\sqrt{\log(T/\delta)}+C_{7}\left(\log(T/\delta)+C_{% 1}^{2}\right)\tau^{*}\eta_{\max\left\{0,t-\tau^{*}\right\}}.+ italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG + italic_C start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ( roman_log ( italic_T / italic_δ ) + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT roman_max { 0 , italic_t - italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT .

Summing (A.2) from t=0,1,,T1𝑡01𝑇1t=0,1,\cdots,T-1italic_t = 0 , 1 , ⋯ , italic_T - 1 yields that

𝔼μ,π,[𝜽T𝜽T2𝜽0]subscript𝔼𝜇𝜋delimited-[]conditionalsuperscriptnormsuperscript𝜽𝑇subscriptsuperscript𝜽𝑇2superscript𝜽0\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\|\boldsymbol{\theta}^{T}-% \boldsymbol{\theta}^{T}_{*}\|^{2}\mid\boldsymbol{\theta}^{0}\right]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] \displaystyle\leq 1(1γ)λ0Tt=0T1(C2ηt+C5m1/2log(T/δ)\displaystyle\frac{1}{(1-\gamma)\lambda_{0}T}\sum_{t=0}^{T-1}\left(C_{2}\eta_{% t}+C_{5}m^{-1/2}\sqrt{\log(T/\delta)}\right.divide start_ARG 1 end_ARG start_ARG ( 1 - italic_γ ) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG
+C7(log(T/δ)+C12)τηmax{0,tτ})\displaystyle\left.+C_{7}\left(\log(T/\delta)+C_{1}^{2}\right)\tau^{*}\eta_{% \max\left\{0,t-\tau^{*}\right\}}\right)+ italic_C start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ( roman_log ( italic_T / italic_δ ) + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT roman_max { 0 , italic_t - italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT )
\displaystyle\leq C2(logT+1)2(1γ)2λ02T+C5m1/2log(T/δ)(1γ)λ0+C8τ(log(T/δ)+1)logT2(1γ)2λ02T.subscript𝐶2𝑇12superscript1𝛾2superscriptsubscript𝜆02𝑇subscript𝐶5superscript𝑚12𝑇𝛿1𝛾subscript𝜆0subscript𝐶8superscript𝜏𝑇𝛿1𝑇2superscript1𝛾2superscriptsubscript𝜆02𝑇\displaystyle\frac{C_{2}(\log T+1)}{2(1-\gamma)^{2}\lambda_{0}^{2}T}+\frac{C_{% 5}m^{-1/2}\sqrt{\log(T/\delta)}}{(1-\gamma)\lambda_{0}}+\frac{C_{8}\tau^{*}% \left(\log(T/\delta)+1\right)\log T}{2(1-\gamma)^{2}\lambda_{0}^{2}T}.divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_log italic_T + 1 ) end_ARG start_ARG 2 ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG + divide start_ARG italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG end_ARG start_ARG ( 1 - italic_γ ) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_C start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_log ( italic_T / italic_δ ) + 1 ) roman_log italic_T end_ARG start_ARG 2 ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG .

Therefore, according to the gradient bound (19), we have

𝔼μ,π[(Q^(𝒙;𝜽T)Q^(𝒙;𝜽))2𝜽0]subscript𝔼𝜇𝜋delimited-[]conditionalsuperscript^𝑄𝒙superscript𝜽𝑇^𝑄𝒙superscript𝜽2superscript𝜽0\displaystyle\mathbb{E}_{\mu,\pi}\left[\left(\widehat{Q}(\boldsymbol{x};% \boldsymbol{\theta}^{T})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})% \right)^{2}\mid\boldsymbol{\theta}^{0}\right]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π end_POSTSUBSCRIPT [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] =\displaystyle== 𝔼[(Q^(𝒙;𝜽T)Q^(𝒙;𝜽T))2𝜽0]𝔼delimited-[]conditionalsuperscript^𝑄𝒙superscript𝜽𝑇^𝑄𝒙subscriptsuperscript𝜽𝑇2superscript𝜽0\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{T})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{T}_{*})\right)^{2% }\mid\boldsymbol{\theta}^{0}\right]blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
\displaystyle\leq C32m𝔼[𝜽T𝜽T2𝜽0]superscriptsubscript𝐶32𝑚𝔼delimited-[]conditionalsuperscriptnormsuperscript𝜽𝑇subscriptsuperscript𝜽𝑇2superscript𝜽0\displaystyle C_{3}^{2}m\mathbb{E}\left[\|\boldsymbol{\theta}^{T}-\boldsymbol{% \theta}^{T}_{*}\|^{2}\mid\boldsymbol{\theta}^{0}\right]italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m blackboard_E [ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
\displaystyle\leq C23(logT+1)2(1γ)2λ02T+C22C5m1/2log(T/δ)(1γ)λ0superscriptsubscript𝐶23𝑇12superscript1𝛾2superscriptsubscript𝜆02𝑇superscriptsubscript𝐶22subscript𝐶5superscript𝑚12𝑇𝛿1𝛾subscript𝜆0\displaystyle\frac{C_{2}^{3}(\log T+1)}{2(1-\gamma)^{2}\lambda_{0}^{2}T}+\frac% {C_{2}^{2}C_{5}m^{-1/2}\sqrt{\log(T/\delta)}}{(1-\gamma)\lambda_{0}}divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( roman_log italic_T + 1 ) end_ARG start_ARG 2 ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG + divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG end_ARG start_ARG ( 1 - italic_γ ) italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG
+C22C8τ(log(T/δ)+1)logT2(1γ)2λ02Tsuperscriptsubscript𝐶22subscript𝐶8superscript𝜏𝑇𝛿1𝑇2superscript1𝛾2superscriptsubscript𝜆02𝑇\displaystyle+\frac{C_{2}^{2}C_{8}\tau^{*}\left(\log(T/\delta)+1\right)\log T}% {2(1-\gamma)^{2}\lambda_{0}^{2}T}+ divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_log ( italic_T / italic_δ ) + 1 ) roman_log italic_T end_ARG start_ARG 2 ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG

with probability at least 12δ2Lexp(C6m)12𝛿2𝐿subscript𝐶6𝑚1-2\delta-2L\exp{(-C_{6}m)}1 - 2 italic_δ - 2 italic_L roman_exp ( - italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT italic_m ). Let C~1=max{1,C1},C~2=C6,C~3=C232,C~4=C22C52formulae-sequencesubscript~𝐶11subscript𝐶1formulae-sequencesubscript~𝐶2subscript𝐶6formulae-sequencesubscript~𝐶3superscriptsubscript𝐶232subscript~𝐶4superscriptsubscript𝐶22subscript𝐶52\widetilde{C}_{1}=\max\{1,C_{1}\},\widetilde{C}_{2}=C_{6},\widetilde{C}_{3}=% \frac{C_{2}^{3}}{2},\widetilde{C}_{4}=\frac{C_{2}^{2}C_{5}}{2}over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_max { 1 , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } , over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG , over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG, and C~5=C22C82subscript~𝐶5superscriptsubscript𝐶22subscript𝐶82\widetilde{C}_{5}=\frac{C_{2}^{2}C_{8}}{2}over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG, and we complete the proof. ∎

A.3 Proof of Theorem 3.7

Proof.

Let (s,a)μ×π=:d(s,a)\sim\mu\times\pi=:d( italic_s , italic_a ) ∼ italic_μ × italic_π = : italic_d. To simplify the notation, we denote 𝔼[]𝔼delimited-[]\mathbb{E}[\cdot]blackboard_E [ ⋅ ] as 𝔼(s,a)d[]subscript𝔼similar-to𝑠𝑎𝑑delimited-[]\mathbb{E}_{(s,a)\sim d}[\cdot]blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_d end_POSTSUBSCRIPT [ ⋅ ] in this subsection. Note that

𝔼[(Q(𝒙;𝜽T)\displaystyle\mathbb{E}\left[\left(Q(\boldsymbol{x};\boldsymbol{\theta}^{T})% \right.\right.blackboard_E [ ( italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) Q(s,a))2𝜽0]3𝔼[(Q(𝒙;𝜽T)Q^(𝒙;𝜽T))2𝜽0]\displaystyle-\left.\left.Q^{*}(s,a)\right)^{2}\mid\boldsymbol{\theta}^{0}% \right]\leq 3\mathbb{E}\left[\left(Q(\boldsymbol{x};\boldsymbol{\theta}^{T})-% \widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{T})\right)^{2}\mid\boldsymbol{% \theta}^{0}\right]- italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ≤ 3 blackboard_E [ ( italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
+3𝔼[(Q^(𝒙;𝜽T)Q^(𝒙;𝜽))2𝜽0]+3𝔼[(Q^(𝒙;𝜽)Q(s,a))2𝜽0].3𝔼delimited-[]conditionalsuperscript^𝑄𝒙superscript𝜽𝑇^𝑄𝒙superscript𝜽2superscript𝜽03𝔼delimited-[]conditionalsuperscript^𝑄𝒙superscript𝜽superscript𝑄𝑠𝑎2superscript𝜽0\displaystyle+3\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{T})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})\right)^{2}% \mid\boldsymbol{\theta}^{0}\right]+3\mathbb{E}\left[\left(\widehat{Q}(% \boldsymbol{x};\boldsymbol{\theta}^{*})-Q^{*}(s,a)\right)^{2}\mid\boldsymbol{% \theta}^{0}\right].+ 3 blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] + 3 blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] .

By Lemma D.4, we have

𝔼[(Q(𝒙;𝜽T)Q^(𝒙;𝜽T))2𝜽0]C8m1𝔼delimited-[]conditionalsuperscript𝑄𝒙superscript𝜽𝑇^𝑄𝒙superscript𝜽𝑇2superscript𝜽0subscript𝐶8superscript𝑚1\mathbb{E}\left[\left(Q(\boldsymbol{x};\boldsymbol{\theta}^{T})-\widehat{Q}(% \boldsymbol{x};\boldsymbol{\theta}^{T})\right)^{2}\mid\boldsymbol{\theta}^{0}% \right]\leq C_{8}m^{-1}blackboard_E [ ( italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ≤ italic_C start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (25)

with probability at least 1δ1𝛿1-\delta1 - italic_δ. Recall that Q^(𝒙;𝜽)^𝑄𝒙superscript𝜽\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the fixed point of Πω,m𝒯subscriptΠsubscript𝜔𝑚𝒯\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T and Q(s,a)superscript𝑄𝑠𝑎Q^{*}(s,a)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) is the fixed point of 𝒯𝒯\mathcal{T}caligraphic_T. We define the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm f(s,a)d2=𝔼(s,a)d[f(s,a)2]superscriptsubscriptnorm𝑓𝑠𝑎𝑑2subscript𝔼similar-to𝑠𝑎𝑑delimited-[]𝑓superscript𝑠𝑎2\|f(s,a)\|_{d}^{2}=\mathbb{E}_{(s,a)\sim d}\left[f(s,a)^{2}\right]∥ italic_f ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_d end_POSTSUBSCRIPT [ italic_f ( italic_s , italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Thus

Q^(𝒙;𝜽)Q(s,a)dsubscriptnorm^𝑄𝒙superscript𝜽superscript𝑄𝑠𝑎𝑑\displaystyle\left\|\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})-Q^{*}(% s,a)\right\|_{d}∥ over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT =\displaystyle== Q^(𝒙;𝜽)Πω,mQ(s,a)+Πω,mQ(s,a)Q(s,a)dsubscriptnorm^𝑄𝒙superscript𝜽subscriptΠsubscript𝜔𝑚superscript𝑄𝑠𝑎subscriptΠsubscript𝜔𝑚superscript𝑄𝑠𝑎superscript𝑄𝑠𝑎𝑑\displaystyle\left\|\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})-\Pi_{% \mathcal{F}_{\omega,m}}Q^{*}(s,a)+\Pi_{\mathcal{F}_{\omega,m}}Q^{*}(s,a)-Q^{*}% (s,a)\right\|_{d}∥ over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) + roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT
=(i)𝑖\displaystyle\overset{(i)}{=}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG = end_ARG Πω,m𝒯Q^(𝒙;𝜽)Πω,m𝒯Q(s,a)+Πω,mQ(s,a)Q(s,a)dsubscriptnormsubscriptΠsubscript𝜔𝑚𝒯^𝑄𝒙superscript𝜽subscriptΠsubscript𝜔𝑚𝒯superscript𝑄𝑠𝑎subscriptΠsubscript𝜔𝑚superscript𝑄𝑠𝑎superscript𝑄𝑠𝑎𝑑\displaystyle\left\|\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}\widehat{Q}(% \boldsymbol{x};\boldsymbol{\theta}^{*})-\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T% }Q^{*}(s,a)+\Pi_{\mathcal{F}_{\omega,m}}Q^{*}(s,a)-Q^{*}(s,a)\right\|_{d}∥ roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) + roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT
\displaystyle\leq Πω,m𝒯Q^(𝒙;𝜽)Πω,m𝒯Q(s,a)d+Πω,mQ(s,a)Q(s,a)dsubscriptnormsubscriptΠsubscript𝜔𝑚𝒯^𝑄𝒙superscript𝜽subscriptΠsubscript𝜔𝑚𝒯superscript𝑄𝑠𝑎𝑑subscriptnormsubscriptΠsubscript𝜔𝑚superscript𝑄𝑠𝑎superscript𝑄𝑠𝑎𝑑\displaystyle\left\|\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}\widehat{Q}(% \boldsymbol{x};\boldsymbol{\theta}^{*})-\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T% }Q^{*}(s,a)\right\|_{d}+\left\|\Pi_{\mathcal{F}_{\omega,m}}Q^{*}(s,a)-Q^{*}(s,% a)\right\|_{d}∥ roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + ∥ roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\leq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≤ end_ARG γQ^(𝒙;𝜽)Q(s,a)d+Πω,mQ(s,a)Q(s,a)d,𝛾subscriptnorm^𝑄𝒙superscript𝜽superscript𝑄𝑠𝑎𝑑subscriptnormsubscriptΠsubscript𝜔𝑚superscript𝑄𝑠𝑎superscript𝑄𝑠𝑎𝑑\displaystyle\gamma\left\|\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})-% Q^{*}(s,a)\right\|_{d}+\left\|\Pi_{\mathcal{F}_{\omega,m}}Q^{*}(s,a)-Q^{*}(s,a% )\right\|_{d},italic_γ ∥ over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + ∥ roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ,

where (i) is due to the properties of the fixed point, and (ii) is due to Πω,m𝒯subscriptΠsubscript𝜔𝑚𝒯\Pi_{\mathcal{F}_{\omega,m}}\mathcal{T}roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T is γ𝛾\gammaitalic_γ-contractive on the \infty-norm. This further means that

Q^(𝒙;𝜽)Q(s,a)d21(1γ)2Πω,mQ(s,a)Q(s,a)d2.superscriptsubscriptnorm^𝑄𝒙superscript𝜽superscript𝑄𝑠𝑎𝑑21superscript1𝛾2superscriptsubscriptnormsubscriptΠsubscript𝜔𝑚superscript𝑄𝑠𝑎superscript𝑄𝑠𝑎𝑑2\left\|\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})-Q^{*}(s,a)\right\|_% {d}^{2}\leq\frac{1}{(1-\gamma)^{2}}\left\|\Pi_{\mathcal{F}_{\omega,m}}Q^{*}(s,% a)-Q^{*}(s,a)\right\|_{d}^{2}.∥ over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (26)

Plugging (25) and (26) into (A.3) and using Theorem 3.6, we complete the proof. ∎

Appendix B Convergence Results of Neural Q-learning

B.1 Neural Q-Learning Algorithm

For neural Q-learning, let us redefine some of the above notations. Let the optimal Q-function be Q(s,a)=supπQπ(s,a)superscript𝑄𝑠𝑎subscriptsupremum𝜋superscript𝑄𝜋𝑠𝑎Q^{*}(s,a)=\sup_{\pi}Q^{\pi}(s,a)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) = roman_sup start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) for all state action pairs (s,a)𝑠𝑎(s,a)( italic_s , italic_a ), then the optimal sequence of actions that maximizes the expected cumulative reward will follow at=argmaxa𝒜Q(st,a),t0formulae-sequencesubscript𝑎𝑡subscriptargmaxsuperscript𝑎𝒜superscript𝑄subscript𝑠𝑡superscript𝑎𝑡0a_{t}=\mathop{\mathrm{argmax}}_{a^{\prime}\in\mathcal{A}}Q^{*}(s_{t},a^{\prime% }),t\geq 0italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_argmax start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_t ≥ 0. Therefore, to obtain a near-optimal policy, it is sufficient to find some Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG that approximates Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT well. Define the Bellman optimality operator 𝒯𝒯\mathcal{T}caligraphic_T as

𝒯Q(s,a):=r(s,a)+γ𝔼[maxaQ(s,a)s(s,a)],\mathcal{T}Q(s,a):=r(s,a)+\gamma\mathbb{E}\left[\max_{a^{\prime}}Q(s^{\prime},% a^{\prime})\mid s^{\prime}\sim\mathbb{P}(\cdot\mid s,a)\right],caligraphic_T italic_Q ( italic_s , italic_a ) := italic_r ( italic_s , italic_a ) + italic_γ blackboard_E [ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∣ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ blackboard_P ( ⋅ ∣ italic_s , italic_a ) ] ,

for any (s,a)𝑠𝑎(s,a)( italic_s , italic_a ). Let us remain the definition of the local linearization function class ω,msubscript𝜔𝑚\mathcal{F}_{\omega,m}caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT introduced in (7). Consider the MSPBE minimization problem with multi-layer neural network approximation:

min𝜽Sω𝔼μ,π,[(Q(𝒙;𝜽)Πω,m𝒯Q(𝒙;𝜽))2].subscript𝜽subscript𝑆𝜔subscript𝔼𝜇𝜋delimited-[]superscript𝑄𝒙𝜽subscriptΠsubscript𝜔𝑚𝒯𝑄𝒙𝜽2\min_{\boldsymbol{\theta}\in S_{\omega}}\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[% \left(Q(\boldsymbol{x};\boldsymbol{\theta})-\Pi_{\mathcal{F}_{\omega,m}}% \mathcal{T}Q(\boldsymbol{x};\boldsymbol{\theta})\right)^{2}\right].roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ( italic_Q ( bold_italic_x ; bold_italic_θ ) - roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T italic_Q ( bold_italic_x ; bold_italic_θ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

Then the projected neural Q-learning algorithm can be written as follows:

𝜽t+1=ΠSω(𝜽tηt𝒈(𝜽t)),with𝒈(𝜽t)=Δ(st,at,st+1;𝜽t)𝜽Q(ϕ(st,at);𝜽t)formulae-sequencesuperscript𝜽𝑡1subscriptΠsubscript𝑆𝜔superscript𝜽𝑡subscript𝜂𝑡𝒈superscript𝜽𝑡with𝒈superscript𝜽𝑡Δsubscript𝑠𝑡subscript𝑎𝑡subscript𝑠𝑡1superscript𝜽𝑡subscript𝜽𝑄italic-ϕsubscript𝑠𝑡subscript𝑎𝑡superscript𝜽𝑡\boldsymbol{\theta}^{t+1}=\Pi_{S_{\omega}}\Big{(}\boldsymbol{\theta}^{t}-\eta_% {t}\boldsymbol{g}\left(\boldsymbol{\theta}^{t}\right)\!\!\Big{)},\quad\mbox{% with}\quad\boldsymbol{g}(\boldsymbol{\theta}^{t})=\Delta\left(s_{t},a_{t},s_{t% +1};\boldsymbol{\theta}^{t}\right)\cdot\nabla_{\boldsymbol{\theta}}Q(\phi(s_{t% },a_{t});\boldsymbol{\theta}^{t})bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = roman_Π start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) , with bold_italic_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = roman_Δ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (27)

where

Δ(s,a,s;𝜽t)=Δ𝑠𝑎superscript𝑠superscript𝜽𝑡absent\displaystyle\Delta\left(s,a,s^{\prime};\boldsymbol{\theta}^{t}\right)=roman_Δ ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = Q(ϕ(s,a);𝜽t)(r(s,a)+γmaxb𝒜Q(ϕ(s,b);𝜽t)).𝑄italic-ϕ𝑠𝑎superscript𝜽𝑡𝑟𝑠𝑎𝛾subscript𝑏𝒜𝑄italic-ϕsuperscript𝑠𝑏superscript𝜽𝑡\displaystyle Q(\phi(s,a);\boldsymbol{\theta}^{t})-\Big{(}r(s,a)+\gamma\max_{b% \in\mathcal{A}}Q\left(\phi\left(s^{\prime},b\right);\boldsymbol{\theta}^{t}% \right)\Big{)}.italic_Q ( italic_ϕ ( italic_s , italic_a ) ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_r ( italic_s , italic_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) . (28)

The algorithm details can be described by Algorithm 2 as follows.

Algorithm 2 Neural Q-Learning with Markovian Sampling
  Input: A learning policy π𝜋\piitalic_π, a discount factor γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ), a sequence of learning rates {ηt}t0subscriptsubscript𝜂𝑡𝑡0\{\eta_{t}\}_{t\geq 0}{ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT, a maximum iteration number T𝑇Titalic_T, a projection radius ω>0𝜔0\omega>0italic_ω > 0, a Q network with architecture (4).
  Initialization: Generate each entry of 𝑾l0superscriptsubscript𝑾𝑙0\boldsymbol{W}_{l}^{0}bold_italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT independently from 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ), for l=1,2,,L𝑙12𝐿l=1,2,\cdots,Litalic_l = 1 , 2 , ⋯ , italic_L, and each entry of 𝒃𝒃\boldsymbol{b}bold_italic_b independently from Unif{1,+1}Unif11\text{Unif}\{-1,+1\}Unif { - 1 , + 1 }.
  for t=0,1,,T1𝑡01𝑇1t=0,1,\cdots,T-1italic_t = 0 , 1 , ⋯ , italic_T - 1 do
     Sample (st,at,rt,st+1)subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1(s_{t},a_{t},r_{t},s_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) from the learning policy π𝜋\piitalic_π with atπ(|st)a_{t}\sim\pi(\cdot|s_{t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
     Compute the TD error ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by (28).
     Update 𝜽t+1superscript𝜽𝑡1\boldsymbol{\theta}^{t+1}bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT by the projected stochastic semi-gradient step (27).
  end for
  Output: 𝜽Tsuperscript𝜽𝑇\boldsymbol{\theta}^{T}bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

B.2 Global Convergence

Similar to Section 3, we define the function class ω,msubscript𝜔𝑚\mathcal{F}_{\omega,m}caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT as a collection of all local linearization of Q(𝒙;𝜽)𝑄𝒙𝜽Q(\boldsymbol{x};\boldsymbol{\theta})italic_Q ( bold_italic_x ; bold_italic_θ ) at the initial point 𝜽0superscript𝜽0\boldsymbol{\theta}^{0}bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT:

ω,m:={Q^(𝒙;𝜽)=Q(𝒙;𝜽0)+𝜽Q(𝒙;𝜽0),𝜽𝜽0,𝜽Sω}.assignsubscript𝜔𝑚formulae-sequence^𝑄𝒙𝜽𝑄𝒙superscript𝜽0subscript𝜽𝑄𝒙superscript𝜽0𝜽superscript𝜽0𝜽subscript𝑆𝜔\mathcal{F}_{\omega,m}:=\left\{\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta})% =Q(\boldsymbol{x};\boldsymbol{\theta}^{0})+\left<\nabla_{\boldsymbol{\theta}}Q% (\boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{\theta}-\boldsymbol{% \theta}^{0}\right>,\ \boldsymbol{\theta}\in S_{\omega}\right\}.caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT := { over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ ) = italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) + ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ - bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ⟩ , bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT } .

Let Q^(;𝜽)ω,m^𝑄superscript𝜽subscript𝜔𝑚\widehat{Q}(\cdot\,;\boldsymbol{\theta}^{*})\in\mathcal{F}_{\omega,m}over^ start_ARG italic_Q end_ARG ( ⋅ ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT, and Δ^(s,a,s;𝜽)^Δ𝑠𝑎superscript𝑠𝜽\widehat{\Delta}\left(s,a,s^{\prime};\boldsymbol{\theta}\right)over^ start_ARG roman_Δ end_ARG ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) has the same structure as Δ(s,a,s;𝜽)Δ𝑠𝑎superscript𝑠𝜽\Delta\left(s,a,s^{\prime};\boldsymbol{\theta}\right)roman_Δ ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) expect that the function Q(;𝜽)𝑄𝜽Q(\cdot;\boldsymbol{\theta})italic_Q ( ⋅ ; bold_italic_θ ) is replaced by Q^(;𝜽)^𝑄𝜽\widehat{Q}(\cdot;\boldsymbol{\theta})over^ start_ARG italic_Q end_ARG ( ⋅ ; bold_italic_θ ). The stationary point 𝜽superscript𝜽\boldsymbol{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT satisfies Q^(𝒙;𝜽)=Πω,m𝒯Q^(𝒙;𝜽)^𝑄𝒙superscript𝜽subscriptΠsubscript𝜔𝑚𝒯^𝑄𝒙superscript𝜽\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})=\Pi_{\mathcal{F}_{\omega,m% }}\mathcal{T}\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_Π start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_ω , italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) for neural Q-learning. We redefine ΞβsubscriptΞ𝛽\Xi_{\beta}roman_Ξ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT by replacing the Bellman operator 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi}caligraphic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT in Section 3 with the Bellman optimality operator 𝒯𝒯\mathcal{T}caligraphic_T. A point 𝜽Ξωsuperscript𝜽subscriptΞ𝜔\boldsymbol{\theta}^{*}\in\Xi_{\omega}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Ξ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT if and only if

𝔼μ,π,[Δ^(s,a,s;𝜽)𝜽Q^(ϕ(s,a);𝜽),𝜽𝜽]0.subscript𝔼𝜇𝜋delimited-[]^Δ𝑠𝑎superscript𝑠superscript𝜽subscript𝜽^𝑄italic-ϕ𝑠𝑎superscript𝜽𝜽superscript𝜽0\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\widehat{\Delta}\left(s,a,s^{\prime};% \boldsymbol{\theta}^{*}\right)\big{\langle}\nabla_{\boldsymbol{\theta}}% \widehat{Q}\left(\phi(s,a);\boldsymbol{\theta}^{*}\right),\boldsymbol{\theta}-% \boldsymbol{\theta}^{*}\big{\rangle}\right]\geq 0.blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( italic_ϕ ( italic_s , italic_a ) ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_italic_θ - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ ] ≥ 0 .

The maximum operator introduced by the Bellman optimality operator significantly sophisticates the analysis. Let us remain the definition of ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT in (12), and we define Σπ(𝜽)subscriptsuperscriptΣ𝜋𝜽\Sigma^{*}_{\pi}(\boldsymbol{\theta})roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( bold_italic_θ ) as follows:

𝔼μ,π[𝜽Q(ϕ(s,amax𝜽);𝜽0)𝜽Q(ϕ(s,amax𝜽);𝜽0)],subscript𝔼𝜇𝜋delimited-[]subscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎𝜽superscript𝜽0subscript𝜽𝑄superscriptitalic-ϕ𝑠subscriptsuperscript𝑎𝜽superscript𝜽0top\displaystyle\mathbb{E}_{\mu,\pi}\left[\nabla_{\boldsymbol{\theta}}Q(\phi(s,a^% {\boldsymbol{\theta}}_{\max});\boldsymbol{\theta}^{0})\nabla_{\boldsymbol{% \theta}}Q(\phi(s,a^{\boldsymbol{\theta}}_{\max});\boldsymbol{\theta}^{0})^{% \top}\right],blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] , (29)

where amax𝜽=argmaxa𝒜|𝜽Q(s,a;𝜽0),𝜽|subscriptsuperscript𝑎𝜽subscript𝑎𝒜subscript𝜽𝑄𝑠𝑎superscript𝜽0𝜽a^{\boldsymbol{\theta}}_{\max}=\arg\max_{a\in\mathcal{A}}\left|\left<\nabla_{% \boldsymbol{\theta}}Q(s,a;\boldsymbol{\theta}^{0}),\boldsymbol{\theta}\right>\right|italic_a start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT | ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ ⟩ |. To facilitate the analysis of neural Q-learning, we further assume the following regularity condition introduced by (Xu & Gu, 2020).

Assumption B.1.

ν(0,1)𝜈01\exists\nu\in(0,1)∃ italic_ν ∈ ( 0 , 1 ) such that (1ν)2Σπγ2Σπ(𝜽)0succeeds-or-equalssuperscript1𝜈2subscriptΣ𝜋superscript𝛾2superscriptsubscriptΣ𝜋𝜽0(1-\nu)^{2}\Sigma_{\pi}-\gamma^{2}\Sigma_{\pi}^{*}(\boldsymbol{\theta})\succeq 0( 1 - italic_ν ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_θ ) ⪰ 0 for any 𝜽0superscript𝜽0\boldsymbol{\theta}^{0}bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and 𝜽Sω𝜽subscript𝑆𝜔\boldsymbol{\theta}\in S_{\omega}bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT.

The original version of this assumption comes from (Xu & Gu, 2020), which requires a strict positive definite condition: (1ν)2Σπγ2Σπ(𝜽)0succeedssuperscript1𝜈2subscriptΣ𝜋superscript𝛾2superscriptsubscriptΣ𝜋𝜽0(1-\nu)^{2}\Sigma_{\pi}-\gamma^{2}\Sigma_{\pi}^{*}(\boldsymbol{\theta})\succ 0( 1 - italic_ν ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_θ ) ≻ 0. Under this additional assumption, (Xu & Gu, 2020) obtained an 𝒪~(ϵ2)~𝒪superscriptitalic-ϵ2\tilde{\mathcal{O}}(\epsilon^{-2})over~ start_ARG caligraphic_O end_ARG ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) sample complexity for neural Q-learning. A similar complexity result was also derived in (Cai et al., 2023) under a similar regularity condition on the learning policy π𝜋\piitalic_π. At this time, we relax it to the positive semi-definiteness (0succeeds-or-equalsabsent0\succeq 0⪰ 0) and provide a convergence result of neural Q-learning. See Theorem B.2.

Theorem B.2.

Suppose Assumptions 3.1, 3.2 and B.1 hold. We set ω=C~1𝜔subscript~𝐶1\omega=\widetilde{C}_{1}italic_ω = over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the learning rate ηt=12νλ0(t+1)subscript𝜂𝑡12𝜈subscript𝜆0𝑡1\eta_{t}=\frac{1}{2\nu\lambda_{0}(t+1)}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_ν italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t + 1 ) end_ARG. If the feature map ϕ(s,a)=1normitalic-ϕ𝑠𝑎1\|\phi(s,a)\|=1∥ italic_ϕ ( italic_s , italic_a ) ∥ = 1 for each state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) and the network width mm𝑚superscript𝑚m\geq m^{*}italic_m ≥ italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then the output 𝛉Tsuperscript𝛉𝑇\boldsymbol{\theta}^{T}bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT of neural Q-learning algorithm (i.e. (27)) satisfies

𝔼[(Q^(𝒙;𝜽T)Q^(𝒙;𝜽))2𝜽0]C~3(logT+1)ν2λ02T+C~4m1/2νλ0log(T/δ)+C~5τ(log(T/δ)+1)logTν2λ02T,𝔼delimited-[]conditionalsuperscript^𝑄𝒙superscript𝜽𝑇^𝑄𝒙superscript𝜽2superscript𝜽0subscript~𝐶3𝑇1superscript𝜈2superscriptsubscript𝜆02𝑇subscript~𝐶4superscript𝑚12𝜈subscript𝜆0𝑇𝛿subscript~𝐶5superscript𝜏𝑇𝛿1𝑇superscript𝜈2superscriptsubscript𝜆02𝑇\displaystyle\mathbb{E}\left[\big{(}\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{T})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{*})\big{)}^{2}% \mid\boldsymbol{\theta}^{0}\right]\leq\frac{\widetilde{C}_{3}(\log T+1)}{\nu^{% 2}\lambda_{0}^{2}T}+\frac{\widetilde{C}_{4}m^{-1/2}}{\nu\lambda_{0}}\cdot\sqrt% {\log(T/\delta)}+\frac{\widetilde{C}_{5}\tau^{*}\left(\log(T/\delta)+1\right)% \log T}{\nu^{2}\lambda_{0}^{2}T},blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ≤ divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( roman_log italic_T + 1 ) end_ARG start_ARG italic_ν start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG + divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ν italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⋅ square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG + divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_log ( italic_T / italic_δ ) + 1 ) roman_log italic_T end_ARG start_ARG italic_ν start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG ,

with probability at least 12δ2Lexp(C~2m)12𝛿2𝐿subscript~𝐶2𝑚1-2\delta-2L\exp\!\big{(}-\widetilde{C}_{2}m\big{)}1 - 2 italic_δ - 2 italic_L roman_exp ( - over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ), where τsuperscript𝜏\tau^{*}italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the mixing time of Markov chain in Assumption 3.2, and C~1,,C~5>0subscript~𝐶1subscript~𝐶50\widetilde{C}_{1},\cdots,\widetilde{C}_{5}>0over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT > 0 are universal constants.

Proof.

For a little notation abuse, we redefine

𝐠(𝜽t)𝐠superscript𝜽𝑡\displaystyle\mathbf{g}(\boldsymbol{\theta}^{t})bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Δ(st,at,st,𝜽t)𝜽Q(𝒙t;𝜽t),𝐠¯(𝜽t)=𝔼μ,π,[𝐠(𝜽t)]Δsubscript𝑠𝑡subscript𝑎𝑡subscriptsuperscript𝑠𝑡superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽𝑡¯𝐠superscript𝜽𝑡subscript𝔼𝜇𝜋delimited-[]𝐠superscript𝜽𝑡\displaystyle\Delta(s_{t},a_{t},s^{\prime}_{t},\boldsymbol{\theta}^{t})\cdot% \nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t}),% \quad\bar{\mathbf{g}}(\boldsymbol{\theta}^{t})\ =\ \mathbb{E}_{\mu,\pi,\mathbb% {P}}\left[\mathbf{g}(\boldsymbol{\theta}^{t})\right]roman_Δ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , over¯ start_ARG bold_g end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ]
𝐦(𝜽t)𝐦superscript𝜽𝑡\displaystyle\mathbf{m}(\boldsymbol{\theta}^{t})bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Δ^(st,at,st,𝜽t)𝜽Q(𝒙t;𝜽0),𝐦¯(𝜽t)=𝔼μ,π,[𝐦(𝜽t)],^Δsubscript𝑠𝑡subscript𝑎𝑡subscriptsuperscript𝑠𝑡superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽0¯𝐦superscript𝜽𝑡subscript𝔼𝜇𝜋delimited-[]𝐦superscript𝜽𝑡\displaystyle\widehat{\Delta}(s_{t},a_{t},s^{\prime}_{t},\boldsymbol{\theta}^{% t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{% 0}),\quad\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})\ =\ \mathbb{E}_{\mu,\pi,% \mathbb{P}}\left[\mathbf{m}(\boldsymbol{\theta}^{t})\right],over^ start_ARG roman_Δ end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] ,

where

Δ(st,at,st,𝜽t)Δsubscript𝑠𝑡subscript𝑎𝑡subscriptsuperscript𝑠𝑡superscript𝜽𝑡\displaystyle\Delta(s_{t},a_{t},s^{\prime}_{t},\boldsymbol{\theta}^{t})roman_Δ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Q(𝒙t;𝜽t)(r(st,at)+γmaxb𝒜Q(ϕ(st+1,b);𝜽t)),𝑄subscript𝒙𝑡superscript𝜽𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡𝛾subscript𝑏𝒜𝑄italic-ϕsubscript𝑠𝑡1𝑏superscript𝜽𝑡\displaystyle Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s_{t},a_{t% })+\gamma\max_{b\in\mathcal{A}}Q(\phi(s_{t+1},b);\boldsymbol{\theta}^{t})% \right),italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_b ) ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ,
Δ^(st,at,st,𝜽t)^Δsubscript𝑠𝑡subscript𝑎𝑡subscriptsuperscript𝑠𝑡superscript𝜽𝑡\displaystyle\widehat{\Delta}(s_{t},a_{t},s^{\prime}_{t},\boldsymbol{\theta}^{% t})over^ start_ARG roman_Δ end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Q^(𝒙t;𝜽t)(r(st,at)+γmaxb𝒜Q^(ϕ(st+1,b);𝜽t)).^𝑄subscript𝒙𝑡superscript𝜽𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡𝛾subscript𝑏𝒜^𝑄italic-ϕsubscript𝑠𝑡1𝑏superscript𝜽𝑡\displaystyle\widehat{Q}(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s% _{t},a_{t})+\gamma\max_{b\in\mathcal{A}}\widehat{Q}(\phi(s_{t+1},b);% \boldsymbol{\theta}^{t})\right).over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_b ) ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) .

Let Δt=Δ(st,at,st,𝜽t)subscriptΔ𝑡Δsubscript𝑠𝑡subscript𝑎𝑡subscriptsuperscript𝑠𝑡superscript𝜽𝑡\Delta_{t}=\Delta(s_{t},a_{t},s^{\prime}_{t},\boldsymbol{\theta}^{t})roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Δ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and Δ^t=Δ^(st,at,st,𝜽t)subscript^Δ𝑡^Δsubscript𝑠𝑡subscript𝑎𝑡subscriptsuperscript𝑠𝑡superscript𝜽𝑡\widehat{\Delta}_{t}=\widehat{\Delta}(s_{t},a_{t},s^{\prime}_{t},\boldsymbol{% \theta}^{t})over^ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG roman_Δ end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). Similarly, (18) can be derived in neural Q-learning. To estimate the terms 𝐈1𝐈4similar-tosubscript𝐈1subscript𝐈4\mathbf{I}_{1}\sim\mathbf{I}_{4}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ bold_I start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, we can apply Lemmas D.5 and A.1. However, due to the utilization of the Bellman optimality operator in neural Q-learning, some modifications based on Lemma A.3 are required.

Lemma B.3.

Under Assumption B.1, we have that

𝔼μ,π,[𝜽t𝜽t,𝐦¯(𝜽t)𝐦¯(𝜽t)𝜽0]νλ0𝜽t𝜽t2.subscript𝔼𝜇𝜋delimited-[]conditionalsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡¯𝐦superscript𝜽𝑡¯𝐦subscriptsuperscript𝜽𝑡superscript𝜽0𝜈subscript𝜆0superscriptnormsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡2\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left<\boldsymbol{\theta}^{t}-\boldsymbol% {\theta}^{t}_{*},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{*})\right>\mid\boldsymbol{\theta}^{0}\right]\geq\nu% \lambda_{0}\cdot\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}.blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ≥ italic_ν italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (30)
Proof.

To simplify the notation, we denote 𝔼[]𝔼delimited-[]\mathbb{E}[\cdot]blackboard_E [ ⋅ ] as 𝔼μ,π,[]subscript𝔼𝜇𝜋delimited-[]\mathbb{E}_{\mu,\pi,\mathbb{P}}[\cdot]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ⋅ ] in the proof of this lemma. Define Q^#(s;𝜽):=maxa𝒜Q^(ϕ(s,a);𝜽)assignsuperscript^𝑄#𝑠𝜽subscript𝑎𝒜^𝑄italic-ϕ𝑠𝑎𝜽\widehat{Q}^{\#}(s;\boldsymbol{\theta}):=\max_{a\in\mathcal{A}}\widehat{Q}(% \phi(s,a);\boldsymbol{\theta})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ ) := roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( italic_ϕ ( italic_s , italic_a ) ; bold_italic_θ ). Then we have

𝔼[𝜽t𝜽t,𝐦¯(𝜽t)𝐦¯(𝜽t)𝜽0]𝔼delimited-[]conditionalsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡¯𝐦superscript𝜽𝑡¯𝐦subscriptsuperscript𝜽𝑡superscript𝜽0\displaystyle\mathbb{E}\left[\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}% ^{t}_{*},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{*})\right>\mid\boldsymbol{\theta}^{0}\right]blackboard_E [ ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
=\displaystyle== 𝔼[(Δ^(st,at,s;t,𝜽t)Δ^(st,at,s;t,𝜽t))Q(𝒙;𝜽0),𝜽t𝜽t𝜽0]\displaystyle\mathbb{E}\left[\left(\widehat{\Delta}(s_{t},a_{t},s;_{t},% \boldsymbol{\theta}^{t})-\widehat{\Delta}(s_{t},a_{t},s;_{t},\boldsymbol{% \theta}^{t}_{*})\right)\left<\nabla Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),% \boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right>\mid\boldsymbol{% \theta}^{0}\right]blackboard_E [ ( over^ start_ARG roman_Δ end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s ; start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG roman_Δ end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s ; start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ⟨ ∇ italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
=\displaystyle== 𝔼[(Q^(𝒙;𝜽t)Q^(𝒙;𝜽t))Q(𝒙;𝜽0),𝜽t𝜽t𝜽0]𝔼delimited-[]conditional^𝑄𝒙superscript𝜽𝑡^𝑄𝒙subscriptsuperscript𝜽𝑡𝑄𝒙superscript𝜽0superscript𝜽𝑡subscriptsuperscript𝜽𝑡superscript𝜽0\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x};\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x};\boldsymbol{\theta}^{t}_{*})\right)% \left<\nabla Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{\theta}^{t}% -\boldsymbol{\theta}^{t}_{*}\right>\mid\boldsymbol{\theta}^{0}\right]blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ⟨ ∇ italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
γ𝔼[(Q^#(s;𝜽t)Q^#(s;𝜽t))Q(𝒙;𝜽0),𝜽t𝜽t𝜽0]𝛾𝔼delimited-[]conditionalsuperscript^𝑄#𝑠superscript𝜽𝑡superscript^𝑄#𝑠subscriptsuperscript𝜽𝑡𝑄𝒙superscript𝜽0superscript𝜽𝑡subscriptsuperscript𝜽𝑡superscript𝜽0\displaystyle-\gamma\mathbb{E}\left[\left(\widehat{Q}^{\#}(s;\boldsymbol{% \theta}^{t})-\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t}_{*})\right)\left<% \nabla Q(\boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t}_{*}\right>\mid\boldsymbol{\theta}^{0}\right]- italic_γ blackboard_E [ ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ⟨ ∇ italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
=\displaystyle== 𝔼[(Q^(𝒙,𝜽t)Q^(𝒙,𝜽t))2]γ𝔼[(Q^(𝒙,𝜽t)Q^(𝒙,𝜽t))(Q^#(s;𝜽t)Q^#(s;𝜽t))].𝔼delimited-[]superscript^𝑄𝒙superscript𝜽𝑡^𝑄𝒙subscriptsuperscript𝜽𝑡2𝛾𝔼delimited-[]^𝑄𝒙superscript𝜽𝑡^𝑄𝒙subscriptsuperscript𝜽𝑡superscript^𝑄#𝑠superscript𝜽𝑡superscript^𝑄#𝑠subscriptsuperscript𝜽𝑡\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{*})\right)^{2% }\right]-\gamma\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{*})\right)% \cdot\left(\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t}_{*})\right)\right].blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - italic_γ blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ⋅ ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ] .

For the second term of (B.2), we consider

𝔼[(Q^#(s;𝜽t)Q^#(s;𝜽t))2]𝔼delimited-[]superscriptsuperscript^𝑄#𝑠superscript𝜽𝑡superscript^𝑄#𝑠superscriptsubscript𝜽𝑡2\displaystyle\mathbb{E}\left[\left(\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})% -\widehat{Q}^{\#}(s;\boldsymbol{\theta}_{*}^{t})\right)^{2}\right]blackboard_E [ ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] \displaystyle\leq 𝔼[maxa𝒜|Q^(s,a;𝜽t)Q^(s,a;𝜽t)|2]𝔼delimited-[]subscript𝑎𝒜superscript^𝑄𝑠𝑎superscript𝜽𝑡^𝑄𝑠𝑎subscriptsuperscript𝜽𝑡2\displaystyle\mathbb{E}\left[\max_{a\in\mathcal{A}}\left|\widehat{Q}(s,a;% \boldsymbol{\theta}^{t})-\widehat{Q}(s,a;\boldsymbol{\theta}^{t}_{*})\right|^{% 2}\right]blackboard_E [ roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT | over^ start_ARG italic_Q end_ARG ( italic_s , italic_a ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( italic_s , italic_a ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (32)
=(i)𝑖\displaystyle\overset{(i)}{=}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG = end_ARG 𝔼[maxa𝒜|Q^(s,a;𝜽t𝜽t)|2]𝔼delimited-[]subscript𝑎𝒜superscript^𝑄𝑠𝑎superscript𝜽𝑡subscriptsuperscript𝜽𝑡2\displaystyle\mathbb{E}\left[\max_{a\in\mathcal{A}}\left|\widehat{Q}(s,a;% \boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*})\right|^{2}\right]blackboard_E [ roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT | over^ start_ARG italic_Q end_ARG ( italic_s , italic_a ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== (𝜽t𝜽t)Σπ(𝜽t𝜽t)superscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡topsuperscriptsubscriptΣ𝜋superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*})^{\top}% \Sigma_{\pi}^{*}(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*})( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\leq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≤ end_ARG (1ν)2γ2(𝜽t𝜽t)Σπ(𝜽t𝜽t)superscript1𝜈2superscript𝛾2superscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡topsubscriptΣ𝜋superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle\frac{(1-\nu)^{2}}{\gamma^{2}}(\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t}_{*})^{\top}\Sigma_{\pi}\cdot(\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t}_{*})divide start_ARG ( 1 - italic_ν ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ⋅ ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )
=(iii)𝑖𝑖𝑖\displaystyle\overset{(iii)}{=}start_OVERACCENT ( italic_i italic_i italic_i ) end_OVERACCENT start_ARG = end_ARG (1ν)2γ2𝔼[(Q^(𝒙,𝜽t)Q^(𝒙,𝜽t))2],superscript1𝜈2superscript𝛾2𝔼delimited-[]superscript^𝑄𝒙superscript𝜽𝑡^𝑄𝒙subscriptsuperscript𝜽𝑡2\displaystyle\frac{(1-\nu)^{2}}{\gamma^{2}}\mathbb{E}\left[\left(\widehat{Q}(% \boldsymbol{x},\boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol% {\theta}^{t}_{*})\right)^{2}\right],divide start_ARG ( 1 - italic_ν ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where (i) and (iii) follow that Q^(𝒙;)^𝑄𝒙\widehat{Q}(\boldsymbol{x};\cdot)over^ start_ARG italic_Q end_ARG ( bold_italic_x ; ⋅ ) is linear, and (ii) follows Assumption B.1. Therefore,

𝔼[𝜽t𝜽t,𝐦¯(𝜽t)𝐦¯(𝜽t)𝜽0]𝔼delimited-[]conditionalsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡¯𝐦superscript𝜽𝑡¯𝐦subscriptsuperscript𝜽𝑡superscript𝜽0\displaystyle\mathbb{E}\left[\left<\boldsymbol{\theta}^{t}-\boldsymbol{\theta}% ^{t}_{*},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{*})\right>\mid\boldsymbol{\theta}^{0}\right]blackboard_E [ ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ]
=\displaystyle== 𝔼[(Q^(𝒙,𝜽t)Q^(𝒙,𝜽t))2]γ𝔼[(Q^(𝒙,𝜽t)Q^(𝒙,𝜽t))(Q^#(s;𝜽t)Q^#(s;𝜽t))]𝔼delimited-[]superscript^𝑄𝒙superscript𝜽𝑡^𝑄𝒙subscriptsuperscript𝜽𝑡2𝛾𝔼delimited-[]^𝑄𝒙superscript𝜽𝑡^𝑄𝒙subscriptsuperscript𝜽𝑡superscript^𝑄#𝑠superscript𝜽𝑡superscript^𝑄#𝑠subscriptsuperscript𝜽𝑡\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{*})\right)^{2% }\right]-\gamma\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{*})\right)% \cdot\left(\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t}_{*})\right)\right]blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - italic_γ blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ⋅ ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ]
\displaystyle\geq 𝔼[(Q^(𝒙,𝜽t)Q^(𝒙,𝜽t)2)]γ𝔼[(Q^(𝒙,𝜽t)Q^(𝒙,𝜽t))2]𝔼[(Q^#(s;𝜽t)Q^#(s;𝜽t))2]𝔼delimited-[]^𝑄𝒙superscript𝜽𝑡^𝑄superscript𝒙subscriptsuperscript𝜽𝑡2𝛾𝔼delimited-[]superscript^𝑄𝒙superscript𝜽𝑡^𝑄𝒙subscriptsuperscript𝜽𝑡2𝔼delimited-[]superscriptsuperscript^𝑄#𝑠superscript𝜽𝑡superscript^𝑄#𝑠subscriptsuperscript𝜽𝑡2\displaystyle\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},\boldsymbol{% \theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{*})^{2}\right% )\right]-\gamma\sqrt{\mathbb{E}\left[\left(\widehat{Q}(\boldsymbol{x},% \boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t}_{*% })\right)^{2}\right]}\cdot\sqrt{\mathbb{E}\left[\left(\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t}_{*})\right% )^{2}\right]}blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] - italic_γ square-root start_ARG blackboard_E [ ( over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG ⋅ square-root start_ARG blackboard_E [ ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG
(i)𝑖\displaystyle\overset{(i)}{\geq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≥ end_ARG (1γ1νγ)𝔼[Q^(𝒙,𝜽t)Q^(𝒙,𝜽t)2]1𝛾1𝜈𝛾𝔼delimited-[]^𝑄𝒙superscript𝜽𝑡^𝑄superscript𝒙subscriptsuperscript𝜽𝑡2\displaystyle\left(1-\gamma\cdot\frac{1-\nu}{\gamma}\right)\mathbb{E}\left[% \widehat{Q}(\boldsymbol{x},\boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x}% ,\boldsymbol{\theta}^{t}_{*})^{2}\right]( 1 - italic_γ ⋅ divide start_ARG 1 - italic_ν end_ARG start_ARG italic_γ end_ARG ) blackboard_E [ over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== ν(𝜽t𝜽t)Σπ(𝜽t𝜽t)𝜈superscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡topsubscriptΣ𝜋superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle\nu\cdot(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*})^{% \top}\Sigma_{\pi}(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*})italic_ν ⋅ ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\geq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≥ end_ARG νλ0𝜽t𝜽t2,𝜈subscript𝜆0superscriptnormsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡2\displaystyle\nu\lambda_{0}\cdot\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^% {t}_{*}\|^{2},italic_ν italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where (i) follows (32), and (ii) follows 𝜽t𝜽t=𝜽t𝜽(𝚺π)superscript𝜽𝑡subscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡parallel-tosuperscript𝜽subscript𝚺𝜋\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}=\boldsymbol{\theta}^{t}_{% \parallel}-\boldsymbol{\theta}^{*}\in\mathcal{R}(\mathbf{\Sigma_{\pi}})bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_R ( bold_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) and Lemma A.2, which provides λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT-strong convexity. ∎

Now given 𝜽0superscript𝜽0\boldsymbol{\theta}^{0}bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, we can deduce that

𝔼μ,π,[𝜽t+1\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\|\boldsymbol{\theta}^{t+1}\right.blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT 𝜽t+12𝜽0](12ηtνλ0)𝔼[𝜽t𝜽t2𝜽0]+C1ηt2\displaystyle-\left.\boldsymbol{\theta}^{t+1}_{*}\|^{2}\mid\boldsymbol{\theta}% ^{0}\right]\leq(1-2\eta_{t}\nu\lambda_{0})\mathbb{E}\left[\|\boldsymbol{\theta% }^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}\mid\boldsymbol{\theta}^{0}\right]+C_{1% }\eta_{t}^{2}- bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ≤ ( 1 - 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ν italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) blackboard_E [ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2ηtC2m1/2log(T/δ)+2ηtC3(log(T/δ)+C12)τηmax{0,tτ},2subscript𝜂𝑡subscript𝐶2superscript𝑚12𝑇𝛿2subscript𝜂𝑡subscript𝐶3𝑇𝛿superscriptsubscript𝐶12superscript𝜏subscript𝜂0𝑡superscript𝜏\displaystyle+2\eta_{t}C_{2}m^{-1/2}\sqrt{\log(T/\delta)}+2\eta_{t}C_{3}\left(% \log(T/\delta)+C_{1}^{2}\right)\tau^{*}\eta_{\max\left\{0,t-\tau^{*}\right\}},+ 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG + 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( roman_log ( italic_T / italic_δ ) + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT roman_max { 0 , italic_t - italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT ,

with probability at least 12δ2Lexp(C4m)12𝛿2𝐿subscript𝐶4𝑚1-2\delta-2L\exp(-C_{4}m)1 - 2 italic_δ - 2 italic_L roman_exp ( - italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_m ), where {Ci>0}i=1,,4subscriptsubscript𝐶𝑖0𝑖14\left\{C_{i}>0\right\}_{i=1,\ldots,4}{ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 } start_POSTSUBSCRIPT italic_i = 1 , … , 4 end_POSTSUBSCRIPT are universal constants in this subsection. Choosing ηt=12νλ0(t+1)subscript𝜂𝑡12𝜈subscript𝜆0𝑡1\eta_{t}=\frac{1}{2\nu\lambda_{0}(t+1)}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_ν italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t + 1 ) end_ARG can derive the similar results as (A.2). This suggests that we can utilize the techniques outlined in Section A.2 to finalize the remaining proof of Theorem B.2. As a result, we conclude the proof.

Appendix C Details of Section 4

We formally describe the minimax neural Q-learning method in Algorithm 3.

Algorithm 3 Minimax Neural Q-Learning with Gaussian Initialization
  Input: A learning policy pair π=(π1,π2)𝜋superscript𝜋1superscript𝜋2\pi=(\pi^{1},\pi^{2})italic_π = ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), a discount factor γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ), a sequence of learning rates {ηt}t0subscriptsubscript𝜂𝑡𝑡0\{\eta_{t}\}_{t\geq 0}{ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT, a maximum iteration number T𝑇Titalic_T, a projection radius ω>0𝜔0\omega>0italic_ω > 0, a Q network with architecture (4).
  Initialization: Generate each entry of 𝑾l0superscriptsubscript𝑾𝑙0\boldsymbol{W}_{l}^{0}bold_italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT independently from 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ), for l=1,2,,L𝑙12𝐿l=1,2,\cdots,Litalic_l = 1 , 2 , ⋯ , italic_L, and each entry of 𝒃𝒃\boldsymbol{b}bold_italic_b independently from Unif{1,+1}Unif11\text{Unif}\{-1,+1\}Unif { - 1 , + 1 }.
  for t=0,1,,T1𝑡01𝑇1t=0,1,\cdots,T-1italic_t = 0 , 1 , ⋯ , italic_T - 1 do
     Sample (st,at1,at2,rt,st+1)subscript𝑠𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡subscript𝑟𝑡subscript𝑠𝑡1(s_{t},a^{1}_{t},a^{2}_{t},r_{t},s_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) from the learning policy pair π𝜋\piitalic_π with at1π1(|st),at2π2(|st)a^{1}_{t}\sim\pi^{1}(\cdot|s_{t}),a^{2}_{t}\sim\pi^{2}(\cdot|s_{t})italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
     Compute the TD error ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by (17).
     Update 𝜽t+1superscript𝜽𝑡1\boldsymbol{\theta}^{t+1}bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT by the projected stochastic semi-gradient step (16).
  end for
  Output: 𝜽Tsuperscript𝜽𝑇\boldsymbol{\theta}^{T}bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

C.1 Proof of Theorem 4.2

The proof of Theorem 4.2 is similar to Sections A.2 and B.2. However, due to the difference in Bellman operators, we still need to make some modifications to Lemma A.3 or Lemma B.3. See Lemma C.1.

Lemma C.1.

Under Assumption 4.1, we have that

𝔼μ,π,[𝜽t𝜽t,𝐦¯(𝜽t)𝐦¯(𝜽t)𝜽0]νλ0𝜽t𝜽t2.subscript𝔼𝜇𝜋delimited-[]conditionalsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡¯𝐦superscript𝜽𝑡¯𝐦subscriptsuperscript𝜽𝑡superscript𝜽0𝜈subscript𝜆0superscriptnormsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡2\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left<\boldsymbol{\theta}^{t}-\boldsymbol% {\theta}^{t}_{*},\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})-\bar{\mathbf{m}}(% \boldsymbol{\theta}^{t}_{*})\right>\mid\boldsymbol{\theta}^{0}\right]\geq\nu% \lambda_{0}\cdot\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\|^{2}.blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ⟨ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ ∣ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ≥ italic_ν italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (33)
Proof.

To simplify the notation, we denote 𝔼[]𝔼delimited-[]\mathbb{E}[\cdot]blackboard_E [ ⋅ ] as 𝔼μ,π,[]subscript𝔼𝜇𝜋delimited-[]\mathbb{E}_{\mu,\pi,\mathbb{P}}[\cdot]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ ⋅ ] in the proof of this lemma. Define Q^#(s;𝜽):=maxa1𝒜mina2𝒜Q^(ϕ(s,a1,a2);𝜽)assignsuperscript^𝑄#𝑠𝜽subscriptsuperscript𝑎1𝒜subscriptsuperscript𝑎2𝒜^𝑄italic-ϕ𝑠superscript𝑎1superscript𝑎2𝜽\widehat{Q}^{\#}(s;\boldsymbol{\theta}):=\max_{a^{1}\in\mathcal{A}}\min_{a^{2}% \in\mathcal{A}}\widehat{Q}(\phi(s,a^{1},a^{2});\boldsymbol{\theta})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ ) := roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ; bold_italic_θ ). Define the sets 𝒮+={s:Q^#(s;𝜽t)>Q^#(s;𝜽t)}subscript𝒮conditional-set𝑠superscript^𝑄#𝑠superscript𝜽𝑡superscript^𝑄#𝑠subscriptsuperscript𝜽𝑡\mathcal{S}_{+}=\{s:\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})>\widehat{Q}^{% \#}(s;\boldsymbol{\theta}^{t}_{*})\}caligraphic_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = { italic_s : over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) > over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) } and 𝒮=𝒮/𝒮+subscript𝒮𝒮subscript𝒮\mathcal{S}_{-}=\mathcal{S}/\mathcal{S}_{+}caligraphic_S start_POSTSUBSCRIPT - end_POSTSUBSCRIPT = caligraphic_S / caligraphic_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. For each s𝒮+𝑠subscript𝒮s\in\mathcal{S}_{+}italic_s ∈ caligraphic_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT,

Q^#(s;𝜽t)Q^#(s;𝜽t)superscript^𝑄#𝑠superscript𝜽𝑡superscript^𝑄#𝑠subscriptsuperscript𝜽𝑡\displaystyle\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t}_{*})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) =\displaystyle== <𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t>𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽tformulae-sequenceabsentsubscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎1superscript𝜽𝑡subscriptsuperscript𝑎2superscript𝜽𝑡superscript𝜽0superscript𝜽𝑡subscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎1subscriptsuperscript𝜽𝑡subscriptsuperscript𝑎2subscriptsuperscript𝜽𝑡superscript𝜽0subscriptsuperscript𝜽𝑡\displaystyle\Big{<}\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{\theta}^% {0}\right),\boldsymbol{\theta}^{t}\Big{>}-\left<\nabla_{\boldsymbol{\theta}}Q% \left(\phi(s,a^{1}_{\boldsymbol{\theta}^{t}_{*}},a^{2}_{\boldsymbol{\theta}^{t% }_{*}});\boldsymbol{\theta}^{0}\right),\boldsymbol{\theta}^{t}_{*}\right>< ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT > - ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩
=\displaystyle== (<𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t>𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t)\displaystyle\left(\Big{<}\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{\theta}^% {0}\right),\boldsymbol{\theta}^{t}\Big{>}-\left<\nabla_{\boldsymbol{\theta}}Q% \left(\phi(s,a^{1}_{\boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}_{*% }});\boldsymbol{\theta}^{0}\right),\boldsymbol{\theta}^{t}\right>\right)-( < ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT > - ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ ) -
𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t𝜽tlimit-fromsubscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎1superscript𝜽𝑡subscriptsuperscript𝑎2subscriptsuperscript𝜽𝑡superscript𝜽0superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}_{*}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right>-⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ -
(<𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t>𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t)formulae-sequenceabsentsubscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎1superscript𝜽𝑡subscriptsuperscript𝑎2subscriptsuperscript𝜽𝑡superscript𝜽0subscriptsuperscript𝜽𝑡subscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎1subscriptsuperscript𝜽𝑡subscriptsuperscript𝑎2subscriptsuperscript𝜽𝑡superscript𝜽0subscriptsuperscript𝜽𝑡\displaystyle\left(\Big{<}\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}_{*}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}_{*}\Big{>}-\left<\nabla_{% \boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{\boldsymbol{\theta}^{t}_{*}},a^{2}_{% \boldsymbol{\theta}^{t}_{*}});\boldsymbol{\theta}^{0}\right),\boldsymbol{% \theta}^{t}_{*}\right>\right)( < ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT > - ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ )
\displaystyle\leq 𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t𝜽tsubscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎1superscript𝜽𝑡subscriptsuperscript𝑎2subscriptsuperscript𝜽𝑡superscript𝜽0superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}_{*}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right>⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩

and

Q^#(s;𝜽t)Q^#(s;𝜽t)superscript^𝑄#𝑠superscript𝜽𝑡superscript^𝑄#𝑠subscriptsuperscript𝜽𝑡\displaystyle\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t}_{*})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) =\displaystyle== (<𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t>𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t)\displaystyle\left(\Big{<}\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{\theta}^% {0}\right),\boldsymbol{\theta}^{t}\Big{>}-\left<\nabla_{\boldsymbol{\theta}}Q% \left(\phi(s,a^{1}_{\boldsymbol{\theta}^{t}_{*}},a^{2}_{\boldsymbol{\theta}^{t% }});\boldsymbol{\theta}^{0}\right),\boldsymbol{\theta}^{t}\right>\right)-( < ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT > - ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ ) -
𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t𝜽tlimit-fromsubscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎1subscriptsuperscript𝜽𝑡subscriptsuperscript𝑎2superscript𝜽𝑡superscript𝜽0superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}_{*}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right>-⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ -
(<𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t>𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t)formulae-sequenceabsentsubscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎1subscriptsuperscript𝜽𝑡subscriptsuperscript𝑎2superscript𝜽𝑡superscript𝜽0subscriptsuperscript𝜽𝑡subscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎1subscriptsuperscript𝜽𝑡subscriptsuperscript𝑎2subscriptsuperscript𝜽𝑡superscript𝜽0subscriptsuperscript𝜽𝑡\displaystyle\left(\Big{<}\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}_{*}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}_{*}\Big{>}-\left<\nabla_{% \boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{\boldsymbol{\theta}^{t}_{*}},a^{2}_{% \boldsymbol{\theta}^{t}_{*}});\boldsymbol{\theta}^{0}\right),\boldsymbol{% \theta}^{t}_{*}\right>\right)( < ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT > - ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ )
\displaystyle\geq 𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t𝜽t.subscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎1subscriptsuperscript𝜽𝑡subscriptsuperscript𝑎2superscript𝜽𝑡superscript𝜽0superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}_{*}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right>.⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ .

In the same way, for each s𝒮𝑠subscript𝒮s\in\mathcal{S}_{-}italic_s ∈ caligraphic_S start_POSTSUBSCRIPT - end_POSTSUBSCRIPT,

𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t𝜽tsubscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎1superscript𝜽𝑡subscriptsuperscript𝑎2subscriptsuperscript𝜽𝑡superscript𝜽0superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}_{*}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right>⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ \displaystyle\leq Q^#(s;𝜽t)Q^#(s;𝜽t)superscript^𝑄#𝑠superscript𝜽𝑡superscript^𝑄#𝑠subscriptsuperscript𝜽𝑡\displaystyle\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t}_{*})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )
\displaystyle\leq 𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t𝜽t.subscript𝜽𝑄italic-ϕ𝑠subscriptsuperscript𝑎1subscriptsuperscript𝜽𝑡subscriptsuperscript𝑎2superscript𝜽𝑡superscript𝜽0superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1}_{% \boldsymbol{\theta}^{t}_{*}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right>.⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ .

Therefore,

|Q^#(s;𝜽t)Q^#(s;𝜽t)|superscript^𝑄#𝑠superscript𝜽𝑡superscript^𝑄#𝑠subscriptsuperscript𝜽𝑡\displaystyle\left|\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t})-\widehat{Q}^{\#% }(s;\boldsymbol{\theta}^{t}_{*})\right|| over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) | \displaystyle\leq max{|𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t𝜽t|,\displaystyle\max\left\{\Big{|}\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s% ,a^{1}_{\boldsymbol{\theta}^{t}},a^{2}_{\boldsymbol{\theta}^{t}_{*}});% \boldsymbol{\theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}% _{*}\right>\Big{|},\right.roman_max { | ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ | , (34)
|𝜽Q(ϕ(s,a𝜽t1,a𝜽t2);𝜽0),𝜽t𝜽t|}.\displaystyle\left.\Big{|}\left<\nabla_{\boldsymbol{\theta}}Q\left(\phi(s,a^{1% }_{\boldsymbol{\theta}^{t}_{*}},a^{2}_{\boldsymbol{\theta}^{t}});\boldsymbol{% \theta}^{0}\right),\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right>% \Big{|}\right\}.| ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ | } .

By Assumption 4.1, we compute

𝔼μ,π,[|Q^#(s;𝜽t)Q^#(s;𝜽t)|2]subscript𝔼𝜇𝜋delimited-[]superscriptsuperscript^𝑄#𝑠superscript𝜽𝑡superscript^𝑄#𝑠subscriptsuperscript𝜽𝑡2\displaystyle\mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\left|\widehat{Q}^{\#}(s;% \boldsymbol{\theta}^{t})-\widehat{Q}^{\#}(s;\boldsymbol{\theta}^{t}_{*})\right% |^{2}\right]blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ | over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_s ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG (𝜽t𝜽t)Σπ(𝜽t,𝜽t)(𝜽t𝜽t)superscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡topsuperscriptsubscriptΣ𝜋superscript𝜽𝑡subscriptsuperscript𝜽𝑡superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle\left(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right)^% {\top}\Sigma_{\pi}^{*}(\boldsymbol{\theta}^{t},\boldsymbol{\theta}^{t}_{*})% \left(\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{t}_{*}\right)( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )
(ii)𝑖𝑖\displaystyle\overset{(ii)}{\leq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≤ end_ARG (1ν)2γ2(𝜽t𝜽t)Σπ(𝜽t𝜽t)superscript1𝜈2superscript𝛾2superscriptsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡topsubscriptΣ𝜋superscript𝜽𝑡subscriptsuperscript𝜽𝑡\displaystyle\frac{(1-\nu)^{2}}{\gamma^{2}}\cdot\left(\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t}_{*}\right)^{\top}\Sigma_{\pi}\left(\boldsymbol{\theta}% ^{t}-\boldsymbol{\theta}^{t}_{*}\right)divide start_ARG ( 1 - italic_ν ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )
=\displaystyle== (1ν)2γ2𝔼μ,π,[|Q^(s,a1,a2;𝜽t)Q^(s,a1,a2;𝜽t)|2],superscript1𝜈2superscript𝛾2subscript𝔼𝜇𝜋delimited-[]superscript^𝑄𝑠superscript𝑎1superscript𝑎2superscript𝜽𝑡^𝑄𝑠superscript𝑎1superscript𝑎2subscriptsuperscript𝜽𝑡2\displaystyle\frac{(1-\nu)^{2}}{\gamma^{2}}\cdot\mathbb{E}_{\mu,\pi,\mathbb{P}% }\left[\left|\widehat{Q}(s,a^{1},a^{2};\boldsymbol{\theta}^{t})-\widehat{Q}(s,% a^{1},a^{2};\boldsymbol{\theta}^{t}_{*})\right|^{2}\right],divide start_ARG ( 1 - italic_ν ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ | over^ start_ARG italic_Q end_ARG ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where (i) is due to (34), and (ii) is due to Assumption 4.1. Similar to Lemma B.3, we can also obtain (B.2) in this lemma. By substituting (C.1) into (B.2), the proof can be completed.

Now we are ready to prove Theorem 4.2. For a little notation abuse, we redefine

𝐠(𝜽t)𝐠superscript𝜽𝑡\displaystyle\mathbf{g}(\boldsymbol{\theta}^{t})bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Δ(st,at1,at2,st,𝜽t)𝜽Q(𝒙t;𝜽t),𝐠¯(𝜽t)=𝔼μ,π,[𝐠(𝜽t)]Δsubscript𝑠𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑠𝑡superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽𝑡¯𝐠superscript𝜽𝑡subscript𝔼𝜇𝜋delimited-[]𝐠superscript𝜽𝑡\displaystyle\Delta(s_{t},a^{1}_{t},a^{2}_{t},s^{\prime}_{t},\boldsymbol{% \theta}^{t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{% \theta}^{t}),\quad\bar{\mathbf{g}}(\boldsymbol{\theta}^{t})\ =\ \mathbb{E}_{% \mu,\pi,\mathbb{P}}\left[\mathbf{g}(\boldsymbol{\theta}^{t})\right]roman_Δ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , over¯ start_ARG bold_g end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ]
𝐦(𝜽t)𝐦superscript𝜽𝑡\displaystyle\mathbf{m}(\boldsymbol{\theta}^{t})bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Δ^(st,at1,at2,st,𝜽t)𝜽Q(𝒙t;𝜽0),𝐦¯(𝜽t)=𝔼μ,π,[𝐦(𝜽t)],^Δsubscript𝑠𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑠𝑡superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽0¯𝐦superscript𝜽𝑡subscript𝔼𝜇𝜋delimited-[]𝐦superscript𝜽𝑡\displaystyle\widehat{\Delta}(s_{t},a^{1}_{t},a^{2}_{t},s^{\prime}_{t},% \boldsymbol{\theta}^{t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{0}),\quad\bar{\mathbf{m}}(\boldsymbol{\theta}^{t})\ =\ % \mathbb{E}_{\mu,\pi,\mathbb{P}}\left[\mathbf{m}(\boldsymbol{\theta}^{t})\right],over^ start_ARG roman_Δ end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , over¯ start_ARG bold_m end_ARG ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_μ , italic_π , blackboard_P end_POSTSUBSCRIPT [ bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] ,

where

Δ(st,at1,at2,st,𝜽t)Δsubscript𝑠𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑠𝑡superscript𝜽𝑡\displaystyle\Delta(s_{t},a^{1}_{t},a^{2}_{t},s^{\prime}_{t},\boldsymbol{% \theta}^{t})roman_Δ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Q(𝒙t;𝜽t)(r(st,at)+γmaxb1𝒜minb2𝒜Q(ϕ(st+1,b1,b2);𝜽t)),𝑄subscript𝒙𝑡superscript𝜽𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡𝛾subscriptsuperscript𝑏1𝒜subscriptsuperscript𝑏2𝒜𝑄italic-ϕsubscript𝑠𝑡1superscript𝑏1superscript𝑏2superscript𝜽𝑡\displaystyle Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s_{t},a_{t% })+\gamma\max_{b^{1}\in\mathcal{A}}\min_{b^{2}\in\mathcal{A}}Q(\phi(s_{t+1},b^% {1},b^{2});\boldsymbol{\theta}^{t})\right),italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ,
Δ^(st,at1,at2,st,𝜽t)^Δsubscript𝑠𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑠𝑡superscript𝜽𝑡\displaystyle\widehat{\Delta}(s_{t},a^{1}_{t},a^{2}_{t},s^{\prime}_{t},% \boldsymbol{\theta}^{t})over^ start_ARG roman_Δ end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Q^(𝒙t;𝜽t)(r(st,at)+γmaxb1𝒜minb2𝒜Q^(ϕ(st+1,b1,b2);𝜽t)).^𝑄subscript𝒙𝑡superscript𝜽𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡𝛾subscriptsuperscript𝑏1𝒜subscriptsuperscript𝑏2𝒜^𝑄italic-ϕsubscript𝑠𝑡1superscript𝑏1superscript𝑏2superscript𝜽𝑡\displaystyle\widehat{Q}(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s% _{t},a_{t})+\gamma\max_{b^{1}\in\mathcal{A}}\min_{b^{2}\in\mathcal{A}}\widehat% {Q}(\phi(s_{t+1},b^{1},b^{2});\boldsymbol{\theta}^{t})\right).over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) .

Let Δt=Δ(st,at1,at2,st,𝜽t)subscriptΔ𝑡Δsubscript𝑠𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑠𝑡superscript𝜽𝑡\Delta_{t}=\Delta(s_{t},a^{1}_{t},a^{2}_{t},s^{\prime}_{t},\boldsymbol{\theta}% ^{t})roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Δ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and Δ^t=Δ^(st,at1,at2,st,𝜽t)subscript^Δ𝑡^Δsubscript𝑠𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑠𝑡superscript𝜽𝑡\widehat{\Delta}_{t}=\widehat{\Delta}(s_{t},a^{1}_{t},a^{2}_{t},s^{\prime}_{t}% ,\boldsymbol{\theta}^{t})over^ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG roman_Δ end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). After redefining the corresponding notation, we can similarly derive (18) and adopt the associated lemmas. Due to the introduction of the additional Assumption 4.1, we provide Lemma C.1, ensuring that terms 𝐈1𝐈4similar-tosubscript𝐈1subscript𝐈4\mathbf{I}_{1}\sim\mathbf{I}_{4}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ bold_I start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT can be estimated. The remainder of the proof is entirely analogous to Sections A.2 and B.2. Thus, we conclude the proof.

Appendix D Supporting Lemmas for Multi-layer Neural Network

Recalling the definition of the parameterized Q-function, we present the following lemmas related to neural network functions, which play a crucial role in illustrating the main results of our paper. {τi>0}i=1,,10subscriptsubscript𝜏𝑖0𝑖110\left\{\tau_{i}>0\right\}_{i=1,\ldots,10}{ italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 } start_POSTSUBSCRIPT italic_i = 1 , … , 10 end_POSTSUBSCRIPT mentioned below are universal constants.

Lemma D.1.

For any t{1,2,,T}𝑡12𝑇t\in\{1,2,\cdots,T\}italic_t ∈ { 1 , 2 , ⋯ , italic_T }, we have

𝜽tτ1m,w.p.1Lexp(τ2m).formulae-sequencenormsuperscript𝜽𝑡subscript𝜏1𝑚𝑤𝑝.1𝐿subscript𝜏2𝑚\|\boldsymbol{\theta}^{t}\|\leq\tau_{1}\sqrt{m},\quad w.p.1-L\exp{(-\tau_{2}m)}.∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ ≤ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG italic_m end_ARG , italic_w . italic_p .1 - italic_L roman_exp ( - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ) .
Proof.

By Lemma G.2 in (Du et al., 2019), 𝜽0𝒪(m)normsuperscript𝜽0𝒪𝑚\|\boldsymbol{\theta}^{0}\|\leq\mathcal{O}\left(\sqrt{m}\right)∥ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ ≤ caligraphic_O ( square-root start_ARG italic_m end_ARG ) with probability at least 1Lexp(τ2m)1𝐿subscript𝜏2𝑚1-L\exp{(-\tau_{2}m)}1 - italic_L roman_exp ( - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ). Therefore,

𝜽t2subscriptnormsuperscript𝜽𝑡2\displaystyle\|\boldsymbol{\theta}^{t}\|_{2}∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT \displaystyle\leq 𝜽t𝜽0+𝜽0normsuperscript𝜽𝑡superscript𝜽0normsuperscript𝜽0\displaystyle\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{0}\|+\|\boldsymbol% {\theta}^{0}\|∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ + ∥ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥
\displaystyle\leq ω+𝜽0𝜔normsuperscript𝜽0\displaystyle\omega+\|\boldsymbol{\theta}^{0}\|italic_ω + ∥ bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥
\displaystyle\leq 𝒪(m).𝒪𝑚\displaystyle\mathcal{O}\left(\sqrt{m}\right).caligraphic_O ( square-root start_ARG italic_m end_ARG ) .

Lemma D.2.

For any l{1,2,,L}𝑙12𝐿l\in\{1,2,\cdots,L\}italic_l ∈ { 1 , 2 , ⋯ , italic_L }, we have

𝒙(l)τ3m,and𝜽Q(𝒙;𝜽)τ4,w.p. 1Lexp(τ2m).formulae-sequenceformulae-sequencenormsuperscript𝒙𝑙subscript𝜏3𝑚andnormsubscript𝜽𝑄𝒙𝜽subscript𝜏4𝑤𝑝.1𝐿subscript𝜏2𝑚\|\boldsymbol{x}^{(l)}\|\leq\tau_{3}\sqrt{m},\quad\text{and}\quad\left\|\nabla% _{\boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta})\right\|\leq\tau_{4% },\quad w.p.\ 1-L\exp{(-\tau_{2}m)}.∥ bold_italic_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∥ ≤ italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT square-root start_ARG italic_m end_ARG , and ∥ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ ) ∥ ≤ italic_τ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_w . italic_p . 1 - italic_L roman_exp ( - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ) .

Lemma D.2 has been proved by (Tian et al., 2022) (Lemmas A.6similar-to\simA.10).

Lemma D.3.

For any t{1,2,,T}𝑡12𝑇t\in\{1,2,\cdots,T\}italic_t ∈ { 1 , 2 , ⋯ , italic_T } and 𝛉Sω𝛉subscript𝑆𝜔\boldsymbol{\theta}\in S_{\omega}bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, we have

|Q(𝒙t;𝜽)|τ5log(T/δ),w.p. 1δLexp(τ2m).formulae-sequence𝑄subscript𝒙𝑡𝜽subscript𝜏5𝑇𝛿𝑤𝑝.1𝛿𝐿subscript𝜏2𝑚\left|Q(\boldsymbol{x}_{t};\boldsymbol{\theta})\right|\leq\tau_{5}\sqrt{\log(T% /\delta)},\quad w.p.\ 1-\delta-L\exp{(-\tau_{2}m)}.| italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) | ≤ italic_τ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG , italic_w . italic_p . 1 - italic_δ - italic_L roman_exp ( - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ) .
Proof.

By Lemma D.2, we have 1m𝒙(L)τ3,w.p. 1Lexp(τ2m)formulae-sequence1𝑚normsuperscript𝒙𝐿subscript𝜏3𝑤𝑝.1𝐿subscript𝜏2𝑚\frac{1}{\sqrt{m}}\|\boldsymbol{x}^{(L)}\|\leq\tau_{3},\ w.p.\ 1-L\exp{(-\tau_% {2}m)}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG ∥ bold_italic_x start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ∥ ≤ italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_w . italic_p . 1 - italic_L roman_exp ( - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ). Recall the definition of the parameterized Q-function:

Q(𝒙;𝜽)=1m𝒃𝒙(L),𝑄𝒙𝜽1𝑚superscript𝒃topsuperscript𝒙𝐿Q(\boldsymbol{x};\boldsymbol{\theta})=\frac{1}{\sqrt{m}}\boldsymbol{b}^{\top}% \boldsymbol{x}^{(L)},italic_Q ( bold_italic_x ; bold_italic_θ ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG bold_italic_b start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ,

where each element of 𝒃𝒃\boldsymbol{b}bold_italic_b is generated from a uniform distribution over {1,+1}11\{-1,+1\}{ - 1 , + 1 }. For each 𝒙𝒙\boldsymbol{x}bold_italic_x, by Hoeffding inequality, we have

(|1m𝒙(L),𝒃|t)2exp(2t241m𝒙(L)2)(i)2exp(t22τ32),1𝑚superscript𝒙𝐿𝒃𝑡22superscript𝑡24superscriptnorm1𝑚superscript𝒙𝐿2𝑖2superscript𝑡22superscriptsubscript𝜏32\mathbb{P}\left(\left|\left<\frac{1}{m}\boldsymbol{x}^{(L)},\boldsymbol{b}% \right>\right|\geq t\right)\leq 2\exp\left(-\frac{2t^{2}}{4\|\frac{1}{m}% \boldsymbol{x}^{(L)}\|^{2}}\right)\overset{(i)}{\leq}2\exp\left(-\frac{t^{2}}{% 2\tau_{3}^{2}}\right),blackboard_P ( | ⟨ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG bold_italic_x start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT , bold_italic_b ⟩ | ≥ italic_t ) ≤ 2 roman_exp ( - divide start_ARG 2 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 ∥ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG bold_italic_x start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG 2 roman_exp ( - divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ,

where (i) follows Lemma D.2. Substituting δT=2exp(t22τ32)𝛿𝑇2superscript𝑡22superscriptsubscript𝜏32\frac{\delta}{T}=2\exp\left(-\frac{t^{2}}{2\tau_{3}^{2}}\right)divide start_ARG italic_δ end_ARG start_ARG italic_T end_ARG = 2 roman_exp ( - divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ), we get

(|1m𝒙(L),𝒃|τ32log(T/δ))δT.1𝑚superscript𝒙𝐿𝒃subscript𝜏32𝑇𝛿𝛿𝑇\mathbb{P}\left(\left|\left<\frac{1}{m}\boldsymbol{x}^{(L)},\boldsymbol{b}% \right>\right|\geq\tau_{3}\sqrt{2\log(T/\delta)}\right)\leq\frac{\delta}{T}.blackboard_P ( | ⟨ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG bold_italic_x start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT , bold_italic_b ⟩ | ≥ italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT square-root start_ARG 2 roman_log ( italic_T / italic_δ ) end_ARG ) ≤ divide start_ARG italic_δ end_ARG start_ARG italic_T end_ARG .

Now, by the union bound, if we set τ5=τ32subscript𝜏5subscript𝜏32\tau_{5}=\tau_{3}\sqrt{2}italic_τ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT square-root start_ARG 2 end_ARG, then

(maxt[T]|Q(𝒙t;𝜽)|τ5log(T/δ))t=1T(|Q(𝒙t;𝜽)|τ5log(T/δ))δ,subscript𝑡delimited-[]𝑇𝑄subscript𝒙𝑡𝜽subscript𝜏5𝑇𝛿superscriptsubscript𝑡1𝑇𝑄subscript𝒙𝑡𝜽subscript𝜏5𝑇𝛿𝛿\mathbb{P}\left(\max_{t\in[T]}\left|Q(\boldsymbol{x}_{t};\boldsymbol{\theta})% \right|\geq\tau_{5}\sqrt{\log(T/\delta)}\right)\leq\sum_{t=1}^{T}\mathbb{P}% \left(\left|Q(\boldsymbol{x}_{t};\boldsymbol{\theta})\right|\geq\tau_{5}\sqrt{% \log(T/\delta)}\right)\leq\delta,blackboard_P ( roman_max start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT | italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) | ≥ italic_τ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG ) ≤ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_P ( | italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) | ≥ italic_τ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG ) ≤ italic_δ ,

which completes the proof. ∎

Lemma D.4.

Denote 𝛉2Q(𝐱;𝛉)subscriptsuperscript2𝛉𝑄𝐱𝛉\nabla^{2}_{\boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta})∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ ) as the Hessian matrix of Q(𝐱;𝛉)𝑄𝐱𝛉Q(\boldsymbol{x};\boldsymbol{\theta})italic_Q ( bold_italic_x ; bold_italic_θ ). Then for all 𝐱,𝛉Sω𝐱𝛉subscript𝑆𝜔\boldsymbol{x},\boldsymbol{\theta}\in S_{\omega}bold_italic_x , bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, we have w.p. 1δformulae-sequence𝑤𝑝.1𝛿w.p.\ 1-\deltaitalic_w . italic_p . 1 - italic_δ that

𝜽2Q(𝒙;𝜽)2τ6m12subscriptnormsubscriptsuperscript2𝜽𝑄𝒙𝜽2subscript𝜏6superscript𝑚12\|\nabla^{2}_{\boldsymbol{\theta}}Q(\boldsymbol{x};\boldsymbol{\theta})\|_{2}% \leq\tau_{6}m^{-\frac{1}{2}}∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_τ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT

and

|Q(𝒙;𝜽)Q^(𝒙;𝜽)|τ7m12.𝑄𝒙𝜽^𝑄𝒙𝜽subscript𝜏7superscript𝑚12\left|Q(\boldsymbol{x};\boldsymbol{\theta})-\widehat{Q}(\boldsymbol{x};% \boldsymbol{\theta})\right|\leq\tau_{7}m^{-\frac{1}{2}}.| italic_Q ( bold_italic_x ; bold_italic_θ ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ ) | ≤ italic_τ start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT .
Proof.

The first inequality in Lemma D.4 has been proved by (Liu et al., 2020b) (Theorem 3.2), which implies that Q(𝒙;𝜽)𝑄𝒙𝜽Q(\boldsymbol{x};\boldsymbol{\theta})italic_Q ( bold_italic_x ; bold_italic_θ ) is 𝒪(m12)𝒪superscript𝑚12\mathcal{O}(m^{-\frac{1}{2}})caligraphic_O ( italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )-smoothness w.r.t. θ𝜃\thetaitalic_θ. Therefore,

|Q(𝒙;𝜽)Q^(𝒙;𝜽)|=|Q(𝒙;𝜽)Q(𝒙;𝜽0)𝜽Q(𝒙;𝜽0),𝜽0𝜽|=𝒪(m12).𝑄𝒙𝜽^𝑄𝒙𝜽𝑄𝒙𝜽𝑄𝒙superscript𝜽0subscript𝜽𝑄𝒙superscript𝜽0superscript𝜽0𝜽𝒪superscript𝑚12\left|Q(\boldsymbol{x};\boldsymbol{\theta})-\widehat{Q}(\boldsymbol{x};% \boldsymbol{\theta})\right|=\left|Q(\boldsymbol{x};\boldsymbol{\theta})-Q(% \boldsymbol{x};\boldsymbol{\theta}^{0})-\left<\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x};\boldsymbol{\theta}^{0}),\boldsymbol{\theta}^{0}-\boldsymbol{% \theta}\right>\right|=\mathcal{O}(m^{-\frac{1}{2}}).| italic_Q ( bold_italic_x ; bold_italic_θ ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x ; bold_italic_θ ) | = | italic_Q ( bold_italic_x ; bold_italic_θ ) - italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - ⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - bold_italic_θ ⟩ | = caligraphic_O ( italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) .

Lemma D.5.

Let 𝛉Sω𝛉subscript𝑆𝜔\boldsymbol{\theta}\in S_{\omega}bold_italic_θ ∈ italic_S start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT with the radius satisfying ω=𝒪(1)𝜔𝒪1\omega=\mathcal{O}(1)italic_ω = caligraphic_O ( 1 ). Then for all 𝐱2=1subscriptnorm𝐱21\|\boldsymbol{x}\|_{2}=1∥ bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 and 𝛉tS2ωsubscriptsuperscript𝛉𝑡subscript𝑆2𝜔\boldsymbol{\theta}^{t}_{*}\in S_{2\omega}bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 2 italic_ω end_POSTSUBSCRIPT in the neural temporal difference learning algorithm 1, it holds that

|𝐠(𝜽t)𝐦(𝜽t),𝜽t𝜽t+1|𝐠superscript𝜽𝑡𝐦superscript𝜽𝑡superscript𝜽𝑡subscriptsuperscript𝜽𝑡1\displaystyle\left|\left<\mathbf{g}\left(\boldsymbol{\theta}^{t}\right)-% \mathbf{m}\left(\boldsymbol{\theta}^{t}\right),\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t+1}_{*}\right>\right|| ⟨ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ | \displaystyle\leq (τ8τ9ωm12log(T/δ)+2τ4τ8m12)𝜽t𝜽t+1subscript𝜏8subscript𝜏9𝜔superscript𝑚12𝑇𝛿2subscript𝜏4subscript𝜏8superscript𝑚12normsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡1\displaystyle\left(\tau_{8}\tau_{9}\omega m^{-\frac{1}{2}}\sqrt{\log(T/\delta)% }+2\tau_{4}\tau_{8}m^{-\frac{1}{2}}\right)\left\|\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t+1}_{*}\right\|( italic_τ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT italic_ω italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG + 2 italic_τ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥

with probability at least 12δ2Lexp(τ2m)12𝛿2𝐿subscript𝜏2𝑚1-2\delta-2L\exp{(-\tau_{2}m)}1 - 2 italic_δ - 2 italic_L roman_exp ( - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ) over the randomness of the initial point, and 𝐠(𝛉t)2τ10log(T/δ)subscriptnorm𝐠superscript𝛉𝑡2subscript𝜏10𝑇𝛿\left\|\mathbf{g}\left(\boldsymbol{\theta}^{t}\right)\right\|_{2}\leq\tau_{10}% \sqrt{\log(T/\delta)}∥ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_τ start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG holds with probability at least 1δ2Lexp(τ2m)1𝛿2𝐿subscript𝜏2𝑚1-\delta-2L\exp{(-\tau_{2}m)}1 - italic_δ - 2 italic_L roman_exp ( - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ).

Proof.

Note that

𝐠(𝜽t)𝐦(𝜽t)norm𝐠superscript𝜽𝑡𝐦superscript𝜽𝑡\displaystyle\left\|\mathbf{g}\left(\boldsymbol{\theta}^{t}\right)-\mathbf{m}% \left(\boldsymbol{\theta}^{t}\right)\right\|∥ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ =\displaystyle== Δ(𝒙t,𝒙t+1,𝜽t)𝜽Q(𝒙t;𝜽t)Δ^(𝒙t,𝒙t+1,𝜽t)𝜽Q(𝒙t;𝜽0)normΔsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽𝑡^Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽0\displaystyle\left\|\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},\boldsymbol% {\theta}^{t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol% {\theta}^{t})-\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},% \boldsymbol{\theta}^{t})\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{0})\right\|∥ roman_Δ ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG roman_Δ end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∥
\displaystyle\leq Δ(𝒙t,𝒙t+1,𝜽t)(𝜽Q(𝒙t;𝜽t)𝜽Q(𝒙t;𝜽0))normΔsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽0\displaystyle\left\|\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},\boldsymbol% {\theta}^{t})\cdot\left(\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{t})-\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};% \boldsymbol{\theta}^{0})\right)\right\|∥ roman_Δ ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ( ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) ∥
+(Δ(𝒙t,𝒙t+1,𝜽t)Δ^(𝒙t,𝒙t+1,𝜽t))𝜽Q(𝒙t;𝜽0),normΔsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡^Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽0\displaystyle+\left\|\left(\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},% \boldsymbol{\theta}^{t})-\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t% +1},\boldsymbol{\theta}^{t})\right)\cdot\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x}_{t};\boldsymbol{\theta}^{0})\right\|,+ ∥ ( roman_Δ ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG roman_Δ end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∥ ,

where

Δ(𝒙t,𝒙t+1,𝜽t)Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡\displaystyle\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},\boldsymbol{\theta% }^{t})roman_Δ ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Q(𝒙t;𝜽t)(r(st,at)+γQ(𝒙t+1;𝜽t)),𝑄subscript𝒙𝑡superscript𝜽𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡𝛾𝑄subscript𝒙𝑡1superscript𝜽𝑡\displaystyle Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s_{t},a_{t% })+\gamma Q(\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t})\right),italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ,
Δ^(𝒙t,𝒙t+1,𝜽t)^Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡\displaystyle\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},% \boldsymbol{\theta}^{t})over^ start_ARG roman_Δ end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =\displaystyle== Q^(𝒙t;𝜽t)(r(st,at)+γQ^(𝒙t+1;𝜽t)).^𝑄subscript𝒙𝑡superscript𝜽𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡𝛾^𝑄subscript𝒙𝑡1superscript𝜽𝑡\displaystyle\widehat{Q}(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t})-\left(r(s% _{t},a_{t})+\gamma\widehat{Q}(\boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t})% \right).over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) .

For the first term in (D), by Lemma D.3, we have w.p. 1δLexp(τ2m)formulae-sequence𝑤𝑝.1𝛿𝐿subscript𝜏2𝑚w.p.\ 1-\delta-L\exp{(-\tau_{2}m)}italic_w . italic_p . 1 - italic_δ - italic_L roman_exp ( - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ) that

|Δ(𝒙t,𝒙t+1,𝜽t)|τ8log(T/δ).Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡subscript𝜏8𝑇𝛿\left|\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},\boldsymbol{\theta}^{t})% \right|\leq\tau_{8}\sqrt{\log(T/\delta)}.| roman_Δ ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) | ≤ italic_τ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG .

By Lemma D.4, we get that

𝜽Q(𝒙t;𝜽t)𝜽Q(𝒙t;𝜽0)𝜽2Q(𝒙𝒕;𝜽𝒕)2𝜽t𝜽0τ9ωm12.normsubscript𝜽𝑄subscript𝒙𝑡superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽0subscriptnormsubscriptsuperscript2𝜽𝑄subscript𝒙𝒕subscript𝜽𝒕2normsuperscript𝜽𝑡superscript𝜽0subscript𝜏9𝜔superscript𝑚12\left\|\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{t% })-\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{0})% \right\|\leq\|\nabla^{2}_{\boldsymbol{\theta}}Q(\boldsymbol{x_{t}};\boldsymbol% {\theta_{t}})\|_{2}\cdot\|\boldsymbol{\theta}^{t}-\boldsymbol{\theta}^{0}\|% \leq\tau_{9}\omega m^{-\frac{1}{2}}.∥ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∥ ≤ ∥ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ ≤ italic_τ start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT italic_ω italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT .

Therefore,

Δ(𝒙t,𝒙t+1,𝜽t)(𝜽Q(𝒙t;𝜽t)𝜽Q(𝒙t;𝜽0))τ8τ9ωm12log(T/δ).normΔsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽0subscript𝜏8subscript𝜏9𝜔superscript𝑚12𝑇𝛿\left\|\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},\boldsymbol{\theta}^{t})% \cdot\left(\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta% }^{t})-\nabla_{\boldsymbol{\theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{0% })\right)\right\|\leq\tau_{8}\tau_{9}\omega m^{-\frac{1}{2}}\sqrt{\log(T/% \delta)}.∥ roman_Δ ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ( ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) ∥ ≤ italic_τ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT italic_ω italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG . (37)

For the second term in (D), we decompose it into

(Δ(𝒙t,𝒙t+1,𝜽t)Δ^(𝒙t,𝒙t+1,𝜽t))𝜽Q(𝒙t;𝜽0)normΔsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡^Δsubscript𝒙𝑡subscript𝒙𝑡1superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽0\displaystyle\left\|\left(\Delta(\boldsymbol{x}_{t},\boldsymbol{x}_{t+1},% \boldsymbol{\theta}^{t})-\widehat{\Delta}(\boldsymbol{x}_{t},\boldsymbol{x}_{t% +1},\boldsymbol{\theta}^{t})\right)\cdot\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x}_{t};\boldsymbol{\theta}^{0})\right\|∥ ( roman_Δ ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG roman_Δ end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∥ (38)
\displaystyle\leq (Q(𝒙,𝜽t)Q^(𝒙,𝜽t))𝜽Q(𝒙t;𝜽0)+(Q(𝒙t+1;𝜽t)Q^(𝒙t+1;𝜽t))𝜽Q(𝒙t;𝜽0)norm𝑄𝒙superscript𝜽𝑡^𝑄𝒙superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽0norm𝑄subscript𝒙𝑡1superscript𝜽𝑡^𝑄subscript𝒙𝑡1superscript𝜽𝑡subscript𝜽𝑄subscript𝒙𝑡superscript𝜽0\displaystyle\left\|\left(Q(\boldsymbol{x},\boldsymbol{\theta}^{t})-\widehat{Q% }(\boldsymbol{x},\boldsymbol{\theta}^{t})\right)\cdot\nabla_{\boldsymbol{% \theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{0})\right\|+\left\|\left(Q(% \boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x}_{t+1}% ;\boldsymbol{\theta}^{t})\right)\cdot\nabla_{\boldsymbol{\theta}}Q(\boldsymbol% {x}_{t};\boldsymbol{\theta}^{0})\right\|∥ ( italic_Q ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∥ + ∥ ( italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∥
\displaystyle\leq |Q(𝒙,𝜽t)Q^(𝒙,𝜽t)|𝜽Q(𝒙t;𝜽0)+|Q(𝒙t+1;𝜽t)Q^(𝒙t+1;𝜽t)|𝜽Q(𝒙t;𝜽0)𝑄𝒙superscript𝜽𝑡^𝑄𝒙superscript𝜽𝑡normsubscript𝜽𝑄subscript𝒙𝑡superscript𝜽0𝑄subscript𝒙𝑡1superscript𝜽𝑡^𝑄subscript𝒙𝑡1superscript𝜽𝑡normsubscript𝜽𝑄subscript𝒙𝑡superscript𝜽0\displaystyle\left|Q(\boldsymbol{x},\boldsymbol{\theta}^{t})-\widehat{Q}(% \boldsymbol{x},\boldsymbol{\theta}^{t})\right|\cdot\left\|\nabla_{\boldsymbol{% \theta}}Q(\boldsymbol{x}_{t};\boldsymbol{\theta}^{0})\right\|+\left|Q(% \boldsymbol{x}_{t+1};\boldsymbol{\theta}^{t})-\widehat{Q}(\boldsymbol{x}_{t+1}% ;\boldsymbol{\theta}^{t})\right|\cdot\left\|\nabla_{\boldsymbol{\theta}}Q(% \boldsymbol{x}_{t};\boldsymbol{\theta}^{0})\right\|| italic_Q ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) | ⋅ ∥ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∥ + | italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) | ⋅ ∥ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∥
(i)𝑖\displaystyle\overset{(i)}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG 2τ4τ8m12,2subscript𝜏4subscript𝜏8superscript𝑚12\displaystyle 2\tau_{4}\tau_{8}m^{-\frac{1}{2}},2 italic_τ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ,

with probability at least w.p. 1δLexp(τ2m)formulae-sequence𝑤𝑝.1𝛿𝐿subscript𝜏2𝑚w.p.\ 1-\delta-L\exp{(-\tau_{2}m)}italic_w . italic_p . 1 - italic_δ - italic_L roman_exp ( - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m ), where (i) is due to Lemmas D.3 and D.4. Plugging (37) and (38) into (D) yields that

|𝐠(𝜽t)𝐦(𝜽t),𝜽t𝜽t+1|𝐠superscript𝜽𝑡𝐦superscript𝜽𝑡superscript𝜽𝑡subscriptsuperscript𝜽𝑡1\displaystyle\left|\left<\mathbf{g}\left(\boldsymbol{\theta}^{t}\right)-% \mathbf{m}\left(\boldsymbol{\theta}^{t}\right),\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t+1}_{*}\right>\right|| ⟨ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ | \displaystyle\leq 𝐠(𝜽t)𝐦(𝜽t)𝜽t𝜽t+1norm𝐠superscript𝜽𝑡𝐦superscript𝜽𝑡normsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡1\displaystyle\left\|\mathbf{g}\left(\boldsymbol{\theta}^{t}\right)-\mathbf{m}% \left(\boldsymbol{\theta}^{t}\right)\right\|\cdot\left\|\boldsymbol{\theta}^{t% }-\boldsymbol{\theta}^{t+1}_{*}\right\|∥ bold_g ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - bold_m ( bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ ⋅ ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥
\displaystyle\leq (τ8τ9ωm12log(T/δ)+2τ4τ8m12)𝜽t𝜽t+1.subscript𝜏8subscript𝜏9𝜔superscript𝑚12𝑇𝛿2subscript𝜏4subscript𝜏8superscript𝑚12normsuperscript𝜽𝑡subscriptsuperscript𝜽𝑡1\displaystyle\left(\tau_{8}\tau_{9}\omega m^{-\frac{1}{2}}\sqrt{\log(T/\delta)% }+2\tau_{4}\tau_{8}m^{-\frac{1}{2}}\right)\left\|\boldsymbol{\theta}^{t}-% \boldsymbol{\theta}^{t+1}_{*}\right\|.( italic_τ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT italic_ω italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT square-root start_ARG roman_log ( italic_T / italic_δ ) end_ARG + 2 italic_τ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ .

Thus we complete the proof. ∎

Lemma D.5 provides the upper bounds on 𝐈1subscript𝐈1\mathbf{I}_{1}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐈2subscript𝐈2\mathbf{I}_{2}bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in Section A.2. As discussed in Section 3.1, for a finite MDP, the Gram matrix of the L-layer neural network function is positive definite and has a minimum eigenvalue of 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ) when the network width is sufficiently large. This, in fact, serves as an upper bound for msuperscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in Assumption 3.1. Further details are provided in Remark D.6.

Remark D.6.

In this special case, we assume that both state space and action space are finite. Let |𝒮|𝒮|\mathcal{S}|| caligraphic_S | and |𝒜|𝒜|\mathcal{A}|| caligraphic_A | represent the dimensions of the state space and action space, respectively. For simplicity of notation, we view 𝐐(𝜽)𝐐𝜽\mathbf{Q}(\boldsymbol{\theta})bold_Q ( bold_italic_θ ) as an |𝒮||𝒜|×1𝒮𝒜1|\mathcal{S}||\mathcal{A}|\times 1| caligraphic_S | | caligraphic_A | × 1 column vector, with (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) being a multi-index arranged in the lexicographical order. Let dμ×πsimilar-to𝑑𝜇𝜋d\sim\mu\times\piitalic_d ∼ italic_μ × italic_π and 𝐃=diag(d)𝐃diag𝑑\mathbf{D}=\text{diag}(d)bold_D = diag ( italic_d ) be an |𝒮||𝒜|𝒮𝒜|\mathcal{S}||\mathcal{A}|| caligraphic_S | | caligraphic_A |-dimensional diagonal matrix, whose (s,a)𝑠𝑎(s,a)( italic_s , italic_a )-th diagonal entry is d(s,a)𝑑𝑠𝑎d(s,a)italic_d ( italic_s , italic_a ), and the order of (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) in 𝐃𝐃\mathbf{D}bold_D is the same as 𝐐(𝜽)𝐐𝜽\mathbf{Q}(\boldsymbol{\theta})bold_Q ( bold_italic_θ ). Denote 𝐉𝐉\mathbf{J}bold_J as the Jacobian matrix of 𝐐(𝜽0)𝐐superscript𝜽0\mathbf{Q}(\boldsymbol{\theta}^{0})bold_Q ( bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) and 𝐉𝐃=𝐃12𝐉subscript𝐉𝐃superscript𝐃12𝐉\mathbf{J_{D}}=\mathbf{D}^{\frac{1}{2}}\mathbf{J}bold_J start_POSTSUBSCRIPT bold_D end_POSTSUBSCRIPT = bold_D start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT bold_J. Thus we can rewrite Σπ=𝐉𝐃𝐉𝐃subscriptΣ𝜋superscriptsubscript𝐉𝐃topsubscript𝐉𝐃\Sigma_{\pi}=\mathbf{J_{D}^{\top}}\mathbf{J_{D}}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = bold_J start_POSTSUBSCRIPT bold_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_J start_POSTSUBSCRIPT bold_D end_POSTSUBSCRIPT. Notice that ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is different from the Gram matrix Jacot et al. (2018); Du et al. (2018, 2019); Cao & Gu (2019); Allen-Zhu et al. (2019b) in deep neural network. To derive the μ𝜇\muitalic_μ-weighted Gram matrix Gram(𝜽0)Gramsuperscript𝜽0\text{Gram}(\boldsymbol{\theta}^{0})Gram ( bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ), we provide the following definition.

Definition D.7.

(Du et al. (2019); Cao & Gu (2019); Allen-Zhu et al. (2019b, a), Neural Tangent Kernel Matrix). For any i,j[|𝒮||𝒜|]𝑖𝑗delimited-[]𝒮𝒜i,j\in\left[|\mathcal{S}||\mathcal{A}|\right]italic_i , italic_j ∈ [ | caligraphic_S | | caligraphic_A | ], define

𝚯~i,j(1)=𝚺i,j(1)=𝒙^i,𝒙^j,𝐀ij(l)=(𝚺i,i(l)𝚺i,j(l)𝚺i,j(l)𝚺j,j(l)),formulae-sequencesuperscriptsubscript~𝚯𝑖𝑗1superscriptsubscript𝚺𝑖𝑗1subscript^𝒙𝑖subscript^𝒙𝑗superscriptsubscript𝐀𝑖𝑗𝑙superscriptsubscript𝚺𝑖𝑖𝑙superscriptsubscript𝚺𝑖𝑗𝑙superscriptsubscript𝚺𝑖𝑗𝑙superscriptsubscript𝚺𝑗𝑗𝑙\displaystyle\widetilde{\boldsymbol{\Theta}}_{i,j}^{(1)}=\boldsymbol{\Sigma}_{% i,j}^{(1)}=\left\langle\widehat{\boldsymbol{x}}_{i},\widehat{\boldsymbol{x}}_{% j}\right\rangle,\quad\mathbf{A}_{ij}^{(l)}=\left(\begin{array}[]{cc}% \boldsymbol{\Sigma}_{i,i}^{(l)}&\boldsymbol{\Sigma}_{i,j}^{(l)}\\ \boldsymbol{\Sigma}_{i,j}^{(l)}&\boldsymbol{\Sigma}_{j,j}^{(l)}\end{array}% \right),over~ start_ARG bold_Θ end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = bold_Σ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = ⟨ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ , bold_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = ( start_ARRAY start_ROW start_CELL bold_Σ start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_CELL start_CELL bold_Σ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_Σ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_CELL start_CELL bold_Σ start_POSTSUBSCRIPT italic_j , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) ,
𝚺i,j(l+1)=2𝔼(u,v)N(𝟎,𝐀ij(l))[σ(u)σ(v)],superscriptsubscript𝚺𝑖𝑗𝑙12subscript𝔼similar-to𝑢𝑣𝑁0superscriptsubscript𝐀𝑖𝑗𝑙delimited-[]𝜎𝑢𝜎𝑣\displaystyle\boldsymbol{\Sigma}_{i,j}^{(l+1)}=2\cdot\mathbb{E}_{(u,v)\sim N% \left(\mathbf{0},\mathbf{A}_{ij}^{(l)}\right)}[\sigma(u)\sigma(v)],bold_Σ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = 2 ⋅ blackboard_E start_POSTSUBSCRIPT ( italic_u , italic_v ) ∼ italic_N ( bold_0 , bold_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ italic_σ ( italic_u ) italic_σ ( italic_v ) ] ,
𝚯~i,j(l+1)=𝚯~i,j(l)2𝔼(u,v)N(𝟎,𝐀ij(l))[σ(u)σ(v)]+𝚺i,j(l+1),superscriptsubscript~𝚯𝑖𝑗𝑙1superscriptsubscript~𝚯𝑖𝑗𝑙2subscript𝔼similar-to𝑢𝑣𝑁0superscriptsubscript𝐀𝑖𝑗𝑙delimited-[]superscript𝜎𝑢superscript𝜎𝑣superscriptsubscript𝚺𝑖𝑗𝑙1\displaystyle\widetilde{\boldsymbol{\Theta}}_{i,j}^{(l+1)}=\widetilde{% \boldsymbol{\Theta}}_{i,j}^{(l)}\cdot 2\cdot\mathbb{E}_{(u,v)\sim N\left(% \mathbf{0},\mathbf{A}_{ij}^{(l)}\right)}\left[\sigma^{\prime}(u)\sigma^{\prime% }(v)\right]+\boldsymbol{\Sigma}_{i,j}^{(l+1)},over~ start_ARG bold_Θ end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = over~ start_ARG bold_Θ end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ⋅ 2 ⋅ blackboard_E start_POSTSUBSCRIPT ( italic_u , italic_v ) ∼ italic_N ( bold_0 , bold_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) ] + bold_Σ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT ,

where 𝒙^=d(s,a)ϕ(s,a)^𝒙𝑑𝑠𝑎italic-ϕ𝑠𝑎\widehat{\boldsymbol{x}}=\sqrt{d(s,a)}\phi(s,a)over^ start_ARG bold_italic_x end_ARG = square-root start_ARG italic_d ( italic_s , italic_a ) end_ARG italic_ϕ ( italic_s , italic_a ). Then we call 𝚯(L)=[(𝚯~i,j(L)+𝚺i,j(L))/2]|𝒮||𝒜|×|𝒮||𝒜|superscript𝚯𝐿subscriptdelimited-[]superscriptsubscript~𝚯𝑖𝑗𝐿superscriptsubscript𝚺𝑖𝑗𝐿2𝒮𝒜𝒮𝒜\boldsymbol{\Theta}^{(L)}=\left[\left(\widetilde{\boldsymbol{\Theta}}_{i,j}^{(% L)}+\boldsymbol{\Sigma}_{i,j}^{(L)}\right)/2\right]_{|\mathcal{S}||\mathcal{A}% |\times|\mathcal{S}||\mathcal{A}|}bold_Θ start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT = [ ( over~ start_ARG bold_Θ end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT + bold_Σ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ) / 2 ] start_POSTSUBSCRIPT | caligraphic_S | | caligraphic_A | × | caligraphic_S | | caligraphic_A | end_POSTSUBSCRIPT the μ𝜇\muitalic_μ-weighted neural tangent kernel matrix of an L𝐿Litalic_L-layer ReLU network on μ𝜇\muitalic_μ-weighted state-action pairs 𝒙^1,,𝒙^|𝒮||𝒜|subscript^𝒙1subscript^𝒙𝒮𝒜\widehat{\boldsymbol{x}}_{1},\ldots,\widehat{\boldsymbol{x}}_{|\mathcal{S}||% \mathcal{A}|}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT | caligraphic_S | | caligraphic_A | end_POSTSUBSCRIPT.

There is a large body of work Jacot et al. (2018); Du et al. (2018, 2019); Cao & Gu (2019); Allen-Zhu et al. (2019b) exploring the positive definiteness of Gram(𝜽0)Gramsuperscript𝜽0\text{Gram}(\boldsymbol{\theta}^{0})Gram ( bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) in the literature. Suppose that for all pairs (s,a),(s,a)𝑠𝑎superscript𝑠superscript𝑎(s,a),(s^{\prime},a^{\prime})( italic_s , italic_a ) , ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), ϕ(s,a)=1,d(s,a)>0formulae-sequencenormitalic-ϕ𝑠𝑎1𝑑𝑠𝑎0\|\phi(s,a)\|=1,d(s,a)>0∥ italic_ϕ ( italic_s , italic_a ) ∥ = 1 , italic_d ( italic_s , italic_a ) > 0 and ϕ(s,a)ϕ(s,a)not-parallel-toitalic-ϕ𝑠𝑎italic-ϕsuperscript𝑠superscript𝑎\phi(s,a)\nparallel\phi(s^{\prime},a^{\prime})italic_ϕ ( italic_s , italic_a ) ∦ italic_ϕ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). The results of Theorem 1 and Proposition 2 in Jacot et al. (2018) shows that for an L𝐿Litalic_L-layer neural network with Gaussian initialization parameters,

𝜽Q(𝒙^i;𝜽0),𝜽Q(𝒙^j,𝜽0)𝚯i,j(L),asm.formulae-sequencesubscript𝜽𝑄subscript^𝒙𝑖superscript𝜽0subscript𝜽𝑄subscript^𝒙𝑗superscript𝜽0subscriptsuperscript𝚯𝐿𝑖𝑗as𝑚\left<\nabla_{\boldsymbol{\theta}}Q(\widehat{\boldsymbol{x}}_{i};\boldsymbol{% \theta}^{0}),\nabla_{\boldsymbol{\theta}}Q(\widehat{\boldsymbol{x}}_{j},% \boldsymbol{\theta}^{0})\right>\rightarrow\boldsymbol{\Theta}^{(L)}_{i,j},% \quad\text{as}\ m\rightarrow\infty.⟨ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ⟩ → bold_Θ start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , as italic_m → ∞ .

That is, under the NTK regime, the μ𝜇\muitalic_μ-weighted Gram matrix Gram(𝜽0)=𝐉𝐃𝐉𝐃Gramsuperscript𝜽0subscript𝐉𝐃superscriptsubscript𝐉𝐃top\text{Gram}(\boldsymbol{\theta}^{0})=\mathbf{J_{D}}\mathbf{J_{D}^{\top}}Gram ( bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) = bold_J start_POSTSUBSCRIPT bold_D end_POSTSUBSCRIPT bold_J start_POSTSUBSCRIPT bold_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT converges to 𝚯(L)superscript𝚯𝐿\boldsymbol{\Theta}^{(L)}bold_Θ start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT when m𝑚mitalic_m is sufficiently large. Let λmin(𝚯(L))=2λ=𝒪(1)subscript𝜆superscript𝚯𝐿2superscript𝜆𝒪1\lambda_{\min}\left(\boldsymbol{\Theta}^{(L)}\right)=2\lambda^{\prime}=% \mathcal{O}(1)italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_Θ start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ) = 2 italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_O ( 1 ). For any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), there exists m=Poly(|𝒮||𝒜|,L,δ,λ)superscript𝑚Poly𝒮𝒜𝐿𝛿superscript𝜆m^{*}=\text{Poly}(|\mathcal{S}||\mathcal{A}|,L,\delta,\lambda^{\prime})italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = Poly ( | caligraphic_S | | caligraphic_A | , italic_L , italic_δ , italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) such that if mm𝑚superscript𝑚m\geq m^{*}italic_m ≥ italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have

λmin(Gram(𝜽0))λw.p. 1δ.formulae-sequencesubscript𝜆Gramsuperscript𝜽0superscript𝜆𝑤𝑝.1𝛿\lambda_{\min}\left(\text{Gram}(\boldsymbol{\theta}^{0})\right)\geq\lambda^{% \prime}\quad w.p.\ 1-\delta.italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( Gram ( bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) ≥ italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_w . italic_p . 1 - italic_δ .

This signifies that if the network width mm𝑚superscript𝑚m\geq m^{*}italic_m ≥ italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then we have λ0λmin(Gram(𝜽0))λsubscript𝜆0subscript𝜆Gramsuperscript𝜽0superscript𝜆\lambda_{0}\geq\lambda_{\min}\left(\text{Gram}(\boldsymbol{\theta}^{0})\right)% \geq\lambda^{\prime}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( Gram ( bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) ≥ italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, thereby substantiating our claim.

Appendix E Additional Notes on the Experiments in Section 5

In this section, we further discuss the experimental setup introduced in Section 5. As mentioned in Section 5, our experiments mainly test the following two aspects: (1) how does the network width m affects the final error of the algorithm (first two subfigures in Figure 1); (2) the minimum nonzero singular value in Assumption 3.3 (latter two subfigures in Figure 1).

For point (1), we first generate 2000 samples according to a given policy π𝜋\piitalic_π to imitate the Markov process of Algorithm 1. A two-layer neural network with ELU activation is introduced, and the parameters are initialized using Algorithm 1. We set the initial learning rate at 0.001 with linear decay (per epoch) and a batch size of 100. Notably, as the parameter m𝑚mitalic_m increases, the TD algorithm demonstrates smaller final TD errors.

For point (2), our experiments are based on three main points. First, note that the norm of feature map and parameter random initialization will affect the scaling of the gradient norm w.r.t. Q(s,a;𝜽)𝑄𝑠𝑎𝜽Q(s,a;\boldsymbol{\theta})italic_Q ( italic_s , italic_a ; bold_italic_θ ). Thus we employ the ratio r=σmax/σmin𝑟subscript𝜎subscript𝜎r=\sigma_{\max}/\sigma_{\min}italic_r = italic_σ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / italic_σ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT to characterize the minimum non-zero singular value in order to eliminate the impact of numerical scaling. Second, we set varying network widths to verify the existence of λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Finally, it’s tough to directly obtain the joint distribution of (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) with a fixed learning policy. However, we have that 1N(s,a)μ×π𝜽Q(s,a;𝜽)𝜽Q(s,a;𝜽)Σπ2=𝒪(1/N)subscriptnorm1𝑁subscriptsimilar-to𝑠𝑎𝜇𝜋subscript𝜽𝑄𝑠𝑎𝜽subscript𝜽𝑄superscript𝑠𝑎𝜽topsubscriptΣ𝜋2𝒪1𝑁\left\|\frac{1}{N}\sum_{(s,a)\sim\mu\times\pi}\nabla_{\boldsymbol{\theta}}Q(s,% a;\boldsymbol{\theta})\nabla_{\boldsymbol{\theta}}Q(s,a;\boldsymbol{\theta})^{% \top}-\Sigma_{\pi}\right\|_{2}=\mathcal{O}(1/N)∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_μ × italic_π end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a ; bold_italic_θ ) ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a ; bold_italic_θ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = caligraphic_O ( 1 / italic_N ). To avoid the effects of sampling randomness, we estimate ΣπsubscriptΣ𝜋\Sigma_{\pi}roman_Σ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT from different samples. The experiments demonstrate that the minimum non-zero singular value λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT converges to a constant as m𝑚mitalic_m increases.