An Improved Finite-time Analysis of Temporal Difference
Learning with Deep Neural Networks
Abstract
Temporal difference (TD) learning algorithms with neural network function parameterization have well-established empirical success in many practical large-scale reinforcement learning tasks. However, theoretical understanding of these algorithms remains challenging due to the nonlinearity of the action-value approximation. In this paper, we develop an improved non-asymptotic analysis of the neural TD method with a general -layer neural network. New proof techniques are developed and an improved new sample complexity is derived. To our best knowledge, this is the first finite-time analysis of neural TD that achieves an complexity under the Markovian sampling, as opposed to the best known complexity in the existing literature.
1 Introduction
The temporal difference (TD) learning method, firstly designed for policy evaluation (Sutton, 1988), is a fundamental building block of many popular Reinforcement Learning (RL) algorithms. In standard TD learning algorithms for tabular MDP, based on the Bellman operator, the agent iteratively obtains a state-action-reward-transition tuple and then updates the Q values by a weighted average of the current value and the TD target. Once the algorithm converges, the Q function is considered to be the final return obtained by executing the target policy given some initial action-state pair.
For large-scale reinforcement learning (RL) problems, appropriate parameterization of the Q function is crucial for better scalability of the TD algorithms. Common examples include linear (Tesauro et al., 1995), general smooth nonlinear (Maei et al., 2009), and neural network (Mnih et al., 2013) function approximations. However, it is well known that the naive extension of TD learning and Q-learning algorithms can diverge under the general function approximation Tsitsiklis & Van Roy (1996). To encourage convergence, numerous variants of TD and Q-learning have been proposed, including Least-squares TD (LSTD) (Bradtke & Barto, 1996; Boyan, 2002) and gradient TD (GTD) (Sutton et al., 2009a, b), to name a few.
The applications of neural network function approximation have witnessed huge empirical success in many real-world tasks, including Deep Q-network (DQN) algorithms (Mnih et al., 2013; Van Hasselt et al., 2016), policy improvement method (Sutton et al., 1999), trust region policy optimization (Schulman et al., 2015) and the actor-critic algorithms (Konda & Tsitsiklis, 1999; Lillicrap et al., 2015; Fujimoto et al., 2018), etc. However, due to the analysis difficulties brought by the function approximation, a significant gap exists between the empirical success and the theoretical understanding of these algorithms. Hence analyzing the convergence and sample complexity of TD learning and Q-learning under various Q function parameterizations has always been an active topic in the RL community during the past decades.
Early works focus on the asymptotic convergence of the algorithms with tabular or linear function approximation. For the tabular (stochastic) TD or Q learning method, Jaakkola et al. (1993) established the asymptotic convergence for the first time. Later on, the asymptotic convergence of algorithms with linear function approximation has been extensively discussed using ODE-based methods, see e.g. Tsitsiklis & Van Roy (1996); Perkins & Pendrith (2002); Borkar (2009). Meanwhile, in contrast to the convergent results for RL algorithms under the tabular or linear settings, TD with nonlinear function approximation is known to diverge in general Tsitsiklis & Van Roy (1996); Brandfonbrener & Bruna (2019). To overcome this issue, Maei et al. (2009) proposed to optimize the Mean Squared Projected Bellman Error (MSPBE) via a gradient-based algorithm. Due to the problem nonconvexity, only asymptotic convergence to stationary points can be guaranteed.
More recently, benefiting from the improved techniques for analyzing stochastic optimization algorithms, there has been a growing number of research on providing finite-time analysis for TD and Q-learning algorithms with function approximations.
For linear function approximation, the non-asymptotic results of TD learning and its variants are relatively well-understood, including TD Bhandari et al. (2018); Dalal et al. (2018); Zou et al. (2019), gradient TD Dalal et al. (2018); Touati et al. (2018); Liu et al. (2020a), and Least-Squares TD Lazaric et al. (2010); Prashanth et al. (2014); Tagorti & Scherrer (2015), etc. In particular, Bhandari et al. (2018) established the first finite-time analysis of linear Q-learning under both i.i.d. sampling and Markovian sampling settings.
For neural network function approximation, which is directly related to this paper, we provide a more detailed discussion. Based on the recent advances in the understanding of optimizing ReLU network Jacot et al. (2018); Du et al. (2018); Allen-Zhu et al. (2019a, b); Cao & Gu (2019, 2020), a few recent works have successfully developed the finite-time analysis of the neural TD and neural Q-learning algorithms, as long as the Q network is sufficiently wide. Let be the true action-value function and let denote the action-value function parameterized by a neural network with weights , at any state action pair . Then we aim to find some -optimal parameter such that , where the expectation is taken over the possible randomness in the output as well as the distribution over the state-action pairs , and is the optimal approximation error of the parameterization function class. In (Xu & Gu, 2020), a neural Q-learning algorithm with a general -layer ReLU network is analyzed, and an sample complexity is guaranteed given that the network is sufficiently wide. In (Cai et al., 2023), the authors studied both the neural TD learning and neural Q-learning algorithms for minimizing the MSPBE for policy evaluation and policy optimization, respectively. For policy evaluation, the in the definition of an -optimal solution is defaulted as with being the policy to be evaluated. For both cases, an sample complexity is guaranteed for wide two-layer ReLU networks. In (Sun et al., 2022), an complexity has been achieved by an adaptive neural TD algorithm with multi-layer ReLU networks, where is a constant that characterizes the sparsity and decay rate of the stochastic semi-gradients. However, without additional assumption, only an complexity with can be theoretically guaranteed. Finally, for policy evaluation problems, there are also several works that aim at reducing the width of the over-parameterized Q networks in the existing works (Tian et al., 2022; Cayci et al., 2023). In terms of complexity, both of them requires samples to obtain an -optimal solution.
Despite the fact that existing analysis of the neural TD or neural Q-learning algorithms merely provides the sample complexity under various settings, an sample complexity should be expected. In fact, a double-loop fitted Q-iteration (FQI) method (Fan et al., 2020) and its single-loop Gauss-Newton variant (Ke et al., 2023) can achieve an sample complexity is obtained for two-layer Q networks. Let be the Bellman (optimality) operator, then the FQI method repeatedly solves a nonlinear least square subproblem to obtain the next iteration: . Compared to the single-loop neural TD or neural Q-learning method that takes only one sample (or a mini-batch) to update the weights of Q networks, the update scheme of FQI requires repeatedly solving a subproblem to sufficiently high accuracy to enable convergence, which makes it inefficient and less favorable in practice. Therefore, we would like to raise a question: {mdframed}[leftmargin=1cm,rightmargin=1cm, backgroundcolor=gray!10] Can we improve the existing analysis of the neural temporal difference learning algorithm and obtain an sample complexity under general multi-layer Q neural networks?
To answer this question, we revisit the convergence analysis of the neural TD learning or Q-learning algorithms under the non-i.i.d. Markovian observations where a general -layer neural network is used for Q function parameterization. By proposing a new subspace analysis technique, under suitable conditions, we derive a brand new sample complexity for neural TD learning or Q-learning, improving the state-of-the-art sample complexity in the existing works. Our contributions are summarized as follows.
Neural Approximation | Network Depth | Network Width | Activation | Sample Complexity | |
---|---|---|---|---|---|
(Bhandari et al., 2018) | No | NA | NA | NA | |
(Cai et al., 2023) | Yes | 2 | ReLu | ||
(Xu & Gu, 2020) | Yes | ReLu | |||
(Sun et al., 2022) | Yes | ReLu | |||
(Tian et al., 2022) | Yes | ELU, GeLU | |||
Ours | Yes | ELU, GeLU |
-
•
Under the non-i.i.d. Markovian sampling setting, we derive an sample complexity for both neural TD learning and Q-learning methods under the multi-layer network approximation for Q functions. Our result also improves the best known sample complexity in the existing works.
-
•
Based on our newly developed techniques, we further provide a finite-sample analysis for a minimax neural Q-learning algorithm that solves two-player zero-sum Markov games. An sample complexity is obtained under the non-i.i.d. Markovian sampling setting.
Technically, the subspace analysis approach that we propose to establish the sample complexity is by itself of independent interest. We believe this technique can potentially be applied to linear Q-learning algorithms and linear Actor-Critic algorithms without requiring the positive definiteness assumption of the feature covariance matrix (Bhandari et al., 2018; Zou et al., 2019; Barakat et al., 2022), while maintaining the complexity.
In summary, we provide a comprehensive comparison between our work and the most related works in their respective settings and sample complexity in Table 1. Our work establishes an optimal sample complexity analysis within a broader contextual framework.
2 Preliminaries
We consider the infinite-horizon discounted Markov decision process (MDP), which is denoted as . We consider a general state space and a finite action space . At any state , if the agent takes an action , it will receive a reward and transition to the next state with probability . We call the reward function and the transition kernel. Let be a discount factor, then an MDP aims to find a sequence of actions to maximize the expected and discounted cumulative reward , where is the distribution of the initial state .
Let denote the set of all probability distributions over the action space , and let a policy be a mapping that returns a probability distribution given any state . If an agent follows a policy , then at any state , it will act by sampling an action . Therefore, the action-value function (Q-function) under the policy is
for , where all actions except are sampled according to . For any mapping , let the Bellman operator be
Then is a -contraction under the infinity norm and is the unique solution to the fixed-point equation (Bertsekas, 2012). If the Q function is parameterized by some function to gain better scalability for large-scale RL problems, popular approaches for finding a good include minimizing the the Mean-Squared Bellman Error (MSBE):
(1) |
and minimizing the Mean-Squared Projected Bellman Error (MSPBE):
(2) |
where is a feasible domain of the parameter , is some distribution over state action pairs, and is the projection onto some function class . Typical choices of includes the Q function parameterization class itself (Maei et al., 2009), and some local linearization of the parameterization function class (Cai et al., 2023).
In this paper, we study the neural temporal difference learning method where the action-value function is parameterized by some multi-layer neural network. Let us define a feedforward neural network by the following recursion:
(3) |
where , for are the weight matrices of the network, is an activation function, and the input is a feature map for any state action pair . For simplicity of notation, we write , then is parameterized by
(4) |
where the parameter denotes the collection of all weight matrices, and is given by a random initialization. stands for the vetorization operator that reshapes a matrix to a column vector by stacking its columns one by one and the “;” separator in stands for the vertical stacking of the elements. That is, we reshape to a long column vector for the notational convenience in later discussion.
Assumption 2.1.
The activation function is -Lipschitz and -smooth, i.e. , for
and
Assumption 2.1 indicates that our results below are not based on the popular ReLU activation function. However, we primarily focus on some twice-differentiable activation functions (such as Sigmoid, ELU, GeLU, etc.), which are smooth approximations of the ReLU function and are frequently utilized in practical problems (Devlin et al., 2018; Godfrey, 2019). Such a setup aligns with (Liu et al., 2020b), and provides a -smooth property for the neural Q-function.
Let be the initial solution. For each , we initialize the weights of element-wise from a normal distribution and each element of is drawn uniformly from . The parameter will not be optimized during training. For regularity purpose, we would like to restrict the iterations to a bounded set around , which is defined as
In each iteration , the neural Q-learning algorithm obtains a sample of state-action-reward-transition tuple and computes the TD error by
(5) |
with Then a projected stochastic semi-gradient step is performed to update the weight matrices:
(6) |
with
We formally describe the neural TD learning method in Algorithm 1.
One remark is that, under the non-i.i.d. Markovian sampling setting, the agent is only able to generate a trajectory of samples following some given learning policy , which is very common in the offline RL (Wu et al., 2019; Levine et al., 2020; Kostrikov et al., 2021) where the data trajectories are generated by some learning policy.
In later sections, we will revisit the Algorithm 1 and design a novel subspace analysis technique for this method and achieve an improved sample complexity of . Moreover, by replacing the TD error induced by the Bellman operator (5) with TD error induced by the Bellman optimality operator: , Algorithm 1 can be reduced to the neural Q-learning method for finding optimal state-action value . Our analysis for neural TD learning can be extended to the neural Q-learning analogously and obtain the same sample complexity.
3 Convergence of Neural Temporal Difference Learning
3.1 Basic Settings and Assumptions
To analyze Algorithm 1, let us first define the local linearization function class of the multi-layer Q network (4) at the random initialization :
(7) |
for any . Consider the MSPBE minimization problem:
(8) |
where is the initial state distribution, is the learning policy, and is the transition kernel, the expectation is taken over , , and in . Define the set as
(9) |
Then the set consists of the points with which forms a fixed point of the projected Bellman operator for the problem (8). By Section 4.1 in (Cai et al., 2023), the fixed point of is unique for . Therefore, the following relationship holds
(10) |
for Moreover, it is also shown that a point if and only if it satisfies the stationarity condition:
(11) |
where is a local linearization provided by (7) and is defined as
Hence people may analyze the gap between and by first connecting it to . Based on this, Cai et al. (2023) derived an sample complexity for the neural TD method. Now we define
(12) |
It is worth noting that the matrix only depends on and . In the original assumption about (12), (Zou et al., 2019; Xu & Gu, 2020) in fact assumed positive definiteness () of , which can be viewed as a generalized version of the positive definite feature covariance matrix assumption in the analysis of linear TD and linear Q-learning, see e.g. (Zou et al., 2019). However, in this paper we adopt the following weaker regularity assumption.
Assumption 3.1.
Let denote the minimum non-zero singular value of the matrix , then there exist constants such that as long as the Q network width .
For neural Q function approximation, a sufficient but not necessary condition for Assumption 3.1 can be obtained by exploiting the theory of over-parameterized neural networks. Roughly speaking, for a finite MDP with an -layer ReLU Q network, if the feature map satisfies for , the results of (Jacot et al., 2018; Allen-Zhu et al., 2019a, b; Cao & Gu, 2019, 2020) suggest that there exist such that with high probability for networks with width Here stands for the Gram matrix of the network at the initialization . A lower bound on can then be constructed with , refer to Remark D.6 in Appendix D.
Finally, to facilitate the sample complexity analysis under the non-i.i.d. Markovian sampling setting, let us make the following assumption on the fast mixing rate of the MDP sample trajectories, which is widely adopted in the related analysis (Zou et al., 2019; Xu & Gu, 2020; Cai et al., 2023).
Assumption 3.2.
We assume that the Markov chain induced by the learning policy and the transition kernel is uniformly ergodic with its invariant measure . Furthermore, we assume that there are constants such that
for all
Without loss of generality, we also make the following technical assumption, which is not fundamental as opposed to Assumption 3.1, and 3.2.
Assumption 3.3.
We assume the initial state distribution to be the stationary state distribution under policy .
This assumption is in fact very natural. Concerning the stationarity of , it can always be guaranteed by abandoning the first samples while Assumption 3.2 indicates that the mixing time . This assumption guarantees that the operator is -contractive w.r.t. in policy evaluation. Similar assumptions are included in (Bhandari et al., 2018; Cai et al., 2023).
3.2 An Improved Complexity of Neural TD Learning
To derive the sample complexity, we rely on the following key observation on subspace decomposition, which is beyond the existing analysis framework.
Proposition 3.4.
Let and denote the range space and kernel space of the matrix , respectively. Then for any parameter , there exists such that
which also implies that the projections of and onto the subspace are identical.
Based on this argument, for the iteration sequence generated by Algorithm 1, there exists a sequence such that . Therefore, unlike the existing works that analyze for some , c.f. (Cai et al., 2023; Xu & Gu, 2020), we will prove a much faster convergence in . Combined with (10), this further indicates the improved sample complexity in this paper. The proof of this proposition is presented as follows.
Proof.
For the ease of discussion, let us denote the dimension of weight parameter as . Then we may denote and . First of all let us fix an arbitrary , then we may decompose it into two orthogonal components:
Similarly, we can decompose the currently considered vector as
Note that having an arbitrary vector in the kernel space of means that , which further indicates that
Therefore, under the measure , we have
(13) |
where stands for almost surely. Therefore, define , we can check the stationarity condition (11) for by establishing:
(14) | ||||
The proof of (14) is lengthy and is thus moved to Appendix A.1 for succinctness. As a result we have and . Note that for any ,
which completes the proof. ∎
Following basic linear algebra analysis, we also have the following proposition.
Proposition 3.5.
Under Assumption 3.1, suppose the adopted Q network is sufficiently wide so that , then for any , we have .
Proposition 3.4 indicates that the variations in the local linearization of Q-function values solely depend on the variations in parameters within the subspace . In the mean while, Proposition 3.5 indicates that such local linearization is non-singular within . Based on these observations, we can first provide a fast convergence of and then show that for any . We summarize this result in Theorem 3.6 while presenting its proof in Appendix A.2.
Theorem 3.6.
Suppose Assumptions 3.1, 3.2 and 3.3 hold. We set and the learning rate . If the feature map for each state-action pair and the network width , then the output of Algorithm 1 satisfies
with probability at least , where is the mixing time of Markov chain in Assumption 3.2, and are universal constants.
Let be the true state-action value function that satisfies the Bellman equation . Then based on the convergence of the local linearization in Theorem 3.6, we establish the global convergence of neural temporal difference learning as Theorem 3.7.
Theorem 3.7.
Let be the optimal approximation error of the function class . Then Theorem 3.7 demonstrates that under suitable parameter choices, neural TD learning method identify an approximation error bound of within samples. Existing works include Cai et al. (2023); Xu & Gu (2020); Tian et al. (2022); Cayci et al. (2023) achieve sample complexity, and (Sun et al., 2022) achieves with additional assumptions.
Following a similar analysis while adopting an additional regularity assumption on the matrix , one can further extend the above analysis to the neural Q-learning by substituting the Bellman operator with the Bellman optimality operator. A similar sample complexity can still be achieved, which is relegated to Appendix for succinctness.
4 Convergence of Minimax Neural Q-Learning
A two-player zero-sum Markov game (Littman, 1994; Bowling & Veloso, 2001; Perolat et al., 2018), as a simple variant of MDP, is defined as a six-tuple . Here is state space, and are the action space of the first and second player, respectively, is the transition probability, is the reward function and is the discounted factor. At time , player 1 and player 2 take actions ( and ) simultaneously. Player 1 obtains the reward . while player 2 obtains . The goal of the two players is to maximize their cumulative rewards respectively. For a policy pair , we can define the state-action value function as follows:
The optimal state-action value function is defined as
We denote the optimal policy pair if . Moreover, the Minimax Bellman operator for the Markov game is defined as
Thus . Let the feature map and be a given learning policy for players 1 and 2. Assume that is a sampled trajectory of states, actions and rewards obtained from the environment using policy . Let us recall the definition of the local linearization function class introduced in (7). Consider the MSPBE minimization problem with multi-layer neural network approximation:
To solve this problem, we still adopt the projected stochastic semi-gradient iteration method is provided described by (6), that is,
(16) |
while redefining the stochastic semi-gradient estimator as
where and
(17) | ||||
Now we redefine the function class as a collection of all local linearization of at the initial point :
To analyze this method, for any , we redefine the set introduced in (3.1) by replacing the Bellman operator with the Minimax Bellman operator . Similar to (10), we still have a point if and only if
where , and has the same structure as expect that the function is replaced by .
Unlike the neural temporal difference learning method that aims at evaluating the state-action values of a fixed learning policy. The Minimax Bellman operator significantly sophisticates the analysis. Let us redefine the feature covariance matrix with respect to the learning policy , that is
Let the actions satisfies . For each parameter pair , we define the action pair that satisfies
Then for any , the minimax feature covariance matrix is defined as follows:
Assumption 4.1.
For any , there exists a constant such that .
Note the original version of this assumption in (Zou et al., 2019) in fact requires a strict positive definite condition: . Under this additional assumption, (Zou et al., 2019) obtained an sample complexity for minimax Q-learning with linear function approximation. With the help of our subspace analysis technique, in this paper, we relax it to the positive semi-definiteness (). Now we are ready to state our result for minimax neural Q-learning.
Theorem 4.2.
Suppose Assumptions 3.1, 3.2 and 4.1 hold. We set and the learning rate . If the feature map for each state-action pair and the network width , then the output of neural minimax Q-learning Algorithm 3 satisfies
with probability at least , where is the mixing time of Markov chain in Assumption 3.2, and are universal constants.
Theorem 4.2 establishes a finite-time analysis of -sample complexity for minimax neural Q-learning in terms of the function class . For a more specific description and theorem proof, see Appendix C. To the best of our knowledge, this is the first analysis of minimax Q-learning with neural network function approximation, characterized by a complexity bound of .
5 Experiments
Finally, we construct several experiments over the OpenAI Gym (Brockman et al., 2016) tasks and validate our theoretical findings. We consider a two-layer neural network, as follows:
where is ELU activation in this section. Furthermore, details regarding the initialization and iteration methods for the parameters can be found in Section 2. For all experiments, we generate samples based on a prescribed -greedy policy with . To prevent redundancy in the features , we employ one-hot encoding for discrete action-state spaces and implement a fixed grid discretization for continuous spaces. when both and belong to the same one-hot encoding or grid, we treat them as the same sample point. Our investigation into the impact of network width on the TD learning algorithm will be conducted from two perspectives: (i) examining whether the network width is correlated with the TD error, and (ii) exploring the existence of constants and that satisfy Assumption 3.1.
The four subfigures in Figure 1 represent two types of environments: one with a discrete state space and the other with a continuous state space. The first two subfigures depict the convergence performance of the TD algorithm at different network widths. We generate 2,000 sample points and run for 500 epochs. Notably, as the parameter increases, the TD algorithm demonstrates faster convergence, resulting in smaller final TD errors. The latter two subfigures illustrate the existence of and . Specifically, we compute the largest non-zero singular value and smallest non-zero singular value of the matrix . To mitigate the absolute magnitude of , we introduce the ratio as a metric to validate Assumption 3.1. It can be observed that the value of approaches a constant as increases for all cases, providing empirical support for the validity of the assumption.
6 Conclusion
We study the finite-time analysis of the TD and Q learning methods with neural network approximation, where the state-action pairs are generated by a given policy under the Markovian sampling. Besides the convergence to the true action-value function except for an inevitable function approximation error, an improved analysis technique is introduced to establish an complexity for the neural TD and Q learning methods, which improves the existing complexity. For future work, it is also interesting to investigate if the proposed technique can improve the current complexity estimate of the actor-critic methods, which are partially built upon the neural TD methods.
7 Acknowledgements
Dr. Zaiwen Wen is supported in part by the NSFC grant 12331010. Dr. Junyu Zhang is supported in part by the MOE AcRF grant A-0009530-05-00.
References
- Allen-Zhu et al. (2019a) Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019a.
- Allen-Zhu et al. (2019b) Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019b.
- Barakat et al. (2022) Barakat, A., Bianchi, P., and Lehmann, J. Analysis of a target-based actor-critic algorithm with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pp. 991–1040. PMLR, 2022.
- Bertsekas (2012) Bertsekas, D. Dynamic programming and optimal control: Volume I, volume 1. Athena scientific, 2012.
- Bhandari et al. (2018) Bhandari, J., Russo, D., and Singal, R. A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory, pp. 1691–1692. PMLR, 2018.
- Borkar (2009) Borkar, V. S. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
- Bowling & Veloso (2001) Bowling, M. and Veloso, M. Rational and convergent learning in stochastic games. In International joint conference on artificial intelligence, volume 17, pp. 1021–1026. Citeseer, 2001.
- Boyan (2002) Boyan, J. A. Technical update: Least-squares temporal difference learning. Machine learning, 49(2):233–246, 2002.
- Bradtke & Barto (1996) Bradtke, S. J. and Barto, A. G. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1):33–57, 1996.
- Brandfonbrener & Bruna (2019) Brandfonbrener, D. and Bruna, J. Geometric insights into the convergence of nonlinear td learning. arXiv preprint arXiv:1905.12185, 2019.
- Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Cai et al. (2023) Cai, Q., Yang, Z., Lee, J. D., and Wang, Z. Neural temporal difference and q learning provably converge to global optima. Mathematics of Operations Research, 2023.
- Cao & Gu (2019) Cao, Y. and Gu, Q. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019.
- Cao & Gu (2020) Cao, Y. and Gu, Q. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3349–3356, 2020.
- Cayci et al. (2023) Cayci, S., Satpathi, S., He, N., and Srikant, R. Sample complexity and overparameterization bounds for temporal difference learning with neural network approximation. IEEE Transactions on Automatic Control, 2023.
- Dalal et al. (2018) Dalal, G., Szörényi, B., Thoppe, G., and Mannor, S. Finite sample analyses for td (0) with function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Du et al. (2019) Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675–1685. PMLR, 2019.
- Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- Fan et al. (2020) Fan, J., Wang, Z., Xie, Y., and Yang, Z. A theoretical analysis of deep q-learning. In Learning for dynamics and control, pp. 486–489. PMLR, 2020.
- Fujimoto et al. (2018) Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
- Godfrey (2019) Godfrey, L. B. An evaluation of parametric activation functions for deep learning. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 3006–3011. IEEE, 2019.
- Jaakkola et al. (1993) Jaakkola, T., Jordan, M., and Singh, S. Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 6, 1993.
- Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Ke et al. (2023) Ke, Z., Wen, Z., and Zhang, J. Provably efficient gauss-newton temporal difference learning method with function approximation. arXiv preprint arXiv:2302.13087, 2023.
- Konda & Tsitsiklis (1999) Konda, V. and Tsitsiklis, J. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
- Kostrikov et al. (2021) Kostrikov, I., Fergus, R., Tompson, J., and Nachum, O. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774–5783. PMLR, 2021.
- Lazaric et al. (2010) Lazaric, A., Ghavamzadeh, M., and Munos, R. Finite-sample analysis of lstd. In ICML-27th International Conference on Machine Learning, pp. 615–622, 2010.
- Levine et al. (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Littman (1994) Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157–163. Elsevier, 1994.
- Liu et al. (2020a) Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. Finite-sample analysis of proximal gradient td algorithms. arXiv preprint arXiv:2006.14364, 2020a.
- Liu et al. (2020b) Liu, C., Zhu, L., and Belkin, M. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954–15964, 2020b.
- Maei et al. (2009) Maei, H., Szepesvari, C., Bhatnagar, S., Precup, D., Silver, D., and Sutton, R. S. Convergent temporal-difference learning with arbitrary smooth function approximation. Advances in neural information processing systems, 22, 2009.
- Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Perkins & Pendrith (2002) Perkins, T. J. and Pendrith, M. D. On the existence of fixed points for q-learning and sarsa in partially observable domains. In ICML, pp. 490–497, 2002.
- Perolat et al. (2018) Perolat, J., Piot, B., and Pietquin, O. Actor-critic fictitious play in simultaneous move multistage games. In International Conference on Artificial Intelligence and Statistics, pp. 919–928. PMLR, 2018.
- Prashanth et al. (2014) Prashanth, L., Korda, N., and Munos, R. Fast lstd using stochastic approximation: Finite time analysis and application to traffic control. In Joint European conference on machine learning and knowledge discovery in databases, pp. 66–81. Springer, 2014.
- Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015.
- Sun et al. (2022) Sun, T., Li, D., and Wang, B. Finite-time analysis of adaptive temporal difference learning with deep neural networks. Advances in Neural Information Processing Systems, 35:19592–19604, 2022.
- Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
- Sutton et al. (1999) Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Sutton et al. (2009a) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning, pp. 993–1000, 2009a.
- Sutton et al. (2009b) Sutton, R. S., Szepesvári, C., and Maei, H. R. A convergent o (n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in neural information processing systems, 21(21):1609–1616, 2009b.
- Tagorti & Scherrer (2015) Tagorti, M. and Scherrer, B. On the rate of convergence and error bounds for lstd (). In International Conference on Machine Learning, pp. 1521–1529. PMLR, 2015.
- Tesauro et al. (1995) Tesauro, G. et al. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
- Tian et al. (2022) Tian, H., Paschalidis, I., and Olshevsky, A. On the performance of temporal difference learning with neural networks. In The Eleventh International Conference on Learning Representations, 2022.
- Touati et al. (2018) Touati, A., Bacon, P.-L., Precup, D., and Vincent, P. Convergent tree backup and retrace with function approximation. In International Conference on Machine Learning, pp. 4955–4964. PMLR, 2018.
- Tsitsiklis & Van Roy (1996) Tsitsiklis, J. and Van Roy, B. Analysis of temporal-diffference learning with function approximation. Advances in neural information processing systems, 9, 1996.
- Van Hasselt et al. (2016) Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
- Wu et al. (2019) Wu, Y., Tucker, G., and Nachum, O. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
- Xu & Gu (2020) Xu, P. and Gu, Q. A finite-time analysis of q-learning with neural network function approximation. In International Conference on Machine Learning, pp. 10555–10565. PMLR, 2020.
- Zou et al. (2019) Zou, S., Xu, T., and Liang, Y. Finite-sample analysis for sarsa with linear function approximation. Advances in neural information processing systems, 32, 2019.
Appendix A Details of Section 3
A.1 Proof of (14)
A.2 Proof of Theorem 3.6
Proof.
Recall the definition of the semi-gradient in Section (6). We denote as its expectation. Let and also be the corresponding semi-gradients based on the linearized function , that is,
where
To simplify the notation, let and . Recall the definition of the range space and the kernel space . By Proposition 3.4, we know that for any vector thus for any feature map and parameter . Then we can decompose as
where (i) follows
and
Recall the stationarity condition (11), for any ,
where (i) is the same as the proof in Section A.1. Therefore,
(18) | |||||
where (i) follows for any .
Next, we analyze the upper bounds of , , and item by item. To simplify the notation, let be universal constants in this section. We set and . By Lemma D.5, we have
(19) |
and
(20) | |||||
with probability at least , where (i) follows and
Thus (19) and (20) provide upper bounds on and , respectively. The next lemma provides an estimate of the Markov sampling error.
Lemma A.1.
Suppose the learning rate sequence is non-increasing. Under Assumption 3.2, it holds that
(21) |
for any fixed , where
is the mixing time of the Markov chain .
Proof.
We adopt the proof framework outlined in Lemma 6.2 of Xu & Gu (2020). However, variations in the neural network settings lead to differences in the norms of gradients and parameters, thereby resulting in slight variations in the results. Thereby we have
∎
Looking back at the definitions of and , and the discussion in Section 3.2, we derive Lemmas A.2 and A.3 to estimate .
Lemma A.2.
Let as the minimum nonzero singular value of . For any , we have
Proof.
Define . To begin with, the Bellman operator is a -contraction with -norm since is the stationary distribution of corresponding to the policy . In details, consider
(23) | ||||
where (i) follows that and have the same stationary distribution. To simplify the notation, we denote as in the proof of this lemma. Then we compute
(24) | |||||
where (i) follows the Cauchy-Schwarz inequality, (ii) follows (23), and (iii) follows and Lemma A.2, which provides -strong convexity. Thus we complete the proof of Lemma A.3. ∎
A.3 Proof of Theorem 3.7
Proof.
Let . To simplify the notation, we denote as in this subsection. Note that
By Lemma D.4, we have
(25) |
with probability at least . Recall that is the fixed point of and is the fixed point of . We define the -norm . Thus
where (i) is due to the properties of the fixed point, and (ii) is due to is -contractive on the -norm. This further means that
(26) |
Plugging (25) and (26) into (A.3) and using Theorem 3.6, we complete the proof. ∎
Appendix B Convergence Results of Neural Q-learning
B.1 Neural Q-Learning Algorithm
For neural Q-learning, let us redefine some of the above notations. Let the optimal Q-function be for all state action pairs , then the optimal sequence of actions that maximizes the expected cumulative reward will follow . Therefore, to obtain a near-optimal policy, it is sufficient to find some that approximates well. Define the Bellman optimality operator as
for any . Let us remain the definition of the local linearization function class introduced in (7). Consider the MSPBE minimization problem with multi-layer neural network approximation:
Then the projected neural Q-learning algorithm can be written as follows:
(27) |
where
(28) |
The algorithm details can be described by Algorithm 2 as follows.
B.2 Global Convergence
Similar to Section 3, we define the function class as a collection of all local linearization of at the initial point :
Let , and has the same structure as expect that the function is replaced by . The stationary point satisfies for neural Q-learning. We redefine by replacing the Bellman operator in Section 3 with the Bellman optimality operator . A point if and only if
The maximum operator introduced by the Bellman optimality operator significantly sophisticates the analysis. Let us remain the definition of in (12), and we define as follows:
(29) |
where . To facilitate the analysis of neural Q-learning, we further assume the following regularity condition introduced by (Xu & Gu, 2020).
Assumption B.1.
such that for any and .
The original version of this assumption comes from (Xu & Gu, 2020), which requires a strict positive definite condition: . Under this additional assumption, (Xu & Gu, 2020) obtained an sample complexity for neural Q-learning. A similar complexity result was also derived in (Cai et al., 2023) under a similar regularity condition on the learning policy . At this time, we relax it to the positive semi-definiteness () and provide a convergence result of neural Q-learning. See Theorem B.2.
Theorem B.2.
Suppose Assumptions 3.1, 3.2 and B.1 hold. We set and the learning rate . If the feature map for each state-action pair and the network width , then the output of neural Q-learning algorithm (i.e. (27)) satisfies
with probability at least , where is the mixing time of Markov chain in Assumption 3.2, and are universal constants.
Proof.
For a little notation abuse, we redefine
where
Let and . Similarly, (18) can be derived in neural Q-learning. To estimate the terms , we can apply Lemmas D.5 and A.1. However, due to the utilization of the Bellman optimality operator in neural Q-learning, some modifications based on Lemma A.3 are required.
Proof.
To simplify the notation, we denote as in the proof of this lemma. Define . Then we have
For the second term of (B.2), we consider
(32) | |||||
where (i) and (iii) follow that is linear, and (ii) follows Assumption B.1. Therefore,
where (i) follows (32), and (ii) follows and Lemma A.2, which provides -strong convexity. ∎
Now given , we can deduce that
with probability at least , where are universal constants in this subsection. Choosing can derive the similar results as (A.2). This suggests that we can utilize the techniques outlined in Section A.2 to finalize the remaining proof of Theorem B.2. As a result, we conclude the proof.
∎
Appendix C Details of Section 4
We formally describe the minimax neural Q-learning method in Algorithm 3.
C.1 Proof of Theorem 4.2
The proof of Theorem 4.2 is similar to Sections A.2 and B.2. However, due to the difference in Bellman operators, we still need to make some modifications to Lemma A.3 or Lemma B.3. See Lemma C.1.
Lemma C.1.
Under Assumption 4.1, we have that
(33) |
Proof.
To simplify the notation, we denote as in the proof of this lemma. Define . Define the sets and . For each ,
and
In the same way, for each ,
Therefore,
(34) | |||||
By Assumption 4.1, we compute
where (i) is due to (34), and (ii) is due to Assumption 4.1. Similar to Lemma B.3, we can also obtain (B.2) in this lemma. By substituting (C.1) into (B.2), the proof can be completed.
∎
Now we are ready to prove Theorem 4.2. For a little notation abuse, we redefine
where
Let and . After redefining the corresponding notation, we can similarly derive (18) and adopt the associated lemmas. Due to the introduction of the additional Assumption 4.1, we provide Lemma C.1, ensuring that terms can be estimated. The remainder of the proof is entirely analogous to Sections A.2 and B.2. Thus, we conclude the proof.
Appendix D Supporting Lemmas for Multi-layer Neural Network
Recalling the definition of the parameterized Q-function, we present the following lemmas related to neural network functions, which play a crucial role in illustrating the main results of our paper. mentioned below are universal constants.
Lemma D.1.
For any , we have
Proof.
Lemma D.2.
For any , we have
Lemma D.3.
For any and , we have
Proof.
By Lemma D.2, we have . Recall the definition of the parameterized Q-function:
where each element of is generated from a uniform distribution over . For each , by Hoeffding inequality, we have
where (i) follows Lemma D.2. Substituting , we get
Now, by the union bound, if we set , then
which completes the proof. ∎
Lemma D.4.
Denote as the Hessian matrix of . Then for all , we have that
and
Proof.
Lemma D.5.
Let with the radius satisfying . Then for all and in the neural temporal difference learning algorithm 1, it holds that
with probability at least over the randomness of the initial point, and holds with probability at least .
Proof.
Lemma D.5 provides the upper bounds on and in Section A.2. As discussed in Section 3.1, for a finite MDP, the Gram matrix of the L-layer neural network function is positive definite and has a minimum eigenvalue of when the network width is sufficiently large. This, in fact, serves as an upper bound for in Assumption 3.1. Further details are provided in Remark D.6.
Remark D.6.
In this special case, we assume that both state space and action space are finite. Let and represent the dimensions of the state space and action space, respectively. For simplicity of notation, we view as an column vector, with being a multi-index arranged in the lexicographical order. Let and be an -dimensional diagonal matrix, whose -th diagonal entry is , and the order of in is the same as . Denote as the Jacobian matrix of and . Thus we can rewrite . Notice that is different from the Gram matrix Jacot et al. (2018); Du et al. (2018, 2019); Cao & Gu (2019); Allen-Zhu et al. (2019b) in deep neural network. To derive the -weighted Gram matrix , we provide the following definition.
Definition D.7.
There is a large body of work Jacot et al. (2018); Du et al. (2018, 2019); Cao & Gu (2019); Allen-Zhu et al. (2019b) exploring the positive definiteness of in the literature. Suppose that for all pairs , and . The results of Theorem 1 and Proposition 2 in Jacot et al. (2018) shows that for an -layer neural network with Gaussian initialization parameters,
That is, under the NTK regime, the -weighted Gram matrix converges to when is sufficiently large. Let . For any , there exists such that if , we have
This signifies that if the network width , then we have , thereby substantiating our claim.
Appendix E Additional Notes on the Experiments in Section 5
In this section, we further discuss the experimental setup introduced in Section 5. As mentioned in Section 5, our experiments mainly test the following two aspects: (1) how does the network width m affects the final error of the algorithm (first two subfigures in Figure 1); (2) the minimum nonzero singular value in Assumption 3.3 (latter two subfigures in Figure 1).
For point (1), we first generate 2000 samples according to a given policy to imitate the Markov process of Algorithm 1. A two-layer neural network with ELU activation is introduced, and the parameters are initialized using Algorithm 1. We set the initial learning rate at 0.001 with linear decay (per epoch) and a batch size of 100. Notably, as the parameter increases, the TD algorithm demonstrates smaller final TD errors.
For point (2), our experiments are based on three main points. First, note that the norm of feature map and parameter random initialization will affect the scaling of the gradient norm w.r.t. . Thus we employ the ratio to characterize the minimum non-zero singular value in order to eliminate the impact of numerical scaling. Second, we set varying network widths to verify the existence of . Finally, it’s tough to directly obtain the joint distribution of with a fixed learning policy. However, we have that . To avoid the effects of sampling randomness, we estimate from different samples. The experiments demonstrate that the minimum non-zero singular value converges to a constant as increases.