License: arXiv.org perpetual non-exclusive license
arXiv:2310.09714v2 [cs.RO] 07 Mar 2024

Enhancing Task Performance of Learned Simplified Models via Reinforcement Learning thanks: normal-∗{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPTToyota Research Institute provided funds to support this work. thanks: 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTThe authors are with the GRASP Laboratory, University of Pennsylvania, Philadelphia, PA 19104, USA {xuanhien, posa}@seas.upenn.edu

Hien Bui11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT and Michael Posa11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
Abstract

In contact-rich tasks, the hybrid, multi-modal nature of contact dynamics poses great challenges in model representation, planning, and control. Recent efforts have attempted to address these challenges via data-driven methods, learning dynamical models in combination with model predictive control. Those methods, while effective, rely solely on minimizing forward prediction errors to hope for better task performance with MPC controllers. This weak correlation can result in data inefficiency as well as limitations to overall performance. In response, we propose a novel strategy: using a policy gradient algorithm to find a simplified dynamics model that explicitly maximizes task performance. Specifically, we parameterize the stochastic policy as the perturbed output of the MPC controller, thus, the learned model representation can directly associate with the policy or task performance. We apply the proposed method to contact-rich tasks where a three-fingered robotic hand manipulates previously unknown objects. Our method significantly enhances task success rate by up to 15% in manipulating diverse objects compared to the existing method while sustaining data efficiency. Our method can solve some tasks with success rates of 70% or higher using under 30 minutes of data. All videos and codes are available at https://sites.google.com/view/lcs-rl.

I Introduction

Refer to caption
Figure 1: The diagram demonstrates our proposed framework of learning simplified dynamic models for solving contact-rich manipulation tasks in low data regimes. Our framework proposes an iterative learning loop that consists of main components: a stochastic policy and a policy optimizer. Top panel: Using learned dynamic models under MPC scheme and Gaussian noise to construct the stochastic policy. Bottom panel: Combining PPO and prediction loss to optimize the policy parameters with the collected on-policy data.

In many robotics tasks, such as dexterous manipulation and locomotion, robots frequently need to make and break contact with the environment. Yet, finding explicit models and policies that can exploit the hybrid complex interaction of the robot with its environment to solve the tasks remains a challenge. Some works in model-based control [1, 2, 3, 4] have attempted to explicitly identify contact modes and plan the contact sequences. However, these approaches face scalability challenges as the number of contact modes increases.

Recent data-driven methods have made significant advances in tackling that scalability issue, broadly offering two primary strategies. Modern model-free reinforcement learning (RL) directly parameterizes control policies with deep neural networks, then iteratively improves the policies through large-scale trial and error [5, 6, 7]. However, because of their data inefficiency, carrying out experiments on real robotic systems is always resource-intensive and time-consuming. In contrast, a large portion of model-based RL[8, 9, 10, 11, 12, 13, 14] leverages the expressiveness power of deep neural networks to learn intricate dynamic models. The learned models are subsequently employed for trajectory planning/predictive control, typically via random shooting techniques. Despite improving data efficiency compared to model-free RL, model-based RL remains data-intensive because conventional methods for learning deep dynamic models struggle to capture stiffness and multi-modal contact dynamics [15, 16]. Moreover, by adopting simpler model representations such as time-varying linear or Gaussian Process models, some works [17, 18, 19] demonstrate good performance with limited data.

The most recent work [20] shows great performance with only a few minutes of data by learning a linear complementarity system (LCS), a piecewise-affine and reduced-order representation of multi-contact dynamics, and combining it with an MPC planner. However, two main building blocks of this work, the dynamic model fitting and planning with the learned model, appear as two de-coupled optimization problems repeated over many cycles of learning. More specifically, ensuring better forward prediction capability of the learned dynamic model is necessary but might not be sufficient to achieve better task performance, leading to limitations of data efficiency and task performance. This issue is known as objective mismatch [21].

This paper presents LCS-RL, a low-dimensional RL framework, to address the above issue, aiming to further enhance task performance and data efficiency. Particularly, the proposed framework directly bridges the dynamic model learning part to task performance optimization via a policy gradient algorithm.

I-A Contributions

  1. 1.

    We present LCS-RL, a novel framework that leverages the combination of RL and simple multi-contact models for solving contact-rich tasks. Specifically, our framework applies a reinforcement learning algorithm to directly maximize the task performance of simplified models in combination with the MPC planner.

  2. 2.

    We show that the proposed method consistently achieves higher task performance, up to 15%percent1515\%15 % in three-fingered robot manipulation tasks with various objects compared to the prior methods [20, 10]. In addition, our method is data efficient as it can solve some dexterous manipulation tasks with 70%percent7070\%70 % to 96%percent9696\%96 % success rates using just under 30303030 minutes of data.

  3. 3.

    We also demonstrate that the learned LCS model of one object can be transferred to other objects, drastically improving data efficiency.

II Related Work

II-A Differentiable MPC and Reinforcement Learning

Control policies formulated as differentiable MPC problems and optimized using backpropagation, either through imitation learning or RL loss, have been extensively studied [22, 23, 24, 25, 26, 27, 28, 29, 30]. Amos et al. [23], Xu et al. [30], and Jin et al. [26, 27] propose using either a differential set of Karush–Kuhn–Tucker (KKT) or Pontryagin’s maximum principle (PMP) conditions to ensure MPC problem differentiability, with limited experiments in low-dimensional tasks. Recent works [24, 25, 28] extend these methods to more complex robotics problems. Esfahani et al. [24] use a specialized Q-learning algorithm to learn MPC cost function parameters, effective in mobile robot tasks but critically requiring ground-truth dynamics. Wan et al. [28] focus on image-based tasks, introducing a differentiable sampling-based MPC policy to learn latent dynamics models, encoder, and Q-value predictors concurrently, but requiring substantial data. For data-efficient system identification and control in state-based contact-rich tasks, Saxena et al. [25] are the most relevant. They parameterize the dynamics model using a switching linear dynamical model with a contact-mode prediction function and construct a differentiable feedback controller (LQR), optimizing both its cost matrices and dynamics model parameters to match expert demonstrations.

II-B Main Baseline for Comparisons

Jin et al. [20] suggest that there exists a simple model that can adequately capture task-relevant contact dynamics, thereby enabling both high performance and real-time control for contact-rich manipulation. In particular, the authors propose to use a reduced-order hybrid model to represent and use the model predictive controller for planning. Since the model is far simpler than deep neural networks, much less data is required for model learning. Their framework achieves high task performance with under 5 minutes of data. In this paper, we compare the task performance and data efficiency of our proposed method against this baseline in some dexterous manipulation tasks.

III Backgrounds

III-A Linear Complementarity Systems

A discrete-time linear complementarity system (LCS) is a piecewise-affine system, where the state evolution is governed by linear dynamics in (1a) and a linear complementarity problem (LCP) in (1b).

𝒙t+1=A𝒙t+B𝒖t+C𝝀t+𝒅,subscript𝒙𝑡1𝐴subscript𝒙𝑡𝐵subscript𝒖𝑡𝐶subscript𝝀𝑡𝒅\displaystyle\boldsymbol{x}_{t+1}=A\boldsymbol{x}_{t}+B\boldsymbol{u}_{t}+C% \boldsymbol{\lambda}_{t}+\boldsymbol{d},bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_A bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_B bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_C bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_d , (1a)
0𝝀tD𝒙t+E𝒖t+F𝝀t+𝒄0.0subscript𝝀𝑡perpendicular-to𝐷subscript𝒙𝑡𝐸subscript𝒖𝑡𝐹subscript𝝀𝑡𝒄0\displaystyle 0\leq\boldsymbol{\lambda}_{t}\perp D\boldsymbol{x}_{t}+E% \boldsymbol{u}_{t}+F\boldsymbol{\lambda}_{t}+\boldsymbol{c}\geq 0.0 ≤ bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟂ italic_D bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_E bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_F bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_c ≥ 0 . (1b)

Here, 𝒙tnxsubscript𝒙𝑡superscriptsubscript𝑛𝑥\boldsymbol{x}_{t}\in\mathbb{R}^{n_{x}}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝒖tnusubscript𝒖𝑡superscriptsubscript𝑛𝑢\boldsymbol{u}_{t}\in\mathbb{R}^{n_{u}}bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and 𝝀tnλsubscript𝝀𝑡superscriptsubscript𝑛𝜆\boldsymbol{\lambda}_{t}\in\mathbb{R}^{n_{\lambda}}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are respectively the system state, action, and the complementarity variable at time step t𝑡titalic_t. And, 𝒙t+1nxsubscript𝒙𝑡1superscriptsubscript𝑛𝑥\boldsymbol{x}_{t+1}\in\mathbb{R}^{n_{x}}bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the system state at the next time step t+1𝑡1t+1italic_t + 1. The symbol perpendicular-to\perp denotes zero inner product or orthogonality of two vectors. Moreover, the matrix Anx×nx𝐴superscriptsubscript𝑛𝑥subscript𝑛𝑥A\in\mathbb{R}^{n_{x}\times n_{x}}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT defines the autonomous dynamics and matrix Bnx×nu𝐵superscriptsubscript𝑛𝑥subscript𝑛𝑢B\in\mathbb{R}^{n_{x}\times n_{u}}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT captures the effect of actions on states. And, the matrix Cnx×nλ𝐶superscriptsubscript𝑛𝑥subscript𝑛𝜆C\in\mathbb{R}^{n_{x}\times n_{\lambda}}italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒅nx𝒅superscriptsubscript𝑛𝑥\boldsymbol{d}\in\mathbb{R}^{n_{x}}bold_italic_d ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT describe the effect of the contact forces and the constant forces acting on the state respectively. Other matrices Dnλ×nx,Enλ×nu,Fnλ×nλformulae-sequence𝐷superscriptsubscript𝑛𝜆subscript𝑛𝑥formulae-sequence𝐸superscriptsubscript𝑛𝜆subscript𝑛𝑢𝐹superscriptsubscript𝑛𝜆subscript𝑛𝜆D\in\mathbb{R}^{n_{\lambda}\times n_{x}},E\in\mathbb{R}^{n_{\lambda}\times n_{% u}},F\in\mathbb{R}^{n_{\lambda}\times n_{\lambda}}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒄nλ𝒄superscriptsubscript𝑛𝜆\boldsymbol{c}\in\mathbb{R}^{n_{\lambda}}bold_italic_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT altogether capture the relationship between states, actions, and contact forces. The LCS models are commonly used in modeling multi-contact robotics problems [31, 32].

III-B Learning Linear Complementarity Systems

Given a data buffer 𝒟𝒟\mathcal{D}caligraphic_D that contains some state transitions (𝒙t,𝒖t,𝒙t+1)subscript𝒙𝑡subscript𝒖𝑡subscript𝒙𝑡1(\boldsymbol{x}_{t},\boldsymbol{u}_{t},\boldsymbol{x}_{t+1})( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ), we can learn all matrix and vector parameters of an LCS model (A,B,C,𝒅,D,E,F,𝒄)𝐴𝐵𝐶𝒅𝐷𝐸𝐹𝒄(A,B,C,\boldsymbol{d},D,E,F,\boldsymbol{c})( italic_A , italic_B , italic_C , bold_italic_d , italic_D , italic_E , italic_F , bold_italic_c ) in (III-A) by using the gradient descent method with the violation-based loss, proposed by Jin et al. [33]

vio𝚯=min𝝀t0,ϕt012A𝒙t+B𝒖t+C𝝀t+𝒅𝒙t+12+1ξ(𝝀tTϕt+12γD𝒙t+E𝒖t+F𝝀t+𝒄ϕt2).missing-subexpressionsuperscriptsubscriptvio𝚯subscriptformulae-sequencesubscript𝝀𝑡0subscriptbold-italic-ϕ𝑡012superscriptnorm𝐴subscript𝒙𝑡𝐵subscript𝒖𝑡𝐶subscript𝝀𝑡𝒅subscript𝒙𝑡12missing-subexpression1𝜉superscriptsubscript𝝀𝑡Tsubscriptbold-italic-ϕ𝑡12𝛾superscriptnorm𝐷subscript𝒙𝑡𝐸subscript𝒖𝑡𝐹subscript𝝀𝑡𝒄subscriptbold-italic-ϕ𝑡2\displaystyle\begin{aligned} &\mathcal{L}_{\mathrm{vio}}^{\boldsymbol{\Theta}}% =\min_{\boldsymbol{\lambda}_{t}\geq 0,\boldsymbol{\phi}_{t}\geq 0}\frac{1}{2}% \left\|A\boldsymbol{x}_{t}+B\boldsymbol{u}_{t}+C\boldsymbol{\lambda}_{t}+% \boldsymbol{d}-\boldsymbol{x}_{t+1}\right\|^{2}\\ &+\frac{1}{\xi}\left(\boldsymbol{\lambda}_{t}^{\mathrm{T}}\boldsymbol{\phi}_{t% }+\frac{1}{2\gamma}\left\|D\boldsymbol{x}_{t}+E\boldsymbol{u}_{t}+F\boldsymbol% {\lambda}_{t}+\boldsymbol{c}-\boldsymbol{\phi}_{t}\right\|^{2}\right).\end{aligned}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT roman_vio end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_Θ end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0 , bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_A bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_B bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_C bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_d - bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG 1 end_ARG start_ARG italic_ξ end_ARG ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∥ italic_D bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_E bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_F bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_c - bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . end_CELL end_ROW

(2)

Particularly, the loss vio𝚯superscriptsubscriptvio𝚯\mathcal{L}_{\mathrm{vio}}^{\boldsymbol{\Theta}}caligraphic_L start_POSTSUBSCRIPT roman_vio end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_Θ end_POSTSUPERSCRIPT itself is an optimization problem whose first and second terms specify the violation of the affine dynamics (1a) and the LCP constraint (1b), respectively. Under the condition 0<γσmin(F+FT)0𝛾subscript𝜎min𝐹superscript𝐹𝑇0<\gamma\leq\sigma_{\text{min}}(F+F^{T})0 < italic_γ ≤ italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ( italic_F + italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ), finding 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to minimize the second term is equivalent to directly solving an LCP in (1b) for 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, but poses a better-conditioned landscape for vio𝚯superscriptsubscriptvio𝚯\mathcal{L}_{\mathrm{vio}}^{\boldsymbol{\Theta}}caligraphic_L start_POSTSUBSCRIPT roman_vio end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_Θ end_POSTSUPERSCRIPT, thus enabling the identification of multi-modal and stiff dynamics. The hyper-parameter ξ>0𝜉0\xi>0italic_ξ > 0 aims to balance two terms of the loss vio𝚯superscriptsubscriptvio𝚯\mathcal{L}_{\mathrm{vio}}^{\boldsymbol{\Theta}}caligraphic_L start_POSTSUBSCRIPT roman_vio end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_Θ end_POSTSUPERSCRIPT; and ϕtnλsubscriptbold-italic-ϕ𝑡superscriptsubscript𝑛𝜆\boldsymbol{\phi}_{t}\in\mathbb{R}^{n_{\lambda}}bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is an introduced slack variable for the complementarity equation. Full explanations of the loss formulation and its hyper-parameters can be found in [33].

As proven in [33], using Envelope Theorem [34], we can analytically compute the gradient of the violation-based loss with respect to LCS parameters dvio𝚯d𝚯𝑑subscriptsuperscript𝚯vio𝑑𝚯\frac{d\mathcal{L}^{\boldsymbol{\Theta}}_{\mathrm{vio}}}{d\boldsymbol{\Theta}}divide start_ARG italic_d caligraphic_L start_POSTSUPERSCRIPT bold_Θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_vio end_POSTSUBSCRIPT end_ARG start_ARG italic_d bold_Θ end_ARG without differentiating through the solution of the optimization problem.

Parameterization: LCS-MPC policy parameters 𝜽𝜽\boldsymbol{\theta}bold_italic_θ;
Hyper-parameters: The number of warm-up
iterations M𝑀Mitalic_M; the number of policy improvement
steps Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT; and learning rate η𝜂\etaitalic_η;
Initialization: 𝜽0subscript𝜽0\boldsymbol{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, empty data buffer 𝒟𝒟\mathcal{D}caligraphic_D;
1 for k=0,1,,M𝑘01normal-…𝑀k=0,1,...,Mitalic_k = 0 , 1 , … , italic_M do
2       Collect N𝑁Nitalic_N rollout trajectories by running the LCS-MPC stochastic policy π𝜽ksubscript𝜋subscript𝜽𝑘\pi_{\boldsymbol{\theta}_{k}}italic_π start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT and add to 𝒟𝒟\mathcal{D}caligraphic_D for i0normal-←𝑖0i\leftarrow 0italic_i ← 0 to Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT do
3             Using data in 𝒟𝒟\mathcal{D}caligraphic_D, compute the gradient dvio𝜽id𝜽i𝑑superscriptsubscriptviosubscript𝜽𝑖𝑑subscript𝜽𝑖\frac{d\mathcal{L}_{\mathrm{vio}}^{\boldsymbol{\theta}_{i}}}{d\boldsymbol{% \theta}_{i}}divide start_ARG italic_d caligraphic_L start_POSTSUBSCRIPT roman_vio end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_d bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG
4             Update 𝜽i+1𝜽iηdvio𝜽id𝜽isubscript𝜽𝑖1subscript𝜽𝑖𝜂𝑑superscriptsubscriptviosubscript𝜽𝑖𝑑subscript𝜽𝑖\boldsymbol{\theta}_{i+1}\leftarrow\boldsymbol{\theta}_{i}-\eta\frac{d\mathcal% {L}_{\mathrm{vio}}^{\boldsymbol{\theta}_{i}}}{d\boldsymbol{\theta}_{i}}bold_italic_θ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_η divide start_ARG italic_d caligraphic_L start_POSTSUBSCRIPT roman_vio end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_d bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG
5       end for
6      
7 end for
Save the final parameters 𝜽Msubscript𝜽𝑀\boldsymbol{\theta}_{M}bold_italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT for the main phase
Algorithm 1 Warm-up phase for optimizing LCS-MPC Policy

III-C Model Predictive Controller with LCS

Utilizing LCS to represent the dynamics model, one can construct a model predictive controller (MPC) as follows:

min𝒖t,𝒖t+1,,𝒖t+H1k=tt+H1𝒞(𝒙k,𝒖k)+𝒞f(𝒙t+H) s.t. 𝒙k+1=A𝒙k+B𝒖k+C𝝀k+𝒅,𝟎𝝀kD𝒙k+E𝒖k+F𝝀k+𝒄𝟎,𝒖min𝒖k𝒖max.subscriptsubscript𝒖𝑡subscript𝒖𝑡1subscript𝒖𝑡𝐻1superscriptsubscript𝑘𝑡𝑡𝐻1𝒞subscript𝒙𝑘subscript𝒖𝑘subscript𝒞𝑓subscript𝒙𝑡𝐻 s.t. subscript𝒙𝑘1𝐴subscript𝒙𝑘𝐵subscript𝒖𝑘𝐶subscript𝝀𝑘𝒅missing-subexpression0subscript𝝀𝑘perpendicular-to𝐷subscript𝒙𝑘𝐸subscript𝒖𝑘𝐹subscript𝝀𝑘𝒄0missing-subexpressionsubscript𝒖minsubscript𝒖𝑘subscript𝒖max\displaystyle\begin{aligned} \min_{\boldsymbol{u}_{t},\boldsymbol{u}_{t+1},% \dots,\boldsymbol{u}_{t+H-1}}&\sum_{k=t}^{t+H-1}\mathcal{C}\left(\boldsymbol{x% }_{k},\boldsymbol{u}_{k}\right)+\mathcal{C}_{f}(\boldsymbol{x}_{t+H})\\ \text{ s.t. }\quad\quad&\boldsymbol{x}_{k+1}=A\boldsymbol{x}_{k}+B\boldsymbol{% u}_{k}+C\boldsymbol{\lambda}_{k}+\boldsymbol{d},\\ &\mathbf{0}\leq\boldsymbol{\lambda}_{k}\perp D\boldsymbol{x}_{k}+E\boldsymbol{% u}_{k}+F\boldsymbol{\lambda}_{k}+\boldsymbol{c}\geq\mathbf{0},\\ &\boldsymbol{u}_{\text{min}}\leq\boldsymbol{u}_{k}\leq\boldsymbol{u}_{\text{% max}}.\end{aligned}start_ROW start_CELL roman_min start_POSTSUBSCRIPT bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , … , bold_italic_u start_POSTSUBSCRIPT italic_t + italic_H - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_k = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_H - 1 end_POSTSUPERSCRIPT caligraphic_C ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + caligraphic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_A bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_B bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_C bold_italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_d , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_0 ≤ bold_italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟂ italic_D bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_F bold_italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_c ≥ bold_0 , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_italic_u start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ≤ bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ bold_italic_u start_POSTSUBSCRIPT max end_POSTSUBSCRIPT . end_CELL end_ROW

(3)

where H𝐻Hitalic_H is the planning horizon; 𝒞𝒞\mathcal{C}caligraphic_C and 𝒞fsubscript𝒞𝑓\mathcal{C}_{f}caligraphic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT are the path and terminal cost functions. And, 𝒖minsubscript𝒖min\boldsymbol{u}_{\text{min}}bold_italic_u start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and 𝒖maxsubscript𝒖max\boldsymbol{u}_{\text{max}}bold_italic_u start_POSTSUBSCRIPT max end_POSTSUBSCRIPT are the lower and upper bounds of actions.

Given any initial state 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we solve the LCS-MPC in (3) to plan a sequence of optimal actions [𝒖t,𝒖t+1,,𝒖t+H1]subscript𝒖𝑡subscript𝒖𝑡1subscript𝒖𝑡𝐻1\left[\boldsymbol{u}_{t},\boldsymbol{u}_{t+1},\dots,\boldsymbol{u}_{t+H-1}\right][ bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , … , bold_italic_u start_POSTSUBSCRIPT italic_t + italic_H - 1 end_POSTSUBSCRIPT ] that minimizes the total cost, then select the first action 𝒖tsubscript𝒖𝑡\boldsymbol{u}_{t}bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to apply on the robot and repeat the process in every time step in a receding horizon manner. To efficiently solve the LCS-MPC, we employ the direct trajectory optimization method [35], which simultaneously searches over trajectories of 𝒙t:t+Hsubscript𝒙:𝑡𝑡𝐻\boldsymbol{x}_{t:t+H}bold_italic_x start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT, 𝒖t:t+H1subscript𝒖:𝑡𝑡𝐻1\boldsymbol{u}_{t:t+H-1}bold_italic_u start_POSTSUBSCRIPT italic_t : italic_t + italic_H - 1 end_POSTSUBSCRIPT, and 𝝀t:t+H1subscript𝝀:𝑡𝑡𝐻1\boldsymbol{\lambda}_{t:t+H-1}bold_italic_λ start_POSTSUBSCRIPT italic_t : italic_t + italic_H - 1 end_POSTSUBSCRIPT, treating the LCS dynamics as a separate constraint for each time step. Also, we use the IPOPT solver [36] to solve such nonlinear problems.

Parameterization: LCS-MPC policy parameters 𝜽𝜽\boldsymbol{\theta}bold_italic_θ and value function 𝒱ϕsubscript𝒱italic-ϕ\mathcal{V}_{\phi}caligraphic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT;
Hyper-parameters: Total number of iterations K𝐾Kitalic_K,
the number of policy improvement steps Np¯¯subscript𝑁𝑝\bar{N_{p}}over¯ start_ARG italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG;
Learning rate η¯¯𝜂\bar{\eta}over¯ start_ARG italic_η end_ARG for the policy optimization; Loss
weighting parameter β𝛽\betaitalic_β in (7); Discount factor γ𝛾\gammaitalic_γ
and parameter ζ𝜁\zetaitalic_ζ for computing the advantage values;
Initialization: 𝜽Msubscript𝜽𝑀\boldsymbol{\theta}_{M}bold_italic_θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT obtained from the warm-up phase, ϕ0subscriptitalic-ϕ0\phi_{0}italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and empty data buffer 𝒟𝒟\mathcal{D}caligraphic_D;
1 for k=0,1,,K𝑘01normal-…𝐾k=0,1,...,Kitalic_k = 0 , 1 , … , italic_K do
2       Empty buffer 𝒟𝒟\mathcal{D}caligraphic_D, collect N¯¯𝑁\bar{N}over¯ start_ARG italic_N end_ARG new trajectories by running the LCS-MPC policy π𝜽ksubscript𝜋subscript𝜽𝑘\pi_{\boldsymbol{\theta}_{k}}italic_π start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and add to 𝒟𝒟\mathcal{D}caligraphic_D
3       For each trajectory, compute the generalized advantage value 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as in (5), then the bootstrapped total reward Rt=𝒜t+𝒱ϕk(𝒙t)subscript𝑅𝑡subscript𝒜𝑡subscript𝒱subscriptitalic-ϕ𝑘subscript𝒙𝑡R_{t}=\mathcal{A}_{t}+\mathcal{V}_{\phi_{k}}(\boldsymbol{x}_{t})italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + caligraphic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
4       for i0normal-←𝑖0i\leftarrow 0italic_i ← 0 to Np¯normal-¯subscript𝑁𝑝\bar{N_{p}}over¯ start_ARG italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG do
5             Compute the combined loss gradient dc𝜽id𝜽i𝑑subscriptsuperscriptsubscript𝜽𝑖𝑐𝑑subscript𝜽𝑖\frac{d\mathcal{L}^{\boldsymbol{\theta}_{i}}_{c}}{d\boldsymbol{\theta}_{i}}divide start_ARG italic_d caligraphic_L start_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_d bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG
6             Update 𝜽i+1𝜽iη¯dc𝜽id𝜽isubscript𝜽𝑖1subscript𝜽𝑖¯𝜂𝑑subscriptsuperscriptsubscript𝜽𝑖𝑐𝑑subscript𝜽𝑖\boldsymbol{\theta}_{i+1}\leftarrow\boldsymbol{\theta}_{i}-\bar{\eta}\frac{d% \mathcal{L}^{\boldsymbol{\theta}_{i}}_{c}}{d\boldsymbol{\theta}_{i}}bold_italic_θ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_η end_ARG divide start_ARG italic_d caligraphic_L start_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_d bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG
7       end for
8      Fit value function 𝒱ϕsubscript𝒱italic-ϕ\mathcal{V}_{\phi}caligraphic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT by performing regression with mean-square error ϕk+1=argminϕ1|𝒟|Tτ𝒟t=0T(𝒱ϕ(𝒙t)Rt)2subscriptitalic-ϕ𝑘1subscriptitalic-ϕ1𝒟𝑇subscript𝜏𝒟superscriptsubscript𝑡0𝑇superscriptsubscript𝒱italic-ϕsubscript𝒙𝑡subscript𝑅𝑡2\displaystyle\phi_{k+1}=\arg\min_{\phi}\frac{1}{\left|\mathcal{D}\right|T}\sum% _{\tau\in\mathcal{D}}\sum_{t=0}^{T}\left(\mathcal{V}_{\phi}\left(\boldsymbol{x% }_{t}\right)-R_{t}\right)^{2}italic_ϕ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_D | italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ caligraphic_D end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( caligraphic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
9 end for
Algorithm 2 Main phase for optimizing LCS-MPC Policy using the PPO algorithm

III-D Proximal Policy Optimization

Proximal Policy Optimization (PPO) [5] is a policy gradient algorithm that focuses on determining how to make the most significant policy improvement using current data, all while avoiding excessive steps that could lead to performance collapse. Particularly, the PPO loss is defined as follows:

PPO𝜽=1|𝒟|TτDt=0T{max(htθ,1ϵ)𝒜t if 𝒜t<0min(htθ,1+ϵ)𝒜t if 𝒜t0,subscriptsuperscript𝜽PPO1𝒟𝑇subscript𝜏𝐷superscriptsubscript𝑡0𝑇casessuperscriptsubscript𝑡𝜃1italic-ϵsubscript𝒜𝑡 if subscript𝒜𝑡0superscriptsubscript𝑡𝜃1italic-ϵsubscript𝒜𝑡 if subscript𝒜𝑡0\displaystyle\mathcal{L}^{\boldsymbol{\theta}}_{\text{PPO}}=-\frac{1}{|% \mathcal{D}|T}\sum_{\tau\in D}\sum_{t=0}^{T}\begin{cases}\max\left(h_{t}^{% \theta},1-\epsilon\right)\mathcal{A}_{t}&\text{ if }\mathcal{A}_{t}<0\\ \min\left(h_{t}^{\theta},1+\epsilon\right)\mathcal{A}_{t}&\text{ if }\mathcal{% A}_{t}\geq 0,\end{cases}caligraphic_L start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT PPO end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG | caligraphic_D | italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ italic_D end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT { start_ROW start_CELL roman_max ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT , 1 - italic_ϵ ) caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < 0 end_CELL end_ROW start_ROW start_CELL roman_min ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT , 1 + italic_ϵ ) caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0 , end_CELL end_ROW

(4)

where 𝒟𝒟\mathcal{D}caligraphic_D and |𝒟|𝒟|\mathcal{D}|| caligraphic_D | are the data buffer and its size, that buffer consists of on-policy rollout trajectories τ𝜏\tauitalic_τ, and T𝑇Titalic_T is the length of trajectories. There are two key quantities in (4): the ratio ht𝜽=π𝜽(𝒖t𝒙t)π𝜽old (𝒖t𝒙t)subscriptsuperscript𝜽𝑡subscript𝜋𝜽conditionalsubscript𝒖𝑡subscript𝒙𝑡subscript𝜋subscript𝜽old conditionalsubscript𝒖𝑡subscript𝒙𝑡h^{\boldsymbol{\theta}}_{t}=\frac{\pi_{\boldsymbol{\theta}}\left(\boldsymbol{u% }_{t}\mid\boldsymbol{x}_{t}\right)}{\pi_{\boldsymbol{\theta}_{\text{old }}}% \left(\boldsymbol{u}_{t}\mid\boldsymbol{x}_{t}\right)}italic_h start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG and the truncated version of generalized advantage function 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [37]. Here, the ratio ht𝜽subscriptsuperscript𝜽𝑡h^{\boldsymbol{\theta}}_{t}italic_h start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates how much the new policy differs from the old one. The scalar ϵitalic-ϵ\epsilonitalic_ϵ defines the bounds of ht𝜽subscriptsuperscript𝜽𝑡h^{\boldsymbol{\theta}}_{t}italic_h start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which are often referred to as the trust region of policy improvements. In addition, 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT guides the policy search by measuring whether a certain action is a good or bad decision within a given state. The detailed expression of 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given below

𝒜t=δt+(γζ)δt+1+(γζ)2δt+2+(γζ)Tt+1δT1,with δt=rt+γ𝒱ϕ(𝒙t+1)𝒱ϕ(𝒙t),missing-subexpressionsubscript𝒜𝑡subscript𝛿𝑡𝛾𝜁subscript𝛿𝑡1superscript𝛾𝜁2subscript𝛿𝑡2superscript𝛾𝜁𝑇𝑡1subscript𝛿𝑇1missing-subexpressionwith subscript𝛿𝑡subscript𝑟𝑡𝛾subscript𝒱italic-ϕsubscript𝒙𝑡1subscript𝒱italic-ϕsubscript𝒙𝑡\begin{aligned} &\mathcal{A}_{t}=\delta_{t}+(\gamma\zeta)\delta_{t+1}+(\gamma% \zeta)^{2}\delta_{t+2}\ldots+(\gamma\zeta)^{T-t+1}\delta_{T-1},\\ &\;\text{with }\delta_{t}=r_{t}+\gamma\mathcal{V}_{\phi}\left(\boldsymbol{x}_{% t+1}\right)-\mathcal{V}_{\phi}(\boldsymbol{x}_{t}),\end{aligned}start_ROW start_CELL end_CELL start_CELL caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( italic_γ italic_ζ ) italic_δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + ( italic_γ italic_ζ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT … + ( italic_γ italic_ζ ) start_POSTSUPERSCRIPT italic_T - italic_t + 1 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL with italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ caligraphic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - caligraphic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW

(5)

where γ𝛾\gammaitalic_γ is the discount factor and ζ𝜁\zetaitalic_ζ is a hyper-parameter that controls the bias-variance tradeoff of the estimation. The reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is obtained by executing action 𝒖tsubscript𝒖𝑡\boldsymbol{u}_{t}bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at state 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The learned value function 𝒱ϕ(𝒙t)subscript𝒱italic-ϕsubscript𝒙𝑡\mathcal{V}_{\phi}(\boldsymbol{x}_{t})caligraphic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and 𝒱ϕ(𝒙t+1)subscript𝒱italic-ϕsubscript𝒙𝑡1\mathcal{V}_{\phi}(\boldsymbol{x}_{t+1})caligraphic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) estimate the expected total rewards if we follow the current policy from state 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒙t+1subscript𝒙𝑡1\boldsymbol{x}_{t+1}bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT till the end of trajectories.

Refer to caption
(a)
Refer to caption
Step 1
Refer to caption
Step 5
Refer to caption
Step 8
Refer to caption
Step 13
Refer to caption
Step 17
Refer to caption
Step 20
(b)
Figure 2: TriFinger dexterous manipulation tasks. (a) shows the simulation environment that is constructed using MuJoCo physics engine [38]. In this task, the three fingers need to push the cube towards a random target pose, visualized by the red transparent cube. (b) is an example of a rollout trajectory that demonstrates how the fingers approach, make, and break contacts to reposition the cube.

IV Practical Algorithm

In this section, we introduce our framework, LCS-RL, that utilizes a reinforcement learning algorithm, here we use PPO, to optimize an LCS dynamic model (in combination with model predictive control) for solving contact-rich tasks.

First, we formulate a stochastic policy, called the LCS-MPC stochastic policy, by adding Gaussian noise to the output of the LCS-MPC planner in (3). In other words, the LCS-MPC policy is directly parameterized by the LCS model. Then, we use the combination of the PPO loss and the violation-based loss given in (7) to improve the task performance of that policy.

We address the poor data efficiency of the PPO algorithm by leveraging data-efficient model learning at the start and then transitioning to PPO when in a good neighborhood. Therefore, our framework consists of two phases: the warm-up phase and the main phase. In the warm-up phase, we follow the algorithm proposed in [20], solely employing the violation-based loss (2) to quickly learn the parameters of the LCS model that can achieve good task performance. Subsequently, we use the learned LCS model to accelerate the main phase, where we start using the PPO algorithm. In practice, we find that having the warm-up phase leads to more stable and progressive training than involving PPO right from the beginning. While theoretically, we could merge the two phases and switch the loss upon transition, it is simpler to keep them separated due to different hyper-parameters required for each phase. Details of the warm-up phase and main phase are provided in Algorithm 1 and 2.

IV-A LCS-MPC Stochastic Policy

The LCS-MPC policy is the probability density of an action distribution associated with the current state 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

π𝜽(𝒖t|𝒙t)=exp(12(𝒖tμ𝚯(𝒙t))TΣ1(𝒖tμ𝚯(𝒙t)))(2π)nu/2det(Σ)1/2,subscript𝜋𝜽conditionalsubscript𝒖𝑡subscript𝒙𝑡absentexp12superscriptsubscript𝒖𝑡subscript𝜇𝚯subscript𝒙𝑡𝑇superscriptΣ1subscript𝒖𝑡subscript𝜇𝚯subscript𝒙𝑡superscript2𝜋subscript𝑛𝑢2superscriptΣ12\displaystyle\begin{aligned} \pi_{\boldsymbol{\theta}}(\boldsymbol{u}_{t}|% \boldsymbol{x}_{t})&=\frac{\text{exp}\left(-\frac{1}{2}(\boldsymbol{u}_{t}-\mu% _{\boldsymbol{\Theta}}(\boldsymbol{x}_{t}))^{T}\Sigma^{-1}(\boldsymbol{u}_{t}-% \mu_{\boldsymbol{\Theta}}(\boldsymbol{x}_{t}))\right)}{(2\pi)^{n_{u}/2}\det(% \Sigma)^{1/2}},\end{aligned}start_ROW start_CELL italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = divide start_ARG exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT / 2 end_POSTSUPERSCRIPT roman_det ( roman_Σ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG , end_CELL end_ROW

(6)

where μ𝚯nusubscript𝜇𝚯superscriptsubscript𝑛𝑢\mu_{\boldsymbol{\Theta}}\in\mathbb{R}^{n_{u}}italic_μ start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the optimal output of the LCS-MPC planner, and Σnu×nuΣsuperscriptsubscript𝑛𝑢subscript𝑛𝑢\Sigma\in\mathbb{R}^{n_{u}\times n_{u}}roman_Σ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the covariance matrix that indicates the noise magnitude. Here, 𝚯=(A,B,C,𝒅,D,E,F,𝒄)𝚯𝐴𝐵𝐶𝒅𝐷𝐸𝐹𝒄\boldsymbol{\Theta}=(A,B,C,\boldsymbol{d},D,E,F,\boldsymbol{c})bold_Θ = ( italic_A , italic_B , italic_C , bold_italic_d , italic_D , italic_E , italic_F , bold_italic_c ) is actually the LCS parameters. And, 𝜽=[𝚯,Σ]𝜽𝚯Σ\boldsymbol{\theta}=[\boldsymbol{\Theta},\Sigma]bold_italic_θ = [ bold_Θ , roman_Σ ] denotes the joint vector of the policy’s learnable parameters. The added noise encourages exploration, helping to avoid low-quality local minima. Typically, the noise magnitude is large initially, gradually decreasing as the policy exploits acquired knowledge for better task performance.

Refer to caption
Figure 3: Learning curves of the TriFinger Moving Cube task. The red, blue, orange, and green lines show the average task success rate of our proposed method, the prior method [20], a method that uses PPO without a warm-up phase, and PDDM [10] respectively. At the beginning of the training, our method and the prior method [20] share the same performance since the same algorithm is used. However, the transition occurs after collecting 6 minutes of data, when our method switches to fully employ the PPO algorithm. Shaded regions indicate normal t-score 95% confidence intervals.

IV-B Loss for Optimizing LCS Model

In our framework, we employ two types of loss: violation-based and PPO. The violation-based loss improves the forward prediction capability of the LCS model, while the PPO loss directly enhances the task performance of the LCS-MPC planner. Introducing a hyper-parameter β[0,1]𝛽01\beta\in[0,1]italic_β ∈ [ 0 , 1 ] allows us to balance the contributions of these losses, resulting in the combined loss:

c𝜽=βPPO𝜽+(1β)vio𝜽,subscriptsuperscript𝜽𝑐𝛽subscriptsuperscript𝜽PPO1𝛽subscriptsuperscript𝜽vio\mathcal{L}^{\boldsymbol{\theta}}_{c}=\beta\mathcal{L}^{\boldsymbol{\theta}}_{% \mathrm{PPO}}+(1-\beta)\mathcal{L}^{\boldsymbol{\theta}}_{\mathrm{vio}},caligraphic_L start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_β caligraphic_L start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_PPO end_POSTSUBSCRIPT + ( 1 - italic_β ) caligraphic_L start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_vio end_POSTSUBSCRIPT , (7)

In order to optimize the parameters of LCS-MPC stochastic policy, one must compute the gradient of the PPO loss and the violation-based loss with respect to the policy parameters. In section III-B we have already mentioned the method for calculating dvio𝜽d𝜽𝑑subscriptsuperscript𝜽vio𝑑𝜽\frac{d\mathcal{L}^{\boldsymbol{\theta}}_{\mathrm{vio}}}{d\boldsymbol{\theta}}divide start_ARG italic_d caligraphic_L start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_vio end_POSTSUBSCRIPT end_ARG start_ARG italic_d bold_italic_θ end_ARG. Moving forward, we will illustrate the process for computing dPPO𝜽d𝜽𝑑subscriptsuperscript𝜽PPO𝑑𝜽\frac{d\mathcal{L}^{\boldsymbol{\theta}}_{\mathrm{PPO}}}{d\boldsymbol{\theta}}divide start_ARG italic_d caligraphic_L start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_PPO end_POSTSUBSCRIPT end_ARG start_ARG italic_d bold_italic_θ end_ARG.

IV-C Gradient of PPO Loss

The original PPO algorithm parameterizes policies via deep neural networks [5], thus computing the gradient of the PPO loss over policy parameters can be simply done via automatic differentiation. However, it does not apply to our case since our policy is actually an MPC planner. Differentiating through the MPC requires special treatment.

Given the explicit form of PPO loss in (4), using chain rule, the gradient with respect to 𝜽𝜽\boldsymbol{\theta}bold_italic_θ can be computed

dPPO𝜽d𝜽=1|𝒟|TτDt=0T{dht𝜽dπ𝜽dπ𝜽d𝜽𝒜tif ht𝜽1ϵ𝒜t<00if ht𝜽<1ϵ𝒜t<0dht𝜽dπ𝜽dπ𝜽d𝜽𝒜tif ht𝜽1+ϵ𝒜t00if ht𝜽>1+ϵ𝒜t0.𝑑subscriptsuperscript𝜽PPO𝑑𝜽1𝒟𝑇subscript𝜏𝐷superscriptsubscript𝑡0𝑇cases𝑑superscriptsubscript𝑡𝜽𝑑subscript𝜋𝜽𝑑subscript𝜋𝜽𝑑𝜽subscript𝒜𝑡if superscriptsubscript𝑡𝜽1italic-ϵsubscript𝒜𝑡00if superscriptsubscript𝑡𝜽1italic-ϵsubscript𝒜𝑡0𝑑superscriptsubscript𝑡𝜽𝑑subscript𝜋𝜽𝑑subscript𝜋𝜽𝑑𝜽subscript𝒜𝑡if superscriptsubscript𝑡𝜽1italic-ϵsubscript𝒜𝑡00if superscriptsubscript𝑡𝜽1italic-ϵsubscript𝒜𝑡0\displaystyle\frac{d\mathcal{L}^{\boldsymbol{\theta}}_{\mathrm{PPO}}}{d% \boldsymbol{\theta}}=\frac{-1}{|\mathcal{D}|T}\sum_{\tau\in D}\sum_{t=0}^{T}% \begin{cases}\frac{dh_{t}^{\boldsymbol{\theta}}}{d\pi_{\boldsymbol{\theta}}}% \frac{d\pi_{\boldsymbol{\theta}}}{d\boldsymbol{\theta}}\mathcal{A}_{t}&\text{% if }h_{t}^{\boldsymbol{\theta}}\geq 1-\epsilon\text{; }\mathcal{A}_{t}<0\\ 0&\text{if }h_{t}^{\boldsymbol{\theta}}<1-\epsilon\text{; }\mathcal{A}_{t}<0\\ \frac{dh_{t}^{\boldsymbol{\theta}}}{d\pi_{\boldsymbol{\theta}}}\frac{d\pi_{% \boldsymbol{\theta}}}{d\boldsymbol{\theta}}\mathcal{A}_{t}&\text{if }h_{t}^{% \boldsymbol{\theta}}\leq 1+\epsilon\text{; }\mathcal{A}_{t}\geq 0\\ 0&\text{if }h_{t}^{\boldsymbol{\theta}}>1+\epsilon\text{; }\mathcal{A}_{t}\geq 0% .\end{cases}divide start_ARG italic_d caligraphic_L start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_PPO end_POSTSUBSCRIPT end_ARG start_ARG italic_d bold_italic_θ end_ARG = divide start_ARG - 1 end_ARG start_ARG | caligraphic_D | italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ italic_D end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT { start_ROW start_CELL divide start_ARG italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG divide start_ARG italic_d italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG start_ARG italic_d bold_italic_θ end_ARG caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT ≥ 1 - italic_ϵ ; caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT < 1 - italic_ϵ ; caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG divide start_ARG italic_d italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG start_ARG italic_d bold_italic_θ end_ARG caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT ≤ 1 + italic_ϵ ; caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT > 1 + italic_ϵ ; caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0 . end_CELL end_ROW

(8)

Due to the clipping effect of the PPO loss, the gradients are zero when the improvement steps of the PPO policy ht𝜽subscriptsuperscript𝜽𝑡h^{\boldsymbol{\theta}}_{t}italic_h start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are outside of trust region [1ϵ,1+ϵ]1italic-ϵ1italic-ϵ[1-\epsilon,1+\epsilon][ 1 - italic_ϵ , 1 + italic_ϵ ]. Hence, we are left to compute the gradients if ht𝜽subscriptsuperscript𝜽𝑡h^{\boldsymbol{\theta}}_{t}italic_h start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT stays within the trust region. To compute (8), one must compute dht𝜽dπ𝜽𝑑superscriptsubscript𝑡𝜽𝑑subscript𝜋𝜽\frac{dh_{t}^{\boldsymbol{\theta}}}{d\pi_{\boldsymbol{\theta}}}divide start_ARG italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG and dπ𝜽d𝜽𝑑subscript𝜋𝜽𝑑𝜽\frac{d\pi_{\boldsymbol{\theta}}}{d\boldsymbol{\theta}}divide start_ARG italic_d italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG start_ARG italic_d bold_italic_θ end_ARG. While it is straightforward to evaluate dht𝜽dπ𝜽𝑑superscriptsubscript𝑡𝜽𝑑subscript𝜋𝜽\frac{dh_{t}^{\boldsymbol{\theta}}}{d\pi_{\boldsymbol{\theta}}}divide start_ARG italic_d italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG, computing dπ𝜽d𝜽𝑑subscript𝜋𝜽𝑑𝜽\frac{d\pi_{\boldsymbol{\theta}}}{d\boldsymbol{\theta}}divide start_ARG italic_d italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG start_ARG italic_d bold_italic_θ end_ARG requires differentiation of the optimal actions of the MPC μ𝚯(𝒙t)subscript𝜇𝚯subscript𝒙𝑡\mu_{\boldsymbol{\Theta}}(\boldsymbol{x}_{t})italic_μ start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with respect to its parameters 𝚯𝚯\boldsymbol{\Theta}bold_Θ. We compute this derivative via perturbations of KKT conditions [39] with details given in [40]. Also, note that the MPC problem is not always classically differentiable (e.g. when strict complementarity does not hold in the KKT conditions), but we have not found this problematic in practice.

Peak Success Rate (%) Final Success Rate (%)
Object Only vioθsuperscriptsubscriptvio𝜃\mathbf{\mathcal{L}_{\mathrm{vio}}^{\theta}}caligraphic_L start_POSTSUBSCRIPT roman_vio end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT LCS-RL Only vioθsuperscriptsubscriptvio𝜃\mathbf{\mathcal{L}_{\mathrm{vio}}^{\theta}}caligraphic_L start_POSTSUBSCRIPT roman_vio end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT LCS-RL
(Ours) (Ours)
Sugar Box 87.5±9.2plus-or-minus87.59.287.5\pm 9.287.5 ± 9.2 95.9±2.4plus-or-minus95.92.4\boldsymbol{95.9\pm 2.4}bold_95.9 bold_± bold_2.4 44.3±26.2plus-or-minus44.326.244.3\pm 26.244.3 ± 26.2 95.9±2.4plus-or-minus95.92.4\boldsymbol{95.9\pm 2.4}bold_95.9 bold_± bold_2.4
Fish Can 59.1±7.3plus-or-minus59.17.359.1\pm 7.359.1 ± 7.3 69.9±8.1plus-or-minus69.98.1\boldsymbol{69.9\pm 8.1}bold_69.9 bold_± bold_8.1 38.8±13.1plus-or-minus38.813.138.8\pm 13.138.8 ± 13.1 69.9±8.1plus-or-minus69.98.1\boldsymbol{69.9\pm 8.1}bold_69.9 bold_± bold_8.1
Mug 32.5±8.4plus-or-minus32.58.432.5\pm 8.432.5 ± 8.4 44.6±3.9plus-or-minus44.63.9\boldsymbol{44.6\pm 3.9}bold_44.6 bold_± bold_3.9 10.0±6.3plus-or-minus10.06.310.0\pm 6.310.0 ± 6.3 44.6±3.9plus-or-minus44.63.9\boldsymbol{44.6\pm 3.9}bold_44.6 bold_± bold_3.9
Wrench 45.5±8.7plus-or-minus45.58.745.5\pm 8.745.5 ± 8.7 60.0±6.5plus-or-minus60.06.5\boldsymbol{60.0\pm 6.5}bold_60.0 bold_± bold_6.5 11.5±8.6plus-or-minus11.58.611.5\pm 8.611.5 ± 8.6 59.5±6.6plus-or-minus59.56.6\boldsymbol{59.5\pm 6.6}bold_59.5 bold_± bold_6.6
Clamp 24.7±5.3plus-or-minus24.75.324.7\pm 5.324.7 ± 5.3 39.3±9.7plus-or-minus39.39.7\boldsymbol{39.3\pm 9.7}bold_39.3 bold_± bold_9.7 6.1±5.6plus-or-minus6.15.66.1\pm 5.66.1 ± 5.6 38.7±10.2plus-or-minus38.710.2\boldsymbol{38.7\pm 10.2}bold_38.7 bold_± bold_10.2
Banana 28.4±5.8plus-or-minus28.45.828.4\pm 5.828.4 ± 5.8 35.5±5.3plus-or-minus35.55.3\boldsymbol{35.5\pm 5.3}bold_35.5 bold_± bold_5.3 7.5±6.6plus-or-minus7.56.67.5\pm 6.67.5 ± 6.6 35.5±5.3plus-or-minus35.55.3\boldsymbol{35.5\pm 5.3}bold_35.5 bold_± bold_5.3
TABLE I: Comparison task success rates between our method and prior method [20] on diverse objects.

V Experiments and Results

In this section, we will verify our proposed framework on the three-fingered robotic hand manipulation task that was first proposed by [20] (see Fig. 2). We call it the TriFinger Moving Cube task. In the first experiment, we show a comparison of the task performance of the LCS model trained by our method and prior methods. Next, we replace the cube with other objects that have more complex shapes and repeat the same experiment. Lastly, we demonstrate that the data efficiency of our framework can be greatly improved via transfer learning. In order to guarantee statistically meaningful results, for each experiment, we have 10 runs with 10 random seeds. Also, we compute the task success rate by evaluating the learned models with 1000 random goal poses and aggregate results. All videos and codes are available at https://sites.google.com/view/lcs-rl.

V-A TriFinger Moving Cube Task

In this task, a TriFinger robot aims to align a 6 cm-sized cube with random goal poses on a planar surface. Each episode comprises up to 20 steps, each lasting 0.1 seconds.

Refer to caption

Refer to caption
(a) Sugar Box

Refer to caption

Refer to caption
(b) Fish Can

Refer to caption

Refer to caption
(c) Mug

Refer to caption

Refer to caption
(d) Wrench

Refer to caption

Refer to caption
(e) Clamp

Refer to caption

Refer to caption
(f) Banana
Figure 4: Comparison task performance between learning the LCS model from scratch and pre-trained models (obtained from training with the TriFinger Moving Cube task) on the YCB objects using our LCS-RL framework.

V-A1 States and Actions

We define the system state

𝒙=[𝒑cube,αcube,𝒑fingertips]9,𝒙subscript𝒑cubesubscript𝛼cubesubscript𝒑fingertipssuperscript9\boldsymbol{x}=[\boldsymbol{p}_{\text{cube}},\;\alpha_{\text{cube}},\;% \boldsymbol{p}_{\text{fingertips}}]\in\mathbb{R}^{9},bold_italic_x = [ bold_italic_p start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT , bold_italic_p start_POSTSUBSCRIPT fingertips end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT , (9)

where 𝒑cube2subscript𝒑cubesuperscript2\boldsymbol{p}_{\text{cube}}\in\mathbb{R}^{2}bold_italic_p start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the xy position of the cube; αcubesubscript𝛼cube\alpha_{\text{cube}}\in\mathbb{R}italic_α start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT ∈ blackboard_R is the rotation angle around the z (vertical) axis; and 𝒑fingertips6subscript𝒑fingertipssuperscript6\boldsymbol{p}_{\text{fingertips}}\in\mathbb{R}^{6}bold_italic_p start_POSTSUBSCRIPT fingertips end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT are the xy positions of three fingertips. We define the actions as deviations from the current positions of three fingertips in the Cartesian space. In addition, we impose safety limits on actions (element-wise) to constrain how far fingertips can move in one time step

𝒖=Δ𝒑fingertips6,𝒖Δsubscript𝒑fingertipssuperscript6\displaystyle\boldsymbol{u}=\Delta\boldsymbol{p}_{\text{fingertips}}\in\mathbb% {R}^{6},bold_italic_u = roman_Δ bold_italic_p start_POSTSUBSCRIPT fingertips end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT , (10)
ui[0.015, 0.015]m.subscript𝑢𝑖0.0150.015m\displaystyle u_{i}\in[-0.015,\;0.015]\;\;\text{m}.italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ - 0.015 , 0.015 ] m .

We employ operational space control (OSC) [41] in the lower-level controller to map action 𝒖𝒖\boldsymbol{u}bold_italic_u to the joint torque of each finger. We also utilize the OSC controller to maintain fingertips at a constant height as this TriFinger task involves only planar manipulation.

V-A2 Task Space

We use the same bounds for task space as in [20], from which the goal poses are uniformly sampled

mTsuperscriptm𝑇\displaystyle{}^{T}\;\text{m}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT m 𝒑goal[0.06, 0.06]Tm,absentsubscript𝒑goalsuperscript0.060.06𝑇m\displaystyle\leq\boldsymbol{p}_{\text{goal}}\leq[0.06,\;0.06]^{T}\;\text{m},≤ bold_italic_p start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT ≤ [ 0.06 , 0.06 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT m , (11)
0.5rads0.5rads\displaystyle-0.5\;\text{rads}- 0.5 rads αgoal0.5rads.absentsubscript𝛼goal0.5rads\displaystyle\leq\alpha_{\text{goal}}\leq 0.5\;\text{rads}.≤ italic_α start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT ≤ 0.5 rads .

V-A3 Task Success Criteria

When the cube pose is near the goal pose and within some tolerances, we can consider that the task is successfully completed. The goal tolerance values are selected to establish the right level of difficulty as too stringent tolerances make the task impossible to solve. We follow some previous works on TriFinger tasks [42, 43] to set the tolerances as follows:

𝒑cube𝒑goal0.02m,normsuperscriptsubscript𝒑cubesubscript𝒑goal0.02m\displaystyle\|\boldsymbol{p}_{\text{cube}}^{\ast}-\boldsymbol{p}_{\text{goal}% }\|\leq 0.02\;\text{m},∥ bold_italic_p start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_italic_p start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT ∥ ≤ 0.02 m , (12)
αcubeαgoal0.2rads,normsuperscriptsubscript𝛼cubesubscript𝛼goal0.2rads\displaystyle\|\alpha_{\text{cube}}^{\ast}-\alpha_{\text{goal}}\|\leq 0.2\;% \text{rads},∥ italic_α start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_α start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT ∥ ≤ 0.2 rads ,

where 𝒑cubesuperscriptsubscript𝒑cube\boldsymbol{p}_{\text{cube}}^{\ast}bold_italic_p start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and αcubesuperscriptsubscript𝛼cube\alpha_{\text{cube}}^{\ast}italic_α start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are the xy position and orientation of the cube at the last time step T𝑇Titalic_T; 𝒑goal2subscript𝒑goalsuperscript2\boldsymbol{p}_{\text{goal}}\in\mathbb{R}^{2}bold_italic_p start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and αgoalsubscript𝛼goal\alpha_{\text{goal}}\in\mathbb{R}italic_α start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT ∈ blackboard_R together specify the goal pose.

V-A4 Cost function for the LCS-MPC

We utilize the same cost function for the LCS-MPC as in the prior work [20]:

𝒥=k=tt+H1𝒞(𝒙k,𝒖k)+𝒞f(𝒙t+H),𝒞=10.0𝒑fingertips 𝒑cube 2+200.0𝒑cube 𝒑goal 2+0.3(αcube αgoal )2+200.0𝒖2,𝒞f=6.0𝒑fingertips 𝒑cube 2+200.0𝒑cube 𝒑goal 2+1.5(αcube αgoal )2,missing-subexpression𝒥superscriptsubscript𝑘𝑡𝑡𝐻1𝒞subscript𝒙𝑘subscript𝒖𝑘subscript𝒞𝑓subscript𝒙𝑡𝐻missing-subexpression𝒞10.0superscriptnormsubscript𝒑fingertips subscript𝒑cube 2200.0superscriptnormsubscript𝒑cube subscript𝒑goal 2missing-subexpression0.3superscriptsubscript𝛼cube subscript𝛼goal 2200.0superscriptnorm𝒖2missing-subexpressionsubscript𝒞𝑓6.0superscriptnormsubscript𝒑fingertips subscript𝒑cube 2200.0superscriptnormsubscript𝒑cube subscript𝒑goal 2missing-subexpression1.5superscriptsubscript𝛼cube subscript𝛼goal 2\displaystyle\begin{aligned} &\qquad\qquad\quad\mathcal{J}=\sum_{k=t}^{t+H-1}% \mathcal{C}(\boldsymbol{x}_{k},\boldsymbol{u}_{k})+\mathcal{C}_{f}(\boldsymbol% {x}_{t+H}),\\ &\mathcal{C}=10.0\left\|\boldsymbol{p}_{\text{fingertips }}-\boldsymbol{p}_{% \text{cube }}\right\|^{2}+200.0\left\|\boldsymbol{p}_{\text{cube }}-% \boldsymbol{p}_{\text{goal }}\right\|^{2}\\ &\quad+0.3\left(\alpha_{\text{cube }}-\alpha_{\text{goal }}\right)^{2}+200.0% \left\|\boldsymbol{u}\right\|^{2},\\ &\mathcal{C}_{f}=6.0\left\|\boldsymbol{p}_{\text{fingertips }}-\boldsymbol{p}_% {\text{cube }}\right\|^{2}+200.0\left\|\boldsymbol{p}_{\text{cube }}-% \boldsymbol{p}_{\text{goal }}\right\|^{2}\\ &\quad+1.5\left(\alpha_{\text{cube }}-\alpha_{\text{goal }}\right)^{2},\\ \end{aligned}start_ROW start_CELL end_CELL start_CELL caligraphic_J = ∑ start_POSTSUBSCRIPT italic_k = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_H - 1 end_POSTSUPERSCRIPT caligraphic_C ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + caligraphic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_C = 10.0 ∥ bold_italic_p start_POSTSUBSCRIPT fingertips end_POSTSUBSCRIPT - bold_italic_p start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 200.0 ∥ bold_italic_p start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT - bold_italic_p start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + 0.3 ( italic_α start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 200.0 ∥ bold_italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 6.0 ∥ bold_italic_p start_POSTSUBSCRIPT fingertips end_POSTSUBSCRIPT - bold_italic_p start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 200.0 ∥ bold_italic_p start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT - bold_italic_p start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + 1.5 ( italic_α start_POSTSUBSCRIPT cube end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW

(13)

V-A5 Reward function for PPO

We use both dense and sparse reward functions for the PPO algorithm. The dense reward function rt(𝒙t,𝒖t)subscript𝑟𝑡subscript𝒙𝑡subscript𝒖𝑡r_{t}(\boldsymbol{x}_{t},\boldsymbol{u}_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is simply the negation of the cost function 𝒞𝒞\mathcal{C}caligraphic_C in (13). This choice of reward function ensures that both PPO and the MPC planner of the stochastic policy align in the same direction toward task completion.

At the end of the rollout trajectory, we add a negative sparse reward to penalize for not completing the task:

r(𝒙T1,𝒖T1)=10.0×(1is_task_completed),𝑟subscript𝒙𝑇1subscript𝒖𝑇110.01is_task_completed\displaystyle r(\boldsymbol{x}_{T-1},\boldsymbol{u}_{T-1})=-10.0\times(1-% \textit{is\_task\_completed}),italic_r ( bold_italic_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) = - 10.0 × ( 1 - is_task_completed ) , (14)

In practice, we find that sparse reward helps to accelerate the training significantly.

V-B Results of the TriFinger Moving Cube Task

To demonstrate the effectiveness of LCS-RL, we compare against the prior method [20], which trains purely on vio𝜽subscriptsuperscript𝜽vio\mathcal{L}^{\boldsymbol{\theta}}_{\mathrm{vio}}caligraphic_L start_POSTSUPERSCRIPT bold_italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_vio end_POSTSUBSCRIPT, against PPO without a warm-up phase. We also compare against a state-of-the-art model-based RL approach PDDM [10] to demonstrate the utility of simple, non-smooth models over deep neural networks for manipulation. Note that in the main phase of our method, we set β=1.0𝛽1.0\beta=1.0italic_β = 1.0 for the combined loss in (7), meaning that only the PPO loss is used. The results are shown in Fig.3.

When utilizing only the violation-based loss, the mean success rate peaks at 55% after 7 minutes of data, then fluctuates and decreases as more data is collected, and finally stops at approximately 30%. In contrast, starting with the same performance at 6 minutes of data, our method improves the task performance throughout the training, reaching 65% of success rate after 25 minutes of data and 71.4% at the end of the training. Since the LCS models have limited expressiveness power, even if we optimize LCS models for better capability of forward prediction, this capability might not be optimally assigned to regions of state space where accuracy is needed for task performance. As a result, the overall task performance might drop significantly. Our method does not suffer from that issue because the only objective of PPO is encouraging the policy, to repeat good trajectories and avoid bad trajectories.

In addition, the PPO-only method and PDDM [10] have the lowest task performances throughout the training, achieving merely 10%percent1010\%10 % and 20%percent2020\%20 % for the final task success rate.

V-C TriFinger Manipulating Diverse Objects

We run a set of experiments on the TriFinger Moving Object task, which is similar to the TriFinger Moving Cube task, but the cube is replaced by other objects with non-convex, highly intricate shapes. Those objects, including sugar box, fish can, mug, wrench, clamp, and banana, are selected from the YCB object and model set [44]. As seen from Table I, our method consistently outperforms the prior method in [20] given the same amount of data, gaining from 8% (sugar box) to 15% (clamp) higher task success rate.

V-D Transfer Learning

To illustrate the transfer learning capabilities of our LCS-RL framework, we employ the LCS model initially trained on the TriFinger Moving Cube task as the starting point for training on other objects. The results in Fig. 4 show that our LCS-RL framework is highly suitable for transfer learning. Particularly, we can observe that transfer learning significantly accelerates the training, yielding even higher final task success rates in all objects (except for the sugar box), compared to the training from scratch model.

VI Conclusions

In conclusion, we present LCS-RL, a novel approach that leverages a reinforcement learning algorithm to directly maximize the task performance of the LCS model in combination with the MPC planner. We demonstrate that the proposed method attains higher task performance and greater sample efficiency compared to prior methods in TriFinger robot tasks involving pushing and rotating various objects. In addition, we show that our method is highly suitable for transfer learning, which further helps to improve data efficiency.

Our framework is not limited to only the PPO algorithm since any RL algorithms can be incorporated. Thus, one direction for future work is to explore other RL algorithms and employ them in our framework. There are off-policy RL algorithms such as TD3 [6] or SAC [7], known for better data efficiency when compared to on-policy algorithms [45]. Nevertheless, this advantage may not be as evident in situations with limited data.

Lastly, we observe the limitation of using the LCS model for representing system dynamic models, especially with complex geometries like bananas or clamps. We aim to investigate alternative structured models simpler than neural networks yet exhibiting nonlinear components, although with a trade-off in data efficiency. One potential candidate is the Nonlinear Complementarity System (NCS) model.

References

  • [1] T. Marcucci and R. Tedrake, “Warm Start of Mixed-Integer Programs for Model Predictive Control of Hybrid Systems,” IEEE Transactions on Automatic Control, vol. 66, pp. 2433–2448, 2019.
  • [2] D. Frick, A. Georghiou, J. L. Jerez, A. Domahidi, and M. Morari, “Low-complexity method for hybrid MPC with local guarantees,” SIAM Journal on Control and Optimization, vol. 57, no. 4, pp. 2328–2361, 2019.
  • [3] A. Aydinoglu, A. Wei, and M. Posa, “Consensus Complementarity Control for Multi-Contact MPC,” arXiv preprint arXiv:2304.11259, 2023.
  • [4] A. Aydinoglu and M. Posa, “Real-Time Multi-Contact Model Predictive Control via ADMM,” in 2022 International Conference on Robotics and Automation (ICRA).   Philadelphia, PA, USA: IEEE, 2022, pp. 3414–3421.
  • [5] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [6] S. Fujimoto, H. Hoof, and D. Meger, “Addressing Function Approximation Error in Actor-Critic Methods,” in Proceedings of the 35th International Conference on Machine Learning.   PMLR, 2018, pp. 1587–1596.
  • [7] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” in Proceedings of the 35th International Conference on Machine Learning, vol. 80.   PMLR, 2018, pp. 1861–1870.
  • [8] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 7559–7566.
  • [9] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models,” in Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [10] A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar, “Deep Dynamics Models for Learning Dexterous Manipulation,” in Conference on Robot Learning.   PMLR, 2020, pp. 1101–1112.
  • [11] A. S. Morgan, D. Nandha, G. Chalvatzaki, C. D’Eramo, A. M. Dollar, and J. Peters, “Model Predictive Actor-Critic: Accelerating Robot Skill Acquisition with Deep Reinforcement Learning,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 6672–6678.
  • [12] F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine, “Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control,” arXiv:1812.00568 [cs], 2018.
  • [13] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. J. Johnson, and S. Levine, “SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning,” in Proceedings of the 36th International Conference on Machine Learning, vol. 97.   PMLR, 2019, pp. 7444–7453.
  • [14] R. Ghugare, H. Bharadhwaj, B. Eysenbach, S. Levine, and R. Salakhutdinov, “Simplifying model-based RL: Learning representations, latent-space models, and policies with one objective,” in The Eleventh International Conference on Learning Representations, 2023.
  • [15] M. Parmar, M. Halm, and M. Posa, “Fundamental Challenges in Deep Learning for Stiff Contact Dynamics,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   Prague, Czech Republic: IEEE, 2021, pp. 5181–5188.
  • [16] B. Bianchini, M. Halm, N. Matni, and M. Posa, “Generalization Bounded Implicit Learning of Nearly Discontinuous Functions,” in Proceedings of The 4th Annual Learning for Dynamics and Control Conference (L4DC), ser. Proceedings of Machine Learning Research, vol. 168, 2022, pp. 1112–1124.
  • [17] V. Kumar, E. Todorov, and S. Levine, “Optimal control with learned local models: Application to dexterous manipulation,” in 2016 IEEE International Conference on Robotics and Automation (ICRA).   Stockholm, Sweden: IEEE, 2016, pp. 378–383.
  • [18] S. Levine and V. Koltun, “Guided Policy Search,” in Proceedings of the 30th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, S. Dasgupta and D. McAllester, Eds., vol. 28.   Atlanta, Georgia, USA: PMLR, 2013, pp. 1–9.
  • [19] M. P. Deisenroth and C. E. Rasmussen, “PILCO: A Model-Based and Data-Efficient Approach to Policy Search,” in Proceedings of the 28th International Conference on International Conference on Machine Learning, ser. ICML’11.   Madison, WI, USA: Omnipress, 2011, pp. 465–472.
  • [20] W. Jin and M. Posa, “Task-driven hybrid model reduction for dexterous manipulation,” IEEE Transactions on Robotics (TRO), vol. 40, pp. 1774–1794, Jan. 2024.
  • [21] N. Lambert, B. Amos, O. Yadan, and R. Calandra, “Objective Mismatch in Model-based Reinforcement Learning,” arXiv preprint arXiv:2002.04523, 2021.
  • [22] M. Okada, L. Rigazio, and T. Aoshima, “Path Integral Networks: End-to-End Differentiable Optimal Control,” arXiv preprint arXiv:1706.09597, 2017.
  • [23] B. Amos, I. D. J. Rodriguez, J. Sacks, B. Boots, and J. Z. Kolter, “Differentiable MPC for End-to-end Planning and Control,” in Advances in Neural Information Processing Systems, 2019.
  • [24] H. N. Esfahani, A. B. Kordabad, and S. Gros, “Approximate Robust NMPC using Reinforcement Learning,” in 2021 European Control Conference (ECC), Rotterdam, Netherlands, 2021.
  • [25] S. Saxena, A. LaGrassa, and O. Kroemer, “Learning reactive and predictive differentiable controllers for switching linear dynamical models,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 7563–7569.
  • [26] W. Jin, Z. Wang, Z. Yang, and S. Mou, “Pontryagin differentiable programming: An end-to-end learning and control framework,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., 2020, pp. 7979–7992.
  • [27] W. Jin, S. Mou, and G. J. Pappas, “Safe pontryagin differentiable programming,” in Advances in Neural Information Processing Systems, vol. 34.   Curran Associates, Inc., 2021, pp. 16 034–16 050.
  • [28] W. Wan, Y. Wang, Z. Erickson, and D. Held, “Difftop: Differentiable trajectory optimization for deep reinforcement and imitation learning,” arXiv preprint arXiv:2402.05421, 2024.
  • [29] L. Pineda, T. Fan, M. Monge, S. Venkataraman, P. Sodhi, R. T. Chen, J. Ortiz, D. DeTone, A. Wang, S. Anderson, J. Dong, B. Amos, and M. Mukadam, “Theseus: A Library for Differentiable Nonlinear Optimization,” Advances in Neural Information Processing Systems, 2022.
  • [30] M. Xu, T. L. Molloy, and S. Gould, “Revisiting implicit differentiation for learning problems in optimal control,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [31] D. Stewart and J. Trinkle, “An implicit time-stepping scheme for rigid body dynamics with coulomb friction,” in Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065), vol. 1, 2000, pp. 162–169 vol.1.
  • [32] A. Aydinoglu, P. Sieg, V. M. Preciado, and M. Posa, “Stabilization of Complementarity Systems via Contact-Aware Controllers,” IEEE Transactions on Robotics, vol. 38, no. 3, pp. 1735–1754.
  • [33] W. Jin, A. Aydinoglu, M. Halm, and M. Posa, “Learning Linear Complementarity Systems,” in Proceedings of The 4th Annual Learning for Dynamics and Control Conference (L4DC).   PMLR, 2022, p. 13.
  • [34] S. N. Afriat, “Theory of maxima and the method of lagrange,” SIAM Journal on Applied Mathematics, vol. 20, no. 3, pp. 343–357, 1971.
  • [35] M. Posa, C. Cantu, and R. Tedrake, “A direct method for trajectory optimization of rigid bodies through contact,” The International Journal of Robotics Research, vol. 33, no. 1, pp. 69–81, 2014.
  • [36] A. Wächter and L. T. Biegler, “On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming,” Mathematical Programming, vol. 106, no. 1, pp. 25–57, 2006.
  • [37] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-Dimensional Continuous Control Using Generalized Advantage Estimation,” in 4th International Conference on Learning Representations (ICLR), 2016.
  • [38] E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2012, pp. 5026–5033.
  • [39] H. W. Kuhn and A. W. Tucker, “Nonlinear programming,” in Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 1950.   Berkeley and Los Angeles: University of California Press, 1951, pp. 481–492.
  • [40] C. Büskens and H. Maurer, “Sensitivity analysis and real-time control of nonlinear optimal control systems via nonlinear programming methods,” in Variational Calculus, Optimal Control and Applications: International Conference in Honour of L. Bittner and R. Klötzler, Basel: Birkhäuser Basel, 1998.
  • [41] O. Khatib, “A unified approach for motion and force control of robot manipulators: The operational space formulation,” IEEE J. Robotics Autom., vol. 3, pp. 43–53, 1987.
  • [42] A. Allshire, M. MittaI, V. Lodaya, V. Makoviychuk, D. Makoviichuk, F. Widmaier, M. Wüthrich, S. Bauer, A. Handa, and A. Garg, “Transferring dexterous manipulation from gpu simulation to a remote real-world trifinger,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022, pp. 11 802–11 809.
  • [43] N. Funk, C. Schaff, R. Madan, T. Yoneda, J. U. De Jesus, J. Watson, E. K. Gordon, F. Widmaier, S. Bauer, S. S. Srinivasa, T. Bhattacharjee, M. R. Walter, and J. Peters, “Benchmarking Structured Policies and Policy Optimization for Real-World Dexterous Object Manipulation,” IEEE Robotics and Automation Letters, vol. 7, no. 1, pp. 478–485, 2022.
  • [44] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar, “The YCB object and Model set: Towards common benchmarks for manipulation research,” in 2015 International Conference on Advanced Robotics (ICAR).   Istanbul, Turkey: IEEE, 2015, pp. 510–517.
  • [45] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2019.