Scaling Manipulation Learning with Visual Kinematic Chain Prediction

Xinyu Zhang   Yuhan Liu   Haonan Chang   Abdeslam Boularias
{xz653, yl1834, hc856, ab1544}@rutgers.edu
Rutgers University
Abstract

Learning general-purpose models from diverse datasets has achieved great success in machine learning. In robotics, however, existing methods in multi-task learning are typically constrained to a single robot and workspace, while recent work such as RT-X requires a non-trivial action normalization procedure to manually bridge the gap between different action spaces in diverse environments. In this paper, we propose the visual kinematics chain as a precise and universal representation of quasi-static actions for robot learning over diverse environments, which requires no manual adjustment since the visual kinematic chains can be automatically obtained from the robot’s model and camera parameters. We propose the Visual Kinematics Transformer (VKT), a convolution-free architecture that supports an arbitrary number of camera viewpoints, and that is trained with a single objective of forecasting kinematic structures through optimal point-set matching. We demonstrate the superior performance of VKT over BC transformers as a general agent on Calvin, RLBench, Open-X, and real robot manipulation tasks. Video demonstrations can be found at https://mlzxy.github.io/visual-kinetic-chain.

Keywords: Multi-Task Robot Learning, Manipulation

1 Introduction

There are numerous techniques in machine learning and computer vision that can successfully learn a single general-purpose model from multiple diverse datasets [1]. In robotics, however, despite the recent advances in multi-task learning that enable a single policy to perform various tasks through imitation learning with language instructions [2, 3], these methods are typically constrained to a single robot and workspace. Some recent work, such as RT-X [4, 5], leverages a large vision-language model (VLM) to directly train policies with Behavioral Cloning (BC) on the Open-X Embodiment [5], a collection of datasets crowd-sourced from various environments and robots. However, these techniques require a non-trivial action normalization procedure to bridge the gap between the different action spaces in diverse setups, such as end-effector poses, joint positions, and velocities in various world frames. This manual engineering procedure is currently custom-designed for each training dataset, which affects the generalization and interpretability of these models.

Therefore, a key question is: can we find an action representation that is precise and universal for various setups and robots? To this end, we propose the visual kinematic chain, which is the projection of the robot’s high-dimensional kinematic structure into the image plane as pixels. Instead of predicting the low-level robot actions, we propose to learn and visually forecast the movement of robots’ kinematic chains in the image plane. This visual approach provides a unified action representation for different robots, and requires no manual adjustment since the visual kinematic chains can be automatically obtained from the robot’s model and camera parameters.

To forecast the kinematic structures of various robots, we render the kinematic chains into point sets by sampling points along links and performing optimal matching [6] to minimize the earth moving distance [7, 8] between the predicted point sets and ground-truth kinematic structures. Further, we propose the Visual Kinematics Transformer (VKT), a convolution-free architecture that is solely based on the attention mechanism in the RGB space, which supports an arbitrary number of camera viewpoints. Our VKT is trained with a single objective of forecasting kinematic structures through optimal point set matching without seeing any low-level robot actions. VKT is deployed to a specific environment by simply training a tiny head while freezing the backbone. VKT demonstrates superior performance over BC transformers as a general agent on Calvin (success length 1.08 vs 0.33), RLBench (success rate 61.7% vs 24.3%), Open-X, and real robot manipulation tasks.

To summarize, our contributions are threefold. (1) We propose the visual kinematics chain as a precise and universal representation of quasi-static actions for learning from diverse robot configurations. (2) We propose VKT, a convolution-free architecture that supports an arbitrary number of camera viewpoints and is trained with a single objective of forecasting kinematic structures through optimal point set matching. (3) We present a thorough empirical study of the performance of VKT on specialized and general agents over a diverse set of language-conditioned tasks and environments.

Refer to caption
Figure 1: Overview of the proposed framework. We use the visual kinematic chain as a universal action representation across diverse robots and setups. We propose the Visual Kinematics Transformer (VKT), an architecture based solely on attention layers, which predicts the future movements of the visual kinematic chain in images from multiple viewpoints. Our VKT is trained with the earth-moving distance as the single learning objective without knowing any low-level robot states or actions. When deployed in a specific environment, we freeze the VKT as a backbone and attach a tiny head to project the VKT output to actual robot commands.

2 Related Work

Robot Learning from Diverse Environments. Learning a single universal multi-task policy on various robots, cameras, and task configurations is a challenging problem that is receiving increasing attention [9]. One recent strategy to solving this problem consists in training a network that takes as input a robot’s kinematic structure [10, 11]. For instance, MetaMorph [12] tokenizes the robot’s kinematic tree, which is then used as a transformer prompt to produce per-joint action. However, these methods are only evaluated in simple walking environments of Mujoco [13] without any variations in the task, the camera viewpoint, and the workspace. Several recent works have also looked into leveraging a large vision-language model (VLM) to directly train robots, through behavior cloning, on a collection of crowd-sourced datasets such as Open-X Embodiment [14, 15, 5]. Despite their impressive results, these methods require a non-trivial action normalization procedure to reduce the gap between action spaces such as end-effector poses, joint positions, and velocities in various world frames. However, this manual engineering procedure is custom-designed for each training dataset, which severely affects the generalization and interpretability of these models. In contrast, our method directly learns and visually predicts the movement of robots’ kinematic chains.

Transformers for Manipulation Learning. Transformers are widely used in manipulation learning [4, 16]. However, existing architectures require a fixed number of camera viewpoints, which is not guaranteed when learning is performed across various environments. Both RT-1/2 and RT-X select one canonical view from each Open-X dataset [5, 4, 17]. RVT [18] supports multiple but predefined viewpoints. PerAct [3] trains transformers over point-clouds and is not limited by camera viewpoints. But methods based on point clouds still suffer from the lack of datasets with depth information, as well as scale discrepancies [19, 20]. In contrast, our visual kinematics transformer (VKT) is a convolution-free architecture that is solely based on attention mechanisms in the RGB space, without the need for depth data. VKT also supports an arbitrary number of camera viewpoints.

Intermediate Action Representations. Instead of only using low-level robot commands, incorporating intermediate actions into control policies has been shown to increase data efficiency [21]. RT-H is a hierarchical policy that predicts language motions such as “move arm forward” [22]. SWIM uses affordance maps to unify grasping actions of humans and robots [21]. General Flow employs a labeling process to translate end-effector motions into point-cloud movements [23]. IMOP [24] uses invariant regions to describe key manipulation poses. RT-Trajectory predicts the future 2D trajectory of the end-effector’s center in a given image [25]. In contrast, our method forecasts the movements of the entire kinematic chain in multiple camera viewpoints. This representation provides a visually grounded action abstraction, while capturing detailed and small motions at the same time.

Refer to caption
Figure 2: Rendering a kinematic chain as a set of pixels in an RGB image

3 Visual Kinematic Chain Forecasting

We consider the problem of learning a single universal multi-task policy from diverse robots and workspaces through a unified action space. Our key insight lies in the use of the visual kinematic chain, which is the projection of the robot’s high-dimensional kinematic structure into a set of pixels in the image 2D plane. The visual kinematic chain can be analytically generated with camera parameters and robot models without any manual engineering. We show that this simple representation allows us to train a single visual language-conditioned policy over multiple environments. The VLM policy is trained with a single objective of forecasting the future visual kinematic chain movements without knowing any low-level robot actions. Our proposed transformer architecture is built entirely with attention layers and can predict consistent multi-view chains from an arbitrary number of camera viewpoints. To deploy the multi-task policy in a specific environment, one simply needs to freeze the transformer and train a tiny head to project the actions into actual robot commands. An overview of our framework is illustrated in Figure 1.

Fitting any Structure with Point-Set Matching. The kinematic structures of various robots differ significantly. To forecast different structures, we propose to render them to point sets by uniformly sampling points along links, as illustrated in Figure 2. Let P={pi}i=1M𝑃superscriptsubscriptsubscript𝑝𝑖𝑖1𝑀P=\{p_{i}\}_{i=1}^{M}italic_P = { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT denote the point-set rendered from the ground-truth configuration of a kinematic chain, and Q={qi}i=1N𝑄superscriptsubscriptsubscript𝑞𝑖𝑖1𝑁Q=\{q_{i}\}_{i=1}^{N}italic_Q = { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT denote the point-set predicted by VKT, where pi2,qi2formulae-sequencesubscript𝑝𝑖superscript2subscript𝑞𝑖superscript2p_{i}\in\mathbb{R}^{2},q_{i}\in\mathbb{R}^{2}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We first compute the pairwise euclidean distance matrix CM×N𝐶superscript𝑀𝑁C\in\mathbb{R}^{M\times N}italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT, where Cij=piqjsubscript𝐶𝑖𝑗normsubscript𝑝𝑖subscript𝑞𝑗C_{ij}=\parallel p_{i}-q_{j}\parallelitalic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∥ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥. We then minimize the earth-moving distance EMD(P,Q)=i,jγijCijEMD𝑃𝑄subscript𝑖𝑗subscriptsuperscript𝛾𝑖𝑗subscript𝐶𝑖𝑗\operatorname{EMD}(P,Q)=\sum_{i,j}\gamma^{*}_{ij}C_{ij}roman_EMD ( italic_P , italic_Q ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, where γsuperscript𝛾\gamma^{*}italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a stochastic assignment matrix that matches points in P𝑃Pitalic_P and Q𝑄Qitalic_Q. We find γsuperscript𝛾\gamma^{*}italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT using the Sinkhorn-Knopp algorithm to solve the following optimization problem online for each mini-batch [26, 8].

γ=argminγ+M×Ni,jγijCij, s.t. γ1=1;γT1=1;γ0.formulae-sequencesuperscript𝛾subscript𝛾superscriptsubscript𝑀𝑁subscript𝑖𝑗subscript𝛾𝑖𝑗subscript𝐶𝑖𝑗formulae-sequence s.t. 𝛾11formulae-sequencesuperscript𝛾𝑇11𝛾0\begin{array}[]{r}\vspace{0.5em}\gamma^{*}=\arg\min_{\gamma\in\mathbb{R}_{+}^{% M\times N}}\sum_{i,j}\gamma_{ij}C_{ij},\text{ s.t. }\gamma 1=1;\gamma^{T}1=1;% \gamma\geq 0.\end{array}start_ARRAY start_ROW start_CELL italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_γ ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , s.t. italic_γ 1 = 1 ; italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT 1 = 1 ; italic_γ ≥ 0 . end_CELL end_ROW end_ARRAY (1)
Refer to caption
Figure 3: Overview of our proposed visual kinematics transformer (VKT). For each camera input, we encode the language instruction and RGB image as text-and-vision tokens with CLIP [27]. Then, we concatenate the text tokens and the kinematics tokens as query tokens. The kinematics tokens are learned parameters. Next, the query and vision tokens are interweaved with a sequence of our proposed multi-view dual attention block. For each block, the query tokens are first updated with self-attention (orange). Then, cross attention is applied with the query tokens as queries and the vision tokens as keys and values (green). A cross-attention layer updates queries with keys and values, we use ➝ to denote queries and \multimapdot\multimapdot\multimapdot to denote keys and values. Then, both the query tokens and the vision tokens are updated through cross-attentions with tokens from other camera viewpoints (blue). Next, the vision tokens are updated by cross-attention with query tokens as keys and values (green). Finally, the T𝑇Titalic_T kinematics tokens are projected into T𝑇Titalic_T point sets through an MLP, representing the visual kinematic chain in the current and the future T1𝑇1T-1italic_T - 1 steps. The predicted point-sets are optimized through point-set matching with the ground-truth, as shown in Equation 1.

Visual Kinematics Transformer (VKT). Our VKT network accepts a language instruction and an RGB image as input and predicts a sequence of T𝑇Titalic_T point-sets. The dimensions of the kinematics tokens are T×D𝑇𝐷T\times Ditalic_T × italic_D, where D𝐷Ditalic_D denotes the embedding size, and each token predicts one point-set. The first point-set in the predicted sequence describes the current state of the kinematic chain in the given RGB image. The remaining point-sets are predictions of the future states of the kinematic chain in the next T1𝑇1T-1italic_T - 1 time-steps, i.e., future robot movements. The network is trained to match each point-set to the ground-truth kinematic structure at the corresponding time-step by minimizing the earth-moving distance. The same process is applied to multiple RGB images taken from different viewpoints. The network is also trained to make the future point-sets predicted from different viewpoints consistent with each other. The kinematics tokens are concatenated with the text tokens as query tokens. The concatenated query and vision tokens are forwarded to our proposed multi-view dual attention block, to interweave the information between query and vision tokens and across multiple camera viewpoints.

The multi-view dual attention block performs dual attentions between query and vision tokens within each single view, indicated by the green blocks in Figure 3, and multi-view attentions for query and vision tokens independently but across multiple camera viewpoints, shown as the blue blocks in Figure 3. This dual attention enables the query token to learn the visual kinematic chain movements. The multi-view attention enables a spatially-consistent prediction across multiple viewpoints. The output kinematics tokens are projected to T𝑇Titalic_T point-sets using a small MLP network. Each point-set contains N𝑁Nitalic_N 2D points on the input image. The VKT architecture is illustrated in Figure 3.

Compared to existing robot learning transformers [18, 17], our proposed VKT is convolution-free and built solely with attention layers. Therefore, VKT supports an arbitrary number of camera viewpoints, which is an important advantage as camera setups are different among various environments.

Refer to caption
Figure 4: Projecting the kinematics tokens into robot actions with 1D convolution.

Deploying to a Specific Environment. We freeze the trained VKT, drop the point-set prediction branch, and only use the kinematics tokens. We apply a 1D convolution to project the T𝑇Titalic_T kinematics tokens into low-level robot actions for the next T𝑇Titalic_T time-steps. Unlike the VKT backbone which is trained from multiple environments, the 1D convolution head is trained separately for each environment. The corresponding proposed architecture is illustrated in Figure 4.

Table 1: Comparison of Behavioral Cloning Transformer (BCT) and our Visual Kinematics Transformer (VKT) on Calvin and RLBench. BCT and VKT share the same architecture except that BCT directly predicts the robot’s end-effector movements. VKT outperforms BCT in isolated environments (specialized) and retains a similar performance when trained in both environments (general).
Environment Metric Specialized Agent General Agent
BCT VKT (Ours) BCT VKT (Ours)
Calvin [28] Success Length \uparrow 1.1 1.173 0.334 1.083
RLBench [29] Success Rate \uparrow 36.4% 55.5% 24.3% 61.7%

4 Experiments

We evaluate VKT on Calvin [28] and RLBench [29], two standard benchmarks in language-conditioned multi-task manipulation learning. Further, we evaluate VKT in a subset of Open-X Embodiment [4] and real robot experiments. We aim to answer the following questions: (1) Can visual kinematics forecasting improve specialized agents in each environment? (2) Can VKT serve as a strong general agent in multiple environments? (3) Can VKT be efficiently trained with real-world demonstrations and solve manipulation tasks with real robots?

Setup. We adopt the setting of RT-X [5] for imitation learning from RGB images without depth. RLBench contains various task categories, such as pick-and-place, tool use, high-precision operations, screwing, tasks with visual occlusion, and long-term manipulation. RLBench provides five RGBD cameras. We use the RGB signal from the front, left, and right cameras. We choose 17 tasks based on camera visibility and the task categorization of Hiveformer [30]. We use 100 recorded trajectories per task for training, evaluate the trained agent on 25 independent trials, and report the average success rate. Calvin requires policies to accomplish multiple tasks sequentially in each episode. We follow the convention of [28] of training on the pre-collected demonstrations and reporting the average number of successful steps per episode (success length). Examples of Calvin and RLBench tasks are shown in Figure 5. Additional visualizations are provided in the supplementary material.

Implementation Details. We use CLIP ViT-B/16 [27] as text and vision encoders. We resize all RGB inputs to 224×224224224224\times 224224 × 224. We use 8 multi-view dual attention blocks (L=8𝐿8L=8italic_L = 8) and predict visual kinematics chains for a horizon of 20 time-steps (T=20𝑇20T=20italic_T = 20) and set size D𝐷Ditalic_D to 512512512512 in all experiments.

4.1 Learning from Multiple Environments

4.1.1 Simulation

Refer to caption
Figure 5: Predicted visual kinematic chains returned by our VKT for different robots in Calvin (left) and RLBench (right). The predicted kinematic chain of the current frame is colored in red and the forecast one for the next time-step is in blue. Videos are included in the supplementary material.

Baselines. We compare the performance of our visual kinematics transformer (VKT) and a behavioral cloning transformer (BCT). The BCT shares the same neural architecture with VKT. The difference is that BCT is directly trained to predict 6-DoF end-effector poses and gripper states. In contrast, VKT is trained to forecast visual kinematic chains first, then project the kinematics tokens to robot actions using a convolution head with the backbone frozen, as shown in Figure 4. The specialized agent is trained on each environment separately. The general agent is trained using demonstrations from both environments. We use the 6-DoF end-effector pose and binary gripper states as the robot action space for both Calvin and RLBench. Note that RLBench requires an additional gripper state for collision avoidance. We use two parallel convolution heads for the BCT general agent to predict the low-level robot actions for each environment.

Table 2: Performance on RLBench. We report the success rates on each task and the average overall success rate. Success rates are measured from five independent runs.
Method Avg. Put Reach Turn Slide Öffnen Sie Money Place Sweep
Success In Drawer Drag Tippen Sie auf Blocks Drawer In Safe Wine To Pan
Specialized Agent BCT 36.4 1.6 77.6 63.2 12.0 20.8 44.8 44.0 5.6
VKT (Ours) 55.5 52.0 88.8 40.8 22.4 56.0 76.8 31.2 48.0
General Agent BCT 24.3 0.0 28.4 68.8 8.0 8.6 8.0 4.0 0.0
VKT (Ours) 61.7 48.8 77.6 45.6 12.8 73.6 81.6 89.6 50.4
Meet Telefon Lid Schließen Sie Schließen Sie Ball Drücken Sie Lift Schließen Sie
Off Grill On Base Off Pan Microwave Box In Hoop Button Block Jar
Specialized Agent BCT 32.0 37.6 47.2 93.6 84.0 33.6 21.6 0.0 0.0
VKT (Ours) 41.6 68.0 92.0 64.0 93.6 85.6 64.0 8.8 10.4
General Agent BCT 21.6 0.0 52.0 80.0 76.8 36.4 20.0 0.0 0.0
VKT (Ours) 33.6 72.8 98.4 92.0 84.8 93.6 69.6 11.2 12.8

Results. Table 1 summarizes the comparison between VKT and BCT. By forecasting the visual kinematics chains, VKT outperforms BCT in both specialized and general agents. This indicates that the kinematics tokens capture sufficient information about the robot’s action, despite not seeing any low-level robot actions. When trained in both environments, the general BCT agent suffers a catastrophic performance drop (1.1 \rightarrow 0.33, 36.4%24.3%percent36.4percent24.336.4\%\rightarrow 24.3\%36.4 % → 24.3 %). In contrast, the general VKT agent only has a minor drop in Calvin (1.1731.0831.1731.0831.173\rightarrow 1.0831.173 → 1.083) but its performance increases (55.5%61.7%percent55.5percent61.755.5\%\rightarrow 61.7\%55.5 % → 61.7 %) in RLBench. This indicates that by learning actions in the image planes, training becomes more robust to distribution shifts. This is further verified by our real-robot experiments in Section 4.2. We list the per-task and per-step success rates of RLBench and Calvin in Table 2 and 6. Overall, VKT outperforms BCT except a lower success rate in the “turn tap” task. This task requires some specific end-effector rotations which involve large variations in the Euler angles of the end-effector but small changes in the visual kinematic chain from all camera viewpoints. We discuss these limitations and future directions in Section 5.

4.1.2 Open-X Embodiment

Setup. Open-X Embodiment (Open-X) [5] is a collection of video demonstrations of real-world manipulation tasks crowd-sourced from diverse environments. Since Open-X does not provide robot models such as URDF, and camera parameters are not available in most datasets, we select four datasets from Open-X and manually annotate the robot joint positions from 80808080 demonstration videos with 1,1359113591,13591 , 1359 frames. Using the annotated videos, we train a single VKT for the four datasets. Then, we train the network to predict low-level robot actions as in Figure 4. However, instead of freezing the VKT, we continue finetuning the entire network with the trained VKT as initial weights. For the BC transformer, we use the same training setup but randomly initialize the network.

Refer to caption
Figure 6: Visual kinematic chains of different robots predicted by our VKT in Open-X Embodiment (left) and a comparison of the relative precision of the behavior cloning transformer (BCT) and our visual kinematics transformer (VKT) on the four selected datasets.

Results. Figure 6 shows qualitative examples of the predicted kinematic chains (left) and compares the relative precision of BCT and VKT on the validation set (right). Let vktsubscript𝑣𝑘𝑡\mathcal{L}_{vkt}caligraphic_L start_POSTSUBSCRIPT italic_v italic_k italic_t end_POSTSUBSCRIPT and bctsubscript𝑏𝑐𝑡\mathcal{L}_{bct}caligraphic_L start_POSTSUBSCRIPT italic_b italic_c italic_t end_POSTSUBSCRIPT denote the L1 error of VKT and BCT, the relative precision of VKT is computed as |vktbct|/vktsubscript𝑣𝑘𝑡subscript𝑏𝑐𝑡subscript𝑣𝑘𝑡|\mathcal{L}_{vkt}-\mathcal{L}_{bct}|/\mathcal{L}_{vkt}| caligraphic_L start_POSTSUBSCRIPT italic_v italic_k italic_t end_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_b italic_c italic_t end_POSTSUBSCRIPT | / caligraphic_L start_POSTSUBSCRIPT italic_v italic_k italic_t end_POSTSUBSCRIPT. Figure 6 shows that our VKT learns to predict kinematic chains from diverse real-world data, and visual kinematics forecasting can serve as a pre-training objective to improve imitation learning.

4.2 Real Robot Experiments

Refer to caption
Figure 7: Visual chains predicted by VKT in a language-conditioned pick-and-place task with a real robot. The chains of pick and place actions are in red and blue, respectively.
Table 3: Success rates of VKT for solving the real-world language-conditioned pick-and-place task.
Object BCT VKT (Ours)
Overall (%) 41.5 69.2
Crackers 3/7 4/7
Cup 3/7 4/7
Ball 3/8 7/8
Chips 1/7 3/7
Brick 5/8 6/8
Tomato Can 4/7 6/7
Mustard 0/7 4/7
Sugar Box 4/7 6/7
Orange 4/7 4/7

Setup. We evaluate our VKT in a real-world language-conditioned pick-and-place task. The task input includes an RGB image and text instructions such as “put the cup into the blue box”. We adopt nine YCB objects such as balls, chips, and two boxes as place targets. Examples of this task and the forecasted kinematic chains are shown in Figure 7. We implement a scripted policy using a DINOv2-based object detector [31] to collect 90 demonstrations as training data, with only 5 demonstrations for each object-container pair. We automatically compute the ground-truth visual kinematic chain with the robot’s URDF and the calibrated camera parameters. We use a Kuka LBR iiwa robot. For the convolution head, we train the model to directly predict the pick and the place poses instead of predicting the full trajectory for simplicity, and use MoveIt [32] for path generation.

Results. Table 3 shows the success rates of VKT in the real-world task. The overall success rate is the average of 65 independent runs with different object layouts. The BC Transformer (BCT) shares the same training and evaluation setup with VKT, but VKT outperforms BCT by a significant margin. Further, we discover that visual kinematic forecasting enables the use of 2D image augmentation for manipulation learning because the action space also resides in image planes, which is important to prevent overfitting in our experiment due to the small training set. In comparison, existing manipulation learning augmentations are applied in the 3D space [33], which requires depth that is less available than RGB data, and has fewer categories than 2D image augmentations [34].

4.3 Ablation Studies

Table 4: General agent’s performance on forecasting the full kinematic chain vs. only the end-effector
Configuration Calvin RLBench
Success Length \uparrow Success Rate \uparrow
BCT 0.334 24
VKT (Ours) Full Chain 1.083 55.5
End-Effector 1.06 54.1
Table 5: Effect of multiple viewpoints in RLBench
Configuration Success Rate
Single View 26.8%
Multi View 55.5%
Table 6: Performance on Calvin. Calvin requires agents to accomplish 5 tasks sequentially at each episode. We measure the step-level success rate and the average number of successful steps (success length). Success rates are measured from 1000 episodes.
Method Success Length \uparrow Success rates at each step (%)
1 2 3 4 5
Specialized Agent BCT 1.1 59 31.7 12.7 5.2 1.4
VKT (Ours) 1.173 60.5 32.5 14.7 7.1 2.5
General Agent BCT 0.334 26.8 5.3 1.2 0.1 0
VKT (Ours) 1.083 58.3 29 13.5 5.6 1.9

Predicting the Full Chain versus Predicting Only the End-Effector. We compare the effects of forecasting the full chain and only the end-effector in Table 5. Table 5 shows that forecasting the full chain improves the performance on both Calvin and RLBench. However, despite the inability to capture rotations, end-effector VKT also delivers stronger performance as a general agent than BCT. This indicates that a key to building general agents across diverse environments is to predict robot movements directly in the image plane. Furthermore, the end-effector predictions can be useful in special cases when the robot’s body is largely invisible due to occlusions, or when robot models are unavailable. This suggests a broader applicability for visual kinematics forecasting.

Multi-View versus Single-View Predictions. We study the impacts of multi-view forecasting in Table 5. We remove the multi-view attention layers to create the single-view VKT. We average the low-level robot actions predicted from each view as the final prediction. Table 5 shows that VKT improves its 3D understanding significantly with our multi-view dual attention block by predicting consistent kinematic chains in different camera viewpoints.

5 Discussion and Conclusion

We have shown that visual kinematics forecasting of quasi-static robot movements can be a building block technique for developing general agents across diverse robot learning environments. However, there are still some open questions that are worth discussing. For example, we did not utilize wrist cameras in VKT because it does not take images of the robot. This can be resolved with RGBD wrist camera and virtual view projection [18]. Also, some public real-robot datasets lack robot URDF files and reliable camera parameters, which creates several challenges for applying our method. Finally, it is worth noting that one can re-purpose VKT to forecast not only robot movements, but also human poses and movements in order to leverage abundant human demonstration videos for robot learning.

References

  • Bommasani et al. [2021] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  • Shridhar et al. [2022] M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning, pages 894–906. PMLR, 2022.
  • Shridhar et al. [2023] M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
  • Brohan et al. [2023] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  • Padalkar et al. [2023] A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models. 2023.
  • Flamary et al. [2021] R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier, L. Gautheron, N. T. Gayraud, H. Janati, A. Rakotomamonjy, I. Redko, A. Rolet, A. Schutz, V. Seguy, D. J. Sutherland, R. Tavenard, A. Tong, and T. Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1–8, 2021. URL http://jmlr.org/papers/v22/20-451.html.
  • Rubner et al. [2000] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40:99–121, 2000.
  • Feydy [2020] J. Feydy. Geometric data analysis, beyond convolutions. Applied Mathematics, 2020.
  • Dasari et al. [2020] S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning. In L. P. Kaelbling, D. Kragic, and K. Sugiura, editors, Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, pages 885–897. PMLR, 30 Oct–01 Nov 2020. URL https://proceedings.mlr.press/v100/dasari20a.html.
  • Huang et al. [2020] W. Huang, I. Mordatch, and D. Pathak. One policy to control them all: shared modular policies for agent-agnostic control. In Proceedings of the 37th International Conference on Machine Learning, pages 4455–4464, 2020.
  • Kurin et al. [2020] V. Kurin, M. Igl, T. Rocktäschel, W. Boehmer, and S. Whiteson. My body is a cage: the role of morphology in graph-based incompatible control. In International Conference on Learning Representations, 2020.
  • Gupta et al. [2021] A. Gupta, L. Fan, S. Ganguli, and L. Fei-Fei. Metamorph: Learning universal controllers with transformers. In International Conference on Learning Representations, 2021.
  • Todorov et al. [2012] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
  • Reed et al. [2022] S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  • Team et al. [2023] O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, et al. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2023.
  • Chi et al. [2023] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
  • Brohan et al. [2022] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  • Goyal et al. [2023] A. Goyal, J. Xu, Y. Guo, V. Blukis, Y.-W. Chao, and D. Fox. Rvt: Robotic view transformer for 3d object manipulation. arXiv preprint arXiv:2306.14896, 2023.
  • Xu et al. [2022] C. Xu, S. Yang, T. Galanti, B. Wu, X. Yue, B. Zhai, W. Zhan, P. Vajda, K. Keutzer, and M. Tomizuka. Image2point: 3d point-cloud understanding with 2d image pretrained models. In European Conference on Computer Vision, pages 638–656. Springer, 2022.
  • Zhu et al. [2023] H. Zhu, H. Yang, X. Wu, D. Huang, S. Zhang, X. He, T. He, H. Zhao, C. Shen, Y. Qiao, et al. Ponderv2: Pave the way for 3d foundataion model with a universal pre-training paradigm. arXiv preprint arXiv:2310.08586, 2023.
  • Mendonca et al. [2023] R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos. arXiv preprint arXiv:2308.10901, 2023.
  • Belkhale et al. [2024] S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y. Chebotar, D. Dwibedi, and D. Sadigh. Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823, 2024.
  • Yuan et al. [2024] C. Yuan, C. Wen, T. Zhang, and Y. Gao. General flow as foundation affordance for scalable robot learning. arXiv preprint arXiv:2401.11439, 2024.
  • Zhang and Boularias [2024] X. Zhang and A. Boularias. One-shot imitation learning with invariance matching for robotic manipulation. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024.
  • Gu et al. [2023] J. Gu, S. Kirmani, P. Wohlhart, Y. Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977, 2023.
  • Cuturi [2013] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
  • Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  • Mees et al. [2022] O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022.
  • James et al. [2020] S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
  • Guhur et al. [2023] P.-L. Guhur, S. Chen, R. G. Pinel, M. Tapaswi, I. Laptev, and C. Schmid. Instruction-driven history-aware policies for robotic manipulations. In Conference on Robot Learning, pages 175–187. PMLR, 2023.
  • Oquab et al. [2023] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  • Coleman et al. [2014] D. Coleman, I. Sucan, S. Chitta, and N. Correll. Reducing the barrier to entry of complex robotic software: a moveit! case study. arXiv preprint arXiv:1404.3785, 2014.
  • Mitrano and Berenson [2022] P. Mitrano and D. Berenson. Data augmentation for manipulation. arXiv preprint arXiv:2205.02886, 2022.
  • Xu et al. [2023] M. Xu, S. Yoon, A. Fuentes, and D. S. Park. A comprehensive survey of image augmentation techniques for deep learning. Pattern Recognition, 137:109347, 2023.