Refer to caption — Figure 1. A physically simulated character performing motions specified by language commands. Our framework is able to train a versatile language-directed controller on a large dataset containing thousands of motions.

SuperPADL: Scaling Language-Directed Physics-Based Control with Progressive Supervised Distillation

Jordan Juravsky 0000-0003-2080-7074 NVIDIACanada Stanford UniversityUnited States [email protected] , Yunrong Guo 0000-0001-7468-6162 NVIDIACanada [email protected] , Sanja Fidler 0000-0003-1040-3260 NVIDIACanada University of TorontoCanada [email protected] and Xue Bin Peng 0000-0002-3677-5655 NVIDIACanada Simon Fraser UniversityCanada [email protected]

(2024)

Abstract.

Physically-simulated models for human motion can generate high-quality responsive character animations, often in real-time. Natural language serves as a flexible interface for controlling these models, allowing expert and non-expert users to quickly create and edit their animations. Many recent physics-based animation methods, including those that use text interfaces, train control policies using reinforcement learning (RL). However, scaling these methods beyond several hundred motions has remained challenging. Meanwhile, kinematic animation models are able to successfully learn from thousands of diverse motions by leveraging supervised learning methods. Inspired by these successes, in this work we introduce SuperPADL, a scalable framework for physics-based text-to-motion that leverages both RL and supervised learning to train controllers on thousands of diverse motion clips. SuperPADL is trained in stages using progressive distillation, starting with a large number of specialized experts using RL. These experts are then iteratively distilled into larger, more robust policies using a combination of reinforcement learning and supervised learning. Our final SuperPADL controller is trained on a dataset containing over 5000 skills and runs in real time on a consumer GPU. Moreover, our policy can naturally transition between skills, allowing for users to interactively craft multi-stage animations. We experimentally demonstrate that SuperPADL significantly outperforms RL-based baselines at this large data scale.

character animation, language commands, reinforcement learning, adversarial imitation learning

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24; July 27-August 1, 2024; Denver, CO, USA^†^†booktitle: Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24 (SIGGRAPH Conference Papers ’24), July 27-August 1, 2024, Denver, CO, USA^†^†doi: 10.1145/3641519.3657492^†^†isbn: 979-8-4007-0525-0/24/07^†^†ccs: Computing methodologies Procedural animation^†^†ccs: Computing methodologies Control methods^†^†ccs: Computing methodologies Adversarial learning

1. Introduction

Physics-based character animation offers the potential of synthesizing life-like and responsive behaviors from first principles. The use of reinforcement learning techniques for character control has led to a rapid growth in the corpus of motor skills that can be replicated by simulated characters, ranging from common behaviors, such as locomotion (Peng et al., 2017; Xie et al., 2020; Winkler et al., 2022), to highly athletic skills, such as gymnastics and martial arts (Peng et al., 2018; Liu and Hodgins, 2018; Won et al., 2021; Xie et al., 2022). As the capabilities of simulated characters continue to improve, interfaces that enable users to direct and elicit the desired behaviors from a character will be vital for the viability of these systems for practical applications. The most common interfaces for directing simulated characters often leverage compact control abstractions, such as joystick commands or target waypoints (Treuille et al., 2007; Coros et al., 2009; de Lasa et al., 2010; Holden et al., 2017; Peng et al., 2017). These abstractions provide accessible interfaces that allow users to easily specify high-level commands for a character, but they tend to offer only limited control over the character’s behaviors. More versatile control interfaces, such as target trajectories and keyframes, can in principle allow users to specify any desired behaviors for a character (Peng et al., 2018; Wang et al., 2020; Won et al., 2020; Luo et al., 2023). However, authoring target trajectories can itself be a labour-intensive process, requiring significant domain expertise or specialized equipment (e.g. motion capture).

Natural language offers a promising interface that can be both accessible and versatile. Large language models have been shown to provide powerful interfaces for directing generative models in a large variety of domains (Devlin et al., 2018; Brown et al., 2020; Saharia et al., 2022; Poole et al., 2023). Efforts have also been made to incorporate language interfaces into motion synthesis models. However, the majority of this work has been focused on kinematic motion models (Petrovich et al., 2022; Dabral et al., 2023; Tevet et al., 2023; Jiang et al., 2023; Zhang et al., 2023b). Language-directed controllers for physics-based characters have yet to replicate comparable versatility, scalability, and motion quality of their kinematic counterparts (Juravsky et al., 2022; Ren et al., 2023; Sun et al., 2023). In this work, we aim to develop a scalable framework for training language-directed controllers for physically simulated characters, which is able to leverage large motion datasets to learn a single versatile controller capable of performing a vast repertoire of skills.

The central contribution of this work is a large-scale framework for training versatile language-directed controllers for physically simulated characters. To address scalability challenges of applying reinforcement learning to a large corpus of skills, we propose a progressive distillation framework, which starts from skill-specific controllers, and then progressively constructs more versatile controllers that are able to perform a larger and larger set of skills, ultimately yielding a single unified controller capable of reproducing behaviors from over 5000 motion capture sequences. Our model operates in real-time, and is able to respond interactively to changes in the user’s language commands.

2. Related Work

Physics-based models for character animation has had a long history in computer graphics, with a large body of work devoted to constructing controllers that enable simulated characters to reproduce a large repertoire of motor skills (Hodgins et al., 1995; da Silva et al., 2008; Wang et al., 2009; Lee et al., 2010a; Tan et al., 2014; Wang et al., 2012; Liu and Hodgins, 2018; Clegg et al., 2018). While these efforts have led to substantial improvements in the motor capabilities of simulated characters, a barrier that has precluded the wider adoption of these models for practical applications has been the lack of accessible interfaces for users to direct the behaviors of these models. The majority of these models provide users with simple control interfaces, such as joystick commands or target waypoints (Treuille et al., 2007; Coros et al., 2009; Lee et al., 2010b; Agrawal and van de Panne, 2016; Holden et al., 2017; Peng et al., 2018; Starke et al., 2019; Zhang et al., 2020; Ling et al., 2020; Peng et al., 2021; Lee et al., 2021b, a; Peng et al., 2022), which are easy to use but greatly restrict the control a user can exert over the character’s behaviors. Alternatively, motion imitation models provide an expressive interface, where a user can specify target reference motions for a simulated character to imitate (Peng et al., 2018; Bergamin et al., 2019; Won et al., 2020; Luo et al., 2023). While these models can provide users with versatile and granular control over a character’s behavior, they require users to construct reference motions for every desired motion, which can itself be a costly and labour intensive process.

Text-Driven Generation:

Natural language offers the potential to develop accessible and expressive interfaces for directing the behaviors of simulated characters. Recent advances in large-language models has lead to a proliferation of text-driven interfaces for generative models in a wide variety of domains (Devlin et al., 2018; Liu et al., 2019; Raffel et al., 2019; Saharia et al., 2022; Poole et al., 2023). Similar techniques have also been adopted for motion synthesis, creating text2motion models that are able to generate motions according to natural language descriptions (Plappert et al., 2017; Lin et al., 2018; Ahuja and Morency, 2019; Tevet et al., 2022, 2023). These models have in part been made possible by the availability of large public datasets which contain tens of hours of text-labeled human motion data (Guo et al., 2022; Punnakkal et al., 2021). A large body of work has also adapted generative modeling techniques for text-directed motion synthesis, such as sequence prediction models with RNNs (Plappert et al., 2017; Lin et al., 2018), variational auto-encoders (Petrovich et al., 2022; Athanasiou et al., 2022), contrastive coding (Ahuja and Morency, 2019; Tevet et al., 2022), transformers (Jiang et al., 2023; Zhang et al., 2023a), and diffusion models (Dabral et al., 2023; Tevet et al., 2023). While these text2motion models have demonstrated promising capabilities, the vast majority of this work focuses on kinematic motion models.

Language-Directed Controllers:

In this work, we aim to develop a large-scale framework for training language-directed models for physically simulated characters. Prior efforts in this domain have largely been limited in terms of the variety of motions that can be reproduced by a model, and the quality of the generated motions. Juravsky et al. (2022) proposed an adversarial imitation learning framework for training language-directed controllers for a simulated humanoid character. However, their system was only effective in learning from a relatively small dataset of approximately 9 minutes of motion data. A common approach for developing more versatile controllers is to combine a kinematic text2motion model with a low-level motion tracking model (Ren et al., 2023; Luo et al., 2023). These methods first generate a kinematic reference motion using previously mentioned text2motion techniques, and the task of controlling a simulated character is then reduced to simple motion tracking. This approach is able to leverage the scalability of kinematic motion models that are trained on large motion datasets, but the capabilities of the simulated character are then greatly restricted to closely follow the behaviors dictated by the kinematic model. Furthermore, the motion quality of these composite models is still conspicuously lower compared to state-of-the-art physics-based character animation systems.

The goal of our work is to develop versatile end-to-end language-directed controllers for physically simulated characters that can be controlled using natural language to perform a large corpus of motor skills. Our model does not require an existing kinematic text2motion model, and instead directly maps language commands to motor actuations for driving a simulated character to perform the behaviors specified by the user. The key component of our system is a progressive supervised distillation framework, which gradually trains controllers on larger and larger datasets. Training specialized expert controllers has been a common approach for applying motion imitation methods to train general controllers from large datasets (Won et al., 2020; Luo et al., 2023). However, these prior systems combine individual experts into a general controller by constructing mixture-of-experts models, which requires retaining all experts in the runtime model. Our framework leverages a progressive distillation procedure to progressively aggregate expert controllers trained on different motion datasets into a single general controller capable of reproducing behaviors from a large dataset of motion clips. Distillation has been applied in previous efforts to scale up reinforcement learning to train multi-task control policies (Merel et al., 2019, 2020). Wagener et al. (2023) used a single-stage distillation approach to train a general motion tracking controller capable of imitating approximately 3.5 hours of motion data. Our multi-stage progress distillation approach enables our system to train versatile controllers on 8.5 hours of text-labeled motion clips, leading to a unified end-to-end controller that can be directed to perform a large variety skills with simple text commands. Model-based RL has also shown promise for scaling to larger motion datasets (Won et al., 2022; Yao et al., 2022). However, these techniques have yet to be effectively demonstrated on large motion datasets. Our work shows that model-free RL combined with a progressive distillation approach can be an effective method for training a general unified controller from thousands of motion clips.

3. Overview

In this work, our goal is to train a single versatile control policy capable of responding to thousands of different text commands, while being able to naturally transition between motions. The approach we propose, which we call SuperPADL¹¹1The name is inspired by our combination of supervised learning and PADL-like adversarial RL objectives., is inspired by two observations:

•

Existing reinforcement learning techniques for physics-based animation are able to train policies with many desirable properties, such as the ability to reproduce reference motions with high quality and naturally transition between skills. However, these approaches do not scale beyond at most several hundred motions.
•

Kinematic motion models, such as motion diffusion models, are able to scale to datasets containing thousands of motions using supervised learning objectives.

In light of these observations, we present a method that combines RL and supervised objectives, centered around the progressive distillation of motion controllers. Initially, we seek to train a large number of highly specialized expert policies using RL. We then iteratively distill these experts together with supervised techniques, progressively training more general-purpose and capable models. Concretely, our method is composed of three training stages:

(1)

We first train an independent expert tracking policy on every motion capture sequence in our dataset using DeepMimic (Peng et al., 2018). The purpose of this stage is to create high-quality reconstructions of our original data in the physical domain. These reconstructions provide us with a “dataset of actions” that we can use to train later networks with supervised losses.
(2)

Next, we randomly partition our dataset into groups of 20 motions and train a controller policy on each group. These group controllers are trained using a hybrid objective that combines adversarial RL with a behaviour cloning (BC) loss on trajectories from the DeepMimic experts. Each group controller learns to naturally transition between the skills in its group, and does not require the phase variable used to train the DeepMimic experts. Note that these group controllers are not conditioned on language during training.
(3)

Finally, we distill the group controllers into a single text-conditioned global policy. This final distilled controller is trained exclusively with a supervised imitation objective: for every motion in our dataset, we encourage the distilled controller to match the actions of the appropriate group controller.

Figure 2 provides an overview of our framework. In each stage of training, we progressively produce more general-purpose, capable controllers. Simultaneously, our approach progressively reduces the use of RL as the number of motions learned per policy grows: DeepMimic experts are trained purely with RL, group controllers are trained with a hybrid RL-BC objective, and the global controller is only trained with supervision. The key to our approach is to only leverage RL methods at the smaller data scales where these methods are effective, and later then transfer the skills of many smaller networks into larger ones through distillation.

3.1. Training Per-Motion Expert Tracking Policies

Recent work has assembled large datasets $D=\{(m_{i},C_{i})\}$ of motion capture sequences $m_{i}$ annotated with a set of one or more natural language labels $C_{i}$ (Punnakkal et al., 2021; Guo et al., 2022). However, these motion capture recordings are kinematic, preventing the direct application of supervised losses when training physically-simulated control policies. In order to address this incompatibility, in the first stage of SuperPADL we translate our motion capture dataset into the physical domain by training an expert tracking policy on every motion in our dataset.

Our approach is inspired by the first stage of MoCapAct (Wagener et al., 2023), leveraging DeepMimic as our tracking method (Peng et al., 2018). We train a DeepMimic expert policy $\pi^{e}_{i}({\mathbf{a}}_{t}|{\mathbf{o}}_{t},\phi)$ on every motion capture sequence in our dataset, conditioned on the current state of character ${\mathbf{o}}_{t}$ as well as a phase variable $\phi\in[0,1]$ that synchronizes the policy to the reference motion.

We train each policy for a maximum of 3000 epochs, corresponding to approximately 200M frames of experience. To reduce overall compute usage, we monitor the cartesian pose error throughout training and stop training early when the error falls below 3cm. If the epoch limit is reached and the pose error remains above 5cm, we discard the motion from our dataset. By leveraging a GPU-accelerated implementation of DeepMimic in NVIDIA Isaac Gym (Peng et al., 2018; Makoviychuk et al., 2021), the majority of tracking policies complete training in under an hour, while only 5% of motions are discarded (see Section 4.1). The discarded items in our dataset often correspond to physically implausible motions, such as those that involve third-party objects that are not recorded in the motion capture sequence (e.g. climbing up nonexistent stairs, sitting on nonexistent chairs, etc.). Additionally, our tracking environment resets whenever a character’s body part, excluding the feet, touches the ground. This causes motions like crawling and lying prone to be discarded.

After training each tracking policy, we record a dataset of trajectories containing 10000 observation-action frames from each expert:

(1)

\displaystyle T_{i}

\displaystyle=({\mathbf{o}}_{1}^{i},{\mathbf{a}}_{1}^{i},{\mathbf{o}}_{2}^{i},% {\mathbf{a}}_{2}^{i},...).

In practice, in order to increase the diversity of states encountered during these rollouts, $T_{i}$ is created in chunks by initializing the character 100 different times using random frames from the reference motion and rolling out each initialization for 100 frames. Additionally, in 90% of rollouts, stochastic actions are sampled from the policy’s action distribution, while in the remaining 10% deterministic/greedy actions are taken from the mean of the action distribution. Note that while a combination of stochastic and deterministic actions are used to generate the rollouts, the actions recorded in the trajectory data always correspond to the deterministic action that the expert policy would have taken at each state. This is done to avoid adding noise to the trajectory action labels. We refer to this collection of trajectories $D_{T}=\{T_{i}\}$ as our trajectory dataset.

3.2. Training Group Controllers with PADL+BC

Our ultimate goal is to train a single model that can perform a wide range of behaviors in response to user text commands and seamlessly transition between different behaviors as the user’s command changes. However, at the end of the first stage of SuperPADL training, we are instead left with a collection of highly specialized expert tracking policies. Each expert can reproduce their corresponding motion capture clip with high quality, yet they cannot generate any other motion. Additionally, tracking policies lack the robustness to reliably recover from perturbations or initialization in out-of-distribution states, as might be necessary when transitioning between different motions. Therefore, the next step in our pipeline is to develop more general control policies that retain the motion quality of individual experts while also being much more robust.

Previous work has demonstrated that adversarial reinforcement learning can effectively train policies that exhibit these properties (Peng et al., 2021, 2022; Juravsky et al., 2022). However, as we experimentally demonstrate in Section 4.3, these RL techniques do not directly scale to thousands of motions. We address this limitation by employing a progressive distillation approach in the second and third stages of SuperPADL training. In the second stage of SuperPADL, we train controllers on small groups of motions using an objective combining adversarial RL and behaviour cloning, before performing a second, purely-supervised distillation stage to train a final global policy. In this first distillation stage, we randomly partition our dataset into groups of 20 motions:

(2)

\displaystyle P_{i}

\displaystyle=\{(m_{20i+1},C_{20i+1}),(m_{20i+2},C_{20i+2}),...,(m_{20i+20},C_% {20i+20})\}

and train a group controller $\pi^{g}_{i}({\mathbf{a}}_{t}|{\mathbf{o}}_{t},I)$ on each partition $P_{i}$ , parameterized by the current character state ${\mathbf{o}}_{t}$ and the motion index $I\in\{20i+1,...,20i+20\}$ . The index $I$ is encoded using a trainable, randomly-initialized embedding table.

Our goal is for each group controller to:

•

Imitate the motions in its partition when conditioned on the corresponding motion.
•

Naturally transition between these motions when the input index changes.
•

Generally avoid falling over.

We optimize each group controller using PADL+BC, a novel objective that combines the adversarial RL setup of PADL with behaviour cloning:

(3)

\displaystyle\mathcal{L}

\displaystyle=\mathcal{L}_{\text{PADL}}+0.01\mathcal{L}_{\text{BC}}.

PADL introduces a motion-conditioned discriminator network:

(4)

\displaystyle\text{Disc}(I,{\mathbf{s}},{\mathbf{s}}^{\prime})\to[0,1]

which is trained to distinguish between state transitions $({\mathbf{s}},{\mathbf{s}}^{\prime})$ from a reference motion $m_{I}$ and state transitions generated by the policy when conditioned on $I$ . The policy is optimized with PPO (Schulman et al., 2017), using a reward that encourages the policy to “fool” the discriminator:

(5)

\displaystyle r_{t}

\displaystyle=-\mathrm{log}\left(1-\text{Disc}(I,{\mathbf{s}}_{t-1},{\mathbf{s% }}_{t})\right).

In addition to this PPO loss, we introduce an additional behaviour cloning loss on the dataset of expert trajectories $D_{T}$ . At each step of optimization, we sample (observation, action) pairs from the stored trajectories of our grouped motions and encourage the group controller to imitate the expert actions:

(6)

\displaystyle\mathcal{L}_{\text{BC}}

\displaystyle=\mathop{\mathbb{E}}_{I\sim\{20i+1,...,20i+20\}}\mathop{\mathbb{E% }}_{({\mathbf{o}},{\mathbf{a}})\sim T_{I}}||\pi^{g}_{i}({\mathbf{o}},I)-{% \mathbf{a}}||_{2}^{2}.

While the observations for the tracking experts included a phase variable to synchronize the policy to the target motion, this phase variable is omitted from the observations given to the group controller. This is crucial since determining the correct phase observation to give to the group controller during inference is difficult - for example, when transitioning from one motion to another, it is unclear what the phase should be set to.

The motivation behind the added behaviour cloning loss is twofold. First, the supervised training signal allows us to significantly cut down on the training cost of a group controller relative to a pure-PADL policy. We train each group controller using a 2000-epoch warmup phase where only the BC loss is applied. These warmup epochs complete significantly faster than normal PPO epochs where trajectories must be rolled out in the environment. Following this warmup, we only train group controllers with PPO+BC for an additional 1B samples of experience, compared to the 7B samples reported in Juravsky et al. (2022) for training a PADL policy from scratch. Overall, training a group controller with PADL+BC completes in around 12 hours on a single A40 GPU, compared to almost three days of training for a pure-PADL controller. Additionally, we demonstrate in Section 4.4 that PADL+BC controllers, leveraging the experts’ accurate motion reconstructions, can generate motions with higher quality than pure PADL policies.

3.3. Distilling into a Global Text-Conditioned Policy

Group controllers mark a significant improvement in generalization over the tracking expert policies, enabling a single controller to reproduce multiple motions, naturally transition between motions, and operate without a phase variable. However, the group controllers are still constrained by the relatively small set of motions that each controller is trained on. As we demonstrate in Section 4.3, PADL+BC alone does not scale to the thousands of motions available in open-source motion capture datasets (Mahmood et al., 2019). Moreover, our group controllers are conditioned using simple motion indices and cannot follow commands in natural language.

To train a global, language-conditioned policy $\pi^{G}({\mathbf{a}}_{t}|{\mathbf{o}}_{t},c)$ (where $c$ denotes a text caption), we perform a second round of distillation, now leveraging the group controllers as teacher policies. Unlike PADL+BC, which combines RL and supervised training objectives, the training of the global policy is purely supervised, allowing us to scale to much larger datasets.

Training of the global policy begins similarly to group controllers with an offline, behaviour cloning warmup phase using the static trajectory dataset $D_{T}$ collected from the tracking policies. This initializes the state distribution of the global policy to be similar enough to those of the group controllers that they can provide effective teacher feedback.

Following warmup, the global policy is trained to convergence using online imitation learning similar to DAGGER (Ross et al., 2011). Every epoch, trajectories are rolled using the current global controller. Each observation in those trajectories is then annotated using the appropriate group controller (based on the motion that the policy is trying to imitate). The global controller is optimized to match these annotations. Unlike with group controllers, when conditioning the global controller to imitate motion $i$ , we provide a natural language caption for the motion instead of simply an index variable. At every rollout, we sample a motion $(m_{i},C_{i})$ from $D$ and sample a caption $c$ from $C_{i}$ . The caption is encoded using the CLIP text encoder, with the pooled encoder embedding provided to the main policy network $\pi^{G}$ .

3.4. Experimental Details

3.4.1. Dataset Curation

The motion capture data that we use to train SuperPADL is a filtered subset of AMASS, an open-source aggregation of smaller motion datasets (Mahmood et al., 2019). We first filter out any motion clips shorter than two seconds or longer than nine seconds. We also apply a series of filters that attempt to detect motions that are physically impossible, such as climbing a staircase or swinging from a bar (third party objects are not included in the motion capture recordings). The plausibility filters examine the heights of the character’s limbs and extremities to look for signs that a character has not touched the ground for a prolonged period of time, filtering the motion out of our dataset if such an event is detected. We train DeepMimic experts on a dataset of 5866 filtered motions.

To augment this motion data with natural language annotations, we use the HumanML3D dataset of captions (Guo et al., 2022), which provides several captions for every motion in AMASS. To add additional diversity to this data, we use ChatGPT to generate paraphrases using the original set of annotations. When training our global controller on the 5587 motions that pass the expert tracking phase (totalling approximately 8.5 hours of data), there are a total of 48207 captions in the dataset.

3.4.2. Network Architectures

All policies (tracking experts, group controllers, and global controllers) are trained using simple MLP architectures. Figure 3 summarizes the inputs and outputs of each network. While several existing works in adversarial RL use only the character’s state at the the current frame as model observations (Peng et al., 2021, 2022; Juravsky et al., 2022), we observe benefits when training group and global controllers that are conditioned on a longer history. We maintain a context window looking 40 frames back into the past, and generate inputs for actor and critic networks by selecting every eighth frame from the window, totalling five total frames of observations. This approach balances providing models with a longer history with restricting the total observation dimension, since excessively large observations can slow down and potentially destabilize training. More architectural details are given in Appendix C.

4. Results

We present examples of our global controller reproducing motions from its training data using text commands in Figure 7. We demonstrate that SuperPADL is able to generate an extremely diverse set of motions, ranging from basic locomotion skills and hand gestures to much more difficult martial arts and dancing behaviours. In contrast with kinematic motion diffusion models, where generating a single animation can take up to a minute (Tevet et al., 2023), we highlight that SuperPADL can generate motion in real time on a single consumer GPU, enabling interactive applications.

SuperPADL is also able to successfully transition between skills, with examples shown in Figure 8. These transition abilities were initially learned by each group controller using adversarial RL, and have been inherited by the global controller through distillation. Note that even though each group controller was only trained to transition between the 20 motions in its group, the global SuperPADL controller is able to transition between any two motions, regardless of their group assignments.

4.1. Training Tracking Experts

In Figure 4 we visualize the distribution of training times for the expert tracking policies detailed in Section 3.1. All policies were trained on individual NVIDIA A40 GPUs. We see that ending training early based on the most recent tracking error significantly reduces the total cost of compute required to train all experts. A majority of experts finish training in under an hour, and over 30% complete in less than 30 minutes. However, since policies that do not reach the target error threshold are trained until the epoch limit of 3000 epochs, we see that the 5% of rejected policies have an oversized impact on cumulative training cost. This highlights the importance of strong dataset filters that can identify physically-implausible motions before training begins.

4.2. Measuring Controller Quality with Thresholded Precision and Recall

Measuring tracking error is difficult for policies lacking a phase variable input that synchronizes them to a reference motion. In order to evaluate the quality of motions produced by non-tracking policies (i.e. group and global controllers), we introduce the metrics of thresholded precision and recall. These metrics are inspired by the thresholded coverage metric used in (Juravsky et al., 2022), with our thresholded recall metric being almost identical to that work’s construction of thresholded coverage. Thresholded recall measures the fraction of a reference motion that is reproduced by a policy when conditioned to generate that motion. To calculate the recall of a policy $\pi$ on a motion sequence $\hat{{\mathbf{m}}}=(\hat{{\mathbf{s}}_{0}},\hat{{\mathbf{s}}_{1}},...,\hat{{% \mathbf{s}}_{n}})$ , we first roll out a (deterministic) trajectory $\tau=({\mathbf{s}}_{0},{\mathbf{s}}_{1},...,{\mathbf{s}}_{k})$ from $\pi$ . We then consider all ten-frame-long sliding windows from $\hat{{\mathbf{m}}}$ and check whether any ten-frame window in $\tau$ is “sufficiently close”, as determined by some threshold ${\epsilon}$ . Specifically, we define:

(7)

\displaystyle\text{Rec}(\tau,\hat{{\mathbf{m}}},{\epsilon})=\frac{1}{n-9}\sum_% {i=0}^{n-10}\mathcal{I}\left(\left(\min_{j\in\{0,...,k-10\}}||\hat{{\mathbf{s}% }}_{i:i+9}-{\mathbf{s}}_{j:j+9}||_{2}\right)\leq{\epsilon}\right)

where ${\mathbf{s}}_{x:y}$ denotes the concatenation of frames $({\mathbf{s}}_{x},{\mathbf{s}}_{x+1},...,{\mathbf{s}}_{y})$ , and $\mathcal{I}$ denotes an indicator variable. The key difference between this metric and thresholded coverage is that thresholded recall operates on windows of consecutive states, while thresholded coverage only considers individual frames. We choose to construct windows to better capture the temporal structure of the reference motion: for example, a hypothetical trajectory that perfectly imitated $\hat{{\mathbf{m}}}$ in reverse would always produce a perfect thresholded coverage score, but this is not true for thresholded recall.

Complementing this thresholded recall metric is a thresholded precision metric, which considers all the windows in $\tau$ and measures whether any window in $\hat{{\mathbf{m}}}$ is sufficiently close:

(8)

\displaystyle\text{Prec}(\tau,\hat{{\mathbf{m}}},{\epsilon})=\frac{1}{k-9}\sum% _{i=0}^{k-10}\mathcal{I}\left(\left(\min_{j\in\{0,...,n-10\}}||{\mathbf{s}}_{i% :i+9}-\hat{{\mathbf{s}}}_{j:j+9}||_{2}\right)\leq{\epsilon}\right)

While thresholded recall measures the fraction of $\hat{{\mathbf{m}}}$ that the policy imitates, thresholded precision measures the fraction of $\tau$ that imitates a portion of $\hat{{\mathbf{m}}}$ . For example, a trajectory that perfectly loops a subset of the reference motion would score very highly on precision, but low on recall. Conversely, a trajectory that perfectly imitated the entire reference clip, but then contained some bizarre additional motions, would have a very high recall score but a very low precision. In practice, we find that the two metrics are correlated, however policies that mostly ignore their conditioning and focus on staying upright (such as the global controller baselines in Section 4.3) will often have a higher precision score than recall.

To evaluate trained policies, we follow the procedure of Juravsky et al. (2022) and sweep over many values of ${\epsilon}$ when calculating thresholded precision and recall. We record trajectories and calculate metrics for all motions in the training dataset, and report the averaged results as a plot. Additionally, to summarize these plots with individual scalars, we report the area under each curve (AUC).

4.3. Evaluating Global Controllers

We use thresholded precision and recall metrics to compare our global SuperPADL controller against two baselines trained on the same dataset of 5587 motions. Our first baseline directly applies PADL on the full dataset. Additionally, we train a PADL+BC controller on the entire dataset, instead of on a group of 20 motions. We use the same language encoder architecture as SuperPADL for both baselines and focus on evaluating the motion quality of each method when training on thousands of motions. The policy network sizes are held constant across all three methods. For the PADL and PADL+BC runs, the critic the discriminator are also appropriately scaled up in size.

We report our thresholded precision and recall metrics in Figure 5 and Table 1, observing that both baselines achieve lower precision and recall scores than SuperPADL. Qualitatively, these baseline networks are unable to do much more than stay upright and stumble around, appearing to respond very little to the user’s text command. The PADL+BC network will occasionally reproduce short snippets of simple motions such as jogging. We emphasize that these baselines attempt to apply adversarial reinforcement learning objectives at scale, while SuperPADL only trains small-scale policies with RL. Instead, SuperPADL relies exclusively on supervised learning (through DAGGER) to train the global controller (Ross et al., 2011).

We assess the ability of the SuperPADL global controller to transition between skills in Appendix A. The global controller can successfully transition (i.e. not fall) over 90% of the time, even when transitioning between two skills from different motion groups. Additionally, we evaluate SuperPADL’s ability to respond to language commands in Appendix B, showing that human raters are able to match animations to the appropriate caption a majority of the time.

Table 1. Measuring area-under-curve (AUC) motion quality metrics for different global controller objectives. Using adversarial RL on datasets containing thousands of motions is ineffective, leading to policies that are largely unresponsive to text commands.

Method	Precision AUC	Recall AUC
SuperPADL	1.18	1.11
PADL+BC	1.12	0.73
PADL	0.99	0.70

4.4. Evaluating Group Controllers

We also assess the motion quality of our PADL+BC group controllers when compared against a pure-PADL baseline. We randomly select four groups of 20 motions from our dataset and train a controller using both approaches on each group. Note that unlike the PADL models trained in Juravsky et al. (2022), we train pure-PADL group controllers without any language conditioning, instead using the same simple motion index embedding as our PADL+BC controllers. Additionally, we measure the training time required for each network when using a single A40 GPU.

We report our results in Figure 6 and Table 2. We observe that group controllers trained with PADL+BC are able to generate higher-quality motions than vanilla PADL controllers while simultaneously requiring much less GPU time. Qualitatively, we observe that PADL+BC models are less prone to imitating only a subsection of a reference motion than PADL policies. Additionally, we find that PADL+BC models seem to be more successful at looping their generations and avoiding getting stuck. Note that the GPU training time reported for PADL+BC models does not include the time required to train the 20 tracking policies that the group controller distills from. A proper accounting of end-to-end compute cost should include these prerequisite training steps as well. However, as shown in Figure 4, the mean tracking policy time is under one hour, therefore even when considering the cost of training tracking experts, the total cost of training one PADL+BC group controller is lower than that of one pure-PADL group controller. Additionally, the training of 20 separate tracking policies can be more easily distributed across multiple GPUs than the training of a single group controller.

Table 2. Measuring area-under-curve (AUC) motion quality metrics and policy training time for our work’s PADL+BC group controller and a pure-PADL group controller baseline. The supervision signal added to PADL+BC allows it to attain a higher motion quality while training in significantly less time than the baseline. Standard deviation is calculated across four trained policies, each trained on a distinct motion group.

Method	Precision AUC	Recall AUC	Training Time
PADL+BC	1.21 ± 0.03	1.21 ± 0.02	12h
PADL	1.02 ± 0.11	1.10 ± 0.05	67h

5. Discussion

In this work we presented SuperPADL, a framework for training physics-based text-conditioned animation models on large datasets. Our approach is predicated on the observations that kinematic motion models, using supervised learning objectives, are able to scale to datasets containing thousands of motions, while RL-based approaches can struggle beyond several hundred motions. In light of this, we employ a progressive distillation process, where we first train small expert policies using RL and then iteratively distill them into larger, more capable networks. Our final controller is able to reproduce skills from a dataset of over 5000 motions and naturally transition in response to changing user commands.

While SuperPADL is able to reproduce many motions with high quality, the network can still struggle with some highly dynamic motions, such as ballet dances or jumps. A limitation of our current approach is that any poorly-reproduced skills (or other flaws) in a group controller will cascade into the global controller during distillation. Additionally, the global controller can still fall over, particularly when asked to transition during a difficult motion. For example, the character will often lose its balance when it is asked to transition mid-kick when it is balancing on one leg with the other leg extended. Altering the timing of motion transitions to avoid these sensitive regions can make the transitions much more reliable.

We are excited about future work that continues to scale physics-based text-to-motion models to even larger datasets. SuperPADL only uses a fraction of the motions in AMASS and in particular does not train on very long motion capture sequences. While the PADL policies trained in Juravsky et al. (2022) were only conditioned on the current character state, SuperPADL is trained on a history of past states. Future architectures that are given an even wider context may be able to learn from longer reference motions. Additionally, we are interested in exploring alternative combinations of RL and supervised learning for physics-based animation. SuperPADL’s global controller is a relatively simple deterministic network, and a more sophisticated generative model setup might be better at modelling multi-modal motion distributions. Since the final distillation stage of SuperPADL is purely supervised, it should be possible to train the global controller as a diffusion model using a denoising objective on the target actions. This would give users access to many of the customization techniques that have been successful with text-to-image models, such as guidance and different noise schedulers (Ho and Salimans, 2022). Overall, we hope that our work contributes to the development of more capable physics-based animation models as well as more powerful, accessible animation tools.

References

(1)
Agrawal and van de Panne (2016) Shailen Agrawal and Michiel van de Panne. 2016. Task-based Locomotion. ACM Transactions on Graphics (Proc. SIGGRAPH 2016) 35, 4 (2016).
Ahuja and Morency (2019) C. Ahuja and L. Morency. 2019. Language2Pose: Natural Language Grounded Pose Forecasting. In 2019 International Conference on 3D Vision (3DV). IEEE Computer Society, Los Alamitos, CA, USA, 719–728. https://doi.org/10.1109/3DV.2019.00084
Athanasiou et al. (2022) Nikos Athanasiou, Mathis Petrovich, Michael J. Black, and Gül Varol. 2022. TEACH: Temporal Action Compositions for 3D Humans. In International Conference on 3D Vision (3DV).
Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv:1607.06450 [stat.ML]
Bergamin et al. (2019) Kevin Bergamin, Simon Clavet, Daniel Holden, and James Richard Forbes. 2019. DReCon: Data-Driven Responsive Control of Physics-Based Characters. ACM Trans. Graph. 38, 6, Article 206 (Nov. 2019), 11 pages. https://doi.org/10.1145/3355089.3356536
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. CoRR abs/2005.14165 (2020). arXiv:2005.14165 https://arxiv.org/abs/2005.14165
Clegg et al. (2018) Alexander Clegg, Wenhao Yu, Jie Tan, C. Karen Liu, and Greg Turk. 2018. Learning to Dress: Synthesizing Human Dressing Motion via Deep Reinforcement Learning. ACM Trans. Graph. 37, 6, Article 179 (dec 2018), 10 pages. https://doi.org/10.1145/3272127.3275048
Coros et al. (2009) Stelian Coros, Philippe Beaudoin, and Michiel van de Panne. 2009. Robust Task-based Control Policies for Physics-based Characters. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 28, 5 (2009), Article 170.
da Silva et al. (2008) Marco da Silva, Yeuhi Abe, and Jovan Popović. 2008. Simulation of Human Motion Data using Short‐Horizon Model‐Predictive Control. Computer Graphics Forum 27 (2008).
Dabral et al. (2023) Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, and Christian Theobalt. 2023. MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis. In Computer Vision and Pattern Recognition (CVPR).
de Lasa et al. (2010) Martin de Lasa, Igor Mordatch, and Aaron Hertzmann. 2010. Feature-Based Locomotion Controllers. ACM Transactions on Graphics 29, 3 (2010).
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://doi.org/10.48550/ARXIV.1810.04805
Guo et al. (2022) Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022. Generating Diverse and Natural 3D Human Motions From Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5152–5161.
Ho and Salimans (2022) Jonathan Ho and Tim Salimans. 2022. Classifier-Free Diffusion Guidance. arXiv:2207.12598 [cs.LG]
Hodgins et al. (1995) Jessica K. Hodgins, Wayne L. Wooten, David C. Brogan, and James F. O’Brien. 1995. Animating Human Athletics. In Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’95). Association for Computing Machinery, New York, NY, USA, 71–78. https://doi.org/10.1145/218380.218414
Holden et al. (2017) Daniel Holden, Taku Komura, and Jun Saito. 2017. Phase-Functioned Neural Networks for Character Control. ACM Trans. Graph. 36, 4, Article 42 (jul 2017), 13 pages. https://doi.org/10.1145/3072959.3073663
Jiang et al. (2023) Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. 2023. MotionGPT: Human Motion as a Foreign Language. arXiv preprint arXiv:2306.14795 (2023).
Juravsky et al. (2022) Jordan Juravsky, Yunrong Guo, Sanja Fidler, and Xue Bin Peng. 2022. PADL: Language-Directed Physics-Based Character Control. In SIGGRAPH Asia 2022 Conference Papers (Daegu, Republic of Korea) (SA ’22). Association for Computing Machinery, New York, NY, USA, Article 19, 9 pages. https://doi.org/10.1145/3550469.3555391
Lee et al. (2021b) Kyungho Lee, Sehee Min, Sunmin Lee, and Jehee Lee. 2021b. Learning Time-Critical Responses for Interactive Character Control. ACM Trans. Graph. 40, 4, Article 147 (jul 2021), 11 pages. https://doi.org/10.1145/3450626.3459826
Lee et al. (2021a) Seyoung Lee, Sunmin Lee, Yongwoo Lee, and Jehee Lee. 2021a. Learning a family of motor skills from a single motion clip. ACM Trans. Graph. 40, 4, Article 93 (2021).
Lee et al. (2010a) Yoonsang Lee, Sungeun Kim, and Jehee Lee. 2010a. Data-Driven Biped Control. ACM Trans. Graph. 29, 4, Article 129 (July 2010), 8 pages. https://doi.org/10.1145/1778765.1781155
Lee et al. (2010b) Yongjoon Lee, Kevin Wampler, Gilbert Bernstein, Jovan Popović, and Zoran Popović. 2010b. Motion Fields for Interactive Character Locomotion. ACM Trans. Graph. 29, 6, Article 138 (dec 2010), 8 pages. https://doi.org/10.1145/1882261.1866160
Lin et al. (2018) Angela S. Lin, Lemeng Wu, Rodolfo Corona, Kevin Tai, Qixing Huang, and Raymond J. Mooney. 2018. Generating Animated Videos of Human Activities from Natural Language Descriptions. In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2018. http://www.cs.utexas.edu/users/ai-labpub-view.php?PubID=127730
Ling et al. (2020) Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel van de Panne. 2020. Character Controllers Using Motion VAEs. ACM Trans. Graph. 39, 4 (2020).
Liu and Hodgins (2018) Libin Liu and Jessica Hodgins. 2018. Learning basketball dribbling skills using trajectory optimization and deep reinforcement learning. ACM Trans. Graph. 37, 4, Article 142 (jul 2018), 14 pages. https://doi.org/10.1145/3197517.3201315
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/ARXIV.1907.11692
Luo et al. (2023) Zhengyi Luo, Jinkun Cao, Alexander W. Winkler, Kris Kitani, and Weipeng Xu. 2023. Perpetual Humanoid Control for Real-time Simulated Avatars. In International Conference on Computer Vision (ICCV).
Mahmood et al. (2019) Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. arXiv:1904.03278 [cs.CV]
Makoviychuk et al. (2021) Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. 2021. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning. CoRR abs/2108.10470 (2021). arXiv:2108.10470 https://arxiv.org/abs/2108.10470
Merel et al. (2019) Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. 2019. Neural Probabilistic Motor Primitives for Humanoid Control. In International Conference on Learning Representations. https://openreview.net/forum?id=BJl6TjRcY7
Merel et al. (2020) Josh Merel, Saran Tunyasuvunakool, Arun Ahuja, Yuval Tassa, Leonard Hasenclever, Vu Pham, Tom Erez, Greg Wayne, and Nicolas Heess. 2020. Catch & Carry: reusable neural controllers for vision-guided whole-body tasks. ACM Trans. Graph. 39, 4, Article 39 (aug 2020), 14 pages. https://doi.org/10.1145/3386569.3392474
Peng et al. (2018) Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018. DeepMimic: Example-guided Deep Reinforcement Learning of Physics-based Character Skills. ACM Trans. Graph. 37, 4, Article 143 (July 2018), 14 pages. https://doi.org/10.1145/3197517.3201311
Peng et al. (2017) Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. 2017. DeepLoco: Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement Learning. ACM Trans. Graph. 36, 4, Article 41 (July 2017), 13 pages. https://doi.org/10.1145/3072959.3073602
Peng et al. (2022) Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. 2022. ASE: Large-scale Reusable Adversarial Skill Embeddings for Physically Simulated Characters. ACM Trans. Graph. 41, 4, Article 94 (July 2022).
Peng et al. (2021) Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. 2021. AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control. ACM Trans. Graph. 40, 4, Article 1 (July 2021), 15 pages. https://doi.org/10.1145/3450626.3459670
Petrovich et al. (2022) Mathis Petrovich, Michael J. Black, and Gül Varol. 2022. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV).
Plappert et al. (2017) Matthias Plappert, Christian Mandery, and Tamim Asfour. 2017. Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. CoRR abs/1705.06400 (2017). arXiv:1705.06400 http://arxiv.org/abs/1705.06400
Poole et al. (2023) Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2023. DreamFusion: Text-to-3D using 2D Diffusion. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=FjNys5c7VyY
Punnakkal et al. (2021) Abhinanda R. Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J. Black. 2021. BABEL: Bodies, Action and Behavior with English Labels. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). 722–731.
Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. https://doi.org/10.48550/ARXIV.1910.10683
Ren et al. (2023) Jiawei Ren, Mingyuan Zhang, Cunjun Yu, Xiao Ma, Liang Pan, and Ziwei Liu. 2023. InsActor: Instruction-driven Physics-based Characters. NeurIPS (2023).
Ross et al. (2011) Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. 2011. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. arXiv:1011.0686 [cs.LG]
Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=08Yk-n5l2Al
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. CoRR abs/1707.06347 (2017). arXiv:1707.06347 http://arxiv.org/abs/1707.06347
Starke et al. (2019) Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural State Machine for Character-Scene Interactions. ACM Trans. Graph. 38, 6, Article 209 (nov 2019), 14 pages. https://doi.org/10.1145/3355089.3356505
Sun et al. (2023) Jingkai Sun, Qiang Zhang, Yiqun Duan, Xiaoyang Jiang, Chong Cheng, and Renjing Xu. 2023. Prompt, Plan, Perform: LLM-based Humanoid Control via Quantized Imitation Learning. arXiv:2309.11359 [cs.RO]
Tan et al. (2014) Jie Tan, Yuting Gu, C. Karen Liu, and Greg Turk. 2014. Learning Bicycle Stunts. ACM Trans. Graph. 33, 4, Article 50 (July 2014), 12 pages. https://doi.org/10.1145/2601097.2601121
Tevet et al. (2022) Guy Tevet, Brian Gordon, Amir Hertz, Amit H. Bermano, and Daniel Cohen-Or. 2022. MotionCLIP: Exposing Human Motion Generation to CLIP Space. https://doi.org/10.48550/ARXIV.2203.08063
Tevet et al. (2023) Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. 2023. Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=SJ1kSyO2jwu
Treuille et al. (2007) Adrien Treuille, Yongjoon Lee, and Zoran Popović. 2007. Near-Optimal Character Animation with Continuous Control. In ACM SIGGRAPH 2007 Papers (San Diego, California) (SIGGRAPH ’07). Association for Computing Machinery, New York, NY, USA, 7–es. https://doi.org/10.1145/1275808.1276386
Wagener et al. (2023) Nolan Wagener, Andrey Kolobov, Felipe Vieira Frujeri, Ricky Loynd, Ching-An Cheng, and Matthew Hausknecht. 2023. MoCapAct: A Multi-Task Dataset for Simulated Humanoid Control. arXiv:2208.07363 [cs.RO]
Wang et al. (2009) Jack M. Wang, David J. Fleet, and Aaron Hertzmann. 2009. Optimizing Walking Controllers. In ACM SIGGRAPH Asia 2009 Papers (Yokohama, Japan) (SIGGRAPH Asia ’09). Association for Computing Machinery, New York, NY, USA, Article 168, 8 pages. https://doi.org/10.1145/1661412.1618514
Wang et al. (2012) Jack M. Wang, Samuel R. Hamner, Scott L. Delp, and Vladlen Koltun. 2012. Optimizing Locomotion Controllers Using Biologically-Based Actuators and Objectives. ACM Trans. Graph. 31, 4, Article 25 (jul 2012), 11 pages. https://doi.org/10.1145/2185520.2185521
Wang et al. (2020) Tingwu Wang, Yunrong Guo, Maria Shugrina, and Sanja Fidler. 2020. UniCon: Universal Neural Controller For Physics-based Character Motion. arXiv:2011.15119 [cs.GR]
Winkler et al. (2022) Alexander Winkler, Jungdam Won, and Yuting Ye. 2022. QuestSim: Human Motion Tracking from Sparse Sensors with Simulated Avatars. In SIGGRAPH Asia 2022 Conference Papers (¡conf-loc¿, ¡city¿Daegu¡/city¿, ¡country¿Republic of Korea¡/country¿, ¡/conf-loc¿) (SA ’22). Association for Computing Machinery, New York, NY, USA, Article 2, 8 pages. https://doi.org/10.1145/3550469.3555411
Won et al. (2020) Jungdam Won, Deepak Gopinath, and Jessica Hodgins. 2020. A Scalable Approach to Control Diverse Behaviors for Physically Simulated Characters. ACM Trans. Graph. 39, 4, Article 33 (jul 2020), 12 pages. https://doi.org/10.1145/3386569.3392381
Won et al. (2021) Jungdam Won, Deepak Gopinath, and Jessica Hodgins. 2021. Control strategies for physically simulated characters performing two-player competitive sports. ACM Trans. Graph. 40, 4, Article 146 (jul 2021), 11 pages. https://doi.org/10.1145/3450626.3459761
Won et al. (2022) Jungdam Won, Deepak Gopinath, and Jessica Hodgins. 2022. Physics-based character controllers using conditional VAEs. ACM Trans. Graph. 41, 4, Article 96 (jul 2022), 12 pages. https://doi.org/10.1145/3528223.3530067
Xie et al. (2020) Zhaoming Xie, Hung Yu Ling, Nam Hee Kim, and Michiel van de Panne. 2020. ALLSTEPS: Curriculum-driven Learning of Stepping Stone Skills. In Proc. ACM SIGGRAPH / Eurographics Symposium on Computer Animation.
Xie et al. (2022) Zhaoming Xie, Sebastian Starke, Hung Yu Ling, and Michiel van de Panne. 2022. Learning Soccer Juggling Skills with Layer-wise Mixture-of-Experts. (2022).
Yao et al. (2022) Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. 2022. ControlVAE: Model-Based Learning of Generative Controllers for Physics-Based Characters. ACM Trans. Graph. 41, 6, Article 183 (nov 2022), 16 pages. https://doi.org/10.1145/3550454.3555434
Zhang et al. (2023b) Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. 2023b. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Zhang et al. (2023a) Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, and Wanli Ouyang. 2023a. MotionGPT: Finetuned LLMs are General-Purpose Motion Generators. arXiv preprint arXiv:2306.10900 (2023).
Zhang et al. (2020) Yunbo Zhang, Wenhao Yu, C. Karen Liu, Charlie Kemp, and Greg Turk. 2020. Learning to Manipulate Amorphous Materials. ACM Trans. Graph. 39, 6, Article 189 (nov 2020), 11 pages. https://doi.org/10.1145/3414685.3417868

Appendix A Evaluating Global Controller Transitions

We evaluate the ability of the SuperPADL global policy to transition between skills without falling over. We assess the policy by rolling out “transition trajectories” where the model is initially conditioned on one caption before the caption is changed to that of a different motion. We sample 4096 trajectories, each 10 seconds long with a transition point sampled uniformly between 3 and 7 seconds into the trajectory. In each trajectory, the character is initialized in a default standing position. We classify a transition as successful if the character does not fall over at any point in the trajectory.

We conduct two variations of this experiment. In the first version, we only sample pairs of captions that correspond to two motions from the same motion group (defined in Section 3.2). In the second version, the caption pair must come from motions in two different groups. By evaluating these cases separately, we can assess whether the global controller has learned general transition skills or whether it is limited by the group controllers it is distilled from (which are each only trained to transition between a small group of motions). Our results are summarized in Table 3. We see that the global controller succeeds at the transition over 90% of the time, regardless of whether caption pairs are sampled from the same group or not.

Table 3. Evaluating the fraction of successful skill transitions with SuperPADL (i.e. the fraction of transitions where the character does not fall over).

Transition Type	Successful (No Falls)	Fell Before Transition Point	Fell After Transition Point
Caption-Pair from Same Group	92.70%	3.00%	4.30%
Caption-Pair from Different Groups	90.92%	3.12%	5.96%

Appendix B Evaluating Response to Language Commands

We assess the faithfulness of SuperPADL’s generated motions with respect to language commands using human evaluation²²2Automated metrics such as FID and R-precision are difficult to apply to our method since they rely on motion encoders that use SMPL character models as input, which are distinct from our physics-based character.. We generate evaluation questions by sampling 100 captions from our training dataset and recording a 24-second animation from the global controller for each caption. In each animation, we initialize the character in a default standing pose.

We present three human raters with a rendered video of each animation and ask them to select the most appropriate caption from four options, where one answer is the correct caption and the others are randomly sampled from other motions in the dataset. We also present raters with options for “Nothing applies” and “Multiple options apply”. The latter option is helpful to identify cases where the “incorrect” alternative captions come from similar motions, for example when both the true caption and an alternative option come from walking clips.

The evaluation results summarize our results in Table 4, with raters discerning the correct caption from SuperPADL’s motion a majority of of the time. When examining looking at the evaluation results, we observed that many “Nothing applies” responses correspond to motions where the character struggles to leave the initial standing pose and instead remains mostly idle.

Table 4. Evaluating the ability of human raters to identify the caption that SuperPADL was conditioned on when given four possible options.

User Response	Average Selection Frequency
Correct Caption	57.33%
Incorrect Caption	19.33%
Multiple Applicable Options	5.00%
No Applicable Options	18.33%

Appendix C Architecture and Training Details

C.1. Physics Environment

All of our experiments are run using the NVIDIA Isaac Gym simulator (Makoviychuk et al., 2021). Our character model and observation format matches that of Juravsky et al. (2022), except without a sword and shield. Our action space is 36-dimensional.

C.2. Expert Tracking Policies

Each expert tracking policy (and corresponding critic network) is an MLP with two hidden layers containing 1024 and 512 units, respectively. We use an ELU activation function. We encode the reference phase $\phi\in[0,1]$ using two scalars storing $[\sin(\phi),\cos(\phi)]$ . Each network is trained using a single A40 GPU.

C.3. Group Controllers

When training group controllers, the actors, critics, and discriminators are separate MLP networks with a ReLU activation function and three hidden layers containing [1024, 1024, 512] units. For the actor and critic networks, we provide five frames of character states as input by maintaining a buffer of the 40 most recent frames and sampling every eighth. For the discriminator, we follow the convention of Peng et al. (2022) and use the 10 most recent frames as observations. We encode the motion index using an 128-dimensional embedding table. Each network is trained using a single A40 GPU. We train PADL+BC policies for a total of 10000 epochs (including the 2000-epoch warmup period where we only apply the behaviour cloning loss). When training PADL group controller baselines, we train the policy for 54000 epochs, corresponding to approximately 7B frames.

C.4. Global Controllers

All networks used when training global controllers are MLPs with hidden layers containing [3072, 3072, 3072, 2048] units. We provide the policies (and, when applicable, critic networks) with five frames of history using the same method as the group controllers. We train controllers and baselines using eight A40 GPUs.

We train the SuperPADL global controller for 7900 epochs, corresponding to approximately 380K optimization steps, 6B frames of RL training (i.e. 6B online collected samples), and 12 hours of training. Like when training PADL+BC group controllers, we begin with a 2000 epoch BC-only warmup period. We use an ELU activation function and follow every activation layer with a LayerNorm (Ba et al., 2016).

For the PADL baseline, we train the model for 14600 epochs, corresponding to approximately 700K optimization steps, 15B frames, and 89 hours of training. For the PADL+BC baseline, we train the model for 14000 epochs, corresponding to approximately 670K optimization steps, 15B frames, and 98 hours of training. We do not use an initial BC-only warmup period (we do not observe it to have a significant impact). For both baselines, we use ReLU activations and no LayerNorm.