HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.07538v1 [cs.GR] 12 Dec 2023

Anatomically Constrained Implicit Face Models

Prashanth Chandran
DisneyResearch|Studios
[email protected]
   Gaspard Zoss
DisneyResearch|Studios
[email protected]
Abstract

Coordinate based implicit neural representations have gained rapid popularity in recent years as they have been successfully used in image, geometry and scene modeling tasks. In this work, we present a novel use case for such implicit representations in the context of learning anatomically constrained face models. Actor specific anatomically constrained face models are the state of the art in both facial performance capture and performance retargeting. Despite their practical success, these anatomical models are slow to evaluate and often require extensive data capture to be built. We propose the anatomical implicit face model; an ensemble of implicit neural networks that jointly learn to model the facial anatomy and the skin surface with high-fidelity, and can readily be used as a drop in replacement to conventional blendshape models. Given an arbitrary set of skin surface meshes of an actor and only a neutral shape with estimated skull and jaw bones, our method can recover a dense anatomical substructure which constrains every point on the facial surface. We demonstrate the usefulness of our approach in several tasks ranging from shape fitting, shape editing, and performance retargeting.

1 Introduction

Deformable face models are an important tool in the arsenal of visual effects artists dealing with facial animation. As they are ubiquitously used both in high-end production workflows and lightweight consumer applications, building expressive face models for various applications continues to remain an active area of research [17]. Face models today can range from simple linear global shape models [4, 27, 29] to highly complex local models that incorporate the underlying facial anatomy through physical simulation [48, 15, 44] or through anatomical constraints [47].

In this work, we concern ourselves primarily with the high-quality facial animation workflow where actor specific linear blendshape models [27] continue to remain the most commonly used tool for creating facial animations [33, 47, 10]. We propose a new class of actor specific shape models named the Anatomical Implicit face Model (AIM) which provides several unique advantages over the existing actor specific face models, and can be used as a drop-in replacement for traditional blendshape models.

An actor specific blendshape model is a collection of 3D shapes of the given actor performing a number of facial expressions, usually created by face scanning [2] or by an artist. While the user-friendliness of such actor specific blendshape models contributes to their wide adoption, it is a well known limitation that such models often require hundreds of shapes to accurately model complex facial deformation [27]. To address these shortcomings, local blendshape models [42, 47, 10] were proposed. By splitting the face into regions, and allowing the individual regions to deform independently, local shape models are able to capture complex deformations with a limited number of shapes.

While local models address the lack of expressivity in global shape models, state-of-the-art methods in facial performance capture [47] and retargeting [10] often incorporate anatomical constraints on the facial surface to plausibly restrict the range of the skin deformations. The anatomical constraints employed by these models [47, 10] provide a few hidden advantages that end up contributing towards their practical success. For example, in the context of facial performance capture, Wu et al. [47] demonstrated that including anatomical constraints derived from the relationship between the facial skin and underlying bones (skull and mandible) helps to separate the rigid and non-rigid components of facial deformation, leading to better face performance capture. In the context of facial performance retargeting, Chandran et al. [10] made use of such an anatomically constrained local face model to restrict a retargeted shape to lie within the space of anatomically plausible shapes of the target actor.

Despite their practical success, anatomical constraints are often formulated in practice as regularization terms that have to be satisfied as part of complex optimization problems involving several objectives. As a result, fitting these anatomical face models to a target scan or an image for instance, is a computationally intensive procedure taking several minutes per frame on a CPU, or requires hand crafted GPU solvers [20]. Furthermore anatomy constraints are enforced only in sparse regions of the face, whereas in reality the facial skin surface is more densely constrained by the underlying anatomy, and simulating this dense interaction between the anatomy and facial skin through physical simulation can be even more computationally intensive [39, 48].

In this paper, we propose the Anatomical Implicit face Model; a framework that allows for a holistic representation of both the facial anatomy and the skin surface using simple implicit neural networks and facilitates the learning of a continuous anatomical structure that densely constrains the skin surface. Our model formulation, inspired by the anatomical local model (ALM) of Wu et al. [47], can further disentangle deformation arising from rigid bone motion (jaw motion) and non-rigid deformations created by muscle activations. Our model also addresses the computational bottleneck of the ALM model by explicitly deriving the skin surface from the anatomy, instead of formulating it as a constrained optimization problem. By ensuring that a point on the skin surface is always reconstructed through the underlying anatomy, our method provides several unique features in comparison to existing implicit face models, such as anatomy based face manipulation (see Section 5). Before describing the details of our anatomical formulation in Section 3, we discuss related work in Section 2.

2 Related Work

3D Morphable Models

Facial models used in animation make up for an extremely well studied body of work with the earliest works dating back to the late 1970s [18]. We therefore refer to the excellent survey of Egger et al. [17] for an in-depth review of the state-of-the-art methods, and provide only a concise summary in this section. Facial blendshapes [18, 27] have been conventionally used as a standard tool by artists to navigate the geometric space of human faces. The seminal 3D linear morphable model proposed by Blanz and Vetter [4] used principal component analysis to describe the variation in facial geometry and texture, which was later extended to multilinear models, jointly modeling identity and expression by Vlasic et al. [43] and later by Cao et al. [7]. Today a very commonly used morphable face model is the FLAME model [29] which incorporates identity, expression and corrective blendshapes in addition to modeling bone motion with linear blend skinning. Due to its flexible nature, the FLAME model is widely used by face reconstruction algorithms today [19]. Finally Chai et al. [8] recently created the HIFI3D++ morphable model which is built from a union of scans from several previously proposed models.

In the past few years, numerous face models leveraging the power of deep neural networks to model the nonlinear deformation of the human face have also been proposed. While the initial work in this area by Ranjan et al. [38] focused on the use of specialized graph convolutional networks to operate on shapes, several later approaches proposed further modifications to the network architecture to improve the accuracy in shape representation [14, 55, 5, 22]. To make these deep morphable models intuitive to use, Chandran et al. [9] subsequently proposed the Semantic Deep Face Model which treats a collection of neural networks like a multilinear model to achieve identity-expression disentanglement. Extensions of such a semantically controllable model to deal with topology changes [12] and temporal sequences of geometry [11] have also been proposed. Deep neural models that jointly model the facial geometry and appearance with semantic controls have also been proposed [28].

Implicit Face Models

Owning to the massive success of coordinate based neural networks in representing images [40, 30], 3D shapes [35] and arbitrary scenes [31], today’s research on parametric face models primarily focuses on implicit representations. Yenamandra et al. [49] proposed i3DMM as an initial exploration of using coordinate based networks for modeling full head geometries. This was followed by IMFace [51] which disentangled facial geometry into separate identity and expression embeddings with the help of individual deformation fields. More recently, Neural Parametric Head Models (NPHM) [21] proposed a method which improves the fidelity of neural implicit representations by jointly training an ensemble of local neural fields centered around anchor points. Implicit neural representations have also successfully been employed in learning an animatable avatar of a human face from only monocular video as demonstrated by IMAvatar [52] and Point Avatar [53]. Wang et al. [45] also proposed MoRF, which is a Neural Radiance Field [31] conditioned on an identity code allowing for photorealistic free viewpoint rendering of the full head in a fixed expression. Recently Buhler et al. [6] also explored how such multi-identity radiance fields can be fit to sparse images to recover a volumetric head model. Finally coordinate based neural networks have also been successfully employed in creating animatable human body models [16, 34, 3, 23].

Anatomically Constrained Face Models

The anatomical local model proposed in the context of monocular facial performance capture by Wu et al. [47], first introduced the coupling of the anatomical bone structure to the skin surface and modeled the effect of skin patches sliding over the bone through soft anatomical constraints. This formulation was later adapted by Chandran et al. [10] for facial performance retargeting. Qiu et al. proposed SCULPTOR [37], a multi-identity joint morphable model of facial anatomy and skin learned from a database of computed tomography (CT) scans. Recently Choi et al. proposed Animatomy [15], a muscle fiber based anatomical basis for animator friendly face modeling applications. Lastly we recognize several physically based face models [48, 44, 41, 39] which inherently have the ability to model anatomy constraints through simulation.

We draw inspiration from the three classes of facial morphable models discussed above and propose the Anatomical Implicit face Model: a blendshape based, implicit, anatomically constrained face model targeted towards high-quality actor specific face modeling. Our method can be seen as general extension of local blendshape models [10] to a continuously evaluable implicit function, and represents a set of actor blendshapes through a novel anatomical formulation. Unlike traditional patch-based models, our framework allows us to approximate complex shapes without requiring the user to specify patch layouts and other hyper-parameters. Our solution is based on simple coordinate based MLPs enabling efficient training and inference, and provides computational benefits over previous anatomically formulated face models [47]. Finally to the best of our knowledge, our method is the first to explore anatomical constraints inside an implicit facial blendshape model.

Refer to caption
Figure 1: Our approach consists of a model learning stage (Section 4.1) and a model fitting stage (Section 4.2). In the model learning stage, a set of an actor’s blendshapes are memorized by an ensemble of MLPs by our Anatomical Implicit face Model (AIM). In the second model fitting stage, the memorized model can be used as power shape prior to fit the actor model to target shapes.

3 Anatomical Model Formulation

The core idea of our approach is to formulate a learning scheme for an implicit neural representation that can reproduce an actor blendshape model while automatically learning the underlying facial anatomy and constraining the skin surface to this learned anatomy. Crucial to our learning scheme is our anatomically constrained face model that geometrically couples the underlying facial anatomy to the enclosing skin surface which we describe next.

Refer to caption
Figure 2: We show the break down of how we anatomically build up the facial skin surface. Starting from a learned anatomy surface (left), and learned anatomic properties like the soft tissue thickness, and anatomic surface normals, we reconstruct the neutral skin geometry. The neutral anatomy is skinned, and non-rigidly deformed with residual displacements to result in the final shape.

We assume that we are given a set of N𝑁Nitalic_N 3D scans (S0,S1,S2,..,SN1)({S_{0},S_{1},S_{2},..,S_{N-1}})( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . , italic_S start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ) of an actor represented as meshes. Without loss of generality, let S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be the shape with a neutral expression (or the rest pose). Each shape Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of V𝑉Vitalic_V vertices, and all shapes share the same vertex connectivity. For simplicity we exclude the index of the vertex in a shape in our notation and present our formulation as operating on surface points 𝐬𝐑3𝐬superscript𝐑3\mathbf{s}\in\mathbf{R}^{3}bold_s ∈ bold_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. Let 𝐬𝟎𝐑3subscript𝐬0superscript𝐑3\mathbf{s_{0}}\in\mathbf{R}^{3}bold_s start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and 𝐬𝐢𝐑3subscript𝐬𝐢superscript𝐑3\mathbf{s_{i}}\in\mathbf{R}^{3}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT be corresponding points on the skin surface for the neutral expression and expression i𝑖iitalic_i respectively. In most previous methods for learning neural face models, a skin surface point 𝐬𝐬\mathbf{s}bold_s is learned as a displacement from a base face surface [9, 12, 21] or simply as points lying in an arbitrary 3D space [51, 52, 45]. Contrary to such approaches, we propose to learn the skin surface 𝐬𝐬\mathbf{s}bold_s using implicit neural representations that arrives at the facial skin surface through a formulation that combines anatomic constraints, linear blend skinning (LBS), and expression blendshapes into a single framework.

For our model formulation, we take inspiration from the anatomic constraints first proposed for non-neural face models [1, 47], particularly that of Wu et al. [47]. They establish a link between the skin surface and the anatomic bones by modeling the thickness di𝐑subscript𝑑𝑖𝐑d_{i}\in\mathbf{R}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_R of the soft tissue between a bone point 𝐛𝐢𝐑subscript𝐛𝐢𝐑\mathbf{b_{i}}\in\mathbf{R}bold_b start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ bold_R and the skin surface 𝐬𝐢subscript𝐬𝐢\mathbf{s_{i}}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT. These constraints are defined in sparse regions of the face where a skin point can be trusted to have bone underneath. We draw inspiration from their simple formulation and make some important deviations that enable us to jointly learn both the surface of the underlying skin anatomy and the enclosing skin surface for every point on the skin through end-to-end learning. Specifically, we arrive at a point on the skin surface as follows

𝐬𝟎=𝐛𝟎+d0𝐧𝟎subscript𝐬0subscript𝐛0subscript𝑑0subscript𝐧0\mathbf{s_{0}}=\mathbf{b_{0}}+d_{0}\mathbf{n_{0}}bold_s start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT = bold_b start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT (1)

where 𝐬𝟎subscript𝐬0\mathbf{s_{0}}bold_s start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT is the position of a surface point corresponding to 𝐬isubscript𝐬𝑖\mathbf{s}_{i}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT but on the neutral shape S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 𝐛𝟎subscript𝐛0\mathbf{b_{0}}bold_b start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT, d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and 𝐧𝟎subscript𝐧0\mathbf{n_{0}}bold_n start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT are the bone point, soft tissue thickness and the bone normal at 𝐬𝟎subscript𝐬0\mathbf{s_{0}}bold_s start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT. While Eq. 1 allows us to reconstruct points on the neutral face geometry, to adequately represent skin surfaces under arbitrary facial expressions, we need to account for surface deformation arising from the rigid motion of underlying facial bones (skull and mandible), and the non-rigid skin motion arising from muscle activations, skin sliding, and self collisions. To accommodate these additional degrees of freedom in skin deformation, we incorporate standard linear blend skinning, and expression blendshapes similar to the FLAME model [29]. Therefore given an anatomically reconstructed point on the neutral skin surface 𝐬𝟎subscript𝐬0\mathbf{s_{0}}bold_s start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT, we can now compute the position of the same point in an arbitrary expression 𝐬𝐢subscript𝐬𝐢\mathbf{s_{i}}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT as

𝐬𝐢=LBS(𝐬𝟎,Tb,k)+𝐞𝐢subscript𝐬𝐢LBSsubscript𝐬0subscript𝑇𝑏𝑘subscript𝐞𝐢\mathbf{s_{i}}=\text{LBS}(\mathbf{s_{0}},T_{b},k)+\mathbf{e_{i}}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = LBS ( bold_s start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_k ) + bold_e start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT (2)

where LBS refers to the standard linear blend skinning operator that rigidly transforms the anatomically reconstructed neutral surface point 𝐬𝟎subscript𝐬0\mathbf{s_{0}}bold_s start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT with a transformation Tbsubscript𝑇𝑏T_{b}italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and a skinning weight k𝑘kitalic_k, 𝐞𝐢𝐑3subscript𝐞𝐢superscript𝐑3\mathbf{e_{i}}\in\mathbf{R}^{3}bold_e start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the corrective displacement that is added on top of the skinned result to account for deformations that cannot be explained by skinning alone. A visual overview of our approach to anatomically build up the facial skin surface is shown in Fig. 2.

At this point we have established how to arrive at points on the skin surface 𝐬𝐢subscript𝐬𝐢\mathbf{s_{i}}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT for a shape in an arbitrary facial expression Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by starting from the underlying anatomy 𝐛𝐢subscript𝐛𝐢\mathbf{b_{i}}bold_b start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT. It is important to note that the anatomic constraints as defined by Wu et al. [47] can only be computed on regions with an underlying bone, and thus, regions like the cheeks are not anatomically constrained in their approach. An essential feature of our approach that distinguishes it from all previous works is that we enforce anatomic constraints for every point on the skin surface; even in regions where there is no underlying biological bone structure. For this purpose we redefine the anatomy in our work as a rigidly deforming region underneath the skin surface that is not restricted to only the manifold of the skull and mandible bones. Since this structure does not exist in reality and is, therefore, not available for supervised learning, we formulate a learning framework where such rigidly deforming surface can be learnt only from the sparse set of anatomic constraints that can be computed between the skin and the underlying bones. As we will see in Section 5, learning this anatomic surface from data leads to several interesting applications in shape manipulation and performance retargeting that were previously challenging to obtain without expensive physical simulation [48] or extensive volumetric data capture [37].

4 Anatomical Implicit Face Model

At a high level, our method is comprised of two stages: first, a model learning stage (Section 4.1) and second, a model fitting stage (4.2). In the model learning stage, we bake a collection of expression blendshapes from an actor into an implicit neural network that uses the anatomical model formulation described in Section 3. Our model fitting stage uses this learned Anatomical Implicit face Model (AIM) and optimizes for coefficients that deform the model to match test time constraints like 3D shapes, 2D landmarks and so on. The overview of our approach is shown in Fig. 1.

Refer to caption
Figure 3: a) We assume we are given the neutral geometry of an actor along with an rough estimate of the skull and jaw bone [56]. b) We additionally use a collection of N𝑁Nitalic_N 3D shapes of the actor performing expressions. Unlike Wu et al. [47], we do not require the tracked anatomy (skull [1], jaw [57]) for the expression shapes.

4.1 Model Learning

To learn our anatomical implicit face model, we assume we are given a template shape C𝐶Citalic_C, a registered set of N𝑁Nitalic_N shapes (S0,S1,S2,..,SN1)({S_{0},S_{1},S_{2},..,S_{N-1}})( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . , italic_S start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ) of a single actor in the same topology of the canonical shape. Additionally we fit a template skull and jaw only to the neutral shape using the method of Zoss et al. [56]. The template shape C𝐶Citalic_C can either be the neutral shape of the actor or a generic face shape, and the number of shapes provided can be arbitrary. We use a collection of 20 shapes in our work. A visual summary of our training data is shown in Fig. 3. Our objective in the learning stage is to use a coordinate based neural network to memorize the given shapes through the anatomical formulation in Section 3. Given the high representation power of periodic implicit neural networks [40], we use the SIREN coordinate network; an MLP with sinusoidal activation functions, as our base architecture. An ablation study on alternate network choices is provided in section 5.5.

Given a point 𝐜𝐑3𝐜superscript𝐑3\mathbf{c}\in\mathbf{R}^{3}bold_c ∈ bold_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT on the template shape C𝐶Citalic_C, we use three independent MLPs denoted by 𝐁𝐁\mathbf{B}bold_B, 𝐃𝐃\mathbf{D}bold_D, and 𝐍𝐍\mathbf{N}bold_N to predict the anatomy point 𝐛~0𝐑3subscript~𝐛0superscript𝐑3\mathbf{\widetilde{b}}_{0}\in\mathbf{R}^{3}over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, the soft tissue thickness d~0𝐑subscript~𝑑0𝐑\widetilde{d}_{0}\in\mathbf{R}over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ bold_R, and the anatomy normal 𝐧~0𝐑3subscript~𝐧0superscript𝐑3\mathbf{\widetilde{n}}_{0}\in\mathbf{R}^{3}over~ start_ARG bold_n end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. These predicted anatomic properties are then used to reconstruct the position of a point on the neutral skin surface 𝐬~0subscript~𝐬0\mathbf{\widetilde{s}}_{0}over~ start_ARG bold_s end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as

𝐛~0=𝐁(𝐜)subscript~𝐛0𝐁𝐜\displaystyle\mathbf{\widetilde{b}}_{0}=\mathbf{B}(\mathbf{c})over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_B ( bold_c ) (3)
d~0=𝐃(𝐜)subscript~𝑑0𝐃𝐜\displaystyle\widetilde{d}_{0}=\mathbf{D}(\mathbf{c})over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_D ( bold_c ) (4)
𝐧~0=𝐍(𝐜)subscript~𝐧0𝐍𝐜\displaystyle\mathbf{\widetilde{n}}_{0}=\mathbf{N}(\mathbf{c})over~ start_ARG bold_n end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_N ( bold_c ) (5)
𝐬~0=𝐛~0+d~0𝐧~0.subscript~𝐬0subscript~𝐛0subscript~𝑑0subscript~𝐧0\displaystyle\mathbf{\widetilde{s}}_{0}=\mathbf{\widetilde{b}}_{0}+\widetilde{% d}_{0}\mathbf{\widetilde{n}}_{0}.over~ start_ARG bold_s end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over~ start_ARG bold_n end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (6)

As discussed in Section 3, to further account for the rigid and non-rigid deformations of the skin surface, the anatomically constructed neutral skin point 𝐬~0subscript~𝐬0\mathbf{\widetilde{s}}_{0}over~ start_ARG bold_s end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT has to be skinned and further displaced with residual expression deformations. We therefore employ two additional MLPs 𝐊𝐊\mathbf{K}bold_K and 𝐄𝐄\mathbf{E}bold_E that predict the skinning weight k~𝐑~𝑘𝐑\widetilde{k}\in\mathbf{R}over~ start_ARG italic_k end_ARG ∈ bold_R and the corrective displacements basis 𝐞𝐑(N1)×3subscript𝐞superscript𝐑𝑁13\mathbf{\mathcal{B}_{e}}\in\mathbf{R}^{(N-1)\times 3}caligraphic_B start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT ( italic_N - 1 ) × 3 end_POSTSUPERSCRIPT respectively. Note here that, as an implementation detail, we predict the expression displacements for all N1𝑁1N-1italic_N - 1 blendshapes (excluding the neutral) at once from 𝐄𝐄\mathbf{E}bold_E. The corrective expression displacement 𝐞𝐢~𝐑3~subscript𝐞𝐢superscript𝐑3\mathbf{\widetilde{e_{i}}}\in\mathbf{R}^{3}over~ start_ARG bold_e start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_ARG ∈ bold_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT for shape i𝑖iitalic_i can be extracted from this output by indexing 𝐞subscript𝐞\mathbf{\mathcal{B}_{e}}caligraphic_B start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT appropriately.

k~=𝐊(𝐜)~𝑘𝐊𝐜\displaystyle\widetilde{k}=\mathbf{K}(\mathbf{c})over~ start_ARG italic_k end_ARG = bold_K ( bold_c ) (7)
𝐞=𝐄(𝐜)subscript𝐞𝐄𝐜\displaystyle\mathbf{\mathcal{B}_{e}}=\mathbf{E}(\mathbf{c})caligraphic_B start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT = bold_E ( bold_c ) (8)
𝐞𝐢~=𝐞[i]~subscript𝐞𝐢subscript𝐞delimited-[]𝑖\displaystyle\mathbf{\widetilde{e_{i}}}=\mathbf{\mathcal{B}_{e}}[i]over~ start_ARG bold_e start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_ARG = caligraphic_B start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT [ italic_i ] (9)
𝐬~i=LBS(𝐬~0,Tb~,k~)+𝐞𝐢~subscript~𝐬𝑖LBSsubscript~𝐬0~subscript𝑇𝑏~𝑘~subscript𝐞𝐢\displaystyle\mathbf{\widetilde{s}}_{i}=\text{LBS}\left(\mathbf{\widetilde{s}}% _{0},\widetilde{T_{b}},\widetilde{k}\right)+\mathbf{\widetilde{e_{i}}}over~ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = LBS ( over~ start_ARG bold_s end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over~ start_ARG italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG , over~ start_ARG italic_k end_ARG ) + over~ start_ARG bold_e start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_ARG (10)

Here Tb~𝐑9~subscript𝑇𝑏superscript𝐑9\widetilde{T_{b}}\in\mathbf{R}^{9}over~ start_ARG italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG ∈ bold_R start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT is a 6-DOF jaw bone transformation optimized along with the training of the MLPs to account for rigid motion of the mandible. Here we parameterize the jaw bone rotation Tb~~subscript𝑇𝑏\widetilde{T_{b}}over~ start_ARG italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG following the continuous 6D representation [54].

4.1.1 Training Objectives

We next describe the training objectives to learn actor expression blendshapes along with the underlying anatomy structure for each skin surface point.
Skin Position Loss The skin position loss penalizes the difference between the estimated skin point 𝐬~isubscript~𝐬𝑖\mathbf{\widetilde{s}}_{i}over~ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the ground truth skin point 𝐬isubscript𝐬𝑖\mathbf{s}_{i}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

𝐋S=λS𝐬~𝐢𝐬𝐢𝟐𝟐subscript𝐋Ssubscript𝜆Ssuperscriptsubscriptnormsubscript~𝐬𝐢subscript𝐬𝐢22\mathbf{L}_{\text{S}}=\lambda_{\text{S}}||\bf{\widetilde{s}}_{i}-\bf{s}_{i}||_% {2}^{2}bold_L start_POSTSUBSCRIPT S end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT S end_POSTSUBSCRIPT | | over~ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT - bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT (11)

We set λS=1.0subscript𝜆S1.0\lambda_{\text{S}}=1.0italic_λ start_POSTSUBSCRIPT S end_POSTSUBSCRIPT = 1.0 for all our experiments.
Anatomy Regularizer Since we can roughly estimate the skull and jaw geometry on the neutral shape using the method of Zoss et al. [56], we compute sparse anatomic constraints [47] and loosely regularize the learned anatomic properties to stay close to these estimates only in regions where the constraints can be accurately computed (i.e. skin regions with an underlying bone).

𝐋A=λb𝐛~𝟎𝐛𝟎𝟐𝟐+λ𝐝𝐝~𝟎𝐝𝟎𝟐𝟐+λ𝐧𝐧~𝟎𝐧𝐢𝟐𝟐subscript𝐋Asubscript𝜆𝑏superscriptsubscriptnormsubscript~𝐛0subscript𝐛022subscript𝜆𝐝superscriptsubscriptnormsubscript~𝐝0subscript𝐝022subscript𝜆𝐧superscriptsubscriptnormsubscript~𝐧0subscript𝐧𝐢22\mathbf{L}_{\text{A}}\!=\!\lambda_{b}||\bf{\widetilde{b}}_{0}-\bf{b}_{0}||_{2}% ^{2}+\lambda_{d}||\widetilde{d}_{0}-d_{0}||_{2}^{2}+\lambda_{n}||\bf{% \widetilde{n}}_{0}-\bf{n}_{i}||_{2}^{2}bold_L start_POSTSUBSCRIPT A end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | | over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT - bold_b start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT | | over~ start_ARG bold_d end_ARG start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT - bold_d start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT | | over~ start_ARG bold_n end_ARG start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT - bold_n start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT (12)

We set λb=λd=λn=1.0subscript𝜆bsubscript𝜆dsubscript𝜆n1.0\lambda_{\text{b}}=\lambda_{\text{d}}=\lambda_{\text{n}}=1.0italic_λ start_POSTSUBSCRIPT b end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT = 1.0 for all our experiments, and observe that this constraint only regularizes 5-10% of all the vertices generated by the model on average (see Supplemental).
Thickness Regularizer We regularize the soft tissue thickness d~~𝑑\widetilde{d}over~ start_ARG italic_d end_ARG predicted by the model in unconstrained regions to remain as small unless dictated otherwise by the skin position loss.

𝐋D=λDRegd~022subscript𝐋Dsuperscriptsubscript𝜆𝐷𝑅𝑒𝑔superscriptsubscriptnormsubscript~𝑑022\mathbf{L}_{\text{D}}=\lambda_{D}^{Reg}||\widetilde{d}_{0}||_{2}^{2}bold_L start_POSTSUBSCRIPT D end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R italic_e italic_g end_POSTSUPERSCRIPT | | over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (13)

We set λDReg=7.5e4superscriptsubscript𝜆𝐷𝑅𝑒𝑔7.5e4\lambda_{D}^{Reg}=7.5\mathrm{e}{-4}italic_λ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R italic_e italic_g end_POSTSUPERSCRIPT = 7.5 roman_e - 4 for all our experiments.
Symmetry Regularizer To exploit the symmetry of the face, we regularize the predictions of the anatomy MLP 𝐁𝐁\mathbf{B}bold_B to be symmetric. We achieve this by requiring that reflecting the input points 𝐜𝐜\mathbf{c}bold_c along the plane of symmetry provides the same result as reflecting the predicted anatomy points 𝐚~~𝐚\mathbf{\widetilde{a}}over~ start_ARG bold_a end_ARG.

𝐋Sym=λsym𝐁(𝐑(𝐜))𝐑(𝐁(𝐜))𝟐𝟐subscript𝐋Symsubscript𝜆𝑠𝑦𝑚superscriptsubscriptnorm𝐁𝐑𝐜𝐑𝐁𝐜22\mathbf{L}_{\text{Sym}}=\lambda_{sym}||\bf{B}(\bf{R(c)})-R(\bf{B}(\bf{c}))||_{% 2}^{2}bold_L start_POSTSUBSCRIPT Sym end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_s italic_y italic_m end_POSTSUBSCRIPT | | bold_B ( bold_R ( bold_c ) ) - bold_R ( bold_B ( bold_c ) ) | | start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT (14)

where R𝑅Ritalic_R is an operator that reflects a point along the plane of symmetry. We set λsym=1e4subscript𝜆𝑠𝑦𝑚1e4\lambda_{sym}=1\mathrm{e}{-4}italic_λ start_POSTSUBSCRIPT italic_s italic_y italic_m end_POSTSUBSCRIPT = 1 roman_e - 4 for all our experiments. Note that we do not regularize symmetry on the predicted thickness or anatomy normals thereby allowing the model to still be able to represent asymmetric faces.
Optional Skinning Weight Regularizer Finally inspired by [52], we use an optional loss that encourages the estimated skinning weights k~~𝑘\widetilde{k}over~ start_ARG italic_k end_ARG in regions like the forehead that are guaranteed to not be affected by the rigid deformation of the jaw bone to be zero.

𝐋K=λk𝐊(𝐜*)22subscript𝐋Ksubscript𝜆𝑘superscriptsubscriptnorm𝐊superscript𝐜22\mathbf{L}_{\text{K}}=\lambda_{k}||\mathbf{K}(\mathbf{c}^{*})||_{2}^{2}bold_L start_POSTSUBSCRIPT K end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | | bold_K ( bold_c start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (15)

here 𝐜*superscript𝐜\mathbf{c}^{*}bold_c start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT refers to a small region on the canonical shape C𝐶Citalic_C which includes the forehead. We set λK=1e2subscript𝜆𝐾1e2\lambda_{K}=1\mathrm{e}{2}italic_λ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = 1 roman_e 2 for all our experiments.

Our final model energy 𝐋Modelsubscript𝐋Model\mathbf{L}_{\text{Model}}bold_L start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT is a summation of the above losses and is minimized using gradient decent [26] to train our ensemble of coordinate MLPs end-to-end.

𝐋Model=𝐋S+𝐋A+𝐋D+𝐋Sym+𝐋Ksubscript𝐋Modelsubscript𝐋Ssubscript𝐋Asubscript𝐋Dsubscript𝐋Symsubscript𝐋K\mathbf{L}_{\text{Model}}=\mathbf{L}_{\text{S}}+\mathbf{L}_{\text{A}}+\mathbf{% L}_{\text{D}}+\mathbf{L}_{\text{Sym}}+\mathbf{L}_{\text{K}}bold_L start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT = bold_L start_POSTSUBSCRIPT S end_POSTSUBSCRIPT + bold_L start_POSTSUBSCRIPT A end_POSTSUBSCRIPT + bold_L start_POSTSUBSCRIPT D end_POSTSUBSCRIPT + bold_L start_POSTSUBSCRIPT Sym end_POSTSUBSCRIPT + bold_L start_POSTSUBSCRIPT K end_POSTSUBSCRIPT (16)

4.2 Model Fitting

While the aforementioned model can recover interesting anatomic properties of the face with only sparse supervision, it is not very useful unless it can be deformed to match user constraints and serve as a shape prior for an actor facial geometry.

After training our anatomical implicit face model on a collection of N𝑁Nitalic_N shapes, the coefficients that are required to deform it include a jaw bone transformation 𝐓𝐛*𝐑9superscriptsubscript𝐓𝐛superscript𝐑9\mathbf{T_{b}}^{*}\in\mathbf{R}^{9}bold_T start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ bold_R start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT, coefficients 𝐰*𝐑N1superscript𝐰superscript𝐑𝑁1\mathbf{w}^{*}\in\mathbf{R}^{N-1}bold_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT that can be used to blend the corrective expression displacements e𝐑(N1)×3subscript𝑒superscript𝐑𝑁13\mathcal{B}_{e}\in\mathbf{R}^{(N-1)\times 3}caligraphic_B start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT ( italic_N - 1 ) × 3 end_POSTSUPERSCRIPT, and an optional global head transformation 𝐓𝐠*𝐑9superscriptsubscript𝐓𝐠superscript𝐑9\mathbf{T_{g}}^{*}\in\mathbf{R}^{9}bold_T start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ bold_R start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT. Following equation (10), we can therefore evaluate our anatomical implicit face model as

𝐬*=𝐓𝐠*(LBS(𝐬~0,𝐓𝐛*,k~)+N1𝐰*𝐁e)superscript𝐬superscriptsubscript𝐓𝐠LBSsubscript~𝐬0superscriptsubscript𝐓𝐛~𝑘subscript𝑁1superscript𝐰subscript𝐁𝑒\mathbf{s}^{*}=\mathbf{T_{g}}^{*}\left(\text{LBS}\left(\mathbf{\widetilde{s}}_% {0},\mathbf{T_{b}}^{*},\widetilde{k}\right)+\sum_{N-1}\mathbf{w}^{*}\mathbf{B}% _{e}\right)bold_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = bold_T start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( LBS ( over~ start_ARG bold_s end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , over~ start_ARG italic_k end_ARG ) + ∑ start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT bold_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) (17)

where 𝐓𝐠*superscriptsubscript𝐓𝐠\mathbf{T_{g}}^{*}bold_T start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, 𝐓𝐛*superscriptsubscript𝐓𝐛\mathbf{T_{b}}^{*}bold_T start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and 𝐰*superscript𝐰\mathbf{w}^{*}bold_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT are the only unknowns, and the rest can be queried from a pre-trained AIM. We consider two scenarios for model fitting which include i) fitting our model to a sequence of 3D scans e.g. from a facial performance, and ii) fitting our model to 2D landmarks detected on a video [13, 46].

For both scenarios, inspired by the state-of-the-art findings of Kim et al. [50], we employ neural reparameterized optimization [25] and solve for the weights of a simple MLP that predicts the unknown parameters instead of directly optimizing for them. Specifically when given a sequence of J𝐽Jitalic_J frames with 3D/2D constraints, we optimize for J𝐽Jitalic_J frame codes 𝐳𝐣𝐑fsubscript𝐳𝐣superscript𝐑𝑓\mathbf{z_{j}}\in\mathbf{R}^{f}bold_z start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT which, when fed as input to a simple 4-layer MLP 𝐅Tsubscript𝐅𝑇\mathbf{F}_{T}bold_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT with GeLU [24] activations, predicts the head 𝐓𝐠jsuperscriptsubscript𝐓𝐠𝑗\mathbf{T_{g}}^{j}bold_T start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and jaw 𝐓𝐛jsuperscriptsubscript𝐓𝐛𝑗\mathbf{T_{b}}^{j}bold_T start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT poses for each frame. Additionally as the coefficients 𝐰jsuperscript𝐰𝑗\mathbf{w}^{j}bold_w start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT are local and spatially varying depending on the template query point 𝐜𝐜\mathbf{c}bold_c, we use a separate 4-layer MLP 𝐅Wsubscript𝐅𝑊\mathbf{F}_{W}bold_F start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT which predicts the coefficients 𝐰jsuperscript𝐰𝑗\mathbf{w}^{j}bold_w start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT by taking both the frame code 𝐳𝐣subscript𝐳𝐣\mathbf{z_{j}}bold_z start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT and the query point 𝐜𝐜\mathbf{c}bold_c as input.

[𝐓𝐠j,𝐓𝐛j]=𝐅T(𝐳𝐣)superscriptsubscript𝐓𝐠𝑗superscriptsubscript𝐓𝐛𝑗subscript𝐅𝑇subscript𝐳𝐣\displaystyle[\mathbf{T_{g}}^{j},\mathbf{T_{b}}^{j}]=\mathbf{F}_{T}(\mathbf{z_% {j}})[ bold_T start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_T start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ] = bold_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) (18)
𝐰j=𝐅W(𝐳𝐣,𝐜)superscript𝐰𝑗subscript𝐅𝑊subscript𝐳𝐣𝐜\displaystyle\mathbf{w}^{j}=\mathbf{F}_{W}(\mathbf{z_{j}},\mathbf{c})bold_w start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = bold_F start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT , bold_c ) (19)

Unlike the method of Kim et al. [50] where the reparameterized optimization was used mainly for improved performance, this neural optimization is even necessary in our case to restrict the number of optimized variables as the number of spatially varying coefficients 𝐰*superscript𝐰\mathbf{w}^{*}bold_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT used to evaluate our anatomical implicit face model can vary drastically depending on the number of constraint points (see Section 5).

4.2.1 Fitting Objectives

3D Position Constraint For fitting our trained model to 3D constraints coming from a facial performance of an actor, we minimize the euclidean distance between the estimated skin point 𝐬*superscript𝐬\mathbf{s}^{*}bold_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and the ground truth skin point 𝐬GTsuperscript𝐬𝐺𝑇\mathbf{s}^{GT}bold_s start_POSTSUPERSCRIPT italic_G italic_T end_POSTSUPERSCRIPT.

𝐋Pos3D=λ3D𝐬*𝐬GT22superscriptsubscript𝐋Pos3𝐷subscript𝜆3𝐷superscriptsubscriptnormsuperscript𝐬superscript𝐬𝐺𝑇22\mathbf{L}_{\text{Pos}}^{3D}=\lambda_{3D}||\mathbf{s}^{*}-\mathbf{s}^{GT}||_{2% }^{2}bold_L start_POSTSUBSCRIPT Pos end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT = italic_λ start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT | | bold_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - bold_s start_POSTSUPERSCRIPT italic_G italic_T end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (20)

2D Position Constraint For fitting our model to 2D constraints such as facial landmarks estimated by a pre-trained landmark detector [13, 46], we project the estimated skin point 𝐬*superscript𝐬\mathbf{s}^{*}bold_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT to screen space using known camera intrinsics ψ𝜓\psiitalic_ψ and calculate the euclidean distance in 2D between the project point ψ(𝐬*)𝜓superscript𝐬\psi(\mathbf{s}^{*})italic_ψ ( bold_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) and the corresponding landmark.

𝐋Pos2D=λ2Dψ(𝐬*)𝐩22superscriptsubscript𝐋Pos2𝐷subscript𝜆2𝐷superscriptsubscriptnorm𝜓superscript𝐬𝐩22\mathbf{L}_{\text{Pos}}^{2D}=\lambda_{2D}||\psi(\mathbf{s}^{*})-\mathbf{p}||_{% 2}^{2}bold_L start_POSTSUBSCRIPT Pos end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT = italic_λ start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT | | italic_ψ ( bold_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - bold_p | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (21)

𝐩𝐑2𝐩superscript𝐑2\mathbf{p}\in\mathbf{R}^{2}bold_p ∈ bold_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a detected landmark corresponding to point 𝐬*superscript𝐬\mathbf{s}^{*}bold_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.
Coefficient Regularizer As the complexity of our implicit anatomical face model can be arbitrarily large, we regularize the estimated blending coefficients 𝐰*superscript𝐰\mathbf{w}^{*}bold_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT to be small with a weak L2 regularizer.

𝐋W=λRegw𝐰*22subscript𝐋𝑊superscriptsubscript𝜆𝑅𝑒𝑔𝑤superscriptsubscriptnormsuperscript𝐰22\mathbf{L}_{W}=\lambda_{Reg}^{w}||\mathbf{w}^{*}||_{2}^{2}bold_L start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_R italic_e italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT | | bold_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (22)

We set λRegw=0.75superscriptsubscript𝜆𝑅𝑒𝑔𝑤0.75\lambda_{Reg}^{w}=0.75italic_λ start_POSTSUBSCRIPT italic_R italic_e italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT = 0.75 for all our experiments.
Temporal Regularizer Finally when optimizing for coefficients on sequential data, we regularize the optimized frame codes 𝐳𝐣subscript𝐳𝐣\mathbf{z_{j}}bold_z start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT to remain similar between adjacent frames.

𝐋T=λRegt𝐳𝐣𝐳𝐣𝟏22subscript𝐋𝑇superscriptsubscript𝜆𝑅𝑒𝑔𝑡superscriptsubscriptnormsubscript𝐳𝐣subscript𝐳𝐣122\mathbf{L}_{T}=\lambda_{Reg}^{t}||\mathbf{z_{j}}-\mathbf{z_{j-1}}||_{2}^{2}bold_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_R italic_e italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | | bold_z start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT - bold_z start_POSTSUBSCRIPT bold_j - bold_1 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (23)

We set λRegt=0.05superscriptsubscript𝜆𝑅𝑒𝑔𝑡0.05\lambda_{Reg}^{t}=0.05italic_λ start_POSTSUBSCRIPT italic_R italic_e italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 0.05 for all our experiments.

Our final fitting energy 𝐋Fittingsubscript𝐋Fitting\mathbf{L}_{\text{Fitting}}bold_L start_POSTSUBSCRIPT Fitting end_POSTSUBSCRIPT is therefore

𝐋Fitting=𝐋Pos3D+𝐋Pos2D+𝐋W+𝐋Tsubscript𝐋Fittingsuperscriptsubscript𝐋Pos3𝐷superscriptsubscript𝐋Pos2𝐷subscript𝐋Wsubscript𝐋T\mathbf{L}_{\text{Fitting}}=\mathbf{L}_{\text{Pos}}^{3D}+\mathbf{L}_{\text{Pos% }}^{2D}+\mathbf{L}_{\text{W}}+\mathbf{L}_{\text{T}}bold_L start_POSTSUBSCRIPT Fitting end_POSTSUBSCRIPT = bold_L start_POSTSUBSCRIPT Pos end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT + bold_L start_POSTSUBSCRIPT Pos end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT + bold_L start_POSTSUBSCRIPT W end_POSTSUBSCRIPT + bold_L start_POSTSUBSCRIPT T end_POSTSUBSCRIPT (24)

Depending on the 3D or 2D fitting scenario, we set λ2Dsubscript𝜆2𝐷\lambda_{2D}italic_λ start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT or λ3Dsubscript𝜆3𝐷\lambda_{3D}italic_λ start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT to 0 respectively.

4.3 Implementation Details

In the model learning stage, we optimize our implicit coordinate networks for 1e41e41\mathrm{e}{4}1 roman_e 4 iterations with a learning rate of 2e32e32\mathrm{e}{-3}2 roman_e - 3. This takes approximately 10 minutes to converge on a single Nvidia RTX 3090 for an actor model with 40,000 vertices and 20 blendshapes.

In the model fitting stage, we use a learning rate of 1e31e31\mathrm{e}{-3}1 roman_e - 3 and optimize the fitting MLPs 𝐅Tsubscript𝐅𝑇\mathbf{F}_{T}bold_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝐅Wsubscript𝐅𝑊\mathbf{F}_{W}bold_F start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT for 1e41e41\mathrm{e}{4}1 roman_e 4 iterations. This process takes 1 second per frame on a single Nvidia RTX 3090.

We implement all our MLPs in PyTorch [36]. In our supplementary material we discuss the performance implications of replacing our current python backend with the well engineered fused MLP implementation [32].

5 Results

We now present several results, applications and evaluations of our Anatomical Implicit face Model (AIM).

Refer to caption
Figure 4: We demonstrate the ability of our Anatomical Implicit face Model to recover plausible anatomic features of the face, while also modeling the skin surface with very high fidelity. A subset of 3 expressions from 2 different actor specific models are shown here. The errors are displayed with a scale of 0mm Refer to caption 5mm.

5.1 Learning Actor Specific Anatomy

We begin by showing the reconstruction accuracy of our AIM on facial blendshapes of multiple actors. As seen in Fig. 4 on 2 different actors, our method can faithfully represent facial shapes with high fidelity while capturing both the low and high frequency features of facial shape and expression. We also show the anatomic features recovered by our new formulation which includes the dense underlying facial anatomy (shown in red), the soft tissue thickness at every point on the anatomy (visualized as heatmap), and the optimized subject specific skinning weights. These results highlight the new abilities introduced by our method in recovering plausible anatomy features while jointly learning to model surface deformations.

5.2 Anatomy Manipulation

Our ability to estimate the underlying anatomy that densely constrains the skin surface opens up new, yet computationally inexpensive ways to edit facial geometry using our learned anatomic properties. For example, as illustrated in Fig. 5, by simply scaling the learned soft tissue thickness d𝑑ditalic_d in desired regions of the face (denoted by the hand drawn mask), an artist can interactively sculpt/deform an actor’s face shape to match their requirements.

Refer to caption
Figure 5: Once the AIM is learned for an actor, it can be used to intuitively deform a face using the learned anatomic properties, as demonstrated here by scaling the soft tissue thickness in a hand painted cheek region, and by propagating the change to the skin surface thanks to our formulation.

5.3 Expression Reconstruction

We next evaluate the expressiveness of our model by fitting it to unseen 3D performances of multiple actors. Given a sequence of J𝐽Jitalic_J dynamic 3D shapes from a studio scanner [2], we first deform our template mesh C𝐶Citalic_C to match the scanned shapes using standard mesh registration techniques such that the dynamic 3D scans are in full vertex correspondence with our AIM. We then follow the fitting procedure described in Section 4.2 and obtain per-frame transformations [𝐓𝐠j,𝐓𝐛j]superscriptsubscript𝐓𝐠𝑗superscriptsubscript𝐓𝐛𝑗[\mathbf{T_{g}}^{j},\mathbf{T_{b}}^{j}][ bold_T start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_T start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ] and shape coefficients 𝐰jsuperscript𝐰𝑗\mathbf{w}^{j}bold_w start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT that explain the captured ground truth shape. For this experiment, we use the 3D position constraint from Eq. (24) and set 𝐋𝟐𝐃subscript𝐋2𝐃\mathbf{L_{2D}}bold_L start_POSTSUBSCRIPT bold_2 bold_D end_POSTSUBSCRIPT to 0. We densely constrain the fitting procedure at every vertex of the ground truth shape. In Fig. 6 we provide both a qualitative and quantitative comparison of fitting to novel performance from an actor against global blendshapes (GBS) [27], a patch blendshape model (PBS) [13], and the anatomical local model (ALM) [47]. In this experiment, we use 20 ground truth actor blendshapes to build the GBS, PBS, and ALM models, and the anatomically reconstructed blendshapes for our method. Even under this slight disadvantage, our method outperforms both GBS, and PBS and provides visually comparable results to the ALM model. Table 1 shows the average fitting error of each method across 819 frames from 5 sequences of 5 different actors. Our method converges in a few seconds for each frame, while the ALM algorithm consistently requires several minutes per frame. While the continuous nature of AIM enables us to evaluate it with coefficients of arbitrary locality, it could result in situations where our fitting is underconstrained in the absence of dense constraints leading to broken shapes. To illustrate that this does not happen in our reparameterized optimization, we show the result of fitting the AIM to sparse constraints in Fig. 7. While increasing the density of constraints improves the fitting accuracy, fitting our model to sparse landmarks also provides plausible results. Note that we do not compare fitting accuracy against generic morphable models like FLAME [29] or NPHM [21] as ours is actor specific and therefore a quantitative comparison might be unfair to the other methods. However we present some qualitative comparison to generic models in our supplemental material.

Table 1: Average fitting error across 819 frames from 5 sequences of 5 different actors. See supplemental material for details.
GBS [27] PBS [13] ALM [47] Ours
0.834 mm 0.51 mm 0.095 mm 0.312 mm
Refer to caption
Figure 6: We show qualitative and quantitative comparisons of fitting 3D performances with various actor specific models. All the errors are displayed with a scale of 0mm Refer to caption 5mm.
Refer to caption
Figure 7: Our continuous anatomical face model can be fit to 3D scans with varying density of constraints and still provide valid results due to our fitting algorithm.: all the errors are displayed with a scale of 0mm Refer to caption 5mm.

5.4 3D Performance Retargeting

Another important application of our method is in the area of 3D performance retargeting, where the goal is to transfer a facial animation from a source to a target character while respecting the identity and anatomic characteristics of the target character. To accomplish this using our model, we learn two separate instances of our model for the source and target character respectively from a sparse set of 20 blendshapes in correspondence. We then fit our source model to the facial animation of the source target character to obtain per-frame transformations [𝐓𝐠j,𝐓𝐛j]superscriptsubscript𝐓𝐠𝑗superscriptsubscript𝐓𝐛𝑗[\mathbf{T_{g}}^{j},\mathbf{T_{b}}^{j}][ bold_T start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_T start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ] and shape coefficients 𝐰jsuperscript𝐰𝑗\mathbf{w}^{j}bold_w start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. These coefficients can simply be played back on the target model to achieve facial performance retargeting. In Fig. 8, we provide a qualitative comparison to the state-of-the-art 3D retargeting algorithm of Chandran et al. [10] by retargeting the performance from a source to a target character. Our method provides competitive results to state of the art, while also allowing users to disentangle the rigid jaw motion and the soft tissue deformations of the skin surface. Our method additionally provides a substantial runtime benefit here and retargets each frame in a few (2-3) seconds, while the method of Chandran et al. requires several minutes per frame due to a costly anatomic solve. Finally unlike the approach of Chandran et al., our method provides all of above benefits without having to manually choose design parameters such as the patch layout, number of overlaps etc.

Refer to caption
Figure 8: We show the result of facial performance transfer in 3D from an input actor (left) to a different actor as produced by our method (2nd column) and the local retargeting model of Chandran et al. [10]. While providing qualitatively similar results, our model implicitly disentangles the performance into rigid jaw motion (3rd column), and nonrigid soft tissue deformations (4th column).

5.5 Ablations

Table 2: Average error in mm on a sequence of 100 frames using different types of activation functions in our MLPs.
gelu relu siren
0.71 mm 0.62 mm 0.21 mm
Table 3: Average error in mm on a sequence of 80 frames using variation of our loss functions during the model learning stage.
no 𝐋𝐀subscript𝐋𝐀\mathbf{L}_{\text{A}}bold_L start_POSTSUBSCRIPT A end_POSTSUBSCRIPT no 𝐋𝐊subscript𝐋𝐊\mathbf{L}_{\text{K}}bold_L start_POSTSUBSCRIPT K end_POSTSUBSCRIPT no 𝐋𝐒𝐲𝐦subscript𝐋𝐒𝐲𝐦\mathbf{L}_{\text{Sym}}bold_L start_POSTSUBSCRIPT Sym end_POSTSUBSCRIPT no 𝐋𝐃subscript𝐋𝐃\mathbf{L}_{\text{D}}bold_L start_POSTSUBSCRIPT D end_POSTSUBSCRIPT 𝐋𝐌𝐨𝐝𝐞𝐥subscript𝐋𝐌𝐨𝐝𝐞𝐥\mathbf{L}_{\text{Model}}bold_L start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT (Ours)
0.29 0.22 0.24 0.21 0.19

Finally Table 2 shows an ablation study on our choice of activation in our MLPs and Table 3 shows an ablation on the effect of several of our loss functions on the recovered geometry. Additional ablations are provided in the supplemental material.

6 Conclusion

In this paper we propose a new anatomically constrained implicit face model which provides a holistic representation of both facial anatomy and the enclosing skin surface using an ensemble of coordinate neural networks. Given an arbtrary set of skin surface meshes and only a neutral shape with estimated skull and jaw bones, our method recovers a dense anatomical substructure to constrain each point on the skin surface, and can model complex skin deformations with high fidelity. While we have explored the use of such a model in the context of actor specific blendshape models, future work could analyze it’s implications as a generic morphable model, by extending our formulation to handle multiple identities at once. Our new Anatomical Implicit face Model (AIM) has applications in shape representation and manipulation, retargeting and more, and we hope that our method encourages exciting future research.

References

  • Beeler and Bradley [2014] Thabo Beeler and Derek Bradley. Rigid stabilization of facial expressions. ACM TOG, 33(4), 2014.
  • Beeler et al. [2011] Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W. Sumner, and Markus Gross. High-quality passive facial performance capture using anchor frames. ACM Trans. Graphics Proc SIGGRAPH, 30, 2011.
  • Biswas et al. [2021] Sourav Biswas, Kangxue Yin, Maria Shugrina, Sanja Fidler, and Sameh Khamis. Hierarchical neural implicit pose network for animation and motion retargeting. CoRR, abs/2112.00958, 2021.
  • Blanz and Vetter [1999] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Siggraph, 1999.
  • Bouritsas et al. [2019] G. Bouritsas, S. Bokhnyak, S. Ploumpis, S. Zafeiriou, and M. Bronstein. Neural 3d morphable models: Spiral convolutional networks for 3d shape representation learning and generation. In ICCV, 2019.
  • Bühler et al. [2023] Marcel C Bühler, Kripasindhu Sarkar, Tanmay Shah, Gengyan Li, Daoye Wang, Leonhard Helminger, Sergio Orts-Escolano, Dmitry Lagun, Otmar Hilliges, Thabo Beeler, et al. Preface: A data-driven volumetric prior for few-shot ultra high-resolution face synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  • Cao et al. [2014] Chen Cao, Yanlin Weng, Shun Zhou, Y. Tong, and Kun Zhou. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20, 2014.
  • Chai et al. [2022] Zenghao Chai, Haoxian Zhang, Jing Ren, Di Kang, Zhengzhuo Xu, Xuefei Zhe, Chun Yuan, and Linchao Bao. Realy: Rethinking the evaluation of 3d face reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  • Chandran et al. [2020] Prashanth Chandran, Derek Bradley, Markus Gross, and Thabo Beeler. Semantic deep face models. In TDV, 2020.
  • Chandran et al. [2022a] Prashanth Chandran, Loïc Ciccone, Markus Gross, and Derek Bradley. Local anatomically-constrained facial performance retargeting. ACM Trans. Graph., 41(4), 2022a.
  • Chandran et al. [2022b] Prashanth Chandran, Gaspard Zoss, Markus Gross, Paulo Gotardo, and Derek Bradley. Facial Animation with Disentangled Identity and Motion using Transformers. Computer Graphics Forum, 2022b.
  • Chandran et al. [2022c] Prashanth Chandran, Gaspard Zoss, Markus Gross, Paulo Gotardo, and Derek Bradley. Shape transformers: Topology-independent 3d shape models using transformers. 41(2), 2022c.
  • Chandran et al. [2023] P. Chandran, G. Zoss, P. Gotardo, and D. Bradley. Continuous landmark detection with 3d queries. In CVPR, Los Alamitos, CA, USA, 2023. IEEE Computer Society.
  • Chen and Kim [2021] Zhixiang Chen and Tae-Kyun Kim. Learning feature aggregation for deep 3d morphable models. In CVPR, 2021.
  • Choi et al. [2022] Byungkuk Choi, Haekwang Eom, Benjamin Mouscadet, Stephen Cullingford, Kurt Ma, Stefanie Gassel, Suzi Kim, Andrew Moffat, Millicent Maier, Marco Revelant, Joe Letteri, and Karan Singh. Animatomy: An animator-centric, anatomically inspired system for 3d facial modeling, animation and transfer. In SIGGRAPH Asia 2022 Conference Papers, 2022.
  • Deng et al. [2020] Boyang Deng, J. P. Lewis, Timothy Jeruzalski, Gerard Pons-Moll, Geoffrey Hinton, Mohammad Norouzi, and Andrea Tagliasacchi. Nasa neural articulated shape approximation. In ECCV, 2020.
  • Egger et al. [2020] Bernhard Egger, William A. P. Smith, Ayush Tewari, Stefanie Wuhrer, Michael Zollhoefer, Thabo Beeler, Florian Bernard, Timo Bolkart, Adam Kortylewski, Sami Romdhani, Christian Theobalt, Volker Blanz, and Thomas Vetter. 3d morphable face models - past, present and future. ACM TOG, 39(5), 2020.
  • Ekman and Friesen [1978] Paul Ekman and Wallace V. Friesen. Facial action coding system: a technique for the measurement of facial movement. 1978.
  • Feng et al. [2021] Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. TOG, 40(4), 2021.
  • Fratarcangeli et al. [2020] Marco Fratarcangeli, Derek Bradley, Aurel Gruber, Gaspard Zoss, and Thabo Beeler. Fast Nonlinear Least Squares Optimization of Large-Scale Semi-Sparse Problems. Computer Graphics Forum, 2020.
  • Giebenhain et al. [2023] Simon Giebenhain, Tobias Kirschstein, Markos Georgopoulos, Martin Rünz, Lourdes Agapito, and Matthias Nießner. Learning neural parametric head models. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Gong et al. [2019] S. Gong, L. Chen, M. Bronstein, and S. Zafeiriou. Spiralnet++: A fast and highly efficient mesh convolution operator. In ICCV, 2019.
  • Guo et al. [2023] Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016.
  • Hoyer et al. [2019] Stephan Hoyer, Jascha Sohl-Dickstein, and Sam Greydanus. Neural reparameterization improves structural optimization. CoRR, abs/1909.04240, 2019.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Lewis et al. [2014] J. P. Lewis, K. Anjyo, Taehyun Rhee, M. Zhang, Frédéric H. Pighin, and Z. Deng. Practice and theory of blendshape facial models. In Computer Graphics Forum (Proc. Eurographics, 2014.
  • Li et al. [2020] Ruilong Li, Karl Bladin, Yajie Zhao, Chinmay Chinara, Owen Ingraham, Pengda Xiang, Xinglei Ren, Pratusha Prasad, Bipin Kishore, Jun Xing, and Hao Li. Learning formation of physically-based face attributes. CoRR, abs/2004.03458, 2020.
  • Li et al. [2017] Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans. ACM, 2017.
  • Martel et al. [2021] Julien N. P. Martel, David B. Lindell, Connor Z. Lin, Eric R. Chan, Marco Monteiro, and Gordon Wetzstein. ACORN: adaptive coordinate networks for neural scene representation. CoRR, abs/2105.02788, 2021.
  • Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  • Müller [2021] Thomas Müller. tiny-cuda-nn, 2021.
  • Orvalho et al. [2012] Verónica Orvalho, Pedro Bastos, Frederic Parke, Bruno Oliveira, and Xenxo Alvarez. A Facial Rigging Survey. In Eurographics 2012 - State of the Art Reports. The Eurographics Association, 2012.
  • Palafox et al. [2021] Pablo R. Palafox, Aljaz Bozic, Justus Thies, Matthias Nießner, and Angela Dai. Npms: Neural parametric models for 3d deformable shapes. CoRR, abs/2104.00702, 2021.
  • Park et al. [2019] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, 2019.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 2019.
  • Qiu et al. [2022] Zesong Qiu, Yuwei Li, Dongming He, Qixuan Zhang, Longwen Zhang, Yinghao Zhang, Jingya Wang, Lan Xu, Xudong Wang, Yuyao Zhang, and Jingyi Yu. Sculptor: Skeleton-consistent face creation using a learned parametric generator. ACM Trans. Graph., 41(6), 2022.
  • Ranjan et al. [2018] Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J. Black. Generating 3d faces using convolutional mesh autoencoders. In ECCV, 2018.
  • Sifakis et al. [2006] Eftychios Sifakis, Andrew Selle, Avram Robinson-Mosher, and Ronald Fedkiw. Simulating speech with a physics-based facial muscle model. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Goslar, DEU, 2006. Eurographics Association.
  • Sitzmann et al. [2020] Vincent Sitzmann, Julien N.P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. In Proc. NeurIPS, 2020.
  • Srinivasan et al. [2021] Sangeetha Grama Srinivasan, Qisi Wang, Junior Rojas, Gergely Klár, Ladislav Kavan, and Eftychios Sifakis. Learning active quasistatic physics-based models from data. ACM Trans. Graph., 40(4), 2021.
  • Tena et al. [2011] J. Rafael Tena, Fernando De la Torre, and Iain Matthews. Interactive region-based linear 3d face models. ACM Trans. Graphics Proc SIGGRAPH, 30(4), 2011.
  • Vlasic et al. [2005] Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popović. Face transfer with multilinear models. ACM TOG, 24(3), 2005.
  • Wagner et al. [2023] Nicolas Wagner, Mario Botsch, and Ulrich Schwanecke. Softdeca: Computationally efficient physics-based facial animations. In Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games, 2023.
  • Wang et al. [2022] Daoye Wang, Prashanth Chandran, Gaspard Zoss, Derek Bradley, and Paulo Gotardo. Morf: Morphable radiance fields for multiview neural head modeling. In ACM SIGGRAPH 2022 Conference Proceedings, 2022.
  • Wood et al. [2022] Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt, Matthew Johnson, Jingjing Shen, Nikola Milosavljevic, Daniel Wilde, Stephan Garbin, Chirag Raman, Jamie Shotton, Toby Sharp, Ivan Stojiljkovic, Tom Cashman, and Julien Valentin. 3d face reconstruction with dense landmarks, 2022.
  • Wu et al. [2016] Chenglei Wu, Derek Bradley, Markus Gross, and Thabo Beeler. An anatomically-constrained local deformation model for monocular face capture. ACM TOG, 35(4), 2016.
  • Yang et al. [2022] Lingchen Yang, Byungsoo Kim, Gaspard Zoss, Baran Gözcü, Markus Gross, and Barbara Solenthaler. Implicit neural representation for physics-driven actuated soft bodies. ACM Trans. Graph., 41(4), 2022.
  • Yenamandra et al. [2021] Tarun Yenamandra, Ayush Tewari, Florian Bernard, Hans-Peter Seidel, Mohamed Elgharib, Daniel Cremers, and Christian Theobalt. i3dmm: Deep implicit 3d morphable model of human heads. In CVPR, 2021.
  • Youwang et al. [2023] Kim Youwang, Lee Hyun, Kim Sung-Bin, Suekyeong Nam, Janghoon Ju, and Tae-Hyun Oh. A large-scale 3d face mesh video dataset via neural re-parameterized optimization. arXiv preprint, arXiv:2310.03205, 2023.
  • Zheng et al. [2022a] Mingwu Zheng, Hongyu Yang, Di Huang, and Liming Chen. Imface: A nonlinear 3d morphable face model with implicit neural representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022a.
  • Zheng et al. [2022b] Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C. Bühler, Xu Chen, Michael J. Black, and Otmar Hilliges. Im avatar: Implicit morphable head avatars from videos. In Computer Vision and Pattern Recognition (CVPR), 2022b.
  • Zheng et al. [2023] Yufeng Zheng, Wang Yifan, Gordon Wetzstein, Michael J. Black, and Otmar Hilliges. Pointavatar: Deformable point-based head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Zhou et al. [2019] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Zhou et al. [2020] Yi Zhou, Chenglei Wu, Zimo Li, Chen Cao, Yuting Ye, Jason Saragih, Hao Li, and Yaser Sheikh. Fully convolutional mesh autoencoder using efficient spatially varying kernels. In NeurIPS, 2020.
  • Zoss et al. [2018] Gaspard Zoss, Derek Bradley, Pascal Bérard, and Thabo Beeler. An empirical rig for jaw animation. ACM TOG, 37(4), 2018.
  • Zoss et al. [2019] Gaspard Zoss, Thabo Beeler, Markus Gross, and Derek Bradley. Accurate markerless jaw tracking for facial performance capture. ACM TOG, 38(4), 2019.
\thetitle

Supplementary Material

Appendix A Additional Details

A.1 Anatomy Constraints

Refer to caption
Figure 9: We show here the collection of 3D shapes used in our Model Learning stage. For all our experiments we used 1 neutral expression (or rest pose) and 19 expressions, all captured and reconstructed following the method of Beeler et al. [2].

We loosely regularize the skull and mandible geometries using sparse anatomical constraints. We compute these sparse constraints by fitting a template skull and mandible meshes to the neutral geometry following the method of Zoss et al. [56]. For any given skin point inside a hand-painted trusted region of the bone fitting process, we trace a ray along the inverse direction of the skin normal and store the bone intersection point only if the bone faces the same direction as the skin. We then trace another ray following now the bone normal, intersecting the skin again (potentially at a different point) and store the thickness and bone normal for the intersected skin point. Overall our sparse anatomical constraints exist only for 5 to 10% of the skin query points. We then use those bone points and thicknesses inside our losses 𝐋Asubscript𝐋A\mathbf{L}_{\text{A}}bold_L start_POSTSUBSCRIPT A end_POSTSUBSCRIPT and 𝐋Dsubscript𝐋D\mathbf{L}_{\text{D}}bold_L start_POSTSUBSCRIPT D end_POSTSUBSCRIPT respectively. We show a visualization of the anatomical constraints and learned anatomies and thicknesses on Fig. 10. A visual depiction of the full set of 20 shapes used in our work is shown in Fig. 9.

Refer to caption
Figure 10: We show for two actors, first on the left the input neutral geometry next to the fitted skull and mandible, with an overlay of our computed sparse anatomical constraints. On the right, we show the reconstructed geometry, the learned anatomy (using those sparse anatomical constraints) and learned thicknesses.

A.2 Network Architecture

In Fig. 11 and Fig. 12, we show a detailed breakdown of our memorization and fitting networks.

Refer to caption
Figure 11: Starting from a query point 𝐜𝐜\mathbf{c}bold_c on the template shape, an ensemble of Siren MLPs [40] predict the dense underlying anatomy 𝐛~0subscript~𝐛0\widetilde{\mathbf{b}}_{0}over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, anatomy normals 𝐧~0subscript~𝐧0\widetilde{\mathbf{n}}_{0}over~ start_ARG bold_n end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and the soft tissue thickness d~0subscript~𝑑0\widetilde{d}_{0}over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, using which a neutral shape 𝐬~0subscript~𝐬0\widetilde{\mathbf{s}}_{0}over~ start_ARG bold_s end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of the actor is reconstructed. Then using learned per-shape jaw transformations T~isubscript~𝑇𝑖\widetilde{T}_{i}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and actor specific skinning weights k~~𝑘\widetilde{k}over~ start_ARG italic_k end_ARG, the neutral is skinned to account for the rigid jaw movement. Finally, expression specific deformations 𝐞~isubscript~𝐞𝑖\widetilde{\mathbf{e}}_{i}over~ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are added on top of the skinned mesh to reconstruct the given blendshapes.
Refer to caption
Figure 12: Given a query point 𝐜𝐜\mathbf{c}bold_c and a learned code 𝐳jsubscript𝐳𝑗\mathbf{z}_{j}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for each target shape, we use small fitting MLPs to predicts the jaw transformation T~j*superscriptsubscript~𝑇𝑗\widetilde{T}_{j}^{*}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and the per-point coefficients 𝐰~j*superscriptsubscript~𝐰𝑗\widetilde{\mathbf{w}}_{j}^{*}over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, using which the AIM model can be evaluated to result in the estimated shape. The fitting MLPs are trained to minimize the reconstruction error between the estimated and target shape.

Appendix B Additional Results

B.1 Face Reconstruction from 2D Landmarks

In the main paper, we describe how to formulate a 2D position constraint to fit our anatomical implicit face model to landmarks obtained from a pre-trained landmark detector. In Fig. 13, we show qualitative results of fitting our trained anatomical implicit model to 10,000 dense landmarks predicted by a 2D landmark detector [13] on an input monocular video.

Refer to caption
Figure 13: We demonstrate a proof of concept of the application of our model in face reconstruction, where our AIM model can be fit to 2D landmarks obtained from a pre-trained landmark detector, capturing both the pose and expression of the person faithfully.

B.2 Learning Actor Specific Anatomical Properties

In Fig. 18, we show additional results of the recovered dense anatomical properties on a number of actors with varying face shapes spanning different ethnicities, and age groups.

B.3 Runtime Analysis

Our model fitting stage, which involves the training of the fitting MLPs FWsubscript𝐹𝑊F_{W}italic_F start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and FTsubscript𝐹𝑇F_{T}italic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT (see the main text), takes atmost a few seconds per-frame to converge on a Nvidia RTX 3090. As an engineering update to our system, we experimented with the tinycuda framework of Muller et al. [32] and found that it provided a 2x performance improvement in model fitting, without any adverse effects on fitting accuracy. We leave a more thorough performance optimization of our pipeline to future work, which could also include exploring fused MLPs for the model learning stage.

B.4 3D Performance Retargeting

We kindly refer you to our supplemental video for additional retargeting results and qualitative comparisons.

B.5 Ablations

Refer to caption
Figure 14: 1st row: We show the effect of removing the thickness regularizer LDsubscript𝐿𝐷L_{D}italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT that encourages the soft tissue thickness to remain small in unconstrained areas, 2nd row: the effect of removing the anatomy loss LAsubscript𝐿𝐴L_{A}italic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT which result in a collapse of the learned anatomy, while still reconstructing the neutral in the first column, 3rd row: The effect of removing the optional skinning weight regularizer LKsubscript𝐿𝐾L_{K}italic_L start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT which does not adversely affect the learned skinning weights as seen in the last column, 4th row: The effect of removing the symmetry regularizer on the anatomy, as a result of which the anatomy no longer remains symmetric, and last row, our LModelsubscript𝐿ModelL_{\text{Model}}italic_L start_POSTSUBSCRIPT Model end_POSTSUBSCRIPT loss which uses a weight sum of all regularizers.

We provide visual results for the several ablations we performed in our work, which include the effect of removing certain regularizers used during the model learning stage (see section in the main text) in Fig. 14, the effect of different activation functions in Fig. 15, and the size of the hidden layers used during model learning in Fig. 16.

Refer to caption
Figure 15: Using GeLU and ReLU activations in our implicit MLPs results in oversmoothed anatomy and reconstructions lacking surface detail. Sine activations provided the best results.
Refer to caption
Figure 16: While increasing the size of the hidden layers in our MLPs improved reconstruction performance, it comes at the cost of a larger network that is slower to evaluate. In our work, we used a hidden layer size of 256 neurons which provided a good balance between accuracy and performance.

B.6 Generic Model Comparison

As discussed in the main text, a quantitative comparison of our actor specific model against a generic 3D morphable model would be unfair to general 3DMMs as they serve a more diverse purpose. However in Fig. 17 we show a visual comparison of 2 expressions fitted using 3D positions as constraints with our model and the FLAME model [29] for 2 different actors.

Refer to caption
Figure 17: We show 2 expression of 2 different actors fitted by our model and the FLAME model [29]. A generic 3DMM is unable to faithfully capture an particular individuals shape that lies outside of it’s shape space.
Refer to caption
Figure 18: We show the anatomical features recovered by our formulation across a wide variety of actors. From left to right, we show the ground truth neutral shape, the reconstructed neutral shape, our learned anatomy, our learned soft-tissue thickness, our learned anatomical normals, and our learned subject specific skinning weights.