HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2402.18330v1 [cs.CV] 28 Feb 2024

Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting

Taeho Kang
Seoul National University, South Korea
[email protected]
   Youngki Lee
Seoul National University, South Korea
[email protected]
Abstract

We present EgoTAP, a heatmap-to-3D pose lifting method for highly accurate stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view limbs in egocentric camera views make accurate pose estimation a challenging problem. To address the challenge, prior methods employ joint heatmaps-probabilistic 2D representations of the body pose, but heatmap-to-3D pose conversion still remains an inaccurate process. We propose a novel heatmap-to-3D lifting method composed of the Grid ViT Encoder and the Propagation Network. The Grid ViT Encoder summarizes joint heatmaps into effective feature embedding using self-attention. Then, the Propagation Network estimates the 3D pose by utilizing skeletal information to better estimate the position of obscure joints. Our method significantly outperforms the previous state-of-the-art qualitatively and quantitatively demonstrated by a 23.9% reduction of error in an MPJPE metric. Our source code is available in GitHub 111https://github.com/tho-kn/EgoTAP.

1 Introduction

The increasing use of Virtual Reality(VR) and Augmented Reality(AR) applications has prompted efforts to perform various vision tasks with minimal wearable sensors. Specifically, head-mounted cameras in the egocentric setup (Fig. 1) received increasing attention thanks to their accessibility. Here, accurate 3D pose estimation is noted as a task critical for seamlessly integrating virtual selves into the real world. However, existing egocentric pose estimation methods still suffer from accuracy challenges [8].

Refer to caption
Figure 1: The stereo egocentric input and the comparison of the estimated pose of the state-of-the-art method [8] and ours. Blue color for the ground truth and red color for the respective method’s estimation

Conventional 3D pose estimation methods typically derive 3D pose directly from 2D pose information [19, 12, 34]. However, this approach faces challenges in egocentric setups due to inaccuracies in 2D pose estimation resulting from limited camera views and self-occlusion. To address this, egocentric pose estimation methods use joint heatmaps—probabilistic 2D representations of joints [21]. These heatmaps employ probability distributions of likely joint positions rather than exact locations. Following this approach, methods generate heatmaps for key joints from egocentric camera input, consolidate them into a unified feature embedding vector, and perform full-body 3D pose estimation (Fig. 2). However, two critical problems in the heatmap-to-3D lifting process significantly impact position estimation accuracy.

Refer to caption
Figure 2: The architecture of the common baseline heatmap-to-3D approach. This architecture is adopted by monocular x𝑥xitalic_xR-EgoPose [20] and stereo UnrealEgo [2] for 3D pose inference.
Refer to caption
Refer to caption
(a) Input
Refer to caption
Refer to caption
(b) Estimated Heatmaps
Refer to caption
Refer to caption
(c) CNN Encoder
Refer to caption
Refer to caption
(d) Grid ViT Encoder
Figure 3: Comparison of the reconstructed heatmaps from the encoded heatmap features, with the frozen encoder from (c) CNN Encoder and (d) Grid ViT Encoder of the pose estimation model.

Inefficiency in feature embedding. Obtaining an effective feature embedding from the heatmap poses a significant challenge. A robust embedding vector is crucial for accurately reconstructing the 3D pose, given the indirect mapping between the probabilistic, high-dimensional heatmaps and the 3D pose. However, the standard design, utilizing a CNN (Convolutional Neural Network) encoder, proves inadequate for feature summarization. The CNN encoder fails to preserve correspondence between specific heatmaps and joint poses, as features are merged into a single shared embedding. Furthermore, the spatial locality assumption of CNNs does not hold in an egocentric setup, where related joints may be distant in pixel space due to the proximity of ego-centric cameras to body parts and biased positions. The 3D pose lifting employs heatmap reconstruction loss [20, 32, 2, 8] to recover heatmap information, but full recovery becomes challenging once the embedding vector has significantly lost information, as illustrated in Fig. 3.

Feature Importance-agnostic 3D Lifting. Secondly, there is a significant inaccuracy in estimating a full-body 3D pose without effectively distinguishing between important and unimportant features, as seen in the conventional pipeline using Multi-Layer Perception (Fig. 2 (b)). The prior methods [20, 32, 2, 8] do not consider the certainty of joints or the physical relationships between them, relying solely on the motion distribution within the training data. This approach may result in obscure joint features adversely affecting joints with clear visual cues in the camera or those estimable from nearby joint information. The supplementary material highlights that body extremities with less visibility exhibit higher estimation errors.

To tackle these challenges, we introduce EgoTAP (Egocentric Transformer-Attention Propagation Network). EgoTAP incorporates two key techniques: Grid ViT (Vision Transformer) Heatmap Encoder and Propagation Network. We design the former to generate an effective feature embedding that (i) preserves the correspondence between heatmaps and feature embedding and (ii) captures meaningful relationships between distant pixels. The latter assigns weights to evident joint features with clearer visual cues and predicts the position of less visible joints using the skeletal information of body limbs. Through these techniques, we achieve a substantial improvement in pose error metrics, demonstrating a 23.9% reduction in MPJPE and a 17.7% decrease in PA-MPJPE compared to state-of-the-art methods.

Grid ViT Heatmap Encoder addresses the inefficiency of the CNN encoding process. The Grid ViT Heatmap Encoder consolidates all joint heatmaps into a single image and divides them into patches, with each patch corresponding to a heatmap. Subsequently, self-attention is applied across all patches, generating per-patch feature embeddings. The ViT Heatmap Encoder offers two key advantages. Firstly, the per-patch embedding better preserves the position information of the original joint heatmaps. Secondly, self-attention facilitates the effective embedding of inter-joint relationships, particularly useful for joint features in distant areas.

Propagation Network propagates various features from the neck joint, likely to have the evident features, to the body’s extremities with less visibility, following the body hierarchy. To enable propagation, we devise an LSTM [7]-inspired cell, PU (Propagation Unit). The PU takes the parent joint’s feature, the relational (limb) features as a hidden state, and the child joint’s features as input to predict the final 3D position. The PU has an additional gate to forget the parent and relational features in case the child joint features are evident, limiting the role of the predictive estimation only for obscure joints. This design explicitly leverages the physical relationships of joints rather than implicitly inferring them from the training data, thereby contributing to higher pose estimation accuracy.

In summary, our contributions are the following:

  • The first egocentric 3D pose estimation method using a vision transformer for efficient feature embedding.

  • The Propagation Network that enables the predictive estimation for obscure joints using skeletal hierarchy.

  • The Propagation Unit, to control the importance of the propagated features.

  • EgoTAP outperforms the state-of-the-art stereo egocentric pose estimation both qualitatively and quantitatively.

2 Related Works

2.1 Egocentric Pose Estimation

Egocentric pose estimation can be classified into two main categories. The first category focuses on estimating the pose of other people within the camera’s field of view, as in Ng et al.[15] while the second category estimates the pose of the user self [11]. Our work belongs to the second category, especially with a downward-oriented egocentric camera.

EgoCap [17] showcased its potential using stereo cameras on a helmet-mounted stick. Mo22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTCap22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT [27] and x𝑥xitalic_xR-EgoPose [20] have introduced single-camera methods, which handle occlusion. The former proposes a two-branched heatmap, one for the lower body with a magnified view. The latter adds a heatmap reconstructor to preserve the probabilistic information of heatmaps. Recent methods utilize an external camera view to make a weakly labeled large-scale dataset [24] and a scene depth estimation model to estimate 3D pose with volumetric heatmaps [25]. These methods, however, require additional external cameras or depth datasets from specific views.

Recently, a stereo egocentric setup has gained attention for a wide-view stereo perspective. EgoGlass [32] introduces an unobtrusive eyeglass-mounted stereo camera setup, minimizing obtrusiveness. It incorporates an additional segmentation branch on the heatmap estimator module to improve the awareness of body parts and pixel correspondence. UnrealEgo [2] introduces a publicly available synthetic large-scale dataset based on the EgoGlass setup and proposes to share weights and merge features across the stereo view in the heatmap estimator. Ego3DPose [8] suggests making an independent estimate of the 3D orientation of each limb, using the concatenated orientation vector for the final decoder. We observed two problems in these prior works, i.e., information loss in feature embedding and data-dependant estimation of obscure joints, and propose two corresponding techniques to address the problems.

2.2 3D Human Pose Estimation with Transformer

The transformer-based architecture has been explored for the 3D pose estimation task. Epipolar Transformers [6] utilizes attention to match features along the epipolar line from the stereo view. Most methods focused on using transformers for 2D to 3D pose lifting spatially and temporally. PoseFormer [34] is the first transformer-based 2D-to-3D pose lifting method consisting of spatial and temporal transformer networks. MixSTE [31] and PoseFormerV2 [33] improved it with the per joint temporal characteristics and frequency domain feature. Unlike prior works, we exploit the transformer to effectively embed heatmap information for accurate heatmap-to-3D pose lifting.

2.3 Skeletal Network Models

Multiple works utilize skeletal hierarchy for vision tasks. For instance, Liu et al. [13] uses spatio-temporal LSTM to iterate through all joints for action recognition. Most recent efforts utilize a graph-based model to represent skeletal hierarchy. The Graph Convolutional Networks [10] is widely utilized for activity recognition [4] while ST-GCN [28] models a dynamic skeletal graph in a spatiotemporal manner. The graph-based models are adapted for the pose estimation [30, 29, 28], using dynamic skeletal graphs with action-specific edges or adopting adaptive ST-GCN [29, 28].

Our work is the first to leverage skeletal information in the ego-centric setup. Specifically, we address the challenge of obscure features, particularly for body extremities, which impact the pose estimation of all body parts. Introducing a skeleton-aware uni-directional Propagation Network model, we leverage clear visual cues from camera-proximate joints to estimate the pose of body parts with obscure visual features.

3 Method

3.1 Overview

Refer to caption
Figure 4: Overall network architecture of EgoTAP. EgoTAP takes heatmaps from pre-trained heatmap estimators taking stereo input images and lifts the heatmaps to the 3D pose with the Grid ViT Encoder, Propagation Network, and finally, a projection layer.

Overall Architecture. Fig. 4 illustrates the comprehensive architecture of EgoTAP. It comprises two essential components: the Grid ViT Heatmap Encoder and the Propagation Network. The Grid ViT Heatmap Encoder takes joint heatmaps as input and generates effective feature embeddings for each joint. The Propagation Network processes these embeddings with awareness of the skeletal structure to estimate the 3D pose accurately. Notably, the per-joint feature embedding is propagated through a skeletal hierarchy, represented as a tree structure with a root representing the head. In Fig. 4, a simplified skeleton is depicted, showcasing the propagation from the head to the hand, highlighted in red. The feature propagation utilizes the PU (Propagation Unit in Fig. 5), which calculates joint states based on the parent joint’s states along with other self-joint features. The hidden states of the last PU layer are concatenated with the joint features from the Grid ViT encoder and linearly projected to estimate the 3D pose of each joint.

Input and Output. Our method utilizes a pre-trained and frozen heatmap estimator that takes stereo RGB images I2×256×256×3𝐼superscript22562563I\in\mathbb{R}^{2\times 256\times 256\times 3}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 256 × 256 × 3 end_POSTSUPERSCRIPT and estimates stereo heatmaps for NJsubscript𝑁𝐽N_{J}italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT joints 𝐇𝐉2NJ×64×64subscript𝐇𝐉superscript2subscript𝑁𝐽6464\mathbf{H_{J}}\in\mathbb{R}^{2N_{J}\times 64\times 64}bold_H start_POSTSUBSCRIPT bold_J end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT × 64 × 64 end_POSTSUPERSCRIPT and NLsubscript𝑁𝐿N_{L}italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT limbs 𝐇𝐋2NL×2×64×64subscript𝐇𝐋superscript2subscript𝑁𝐿26464\mathbf{H_{L}}\in\mathbb{R}^{2N_{L}\times 2\times 64\times 64}bold_H start_POSTSUBSCRIPT bold_L end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT × 2 × 64 × 64 end_POSTSUPERSCRIPT. EgoTAP takes the heatmaps and reconstructs the 3D pose PNJ×3𝑃superscriptsubscriptsuperscript𝑁𝐽3P\in\mathbb{R}^{N^{\prime}_{J}\times 3}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT of NJsubscriptsuperscript𝑁𝐽N^{\prime}_{J}italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT joints relative to the user’s root defined in the dataset. Note that the number of estimation targets NJsubscriptsuperscript𝑁𝐽N^{\prime}_{J}italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT can differ from the number of joints with heatmap NJsubscript𝑁𝐽N_{J}italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT depending on the dataset.

Loss. We use the Euclidean distance and the cosine similarity-based loss between the ground-truth pose and the estimated pose to train the Attention-Propagation network. The loss formulation is in the supplementary material.

Heatmaps. Two types of heatmaps for joints and limbs are used. We follow the standard definition of joint heatmap [21] where pixel values represent the probability that the joint is in that 2D coordinate. The limb heatmaps have two channels and are used to get relational features between two joints for the Propagation Network in Sec. 3.3. We use a limb heatmap suggested by Kang et al. [8], representing 3D information along with limb visibility as a line connecting joints. From the next section, we denote two types of heatmaps: joint heatmaps and limb heatmaps. We use a pre-trained ResNet-18 [5] based U-Net [18] architecture with a shared weight for two input image encoders and shared decoder, suggested by Akada et al. [2] for heatmap estimation.

3.2 Grid ViT Heatmap Encoder

Our encoder, described in Fig. 4, combines all joint heatmaps into a large single grid image. The grid is split into patches, linearly projected to make the input embedding, and fed to a transformer [22] encoder architecture with multi-head attention. The transformer encoding process preserves the correspondence between a patch and the input feature embedding in the output. The output feature embeddings corresponding to individual input patches are concatenated and re-encoded to form a feature embedding vector for the heatmap.

Unlike the CNN encoder, where the communication occurs within the nearby pixels of different heatmaps, the Grid ViT Heatmap Encoder allows communication between heatmap patches that are far spatially. This allows features to be shared without downsampling, minimizing the loss of information. The efficiency of the encoder is demonstrated by the precisely reconstructed heatmaps from the embeddings in Fig. 3 and Table 3, and improved pose estimation accuracy.

To formulate the process, let {𝐇𝐉,𝐢64×64|i=1,2,,2NJ}conditional-setsubscript𝐇𝐉𝐢superscript6464𝑖122subscript𝑁𝐽\{\mathbf{H_{J,i}}\in\mathbb{R}^{64\times 64}|i=1,2,\ldots,2N_{J}\}{ bold_H start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 64 × 64 end_POSTSUPERSCRIPT | italic_i = 1 , 2 , … , 2 italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT } be sets of 2×NJ2subscript𝑁𝐽2\times N_{J}2 × italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT stereo joint heatmaps. Heatmaps are arranged into a single grid image. The image is subsequently split to total 4×4×2NJ442subscript𝑁𝐽4\times 4\times 2N_{J}4 × 4 × 2 italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT patches {Xi16×16|i=1,2,,32NJ}conditional-setsubscript𝑋𝑖superscript1616𝑖1232subscript𝑁𝐽\{X_{i}\in\mathbb{R}^{16\times 16}|i=1,2,\ldots,32N_{J}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 16 × 16 end_POSTSUPERSCRIPT | italic_i = 1 , 2 , … , 32 italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT } where 16 patches corresponds to a heatmap. X16(i1)+1subscript𝑋16𝑖11X_{16(i-1)+1}italic_X start_POSTSUBSCRIPT 16 ( italic_i - 1 ) + 1 end_POSTSUBSCRIPT to X16isubscript𝑋16𝑖X_{16i}italic_X start_POSTSUBSCRIPT 16 italic_i end_POSTSUBSCRIPT corresponds to i𝑖iitalic_i-th heatmap for simplicity.

Each patch Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is then projected to an input embedding space 1024superscript1024\mathbb{R}^{1024}blackboard_R start_POSTSUPERSCRIPT 1024 end_POSTSUPERSCRIPT with a learnable projection matrix Wz1024×256subscript𝑊𝑧superscript1024256W_{z}\in\mathbb{R}^{1024\times 256}italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1024 × 256 end_POSTSUPERSCRIPT. Additionally, learnable positional encodings 𝐩i1024subscript𝐩𝑖superscript1024\mathbf{p}_{i}\in\mathbb{R}^{1024}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1024 end_POSTSUPERSCRIPT are added, resulting in the transformer input embedding zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The projected embedding with positional encoding for each patch is:

zi=WzFlatten(Xi)+𝐩isubscript𝑧𝑖subscript𝑊𝑧𝐹𝑙𝑎𝑡𝑡𝑒𝑛subscript𝑋𝑖subscript𝐩𝑖z_{i}=W_{z}\cdot Flatten(X_{i})+\mathbf{p}_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⋅ italic_F italic_l italic_a italic_t italic_t italic_e italic_n ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (1)

z=[z1,z2,,z32NJ]𝑧subscript𝑧1subscript𝑧2subscript𝑧32subscript𝑁𝐽z=[z_{1},z_{2},\ldots,z_{32N_{J}}]italic_z = [ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT 32 italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] is encoded by three ViT transformer encoder [3] layers with multi-head attention to output z=[z1,z2,,z32NJ]superscript𝑧subscriptsuperscript𝑧1subscriptsuperscript𝑧2subscriptsuperscript𝑧32subscript𝑁𝐽z^{\prime}=[z^{\prime}_{1},z^{\prime}_{2},\ldots,z^{\prime}_{32N_{J}}]italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 32 italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]. For the j𝑗jitalic_j-th heatmap, the corresponding output embeddings from 16 patches are concatenated to 𝒵jsubscript𝒵𝑗\mathcal{Z}_{j}caligraphic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and then re-encoded to smaller dimensional feature embedding kjsubscript𝑘𝑗k_{j}italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT through multiple fully connected layers denoted as EKsubscript𝐸𝐾E_{K}italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. The process is formulated as follows:

z=𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟𝐸𝑛𝑐𝑜𝑑𝑒𝑟(z)superscript𝑧𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟𝐸𝑛𝑐𝑜𝑑𝑒𝑟𝑧z^{\prime}=\textit{TransformerEncoder}(z)italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = TransformerEncoder ( italic_z ) (2)
𝒵j=[z16(j1)+1,z16(j1)+2,,z16j]subscript𝒵𝑗subscriptsuperscript𝑧16𝑗11subscriptsuperscript𝑧16𝑗12subscriptsuperscript𝑧16𝑗\mathcal{Z}_{j}=[z^{\prime}_{16(j-1)+1},z^{\prime}_{16(j-1)+2},\ldots,z^{% \prime}_{16j}]caligraphic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = [ italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 16 ( italic_j - 1 ) + 1 end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 16 ( italic_j - 1 ) + 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 16 italic_j end_POSTSUBSCRIPT ] (3)
kj=EK(𝒵j)subscript𝑘𝑗subscript𝐸𝐾subscript𝒵𝑗k_{j}=E_{K}(\mathcal{Z}_{j})italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (4)

A joint feature 𝐅𝐉,𝐢R256subscript𝐅𝐉𝐢superscript𝑅256\mathbf{F_{J,i}}\in{R}^{256}bold_F start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 256 end_POSTSUPERSCRIPT that corresponds to a specific joint is obtained by concatenating the stereo heatmap features. Let’s say 2i12𝑖12i-12 italic_i - 1 and 2i2𝑖2i2 italic_i-th heatmap correspond to i𝑖iitalic_i-th joint.

𝐅𝐉,𝐢=[k2i1,k2i], for 1iNJformulae-sequencesubscript𝐅𝐉𝐢subscript𝑘2𝑖1subscript𝑘2𝑖 for 1𝑖subscript𝑁𝐽\mathbf{F_{J,i}}=[k_{2i-1},k_{2i}],\text{ for }1\leq i\leq N_{J}bold_F start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT = [ italic_k start_POSTSUBSCRIPT 2 italic_i - 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT ] , for 1 ≤ italic_i ≤ italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT (5)

3.3 Propagation Network

Propagation Process. The Propagation Network estimates the joint positions using their parent joints’ positions and the relationships between the joints. The Propagation Network is inspired by the stereo setup’s capability to estimate 3D pose without the help of other joints and the general trend of higher visibility on joints closer to the camera in the egocentric setup. Sec. 4.3.2 shows that the Propagation Network effectively takes advantage of accurate estimation of the parent joint with a Propagation Potential and Propagation Effect metric.

The Propagation Network comprises a relational feature encoder and the 2-layered PU that handles the propagation process. The relational feature encoder takes the estimated limb heatmaps to output the relational feature between joints. The PU handles the propagation process, which takes the parent states, relational and joint features of the child joint as input and generates the child joint’s states. The states of joints are propagated through the tree hierarchy from the head directly attached to the camera to the extremities. During propagation, the reflection of the parent joint information is flexibly determined based on the certainty of the parent and child joint features by the PU.

We leverage the limb heatmaps with 3D information embedded with a trigonometric function of camera view angle [8] to provide information about the connection between the parent and child joint. An encoder with fully connected layers ERsubscript𝐸𝑅E_{R}italic_E start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT encodes limb heatmaps 𝐇𝐋,𝐢R2×64×64subscript𝐇𝐋𝐢superscript𝑅26464\mathbf{H_{L,i}}\in R^{2\times 64\times 64}bold_H start_POSTSUBSCRIPT bold_L , bold_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 2 × 64 × 64 end_POSTSUPERSCRIPT into a limb feature. Stereo limb features are concatenated to form relational feature 𝐅𝐑subscript𝐅𝐑\mathbf{F_{R}}bold_F start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT. Let’s say 𝐇𝐋,𝟐𝐢𝟏subscript𝐇𝐋2𝐢1\mathbf{H_{L,2i-1}}bold_H start_POSTSUBSCRIPT bold_L , bold_2 bold_i - bold_1 end_POSTSUBSCRIPT and 𝐇𝐋,𝟐𝐢subscript𝐇𝐋2𝐢\mathbf{H_{L,2i}}bold_H start_POSTSUBSCRIPT bold_L , bold_2 bold_i end_POSTSUBSCRIPT corresponds to a limb that connects the i𝑖iitalic_i-th joint and its parent. The process is:

𝐅𝐑,𝐢=[EL(𝐇𝐋,𝟐𝐢𝟏),EL(𝐇𝐋,𝟐𝐢)], for 1iNLformulae-sequencesubscript𝐅𝐑𝐢subscript𝐸𝐿subscript𝐇𝐋2𝐢1subscript𝐸𝐿subscript𝐇𝐋2𝐢 for 1𝑖subscript𝑁𝐿\mathbf{F_{R,i}}=[E_{L}(\mathbf{H_{L,2i-1}}),E_{L}(\mathbf{H_{L,2i}})],\text{ % for }1\leq i\leq N_{L}bold_F start_POSTSUBSCRIPT bold_R , bold_i end_POSTSUBSCRIPT = [ italic_E start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT bold_L , bold_2 bold_i - bold_1 end_POSTSUBSCRIPT ) , italic_E start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT bold_L , bold_2 bold_i end_POSTSUBSCRIPT ) ] , for 1 ≤ italic_i ≤ italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (6)

The Propagation Network consists of two layers of the Propagation Unit, described later. For a tree hierarchy where parent(i)𝑝𝑎𝑟𝑒𝑛𝑡𝑖parent(i)italic_p italic_a italic_r italic_e italic_n italic_t ( italic_i ) denotes a parent joint’s index, and 𝑃𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑖𝑜𝑛𝑁𝑒𝑡((H,C),R,J)𝑃𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑖𝑜𝑛𝑁𝑒𝑡𝐻𝐶𝑅𝐽\textit{PropagationNet}((H,C),R,J)PropagationNet ( ( italic_H , italic_C ) , italic_R , italic_J ) denotes the Propagation Network, which takes hidden and cell states for two PU layers H=[h1,h2]𝐻subscript1subscript2H=[h_{1},h_{2}]italic_H = [ italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ], C=[c1,c2]𝐶subscript𝑐1subscript𝑐2C=[c_{1},c_{2}]italic_C = [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ], relational feature R𝑅Ritalic_R and joint feature J𝐽Jitalic_J, the hidden and cell state for i𝑖iitalic_i-th joint 𝐇i,𝐂isubscript𝐇𝑖subscript𝐂𝑖\mathbf{H}_{i},\mathbf{C}_{i}bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is computed as follows:

𝐒i=(𝐇i,𝐂i)subscript𝐒𝑖subscript𝐇𝑖subscript𝐂𝑖\mathbf{S}_{i}=(\mathbf{H}_{i},\mathbf{C}_{i})bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (7)
𝐇0=0,𝐂0=0formulae-sequencesubscript𝐇00subscript𝐂00\mathbf{H}_{0}=\vec{0},\mathbf{C}_{0}=\vec{0}bold_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over→ start_ARG 0 end_ARG , bold_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over→ start_ARG 0 end_ARG (8)
𝐒i=𝑃𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑖𝑜𝑛𝑁𝑒𝑡(𝐒parent(i),𝐅𝐉,𝐢,𝐅𝐑,𝐢), for 1iNJformulae-sequencesubscript𝐒𝑖𝑃𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑖𝑜𝑛𝑁𝑒𝑡subscript𝐒parent𝑖subscript𝐅𝐉𝐢subscript𝐅𝐑𝐢 for 1𝑖subscript𝑁𝐽\mathbf{S}_{i}=\textit{PropagationNet}(\mathbf{S}_{\text{parent}(i)},\mathbf{F% _{J,i}},\mathbf{F_{R,i}}),\text{ for }1\leq i\leq N_{J}bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = PropagationNet ( bold_S start_POSTSUBSCRIPT parent ( italic_i ) end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT bold_R , bold_i end_POSTSUBSCRIPT ) , for 1 ≤ italic_i ≤ italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT (9)

The root joint head is indexed 0 and initialized with zero vector, as it is not visible from an egocentric view and, thus, does not have features. The i𝑖iitalic_i-th Propagated Feature 𝐅𝐏,𝐢R256subscript𝐅𝐏𝐢superscript𝑅256\mathbf{F_{P,i}}\in{R}^{256}bold_F start_POSTSUBSCRIPT bold_P , bold_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 256 end_POSTSUPERSCRIPT is a hidden state from the second layer of the Propagation Network 𝐡2,isubscript𝐡2𝑖\mathbf{h}_{2,i}bold_h start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT.

The output of the Propagation Network 𝐅𝐏,𝐢subscript𝐅𝐏𝐢\mathbf{F_{P,i}}bold_F start_POSTSUBSCRIPT bold_P , bold_i end_POSTSUBSCRIPT and transformer output joint features 𝐅𝐉,𝐢subscript𝐅𝐉𝐢\mathbf{F_{J,i}}bold_F start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT for each joint are concatenated and projected to estimate the 3D position of each joint.

Refer to caption
Figure 5: The Propagation Network with two layers of Propagation Unit.

Propagation Unit. We devise a Propagation Unit inspired by the LSTM cell for the above propagation process. Fig. 5 shows the internal structure of the Propagation Unit. The Propagation Unit weights the parent’s hidden state and the relational feature with the joint feature. The joint heatmap from stereo views can be sufficient for precise 3D estimation, and this weighting limits the role of the predictive estimation for obscure joints.

To formulate the Propagation Unit, we denote the weight matrix as W𝑊Witalic_W and bias vectors as b𝑏bitalic_b. The symbol direct-product\odot represents element-wise multiplication. The +++ sign represents element-wise addition. σ𝜎\sigmaitalic_σ denotes the sigmoid activation.

fi=σ(Wf𝐅𝐉,𝐢+bf)subscriptsuperscript𝑓𝑖𝜎subscript𝑊superscript𝑓subscript𝐅𝐉𝐢subscript𝑏superscript𝑓f^{\prime}_{i}=\sigma(W_{f^{\prime}}\cdot\mathbf{F_{J,i}}+b_{f^{\prime}})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) (10)
fi′′=σ(Wf′′𝐅𝐉,𝐢+bf′′)subscriptsuperscript𝑓′′𝑖𝜎subscript𝑊superscript𝑓′′subscript𝐅𝐉𝐢subscript𝑏superscript𝑓′′f^{\prime\prime}_{i}=\sigma(W_{f^{\prime\prime}}\cdot\mathbf{F_{J,i}}+b_{f^{% \prime\prime}})italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) (11)
hi=fihparent(i)subscriptsuperscript𝑖direct-productsubscriptsuperscript𝑓𝑖subscript𝑝𝑎𝑟𝑒𝑛𝑡𝑖h^{\prime}_{i}=f^{\prime}_{i}\odot h_{parent(i)}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ italic_h start_POSTSUBSCRIPT italic_p italic_a italic_r italic_e italic_n italic_t ( italic_i ) end_POSTSUBSCRIPT (12)
ri=fi′′𝐅𝐑,𝐢subscriptsuperscript𝑟𝑖direct-productsubscriptsuperscript𝑓′′𝑖subscript𝐅𝐑𝐢r^{\prime}_{i}=f^{\prime\prime}_{i}\odot\mathbf{F_{R,i}}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ bold_F start_POSTSUBSCRIPT bold_R , bold_i end_POSTSUBSCRIPT (13)

An additional forget gate is computed from the joint feature and is denoted as fisubscriptsuperscript𝑓𝑖f^{\prime}_{i}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and fi′′subscriptsuperscript𝑓′′𝑖f^{\prime\prime}_{i}italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The additional forget gate controls both the parent joint’s hidden state and the relational feature between two joints, resulting in the modified hidden state hisubscriptsuperscript𝑖h^{\prime}_{i}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the modified relational feature risubscriptsuperscript𝑟𝑖r^{\prime}_{i}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Subsequently, these modified states and the joint feature treated as input are used in the standard LSTM architecture, weighted, and then applied non-linearity for the four gates: input, candidate cell state, forget, and output.

For the second layer of the Propagation Network, as there is only a hidden state from the previous layer without relational or joint feature distinction, the hidden state from the previous layer is used for forgetting the parent joint’s hidden state in the current layer.

4 Evaluation

4.1 Experiment Setup

4.1.1 Datasets

Overview. We used two datasets: UnrealEgo [2] and EgoCap [17] for the 3D pose estimation in the stereo egocentric camera setup. We conducted the within-dataset evaluation using each dataset’s train and test set split since the egocentric datasets have significantly different setups and resulting views.

UnrealEgo. The UnrealEgo [2] is a synthetic dataset containing 450k frames with 17 characters. The dataset covers a variety of environments and motions that are challenging to capture in a real-world setup. There are a total of 16161616 joints to estimate. The dataset defines the target local 3D pose in a pelvis-relative coordinate system, as opposed to the camera coordinate system in most datasets, and has a head pose to estimate. The pelvis and head do not have corresponding heatmaps and features. We added a learnable matrix for linear projection, taking all the final features 𝐅𝐉subscript𝐅𝐉\mathbf{F_{J}}bold_F start_POSTSUBSCRIPT bold_J end_POSTSUBSCRIPT and 𝐅𝐏subscript𝐅𝐏\mathbf{F_{P}}bold_F start_POSTSUBSCRIPT bold_P end_POSTSUBSCRIPT to estimate offset for all joints and head pose. We found that this simple change effectively deals with different pose definitions.

EgoCap. The EgoCap [17] dataset is captured with egocentric cameras attached at the end of the stick on the helmet. It comprises 35k frames for training from six subjects and 1k for testing from one subject with 3D pose annotation. Evaluation with this dataset showcases applicability in a real-world textured image. There are a total of 17171717 joints to estimate.

4.1.2 Baselines

We experiment with three baseline stereo egocentric pose estimation methods: EgoGlass [32], UnrealEgo [2], and Ego3DPose [8]. We use the official UnrealEgo [2] and Ego3DPose [8] implementations. EgoGlass [32] implementation is taken from the latter as no official source code is provided. For the UnrealEgo [2] and Ego3DPose [8], we changed the embedding and pose decoder dimension, which gives higher estimation accuracy than their original setups. The change does not impact the EgoGlass [32], possibly due to the joint training of the heatmap and pose estimator.

4.1.3 Metrics

The MPJPE and PA-MPJPE metrics are used. The MPJPE is a mean per joint position error in a 3D Euclidian distance. PA-MPJPE applies Procrustes analysis before computing the MPJPE to calculate transform-invariant positional error.

4.2 Overall Performance

Refer to caption
Figure 6: Qualitative comparison of EgoTAP with state-of-the-art stereo egocentric pose estimation methods. The blue is the ground truth, and the red is the estimated pose.

4.2.1 Qualatative Results

Fig. 6 presents a qualitative comparison between our method and previous approaches on the UnrealEgo and EgoCap datasets. A more detailed qualitative comparison is available in the supplementary video. Our method demonstrates a significant improvement over baseline methods.

4.2.2 Evaluation on UnrealEgo

The second column of Table 1 presents the quantitative evaluation results on UnrealEgo [2] using MPJPE and PA-MPJPE metrics. Our method demonstrates superior performance compared to state-of-the-art methods, achieving a 23.9% reduction in MPJPE and a 17.7% decrease in PA-MPJPE. These improvements extend across all 31 activity categories detailed in the supplementary material, covering a range of movements from common actions like sitting and standing to less frequent crawling and crouching and more complex motion categories, including sports.

Noteworthy improvements are observed across various categories, with the most substantial enhancement in the “Crouching-Forward” category, boasting a 31.3% reduction in MPJPE. Conversely, the smallest improvement is noted in the “Crawling” activity, with an 8.8% decrease in MPJPE. It’s important to acknowledge that while our method relies on visual cues, the effectiveness varies based on the visibility of body parts. For instance, in activities like “Crouching-Forward,” where many body parts are partially visible, our method excels in improving accuracy. On the other hand, in activities like “Crawling,” where visible body features are significantly lacking, the challenge of enhancement is more pronounced.

Method UnrealEgo [2] EgoCap [17]
EgoGlass [32] 81.55 (61.56) 67.90 (-)
UnrealEgo [2] 63.53 (47.76) 70.77 (52.91)
Ego3DPose [8] 53.99 (43.02) 69.45 (49.98)
Ours 41.06 (35.39) 55.38 (45.24)
Table 1: Evaluation results of state-of-the-art methods and ours on two datasets. The metric is MPJPE, and in the bracket is PA-MPJPE. The bold text indicates the best results.

4.2.3 Evaluation on EgoCap

The third column of Table 1 presents the quantitative results on the EgoCap dataset. Our method demonstrates significant outperformance, surpassing EgoGlass [32] by 22.6% in MPJPE and Ego3DPose [8] by 9.4% in PA-MPJPE. For EgoGlass [32], we report the MPJPE value from their paper, as they do not furnish official code or network details, and the available replication [8] did not match the performance.

The relatively smaller improvement in PA-MPJPE, which discards the effect of the root’s transform, could be attributed to prior methods estimating the full body pose as a whole. Consequently, they might capture the relative pose between joints while the estimation is globally biased. Nevertheless, when integrating the output camera coordinate system pose with the 6-DoF pose of VR and AR devices, precise pose estimation in the correct coordinate frame is crucial for accurate body tracking in the global coordinate system.

We observed that the estimated limb heatmaps in the EgoCap dataset exhibit lower accuracy than those in the UnrealEgo dataset, as illustrated in the supplementary material. This discrepancy could be attributed to the limited volume and the small number of subjects in the EgoCap dataset. Despite these challenges, our Attention-Propagation network effectively lifts the 3D pose from heatmaps. However, Ego3DPose [8], which utilizes limb heatmaps, did not perform well. This could be attributed to their explicit inference of orientation for each limb. The final decoder, which takes independent information as an output orientation, struggles with inaccurate information.

4.3 Ablation Study

We performed ablation studies to showcase the effectiveness of each network component, as summarized in Table 2.

Method UnrealEgo [2] EgoCap [17]
Heatmap Encoder
CNN 63.53 (47.76) 70.77 (52.91)
Channel ViT 61.62 (47.05) 83.39 (56.29)
Grid ViT 49.03 (41.03) 63.97 (53.17)
Propagation Network
Grid ViT + RF 48.12 (40.79) 63.09 (52.60)
Grid ViT + LSTM 49.43 (41.31) 60.16 (49.18)
Grid ViT + LSTM RF Alter 44.97 (38.99) 62.60 (50.78)
Grid ViT + LSTM RF Concat 44.77 (38.91) 58.35 (47.06)
Ours (Grid ViT + PU) 41.06 (35.39) 55.38 (45.24)
Table 2: Ablation results of our method for two main components on two datasets. The metric is MPJPE, and in the bracket is PA-MPJPE. The bold text for metrics indicates the best results.

4.3.1 Grid ViT Heatmap Encoder

Pose Estimation: We assess the impact of the Grid ViT Heatmap Encoder. “CNN” presents the results from UnrealEgo [2], utilizing a CNN. “Channel ViT” showcases the outcomes with a typical encoder with ViT, where heatmaps are concatenated along the channel axis before being split into patches, resulting in feature embeddings that do not align with the heatmaps. Simply adopting transformers [22] yields minimal improvement, i.e., a 3% reduction in MPJPE, compared to the CNN-based lifting for the UnrealEgo [2] baseline and dataset. However, this approach significantly degrades performance on EgoCap [17]. This observation underscores the importance of addressing the correspondence between feature embedding and heatmaps in the pose estimation process.

Heatmap Reconstruction Error 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT/Pixel
Zeros 5.45
CNN Encoder 4.84
Grid ViT Heatmap Encoder 1.68
Table 3: Reconstruction mean square error of the heatmaps from the features encoded with a different frozen encoder architecture, experimented in the UnrealEgo [2] dataset.

Heatmap Reconstruction: We conducted experiments to evaluate the heatmap encoder’s efficiency in encoding heatmap features. A simple decoder is appended to our encoder and baseline encoders to achieve this. The decoder is trained to reconstruct the estimated heatmaps from the feature embedding. Table 3 presents the reconstruction error of the heatmap in the test set. The “Zeros” row provides the error for a zero-only output for comparison. The results demonstrate that the Grid ViT Heatmap Encoder effectively extracts heatmap features, evidenced by the reconstructed fine details of the heatmap in Fig. 3. In contrast, the heatmaps were not recoverable from features encoded by CNN, highlighting its inefficiency.

4.3.2 Propagation Network

Pose Estimation: We investigate if including relational features alone can significantly enhance accuracy through “+ RF” when incorporated with our Grid ViT encoder. The relational features are concatenated to the joint features for the final projection layer without the involvement of a propagation network. This approach demonstrates marginal impact or even degrades the estimation accuracy. Additionally, we analyze the effect of the Propagation Network with LSTM [7]. In the case of “+ LSTM,” only joint features are utilized in the propagation, yielding a marginal effect.

Additional experiments investigate the impact of the Propagation Network without PU, denoted as “+ LSTM RF Alter” and “+ LSTM RF Concat.” Relational and joint features are alternately taken in the former, and the propagation feature is output in the joint feature step. The latter takes both as a concatenated vector. Both methods demonstrate improvements, with the latter achieving an 8.7% and 8.8% reduction in MPJPE for two datasets compared to the Grid ViT Heatmap Encoder-only approach. The final model, incorporating PU, maximizes the potential of the Propagation Network, showcasing a 16.3% and 13.4% improvement in MPJPE for the two datasets. This highlights the significance of balancing the role of predictive estimation using parent joints and direct estimation using self-joint features.

Refer to caption
(a) UnrealEgo [2]
Refer to caption
(b) UnrealEgo (Camera-relative)
Refer to caption
(c) EgoCap [17]
Figure 7: Hexagonal-grid density plot of the Propagation Potential and the Propagation Effect(mm) in our evaluation datasets. The dark line shows linear regression results.

Propagation Potential and Effect: The Propagation Network leverages more evident parent joint features to improve the child joint’s pose estimation. The hexagonal-grid density plot in Fig. 7 illustrates its impact quantitatively. The x𝑥xitalic_x-axis represents the Propagation Potential (𝐏𝐏𝐏𝐏\mathbf{PP}bold_PP). 𝐏𝐏𝐏𝐏\mathbf{PP}bold_PP approximates the upper bound of the improvement using the parent’s feature, with a difference between the parent and child joint’s pose estimation error. On the y𝑦yitalic_y-axis, the Propagation Effect (𝐏𝐄𝐏𝐄\mathbf{PE}bold_PE) is the improvement of the child joint’s pose error by the Propagation Network. Using ΔΔ\Deltaroman_Δ to denote the pose estimation error, subscripts to denote joints, and superscripts to denote the model (𝐍𝐏𝐍𝐏\mathbf{NP}bold_NP without propagation, 𝐏𝐏\mathbf{P}bold_P with propagation), we define these metrics as follows.

𝐏𝐏=Δchild𝐍𝐏Δparent𝐍𝐏𝐏𝐏superscriptsubscriptΔchild𝐍𝐏superscriptsubscriptΔparent𝐍𝐏\mathbf{PP}=\Delta_{\text{child}}^{\mathbf{NP}}-\Delta_{\text{parent}}^{% \mathbf{NP}}bold_PP = roman_Δ start_POSTSUBSCRIPT child end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_NP end_POSTSUPERSCRIPT - roman_Δ start_POSTSUBSCRIPT parent end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_NP end_POSTSUPERSCRIPT (14)
𝐏𝐄=Δchild𝐍𝐏Δchild𝐏𝐏𝐄superscriptsubscriptΔchild𝐍𝐏superscriptsubscriptΔchild𝐏\mathbf{PE}=\Delta_{\text{child}}^{\mathbf{NP}}-\Delta_{\text{child}}^{\mathbf% {P}}bold_PE = roman_Δ start_POSTSUBSCRIPT child end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_NP end_POSTSUPERSCRIPT - roman_Δ start_POSTSUBSCRIPT child end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_P end_POSTSUPERSCRIPT (15)

For all datasets, linear regression reveals a positive relationship between 𝐏𝐏𝐏𝐏\mathbf{PP}bold_PP and 𝐏𝐄𝐏𝐄\mathbf{PE}bold_PE with a p-value of the null hypothesis <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, indicating that the Propagation Network is more effective when the parent joint has a more precise estimation, aligning with expectations. The average 𝐏𝐏𝐏𝐏\mathbf{PP}bold_PP and 𝐏𝐄𝐏𝐄\mathbf{PE}bold_PE were 16.9716.9716.9716.97 and 8.508.508.508.50 for the UnrealEgo dataset [2] and 4.324.324.324.32 and 9.399.399.399.39 for the EgoCap [17] dataset. The UnrealEgo [2] dataset exhibits higher potential due to the cameras closer to the head, unlike cameras around 20cm away from the head in the EgoCap dataset [17].

The effect is more pronounced for the UnrealEgo [2] dataset when the 3D pose is estimated in camera-relative coordinates. This eliminates the global offset (pelvis pose) bias from per-joint improvement. Fig. 7 (b), exhibits trends where 𝐏𝐄𝐏𝐄\mathbf{PE}bold_PE is similar to 𝐏𝐏𝐏𝐏\mathbf{PP}bold_PP or close to zero. When the 𝐏𝐄𝐏𝐄\mathbf{PE}bold_PE is similar to 𝐏𝐏𝐏𝐏\mathbf{PP}bold_PP, the child joint’s pose error is improved close to the parent joint’s error. The effect of the Propagation Network is near the upper bound (𝐏𝐏𝐏𝐏\mathbf{PP}bold_PP). The propagation cannot improve the child joint’s pose error in some cases, possibly due to the occlusion of limbs. Such cases exhibit near zero 𝐏𝐄𝐏𝐄\mathbf{PE}bold_PE. 66.0766.0766.0766.07% of 𝐏𝐄𝐏𝐄\mathbf{PE}bold_PE and 75.6275.6275.6275.62% of 𝐏𝐏𝐏𝐏\mathbf{PP}bold_PP in the samples are positive, and 54.1654.1654.1654.16% of samples lie in the first quadrant. The average positive 𝐏𝐄𝐏𝐄\mathbf{PE}bold_PE is 10.7510.7510.7510.75, while the average negative 𝐏𝐄𝐏𝐄\mathbf{PE}bold_PE is only 0.510.51-0.51- 0.51, demonstrating that many joints significantly benefit from the propagation.

5 Conclusion

In this study, we introduce a novel heatmap-to-3D lifting method tailored for the stereo egocentric setup, employing a transformer for efficient feature embedding and an attention-driven Propagation Network focused on evident features. We demonstrate effective heatmap feature extraction through the Grid ViT Heatmap Encoder, employing patch-wise communication with self-attention to preserve correspondence between the heatmap and the feature embedding. The Propagation Network utilizes visual cues from the proximate parent joint, leveraging joint relational information to predictively estimate less visible child joint poses. Our experiments highlight significant advancements over state-of-the-art stereo egocentric pose estimation methods, underscoring the efficacy of our proposed approach.

Acknowledgement

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MIST) (No. 2022R1A2C3008495). This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No.RS-2023-00218601).

\thetitle

Supplementary Material

Appendix A Overview

The supplementary material contains the following:

  • Dataset Processing

  • Implementation

  • Training

  • Experiment

  • Example Figure

  • Limitations and Future Works

Appendix B Dataset Processing

We explain the details of the train and test dataset we used in this section. Our method requires a 2D and 3D pose annotation and stereo input images. The 2D annotation is necessary for generating the heatmaps.

B.1 UnrealEgo

We utilize the full dataset, including metadata files and preprocessed pickles. The public Ego3DPose [8] code loads metadata and pickles. Their code adds 2D and 3D pose data in the camera coordinate system and their limb heatmap representation in the pickle files. Our method uses these final pickles.

B.2 EgoCap

We used publicly available 2D pose annotation on the train set. Additionally, we got the full ground truth 3D pose for the train set of the EgoCap [17] dataset from the authors. In the fisheye views of the dataset, images are projected only in the circular area due to strong distortion. Thus, the original images contain areas that do not have real views. Following the Kang et al. [8], we cropped the image horizontally into a square area centered at the x-axis focal center (𝐟𝐱𝐟𝐱\mathbf{fx}bold_fx) provided in the dataset calibration data. We resized the images to 256 by 256 images to fit our model.

The dataset has a train set and 2D and 3D validation sets. The 3D validation set contains a ground truth 3D pose and is used for testing. The 2D validation dataset provides the 2D annotation for the images in the 3D validation sets from a subject labeled 7. The 3D pose is converted from a mm𝑚𝑚mmitalic_m italic_m to a cm𝑐𝑚cmitalic_c italic_m unit to scale the pose loss in accordance with the UnrealEgo dataset.

Appendix C Implementation

Refer to caption
Figure 8: The ViT encoder architecture.

C.1 Grid ViT Heatmap Encoder

The 64×64646464\times 6464 × 64 sized heatmaps are put into one image with resolution 384×384384384384\times 384384 × 384. The image comprises 36363636 areas as a 6×6666\times 66 × 6 grid. The number of joint heatmaps is 30303030 for the UnrealEgo [2] dataset and 34343434 for the EgoCap [17]. The heatmaps fill in the grid in order. The areas that do not correspond to any heatmap are masked in the ViT encoder module and don’t impact the output.

We adopt the ViT encoder [3] architecture. Our implementation adopts the public Transformers [26] module ViTModel class for the PyTorch [16]. We removed the [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] token since we are not using the module for a classification task. Doing so improves pose estimation accuracy empirically. The module follows the standard ViT [3] encoder architecture shown in Fig. 8 that takes the input embedding z𝑧zitalic_z and outputs feature embedding zsuperscript𝑧z^{\prime}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

The ViT encoder takes embeddings of size 1024102410241024 per each of 32NJ32subscript𝑁𝐽32N_{J}32 italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT patches, z=[z1,z2,,z32NJ]𝑧subscript𝑧1subscript𝑧2subscript𝑧32subscript𝑁𝐽z=[z_{1},z_{2},\ldots,z_{32N_{J}}]italic_z = [ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT 32 italic_N start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]. The multi-head attention layer has 8888 heads. The intermediate layer size of the MLP is 4096409640964096. The Grid ViT Heatmap Encoder uses three ViT encoder layers. It outputs a total 16384163841638416384 size of the embedding vector from 16161616 patches for each heatmap. The embedding vector is then compressed with MLP denoted E𝐊subscript𝐸𝐊E_{\mathbf{K}}italic_E start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT in the paper. The MLP has ReLU [1] non-linearity for the intermediate layers. The MLP’s hidden sizes of the first two layers are 2048204820482048 and 512512512512, and the last layer outputs a final embedding of size 128128128128.

C.2 Propagation Network

In an extension of the typical LSTM [7], the Propagation Unit’s relational features, joint features, hidden and cell states, and gate outputs all have the same size. We chose 256256256256 for the size.

C.2.1 Limb Heatmap Encoder

The limb heatmap encoder ERsubscript𝐸𝑅E_{R}italic_E start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT extracts relational features. The encoder consists of three layers with the same structure as the final MLP layers of the Grid ViT Heatmap Encoder, with only an input size difference. The input two-channeled limb heatmap [8] has 2×64×64264642\times 64\times 642 × 64 × 64 size. The encoder takes it after flattening it. The encoder consists of three fully connected layers, the first two layers with 2048204820482048 and 512512512512 output size, with the ReLU [1] activation, and the final layer outputs the embedding with a size 128128128128.

C.2.2 Second Layer of the PU

The second layer of PU does not take distinct relational and joint features. It takes the parent joint’s second layer cell and hidden state with the first layer’s hidden state of the joint. Since hidden states from different layers are used in this section, let’s denote the n𝑛nitalic_n-th layer hidden states of i𝑖iitalic_i-th joint hn,isubscript𝑛𝑖h_{n,i}italic_h start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT. The additional forget gate in the second layer gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT controls the parent joint’s second PU layer’s hidden state, resulting in the modified hidden state h2,isubscriptsuperscript2𝑖h^{\prime}_{2,i}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT. This is formulated as follows:

gi=σ(Wgh1,i+bg)subscript𝑔𝑖𝜎subscript𝑊𝑔subscript1𝑖subscript𝑏𝑔g_{i}=\sigma(W_{g}\cdot h_{1,i}+b_{g})italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) (16)
h2,i=gih2,parent(i)subscriptsuperscript2𝑖direct-productsubscript𝑔𝑖subscript2𝑝𝑎𝑟𝑒𝑛𝑡𝑖h^{\prime}_{2,i}=g_{i}\odot h_{2,{parent(i)}}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ italic_h start_POSTSUBSCRIPT 2 , italic_p italic_a italic_r italic_e italic_n italic_t ( italic_i ) end_POSTSUBSCRIPT (17)

The modified parent hidden state and the joint’s first layer hidden state are input for the inner LSTM [7].

C.2.3 Internal LSTM of the PU

We explain the formulation of the LSTM [7] inside the PU in more detail here.

Formulation of typical LSTM. The LSTM is formulated as follows, where hi1subscript𝑖1h_{i-1}italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT denotes the hidden state of the previous step, ci1subscript𝑐𝑖1c_{i-1}italic_c start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT denotes the cell state of the previous step, and xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the input. Here, W𝑊Witalic_W and b𝑏bitalic_b denote weights and biases for each gate. The symbol direct-product\odot represents element-wise multiplication, and the +++ sign represents element-wise addition. tanh\tanhroman_tanh and σ𝜎\sigmaitalic_σ denote the hyperbolic tangent and sigmoid activation.

fi=σ(Wf[hi1,xi]+bf)subscript𝑓𝑖𝜎subscript𝑊𝑓subscript𝑖1subscript𝑥𝑖subscript𝑏𝑓f_{i}=\sigma(W_{f}\cdot[h_{i-1},x_{i}]+b_{f})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) (18)
ii=σ(Wi[hi1,xi]+bi)subscript𝑖𝑖𝜎subscript𝑊𝑖subscript𝑖1subscript𝑥𝑖subscript𝑏𝑖i_{i}=\sigma(W_{i}\cdot[h_{i-1},x_{i}]+b_{i})italic_i start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (19)
oi=σ(Wo[hi1,xi]+bo)subscript𝑜𝑖𝜎subscript𝑊𝑜subscript𝑖1subscript𝑥𝑖subscript𝑏𝑜o_{i}=\sigma(W_{o}\cdot[h_{i-1},x_{i}]+b_{o})italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) (20)
c~i=tanh(Wc[hi1,xi]+bc)subscript~𝑐𝑖subscript𝑊𝑐subscript𝑖1subscript𝑥𝑖subscript𝑏𝑐\tilde{c}_{i}=\tanh(W_{c}\cdot[h_{i-1},x_{i}]+b_{c})over~ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_tanh ( italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) (21)
ci=fici1+iic~isubscript𝑐𝑖direct-productsubscript𝑓𝑖subscript𝑐𝑖1direct-productsubscript𝑖𝑖subscript~𝑐𝑖c_{i}=f_{i}\odot c_{i-1}+i_{i}\odot\tilde{c}_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ italic_c start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_i start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ over~ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (22)
hi=oitanh(ci)subscript𝑖direct-productsubscript𝑜𝑖subscript𝑐𝑖h_{i}=o_{i}\odot\tanh(c_{i})italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ roman_tanh ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (23)

The fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, iisubscript𝑖𝑖i_{i}italic_i start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are forget, input, and output gates. c~isubscript~𝑐𝑖\tilde{c}_{i}over~ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the candidate cell value. hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the final hidden and cell state for step i𝑖iitalic_i.

Formulation of internal LSTM. Unlike the LSTM taking the cell and hidden state, the internal LSTM of the first PU layer takes three states in addition to input joint features. The three states are the modified parent’s hidden state hisubscriptsuperscript𝑖h^{\prime}_{i}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the modified relational feature of the joint risubscriptsuperscript𝑟𝑖r^{\prime}_{i}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the cell state of the parent cparent(i)subscript𝑐𝑝𝑎𝑟𝑒𝑛𝑡𝑖c_{parent(i)}italic_c start_POSTSUBSCRIPT italic_p italic_a italic_r italic_e italic_n italic_t ( italic_i ) end_POSTSUBSCRIPT. The input is joint features 𝐅𝐉,𝐢subscript𝐅𝐉𝐢\mathbf{F_{J,i}}bold_F start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT.

This section explains the first and second layers together; thus, we denote the n𝑛nitalic_n-th layer of i𝑖iitalic_i-th joint with a n,i𝑛𝑖n,iitalic_n , italic_i subscript, as in hn,isubscript𝑛𝑖h_{n,i}italic_h start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT for the hidden state. In the computation of the forget, input, and output gates and the candidate cell value, a concatenated vector of the modified parent’s hidden state and relational features, and the joint features [h1,i,r1,i,𝐅𝐉,𝐢]subscriptsuperscript1𝑖subscriptsuperscript𝑟1𝑖subscript𝐅𝐉𝐢[h^{\prime}_{1,i},r^{\prime}_{1,i},\mathbf{F_{J,i}}][ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT ] replaces [hi1,xi]subscript𝑖1subscript𝑥𝑖[h_{i-1},x_{i}][ italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ].

f1,i=σ(W1,f[h1,i,ri,𝐅𝐉,𝐢]+b1,f)subscript𝑓1𝑖𝜎subscript𝑊1𝑓subscriptsuperscript1𝑖subscriptsuperscript𝑟𝑖subscript𝐅𝐉𝐢subscript𝑏1𝑓f_{1,i}=\sigma(W_{1,f}\cdot[h^{\prime}_{1,i},r^{\prime}_{i},\mathbf{F_{J,i}}]+% b_{1,f})italic_f start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT 1 , italic_f end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT 1 , italic_f end_POSTSUBSCRIPT ) (24)
i1,i=σ(W1,i[h1,i,ri,𝐅𝐉,𝐢]+b1,i)subscript𝑖1𝑖𝜎subscript𝑊1𝑖subscriptsuperscript1𝑖subscriptsuperscript𝑟𝑖subscript𝐅𝐉𝐢subscript𝑏1𝑖i_{1,i}=\sigma(W_{1,i}\cdot[h^{\prime}_{1,i},r^{\prime}_{i},\mathbf{F_{J,i}}]+% b_{1,i})italic_i start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ) (25)
o1,i=σ(W1,o[h1,i,ri,𝐅𝐉,𝐢]+b1,o)subscript𝑜1𝑖𝜎subscript𝑊1𝑜subscriptsuperscript1𝑖subscriptsuperscript𝑟𝑖subscript𝐅𝐉𝐢subscript𝑏1𝑜o_{1,i}=\sigma(W_{1,o}\cdot[h^{\prime}_{1,i},r^{\prime}_{i},\mathbf{F_{J,i}}]+% b_{1,o})italic_o start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT 1 , italic_o end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT 1 , italic_o end_POSTSUBSCRIPT ) (26)
c~1,i=tanh(W1,c[h1,i,ri,𝐅𝐉,𝐢]+b1,c)subscript~𝑐1𝑖subscript𝑊1𝑐subscriptsuperscript1𝑖subscriptsuperscript𝑟𝑖subscript𝐅𝐉𝐢subscript𝑏1𝑐\tilde{c}_{1,i}=\tanh(W_{1,c}\cdot[h^{\prime}_{1,i},r^{\prime}_{i},\mathbf{F_{% J,i}}]+b_{1,c})over~ start_ARG italic_c end_ARG start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT = roman_tanh ( italic_W start_POSTSUBSCRIPT 1 , italic_c end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT bold_J , bold_i end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT 1 , italic_c end_POSTSUBSCRIPT ) (27)

For the second layer, the modified second layer parent hidden state h2,isubscriptsuperscript2𝑖h^{\prime}_{2,i}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT from the Sec. C.2.2 takes the place of hi1subscript𝑖1h_{i-1}italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. The previous layer’s hidden state h1,isubscript1𝑖h_{1,i}italic_h start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT replaces input xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, analogous to the standard multi-layered LSTM.

f2,i=σ(W2,f[h2,i,h1,i]+b2,f)subscript𝑓2𝑖𝜎subscript𝑊2𝑓subscriptsuperscript2𝑖subscript1𝑖subscript𝑏2𝑓f_{2,i}=\sigma(W_{2,f}\cdot[h^{\prime}_{2,i},h_{1,i}]+b_{2,f})italic_f start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT 2 , italic_f end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT 2 , italic_f end_POSTSUBSCRIPT ) (28)
i2,i=σ(W2,i[h2,i,h1,i]+b2,i)subscript𝑖2𝑖𝜎subscript𝑊2𝑖subscriptsuperscript2𝑖subscript1𝑖subscript𝑏2𝑖i_{2,i}=\sigma(W_{2,i}\cdot[h^{\prime}_{2,i},h_{1,i}]+b_{2,i})italic_i start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT ) (29)
o2,i=σ(W2,o[h2,i,h1,i]+b2,o)subscript𝑜2𝑖𝜎subscript𝑊2𝑜subscriptsuperscript2𝑖subscript1𝑖subscript𝑏2𝑜o_{2,i}=\sigma(W_{2,o}\cdot[h^{\prime}_{2,i},h_{1,i}]+b_{2,o})italic_o start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT 2 , italic_o end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT 2 , italic_o end_POSTSUBSCRIPT ) (30)
c~2,i=tanh(W2,c[h2,i,h1,i]+b2,c)subscript~𝑐2𝑖subscript𝑊2𝑐subscriptsuperscript2𝑖subscript1𝑖subscript𝑏2𝑐\tilde{c}_{2,i}=\tanh(W_{2,c}\cdot[h^{\prime}_{2,i},h_{1,i}]+b_{2,c})over~ start_ARG italic_c end_ARG start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT = roman_tanh ( italic_W start_POSTSUBSCRIPT 2 , italic_c end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT 2 , italic_c end_POSTSUBSCRIPT ) (31)

The Propagation Unit takes features from the parent joint, not the previous index. In the computation of the final cell and hidden state, both layers of PU take cn,parent(i)subscript𝑐𝑛𝑝𝑎𝑟𝑒𝑛𝑡𝑖c_{n,parent(i)}italic_c start_POSTSUBSCRIPT italic_n , italic_p italic_a italic_r italic_e italic_n italic_t ( italic_i ) end_POSTSUBSCRIPT instead of ci1subscript𝑐𝑖1c_{i-1}italic_c start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT in the formula. The hidden state is computed in the same way.

cn,i=fn,icn,parent(i)+in,ic~n,isubscript𝑐𝑛𝑖direct-productsubscript𝑓𝑛𝑖subscript𝑐𝑛𝑝𝑎𝑟𝑒𝑛𝑡𝑖direct-productsubscript𝑖𝑛𝑖subscript~𝑐𝑛𝑖c_{n,i}=f_{n,i}\odot c_{n,parent(i)}+i_{n,i}\odot\tilde{c}_{n,i}italic_c start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ⊙ italic_c start_POSTSUBSCRIPT italic_n , italic_p italic_a italic_r italic_e italic_n italic_t ( italic_i ) end_POSTSUBSCRIPT + italic_i start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ⊙ over~ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT (32)
hn,i=on,itanh(cn,i)subscript𝑛𝑖direct-productsubscript𝑜𝑛𝑖subscript𝑐𝑛𝑖h_{n,i}=o_{n,i}\odot\tanh(c_{n,i})italic_h start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ⊙ roman_tanh ( italic_c start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ) (33)

Appendix D Training

D.1 Hardware Setup

We trained and tested our method on a server with NVIDIA RTX A6000 GPU and AMD EPYC 7313 16-Core Processor CPU.

D.2 Heatmap Estimator

The heatmap estimator is trained using UnrealEgo [2] code and their scripts for the UnrealEgo dataset. The default configuration utilizes Adam [9] optimizer with a learning rate 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. They train the network for 10101010 epochs, the later 5555 epochs with linear decay, with batch size 16161616. For the EgoCap dataset, we trained the heatmap estimators for 30303030 epochs with the same setup. Linear decay is used for the last 15151515 epochs proportionally.

When hasty convergence, where all heatmap values converge to 0, is detected, the training is automatically restarted, following the protocol of Kang et al. [8].

D.3 EgoTAP Network

The network is trained with the AdamW [14] optimizer with pretrained and frozen heatmap estimator weight. The learning rate of 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT is used with 16 epochs with a cosine annealing scheduler, with batch size 32323232. Early epochs use a linear warmup, one epoch for the UnrealEgo [2], and two epochs for the EgoCap [17] dataset.

D.4 Loss

D.4.1 EgoTAP Network

EgoTAP has two loss terms: a pose error loss (i.e., joints’ average Euclidean distance) and cosine-similarity loss [20] that focuses on estimating the correct 3D orientation for each limb.

The pose error loss is defined as follows. Given two 3D joint poses: the predicted pose 𝐩i\mathbf{p\prime}_{i}bold_p ′ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the ground truth pose 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for i=1,,J𝑖1𝐽i=1,\ldots,Jitalic_i = 1 , … , italic_J, where J𝐽Jitalic_J is the total number of joints:

Lp=1Ji=1J𝐩i𝐩i2{L_{p}}=\frac{1}{J}\sum_{i=1}^{J}\|\mathbf{p\prime}_{i}-\mathbf{p}_{i}\|_{2}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_J end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT ∥ bold_p ′ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (34)

The cosine similarity loss is then defined as follows. A limb pose vector for a particular joint is obtained by subtracting the pose of its parent joint from its pose, i.e., 𝐯i=𝐩i𝐩parent(i)subscript𝐯𝑖subscript𝐩𝑖subscript𝐩parent𝑖\mathbf{v}_{i}=\mathbf{p}_{i}-\mathbf{p}_{\text{parent}(i)}bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_p start_POSTSUBSCRIPT parent ( italic_i ) end_POSTSUBSCRIPT and 𝐯i=𝐩i𝐩parent(i)\mathbf{v\prime}_{i}=\mathbf{p\prime}_{i}-\mathbf{p\prime}_{\text{parent}(i)}bold_v ′ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_p ′ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_p ′ start_POSTSUBSCRIPT parent ( italic_i ) end_POSTSUBSCRIPT for the ground truth and predicted poses, respectively. Given these vectors, the cosine similarity between two limb pose vectors is calculated using their inner product:

Lc=1J1i=2J𝐯i𝐯i𝐯i2𝐯i2{L_{c}}=\frac{1}{J-1}\sum_{i=2}^{J}\|\frac{\mathbf{v}_{i}\cdot\mathbf{v\prime}% _{i}}{\|\mathbf{v}_{i}\|_{2}\|\mathbf{v\prime}_{i}\|_{2}}italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_J - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT ∥ divide start_ARG bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_v ′ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_v ′ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG (35)

Note that the root joint is ignored since it does not have a parent joint.

The final loss term is a weighted sum of two losses, where we choose wp=0.1subscript𝑤𝑝0.1w_{p}=0.1italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.1 and wc=0.01subscript𝑤𝑐0.01w_{c}=-0.01italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = - 0.01. The cosine similarity loss weight has a negative sign because higher cosine similarity indicates a more accurate pose.

L=wpLp+wcLc𝐿subscript𝑤𝑝subscript𝐿𝑝subscript𝑤𝑐subscript𝐿𝑐L=w_{p}L_{p}+w_{c}L_{c}italic_L = italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (36)

D.4.2 Heatmap Reconstruction Network for Ablation

We adopted the heatmap decoder proposed by Tome et al. [20] for the heatmap reconstruction in the ablation study of the Grid ViT Heatmp Encoder. The network with only mean squared loss struggles from early convergence to outputting near-zero valued heatmaps. The problem is more severe than the heatmap estimator network since the reconstruction network does not contain specialized architecture like the U-Net [18], which helps the heatmap generation through the multi-resolution features.

Thus, additional loss to match the heatmap’s minimum and maximum values is added if the mean squared loss is higher than the threshold to prevent the network training in the ablation studies from converging to outputting only zeros. We set the threshold to a value empirically found sufficient to ensure avoidance of the zero-only convergence. The loss term guides the network to output a peak in the heatmap, as joint heatmaps do.

The heatmap reconstruction’s target is to minimize the mean squared loss between the predicted heatmap H𝐻Hitalic_H, and reconstructed heatmap HH\primeitalic_H ′. This is computed as follows:

Lr=1Ni=1N(HiHi)2subscript𝐿𝑟1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝐻𝑖subscriptsuperscript𝐻𝑖2L_{r}=\frac{1}{N}\sum_{i=1}^{N}(H_{i}-H^{\prime}_{i})^{2}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (37)

The min-max loss applies only if the mean squared loss is higher than threshold θ𝜃\thetaitalic_θ. The threshold is 5.5*1045.5superscript1045.5*10^{-4}5.5 * 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. It is computed as follows:

Lmin=1Ni=1N|min(Hi)min(Hi)|subscript𝐿𝑚𝑖𝑛1𝑁superscriptsubscript𝑖1𝑁subscript𝐻𝑖subscriptsuperscript𝐻𝑖L_{min}=\frac{1}{N}\sum_{i=1}^{N}|\min(H_{i})-\min(H^{\prime}_{i})|italic_L start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | roman_min ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_min ( italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | (38)
Lmax=1Ni=1N|max(Hi)max(Hi)|subscript𝐿𝑚𝑎𝑥1𝑁superscriptsubscript𝑖1𝑁subscript𝐻𝑖subscriptsuperscript𝐻𝑖L_{max}=\frac{1}{N}\sum_{i=1}^{N}|\max(H_{i})-\max(H^{\prime}_{i})|italic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | roman_max ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_max ( italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | (39)
Lm={Lmin+Lmaxif Lr>θ0otherwiseL_{m}=\left\{\begin{aligned} &L_{min}+L_{max}&\text{if }L_{r}>\theta\\ &0&\text{otherwise}\end{aligned}\right.italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL italic_L start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_CELL start_CELL if italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT > italic_θ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (40)

The weight for the reconstruction wrsubscript𝑤𝑟w_{r}italic_w start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is set to 1111 and the weight for the min-max penalty wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is set to 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, resulting in the total loss:

L=wr(Lr+wmLm)𝐿subscript𝑤𝑟subscript𝐿𝑟subscript𝑤𝑚subscript𝐿𝑚L=w_{r}\cdot(L_{r}+w_{m}\cdot L_{m})italic_L = italic_w start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ⋅ ( italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) (41)

The loss term does not impact the final result once the network avoids the early convergence and stabilizes. The zero output convergence is still observed for the CNN encoder embedding, so the training was restarted until it did not converge to output only zeros. Such an additional loss term is necessary to get meaningful non-zero reconstruction from the output of the CNN heatmap encoder, showing the difficulty of heatmap information recovery from its embeddings.

Appendix E Experiment

MPJPE (PA-MPJPE)
Method Jumping Falling Down Exercising Pulling Singing Rolling Crawling Laying
EgoGlass [32] 78.93(63.85) 123.80(92.71) 94.21(69.85) 79.41(55.41) 68.16(50.25) 100.53(87.26) 173.69(111.51) 106.41(86.42)
UnrealEgo [2] 61.66(49.46) 108.73(78.02) 77.14(58.87) 57.01(43.51) 52.61(37.58) 73.38(64.56) 162.90(102.15) 82.60(67.47)
Ego3DPose [8] 52.12(43.29) 86.08(71.72) 67.52(56.39) 48.92(37.02) 43.86(34.54) 74.24(64.81) 138.47(92.93) 78.13(67.23)
Ours 43.05(37.31) 75.77(63.48) 52.76(46.21) 34.45(26.82) 33.96(29.10) 52.24(47.58) 126.23(91.46) 66.38(59.56)
Method Sitting on the Ground Crouching Crouching and Turning Crouching to Standing Crouching-Forward Crouching-Backward Crouching-Sideways Standing-Whole Body
EgoGlass 204.35(147.99) 121.76(100.12) 130.24(104.28) 84.31(59.67) 82.84(66.06) 90.36(76.83) 101.37(78.81) 69.78(52.59)
UnrealEgo 190.26(144.27) 96.69(79.29) 116.59(99.94) 66.20(44.92) 56.10(46.62) 62.54(46.21) 72.35(55.87) 52.91(39.14)
Ego3DPose 143.44(122.93) 82.01(67.94) 104.24(83.98) 58.74(41.34) 48.60(38.81) 47.36(36.05) 57.85(48.83) 45.69(35.47)
Ours 121.24(103.27) 67.27(60.07) 89.97(70.78) 41.91(28.68) 33.47(29.52) 34.08(28.30) 40.21(38.51) 32.77(28.27)
Method Standing-Upper Body Standing-Turning Standing to Crouching Standing-Forward Standing-Backward Standing-Sideways Dancing Boxing
EgoGlass 69.24(49.36) 77.77(60.27) 83.86(81.65) 76.75(63.23) 78.40(59.83) 82.71(66.46) 82.84(65.59) 66.98(49.13)
UnrealEgo 50.97(34.86) 60.42(46.27) 48.09(40.5) 56.23(47.90) 57.14(44.90) 63.10(50.86) 64.73(51.79) 52.13(38.36)
Ego3DPose 44.14(32.99) 51.45(41.91) 55.66(45.08) 49.48(44.65) 45.35(36.52) 52.25(44.97) 55.30(46.47) 41.55(32.14)
Ours 31.29(25.51) 41.24(35.55) 37.07(30.28) 39.64(38.06) 32.43(30.25) 39.34(37.94) 42.14(38.39) 29.97(25.69)
Method Wrestling Soccer Baseball Basketball American Football Golf
EgoGlass 84.23(62.81) 81.57(60.59) 76.20(56.11) 78.33(57.30) 102.54(84.03) 69.69(48.15)
UnrealEgo 67.85(52.73) 67.09(51.43) 62.15(48.60) 64.73(47.79) 89.57(68.49) 55.87(40.34)
Ego3DPose 57.96(45.94) 59.56(45.23) 56.21(42.17) 56.02(41.94) 77.89(62.56) 48.10(36.01)
Ours 44.15(39.38) 48.27(38.56) 44.83(35.92) 45.19(36.78) 65.30(54.41) 38.97(31.25)
Table 4: Quantitative evaluation results on the UnrealEgo dataset per category.

Appendix F Categorical Evaluation on the UnrealEgo dataset

Table 4 categorically shows the result on the UnrealEgo [2] dataset. Metrics from all three baseline methods, EgoGlass [32], UnrealEgo [2], and Ego3DPose [8] are shown with our method. The MPJPE values are outside the bracket, and the PA-MPJPE values are inside the bracket.

(a) UnrealEgo [2]
Refer to caption
Refer to caption
(a) UnrealEgo [2]
(b) Ours
Figure 9: CDF of errors for each joint in the UnrealEgo [2] dataset with their method.

F.1 Per Joint Error Distribution

Fig. 9 shows the CDF (Cumulative Distribution Function) of the error of each joint. Two results show one from the UnrealEgo [2] method as an example of the baseline in the introduction and one for our method, both evaluated on the UnrealEgo dataset.

The thigh is directly attached to the pelvis, which is the origin of the local pose definition in the dataset. Thus, for all methods, the thigh has the lowest errors among all joints. Lower body joints, calf, foot, and balls generally have significantly higher errors than other joints, except for the hands, which have larger errors than the calf. The error of the hand and the lower arm gets much closer to the upper arm in our method, showing the benefit of the propagation.

Refer to caption
Refer to caption
(a) EgoCap [17] dataset
Refer to caption
Refer to caption
(b) UnrealEgo [2] dataset
Figure 10: Estimated limb heatmaps on the test set of the EgoCap (Left) and UnrealEgo (Right).
UnrealEgo [2] EgoCap [17]
Estimated Heatmap 41.06 55.38
Ground Truth Heatmap 6.63 26.63
Table 5: Comparison of pose estimation error (MPJPE) of our method, with estimated and ground truth heatmaps provided as input. Columns indicates two datasets.

F.2 Impact of the Heatmap Estimation Accuracy

We experimented with our architecture’s performance when the ground truth heatmaps were provided instead of the estimated heatmaps. Table 5 reveals that the limited 2D pose information from the view is a key bottleneck for egocentric pose estimation. The full 2D pose provided by the ground truth heatmap reduces error significantly. Despite the better view provided by the camera attached far from the head, the EgoCap has a higher estimation error with ground truth heatmaps. A relatively small dataset volume for training can also be a bottleneck.

F.3 Impact of ViT Backbone Size

Experiments revealed that the bottleneck of the pose estimation accuracy is not in the computational capacity of the backbone. We experimented with up to 12 layers of the ViT encoders and 8 times larger feature sizes in the ViT encoder. No notable improvement was observed compared to the smaller backbone we chose. The UnrealEgo [2] shows consistent experimental results that the larger ResNet backbones do not improve the pose estimation accuracy.

Appendix G Example Figure

G.1 Limb Heatmaps

The main text mentions that the limb heatmap estimation is less accurate on the EgoCap [17]. The heatmap visualization in Fig. 10 shows noisy lines for limbs.

Appendix H Limitations and Future Works

EgoTAP is limited to a single frame input and relies fully on visual cues. The result with the EgoTAP on motions with severe occlusion, such as “Crawling” and “Sitting on the Ground”, has very high error compared to other motion categories as shown in Table 4. Unlike many recently proposed general pose estimation methods, the egocentric setup’s exploration of utilizing the temporal context is limited. For the egocentric view with a limited view, the invisible joints’ pose can benefit significantly from the temporal context. For one example in the egocentric setup, Wang. et al. [23] applied temporal optimization using a variational autoencoder for improved pose estimation in the global coordinate.

The method’s applicability can further be tested on monocular and different potential egocentric camera setups. The Propagation Network is based on the stereo setup, which provides sufficient information for a 3D pose when the joint is visible from both views. Thus, the propagation scheme helps child joint pose estimation. While the 3D pose estimation from the single heatmap is not feasible in the monocular setup, pose space is highly constrained, and our method can also be applicable potentially with modification.

The Propagation Network applies to an egocentric view with a specific characteristic. The method itself lacks dynamicity like the GCN-based method [30], which would make it applicable to many different situations. The tree hierarchy assumption still holds for arbitrary root joints in the skeletal hierarchy, giving room for more dynamicity. Applying such a tree hierarchy-based network has the potential for a specific joint-related situation, such as collision. Such application remains a future work.

References

  • Agarap [2018] Abien Fred Agarap. Deep learning using rectified linear units (relu), 2018. cite arxiv:1803.08375Comment: 7 pages, 11 figures, 9 tables.
  • Akada et al. [2022] Hiroyasu Akada, Jian Wang, Soshi Shimada, Masaki Takahashi, Christian Theobalt, and Vladislav Golyanik. Unrealego: A new dataset for robust egocentric 3d human motion capture. In European Conference on Computer Vision (ECCV), 2022.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  • Feng and Meunier [2022] Miao Feng and Jean Meunier. Skeleton graph-neural-network-based human action recognition: A survey. Sensors, 22(6):2091, 2022.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. IEEE, 2016.
  • He et al. [2020] Y. He, R. Yan, K. Fragkiadaki, and S. Yu. Epipolar transformers. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7776–7785, Los Alamitos, CA, USA, 2020. IEEE Computer Society.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9:1735–80, 1997.
  • Kang et al. [2023] Taeho Kang, Kyungjin Lee, Jinrui Zhang, and Youngki Lee. Ego3dpose: Capturing 3d cues from binocular egocentric views. In SIGGRAPH Asia 2023 Conference Papers, New York, NY, USA, 2023. Association for Computing Machinery.
  • Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
  • Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
  • Li et al. [2023a] Jiaman Li, Karen Liu, and Jiajun Wu. Ego-body pose estimation via ego-head pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17142–17151, 2023a.
  • Li et al. [2023b] Wenhao Li, Hong Liu, Hao Tang, and Pichao Wang. Multi-hypothesis representation learning for transformer-based 3d human pose estimation. Pattern Recognition, page 109631, 2023b.
  • Liu et al. [2018] Jun Liu, Amir Shahroudy, Dong Xu, Alex C. Kot, and Gang Wang. Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell., 40(12):3007–3021, 2018.
  • Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  • Ng et al. [2020] Evonne Ng, Donglai Xiang, Hanbyul Joo, and Kristen Grauman. You2me: Inferring body pose in egocentric video via first and second person interactions. CVPR, 2020.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • Rhodin et al. [2016] Helge Rhodin, Christian Richardt, Dan Casas, Eldar Insafutdinov, Mohammad Shafiei Rezvani Nezhad, Hans-Peter Seidel, Bernt Schiele, and Christian Theobalt. Egocap: Egocentric marker-less motion capture with two fisheye cameras. ACM Transactions on Graphics, 35, 2016.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015. cite arxiv:1505.04597Comment: conditionally accepted at MICCAI 2015.
  • Shan et al. [2023] Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Zhao Wang, Kai Han, Shanshe Wang, Siwei Ma, and Wen Gao. Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. arXiv preprint arXiv:2303.11579, 2023.
  • Tome et al. [2019] Denis Tome, Patrick Peluse, Lourdes Agapito, and Hernan Badino. xr-egopose: Egocentric 3d human pose from an hmd camera. In Proceedings of the IEEE International Conference on Computer Vision, pages 7728–7738, 2019.
  • Tompson et al. [2015] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In CVPR, pages 648–656. IEEE Computer Society, 2015.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  • Wang et al. [2021] Jian Wang, Lingjie Liu, Weipeng Xu, Kripasindhu Sarkar, and Christian Theobalt. Estimating egocentric 3d human pose in global space. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11500–11509, 2021.
  • Wang et al. [2022] Jian Wang, Lingjie Liu, Weipeng Xu, Kripasindhu Sarkar, Diogo Luvizon, and Christian Theobalt. Estimating egocentric 3d human pose in the wild with external weak supervision. CVPR, 2022.
  • Wang et al. [2023] Jian Wang, Diogo Luvizon, Weipeng Xu, Lingjie Liu, Kripasindhu Sarkar, and Christian Theobalt. Scene-aware egocentric 3d human pose estimation. CVPR, 2023.
  • Wolf et al. [2020] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, 2020. Association for Computational Linguistics.
  • Xu et al. [2019] Weipeng Xu, Avishek Chatterjee, Michael Zollhoefer, Helge Rhodin, Pascal Fua, Hans-Peter Seidel, and Christian Theobalt. Mo22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTCap22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT : Real-time mobile 3d motion capture with a cap-mounted fisheye camera. IEEE Transactions on Visualization and Computer Graphics, pages 1–1, 2019.
  • Yan et al. [2018] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018.
  • Yu et al. [2023] Bruce X.B. Yu, Zhi Zhang, Yongxu Liu, Sheng-hua Zhong, Yan Liu, and Chang Wen Chen. Gla-gcn: Global-local adaptive graph convolutional network for 3d human pose estimation from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8818–8829, 2023.
  • Zeng et al. [2021] Ailing Zeng, Xiao Sun, Lei Yang, Nanxuan Zhao, Minhao Liu, and Qiang Xu. Learning skeletal graph neural networks for hard 3d pose estimation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11416–11425, 2021.
  • Zhang et al. [2022] Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Junsong Yuan. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13232–13242, 2022.
  • Zhao et al. [2021] Dongxu Zhao, Zhen Wei, Jisan Mahmud, and Jan-Michael Frahm. Egoglass: Egocentric-view human pose estimation from an eyeglass frame. In 2021 International Conference on 3D Vision (3DV), pages 32–41, 2021.
  • Zhao et al. [2023] Qitao Zhao, Ce Zheng, Mengyuan Liu, Pichao Wang, and Chen Chen. Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8877–8886, 2023.
  • Zheng et al. [2021] Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 3d human pose estimation with spatial and temporal transformers. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.