Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting

Taeho Kang
Seoul National University, South Korea
[email protected] Youngki Lee
Seoul National University, South Korea
[email protected]

Abstract

We present EgoTAP, a heatmap-to-3D pose lifting method for highly accurate stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view limbs in egocentric camera views make accurate pose estimation a challenging problem. To address the challenge, prior methods employ joint heatmaps-probabilistic 2D representations of the body pose, but heatmap-to-3D pose conversion still remains an inaccurate process. We propose a novel heatmap-to-3D lifting method composed of the Grid ViT Encoder and the Propagation Network. The Grid ViT Encoder summarizes joint heatmaps into effective feature embedding using self-attention. Then, the Propagation Network estimates the 3D pose by utilizing skeletal information to better estimate the position of obscure joints. Our method significantly outperforms the previous state-of-the-art qualitatively and quantitatively demonstrated by a 23.9% reduction of error in an MPJPE metric. Our source code is available in GitHub ¹¹1https://github.com/tho-kn/EgoTAP.

1 Introduction

The increasing use of Virtual Reality(VR) and Augmented Reality(AR) applications has prompted efforts to perform various vision tasks with minimal wearable sensors. Specifically, head-mounted cameras in the egocentric setup (Fig. 1) received increasing attention thanks to their accessibility. Here, accurate 3D pose estimation is noted as a task critical for seamlessly integrating virtual selves into the real world. However, existing egocentric pose estimation methods still suffer from accuracy challenges [8].

Refer to caption — Figure 1: The stereo egocentric input and the comparison of the estimated pose of the state-of-the-art method [8] and ours. Blue color for the ground truth and red color for the respective method’s estimation

Conventional 3D pose estimation methods typically derive 3D pose directly from 2D pose information [19, 12, 34]. However, this approach faces challenges in egocentric setups due to inaccuracies in 2D pose estimation resulting from limited camera views and self-occlusion. To address this, egocentric pose estimation methods use joint heatmaps—probabilistic 2D representations of joints [21]. These heatmaps employ probability distributions of likely joint positions rather than exact locations. Following this approach, methods generate heatmaps for key joints from egocentric camera input, consolidate them into a unified feature embedding vector, and perform full-body 3D pose estimation (Fig. 2). However, two critical problems in the heatmap-to-3D lifting process significantly impact position estimation accuracy.

Inefficiency in feature embedding. Obtaining an effective feature embedding from the heatmap poses a significant challenge. A robust embedding vector is crucial for accurately reconstructing the 3D pose, given the indirect mapping between the probabilistic, high-dimensional heatmaps and the 3D pose. However, the standard design, utilizing a CNN (Convolutional Neural Network) encoder, proves inadequate for feature summarization. The CNN encoder fails to preserve correspondence between specific heatmaps and joint poses, as features are merged into a single shared embedding. Furthermore, the spatial locality assumption of CNNs does not hold in an egocentric setup, where related joints may be distant in pixel space due to the proximity of ego-centric cameras to body parts and biased positions. The 3D pose lifting employs heatmap reconstruction loss [20, 32, 2, 8] to recover heatmap information, but full recovery becomes challenging once the embedding vector has significantly lost information, as illustrated in Fig. 3.

Feature Importance-agnostic 3D Lifting. Secondly, there is a significant inaccuracy in estimating a full-body 3D pose without effectively distinguishing between important and unimportant features, as seen in the conventional pipeline using Multi-Layer Perception (Fig. 2 (b)). The prior methods [20, 32, 2, 8] do not consider the certainty of joints or the physical relationships between them, relying solely on the motion distribution within the training data. This approach may result in obscure joint features adversely affecting joints with clear visual cues in the camera or those estimable from nearby joint information. The supplementary material highlights that body extremities with less visibility exhibit higher estimation errors.

To tackle these challenges, we introduce EgoTAP (Egocentric Transformer-Attention Propagation Network). EgoTAP incorporates two key techniques: Grid ViT (Vision Transformer) Heatmap Encoder and Propagation Network. We design the former to generate an effective feature embedding that (i) preserves the correspondence between heatmaps and feature embedding and (ii) captures meaningful relationships between distant pixels. The latter assigns weights to evident joint features with clearer visual cues and predicts the position of less visible joints using the skeletal information of body limbs. Through these techniques, we achieve a substantial improvement in pose error metrics, demonstrating a 23.9% reduction in MPJPE and a 17.7% decrease in PA-MPJPE compared to state-of-the-art methods.

Grid ViT Heatmap Encoder addresses the inefficiency of the CNN encoding process. The Grid ViT Heatmap Encoder consolidates all joint heatmaps into a single image and divides them into patches, with each patch corresponding to a heatmap. Subsequently, self-attention is applied across all patches, generating per-patch feature embeddings. The ViT Heatmap Encoder offers two key advantages. Firstly, the per-patch embedding better preserves the position information of the original joint heatmaps. Secondly, self-attention facilitates the effective embedding of inter-joint relationships, particularly useful for joint features in distant areas.

Propagation Network propagates various features from the neck joint, likely to have the evident features, to the body’s extremities with less visibility, following the body hierarchy. To enable propagation, we devise an LSTM [7]-inspired cell, PU (Propagation Unit). The PU takes the parent joint’s feature, the relational (limb) features as a hidden state, and the child joint’s features as input to predict the final 3D position. The PU has an additional gate to forget the parent and relational features in case the child joint features are evident, limiting the role of the predictive estimation only for obscure joints. This design explicitly leverages the physical relationships of joints rather than implicitly inferring them from the training data, thereby contributing to higher pose estimation accuracy.

In summary, our contributions are the following:

•

The first egocentric 3D pose estimation method using a vision transformer for efficient feature embedding.
•

The Propagation Network that enables the predictive estimation for obscure joints using skeletal hierarchy.
•

The Propagation Unit, to control the importance of the propagated features.
•

EgoTAP outperforms the state-of-the-art stereo egocentric pose estimation both qualitatively and quantitatively.

2 Related Works

2.1 Egocentric Pose Estimation

Egocentric pose estimation can be classified into two main categories. The first category focuses on estimating the pose of other people within the camera’s field of view, as in Ng et al.[15] while the second category estimates the pose of the user self [11]. Our work belongs to the second category, especially with a downward-oriented egocentric camera.

EgoCap [17] showcased its potential using stereo cameras on a helmet-mounted stick. Mo ${}^{2}$ Cap ${}^{2}$ [27] and $x$ R-EgoPose [20] have introduced single-camera methods, which handle occlusion. The former proposes a two-branched heatmap, one for the lower body with a magnified view. The latter adds a heatmap reconstructor to preserve the probabilistic information of heatmaps. Recent methods utilize an external camera view to make a weakly labeled large-scale dataset [24] and a scene depth estimation model to estimate 3D pose with volumetric heatmaps [25]. These methods, however, require additional external cameras or depth datasets from specific views.

Recently, a stereo egocentric setup has gained attention for a wide-view stereo perspective. EgoGlass [32] introduces an unobtrusive eyeglass-mounted stereo camera setup, minimizing obtrusiveness. It incorporates an additional segmentation branch on the heatmap estimator module to improve the awareness of body parts and pixel correspondence. UnrealEgo [2] introduces a publicly available synthetic large-scale dataset based on the EgoGlass setup and proposes to share weights and merge features across the stereo view in the heatmap estimator. Ego3DPose [8] suggests making an independent estimate of the 3D orientation of each limb, using the concatenated orientation vector for the final decoder. We observed two problems in these prior works, i.e., information loss in feature embedding and data-dependant estimation of obscure joints, and propose two corresponding techniques to address the problems.

2.2 3D Human Pose Estimation with Transformer

The transformer-based architecture has been explored for the 3D pose estimation task. Epipolar Transformers [6] utilizes attention to match features along the epipolar line from the stereo view. Most methods focused on using transformers for 2D to 3D pose lifting spatially and temporally. PoseFormer [34] is the first transformer-based 2D-to-3D pose lifting method consisting of spatial and temporal transformer networks. MixSTE [31] and PoseFormerV2 [33] improved it with the per joint temporal characteristics and frequency domain feature. Unlike prior works, we exploit the transformer to effectively embed heatmap information for accurate heatmap-to-3D pose lifting.

2.3 Skeletal Network Models

Multiple works utilize skeletal hierarchy for vision tasks. For instance, Liu et al. [13] uses spatio-temporal LSTM to iterate through all joints for action recognition. Most recent efforts utilize a graph-based model to represent skeletal hierarchy. The Graph Convolutional Networks [10] is widely utilized for activity recognition [4] while ST-GCN [28] models a dynamic skeletal graph in a spatiotemporal manner. The graph-based models are adapted for the pose estimation [30, 29, 28], using dynamic skeletal graphs with action-specific edges or adopting adaptive ST-GCN [29, 28].

Our work is the first to leverage skeletal information in the ego-centric setup. Specifically, we address the challenge of obscure features, particularly for body extremities, which impact the pose estimation of all body parts. Introducing a skeleton-aware uni-directional Propagation Network model, we leverage clear visual cues from camera-proximate joints to estimate the pose of body parts with obscure visual features.

3 Method

3.1 Overview

Overall Architecture. Fig. 4 illustrates the comprehensive architecture of EgoTAP. It comprises two essential components: the Grid ViT Heatmap Encoder and the Propagation Network. The Grid ViT Heatmap Encoder takes joint heatmaps as input and generates effective feature embeddings for each joint. The Propagation Network processes these embeddings with awareness of the skeletal structure to estimate the 3D pose accurately. Notably, the per-joint feature embedding is propagated through a skeletal hierarchy, represented as a tree structure with a root representing the head. In Fig. 4, a simplified skeleton is depicted, showcasing the propagation from the head to the hand, highlighted in red. The feature propagation utilizes the PU (Propagation Unit in Fig. 5), which calculates joint states based on the parent joint’s states along with other self-joint features. The hidden states of the last PU layer are concatenated with the joint features from the Grid ViT encoder and linearly projected to estimate the 3D pose of each joint.

Input and Output. Our method utilizes a pre-trained and frozen heatmap estimator that takes stereo RGB images $I\in\mathbb{R}^{2\times 256\times 256\times 3}$ and estimates stereo heatmaps for $N_{J}$ joints $\mathbf{H_{J}}\in\mathbb{R}^{2N_{J}\times 64\times 64}$ and $N_{L}$ limbs $\mathbf{H_{L}}\in\mathbb{R}^{2N_{L}\times 2\times 64\times 64}$ . EgoTAP takes the heatmaps and reconstructs the 3D pose $P\in\mathbb{R}^{N^{\prime}_{J}\times 3}$ of $N^{\prime}_{J}$ joints relative to the user’s root defined in the dataset. Note that the number of estimation targets $N^{\prime}_{J}$ can differ from the number of joints with heatmap $N_{J}$ depending on the dataset.

Loss. We use the Euclidean distance and the cosine similarity-based loss between the ground-truth pose and the estimated pose to train the Attention-Propagation network. The loss formulation is in the supplementary material.

Heatmaps. Two types of heatmaps for joints and limbs are used. We follow the standard definition of joint heatmap [21] where pixel values represent the probability that the joint is in that 2D coordinate. The limb heatmaps have two channels and are used to get relational features between two joints for the Propagation Network in Sec. 3.3. We use a limb heatmap suggested by Kang et al. [8], representing 3D information along with limb visibility as a line connecting joints. From the next section, we denote two types of heatmaps: joint heatmaps and limb heatmaps. We use a pre-trained ResNet-18 [5] based U-Net [18] architecture with a shared weight for two input image encoders and shared decoder, suggested by Akada et al. [2] for heatmap estimation.

3.2 Grid ViT Heatmap Encoder

Our encoder, described in Fig. 4, combines all joint heatmaps into a large single grid image. The grid is split into patches, linearly projected to make the input embedding, and fed to a transformer [22] encoder architecture with multi-head attention. The transformer encoding process preserves the correspondence between a patch and the input feature embedding in the output. The output feature embeddings corresponding to individual input patches are concatenated and re-encoded to form a feature embedding vector for the heatmap.

Unlike the CNN encoder, where the communication occurs within the nearby pixels of different heatmaps, the Grid ViT Heatmap Encoder allows communication between heatmap patches that are far spatially. This allows features to be shared without downsampling, minimizing the loss of information. The efficiency of the encoder is demonstrated by the precisely reconstructed heatmaps from the embeddings in Fig. 3 and Table 3, and improved pose estimation accuracy.

To formulate the process, let $\{\mathbf{H_{J,i}}\in\mathbb{R}^{64\times 64}|i=1,2,\ldots,2N_{J}\}$ be sets of $2\times N_{J}$ stereo joint heatmaps. Heatmaps are arranged into a single grid image. The image is subsequently split to total $4\times 4\times 2N_{J}$ patches $\{X_{i}\in\mathbb{R}^{16\times 16}|i=1,2,\ldots,32N_{J}\}$ where 16 patches corresponds to a heatmap. $X_{16(i-1)+1}$ to $X_{16i}$ corresponds to $i$ -th heatmap for simplicity.

Each patch $X_{i}$ is then projected to an input embedding space $\mathbb{R}^{1024}$ with a learnable projection matrix $W_{z}\in\mathbb{R}^{1024\times 256}$ . Additionally, learnable positional encodings $\mathbf{p}_{i}\in\mathbb{R}^{1024}$ are added, resulting in the transformer input embedding $z_{i}$ . The projected embedding with positional encoding for each patch is:

z_{i}=W_{z}\cdot Flatten(X_{i})+\mathbf{p}_{i}

(1)

$z=[z_{1},z_{2},\ldots,z_{32N_{J}}]$ is encoded by three ViT transformer encoder [3] layers with multi-head attention to output $z^{\prime}=[z^{\prime}_{1},z^{\prime}_{2},\ldots,z^{\prime}_{32N_{J}}]$ . For the $j$ -th heatmap, the corresponding output embeddings from 16 patches are concatenated to $\mathcal{Z}_{j}$ and then re-encoded to smaller dimensional feature embedding $k_{j}$ through multiple fully connected layers denoted as $E_{K}$ . The process is formulated as follows:

z^{\prime}=\textit{TransformerEncoder}(z)

(2)

\mathcal{Z}_{j}=[z^{\prime}_{16(j-1)+1},z^{\prime}_{16(j-1)+2},\ldots,z^{% \prime}_{16j}]

(3)

k_{j}=E_{K}(\mathcal{Z}_{j})

(4)

A joint feature $\mathbf{F_{J,i}}\in{R}^{256}$ that corresponds to a specific joint is obtained by concatenating the stereo heatmap features. Let’s say $2i-1$ and $2i$ -th heatmap correspond to $i$ -th joint.

\mathbf{F_{J,i}}=[k_{2i-1},k_{2i}],\text{ for }1\leq i\leq N_{J}

(5)

3.3 Propagation Network

Propagation Process. The Propagation Network estimates the joint positions using their parent joints’ positions and the relationships between the joints. The Propagation Network is inspired by the stereo setup’s capability to estimate 3D pose without the help of other joints and the general trend of higher visibility on joints closer to the camera in the egocentric setup. Sec. 4.3.2 shows that the Propagation Network effectively takes advantage of accurate estimation of the parent joint with a Propagation Potential and Propagation Effect metric.

The Propagation Network comprises a relational feature encoder and the 2-layered PU that handles the propagation process. The relational feature encoder takes the estimated limb heatmaps to output the relational feature between joints. The PU handles the propagation process, which takes the parent states, relational and joint features of the child joint as input and generates the child joint’s states. The states of joints are propagated through the tree hierarchy from the head directly attached to the camera to the extremities. During propagation, the reflection of the parent joint information is flexibly determined based on the certainty of the parent and child joint features by the PU.

We leverage the limb heatmaps with 3D information embedded with a trigonometric function of camera view angle [8] to provide information about the connection between the parent and child joint. An encoder with fully connected layers $E_{R}$ encodes limb heatmaps $\mathbf{H_{L,i}}\in R^{2\times 64\times 64}$ into a limb feature. Stereo limb features are concatenated to form relational feature $\mathbf{F_{R}}$ . Let’s say $\mathbf{H_{L,2i-1}}$ and $\mathbf{H_{L,2i}}$ corresponds to a limb that connects the $i$ -th joint and its parent. The process is:

\mathbf{F_{R,i}}=[E_{L}(\mathbf{H_{L,2i-1}}),E_{L}(\mathbf{H_{L,2i}})],\text{ % for }1\leq i\leq N_{L}

(6)

The Propagation Network consists of two layers of the Propagation Unit, described later. For a tree hierarchy where $parent(i)$ denotes a parent joint’s index, and $\textit{PropagationNet}((H,C),R,J)$ denotes the Propagation Network, which takes hidden and cell states for two PU layers $H=[h_{1},h_{2}]$ , $C=[c_{1},c_{2}]$ , relational feature $R$ and joint feature $J$ , the hidden and cell state for $i$ -th joint $\mathbf{H}_{i},\mathbf{C}_{i}$ is computed as follows:

\mathbf{S}_{i}=(\mathbf{H}_{i},\mathbf{C}_{i})

(7)

\mathbf{H}_{0}=\vec{0},\mathbf{C}_{0}=\vec{0}

(8)

\mathbf{S}_{i}=\textit{PropagationNet}(\mathbf{S}_{\text{parent}(i)},\mathbf{F% _{J,i}},\mathbf{F_{R,i}}),\text{ for }1\leq i\leq N_{J}

(9)

The root joint head is indexed 0 and initialized with zero vector, as it is not visible from an egocentric view and, thus, does not have features. The $i$ -th Propagated Feature $\mathbf{F_{P,i}}\in{R}^{256}$ is a hidden state from the second layer of the Propagation Network $\mathbf{h}_{2,i}$ .

The output of the Propagation Network $\mathbf{F_{P,i}}$ and transformer output joint features $\mathbf{F_{J,i}}$ for each joint are concatenated and projected to estimate the 3D position of each joint.

Propagation Unit. We devise a Propagation Unit inspired by the LSTM cell for the above propagation process. Fig. 5 shows the internal structure of the Propagation Unit. The Propagation Unit weights the parent’s hidden state and the relational feature with the joint feature. The joint heatmap from stereo views can be sufficient for precise 3D estimation, and this weighting limits the role of the predictive estimation for obscure joints.

To formulate the Propagation Unit, we denote the weight matrix as $W$ and bias vectors as $b$ . The symbol $\odot$ represents element-wise multiplication. The $+$ sign represents element-wise addition. $\sigma$ denotes the sigmoid activation.

f^{\prime}_{i}=\sigma(W_{f^{\prime}}\cdot\mathbf{F_{J,i}}+b_{f^{\prime}})

(10)

f^{\prime\prime}_{i}=\sigma(W_{f^{\prime\prime}}\cdot\mathbf{F_{J,i}}+b_{f^{% \prime\prime}})

(11)

h^{\prime}_{i}=f^{\prime}_{i}\odot h_{parent(i)}

(12)

r^{\prime}_{i}=f^{\prime\prime}_{i}\odot\mathbf{F_{R,i}}

(13)

An additional forget gate is computed from the joint feature and is denoted as $f^{\prime}_{i}$ and $f^{\prime\prime}_{i}$ . The additional forget gate controls both the parent joint’s hidden state and the relational feature between two joints, resulting in the modified hidden state $h^{\prime}_{i}$ and the modified relational feature $r^{\prime}_{i}$ . Subsequently, these modified states and the joint feature treated as input are used in the standard LSTM architecture, weighted, and then applied non-linearity for the four gates: input, candidate cell state, forget, and output.

For the second layer of the Propagation Network, as there is only a hidden state from the previous layer without relational or joint feature distinction, the hidden state from the previous layer is used for forgetting the parent joint’s hidden state in the current layer.

4 Evaluation

4.1 Experiment Setup

4.1.1 Datasets

Overview. We used two datasets: UnrealEgo [2] and EgoCap [17] for the 3D pose estimation in the stereo egocentric camera setup. We conducted the within-dataset evaluation using each dataset’s train and test set split since the egocentric datasets have significantly different setups and resulting views.

UnrealEgo. The UnrealEgo [2] is a synthetic dataset containing 450k frames with 17 characters. The dataset covers a variety of environments and motions that are challenging to capture in a real-world setup. There are a total of $16$ joints to estimate. The dataset defines the target local 3D pose in a pelvis-relative coordinate system, as opposed to the camera coordinate system in most datasets, and has a head pose to estimate. The pelvis and head do not have corresponding heatmaps and features. We added a learnable matrix for linear projection, taking all the final features $\mathbf{F_{J}}$ and $\mathbf{F_{P}}$ to estimate offset for all joints and head pose. We found that this simple change effectively deals with different pose definitions.

EgoCap. The EgoCap [17] dataset is captured with egocentric cameras attached at the end of the stick on the helmet. It comprises 35k frames for training from six subjects and 1k for testing from one subject with 3D pose annotation. Evaluation with this dataset showcases applicability in a real-world textured image. There are a total of $17$ joints to estimate.

4.1.2 Baselines

We experiment with three baseline stereo egocentric pose estimation methods: EgoGlass [32], UnrealEgo [2], and Ego3DPose [8]. We use the official UnrealEgo [2] and Ego3DPose [8] implementations. EgoGlass [32] implementation is taken from the latter as no official source code is provided. For the UnrealEgo [2] and Ego3DPose [8], we changed the embedding and pose decoder dimension, which gives higher estimation accuracy than their original setups. The change does not impact the EgoGlass [32], possibly due to the joint training of the heatmap and pose estimator.

4.1.3 Metrics

The MPJPE and PA-MPJPE metrics are used. The MPJPE is a mean per joint position error in a 3D Euclidian distance. PA-MPJPE applies Procrustes analysis before computing the MPJPE to calculate transform-invariant positional error.

4.2 Overall Performance

4.2.1 Qualatative Results

Fig. 6 presents a qualitative comparison between our method and previous approaches on the UnrealEgo and EgoCap datasets. A more detailed qualitative comparison is available in the supplementary video. Our method demonstrates a significant improvement over baseline methods.

4.2.2 Evaluation on UnrealEgo

The second column of Table 1 presents the quantitative evaluation results on UnrealEgo [2] using MPJPE and PA-MPJPE metrics. Our method demonstrates superior performance compared to state-of-the-art methods, achieving a 23.9% reduction in MPJPE and a 17.7% decrease in PA-MPJPE. These improvements extend across all 31 activity categories detailed in the supplementary material, covering a range of movements from common actions like sitting and standing to less frequent crawling and crouching and more complex motion categories, including sports.

Noteworthy improvements are observed across various categories, with the most substantial enhancement in the “Crouching-Forward” category, boasting a 31.3% reduction in MPJPE. Conversely, the smallest improvement is noted in the “Crawling” activity, with an 8.8% decrease in MPJPE. It’s important to acknowledge that while our method relies on visual cues, the effectiveness varies based on the visibility of body parts. For instance, in activities like “Crouching-Forward,” where many body parts are partially visible, our method excels in improving accuracy. On the other hand, in activities like “Crawling,” where visible body features are significantly lacking, the challenge of enhancement is more pronounced.

Method	UnrealEgo [2]	EgoCap [17]
EgoGlass [32]	81.55 (61.56)	67.90 (-)
UnrealEgo [2]	63.53 (47.76)	70.77 (52.91)
Ego3DPose [8]	53.99 (43.02)	69.45 (49.98)
Ours	41.06 (35.39)	55.38 (45.24)

Table 1: Evaluation results of state-of-the-art methods and ours on two datasets. The metric is MPJPE, and in the bracket is PA-MPJPE. The bold text indicates the best results.

4.2.3 Evaluation on EgoCap

The third column of Table 1 presents the quantitative results on the EgoCap dataset. Our method demonstrates significant outperformance, surpassing EgoGlass [32] by 22.6% in MPJPE and Ego3DPose [8] by 9.4% in PA-MPJPE. For EgoGlass [32], we report the MPJPE value from their paper, as they do not furnish official code or network details, and the available replication [8] did not match the performance.

The relatively smaller improvement in PA-MPJPE, which discards the effect of the root’s transform, could be attributed to prior methods estimating the full body pose as a whole. Consequently, they might capture the relative pose between joints while the estimation is globally biased. Nevertheless, when integrating the output camera coordinate system pose with the 6-DoF pose of VR and AR devices, precise pose estimation in the correct coordinate frame is crucial for accurate body tracking in the global coordinate system.

We observed that the estimated limb heatmaps in the EgoCap dataset exhibit lower accuracy than those in the UnrealEgo dataset, as illustrated in the supplementary material. This discrepancy could be attributed to the limited volume and the small number of subjects in the EgoCap dataset. Despite these challenges, our Attention-Propagation network effectively lifts the 3D pose from heatmaps. However, Ego3DPose [8], which utilizes limb heatmaps, did not perform well. This could be attributed to their explicit inference of orientation for each limb. The final decoder, which takes independent information as an output orientation, struggles with inaccurate information.

4.3 Ablation Study

We performed ablation studies to showcase the effectiveness of each network component, as summarized in Table 2.

Method	UnrealEgo [2]	EgoCap [17]
Heatmap Encoder
CNN	63.53 (47.76)	70.77 (52.91)
Channel ViT	61.62 (47.05)	83.39 (56.29)
Grid ViT	49.03 (41.03)	63.97 (53.17)
Propagation Network
Grid ViT + RF	48.12 (40.79)	63.09 (52.60)
Grid ViT + LSTM	49.43 (41.31)	60.16 (49.18)
Grid ViT + LSTM RF Alter	44.97 (38.99)	62.60 (50.78)
Grid ViT + LSTM RF Concat	44.77 (38.91)	58.35 (47.06)
Ours (Grid ViT + PU)	41.06 (35.39)	55.38 (45.24)

Table 2: Ablation results of our method for two main components on two datasets. The metric is MPJPE, and in the bracket is PA-MPJPE. The bold text for metrics indicates the best results.

4.3.1 Grid ViT Heatmap Encoder

Pose Estimation: We assess the impact of the Grid ViT Heatmap Encoder. “CNN” presents the results from UnrealEgo [2], utilizing a CNN. “Channel ViT” showcases the outcomes with a typical encoder with ViT, where heatmaps are concatenated along the channel axis before being split into patches, resulting in feature embeddings that do not align with the heatmaps. Simply adopting transformers [22] yields minimal improvement, i.e., a 3% reduction in MPJPE, compared to the CNN-based lifting for the UnrealEgo [2] baseline and dataset. However, this approach significantly degrades performance on EgoCap [17]. This observation underscores the importance of addressing the correspondence between feature embedding and heatmaps in the pose estimation process.

Heatmap Reconstruction Error	$10^{-4}$ /Pixel
Zeros	5.45
CNN Encoder	4.84
Grid ViT Heatmap Encoder	1.68

Table 3: Reconstruction mean square error of the heatmaps from the features encoded with a different frozen encoder architecture, experimented in the UnrealEgo [2] dataset.

Heatmap Reconstruction: We conducted experiments to evaluate the heatmap encoder’s efficiency in encoding heatmap features. A simple decoder is appended to our encoder and baseline encoders to achieve this. The decoder is trained to reconstruct the estimated heatmaps from the feature embedding. Table 3 presents the reconstruction error of the heatmap in the test set. The “Zeros” row provides the error for a zero-only output for comparison. The results demonstrate that the Grid ViT Heatmap Encoder effectively extracts heatmap features, evidenced by the reconstructed fine details of the heatmap in Fig. 3. In contrast, the heatmaps were not recoverable from features encoded by CNN, highlighting its inefficiency.

4.3.2 Propagation Network

Pose Estimation: We investigate if including relational features alone can significantly enhance accuracy through “+ RF” when incorporated with our Grid ViT encoder. The relational features are concatenated to the joint features for the final projection layer without the involvement of a propagation network. This approach demonstrates marginal impact or even degrades the estimation accuracy. Additionally, we analyze the effect of the Propagation Network with LSTM [7]. In the case of “+ LSTM,” only joint features are utilized in the propagation, yielding a marginal effect.

Additional experiments investigate the impact of the Propagation Network without PU, denoted as “+ LSTM RF Alter” and “+ LSTM RF Concat.” Relational and joint features are alternately taken in the former, and the propagation feature is output in the joint feature step. The latter takes both as a concatenated vector. Both methods demonstrate improvements, with the latter achieving an 8.7% and 8.8% reduction in MPJPE for two datasets compared to the Grid ViT Heatmap Encoder-only approach. The final model, incorporating PU, maximizes the potential of the Propagation Network, showcasing a 16.3% and 13.4% improvement in MPJPE for the two datasets. This highlights the significance of balancing the role of predictive estimation using parent joints and direct estimation using self-joint features.

Propagation Potential and Effect: The Propagation Network leverages more evident parent joint features to improve the child joint’s pose estimation. The hexagonal-grid density plot in Fig. 7 illustrates its impact quantitatively. The $x$ -axis represents the Propagation Potential ( $\mathbf{PP}$ ). $\mathbf{PP}$ approximates the upper bound of the improvement using the parent’s feature, with a difference between the parent and child joint’s pose estimation error. On the $y$ -axis, the Propagation Effect ( $\mathbf{PE}$ ) is the improvement of the child joint’s pose error by the Propagation Network. Using $\Delta$ to denote the pose estimation error, subscripts to denote joints, and superscripts to denote the model ( $\mathbf{NP}$ without propagation, $\mathbf{P}$ with propagation), we define these metrics as follows.

\mathbf{PP}=\Delta_{\text{child}}^{\mathbf{NP}}-\Delta_{\text{parent}}^{% \mathbf{NP}}

(14)

\mathbf{PE}=\Delta_{\text{child}}^{\mathbf{NP}}-\Delta_{\text{child}}^{\mathbf% {P}}

(15)

For all datasets, linear regression reveals a positive relationship between $\mathbf{PP}$ and $\mathbf{PE}$ with a p-value of the null hypothesis $<10^{-3}$ , indicating that the Propagation Network is more effective when the parent joint has a more precise estimation, aligning with expectations. The average $\mathbf{PP}$ and $\mathbf{PE}$ were $16.97$ and $8.50$ for the UnrealEgo dataset [2] and $4.32$ and $9.39$ for the EgoCap [17] dataset. The UnrealEgo [2] dataset exhibits higher potential due to the cameras closer to the head, unlike cameras around 20cm away from the head in the EgoCap dataset [17].

The effect is more pronounced for the UnrealEgo [2] dataset when the 3D pose is estimated in camera-relative coordinates. This eliminates the global offset (pelvis pose) bias from per-joint improvement. Fig. 7 (b), exhibits trends where $\mathbf{PE}$ is similar to $\mathbf{PP}$ or close to zero. When the $\mathbf{PE}$ is similar to $\mathbf{PP}$ , the child joint’s pose error is improved close to the parent joint’s error. The effect of the Propagation Network is near the upper bound ( $\mathbf{PP}$ ). The propagation cannot improve the child joint’s pose error in some cases, possibly due to the occlusion of limbs. Such cases exhibit near zero $\mathbf{PE}$ . $66.07$ % of $\mathbf{PE}$ and $75.62$ % of $\mathbf{PP}$ in the samples are positive, and $54.16$ % of samples lie in the first quadrant. The average positive $\mathbf{PE}$ is $10.75$ , while the average negative $\mathbf{PE}$ is only $-0.51$ , demonstrating that many joints significantly benefit from the propagation.

5 Conclusion

In this study, we introduce a novel heatmap-to-3D lifting method tailored for the stereo egocentric setup, employing a transformer for efficient feature embedding and an attention-driven Propagation Network focused on evident features. We demonstrate effective heatmap feature extraction through the Grid ViT Heatmap Encoder, employing patch-wise communication with self-attention to preserve correspondence between the heatmap and the feature embedding. The Propagation Network utilizes visual cues from the proximate parent joint, leveraging joint relational information to predictively estimate less visible child joint poses. Our experiments highlight significant advancements over state-of-the-art stereo egocentric pose estimation methods, underscoring the efficacy of our proposed approach.

Acknowledgement

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MIST) (No. 2022R1A2C3008495). This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No.RS-2023-00218601).

\thetitle

Supplementary Material

Appendix A Overview

The supplementary material contains the following:

•

Dataset Processing
•

Implementation
•

Training
•

Experiment
•

Example Figure
•

Limitations and Future Works

Appendix B Dataset Processing

We explain the details of the train and test dataset we used in this section. Our method requires a 2D and 3D pose annotation and stereo input images. The 2D annotation is necessary for generating the heatmaps.

B.1 UnrealEgo

We utilize the full dataset, including metadata files and preprocessed pickles. The public Ego3DPose [8] code loads metadata and pickles. Their code adds 2D and 3D pose data in the camera coordinate system and their limb heatmap representation in the pickle files. Our method uses these final pickles.

B.2 EgoCap

We used publicly available 2D pose annotation on the train set. Additionally, we got the full ground truth 3D pose for the train set of the EgoCap [17] dataset from the authors. In the fisheye views of the dataset, images are projected only in the circular area due to strong distortion. Thus, the original images contain areas that do not have real views. Following the Kang et al. [8], we cropped the image horizontally into a square area centered at the x-axis focal center ( $\mathbf{fx}$ ) provided in the dataset calibration data. We resized the images to 256 by 256 images to fit our model.

The dataset has a train set and 2D and 3D validation sets. The 3D validation set contains a ground truth 3D pose and is used for testing. The 2D validation dataset provides the 2D annotation for the images in the 3D validation sets from a subject labeled 7. The 3D pose is converted from a $mm$ to a $cm$ unit to scale the pose loss in accordance with the UnrealEgo dataset.

Appendix C Implementation

C.1 Grid ViT Heatmap Encoder

The $64\times 64$ sized heatmaps are put into one image with resolution $384\times 384$ . The image comprises $36$ areas as a $6\times 6$ grid. The number of joint heatmaps is $30$ for the UnrealEgo [2] dataset and $34$ for the EgoCap [17]. The heatmaps fill in the grid in order. The areas that do not correspond to any heatmap are masked in the ViT encoder module and don’t impact the output.

We adopt the ViT encoder [3] architecture. Our implementation adopts the public Transformers [26] module ViTModel class for the PyTorch [16]. We removed the $[CLS]$ token since we are not using the module for a classification task. Doing so improves pose estimation accuracy empirically. The module follows the standard ViT [3] encoder architecture shown in Fig. 8 that takes the input embedding $z$ and outputs feature embedding $z^{\prime}$ .

The ViT encoder takes embeddings of size $1024$ per each of $32N_{J}$ patches, $z=[z_{1},z_{2},\ldots,z_{32N_{J}}]$ . The multi-head attention layer has $8$ heads. The intermediate layer size of the MLP is $4096$ . The Grid ViT Heatmap Encoder uses three ViT encoder layers. It outputs a total $16384$ size of the embedding vector from $16$ patches for each heatmap. The embedding vector is then compressed with MLP denoted $E_{\mathbf{K}}$ in the paper. The MLP has ReLU [1] non-linearity for the intermediate layers. The MLP’s hidden sizes of the first two layers are $2048$ and $512$ , and the last layer outputs a final embedding of size $128$ .

C.2 Propagation Network

In an extension of the typical LSTM [7], the Propagation Unit’s relational features, joint features, hidden and cell states, and gate outputs all have the same size. We chose $256$ for the size.

C.2.1 Limb Heatmap Encoder

The limb heatmap encoder $E_{R}$ extracts relational features. The encoder consists of three layers with the same structure as the final MLP layers of the Grid ViT Heatmap Encoder, with only an input size difference. The input two-channeled limb heatmap [8] has $2\times 64\times 64$ size. The encoder takes it after flattening it. The encoder consists of three fully connected layers, the first two layers with $2048$ and $512$ output size, with the ReLU [1] activation, and the final layer outputs the embedding with a size $128$ .

C.2.2 Second Layer of the PU

The second layer of PU does not take distinct relational and joint features. It takes the parent joint’s second layer cell and hidden state with the first layer’s hidden state of the joint. Since hidden states from different layers are used in this section, let’s denote the $n$ -th layer hidden states of $i$ -th joint $h_{n,i}$ . The additional forget gate in the second layer $g_{i}$ controls the parent joint’s second PU layer’s hidden state, resulting in the modified hidden state $h^{\prime}_{2,i}$ . This is formulated as follows:

g_{i}=\sigma(W_{g}\cdot h_{1,i}+b_{g})

(16)

h^{\prime}_{2,i}=g_{i}\odot h_{2,{parent(i)}}

(17)

The modified parent hidden state and the joint’s first layer hidden state are input for the inner LSTM [7].

C.2.3 Internal LSTM of the PU

We explain the formulation of the LSTM [7] inside the PU in more detail here.

Formulation of typical LSTM. The LSTM is formulated as follows, where $h_{i-1}$ denotes the hidden state of the previous step, $c_{i-1}$ denotes the cell state of the previous step, and $x_{i}$ denotes the input. Here, $W$ and $b$ denote weights and biases for each gate. The symbol $\odot$ represents element-wise multiplication, and the $+$ sign represents element-wise addition. $\tanh$ and $\sigma$ denote the hyperbolic tangent and sigmoid activation.

f_{i}=\sigma(W_{f}\cdot[h_{i-1},x_{i}]+b_{f})

(18)

i_{i}=\sigma(W_{i}\cdot[h_{i-1},x_{i}]+b_{i})

(19)

o_{i}=\sigma(W_{o}\cdot[h_{i-1},x_{i}]+b_{o})

(20)

\tilde{c}_{i}=\tanh(W_{c}\cdot[h_{i-1},x_{i}]+b_{c})

(21)

c_{i}=f_{i}\odot c_{i-1}+i_{i}\odot\tilde{c}_{i}

(22)

h_{i}=o_{i}\odot\tanh(c_{i})

(23)

The $f_{i}$ , $i_{i}$ , and $o_{i}$ are forget, input, and output gates. $\tilde{c}_{i}$ denotes the candidate cell value. $h_{i}$ and $c_{i}$ are the final hidden and cell state for step $i$ .

Formulation of internal LSTM. Unlike the LSTM taking the cell and hidden state, the internal LSTM of the first PU layer takes three states in addition to input joint features. The three states are the modified parent’s hidden state $h^{\prime}_{i}$ , the modified relational feature of the joint $r^{\prime}_{i}$ , and the cell state of the parent $c_{parent(i)}$ . The input is joint features $\mathbf{F_{J,i}}$ .

This section explains the first and second layers together; thus, we denote the $n$ -th layer of $i$ -th joint with a $n,i$ subscript, as in $h_{n,i}$ for the hidden state. In the computation of the forget, input, and output gates and the candidate cell value, a concatenated vector of the modified parent’s hidden state and relational features, and the joint features $[h^{\prime}_{1,i},r^{\prime}_{1,i},\mathbf{F_{J,i}}]$ replaces $[h_{i-1},x_{i}]$ .

f_{1,i}=\sigma(W_{1,f}\cdot[h^{\prime}_{1,i},r^{\prime}_{i},\mathbf{F_{J,i}}]+% b_{1,f})

(24)

i_{1,i}=\sigma(W_{1,i}\cdot[h^{\prime}_{1,i},r^{\prime}_{i},\mathbf{F_{J,i}}]+% b_{1,i})

(25)

o_{1,i}=\sigma(W_{1,o}\cdot[h^{\prime}_{1,i},r^{\prime}_{i},\mathbf{F_{J,i}}]+% b_{1,o})

(26)

\tilde{c}_{1,i}=\tanh(W_{1,c}\cdot[h^{\prime}_{1,i},r^{\prime}_{i},\mathbf{F_{% J,i}}]+b_{1,c})

(27)

For the second layer, the modified second layer parent hidden state $h^{\prime}_{2,i}$ from the Sec. C.2.2 takes the place of $h_{i-1}$ . The previous layer’s hidden state $h_{1,i}$ replaces input $x_{i}$ , analogous to the standard multi-layered LSTM.

f_{2,i}=\sigma(W_{2,f}\cdot[h^{\prime}_{2,i},h_{1,i}]+b_{2,f})

(28)

i_{2,i}=\sigma(W_{2,i}\cdot[h^{\prime}_{2,i},h_{1,i}]+b_{2,i})

(29)

o_{2,i}=\sigma(W_{2,o}\cdot[h^{\prime}_{2,i},h_{1,i}]+b_{2,o})

(30)

\tilde{c}_{2,i}=\tanh(W_{2,c}\cdot[h^{\prime}_{2,i},h_{1,i}]+b_{2,c})

(31)

The Propagation Unit takes features from the parent joint, not the previous index. In the computation of the final cell and hidden state, both layers of PU take $c_{n,parent(i)}$ instead of $c_{i-1}$ in the formula. The hidden state is computed in the same way.

c_{n,i}=f_{n,i}\odot c_{n,parent(i)}+i_{n,i}\odot\tilde{c}_{n,i}

(32)

h_{n,i}=o_{n,i}\odot\tanh(c_{n,i})

(33)

Appendix D Training

D.1 Hardware Setup

We trained and tested our method on a server with NVIDIA RTX A6000 GPU and AMD EPYC 7313 16-Core Processor CPU.

D.2 Heatmap Estimator

The heatmap estimator is trained using UnrealEgo [2] code and their scripts for the UnrealEgo dataset. The default configuration utilizes Adam [9] optimizer with a learning rate $10^{-3}$ . They train the network for $10$ epochs, the later $5$ epochs with linear decay, with batch size $16$ . For the EgoCap dataset, we trained the heatmap estimators for $30$ epochs with the same setup. Linear decay is used for the last $15$ epochs proportionally.

When hasty convergence, where all heatmap values converge to 0, is detected, the training is automatically restarted, following the protocol of Kang et al. [8].

D.3 EgoTAP Network

The network is trained with the AdamW [14] optimizer with pretrained and frozen heatmap estimator weight. The learning rate of $10^{-3}$ is used with 16 epochs with a cosine annealing scheduler, with batch size $32$ . Early epochs use a linear warmup, one epoch for the UnrealEgo [2], and two epochs for the EgoCap [17] dataset.

D.4 Loss

D.4.1 EgoTAP Network

EgoTAP has two loss terms: a pose error loss (i.e., joints’ average Euclidean distance) and cosine-similarity loss [20] that focuses on estimating the correct 3D orientation for each limb.

The pose error loss is defined as follows. Given two 3D joint poses: the predicted pose $\mathbf{p\prime}_{i}$ and the ground truth pose $\mathbf{p}_{i}$ , for $i=1,\ldots,J$ , where $J$ is the total number of joints:

{L_{p}}=\frac{1}{J}\sum_{i=1}^{J}\|\mathbf{p\prime}_{i}-\mathbf{p}_{i}\|_{2}

(34)

The cosine similarity loss is then defined as follows. A limb pose vector for a particular joint is obtained by subtracting the pose of its parent joint from its pose, i.e., $\mathbf{v}_{i}=\mathbf{p}_{i}-\mathbf{p}_{\text{parent}(i)}$ and $\mathbf{v\prime}_{i}=\mathbf{p\prime}_{i}-\mathbf{p\prime}_{\text{parent}(i)}$ for the ground truth and predicted poses, respectively. Given these vectors, the cosine similarity between two limb pose vectors is calculated using their inner product:

{L_{c}}=\frac{1}{J-1}\sum_{i=2}^{J}\|\frac{\mathbf{v}_{i}\cdot\mathbf{v\prime}% _{i}}{\|\mathbf{v}_{i}\|_{2}\|\mathbf{v\prime}_{i}\|_{2}}

(35)

Note that the root joint is ignored since it does not have a parent joint.

The final loss term is a weighted sum of two losses, where we choose $w_{p}=0.1$ and $w_{c}=-0.01$ . The cosine similarity loss weight has a negative sign because higher cosine similarity indicates a more accurate pose.

L=w_{p}L_{p}+w_{c}L_{c}

(36)

D.4.2 Heatmap Reconstruction Network for Ablation

We adopted the heatmap decoder proposed by Tome et al. [20] for the heatmap reconstruction in the ablation study of the Grid ViT Heatmp Encoder. The network with only mean squared loss struggles from early convergence to outputting near-zero valued heatmaps. The problem is more severe than the heatmap estimator network since the reconstruction network does not contain specialized architecture like the U-Net [18], which helps the heatmap generation through the multi-resolution features.

Thus, additional loss to match the heatmap’s minimum and maximum values is added if the mean squared loss is higher than the threshold to prevent the network training in the ablation studies from converging to outputting only zeros. We set the threshold to a value empirically found sufficient to ensure avoidance of the zero-only convergence. The loss term guides the network to output a peak in the heatmap, as joint heatmaps do.

The heatmap reconstruction’s target is to minimize the mean squared loss between the predicted heatmap $H$ , and reconstructed heatmap $H\prime$ . This is computed as follows:

L_{r}=\frac{1}{N}\sum_{i=1}^{N}(H_{i}-H^{\prime}_{i})^{2}

(37)

The min-max loss applies only if the mean squared loss is higher than threshold $\theta$ . The threshold is $5.5*10^{-4}$ . It is computed as follows:

L_{min}=\frac{1}{N}\sum_{i=1}^{N}|\min(H_{i})-\min(H^{\prime}_{i})|

(38)

L_{max}=\frac{1}{N}\sum_{i=1}^{N}|\max(H_{i})-\max(H^{\prime}_{i})|

(39)

L_{m}=\left\{\begin{aligned} &L_{min}+L_{max}&\text{if }L_{r}>\theta\\ &0&\text{otherwise}\end{aligned}\right.

(40)

The weight for the reconstruction $w_{r}$ is set to $1$ and the weight for the min-max penalty $w_{m}$ is set to $10^{-3}$ , resulting in the total loss:

L=w_{r}\cdot(L_{r}+w_{m}\cdot L_{m})

(41)

The loss term does not impact the final result once the network avoids the early convergence and stabilizes. The zero output convergence is still observed for the CNN encoder embedding, so the training was restarted until it did not converge to output only zeros. Such an additional loss term is necessary to get meaningful non-zero reconstruction from the output of the CNN heatmap encoder, showing the difficulty of heatmap information recovery from its embeddings.

Appendix E Experiment

MPJPE (PA-MPJPE)
Method	Jumping	Falling Down	Exercising	Pulling	Singing	Rolling	Crawling	Laying
EgoGlass [32]	78.93(63.85)	123.80(92.71)	94.21(69.85)	79.41(55.41)	68.16(50.25)	100.53(87.26)	173.69(111.51)	106.41(86.42)
UnrealEgo [2]	61.66(49.46)	108.73(78.02)	77.14(58.87)	57.01(43.51)	52.61(37.58)	73.38(64.56)	162.90(102.15)	82.60(67.47)
Ego3DPose [8]	52.12(43.29)	86.08(71.72)	67.52(56.39)	48.92(37.02)	43.86(34.54)	74.24(64.81)	138.47(92.93)	78.13(67.23)
Ours	43.05(37.31)	75.77(63.48)	52.76(46.21)	34.45(26.82)	33.96(29.10)	52.24(47.58)	126.23(91.46)	66.38(59.56)
Method	Sitting on the Ground	Crouching	Crouching and Turning	Crouching to Standing	Crouching-Forward	Crouching-Backward	Crouching-Sideways	Standing-Whole Body
EgoGlass	204.35(147.99)	121.76(100.12)	130.24(104.28)	84.31(59.67)	82.84(66.06)	90.36(76.83)	101.37(78.81)	69.78(52.59)
UnrealEgo	190.26(144.27)	96.69(79.29)	116.59(99.94)	66.20(44.92)	56.10(46.62)	62.54(46.21)	72.35(55.87)	52.91(39.14)
Ego3DPose	143.44(122.93)	82.01(67.94)	104.24(83.98)	58.74(41.34)	48.60(38.81)	47.36(36.05)	57.85(48.83)	45.69(35.47)
Ours	121.24(103.27)	67.27(60.07)	89.97(70.78)	41.91(28.68)	33.47(29.52)	34.08(28.30)	40.21(38.51)	32.77(28.27)
Method	Standing-Upper Body	Standing-Turning	Standing to Crouching	Standing-Forward	Standing-Backward	Standing-Sideways	Dancing	Boxing
EgoGlass	69.24(49.36)	77.77(60.27)	83.86(81.65)	76.75(63.23)	78.40(59.83)	82.71(66.46)	82.84(65.59)	66.98(49.13)
UnrealEgo	50.97(34.86)	60.42(46.27)	48.09(40.5)	56.23(47.90)	57.14(44.90)	63.10(50.86)	64.73(51.79)	52.13(38.36)
Ego3DPose	44.14(32.99)	51.45(41.91)	55.66(45.08)	49.48(44.65)	45.35(36.52)	52.25(44.97)	55.30(46.47)	41.55(32.14)
Ours	31.29(25.51)	41.24(35.55)	37.07(30.28)	39.64(38.06)	32.43(30.25)	39.34(37.94)	42.14(38.39)	29.97(25.69)
Method	Wrestling	Soccer	Baseball	Basketball	American Football	Golf
EgoGlass	84.23(62.81)	81.57(60.59)	76.20(56.11)	78.33(57.30)	102.54(84.03)	69.69(48.15)
UnrealEgo	67.85(52.73)	67.09(51.43)	62.15(48.60)	64.73(47.79)	89.57(68.49)	55.87(40.34)
Ego3DPose	57.96(45.94)	59.56(45.23)	56.21(42.17)	56.02(41.94)	77.89(62.56)	48.10(36.01)
Ours	44.15(39.38)	48.27(38.56)	44.83(35.92)	45.19(36.78)	65.30(54.41)	38.97(31.25)

Table 4: Quantitative evaluation results on the UnrealEgo dataset per category.

Appendix F Categorical Evaluation on the UnrealEgo dataset

Table 4 categorically shows the result on the UnrealEgo [2] dataset. Metrics from all three baseline methods, EgoGlass [32], UnrealEgo [2], and Ego3DPose [8] are shown with our method. The MPJPE values are outside the bracket, and the PA-MPJPE values are inside the bracket.

F.1 Per Joint Error Distribution

Fig. 9 shows the CDF (Cumulative Distribution Function) of the error of each joint. Two results show one from the UnrealEgo [2] method as an example of the baseline in the introduction and one for our method, both evaluated on the UnrealEgo dataset.

The thigh is directly attached to the pelvis, which is the origin of the local pose definition in the dataset. Thus, for all methods, the thigh has the lowest errors among all joints. Lower body joints, calf, foot, and balls generally have significantly higher errors than other joints, except for the hands, which have larger errors than the calf. The error of the hand and the lower arm gets much closer to the upper arm in our method, showing the benefit of the propagation.

	UnrealEgo [2]	EgoCap [17]
Estimated Heatmap	41.06	55.38
Ground Truth Heatmap	6.63	26.63

Table 5: Comparison of pose estimation error (MPJPE) of our method, with estimated and ground truth heatmaps provided as input. Columns indicates two datasets.

F.2 Impact of the Heatmap Estimation Accuracy

We experimented with our architecture’s performance when the ground truth heatmaps were provided instead of the estimated heatmaps. Table 5 reveals that the limited 2D pose information from the view is a key bottleneck for egocentric pose estimation. The full 2D pose provided by the ground truth heatmap reduces error significantly. Despite the better view provided by the camera attached far from the head, the EgoCap has a higher estimation error with ground truth heatmaps. A relatively small dataset volume for training can also be a bottleneck.

F.3 Impact of ViT Backbone Size

Experiments revealed that the bottleneck of the pose estimation accuracy is not in the computational capacity of the backbone. We experimented with up to 12 layers of the ViT encoders and 8 times larger feature sizes in the ViT encoder. No notable improvement was observed compared to the smaller backbone we chose. The UnrealEgo [2] shows consistent experimental results that the larger ResNet backbones do not improve the pose estimation accuracy.

Appendix G Example Figure

G.1 Limb Heatmaps

The main text mentions that the limb heatmap estimation is less accurate on the EgoCap [17]. The heatmap visualization in Fig. 10 shows noisy lines for limbs.

Appendix H Limitations and Future Works

EgoTAP is limited to a single frame input and relies fully on visual cues. The result with the EgoTAP on motions with severe occlusion, such as “Crawling” and “Sitting on the Ground”, has very high error compared to other motion categories as shown in Table 4. Unlike many recently proposed general pose estimation methods, the egocentric setup’s exploration of utilizing the temporal context is limited. For the egocentric view with a limited view, the invisible joints’ pose can benefit significantly from the temporal context. For one example in the egocentric setup, Wang. et al. [23] applied temporal optimization using a variational autoencoder for improved pose estimation in the global coordinate.

The method’s applicability can further be tested on monocular and different potential egocentric camera setups. The Propagation Network is based on the stereo setup, which provides sufficient information for a 3D pose when the joint is visible from both views. Thus, the propagation scheme helps child joint pose estimation. While the 3D pose estimation from the single heatmap is not feasible in the monocular setup, pose space is highly constrained, and our method can also be applicable potentially with modification.

The Propagation Network applies to an egocentric view with a specific characteristic. The method itself lacks dynamicity like the GCN-based method [30], which would make it applicable to many different situations. The tree hierarchy assumption still holds for arbitrary root joints in the skeletal hierarchy, giving room for more dynamicity. Applying such a tree hierarchy-based network has the potential for a specific joint-related situation, such as collision. Such application remains a future work.

References

Agarap [2018] Abien Fred Agarap. Deep learning using rectified linear units (relu), 2018. cite arxiv:1803.08375Comment: 7 pages, 11 figures, 9 tables.
Akada et al. [2022] Hiroyasu Akada, Jian Wang, Soshi Shimada, Masaki Takahashi, Christian Theobalt, and Vladislav Golyanik. Unrealego: A new dataset for robust egocentric 3d human motion capture. In European Conference on Computer Vision (ECCV), 2022.
Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
Feng and Meunier [2022] Miao Feng and Jean Meunier. Skeleton graph-neural-network-based human action recognition: A survey. Sensors, 22(6):2091, 2022.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. IEEE, 2016.
He et al. [2020] Y. He, R. Yan, K. Fragkiadaki, and S. Yu. Epipolar transformers. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7776–7785, Los Alamitos, CA, USA, 2020. IEEE Computer Society.
Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9:1735–80, 1997.
Kang et al. [2023] Taeho Kang, Kyungjin Lee, Jinrui Zhang, and Youngki Lee. Ego3dpose: Capturing 3d cues from binocular egocentric views. In SIGGRAPH Asia 2023 Conference Papers, New York, NY, USA, 2023. Association for Computing Machinery.
Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
Li et al. [2023a] Jiaman Li, Karen Liu, and Jiajun Wu. Ego-body pose estimation via ego-head pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17142–17151, 2023a.
Li et al. [2023b] Wenhao Li, Hong Liu, Hao Tang, and Pichao Wang. Multi-hypothesis representation learning for transformer-based 3d human pose estimation. Pattern Recognition, page 109631, 2023b.
Liu et al. [2018] Jun Liu, Amir Shahroudy, Dong Xu, Alex C. Kot, and Gang Wang. Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell., 40(12):3007–3021, 2018.
Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
Ng et al. [2020] Evonne Ng, Donglai Xiang, Hanbyul Joo, and Kristen Grauman. You2me: Inferring body pose in egocentric video via first and second person interactions. CVPR, 2020.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
Rhodin et al. [2016] Helge Rhodin, Christian Richardt, Dan Casas, Eldar Insafutdinov, Mohammad Shafiei Rezvani Nezhad, Hans-Peter Seidel, Bernt Schiele, and Christian Theobalt. Egocap: Egocentric marker-less motion capture with two fisheye cameras. ACM Transactions on Graphics, 35, 2016.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015. cite arxiv:1505.04597Comment: conditionally accepted at MICCAI 2015.
Shan et al. [2023] Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Zhao Wang, Kai Han, Shanshe Wang, Siwei Ma, and Wen Gao. Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. arXiv preprint arXiv:2303.11579, 2023.
Tome et al. [2019] Denis Tome, Patrick Peluse, Lourdes Agapito, and Hernan Badino. xr-egopose: Egocentric 3d human pose from an hmd camera. In Proceedings of the IEEE International Conference on Computer Vision, pages 7728–7738, 2019.
Tompson et al. [2015] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In CVPR, pages 648–656. IEEE Computer Society, 2015.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
Wang et al. [2021] Jian Wang, Lingjie Liu, Weipeng Xu, Kripasindhu Sarkar, and Christian Theobalt. Estimating egocentric 3d human pose in global space. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11500–11509, 2021.
Wang et al. [2022] Jian Wang, Lingjie Liu, Weipeng Xu, Kripasindhu Sarkar, Diogo Luvizon, and Christian Theobalt. Estimating egocentric 3d human pose in the wild with external weak supervision. CVPR, 2022.
Wang et al. [2023] Jian Wang, Diogo Luvizon, Weipeng Xu, Lingjie Liu, Kripasindhu Sarkar, and Christian Theobalt. Scene-aware egocentric 3d human pose estimation. CVPR, 2023.
Wolf et al. [2020] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, 2020. Association for Computational Linguistics.
Xu et al. [2019] Weipeng Xu, Avishek Chatterjee, Michael Zollhoefer, Helge Rhodin, Pascal Fua, Hans-Peter Seidel, and Christian Theobalt. Mo ${}^{2}$ Cap ${}^{2}$ : Real-time mobile 3d motion capture with a cap-mounted fisheye camera. IEEE Transactions on Visualization and Computer Graphics, pages 1–1, 2019.
Yan et al. [2018] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018.
Yu et al. [2023] Bruce X.B. Yu, Zhi Zhang, Yongxu Liu, Sheng-hua Zhong, Yan Liu, and Chang Wen Chen. Gla-gcn: Global-local adaptive graph convolutional network for 3d human pose estimation from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8818–8829, 2023.
Zeng et al. [2021] Ailing Zeng, Xiao Sun, Lei Yang, Nanxuan Zhao, Minhao Liu, and Qiang Xu. Learning skeletal graph neural networks for hard 3d pose estimation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11416–11425, 2021.
Zhang et al. [2022] Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Junsong Yuan. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13232–13242, 2022.
Zhao et al. [2021] Dongxu Zhao, Zhen Wei, Jisan Mahmud, and Jan-Michael Frahm. Egoglass: Egocentric-view human pose estimation from an eyeglass frame. In 2021 International Conference on 3D Vision (3DV), pages 32–41, 2021.
Zhao et al. [2023] Qitao Zhao, Ce Zheng, Mengyuan Liu, Pichao Wang, and Chen Chen. Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8877–8886, 2023.
Zheng et al. [2021] Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 3d human pose estimation with spatial and temporal transformers. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.