Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning

Chen, Yuxiao; Zhao, Long; Yuan, Jianbo; Tian, Yu; Xia, Zhaoyang; Geng, Shijie; Han, Ligong; Metaxas, Dimitris N.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2207.09644 (cs)

[Submitted on 20 Jul 2022 (v1), last revised 27 Mar 2023 (this version, v3)]

Title:Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning

Authors:Yuxiao Chen, Long Zhao, Jianbo Yuan, Yu Tian, Zhaoyang Xia, Shijie Geng, Ligong Han, Dimitris N. Metaxas

View PDF

Abstract:Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks.

Comments:	Accepted to ECCV 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2207.09644 [cs.CV]
	(or arXiv:2207.09644v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2207.09644

Submission history

From: Yuxiao Chen [view email]
[v1] Wed, 20 Jul 2022 04:21:05 UTC (17,214 KB)
[v2] Tue, 2 Aug 2022 20:09:22 UTC (21,826 KB)
[v3] Mon, 27 Mar 2023 10:35:11 UTC (21,826 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators