Interacting hand reconstruction presents significant opportunities in various applications. However, it currently faces challenges such as the difficulty in distinguishing the features of both hands, misalignment of hand meshes with input images, and modeling the complex spatial relationships between interacting hands. In this paper, we propose a multilevel feature fusion interactive network for hand reconstruction (HandFI). Within this network, the hand feature separation module utilizes attentional mechanisms and positional coding to distinguish between left-hand and right-hand features while maintaining the spatial relationship of the features. The hand fusion and attention module promotes the alignment of hand vertices with the image by integrating multi-scale hand features while introducing cross-attention to help determine the complex spatial relationships between interacting hands, thereby enhancing the accuracy of two-hand reconstruction. We evaluated our method with existing approaches using the InterHand 2.6M, RGB2Hands, and EgoHands datasets. Extensive experimental results demonstrated that our method outperformed other representative methods, with performance metrics of 9.38 mm for the MPJPE and 9.61 mm for the MPVPE. Additionally, the results obtained in real-world scenes further validated the generalization capability of our method.
Keywords: MANO; feature fusion; interacting hand reconstruction.