OffsetNet: Towards Efficient Multiple Object Tracking, Detection, and Segmentation

IEEE Trans Pattern Anal Mach Intell. 2024 Nov 4:PP. doi: 10.1109/TPAMI.2024.3485644. Online ahead of print.

Abstract

Offset-based representation has emerged as a promising approach for modeling semantic relations between pixels and object motion, demonstrating efficacy across various computer vision tasks. In this paper, we introduce a novel one-stage multi-tasking network tailored to extend the offset-based approach to MOTS. Our proposed framework, named OffsetNet, is designed to concurrently address amodal bounding box detection, instance segmentation, and tracking. It achieves this by formulating these three tasks within a unified pixel-offset-based representation, thereby achieving excellent efficiency and encouraging mutual collaborations. OffsetNet achieves several remarkable properties: first, the encoder is empowered by a novel Memory Enhanced Linear Self-Attention (MELSA) block to efficiently aggregate spatial-temporal features; second, all tasks are decoupled fairly using three lightweight decoders that operate in a one-shot manner; third, a novel cross-frame offsets prediction module is proposed to enhance the robustness of tracking against occlusions. With these merits, OffsetNet achieves 76.83% HOTA on KITTI MOTS benchmark, which is the best result without relying on 3D detection. Furthermore, OffsetNet achieves 74.83% HOTA at 50 FPS on the KITTI MOT benchmark, which is nearly 3.3 times faster than CenterTrack with better performance. We hope our approach will serve as a solid baseline and encourage future research in this field.