Temporal Keypoint Matching and Refinement Network for Pose Estimation and Tracking
Multi-person pose estimation and tracking in realistic videos is very challenging due to factors such as occlusions, fast motion and pose variations. Top-down approaches are commonly used for this task, which involves three stages: person detection, single-person pose estimation, and pose association across time. Recently, significant progress has been made in person detection and single-person pose estimation. In this paper, we mainly focus on improving pose association and estimation in a video to build a strong pose estimator and tracker. To this end, we propose a novel temporal keypoint matching and refinement network. Specifically, we propose two network modules, temporal keypoint matching and temporal keypoint refinement, which are incorporated into a single-person pose estimatin network. The temporal keypoint matching module learns a simialrity metric for matching keypoints across frames. Pose matching is performed by aggregating keypoint similarities between poses in adjacent frames. The temporal keypoint refinement module serves to correct individual poses by utilizing their associated poses in neighboring frames as temporal context. We validate the effectiveness of our proposed network on two benchmark datasets: PoseTrack 2017 and PoseTrack 2018. Exprimental results show that our approach achieves state-of-the-art performance on both datasets."