Delving into Details: Synopsis-to-Detail Networks for Video Recognition
"In this paper, we explore the details in video recognition with the aim to improve the accuracy. It is observed that most failure cases in recent works fall on the mis-classifications among very similar actions (such as high kick vs. side kick) that need a capturing of fine-grained discriminative details. To solve this problem, we propose synopsis-to-detail networks for video action recognition. Firstly, a synopsis network is introduced to predict the top-k likely actions and generate the synopsis (location & scale of details and contextual features). Secondly, according to the synopsis, a detail network is applied to extract the discriminative details in the input and infer the final action prediction. The proposed synopsis-to-detail networks enable us to train models directly from scratch in an end-to-end manner and to investigate various architectures for synopsis/detail recognition. Extensive experiments on benchmark datasets, including Kinetics-400, Mini-Kinetics and Something-Something V1 & V2, show that our method is more effective and efficient than the competitive baselines. Code is available at: https://github.com/liang4sx/S2DNet."