Self-Supervised Multi-Task Procedure Learning from Instructional Videos
We address the problem of unsupervised procedure learning from instructional videos of multiple tasks using Deep Neural Networks (DNNs). Unlike existing works, we assume that training videos come from multiple tasks without key-step annotations or grammars, and the goals are to classify a test video to the underlying task and to localize its key-steps. Our DNN learns task-dependent attention features from informative regions of each frame without ground-truth bounding boxes and learns to discover and localize key-steps without key-step annotations by using an unsupervised subset selection module as a teacher. It also learns to classify an input video using the discovered key-steps using a learnable key-step feature pooling mechanism that extracts and learns to combine key-step based features for task recognition. By experiments on two instructional video datasets, we show the effectiveness of our method for unsupervised localization of procedure steps and video classification."