CoTeRe-Net: Discovering Collaborative Ternary Relations in Videos
Modeling relations is crucial to understand videos for action and behavior recognition. Current relation models mainly reason about relations of invisibly implicit cues, while important relations of visually explicit cues are rarely considered, and the collaboration between them is usually ignored. In this paper, we propose a novel relation model that discovers relations of both implicit and explicit cues as well as their collaboration in videos. Our model concerns Collaborative Ternary Relations (CoTeRe), where the ternary relation involves channel (C, for implicit), temporal (T, for implicit), and spatial (S, for explicit) relation (R). We devise a flexible and effective CTSR module to collaborate ternary relations for 3D-CNNs, and then construct CoTeRe-Nets for action recognition. Extensive experiments on both ablation study and performance evaluation demonstrate that our CTSR module is significantly effective with approximate 3% gains and our CoTeRe-Nets outperform state-of-the-art approaches on three popular benchmarks. Boosts analysis and relations visualization also validate that relations of both implicit and explicit cues are discovered with efficacy by our method. Our code is available at https://github.com/zhenglab/cotere-net ."