Dual Perspective Network for Audio-Visual Event Localization
Varshanth Rao, Md Ibrahim Khalil, Haoda Li, Peng Dai, Juwei Lu
"The Audio-Visual Event Localization (AVEL) problem involves tackling three core sub-tasks: the creation of efficient audio-visual representations using cross-modal guidance, the formation of short-term temporal feature aggregations, and its accumulation to achieve long-term dependency resolution. These sub-tasks are often performed by tailored modules, where the limited inter-module interaction restricts feature learning to a serialized manner. Past works have traditionally viewed videos as temporally sequenced multi-modal streams. We improve and extend on this view by proposing a novel architecture, the Dual Perspective Network (DPNet), that - (1) additionally operates on an intuitive graph perspective of a video to simultaneously facilitate cross-modal guidance and short-term temporal aggregation using a Graph NeuralNetwork (GNN), (2) deploys a Temporal Convolutional Network (TCN)to achieve long-term dependency resolution, and (3) encourages interactive feature learning using an acyclic feature refinement process that alternates between the GNN and TCN. Further, we introduce the RelationalGraph Convolutional Transformer, a novel GNN integrated into the DP-Net, to express and attend each segment node’s relational representation with its different relational neighborhoods. Lastly, we diversify the input to the DPNet through a new video augmentation technique called Replicate and Link, which outputs semantically identical video blends whose graph representations can be linked to that of the source videos. Experiments reveal that our DPNet framework outperforms prior state-of-the-art methods by large margins for the AVEL task on the public AVE dataset, while extensive ablation studies corroborate the efficacy of each proposed method."