Multiview Detection with Feature Perspective Transformation
Incorporating multiple camera views for detection alleviates the impact of occlusions in crowded scenes. In a multiview detection system, we need to answer two important questions. First, how should we aggregate cues from multiple views? Second, how should we aggregate information from spatially neighboring locations? To address these questions, we introduce a novel multiview detector, MVDet. During multiview aggregation, for each location on the ground, existing methods use multiview anchor box features as representation, which potentially limits performance as pre-defined anchor boxes can be inaccurate. In contrast, via feature map perspective transformation, MVDet employs anchor-free representations with feature vectors directly sampled from corresponding pixels in multiple views. For spatial aggregation, different from previous methods that require design and operations outside of neural networks, MVDet takes a fully convolutional approach with large convolutional kernels on the multiview aggregated feature map. The proposed model is end-to-end learnable and achieves 88.2% MODA on Wildtrack dataset, outperforming the state-of-the-art by 14.1%. We also provide detailed analysis of MVDet on a newly introduced synthetic dataset, MultiviewX, which allows us to control the level of occlusion. Code and MultiviewX dataset are available at https://github.com/hou-yz/MVDet/."