Enhancing Multi-modal Features Using Local Self-Attention for 3D Object Detection
"LiDAR and Camera sensors have complementary properties: LiDAR senses accurate positioning, while camera provides rich texture and color information. Fusing these two modalities can intuitively improve the performance of 3D detection. Most multi-modal fusion methods use networks to extract features of LiDAR and camera modality respectively, then simply add or concancate them together. We argue that these two kinds of signals are completely different, so it is not proper to combine these two heterogeneous features directly. In this paper, we propose EMMF-Det to do multi-modal fusion leveraging range and camera images. EMMF-Det uses self-attention mechanism to do feature re-weighting on these two modalities interactively, which can enchance the features with color, texture and localiztion information provided by LiDAR and camera signals. On the Waymo Open Dataset, EMMF-Det acheives the state-of-the-art performance. Besides this, evaluation on self-built dataset further proves the effectiveness of our method."