ECVA | European Computer Vision Association

ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, Leonidas Guibas ;

Abstract

In this work we study the problem of using referential language to identify common objects in real-world 3D scenes. We focus on a challenging setup where the referred object belongs to a extit{fine-grained} object class and the underlying scene contains extit{multiple} object instances of that class. Due to the scarcity and unsuitability of existent 3D-oriented linguistic resources for this task, we first develop two large-scale and complementary visio-linguistic datasets: i) extbf{ extit{Sr3D}}, which contains 83.5K template-based utterances leveraging extit{spatial relations} with other fine-grained object classes to localize a referred object in a given scene, and ii) extbf{ extit{Nr3D}} which contains 41.5K extit{natural, free-form}, utterances collected by deploying a 2-player object reference game in 3D scenes. Using utterances of either datasets, human listeners can recognize the referred object with high ($>$86\%, 92\% resp.) accuracy. By tapping on this data, we develop novel neural listeners that can comprehend object-centric natural language and identify the referred object extit{directly} in a 3D scene. Our key technical contribution is designing an approach for combining linguistic and geometric information (in the form of 3D point-clouds) and creating multi-modal (3D) neural listeners. We also show that architectures which promote object-to-object communication via graph neural networks outperform less context-aware alternatives, and that language-assisted 3D object identification outperforms language-agnostic object classifiers."

Related Material

[pdf]