Bounding-box Channels for Visual Relationship Detection
Recognizing the relationship between multiple objects in an image is essential for a deeper understanding of the meaning of the image. However, current visual recognition methods are still far from reaching human-level accuracy. Recent approaches have tackled this task by combining image features with semantic and spatial features, but the way they relate them to each other is weak, mostly because the spatial context in the image feature is lost. In this paper, we propose the bounding-box channels, a novel architecture capable of relating the semantic, spatial, and image features strongly. Our network learns bounding-box channels, which are initialized according to the position of the objects and the label of objects, and concatenated to the image features extracted from the objects. Then, they are input together to the relationship estimator. Our method can retain the spatial information in the image features, and strongly associate them with the semantic and spatial features. This way, our method is capable of effectively emphasizing the features in the object area for a better modeling of the relationships within objects. In addition, we experimentally show that our bounding-box channels have a high generalization ability. Our evaluation results show the efficacy of our architecture outperforming previous works in visual relationship detection. "