Improving Closed and Open-Vocabulary Attribute Prediction Using Transformers
"We study recognizing attributes for objects in visual scenes. We consider attributes to be any phrases that describe an object’s physical and semantic properties, and its relationships with other objects. Existing work studies attribute prediction in a closed setting with a fixed set of attributes, and implements a model that uses limited context. We propose TAP, a new Transformer-based model that can utilize context and predict attributes for multiple objects in a scene in a single forward pass, and a training scheme that allows this model to learn attribute prediction from image-text datasets. Experiments on the large closed attribute benchmark VAW show that TAP outperforms the SOTA by 5.1% mAP. In addition, by utilizing pretrained text embeddings, we extend our model to OpenTAP which can recognize novel attributes not seen during training. In a large-scale setting, we further show that OpenTAP can predict a large number of seen and unseen attributes that outperforms large-scale vision-text model CLIP by a decisive margin."