Panoptic Scene Graph Generation
Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou, Wayne Zhang, Ziwei Liu
"Existing research addresses scene graph generation (SGG), a critical technology to scene understanding in images, from the detection perspective, i.e., objects are detected using bounding boxes followed by prediction of their pairwise relationships. We argue that such a paradigm would cause several problems that impede the progress of the field. For instance, bounding box-based labels in current datasets usually contain redundant information like hairs and miss some background information that is crucial to the understanding of context. In this work, we introduce panoptic scene graph generation (PSG task), a new problem that requires the model to generate more comprehensive scene graph representations based on panoptic segmentations rather than rigid bounding boxes. A high-quality PSG dataset, which contains 51k well-annotated images from COCO and Visual Genome, is created for the community to keep track of the progress. For benchmarking, we build three two-stage models, which are modified from current state-of-the-arts in SGG, and another one-stage model called PSGTR, which is based on the efficient Transformer-based detector, i.e., DETR. We further propose an approach called PSGFormer, which achieves significant improvements with two novel extensions over PSGTR: 1) separate modeling of objects and relations in the form of queries in two Transformer decoders, and 2) a prompting-like interaction mechanism. In the end, we share insights on open challenges and future directions."