Referring Object Manipulation of Natural Images with Conditional Classifier-Free Guidance
"We introduce the problem of referring object manipulation (ROM), which aims to generate photo-realistic image edits regarding two textual descriptions: 1) a text referring to an object in the input image and 2) a text describing how to manipulate the referred object. A successful ROM model would enable users to simply use natural language to manipulate images, removing the need for learning sophisticated image editing software. We present one of the first approach to address this challenging multi-modal problem by combining a referring image segmentation method with a text-guided diffusion model. Specifically, we propose a conditional classifier-free guidance scheme to better guide the diffusion process along the direction from the referring expression to the target prompt. In addition, we provide a new localized ranking method and further improvements to make the generated edits more robust. Experimental results show that the proposed framework can serve as a simple but strong baseline for referring object manipulation. Also, comparisons with several baseline text-guided diffusion models demonstrate the effectiveness of our conditional classifier-free guidance technique."