ECVA | European Computer Vision Association

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, Edward Raff ;

Abstract

"Image generation and manipulation requires technical expertise to use, inhibiting adoption. Current methods rely heavily on training to a specific domain (e.g., only faces), manual work or algorithm tuning to latent vector discovery, and manual effort in mask selection to alter only a part of an image. We address all of these usability constraints while producing images of high visual and semantic quality through a unique combination of OpenAI’s CLIP (Radford et al., 2021), VQGAN (Esser et al., 2021), and a generation augmentation strategy to produce VQGAN-CLIP. This allows generation and manipulation of images using natural language text, without further training on any domain datasets. We demonstrate on a variety of tasks how VQGAN-CLIP produces higher visual quality outputs than prior, less flexible approaches like minDALL-E (Kakaobrain, 2021) and Open-Edit (Liu, 2020), despite not being trained for the tasks presented."

Related Material

[pdf] [supplementary material] [DOI]