Learning What and Where to Draw: Difference between revisions
No edit summary |
|||
Line 7: | Line 7: | ||
A highlight of this work is that the authors demonstrate two ways to encode spatial constraints into the GAN. First, the authors provide an implementation showing how to condition on the coarse location of a bird by incorporating spatial masking and cropping modules into a text-conditional General Adversarial Network. This technique is implemented using spatial transformers. Second, the authors demonstrate how they are able to condition on part locations of birds and humans in the form of a set of normalized (x,y) coordinates. | A highlight of this work is that the authors demonstrate two ways to encode spatial constraints into the GAN. First, the authors provide an implementation showing how to condition on the coarse location of a bird by incorporating spatial masking and cropping modules into a text-conditional General Adversarial Network. This technique is implemented using spatial transformers. Second, the authors demonstrate how they are able to condition on part locations of birds and humans in the form of a set of normalized (x,y) coordinates. | ||
== Related Work == | |||
This is not that first paper to show how Deep convolutional networks can be used to generate synthetic images. Other notable works include: | |||
* Dosovitsky et al. (2015) trained a deconvolutional network to generate 3D chair renderings conditioned on a set of graphics codes indicating shape, position and lighting | |||
* Yang et al. (2015) followed with a recurrent convolutional encoder-decoder that learned to apply incremental 3D rotations to generate sequences of rotated chair and face images | |||
* Reed et al. (2015) trained a network to generate images that solved visual analogy problems | |||
== Summary of Contributions == | == Summary of Contributions == | ||
* Novel architecture for text- and location-controllable image synthesis, which yields more realistic and high-resolution Caltech-USCD bird samples | * Novel architecture for text- and location-controllable image synthesis, which yields more realistic and high-resolution Caltech-USCD bird samples | ||
* A text-conditional object part completion model enabling a streamlined user interface for specifying part locations | * A text-conditional object part completion model enabling a streamlined user interface for specifying part locations | ||
* Exploratory results and a new dataset for pose-conditional text to human image synthesis | * Exploratory results and a new dataset for pose-conditional text to human image synthesis |
Revision as of 18:09, 17 October 2017
Introduction
Generative Adversarial Networks (GANs) have been successfully used to synthesize compelling real-world images. In what follows we outline an enhanced GAN called the Generative Adversarial What- Where Network (GAWWN). In addition to accepting as input a noise vector, this network also accepts as input instructions describing what content to draw and in which location to draw the content. Traditionally, these models use simply conditioning variables such as a class label or a non-localized caption. The authors of 'Learning What and Where to Draw' believe that image synthesis will be drastically enhanced by incorporating a notion of localized objects.
The main goal in constructing the GAWWN network is to seperate the questions of 'what' and 'where' to modify the image at each step of the computational process. Prior to elaborating on the experimental results of the GAWWN the authors cite that this model benefits from greater parameter efficiency and produces more interpratable sample images. The proposed model learns to perform location and content-controllable image synthesis on the Caltech-USCD (CUB) bird data set and the MPII Human Pose (HBU) data set.
A highlight of this work is that the authors demonstrate two ways to encode spatial constraints into the GAN. First, the authors provide an implementation showing how to condition on the coarse location of a bird by incorporating spatial masking and cropping modules into a text-conditional General Adversarial Network. This technique is implemented using spatial transformers. Second, the authors demonstrate how they are able to condition on part locations of birds and humans in the form of a set of normalized (x,y) coordinates.
Related Work
This is not that first paper to show how Deep convolutional networks can be used to generate synthetic images. Other notable works include:
- Dosovitsky et al. (2015) trained a deconvolutional network to generate 3D chair renderings conditioned on a set of graphics codes indicating shape, position and lighting
- Yang et al. (2015) followed with a recurrent convolutional encoder-decoder that learned to apply incremental 3D rotations to generate sequences of rotated chair and face images
- Reed et al. (2015) trained a network to generate images that solved visual analogy problems
Summary of Contributions
- Novel architecture for text- and location-controllable image synthesis, which yields more realistic and high-resolution Caltech-USCD bird samples
- A text-conditional object part completion model enabling a streamlined user interface for specifying part locations
- Exploratory results and a new dataset for pose-conditional text to human image synthesis