Conditional Image Synthesis with Auxiliary Classifier GANs
Abstract: "In this paper we introduce new methods for the improved training of generative adversarial networks (GANs) for image synthesis. We construct a variant of GANs employing label conditioning that results in 128×128 resolution image samples exhibiting global coherence. We expand on previous work for image quality assessment to provide two new analyses for assessing the discriminability and diversity of samples from class-conditional image synthesis models. These analyses demonstrate that high resolution samples provide class information not present in low resolution samples. Across 1000 ImageNet classes, 128×128 samples are more than twice as discriminable as artificially resized 32×32 samples. In addition, 84.7% of the classes have samples exhibiting diversity comparable to real ImageNet data." (Odena et al., 2016)
Introduction
Motivation
The authors introduce a GAN architecture for generating high resolution images from the ImageNet dataset. They show that this architecture makes it possible to split the generation process into many sub-models. They further suggest that GANs have trouble generating globally coherent images, and that this architecture is responsible for the coherence of their sampled images. They also experimentally demonstrate that generating higher resolution images allow the model to encode more class-specific information, making them more visually discriminable than lower resolution images even after they have been resized to the same resolution.
The second half of the paper introduces metrics for assessing visual discriminability and diversity of synthesized images. The discussion of image diversity in particular is important due to the tendency for GANs to 'collapse' to only produce one image that best fools the discriminator (Goodfellow et al., 2014).
Previous Work
Of all image synthesis methods (e.g. variational autoencoders, autoregressive models, invertible density estimators), GANs have become one of the most popular and successful due to their flexibility and the ease with which they can be sampled from. A standard GAN framework pits a generative model $G$ against a discriminative adversary $D$. The goal of $G$ is to learn a mapping from a latent space $Z$ to a real space $X$ to produce examples (generally images) indistinguishable from training data. The goal of the $D$ is to iteratively learn to predict when a given input image is from the training set or a synthesized image from $G$. Jointly the models are trained to solve the game-theoretical minimax problem defined by: $$\underset{G}{min}\underset{D}{max}V(G,D)=E_{X\sim p_{data}(x)}[log(D(X))]+E_{z\sim p_{Z}(z)}[log(1-D(G(Z)))]$$
Contributions
Model
The authors propose a conditional GAN that both takes the class to be synthesized as input to the, and includes a classification accuracy term in the loss function of the discriminator. They also split the generation process into many class-specific submodels.
Measurement Methods
The authors propose two measurement methods to assess the discriminability and diversity of the generated images.
Experimental Results on Image Resolution
Results
Critique
Model
Not very different from other GANs. Some unsupported claims about stabilizing training etc.
Metrics
Experiments
Discussion of overfitting says b/c nearest neighbours under L1 measure in pixel space are not similar looking it doesn't overfit.
Conclusion
References
- Odena, A., Olah, C., & Shlens, J. (2016). Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585.
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680).