Wasserstein Auto-encoders
Introduction
Early successes in the field of representation learning were based on supervised approaches, which used large labelled datasets to achieve impressive results. On the other hand, popular unsupervised generative modeling methods mainly consisted of probabilistic approaches focusing on low dimensional data. In recent years, there have been models proposed which try to combine these two approaches. One such popular method is called variational auto-encoders (VAEs). VAEs are theoretically elegant but have a major drawback of generating blurry sample images when used for modeling natural images. In comparison, generative adversarial networks (GANs) produce much sharper sample images but have their own list of problems which include lack of encoder, harder to train, and "mode collapse" problem. Mode collpase problem refers to the inability of the model to capture all the variability in the true data distribution. Currently, there has been a lot of activity around finding and evaluating numerous GANs architectures and combining VAEs and GANs but a model which combines the best of both GANs and VAEs is yet to be discovered.
The work done in this paper builds up on the theoretical work done in [4]. The authors tackle generative modeling using optimal transport (OT). The OT cost is defined as measure of distance between probability distributions. One of the feature of OT cost which is beneficial is that it provides much weaker topology when compared to other costs including f-divergences which are associated with the original GAN algorithms. The problem with stronger notions of distances such f-divergences is that they often max out and provide no useful gradients for training. In comparison, the OT cost has been claimed to behave much more nicely [5, 8]. Despite the preceding claim, the implementation, which is similar to GANs, still requires addition of a constraint or a regularization term into the the objective function.
Original Contributions
Let [math]\displaystyle{ P_X }[/math] be the true but unknown data distribution, [math]\displaystyle{ P_G }[/math] be the latent variable model specified by the prior distribution [math]\displaystyle{ P_Z }[/math] of latent codes [math]\displaystyle{ Z \in \mathcal{Z} }[/math] and the generative model [math]\displaystyle{ P_G(X|Z) }[/math] of the data points [math]\displaystyle{ X \in \mathcal{X} }[/math] given [math]\displaystyle{ Z }[/math]. The goal in this paper is to minimize [math]\displaystyle{ OT\ W_c(P_X, P_G) }[/math].
The main contributions are given below:
- A new class of auto-encoders called Wasserstein Auto-Encoders (WAE). WAE minimize the optimal transport [math]\displaystyle{ W_c(P_X, P_G) }[/math] for any cost function [math]\displaystyle{ c }[/math]. As is the case with VAEs, WAE objective function is also made of two terms: the c-reconstruction cost and a regularizer term which penalizes the discrepancy between two distributions in [math]\displaystyle{ \mathcal{Z} }[/math]. Note that when [math]\displaystyle{ c }[/math] is the squared cost and the regularizer term is the GAN objective, WAE is equivalent to the adversarial auto-encoders described in [2].
- Experimental results of using WAE on MNIST and CelebA datasets with squared cost [math]\displaystyle{ c(x, y) = ||x - y||_2^2 }[/math]. The results of these experiments show that WAEs have the good features of VAEs such as stable training, encoder-decoder architecture, and a nice latent manifold structure while simultaneously improving the quality of the generated samples.
- Two different regularizers. One based on GANs and adversarial training in the latent space [math]\displaystyle{ \mathcal{Z} }[/math]. The other one is based on something called "Maximum Mean Discrepancy" which known to have high performance when matching high dimensional standard normal distributions. The second regularizer also makes the problem fully adversary-free min-min optimization problem.
- The final contribution is the mathematical analysis used to derive the WAE objective function. In particular, the mathematical analysis shows that in the case of generative models, the primal form of [math]\displaystyle{ W_c(P_X, P_G) }[/math] is equivalent to a problem which deals with the optimization of a probabilistic encoder [math]\displaystyle{ Q(Z|X) }[/math]