Difference between revisions of "From Variational to Deterministic Autoencoders"

From statwiki
Jump to: navigation, search
(Redesigned Loss Function)
(Redesigned Training Loss Function)
Line 55: Line 55:
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{Z}||_2^2
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.

Revision as of 04:57, 1 November 2020

Presented by

Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf


This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic. They investigate how the forcing of an arbitrary prior [math]p(z) [/math] within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.


The authors point to several drawbacks currently associated with VAE's including:

  • over-regularisation induced by the KL divergence term within the objective (Tolstikhin et al., 2017)
  • posterior collapse in conjunction with powerful decoders (van den Oord et al., 2017)
  • increased variance of gradients caused by approximating expectations through sampling (Burda et al., 2015; Tucker et al., 2017)

These issues motivate their consideration of alternatives to the variational framework adopted by VAE's.

Furthermore, the authors consider VAE's introduction of random noise within the reparameterization [math] z = \mu(x) +\sigma(x)\epsilon [/math] as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.

The removal of random noise injection from VAE's eliminates the ability to sample fro [math]p(z)[/math] and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.

Related Work

The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) (Tolstikhin et al., 2017) where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) (van den Oord et al., 2017; Razavi et al., 2019) can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.

Framework Architecture


The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise [math]\epsilon[/math] from the reparameterization of the latent variable [math] z [/math]. Secondly, it proposes a resigned loss function [math]\mathcal{L}_{RAE}[/math]. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.

Eliminating Random Noise

The authors proposal to eliminate the injection of random noise [math]\epsilon[/math] from the reparameterization of the latent variable [math] z = \mu(x) +\sigma(x)\epsilon [/math] resulting in a Encoder [math]E_{\phi} [/math] that deterministically maps a data point [math] x [/math] to a latent varible [math] z [/math].

The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function: \begin{align} \mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z)) \end{align}

In eliminating the random noise within [math]z[/math] the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because [math]z[/math] is no longer a distribution and [math]p(x|z)[/math] would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.

Redesigned Training Loss Function

The resigned loss function [math]\mathcal{L}_{RAE}[/math] is defined as: \begin{align} \mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\ \text{where }\lambda\text{ and }\beta\text{ are hyper parameters} \end{align}

The first term [math]\mathcal{L}_{REC}[/math] is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions [math]\mu_{\theta}[/math] by a decoder that is deterministic. In the paper it is formally defined as: \begin{align} \mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2 \end{align} However, as the decoder [math]D_{\theta}[/math] is deterministic the reconstruction loss is equivalent to: \begin{align} \mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2 \end{align}

The second term [math]\mathcal{L}^{RAE}_Z[/math] is defined as : \begin{align} \mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2 \end{align} This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.

The third term [math]\mathcal{L}_{REG}[/math] acts as the explicit regularizer to the decoder. The authors consider the following possible formulations for [math]\mathcal{L}_{REG}[/math]

Tikhonov regularization(Tikhonov & Arsenin, 1977)

\begin{align} \mathcal{L}_{REG} = ||\theta||_2^2 \end{align}

This in effect applies weight decay to the decoder parameters [math]\theta[/math]
Gradient Penalty:

\begin{align} \mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2 \end{align}

This would bound the gradient norm of the decoder with respect to its input
Spectral Normalization:
The authors also consider using Spectral Normalization in place of [math]\mathcal{L}_{REG}[/math] whereby each weight matrix [math]\theta_{\ell}[/math] in the decoder network is normalized by an estimate of it largest singular value [math]s(\theta_{\ell})[/math]. Formally this is defined as:

\begin{align} \theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell}\\ \end{align}

Ex-Post Density Estimation

Removing the stochasticity from RAEs loses the ability to control the distribution of the latent space and sample from it to produce varying generations. The authors overcome this issue by proposing ex-post density estimation over the trained RAEs latent space <. In this process a density estimator [math]q_{\delta}(\mathbf{z})[/math] is fit over the latent points [math]\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} [/math]. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allows for generalisation of untrained points.

Empirical Evaluations

Image Modeling:

Models Evaluated:

The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in figure 1. Additionally they consider a model (RAE) where [math]\mathcal{L}_{REC} [/math] is excluded from the loss and a model (AE) where both [math]\mathcal{L}_{REC} [/math] and [math]\mathcal{L}^{RAE}_{Z} [/math] are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss and a 2-stage VAE (2sVAE).

Metrics of Evaluation:

Each model was evaluated on the following metrics:

  • Rec: Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.
  • [math]\mathcal{N}[/math]: FID calculated between test data and random samples from a single Gaussian that is either the isotropic [math]p(z)[/math] fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to [math]q_{\delta}(z)[/math] for CV-VAEs and RAEs.
  • GMM: FID cacluated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.
  • Interp: Mid-point interpolation between random pairs of test reconstructions.


Each model was trained and evaluated on the MNIST[!],CIFAR[!],and CELEBA datasets. Their performance across each metric and each dateset can be seen in figure 1.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for [math]\mathcal{N}[/math] the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.

Modeling Structured Objects


The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammerVAE (GVAE) (Kusner et al., 2017) and replace its variational framework with that of an RAE's utilizing the Tikonov regularization.

Metrics of Evaluation

In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider [math]log(1 + MSE)[/math] between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient [math]log(P)[/math] where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) (Gomez-Bombarelli et al., 2018) as seen in figure 2. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.


Their results displayed in figure 2 show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.


The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework. By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.