From Variational to Deterministic Autoencoders

From statwiki
Jump to navigation Jump to search

Presented by

Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf

Introduction

This paper presents an alternative framework titled Regularized Autoencoders (RAEs) for generative modelling that is deterministic. They investigate how this stochasticity of VAEs could be substituted with implicit and explicit regularization schemes. Furthermore,the present a generative mechanism within a deterministic auto-encoder utilising an ex-post density estimation step that can also be applied to existing VAEs improving their sample quality. They further conduct an empirical comparison between VAEs and deterministic regularized auto-encoders and show the latter are able to generate samples that are comparable or better when applied to images and structured data.

Previous Work

The proposed method modifies the architecture of the existing Varational Autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014).

Motivation

The authors point to several drawbacks currently associated with VAE's including:

  • over-regularisation induced by the KL divergence term within the objective (Tolstikhin et al., 2017)
  • posterior collapse in conjunction with powerful decoders (van den Oord et al., 2017)
  • increased variance of gradients caused by approximating expectations through sampling (Burda et al., 2015; Tucker et al., 2017)

These issues motivate their consideration of alternatives to the variational framework adopted by VAE's.

Furthermore, the authors consider VAE's introduction of random noise within the reparameterization [math]\displaystyle{ z = \mu(x) +\sigma(x)\epsilon }[/math] as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of alternative regularization schemes for an auto-encoders that could be substituted in place of the VAE's random noise injection to produce equivalent or better generated samples. This would allow for the elimination of the variational framework and its associated drawbacks.

Framework Architecture

Overview

The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise [math]\displaystyle{ \epsilon }[/math] from the reparameterization of the latent variable [math]\displaystyle{ z }[/math]. Secondly, it proposes a resigned loss function [math]\displaystyle{ \mathcal{L}_{RAE} }[/math]. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.


Eliminating Random Noise

The authors proposal to eliminate the injection of random noise [math]\displaystyle{ \epsilon }[/math] from the reparameterization of the latent variable [math]\displaystyle{ z = \mu(x) +\sigma(x)\epsilon }[/math] resulting in a Encoder <math.E_{phi} </math> that deterministically maps a data point [math]\displaystyle{ x }[/math] to a latent varible [math]\displaystyle{ z }[/math].

The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its loss function: \begin{align} \mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}logp_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z)) \end{align}

In eliminating the random noise within [math]\displaystyle{ z }[/math] the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because [math]\displaystyle{ z }[/math] is no longer a distribution and [math]\displaystyle{ p(x|z) }[/math] would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to the redesign of a loss function used in training the RAE.

Redesigned Loss Function

The resigned loss function [math]\displaystyle{ \mathcal{L}_{RAE} }[/math] is defined as: \begin{align} \mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\ \text{where }\lambda\text{ and }\beta\text{ are hyper parameters} \end{align}

The first term [math]\displaystyle{ \mathcal{L}_{REC} }[/math] is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions [math]\displaystyle{ \mu_{\theta} }[/math] by a decoder that is deterministic. In the paper it is formally defined as: \begin{align} \mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2 \end{align} However, as the decoder [math]\displaystyle{ D_{\theta} }[/math] is deterministic the reconstruction loss is equivalent to: \begin{align} \mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2 \end{align}

The second term [math]\displaystyle{ \mathcal{L}^{RAE}_Z }[/math] is defined as : \begin{align} \mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{Z}||_2^2 \end{align} This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.

The third term [math]\displaystyle{ \mathcal{L}_{REG} }[/math] acts as the explicit regularizer to the decoder. The authors consider the following possible formulations for [math]\displaystyle{ \mathcal{L}_{REG} }[/math]

Tikhonov regularization(Tikhonov & Arsenin, 1977)

\begin{align} \mathcal{L}_{REG} = ||\theta||_2^2 \end{align}

This in effect applies weight decay to the decoder parameters [math]\displaystyle{ \theta }[/math]
Gradient Penalty:

\begin{align} \mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2 \end{align}

This would bound the gradient norm of the decoder with respect to its input
Spectral Normalization:
The authors also consider using Spectral Normalization in place of [math]\displaystyle{ \mathcal{L}_{REG} }[/math] whereby each weight matrix [math]\displaystyle{ \theta_{\ell} }[/math] in the decoder network is normalized by an estimate of it largest singular value. Formally this is defined as:

\begin{align} \theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell}\\ \end{align}

Where [math]\displaystyle{ s(\theta_{\ell}) }[/math] is the spectral norm estimation.

Ex-Post Density Estimation

Removing the stochasticity from RAEs loses the ability to control the distribution of the latent space and sample from it to produce varying generations. The authors overcome this issue by proposing ex-post density estimation over the trained RAEs latent space <. In this process a density estimator [math]\displaystyle{ q_{/delta}(/mathbf{z}) }[/math] is fit over the latent points [math]\displaystyle{ \{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} }[/math]. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allows for generalisation of untrained points.

Experiment Results

Conclusion

Critiques

References