User:Shemati
Introduction
Generative adversarial networks (GANs) are one of the most important generative models, where a couple of discriminator and generator compete to each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN model the data density in an implicit manner while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterized variational is also a similar minimax game similar to the one in GAN. Although they are similar, There is an advantage of this EBM view that is unlike the original GAN formulation, in this EBM model discriminator itself is an explicit density model of the data.
Considering some remarks, authors in this paper show that an energy-based model can be trained using similar minmax formulation in GANs. After training the energy based model, the use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modelling data distribution. More precisely, they calculate normalized Fisher Vectors and Fisher Distance measure using the derivative of the discriminator to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that with respect to its parameter thanks to the fact that in this EBM model discriminator itself is an explicit density model of the data. Fisher vector can be used for set representation problem which is a challenging taks. In fact, as we will see, we can use fisher kernel to calculate distance between two set of images which is not a trivial task. Authors use AFV useful as pre-trained features for the following tasks:
- State-of-the-art performance for unsupervised feature extraction and linear classification tasks
- They used the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments
- They improve training of the GAN through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.
- They use AFV to estimate distance between sets which allow them monitor the training process. More precisely, They show that the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.
Background
Generative Adversarial Networks
The weights of generator and discriminator are updated by solving the following optimization problem: \begin{equation} \underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))] \label{1} \end{equation}
Where [math]\displaystyle{ p_{data(\mathbf{x})} }[/math], [math]\displaystyle{ D(x) }[/math], and [math]\displaystyle{ G(x) }[/math] are distribution of data, discriminator, and generator respectively. To optimize the above problem, in the inner loop [math]\displaystyle{ D }[/math] is trained until convergence given [math]\displaystyle{ G }[/math], and in the outer loop [math]\displaystyle{ G }[/math], is updated one step given [math]\displaystyle{ D }[/math].
GANs as variational training of deep EBMs
Assume an energy based model define a density function [math]\displaystyle{ p_{E}(\mathbf{x}) }[/math] as [math]\displaystyle{ \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } }[/math]. Then, the negative log likelihood (NLL) of the [math]\displaystyle{ p_{E}(\mathbf{x}) }[/math] can be written as
\begin{equation} E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} = E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q) \end{equation}
where [math]\displaystyle{ q(x) }[/math] is an auxiliary distribution which is called call the variational distribution and [math]\displaystyle{ H(q) }[/math] s its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given [math]\displaystyle{ H(q) }[/math]. This bound is tight if [math]\displaystyle{ q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, }[/math] which means [math]\displaystyle{ q(x) = p_{E}(\mathbf{x}) }[/math]. In this case, we can put [math]\displaystyle{ D(\mathbf{x})= -E(\mathbf{x}) }[/math]