# stat946w18/IMPROVING GANS USING OPTIMAL TRANSPORT

## Contents

## Introduction

## GANs and Optimal Transport

### Generative Adversarial Nets

Original GAN was firstly reviewed. The objective function of the GAN:

The goal of GANs is to train the generator g and the discriminator d finding a pair of (g,d) to achieve Nash equilibrium. However, it could cause failure of converging since the generator and the discriminator are trained based on gradient descent techniques.

### Wasserstein distance (Earth-Mover Distance)

In order to solve the problem of convergence failure, Arjovsky et. al. (2017) suggested Wasserstein distance (Earth-Mover distance) based on the optimal transport theory.

where [math] \prod (p,g) [/math] is the set of all joint distributions [math] \gamma (x,y) [/math] with marginals [math] p(x) [/math] (real data), [math] g(y) [/math] (generated data) and [math] c(x,y) [/math] is a cost function which takes to be the Euclidean distance.

The Earth-Mover Distance can be considered as moving the minimum amount of points between distribution [math] g(y) [/math] and [math] p(x) [/math] such that the generator distribution [math] g(y) [/math] is similar to the real data distribution [math] p(x) [/math].

Consider that solving the Wasserstein distance is usually not possible, the proposed Wasserstein GAN (W-GAN) provides an estimated solution by switching the optimal transport problem into dual formulation using a set of 1-Lipschitz functions. A neural network can then be used to obtain an estimation.

W-GAN solves the unstable training process of original GAN and it can solve the optimal transport problem approximately, but it is still intractable.

The training process becomes traceable and a fully connected multi-layer network is sufficient for the training. However, the calculation of the Wasserstein distance is still an approximation and we can not have a fine-optimized critic.

### Sinklhorn Distance

Genevay et al. (2017) proposed to use the primal formulation of optimal transport instead of the dual formulation to generative modeling. They introduced Sinkhorn distance which is a smoothed generalization of the Earth Mover distance.

It introduced a constant [math] \beta [/math] to restricted the entropy of the joint distribution [math] \prod_{\beta} (p,g) [/math]. This distanc could generalize to approximate the mini-batches of data [math] X ,Y[/math] with [math] K [/math] vectors of [math] x, y[/math]. The [math] i, j [/math] th entry of the cost matrix [math] C [/math] can be interpreted as the cost it needs to transport the [math] x_i [/math] in mini-batch X to the [math] y_i [/math] in mini-batch [math]Y [/math]. The resulting distance will be:

where [math] M [/math] is a [math] K \times K [/math] matrix, each row of [math] M [/math] is a joint distribution of [math] \gamma (x,y) [/math] with positive entries. The summmation of rows or columns of [math] M [/math] is always equal to 1. Sinkhorn AutoDiff is the methods that minimiztion over M proposed by Genevay et al. (2017b)

This mini-batch Sinkhorn distance is not only fully tractable but also solve the instability problem of GANs. However, for the fixed mini-batch size, the gradients in this equation are biased.

### Energy Distance (Cramer Distance)

In order to solve the above problem, Energy Distance is proposed by Bellemare et al. (2017):

where [math] x, x' [/math] are independent samples from data distribution [math] p [/math] and [math] y, y'[/math] are independent samples from the generator distribution [math] g [/math]. Cramer GAN is the method that by minimizing the ED distance metric to training the generator.

## MINI-BATCH ENERGY DISTANCE

Salimans et al. (2016) mentioned that comparing to use distributions over individual images, mini-batch GAN is more powerful when use the distributions over mini-batches [math] g(X), p(X) [/math]. The distance measure is generated for mini-batches.

### GENERALIZED ENERGY DISTANCE

The generalized energy distance allowed to use non-Euclidean distance functions d. It is also valid for mini-batches.

where [math] X, X' [/math] are independent samples from data distribution [math] p [/math] and [math] Y, Y'[/math] are independent samples from the generator distribution [math] g [/math]. As long as [math] d [/math] is a metric, [math] D_{GED}(p,g) [/math] is a metric. Thus, by the defination of metric, [math] D(p,g) \geq 0,[/math] and [math] D(p,g)=0 [/math] iff [math] p=g [/math].

### MINI-BATCH ENERGY DISTANCE

Authors proposed a new metric, Mini-batch Energy Distance. It is a distance metric that combines the energy distance with primal form of optimal tranport over mini-batch distributions [math] g(Y) [/math] and [math] p(X) [/math]. Authors used entropy-regularized Sinkhorn distance as the metric d. Inside the generalized energy distance, the Sinkhorn distance is a valid metric between each mini-batches.

where [math] X, X' [/math] are independent sampled mini-batches from data distribution [math] p [/math] and [math] Y, Y'[/math]are independent mini-batches from the generator distribution [math] g [/math]. By adding the [math] - \mathcal{W}_c (Y,Y')[/math] [math] \mathcal{W}_c (X,Y)[/math] to equation (5) and using enregy distance, the objective becomes statistically consistent and mini-batch gradients are unbiased.

## OPTIMAL TRANSPORT GAN (OT-GAN)

A well definded cost function could effect the statistical efficency. A commen choice of c, such as Euclidean distance, shows a generally pooly performance in highly dimensions. Authors suggested to use cosine distance between vectors [math] v_\eta (x) [/math] and [math] v_\eta (y) [/math]. The [math] v_\eta [/math] is a deep neural network which maps the mini- batch images into a learned latent space.