# Difference between revisions of "MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION"

This page contains a summary of the paper "Multi-View Data Generation without Supervision" by Mickael Chen, Ludovic Denoyer, Thierry Artieres. It was published at the International Conference on Learning Representations (ICLR) in 2018 in Poster Category.

## Introduction

### Motivation

High Dimensional Generative models have seen a surge of interest off late with introduction of Variational auto-encoders and generative adversarial networks. This paper focuses on a particular problem where one aims at generating samples corresponding to a number of objects under various views. The distribution of the data is assumed to be driven by two independent latent factors: the content, which represents the intrinsic features of an object, and the view, which stands for the settings of a particular observation of that object (for example, the different angles of the same object). The paper proposes two models using this disentanglement of latent space - a generative model and a conditional variant of the same.

### Related Work

The problem of handling multi-view inputs has mainly been studied from the predictive point of view where one wants, for example, to learn a model able to predict/classify over multiple views of the same object (Su et al. (2015); Qi et al. (2016)). These approaches generally involve (early or late) fusion of the different views at a particular level of a deep architecture. Recent studies have focused on identifying factors of variations from multiview datasets. The underlying idea is to consider that a particular data sample may be thought as the mix of a content information (e.g. related to its class label like a given person in a face dataset) and of a side information, the view, which accounts for factors of variability (e.g. exposure, viewpoint, with/wo glasses...). So, all the samples of same class contain the same content but different view. A number of approaches have been proposed to disentangle the content from the view, also referred as the style in some papers (Mathieu et al. (2016); Denton & Birodkar (2017)). The two common limitations the earlier approaches pose - as claimed by the paper - are that (i) they usually consider discrete views that are characterized by a domain or a set of discrete (binary/categorical) attributes (e.g. face with/wo glasses, the color of the hair, etc.) and could not easily scale to a large number of attributes or to continuous views. (ii) most models are trained using view supervision (e.g. the view attributes), which of course greatly helps learning such model, yet prevents their use on many datasets where this information is not available.

### Contributions

The contributions that authors claim are the following: (i) A new generative model able to generate data with various content and high view diversity using a supervision on the content information only. (ii) Extend the generative model to a conditional model that allows generating new views over any input sample.

## Paper Overview

### Background

The paper uses concept of the poplar GAN (Generative Adverserial Networks) proposed by Goodfellow et al.(2014).

Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other (thus the “adversarial”). GANs were introduced in a paper by Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio, in 2014. Referring to GANs, Facebook’s AI research director Yann LeCun called adversarial training “the most interesting idea in the last 10 years in ML.”

Let us denote $X$ an input space composed of multidimensional samples x e.g. vector, matrix or tensor. Given a latent space $R^n$ and a prior distribution $p_z(z)$ over this latent space, any generator function $G : R^n → X$ defines a distribution $p_G$ on $X$ which is the distribution of samples G(z) where $z ∼ p_z$. A GAN defines, in addition to G, a discriminator function D : X → [0; 1] which aims at differentiating between real inputs sampled from the training set and fake inputs sampled following $p_G$, while the generator is learned to fool the discriminator D. Usually both G and D are implemented with neural networks. The objective function is based on the following adversarial criterion:

$\underset{G}{min} \ \underset{D}{max}$ $E_{p_x}[log D(x)] + Ep_z[log(1 − D(G(z)))]$

where px is the empirical data distribution on X . It has been shown in Goodfellow et al. (2014) that if G∗ and D∗ are optimal for the above criterion, the Jensen-Shannon divergence between $p_{G∗}$ and the empirical distribution of the data $p_x$ in the dataset is minimized, making GAN able to estimate complex continuous data distributions.

$\underset{G}{min} \ \underset{D}{max}$ $E_{p_x}[log D(x,y)] + Ep_z[log(1 − D(G(y,z)))]$