MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Created page with "This page contains a summary of the paper "[https://openreview.net/pdf?id=ryRh0bb0Z MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION]" by Mickael Chen, Ludovic Denoyer, Thi...")
 
 
(100 intermediate revisions by 19 users not shown)
Line 1: Line 1:
This page contains a summary of the paper "[https://openreview.net/pdf?id=ryRh0bb0Z MULTI-VIEW DATA GENERATION WITHOUT VIEW
This page contains a summary of the paper "[https://openreview.net/forum?id=ryRh0bb0Z Multi-View Data Generation without Supervision]" by Mickael Chen, Ludovic Denoyer, Thierry Artieres. It was published at the International Conference on Learning Representations (ICLR) in 2018. An implementation of the models presented in this paper is available here[https://github.com/mickaelChen/GMV]
SUPERVISION]" by Mickael Chen, Ludovic Denoyer, Thierry Artieres. It was published at the International Conference on Learning Representations (ICLR) in 2018.  


==Introduction==
==Introduction==
The dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both ''what'' and ''how'' to imitate for a certain task. For example, in the robotics field, Learning from Demonstration (LfD) (Argall et al., 2009; Ng & Russell, 2000; Pomerleau, 1989; Schaal, 1999) requires an expert to manually move robot joints (kinesthetic teaching) or teleoperate the robot to teach the desired task. The expert will, in general, provide multiple demonstrations of a specific task at training time which the agent will form into observation-action pairs to then distill into a policy for performing the task. In the case of demonstrations for a robot, this heavily supervised process is tedious and unsustainable especially looking at the fact that new tasks need a set of new demonstrations for the robot to learn from.


===Paper Overview===
===Motivation===
''Observational Learning'' (Bandura & Walters, 1977), a term from the field of psychology, suggests a more general formulation where the expert communicates ''what'' needs to be done (as opposed to ''how'' something is to be done) by providing observations of the desired world states via video or sequential images. This is the proposition of the paper and while this is a harder learning problem, it is possibly more useful because the expert can now distill a large number of tasks easily (and quickly) to the agent.
We are interested in learning generative models that build and make use of a disentangled latent space where the content and the view are encoded separately. We propose to take an original approach by learning such models from multi-view datasets, where (i) samples are labeled based on their content, and without any view information, and (ii) where the generated views are not restricted to be one view in a subset of possible views. High Dimensional Generative models have seen a surge of interest of late with the introduction of Variational Auto-Encoders and Generative Adversarial Networks. This paper focuses on a particular problem where one aims at generating samples corresponding to a number of objects under various views.  The distribution of the data is assumed to be driven by two independent latent factors: the content, which represents the intrinsic features of an object, and the view, which stands for the settings of a particular observation of that object (for example, the different angles of the same object). The paper proposes two models using this disentanglement of latent space - a generative model and a conditional variant of the same. The authors claim that unlike many multi-view
approaches, the proposed model doesn’t need any supervision on the views but only on the content.


[[File:1-GSP.png | 650px|thumb|center|Figure 1: The goal-conditioned skill policy (GSP) takes as input the current and goal observations and outputs an action sequence that would lead to that goal. We compare the performance of the following GSP models: (a) Simple inverse model; (b) Multi-step GSP with previous action history; (c) Multi-step GSP with previous action history and a forward model as regularizer, but no forward consistency; (d) Multi-step GSP with forward consistency loss proposed in this work.]]
===Related Work===
 
The problem of handling multi-view inputs has mainly been studied from the predictive point of view where one wants, for example, to learn a model able to predict/classify over multiple views of the same object (Su et al. (2015); Qi et al. (2016)). These approaches generally involve  (early or late) fusion of the different views at a particular level of a deep architecture. Recent studies have focused on identifying factors of variations from multiview datasets. The underlying idea is to consider that a particular data sample may be thought as the mix of a content information (e.g. related to its class label like a given person in a face dataset) and of a side information, the view, which accounts for factors of variability (e.g. exposure, viewpoint, with/wo glasses...). So, all the samples of the same class contain the same content but different view. A number of approaches have been proposed to disentangle the content from the view (i.e. methods based on unlabeled samples), also referred as the style in some papers (Mathieu et al. (2016); Denton & Birodkar (2017)). The two common limitations the earlier approaches pose - as claimed by the paper - are that (i) they usually
consider discrete views that are characterized by a domain or a set of discrete (binary/categorical) attributes (e.g. face with/wo glasses, the color of the hair, etc.) and could not easily scale to a large number of attributes or to continuous views. (ii) most models are trained using view supervision (e.g. the view attributes), which of course greatly helps in the learning of such model, yet prevents their use on many datasets where this information is not available.
 
Recently such attempts have been made to learn such models without supervision, but they cannot disentangle high level concepts as only simple features can be reliably captured without any guidance.


This paper follows (Agrawal et al., 2016; Levine et al., 2016; Pinto & Gupta, 2016) where an agent first explores the environment independently and then distills its observations into goal-directed skills. The word 'skill' is used to denote a function that predicts the sequence of actions to take the agent from the current observation to the goal. This function is what is known as a ''goal-conditioned skill policy (GSP)'' and it is learned by re-labeling states that the agent has visited as goals and the actions taken as prediction targets. During inference, the GSP recreates the task step-by-step given the goal observations from the demonstration.
===Contributions===


A challenge of learning the GSP is that the distribution of trajectories from one state to another is multi-modal; that is, there are many possible ways of traversing from one state to another. This issue is addressed with the main contribution of this paper, the ''forward-consistent loss'' which essentially says that reaching the goal is more important than how it is reached. First, a forward model is learned that predicts the next observation from the given action and current observation. The difference in the output of the forward model for the GSP-selected action and the ground-truth next state is used to train the model. This forward-consistent loss has the effect of not inadvertently penalizing actions that are ''consistent'' with the ground-truth action but not exactly the same.
The contributions that authors claim are the following: (i) A new generative model able to generate data with various content and high view diversity using a supervision on the content information only. (ii) Extend the generative model to a conditional model that allows generating new views over any input sample. (iii) Report experimental results on four different images datasets to prove that the models can generate realistic samples and capture (and generate with) the diversity of views.


As a simple example to explain the forward-consistent loss, imagine a scenario where a robot must grab an object some distance ahead with an obstacle along the pathway. Now suppose that during demonstration the obstacle is avoided by going to the right and then grabbing the object while the agent during training decides to go left and then grab the object. The forward-consistent loss would characterize the action of the robot as ''consistent'' with the ground-truth action of the demonstrator and not penalize the robot for going left instead of right.
Precisely,two models have been proposed:
# a generative model ('''GMV - Generative Multi-view Model''') that generates objects under various views (multiview generation),  
# and a conditional extension, '''conditional GMV (C-GMV)''' of this model that generates a large number of views of any input object (conditional multi-view generation).  


Of course, when introducing something like this forward-consistent loss, issues related to the number of steps needed to reach a certain goal become prevalent. To address this, the paper pairs the GSP with a goal recognizer that determines if the goal has been satisfied with respect to some metrics. Figure 1 shows various GSPs along with diagram d) showing the forward-consistent loss proposed in this paper.
Both models are based on the adversarial training schema of Generative Adversarial Networks (GAN) proposed in Goodfellow et al. (2014)). The simple but strong idea is to focus on distributions over pairs of examples (e.g. images representing a same object in different views) rather than distribution on single examples.
The zero-shot imitator is tested on a Baxter robot performing tasks involving rope manipulation, a TurtleBot performing office navigation and navigation experiments in ''VizDoom''. Positive results are shown for all three experiments leading to the conclusion that the forward-consistent GSP can be used to imitate a variety of tasks without making environmental or task-specific assumptions.


===Related Work===
==Paper Overview==
Some key ideas related to this paper are '''imitation learning''', '''visual demonstration''', '''forward/inverse dynamics and consistency''' and finally, '''goal conditioning'''. The paper has more on each of these topics including citations to related papers. The propositions in this paper are related to imitation learning but the problem being addressed is different in that there is less supervision and the model requires generalization across tasks during inference.
 
===Background===
 
The paper uses the concept of the popular GAN (Generative Adversarial Networks) proposed by Goodfellow et al.(2014).
 
GENERATIVE ADVERSARIAL NETWORK:
 
Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other (thus the “adversarial”). GANs was introduced in a paper by Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio, in 2014. Referring to GANs, Facebook’s AI research director Yann LeCun called adversarial training “the most interesting idea in the last 10 years in ML.”
 
Let us denote <math>X</math> an input space composed of multidimensional samples <math>x</math> e.g. vector, matrix or tensor. Given a latent space <math>R^n</math> and a prior distribution <math>p_z(z)</math> over this latent space, any generator function <math>G : R^n → X</math> defines a distribution <math>p_G </math> on <math> X</math> which is the distribution of samples <math>G(z)</math> where <math>z ∼ p_z</math>. A GAN defines, in addition to <math>G</math>, a discriminator function <math>D : X → [0; 1]</math> which aims at differentiating between real inputs sampled from the training set and fake inputs sampled from <math>p_G</math>, while the generator learns to fool the discriminator <math>D</math>. Usually both <math>G</math> and <math>D</math> are implemented with neural networks. The objective function is based on the following adversarial criterion:
 
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \  \underset{D}{max}</math> <math>E_{p_x}[log D(x)] + Ep_z[log(1 − D(G(z)))]</math></div>
 
where <math>p_x</math> is the empirical data distribution on <math>X</math> .
It has been shown in Goodfellow et al. (2014) that if G∗ and D∗ are optimal for the above criterion, the Jensen-Shannon divergence between <math>p_{G∗}</math> and the empirical distribution of the data <math>p_x</math> in the dataset is minimized, making GAN able to estimate complex continuous data distributions.


==Learning to Imitate Without Expert Supervision==
CONDITIONAL GENERATIVE ADVERSARIAL NETWORK:


In this section (and the included subsections) the methods for learning the GSP, ''forward consistency loss'' and ''goal recognizer'' network are described.  
In the Conditional GAN (CGAN), the generator learns to generate a fake sample with a specific condition or characteristics (such as a label associated with an image or more detailed tag) rather than a generic sample from unknown noise distribution. The conditionality of a CGAN is determined by defining a generator function <math>G</math> which takes a noise vector <math>z</math> and a condition <math>y</math> as inputs. Now, to add such a condition to both generator and discriminator, we will simply feed some vector <math>y</math>, into both networks. Hence, both the discriminator <math>D(X,y)</math> and generator <math>G(z,y)</math> are jointly distributed with <math>y</math>. A target <math>X</math> from a given input <math>y</math> can be obtained by first sampling the latent vector <math>z ∼ p_z</math>, then by computing <math>G(y, z)</math>. The discriminator takes both the condition <math>y</math> and the datapoint <math>x</math> as inputs.


Let <math display="inline">S : \{x_1, a_1, x_2, a_2, ..., x_T\}</math> be the sequence of observation-action pairs generated by the agent as it explores the environment using the policy <math display="inline">a = π_E(s)</math>. This exploration data is used to learn the GSP.
Now, the objective function of CGAN is:


<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \  \underset{D}{max}</math> <math>E_{p_x}[log D(x,y)] + Ep_z[log(1 − D(G(y,z)))]</math></div>


<div style="text-align: center;"><math>\overrightarrow{a}_τ =π (x_i, x_g; θ_π)</math></div>
The paper also suggests that many studies have reported that when dealing with high-dimensional input spaces, CGAN tends to collapse the modes of the data distribution, mostly ignoring the latent factor <math>z</math> and generating <math>x</math> only based on the condition <math>y</math>, exhibiting an almost deterministic behavior. At this point, the CGAN also fails to produce a satisfying amount of diversity in generated samples.


===Generative Multi-View Model===


The above equation represents the learned GSP. <math display="inline">π</math> takes as input a pair of observations <math display="inline">(x_i, x_g)</math> and outputs the sequence of required actions <math display="inline">(\overrightarrow{a}_τ : a_1, a_2, ..., a_K)</math> to reach the goal observation <math display="inline">x_g</math> from the current observation <math display="inline">x_i</math>. The states <math display="inline">x_i</math> and <math display="inline">x_g</math> are sampled from <math display="inline">S</math> and the number of actions <math display="inline">K</math> is inferred by the model. <math display="inline">π</math> can be thought of as a general deep network with parameters <math display="inline">θ_π</math>. It is good to note that <math display="inline">x_g</math> could be an intermediate subtask of the overall goal. So in essence, subtasks can be strung together to achieve an overall goal (i.e. go to position 1, then go to position 2, then go to final destination).
''' Objective and Notations: ''' The distribution of the data x ∈ X is assumed to be driven by two latent factors: a content factor denoted c which corresponds to the invariant proprieties of the object and a view factor denoted v which corresponds to the factor of variations. Typically, if X is the space of people’s faces, c stands for the intrinsic features of a person’s face while v stands for the transient features and the viewpoint of a particular photo of the face, including the photo exposure
and additional elements like a hat, glasses, etc.... These two factors c and v are assumed to be independent and these are the factors needed to learn.


Let the sequence of images <math display="inline">D: \{x_1^d, x_2^d, ..., x_N^d\}</math> be the task  to be imitated which is captured when the expert demonstrates the task. The sequence has at least one entry and can be as temporally dense as needed (i.e. the expert can show as many goals or sub-goals as needed to the agent). The agent then uses the learned GSP <math display="inline">π</math> to start from initial state <math display="inline">x_0</math> and follow the actions predicted by <math display="inline">π(x_0, x_1^d; θ_π)</math> to imitate the observations in <math display="inline">D</math>.
The paper defines two tasks here to be done:
(i) '''Multi View Generation''': we want to be able to sample over X by controlling the two factors c and v. Given two priors, p(c) and p(v), this sampling will be possible if we are able to estimate p(x|c, v) from a training set.
(ii) '''Conditional Multi-View Generation''': the second objective is to be able to sample different views of a given object. Given a prior p(v), this sampling will be achieved by learning the probability p(c|x), in addition to p(x|c, v). Ability to learn generative models able to generate from a disentangled latent space would allow controlling the sampling on the two different axes,
the content and the view. The authors claim the originality of work is to learn such generative models without using any view labeling information.


A separate ''goal recognizer'' network is needed to ascertain if the current observation is close to the goal or not. This is because multiple actions might be required to reach close to <math display="inline">x_1^d</math>. Knowing this, let <math display="inline">x_0^\prime</math> be the observation after executing the predicted action. The goal recognizer evaluates whether <math display="inline">x_0^\prime</math> is sufficiently close to the goal and if not, the agent executes
The paper introduces the vectors '''c''' and '''v''' to represent latent vectors in R<sup>c</sup> and R<sup>v</sup>
<math display="inline">a = π(x_0^\prime, x_1^d; θ_π)</math>. This process is executed repeatedly for each image in <math display="inline">D</math> until the final goal is reached. This approach never requires the expert to convey to the agent what actions it performed.


===Learning the Goal-Conditioned Skill Policy (GSP)===


It is easy to first describe the one-step version of GSP and then extend it to a multi-step version. The one-step trajectory can be generalized as <math display="inline">(x_t; a_t; x_{t+1})</math> with GSP  <math display="inline">\hat{a}_t = π(x_t; x_{t+1}; θ_π)</math> and is trained by the standard cross-entropy loss given below with respect to the GSP parameters <math display="inline">θ_π</math>:
''' Generative Multi-view Model: '''


Consider two prior distributions over the content and view factors denoted as <math>p_c</math> and <math>p_v</math>, corresponding to the prior distribution over content and latent factors. Moreover, we consider a generator G that implements a distribution over samples x, denoted as <math>p_G</math> by computing G(c, v) with <math>c ∼ p_c</math> and <math>v ∼ p_v</math>. The objective is to learn this generator so that its first input c corresponds to the content of the generated sample while its second input v, captures the underlying view of the sample. Doing so would allow one to control the output sample of the generator by tuning its content or its view (i.e. c and v).


<div style="text-align: center;"><math>L(a_t; \hat{a}_t) = p(a_t|x_t; x_{t+1}) log( \hat{a}_t)</math></div>
The key idea that the authors propose is to focus on the distribution of pairs of inputs rather than on the distribution over individual samples. When no view supervision is available the only valuable pairs of samples that one may build from the dataset consist of two samples of a given object under two different views. When we choose any two samples randomly from the dataset from the same object, it is most likely that we get two different views. The paper explains that there are three goals here,  (i) As in regular GAN, each sample generated by G needs to look realistic. (ii) As real pairs are composed of two views of the same object, the generator should generate pairs of the same object. Since the two sampled view factors v1 and v2 are different, the only way this can be achieved is by encoding the content vector c which is invariant. (iii) It is expected that the discriminator should easily discriminate between a pair of samples corresponding to the same object under different views from a pair of samples corresponding to a same object under the same view. Because the pair shares the same content factor c, this should force the generator to use the view factors v1 and v2 to produce diversity in the generated pair.


Now, the objective function of GMV Model is:


<math display="inline">p</math> and <math display="inline">\hat{a}_t</math> are the ground-truth and predicted action distributions respectively. <math display="inline">p</math> is not readily available so it is empirically approximated using samples from the distribution <math display="inline">a_t</math>. In a standard deep learning problem it is common to assume <math display="inline">p</math> as a delta function at <math display="inline">a_t</math> but this is violated if <math display="inline">p</math> is multi-modal and high-dimensional. That is, the same inputs would be presented with different targets leading to high variance in gradients. This would make learning challenging, leading to the further developments presented in sections 2.2, 2.3 and 2.4.
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \  \underset{D}{max}</math> <math>E_{x_1,x_2}[log D(x_1,x_2)] + E_{v_1,v_2}[log(1 − D(G(c,v_1),G(c,v_2)))]</math></div>


===Forward Consistency Loss===
Once the model is learned, generator G that generates single samples by first sampling c and v following <math>p_c</math> and <math>p_v</math>, then by computing G(c, v). By freezing c or v, one may then generate samples corresponding to multiple views of any particular content, or corresponding to many contents under a particular view. One can also make interpolations between two given views over a particular content, or between two contents using a particular view


To deal with multi-modality, this paper proposes the ''forward consistency loss'' where instead of penalizing actions predicted by the GSP to match the ground truth, the parameters of the GSP are learned such that they minimize the distance between observation <math display="inline">\hat{x}_{t+1}</math> (prediction from executing <math display="inline">\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math> ) and the observation <math display="inline">x_{t+1}</math> (ground truth). This is done so that the predicted action is not penalized if it leads to the same next state as the ground-truth action. This will in turn reduce the variation in gradients and aid the learning process. This is what is denoted as ''forward consistency loss''.
<div style="text-align: center;font-size:100%">[[File:GMV.png]]</div>


To operationalize the forward consistency loss... The forward dynamics <math display="inline">f</math> are learned from the data and is defined as <math display="inline">\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math>. Since <math display="inline">f</math> is not analytic, there is no guarantee that <math display="inline">\widetilde{x}_{t+1} = \hat{x}_{t+1} </math> so an additional term is added to the loss: <math display="inline">||x_{t+1} - \hat{x}_{t+1}||_2^2 </math>. The parameters of <math display="inline">θ_f</math> are inferred by minimizing <math display="inline">||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 </math> where λ is a scalar hyper-parameter. The first term ensures that the learned model explains the ground truth transitions while the second term ensures consistency. In summary, the loss function is given below:
===Conditional Generative Model (C-GMV)===


C-GMV is proposed by the authors to be able to change the view of a given object that would be provided as an input to the model. This model extends the generative model's the ability to extract the content factor from any given input and to use this extracted content in order to generate new views of the corresponding object. To achieve such a goal, we must add to our generative model an encoder function denoted <math>E : X → R^C</math> that will map any input in X to the content space <math>R^C</math>


<div style="text-align: center;font-size:100%"><math>\underset{θ_π θ_f}{min}  \bigg( ||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 + L(a_t, \hat{a}_t) \bigg)</math>, such that</div>
Input sample x is encoded in the content space using an encoder function, noted E (implemented as a neural network).
<div style="text-align: center;font-size:80%"><math>\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)</math></div>
This encoder serves to generate a content vector c = E(x) that will be combined with a randomly sampled view <math>v ∼ p_v</math> to generate an artificial example. The artificial sample is then combined with the original input x to form a negative pair. The issue with this approach is that CGAN is known to easily miss modes of the underlying distribution. The generator enters in a state where it ignores the noisy component v. To overcome this phenomenon, we use the same idea as in GMV. We build negative pairs <math>(G(c, v_1), G(c, v_2))</math> by randomly sampling two views <math>v_1</math> and <math>v_2</math> that are combined to get a unique content c. c is computed from a sample x using the encoder E, i.e.  c= E(x). By doing so, the ability of our approach to generating pairs with view diversity is preserved. Since this diversity can only be captured by taking into account the two different view vectors provided to the model (<math>v_1</math> and <math>v_2</math>), this will encourage G(c, v) to generate samples containing both the content information c, and the view v. Positive pairs are sampled from the training set and correspond to two views of a given object.
<div style="text-align: center;font-size:80%"><math>\hat{x}_{t+1} = f(x_t, \hat{a}_t; θ_f)</math></div>
<div style="text-align: center;font-size:80%"><math>\hat{a}_t = π(x_t, x_{t+1}; θ_π)</math></div>


Past works have shown that learning forward dynamics in the feature space as opposed to raw observation space is more robust so this paper follows in extending the GSP to make predictions on feature representations denoted <math>\phi(x_t), \phi(x_{t+1})</math>. The forward consistency loss is then computed to make predictions in the feature space <math>\phi</math>. Learning <math>θ_π,θ_f</math> from scratch can cause noisier gradient updates for <math>π</math>. This is addressed by pre-training the forward model with the first term and GSP seperately by blocking gradient flow. Fine-tuning is then done with <math>θ_π,θ_f</math> jointly.
The Objective function for C-GMV will be:


The generalization to multi-step GSP <math>π_m</math> is shown below where <math>\phi</math> refers to the feature space rather than observation space which was used in the single-step case:
<div style="text-align: center;font-size:100%"><math>\underset{G}{min} \  \underset{D}{max}</math> <math>E_{x_1,x_2 ~ p_x|l(x_1)=l(x_2)}[log D(x_1,x_2)] + E_{v_1,v_2 ~ p_v,x~p_x}[log(1 − D(G(E(x),v_1),G(E(x),v_2)))]+E_{v∼p_v,x∼p_x}[log(1 − D(G(E(x), v), x))] </math></div>


<div style="text-align: center;font-size:100%"><math>\underset{θ_π, θ_f, θ_{\phi}}{min} \sum_{t=i}^{t=T} \bigg(||\phi(x_{t+1}) - \phi(\widetilde{x}_{t+1})||_2^2 + λ||\phi(x_{t+1}) - \phi(\hat{x}_{t+1})||_2^2 + L(a_t, \hat{a}_t)\bigg)</math>, such that</div>
<div style="text-align: center;font-size:100%">[[File:CGMV.png]]</div>


<div style="text-align: center;font-size:80%"><math>\phi(\widetilde{x}_{t+1}) = f\big(\phi(x_t), a_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{x}_{t+1}) = f\big(\phi(x_t), \hat{a}_t; θ_f\big)</math></div>
<div style="text-align: center;font-size:80%"><math>\phi(\hat{a}_t) = π\big(\phi(x_t), \phi(x_{t+1}); θ_π\big)</math></div>


At inference time, as  with the GMV model, we are interested in getting the encoder E and the
generator G. These models may be used for generating new views of any object which is observed
as an input sample x by computing its content vector E(x), then sampling <math>v ∼ p_v</math> and finally by
computing the output G(E(x), v)


The forward consistency loss is computed at each time step, t, and jointly optimized with the action prediction loss over the whole trajectory. <math>\phi(.)</math> is represented by a CNN with parameters  <math>θ_{\phi}</math>. The multi-step ''forward consistent'' GSP <math> \pi_m</math> is implemented via a recurrent network with inputs current state, goal states, actions at previous time step and the internal hidden representation denoted <math> h_{t-1}</math>, and outputs the actions to take.
==Experiments and Results==


===Goal Recognizer===
The authors have given an exhaustive set of results and experiments.


The goal recognizer network was introduced to figure out if the current goal is reached. This allows the agent to take multiple steps between goals without being penalized. In this paper, goal recognition was taken as a binary classification problem; given an observation and the goal, is the observation close to the goal or not.
Datasets: The two models were evaluated by performing experiments over four image datasets of various domains. Note that when supervision is available on the views (like CelebA for example where images are labeled with attributes) it is not used for learning models. The only supervision that is used is if two samples correspond to the same object or not.


<div style="text-align: center;font-size:100%">[[File:table_data.png]]</div>


The goal recognizer was trained on data from the agent's random exploration. Pseudo-goal states were samples from the visited states, and all observations within a few timesteps of these were considered as positive results (close to the goal). The goal classifier was trained using the standard cross-entropy loss.


The authors found that training a separate goal recognition network outperformed simply adding a 'stop' action to the action space of the policy network.
Model Architecture: Same architectures for every dataset. The images were rescaled to 3×64×64 tensors. The generator G and the discriminator D follow that of the DCGAN implementation proposed in Radford et al. (2015). The encoder E is similar to D with the only differences being the batch-normalization in the first layer and the last layer which doesn't have a non-linearity. The Adam optimizer was used, with a batch size of 128. The learning rates for G and D were set to 1*10<sup>-3</sup> and 2*10<sup>-4</sup> respectively for the GMV experiments. In the C-GMV experiments, learning rates of 5*10<sup>-5</sup> were used. Alternating gradient descent was used to optimize the different objectives of the network components (generator, encoder and discriminator).


===Ablations and Baselines===
Baselines: Most existing methods are learned on datasets with view labeling. To fairly compare with alternative models, authors have built baselines working in the same conditions as the models in this paper. In addition, models are compared with the model from Mathieu et al. (2016). Results gained with two implementations are reported, the first one based on the implementation provided by the authors2 (denoted Mathieu et al. (2016)), and the second one (denoted Mathieu et al. (2016) (DCGAN) ) that implements the same model using architectures inspired from DCGAN Radford et al. (2015), which is more stable and that was tuned to allow a fair comparison with our approach. For pure multi-view generative setting, generative model(GMV) is compared with standard GANs that are learned to approximate the joint generation of multiple samples: DCGANx2 is learned to output pairs of views over the same object, DCGANx4 is trained on quadruplets, and DCGANx8 on eight different views.


To summarize, the GSP formulation is composed of (a) recurrent variable-length skill policy network, (b) explicitly encoding the previous action in the recurrence, (c) goal recognizer, (d) forward consistency loss function, and (w) learning forward dynamics in the feature space instead of raw observation space.
===Generating Multiple Contents and Views===


To show the importance of each component a systematic ablation (removal) of components for each experiment is done to show the impact on visual imitation. The following methods will be evaluated in the experiments section:
Figure 1 shows examples of generated images by our model and Figure 4 shows images sampled by the DCGAN based models (DCGANx2, DCGANx4, and DCGANx8) on 3DChairs and CelebA datasets.


# Classical methods: In visual navigation, the paper attempts to compare against the state-of-the-art ORB-SLAM2 and Open-SFM.
<div style="text-align: center;font-size:100%">[[File:fig1_gmv.png]]</div>
# Inverse model: Nair et al. (2017) leverage vanilla inverse dynamics to follow demonstration in rope manipulation setup.
# '''GSP-NoPrevAction-NoFwdConst''' is the removal of the paper's recurrent GSP without previous action history and without forwarding consistency loss.
# '''GSP-NoFwdConst''' refers to the recurrent GSP with previous action history, but without forwarding consistency objective.
# '''GSP-FwdRegularizer''' refers to the model where forward prediction is only used to regularize the features of GSP but has no role to play in the loss function of predicted actions.
# '''GSP''' refers to the complete method with all the components.


==Experiments==
<div style="text-align: center;font-size:100%">[[File:fig4_gmv.png]]</div>


The model is evaluated by testing performance on a rope manipulation task using a Baxter Robot, navigation of a TurtleBot in cluttered office environments and simulated 3D navigation in VizDoom. A good skill policy will generalize to unseen environments and new goals while staying robust to irrelevant distractors and observations. For the rope manipulation task this is tested by making the robot tie a knot, a task it did not observe during training. For the navigation tasks, generalization is checked by getting the agents to traverse new buildings and floors.


===Rope Manipulation===
Figure 5 shows additional results, using the same presentation, for the GMV model only on two other datasets. In the left hand block of Figure 5, each row shows different views generated given the same content.


Rope manipulation is an interesting task because even humans learn complex rope manipulation, such as tying knots, via observing an expert perform it.
<div style="text-align: center;font-size:100%">[[File:fig5_gmv.png]]</div>


In this paper, rope manipulation data collected by Nair et al. (2017) is used, where a Baxter robot manipulated a rope kept on a table in front of it. During this exploration, the robot picked up the rope at a random point and displaced it randomly on the table. 60K interaction pairs were collected of the form <math>(x_t, a_t, x_{t+1})</math>. These were used to train the GSP proposed in this paper.  
Figure 6 shows generated samples obtained by interpolation between two different view factors (left) or two content factors (right). Again, in the left and right hand block of Figure 6, each row shows different views generated given the same content. It allows us to have a better idea of the underlying view/content structure captured by GMV. We can see that our approach is able to smoothly move from one content/view to another content/view while keeping the other factor constant. This also illustrates that content and view factors are well independently handled by the generator i.e. changing the view
does not modify the content and vice versa.


For this experiment, the Baxter robot is setup exactly like the one presented in Nair et al. (2017). The robot is tasked with manipulating the rope into an 'S' as well as tying a knot as shown in Figure 2. In testing, the robot was only provided with images of intermediate states of the rope, and not the actions taken by the human trainer. The thin plate spline robust point matching technique (TPS-RPM) (Chui & Rangarajan, 2003) is used to measure the performance of constructing the 'S' shape as shown in Figure 3. Visual verification (by a human) was used to assess the tying of a successful knot.


The base architecture consisted of a pre-trained AlexNet whose features were fed into a skill policy network that predicts the location of grasp, the direction of displacement and the magnitude of displacement. All models were optimized using Asam with a learning rate of 1e-4. For the first 40K iterations, the AlexNet weights were frozen and then fine-tuned jointly with the later layers. More details are provided in the appendix of the paper.
<div style="text-align: center;font-size:100%">[[File:fig6_gmv.png]]</div>


The approach of this paper is compared to (Nair et al., 2017) where they did similar experiments using an inverse model. The results in Figure 3 show that for the 'S' shape construction, zero-shot visual imitation achieves a success rate of 60% versus the 36% baseline from the inverse model.
===Generating Multiple Views of a Given Object===


[[File:2-Rope_manip.png | 650px|thumb|center|Figure 2: Qualitative visualization of results for rope manipulation task using Baxter robot. (a) The
The second set of experiments evaluates the ability of C-GMV to capture a particular content from an input sample and to use this content to generate multiple views of the same object. Figure 7 and 8 illustrate the diversity of views in samples generated by our model and compare our results with those obtained with the CGAN model and to models from Mathieu et al. (2016). For each row, the input sample is shown in the left column. New views are generated from that input and shown to the right, with those generated from C_GMV in the centre, and those generated from CGAN on the far right.
robotics system setup. (b) The sequence of human demonstration images provided by the human
during inference for the task of knot-tying (top row), and the sequences of observation states reached
by the robot while imitating the given demonstration (bottom rows). (c) The sequence of human
demonstration images and the ones reached by the robot for the task of manipulating rope into ‘S’
shape. Our agent is able to successfully imitate the demonstration.]]


[[File:3-GSP_graph.png | 650px|thumb|center|Figure 3: GSP trained using forward consistency loss significantly outperforms the baselines at the task of (a) manipulating rope into 'S' shape as measured by TPS-RPM error and (b) knot-tying where a success rate is reported with bootstrap standard deviation]]
<div style="text-align: center;font-size:100%">[[File:fig7_gmv.png]]</div>


===Navigation in Indoor Office Environments===
In this experiment, the robot was shown a single image or multiple images to lead it to the goal. The robot, a TurtleBot2, autonomously moves to the goal. For learning the GSP, an automated self-supervised method for data collection was devised that didn't require human supervision. The robot explored two floors of an academic building and collected 230K interactions <math>(x_t, a_t, x_{t+1})</math> (more detail is provided I the appendix of the paper). The robot was then placed into an unseen floor of the building with different textures and furniture layout for performing visual imitation at test time.


The collected data was used to train a ''recurrent forward-consistent GSP''. The base architecture for the model was an ImageNet pre-trained ResNet-50 network. The loss weight of the forward model is 0.1 and the objective is minimized using Adam with a learning rate of 5e-4. More details on the implementation are given in the appendix of the paper.
<div style="text-align: center;font-size:100%">[[File:fig8_gmv.png]]</div>


Figure 4 shows the robot's observations during testing. Table 1 shows the results of this experiment; as can be seen, GSP fairs much better than all previous baselines.
=== Evaluation of the Quality of Generated Samples ===


[[File:4-TurtleBot_visualization.png | 650px|thumb|center|Figure 4: Visualization of the TurtleBot trajectory to reach a goal image (right) from the initial image
There are usually several metrics to evaluate generative models. Some of them are:  
(top-left). Since the initial and goal image has no overlap, the robot first explores the environment
<ol>
by turning in place. Once it detects overlap between its current image and goal image (i.e. step 42
  <li>Inception Score: In a general sense, the Inception Score is a metric used to quantify the “realness” of a generated image. It is calculated across a set of generated images, and considers two criteria. First, all images of the sample class should be similar (low in-class variance). And second, the distribution of classes should not be dominated by any particular class. The better these criteria are met; the higher the Inception Score.</li>
onward), it moves towards the goal. Note that we did not explicitly train the robot to explore and
  <li>Latent Space Interpolation</li>
such exploratory behavior naturally emerged from the self-supervised learning.]]
  <li>log-likelihood (LL) score</li>
  <li> minimum description length (MDL) score</li>
  <li>minimum message length (MML) score</li>
  <li>Akaike Information Criterion (AIC) score</li>
  <li>Bayesian Information Criterion (BIC) score</li>
</ol>


[[File:5-Table1.png | 650px|thumb|center|Table 1: Quantitative evaluation of various methods on the task of navigating using a single image
of goal in an unseen environment. Each column represents a different run of our system for a
different initial/goal image pair. The full GSP model takes longer to reach the goal on average given
a successful run but reaches the goal successfully at a much higher rate.]]


Figure 5 and table 1 show the results for the robot performing a task with multiple waypoints, i.e. the robot was shown multiple sub-goals instead of just one final goal state. This was required when the end goal was far away form the robot, such as in another room. It is good to note that zero-shot visual imitation is robust to a changing environment where every frame need not match the demonstrated frame. This is achieved by providing sparse landmarks.


[[File:6-Turtlebot_visual_2.png | 650px|thumb|center|Figure 5: The performance of TurtleBot at following a visual demonstration given as a sequence of
images (top row). The TurtleBot is positioned in a manner such that the first image in the demonstration
has no overlap with its current observation. Even under this condition, the robot is able to move closer
to the first demo image (shown as Robot WayPoint-1) and then follow the provided demonstration
until the end. This also exemplifies a failure case for classical methods; there are no possible keypoint
matches between WayPoint-1 and WayPoint-2, and the initial observation is even farther from
WayPoint-1.]]


[[File:5-Table2.png | 650px |thumb|center|Table 2: Quantitative evaluation of TurtleBot’s performance at following visual demonstrations in
two scenarios: maze and the loop. We report the % of landmarks reached by the agent across three
runs of two different demonstrations. Results show that our method outperforms the baselines. Note
that 3 more trials of the loop demonstration were tested under significantly different lighting conditions
and neither model succeeded. Detailed results are available in the supplementary materials.]]


===3D Navigation in VizDoom===
The authors did sets of experiments aimed at evaluating the quality of the generated samples. They have been made on the CelebA dataset and evaluate (i) the ability of the models to preserve the identity of a person in multiple generated views, (ii) to generate realistic samples, (iii) to preserve the diversity in the generated views and (iv) to capture the view distributions of the original dataset.


To round off the experiments, a VizDoom simulation environment was used to test the GSP. VizDoom is a Doom-based popular Reinforcement Learning testbed. It allows agents to play the doom game using only a screen buffer. It is a 3D simulation environment that is traditionally considered to be harder than 2D domain like Atari. The goal was to measure the robustness of each method with proper error bars, the role of initial self-supervised data collection and the quantitative difference in modeling forward consistency loss in feature space in comparison to raw visual space.  
<div style="text-align: center;font-size:100%">[[File:tab3.png]]</div>


Data were collected using two methods: random exploration and curiosity-driven exploration (Pathak et al., 2017). The hypothesis here is that better data rather than just random exploration can lead to a better learned GSP. More details on the implementation are given in the paper appendix.


Table 3 shows the results of the VizDoom experiments with the key takeaway that the data collected via curiosity seems to improve the final imitation performance across all methods.
<div style="text-align: center;font-size:100%">[[File:tab4.png]]</div>


[[File:8-Table3.png | 550px |thumb|center| Table 3: Quantitative evaluation of our proposed GSP and the baseline models at following visual
demonstrations in VizDoom 3D Navigation. Medians and 95% confidence intervals are reported for
demonstration completion and efficiency over 50 seeds and 5 human paths per environment type.]]


==Discussion==
<div style="text-align: center;font-size:100%">[[File:table.png]]</div>


This work presented a method for imitating expert demonstrations from visual observations alone. The key idea is to learn a GSP utilizing data collected by self-supervision. A limitation of this approach is that the quality of the learned GSP is restricted by the exploration data. For instance, moving to a goal in between rooms would not be possible without an intermediate sub-goal. So, future research in zero-shot imitation could aim to generalize the exploration such that the agent is able to explore across different rooms for example.
==Conclusion==


A limitation of the work in this paper is that the method requires first-person view demonstrations. Extending to the third-person may yield a learning of a more general framework. Also, in the current framework, it is assumed that the visual observations of the expert and agent are similar. When the expert performs a demonstration in one setting such as daylight, and the agent performs the task in the evening, results may worsen.  
The paper proposed a generative model, which can be learnt from multi-view data without any supervision. Moreover, it introduced a conditional version that allows generating new views of an input image. Using experiments, they proved that the model can capture content and view factors. Here, the paper showed that the application of architecture search to dense image prediction was achieved through a) The construction of a recursive search space leveraging innovation in the dense prediction literature b) construction of a fast proxy predictive of a large task. The learned architecture was shown to surpass human invented architectures across three dense image prediction tasks i.e scene parsing, person part segmentation and semantic segmentation. In the future, they are planning to use the method of this paper for data augmentation which can enrich training dataset. .


The expert demonstrations are also purely imitated; that is, the agent does not learn the demonstrations. Future work could look into learning the demonstration so as to richen its exploration techniques.
==Future Work==
The authors of the papers mentioned that they plan to explore using their model for data augmentation, as it can produce other data views for training, in both semi-supervised and one-shot/few-shot learning settings.  


This work used a sequence of images to provide a demonstration but the work, in general, does not make image-specific assumptions. Thus the work could be extended to using formal language to communicate goals, an idea left for future work.
==Critique==


==References==
The main idea is to train the model with pairs of images with different views. It is not that clear as to what defines a view in particular. The algorithms are largely based on earlier concepts of GAN and CGAN The authors give reference to the previous papers tackling the same problem and clearly define that the novelty in this approach is not making use of view labels. The authors give a very thorough list of experiments which clearly establish the superiority of the proposed models to baselines.


[1] D.Pathak, P.Mahmoudieh, G.Luo, P.Agrawal, D.Chen, Y.Shentu, E.Shelhamer, J.Malik, A.A.Efros, and T. Darrell. Zero-shot Visual Imitation. In ICLR, 2018.
However, this paper only tested the model on rather constrained examples. As was observed in the results the proposed approach seems to have a high sample complexity relying on training samples covering the full range of variations for both specified and unspecified variations. Also, the proposed model does not attempt to disentangle variations within the specified and unspecified components.


[2] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning
The method that the paper presented is novel and the paper is easy to follow. However, the authors only show a comparison between the proposed method and several baselines: DCGAN and CGAN and do not compare with the methods from Mathieu et al. 2016. In addition, the experiment result is empirical, we do not know the performance of this method in practice in the real world.
from demonstration. Robotics and autonomous systems, 2009.


[3] Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice-hall Englewood
==References==
Cliffs, NJ, 1977.


[4] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke
[1] Mickael Chen, Ludovic Denoyer, Thierry Artieres. MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION. Published as a conference paper at ICLR 2018
by poking: Experiential learning of intuitive physics. NIPS, 2016.


[5] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination
[2] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048, 2016.
for robotic grasping with large-scale data collection. In ISER, 2016.


[6] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and
[3] Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan Russell, and Josef Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, 2014.
700 robot hours. ICRA, 2016.


[7] Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey
[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
Levine. Combining self-supervised learning and imitation for vision-based rope manipulation.
ICRA, 2017.


[8] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration
[5] Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.
by self-supervised prediction. In ICML, 2017.

Latest revision as of 23:34, 13 December 2018

This page contains a summary of the paper "Multi-View Data Generation without Supervision" by Mickael Chen, Ludovic Denoyer, Thierry Artieres. It was published at the International Conference on Learning Representations (ICLR) in 2018. An implementation of the models presented in this paper is available here[1]

Introduction

Motivation

We are interested in learning generative models that build and make use of a disentangled latent space where the content and the view are encoded separately. We propose to take an original approach by learning such models from multi-view datasets, where (i) samples are labeled based on their content, and without any view information, and (ii) where the generated views are not restricted to be one view in a subset of possible views. High Dimensional Generative models have seen a surge of interest of late with the introduction of Variational Auto-Encoders and Generative Adversarial Networks. This paper focuses on a particular problem where one aims at generating samples corresponding to a number of objects under various views. The distribution of the data is assumed to be driven by two independent latent factors: the content, which represents the intrinsic features of an object, and the view, which stands for the settings of a particular observation of that object (for example, the different angles of the same object). The paper proposes two models using this disentanglement of latent space - a generative model and a conditional variant of the same. The authors claim that unlike many multi-view approaches, the proposed model doesn’t need any supervision on the views but only on the content.

Related Work

The problem of handling multi-view inputs has mainly been studied from the predictive point of view where one wants, for example, to learn a model able to predict/classify over multiple views of the same object (Su et al. (2015); Qi et al. (2016)). These approaches generally involve (early or late) fusion of the different views at a particular level of a deep architecture. Recent studies have focused on identifying factors of variations from multiview datasets. The underlying idea is to consider that a particular data sample may be thought as the mix of a content information (e.g. related to its class label like a given person in a face dataset) and of a side information, the view, which accounts for factors of variability (e.g. exposure, viewpoint, with/wo glasses...). So, all the samples of the same class contain the same content but different view. A number of approaches have been proposed to disentangle the content from the view (i.e. methods based on unlabeled samples), also referred as the style in some papers (Mathieu et al. (2016); Denton & Birodkar (2017)). The two common limitations the earlier approaches pose - as claimed by the paper - are that (i) they usually consider discrete views that are characterized by a domain or a set of discrete (binary/categorical) attributes (e.g. face with/wo glasses, the color of the hair, etc.) and could not easily scale to a large number of attributes or to continuous views. (ii) most models are trained using view supervision (e.g. the view attributes), which of course greatly helps in the learning of such model, yet prevents their use on many datasets where this information is not available.

Recently such attempts have been made to learn such models without supervision, but they cannot disentangle high level concepts as only simple features can be reliably captured without any guidance.

Contributions

The contributions that authors claim are the following: (i) A new generative model able to generate data with various content and high view diversity using a supervision on the content information only. (ii) Extend the generative model to a conditional model that allows generating new views over any input sample. (iii) Report experimental results on four different images datasets to prove that the models can generate realistic samples and capture (and generate with) the diversity of views.

Precisely,two models have been proposed:

  1. a generative model (GMV - Generative Multi-view Model) that generates objects under various views (multiview generation),
  2. and a conditional extension, conditional GMV (C-GMV) of this model that generates a large number of views of any input object (conditional multi-view generation).

Both models are based on the adversarial training schema of Generative Adversarial Networks (GAN) proposed in Goodfellow et al. (2014)). The simple but strong idea is to focus on distributions over pairs of examples (e.g. images representing a same object in different views) rather than distribution on single examples.

Paper Overview

Background

The paper uses the concept of the popular GAN (Generative Adversarial Networks) proposed by Goodfellow et al.(2014).

GENERATIVE ADVERSARIAL NETWORK:

Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other (thus the “adversarial”). GANs was introduced in a paper by Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio, in 2014. Referring to GANs, Facebook’s AI research director Yann LeCun called adversarial training “the most interesting idea in the last 10 years in ML.”

Let us denote [math]\displaystyle{ X }[/math] an input space composed of multidimensional samples [math]\displaystyle{ x }[/math] e.g. vector, matrix or tensor. Given a latent space [math]\displaystyle{ R^n }[/math] and a prior distribution [math]\displaystyle{ p_z(z) }[/math] over this latent space, any generator function [math]\displaystyle{ G : R^n → X }[/math] defines a distribution [math]\displaystyle{ p_G }[/math] on [math]\displaystyle{ X }[/math] which is the distribution of samples [math]\displaystyle{ G(z) }[/math] where [math]\displaystyle{ z ∼ p_z }[/math]. A GAN defines, in addition to [math]\displaystyle{ G }[/math], a discriminator function [math]\displaystyle{ D : X → [0; 1] }[/math] which aims at differentiating between real inputs sampled from the training set and fake inputs sampled from [math]\displaystyle{ p_G }[/math], while the generator learns to fool the discriminator [math]\displaystyle{ D }[/math]. Usually both [math]\displaystyle{ G }[/math] and [math]\displaystyle{ D }[/math] are implemented with neural networks. The objective function is based on the following adversarial criterion:

[math]\displaystyle{ \underset{G}{min} \ \underset{D}{max} }[/math] [math]\displaystyle{ E_{p_x}[log D(x)] + Ep_z[log(1 − D(G(z)))] }[/math]

where [math]\displaystyle{ p_x }[/math] is the empirical data distribution on [math]\displaystyle{ X }[/math] . It has been shown in Goodfellow et al. (2014) that if G∗ and D∗ are optimal for the above criterion, the Jensen-Shannon divergence between [math]\displaystyle{ p_{G∗} }[/math] and the empirical distribution of the data [math]\displaystyle{ p_x }[/math] in the dataset is minimized, making GAN able to estimate complex continuous data distributions.

CONDITIONAL GENERATIVE ADVERSARIAL NETWORK:

In the Conditional GAN (CGAN), the generator learns to generate a fake sample with a specific condition or characteristics (such as a label associated with an image or more detailed tag) rather than a generic sample from unknown noise distribution. The conditionality of a CGAN is determined by defining a generator function [math]\displaystyle{ G }[/math] which takes a noise vector [math]\displaystyle{ z }[/math] and a condition [math]\displaystyle{ y }[/math] as inputs. Now, to add such a condition to both generator and discriminator, we will simply feed some vector [math]\displaystyle{ y }[/math], into both networks. Hence, both the discriminator [math]\displaystyle{ D(X,y) }[/math] and generator [math]\displaystyle{ G(z,y) }[/math] are jointly distributed with [math]\displaystyle{ y }[/math]. A target [math]\displaystyle{ X }[/math] from a given input [math]\displaystyle{ y }[/math] can be obtained by first sampling the latent vector [math]\displaystyle{ z ∼ p_z }[/math], then by computing [math]\displaystyle{ G(y, z) }[/math]. The discriminator takes both the condition [math]\displaystyle{ y }[/math] and the datapoint [math]\displaystyle{ x }[/math] as inputs.

Now, the objective function of CGAN is:

[math]\displaystyle{ \underset{G}{min} \ \underset{D}{max} }[/math] [math]\displaystyle{ E_{p_x}[log D(x,y)] + Ep_z[log(1 − D(G(y,z)))] }[/math]

The paper also suggests that many studies have reported that when dealing with high-dimensional input spaces, CGAN tends to collapse the modes of the data distribution, mostly ignoring the latent factor [math]\displaystyle{ z }[/math] and generating [math]\displaystyle{ x }[/math] only based on the condition [math]\displaystyle{ y }[/math], exhibiting an almost deterministic behavior. At this point, the CGAN also fails to produce a satisfying amount of diversity in generated samples.

Generative Multi-View Model

Objective and Notations: The distribution of the data x ∈ X is assumed to be driven by two latent factors: a content factor denoted c which corresponds to the invariant proprieties of the object and a view factor denoted v which corresponds to the factor of variations. Typically, if X is the space of people’s faces, c stands for the intrinsic features of a person’s face while v stands for the transient features and the viewpoint of a particular photo of the face, including the photo exposure and additional elements like a hat, glasses, etc.... These two factors c and v are assumed to be independent and these are the factors needed to learn.

The paper defines two tasks here to be done: (i) Multi View Generation: we want to be able to sample over X by controlling the two factors c and v. Given two priors, p(c) and p(v), this sampling will be possible if we are able to estimate p(x|c, v) from a training set. (ii) Conditional Multi-View Generation: the second objective is to be able to sample different views of a given object. Given a prior p(v), this sampling will be achieved by learning the probability p(c|x), in addition to p(x|c, v). Ability to learn generative models able to generate from a disentangled latent space would allow controlling the sampling on the two different axes, the content and the view. The authors claim the originality of work is to learn such generative models without using any view labeling information.

The paper introduces the vectors c and v to represent latent vectors in Rc and Rv


Generative Multi-view Model:

Consider two prior distributions over the content and view factors denoted as [math]\displaystyle{ p_c }[/math] and [math]\displaystyle{ p_v }[/math], corresponding to the prior distribution over content and latent factors. Moreover, we consider a generator G that implements a distribution over samples x, denoted as [math]\displaystyle{ p_G }[/math] by computing G(c, v) with [math]\displaystyle{ c ∼ p_c }[/math] and [math]\displaystyle{ v ∼ p_v }[/math]. The objective is to learn this generator so that its first input c corresponds to the content of the generated sample while its second input v, captures the underlying view of the sample. Doing so would allow one to control the output sample of the generator by tuning its content or its view (i.e. c and v).

The key idea that the authors propose is to focus on the distribution of pairs of inputs rather than on the distribution over individual samples. When no view supervision is available the only valuable pairs of samples that one may build from the dataset consist of two samples of a given object under two different views. When we choose any two samples randomly from the dataset from the same object, it is most likely that we get two different views. The paper explains that there are three goals here, (i) As in regular GAN, each sample generated by G needs to look realistic. (ii) As real pairs are composed of two views of the same object, the generator should generate pairs of the same object. Since the two sampled view factors v1 and v2 are different, the only way this can be achieved is by encoding the content vector c which is invariant. (iii) It is expected that the discriminator should easily discriminate between a pair of samples corresponding to the same object under different views from a pair of samples corresponding to a same object under the same view. Because the pair shares the same content factor c, this should force the generator to use the view factors v1 and v2 to produce diversity in the generated pair.

Now, the objective function of GMV Model is:

[math]\displaystyle{ \underset{G}{min} \ \underset{D}{max} }[/math] [math]\displaystyle{ E_{x_1,x_2}[log D(x_1,x_2)] + E_{v_1,v_2}[log(1 − D(G(c,v_1),G(c,v_2)))] }[/math]

Once the model is learned, generator G that generates single samples by first sampling c and v following [math]\displaystyle{ p_c }[/math] and [math]\displaystyle{ p_v }[/math], then by computing G(c, v). By freezing c or v, one may then generate samples corresponding to multiple views of any particular content, or corresponding to many contents under a particular view. One can also make interpolations between two given views over a particular content, or between two contents using a particular view

Conditional Generative Model (C-GMV)

C-GMV is proposed by the authors to be able to change the view of a given object that would be provided as an input to the model. This model extends the generative model's the ability to extract the content factor from any given input and to use this extracted content in order to generate new views of the corresponding object. To achieve such a goal, we must add to our generative model an encoder function denoted [math]\displaystyle{ E : X → R^C }[/math] that will map any input in X to the content space [math]\displaystyle{ R^C }[/math]

Input sample x is encoded in the content space using an encoder function, noted E (implemented as a neural network). This encoder serves to generate a content vector c = E(x) that will be combined with a randomly sampled view [math]\displaystyle{ v ∼ p_v }[/math] to generate an artificial example. The artificial sample is then combined with the original input x to form a negative pair. The issue with this approach is that CGAN is known to easily miss modes of the underlying distribution. The generator enters in a state where it ignores the noisy component v. To overcome this phenomenon, we use the same idea as in GMV. We build negative pairs [math]\displaystyle{ (G(c, v_1), G(c, v_2)) }[/math] by randomly sampling two views [math]\displaystyle{ v_1 }[/math] and [math]\displaystyle{ v_2 }[/math] that are combined to get a unique content c. c is computed from a sample x using the encoder E, i.e. c= E(x). By doing so, the ability of our approach to generating pairs with view diversity is preserved. Since this diversity can only be captured by taking into account the two different view vectors provided to the model ([math]\displaystyle{ v_1 }[/math] and [math]\displaystyle{ v_2 }[/math]), this will encourage G(c, v) to generate samples containing both the content information c, and the view v. Positive pairs are sampled from the training set and correspond to two views of a given object.

The Objective function for C-GMV will be:

[math]\displaystyle{ \underset{G}{min} \ \underset{D}{max} }[/math] [math]\displaystyle{ E_{x_1,x_2 ~ p_x|l(x_1)=l(x_2)}[log D(x_1,x_2)] + E_{v_1,v_2 ~ p_v,x~p_x}[log(1 − D(G(E(x),v_1),G(E(x),v_2)))]+E_{v∼p_v,x∼p_x}[log(1 − D(G(E(x), v), x))] }[/math]


At inference time, as with the GMV model, we are interested in getting the encoder E and the generator G. These models may be used for generating new views of any object which is observed as an input sample x by computing its content vector E(x), then sampling [math]\displaystyle{ v ∼ p_v }[/math] and finally by computing the output G(E(x), v)

Experiments and Results

The authors have given an exhaustive set of results and experiments.

Datasets: The two models were evaluated by performing experiments over four image datasets of various domains. Note that when supervision is available on the views (like CelebA for example where images are labeled with attributes) it is not used for learning models. The only supervision that is used is if two samples correspond to the same object or not.


Model Architecture: Same architectures for every dataset. The images were rescaled to 3×64×64 tensors. The generator G and the discriminator D follow that of the DCGAN implementation proposed in Radford et al. (2015). The encoder E is similar to D with the only differences being the batch-normalization in the first layer and the last layer which doesn't have a non-linearity. The Adam optimizer was used, with a batch size of 128. The learning rates for G and D were set to 1*10-3 and 2*10-4 respectively for the GMV experiments. In the C-GMV experiments, learning rates of 5*10-5 were used. Alternating gradient descent was used to optimize the different objectives of the network components (generator, encoder and discriminator).

Baselines: Most existing methods are learned on datasets with view labeling. To fairly compare with alternative models, authors have built baselines working in the same conditions as the models in this paper. In addition, models are compared with the model from Mathieu et al. (2016). Results gained with two implementations are reported, the first one based on the implementation provided by the authors2 (denoted Mathieu et al. (2016)), and the second one (denoted Mathieu et al. (2016) (DCGAN) ) that implements the same model using architectures inspired from DCGAN Radford et al. (2015), which is more stable and that was tuned to allow a fair comparison with our approach. For pure multi-view generative setting, generative model(GMV) is compared with standard GANs that are learned to approximate the joint generation of multiple samples: DCGANx2 is learned to output pairs of views over the same object, DCGANx4 is trained on quadruplets, and DCGANx8 on eight different views.

Generating Multiple Contents and Views

Figure 1 shows examples of generated images by our model and Figure 4 shows images sampled by the DCGAN based models (DCGANx2, DCGANx4, and DCGANx8) on 3DChairs and CelebA datasets.


Figure 5 shows additional results, using the same presentation, for the GMV model only on two other datasets. In the left hand block of Figure 5, each row shows different views generated given the same content.

Figure 6 shows generated samples obtained by interpolation between two different view factors (left) or two content factors (right). Again, in the left and right hand block of Figure 6, each row shows different views generated given the same content. It allows us to have a better idea of the underlying view/content structure captured by GMV. We can see that our approach is able to smoothly move from one content/view to another content/view while keeping the other factor constant. This also illustrates that content and view factors are well independently handled by the generator i.e. changing the view does not modify the content and vice versa.


Generating Multiple Views of a Given Object

The second set of experiments evaluates the ability of C-GMV to capture a particular content from an input sample and to use this content to generate multiple views of the same object. Figure 7 and 8 illustrate the diversity of views in samples generated by our model and compare our results with those obtained with the CGAN model and to models from Mathieu et al. (2016). For each row, the input sample is shown in the left column. New views are generated from that input and shown to the right, with those generated from C_GMV in the centre, and those generated from CGAN on the far right.


Evaluation of the Quality of Generated Samples

There are usually several metrics to evaluate generative models. Some of them are:

  1. Inception Score: In a general sense, the Inception Score is a metric used to quantify the “realness” of a generated image. It is calculated across a set of generated images, and considers two criteria. First, all images of the sample class should be similar (low in-class variance). And second, the distribution of classes should not be dominated by any particular class. The better these criteria are met; the higher the Inception Score.
  2. Latent Space Interpolation
  3. log-likelihood (LL) score
  4. minimum description length (MDL) score
  5. minimum message length (MML) score
  6. Akaike Information Criterion (AIC) score
  7. Bayesian Information Criterion (BIC) score



The authors did sets of experiments aimed at evaluating the quality of the generated samples. They have been made on the CelebA dataset and evaluate (i) the ability of the models to preserve the identity of a person in multiple generated views, (ii) to generate realistic samples, (iii) to preserve the diversity in the generated views and (iv) to capture the view distributions of the original dataset.



Conclusion

The paper proposed a generative model, which can be learnt from multi-view data without any supervision. Moreover, it introduced a conditional version that allows generating new views of an input image. Using experiments, they proved that the model can capture content and view factors. Here, the paper showed that the application of architecture search to dense image prediction was achieved through a) The construction of a recursive search space leveraging innovation in the dense prediction literature b) construction of a fast proxy predictive of a large task. The learned architecture was shown to surpass human invented architectures across three dense image prediction tasks i.e scene parsing, person part segmentation and semantic segmentation. In the future, they are planning to use the method of this paper for data augmentation which can enrich training dataset. .

Future Work

The authors of the papers mentioned that they plan to explore using their model for data augmentation, as it can produce other data views for training, in both semi-supervised and one-shot/few-shot learning settings.

Critique

The main idea is to train the model with pairs of images with different views. It is not that clear as to what defines a view in particular. The algorithms are largely based on earlier concepts of GAN and CGAN The authors give reference to the previous papers tackling the same problem and clearly define that the novelty in this approach is not making use of view labels. The authors give a very thorough list of experiments which clearly establish the superiority of the proposed models to baselines.

However, this paper only tested the model on rather constrained examples. As was observed in the results the proposed approach seems to have a high sample complexity relying on training samples covering the full range of variations for both specified and unspecified variations. Also, the proposed model does not attempt to disentangle variations within the specified and unspecified components.

The method that the paper presented is novel and the paper is easy to follow. However, the authors only show a comparison between the proposed method and several baselines: DCGAN and CGAN and do not compare with the methods from Mathieu et al. 2016. In addition, the experiment result is empirical, we do not know the performance of this method in practice in the real world.

References

[1] Mickael Chen, Ludovic Denoyer, Thierry Artieres. MULTI-VIEW DATA GENERATION WITHOUT VIEW SUPERVISION. Published as a conference paper at ICLR 2018

[2] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048, 2016.

[3] Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan Russell, and Josef Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, 2014.

[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

[5] Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.