@@ Line 1: / Line 1: @@
-'''Synthesizing Programs for Images usingReinforced Adversarial Learning: ''' Summary of the ICML 2018 paper http://proceedings.mlr.press/v80/ganin18a.html
-== Presented by ==
-. Nekoei, Hadi [Quest ID: 20727088]
-== Motivation ==
-Conventional neural generative models have major problems.
-* Firstly, it is not clear how to inject knowledge to the model about the data.
-* Secondly, latent space is not easily interpretable.
-The provided solution in this paper is to generate programs to incorporate tools, e.g. graphics editors, illustration software, CAD. and '''creating more meaningful API(sequence of complex actions vs raw pixels)'''.
-== Introduction ==
-Humans, frequently, use the ability to recover structured representation from raw sensation to understand their environment. Decomposing a picture of a hand-written character into strokes or understanding the layout of a building can be exploited to learn how actually our brain works.
-To address these problems, a new approach is presented for interpreting and generating images using Deep Reinforced Adversarial Learning in order to solve the need for a large amount of supervision and scalability to larger real-world datasets. In this approach, an adversarially trained agent '''(SPIRAL)''' generates a program which is executed by a graphics engine to generate images, either conditioned on data or unconditionally. The agent is rewarded by fooling a discriminator network and is trained with distributed reinforcement learning without any extra supervision. The discriminator network itself is trained to distinguish between generated and real images.
-[[File:Fig1 SPIRAL.PNG | 400px|thumb|center|Figure 1: SPIRAL]]
-== Related Work ==
-Related works in this filed is summarized as follows:
-* There has been a huge amount of studies on inverting simulators to interpret images (Nair et al., 2008; Paysan et al., 2009; Mansinghka et al., 2013; Loper & Black, 2014; Kulkarni et al., 2015a; Jampani et al., 2015)
-* Inferring motor programs for reconstruction of MNIST digits (Nair & Hinton, 2006)
-* Visual program induction in the context of hand-written characters on the OMNIGLOT dataset (Lake et al., 2015)
-* inferring and learning feed-forward or recurrent procedures for image generation (LeCun et al., 2015; Hinton & Salakhutdinov, 2006; Goodfellow et al., 2014; Ackley et al., 1987; Kingma & Welling, 2013; Oord et al., 2016; Kulkarni et al., 2015b; Eslami et al., 2016; Reed et al., 2017; Gregor et al., 2015).
-'''However, all of these methods have limitations such as:'''
-* Scaling to larger real-world datasets
-* Requiring hand-crafted parses and supervision in the form of sketches and corresponding images
-* Lack the ability to infer structured representations of images
-== The SPIRAL Agent ==
-=== Overview ===
-The paper aims to construct a generative model <math>\mathbf{G}</math> to take samples from a distribution <math>p_{d}</math>. The generative model consists of a recurrent network <math>\pi</math> (called policy network or agent) and an external rendering simulator R that accepts a sequence of commands from the agent and maps them into the domain of interest, e.g. R could be a CAD program rendering descriptions of primitives into 3D scenes.
-In order to train policy network <math>\pi</math>, the paper has exploited generative adversarial network. In this framework, the generator tries to fool a discriminator network which is trained to distinguish between real and fake samples. Thus, the distribution generated by <math>\mathbf{G}</math> approaches <math>pd</math>.
-=== Objectives ===
-The authors give training objective for <math>\mathbf{G}</math> and <math>\mathbf{D}</math> as follows.
-'''Discriminator:''' Following (Gulrajani et al., 2017), the objective for <math>\mathbf{D}</math> is defined as:
-<math>\mathcal{L}_D = -\mathbb{E}_{x\sim p_d}[D(x)] + \mathbb{E}_{x\sim p_g}[D(x)] + R </math>
-where <math>\mathbf{R}</math> is a regularization term softly constraining <math>\mathbf{D}</math> to stay in the set of Lipschitz continuous functions (for some fixed Lipschitz constant).
-'''Generator:''' To define the objective for <math>\mathbf{G}</math>, a variant of the REINFORCE (Williams, 1992) algorithm, advantage actor-critic (A2C) is employed:
-<math>\mathcal{L}_G = -\sum_{t} log\pi(a_t|s_t;\theta)[R_t - V^{\pi}(s_t)]</math>
-where <math>V^{\pi}</math> is an approximation to the value function which is considered to be independent of theta, and <math>R_{t} = \sum_{t}^{N}r_{t}</math> is a
--sample Monte-Carlo estimate of the return. Rewards are set to:
-<math>
-  r_t = \left\{
-    \begin{array}{@{} l c @{}}
-\text{ t  N} \\
-      D(\mathbb{R}(a_1, a_2, ..., a_N)) & \text{ t = N}
-    \end{array}\right.
-  \label{eq4}
-</math>
-One interesting aspect of this new formulation is that we
-can also bias the search by introducing intermediate rewards
-which may depend not only on the output of R but also on
-commands used to generate that output.
-=== Conditional generation: ===
-In some cases such as producing a given image <math>x_{target}</math>, conditioning the model on auxiliary inputs is useful. That can be done by feeding <math>x_{target}</math> to both policy and discriminator networks as:
-<math>
-p_g = -R(p_a(a|x_{target}))
-</math>
-While <math>p_{d}</math> becomes a dirac function centered at <math>x_{target}</math>.
-It can be proven that for this particular setting of <math>p_{g}</math> and <math>p_{d}</math>, the <math>l2</math>-distance is an optimal discriminator.
-=== Distributed Learning: ===
-Our training pipeline is outlined in Figure 2b. It is an extension of the recently proposed '''IMPALA''' architecture (Espeholt et al., 2018). For training, we define three kinds of workers:
-* Actors are responsible for generating the training trajectories through interaction between the policy network and the rendering simulator. Each trajectory contains a sequence <math>((\pi_{t}; a_{t}) | 1 \leq t \leq N)</math> as well as all intermediate
-renderings produced by R.
-* A policy learner receives trajectories from the actors, combines them into a batch and updates <math>\pi</math> by performing '''SGD''' step on <math>\mathcal{L}_G</math> (2). Following common practice (Mnih et al., 2016), we augment <math>\mathcal{L}_G</math> with an entropy penalty encouraging exploration.
-* In contrast to the base '''IMPALA''' setup, we define an additional discriminator learner. This worker consumes
-random examples from <math>p_{d}</math>, as well as generated data (final renders) coming from the actor workers, and optimizes <math>\mathcal{L}_D</math> (1).
-[[File:Fig2 SPIRAL Architecture.png | 700px|thumb|center|Figure 2: Fig2 SPIRAL_Architecture]]
-'''Note:''' We do not omit any trajectories in the policy learner.
-Instead, we decouple the <math>D</math> updates from the <math>D</math> updates
-by introducing a replay buffer that serves as a communication
-layer between the actors and the discriminator learner.
-That allows the latter to optimize D at a higher rate than
-the training of the policy network due to the difference in
-network sizes (<math>\pi</math> is a multi-step RNN, while <math>D</math> is a plain
-'''CNN'''). We note that even though sampling from a replay
-buffer inevitably results in smoothing of <math>p_{g}</math>, this
-setup is found to work well in practice.
-== Experiments:==
-=== Datasets ===
-=== Environments ===
-Two rendering environment is introduced. For MNIST, OMNIGLOT and CELEBA generation an open-source painting librabry LIMBYPAINT (libmypaint
-contributors, 2018).) is used. The agent controls a brush and produces
-a sequence of (possibly disjoint) strokes on a canvas
-C. The state of the environment is comprised of the contents
-of $C$ as well as the current brush location $l_{t}$. Each action
-$a_{t}$ is a tuple of 8 discrete decisions (a1t; a2t; : : : ; a8t) (see
-Figure 3). The first two components are the control point $p_{c}$
-and the endpoint $l_{t+1}$ of the stroke.
-[[File:Fig3_agent_action_space.PNG | 500px|thumb|center|Figure 2: Fig2 SPIRAL_Architecture]]
-The next 5
-components represent the appearance of the stroke: the
-pressure that the agent applies to the brush (10 levels), the
-brush size, and the stroke color characterized by mixture
-of red, green and blue (20 bins for each color component).
-The last element of at is a binary flag specifying the type
-of action: the agent can choose either to produce a stroke
-or to jump right to $l_{t+1}$.
-In the MUJOCO SCENES experiment, we render images
-using a MuJoCo-based environment (Todorov et al., 2012).
-At each time step, the agent has to decide on the object
-type (4 options), its location on a 16 $\times$ 16 grid, its size
-(3 options) and the color (3 color components with 4 bins
-each). The resulting tuple is sent to the environment, which
-adds an object to the scene according to the specification.
-=== MNIST ===
-For the MNIST dataset, two sets of experiments is conducted:
-- In this experiment, an unconditional agent is trained to model the data distribution. Along with the reward provided by discriminator, a small negative reward is provided to the agent for each continuous sequence of strokes to encourage the agent to draw a digit in a continuous motion of stroke. Example of such generation is depicted in the Fig 4a.
-- In the second experiment, an agent is trained to reproduce a given digit.
-Several examples of conditional generated digits is shown in the Fig 4b.
-[[File:Fig4a MNIST.png | 500px|thumb|center|Figure 2: Fig2 SPIRAL_Architecture]]
-The results are shown in the Fig 8a.
-=== OMNIGLOT ===
-Now the trained agents is tested in a similar but more challenging setting of handwritten characters. As can be seen in the Fig 5a, unconditional generation has a lower quality compared to digits in the previous dataset. The conditional agents, on the other hand, were able to reach a convincing quality (Fig 5b).
-Since OMNIGLOT contains a highly diverse set of symbols,
-over the course of training the model could learn a general
-notion of image reproduction rather than simply memorizing
-dataset-specific strokes. In order to test this, a
-trained agent with previously unseen line drawings is fed. The
-resulting reconstructions are shown in Figure 6. The agent
-handles out-of-domain images well, although it is slightly
-better at reconstructing the OMNIGLOT test set.
-For the MNIST dataset, two kinds of rewards, discriminator score and $l^{2}-\text{distance}$ has been compared. Note that the discriminator based approach has a significantly lower training time and lower final $l^{2}$ error.
-Following (Sharma et al., 2017),  also a “blind” version
-of the agent without feeding any intermediate canvas
-states as an input to $\pi$ is trained. The training curve for this experiment is also reported in the Fig 8a. (dotted blue line) The results of training agents with discriminator based and $l^{2}-\text{distance}$ approach is shown in Fig 8a as well.
-=== CELEBA ===
-Since the libmypaint environment is also capable of producing
-complex color paintings, we explore this direction by
-training a conditional agent on the CELEBA dataset. In this
-experiment, the agent does not receive any intermediate rewards.
-In addition to the reconstruction reward (either `2 or
-discriminator-based), we put a penalty on the earth mover’s
-distance between the color histograms of the model’s output
-and xtarget.
-Although blurry, the model’s reconstruction closely matches
-the high-level structure of each image. For instance the
-background color, the position of the face and the color of
-the person’s hair. In some cases, shadows around eyes and
-the nose are visible.
-\subsection{MUJOCO SCENES}
-For the MUJOCO SCENES dataset, we use our agent to construct
-simple CAD programs that best explain input images.
-Here we are only considering the case of conditional generation.
-Like before, the reward function for the generator can
-be either the `2 score or the discriminator output.
-As shown in Figure 8b, the agent trained to directly minimize
-`2 is unable to solve the task and has significantly
-higher pixel-wise error. In comparison, the discriminatorbased
-variant solves the task and produces near-perfect reconstructions
-on a holdout set (Figure 10).
-As in the OMNIGLOT
-experiment, the `2-based agent demonstrates some
-improvement over the random policy but gets stuck and, as
-a result, fails to learn sensible reconstructions (Figure 8b).
-== Discussion ==
-Scaling visual program synthesis to real world and combinatorial
-datasets has been a challenge. We have shown that it is possible to train an adversarial generative agent employing
-black-box rendering simulators. Our results indicate that
-using the Wasserstein discriminator’s output as a reward
-function with asynchronous reinforcement learning can provide
-a scaling path for visual program synthesis. The current
-exploration strategy used in the agent is entropy-based, but
-future work should address this limitation by employing sophisticated
-search algorithms for policy improvement. For
-instance, Monte Carlo Tree Search can be used, analogous
-to AlphaGo Zero (Silver et al., 2017). General-purpose
-inference algorithms could also be used for this purpose.
-== Future Work ==
-Future work should explore different parameterizations of
-action spaces. For instance, the use of two arbitrary control
-points is perhaps not the best way to represent strokes, as it
-is hard to deal with straight lines. Actions could also directly parametrize 3D surfaces, planes and learned texture models
-to invert richer visual scenes. On the reward side, using
-a joint image-action discriminator similar to BiGAN/ALI
-(Donahue et al., 2016; Dumoulin et al., 2016) (in this case,
-the policy can viewed as an encoder, while the renderer becomes
-a decoder) could result in a more meaningful learning
-signal, since D will be forced to focus on the semantics of
-the image.

Synthesizing Programs for Images usingReinforced Adversarial Learning: Difference between revisions

Revision as of 17:26, 23 October 2018

Navigation menu

Search