|
|
Line 1: |
Line 1: |
| '''Synthesizing Programs for Images usingReinforced Adversarial Learning: ''' Summary of the ICML 2018 paper http://proceedings.mlr.press/v80/ganin18a.html
| |
|
| |
|
| == Presented by ==
| |
|
| |
| 1. Nekoei, Hadi [Quest ID: 20727088]
| |
|
| |
| == Motivation ==
| |
|
| |
| Conventional neural generative models have major problems.
| |
|
| |
| * Firstly, it is not clear how to inject knowledge to the model about the data.
| |
|
| |
| * Secondly, latent space is not easily interpretable.
| |
|
| |
| The provided solution in this paper is to generate programs to incorporate tools, e.g. graphics editors, illustration software, CAD. and '''creating more meaningful API(sequence of complex actions vs raw pixels)'''.
| |
|
| |
| == Introduction ==
| |
|
| |
| Humans, frequently, use the ability to recover structured representation from raw sensation to understand their environment. Decomposing a picture of a hand-written character into strokes or understanding the layout of a building can be exploited to learn how actually our brain works.
| |
| To address these problems, a new approach is presented for interpreting and generating images using Deep Reinforced Adversarial Learning in order to solve the need for a large amount of supervision and scalability to larger real-world datasets. In this approach, an adversarially trained agent '''(SPIRAL)''' generates a program which is executed by a graphics engine to generate images, either conditioned on data or unconditionally. The agent is rewarded by fooling a discriminator network and is trained with distributed reinforcement learning without any extra supervision. The discriminator network itself is trained to distinguish between generated and real images.
| |
|
| |
| [[File:Fig1 SPIRAL.PNG | 400px|thumb|center|Figure 1: SPIRAL]]
| |
|
| |
| == Related Work ==
| |
| Related works in this filed is summarized as follows:
| |
| * There has been a huge amount of studies on inverting simulators to interpret images (Nair et al., 2008; Paysan et al., 2009; Mansinghka et al., 2013; Loper & Black, 2014; Kulkarni et al., 2015a; Jampani et al., 2015)
| |
|
| |
| * Inferring motor programs for reconstruction of MNIST digits (Nair & Hinton, 2006)
| |
|
| |
| * Visual program induction in the context of hand-written characters on the OMNIGLOT dataset (Lake et al., 2015)
| |
|
| |
| * inferring and learning feed-forward or recurrent procedures for image generation (LeCun et al., 2015; Hinton & Salakhutdinov, 2006; Goodfellow et al., 2014; Ackley et al., 1987; Kingma & Welling, 2013; Oord et al., 2016; Kulkarni et al., 2015b; Eslami et al., 2016; Reed et al., 2017; Gregor et al., 2015).
| |
|
| |
| '''However, all of these methods have limitations such as:'''
| |
|
| |
| * Scaling to larger real-world datasets
| |
|
| |
| * Requiring hand-crafted parses and supervision in the form of sketches and corresponding images
| |
|
| |
| * Lack the ability to infer structured representations of images
| |
|
| |
| == The SPIRAL Agent ==
| |
| === Overview ===
| |
| The paper aims to construct a generative model <math>\mathbf{G}</math> to take samples from a distribution <math>p_{d}</math>. The generative model consists of a recurrent network <math>\pi</math> (called policy network or agent) and an external rendering simulator R that accepts a sequence of commands from the agent and maps them into the domain of interest, e.g. R could be a CAD program rendering descriptions of primitives into 3D scenes.
| |
| In order to train policy network <math>\pi</math>, the paper has exploited generative adversarial network. In this framework, the generator tries to fool a discriminator network which is trained to distinguish between real and fake samples. Thus, the distribution generated by <math>\mathbf{G}</math> approaches <math>pd</math>.
| |
|
| |
| === Objectives ===
| |
| The authors give training objective for <math>\mathbf{G}</math> and <math>\mathbf{D}</math> as follows.
| |
|
| |
| '''Discriminator:''' Following (Gulrajani et al., 2017), the objective for <math>\mathbf{D}</math> is defined as:
| |
|
| |
| <math>\mathcal{L}_D = -\mathbb{E}_{x\sim p_d}[D(x)] + \mathbb{E}_{x\sim p_g}[D(x)] + R </math>
| |
|
| |
| where <math>\mathbf{R}</math> is a regularization term softly constraining <math>\mathbf{D}</math> to stay in the set of Lipschitz continuous functions (for some fixed Lipschitz constant).
| |
|
| |
| '''Generator:''' To define the objective for <math>\mathbf{G}</math>, a variant of the REINFORCE (Williams, 1992) algorithm, advantage actor-critic (A2C) is employed:
| |
|
| |
|
| |
| <math>\mathcal{L}_G = -\sum_{t} log\pi(a_t|s_t;\theta)[R_t - V^{\pi}(s_t)]</math>
| |
|
| |
| where <math>V^{\pi}</math> is an approximation to the value function which is considered to be independent of theta, and <math>R_{t} = \sum_{t}^{N}r_{t}</math> is a
| |
| 1-sample Monte-Carlo estimate of the return. Rewards are set to:
| |
| <math>
| |
| r_t = \left\{
| |
| \begin{array}{@{} l c @{}}
| |
| 0 \text{ t N} \\
| |
| D(\mathbb{R}(a_1, a_2, ..., a_N)) & \text{ t = N}
| |
| \end{array}\right.
| |
| \label{eq4}
| |
| </math>
| |
|
| |
|
| |
| One interesting aspect of this new formulation is that we
| |
| can also bias the search by introducing intermediate rewards
| |
| which may depend not only on the output of R but also on
| |
| commands used to generate that output.
| |
|
| |
| === Conditional generation: ===
| |
| In some cases such as producing a given image <math>x_{target}</math>, conditioning the model on auxiliary inputs is useful. That can be done by feeding <math>x_{target}</math> to both policy and discriminator networks as:
| |
| <math>
| |
| p_g = -R(p_a(a|x_{target}))
| |
| </math>
| |
| While <math>p_{d}</math> becomes a dirac function centered at <math>x_{target}</math>.
| |
| It can be proven that for this particular setting of <math>p_{g}</math> and <math>p_{d}</math>, the <math>l2</math>-distance is an optimal discriminator.
| |
|
| |
| === Distributed Learning: ===
| |
| Our training pipeline is outlined in Figure 2b. It is an extension of the recently proposed '''IMPALA''' architecture (Espeholt et al., 2018). For training, we define three kinds of workers:
| |
|
| |
|
| |
| * Actors are responsible for generating the training trajectories through interaction between the policy network and the rendering simulator. Each trajectory contains a sequence <math>((\pi_{t}; a_{t}) | 1 \leq t \leq N)</math> as well as all intermediate
| |
| renderings produced by R.
| |
|
| |
|
| |
| * A policy learner receives trajectories from the actors, combines them into a batch and updates <math>\pi</math> by performing '''SGD''' step on <math>\mathcal{L}_G</math> (2). Following common practice (Mnih et al., 2016), we augment <math>\mathcal{L}_G</math> with an entropy penalty encouraging exploration.
| |
|
| |
|
| |
| * In contrast to the base '''IMPALA''' setup, we define an additional discriminator learner. This worker consumes
| |
| random examples from <math>p_{d}</math>, as well as generated data (final renders) coming from the actor workers, and optimizes <math>\mathcal{L}_D</math> (1).
| |
|
| |
| [[File:Fig2 SPIRAL Architecture.png | 700px|thumb|center|Figure 2: Fig2 SPIRAL_Architecture]]
| |
|
| |
| '''Note:''' We do not omit any trajectories in the policy learner.
| |
| Instead, we decouple the <math>D</math> updates from the <math>D</math> updates
| |
| by introducing a replay buffer that serves as a communication
| |
| layer between the actors and the discriminator learner.
| |
| That allows the latter to optimize D at a higher rate than
| |
| the training of the policy network due to the difference in
| |
| network sizes (<math>\pi</math> is a multi-step RNN, while <math>D</math> is a plain
| |
| '''CNN'''). We note that even though sampling from a replay
| |
| buffer inevitably results in smoothing of <math>p_{g}</math>, this
| |
| setup is found to work well in practice.
| |
|
| |
| == Experiments:==
| |
| === Datasets ===
| |
|
| |
|
| |
|
| |
| === Environments ===
| |
| Two rendering environment is introduced. For MNIST, OMNIGLOT and CELEBA generation an open-source painting librabry LIMBYPAINT (libmypaint
| |
| contributors, 2018).) is used. The agent controls a brush and produces
| |
| a sequence of (possibly disjoint) strokes on a canvas
| |
| C. The state of the environment is comprised of the contents
| |
| of $C$ as well as the current brush location $l_{t}$. Each action
| |
| $a_{t}$ is a tuple of 8 discrete decisions (a1t; a2t; : : : ; a8t) (see
| |
| Figure 3). The first two components are the control point $p_{c}$
| |
| and the endpoint $l_{t+1}$ of the stroke.
| |
|
| |
| [[File:Fig3_agent_action_space.PNG | 500px|thumb|center|Figure 2: Fig2 SPIRAL_Architecture]]
| |
|
| |
| The next 5
| |
| components represent the appearance of the stroke: the
| |
| pressure that the agent applies to the brush (10 levels), the
| |
| brush size, and the stroke color characterized by mixture
| |
| of red, green and blue (20 bins for each color component).
| |
| The last element of at is a binary flag specifying the type
| |
| of action: the agent can choose either to produce a stroke
| |
| or to jump right to $l_{t+1}$.
| |
|
| |
| In the MUJOCO SCENES experiment, we render images
| |
| using a MuJoCo-based environment (Todorov et al., 2012).
| |
| At each time step, the agent has to decide on the object
| |
| type (4 options), its location on a 16 $\times$ 16 grid, its size
| |
| (3 options) and the color (3 color components with 4 bins
| |
| each). The resulting tuple is sent to the environment, which
| |
| adds an object to the scene according to the specification.
| |
|
| |
| === MNIST ===
| |
| For the MNIST dataset, two sets of experiments is conducted:
| |
|
| |
| 1- In this experiment, an unconditional agent is trained to model the data distribution. Along with the reward provided by discriminator, a small negative reward is provided to the agent for each continuous sequence of strokes to encourage the agent to draw a digit in a continuous motion of stroke. Example of such generation is depicted in the Fig 4a.
| |
|
| |
| 2- In the second experiment, an agent is trained to reproduce a given digit.
| |
| Several examples of conditional generated digits is shown in the Fig 4b.
| |
|
| |
| [[File:Fig4a MNIST.png | 500px|thumb|center|Figure 2: Fig2 SPIRAL_Architecture]]
| |
|
| |
| The results are shown in the Fig 8a.
| |
|
| |
| === OMNIGLOT ===
| |
| Now the trained agents is tested in a similar but more challenging setting of handwritten characters. As can be seen in the Fig 5a, unconditional generation has a lower quality compared to digits in the previous dataset. The conditional agents, on the other hand, were able to reach a convincing quality (Fig 5b).
| |
|
| |
| Since OMNIGLOT contains a highly diverse set of symbols,
| |
| over the course of training the model could learn a general
| |
| notion of image reproduction rather than simply memorizing
| |
| dataset-specific strokes. In order to test this, a
| |
| trained agent with previously unseen line drawings is fed. The
| |
| resulting reconstructions are shown in Figure 6. The agent
| |
| handles out-of-domain images well, although it is slightly
| |
| better at reconstructing the OMNIGLOT test set.
| |
|
| |
| For the MNIST dataset, two kinds of rewards, discriminator score and $l^{2}-\text{distance}$ has been compared. Note that the discriminator based approach has a significantly lower training time and lower final $l^{2}$ error.
| |
| Following (Sharma et al., 2017), also a “blind” version
| |
| of the agent without feeding any intermediate canvas
| |
| states as an input to $\pi$ is trained. The training curve for this experiment is also reported in the Fig 8a. (dotted blue line) The results of training agents with discriminator based and $l^{2}-\text{distance}$ approach is shown in Fig 8a as well.
| |
|
| |
| === CELEBA ===
| |
|
| |
| Since the libmypaint environment is also capable of producing
| |
| complex color paintings, we explore this direction by
| |
| training a conditional agent on the CELEBA dataset. In this
| |
| experiment, the agent does not receive any intermediate rewards.
| |
| In addition to the reconstruction reward (either `2 or
| |
| discriminator-based), we put a penalty on the earth mover’s
| |
| distance between the color histograms of the model’s output
| |
| and xtarget.
| |
|
| |
| Although blurry, the model’s reconstruction closely matches
| |
| the high-level structure of each image. For instance the
| |
| background color, the position of the face and the color of
| |
| the person’s hair. In some cases, shadows around eyes and
| |
| the nose are visible.
| |
| \subsection{MUJOCO SCENES}
| |
|
| |
| For the MUJOCO SCENES dataset, we use our agent to construct
| |
| simple CAD programs that best explain input images.
| |
| Here we are only considering the case of conditional generation.
| |
| Like before, the reward function for the generator can
| |
| be either the `2 score or the discriminator output.
| |
|
| |
| As shown in Figure 8b, the agent trained to directly minimize
| |
| `2 is unable to solve the task and has significantly
| |
| higher pixel-wise error. In comparison, the discriminatorbased
| |
| variant solves the task and produces near-perfect reconstructions
| |
| on a holdout set (Figure 10).
| |
|
| |
| As in the OMNIGLOT
| |
| experiment, the `2-based agent demonstrates some
| |
| improvement over the random policy but gets stuck and, as
| |
| a result, fails to learn sensible reconstructions (Figure 8b).
| |
|
| |
| == Discussion ==
| |
|
| |
| Scaling visual program synthesis to real world and combinatorial
| |
| datasets has been a challenge. We have shown that it is possible to train an adversarial generative agent employing
| |
| black-box rendering simulators. Our results indicate that
| |
| using the Wasserstein discriminator’s output as a reward
| |
| function with asynchronous reinforcement learning can provide
| |
| a scaling path for visual program synthesis. The current
| |
| exploration strategy used in the agent is entropy-based, but
| |
| future work should address this limitation by employing sophisticated
| |
| search algorithms for policy improvement. For
| |
| instance, Monte Carlo Tree Search can be used, analogous
| |
| to AlphaGo Zero (Silver et al., 2017). General-purpose
| |
| inference algorithms could also be used for this purpose.
| |
|
| |
| == Future Work ==
| |
| Future work should explore different parameterizations of
| |
| action spaces. For instance, the use of two arbitrary control
| |
| points is perhaps not the best way to represent strokes, as it
| |
| is hard to deal with straight lines. Actions could also directly parametrize 3D surfaces, planes and learned texture models
| |
| to invert richer visual scenes. On the reward side, using
| |
| a joint image-action discriminator similar to BiGAN/ALI
| |
| (Donahue et al., 2016; Dumoulin et al., 2016) (in this case,
| |
| the policy can viewed as an encoder, while the renderer becomes
| |
| a decoder) could result in a more meaningful learning
| |
| signal, since D will be forced to focus on the semantics of
| |
| the image.
| |