a neural representation of sketch drawings: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Created page with " File:MC_Translation_Example.png == Introduction == Neural machine translation systems are usually trained on large corpora consisting of pairs of pre-translated sentences...")
 
 
(90 intermediate revisions by 30 users not shown)
Line 1: Line 1:


[[File:MC_Translation_Example.png]]
== Introduction ==
== Introduction ==
Neural machine translation systems are usually trained on large corpora consisting of pairs of pre-translated sentences. The paper ''Unsupervised Machine Translation Using Monolingual Corpora Only'' by Guillaume Lample, Ludovic Denoyer, and Marc'Aurelio Ranzato proposes an unsupervised neural machine translation system, which can be trained without such parallel data.
In this paper, the authors present a recurrent neural network, sketch-rnn, that can be used to construct stroke-based drawings. Besides new robust training methods, they also outline a framework for conditional and unconditional sketch generation.


==Motivation==
Neural networks have been heavily used as image generation tools. For example, Generative Adversarial Networks, Variational Inference, and Autoregressive models have been used. Most of those models are designed to generate pixels to construct images. People, however, learn to draw using sequences of strokes as opposed to the simultaneous generation of pixels. The authors propose a new generative model that creates vector images so that it might generalize abstract concepts in a manner more similar to how humans do.
The authors offer two motivations for their work:
# To translate between languages for which large parallel corpora does not exist
# To provide a strong lower bound that any semi-supervised machine translation system is supposed to yield


The model is trained with hand-drawn sketches as input sequences. The model is able to produce sketches in vector format. In the conditional generation model, they also explore the latent space representation for vector images and discuss a few future applications of this model. The model and dataset are now available as an open source project ([https://magenta.tensorflow.org/sketch_rnn link]).


=== Note: What is a corpus (plural corpora)? ===
=== Terminology ===
Pixel images, also referred to as raster or bitmap images are files that encode image data as a set of pixels. These are the most common image type, with extensions such as .png, .jpg, .bmp.


In linguistics, a corpus (plural corpora) or text corpus and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. A corpus may contain texts in a single language (monolingual corpus) or text data in multiple language (multilingual corpus).
Vector images are files that encode image data as paths between points. SVG and EPS file types are used to store vector images.  


== Overview of unsupervised translation system ==
For a visual comparison of raster and vector images, see this [https://www.youtube.com/watch?v=-Fs2t6P5AjY video]. As mentioned, vector images are generally simpler and more abstract, whereas raster images generally are used to store detailed images.  
The unsupervised translation scheme has the following outline:
* The word-vector embeddings of the source and target languages are aligned in an unsupervised manner.
* Sentences from the source and target language are mapped to a common latent vector space by an encoder, and then mapped to probability distributions over sentences in the target or source language by a decoder.
* A de-noising auto-encoder loss encourages the latent-space representations to be insensitive to noise.
* An adversarial loss encourages the latent-space representations of source and target sentences to be indistinguishable from each other. It is intended that the latent-space representation of a sentence should reflect its meaning, and not the particular language in which it is expressed.
* A reconstruction loss encourages the model to improve on the translation model of the previous epoch.


This paper investigates whether it is possible to train a general machine translation system without any form of supervision whatsoever. Based on the assumption that there exists a monolingual corpus (explained earlier) on each language. This set up is interesting for two reasons.  
For this paper, the important distinction between the two is that the encoding of images in the model will be inherently more abstract because of the vector representation. The intuition is that generating abstract representations is more effective using a vector representation.


* First, this is applicable whenever we encounter a new language pair for which we have no annotation.  
== Related Work ==
There are some works in the history that used a similar approach to generate images such as Portrait Drawing by Paul the Robot [26, 28] and some reinforcement learning approaches[28], Reinforcement Learning to discover a set of paint brush strokes that can best represent a given input photograph. They work more like a mimic of digitized photographs, rather than develop generative models of vector images. There are also some Neural networks based approaches, but those are mostly dealing with pixel images. Little work is done on vector images generation. There are models that use Hidden Markov Models [25] or Mixture Density Networks [2] to generate human sketches, continuous data points (modelling Chinese characters as a sequence of pen stroke actions) or vectorized Kanji characters [9,29].


* Second, it provides a strong lower bound performance on what any good semi-supervised approach is expected to yield.
Neural Network-based approaches are able to generate latent space representation of vector images, which follows a Gaussian distribution. The generated output of these networks is trained to match the Gaussian distribution by minimizing a given loss function. Using this idea, previous works attempted to generate a sequence-to-Sequence model with Variational Auto-encoder to model sentences into latent space and using probabilistic program induction to model Omniglot dataset. Variational Auto-encoders differ from regular encoders in that there is an intermediary “sampling step” between the encoder and decoder. Simply connecting the two would NOT guarantee that encoded parameters can be viewed as parameters of a normal distribution representing a latent space. In VAEs, the output of the encoder is physically put into an intermediary step that uses it as normal parameters and provides a sample. In this way, the encoding is penalized as if it were the parameters of some Normal Distribution.


[[File:paper4_fig1.png|frame|none|alt=Alt text|A toy example of illustrating the training process which guides the design of the objective function. The key idea here is to build a common latent space between languages. On the left, the model is trained to reconstruct a sentence from a noisy version of it in the same language. x is the target, C(x) is the noisy input, <math> \hat{x} </math> is the reconstruction. On the right, the model is trained to reconstruct a sentence given the same sentence but in another language.]]
One of the limiting factors that the authors mention in the field of generative vector drawings is the lack of availability of publicly available datasets. Previous datasets such as the Sketch data with 20k vector sketches was explored for feature extraction techniques. The Sketchy dataset consisting of 70k vector sketches along with pixel images was used for large-scale exploration of human sketches. The ShadowDraw system that used 30k raster images along with extracted vectorized features is an interactive system
that predicts what a finished drawing looks like based on a set of incomplete brush strokes from the
user while the sketch is being drawn. In all the cases, the datasets are comparatively small. The dataset proposed in this work uses a much larger dataset and has been made publicly available, and is one of the major contributions of this paper.


==Notation==
== Major Contributions ==
Let <math>S</math> denote the set of words in the source language, and let <math>T</math> denote the set of words in the target language.  Let <math>H \subset \mathbb{R}^{n_H}</math> denote the latent vector space. Moreover, let <math>S'</math> and <math>T'</math> denote the sets of finite sequences of words in the source and target language, and let <math>H'</math> denote the set of finite sequences of vectors in the latent space. For any set X, elide measure-theoretic details and let <math>\mathcal{P}(X)</math> denote the set of probability distributions over X.
This paper makes the following major contributions: Authors outline a framework for both unconditional and
conditional generation of vector images composed of a sequence of lines. The recurrent neural
network-based generative model is capable of producing sketches of common objects in a vector
format. The paper develops a training procedure unique to vector images to make the training more robust. The paper also made available
a large dataset of hand drawn vector images to encourage further development of generative modeling
for vector images, and also release an implementation of our model as an open source project


==Word vector alignment ==
== Methodology ==
=== Dataset ===
QuickDraw is a dataset with 50 million vector drawings collected by an online game [https://quickdraw.withgoogle.com/# Quick Draw!], where the players are required to draw objects belonging to a particular object class in less than 20 seconds. It contains hundreds of classes, each class has 70k training samples, 2.5k validation samples, and 2.5k test samples.


Conneau et al. (2017) describe an unsupervised method for aligning word vectors across languages.  By "alignment", I mean that their method maps words with related meanings to nearby vectors, regardless of the language of the words. Moreover, if two words are one another's literal translations, their word vectors tend to be mutual nearest neighbors.  
The data format of each sample is a representation of a pen stroke action event. The Origin is the initial coordinate of the drawing. The sketches are points in a list. Each point consists of 5 elements <math> (\Delta x, \Delta y, p_{1}, p_{2}, p_{3})</math> where x and y are the offset distance in x and y directions from the previous point. The parameters <math>p_{1}, p_{2}, p_{3}</math> represent three possible states in binary one-hot representation where <math>p_{1}</math> indicates the pen is touching the paper, <math>p_{2}</math> indicates the pen will be lifted from here, and <math>p_{3}</math> represents the drawing has ended.


The underlying idea of the alignment scheme can be summarized as follows:  methods like word2vec or GLoVe  generate vectors for which there is a correspondence between semantics and geometry.  If <math display="inline">f</math> maps English words to their corresponding vectors, we  have the approximate equation
=== Sketch-RNN ===
\begin{align}
[[File:sketchfig2.png|700px|center]]
f(\text{king}) -f(\text{man}) +f(\text{woman})\approx f(\text{queen}).
\end{align}
Furthermore, if <math display="inline">g</math> maps French words to their corresponding vectors, then
\begin{align}
g(\text{roi}) -g(\text{homme}) +g(\text{femme})\approx g(\text{reine}).
\end{align}


Thus if <math display="inline">W</math> maps the word vectors of English words to the word vectors of their French translations, we should expect <math display="inline">W</math> to be linear.  As was observed by Mikolov et al. (2013), the problem of word-vector alignment then becomes a problem of learning the linear transformation that best aligns two point clouds, one from the source language and one from the target language. For more on the history of the word-vector alignment problem, see my CS698 project ([https://uwaterloo.ca/scholar/sites/ca.scholar/files/pa2forsy/files/project_dec_3_0.pdf link]).
The model is a Sequence-to-Sequence Variational Autoencoder(VAE).  


Conneau et al. (2017)'s word vector alignment scheme is unique in that it requires no parallel data, and uses only the shapes of the two word-vector point clouds to be aligned. I will not go into detail, but the heart of the method is a special GAN, in which only the discriminator is a neural network, and the generator is the map corresponding to an orthogonal matrix.
==== Encoder ====
The encoder is a bidirectional RNN. The input is a sketch sequence denoted by <math>S =\{S_0, S_1, ... S_{N_{s}}\}</math> and a reversed sketch sequence denoted by <math>S_{reverse} = \{S_{N_{s}},S_{N_{s}-1}, ... S_0\}</math>. The final hidden layer representations of the two encoded sequences <math>(h_{ \rightarrow}, h_{ \leftarrow})</math> are concatenated to form a latent vector, <math>h</math>, of size <math>N_{z}</math>,
This unsupervised alignment method is crucial to the translation scheme of the current paper. From now on we denote by  
<math display="inline">A: S' \cup T' \to \mathcal{Z}'</math> the function that maps a source- or target- language word sequence to the corresponding aligned word vector sequence.


==Encoder ==
\begin{split}
The encoder <math display="inline">E </math>  reads a sequence of word vectors <math display="inline">(z_1,\ldots, z_m) \in \mathcal{Z}'</math> and outputs a sequence of hidden states <math display="inline">(h_1,\ldots, h_m) \in H'</math> in the latent space.  Crucially, because the word vectors of the two languages have been aligned, the same encoder can be applied to both. That is, to map a source sentence <math display="inline">x=(x_1,\ldots, x_M)\in S'</math> to the latent space, we compute <math display="inline">E(A(x))</math>, and to map a target sentence <math display="inline">y=(y_1,\ldots, y_K)\in T'</math>  to the latent space, we compute  <math display="inline">E(A(y))</math>.
&h_{ \rightarrow} = encode_{ \rightarrow }(S), \\
&h_{ \leftarrow} = encode_{ \leftarrow }(S_{reverse}),  \\
&h = [h_{\rightarrow}; h_{\leftarrow}].
\end{split}


The encoder consists of two LSTMs, one of which reads the word-vector sequence in the forward direction, and one of which reads it in the backward direction. The hidden state sequence is generated by concatenating the hidden states produced by the forward and backward LSTMs at each word vector.
Then the authors project <math>h</math> into two vectors <math>\mu</math> and <math>\hat{\sigma}</math> of size <math>N_{z}</math>. The projection is performed using a fully connected layer. These two vectors are the parameters of the latent space Gaussian distribution that will estimate the distribution of the input data. Because standard deviations cannot be negative, an exponential function is used to convert it to all positive values. Next, a random variable with mean <math>\mu</math> and standard deviation <math>\sigma</math> is constructed by scaling a normalized IID Gaussian, <math>\mathcal{N}(0,I)</math>,


==Decoder==
\begin{split}
& \mu = W_\mu h + b_\mu, \\
& \hat \sigma = W_\sigma h + b_\sigma,  \\
& \sigma = exp( \frac{\hat \sigma}{2}),  \\
& z = \mu + \sigma  \odot \mathcal{N}(0,I).
\end{split}


The decoder is a mono-directional LSTM that accepts a sequence of hidden states <math display="inline">h=(h_1,\ldots, h_m) \in H'</math> from the latent space and a language <math display="inline">L \in \{S,T \}</math>  and outputs a probability distribution over sentences in that language.  We have


\begin{align}
Note that <math>z</math> is not deterministic but a random vector that can be conditioned on an input sketch sequence.
D: H' \times \{S,T \} \to \mathcal{P}(S') \cup \mathcal{P}(T').
\end{align}


The decoder makes use of the attention mechanism of Bahdanau et al. (2014).  To compute the probability of a given sentence <math display="inline">y=(y_1,\ldots,y_K)</math> , the LSTM processes the sentence one word at a time, accepting at step <math display="inline">k</math> the aligned word vector of the previous word in the sentence <math display="inline">A(y_{k-1})</math>  and a context vector <math display="inline">c_k\in H</math> computed from the hidden sequence <math display="inline">h\in H'</math>, and outputting a probability distribution over possible next words. The LSTM is initiated with a special, language-specific start-of-sequence token.  Otherwise, the decoder is does not depend on the language of the sentence it is producing. The context vector is computed as described by Bahdanau et al. (2014), where we let <math display="inline">l_{k}</math> denote the hidden state of the LSTM at step <math display="inline">k</math>, and where <math display="inline">U,W</math> are learnable weight matrices, and <math display="inline">v</math> is a learnable weight vector:
==== Decoder ====
\begin{align}
The decoder is an autoregressive RNN. The initial hidden and cell states are generated using <math>[h_0;c_0] = \tanh(W_z z + b_z)</math>. Here, <math>c_0</math> is utilized if applicable (eg. if an LSTM decoder is used). <math>S_0</math> is defined as <math>(0,0,1,0,0)</math> (the pen is touching the paper at location 0, 0).  
c_k&= \sum_{m=1}^M \alpha_{k,m} h_m\\
\alpha_{k,m}&= \frac{\exp(e_{k,m})}{\sum_{m'=1}^M\exp(e_{k,m'})  },\\
e_{k,m} &= v^T \tanh (Wl_{k-1} + U h_m  ).
\end{align}


For each step <math>i</math> in the decoder, the input <math>x_i</math> is the concatenation of the previous point <math>S_{i-1}</math> and the latent vector <math>z</math>. The outputs of the RNN decoder <math>y_i</math> are parameters for a probability distribution that will generate the next point <math>S_i</math>.


By learning <math display="inline">U,W</math> and <math display="inline">v</math>, the decoder can learn to decide which vectors in the sequence <math display="inline">h</math> are relevant to computing which words in the output sentence.
The authors model <math>(\Delta x,\Delta y)</math> as a Gaussian mixture model (GMM) with <math>M</math> normal distributions and model the ground truth data <math>(p_1, p_2, p_3)</math> as a categorical distribution <math>(q_1, q_2, q_3)</math> where <math>q_1, q_2\ \text{and}\ q_3</math> sum up to 1,


At step <math display="inline">k</math>after receiving the context vector <math display="inline">c_k\in H</math> and the aligned word vector of the previous word in the sequence,<math display="inline">A(y_{k-1})</math>, the LSTM outputs a probability distribution over words, which should be interpreted as the distribution of the next word according to the decoder.  The probability the decoder assigns to a  sentence is then the product of the probabilities computed for each word in this manner.
\begin{align*}
p(\Delta x, \Delta y) = \sum_{j=1}^{M}  \Pi_j \mathcal{N}(\Delta x,\Delta y  | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}),  where \sum_{j=1}^{M}\Pi_j = 1
\end{align*}


[[File:paper4_fig2.png|700px|]]
Where <math>\mathcal{N}(\Delta x,\Delta y  |  \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j})</math> is a bi-variate Normal Distribution, with parameters means <math>\mu_x, \mu_y</math>, standard deviations <math>\sigma_x, \sigma_y</math> and  correlation parameter <math>\rho_{xy}</math>. There are <math>M</math> such distributions. <math>\Pi</math> is a categorical distribution vector of length <math>M</math>. Collectively these form the mixture weights of the Gaussian Mixture model.


==Overview of objective ==
The output vector <math>y_i</math> is generated using a fully-connected forward propagation in the hidden state of the RNN.
The objective function is the sum of:
# The de-noising auto-encoder loss,
# The translation loss,
# The adversarial loss.
I shall describe these in the following sections.


==De-noising Auto-encoder Loss ==
\begin{split}
A de-noising auto-encoder is a function optimized to map a corrupted sample from some dataset to the original un-corrupted sample. De-noising auto-encoders were introduced by Vincent et al. (2008), who provided numerous justifications, one of which is particularly illuminating.  If we think of the dataset of interest as a thin manifold in a high-dimensional space, the corruption process is likely perturbed a datapoint off the manifold. To learn to restore the corrupted datapoint, the de-noising auto-encoder must learn the shape of the manifold.
&x_i = [S_{i-1}; z], \\
&[h_i; c_i] = forward(x_i,[h_{i-1}; c_{i-1}]), \\
&y_i = W_y h_i + b_y, \\
&y_i \in \mathbb{R}^{6M+3}. \\
\end{split}


The reason why we need to de-noise is because: during the training process of an auto-encoder of sentences, if the sequence-to-sequence model is provided
The output consists the probability distribution of the next data point.
with an attention mechanism. Then without any constraint, the auto-encoder tempts to merely copy every input word one by one. Resulting in perfectly copy sequences of random words, suggesting that the model does not learn any useful structure in the data.


Hill et al. (2016),  used a de-noising auto-encoder to learn vectors representing sentences. They corrupted input sentences by randomly dropping and swapping words, and then trained a neural network to map the corrupted sentence to a vector, and then map the vector to the un-corrupted sentence. Interestingly, they found that sentence vectors learned this way were particularly effective when applied to tasks that involved generating paraphrases. This makes some sense: for a vector to be useful in restoring a corrupted sentence, it must capture something of the sentence's underlying meaning.
\begin{align*}
[(\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_1\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_2\ ...\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_M\ (\hat{q_1}\ \hat{q_2}\ \hat{q_3})] = y_i
\end{align*}


The present paper uses the principal of de-noising auto-encoders to compute one of the terms in its loss function.  In each iteration, a sentence is sampled from the source or target language, and a corruption process <math display="inline"> C</math> is applied to it.  <math display="inline"> C</math> works by deleting each word in the sentence with probability <math display="inline">p_C</math> and applying to the sentence a permutation randomly selected from those that do not move words more than <math display="inline">k_C</math> spots from their original positions.  The authors select <math display="inline">p_C=0.1</math> and <math display="inline">k_C=3</math>.  The corrupted sentence is then mapped to the latent space using  <math display="inline">E\circ A</math>. The loss  is then the negative log probability of the original un-corrupted sentence according to the decoder <math display="inline">D</math> applied to the latent-space sequence.
<math>\exp</math> and <math>\tanh</math> operations are applied to ensure that the standard deviations are non-negative and the correlation value is between -1 and 1.


The explanation of Vincent et al. (2008) can help us understand this loss-function term:  the de-noising auto-encoder loss forces the translation system to learn the shapes of the manifolds of the source and target languages.
\begin{align*}
\sigma_x = \exp (\hat \sigma_x),\
\sigma_y = \exp (\hat \sigma_y),\
\rho_{xy} = \tanh(\hat \rho_{xy}).  
\end{align*}


==Translation Loss==
Categorical distribution probabilities for <math>(p_1, p_2, p_3)</math> using <math>(q_1, q_2, q_3)</math> can be obtained as :
To compute the translation loss, we sample a sentence from one of the languages, translate it with the encoder and decoder of the previous epoch, and then corrupt its output with <math display="inline">C</math>.  We then use the current encoder <math display="inline">E</math>  to map the corrupted translation to a sequence <math display="inline">h \in H'</math> and  the decoder <math display="inline">D</math> to map <math display="inline">h</math>  to a probability distribution over sentences.  The translation loss is the negative log probability the decoder assigns to the original uncorrupted sentence.


It is interesting and useful to consider why this translation loss, which depends on the translation model of the previous iteration, should promote an improved translation model in the current iteration. One loose way to understand this is to think of the translator as a de-noising translator.  We are given a sentence perturbed from the manifold of possible sentences from a given language both by the corruption process and by the poor quality of the translation. The model must learn to both project and translate. The technique employed here resembles that used by Sennrich et al. (2014), who trained a neural machine translation system using both parallel and monolingual data.  To make use of the monolingual target-language data, they used an auxiliary model to translate it to the source language, then trained their model to reconstruct the original target-language data from the source-language translation. Sennrich et al. argued that training the model to reconstruct true data from synthetic data was more robust than the opposite approach. The authors of the present paper use similar reasoning.
\begin{align*}
q_k =  \frac{\exp{(\hat q_k)}}{  \sum\nolimits_{j = 1}^{3} \exp {(\hat q_j)}},
k \in \left\{1,2,3\right\},  
\Pi _k = \frac{\exp{(\hat \Pi_k)}}{ \sum\nolimits_{j = 1}^{M} \exp {(\hat \Pi_j)}},
k \in \left\{1,...,M\right\}.
\end{align*}


==Adversarial Loss ==
It is hard for the model to decide when to stop drawing because the probabilities of the three events <math>(p_1, p_2, p_3)</math> are very unbalanced. Researchers in the past have used different weights for each pen event probability, but the authors found this approach lacking elegance and inadequate. They define a hyperparameter representing the max length of the longest sketch in the training set denoted by <math>N_{max}</math>, and set the <math>S_i</math> to be <math>(0, 0, 0, 0, 1)</math> for <math>i > N_s</math>.
The intuition underlying the latent space is that it should encode the meaning of a sentence in a language-independent way.  Accordingly, the authors introduce an adversarial loss, to encourage latent-space vectors mapped from the source and target languages to be indistinguishable. Central to this adversarial loss is the discriminator <math display="inline">R:H' \to [0,1]</math>, which makes use of <math display="inline">r: H\to [0,1]</math> a three-layer fully-connected neural network with 1024 hidden units per layer. Given a sequence of latent-space vectors <math display="inline">h=(h_1,\ldots,h_m)\in H'</math> the discriminator assigns probability  <math display="inline">R(h)=\prod_{i=1}^m r(h_i)</math> that they originated in the target space. Each iteration, the discriminator is trained to maximize the objective function


\begin{align}
The outcome sample <math>S_i^{'}</math> can be generated in each time step during sample process and fed as input for the next time step. The process will stop when <math>p_3 = 1</math> or <math>i = N_{max}</math>. The output is not deterministic but conditioned random sequences. The level of randomness can be controlled using a temperature parameter <math>\tau</math>.
I_T(q) \log (R(E(q))) +(1-I_T(q) )\log(1-R(E(q)))
\end{align}


where <math display="inline">q</math> is a randomly selected sentence, and <math display="inline">I_T(q)</math> is 1 when <math display="inline">q\in I_T</math> is from the source language and 0 if <math display="inline">q\in I_S</math>
\begin{align*}
\hat q_k  \rightarrow \frac{\hat q_k}{\tau},
\hat \Pi_k \rightarrow \frac{\hat \Pi_k}{\tau},  
\sigma_x^2 \rightarrow \sigma_x^2\tau, 
\sigma_y^2 \rightarrow \sigma_y^2\tau.
\end{align*}


The same term is added to the primary objective function, which the encoder and decoder are trained to minimize. The result is that the encoder and decoder learn to fool the discriminator by mapping sentences from the source and target language to similar sequences of latent-space vectors.
The softmax parameters of the categorical distribution and also the <math>\sigma</math>  parameters of the bivariate normal distribution are controlled by the math parameter <math>\tau</math>.This controls the level of randomness in the samples.
The <math>\tau</math> ranges from 0 to 1. When <math>\tau = 0</math> the output will be deterministic as the sample will consist of the points on the peak of the probability density function.


=== Unconditional Generation ===
There is  a special case that only the decoder RNN module is trained. The decoder RNN could work as a standalone autoregressive model without latent variables. In this case, initial states are 0, the input <math>x_i</math> is only <math>S_{i-1}</math> or <math>S_{i-1}^{'}</math>. In the Figure 3, generating sketches unconditionally from the temperature parameter <math>\tau = 0.2</math> at the top in blue, to <math>\tau = 0.9</math>  at the bottom in red.


The authors note that they make use of label smoothing, a technique recommended by Goodfellow (2016) for regularizing GANs, in which the objective described above is replaced by
[[File:sketchfig3.png|700px|center]]


\begin{align}
=== Training ===
I_T(q)(  (1-\alpha)\log (R(E(q)))  +\alpha\log(1-R(E(q)))  )+(1-I_T(q) ) ( (1-\beta) \log(1-R(E(q))) +\beta\log (R(E(q))  ))
The training process is the same as a Variational Autoencoder. The loss function is the sum of Reconstruction Loss <math>L_R</math> and the Kullback-Leibler Divergence Loss <math>L_{KL}</math>. The reconstruction loss <math>L_R</math> can be obtained with generated parameters of pdf and training data <math>S</math>. It is the sum of the <math>L_s</math> and <math>L_p</math>, which are the log loss of the offset <math>(\Delta x, \Delta y)</math> and the pen state <math>(p_1, p_2, p_3)</math>.
\end{align}
for some small nonnegative values of <math display="inline">\alpha, \beta</math>, the idea being to prevent the discriminator from making extreme predictions.  While one-sided label smoothing (<math display="inline">\beta = 0</math>) is generally recommended, the present model differs from a standard GAN in that it is symmetric, and hence two-sided label smoothing would appear more reasonable.


\begin{align*}
L_s = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_s} \log(\sum_{i = 1}^{M} \Pi_{j,i} \mathcal{N}(\Delta x,\Delta y  |  \mu_{x,j,i}, \mu_{y,j,i}, \sigma_{x,j,i},\sigma_{y,j,i}, \rho _{xy,j,i})), 
\end{align*}
\begin{align*}
L_p = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_{max}} \sum_{k = 1}^{3} p_{k,i} \log (q_{k,i}),
L_R = L_s + L_p.
\end{align*}


It is interesting to observe that while the intuition justifying the use of the latent space suggests that the latent space representation of a sentence should be language-independent, this is not actually true: if two sentences are translations of one another, but have different lengths, their latent-space representations will necessarily be different, since a a sentence's latent space representation has the same length as the sentence itself.


==Objective Function==
Both terms are normalized by <math>N_{max}</math>.


Combining the above-described terms, we can write the overall objective function.  Let <math display="inline">Q_S</math> denote the monolingual dataset for the source language, and let <math display="inline">Q_T</math> denote the monolingual dataset for the target language.  Let <math display="inline">D_S:= D(\cdot, S)</math> and<math display="inline">D_T= D(\cdot, T)</math> (i.e. <math display="inline">D_S, D_T</math>) be the decoder restricted to the source or target language, respectively. Let <math display="inline"> M_S </math> and <math display="inline"> M_T </math> denote the target-to-source and source-to-target translation models of the previous epoch.   Then our objective function is
<math>L_{KL}</math> measures the difference between the distribution of the latent vector <math>z</math> and an i.i.d. Gaussian vector with zero mean and unit variance.


\begin{align}
\begin{align*}
\mathcal{L}(D,E,R)=\text{T Translation Loss}+\text{T De-noising Loss} +\text{T Adversarial Loss} +\text{S Translation Loss} +\text{S De-noising Loss} +\text{S Adversarial Loss}\\
L_{KL} = - \frac{1}{2 N_z} (1+\hat \sigma - \mu^2 - \exp(\hat \sigma))
\end{align}
\end{align*}
\begin{align}
=\sum_{q\in Q_T}\left( -\log D_T \circ E \circ C \circ M _S(q) (q)  -\log D_T \circ E \circ C (q) (q)+(1-\alpha)\log (R\circ E(q))  +\alpha\log(1-R\circ E(q))  \right)+\sum_{q\in Q_S}\left( -\log D_S \circ E \circ C \circ M_T (q) (q)  -\log D_S \circ E \circ C (q) (q)+(1-\beta) \log(1-R \circ E(q)) +\beta\log (R\circ E(q) \right).
\end{align}


They alternate between iterations minimizing <math display="inline">\mathcal{L} </math>  with respect to <math display="inline">E, D</math> and iterations maximizing with respect to <math display="inline">R</math>.  ADAM is used for minimization, while RMSprop is used for maximization.  After each epoch, M is updated so that <math display="inline">M_S=D_S \circ E</math> and <math display="inline">M_T=D_T \circ E</math>, after which <math display="inline"> M </math> is frozen until the next epoch.
The overall loss is weighted as:


==Validation==
\begin{align*}
The authors' aim is for their method to be completely unsupervised, so they do not use parallel corpora even for the selection of hyper-parameters.  Instead, they validate by translating sentences to the other language and back, and comparing the resulting sentence with the original according to BLEU, a similarity metric frequently used in translation (Papineni et al. 2002).
Loss = L_R + w_{KL} L_{KL}
\end{align*}


As justification, they show empirically that the score generated by applying BLEU on back-and-forth translation is correlated with applying BLEU using parallel corpora.
When <math>w_{KL} = 0</math>, the model becomes a standalone unconditional generator. Specially, there will be no <math>L_{KL} </math> term as we only optimize for <math>L_{R} </math>. By removing the  <math>L_{KL} </math> term the model approaches a pure autoencoder, meaning it sacrifices the ability to enforce a prior over the latent space and gains better reconstruction loss metrics.
[[File:paper4fig3.png]]


==Experimental Procedure and Results==
While the aforementioned loss function could be used, it was found that annealing the KL term (as shown below) in the loss function produces better results.


The authors test their method on four data sets.  The first is from the English-French translation task of the Workshop on Machine Translation 2014 (WMT14).  This data set consists of parallel data.  The authors generate a monolingual English corpus by randomly sampling 15 million sentence pairs, and choosing only the English sentences.  They then generate a French corpus by selecting the French sentences from those pairs that were not previous chosen.  Importantly, this means that the monolingual data sets have no parallel sentences. The second data set is generated from the English-German translation task from WMT14 using the same procedure.
<center><math>
\eta_{step} = 1 - (1 - \eta_{min})R^{step}
</math></center>


The third and fourth data sets are generated from Multi30k data set, which consists of multilingual captions of various images.  The images are discarded and the English, French, and German captions are used to generate monolingual data sets in the manner described above. These monolingual corpora are much smaller, consisting of 14500 sentences each.
<center><math>
Loss_{train} = L_R + w_{KL} \eta_{step} max(L_{KL}, KL_{min})
</math></center>


The unsupervised translation scheme performs well, though not as well as a supervised translation scheme. It converges after a small number of epochs. Besides supervised translation, the authors compare their method with three other baselines: "Word-by-Word" uses only the previously-discussed word-alignment scheme;  "Word-Reordering" uses a simple LSTM based language model and a greedy algorithm to select a reordering of the words produced by "Word-by-Word". "Oracle Word Reordering" means the optimal reordering of the words produced by "Word-by-Word".
As shown in Figure 4, the <math>L_{R} </math> metric for the standalone decoder model is actually an upper bound for different models using a latent vector. The reason is the unconditional model does not access to the entire sketch it needs to generate.


The discriminator is a MLP with 3 hidden layers of size 1024, Leaky-ReLU activation functions and an output logistic unit. The encoder and the decoder are trained using Adam with
[[File:s.png|600px|thumb|center|Figure 4. Tradeoff between <math>L_{R} </math> and <math>L_{KL} </math>, for two models trained on single class datasets (left).
a learning rate of 0.0003, and a mini-batch size of 32. The discriminator is trained using RMSProp with a learning rate of 0.0005.
Validation Loss Graph for models trained on the Yoga dataset using various <math>w_{KL} </math>. (right)]]


==Result Figures==
== Experiments ==
[[File:MC_Translation Results.png]]
The authors experiment with the sketch-rnn model using different settings and recorded both losses. They used a Long Short-Term Memory(LSTM) model as an encoder and a HyperLSTM as a decoder. HyperLSTM is a type of RNN cell that excels at sequence generation tasks. The ability for HyperLSTM to spontaneously augment its own weights enables it to adapt to many different regimes
[[File:MC_Translation_Convergence.png]]
in a large diverse dataset. They also conduct multi-class datasets. The result is as follows.


==Commentary==
[[File:sketchtable1.png|700px|center]]
This paper's results are impressive: that it is even possible to translate between languages without parallel data suggests that languages are more similar than we might initially suspect, and that the method the authors present has, at least in part, discovered some common deep structure. As the authors point out, using no parallel data at all, their method is able to produce results comparable to those produced by neural machine translation methods trained on hundreds of thousands of a parallel sentences on the WMT dataset. On the other hand, the results they offer come with a few significant caveats.


The first caveat is that the workhorse of the method is the unsupervised word-vector alignment scheme presented in Conneau et al. (2017) (that paper shares 3 authors with this one).  As the ablation study reveals, without word-vector alignment, this method preforms extremely poorly. Moreover, word-by-word translation using word-vector alignment alone performs well, albeit not as well as this method.  This suggests that the method of this paper mainly learns to perform (sometimes significant) corrections to word-by-word translations by reordering and occasional word substitution.  Presumably, it does this by learning something of the natural structure of sentences in each of the two languages, so that it can correct the errors made by word-by-word translation.
We could see the trade-off between <math>L_R</math> and <math>L_{KL}</math> in this table clearly. Furthermore, <math>L_R</math> decreases as <math>w_{KL} </math> is halfed.  


The second caveat is that the best results are attained translating between English and French, two very closely related languages, and the quality of translation between English and German, a slightly-less related pair, is significantly worse ( according to the  ''Shorter Oxford English Dictionary'', 28.3 percent of the English vocabulary is French-derived, 28.2 percent is Latin-derived, and  25 percent is derived from Germanic languages.  This probably understates the degree of correspondence between the French and English vocabularies, since French likely derives from Latin many of the same words English does.  ).  The authors do not report results with more distantly-related pairs, but it is reasonable to expect that performance would degrade significantly, for two reasons. Firstly,  Conneau et al. (2017) shows that the word-alignment scheme performs much worse on more distant language pairs.  This may be because there are more one-to-one correspondences between the words of closely related languages than there are between more distant languages. Secondly, because the same encoder is used to read sentences of both languages, the encoder cannot adapt to the unique word-order properties of either language. This would become a problem for language pairs with very different grammar.  The authors suggest that their scheme could be a useful tool for translating between language pairs for which their are few parallel corpora.  However, language pairs lacking parallel corpora are often (though not always) distantly related, and it is for such pairs that the performance of the present method likely suffers.
=== Conditional Reconstruction ===
The authors assess the reconstructed sketch with a given sketch with different <math>\tau</math> values. We could see that with high <math>\tau</math> value on the right, the reconstructed sketches are more random. The reconstructed sketches have similar properties as the input image , and occasionally add or remove few minor details.  


[[File:sketchfig5.png|700px|center]]


They also experiment on inputting a sketch from a different class. The output will still keep some features from the class that the model is trained on.


=== Latent Space Interpolation ===
The authors visualize the reconstruction sketches while interpolating between latent vectors using different <math>w_{KL}</math> values. As Gaussian prior is enforced on the latent space, fewer gaps are expected in the latent space between two encoded vectors. A model trained using higher <math>w_{KL}</math> is expected to produce images that are close to the data manifold. To show this authors trained several models using various values of  <math>w_{KL}</math> and showed through experimentation that with high <math>w_{KL}</math> values, the generated images are more coherently interpolated.


The proposed method always beats Oracle Word Reordering on the Multi30k data set, but sometimes does not on the WMT data set. This may be because the WMT sentences are much more syntactically complex than the simple image captions of the Multi30k data set.
[[File:sketchfig6.png|700px|center]]


The ablation study also reveals the importance of the corruption process <math display="inline">C</math>: the absence of <math display="inline">C</math> significantly degrades translation quality, though not as much as the absence of word-vector alignment.  We can understand this in two related ways.  First of all, if we view the model as learning to correct structural errors  in word-by-word translations, then the corruption process introduces more errors of this kind, and so provides additional data upon which the model can train.  Second, as Vincent et al. (2008) point out, de-noising auto-encoder training encourages a model to learn the structure of the manifold from which the data is drawn.  By learning the structure of the source and target languages, the model can better correct the errors of word-by-word translation.
=== Sketch Drawing Analogies ===
Since the latent vector <math>z</math> encode conceptual features of a sketch, those features can also be used to augment other sketches that do not have these features. This is possible when models are trained with low <math>L_{KL}</math> values.  Given the smoothness of the latent space, where any interpolated vector between two latent vectors results in a coherent sketch, we can perform vector arithmetic on the latent vectors encoded from different sketches and explore how the model organizes the latent space to represent different concepts in the manifold of generated sketches. For instance, we can subtract the latent vector of an encoded pig head from the latent vector of a full pig, to arrive at a vector that represents a body. Adding this difference to the latent vector of a cat head results in a full cat (i.e. cat head + body = full cat).


[[File:MC_Alignment_Results.png|frame|none|alt=Alt text|From Conneau et al. (2017). The final row shows the performance of alignment method used in the present paper.  Note the degradation in performance for more distant languages.]]
=== Predicting Different Endings of Incomplete Sketches ===
This model is able to predict an incomplete sketch by encoding the sketch into hidden state <math>h</math> using the decoder and then using <math>h</math> as an initial hidden state to generate the remaining sketch. The authors train on individual classes by using decoder-only models and set <math>τ = 0.8</math> to complete samples. Figure 7 shows the results.


[[File:MC_Translation_Ablation.png|frame|none|alt=Alt text|From the present paper. Results of an ablation study.  Of note are the first, third, and forth rows, which demonstrate that while the translation component of the loss is relatively unimportant, the word vector alignment scheme and de-noising auto-encoder matter a great deal.]]
[[File:sketchfig7.png|700px|center]]


==Future Work==
== Limitations ==
The  principal of performing unsupervised translation by starting with a rough but reasonable guess, and then improving it using knowledge of the structure of target language seems promising.  Word by word translation using word-vector alignment  works well for closely related languages like English and French, but is unlikely to work as well for more distant languages.  For those languages, a better method for getting an initial guess is required.


Of course, adding more parallel examples allows the supervised approach to outperform our method,
Although sketch-rnn can model a large variety of sketch drawings, there are several limitations in the current approach. For most single-class datasets, sketch-rnn is capable of modeling around 300 data points. The model becomes increasingly difficult to train beyond this length. For the author's dataset, the Ramer-Douglas-Peucker algorithm is used to simplify the strokes of sketch data to less than 200 data points.
but the good performance of our unsupervised method suggests that it could be very effective for
low-resources languages where no parallel data are available. Moreover, these results open the
door to the development of semi-supervised translation models, which will be the focus of future
investigation. With a phrase-based machine translation system, we obtain 21.6 and 22.4 BLEU
on WMT en-fr and fr-en, which is better than the supervised NMT baseline we report for that
same amount of parallel sentences, which is 16.8 and 16.4 respectively. However, if we train the
same supervised NMT model with BPE (Sennrich et al., 2015b), we obtain 22.6 BLEU for en-fr,
suggesting that our results on unsupervised machine translation could also be improved by using
BPE, as this removes unknown words (about 9% of the words in de-en are replaced by the unknown
token otherwise).


==References==
For more complicated classes of images, such as mermaids or lobsters, the reconstruction loss metrics are not as good compared to simpler classes such as ants, faces or firetrucks. The models trained on these more challenging image classes tend to draw smoother, more circular line segments that do not resemble individual sketches, but rather resemble an averaging of many sketches in the training set. This smoothness may be analogous to the blurriness effect produced by a Variational Autoencoder that is trained on pixel images. Depending on the use case of the model, smooth circular lines can be viewed as aesthetically pleasing and a desirable property.
#Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
 
#Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé  Jégou. "Word Translation without Parallel Data". arXiv:1710.04087, (2017)
While both conditional and unconditional models are capable of training on datasets of several classes, sketch-rnn is ineffective at modeling a large number of classes simultaneously. The samples generated will be incoherent, with different classes are shown in the same sketch.
# Dictionary, Shorter Oxford English. "Shorter Oxford english dictionary." (2007).
 
#Goodfellow, Ian. "NIPS 2016 tutorial: Generative adversarial networks." arXiv preprint arXiv:1701.00160 (2016).
== Applications and Future Work ==
# Hill, Felix, Kyunghyun Cho, and Anna Korhonen. "Learning distributed representations of sentences from unlabelled data." arXiv preprint arXiv:1602.03483 (2016).
The authors believe this model can assist artists by suggesting how to finish a sketch, helping them to find interesting intersections between different drawings or objects, or generating a lot of similar but different designs. In the simplest use, pattern designers can apply sketch-rnn to generate a large number of similar, but unique designs for textile or wallpaper prints. The creative designers can also come up with abstract designs which enables them to resonate more with their target audience
# Lample, Guillaume, Ludovic Denoyer, and Marc'Aurelio Ranzato. "Unsupervised Machine Translation Using Monolingual Corpora Only." arXiv preprint arXiv:1711.00043 (2017).
 
#Papineni, Kishore, et al. "BLEU: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002.
This model may also find its place on teaching students how to draw. Even with the simple sketches in QuickDraw, the authors of this work have become much more proficient at drawing animals, insects, and various sea creatures after conducting these experiments.
# Mikolov, Tomas, Quoc V Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168. (2013).
When the model is trained with a high <math>w_{KL}</math> and sampled with a low <math>\tau</math>, it may help to turn a poor sketch into a more aesthetical one. Latent vector augmentation could also help to create a better drawing by inputting user-rating data during training processes.
#Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Improving neural machine translation models with monolingual data." arXiv preprint arXiv:1511.06709 (2015).
 
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.
The authors conclude by providing the following future directions to this work:
# Vincent, Pascal, et al. "Extracting and composing robust features with denoising autoencoders." Proceedings of the 25th international conference on Machine learning. ACM, 2008.
# Investigate using user-rating data to augmenting the latent vector in the direction that maximizes the aesthetics of the drawing.
# Look into combining variations of sequence-generation models with unsupervised, cross-domain pixel image generation models.
 
It's exciting that they manage to combine this model with other unsupervised, cross-domain pixel image generation models to create photorealistic images from sketches.
 
The authors have also mentioned the opposite direction of converting a photograph of an object into an unrealistic, but similar looking
sketch of the object composed of a minimal number of lines to be a more interesting problem.
 
Moreover, it would be interesting to see how varying loss will be represented as a drawing. Some exotic form of loss function may change the way that the network behaves, which can lead to various applications.
 
== Conclusion ==
The paper presents a methodology to model sketch drawings using recurrent neural networks. The sketch-rnn model that can encode and decode sketches, generate and complete unfinished sketches is introduced in this paper. In addition, Authors demonstrated how to both interpolate between latent spaces from a different class, and use it to augment sketches or generate similar looking sketches. Furthermore, the importance of enforcing a prior distribution on latent vector while interpolating coherent sketch generations is shown. Finally, a large sketch drawings dataset for future research work is created.
 
== Critique ==
This paper presents both a novel large dataset of sketches and a new RNN architecture to generate new sketches. Although its exciting to read about, many improvements can be done.
 
* The performance of the decoder model can hardly be evaluated. The authors present the performance of the decoder by showing the generated sketches, it is clear and straightforward, however, not very efficient. It would be great if the authors could present a way, or a metric to evaluate how well the sketches are generated rather than printing them out and evaluate with human judgment. The authors didn't present an evaluation of the algorithms either. They provided <math>L_R</math> and <math>L_{KL}</math> for reference, however, a lower loss doesn't represent a better performance. Training loss alone likely does not capture the quality of a sketch.
 
* The authors have not mentioned details on training details such as learning rate, training time, parameter size, and so on.
 
* The approach presented in the paper is innovative and make a clever use of a significantly large training database. The same framework could be used for assisting a wide range of professionals into a semi-automatic support system that can augment human capabilities for tasks such as graphic design, report preparation, or even journalism.
 
* Algorithm lacks comparison to the prior state of the art on standard metrics, which made the novelty unclear. Using strokes as inputs is a novel and innovative move, however, the paper does not provide a baseline or any comparison with other methods or algorithms. Some other researches were mentioned in the paper, using similar and smaller datasets. It would be great if the authors could use some basic or existing methods a baseline and compare with the new algorithm.
 
* Besides the comparison with other algorithms, it would also be great if the authors could remove or replace some component of the algorithm in the model to show if one part is necessary, or what made them decide to include a specific component in the algorithm.
 
* The authors did not present better complexity and deeper mathematical analysis on the algorithms in the paper. It also does not include comparison using some more standard metrics compare to previous results. Therefore, it lacks some algorithmic contribution. It would be better to include some more formal analysis on the algorithmic side.
 
* The authors proposed a few future applications for the model, however, the current output seems somehow not very close to their descriptions. But I do believe that this is a very good beginning, with the release of the sketch dataset, it must attract more scholars to research and improve with it!
 
* As they said their model can become increasingly difficult to train on with increased size.
 
== References ==
# Jimmy L. Ba, Jamie R. Kiros, and Geoffrey E. Hinton. Layer normalization. NIPS, 2016.
# Christopher M. Bishop. Mixture density networks. Technical Report, 1994. URL http://publications.aston.ac.uk/373/.
# Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.
# H. Dong, P. Neekhara, C. Wu, and Y. Guo. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. ArXiv e-prints, January 2017.
# David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, October 1973. doi: 10.3138/fm57-6770-u75u-7727. URL http://dx.doi.org/10.3138/fm57-6770-u75u-7727.
# Mathias Eitz, James Hays, and Marc Alexa. How Do Humans Sketch Objects? ACM Trans. Graph.(Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.
# I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv e-prints, December 2016.
# Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
# David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.
# David Ha, Andrew M. Dai, and Quoc V. Le. HyperNetworks. In ICLR, 2017.
# Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.
# P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, November 2016.
# Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The Quick, Draw! - A.I. Experiment. https://quickdraw.withgoogle.com/, 2016. URL https: //quickdraw.withgoogle.com/.
# C. Kaae Sønderby, T. Raiko, L. Maaløe, S. Kaae Sønderby, and O. Winther. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.
# T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Discover cross-domain Relations with Generative Adversarial Networks. ArXiv e-prints, March 2017.
# D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.
# Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
# Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.
# Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. ISSN 1095-9203. doi: 10.1126/science.aab3050. URL http://dx.doi.org/10.1126/science.aab3050.
# Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. Shadowdraw: Real-time user guidance for freehand drawing. In ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, pp. 27:1–27:10, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0943-1. doi: 10.1145/1964921.1964922. URL http://doi.acm.org/10.1145/1964921.1964922.
# M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised Image-to-Image Translation Networks. ArXiv e-prints, March 2017.
# S. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez Colmenarejo, Z. Wang, D. Belov, and N. de Freitas. Parallel Multiscale Autoregressive Density Estimation. ArXiv e-prints, March 2017.
# Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016. ISSN 0730-0301. doi: 10.1145/2897824.2925954. URL http://doi.acm.org/10.1145/2897824.2925954.
# Mike Schuster, Kuldip K. Paliwal, and A. General. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997.
# Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Proceedings of the Fifteenth Eurographics Conference on Rendering Techniques, EGSR’04, pp. 23–32, Aire-la-Ville, Switzerland, Switzerland, 2004. Eurographics Association. ISBN 3-905673-12-6. doi: 10.2312/EGWR/EGSR04/023-032. URL http://dx.doi.org/10.2312/EGWR/EGSR04/023-032.
# Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Comput. Graph.,37(5):348–363, August 2013. ISSN 0097-8493. doi: 10.1016/j.cag.2013.01.012. URL http://dx.doi.org/10.1016/j.cag.2013.01.012.
# T. White. Sampling Generative Networks. [https://arxiv.org/abs/1609.04468 ArXiv e-prints], September 2016.
#Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In ICML. icml.cc / Omnipress, 2012. URL http://dblp.uni-trier.de/db/conf/icml/icml2012.html#XieHS12.
# Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL http://arxiv.org/abs/1606.06539.

Latest revision as of 00:40, 17 December 2018

Introduction

In this paper, the authors present a recurrent neural network, sketch-rnn, that can be used to construct stroke-based drawings. Besides new robust training methods, they also outline a framework for conditional and unconditional sketch generation.

Neural networks have been heavily used as image generation tools. For example, Generative Adversarial Networks, Variational Inference, and Autoregressive models have been used. Most of those models are designed to generate pixels to construct images. People, however, learn to draw using sequences of strokes as opposed to the simultaneous generation of pixels. The authors propose a new generative model that creates vector images so that it might generalize abstract concepts in a manner more similar to how humans do.

The model is trained with hand-drawn sketches as input sequences. The model is able to produce sketches in vector format. In the conditional generation model, they also explore the latent space representation for vector images and discuss a few future applications of this model. The model and dataset are now available as an open source project (link).

Terminology

Pixel images, also referred to as raster or bitmap images are files that encode image data as a set of pixels. These are the most common image type, with extensions such as .png, .jpg, .bmp.

Vector images are files that encode image data as paths between points. SVG and EPS file types are used to store vector images.

For a visual comparison of raster and vector images, see this video. As mentioned, vector images are generally simpler and more abstract, whereas raster images generally are used to store detailed images.

For this paper, the important distinction between the two is that the encoding of images in the model will be inherently more abstract because of the vector representation. The intuition is that generating abstract representations is more effective using a vector representation.

Related Work

There are some works in the history that used a similar approach to generate images such as Portrait Drawing by Paul the Robot [26, 28] and some reinforcement learning approaches[28], Reinforcement Learning to discover a set of paint brush strokes that can best represent a given input photograph. They work more like a mimic of digitized photographs, rather than develop generative models of vector images. There are also some Neural networks based approaches, but those are mostly dealing with pixel images. Little work is done on vector images generation. There are models that use Hidden Markov Models [25] or Mixture Density Networks [2] to generate human sketches, continuous data points (modelling Chinese characters as a sequence of pen stroke actions) or vectorized Kanji characters [9,29].

Neural Network-based approaches are able to generate latent space representation of vector images, which follows a Gaussian distribution. The generated output of these networks is trained to match the Gaussian distribution by minimizing a given loss function. Using this idea, previous works attempted to generate a sequence-to-Sequence model with Variational Auto-encoder to model sentences into latent space and using probabilistic program induction to model Omniglot dataset. Variational Auto-encoders differ from regular encoders in that there is an intermediary “sampling step” between the encoder and decoder. Simply connecting the two would NOT guarantee that encoded parameters can be viewed as parameters of a normal distribution representing a latent space. In VAEs, the output of the encoder is physically put into an intermediary step that uses it as normal parameters and provides a sample. In this way, the encoding is penalized as if it were the parameters of some Normal Distribution.

One of the limiting factors that the authors mention in the field of generative vector drawings is the lack of availability of publicly available datasets. Previous datasets such as the Sketch data with 20k vector sketches was explored for feature extraction techniques. The Sketchy dataset consisting of 70k vector sketches along with pixel images was used for large-scale exploration of human sketches. The ShadowDraw system that used 30k raster images along with extracted vectorized features is an interactive system that predicts what a finished drawing looks like based on a set of incomplete brush strokes from the user while the sketch is being drawn. In all the cases, the datasets are comparatively small. The dataset proposed in this work uses a much larger dataset and has been made publicly available, and is one of the major contributions of this paper.

Major Contributions

This paper makes the following major contributions: Authors outline a framework for both unconditional and conditional generation of vector images composed of a sequence of lines. The recurrent neural network-based generative model is capable of producing sketches of common objects in a vector format. The paper develops a training procedure unique to vector images to make the training more robust. The paper also made available a large dataset of hand drawn vector images to encourage further development of generative modeling for vector images, and also release an implementation of our model as an open source project

Methodology

Dataset

QuickDraw is a dataset with 50 million vector drawings collected by an online game Quick Draw!, where the players are required to draw objects belonging to a particular object class in less than 20 seconds. It contains hundreds of classes, each class has 70k training samples, 2.5k validation samples, and 2.5k test samples.

The data format of each sample is a representation of a pen stroke action event. The Origin is the initial coordinate of the drawing. The sketches are points in a list. Each point consists of 5 elements [math]\displaystyle{ (\Delta x, \Delta y, p_{1}, p_{2}, p_{3}) }[/math] where x and y are the offset distance in x and y directions from the previous point. The parameters [math]\displaystyle{ p_{1}, p_{2}, p_{3} }[/math] represent three possible states in binary one-hot representation where [math]\displaystyle{ p_{1} }[/math] indicates the pen is touching the paper, [math]\displaystyle{ p_{2} }[/math] indicates the pen will be lifted from here, and [math]\displaystyle{ p_{3} }[/math] represents the drawing has ended.

Sketch-RNN

The model is a Sequence-to-Sequence Variational Autoencoder(VAE).

Encoder

The encoder is a bidirectional RNN. The input is a sketch sequence denoted by [math]\displaystyle{ S =\{S_0, S_1, ... S_{N_{s}}\} }[/math] and a reversed sketch sequence denoted by [math]\displaystyle{ S_{reverse} = \{S_{N_{s}},S_{N_{s}-1}, ... S_0\} }[/math]. The final hidden layer representations of the two encoded sequences [math]\displaystyle{ (h_{ \rightarrow}, h_{ \leftarrow}) }[/math] are concatenated to form a latent vector, [math]\displaystyle{ h }[/math], of size [math]\displaystyle{ N_{z} }[/math],

\begin{split} &h_{ \rightarrow} = encode_{ \rightarrow }(S), \\ &h_{ \leftarrow} = encode_{ \leftarrow }(S_{reverse}), \\ &h = [h_{\rightarrow}; h_{\leftarrow}]. \end{split}

Then the authors project [math]\displaystyle{ h }[/math] into two vectors [math]\displaystyle{ \mu }[/math] and [math]\displaystyle{ \hat{\sigma} }[/math] of size [math]\displaystyle{ N_{z} }[/math]. The projection is performed using a fully connected layer. These two vectors are the parameters of the latent space Gaussian distribution that will estimate the distribution of the input data. Because standard deviations cannot be negative, an exponential function is used to convert it to all positive values. Next, a random variable with mean [math]\displaystyle{ \mu }[/math] and standard deviation [math]\displaystyle{ \sigma }[/math] is constructed by scaling a normalized IID Gaussian, [math]\displaystyle{ \mathcal{N}(0,I) }[/math],

\begin{split} & \mu = W_\mu h + b_\mu, \\ & \hat \sigma = W_\sigma h + b_\sigma, \\ & \sigma = exp( \frac{\hat \sigma}{2}), \\ & z = \mu + \sigma \odot \mathcal{N}(0,I). \end{split}


Note that [math]\displaystyle{ z }[/math] is not deterministic but a random vector that can be conditioned on an input sketch sequence.

Decoder

The decoder is an autoregressive RNN. The initial hidden and cell states are generated using [math]\displaystyle{ [h_0;c_0] = \tanh(W_z z + b_z) }[/math]. Here, [math]\displaystyle{ c_0 }[/math] is utilized if applicable (eg. if an LSTM decoder is used). [math]\displaystyle{ S_0 }[/math] is defined as [math]\displaystyle{ (0,0,1,0,0) }[/math] (the pen is touching the paper at location 0, 0).

For each step [math]\displaystyle{ i }[/math] in the decoder, the input [math]\displaystyle{ x_i }[/math] is the concatenation of the previous point [math]\displaystyle{ S_{i-1} }[/math] and the latent vector [math]\displaystyle{ z }[/math]. The outputs of the RNN decoder [math]\displaystyle{ y_i }[/math] are parameters for a probability distribution that will generate the next point [math]\displaystyle{ S_i }[/math].

The authors model [math]\displaystyle{ (\Delta x,\Delta y) }[/math] as a Gaussian mixture model (GMM) with [math]\displaystyle{ M }[/math] normal distributions and model the ground truth data [math]\displaystyle{ (p_1, p_2, p_3) }[/math] as a categorical distribution [math]\displaystyle{ (q_1, q_2, q_3) }[/math] where [math]\displaystyle{ q_1, q_2\ \text{and}\ q_3 }[/math] sum up to 1,

\begin{align*} p(\Delta x, \Delta y) = \sum_{j=1}^{M} \Pi_j \mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}), where \sum_{j=1}^{M}\Pi_j = 1 \end{align*}

Where [math]\displaystyle{ \mathcal{N}(\Delta x,\Delta y | \mu_{x,j}, \mu_{y,j}, \sigma_{x,j},\sigma_{y,j}, \rho _{xy,j}) }[/math] is a bi-variate Normal Distribution, with parameters means [math]\displaystyle{ \mu_x, \mu_y }[/math], standard deviations [math]\displaystyle{ \sigma_x, \sigma_y }[/math] and correlation parameter [math]\displaystyle{ \rho_{xy} }[/math]. There are [math]\displaystyle{ M }[/math] such distributions. [math]\displaystyle{ \Pi }[/math] is a categorical distribution vector of length [math]\displaystyle{ M }[/math]. Collectively these form the mixture weights of the Gaussian Mixture model.

The output vector [math]\displaystyle{ y_i }[/math] is generated using a fully-connected forward propagation in the hidden state of the RNN.

\begin{split} &x_i = [S_{i-1}; z], \\ &[h_i; c_i] = forward(x_i,[h_{i-1}; c_{i-1}]), \\ &y_i = W_y h_i + b_y, \\ &y_i \in \mathbb{R}^{6M+3}. \\ \end{split}

The output consists the probability distribution of the next data point.

\begin{align*} [(\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_1\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_2\ ...\ (\hat\Pi_1\ \mu_x\ \mu_y\ \hat\sigma_x\ \hat\sigma_y\ \hat\rho_{xy})_M\ (\hat{q_1}\ \hat{q_2}\ \hat{q_3})] = y_i \end{align*}

[math]\displaystyle{ \exp }[/math] and [math]\displaystyle{ \tanh }[/math] operations are applied to ensure that the standard deviations are non-negative and the correlation value is between -1 and 1.

\begin{align*} \sigma_x = \exp (\hat \sigma_x),\ \sigma_y = \exp (\hat \sigma_y),\ \rho_{xy} = \tanh(\hat \rho_{xy}). \end{align*}

Categorical distribution probabilities for [math]\displaystyle{ (p_1, p_2, p_3) }[/math] using [math]\displaystyle{ (q_1, q_2, q_3) }[/math] can be obtained as :

\begin{align*} q_k = \frac{\exp{(\hat q_k)}}{ \sum\nolimits_{j = 1}^{3} \exp {(\hat q_j)}}, k \in \left\{1,2,3\right\}, \Pi _k = \frac{\exp{(\hat \Pi_k)}}{ \sum\nolimits_{j = 1}^{M} \exp {(\hat \Pi_j)}}, k \in \left\{1,...,M\right\}. \end{align*}

It is hard for the model to decide when to stop drawing because the probabilities of the three events [math]\displaystyle{ (p_1, p_2, p_3) }[/math] are very unbalanced. Researchers in the past have used different weights for each pen event probability, but the authors found this approach lacking elegance and inadequate. They define a hyperparameter representing the max length of the longest sketch in the training set denoted by [math]\displaystyle{ N_{max} }[/math], and set the [math]\displaystyle{ S_i }[/math] to be [math]\displaystyle{ (0, 0, 0, 0, 1) }[/math] for [math]\displaystyle{ i \gt N_s }[/math].

The outcome sample [math]\displaystyle{ S_i^{'} }[/math] can be generated in each time step during sample process and fed as input for the next time step. The process will stop when [math]\displaystyle{ p_3 = 1 }[/math] or [math]\displaystyle{ i = N_{max} }[/math]. The output is not deterministic but conditioned random sequences. The level of randomness can be controlled using a temperature parameter [math]\displaystyle{ \tau }[/math].

\begin{align*} \hat q_k \rightarrow \frac{\hat q_k}{\tau}, \hat \Pi_k \rightarrow \frac{\hat \Pi_k}{\tau}, \sigma_x^2 \rightarrow \sigma_x^2\tau, \sigma_y^2 \rightarrow \sigma_y^2\tau. \end{align*}

The softmax parameters of the categorical distribution and also the [math]\displaystyle{ \sigma }[/math] parameters of the bivariate normal distribution are controlled by the math parameter [math]\displaystyle{ \tau }[/math].This controls the level of randomness in the samples. The [math]\displaystyle{ \tau }[/math] ranges from 0 to 1. When [math]\displaystyle{ \tau = 0 }[/math] the output will be deterministic as the sample will consist of the points on the peak of the probability density function.

Unconditional Generation

There is a special case that only the decoder RNN module is trained. The decoder RNN could work as a standalone autoregressive model without latent variables. In this case, initial states are 0, the input [math]\displaystyle{ x_i }[/math] is only [math]\displaystyle{ S_{i-1} }[/math] or [math]\displaystyle{ S_{i-1}^{'} }[/math]. In the Figure 3, generating sketches unconditionally from the temperature parameter [math]\displaystyle{ \tau = 0.2 }[/math] at the top in blue, to [math]\displaystyle{ \tau = 0.9 }[/math] at the bottom in red.

Training

The training process is the same as a Variational Autoencoder. The loss function is the sum of Reconstruction Loss [math]\displaystyle{ L_R }[/math] and the Kullback-Leibler Divergence Loss [math]\displaystyle{ L_{KL} }[/math]. The reconstruction loss [math]\displaystyle{ L_R }[/math] can be obtained with generated parameters of pdf and training data [math]\displaystyle{ S }[/math]. It is the sum of the [math]\displaystyle{ L_s }[/math] and [math]\displaystyle{ L_p }[/math], which are the log loss of the offset [math]\displaystyle{ (\Delta x, \Delta y) }[/math] and the pen state [math]\displaystyle{ (p_1, p_2, p_3) }[/math].

\begin{align*} L_s = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_s} \log(\sum_{i = 1}^{M} \Pi_{j,i} \mathcal{N}(\Delta x,\Delta y | \mu_{x,j,i}, \mu_{y,j,i}, \sigma_{x,j,i},\sigma_{y,j,i}, \rho _{xy,j,i})), \end{align*} \begin{align*} L_p = - \frac{1 }{N_{max}} \sum_{i = 1}^{N_{max}} \sum_{k = 1}^{3} p_{k,i} \log (q_{k,i}), L_R = L_s + L_p. \end{align*}


Both terms are normalized by [math]\displaystyle{ N_{max} }[/math].

[math]\displaystyle{ L_{KL} }[/math] measures the difference between the distribution of the latent vector [math]\displaystyle{ z }[/math] and an i.i.d. Gaussian vector with zero mean and unit variance.

\begin{align*} L_{KL} = - \frac{1}{2 N_z} (1+\hat \sigma - \mu^2 - \exp(\hat \sigma)) \end{align*}

The overall loss is weighted as:

\begin{align*} Loss = L_R + w_{KL} L_{KL} \end{align*}

When [math]\displaystyle{ w_{KL} = 0 }[/math], the model becomes a standalone unconditional generator. Specially, there will be no [math]\displaystyle{ L_{KL} }[/math] term as we only optimize for [math]\displaystyle{ L_{R} }[/math]. By removing the [math]\displaystyle{ L_{KL} }[/math] term the model approaches a pure autoencoder, meaning it sacrifices the ability to enforce a prior over the latent space and gains better reconstruction loss metrics.

While the aforementioned loss function could be used, it was found that annealing the KL term (as shown below) in the loss function produces better results.

[math]\displaystyle{ \eta_{step} = 1 - (1 - \eta_{min})R^{step} }[/math]
[math]\displaystyle{ Loss_{train} = L_R + w_{KL} \eta_{step} max(L_{KL}, KL_{min}) }[/math]

As shown in Figure 4, the [math]\displaystyle{ L_{R} }[/math] metric for the standalone decoder model is actually an upper bound for different models using a latent vector. The reason is the unconditional model does not access to the entire sketch it needs to generate.

Figure 4. Tradeoff between [math]\displaystyle{ L_{R} }[/math] and [math]\displaystyle{ L_{KL} }[/math], for two models trained on single class datasets (left). Validation Loss Graph for models trained on the Yoga dataset using various [math]\displaystyle{ w_{KL} }[/math]. (right)

Experiments

The authors experiment with the sketch-rnn model using different settings and recorded both losses. They used a Long Short-Term Memory(LSTM) model as an encoder and a HyperLSTM as a decoder. HyperLSTM is a type of RNN cell that excels at sequence generation tasks. The ability for HyperLSTM to spontaneously augment its own weights enables it to adapt to many different regimes in a large diverse dataset. They also conduct multi-class datasets. The result is as follows.

We could see the trade-off between [math]\displaystyle{ L_R }[/math] and [math]\displaystyle{ L_{KL} }[/math] in this table clearly. Furthermore, [math]\displaystyle{ L_R }[/math] decreases as [math]\displaystyle{ w_{KL} }[/math] is halfed.

Conditional Reconstruction

The authors assess the reconstructed sketch with a given sketch with different [math]\displaystyle{ \tau }[/math] values. We could see that with high [math]\displaystyle{ \tau }[/math] value on the right, the reconstructed sketches are more random. The reconstructed sketches have similar properties as the input image , and occasionally add or remove few minor details.

They also experiment on inputting a sketch from a different class. The output will still keep some features from the class that the model is trained on.

Latent Space Interpolation

The authors visualize the reconstruction sketches while interpolating between latent vectors using different [math]\displaystyle{ w_{KL} }[/math] values. As Gaussian prior is enforced on the latent space, fewer gaps are expected in the latent space between two encoded vectors. A model trained using higher [math]\displaystyle{ w_{KL} }[/math] is expected to produce images that are close to the data manifold. To show this authors trained several models using various values of [math]\displaystyle{ w_{KL} }[/math] and showed through experimentation that with high [math]\displaystyle{ w_{KL} }[/math] values, the generated images are more coherently interpolated.

Sketch Drawing Analogies

Since the latent vector [math]\displaystyle{ z }[/math] encode conceptual features of a sketch, those features can also be used to augment other sketches that do not have these features. This is possible when models are trained with low [math]\displaystyle{ L_{KL} }[/math] values. Given the smoothness of the latent space, where any interpolated vector between two latent vectors results in a coherent sketch, we can perform vector arithmetic on the latent vectors encoded from different sketches and explore how the model organizes the latent space to represent different concepts in the manifold of generated sketches. For instance, we can subtract the latent vector of an encoded pig head from the latent vector of a full pig, to arrive at a vector that represents a body. Adding this difference to the latent vector of a cat head results in a full cat (i.e. cat head + body = full cat).

Predicting Different Endings of Incomplete Sketches

This model is able to predict an incomplete sketch by encoding the sketch into hidden state [math]\displaystyle{ h }[/math] using the decoder and then using [math]\displaystyle{ h }[/math] as an initial hidden state to generate the remaining sketch. The authors train on individual classes by using decoder-only models and set [math]\displaystyle{ τ = 0.8 }[/math] to complete samples. Figure 7 shows the results.

Limitations

Although sketch-rnn can model a large variety of sketch drawings, there are several limitations in the current approach. For most single-class datasets, sketch-rnn is capable of modeling around 300 data points. The model becomes increasingly difficult to train beyond this length. For the author's dataset, the Ramer-Douglas-Peucker algorithm is used to simplify the strokes of sketch data to less than 200 data points.

For more complicated classes of images, such as mermaids or lobsters, the reconstruction loss metrics are not as good compared to simpler classes such as ants, faces or firetrucks. The models trained on these more challenging image classes tend to draw smoother, more circular line segments that do not resemble individual sketches, but rather resemble an averaging of many sketches in the training set. This smoothness may be analogous to the blurriness effect produced by a Variational Autoencoder that is trained on pixel images. Depending on the use case of the model, smooth circular lines can be viewed as aesthetically pleasing and a desirable property.

While both conditional and unconditional models are capable of training on datasets of several classes, sketch-rnn is ineffective at modeling a large number of classes simultaneously. The samples generated will be incoherent, with different classes are shown in the same sketch.

Applications and Future Work

The authors believe this model can assist artists by suggesting how to finish a sketch, helping them to find interesting intersections between different drawings or objects, or generating a lot of similar but different designs. In the simplest use, pattern designers can apply sketch-rnn to generate a large number of similar, but unique designs for textile or wallpaper prints. The creative designers can also come up with abstract designs which enables them to resonate more with their target audience

This model may also find its place on teaching students how to draw. Even with the simple sketches in QuickDraw, the authors of this work have become much more proficient at drawing animals, insects, and various sea creatures after conducting these experiments. When the model is trained with a high [math]\displaystyle{ w_{KL} }[/math] and sampled with a low [math]\displaystyle{ \tau }[/math], it may help to turn a poor sketch into a more aesthetical one. Latent vector augmentation could also help to create a better drawing by inputting user-rating data during training processes.

The authors conclude by providing the following future directions to this work:

  1. Investigate using user-rating data to augmenting the latent vector in the direction that maximizes the aesthetics of the drawing.
  2. Look into combining variations of sequence-generation models with unsupervised, cross-domain pixel image generation models.

It's exciting that they manage to combine this model with other unsupervised, cross-domain pixel image generation models to create photorealistic images from sketches.

The authors have also mentioned the opposite direction of converting a photograph of an object into an unrealistic, but similar looking sketch of the object composed of a minimal number of lines to be a more interesting problem.

Moreover, it would be interesting to see how varying loss will be represented as a drawing. Some exotic form of loss function may change the way that the network behaves, which can lead to various applications.

Conclusion

The paper presents a methodology to model sketch drawings using recurrent neural networks. The sketch-rnn model that can encode and decode sketches, generate and complete unfinished sketches is introduced in this paper. In addition, Authors demonstrated how to both interpolate between latent spaces from a different class, and use it to augment sketches or generate similar looking sketches. Furthermore, the importance of enforcing a prior distribution on latent vector while interpolating coherent sketch generations is shown. Finally, a large sketch drawings dataset for future research work is created.

Critique

This paper presents both a novel large dataset of sketches and a new RNN architecture to generate new sketches. Although its exciting to read about, many improvements can be done.

  • The performance of the decoder model can hardly be evaluated. The authors present the performance of the decoder by showing the generated sketches, it is clear and straightforward, however, not very efficient. It would be great if the authors could present a way, or a metric to evaluate how well the sketches are generated rather than printing them out and evaluate with human judgment. The authors didn't present an evaluation of the algorithms either. They provided [math]\displaystyle{ L_R }[/math] and [math]\displaystyle{ L_{KL} }[/math] for reference, however, a lower loss doesn't represent a better performance. Training loss alone likely does not capture the quality of a sketch.
  • The authors have not mentioned details on training details such as learning rate, training time, parameter size, and so on.
  • The approach presented in the paper is innovative and make a clever use of a significantly large training database. The same framework could be used for assisting a wide range of professionals into a semi-automatic support system that can augment human capabilities for tasks such as graphic design, report preparation, or even journalism.
  • Algorithm lacks comparison to the prior state of the art on standard metrics, which made the novelty unclear. Using strokes as inputs is a novel and innovative move, however, the paper does not provide a baseline or any comparison with other methods or algorithms. Some other researches were mentioned in the paper, using similar and smaller datasets. It would be great if the authors could use some basic or existing methods a baseline and compare with the new algorithm.
  • Besides the comparison with other algorithms, it would also be great if the authors could remove or replace some component of the algorithm in the model to show if one part is necessary, or what made them decide to include a specific component in the algorithm.
  • The authors did not present better complexity and deeper mathematical analysis on the algorithms in the paper. It also does not include comparison using some more standard metrics compare to previous results. Therefore, it lacks some algorithmic contribution. It would be better to include some more formal analysis on the algorithmic side.
  • The authors proposed a few future applications for the model, however, the current output seems somehow not very close to their descriptions. But I do believe that this is a very good beginning, with the release of the sketch dataset, it must attract more scholars to research and improve with it!
  • As they said their model can become increasingly difficult to train on with increased size.

References

  1. Jimmy L. Ba, Jamie R. Kiros, and Geoffrey E. Hinton. Layer normalization. NIPS, 2016.
  2. Christopher M. Bishop. Mixture density networks. Technical Report, 1994. URL http://publications.aston.ac.uk/373/.
  3. Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.
  4. H. Dong, P. Neekhara, C. Wu, and Y. Guo. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. ArXiv e-prints, January 2017.
  5. David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, October 1973. doi: 10.3138/fm57-6770-u75u-7727. URL http://dx.doi.org/10.3138/fm57-6770-u75u-7727.
  6. Mathias Eitz, James Hays, and Marc Alexa. How Do Humans Sketch Objects? ACM Trans. Graph.(Proc. SIGGRAPH), 31(4):44:1–44:10, 2012.
  7. I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv e-prints, December 2016.
  8. Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
  9. David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.
  10. David Ha, Andrew M. Dai, and Quoc V. Le. HyperNetworks. In ICLR, 2017.
  11. Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.
  12. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. ArXiv e-prints, November 2016.
  13. Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The Quick, Draw! - A.I. Experiment. https://quickdraw.withgoogle.com/, 2016. URL https: //quickdraw.withgoogle.com/.
  14. C. Kaae Sønderby, T. Raiko, L. Maaløe, S. Kaae Sønderby, and O. Winther. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.
  15. T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to Discover cross-domain Relations with Generative Adversarial Networks. ArXiv e-prints, March 2017.
  16. D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.
  17. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  18. Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.
  19. Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. ISSN 1095-9203. doi: 10.1126/science.aab3050. URL http://dx.doi.org/10.1126/science.aab3050.
  20. Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. Shadowdraw: Real-time user guidance for freehand drawing. In ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, pp. 27:1–27:10, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0943-1. doi: 10.1145/1964921.1964922. URL http://doi.acm.org/10.1145/1964921.1964922.
  21. M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised Image-to-Image Translation Networks. ArXiv e-prints, March 2017.
  22. S. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez Colmenarejo, Z. Wang, D. Belov, and N. de Freitas. Parallel Multiscale Autoregressive Density Estimation. ArXiv e-prints, March 2017.
  23. Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016. ISSN 0730-0301. doi: 10.1145/2897824.2925954. URL http://doi.acm.org/10.1145/2897824.2925954.
  24. Mike Schuster, Kuldip K. Paliwal, and A. General. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997.
  25. Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Proceedings of the Fifteenth Eurographics Conference on Rendering Techniques, EGSR’04, pp. 23–32, Aire-la-Ville, Switzerland, Switzerland, 2004. Eurographics Association. ISBN 3-905673-12-6. doi: 10.2312/EGWR/EGSR04/023-032. URL http://dx.doi.org/10.2312/EGWR/EGSR04/023-032.
  26. Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Comput. Graph.,37(5):348–363, August 2013. ISSN 0097-8493. doi: 10.1016/j.cag.2013.01.012. URL http://dx.doi.org/10.1016/j.cag.2013.01.012.
  27. T. White. Sampling Generative Networks. ArXiv e-prints, September 2016.
  28. Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In ICML. icml.cc / Omnipress, 2012. URL http://dblp.uni-trier.de/db/conf/icml/icml2012.html#XieHS12.
  29. Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL http://arxiv.org/abs/1606.06539.