Summary - A Neural Representation of Sketch Drawings

From statwiki
Jump to: navigation, search

This paper discusses sketch-rnn, a sequence-to-sequence Variational Autoencoder for generating new vector images images from a given set of hand drawn vector images. The focus on vector images is based on the reasoning that they are a much better representation of how human beings create drawings than the traditional pixel approach. Vector images created by sketch-rnn can be conditioned on a given image, or generated unconditionally by sampling from a learned distribution. Creating new sketches in this way allows neural networks like sketch-rnn opens the possibility of many applications: to be used as a way to teach children to draw, or extend the capacity of an artist by generating many possible next steps for a given sketch.

Related Work

Most work in the area of image generating neural networks has dealt with pixel base images rather than vector based images. There has been some work with a similar aim to this paper, attempting to generate handwriting using Recurrent Neural Networks (Graves, 2013), and other investigations into creating vectorized Kanji characters (Ha, 2015; Zhang et al., 2016). Use of sequence-to-sequence architecture with a Variational Autoencoder had been previously applied to modelling natural language (Bowman et al., 2015), but sketch-rnn takes a new step by using the same combination and applying it to vector images.

Background: The Building Blocks of sketch-rnn

Variational Autoencoders (VAEs)

A variational autoencoder (VAE) is a classical autoencoder and is a neural network consisting of an encoder, a decoder and a loss function. They can be used for image generation and reinforcement learning as they let us design generative models of data and fit them to large data-sets.

Figure 1: Examples of generated images of faces

Simply, the VAE first layer is the encoder which take the input and convert it into a latent vector by reducing the mean squared error of the input and output, like a standard autoencoder. Then to make VAE a generative model, the generate latent vectors should roughly follow a Gaussian distribution as shown in Fig.2. This allows the user to generate an output similar to the database the VAE was trained on by inputting a latent vector straight to the decoder.

Figure 2: Structure of a VAE

Given data in the form of [math] X [/math]’s which are encoded to [math] z [/math]’s, our goal is to maximize the expectation of generating a real data point [math] X [/math] given an encoding [math] z [/math] and parameters: $$ \int P(X|z, \theta)P(z)dz = E_{z \sim P(z)}[P(X|z, \theta)] $$

Where [math] \theta [/math] is an optimizing parameter so that [math] z [/math] can be sampled from [math] P(z) [/math] with high probability that [math] f(z;\theta) [/math] or [math] P(X|z, \theta) [/math] is most likely be one of the [math] X [/math]’s in our dataset. However, a given [math] z [/math] from [math] P(z) [/math] is unlikely to produce a reasonable [math] X [/math]. The challenge is to sample from the posterior distribution [math] P(z|X) [/math] because it allows [math] z [/math] to be conditioned on real data, making a latent vector sampled from it more likely to produce realistic results. Unfortunately, we can’t sample from the posterior. To fix this, we sample instead from a distribution approximating [math] P(z|X) [/math] that is easily sampled from [math] Q(z|X) \sim N(\mu, \Sigma) [/math] Now there are two requirements to take into account: [math] Q(z|X) [/math] must approximate [math] P(z|X) [/math] reasonably well, and [math] P(X|\theta) [/math] must be maximized. Then our objective has the form: $$ \log(P(X|\theta)) - K_{DL}[Q(z|X,\theta)||P(z|X,\theta)] = E_{z\sim Q(z|X,\theta)}[\log(P(X|z,\theta)] - K_{DL}(Q(z|X,\theta)||P(z|\theta)) $$ It is important to note that using [math] Q(z|X)\sim N(\mu, \Sigma) [/math] allows the encoder to output the parameters necessary to create [math] z [/math] with the desired distribution by defining [math] z [/math] as [math] z = \mu + \Sigma^{1/2}\epsilon\ [/math] where [math] \epsilon [/math] is sampled from [math] N(0, 1) [/math]. This also allows the VAE to be trained using back propagation since the random part of [math] z [/math] (i.e [math] \epsilon [/math]) is sampled separately from the flow of gradients in the encoder/decoder.

Recurrent Neural Networks (RNNs)

RNNs are closely related to Feed Forward Neural Networks, but have a few properties that make them suited to modeling sequential data. Consider a sequence of vectors [math] x_1, x_2, x_3...x_n [/math] representing the individual strokes of a drawing similar to the data used by sketch-rnn. It is important to take into account previous strokes if we are to learn patterns in drawings. An RNN satisfies this requirement by retaining a “memory” of the hidden states of previous strokes and uses it to inform output. RNNs are highly effective in modelling many types of sequential data such as natural language and drawings.

Figure 3: An RNN and it’s unfolded counterpart

Figure 3 shows the basic structure of an RNN on the left. On the right is it’s “unrolled” equivalent, very similar to an FFNN, but with an extra input at the hidden layer. For example, the output of the hidden layer after the input of [math] x_t [/math] is [math] s_t=f(Ux_t+Ws_{t-1}) [/math]. Here, [math] f [/math] is the activation function applied to every element of the vector [math] Ux_t+Ws_{t-1} [/math], [math] x_t [/math] is the input, [math] U [/math] and [math] W [/math] are the matrices of weights and [math] s_{t-1} [/math] is the hidden state of the previous data [math] x_{t-1} [/math].

RNNs can be trained using backpropagation, and since all weights in [math] U [/math], [math] W [/math] and [math] V [/math] are shared by each unrolled layer, there are much fewer parameters to train than a FFNN with the same number of layers.

QuickDraw Data Set

The sketch-rnn model was trained and tested using the QuickDraw data set. This data set consists of drawings obtained from the Quick, Draw! game where players are asked to draw a picture of a given object. Drawn objects are stored in a format that captures the pen stroke actions. Each action is given as a vector of 5 items, [math](\Delta x, \Delta y, p1, p2, p3)[/math]. [math](\Delta x,\Delta y)[/math] gives the offset of the pen from the previous point. [math](p1, p2, p3)[/math] are a binary one-hot of possible states for the pen. Respectively, they indicate if the pen is touching the paper, that the pen is about to be lifted from the paper and no line will be drawn next, and that the drawing is finished.

75 classes were used for training and testing sketch-rnn. The data was prepared by simplifying the strokes using the Ramer-Douglas-Peucker algorithm in order to reduce the complexity of the images. The data was then standardized by scaling the stroke offsets so that the offsets in the training set had standard deviation 1. Means were not normalized.

Methodology and Experiment

Figure 4: Schematic diagram of sketch-rnn architecture


The purpose of the encoder is to ingest a sketch [math]x[/math] and produce vectors [math]\mu[/math] and [math]\sigma[/math] whose [math]N_z[/math] entries parameterize the [math]N_z[/math] univariate Gaussian distributions from which the [math]N_z[/math] entries of the latent vector [math]z[/math] are generated. The latent vector [math]z[/math] is the [math]N_z[/math] dimensional embedding of the sketch [math]x[/math] like the latent vector of a standard variational autoencoder, or similar to a standard autoencoder but not a deterministic function of [math]x[/math]

That is, they parameterize [math]Q(z|x)[/math] as described in the background (which is, in this model, multivariate gaussian but with covariances 0, and the entries of vector along the diagonal)

Intuitively, the mean vector [math]\mu[/math] estimates the location of the sketches in latent space. If one were to consider all latent vectors corresponding to a single class in the training data, ideally they would be close together.

The encoder takes the form of a bidirectional RNN consisting of two standard RNNs which are run along a sketch sequence in temporal and reverse temporal order, respectively, whose output vectors are concatenated to form a vector [math]h[/math]. The vector [math]h[/math] is then projected to [math]\mu[/math] and [math]\sigma[/math] via the following transformations:

[math]\mu =W_\mu h + b\mu[/math] [math]\sigma=exp(\frac{W_\sigma h + b_\sigma}{2})[/math]

Where the exponential operation is required to make positive and hence a valid variance. The latent vector is then generated as [math]z = \mu + \sigma \odot \mathcal{N}(0, 1)[/math], where [math]\odot[/math] is the entrywise product of vectors and the latter term is an [math]N_z[/math]-length vector each of whose entries is generated from [math]N(0, 1)[/math] . This is important because when the encoder is trained along with the decoder in end-to-end backpropagation, [math]z = \mu + \sigma \odot \mathcal{N}(0, 1)[/math] can be treated as an affine transformation of some constant, so the derivative of the loss can be taken with respect to both of these parameters.


The purpose of the decoder is to parameterize the distribution [math]P(x|z)[/math]. It takes a latent vector [math]z[/math] and generates a sketch. It is modelled by another RNN, which at each timestep [math]i[/math] takes the point outputted by the previous step, [math]S_{i-1}[/math] (where [math]S_0= (0, 0, 1, 0, 0))[/math] and the latent vector [math]z[/math], in addition to passing the hidden state [math]h_i[/math] between steps. The output of node [math]i[/math] is the parameters of a Gaussian mixture model for the differents coordinates of pointi from the point [math]i-1[/math] of the sketch, and the probabilities [math](q_1, q_2, q_3)[/math]. The Gaussian mixture model is a weighted sum of [math]M[/math] bivariate normal distributions, where [math]M[/math] is a hyperparameter and hence is written as

[math]p(\Delta x, \Delta y) = \sum_{j=1}^M\Pi_j \mathcal{N}(\Delta x, \Delta y | \mu_{x, j}, \mu_{y, j}, \sigma_{x,j}, \sigma_{y, j}, \rho{xy, j})[/math]

Where [math]\sum_{j=1}^M Pi_j=1 [/math]. Thus the a given node of the RNN must return [math]6M+3[/math] parameters in total, where the first [math]6M[/math] parameterize the mixture model and the remaining 3 are [math](\hat{q_1}, \hat{q_2}, \hat{q_3})[/math]. This means each RNN cell must output

[math]y_i = [(\hat{\Pi},\mu_x,\mu_y, \hat{\sigma_x},\hat{\sigma_y},\hat{\rho})_1, ..., (\hat{\Pi},\mu_x,\mu_y, \hat{\sigma_x},\hat{\sigma_y},\hat{\rho})_M, (\hat{q_1}, \hat{q_2}, \hat{q_3})] [/math]

which is computed by

[math]y_i = W_y h_i + b_y[/math]

This produces a [math]6M+1[/math] -length vector [math]yi[/math]. In order to make the parameters valid, the following conditions must be met:

[math]\sum_{j=1}^M \Pi_j = 1[/math]

[math]q_1 + q_2 + q_3 = 1 [/math]

[math]\sigma_{x, j}, \sigma_{y, j}\gt 0 [/math]

[math]-1 \lt \rho_{xy, j} \lt 1 [/math].

To meet these, we set

[math]\Pi_i = \frac{e^{\hat{\Pi_i}}}{\sum_{k=1}^M e^{\hat{\Pi_i}}}[/math]

[math]q_i = \frac{e^{\hat{q_i}}}{\sum_{k=1}^M e^{\hat{q_i}}}[/math]

[math]\sigma_{x, j} = e^{\hat{\sigma_{x_j}}}[/math]

[math]\sigma_{y, j} = e^{\hat{\sigma_{y_j}}}[/math]

[math]\rho_{xy, j} = e^{\hat{\tanh_{\rho_{xy, j}}}}[/math]

We then draw [math]S_i[/math] from the resulting posterior distribution, and [math]x[/math] is computed using all such points [math]Si[/math].

Unconditional Generation

As an interesting aside, the decoder can be trained without the encoder to generate sketches with no prior (latent vector) [math]z[/math]. The decoder is simply adjusted not to concatenate a latent vector to the input of each time step, then trained using only the reconstruction loss (during which this term will become somewhat of a misnomer, as there is no input to be reconstructed in this process).


The training procedure for sketch-rnn uses a combination of two loss functions, Reconstruction loss ([math]L_R[/math]) and Kullback-Liebler Divergence ([math]L_{KL}[/math]) given by [math] L_{R} = - \frac{1}{N_{max}} \sum_{i=1}^{N_s} log(p(\Delta x_i, \Delta y_i)) - \frac{1}{N_{max}} \sum_{i=1}^{N_{max}} \sum_{k=1}^3 p_{k,i} log(q_{k,i}) \\ L_{KL} = - \frac{1}{2N_z} (1 + \hat{\sigma} - \mu^2 - e^\hat{\sigma}) [/math]

The former optimizes the log-likelihood of the generated probability distribution to explain the training data, and the latter optimizes the difference between the distribution of the latent vector and an independently identically distributed standard normal vector. The objective function is a weighted combination of these two loss functions,

[math]Loss = L_R+\omega_{LK} L_{KL}[/math].

[math]\omega_{KL}[/math] is a hyper-parameter that controls the behavior of the model. For [math]\omega_{KL} \rightarrow 0[/math], the model acts as a pure encoder. It decreases the reconstruction loss of the model, but sacrifices the ability to enforce a prior over the latent space. For training, the loss function was further modified by adding a weight growth to the Kullback-Liebler part of the loss function.

[math]Loss=L_R+\omega_{KL}(1-(1-\eta_{min})R^{step}) max\{L_{KL},KL_{min}\}, \qquad \eta_{min},R \lt 1[/math]

The contribution of the Kullback-Liebler term to the loss function therefore starts small (if [math]R[/math] close to 1) and approaches [math]\omega_{KL} L_{KL}[/math] with each step. This was added to allow the optimizer to first focus on reconstruction, which is optimized by the [math]L_R[/math]loss, and later optimize for the [math]L_{KL}[/math] term. Additionally, a minimum value of the [math]L_{KL}[/math] term was enforced by taking [math]max\{L_{KL},KL_{min}\}[/math] for some floor [math]KL_{min}[/math], as it was found that decreasing the [math]L_{KL}[/math] loss beyond a certain value of [math]L_{KL}[/math] did not lead to further improvements in the decoder. The minimum value for [math]KL_{min}[/math] therefore encourages the optimizer to focus on minimizing the [math]L_R[/math] term and improving reconstruction loss once the [math]L_{KL}[/math] term is low enough.


To analyse sketch-rnn, we perform experiments for both conditional and unconditional image generation. We train several models and then analyse the uses of these trained models. We use Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) as the encoder RNN and we use HyperLSTM for the decoder RNN.

Using the QuickDraw dataset, in particular, the classes (cat, pig, face, firetruck, garden, owl, mosquito and yoga class) we train various models individually on these classes and then train 2 models on multiple classes (cat, pig) and (crab, face, pig, rabbit). In training, various [math] w_{KL} [/math] were used and the breakdown of the losses [math]L_{R}[/math] and [math]L_{KL}[/math] were recorded.The results of the experiments are shown below:

Figure 5: Loss Figures ([math]L_{R}[/math] and [math]L_{K}[/math]) for various [math]w_{KL}[/math] settings.

From the table, we see that as we relax (i.e. decrease) [math]w_{KL}[/math](the weight for the KL loss term), [math]R_L[/math] (the reconstruction loss) decreases and [math]L_{KL}[/math] (KL loss) increases. This result can be seen in the training section above.

After training a sketch-rnn models we can use them in 4 ways:

  • Conditional Reconstruction
  • Latent Space Interpolation
  • Sketch Drawing Analogies
  • Predicting Different Endings of Incomplete Sketches

Conditional Reconstruction

Conditional Reconstruction refers to using a sketch-rnn model to reconstruct an image based on a given input.For example, using a model trained only on the cat class based on a given input, we can reconstruct a cat sketch. This is shown in the figure below with the blue pink color scheme representing levels of temperature [math]\tau[/math] from (0.01 to 1).

Figure 6: Conditional generation of cats (left) and pigs (right).

The first thing we notice is that for lower levels of [math]\tau[/math], the model produces results that closely mirror the input. We also notice the image reconstruction of images that do not have the typical features of a cat (i.e. noisy input). When the input is a cat with 3 eyes, the reconstructed cat only has 2 eyes and when the input is a toothbrush, the model keeps the shape of the toothbrush but gives it cat features. In the above figure, we also show similar results for a pig only model.

Latent Space Interpolation

In this application, we use a model trained on 2 classes (in this case, cat and pig and view how one class morphs into the other class. The color scheme blue to pink represent stages as the model morphs the input. In the below image, we view results of the reconstructions on the interpolations from cat to pig, using models trained with various levels of [math]w_{KL}[/math].

Figure 7: Latent space interpolation between cat and pig using various [math]w_{KL}[/math] settings.

Notice that with [math]w_{KL}=0.25[/math] the reconstruction is poor and is a replica of the original input, whereas with [math]w_{KL}=1[/math] , the reconstruction is good and we have a distinct image of a pig at the end.

Recall, that as [math]w_{KL}[/math] decreases, [math]L_R[/math] decreases and [math]L_{KL}[/math] increases. With a lower [math]L_{KL}[/math] we see that the model will generate coherent images regardless of how noisy the input is. This is because, with lower [math]L_{KL}[/math], the encoded latent vectors contain conceptual features of the input sketch while with higher [math]L_{KL}[/math], the encoded latent vectors only contain information on the specific line segments. This suggests that when training a model, we must always investigate the trade-off between the two loss terms as this will affect the quality of the reproduced sketch.

Sketch Drawing Analogies

In this application, using an input sketch, we augment features of a sketch. This means that we take an input sketch of a class and then add features from another class to that sketch (i.e. we investigate the features in the model’s latent space).

As mentioned above, the latent vectors contain information on the features of the input sketch. Models with Low [math]L_{KL}[/math], contain conceptual features of a sketch and hence we can use the latent vectors of these models to augment sketches. This can be seen below.

Figure 8: Sketch Drawing Analogies

In the above image, in the first row, we first subtract the latent vector of an encoded pig head from the latent vector of a full pig to produce a sketch that would represent a “body”. We then use this “body and attach it to a cat head to produce a full cat.In the second row, we do the opposite We take the latent vector of encoded full cat from the latent vector of cat head to represent the action “subtract body”. We then add this to a full pig sketch to produce a sketch of a pig head.

Predicting Different Endings of Incomplete Sketches

We can give a decoder only model trained on a single class an incomplete sketch as in input (i.e. a series of lines) and then predict various possible endings for that given input. This is shown below with [math]\tau=0.8[/math].

Figure 9:sketch-rnn predicting possible endings of various incomplete sketches (the red lines)

Using this type of model, we first encode the sketch into a complete hidden state, h and then generate a sketch that is conditioned on the points from h.

Application and Future Work

The model sketch-rnn can help assist artists through the creative process. Whether it is by drawing many unique images for a design, creating uncommon abstract art, with a high wkl and low temperature the model will transform a poorly sketched drawing into a more coherent version of itself.

The paper can be continued by combining the sketch-rnn model with unsupervised, cross-domain pixel generation models. The idea would be to take a photograph of an image and decompose it into a hand drawn image. The challenge would be to limit the lines in the sketch while still producing a coherent image.


Ha and Eck have done a great job in explaining the details of the model within the paper. The paper was extremely approachable even to someone with minimal background. Their decision to leave some of the details to supplementary material also allowed the reader to grasp the general outline without overwhelming the reader. However, the author’s decision to only included successful result in the main paper, and left the limitation of the model within the supplementary material harms the overall credibility of the paper. There is also no mention of any inaccuracy or bad drawing generated by the model within the paper, which seems odd given that this is a new methodology. This gives the impression that only the selected model within the paper perform well.

The paper also makes good use of VAE models with RNN encoder and decoders. The paper makes a wise choice of using bidirectional RNN to encode the data which allows the model to consider the whole sketch when process the given point. The uses of RNN as the decoder also enables the model to mimic how a human might proceed to perform a sketch given the topic. The 4 experiments done on the model were very informative on the adaptiveness to different applications. Each one highlighted potential the algorithm had in different areas of image reconstruction and helps with the reader to obtain a better insight towards the model.

However, the model suffers problems scaling to larger images. For the current applications of sketch-rnn the images drawn only require 200-300 timesteps. As mentioned in the paper’s supplementary material the model has difficulty training on more then 300 data points. This could implies that the model is difficult to scale to larger problems or drawings. If one was trying to implement a much more detailed drawing this model would have difficulty dealing with the extra data points.

Another limitation is the scalability of the model to multiple different classes. As per mention within the supplementary material, the model tends to start combining features from different class when ask to construct a sketch action when the model is given 4 or more concatenated class. This presented some limitations since multiple model is required in order to have a model to generate sketches for multiple classes. Although there has been effort by GPU manufacturer such Nvidia to optimize the training algorithm, RNN is still remains a computationally expensive model to train. These limitations will hinder the practicality of the model. This is particularly concerning when using the model on mobile which would be a very useful application for sketch-rnn. The authors of the paper don't mention a method of scaling this to mobile devices.


Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.

David Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, 2015.

Ha, D., & Eck, D. A Neural Representation of Sketch Drawings. (2018)

Sepp Hochreiter and Juergen Schmidhuber. Long short-term memory. Neural Computation, 1997.

Doersch, Carl. "Tutorial on variational autoencoders." arXiv preprint arXiv:1606.05908 (2016).

Kurita, Keita. “An Intuitive Explanation of Variational Autoencoders (VAEs Part 1).” Machine Learning Explained, 2 Mar. 2018,

Kurita, Keita. “An Introduction to the Math of Variational Autoencoders (VAEs Part 2).” Machine Learning Explained, 2 Mar. 2018,

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. CoRR, abs/1511.06349, 2015. URL

Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. CoRR, abs/1606.06539, 2016. URL

Recurrent Neural Networks Tutorial, Part 1 – Introduction To Rnns , Denny Britz -

Jozefowicz, Rafal, Wojciech Zaremba, and Ilya Sutskever. "An empirical exploration of recurrent network architectures." International Conference on Machine Learning. 2015.

Jeremy Appleyard “Optimizing Recurrent Neural Networks in CuDNN5”, , Apr. 6 2016