A Neural Representation of Sketch Drawings

From statwiki
Revision as of 18:04, 5 March 2018 by Tadenoud (talk | contribs) (Add experiments summary)
Jump to: navigation, search


There have been many recent advances in neural generative models for low resolution pixel-based images. Humans however, do not see the world in a grid of pixels and more typically communicate drawings of the things we see using a series of pen strokes that represent components of objects. These pen strokes are similar to the way vector-based images store data. This paper proposes a new method for creating conditional and unconditional generative models for creating these kinds of vector sketch drawings based on recurrent neural networks (RNNs). The paper explores many applications of these kinds of models, especially creative applications and makes available their unique dataset of vector images.

Related Work

Previous work related to sketch drawing generation includes methods that focussed primarily on converting input photographs into equivalent vector line drawings. Image generating models using neural networks also exist but focussed more on generation of pixel-based imagery. Some recent work has focussed on handwritten character generation using RNNs and Mixture Density Networks to generate continuous data points. This work has been extended somewhat recently to conditionally and unconditionally generate handwritten vectorized Chinese Kanji characters by modelling them as a series of pen strokes. Furthermore, this paper builds on work that employed Sequence-to-Sequence models with Variational Autencoders to model English sentences in latent vector space.

One of the limiting factors for creating models that operate on vector datasets has been the dearth of publicly available data. Previously available datasets include: Sketch, a set of 20K vector drawings; Sketchy, a set of 70K vector drawings; and ShadowDraw, a set of 30K raster images with extracted vector drawings.



The “QuickDraw” dataset used in this research was assembled from 75K user drawings extracted from the game “Quick, Draw!” where users drew objects from one of hundreds of classes in 20 seconds or less. The dataset is split into 70K training samples and 2.5K validation and test samples each and represents each sketch a set of “pen stroke actions”. Each action is provided as a vector in the form [math](\Delta x, \Delta y, p_{1}, p_{2}, p_{3})[/math]. For each vector, [math]\Delta x[/math] and [math]\Delta y[/math] give the movement of the pen from the previous point, with the initial location being the origin. The last three vector elements are a one-hot representation of pen states; [math]p_{1}[/math] indicates that the pen is down and a line should be drawn between the current point and the next point, [math]p_{2}[/math] indicates that the pen is up and no line should be drawn between the current point and the next point, and [math]p_{3}[/math] indicates that the drawing is finished and subsequent points and the current point should not be drawn.



The model is a Sequence-to-Sequence Variational Autoencoder (VAE). The encoder model is a symmetric and parallel set of two RNNs that individually process the sketch drawings in forward and reverse order, respectively. The hidden state produced by each encoder model is then concatenated into a single hidden state [math]h[/math].

The concatenated hidden state [math]h[/math] is then projected into two vectors [math]\mu[/math] and [math]\hat{\sigma}[/math] each of size [math]N_{z}[/math] using a fully connected layer. [math]\hat{\sigma}[/math] is then converted into a non-negative standard deviation parameter [math]\sigma[/math] using an exponential operator. These two parameters [math]\mu[/math] and [math]\sigma[/math] are then used along with an IID Gaussian vector distributed as [math]\mathcal{N}(0, I)[/math] of size [math]N_{z}[/math] to construct a random vector [math]z \in ℝ^{N_{z}}[/math], similar to the method used for VAE: \begin{align} \mu = W_{\mu}h + b_{mu}\textrm{, }\hat{\sigma} = W_{\sigma}h + b_{\sigma}\textrm{, }\sigma = exp\bigg{(}\frac{\hat{\sigma}}{2}\bigg{)}\textrm{, }z = \mu + \sigma \odot \mathcal{N}(0,I) \end{align}

The decoder model is another RNN that samples output sketches from the latent vector [math]z[/math]. The initial hidden states of each recurrent neuron are determined using [math][h_{0}, c_{0}] = tanh(W_{z}z + b_{z})[/math]. Each step of the decoder RNN accepts the previous point [math]S_{i-1}[/math] and the latent vector [math]z[/math] as concatenated input. The initial point given is the origin point with pen state down. The output at each step are the parameters for a probability distribution of the next point [math]S_{i}[/math]. Outputs [math]\Delta x[/math] and [math]\Delta y[/math] are modelled using a Gaussian Mixture Model (GMM) with M normal distributions and output pen states [math](q_{1}, q_{2}, q_{3})[/math] modelled as a categorical distribution with one-hot encoding. \begin{align} P(\Delta x, \Delta y) = \sum_{j=1}^{M}\Pi_{j}\mathcal{N}(\Delta x, \Delta y | \mu_{x, j}, \mu_{y, j}, \sigma_{x, j}, \sigma_{y, j}, \rho_{xy, j})\textrm{, where }\sum_{j=1}^{M}\Pi_{j} = 1 \end{align}

For each of the M distributions in the GMM, parameters [math]\mu[/math] and [math]\sigma[/math] are output for both the x and y locations signifying the mean location of the next point and the standard deviation, respectively. Also output from each model is parameter [math]\rho_{xy}[/math] signifying correlation of each bivariate normal distribution. An additional vector [math]\Pi[/math] is output giving the mixture weights for the GMM. The output [math]S_{i}[/math] is determined from each of the mixture models using softmax sampling from these distributions.

One of the key difficulties in training this model is the highly imbalanced class distribution of pen states. In particular, the state that signifies a drawing is complete will only appear one time per each sketch and is difficult to incorporate into the model. In order to have the model stop drawing, the authors introduce a hyperparameter that limits the number of points per drawing to being no more than [math]N_{max}[/math], after which all output states form the model are set to (0, 0, 0, 0, 1) to force the drawing to stop.

To sample from the model, the parameters required by the GMM and categorical distributions are generated at each time step and the model is sampled until a “stop drawing” state appears or the time state reaches time [math]N_{max}[/math]. The authors also introduce a “temperature” parameter [math]\tau[/math] that controls the randomness of the drawings by modifying the pen states, model standard deviations, and mixture weights as follows:

\begin{align} \hat{q}_{k} \rightarrow \frac{\hat{q}_{k}}{\tau}\textrm{, }\hat{\Pi}_{k} \rightarrow \frac{\hat{\Pi}_{k}}{\tau}\textrm{, }\sigma^{2}_{x} \rightarrow \sigma^{2}_{x}\tau\textrm{, }\sigma^{2}_{y} \rightarrow \sigma^{2}_{y}\tau \end{align}

This parameter [math]\tau[/math] lies in the range (0, 1]. As the parameter approaches 0, the model becomes more deterministic and always produces the point locations with the maximum likelihood for a given timestep.

Unconditional Generation

The authors also explored unconditional generation of sketch drawings by only training the decoder RNN module. To do this, the initial hidden states of the RNN were set to 0, and only vectors from the drawing input are used as input without any conditional latent variable [math]z[/math]. Different sketches are sampled from the network by only varying the temperature parameter <math\tau</math> between 0.2 and 0.9


The training procedure follows the same approach as training for VAE and uses a loss function that consists of the sum of Reconstruction Loss [math]L_{R}[/math] and KL Divergence Loss [math]L_{KL}[/math]. The reconstruction loss term is composed of two terms; [math]L_{s}[/math], which tries to maximize the log-likelihood of the generated probability distribution explaining the training data [math]S[/math] and [math]L_{p}[/math] which is the log loss of the pen state terms. \begin{align} L_{s} = -\frac{1}{N_{max}}\sum_{i=1}^{N_{S}}log\bigg{(}\sum_{j=1}^{M}\Pi_{j,i}\mathcal{N}(\Delta x_{i},\Delta y_{i} | \mu_{x,j,i},\mu_{y,j,i},\sigma_{x,j,i},\sigma_{y,j,i},\rho_{xy,j,i})\bigg{)} \end{align} \begin{align} L_{p} = -\frac{1}{N_{max}}\sum_{i=1}^{N_{max}} \sum_{k=1}^{3}p_{k,i}log(q_{k,i}) \end{align} \begin{align} L_{R} = L_{s} + L{p} \end{align}

The KL divergence loss [math]L_{KL}[/math] measures the difference between the latent vector [math]z[/math] and an IID Gaussian distribution with 0 mean and unit variance. This term, normalized by the number of dimensions [math]N_{z}[/math] is calculated as: \begin{align} L_{KL} = -\frac{1}{2N_{z}}\big{(}1 + \hat{\sigma} - \mu^{2} – exp(\hat{\sigma})\big{)} \end{align}

The loss for the entire model is thus the weighted sum: \begin{align} Loss = L_{R} + w_{KL}L_{KL} \end{align}

The value of the weight parameter [math]w_{KL}[/math] has the effect that as [math]w_{KL} \rightarrow 0[/math], there is a loss in ability to enforce a prior over the latent space and the model assumes the form of a pure autoencoder.


The authors trained multiple conditional and unconditional models using varying values of [math]w_{KL}[/math] and recorded the different [math]L_{R}[/math] and [math]L_{KL}[/math] values at convergence. The network used LSTM as it’s encoder RNN and HyperLSTM as the decoder network. The HyperLSTM model was used for decoding because it has a history of being useful in sequence generation tasks.

Conditional Reconstruction

$$ INSERT FIGURE CONDITIONAL RECONSTRUCTION $$ The authors qualitatively assessed the reconstructed images [math]S’[/math] given input sketch [math]S[/math] using different values for the temperature hyperparameter [math]\tau[/math]. The figure above shows the results for different values of [math]\tau[/math] starting with 0.01 at the far left and increasing to 1.0 on the far right. Interestingly, sketches with extra features like a cat with 3 eyes is reproduced as a sketch of a cat with two eyes and sketches of object of a different class such as a toothbrush is reproduced as a sketch of a cat that maintains several of the input toothbrush sketches features.

Latent Space Interpolation

$$ INSERT IMAGE LATENT SPACE INTERPOLATION $$ The latent space vectors [math]z[/math] have few “gaps” between encoded latent space vectors due to the enforcement of a Guassian prior. This allowed the authors to do simple arithmetic on the latent vectors from different sketches and produce logical resulting images in the same style as latent space arithmetic on Word2Vec vectors.

Sketch Drawing Analogies

Given the latent space arithmetic possible, it was found that features of a sketch could be added after some sketch input was encoded. For example, a drawing of a cat with a body could be produced by providing the network with a drawing of a cat’s head, and then adding a latent vector to the embedding layer that represents “body”. As an example, this “body” vector might be produced by taking a drawing of a pig with a body and subtracting a vector representing the pigs head.

Predicting Different Endings of Incomplete Sketches

$$ INSERT FIGURE $$ Using the decoder RNN only, it is possible to finish sketches by conditioning future vector line predictions on the previous points. To do this, the decoder RNN is first used to encode some existing points into the hidden state of the decoder network and then generating the remaining points of the sketch.

Applications and Future Work




Ha, D., & Eck, D. A neural representation of sketch drawings. In Proc. International Conference on Learning Representations (2018).