http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Asriram&feedformat=atomstatwiki - User contributions [US]2024-03-28T15:21:56ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Automated_Curriculum_Learning_for_Neural_Networks&diff=31734STAT946F17/ Automated Curriculum Learning for Neural Networks2017-11-30T17:05:55Z<p>Asriram: /* Model */</p>
<hr />
<div>= Introduction =<br />
<br />
Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them “curriculum learning”. The idea of training a learning machine with a curriculum can be traced back<br />
at least to Elman (1993). The basic idea is to start small, learn easier aspects of the task or easier sub-tasks, and then gradually increase the difficulty level. <br />
<br />
However curriculum learning has only recently become prevalent in the field (e.g., Bengio et al., 2009), due in part to the greater complexity of problems now being considered. In particular,<br />
recent work on learning programs with neural networks has relied on curricula to scale up to longer or more complicated tasks (Reed and de Freitas, 2015, Gui et al. 2017). We expect this trend<br />
to continue as the scope of neural networks widens, with deep reinforcement learning providing fertile ground for structured learning.<br />
<br />
One reason for the slow adoption of curriculum learning is that its effectiveness is highly sensitive to the mode of progression through the tasks. One popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting (Sutskever and Zaremba, 2014). However, as well as introducing hard-to-tune parameters, this poses problems for curricula where appropriate thresholds may be unknown or variable across tasks.<br />
<br />
The main contribution of the paper is that a stochastic policy, continuously adapted to optimize learning progress is proposed. Given a progress signal that can be evaluated for each<br />
training example, we use a multi-armed bandit algorithm to find a stochastic policy over the tasks that maximizes overall progress. The bandit is non-stationary because of the behaviour of the network, and hence the optimal policy evolves during training. Moreover, variants of prediction gain, and also a novel class of progress signals which we refer to as complexity gain are considered in this paper.<br />
<br />
= Model =<br />
A task is a distribution $D$ over sequences from $\mathcal{X}$ . A curriculum is an ensemble of tasks $D_1, \ldots D_N$ , a sample is an example drawn from one of the tasks of the curriculum,<br />
and a syllabus is a time-varying sequence of distributions over tasks. A neural network is considered as a probabilistic model $p_\theta$ over $\mathcal{X}$, whose parameters are denoted $\theta$. In essence, for curriculum learning, there are two important settings: the multiple tasks setting and the target task setting. For the multiple tasks setting, the goal is to perform well on nearly all tasks in the ensemble. On the target task setting, the goal is to minimize the loss on the final task.<br />
<br />
The expected loss of the network on the $k$-th task is <br />
\[<br />
\mathcal{L}_k( \theta) := \mathbb{E}_{\mathbf{x} \sim D_k} L(\mathbf{x}, \theta),<br />
\]<br />
where $L(\mathbf{x}, \theta):= -\log {p_\theta}(\mathbf{x})$ is the sample loss on $\mathbf{x}$. <br />
<br />
They consider two related settings. <br />
* In the multiple tasks setting, the goal is to perform as well as possible on all tasks in ${D_k}$; this is captured by the following objective function:<br />
$$<br />
\mathcal{L}_{MT} := \frac{1}{N} \sum_{k=1}^{N} \mathcal{L}_k<br />
$$<br />
*In the target task setting, we are only interested in minimizing the loss on the final task $D_N$. The other tasks then act as a series of stepping stones to the real problem. The objective function in this setting is simply $\mathcal{L}_{TT} := \mathcal{L}_N$.<br />
A curriculum containing $N$ tasks as an $N$-armed bandit, and a syllabus as an adaptive policy which seeks to maximize payoffs from this bandit. In the bandit setting, an agent selects a sequence of arms (actions) $a_1,\ldots, a_T$ over T rounds of play. After each round, the selected arm yields a payoff $r_t$; the payoffs for the other arms are not observed.<br />
===Adversarial Multi-Armed Bandits===<br />
The classic algorithm for adversarial bandits is Exp3, which minimize regret with respect to the single best arm evaluated over the whole history. However, in the case of training neural network, an arm is optimal for a portion of the history, then another arm, and so on; the best strategy is then piecewise stationary. This is generally the case in this paper, as the expected reward for each task changes as the model learns. The Fixed Share method addresses this issue by using an $\epsilon$-greedy strategy. It is known as the Exp3.S algorithm. <br />
<br />
On round $t$, the agent selects an arm stochastically according to a policy $\pi_t$ . This policy is defined by a set of weights $w_t$,<br />
\[<br />
\pi_t(i) := (1-\epsilon)\frac{e^{w_{t,i}}}{\sum_{j=1}^N e^{w_{t,j}}}+\frac{\epsilon}{N} <br />
\]<br />
\[<br />
w_{t,i}:= \log \big[ (1-\alpha_t)\exp\{ w_{t-1,i} +\eta \bar{r}_{t-1,i}^\beta \} +\frac{\alpha_t}{N-1}\sum_{j \ne i} \exp\{ w_{t-1,j} +\eta \bar{r}_{t-1,j}^\beta \} \big]<br />
\]<br />
\[<br />
w_{1,i} = 0, \quad \alpha_t = t^{-1} , \quad \bar{r}_{s,i}^\beta = \frac{r_s \mathbb{I}_{[a_s = i]}+ \beta}{ \pi_s(i) }<br />
\]<br />
<br />
===Reward Scaling===<br />
The appropriate step size $\eta$ depends on the magnitudes of the rewards, which may not be known. The magnitude of reward depends strongly on the gain signal used to measure<br />
learning progress, as well as varying over time as the model learns. To address this issue, all rewards are adaptively rescale to $[-1,1]$ by<br />
\[<br />
r_t = \begin{cases}<br />
-1 &\quad \text{if } \hat{r}_t < q^{l}_t\\<br />
1 &\quad \text{if } \hat{r}_t > q^{h}_t\\<br />
\frac{2(\hat{r}_t-q^l_t)}{q^h_t-q^l_t} -1 , &\quad \text{otherwise.}<br />
\end{cases}<br />
\]<br />
where $q^l_t$ and $q^h_t$ are the quantiles of the history of unscaled rewards up to time $t$. The authors chose them to be $20$-th and $80$-th percentiles respectively.<br />
<br />
===Algorithm===<br />
The automated curriculum learning is summarized as followed,<br />
<br />
where $\tau(\mathbf{x})$ is the length of the longest input sequence. Since the processing time of each task may be different.<br />
<center><br />
[[File:alg.png| thumb | center | 450px |Fig 1. Automated Curriculum Learning Algorithm ]]<br />
</center><br />
<br />
= Learning Progress Signals =<br />
The learning progress is the measurement of effect of a training sample on the target objective. It usually is computationally undesirable or even impossible to obtain. Therefore the authors consider a range of signals derived from two distinct indicators of learning progress: 1) loss-driven, in the sense that they equate progress with a decrease in some loss; or 2) complexity-driven, when they equate progress with an increase in model complexity.<br />
<br />
=== Loss-driven Progress===<br />
The loss-driven progress signals compare the predictions made by the model before and after training on some sample $\mathbf{x}$. <br />
<br />
'''Prediction gain(PG)'''<br />
Prediction gain is defined as the instantaneous change in loss for a sample $\mathbf{x}$<br />
\[<br />
v_{PG}:=L(\mathbf{x},\theta)-L(\mathbf{x},\theta')<br />
\]<br />
<br />
'''Gradient prediction gain (GPG)'''<br />
This measures the magnitude of the gradient vector, which has been used an indicator of salience in the active learning samples<br />
\[<br />
v_{GPG}:= || \triangledown L(\mathbf{x},\theta)||^2_2<br />
\]<br />
<br />
The above signals are instantaneous in the sense that they only depend on x. Such signals are appealing because they are typically cheaper to evaluate, and are agnostic about the overall goal of the curriculum. The remaining three signals more directly measure the effect of training on the desired objective, but require an additional sample $x'$.<br />
<br />
'''Self prediction gain (SPG)'''<br />
Self prediction gain samples a second time from the same task to address the bias problem of PG<br />
\[<br />
v_{SPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_k<br />
\]<br />
<br />
<br />
'''Target prediction gain (TPG)'''<br />
\[<br />
v_{TPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_N<br />
\]<br />
Although this estimator might seem like the most accurate measure so far, it tends to suffer from high variance. <br />
<br />
'''Mean prediction gain (MPG)'''<br />
\[<br />
v_{MPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_k, k \sim U_N,<br />
\]<br />
where $U_N$ denotes the uniform distribution on $\{1,\ldots,N\}$.<br />
<br />
===Complexity-driven Progress===<br />
The intuition for complexity gains is derived from the Minimum Description Length principle (Grunwald, 2007). It states that in order to achieve high generalization from a particular dataset, both the number of bits required to describe the model parameters and the data. This is of significance only if the increase in model complexity strongly reduces the data cost. In the case of neural networks, MDL training is done via stochastic variational inference ( Hinton and Van Camp 1993; Graves, 2011; Kingma et al., 2015; Blundell et al., 2015)<br />
<br />
In the stochastic variational inference framework, a variational posterior $P_\phi(\theta)$ over the network weights is maintained during training, with a single weight sample drawn for each training example. An adaptive prior $Q_\psi(\theta)$ are reused for every network weight. In this paper, both of $P$ and $Q$ are set as a diagonal Gaussian distribution, such the complexity cost can be computed analytically <br />
\[<br />
KL(P_\phi|| Q_\psi) = \frac{(\mu_\phi-\mu_\psi)^2+\sigma^2_\phi-\sigma^2_\psi}{2\sigma^2_\psi}+\ln\Big( \frac{\sigma_\psi}{\sigma_\phi} \Big)<br />
\]<br />
<br />
'''Variational complexity gain (VCG)'''<br />
\[v_{VCG}:= KL(P_{\phi'}|| Q_{\psi'}) - KL(P_\phi|| Q_\psi)\]<br />
<br />
'''Gradient variational complexity gain (GVCG)'''<br />
\[<br />
v_{GVCG}:= [\triangledown_{\phi,\psi} KL(P_\phi|| Q_\psi)]^T \triangledown_\phi \mathbb{E}_{\theta \sim P_\phi} L(\mathbf{x},\theta)<br />
\]<br />
<br />
'''L2 gain (L2G)'''<br />
\[<br />
v_{L2G}:=|| \theta' ||^2_2 -|| \theta ||^2_2, \quad v_{GL2G}:=\theta^T [\triangledown_\theta L(\mathbf{x}, \theta)]<br />
\]<br />
<br />
= Experiments =<br />
To test the proposed approach, the authors applied all the gains to three tasks suites: $n$-gram models, repeat copy, and the bAbI tasks.<br />
<br />
Unidirectional LSTM network architecture was used for all experiments, and cross-entropy was used as loss function. The neural network was optimized by RMSProp with momentum of $0.9$<br />
and a learning rate $10^{-5}$. The parameters for Exp3S algorithm were $\eta = 10^{-3}, \beta = 0, \epsilon = 0.05$. All experiments were repeated $10$ times with different random initializations of network weights. Two performance benchmarks are 1) a fixed uniform policy over all the tasks and 2) directly training on the target task (where applicable).<br />
<br />
===N-Gram Language Modelling===<br />
The first experiment is the character-level KneserNey $n$-gram models (Kneser and Ney, 1995) on the King James Bible data from the Canterbury corpus, with the maximum depth parameter $n$ ranging<br />
between $0$ to $10$. It should be noticed that the amount of linguistic structure increases monotonically with $n$. <br />
<br />
Fig. 2 shows that most of the complexity-based gain signals (L2G, GL2G, GVCG) progress rapidly through the curriculum before focusing strongly on the $10$-gram task. The loss-driven progress<br />
(PG, GPG, SPG, TG) also tend to move towards higher $n$, although more slowly and with less certainty. The authors note that the curriculum is unnecessary for this particular experiment. The 10-gram source was best learnt by directly training on it. The loss based methods also favoured high n-grams for training, but took longer to do so.<br />
<br />
<center><br />
[[File:ngram.png | frame | center |Fig 2. N-gram policies for different gain signals, truncated at $2 \times 10^8$ steps. All curves are averages over $10$ runs ]]<br />
</center><br />
<br />
===Repeat Copy===<br />
In the repeat copy task (Graves et al., 2014), the network is asked to repeat a random sequence a given number of times. Fig. 3 shows that GVCG solves the target task about twice<br />
as fast as uniform sampling for VI training, and that the PG, SPG and TPG gains are somewhat faster than uniform, especially in the early stages.<br />
<br />
<center><br />
[[File:rcode.png | frame | center |Fig 3. Target task loss (per output) truncated at $1.1 \times 10^9$ steps.. All curves are averages over $10$ runs ]]<br />
</center><br />
<br />
===BAbI===<br />
The bAbI dataset (Weston et al., 2015) includes $20$ synthetic question-answering problems designed to test the basic reasoning capabilities of machine learning models. BAbI was not specifically designed for curriculum learning, but some of the tasks follow a natural ordering, such as ‘Two Arg Relations’, ‘Three Arg Relations’. The authors hope that an efficient syllabus could be<br />
found for learning the whole set.<br />
<br />
Fig. 4 shows that prediction gain (PG) clearly improved on uniform sampling in terms of both learning speed and number of tasks completed; for SPG the same benefits were visible, though less pronounced. The other gains were either roughly equal to or worse than uniform.<br />
<br />
<center><br />
[[File:babi.png | frame | center |Fig 4. Completion curves for the bAbI curriculum, truncated at $3.5\times10^8$ steps. All curves are averages over $10$ runs ]]<br />
</center><br />
<br />
= Conclusions =<br />
1. The experiments suggest that a stochastic syllabus can improve the significant gains, when a suitable progress signal is used.<br />
<br />
2. The uniform random task is a surprisingly strong benchmark. This is because learning is dominated by gradients from the tasks on which the network is making the fastest progress, thereby simulating a kind of implicit curriculum but with unnecessary samples.<br />
<br />
3. The learning progress is best evaluated on a local, rather than global basis. In maximum likelihood training, prediction gain is the most consistent signal, while in variational inference<br />
training gradient variational complexity gain performed best.<br />
<br />
= Critique =<br />
In curriculum learning, a popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting. It is interesting to compare the performance of it and the proposed automated curriculum learning methods.<br />
<br />
Here the assumption is that there is a pre-existing database with learning exemplars to sample from, but what if this isn't available? Forestier et al. tries to answer that problem.<br />
<br />
= Sources =<br />
* Graves, Alex, et al. "Automated Curriculum Learning for Neural Networks." arXiv preprint arXiv:1704.03003 (2017).<br />
* Elman, Jeffrey L. "Learning and development in neural networks: The importance of starting small." Cognition 48.1 (1993): 71-99.<br />
* Bengio, Yoshua, et al. "Curriculum learning." Proceedings of the 26th annual international conference on machine learning. ACM, 2009.<br />
* Reed, Scott, and Nando De Freitas. "Neural programmer-interpreters." arXiv preprint arXiv:1511.06279 (2015).<br />
* Gui, Liangke, Tadas Baltrušaitis, and Louis-Philippe Morency. "Curriculum Learning for Facial Expression Recognition." Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on. IEEE, 2017.<br />
* Zaremba, Wojciech, and Ilya Sutskever. "Learning to execute." arXiv preprint arXiv:1410.4615 (2014).<br />
* Kneser, Reinhard, and Hermann Ney. "Improved backing-off for m-gram language modeling." Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on. Vol. 1. IEEE, 1995.<br />
* Weston, Jason, et al. "Towards ai-complete question answering: A set of prerequisite toy tasks." arXiv preprint arXiv:1502.05698 (2015).<br />
* Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural turing machines." arXiv preprint arXiv:1410.5401 (2014).<br />
* Grunwald, P. D. (2007). The minimum description length principle. The MIT Press.<br />
* Hinton, G. E. and Van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5–13. ACM<br />
* Sébastien Forestier1, Yoan Mollard, Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning<br />
* Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural networks. In Proceedings of The 32nd International Conference on Machine Learning, pages 1613–1622<br />
* Kingma, D. P., Salimans, T., and Welling, M. (2015). Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Question-Image_Co-Attention_for_Visual_Question_Answering&diff=31582Hierarchical Question-Image Co-Attention for Visual Question Answering2017-11-28T06:20:04Z<p>Asriram: /* Reference */</p>
<hr />
<div>__TOC__<br />
== Paper Summary ==<br />
{| class="wikitable"<br />
|-<br />
|'''Conference'''<br />
| <br />
* NIPS 2016<br />
* Presented as spotlight oral: [https://www.youtube.com/watch?v=m6t9IFdk0ms Youtube link]<br />
* 85 citations so far<br />
|-<br />
| '''Authors'''<br />
|Jiasen Lu, Jianwei Yang, Dhruv Batra, '''Devi Parikh'''<br />
|-<br />
|'''Abstract'''<br />
|''A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.''<br />
|}<br />
= Introduction =<br />
'''Visual Question Answering (VQA)''' is a recent problem in computer vision and<br />
natural language processing that has garnered a large amount of interest from<br />
the deep learning, computer vision, and natural language processing communities.<br />
In VQA, an algorithm needs to answer text-based questions about images in<br />
natural language as illustrated in Figure 1.<br />
<br />
[[File:vqa-overview.png|thumb|600px|center|Figure 1: Illustration of VQA system whereby machine learning algorithm answers a visual question asked by an user for a given image (ref: http://www.visualqa.org/static/img/challenge.png)]]<br />
<br />
Recently, ''visual-attention'' based models have gained traction for VQA tasks, where the<br />
attention mechanism typically produces a spatial map highlighting image regions<br />
relevant for answering the visual question about the image. However, to correctly answer the <br />
question, machine not only needs to understand or "attend"<br />
regions in the image but also the parts of question as well. In this paper, authors have proposed a novel ''co-attention''<br />
technique to combine "where to look" or visual-attention along with "what words<br />
to listen to" or question-attention VQA allowing their model to jointly reasons about image and question thus improving <br />
upon existing state of the art results.<br />
<br />
== "Attention" Models ==<br />
You may skip this section if you already know about "attention" in<br />
context of deep learning. Since this paper talks about "attention" almost<br />
everywhere, I decided to put this section to give very informal and brief<br />
introduction to the concept of the "attention" mechanism specially visual "attention", <br />
however, it can be expanded to any other type of "attention".<br />
<br />
Visual attention in CNN is inspired by the biological visual system. As humans,<br />
we have ability to focus our cognitive processing onto a subset of the<br />
environment that is more relevant for the given situation. Imagine, you witness<br />
a bank robbery where robbers are trying to escape on a car, as a good citizen,<br />
you will immediately focus your attention on number plate and other physical<br />
features of the car and robbers in order to give your testimony later, however, you may not remember things which otherwise interests you more. <br />
Such selective visual attention for a given context (robbery in above example) can also be implemented in<br />
traditional CNNs as well. This allows CNNs to be more robust and superior for certain tasks and it even helps <br />
algorithm designer to visualize what spacial features (regions within image) were more important than others. Attention guided<br />
deep learning is particularly very helpful for image caption and VQA tasks.<br />
<br />
== Role of Visual Attention in VQA ==<br />
This section is not a part of the actual paper that is been summarized, however, it gives an overview<br />
of how visual attention can be incorporated in training of a network for VQA tasks, eventually, helping <br />
readers to absorb and understand actual proposed ideas from the paper more effortlessly. Das et al. [5] provided a research study on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images compared with deep models. The concept of "visual-attention" has also been implemented in VQA tasks, which is exploed in [6].<br />
<br />
Generally for implementing attention, network tries to learn the conditional <br />
distribution $P_{i \in [1,n]}(Li|c)$ representing individual importance for all the features <br />
extracted from each of the dsicrete $n$ locations within the image <br />
conditioned on some context vector $c$. In order words, given $n$ features <br />
$L_i = [L_1, ..., L_n]$ from $n$ different spacial regions within the image (top-left, top-middle, top-right, and so on), <br />
then "attention" module learns a parameteric function $F(c;\theta)$ that outputs an importance mapping <br />
of each of these individual feature for a given context vector $c$ or a discrete probability distribution <br />
of size $n$, can be achived by $softmax(n)$. <br />
<br />
In order to incorporate the visual attention in VQA task, one can define context vector $c$ <br />
as a representation of the visual question asked by an user (using RNN perhaps LSTM). The context $c$ can then be used to generate an <br />
attention map for corresponding image locations (as shown in Figure 2) further improving the accuracy on final end-to-end training. <br />
Most work that exists in literature regarding use of visual-attention in VQA tasks are generally further <br />
specialization of the similar ideas.<br />
<br />
[[File:attention-vqa-general.png|thumb|700px|center|Figure 2: Different attention maps generated based on the given visual question. Regions with most "attention" or importance is whitened, machine learning model has learned to steer its attention based on the given question.]]<br />
<br />
== Motivation and Main Contributions ==<br />
So far, all attention models for VQA in literature have focused on the problem of identifying "where<br />
to look" or visual attention. In this paper, authors argue that the problem of identifying "which words to<br />
listen to" or '''question attention''' is equally important. Consider the questions "how many horses are<br />
in this image?" and "how many horses can you see in this image?". They have the same meaning,<br />
essentially captured by the first three words. A machine that attends to the first three words would<br />
arguably be more robust to linguistic variations irrelevant to the meaning and answer of the question.<br />
Motivated by this observation, in addition to reasoning about visual attention, paper has addressed the<br />
problem of question attention. Basically, main contributions of the paper are as follows.<br />
<br />
* A novel co-attention mechanism for VQA that jointly performs question-guided visual attention and image-guided question attention.<br />
* A hierarchical architecture to represent the question, and consequently construct image-question co-attention maps at 3 different levels: word level, phrase level and question level. <br />
* A novel convolution-pooling strategy at phase-level to adaptively select the phrase sizes whose representations are passed to the question level representation.<br />
* Results on VQA and COCO-QA and ablation studies to quantify the roles of different components in the model<br />
<br />
= Method =<br />
This section is broken down into four parts: '''(i)''' notations used within the paper and also throughout this summary, '''(ii)''' hierarchical representation for a visual question, '''(iii)''' the proposed co-attention mechanism and<br />
'''(iv)''' predicting answers.<br />
<br />
== Notations ==<br />
{| class="wikitable"<br />
|-<br />
|'''Notation'''<br />
|'''Explaination'''<br />
|-<br />
|$Q = \{q_1,...q_T\}$<br />
|One-hot encoding of a visual question with $T$ words. Paper uses three different representation of visual question, one for each level of hierarchy, they are as follows: <br />
# $Q^w = \{q^w_1,...q^w_T\}$: Word level representation of visual question<br />
# $Q^p = \{q^p_1,...q^p_T\}$: Phrase level representation of visual question<br />
# $Q^s = \{q^s_1,...q^s_T\}$: Question level representation of visual question<br />
$Q^{w,p,s}$ has exactly $T$ number of embeddings in it (sequential data with temporal dimension), regardless of its position in the hierarchy i.e. word, phrase or question. <br />
|-<br />
|$V = {v_1,..,v_N}$<br />
|$V$ represented various vectors from $N$ different locations within the given image. Therefore, $v_n$ is feature vector from the image at location $n$. $V$ collectively covers the entire spatial reachings of the image. One can extract these location sensitive features from convolution layer of CNN.<br />
|-<br />
|$\hat{v}^r$ and $\hat{q}^r$<br />
|The co-attention features of image and question at each level in the hierarchy where $r \in \{w,p,s\}$. Basically, its a sum of $Q$ or $V$ after the dot product with attention $a^q$ or $a^v$ at each level of hierarchy. <br />
For example, at "word" level, $a^q_w$ and $a^v_w$ is a probability distribution representing importance of each words in visual question and each location within image respectively, whereas $\hat{q}^w$ and $\hat{v}^w$ are final features vectors for the given question and image with attention maps ($a^q_w$ and $a^v_w$ applied) at the "word" level, and similarly for "phrase" and "question" level as well.<br />
|}<br />
'''Note:''' Throughout the paper, $W$ represents the learnable weights and biases are not used within the equations for simplicity (reader must assume it to exist).<br />
<br />
== Question Hierarchy ==<br />
There are three levels of granularities for their hierarchical representation of a visual question: '''(i)''' word, '''(ii)''' phrase and '''(iii)''' question level. It is important to note, each level depends on the previous one, so, phrase level representations are extracted from word level and question level representations come from phrase level as depicted in Figure 4.<br />
<br />
[[File:hierarchy2.png|thumb|Figure 3: Hierarchical question encoding (source: Figure 3 (a) of original paper on page #5)]]<br />
[[File:hierarchy.PNG|thumb|Figure 4: Another figure illustrating hierarchical question encoding in details]]<br />
<br />
=== Word Level ===<br />
1-hot encoding of question's words $Q = \{q_1,..q_T\}$ are transformed into vector space (learned end-to-end) which represents word level embeddings of a visual question i.e. $Q^w = \{q^w_1,...q^w_T\}$. Paper has learned this transformation end-to-end instead of some pretrained models such as word2vec.<br />
<br />
=== Phrase Level ===<br />
Phrase level embedding vectors are calculated by using 1-D convolutions on the word level embedding vectors. <br />
Concretely, at each word location, the inner product of the word vectors with filters of three <br />
window sizes: unigram, bigram and trigram are computed as illustrated by Figure 4. For the ''t-th'' word, <br />
the output from convolution for window size ''s'' is given by<br />
<br />
$$<br />
\hat{q}^p_{s,t} = tanh(W_c^sq^w_{t:t+s-1}), \quad s \in \{1,2,3\}<br />
$$<br />
<br />
Where $W_c^s$ is the weight parameters. The features from three n-grams are combined together using ''maxpool'' operator to obtain the phrase-level embeddings vectors.<br />
<br />
$$<br />
q_t^p = max(\hat{q}^p_{1,t}, \hat{q}^p_{2,t}, \hat{q}^p_{3,t}), \quad t \in \{1,2,...,T\}<br />
$$<br />
<br />
=== Question Level ===<br />
For question level representation, LSTM is used to encode the sequence $q_t^p$ after max-pooling. The corresponding question-level feature at time ''t'' $q_t^s$ is the <br />
LSTM hidden vector at time ''t'' $h_t$.<br />
<br />
$$<br />
\begin{align*}<br />
h_t &= LSTM(q_t^p, h_{t-1})\\<br />
q_t^s &= h_t, \quad t \in \{1,2,...,T\}<br />
\end{align*}<br />
$$<br />
<br />
== Co-Attention Mechanism ==<br />
Paper has proposed two co-attention mechanisms.<br />
{| class="wikitable"<br />
|-<br />
|'''Parallel co-attention'''<br />
|Generates image and question attention simultaneously.<br />
|-<br />
|'''Alternating co-attention'''<br />
|Sequentially alternates between generating image and question attentions.<br />
|}<br />
These co-attention mechanisms are executed at all three levels of the question hierarchy yielding $\hat{v}^r$ and $\hat{q}^r$ <br />
where $r$ is levels in hierarchy i.e. $r \in \{w,p,s\}$ (refer to [[:Notations]] section).<br />
<br />
<br />
=== Parallel Co-Attention ===<br />
[[File:parallewl-coattention.png|thumb|Figure 5: Parallel co-attention mechanism (ref: Figure 2 (a) from original paper)]]<br />
Parallel co-attention attends to the image and question simultaneously as shown in Figure 5. In the paper, "affinity matrix" has been mentioned as the way to calculate the<br />
"attention" or affinity for every pair of image location and question part for each level in the hierarchy (word, phrase and question). Remember, there are $N$ image locations and $T$ <br />
question parts, thus affinity matrix is $\mathbb{R}^{T \times N}$. Specifically, for a given image with<br />
feature map $V \in \mathbb{R}^{d \times N}$, and the question representation $Q \in \mathbb{R}^{d \times T}$, the affinity matrix $C \in \mathbb{R}^{T \times N}$<br />
is calculated by<br />
<br />
$$<br />
C = tanh(Q^TW_bV)<br />
$$<br />
<br />
where,<br />
* $W_b \in \mathbb{R}^{d \times d}$ contains the weights. <br />
<br />
After computing this affinity matrix, one possible way of<br />
computing the image (or question) attention is to simply maximize out the affinity over the locations<br />
of other modality, i.e. $a_v[n] = \underset{i}{max}(C_{i,n})$ and $a_q[t] = \underset{j}{max}(C_{t,j})$. Their notation here is not rigorous. $a_v[n]$ is actually row number $\underset{i}{argmax}(C_{i,n})$ of matrix $C$, and $a_q[t]$ is column number $\underset{j}{argmax}(C_{t,j})$ of that matrix. Instead of choosing the max activation, paper has considered the affinity matrix as a feature and learn to predict image and question attention <br />
maps via the following<br />
<br />
$$<br />
H_v = tanh(W_vV + (W_qQ)C), \quad H_q = tanh(W_qQ + (W_vV )C^T )\\<br />
a_v = softmax(w_{hv}^T Hv), \quad aq = softmax(w_{hq}^T H_q)<br />
$$<br />
<br />
where,<br />
* $W_v, W_q \in \mathbb{R}^{k \times d}$, $w_{hv}, w_{hq} \in \mathbb{R}^k$ are the weight parameters. <br />
* $a_v \in \mathbb{R}^N$ and $a_q \in \mathbb{R}^T$ are the attention probabilities of each image region $v_n$ and word $q_t$ respectively. <br />
<br />
The intuition behind above equation is that, image/question attention maps should be the function of question and image features jointly, therefore, authors have<br />
developed two intermediate parametric functions $H_v$ and $H_q$ that takes affinity matrix $C$, image features $V$ and question features $Q$ as input. The affinity matrix $C$ <br />
transforms question attention space to image attention space (vice versa for $C^T$). Based on the above attention weights, the image and question attention vectors are calculated<br />
as the weighted sum of the image features and question features, i.e.,<br />
<br />
$$\hat{v} = \sum_{n=1}^{N}{a_n^v v_n}, \quad \hat{q} = \sum_{t=1}^{T}{a_t^q q_t}$$<br />
<br />
The parallel co-attention is done at each level in the hierarchy, leading to $\hat{v}^r$ and $\hat{q}^r$ where $r \in \{w,p,s\}$. The reason they are using $tanh$ <br />
for $H_q$ and $H_v$is not specified in the paper. But my assumption is that they want to have negative impacts for certain unfavorable pair of image location and question fragment. Unlike $RELU$ or $Sigmoid$, $tanh$ can be between $[-1, 1]$ thus appropriate choice.<br />
<br />
=== Alternating Co-Attention ===<br />
[[File:alternating-coattention.png|thumb|Figure 6: Alternating co-attention mechanism (ref: Figure 2 (b) from original paper)]]<br />
In this attention mechanism, authors sequentially alternate between generating image and question attention as shown in Figure 6. <br />
Briefly, this consists of three steps<br />
<br />
# Summarize the question into a single vector $q$<br />
# Attend to the image based on the question summary $q$<br />
# Attend to the question based on the attended image feature.<br />
<br />
Concretely, paper defines an attention operation $\hat{x} = \mathcal{A}(X, g)$, which takes the image (or question)<br />
features $X$ and attention guidance $g$ derived from question (or image) as inputs, and outputs the<br />
attended image (or question) vector. The operation can be expressed in the following steps<br />
<br />
$$<br />
\begin{align*}<br />
H &= tanh(W_xX + (W_gg)𝟙^T)\\<br />
a_x &= softmax(w_{hx}^T H)\\<br />
\hat{x} &= \sum{a_i^x x_i}<br />
\end{align*}<br />
$$<br />
<br />
where,<br />
* $𝟙$ is a vector with all elements to be 1. <br />
* $W_x, W_g \in \mathbb{R}^{k\times d}$ and $w_{hx} \in \mathbb{R}^k$ are parameters. <br />
* $a_x$ is the attention weight of feature $X$.<br />
<br />
Breifly,<br />
* At the first step of alternating coattention, $X = Q$, and $g$ is $0$. <br />
* At the second step, $X = V$ where $V$ is the image features, and the guidance $g$ is intermediate attended question feature $\hat{s}$ from the first step<br />
* Finally, we use the attended image feature $\hat{v}$ as the guidance to attend the question again, i.e., $X = Q$ and $g = \hat{v}$. <br />
<br />
Similar to the parallel co-attention, the alternating co-attention is also done at each level of the hierarchy, leading to $\hat{v}^r$ <br />
and $\hat{q}^r$ where $r \in \{w,p,s\}$.<br />
<br />
== Encoding for Predicting Answers ==<br />
[[File:answer-encoding-for-prediction.png|thumb|Figure 7: Encoding for predicitng answers (source: Figure 3 (b) of original paper on page #5)]]<br />
Paper treats predicting final answer as a classification task. It was surprising because I always thought answer would be a sequence, however, by using MLP it is apparent that answer must be a single word. Co-attended image and question features from all three levels are combined together for the final prediction, see Figure 7. Basically, a multi-layer perceptron (MLP) is deployed to recursively encode the attention features as follows.<br />
$$<br />
\begin{align*}<br />
h_w &= tanh(W_w(\hat{q}^w + \hat{v}^w))\\<br />
h_p &= tanh(W_p[(\hat{q}^p + \hat{v}^p), h_w])\\<br />
h_s &= tanh(W_s[(\hat{q}^s + \hat{v}^s), h_p])\\<br />
p &= softmax(W_hh^s)<br />
\end{align*}<br />
$$<br />
<br />
where <br />
* $W_w, W_p, W_s$ and $W_h$ are the weight parameters. <br />
* $[·]$ is the concatenation operation on two vectors. <br />
* $p$ is the probability of the final answer.<br />
<br />
= Experiments =<br />
Evaluation of the proposed model is performed using two datasets, the VQA dataset [1] and the COCO-QA dataset [2].<br />
<br />
* '''VQA dataset''' is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset.<br />
* '''COCO-QA dataset''' is automatically generated from captions in the Microsoft COCO dataset.<br />
<br />
The proposed approach seems to outperform most of the state-of-art techniques as shown in Table 1 and 2.<br />
<br />
[[File:result-vqa.png|thumb|700px|center|Table 1: Results on the VQA dataset. “-” indicates the results is not available. (ref: Table 1 of original paper page #6)]]<br />
<br />
[[File:result-coco-qa.png|thumb|700px|center|Table 2: Results on the COCO-QA dataset. “-” indicates the results is not available (ref: Table 2 of original paper page #7)]]<br />
<br />
==Ablation Study==<br />
In this part, the authors quantified the importance of individual components in the infrastructure. The idea is re-training the model with ablated components. The detailed settings are listed as follows.<br />
* Image Attention alone(to verify that improvements are not the result of better optimization or better CNN features)<br />
* Question Attention alone<br />
* W/O Conv(replace convolution and pooling by stacking another word embedding layer on the top of word level outputs)<br />
* W/OW-Atten(replace the word level attention with a uniform distribution)<br />
* W/O P-Atten(no phrase level co-attention is performed, and the phrase level attention is set to be uniform. Word and question level co-attentions are still modeled)<br />
* W/O Q-Atten(no question level co-attention is performed while word and phrase level co-attentions are still modeled)<br />
<br />
The results of such ablation experiments can be seen in Table 3. It should be noted that "attention" at top of the hierarchy i.e question level or phrase level matters the most as seen in Table 3.<br />
[[FILE: ablation.png|center|thumb|400px|Table 3: Results of ablation experiments on the VQA dataset]]<br />
<br />
Compared to the full model, it is clear that the ablated model under-performs generally. However, it is interesting to see in some settings, the full model does not excel the ablated model.<br />
<br />
= Qualitative Results =<br />
We now visualize some co-attention maps generated by their method in Figure 8. <br />
<br />
{|class="wikitable"<br />
|'''Word level'''<br />
|<br />
* Model attends mostly to the object regions in an image, and objects at questions as well e.g., heads, bird. <br />
|-<br />
|'''Phrase level'''<br />
|<br />
*Image attention has different patterns across images. <br />
** For the first two images, the attention transfers from objects to background regions. <br />
** For the third image, the attention becomes more focused on the objects. <br />
** Reason for different attention could be perhaphs caused by the different question types. <br />
* On the question side, their model is capable of localizing the key phrases in the question, thus essentially discovering the question types in the dataset. <br />
* For example, their model pays attention to the phrases “what color” and “how many snowboarders”. <br />
|-<br />
|'''Question level'''<br />
|<br />
* Image attention concentrates mostly on objects. <br />
* Their model successfully attends to the regions in images and phrases in the questions appropriate for answering the question, e.g., “color of the bird” and bird region.<br />
|}<br />
<br />
Because their model performs co-attention at three levels, it often captures complementary information from<br />
each level, and then combines them to predict the answer. However, it some what un-intuitive to visualize the <br />
phrase and question level attention mapping applied diretly to the words of the question, since phrase <br />
and question level features are compound features from multiple words, thus their attention contribution on the <br />
actual words from the question cannot be clearly understood. <br />
<br />
[[File:visualization-co-attention.png|thumb|800px|center|Figure 8: Visualization of image and question co-attention maps on the COCO-QA dataset. From left to right:<br />
original image and question pairs, word level co-attention maps, phrase level co-attention maps and question<br />
level co-attention maps. For visualization, both image and question attentions are scaled (from red:high to<br />
blue:low). (ref: Figure 4 of original paper page #8)]]<br />
<br />
= Conclusion =<br />
* A hierarchical co-attention model for visual question answering is proposed. <br />
* Coattention allows model to attend to different regions of the image as well as different fragments of the question. <br />
* Question is hierarchically represented at three levels to capture information from different granularities. <br />
* Visualization shows model co-attends to interpretable regions of images and questions for predicting the answer. <br />
* Though their model was evaluated on visual question answering, it can be potentially applied to other tasks involving vision and language.<br />
== Critique ==<br />
* This is a very intuitively relevant idea that closely resembles the way human brains tackle VQA tasks. Therefore this could be developed more into delivering sequence based answers and sentence generation. Therefore, the authors could have used a more powerful, more scalable word-encoding technique such as Glove or Bag-of-words which result in smaller dimensional vectors, thereby opening doors for more learning techniques like sentence-answer-generation. Since word-encoding is treated as a separate task here, Bag-of-words could work, but if we need a more temporal technique, we could use the Position Encoding mechanism [3] which accounts for the position of the word in the sequence itself. This abstraction could help the model generalize better to a multitude of tasks.<br />
<br />
* The idea that image attentions and question attentions can jointly guide each other makes sense. However, if the image is complex or the question itself is too long, will such side attention be misleading? A further study could be: compared to a simple question, whether a long and complex question will influence the performance of the model.<br />
<br />
* The idea of the paper seems great, but 0.2% improvement over the state-of-the-art performance on VQA dataset isn't significant. It would have been good to show some incorrect samples to indicate why the error was still so high. In fact there is already a new paper [4] that won the 2017 VQA challenge and it significantly outperforms all the previous methods on VQA dataset giving an accuracy of 69%.<br />
<br />
= Reference =<br />
# K. Kafle and C. Kanan, “Visual Question Answering: Datasets, Algorithms, and Future Challenges,” Computer Vision and Image Understanding, Jun. 2017.<br />
# Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. NIPS, 2015.<br />
# Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston and Rob Fergus. 2015. End-To-End Memory Networks. Advances in Neural Information Processing Systems (NIPS) 28<br />
# Damien Teney, Peter Anderson, Xiaodong He, Anton van den Hengel, Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge, CVPR 2017<br />
# A. Das, H. Agrawal, L. Zitnick, D. Parikh and D. Batra, "Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?", Computer Vision and Image Understanding, vol. 163, pp. 90-100, 2017.<br />
# Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, Ram Nevatia, "ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering", Computer Vision and Pattern Recognition, 2015.</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Question-Image_Co-Attention_for_Visual_Question_Answering&diff=31581Hierarchical Question-Image Co-Attention for Visual Question Answering2017-11-28T06:17:11Z<p>Asriram: /* Role of Visual Attention in VQA */</p>
<hr />
<div>__TOC__<br />
== Paper Summary ==<br />
{| class="wikitable"<br />
|-<br />
|'''Conference'''<br />
| <br />
* NIPS 2016<br />
* Presented as spotlight oral: [https://www.youtube.com/watch?v=m6t9IFdk0ms Youtube link]<br />
* 85 citations so far<br />
|-<br />
| '''Authors'''<br />
|Jiasen Lu, Jianwei Yang, Dhruv Batra, '''Devi Parikh'''<br />
|-<br />
|'''Abstract'''<br />
|''A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.''<br />
|}<br />
= Introduction =<br />
'''Visual Question Answering (VQA)''' is a recent problem in computer vision and<br />
natural language processing that has garnered a large amount of interest from<br />
the deep learning, computer vision, and natural language processing communities.<br />
In VQA, an algorithm needs to answer text-based questions about images in<br />
natural language as illustrated in Figure 1.<br />
<br />
[[File:vqa-overview.png|thumb|600px|center|Figure 1: Illustration of VQA system whereby machine learning algorithm answers a visual question asked by an user for a given image (ref: http://www.visualqa.org/static/img/challenge.png)]]<br />
<br />
Recently, ''visual-attention'' based models have gained traction for VQA tasks, where the<br />
attention mechanism typically produces a spatial map highlighting image regions<br />
relevant for answering the visual question about the image. However, to correctly answer the <br />
question, machine not only needs to understand or "attend"<br />
regions in the image but also the parts of question as well. In this paper, authors have proposed a novel ''co-attention''<br />
technique to combine "where to look" or visual-attention along with "what words<br />
to listen to" or question-attention VQA allowing their model to jointly reasons about image and question thus improving <br />
upon existing state of the art results.<br />
<br />
== "Attention" Models ==<br />
You may skip this section if you already know about "attention" in<br />
context of deep learning. Since this paper talks about "attention" almost<br />
everywhere, I decided to put this section to give very informal and brief<br />
introduction to the concept of the "attention" mechanism specially visual "attention", <br />
however, it can be expanded to any other type of "attention".<br />
<br />
Visual attention in CNN is inspired by the biological visual system. As humans,<br />
we have ability to focus our cognitive processing onto a subset of the<br />
environment that is more relevant for the given situation. Imagine, you witness<br />
a bank robbery where robbers are trying to escape on a car, as a good citizen,<br />
you will immediately focus your attention on number plate and other physical<br />
features of the car and robbers in order to give your testimony later, however, you may not remember things which otherwise interests you more. <br />
Such selective visual attention for a given context (robbery in above example) can also be implemented in<br />
traditional CNNs as well. This allows CNNs to be more robust and superior for certain tasks and it even helps <br />
algorithm designer to visualize what spacial features (regions within image) were more important than others. Attention guided<br />
deep learning is particularly very helpful for image caption and VQA tasks.<br />
<br />
== Role of Visual Attention in VQA ==<br />
This section is not a part of the actual paper that is been summarized, however, it gives an overview<br />
of how visual attention can be incorporated in training of a network for VQA tasks, eventually, helping <br />
readers to absorb and understand actual proposed ideas from the paper more effortlessly. Das et al. [5] provided a research study on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images compared with deep models. The concept of "visual-attention" has also been implemented in VQA tasks, which is exploed in [6].<br />
<br />
Generally for implementing attention, network tries to learn the conditional <br />
distribution $P_{i \in [1,n]}(Li|c)$ representing individual importance for all the features <br />
extracted from each of the dsicrete $n$ locations within the image <br />
conditioned on some context vector $c$. In order words, given $n$ features <br />
$L_i = [L_1, ..., L_n]$ from $n$ different spacial regions within the image (top-left, top-middle, top-right, and so on), <br />
then "attention" module learns a parameteric function $F(c;\theta)$ that outputs an importance mapping <br />
of each of these individual feature for a given context vector $c$ or a discrete probability distribution <br />
of size $n$, can be achived by $softmax(n)$. <br />
<br />
In order to incorporate the visual attention in VQA task, one can define context vector $c$ <br />
as a representation of the visual question asked by an user (using RNN perhaps LSTM). The context $c$ can then be used to generate an <br />
attention map for corresponding image locations (as shown in Figure 2) further improving the accuracy on final end-to-end training. <br />
Most work that exists in literature regarding use of visual-attention in VQA tasks are generally further <br />
specialization of the similar ideas.<br />
<br />
[[File:attention-vqa-general.png|thumb|700px|center|Figure 2: Different attention maps generated based on the given visual question. Regions with most "attention" or importance is whitened, machine learning model has learned to steer its attention based on the given question.]]<br />
<br />
== Motivation and Main Contributions ==<br />
So far, all attention models for VQA in literature have focused on the problem of identifying "where<br />
to look" or visual attention. In this paper, authors argue that the problem of identifying "which words to<br />
listen to" or '''question attention''' is equally important. Consider the questions "how many horses are<br />
in this image?" and "how many horses can you see in this image?". They have the same meaning,<br />
essentially captured by the first three words. A machine that attends to the first three words would<br />
arguably be more robust to linguistic variations irrelevant to the meaning and answer of the question.<br />
Motivated by this observation, in addition to reasoning about visual attention, paper has addressed the<br />
problem of question attention. Basically, main contributions of the paper are as follows.<br />
<br />
* A novel co-attention mechanism for VQA that jointly performs question-guided visual attention and image-guided question attention.<br />
* A hierarchical architecture to represent the question, and consequently construct image-question co-attention maps at 3 different levels: word level, phrase level and question level. <br />
* A novel convolution-pooling strategy at phase-level to adaptively select the phrase sizes whose representations are passed to the question level representation.<br />
* Results on VQA and COCO-QA and ablation studies to quantify the roles of different components in the model<br />
<br />
= Method =<br />
This section is broken down into four parts: '''(i)''' notations used within the paper and also throughout this summary, '''(ii)''' hierarchical representation for a visual question, '''(iii)''' the proposed co-attention mechanism and<br />
'''(iv)''' predicting answers.<br />
<br />
== Notations ==<br />
{| class="wikitable"<br />
|-<br />
|'''Notation'''<br />
|'''Explaination'''<br />
|-<br />
|$Q = \{q_1,...q_T\}$<br />
|One-hot encoding of a visual question with $T$ words. Paper uses three different representation of visual question, one for each level of hierarchy, they are as follows: <br />
# $Q^w = \{q^w_1,...q^w_T\}$: Word level representation of visual question<br />
# $Q^p = \{q^p_1,...q^p_T\}$: Phrase level representation of visual question<br />
# $Q^s = \{q^s_1,...q^s_T\}$: Question level representation of visual question<br />
$Q^{w,p,s}$ has exactly $T$ number of embeddings in it (sequential data with temporal dimension), regardless of its position in the hierarchy i.e. word, phrase or question. <br />
|-<br />
|$V = {v_1,..,v_N}$<br />
|$V$ represented various vectors from $N$ different locations within the given image. Therefore, $v_n$ is feature vector from the image at location $n$. $V$ collectively covers the entire spatial reachings of the image. One can extract these location sensitive features from convolution layer of CNN.<br />
|-<br />
|$\hat{v}^r$ and $\hat{q}^r$<br />
|The co-attention features of image and question at each level in the hierarchy where $r \in \{w,p,s\}$. Basically, its a sum of $Q$ or $V$ after the dot product with attention $a^q$ or $a^v$ at each level of hierarchy. <br />
For example, at "word" level, $a^q_w$ and $a^v_w$ is a probability distribution representing importance of each words in visual question and each location within image respectively, whereas $\hat{q}^w$ and $\hat{v}^w$ are final features vectors for the given question and image with attention maps ($a^q_w$ and $a^v_w$ applied) at the "word" level, and similarly for "phrase" and "question" level as well.<br />
|}<br />
'''Note:''' Throughout the paper, $W$ represents the learnable weights and biases are not used within the equations for simplicity (reader must assume it to exist).<br />
<br />
== Question Hierarchy ==<br />
There are three levels of granularities for their hierarchical representation of a visual question: '''(i)''' word, '''(ii)''' phrase and '''(iii)''' question level. It is important to note, each level depends on the previous one, so, phrase level representations are extracted from word level and question level representations come from phrase level as depicted in Figure 4.<br />
<br />
[[File:hierarchy2.png|thumb|Figure 3: Hierarchical question encoding (source: Figure 3 (a) of original paper on page #5)]]<br />
[[File:hierarchy.PNG|thumb|Figure 4: Another figure illustrating hierarchical question encoding in details]]<br />
<br />
=== Word Level ===<br />
1-hot encoding of question's words $Q = \{q_1,..q_T\}$ are transformed into vector space (learned end-to-end) which represents word level embeddings of a visual question i.e. $Q^w = \{q^w_1,...q^w_T\}$. Paper has learned this transformation end-to-end instead of some pretrained models such as word2vec.<br />
<br />
=== Phrase Level ===<br />
Phrase level embedding vectors are calculated by using 1-D convolutions on the word level embedding vectors. <br />
Concretely, at each word location, the inner product of the word vectors with filters of three <br />
window sizes: unigram, bigram and trigram are computed as illustrated by Figure 4. For the ''t-th'' word, <br />
the output from convolution for window size ''s'' is given by<br />
<br />
$$<br />
\hat{q}^p_{s,t} = tanh(W_c^sq^w_{t:t+s-1}), \quad s \in \{1,2,3\}<br />
$$<br />
<br />
Where $W_c^s$ is the weight parameters. The features from three n-grams are combined together using ''maxpool'' operator to obtain the phrase-level embeddings vectors.<br />
<br />
$$<br />
q_t^p = max(\hat{q}^p_{1,t}, \hat{q}^p_{2,t}, \hat{q}^p_{3,t}), \quad t \in \{1,2,...,T\}<br />
$$<br />
<br />
=== Question Level ===<br />
For question level representation, LSTM is used to encode the sequence $q_t^p$ after max-pooling. The corresponding question-level feature at time ''t'' $q_t^s$ is the <br />
LSTM hidden vector at time ''t'' $h_t$.<br />
<br />
$$<br />
\begin{align*}<br />
h_t &= LSTM(q_t^p, h_{t-1})\\<br />
q_t^s &= h_t, \quad t \in \{1,2,...,T\}<br />
\end{align*}<br />
$$<br />
<br />
== Co-Attention Mechanism ==<br />
Paper has proposed two co-attention mechanisms.<br />
{| class="wikitable"<br />
|-<br />
|'''Parallel co-attention'''<br />
|Generates image and question attention simultaneously.<br />
|-<br />
|'''Alternating co-attention'''<br />
|Sequentially alternates between generating image and question attentions.<br />
|}<br />
These co-attention mechanisms are executed at all three levels of the question hierarchy yielding $\hat{v}^r$ and $\hat{q}^r$ <br />
where $r$ is levels in hierarchy i.e. $r \in \{w,p,s\}$ (refer to [[:Notations]] section).<br />
<br />
<br />
=== Parallel Co-Attention ===<br />
[[File:parallewl-coattention.png|thumb|Figure 5: Parallel co-attention mechanism (ref: Figure 2 (a) from original paper)]]<br />
Parallel co-attention attends to the image and question simultaneously as shown in Figure 5. In the paper, "affinity matrix" has been mentioned as the way to calculate the<br />
"attention" or affinity for every pair of image location and question part for each level in the hierarchy (word, phrase and question). Remember, there are $N$ image locations and $T$ <br />
question parts, thus affinity matrix is $\mathbb{R}^{T \times N}$. Specifically, for a given image with<br />
feature map $V \in \mathbb{R}^{d \times N}$, and the question representation $Q \in \mathbb{R}^{d \times T}$, the affinity matrix $C \in \mathbb{R}^{T \times N}$<br />
is calculated by<br />
<br />
$$<br />
C = tanh(Q^TW_bV)<br />
$$<br />
<br />
where,<br />
* $W_b \in \mathbb{R}^{d \times d}$ contains the weights. <br />
<br />
After computing this affinity matrix, one possible way of<br />
computing the image (or question) attention is to simply maximize out the affinity over the locations<br />
of other modality, i.e. $a_v[n] = \underset{i}{max}(C_{i,n})$ and $a_q[t] = \underset{j}{max}(C_{t,j})$. Their notation here is not rigorous. $a_v[n]$ is actually row number $\underset{i}{argmax}(C_{i,n})$ of matrix $C$, and $a_q[t]$ is column number $\underset{j}{argmax}(C_{t,j})$ of that matrix. Instead of choosing the max activation, paper has considered the affinity matrix as a feature and learn to predict image and question attention <br />
maps via the following<br />
<br />
$$<br />
H_v = tanh(W_vV + (W_qQ)C), \quad H_q = tanh(W_qQ + (W_vV )C^T )\\<br />
a_v = softmax(w_{hv}^T Hv), \quad aq = softmax(w_{hq}^T H_q)<br />
$$<br />
<br />
where,<br />
* $W_v, W_q \in \mathbb{R}^{k \times d}$, $w_{hv}, w_{hq} \in \mathbb{R}^k$ are the weight parameters. <br />
* $a_v \in \mathbb{R}^N$ and $a_q \in \mathbb{R}^T$ are the attention probabilities of each image region $v_n$ and word $q_t$ respectively. <br />
<br />
The intuition behind above equation is that, image/question attention maps should be the function of question and image features jointly, therefore, authors have<br />
developed two intermediate parametric functions $H_v$ and $H_q$ that takes affinity matrix $C$, image features $V$ and question features $Q$ as input. The affinity matrix $C$ <br />
transforms question attention space to image attention space (vice versa for $C^T$). Based on the above attention weights, the image and question attention vectors are calculated<br />
as the weighted sum of the image features and question features, i.e.,<br />
<br />
$$\hat{v} = \sum_{n=1}^{N}{a_n^v v_n}, \quad \hat{q} = \sum_{t=1}^{T}{a_t^q q_t}$$<br />
<br />
The parallel co-attention is done at each level in the hierarchy, leading to $\hat{v}^r$ and $\hat{q}^r$ where $r \in \{w,p,s\}$. The reason they are using $tanh$ <br />
for $H_q$ and $H_v$is not specified in the paper. But my assumption is that they want to have negative impacts for certain unfavorable pair of image location and question fragment. Unlike $RELU$ or $Sigmoid$, $tanh$ can be between $[-1, 1]$ thus appropriate choice.<br />
<br />
=== Alternating Co-Attention ===<br />
[[File:alternating-coattention.png|thumb|Figure 6: Alternating co-attention mechanism (ref: Figure 2 (b) from original paper)]]<br />
In this attention mechanism, authors sequentially alternate between generating image and question attention as shown in Figure 6. <br />
Briefly, this consists of three steps<br />
<br />
# Summarize the question into a single vector $q$<br />
# Attend to the image based on the question summary $q$<br />
# Attend to the question based on the attended image feature.<br />
<br />
Concretely, paper defines an attention operation $\hat{x} = \mathcal{A}(X, g)$, which takes the image (or question)<br />
features $X$ and attention guidance $g$ derived from question (or image) as inputs, and outputs the<br />
attended image (or question) vector. The operation can be expressed in the following steps<br />
<br />
$$<br />
\begin{align*}<br />
H &= tanh(W_xX + (W_gg)𝟙^T)\\<br />
a_x &= softmax(w_{hx}^T H)\\<br />
\hat{x} &= \sum{a_i^x x_i}<br />
\end{align*}<br />
$$<br />
<br />
where,<br />
* $𝟙$ is a vector with all elements to be 1. <br />
* $W_x, W_g \in \mathbb{R}^{k\times d}$ and $w_{hx} \in \mathbb{R}^k$ are parameters. <br />
* $a_x$ is the attention weight of feature $X$.<br />
<br />
Breifly,<br />
* At the first step of alternating coattention, $X = Q$, and $g$ is $0$. <br />
* At the second step, $X = V$ where $V$ is the image features, and the guidance $g$ is intermediate attended question feature $\hat{s}$ from the first step<br />
* Finally, we use the attended image feature $\hat{v}$ as the guidance to attend the question again, i.e., $X = Q$ and $g = \hat{v}$. <br />
<br />
Similar to the parallel co-attention, the alternating co-attention is also done at each level of the hierarchy, leading to $\hat{v}^r$ <br />
and $\hat{q}^r$ where $r \in \{w,p,s\}$.<br />
<br />
== Encoding for Predicting Answers ==<br />
[[File:answer-encoding-for-prediction.png|thumb|Figure 7: Encoding for predicitng answers (source: Figure 3 (b) of original paper on page #5)]]<br />
Paper treats predicting final answer as a classification task. It was surprising because I always thought answer would be a sequence, however, by using MLP it is apparent that answer must be a single word. Co-attended image and question features from all three levels are combined together for the final prediction, see Figure 7. Basically, a multi-layer perceptron (MLP) is deployed to recursively encode the attention features as follows.<br />
$$<br />
\begin{align*}<br />
h_w &= tanh(W_w(\hat{q}^w + \hat{v}^w))\\<br />
h_p &= tanh(W_p[(\hat{q}^p + \hat{v}^p), h_w])\\<br />
h_s &= tanh(W_s[(\hat{q}^s + \hat{v}^s), h_p])\\<br />
p &= softmax(W_hh^s)<br />
\end{align*}<br />
$$<br />
<br />
where <br />
* $W_w, W_p, W_s$ and $W_h$ are the weight parameters. <br />
* $[·]$ is the concatenation operation on two vectors. <br />
* $p$ is the probability of the final answer.<br />
<br />
= Experiments =<br />
Evaluation of the proposed model is performed using two datasets, the VQA dataset [1] and the COCO-QA dataset [2].<br />
<br />
* '''VQA dataset''' is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset.<br />
* '''COCO-QA dataset''' is automatically generated from captions in the Microsoft COCO dataset.<br />
<br />
The proposed approach seems to outperform most of the state-of-art techniques as shown in Table 1 and 2.<br />
<br />
[[File:result-vqa.png|thumb|700px|center|Table 1: Results on the VQA dataset. “-” indicates the results is not available. (ref: Table 1 of original paper page #6)]]<br />
<br />
[[File:result-coco-qa.png|thumb|700px|center|Table 2: Results on the COCO-QA dataset. “-” indicates the results is not available (ref: Table 2 of original paper page #7)]]<br />
<br />
==Ablation Study==<br />
In this part, the authors quantified the importance of individual components in the infrastructure. The idea is re-training the model with ablated components. The detailed settings are listed as follows.<br />
* Image Attention alone(to verify that improvements are not the result of better optimization or better CNN features)<br />
* Question Attention alone<br />
* W/O Conv(replace convolution and pooling by stacking another word embedding layer on the top of word level outputs)<br />
* W/OW-Atten(replace the word level attention with a uniform distribution)<br />
* W/O P-Atten(no phrase level co-attention is performed, and the phrase level attention is set to be uniform. Word and question level co-attentions are still modeled)<br />
* W/O Q-Atten(no question level co-attention is performed while word and phrase level co-attentions are still modeled)<br />
<br />
The results of such ablation experiments can be seen in Table 3. It should be noted that "attention" at top of the hierarchy i.e question level or phrase level matters the most as seen in Table 3.<br />
[[FILE: ablation.png|center|thumb|400px|Table 3: Results of ablation experiments on the VQA dataset]]<br />
<br />
Compared to the full model, it is clear that the ablated model under-performs generally. However, it is interesting to see in some settings, the full model does not excel the ablated model.<br />
<br />
= Qualitative Results =<br />
We now visualize some co-attention maps generated by their method in Figure 8. <br />
<br />
{|class="wikitable"<br />
|'''Word level'''<br />
|<br />
* Model attends mostly to the object regions in an image, and objects at questions as well e.g., heads, bird. <br />
|-<br />
|'''Phrase level'''<br />
|<br />
*Image attention has different patterns across images. <br />
** For the first two images, the attention transfers from objects to background regions. <br />
** For the third image, the attention becomes more focused on the objects. <br />
** Reason for different attention could be perhaphs caused by the different question types. <br />
* On the question side, their model is capable of localizing the key phrases in the question, thus essentially discovering the question types in the dataset. <br />
* For example, their model pays attention to the phrases “what color” and “how many snowboarders”. <br />
|-<br />
|'''Question level'''<br />
|<br />
* Image attention concentrates mostly on objects. <br />
* Their model successfully attends to the regions in images and phrases in the questions appropriate for answering the question, e.g., “color of the bird” and bird region.<br />
|}<br />
<br />
Because their model performs co-attention at three levels, it often captures complementary information from<br />
each level, and then combines them to predict the answer. However, it some what un-intuitive to visualize the <br />
phrase and question level attention mapping applied diretly to the words of the question, since phrase <br />
and question level features are compound features from multiple words, thus their attention contribution on the <br />
actual words from the question cannot be clearly understood. <br />
<br />
[[File:visualization-co-attention.png|thumb|800px|center|Figure 8: Visualization of image and question co-attention maps on the COCO-QA dataset. From left to right:<br />
original image and question pairs, word level co-attention maps, phrase level co-attention maps and question<br />
level co-attention maps. For visualization, both image and question attentions are scaled (from red:high to<br />
blue:low). (ref: Figure 4 of original paper page #8)]]<br />
<br />
= Conclusion =<br />
* A hierarchical co-attention model for visual question answering is proposed. <br />
* Coattention allows model to attend to different regions of the image as well as different fragments of the question. <br />
* Question is hierarchically represented at three levels to capture information from different granularities. <br />
* Visualization shows model co-attends to interpretable regions of images and questions for predicting the answer. <br />
* Though their model was evaluated on visual question answering, it can be potentially applied to other tasks involving vision and language.<br />
== Critique ==<br />
* This is a very intuitively relevant idea that closely resembles the way human brains tackle VQA tasks. Therefore this could be developed more into delivering sequence based answers and sentence generation. Therefore, the authors could have used a more powerful, more scalable word-encoding technique such as Glove or Bag-of-words which result in smaller dimensional vectors, thereby opening doors for more learning techniques like sentence-answer-generation. Since word-encoding is treated as a separate task here, Bag-of-words could work, but if we need a more temporal technique, we could use the Position Encoding mechanism [3] which accounts for the position of the word in the sequence itself. This abstraction could help the model generalize better to a multitude of tasks.<br />
<br />
* The idea that image attentions and question attentions can jointly guide each other makes sense. However, if the image is complex or the question itself is too long, will such side attention be misleading? A further study could be: compared to a simple question, whether a long and complex question will influence the performance of the model.<br />
<br />
* The idea of the paper seems great, but 0.2% improvement over the state-of-the-art performance on VQA dataset isn't significant. It would have been good to show some incorrect samples to indicate why the error was still so high. In fact there is already a new paper [4] that won the 2017 VQA challenge and it significantly outperforms all the previous methods on VQA dataset giving an accuracy of 69%.<br />
<br />
= Reference =<br />
# K. Kafle and C. Kanan, “Visual Question Answering: Datasets, Algorithms, and Future Challenges,” Computer Vision and Image Understanding, Jun. 2017.<br />
# Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. NIPS, 2015.<br />
# Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston and Rob Fergus. 2015. End-To-End Memory Networks. Advances in Neural Information Processing Systems (NIPS) 28<br />
# Damien Teney, Peter Anderson, Xiaodong He, Anton van den Hengel, Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge, CVPR 2017<br />
# A. Das, H. Agrawal, L. Zitnick, D. Parikh and D. Batra, "Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?", Computer Vision and Image Understanding, vol. 163, pp. 90-100, 2017.</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Question-Image_Co-Attention_for_Visual_Question_Answering&diff=31580Hierarchical Question-Image Co-Attention for Visual Question Answering2017-11-28T06:16:50Z<p>Asriram: /* Role of Visual Attention in VQA */</p>
<hr />
<div>__TOC__<br />
== Paper Summary ==<br />
{| class="wikitable"<br />
|-<br />
|'''Conference'''<br />
| <br />
* NIPS 2016<br />
* Presented as spotlight oral: [https://www.youtube.com/watch?v=m6t9IFdk0ms Youtube link]<br />
* 85 citations so far<br />
|-<br />
| '''Authors'''<br />
|Jiasen Lu, Jianwei Yang, Dhruv Batra, '''Devi Parikh'''<br />
|-<br />
|'''Abstract'''<br />
|''A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.''<br />
|}<br />
= Introduction =<br />
'''Visual Question Answering (VQA)''' is a recent problem in computer vision and<br />
natural language processing that has garnered a large amount of interest from<br />
the deep learning, computer vision, and natural language processing communities.<br />
In VQA, an algorithm needs to answer text-based questions about images in<br />
natural language as illustrated in Figure 1.<br />
<br />
[[File:vqa-overview.png|thumb|600px|center|Figure 1: Illustration of VQA system whereby machine learning algorithm answers a visual question asked by an user for a given image (ref: http://www.visualqa.org/static/img/challenge.png)]]<br />
<br />
Recently, ''visual-attention'' based models have gained traction for VQA tasks, where the<br />
attention mechanism typically produces a spatial map highlighting image regions<br />
relevant for answering the visual question about the image. However, to correctly answer the <br />
question, machine not only needs to understand or "attend"<br />
regions in the image but also the parts of question as well. In this paper, authors have proposed a novel ''co-attention''<br />
technique to combine "where to look" or visual-attention along with "what words<br />
to listen to" or question-attention VQA allowing their model to jointly reasons about image and question thus improving <br />
upon existing state of the art results.<br />
<br />
== "Attention" Models ==<br />
You may skip this section if you already know about "attention" in<br />
context of deep learning. Since this paper talks about "attention" almost<br />
everywhere, I decided to put this section to give very informal and brief<br />
introduction to the concept of the "attention" mechanism specially visual "attention", <br />
however, it can be expanded to any other type of "attention".<br />
<br />
Visual attention in CNN is inspired by the biological visual system. As humans,<br />
we have ability to focus our cognitive processing onto a subset of the<br />
environment that is more relevant for the given situation. Imagine, you witness<br />
a bank robbery where robbers are trying to escape on a car, as a good citizen,<br />
you will immediately focus your attention on number plate and other physical<br />
features of the car and robbers in order to give your testimony later, however, you may not remember things which otherwise interests you more. <br />
Such selective visual attention for a given context (robbery in above example) can also be implemented in<br />
traditional CNNs as well. This allows CNNs to be more robust and superior for certain tasks and it even helps <br />
algorithm designer to visualize what spacial features (regions within image) were more important than others. Attention guided<br />
deep learning is particularly very helpful for image caption and VQA tasks.<br />
<br />
== Role of Visual Attention in VQA ==<br />
This section is not a part of the actual paper that is been summarized, however, it gives an overview<br />
of how visual attention can be incorporated in training of a network for VQA tasks, eventually, helping <br />
readers to absorb and understand actual proposed ideas from the paper more effortlessly. Das et al. [5] provided a research study on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images compared with deep models. The concept of "visual-attention" has also been implemented in VQA tasks, which is exploed in [6, 7]<br />
<br />
Generally for implementing attention, network tries to learn the conditional <br />
distribution $P_{i \in [1,n]}(Li|c)$ representing individual importance for all the features <br />
extracted from each of the dsicrete $n$ locations within the image <br />
conditioned on some context vector $c$. In order words, given $n$ features <br />
$L_i = [L_1, ..., L_n]$ from $n$ different spacial regions within the image (top-left, top-middle, top-right, and so on), <br />
then "attention" module learns a parameteric function $F(c;\theta)$ that outputs an importance mapping <br />
of each of these individual feature for a given context vector $c$ or a discrete probability distribution <br />
of size $n$, can be achived by $softmax(n)$. <br />
<br />
In order to incorporate the visual attention in VQA task, one can define context vector $c$ <br />
as a representation of the visual question asked by an user (using RNN perhaps LSTM). The context $c$ can then be used to generate an <br />
attention map for corresponding image locations (as shown in Figure 2) further improving the accuracy on final end-to-end training. <br />
Most work that exists in literature regarding use of visual-attention in VQA tasks are generally further <br />
specialization of the similar ideas.<br />
<br />
[[File:attention-vqa-general.png|thumb|700px|center|Figure 2: Different attention maps generated based on the given visual question. Regions with most "attention" or importance is whitened, machine learning model has learned to steer its attention based on the given question.]]<br />
<br />
== Motivation and Main Contributions ==<br />
So far, all attention models for VQA in literature have focused on the problem of identifying "where<br />
to look" or visual attention. In this paper, authors argue that the problem of identifying "which words to<br />
listen to" or '''question attention''' is equally important. Consider the questions "how many horses are<br />
in this image?" and "how many horses can you see in this image?". They have the same meaning,<br />
essentially captured by the first three words. A machine that attends to the first three words would<br />
arguably be more robust to linguistic variations irrelevant to the meaning and answer of the question.<br />
Motivated by this observation, in addition to reasoning about visual attention, paper has addressed the<br />
problem of question attention. Basically, main contributions of the paper are as follows.<br />
<br />
* A novel co-attention mechanism for VQA that jointly performs question-guided visual attention and image-guided question attention.<br />
* A hierarchical architecture to represent the question, and consequently construct image-question co-attention maps at 3 different levels: word level, phrase level and question level. <br />
* A novel convolution-pooling strategy at phase-level to adaptively select the phrase sizes whose representations are passed to the question level representation.<br />
* Results on VQA and COCO-QA and ablation studies to quantify the roles of different components in the model<br />
<br />
= Method =<br />
This section is broken down into four parts: '''(i)''' notations used within the paper and also throughout this summary, '''(ii)''' hierarchical representation for a visual question, '''(iii)''' the proposed co-attention mechanism and<br />
'''(iv)''' predicting answers.<br />
<br />
== Notations ==<br />
{| class="wikitable"<br />
|-<br />
|'''Notation'''<br />
|'''Explaination'''<br />
|-<br />
|$Q = \{q_1,...q_T\}$<br />
|One-hot encoding of a visual question with $T$ words. Paper uses three different representation of visual question, one for each level of hierarchy, they are as follows: <br />
# $Q^w = \{q^w_1,...q^w_T\}$: Word level representation of visual question<br />
# $Q^p = \{q^p_1,...q^p_T\}$: Phrase level representation of visual question<br />
# $Q^s = \{q^s_1,...q^s_T\}$: Question level representation of visual question<br />
$Q^{w,p,s}$ has exactly $T$ number of embeddings in it (sequential data with temporal dimension), regardless of its position in the hierarchy i.e. word, phrase or question. <br />
|-<br />
|$V = {v_1,..,v_N}$<br />
|$V$ represented various vectors from $N$ different locations within the given image. Therefore, $v_n$ is feature vector from the image at location $n$. $V$ collectively covers the entire spatial reachings of the image. One can extract these location sensitive features from convolution layer of CNN.<br />
|-<br />
|$\hat{v}^r$ and $\hat{q}^r$<br />
|The co-attention features of image and question at each level in the hierarchy where $r \in \{w,p,s\}$. Basically, its a sum of $Q$ or $V$ after the dot product with attention $a^q$ or $a^v$ at each level of hierarchy. <br />
For example, at "word" level, $a^q_w$ and $a^v_w$ is a probability distribution representing importance of each words in visual question and each location within image respectively, whereas $\hat{q}^w$ and $\hat{v}^w$ are final features vectors for the given question and image with attention maps ($a^q_w$ and $a^v_w$ applied) at the "word" level, and similarly for "phrase" and "question" level as well.<br />
|}<br />
'''Note:''' Throughout the paper, $W$ represents the learnable weights and biases are not used within the equations for simplicity (reader must assume it to exist).<br />
<br />
== Question Hierarchy ==<br />
There are three levels of granularities for their hierarchical representation of a visual question: '''(i)''' word, '''(ii)''' phrase and '''(iii)''' question level. It is important to note, each level depends on the previous one, so, phrase level representations are extracted from word level and question level representations come from phrase level as depicted in Figure 4.<br />
<br />
[[File:hierarchy2.png|thumb|Figure 3: Hierarchical question encoding (source: Figure 3 (a) of original paper on page #5)]]<br />
[[File:hierarchy.PNG|thumb|Figure 4: Another figure illustrating hierarchical question encoding in details]]<br />
<br />
=== Word Level ===<br />
1-hot encoding of question's words $Q = \{q_1,..q_T\}$ are transformed into vector space (learned end-to-end) which represents word level embeddings of a visual question i.e. $Q^w = \{q^w_1,...q^w_T\}$. Paper has learned this transformation end-to-end instead of some pretrained models such as word2vec.<br />
<br />
=== Phrase Level ===<br />
Phrase level embedding vectors are calculated by using 1-D convolutions on the word level embedding vectors. <br />
Concretely, at each word location, the inner product of the word vectors with filters of three <br />
window sizes: unigram, bigram and trigram are computed as illustrated by Figure 4. For the ''t-th'' word, <br />
the output from convolution for window size ''s'' is given by<br />
<br />
$$<br />
\hat{q}^p_{s,t} = tanh(W_c^sq^w_{t:t+s-1}), \quad s \in \{1,2,3\}<br />
$$<br />
<br />
Where $W_c^s$ is the weight parameters. The features from three n-grams are combined together using ''maxpool'' operator to obtain the phrase-level embeddings vectors.<br />
<br />
$$<br />
q_t^p = max(\hat{q}^p_{1,t}, \hat{q}^p_{2,t}, \hat{q}^p_{3,t}), \quad t \in \{1,2,...,T\}<br />
$$<br />
<br />
=== Question Level ===<br />
For question level representation, LSTM is used to encode the sequence $q_t^p$ after max-pooling. The corresponding question-level feature at time ''t'' $q_t^s$ is the <br />
LSTM hidden vector at time ''t'' $h_t$.<br />
<br />
$$<br />
\begin{align*}<br />
h_t &= LSTM(q_t^p, h_{t-1})\\<br />
q_t^s &= h_t, \quad t \in \{1,2,...,T\}<br />
\end{align*}<br />
$$<br />
<br />
== Co-Attention Mechanism ==<br />
Paper has proposed two co-attention mechanisms.<br />
{| class="wikitable"<br />
|-<br />
|'''Parallel co-attention'''<br />
|Generates image and question attention simultaneously.<br />
|-<br />
|'''Alternating co-attention'''<br />
|Sequentially alternates between generating image and question attentions.<br />
|}<br />
These co-attention mechanisms are executed at all three levels of the question hierarchy yielding $\hat{v}^r$ and $\hat{q}^r$ <br />
where $r$ is levels in hierarchy i.e. $r \in \{w,p,s\}$ (refer to [[:Notations]] section).<br />
<br />
<br />
=== Parallel Co-Attention ===<br />
[[File:parallewl-coattention.png|thumb|Figure 5: Parallel co-attention mechanism (ref: Figure 2 (a) from original paper)]]<br />
Parallel co-attention attends to the image and question simultaneously as shown in Figure 5. In the paper, "affinity matrix" has been mentioned as the way to calculate the<br />
"attention" or affinity for every pair of image location and question part for each level in the hierarchy (word, phrase and question). Remember, there are $N$ image locations and $T$ <br />
question parts, thus affinity matrix is $\mathbb{R}^{T \times N}$. Specifically, for a given image with<br />
feature map $V \in \mathbb{R}^{d \times N}$, and the question representation $Q \in \mathbb{R}^{d \times T}$, the affinity matrix $C \in \mathbb{R}^{T \times N}$<br />
is calculated by<br />
<br />
$$<br />
C = tanh(Q^TW_bV)<br />
$$<br />
<br />
where,<br />
* $W_b \in \mathbb{R}^{d \times d}$ contains the weights. <br />
<br />
After computing this affinity matrix, one possible way of<br />
computing the image (or question) attention is to simply maximize out the affinity over the locations<br />
of other modality, i.e. $a_v[n] = \underset{i}{max}(C_{i,n})$ and $a_q[t] = \underset{j}{max}(C_{t,j})$. Their notation here is not rigorous. $a_v[n]$ is actually row number $\underset{i}{argmax}(C_{i,n})$ of matrix $C$, and $a_q[t]$ is column number $\underset{j}{argmax}(C_{t,j})$ of that matrix. Instead of choosing the max activation, paper has considered the affinity matrix as a feature and learn to predict image and question attention <br />
maps via the following<br />
<br />
$$<br />
H_v = tanh(W_vV + (W_qQ)C), \quad H_q = tanh(W_qQ + (W_vV )C^T )\\<br />
a_v = softmax(w_{hv}^T Hv), \quad aq = softmax(w_{hq}^T H_q)<br />
$$<br />
<br />
where,<br />
* $W_v, W_q \in \mathbb{R}^{k \times d}$, $w_{hv}, w_{hq} \in \mathbb{R}^k$ are the weight parameters. <br />
* $a_v \in \mathbb{R}^N$ and $a_q \in \mathbb{R}^T$ are the attention probabilities of each image region $v_n$ and word $q_t$ respectively. <br />
<br />
The intuition behind above equation is that, image/question attention maps should be the function of question and image features jointly, therefore, authors have<br />
developed two intermediate parametric functions $H_v$ and $H_q$ that takes affinity matrix $C$, image features $V$ and question features $Q$ as input. The affinity matrix $C$ <br />
transforms question attention space to image attention space (vice versa for $C^T$). Based on the above attention weights, the image and question attention vectors are calculated<br />
as the weighted sum of the image features and question features, i.e.,<br />
<br />
$$\hat{v} = \sum_{n=1}^{N}{a_n^v v_n}, \quad \hat{q} = \sum_{t=1}^{T}{a_t^q q_t}$$<br />
<br />
The parallel co-attention is done at each level in the hierarchy, leading to $\hat{v}^r$ and $\hat{q}^r$ where $r \in \{w,p,s\}$. The reason they are using $tanh$ <br />
for $H_q$ and $H_v$is not specified in the paper. But my assumption is that they want to have negative impacts for certain unfavorable pair of image location and question fragment. Unlike $RELU$ or $Sigmoid$, $tanh$ can be between $[-1, 1]$ thus appropriate choice.<br />
<br />
=== Alternating Co-Attention ===<br />
[[File:alternating-coattention.png|thumb|Figure 6: Alternating co-attention mechanism (ref: Figure 2 (b) from original paper)]]<br />
In this attention mechanism, authors sequentially alternate between generating image and question attention as shown in Figure 6. <br />
Briefly, this consists of three steps<br />
<br />
# Summarize the question into a single vector $q$<br />
# Attend to the image based on the question summary $q$<br />
# Attend to the question based on the attended image feature.<br />
<br />
Concretely, paper defines an attention operation $\hat{x} = \mathcal{A}(X, g)$, which takes the image (or question)<br />
features $X$ and attention guidance $g$ derived from question (or image) as inputs, and outputs the<br />
attended image (or question) vector. The operation can be expressed in the following steps<br />
<br />
$$<br />
\begin{align*}<br />
H &= tanh(W_xX + (W_gg)𝟙^T)\\<br />
a_x &= softmax(w_{hx}^T H)\\<br />
\hat{x} &= \sum{a_i^x x_i}<br />
\end{align*}<br />
$$<br />
<br />
where,<br />
* $𝟙$ is a vector with all elements to be 1. <br />
* $W_x, W_g \in \mathbb{R}^{k\times d}$ and $w_{hx} \in \mathbb{R}^k$ are parameters. <br />
* $a_x$ is the attention weight of feature $X$.<br />
<br />
Breifly,<br />
* At the first step of alternating coattention, $X = Q$, and $g$ is $0$. <br />
* At the second step, $X = V$ where $V$ is the image features, and the guidance $g$ is intermediate attended question feature $\hat{s}$ from the first step<br />
* Finally, we use the attended image feature $\hat{v}$ as the guidance to attend the question again, i.e., $X = Q$ and $g = \hat{v}$. <br />
<br />
Similar to the parallel co-attention, the alternating co-attention is also done at each level of the hierarchy, leading to $\hat{v}^r$ <br />
and $\hat{q}^r$ where $r \in \{w,p,s\}$.<br />
<br />
== Encoding for Predicting Answers ==<br />
[[File:answer-encoding-for-prediction.png|thumb|Figure 7: Encoding for predicitng answers (source: Figure 3 (b) of original paper on page #5)]]<br />
Paper treats predicting final answer as a classification task. It was surprising because I always thought answer would be a sequence, however, by using MLP it is apparent that answer must be a single word. Co-attended image and question features from all three levels are combined together for the final prediction, see Figure 7. Basically, a multi-layer perceptron (MLP) is deployed to recursively encode the attention features as follows.<br />
$$<br />
\begin{align*}<br />
h_w &= tanh(W_w(\hat{q}^w + \hat{v}^w))\\<br />
h_p &= tanh(W_p[(\hat{q}^p + \hat{v}^p), h_w])\\<br />
h_s &= tanh(W_s[(\hat{q}^s + \hat{v}^s), h_p])\\<br />
p &= softmax(W_hh^s)<br />
\end{align*}<br />
$$<br />
<br />
where <br />
* $W_w, W_p, W_s$ and $W_h$ are the weight parameters. <br />
* $[·]$ is the concatenation operation on two vectors. <br />
* $p$ is the probability of the final answer.<br />
<br />
= Experiments =<br />
Evaluation of the proposed model is performed using two datasets, the VQA dataset [1] and the COCO-QA dataset [2].<br />
<br />
* '''VQA dataset''' is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset.<br />
* '''COCO-QA dataset''' is automatically generated from captions in the Microsoft COCO dataset.<br />
<br />
The proposed approach seems to outperform most of the state-of-art techniques as shown in Table 1 and 2.<br />
<br />
[[File:result-vqa.png|thumb|700px|center|Table 1: Results on the VQA dataset. “-” indicates the results is not available. (ref: Table 1 of original paper page #6)]]<br />
<br />
[[File:result-coco-qa.png|thumb|700px|center|Table 2: Results on the COCO-QA dataset. “-” indicates the results is not available (ref: Table 2 of original paper page #7)]]<br />
<br />
==Ablation Study==<br />
In this part, the authors quantified the importance of individual components in the infrastructure. The idea is re-training the model with ablated components. The detailed settings are listed as follows.<br />
* Image Attention alone(to verify that improvements are not the result of better optimization or better CNN features)<br />
* Question Attention alone<br />
* W/O Conv(replace convolution and pooling by stacking another word embedding layer on the top of word level outputs)<br />
* W/OW-Atten(replace the word level attention with a uniform distribution)<br />
* W/O P-Atten(no phrase level co-attention is performed, and the phrase level attention is set to be uniform. Word and question level co-attentions are still modeled)<br />
* W/O Q-Atten(no question level co-attention is performed while word and phrase level co-attentions are still modeled)<br />
<br />
The results of such ablation experiments can be seen in Table 3. It should be noted that "attention" at top of the hierarchy i.e question level or phrase level matters the most as seen in Table 3.<br />
[[FILE: ablation.png|center|thumb|400px|Table 3: Results of ablation experiments on the VQA dataset]]<br />
<br />
Compared to the full model, it is clear that the ablated model under-performs generally. However, it is interesting to see in some settings, the full model does not excel the ablated model.<br />
<br />
= Qualitative Results =<br />
We now visualize some co-attention maps generated by their method in Figure 8. <br />
<br />
{|class="wikitable"<br />
|'''Word level'''<br />
|<br />
* Model attends mostly to the object regions in an image, and objects at questions as well e.g., heads, bird. <br />
|-<br />
|'''Phrase level'''<br />
|<br />
*Image attention has different patterns across images. <br />
** For the first two images, the attention transfers from objects to background regions. <br />
** For the third image, the attention becomes more focused on the objects. <br />
** Reason for different attention could be perhaphs caused by the different question types. <br />
* On the question side, their model is capable of localizing the key phrases in the question, thus essentially discovering the question types in the dataset. <br />
* For example, their model pays attention to the phrases “what color” and “how many snowboarders”. <br />
|-<br />
|'''Question level'''<br />
|<br />
* Image attention concentrates mostly on objects. <br />
* Their model successfully attends to the regions in images and phrases in the questions appropriate for answering the question, e.g., “color of the bird” and bird region.<br />
|}<br />
<br />
Because their model performs co-attention at three levels, it often captures complementary information from<br />
each level, and then combines them to predict the answer. However, it some what un-intuitive to visualize the <br />
phrase and question level attention mapping applied diretly to the words of the question, since phrase <br />
and question level features are compound features from multiple words, thus their attention contribution on the <br />
actual words from the question cannot be clearly understood. <br />
<br />
[[File:visualization-co-attention.png|thumb|800px|center|Figure 8: Visualization of image and question co-attention maps on the COCO-QA dataset. From left to right:<br />
original image and question pairs, word level co-attention maps, phrase level co-attention maps and question<br />
level co-attention maps. For visualization, both image and question attentions are scaled (from red:high to<br />
blue:low). (ref: Figure 4 of original paper page #8)]]<br />
<br />
= Conclusion =<br />
* A hierarchical co-attention model for visual question answering is proposed. <br />
* Coattention allows model to attend to different regions of the image as well as different fragments of the question. <br />
* Question is hierarchically represented at three levels to capture information from different granularities. <br />
* Visualization shows model co-attends to interpretable regions of images and questions for predicting the answer. <br />
* Though their model was evaluated on visual question answering, it can be potentially applied to other tasks involving vision and language.<br />
== Critique ==<br />
* This is a very intuitively relevant idea that closely resembles the way human brains tackle VQA tasks. Therefore this could be developed more into delivering sequence based answers and sentence generation. Therefore, the authors could have used a more powerful, more scalable word-encoding technique such as Glove or Bag-of-words which result in smaller dimensional vectors, thereby opening doors for more learning techniques like sentence-answer-generation. Since word-encoding is treated as a separate task here, Bag-of-words could work, but if we need a more temporal technique, we could use the Position Encoding mechanism [3] which accounts for the position of the word in the sequence itself. This abstraction could help the model generalize better to a multitude of tasks.<br />
<br />
* The idea that image attentions and question attentions can jointly guide each other makes sense. However, if the image is complex or the question itself is too long, will such side attention be misleading? A further study could be: compared to a simple question, whether a long and complex question will influence the performance of the model.<br />
<br />
* The idea of the paper seems great, but 0.2% improvement over the state-of-the-art performance on VQA dataset isn't significant. It would have been good to show some incorrect samples to indicate why the error was still so high. In fact there is already a new paper [4] that won the 2017 VQA challenge and it significantly outperforms all the previous methods on VQA dataset giving an accuracy of 69%.<br />
<br />
= Reference =<br />
# K. Kafle and C. Kanan, “Visual Question Answering: Datasets, Algorithms, and Future Challenges,” Computer Vision and Image Understanding, Jun. 2017.<br />
# Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. NIPS, 2015.<br />
# Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston and Rob Fergus. 2015. End-To-End Memory Networks. Advances in Neural Information Processing Systems (NIPS) 28<br />
# Damien Teney, Peter Anderson, Xiaodong He, Anton van den Hengel, Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge, CVPR 2017<br />
# A. Das, H. Agrawal, L. Zitnick, D. Parikh and D. Batra, "Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?", Computer Vision and Image Understanding, vol. 163, pp. 90-100, 2017.</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Question-Image_Co-Attention_for_Visual_Question_Answering&diff=31579Hierarchical Question-Image Co-Attention for Visual Question Answering2017-11-28T06:15:37Z<p>Asriram: /* Ablation Study */</p>
<hr />
<div>__TOC__<br />
== Paper Summary ==<br />
{| class="wikitable"<br />
|-<br />
|'''Conference'''<br />
| <br />
* NIPS 2016<br />
* Presented as spotlight oral: [https://www.youtube.com/watch?v=m6t9IFdk0ms Youtube link]<br />
* 85 citations so far<br />
|-<br />
| '''Authors'''<br />
|Jiasen Lu, Jianwei Yang, Dhruv Batra, '''Devi Parikh'''<br />
|-<br />
|'''Abstract'''<br />
|''A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.''<br />
|}<br />
= Introduction =<br />
'''Visual Question Answering (VQA)''' is a recent problem in computer vision and<br />
natural language processing that has garnered a large amount of interest from<br />
the deep learning, computer vision, and natural language processing communities.<br />
In VQA, an algorithm needs to answer text-based questions about images in<br />
natural language as illustrated in Figure 1.<br />
<br />
[[File:vqa-overview.png|thumb|600px|center|Figure 1: Illustration of VQA system whereby machine learning algorithm answers a visual question asked by an user for a given image (ref: http://www.visualqa.org/static/img/challenge.png)]]<br />
<br />
Recently, ''visual-attention'' based models have gained traction for VQA tasks, where the<br />
attention mechanism typically produces a spatial map highlighting image regions<br />
relevant for answering the visual question about the image. However, to correctly answer the <br />
question, machine not only needs to understand or "attend"<br />
regions in the image but also the parts of question as well. In this paper, authors have proposed a novel ''co-attention''<br />
technique to combine "where to look" or visual-attention along with "what words<br />
to listen to" or question-attention VQA allowing their model to jointly reasons about image and question thus improving <br />
upon existing state of the art results.<br />
<br />
== "Attention" Models ==<br />
You may skip this section if you already know about "attention" in<br />
context of deep learning. Since this paper talks about "attention" almost<br />
everywhere, I decided to put this section to give very informal and brief<br />
introduction to the concept of the "attention" mechanism specially visual "attention", <br />
however, it can be expanded to any other type of "attention".<br />
<br />
Visual attention in CNN is inspired by the biological visual system. As humans,<br />
we have ability to focus our cognitive processing onto a subset of the<br />
environment that is more relevant for the given situation. Imagine, you witness<br />
a bank robbery where robbers are trying to escape on a car, as a good citizen,<br />
you will immediately focus your attention on number plate and other physical<br />
features of the car and robbers in order to give your testimony later, however, you may not remember things which otherwise interests you more. <br />
Such selective visual attention for a given context (robbery in above example) can also be implemented in<br />
traditional CNNs as well. This allows CNNs to be more robust and superior for certain tasks and it even helps <br />
algorithm designer to visualize what spacial features (regions within image) were more important than others. Attention guided<br />
deep learning is particularly very helpful for image caption and VQA tasks.<br />
<br />
== Role of Visual Attention in VQA ==<br />
This section is not a part of the actual paper that is been summarized, however, it gives an overview<br />
of how visual attention can be incorporated in training of a network for VQA tasks, eventually, helping <br />
readers to absorb and understand actual proposed ideas from the paper more effortlessly. Das et al. [5] provided a research study on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images compared with deep models.<br />
<br />
Generally for implementing attention, network tries to learn the conditional <br />
distribution $P_{i \in [1,n]}(Li|c)$ representing individual importance for all the features <br />
extracted from each of the dsicrete $n$ locations within the image <br />
conditioned on some context vector $c$. In order words, given $n$ features <br />
$L_i = [L_1, ..., L_n]$ from $n$ different spacial regions within the image (top-left, top-middle, top-right, and so on), <br />
then "attention" module learns a parameteric function $F(c;\theta)$ that outputs an importance mapping <br />
of each of these individual feature for a given context vector $c$ or a discrete probability distribution <br />
of size $n$, can be achived by $softmax(n)$. <br />
<br />
In order to incorporate the visual attention in VQA task, one can define context vector $c$ <br />
as a representation of the visual question asked by an user (using RNN perhaps LSTM). The context $c$ can then be used to generate an <br />
attention map for corresponding image locations (as shown in Figure 2) further improving the accuracy on final end-to-end training. <br />
Most work that exists in literature regarding use of visual-attention in VQA tasks are generally further <br />
specialization of the similar ideas.<br />
<br />
[[File:attention-vqa-general.png|thumb|700px|center|Figure 2: Different attention maps generated based on the given visual question. Regions with most "attention" or importance is whitened, machine learning model has learned to steer its attention based on the given question.]]<br />
<br />
== Motivation and Main Contributions ==<br />
So far, all attention models for VQA in literature have focused on the problem of identifying "where<br />
to look" or visual attention. In this paper, authors argue that the problem of identifying "which words to<br />
listen to" or '''question attention''' is equally important. Consider the questions "how many horses are<br />
in this image?" and "how many horses can you see in this image?". They have the same meaning,<br />
essentially captured by the first three words. A machine that attends to the first three words would<br />
arguably be more robust to linguistic variations irrelevant to the meaning and answer of the question.<br />
Motivated by this observation, in addition to reasoning about visual attention, paper has addressed the<br />
problem of question attention. Basically, main contributions of the paper are as follows.<br />
<br />
* A novel co-attention mechanism for VQA that jointly performs question-guided visual attention and image-guided question attention.<br />
* A hierarchical architecture to represent the question, and consequently construct image-question co-attention maps at 3 different levels: word level, phrase level and question level. <br />
* A novel convolution-pooling strategy at phase-level to adaptively select the phrase sizes whose representations are passed to the question level representation.<br />
* Results on VQA and COCO-QA and ablation studies to quantify the roles of different components in the model<br />
<br />
= Method =<br />
This section is broken down into four parts: '''(i)''' notations used within the paper and also throughout this summary, '''(ii)''' hierarchical representation for a visual question, '''(iii)''' the proposed co-attention mechanism and<br />
'''(iv)''' predicting answers.<br />
<br />
== Notations ==<br />
{| class="wikitable"<br />
|-<br />
|'''Notation'''<br />
|'''Explaination'''<br />
|-<br />
|$Q = \{q_1,...q_T\}$<br />
|One-hot encoding of a visual question with $T$ words. Paper uses three different representation of visual question, one for each level of hierarchy, they are as follows: <br />
# $Q^w = \{q^w_1,...q^w_T\}$: Word level representation of visual question<br />
# $Q^p = \{q^p_1,...q^p_T\}$: Phrase level representation of visual question<br />
# $Q^s = \{q^s_1,...q^s_T\}$: Question level representation of visual question<br />
$Q^{w,p,s}$ has exactly $T$ number of embeddings in it (sequential data with temporal dimension), regardless of its position in the hierarchy i.e. word, phrase or question. <br />
|-<br />
|$V = {v_1,..,v_N}$<br />
|$V$ represented various vectors from $N$ different locations within the given image. Therefore, $v_n$ is feature vector from the image at location $n$. $V$ collectively covers the entire spatial reachings of the image. One can extract these location sensitive features from convolution layer of CNN.<br />
|-<br />
|$\hat{v}^r$ and $\hat{q}^r$<br />
|The co-attention features of image and question at each level in the hierarchy where $r \in \{w,p,s\}$. Basically, its a sum of $Q$ or $V$ after the dot product with attention $a^q$ or $a^v$ at each level of hierarchy. <br />
For example, at "word" level, $a^q_w$ and $a^v_w$ is a probability distribution representing importance of each words in visual question and each location within image respectively, whereas $\hat{q}^w$ and $\hat{v}^w$ are final features vectors for the given question and image with attention maps ($a^q_w$ and $a^v_w$ applied) at the "word" level, and similarly for "phrase" and "question" level as well.<br />
|}<br />
'''Note:''' Throughout the paper, $W$ represents the learnable weights and biases are not used within the equations for simplicity (reader must assume it to exist).<br />
<br />
== Question Hierarchy ==<br />
There are three levels of granularities for their hierarchical representation of a visual question: '''(i)''' word, '''(ii)''' phrase and '''(iii)''' question level. It is important to note, each level depends on the previous one, so, phrase level representations are extracted from word level and question level representations come from phrase level as depicted in Figure 4.<br />
<br />
[[File:hierarchy2.png|thumb|Figure 3: Hierarchical question encoding (source: Figure 3 (a) of original paper on page #5)]]<br />
[[File:hierarchy.PNG|thumb|Figure 4: Another figure illustrating hierarchical question encoding in details]]<br />
<br />
=== Word Level ===<br />
1-hot encoding of question's words $Q = \{q_1,..q_T\}$ are transformed into vector space (learned end-to-end) which represents word level embeddings of a visual question i.e. $Q^w = \{q^w_1,...q^w_T\}$. Paper has learned this transformation end-to-end instead of some pretrained models such as word2vec.<br />
<br />
=== Phrase Level ===<br />
Phrase level embedding vectors are calculated by using 1-D convolutions on the word level embedding vectors. <br />
Concretely, at each word location, the inner product of the word vectors with filters of three <br />
window sizes: unigram, bigram and trigram are computed as illustrated by Figure 4. For the ''t-th'' word, <br />
the output from convolution for window size ''s'' is given by<br />
<br />
$$<br />
\hat{q}^p_{s,t} = tanh(W_c^sq^w_{t:t+s-1}), \quad s \in \{1,2,3\}<br />
$$<br />
<br />
Where $W_c^s$ is the weight parameters. The features from three n-grams are combined together using ''maxpool'' operator to obtain the phrase-level embeddings vectors.<br />
<br />
$$<br />
q_t^p = max(\hat{q}^p_{1,t}, \hat{q}^p_{2,t}, \hat{q}^p_{3,t}), \quad t \in \{1,2,...,T\}<br />
$$<br />
<br />
=== Question Level ===<br />
For question level representation, LSTM is used to encode the sequence $q_t^p$ after max-pooling. The corresponding question-level feature at time ''t'' $q_t^s$ is the <br />
LSTM hidden vector at time ''t'' $h_t$.<br />
<br />
$$<br />
\begin{align*}<br />
h_t &= LSTM(q_t^p, h_{t-1})\\<br />
q_t^s &= h_t, \quad t \in \{1,2,...,T\}<br />
\end{align*}<br />
$$<br />
<br />
== Co-Attention Mechanism ==<br />
Paper has proposed two co-attention mechanisms.<br />
{| class="wikitable"<br />
|-<br />
|'''Parallel co-attention'''<br />
|Generates image and question attention simultaneously.<br />
|-<br />
|'''Alternating co-attention'''<br />
|Sequentially alternates between generating image and question attentions.<br />
|}<br />
These co-attention mechanisms are executed at all three levels of the question hierarchy yielding $\hat{v}^r$ and $\hat{q}^r$ <br />
where $r$ is levels in hierarchy i.e. $r \in \{w,p,s\}$ (refer to [[:Notations]] section).<br />
<br />
<br />
=== Parallel Co-Attention ===<br />
[[File:parallewl-coattention.png|thumb|Figure 5: Parallel co-attention mechanism (ref: Figure 2 (a) from original paper)]]<br />
Parallel co-attention attends to the image and question simultaneously as shown in Figure 5. In the paper, "affinity matrix" has been mentioned as the way to calculate the<br />
"attention" or affinity for every pair of image location and question part for each level in the hierarchy (word, phrase and question). Remember, there are $N$ image locations and $T$ <br />
question parts, thus affinity matrix is $\mathbb{R}^{T \times N}$. Specifically, for a given image with<br />
feature map $V \in \mathbb{R}^{d \times N}$, and the question representation $Q \in \mathbb{R}^{d \times T}$, the affinity matrix $C \in \mathbb{R}^{T \times N}$<br />
is calculated by<br />
<br />
$$<br />
C = tanh(Q^TW_bV)<br />
$$<br />
<br />
where,<br />
* $W_b \in \mathbb{R}^{d \times d}$ contains the weights. <br />
<br />
After computing this affinity matrix, one possible way of<br />
computing the image (or question) attention is to simply maximize out the affinity over the locations<br />
of other modality, i.e. $a_v[n] = \underset{i}{max}(C_{i,n})$ and $a_q[t] = \underset{j}{max}(C_{t,j})$. Their notation here is not rigorous. $a_v[n]$ is actually row number $\underset{i}{argmax}(C_{i,n})$ of matrix $C$, and $a_q[t]$ is column number $\underset{j}{argmax}(C_{t,j})$ of that matrix. Instead of choosing the max activation, paper has considered the affinity matrix as a feature and learn to predict image and question attention <br />
maps via the following<br />
<br />
$$<br />
H_v = tanh(W_vV + (W_qQ)C), \quad H_q = tanh(W_qQ + (W_vV )C^T )\\<br />
a_v = softmax(w_{hv}^T Hv), \quad aq = softmax(w_{hq}^T H_q)<br />
$$<br />
<br />
where,<br />
* $W_v, W_q \in \mathbb{R}^{k \times d}$, $w_{hv}, w_{hq} \in \mathbb{R}^k$ are the weight parameters. <br />
* $a_v \in \mathbb{R}^N$ and $a_q \in \mathbb{R}^T$ are the attention probabilities of each image region $v_n$ and word $q_t$ respectively. <br />
<br />
The intuition behind above equation is that, image/question attention maps should be the function of question and image features jointly, therefore, authors have<br />
developed two intermediate parametric functions $H_v$ and $H_q$ that takes affinity matrix $C$, image features $V$ and question features $Q$ as input. The affinity matrix $C$ <br />
transforms question attention space to image attention space (vice versa for $C^T$). Based on the above attention weights, the image and question attention vectors are calculated<br />
as the weighted sum of the image features and question features, i.e.,<br />
<br />
$$\hat{v} = \sum_{n=1}^{N}{a_n^v v_n}, \quad \hat{q} = \sum_{t=1}^{T}{a_t^q q_t}$$<br />
<br />
The parallel co-attention is done at each level in the hierarchy, leading to $\hat{v}^r$ and $\hat{q}^r$ where $r \in \{w,p,s\}$. The reason they are using $tanh$ <br />
for $H_q$ and $H_v$is not specified in the paper. But my assumption is that they want to have negative impacts for certain unfavorable pair of image location and question fragment. Unlike $RELU$ or $Sigmoid$, $tanh$ can be between $[-1, 1]$ thus appropriate choice.<br />
<br />
=== Alternating Co-Attention ===<br />
[[File:alternating-coattention.png|thumb|Figure 6: Alternating co-attention mechanism (ref: Figure 2 (b) from original paper)]]<br />
In this attention mechanism, authors sequentially alternate between generating image and question attention as shown in Figure 6. <br />
Briefly, this consists of three steps<br />
<br />
# Summarize the question into a single vector $q$<br />
# Attend to the image based on the question summary $q$<br />
# Attend to the question based on the attended image feature.<br />
<br />
Concretely, paper defines an attention operation $\hat{x} = \mathcal{A}(X, g)$, which takes the image (or question)<br />
features $X$ and attention guidance $g$ derived from question (or image) as inputs, and outputs the<br />
attended image (or question) vector. The operation can be expressed in the following steps<br />
<br />
$$<br />
\begin{align*}<br />
H &= tanh(W_xX + (W_gg)𝟙^T)\\<br />
a_x &= softmax(w_{hx}^T H)\\<br />
\hat{x} &= \sum{a_i^x x_i}<br />
\end{align*}<br />
$$<br />
<br />
where,<br />
* $𝟙$ is a vector with all elements to be 1. <br />
* $W_x, W_g \in \mathbb{R}^{k\times d}$ and $w_{hx} \in \mathbb{R}^k$ are parameters. <br />
* $a_x$ is the attention weight of feature $X$.<br />
<br />
Breifly,<br />
* At the first step of alternating coattention, $X = Q$, and $g$ is $0$. <br />
* At the second step, $X = V$ where $V$ is the image features, and the guidance $g$ is intermediate attended question feature $\hat{s}$ from the first step<br />
* Finally, we use the attended image feature $\hat{v}$ as the guidance to attend the question again, i.e., $X = Q$ and $g = \hat{v}$. <br />
<br />
Similar to the parallel co-attention, the alternating co-attention is also done at each level of the hierarchy, leading to $\hat{v}^r$ <br />
and $\hat{q}^r$ where $r \in \{w,p,s\}$.<br />
<br />
== Encoding for Predicting Answers ==<br />
[[File:answer-encoding-for-prediction.png|thumb|Figure 7: Encoding for predicitng answers (source: Figure 3 (b) of original paper on page #5)]]<br />
Paper treats predicting final answer as a classification task. It was surprising because I always thought answer would be a sequence, however, by using MLP it is apparent that answer must be a single word. Co-attended image and question features from all three levels are combined together for the final prediction, see Figure 7. Basically, a multi-layer perceptron (MLP) is deployed to recursively encode the attention features as follows.<br />
$$<br />
\begin{align*}<br />
h_w &= tanh(W_w(\hat{q}^w + \hat{v}^w))\\<br />
h_p &= tanh(W_p[(\hat{q}^p + \hat{v}^p), h_w])\\<br />
h_s &= tanh(W_s[(\hat{q}^s + \hat{v}^s), h_p])\\<br />
p &= softmax(W_hh^s)<br />
\end{align*}<br />
$$<br />
<br />
where <br />
* $W_w, W_p, W_s$ and $W_h$ are the weight parameters. <br />
* $[·]$ is the concatenation operation on two vectors. <br />
* $p$ is the probability of the final answer.<br />
<br />
= Experiments =<br />
Evaluation of the proposed model is performed using two datasets, the VQA dataset [1] and the COCO-QA dataset [2].<br />
<br />
* '''VQA dataset''' is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset.<br />
* '''COCO-QA dataset''' is automatically generated from captions in the Microsoft COCO dataset.<br />
<br />
The proposed approach seems to outperform most of the state-of-art techniques as shown in Table 1 and 2.<br />
<br />
[[File:result-vqa.png|thumb|700px|center|Table 1: Results on the VQA dataset. “-” indicates the results is not available. (ref: Table 1 of original paper page #6)]]<br />
<br />
[[File:result-coco-qa.png|thumb|700px|center|Table 2: Results on the COCO-QA dataset. “-” indicates the results is not available (ref: Table 2 of original paper page #7)]]<br />
<br />
==Ablation Study==<br />
In this part, the authors quantified the importance of individual components in the infrastructure. The idea is re-training the model with ablated components. The detailed settings are listed as follows.<br />
* Image Attention alone(to verify that improvements are not the result of better optimization or better CNN features)<br />
* Question Attention alone<br />
* W/O Conv(replace convolution and pooling by stacking another word embedding layer on the top of word level outputs)<br />
* W/OW-Atten(replace the word level attention with a uniform distribution)<br />
* W/O P-Atten(no phrase level co-attention is performed, and the phrase level attention is set to be uniform. Word and question level co-attentions are still modeled)<br />
* W/O Q-Atten(no question level co-attention is performed while word and phrase level co-attentions are still modeled)<br />
<br />
The results of such ablation experiments can be seen in Table 3. It should be noted that "attention" at top of the hierarchy i.e question level or phrase level matters the most as seen in Table 3.<br />
[[FILE: ablation.png|center|thumb|400px|Table 3: Results of ablation experiments on the VQA dataset]]<br />
<br />
Compared to the full model, it is clear that the ablated model under-performs generally. However, it is interesting to see in some settings, the full model does not excel the ablated model.<br />
<br />
= Qualitative Results =<br />
We now visualize some co-attention maps generated by their method in Figure 8. <br />
<br />
{|class="wikitable"<br />
|'''Word level'''<br />
|<br />
* Model attends mostly to the object regions in an image, and objects at questions as well e.g., heads, bird. <br />
|-<br />
|'''Phrase level'''<br />
|<br />
*Image attention has different patterns across images. <br />
** For the first two images, the attention transfers from objects to background regions. <br />
** For the third image, the attention becomes more focused on the objects. <br />
** Reason for different attention could be perhaphs caused by the different question types. <br />
* On the question side, their model is capable of localizing the key phrases in the question, thus essentially discovering the question types in the dataset. <br />
* For example, their model pays attention to the phrases “what color” and “how many snowboarders”. <br />
|-<br />
|'''Question level'''<br />
|<br />
* Image attention concentrates mostly on objects. <br />
* Their model successfully attends to the regions in images and phrases in the questions appropriate for answering the question, e.g., “color of the bird” and bird region.<br />
|}<br />
<br />
Because their model performs co-attention at three levels, it often captures complementary information from<br />
each level, and then combines them to predict the answer. However, it some what un-intuitive to visualize the <br />
phrase and question level attention mapping applied diretly to the words of the question, since phrase <br />
and question level features are compound features from multiple words, thus their attention contribution on the <br />
actual words from the question cannot be clearly understood. <br />
<br />
[[File:visualization-co-attention.png|thumb|800px|center|Figure 8: Visualization of image and question co-attention maps on the COCO-QA dataset. From left to right:<br />
original image and question pairs, word level co-attention maps, phrase level co-attention maps and question<br />
level co-attention maps. For visualization, both image and question attentions are scaled (from red:high to<br />
blue:low). (ref: Figure 4 of original paper page #8)]]<br />
<br />
= Conclusion =<br />
* A hierarchical co-attention model for visual question answering is proposed. <br />
* Coattention allows model to attend to different regions of the image as well as different fragments of the question. <br />
* Question is hierarchically represented at three levels to capture information from different granularities. <br />
* Visualization shows model co-attends to interpretable regions of images and questions for predicting the answer. <br />
* Though their model was evaluated on visual question answering, it can be potentially applied to other tasks involving vision and language.<br />
== Critique ==<br />
* This is a very intuitively relevant idea that closely resembles the way human brains tackle VQA tasks. Therefore this could be developed more into delivering sequence based answers and sentence generation. Therefore, the authors could have used a more powerful, more scalable word-encoding technique such as Glove or Bag-of-words which result in smaller dimensional vectors, thereby opening doors for more learning techniques like sentence-answer-generation. Since word-encoding is treated as a separate task here, Bag-of-words could work, but if we need a more temporal technique, we could use the Position Encoding mechanism [3] which accounts for the position of the word in the sequence itself. This abstraction could help the model generalize better to a multitude of tasks.<br />
<br />
* The idea that image attentions and question attentions can jointly guide each other makes sense. However, if the image is complex or the question itself is too long, will such side attention be misleading? A further study could be: compared to a simple question, whether a long and complex question will influence the performance of the model.<br />
<br />
* The idea of the paper seems great, but 0.2% improvement over the state-of-the-art performance on VQA dataset isn't significant. It would have been good to show some incorrect samples to indicate why the error was still so high. In fact there is already a new paper [4] that won the 2017 VQA challenge and it significantly outperforms all the previous methods on VQA dataset giving an accuracy of 69%.<br />
<br />
= Reference =<br />
# K. Kafle and C. Kanan, “Visual Question Answering: Datasets, Algorithms, and Future Challenges,” Computer Vision and Image Understanding, Jun. 2017.<br />
# Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. NIPS, 2015.<br />
# Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston and Rob Fergus. 2015. End-To-End Memory Networks. Advances in Neural Information Processing Systems (NIPS) 28<br />
# Damien Teney, Peter Anderson, Xiaodong He, Anton van den Hengel, Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge, CVPR 2017<br />
# A. Das, H. Agrawal, L. Zitnick, D. Parikh and D. Batra, "Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?", Computer Vision and Image Understanding, vol. 163, pp. 90-100, 2017.</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Question-Image_Co-Attention_for_Visual_Question_Answering&diff=31578Hierarchical Question-Image Co-Attention for Visual Question Answering2017-11-28T06:14:51Z<p>Asriram: /* Role of Visual Attention in VQA */</p>
<hr />
<div>__TOC__<br />
== Paper Summary ==<br />
{| class="wikitable"<br />
|-<br />
|'''Conference'''<br />
| <br />
* NIPS 2016<br />
* Presented as spotlight oral: [https://www.youtube.com/watch?v=m6t9IFdk0ms Youtube link]<br />
* 85 citations so far<br />
|-<br />
| '''Authors'''<br />
|Jiasen Lu, Jianwei Yang, Dhruv Batra, '''Devi Parikh'''<br />
|-<br />
|'''Abstract'''<br />
|''A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.''<br />
|}<br />
= Introduction =<br />
'''Visual Question Answering (VQA)''' is a recent problem in computer vision and<br />
natural language processing that has garnered a large amount of interest from<br />
the deep learning, computer vision, and natural language processing communities.<br />
In VQA, an algorithm needs to answer text-based questions about images in<br />
natural language as illustrated in Figure 1.<br />
<br />
[[File:vqa-overview.png|thumb|600px|center|Figure 1: Illustration of VQA system whereby machine learning algorithm answers a visual question asked by an user for a given image (ref: http://www.visualqa.org/static/img/challenge.png)]]<br />
<br />
Recently, ''visual-attention'' based models have gained traction for VQA tasks, where the<br />
attention mechanism typically produces a spatial map highlighting image regions<br />
relevant for answering the visual question about the image. However, to correctly answer the <br />
question, machine not only needs to understand or "attend"<br />
regions in the image but also the parts of question as well. In this paper, authors have proposed a novel ''co-attention''<br />
technique to combine "where to look" or visual-attention along with "what words<br />
to listen to" or question-attention VQA allowing their model to jointly reasons about image and question thus improving <br />
upon existing state of the art results.<br />
<br />
== "Attention" Models ==<br />
You may skip this section if you already know about "attention" in<br />
context of deep learning. Since this paper talks about "attention" almost<br />
everywhere, I decided to put this section to give very informal and brief<br />
introduction to the concept of the "attention" mechanism specially visual "attention", <br />
however, it can be expanded to any other type of "attention".<br />
<br />
Visual attention in CNN is inspired by the biological visual system. As humans,<br />
we have ability to focus our cognitive processing onto a subset of the<br />
environment that is more relevant for the given situation. Imagine, you witness<br />
a bank robbery where robbers are trying to escape on a car, as a good citizen,<br />
you will immediately focus your attention on number plate and other physical<br />
features of the car and robbers in order to give your testimony later, however, you may not remember things which otherwise interests you more. <br />
Such selective visual attention for a given context (robbery in above example) can also be implemented in<br />
traditional CNNs as well. This allows CNNs to be more robust and superior for certain tasks and it even helps <br />
algorithm designer to visualize what spacial features (regions within image) were more important than others. Attention guided<br />
deep learning is particularly very helpful for image caption and VQA tasks.<br />
<br />
== Role of Visual Attention in VQA ==<br />
This section is not a part of the actual paper that is been summarized, however, it gives an overview<br />
of how visual attention can be incorporated in training of a network for VQA tasks, eventually, helping <br />
readers to absorb and understand actual proposed ideas from the paper more effortlessly. Das et al. [5] provided a research study on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images compared with deep models.<br />
<br />
Generally for implementing attention, network tries to learn the conditional <br />
distribution $P_{i \in [1,n]}(Li|c)$ representing individual importance for all the features <br />
extracted from each of the dsicrete $n$ locations within the image <br />
conditioned on some context vector $c$. In order words, given $n$ features <br />
$L_i = [L_1, ..., L_n]$ from $n$ different spacial regions within the image (top-left, top-middle, top-right, and so on), <br />
then "attention" module learns a parameteric function $F(c;\theta)$ that outputs an importance mapping <br />
of each of these individual feature for a given context vector $c$ or a discrete probability distribution <br />
of size $n$, can be achived by $softmax(n)$. <br />
<br />
In order to incorporate the visual attention in VQA task, one can define context vector $c$ <br />
as a representation of the visual question asked by an user (using RNN perhaps LSTM). The context $c$ can then be used to generate an <br />
attention map for corresponding image locations (as shown in Figure 2) further improving the accuracy on final end-to-end training. <br />
Most work that exists in literature regarding use of visual-attention in VQA tasks are generally further <br />
specialization of the similar ideas.<br />
<br />
[[File:attention-vqa-general.png|thumb|700px|center|Figure 2: Different attention maps generated based on the given visual question. Regions with most "attention" or importance is whitened, machine learning model has learned to steer its attention based on the given question.]]<br />
<br />
== Motivation and Main Contributions ==<br />
So far, all attention models for VQA in literature have focused on the problem of identifying "where<br />
to look" or visual attention. In this paper, authors argue that the problem of identifying "which words to<br />
listen to" or '''question attention''' is equally important. Consider the questions "how many horses are<br />
in this image?" and "how many horses can you see in this image?". They have the same meaning,<br />
essentially captured by the first three words. A machine that attends to the first three words would<br />
arguably be more robust to linguistic variations irrelevant to the meaning and answer of the question.<br />
Motivated by this observation, in addition to reasoning about visual attention, paper has addressed the<br />
problem of question attention. Basically, main contributions of the paper are as follows.<br />
<br />
* A novel co-attention mechanism for VQA that jointly performs question-guided visual attention and image-guided question attention.<br />
* A hierarchical architecture to represent the question, and consequently construct image-question co-attention maps at 3 different levels: word level, phrase level and question level. <br />
* A novel convolution-pooling strategy at phase-level to adaptively select the phrase sizes whose representations are passed to the question level representation.<br />
* Results on VQA and COCO-QA and ablation studies to quantify the roles of different components in the model<br />
<br />
= Method =<br />
This section is broken down into four parts: '''(i)''' notations used within the paper and also throughout this summary, '''(ii)''' hierarchical representation for a visual question, '''(iii)''' the proposed co-attention mechanism and<br />
'''(iv)''' predicting answers.<br />
<br />
== Notations ==<br />
{| class="wikitable"<br />
|-<br />
|'''Notation'''<br />
|'''Explaination'''<br />
|-<br />
|$Q = \{q_1,...q_T\}$<br />
|One-hot encoding of a visual question with $T$ words. Paper uses three different representation of visual question, one for each level of hierarchy, they are as follows: <br />
# $Q^w = \{q^w_1,...q^w_T\}$: Word level representation of visual question<br />
# $Q^p = \{q^p_1,...q^p_T\}$: Phrase level representation of visual question<br />
# $Q^s = \{q^s_1,...q^s_T\}$: Question level representation of visual question<br />
$Q^{w,p,s}$ has exactly $T$ number of embeddings in it (sequential data with temporal dimension), regardless of its position in the hierarchy i.e. word, phrase or question. <br />
|-<br />
|$V = {v_1,..,v_N}$<br />
|$V$ represented various vectors from $N$ different locations within the given image. Therefore, $v_n$ is feature vector from the image at location $n$. $V$ collectively covers the entire spatial reachings of the image. One can extract these location sensitive features from convolution layer of CNN.<br />
|-<br />
|$\hat{v}^r$ and $\hat{q}^r$<br />
|The co-attention features of image and question at each level in the hierarchy where $r \in \{w,p,s\}$. Basically, its a sum of $Q$ or $V$ after the dot product with attention $a^q$ or $a^v$ at each level of hierarchy. <br />
For example, at "word" level, $a^q_w$ and $a^v_w$ is a probability distribution representing importance of each words in visual question and each location within image respectively, whereas $\hat{q}^w$ and $\hat{v}^w$ are final features vectors for the given question and image with attention maps ($a^q_w$ and $a^v_w$ applied) at the "word" level, and similarly for "phrase" and "question" level as well.<br />
|}<br />
'''Note:''' Throughout the paper, $W$ represents the learnable weights and biases are not used within the equations for simplicity (reader must assume it to exist).<br />
<br />
== Question Hierarchy ==<br />
There are three levels of granularities for their hierarchical representation of a visual question: '''(i)''' word, '''(ii)''' phrase and '''(iii)''' question level. It is important to note, each level depends on the previous one, so, phrase level representations are extracted from word level and question level representations come from phrase level as depicted in Figure 4.<br />
<br />
[[File:hierarchy2.png|thumb|Figure 3: Hierarchical question encoding (source: Figure 3 (a) of original paper on page #5)]]<br />
[[File:hierarchy.PNG|thumb|Figure 4: Another figure illustrating hierarchical question encoding in details]]<br />
<br />
=== Word Level ===<br />
1-hot encoding of question's words $Q = \{q_1,..q_T\}$ are transformed into vector space (learned end-to-end) which represents word level embeddings of a visual question i.e. $Q^w = \{q^w_1,...q^w_T\}$. Paper has learned this transformation end-to-end instead of some pretrained models such as word2vec.<br />
<br />
=== Phrase Level ===<br />
Phrase level embedding vectors are calculated by using 1-D convolutions on the word level embedding vectors. <br />
Concretely, at each word location, the inner product of the word vectors with filters of three <br />
window sizes: unigram, bigram and trigram are computed as illustrated by Figure 4. For the ''t-th'' word, <br />
the output from convolution for window size ''s'' is given by<br />
<br />
$$<br />
\hat{q}^p_{s,t} = tanh(W_c^sq^w_{t:t+s-1}), \quad s \in \{1,2,3\}<br />
$$<br />
<br />
Where $W_c^s$ is the weight parameters. The features from three n-grams are combined together using ''maxpool'' operator to obtain the phrase-level embeddings vectors.<br />
<br />
$$<br />
q_t^p = max(\hat{q}^p_{1,t}, \hat{q}^p_{2,t}, \hat{q}^p_{3,t}), \quad t \in \{1,2,...,T\}<br />
$$<br />
<br />
=== Question Level ===<br />
For question level representation, LSTM is used to encode the sequence $q_t^p$ after max-pooling. The corresponding question-level feature at time ''t'' $q_t^s$ is the <br />
LSTM hidden vector at time ''t'' $h_t$.<br />
<br />
$$<br />
\begin{align*}<br />
h_t &= LSTM(q_t^p, h_{t-1})\\<br />
q_t^s &= h_t, \quad t \in \{1,2,...,T\}<br />
\end{align*}<br />
$$<br />
<br />
== Co-Attention Mechanism ==<br />
Paper has proposed two co-attention mechanisms.<br />
{| class="wikitable"<br />
|-<br />
|'''Parallel co-attention'''<br />
|Generates image and question attention simultaneously.<br />
|-<br />
|'''Alternating co-attention'''<br />
|Sequentially alternates between generating image and question attentions.<br />
|}<br />
These co-attention mechanisms are executed at all three levels of the question hierarchy yielding $\hat{v}^r$ and $\hat{q}^r$ <br />
where $r$ is levels in hierarchy i.e. $r \in \{w,p,s\}$ (refer to [[:Notations]] section).<br />
<br />
<br />
=== Parallel Co-Attention ===<br />
[[File:parallewl-coattention.png|thumb|Figure 5: Parallel co-attention mechanism (ref: Figure 2 (a) from original paper)]]<br />
Parallel co-attention attends to the image and question simultaneously as shown in Figure 5. In the paper, "affinity matrix" has been mentioned as the way to calculate the<br />
"attention" or affinity for every pair of image location and question part for each level in the hierarchy (word, phrase and question). Remember, there are $N$ image locations and $T$ <br />
question parts, thus affinity matrix is $\mathbb{R}^{T \times N}$. Specifically, for a given image with<br />
feature map $V \in \mathbb{R}^{d \times N}$, and the question representation $Q \in \mathbb{R}^{d \times T}$, the affinity matrix $C \in \mathbb{R}^{T \times N}$<br />
is calculated by<br />
<br />
$$<br />
C = tanh(Q^TW_bV)<br />
$$<br />
<br />
where,<br />
* $W_b \in \mathbb{R}^{d \times d}$ contains the weights. <br />
<br />
After computing this affinity matrix, one possible way of<br />
computing the image (or question) attention is to simply maximize out the affinity over the locations<br />
of other modality, i.e. $a_v[n] = \underset{i}{max}(C_{i,n})$ and $a_q[t] = \underset{j}{max}(C_{t,j})$. Their notation here is not rigorous. $a_v[n]$ is actually row number $\underset{i}{argmax}(C_{i,n})$ of matrix $C$, and $a_q[t]$ is column number $\underset{j}{argmax}(C_{t,j})$ of that matrix. Instead of choosing the max activation, paper has considered the affinity matrix as a feature and learn to predict image and question attention <br />
maps via the following<br />
<br />
$$<br />
H_v = tanh(W_vV + (W_qQ)C), \quad H_q = tanh(W_qQ + (W_vV )C^T )\\<br />
a_v = softmax(w_{hv}^T Hv), \quad aq = softmax(w_{hq}^T H_q)<br />
$$<br />
<br />
where,<br />
* $W_v, W_q \in \mathbb{R}^{k \times d}$, $w_{hv}, w_{hq} \in \mathbb{R}^k$ are the weight parameters. <br />
* $a_v \in \mathbb{R}^N$ and $a_q \in \mathbb{R}^T$ are the attention probabilities of each image region $v_n$ and word $q_t$ respectively. <br />
<br />
The intuition behind above equation is that, image/question attention maps should be the function of question and image features jointly, therefore, authors have<br />
developed two intermediate parametric functions $H_v$ and $H_q$ that takes affinity matrix $C$, image features $V$ and question features $Q$ as input. The affinity matrix $C$ <br />
transforms question attention space to image attention space (vice versa for $C^T$). Based on the above attention weights, the image and question attention vectors are calculated<br />
as the weighted sum of the image features and question features, i.e.,<br />
<br />
$$\hat{v} = \sum_{n=1}^{N}{a_n^v v_n}, \quad \hat{q} = \sum_{t=1}^{T}{a_t^q q_t}$$<br />
<br />
The parallel co-attention is done at each level in the hierarchy, leading to $\hat{v}^r$ and $\hat{q}^r$ where $r \in \{w,p,s\}$. The reason they are using $tanh$ <br />
for $H_q$ and $H_v$is not specified in the paper. But my assumption is that they want to have negative impacts for certain unfavorable pair of image location and question fragment. Unlike $RELU$ or $Sigmoid$, $tanh$ can be between $[-1, 1]$ thus appropriate choice.<br />
<br />
=== Alternating Co-Attention ===<br />
[[File:alternating-coattention.png|thumb|Figure 6: Alternating co-attention mechanism (ref: Figure 2 (b) from original paper)]]<br />
In this attention mechanism, authors sequentially alternate between generating image and question attention as shown in Figure 6. <br />
Briefly, this consists of three steps<br />
<br />
# Summarize the question into a single vector $q$<br />
# Attend to the image based on the question summary $q$<br />
# Attend to the question based on the attended image feature.<br />
<br />
Concretely, paper defines an attention operation $\hat{x} = \mathcal{A}(X, g)$, which takes the image (or question)<br />
features $X$ and attention guidance $g$ derived from question (or image) as inputs, and outputs the<br />
attended image (or question) vector. The operation can be expressed in the following steps<br />
<br />
$$<br />
\begin{align*}<br />
H &= tanh(W_xX + (W_gg)𝟙^T)\\<br />
a_x &= softmax(w_{hx}^T H)\\<br />
\hat{x} &= \sum{a_i^x x_i}<br />
\end{align*}<br />
$$<br />
<br />
where,<br />
* $𝟙$ is a vector with all elements to be 1. <br />
* $W_x, W_g \in \mathbb{R}^{k\times d}$ and $w_{hx} \in \mathbb{R}^k$ are parameters. <br />
* $a_x$ is the attention weight of feature $X$.<br />
<br />
Breifly,<br />
* At the first step of alternating coattention, $X = Q$, and $g$ is $0$. <br />
* At the second step, $X = V$ where $V$ is the image features, and the guidance $g$ is intermediate attended question feature $\hat{s}$ from the first step<br />
* Finally, we use the attended image feature $\hat{v}$ as the guidance to attend the question again, i.e., $X = Q$ and $g = \hat{v}$. <br />
<br />
Similar to the parallel co-attention, the alternating co-attention is also done at each level of the hierarchy, leading to $\hat{v}^r$ <br />
and $\hat{q}^r$ where $r \in \{w,p,s\}$.<br />
<br />
== Encoding for Predicting Answers ==<br />
[[File:answer-encoding-for-prediction.png|thumb|Figure 7: Encoding for predicitng answers (source: Figure 3 (b) of original paper on page #5)]]<br />
Paper treats predicting final answer as a classification task. It was surprising because I always thought answer would be a sequence, however, by using MLP it is apparent that answer must be a single word. Co-attended image and question features from all three levels are combined together for the final prediction, see Figure 7. Basically, a multi-layer perceptron (MLP) is deployed to recursively encode the attention features as follows.<br />
$$<br />
\begin{align*}<br />
h_w &= tanh(W_w(\hat{q}^w + \hat{v}^w))\\<br />
h_p &= tanh(W_p[(\hat{q}^p + \hat{v}^p), h_w])\\<br />
h_s &= tanh(W_s[(\hat{q}^s + \hat{v}^s), h_p])\\<br />
p &= softmax(W_hh^s)<br />
\end{align*}<br />
$$<br />
<br />
where <br />
* $W_w, W_p, W_s$ and $W_h$ are the weight parameters. <br />
* $[·]$ is the concatenation operation on two vectors. <br />
* $p$ is the probability of the final answer.<br />
<br />
= Experiments =<br />
Evaluation of the proposed model is performed using two datasets, the VQA dataset [1] and the COCO-QA dataset [2].<br />
<br />
* '''VQA dataset''' is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset.<br />
* '''COCO-QA dataset''' is automatically generated from captions in the Microsoft COCO dataset.<br />
<br />
The proposed approach seems to outperform most of the state-of-art techniques as shown in Table 1 and 2.<br />
<br />
[[File:result-vqa.png|thumb|700px|center|Table 1: Results on the VQA dataset. “-” indicates the results is not available. (ref: Table 1 of original paper page #6)]]<br />
<br />
[[File:result-coco-qa.png|thumb|700px|center|Table 2: Results on the COCO-QA dataset. “-” indicates the results is not available (ref: Table 2 of original paper page #7)]]<br />
<br />
==Ablation Study==<br />
In this part, the authors quantified the importance of individual components in the infrastructure. The idea is re-training the model with ablated components. The detailed settings are listed as follows.<br />
* Image Attention alone(to verify that improvements are not the result of better optimization or better CNN features)<br />
* Question Attention alone<br />
* W/O Conv(replace convolution and pooling by stacking another word embedding layer on the top of word level outputs)<br />
* W/OW-Atten(replace the word level attention with a uniform distribution)<br />
* W/O P-Atten(no phrase level co-attention is performed, and the phrase level attention is set to be uniform. Word and question level co-attentions are still modeled)<br />
* W/O Q-Atten(no question level co-attention is performed while word and phrase level co-attentions are still modeled)<br />
<br />
The results of such ablation experiments can be seen in Table 3.<br />
[[FILE: ablation.png|center|thumb|400px|Table 3: Results of ablation experiments on the VQA dataset]]<br />
<br />
Compared to the full model, it is clear that the ablated model under-performs generally. However, it is interesting to see in some settings, the full model does not excel the ablated model.<br />
<br />
= Qualitative Results =<br />
We now visualize some co-attention maps generated by their method in Figure 8. <br />
<br />
{|class="wikitable"<br />
|'''Word level'''<br />
|<br />
* Model attends mostly to the object regions in an image, and objects at questions as well e.g., heads, bird. <br />
|-<br />
|'''Phrase level'''<br />
|<br />
*Image attention has different patterns across images. <br />
** For the first two images, the attention transfers from objects to background regions. <br />
** For the third image, the attention becomes more focused on the objects. <br />
** Reason for different attention could be perhaphs caused by the different question types. <br />
* On the question side, their model is capable of localizing the key phrases in the question, thus essentially discovering the question types in the dataset. <br />
* For example, their model pays attention to the phrases “what color” and “how many snowboarders”. <br />
|-<br />
|'''Question level'''<br />
|<br />
* Image attention concentrates mostly on objects. <br />
* Their model successfully attends to the regions in images and phrases in the questions appropriate for answering the question, e.g., “color of the bird” and bird region.<br />
|}<br />
<br />
Because their model performs co-attention at three levels, it often captures complementary information from<br />
each level, and then combines them to predict the answer. However, it some what un-intuitive to visualize the <br />
phrase and question level attention mapping applied diretly to the words of the question, since phrase <br />
and question level features are compound features from multiple words, thus their attention contribution on the <br />
actual words from the question cannot be clearly understood. <br />
<br />
[[File:visualization-co-attention.png|thumb|800px|center|Figure 8: Visualization of image and question co-attention maps on the COCO-QA dataset. From left to right:<br />
original image and question pairs, word level co-attention maps, phrase level co-attention maps and question<br />
level co-attention maps. For visualization, both image and question attentions are scaled (from red:high to<br />
blue:low). (ref: Figure 4 of original paper page #8)]]<br />
<br />
= Conclusion =<br />
* A hierarchical co-attention model for visual question answering is proposed. <br />
* Coattention allows model to attend to different regions of the image as well as different fragments of the question. <br />
* Question is hierarchically represented at three levels to capture information from different granularities. <br />
* Visualization shows model co-attends to interpretable regions of images and questions for predicting the answer. <br />
* Though their model was evaluated on visual question answering, it can be potentially applied to other tasks involving vision and language.<br />
== Critique ==<br />
* This is a very intuitively relevant idea that closely resembles the way human brains tackle VQA tasks. Therefore this could be developed more into delivering sequence based answers and sentence generation. Therefore, the authors could have used a more powerful, more scalable word-encoding technique such as Glove or Bag-of-words which result in smaller dimensional vectors, thereby opening doors for more learning techniques like sentence-answer-generation. Since word-encoding is treated as a separate task here, Bag-of-words could work, but if we need a more temporal technique, we could use the Position Encoding mechanism [3] which accounts for the position of the word in the sequence itself. This abstraction could help the model generalize better to a multitude of tasks.<br />
<br />
* The idea that image attentions and question attentions can jointly guide each other makes sense. However, if the image is complex or the question itself is too long, will such side attention be misleading? A further study could be: compared to a simple question, whether a long and complex question will influence the performance of the model.<br />
<br />
* The idea of the paper seems great, but 0.2% improvement over the state-of-the-art performance on VQA dataset isn't significant. It would have been good to show some incorrect samples to indicate why the error was still so high. In fact there is already a new paper [4] that won the 2017 VQA challenge and it significantly outperforms all the previous methods on VQA dataset giving an accuracy of 69%.<br />
<br />
= Reference =<br />
# K. Kafle and C. Kanan, “Visual Question Answering: Datasets, Algorithms, and Future Challenges,” Computer Vision and Image Understanding, Jun. 2017.<br />
# Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. NIPS, 2015.<br />
# Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston and Rob Fergus. 2015. End-To-End Memory Networks. Advances in Neural Information Processing Systems (NIPS) 28<br />
# Damien Teney, Peter Anderson, Xiaodong He, Anton van den Hengel, Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge, CVPR 2017<br />
# A. Das, H. Agrawal, L. Zitnick, D. Parikh and D. Batra, "Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?", Computer Vision and Image Understanding, vol. 163, pp. 90-100, 2017.</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Question-Image_Co-Attention_for_Visual_Question_Answering&diff=31577Hierarchical Question-Image Co-Attention for Visual Question Answering2017-11-28T06:14:35Z<p>Asriram: /* Reference */</p>
<hr />
<div>__TOC__<br />
== Paper Summary ==<br />
{| class="wikitable"<br />
|-<br />
|'''Conference'''<br />
| <br />
* NIPS 2016<br />
* Presented as spotlight oral: [https://www.youtube.com/watch?v=m6t9IFdk0ms Youtube link]<br />
* 85 citations so far<br />
|-<br />
| '''Authors'''<br />
|Jiasen Lu, Jianwei Yang, Dhruv Batra, '''Devi Parikh'''<br />
|-<br />
|'''Abstract'''<br />
|''A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.''<br />
|}<br />
= Introduction =<br />
'''Visual Question Answering (VQA)''' is a recent problem in computer vision and<br />
natural language processing that has garnered a large amount of interest from<br />
the deep learning, computer vision, and natural language processing communities.<br />
In VQA, an algorithm needs to answer text-based questions about images in<br />
natural language as illustrated in Figure 1.<br />
<br />
[[File:vqa-overview.png|thumb|600px|center|Figure 1: Illustration of VQA system whereby machine learning algorithm answers a visual question asked by an user for a given image (ref: http://www.visualqa.org/static/img/challenge.png)]]<br />
<br />
Recently, ''visual-attention'' based models have gained traction for VQA tasks, where the<br />
attention mechanism typically produces a spatial map highlighting image regions<br />
relevant for answering the visual question about the image. However, to correctly answer the <br />
question, machine not only needs to understand or "attend"<br />
regions in the image but also the parts of question as well. In this paper, authors have proposed a novel ''co-attention''<br />
technique to combine "where to look" or visual-attention along with "what words<br />
to listen to" or question-attention VQA allowing their model to jointly reasons about image and question thus improving <br />
upon existing state of the art results.<br />
<br />
== "Attention" Models ==<br />
You may skip this section if you already know about "attention" in<br />
context of deep learning. Since this paper talks about "attention" almost<br />
everywhere, I decided to put this section to give very informal and brief<br />
introduction to the concept of the "attention" mechanism specially visual "attention", <br />
however, it can be expanded to any other type of "attention".<br />
<br />
Visual attention in CNN is inspired by the biological visual system. As humans,<br />
we have ability to focus our cognitive processing onto a subset of the<br />
environment that is more relevant for the given situation. Imagine, you witness<br />
a bank robbery where robbers are trying to escape on a car, as a good citizen,<br />
you will immediately focus your attention on number plate and other physical<br />
features of the car and robbers in order to give your testimony later, however, you may not remember things which otherwise interests you more. <br />
Such selective visual attention for a given context (robbery in above example) can also be implemented in<br />
traditional CNNs as well. This allows CNNs to be more robust and superior for certain tasks and it even helps <br />
algorithm designer to visualize what spacial features (regions within image) were more important than others. Attention guided<br />
deep learning is particularly very helpful for image caption and VQA tasks.<br />
<br />
== Role of Visual Attention in VQA ==<br />
This section is not a part of the actual paper that is been summarized, however, it gives an overview<br />
of how visual attention can be incorporated in training of a network for VQA tasks, eventually, helping <br />
readers to absorb and understand actual proposed ideas from the paper more effortlessly. Das et al. [3] provided a research study on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images compared with deep models.<br />
<br />
Generally for implementing attention, network tries to learn the conditional <br />
distribution $P_{i \in [1,n]}(Li|c)$ representing individual importance for all the features <br />
extracted from each of the dsicrete $n$ locations within the image <br />
conditioned on some context vector $c$. In order words, given $n$ features <br />
$L_i = [L_1, ..., L_n]$ from $n$ different spacial regions within the image (top-left, top-middle, top-right, and so on), <br />
then "attention" module learns a parameteric function $F(c;\theta)$ that outputs an importance mapping <br />
of each of these individual feature for a given context vector $c$ or a discrete probability distribution <br />
of size $n$, can be achived by $softmax(n)$. <br />
<br />
In order to incorporate the visual attention in VQA task, one can define context vector $c$ <br />
as a representation of the visual question asked by an user (using RNN perhaps LSTM). The context $c$ can then be used to generate an <br />
attention map for corresponding image locations (as shown in Figure 2) further improving the accuracy on final end-to-end training. <br />
Most work that exists in literature regarding use of visual-attention in VQA tasks are generally further <br />
specialization of the similar ideas.<br />
<br />
[[File:attention-vqa-general.png|thumb|700px|center|Figure 2: Different attention maps generated based on the given visual question. Regions with most "attention" or importance is whitened, machine learning model has learned to steer its attention based on the given question.]]<br />
<br />
== Motivation and Main Contributions ==<br />
So far, all attention models for VQA in literature have focused on the problem of identifying "where<br />
to look" or visual attention. In this paper, authors argue that the problem of identifying "which words to<br />
listen to" or '''question attention''' is equally important. Consider the questions "how many horses are<br />
in this image?" and "how many horses can you see in this image?". They have the same meaning,<br />
essentially captured by the first three words. A machine that attends to the first three words would<br />
arguably be more robust to linguistic variations irrelevant to the meaning and answer of the question.<br />
Motivated by this observation, in addition to reasoning about visual attention, paper has addressed the<br />
problem of question attention. Basically, main contributions of the paper are as follows.<br />
<br />
* A novel co-attention mechanism for VQA that jointly performs question-guided visual attention and image-guided question attention.<br />
* A hierarchical architecture to represent the question, and consequently construct image-question co-attention maps at 3 different levels: word level, phrase level and question level. <br />
* A novel convolution-pooling strategy at phase-level to adaptively select the phrase sizes whose representations are passed to the question level representation.<br />
* Results on VQA and COCO-QA and ablation studies to quantify the roles of different components in the model<br />
<br />
= Method =<br />
This section is broken down into four parts: '''(i)''' notations used within the paper and also throughout this summary, '''(ii)''' hierarchical representation for a visual question, '''(iii)''' the proposed co-attention mechanism and<br />
'''(iv)''' predicting answers.<br />
<br />
== Notations ==<br />
{| class="wikitable"<br />
|-<br />
|'''Notation'''<br />
|'''Explaination'''<br />
|-<br />
|$Q = \{q_1,...q_T\}$<br />
|One-hot encoding of a visual question with $T$ words. Paper uses three different representation of visual question, one for each level of hierarchy, they are as follows: <br />
# $Q^w = \{q^w_1,...q^w_T\}$: Word level representation of visual question<br />
# $Q^p = \{q^p_1,...q^p_T\}$: Phrase level representation of visual question<br />
# $Q^s = \{q^s_1,...q^s_T\}$: Question level representation of visual question<br />
$Q^{w,p,s}$ has exactly $T$ number of embeddings in it (sequential data with temporal dimension), regardless of its position in the hierarchy i.e. word, phrase or question. <br />
|-<br />
|$V = {v_1,..,v_N}$<br />
|$V$ represented various vectors from $N$ different locations within the given image. Therefore, $v_n$ is feature vector from the image at location $n$. $V$ collectively covers the entire spatial reachings of the image. One can extract these location sensitive features from convolution layer of CNN.<br />
|-<br />
|$\hat{v}^r$ and $\hat{q}^r$<br />
|The co-attention features of image and question at each level in the hierarchy where $r \in \{w,p,s\}$. Basically, its a sum of $Q$ or $V$ after the dot product with attention $a^q$ or $a^v$ at each level of hierarchy. <br />
For example, at "word" level, $a^q_w$ and $a^v_w$ is a probability distribution representing importance of each words in visual question and each location within image respectively, whereas $\hat{q}^w$ and $\hat{v}^w$ are final features vectors for the given question and image with attention maps ($a^q_w$ and $a^v_w$ applied) at the "word" level, and similarly for "phrase" and "question" level as well.<br />
|}<br />
'''Note:''' Throughout the paper, $W$ represents the learnable weights and biases are not used within the equations for simplicity (reader must assume it to exist).<br />
<br />
== Question Hierarchy ==<br />
There are three levels of granularities for their hierarchical representation of a visual question: '''(i)''' word, '''(ii)''' phrase and '''(iii)''' question level. It is important to note, each level depends on the previous one, so, phrase level representations are extracted from word level and question level representations come from phrase level as depicted in Figure 4.<br />
<br />
[[File:hierarchy2.png|thumb|Figure 3: Hierarchical question encoding (source: Figure 3 (a) of original paper on page #5)]]<br />
[[File:hierarchy.PNG|thumb|Figure 4: Another figure illustrating hierarchical question encoding in details]]<br />
<br />
=== Word Level ===<br />
1-hot encoding of question's words $Q = \{q_1,..q_T\}$ are transformed into vector space (learned end-to-end) which represents word level embeddings of a visual question i.e. $Q^w = \{q^w_1,...q^w_T\}$. Paper has learned this transformation end-to-end instead of some pretrained models such as word2vec.<br />
<br />
=== Phrase Level ===<br />
Phrase level embedding vectors are calculated by using 1-D convolutions on the word level embedding vectors. <br />
Concretely, at each word location, the inner product of the word vectors with filters of three <br />
window sizes: unigram, bigram and trigram are computed as illustrated by Figure 4. For the ''t-th'' word, <br />
the output from convolution for window size ''s'' is given by<br />
<br />
$$<br />
\hat{q}^p_{s,t} = tanh(W_c^sq^w_{t:t+s-1}), \quad s \in \{1,2,3\}<br />
$$<br />
<br />
Where $W_c^s$ is the weight parameters. The features from three n-grams are combined together using ''maxpool'' operator to obtain the phrase-level embeddings vectors.<br />
<br />
$$<br />
q_t^p = max(\hat{q}^p_{1,t}, \hat{q}^p_{2,t}, \hat{q}^p_{3,t}), \quad t \in \{1,2,...,T\}<br />
$$<br />
<br />
=== Question Level ===<br />
For question level representation, LSTM is used to encode the sequence $q_t^p$ after max-pooling. The corresponding question-level feature at time ''t'' $q_t^s$ is the <br />
LSTM hidden vector at time ''t'' $h_t$.<br />
<br />
$$<br />
\begin{align*}<br />
h_t &= LSTM(q_t^p, h_{t-1})\\<br />
q_t^s &= h_t, \quad t \in \{1,2,...,T\}<br />
\end{align*}<br />
$$<br />
<br />
== Co-Attention Mechanism ==<br />
Paper has proposed two co-attention mechanisms.<br />
{| class="wikitable"<br />
|-<br />
|'''Parallel co-attention'''<br />
|Generates image and question attention simultaneously.<br />
|-<br />
|'''Alternating co-attention'''<br />
|Sequentially alternates between generating image and question attentions.<br />
|}<br />
These co-attention mechanisms are executed at all three levels of the question hierarchy yielding $\hat{v}^r$ and $\hat{q}^r$ <br />
where $r$ is levels in hierarchy i.e. $r \in \{w,p,s\}$ (refer to [[:Notations]] section).<br />
<br />
<br />
=== Parallel Co-Attention ===<br />
[[File:parallewl-coattention.png|thumb|Figure 5: Parallel co-attention mechanism (ref: Figure 2 (a) from original paper)]]<br />
Parallel co-attention attends to the image and question simultaneously as shown in Figure 5. In the paper, "affinity matrix" has been mentioned as the way to calculate the<br />
"attention" or affinity for every pair of image location and question part for each level in the hierarchy (word, phrase and question). Remember, there are $N$ image locations and $T$ <br />
question parts, thus affinity matrix is $\mathbb{R}^{T \times N}$. Specifically, for a given image with<br />
feature map $V \in \mathbb{R}^{d \times N}$, and the question representation $Q \in \mathbb{R}^{d \times T}$, the affinity matrix $C \in \mathbb{R}^{T \times N}$<br />
is calculated by<br />
<br />
$$<br />
C = tanh(Q^TW_bV)<br />
$$<br />
<br />
where,<br />
* $W_b \in \mathbb{R}^{d \times d}$ contains the weights. <br />
<br />
After computing this affinity matrix, one possible way of<br />
computing the image (or question) attention is to simply maximize out the affinity over the locations<br />
of other modality, i.e. $a_v[n] = \underset{i}{max}(C_{i,n})$ and $a_q[t] = \underset{j}{max}(C_{t,j})$. Their notation here is not rigorous. $a_v[n]$ is actually row number $\underset{i}{argmax}(C_{i,n})$ of matrix $C$, and $a_q[t]$ is column number $\underset{j}{argmax}(C_{t,j})$ of that matrix. Instead of choosing the max activation, paper has considered the affinity matrix as a feature and learn to predict image and question attention <br />
maps via the following<br />
<br />
$$<br />
H_v = tanh(W_vV + (W_qQ)C), \quad H_q = tanh(W_qQ + (W_vV )C^T )\\<br />
a_v = softmax(w_{hv}^T Hv), \quad aq = softmax(w_{hq}^T H_q)<br />
$$<br />
<br />
where,<br />
* $W_v, W_q \in \mathbb{R}^{k \times d}$, $w_{hv}, w_{hq} \in \mathbb{R}^k$ are the weight parameters. <br />
* $a_v \in \mathbb{R}^N$ and $a_q \in \mathbb{R}^T$ are the attention probabilities of each image region $v_n$ and word $q_t$ respectively. <br />
<br />
The intuition behind above equation is that, image/question attention maps should be the function of question and image features jointly, therefore, authors have<br />
developed two intermediate parametric functions $H_v$ and $H_q$ that takes affinity matrix $C$, image features $V$ and question features $Q$ as input. The affinity matrix $C$ <br />
transforms question attention space to image attention space (vice versa for $C^T$). Based on the above attention weights, the image and question attention vectors are calculated<br />
as the weighted sum of the image features and question features, i.e.,<br />
<br />
$$\hat{v} = \sum_{n=1}^{N}{a_n^v v_n}, \quad \hat{q} = \sum_{t=1}^{T}{a_t^q q_t}$$<br />
<br />
The parallel co-attention is done at each level in the hierarchy, leading to $\hat{v}^r$ and $\hat{q}^r$ where $r \in \{w,p,s\}$. The reason they are using $tanh$ <br />
for $H_q$ and $H_v$is not specified in the paper. But my assumption is that they want to have negative impacts for certain unfavorable pair of image location and question fragment. Unlike $RELU$ or $Sigmoid$, $tanh$ can be between $[-1, 1]$ thus appropriate choice.<br />
<br />
=== Alternating Co-Attention ===<br />
[[File:alternating-coattention.png|thumb|Figure 6: Alternating co-attention mechanism (ref: Figure 2 (b) from original paper)]]<br />
In this attention mechanism, authors sequentially alternate between generating image and question attention as shown in Figure 6. <br />
Briefly, this consists of three steps<br />
<br />
# Summarize the question into a single vector $q$<br />
# Attend to the image based on the question summary $q$<br />
# Attend to the question based on the attended image feature.<br />
<br />
Concretely, paper defines an attention operation $\hat{x} = \mathcal{A}(X, g)$, which takes the image (or question)<br />
features $X$ and attention guidance $g$ derived from question (or image) as inputs, and outputs the<br />
attended image (or question) vector. The operation can be expressed in the following steps<br />
<br />
$$<br />
\begin{align*}<br />
H &= tanh(W_xX + (W_gg)𝟙^T)\\<br />
a_x &= softmax(w_{hx}^T H)\\<br />
\hat{x} &= \sum{a_i^x x_i}<br />
\end{align*}<br />
$$<br />
<br />
where,<br />
* $𝟙$ is a vector with all elements to be 1. <br />
* $W_x, W_g \in \mathbb{R}^{k\times d}$ and $w_{hx} \in \mathbb{R}^k$ are parameters. <br />
* $a_x$ is the attention weight of feature $X$.<br />
<br />
Breifly,<br />
* At the first step of alternating coattention, $X = Q$, and $g$ is $0$. <br />
* At the second step, $X = V$ where $V$ is the image features, and the guidance $g$ is intermediate attended question feature $\hat{s}$ from the first step<br />
* Finally, we use the attended image feature $\hat{v}$ as the guidance to attend the question again, i.e., $X = Q$ and $g = \hat{v}$. <br />
<br />
Similar to the parallel co-attention, the alternating co-attention is also done at each level of the hierarchy, leading to $\hat{v}^r$ <br />
and $\hat{q}^r$ where $r \in \{w,p,s\}$.<br />
<br />
== Encoding for Predicting Answers ==<br />
[[File:answer-encoding-for-prediction.png|thumb|Figure 7: Encoding for predicitng answers (source: Figure 3 (b) of original paper on page #5)]]<br />
Paper treats predicting final answer as a classification task. It was surprising because I always thought answer would be a sequence, however, by using MLP it is apparent that answer must be a single word. Co-attended image and question features from all three levels are combined together for the final prediction, see Figure 7. Basically, a multi-layer perceptron (MLP) is deployed to recursively encode the attention features as follows.<br />
$$<br />
\begin{align*}<br />
h_w &= tanh(W_w(\hat{q}^w + \hat{v}^w))\\<br />
h_p &= tanh(W_p[(\hat{q}^p + \hat{v}^p), h_w])\\<br />
h_s &= tanh(W_s[(\hat{q}^s + \hat{v}^s), h_p])\\<br />
p &= softmax(W_hh^s)<br />
\end{align*}<br />
$$<br />
<br />
where <br />
* $W_w, W_p, W_s$ and $W_h$ are the weight parameters. <br />
* $[·]$ is the concatenation operation on two vectors. <br />
* $p$ is the probability of the final answer.<br />
<br />
= Experiments =<br />
Evaluation of the proposed model is performed using two datasets, the VQA dataset [1] and the COCO-QA dataset [2].<br />
<br />
* '''VQA dataset''' is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset.<br />
* '''COCO-QA dataset''' is automatically generated from captions in the Microsoft COCO dataset.<br />
<br />
The proposed approach seems to outperform most of the state-of-art techniques as shown in Table 1 and 2.<br />
<br />
[[File:result-vqa.png|thumb|700px|center|Table 1: Results on the VQA dataset. “-” indicates the results is not available. (ref: Table 1 of original paper page #6)]]<br />
<br />
[[File:result-coco-qa.png|thumb|700px|center|Table 2: Results on the COCO-QA dataset. “-” indicates the results is not available (ref: Table 2 of original paper page #7)]]<br />
<br />
==Ablation Study==<br />
In this part, the authors quantified the importance of individual components in the infrastructure. The idea is re-training the model with ablated components. The detailed settings are listed as follows.<br />
* Image Attention alone(to verify that improvements are not the result of better optimization or better CNN features)<br />
* Question Attention alone<br />
* W/O Conv(replace convolution and pooling by stacking another word embedding layer on the top of word level outputs)<br />
* W/OW-Atten(replace the word level attention with a uniform distribution)<br />
* W/O P-Atten(no phrase level co-attention is performed, and the phrase level attention is set to be uniform. Word and question level co-attentions are still modeled)<br />
* W/O Q-Atten(no question level co-attention is performed while word and phrase level co-attentions are still modeled)<br />
<br />
The results of such ablation experiments can be seen in Table 3.<br />
[[FILE: ablation.png|center|thumb|400px|Table 3: Results of ablation experiments on the VQA dataset]]<br />
<br />
Compared to the full model, it is clear that the ablated model under-performs generally. However, it is interesting to see in some settings, the full model does not excel the ablated model.<br />
<br />
= Qualitative Results =<br />
We now visualize some co-attention maps generated by their method in Figure 8. <br />
<br />
{|class="wikitable"<br />
|'''Word level'''<br />
|<br />
* Model attends mostly to the object regions in an image, and objects at questions as well e.g., heads, bird. <br />
|-<br />
|'''Phrase level'''<br />
|<br />
*Image attention has different patterns across images. <br />
** For the first two images, the attention transfers from objects to background regions. <br />
** For the third image, the attention becomes more focused on the objects. <br />
** Reason for different attention could be perhaphs caused by the different question types. <br />
* On the question side, their model is capable of localizing the key phrases in the question, thus essentially discovering the question types in the dataset. <br />
* For example, their model pays attention to the phrases “what color” and “how many snowboarders”. <br />
|-<br />
|'''Question level'''<br />
|<br />
* Image attention concentrates mostly on objects. <br />
* Their model successfully attends to the regions in images and phrases in the questions appropriate for answering the question, e.g., “color of the bird” and bird region.<br />
|}<br />
<br />
Because their model performs co-attention at three levels, it often captures complementary information from<br />
each level, and then combines them to predict the answer. However, it some what un-intuitive to visualize the <br />
phrase and question level attention mapping applied diretly to the words of the question, since phrase <br />
and question level features are compound features from multiple words, thus their attention contribution on the <br />
actual words from the question cannot be clearly understood. <br />
<br />
[[File:visualization-co-attention.png|thumb|800px|center|Figure 8: Visualization of image and question co-attention maps on the COCO-QA dataset. From left to right:<br />
original image and question pairs, word level co-attention maps, phrase level co-attention maps and question<br />
level co-attention maps. For visualization, both image and question attentions are scaled (from red:high to<br />
blue:low). (ref: Figure 4 of original paper page #8)]]<br />
<br />
= Conclusion =<br />
* A hierarchical co-attention model for visual question answering is proposed. <br />
* Coattention allows model to attend to different regions of the image as well as different fragments of the question. <br />
* Question is hierarchically represented at three levels to capture information from different granularities. <br />
* Visualization shows model co-attends to interpretable regions of images and questions for predicting the answer. <br />
* Though their model was evaluated on visual question answering, it can be potentially applied to other tasks involving vision and language.<br />
== Critique ==<br />
* This is a very intuitively relevant idea that closely resembles the way human brains tackle VQA tasks. Therefore this could be developed more into delivering sequence based answers and sentence generation. Therefore, the authors could have used a more powerful, more scalable word-encoding technique such as Glove or Bag-of-words which result in smaller dimensional vectors, thereby opening doors for more learning techniques like sentence-answer-generation. Since word-encoding is treated as a separate task here, Bag-of-words could work, but if we need a more temporal technique, we could use the Position Encoding mechanism [3] which accounts for the position of the word in the sequence itself. This abstraction could help the model generalize better to a multitude of tasks.<br />
<br />
* The idea that image attentions and question attentions can jointly guide each other makes sense. However, if the image is complex or the question itself is too long, will such side attention be misleading? A further study could be: compared to a simple question, whether a long and complex question will influence the performance of the model.<br />
<br />
* The idea of the paper seems great, but 0.2% improvement over the state-of-the-art performance on VQA dataset isn't significant. It would have been good to show some incorrect samples to indicate why the error was still so high. In fact there is already a new paper [4] that won the 2017 VQA challenge and it significantly outperforms all the previous methods on VQA dataset giving an accuracy of 69%.<br />
<br />
= Reference =<br />
# K. Kafle and C. Kanan, “Visual Question Answering: Datasets, Algorithms, and Future Challenges,” Computer Vision and Image Understanding, Jun. 2017.<br />
# Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. NIPS, 2015.<br />
# Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston and Rob Fergus. 2015. End-To-End Memory Networks. Advances in Neural Information Processing Systems (NIPS) 28<br />
# Damien Teney, Peter Anderson, Xiaodong He, Anton van den Hengel, Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge, CVPR 2017<br />
# A. Das, H. Agrawal, L. Zitnick, D. Parikh and D. Batra, "Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?", Computer Vision and Image Understanding, vol. 163, pp. 90-100, 2017.</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Question-Image_Co-Attention_for_Visual_Question_Answering&diff=31576Hierarchical Question-Image Co-Attention for Visual Question Answering2017-11-28T06:14:11Z<p>Asriram: /* Role of Visual Attention in VQA */</p>
<hr />
<div>__TOC__<br />
== Paper Summary ==<br />
{| class="wikitable"<br />
|-<br />
|'''Conference'''<br />
| <br />
* NIPS 2016<br />
* Presented as spotlight oral: [https://www.youtube.com/watch?v=m6t9IFdk0ms Youtube link]<br />
* 85 citations so far<br />
|-<br />
| '''Authors'''<br />
|Jiasen Lu, Jianwei Yang, Dhruv Batra, '''Devi Parikh'''<br />
|-<br />
|'''Abstract'''<br />
|''A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.''<br />
|}<br />
= Introduction =<br />
'''Visual Question Answering (VQA)''' is a recent problem in computer vision and<br />
natural language processing that has garnered a large amount of interest from<br />
the deep learning, computer vision, and natural language processing communities.<br />
In VQA, an algorithm needs to answer text-based questions about images in<br />
natural language as illustrated in Figure 1.<br />
<br />
[[File:vqa-overview.png|thumb|600px|center|Figure 1: Illustration of VQA system whereby machine learning algorithm answers a visual question asked by an user for a given image (ref: http://www.visualqa.org/static/img/challenge.png)]]<br />
<br />
Recently, ''visual-attention'' based models have gained traction for VQA tasks, where the<br />
attention mechanism typically produces a spatial map highlighting image regions<br />
relevant for answering the visual question about the image. However, to correctly answer the <br />
question, machine not only needs to understand or "attend"<br />
regions in the image but also the parts of question as well. In this paper, authors have proposed a novel ''co-attention''<br />
technique to combine "where to look" or visual-attention along with "what words<br />
to listen to" or question-attention VQA allowing their model to jointly reasons about image and question thus improving <br />
upon existing state of the art results.<br />
<br />
== "Attention" Models ==<br />
You may skip this section if you already know about "attention" in<br />
context of deep learning. Since this paper talks about "attention" almost<br />
everywhere, I decided to put this section to give very informal and brief<br />
introduction to the concept of the "attention" mechanism specially visual "attention", <br />
however, it can be expanded to any other type of "attention".<br />
<br />
Visual attention in CNN is inspired by the biological visual system. As humans,<br />
we have ability to focus our cognitive processing onto a subset of the<br />
environment that is more relevant for the given situation. Imagine, you witness<br />
a bank robbery where robbers are trying to escape on a car, as a good citizen,<br />
you will immediately focus your attention on number plate and other physical<br />
features of the car and robbers in order to give your testimony later, however, you may not remember things which otherwise interests you more. <br />
Such selective visual attention for a given context (robbery in above example) can also be implemented in<br />
traditional CNNs as well. This allows CNNs to be more robust and superior for certain tasks and it even helps <br />
algorithm designer to visualize what spacial features (regions within image) were more important than others. Attention guided<br />
deep learning is particularly very helpful for image caption and VQA tasks.<br />
<br />
== Role of Visual Attention in VQA ==<br />
This section is not a part of the actual paper that is been summarized, however, it gives an overview<br />
of how visual attention can be incorporated in training of a network for VQA tasks, eventually, helping <br />
readers to absorb and understand actual proposed ideas from the paper more effortlessly. Das et al. [3] provided a research study on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images compared with deep models.<br />
<br />
Generally for implementing attention, network tries to learn the conditional <br />
distribution $P_{i \in [1,n]}(Li|c)$ representing individual importance for all the features <br />
extracted from each of the dsicrete $n$ locations within the image <br />
conditioned on some context vector $c$. In order words, given $n$ features <br />
$L_i = [L_1, ..., L_n]$ from $n$ different spacial regions within the image (top-left, top-middle, top-right, and so on), <br />
then "attention" module learns a parameteric function $F(c;\theta)$ that outputs an importance mapping <br />
of each of these individual feature for a given context vector $c$ or a discrete probability distribution <br />
of size $n$, can be achived by $softmax(n)$. <br />
<br />
In order to incorporate the visual attention in VQA task, one can define context vector $c$ <br />
as a representation of the visual question asked by an user (using RNN perhaps LSTM). The context $c$ can then be used to generate an <br />
attention map for corresponding image locations (as shown in Figure 2) further improving the accuracy on final end-to-end training. <br />
Most work that exists in literature regarding use of visual-attention in VQA tasks are generally further <br />
specialization of the similar ideas.<br />
<br />
[[File:attention-vqa-general.png|thumb|700px|center|Figure 2: Different attention maps generated based on the given visual question. Regions with most "attention" or importance is whitened, machine learning model has learned to steer its attention based on the given question.]]<br />
<br />
== Motivation and Main Contributions ==<br />
So far, all attention models for VQA in literature have focused on the problem of identifying "where<br />
to look" or visual attention. In this paper, authors argue that the problem of identifying "which words to<br />
listen to" or '''question attention''' is equally important. Consider the questions "how many horses are<br />
in this image?" and "how many horses can you see in this image?". They have the same meaning,<br />
essentially captured by the first three words. A machine that attends to the first three words would<br />
arguably be more robust to linguistic variations irrelevant to the meaning and answer of the question.<br />
Motivated by this observation, in addition to reasoning about visual attention, paper has addressed the<br />
problem of question attention. Basically, main contributions of the paper are as follows.<br />
<br />
* A novel co-attention mechanism for VQA that jointly performs question-guided visual attention and image-guided question attention.<br />
* A hierarchical architecture to represent the question, and consequently construct image-question co-attention maps at 3 different levels: word level, phrase level and question level. <br />
* A novel convolution-pooling strategy at phase-level to adaptively select the phrase sizes whose representations are passed to the question level representation.<br />
* Results on VQA and COCO-QA and ablation studies to quantify the roles of different components in the model<br />
<br />
= Method =<br />
This section is broken down into four parts: '''(i)''' notations used within the paper and also throughout this summary, '''(ii)''' hierarchical representation for a visual question, '''(iii)''' the proposed co-attention mechanism and<br />
'''(iv)''' predicting answers.<br />
<br />
== Notations ==<br />
{| class="wikitable"<br />
|-<br />
|'''Notation'''<br />
|'''Explaination'''<br />
|-<br />
|$Q = \{q_1,...q_T\}$<br />
|One-hot encoding of a visual question with $T$ words. Paper uses three different representation of visual question, one for each level of hierarchy, they are as follows: <br />
# $Q^w = \{q^w_1,...q^w_T\}$: Word level representation of visual question<br />
# $Q^p = \{q^p_1,...q^p_T\}$: Phrase level representation of visual question<br />
# $Q^s = \{q^s_1,...q^s_T\}$: Question level representation of visual question<br />
$Q^{w,p,s}$ has exactly $T$ number of embeddings in it (sequential data with temporal dimension), regardless of its position in the hierarchy i.e. word, phrase or question. <br />
|-<br />
|$V = {v_1,..,v_N}$<br />
|$V$ represented various vectors from $N$ different locations within the given image. Therefore, $v_n$ is feature vector from the image at location $n$. $V$ collectively covers the entire spatial reachings of the image. One can extract these location sensitive features from convolution layer of CNN.<br />
|-<br />
|$\hat{v}^r$ and $\hat{q}^r$<br />
|The co-attention features of image and question at each level in the hierarchy where $r \in \{w,p,s\}$. Basically, its a sum of $Q$ or $V$ after the dot product with attention $a^q$ or $a^v$ at each level of hierarchy. <br />
For example, at "word" level, $a^q_w$ and $a^v_w$ is a probability distribution representing importance of each words in visual question and each location within image respectively, whereas $\hat{q}^w$ and $\hat{v}^w$ are final features vectors for the given question and image with attention maps ($a^q_w$ and $a^v_w$ applied) at the "word" level, and similarly for "phrase" and "question" level as well.<br />
|}<br />
'''Note:''' Throughout the paper, $W$ represents the learnable weights and biases are not used within the equations for simplicity (reader must assume it to exist).<br />
<br />
== Question Hierarchy ==<br />
There are three levels of granularities for their hierarchical representation of a visual question: '''(i)''' word, '''(ii)''' phrase and '''(iii)''' question level. It is important to note, each level depends on the previous one, so, phrase level representations are extracted from word level and question level representations come from phrase level as depicted in Figure 4.<br />
<br />
[[File:hierarchy2.png|thumb|Figure 3: Hierarchical question encoding (source: Figure 3 (a) of original paper on page #5)]]<br />
[[File:hierarchy.PNG|thumb|Figure 4: Another figure illustrating hierarchical question encoding in details]]<br />
<br />
=== Word Level ===<br />
1-hot encoding of question's words $Q = \{q_1,..q_T\}$ are transformed into vector space (learned end-to-end) which represents word level embeddings of a visual question i.e. $Q^w = \{q^w_1,...q^w_T\}$. Paper has learned this transformation end-to-end instead of some pretrained models such as word2vec.<br />
<br />
=== Phrase Level ===<br />
Phrase level embedding vectors are calculated by using 1-D convolutions on the word level embedding vectors. <br />
Concretely, at each word location, the inner product of the word vectors with filters of three <br />
window sizes: unigram, bigram and trigram are computed as illustrated by Figure 4. For the ''t-th'' word, <br />
the output from convolution for window size ''s'' is given by<br />
<br />
$$<br />
\hat{q}^p_{s,t} = tanh(W_c^sq^w_{t:t+s-1}), \quad s \in \{1,2,3\}<br />
$$<br />
<br />
Where $W_c^s$ is the weight parameters. The features from three n-grams are combined together using ''maxpool'' operator to obtain the phrase-level embeddings vectors.<br />
<br />
$$<br />
q_t^p = max(\hat{q}^p_{1,t}, \hat{q}^p_{2,t}, \hat{q}^p_{3,t}), \quad t \in \{1,2,...,T\}<br />
$$<br />
<br />
=== Question Level ===<br />
For question level representation, LSTM is used to encode the sequence $q_t^p$ after max-pooling. The corresponding question-level feature at time ''t'' $q_t^s$ is the <br />
LSTM hidden vector at time ''t'' $h_t$.<br />
<br />
$$<br />
\begin{align*}<br />
h_t &= LSTM(q_t^p, h_{t-1})\\<br />
q_t^s &= h_t, \quad t \in \{1,2,...,T\}<br />
\end{align*}<br />
$$<br />
<br />
== Co-Attention Mechanism ==<br />
Paper has proposed two co-attention mechanisms.<br />
{| class="wikitable"<br />
|-<br />
|'''Parallel co-attention'''<br />
|Generates image and question attention simultaneously.<br />
|-<br />
|'''Alternating co-attention'''<br />
|Sequentially alternates between generating image and question attentions.<br />
|}<br />
These co-attention mechanisms are executed at all three levels of the question hierarchy yielding $\hat{v}^r$ and $\hat{q}^r$ <br />
where $r$ is levels in hierarchy i.e. $r \in \{w,p,s\}$ (refer to [[:Notations]] section).<br />
<br />
<br />
=== Parallel Co-Attention ===<br />
[[File:parallewl-coattention.png|thumb|Figure 5: Parallel co-attention mechanism (ref: Figure 2 (a) from original paper)]]<br />
Parallel co-attention attends to the image and question simultaneously as shown in Figure 5. In the paper, "affinity matrix" has been mentioned as the way to calculate the<br />
"attention" or affinity for every pair of image location and question part for each level in the hierarchy (word, phrase and question). Remember, there are $N$ image locations and $T$ <br />
question parts, thus affinity matrix is $\mathbb{R}^{T \times N}$. Specifically, for a given image with<br />
feature map $V \in \mathbb{R}^{d \times N}$, and the question representation $Q \in \mathbb{R}^{d \times T}$, the affinity matrix $C \in \mathbb{R}^{T \times N}$<br />
is calculated by<br />
<br />
$$<br />
C = tanh(Q^TW_bV)<br />
$$<br />
<br />
where,<br />
* $W_b \in \mathbb{R}^{d \times d}$ contains the weights. <br />
<br />
After computing this affinity matrix, one possible way of<br />
computing the image (or question) attention is to simply maximize out the affinity over the locations<br />
of other modality, i.e. $a_v[n] = \underset{i}{max}(C_{i,n})$ and $a_q[t] = \underset{j}{max}(C_{t,j})$. Their notation here is not rigorous. $a_v[n]$ is actually row number $\underset{i}{argmax}(C_{i,n})$ of matrix $C$, and $a_q[t]$ is column number $\underset{j}{argmax}(C_{t,j})$ of that matrix. Instead of choosing the max activation, paper has considered the affinity matrix as a feature and learn to predict image and question attention <br />
maps via the following<br />
<br />
$$<br />
H_v = tanh(W_vV + (W_qQ)C), \quad H_q = tanh(W_qQ + (W_vV )C^T )\\<br />
a_v = softmax(w_{hv}^T Hv), \quad aq = softmax(w_{hq}^T H_q)<br />
$$<br />
<br />
where,<br />
* $W_v, W_q \in \mathbb{R}^{k \times d}$, $w_{hv}, w_{hq} \in \mathbb{R}^k$ are the weight parameters. <br />
* $a_v \in \mathbb{R}^N$ and $a_q \in \mathbb{R}^T$ are the attention probabilities of each image region $v_n$ and word $q_t$ respectively. <br />
<br />
The intuition behind above equation is that, image/question attention maps should be the function of question and image features jointly, therefore, authors have<br />
developed two intermediate parametric functions $H_v$ and $H_q$ that takes affinity matrix $C$, image features $V$ and question features $Q$ as input. The affinity matrix $C$ <br />
transforms question attention space to image attention space (vice versa for $C^T$). Based on the above attention weights, the image and question attention vectors are calculated<br />
as the weighted sum of the image features and question features, i.e.,<br />
<br />
$$\hat{v} = \sum_{n=1}^{N}{a_n^v v_n}, \quad \hat{q} = \sum_{t=1}^{T}{a_t^q q_t}$$<br />
<br />
The parallel co-attention is done at each level in the hierarchy, leading to $\hat{v}^r$ and $\hat{q}^r$ where $r \in \{w,p,s\}$. The reason they are using $tanh$ <br />
for $H_q$ and $H_v$is not specified in the paper. But my assumption is that they want to have negative impacts for certain unfavorable pair of image location and question fragment. Unlike $RELU$ or $Sigmoid$, $tanh$ can be between $[-1, 1]$ thus appropriate choice.<br />
<br />
=== Alternating Co-Attention ===<br />
[[File:alternating-coattention.png|thumb|Figure 6: Alternating co-attention mechanism (ref: Figure 2 (b) from original paper)]]<br />
In this attention mechanism, authors sequentially alternate between generating image and question attention as shown in Figure 6. <br />
Briefly, this consists of three steps<br />
<br />
# Summarize the question into a single vector $q$<br />
# Attend to the image based on the question summary $q$<br />
# Attend to the question based on the attended image feature.<br />
<br />
Concretely, paper defines an attention operation $\hat{x} = \mathcal{A}(X, g)$, which takes the image (or question)<br />
features $X$ and attention guidance $g$ derived from question (or image) as inputs, and outputs the<br />
attended image (or question) vector. The operation can be expressed in the following steps<br />
<br />
$$<br />
\begin{align*}<br />
H &= tanh(W_xX + (W_gg)𝟙^T)\\<br />
a_x &= softmax(w_{hx}^T H)\\<br />
\hat{x} &= \sum{a_i^x x_i}<br />
\end{align*}<br />
$$<br />
<br />
where,<br />
* $𝟙$ is a vector with all elements to be 1. <br />
* $W_x, W_g \in \mathbb{R}^{k\times d}$ and $w_{hx} \in \mathbb{R}^k$ are parameters. <br />
* $a_x$ is the attention weight of feature $X$.<br />
<br />
Breifly,<br />
* At the first step of alternating coattention, $X = Q$, and $g$ is $0$. <br />
* At the second step, $X = V$ where $V$ is the image features, and the guidance $g$ is intermediate attended question feature $\hat{s}$ from the first step<br />
* Finally, we use the attended image feature $\hat{v}$ as the guidance to attend the question again, i.e., $X = Q$ and $g = \hat{v}$. <br />
<br />
Similar to the parallel co-attention, the alternating co-attention is also done at each level of the hierarchy, leading to $\hat{v}^r$ <br />
and $\hat{q}^r$ where $r \in \{w,p,s\}$.<br />
<br />
== Encoding for Predicting Answers ==<br />
[[File:answer-encoding-for-prediction.png|thumb|Figure 7: Encoding for predicitng answers (source: Figure 3 (b) of original paper on page #5)]]<br />
Paper treats predicting final answer as a classification task. It was surprising because I always thought answer would be a sequence, however, by using MLP it is apparent that answer must be a single word. Co-attended image and question features from all three levels are combined together for the final prediction, see Figure 7. Basically, a multi-layer perceptron (MLP) is deployed to recursively encode the attention features as follows.<br />
$$<br />
\begin{align*}<br />
h_w &= tanh(W_w(\hat{q}^w + \hat{v}^w))\\<br />
h_p &= tanh(W_p[(\hat{q}^p + \hat{v}^p), h_w])\\<br />
h_s &= tanh(W_s[(\hat{q}^s + \hat{v}^s), h_p])\\<br />
p &= softmax(W_hh^s)<br />
\end{align*}<br />
$$<br />
<br />
where <br />
* $W_w, W_p, W_s$ and $W_h$ are the weight parameters. <br />
* $[·]$ is the concatenation operation on two vectors. <br />
* $p$ is the probability of the final answer.<br />
<br />
= Experiments =<br />
Evaluation of the proposed model is performed using two datasets, the VQA dataset [1] and the COCO-QA dataset [2].<br />
<br />
* '''VQA dataset''' is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset.<br />
* '''COCO-QA dataset''' is automatically generated from captions in the Microsoft COCO dataset.<br />
<br />
The proposed approach seems to outperform most of the state-of-art techniques as shown in Table 1 and 2.<br />
<br />
[[File:result-vqa.png|thumb|700px|center|Table 1: Results on the VQA dataset. “-” indicates the results is not available. (ref: Table 1 of original paper page #6)]]<br />
<br />
[[File:result-coco-qa.png|thumb|700px|center|Table 2: Results on the COCO-QA dataset. “-” indicates the results is not available (ref: Table 2 of original paper page #7)]]<br />
<br />
==Ablation Study==<br />
In this part, the authors quantified the importance of individual components in the infrastructure. The idea is re-training the model with ablated components. The detailed settings are listed as follows.<br />
* Image Attention alone(to verify that improvements are not the result of better optimization or better CNN features)<br />
* Question Attention alone<br />
* W/O Conv(replace convolution and pooling by stacking another word embedding layer on the top of word level outputs)<br />
* W/OW-Atten(replace the word level attention with a uniform distribution)<br />
* W/O P-Atten(no phrase level co-attention is performed, and the phrase level attention is set to be uniform. Word and question level co-attentions are still modeled)<br />
* W/O Q-Atten(no question level co-attention is performed while word and phrase level co-attentions are still modeled)<br />
<br />
The results of such ablation experiments can be seen in Table 3.<br />
[[FILE: ablation.png|center|thumb|400px|Table 3: Results of ablation experiments on the VQA dataset]]<br />
<br />
Compared to the full model, it is clear that the ablated model under-performs generally. However, it is interesting to see in some settings, the full model does not excel the ablated model.<br />
<br />
= Qualitative Results =<br />
We now visualize some co-attention maps generated by their method in Figure 8. <br />
<br />
{|class="wikitable"<br />
|'''Word level'''<br />
|<br />
* Model attends mostly to the object regions in an image, and objects at questions as well e.g., heads, bird. <br />
|-<br />
|'''Phrase level'''<br />
|<br />
*Image attention has different patterns across images. <br />
** For the first two images, the attention transfers from objects to background regions. <br />
** For the third image, the attention becomes more focused on the objects. <br />
** Reason for different attention could be perhaphs caused by the different question types. <br />
* On the question side, their model is capable of localizing the key phrases in the question, thus essentially discovering the question types in the dataset. <br />
* For example, their model pays attention to the phrases “what color” and “how many snowboarders”. <br />
|-<br />
|'''Question level'''<br />
|<br />
* Image attention concentrates mostly on objects. <br />
* Their model successfully attends to the regions in images and phrases in the questions appropriate for answering the question, e.g., “color of the bird” and bird region.<br />
|}<br />
<br />
Because their model performs co-attention at three levels, it often captures complementary information from<br />
each level, and then combines them to predict the answer. However, it some what un-intuitive to visualize the <br />
phrase and question level attention mapping applied diretly to the words of the question, since phrase <br />
and question level features are compound features from multiple words, thus their attention contribution on the <br />
actual words from the question cannot be clearly understood. <br />
<br />
[[File:visualization-co-attention.png|thumb|800px|center|Figure 8: Visualization of image and question co-attention maps on the COCO-QA dataset. From left to right:<br />
original image and question pairs, word level co-attention maps, phrase level co-attention maps and question<br />
level co-attention maps. For visualization, both image and question attentions are scaled (from red:high to<br />
blue:low). (ref: Figure 4 of original paper page #8)]]<br />
<br />
= Conclusion =<br />
* A hierarchical co-attention model for visual question answering is proposed. <br />
* Coattention allows model to attend to different regions of the image as well as different fragments of the question. <br />
* Question is hierarchically represented at three levels to capture information from different granularities. <br />
* Visualization shows model co-attends to interpretable regions of images and questions for predicting the answer. <br />
* Though their model was evaluated on visual question answering, it can be potentially applied to other tasks involving vision and language.<br />
== Critique ==<br />
* This is a very intuitively relevant idea that closely resembles the way human brains tackle VQA tasks. Therefore this could be developed more into delivering sequence based answers and sentence generation. Therefore, the authors could have used a more powerful, more scalable word-encoding technique such as Glove or Bag-of-words which result in smaller dimensional vectors, thereby opening doors for more learning techniques like sentence-answer-generation. Since word-encoding is treated as a separate task here, Bag-of-words could work, but if we need a more temporal technique, we could use the Position Encoding mechanism [3] which accounts for the position of the word in the sequence itself. This abstraction could help the model generalize better to a multitude of tasks.<br />
<br />
* The idea that image attentions and question attentions can jointly guide each other makes sense. However, if the image is complex or the question itself is too long, will such side attention be misleading? A further study could be: compared to a simple question, whether a long and complex question will influence the performance of the model.<br />
<br />
* The idea of the paper seems great, but 0.2% improvement over the state-of-the-art performance on VQA dataset isn't significant. It would have been good to show some incorrect samples to indicate why the error was still so high. In fact there is already a new paper [4] that won the 2017 VQA challenge and it significantly outperforms all the previous methods on VQA dataset giving an accuracy of 69%.<br />
<br />
= Reference =<br />
# K. Kafle and C. Kanan, “Visual Question Answering: Datasets, Algorithms, and Future Challenges,” Computer Vision and Image Understanding, Jun. 2017.<br />
# Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. NIPS, 2015.<br />
# Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston and Rob Fergus. 2015. End-To-End Memory Networks. Advances in Neural Information Processing Systems (NIPS) 28<br />
# Damien Teney, Peter Anderson, Xiaodong He, Anton van den Hengel, Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge, CVPR 2017</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=31543STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-27T19:51:05Z<p>Asriram: /* Gated block */</p>
<hr />
<div>=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallelize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals [[#Reference|[6]]]. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN [[#Reference|[9]]]. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embedding to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 [[#Reference|[7]]] illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3 [[#Reference|[7]]]. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions [[#Reference|[3]]]. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modeled successively, with B conditioned on (R, G), and G conditioned on R [[#Reference|[8]]]. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 5 [[#Reference|[8]]]. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 6, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By splitting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 [[#Reference|[3]]] shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions)[[#Reference|[3]]]. Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|300px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighborhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions [[#Reference|[4]]]. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions [[#Reference|[3]]]. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function which alleviates the problem, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
* $*$ is the convolutional operator.<br />
* $\odot$ is the element-wise product. <br />
* $tanh(W_{k,f} \ast x)$ is a classical convolution with tanh activation function.<br />
* $sigmoid(\sigma(W_{k,g} \ast x)$ are the gate values (0 = gate closed, 1 = gate open).<br />
* $W_{k,f}$ and $W_{k,g}$ are learned weights.<br />
* $f, g$ are the different feature maps<br />
<br />
This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape [number of classes, number of filters], $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=== PixelCNN Auto-Encoders ===<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[5]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. The motivation of the triplet loss function (Schroff et al.) is to ensure that an image $x^a_i$ (anchor) of a specific person is close to all other images $x^p_i$ (positive) of the same person than it is to any image $x^n_i$ (negative) of any other person. So the tuple loss function is given by<br />
\[<br />
L = \sum_{i} [||h(x^a_i)-h(x^p_i)) ||^2_2- ||h(x^\alpha_i)-h(x^n_i)) ||^2_2 +\alpha ]_+<br />
\]<br />
where $h$ is the embedding of the image $x$, and $\alpha$ is a margin that is enforced between positive and negative pairs.<br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN called Gated PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# PixelCNN AutoEncoders<br />
<br />
=Critique=<br />
# The paper is not descriptive, and does not explain well on how the horizontal and vertical stacks solve the "blindspot" problem. In addition, the authors just mention the "gated block" and how they designed it, but they do not explain the intuition and how this approach is an improvement over the PixelCNN <br />
# The authors do not provide a good pictorial representation on any of the aforementioned novelties<br />
# The PixelCNN AutoEncoder is not descriptive enough! <br />
# An alternative method of tackling the "blind spot" problem would be to increase the effective receptive field size itself [10]. This can be done in two ways: <br />
*Increasing the depth of the convolution filters<br />
*Adding subsampling layers<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016<br />
# S. Turukin, "Gated PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/24/gated-pixelcnn.html. [Accessed: 15- Nov- 2017].<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick and N. Freitas, "Generating interpretable images with controllable structure", 2016.<br />
# G. Hinton, "Reducing the Dimensionality of Data with Neural Networks", Science, vol. 313, no. 5786, pp. 504-507, 2006.<br />
# "Conditional Image Generation with PixelCNN Decoders", Slideshare.net, 2017. [Online]. Available: https://www.slideshare.net/suga93/conditional-image-generation-with-pixelcnn-decoders. [Accessed: 18- Nov- 2017].<br />
# "Gated PixelCNN", Kawahara.ca, 2017. [Online]. Available: http://kawahara.ca/conditional-image-generation-with-pixelcnn-decoders-slides/gated-pixelcnn/. [Accessed: 17- Nov- 2017].<br />
# K. Dhandhania, "PixelCNN + PixelRNN + PixelCNN 2.0 — Commonlounge", Commonlounge.com, 2017. [Online]. Available: https://www.commonlounge.com/discussion/99e291af08e2427b9d961d41bb12c83b. [Accessed: 15- Nov- 2017].<br />
# S. Turukin, "PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/22/pixelcnn.html. [Accessed: 17- Nov- 2017].<br />
# W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understanding the effective receptive field in deep convolutional neural networks. arXiv preprint arXiv:1701.04128, 2017</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=31533STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-26T17:01:32Z<p>Asriram: /* Vertical Stack */</p>
<hr />
<div>=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallelize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals [[#Reference|[6]]]. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN [[#Reference|[9]]]. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embedding to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 [[#Reference|[7]]] illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3 [[#Reference|[7]]]. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions [[#Reference|[3]]]. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modeled successively, with B conditioned on (R, G), and G conditioned on R [[#Reference|[8]]]. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 5 [[#Reference|[8]]]. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 6, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By splitting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 [[#Reference|[3]]] shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions)[[#Reference|[3]]]. Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|300px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighborhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions [[#Reference|[4]]]. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions [[#Reference|[3]]]. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape [number of classes, number of filters], $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=== PixelCNN Auto-Encoders ===<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[5]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. The motivation of the triplet loss function (Schroff et al.) is to ensure that an image $x^a_i$ (anchor) of a specific person is close to all other images $x^p_i$ (positive) of the same person than it is to any image $x^n_i$ (negative) of any other person. So the tuple loss function is given by<br />
\[<br />
L = \sum_{i} [||h(x^a_i)-h(x^p_i)) ||^2_2- ||h(x^\alpha_i)-h(x^n_i)) ||^2_2 +\alpha ]_+<br />
\]<br />
where $h$ is the embedding of the image $x$, and $\alpha$ is a margin that is enforced between positive and negative pairs.<br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN called Gated PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# PixelCNN AutoEncoders<br />
<br />
=Critique=<br />
# The paper is not descriptive, and does not explain well on how the horizontal and vertical stacks solve the "blindspot" problem. In addition, the authors just mention the "gated block" and how they designed it, but they do not explain the intuition and how this approach is an improvement over the PixelCNN <br />
# The authors do not provide a good pictorial representation on any of the aforementioned novelties<br />
# The PixelCNN AutoEncoder is not descriptive enough! <br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016<br />
# S. Turukin, "Gated PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/24/gated-pixelcnn.html. [Accessed: 15- Nov- 2017].<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick and N. Freitas, "Generating interpretable images with controllable structure", 2016.<br />
# G. Hinton, "Reducing the Dimensionality of Data with Neural Networks", Science, vol. 313, no. 5786, pp. 504-507, 2006.<br />
# "Conditional Image Generation with PixelCNN Decoders", Slideshare.net, 2017. [Online]. Available: https://www.slideshare.net/suga93/conditional-image-generation-with-pixelcnn-decoders. [Accessed: 18- Nov- 2017].<br />
# "Gated PixelCNN", Kawahara.ca, 2017. [Online]. Available: http://kawahara.ca/conditional-image-generation-with-pixelcnn-decoders-slides/gated-pixelcnn/. [Accessed: 17- Nov- 2017].<br />
# K. Dhandhania, "PixelCNN + PixelRNN + PixelCNN 2.0 — Commonlounge", Commonlounge.com, 2017. [Online]. Available: https://www.commonlounge.com/discussion/99e291af08e2427b9d961d41bb12c83b. [Accessed: 15- Nov- 2017].<br />
# S. Turukin, "PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/22/pixelcnn.html. [Accessed: 17- Nov- 2017].</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:vertical_mask.gif&diff=31532File:vertical mask.gif2017-11-26T17:00:00Z<p>Asriram: Asriram uploaded a new version of File:vertical mask.gif</p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Unsupervised_Domain_Adaptation_with_Residual_Transfer_Networks&diff=31127Unsupervised Domain Adaptation with Residual Transfer Networks2017-11-23T02:34:13Z<p>Asriram: /* Conclusion */</p>
<hr />
<div>== Introduction ==<br />
'''Domain Adaptation''' [https://en.wikipedia.org/wiki/Domain_adaptation]is a problem in machine learning which involves taking a model which has been trained on a source domain, and applying this to a different (but related) target domain. '''Unsupervised domain adaptation''' refers to the situation in which the source data is labelled, while the target data is (predominantly) unlabeled. The problem at hand is then finding ways to generalize the learning on the source domain to the target domain. In the age of deep networks this problem has become particularly salient due to the need for vast amounts of labeled training data, in order to reap the benefits of deep learning. Manual generation of labeled data is often prohibitive, and in absence of such data networks are rarely performant. The attempt to circumvent this drought of data typically necessitates the gathering of "off-the-shelf" data sets, which are tangentially related and contain labels, and then building models in these domains. The fundamental issue that unsupervised domain adaptation attempts to address is overcoming the inherent shift in distribution across the domains, without the ability to observe this shift directly.<br />
<br />
This paper proposes a method for unsupervised domain adaptation which relies on three key components: <br />
# A kernel-based penalty to ensure that the abstract representations generated by the networks hidden layers are similar between the source and the target data; <br />
# An entropy based penalty on the target classifier, which exploits the entropy minimization principle; and <br />
# A residual network structure is appended, which allows the source and target classifiers to differ by a (learned) residual function, thus relaxing the shared classifier assumption which is traditionally made.<br />
<br />
This method outperforms state-of-the-art techniques on common benchmark datasets, and is flexible enough to be applied in most feed-forward neural networks.<br />
<br />
[[File:Source-and-Target-Domain-Office-31-Backpack.png|thumb|right|The Office-31 Dataset Images for Backpack. Shows the variation in the source and target domains to motivate why these methods are important.]] <br />
=== Working Example (Office-31) === <br />
In order to assist in the understanding of the methods, it is helpful to have a tangible sense of the problem front of mind. The Domain Adaptation Project [https://people.eecs.berkeley.edu/~jhoffman/domainadapt/] provides data sets which are tailored to the problem of unsupervised domain adaptation. One of these data sets (which is later used in the experiments of this paper) has images which are labeled based on the Amazon product page for the various items. There are then corresponding pictures taken either by webcams or digital SLR cameras. The goal of unsupervised domain adaptation on this data set would be to take any of the three image sources as the source domain, and transfer a classifier to the other domain; see the example images to understand the differences.<br />
<br />
One can imagine that, while it is likely easy to scrape labeled images from Amazon, it is likely far more difficult to collect labeled images from webcam or DSLR pictures directly. The ultimate goal of this method would be to train a model to recognize a picture of a backpack taken with a webcam, based on images of backpacks scraped from Amazon (or similar tasks).<br />
<br />
== Related Work ==<br />
Broadly speaking, the problem of domain adaptation mitigates manual labeling of data in areas such as machine learning, computer vision, and natural language processing. The general goal of domain adaptation is to reduce the discrepancy in probability distributions between the source and target domains.<br />
<br />
Research into the use of Deep Neural Networks for the purpose of domain adaptation has suggested that, while networks learn abstract feature representations which can reduce the discrepancy across domains, it is not possible to wholly remove it [http://www.icml-2011.org/papers/342_icmlpaper.pdf], [https://arxiv.org/pdf/1412.3474.pdf]. Further work has been done to design networks which adapt traditional deep nets (typically CNNs) to specifically address the problems posed by domain adaptation, these methods all only address the issue of feature adaptation [https://arxiv.org/pdf/1502.02791.pdf], [https://arxiv.org/pdf/1409.7495.pdf], [https://people.eecs.berkeley.edu/~jhoffman/papers/Tzeng_ICCV2015.pdf]. That is, they all assume that the target and source classifiers are shared between domains. <br />
<br />
The authors drew particular motivation from He et al. [https://arxiv.org/abs/1512.03385] with the proposed structure of residual networks. Combining the insights from the ResNet architecture, in addition to previous work that had leveraged classifier adaptation (in the context where some target data is labeled) [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.8224&rep=rep1&type=pdf], [http://www.machinelearning.org/archive/icml2009/papers/445.pdf], [http://ieeexplore.ieee.org/document/5539870/] the authors develop their proposed network.<br />
<br />
== Residual Transfer Networks ==<br />
Generally, in an unsupervised domain adaptation problem, we are dealing with a set $\mathcal{D}_s$ (called the source domain) which is defined by $\{(x_i^s, y_i^s)\}_{i=1}^{n_s}$. That is the set of all labeled input-output pairs in our source data set. We denote the number of source elements by $n_s$. There is a corresponding set $\mathcal{D}_t = \{(x_i^t)\}_{i=1}^{n_t}$ (the target domain), consisting of unlabeled input values. There are $n_t$ such values. <br />
[[File:RTN-Structure.png|thumb|left|upright|The overarching structure of the RTN. Consists of an existing network, to which a bottleneck, MMD block, and residual block is appended.]]<br />
We can think of $\mathcal{D}_s$ as being sampled from some underlying distribution $p$, and $\mathcal{D}_t$ as being sampled from $q$. Generally we have that $p \neq q$, partially motivating the need for domain adaptation methods. <br />
<br />
We can consider the classifiers $f_s(\underline{x})$ and $f_t(\underline{x})$, for the source domain and target domain respectively. It is possible to learn $f_s$ based on the sample $\mathcal{D}_s$. Under the '''shared classifier assumption''' it would be the case that $f_s(\underline{x}) = f_t(\underline{x})$, and thus learning the source classifier is enough. This method relaxes this assumption, assuming that in general $f_s \neq f_t$, and attempting to learn both.<br />
<br />
The example network extends deep convolutional networks (in this case AlexNet [http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf]) to '''Residual Transfer Networks''', the mechanics of which are outlined below. Recall that, if $L(\cdot, \cdot)$ is taken to be the cross-entropy loss function, then the empirical error of a CNN on the source domain $\mathcal{D}_s$ is given by:<br />
<br />
<center><br />
<math display="block"><br />
\min_{f_s} \frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)<br />
</math> <br />
</center><br />
<br />
In a standard implementation, the CNN optimizes over the above loss. This will be the starting point for the RTN.<br />
<br />
=== Structural Overview ===<br />
The model proposed in this paper extends existing CNN's and alters the loss function that is optimized over. While each of these components is discussed in depth below, the overarching architecture involves four components:<br />
<br />
# An existing deep model. While this can be any model, in theory, the authors leverage AlexNet in practice.<br />
# A bottleneck layer, used to reduce the dimensionality of the learned abstract feature space, directly after the existing network.<br />
# An MMD block, with the expressed intention of feature adaptation.<br />
# A residual block, with the expressed intention of classifier adaptation. <br />
<br />
This structure is then optimized against a loss function which combines the standard cross-entropy penalty with MMD and target entropy penalties, yielding the proposed Residual Transfer Network (RTN) structure.<br />
<br />
=== Feature Adaptation ===<br />
Feature adaptation refers to the process in which the features which are learned to represent the source domain are made applicable to the target domain. Broadly speaking a CNN works to generate abstract feature representations of the distribution that the inputs are sampled from. It has been found that using these deep features can reduce, but not remove, cross-domain distribution discrepancy, hence the need for feature adaptation. It is important to note that CNN's transfer from general to specific features as the network gets deeper. In this light, the discrepancy between the feature representation of the source and the target will grow through a deeper convolutional net. As such a technique for forcing these distributions to be similar is needed.<br />
<br />
In particular the authors of this paper impose a bottleneck layer (call it $fc_b$) which is included after the final convolutional layer of AlexNet. This dense layer is connected to an additional dense layer $fc_c$, (which will serve as the target classification layer). They then compute the tensor product between the activations of the layers, performing "lossless multi-layer feature fusion". That is for the source domain they define $z_i^s \overset{\underset{\mathrm{def}}{}}{=} x_i^{s,fc_b}\otimes x_i^{s,fc_c}$ and for the target domain, $z_i^t \overset{\underset{\mathrm{def}}{}}{=} x_i^{t,fc_b}\otimes x_i^{t,fc_c}$. The authors then employ feature adaptation by means of Maximum Mean Discrepancy, between the source and target domains, on these fusion features.<br />
<br />
[[File:RTN-MMD-Block.png|right|thumb|The Maximum Mean Discrepancy Block (MMD) included in the RTN. The outputs of $fc_b$ and $fc_c$ are fused through a tensor product, and then passed through the MMD penalty, ensuring distributional similarity.]]<br />
<br />
==== Maximum Mean Discrepancy ==== <br />
The Maximum Mean Discrepancy (MMD) is a Kernel method that involves mapping to a Reproducing Kernel Hilbert Space (RKHS) [https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space]. Denote the RKHS $\mathcal{H}_K$ with a characteristic kernel $K$. We then define the '''mean embedding''' of a distribution $p$ in $\mathcal{H}_K$ to be the unique element $\mu_K(p)$ such that $\mathbf{E}_{x\sim p}f(x) = \langle f(x), \mu_K(p)\rangle_{\mathcal{H}_K}$ for all $f \in \mathcal{H}_K$. Now, if we take $\phi: \mathcal{X} \to \mathcal{H}_K$, then we can define the MMD between two distributions $p$ and $q$ as follows:<br />
<br />
<center><br />
<math display="block"><br />
d_k(p, q) \overset{\underset{\mathrm{def}}{}}{=} ||\mathbf{E}_{x\sim p}(\phi(x^s)) - \mathbf{E}_{x\sim q}(\phi(x^t))||_{\mathcal{H}_K}<br />
</math><br />
</center><br />
<br />
Effectively, the MMD will compute the self-similarity of $p$ and $q$, and subtract twice the cross-similarity between the distributions: $\widehat{\text{MMD}}^2 = \text{mean}(K_{pp}) + \text{mean}(K_{qq}) - 2\times\text{mean}(K_{pq})$. From here we can infer that $p$ and $q$ are equivalent distributions if and only if the $\text{MMD} = 0$. If we then wish to force two distributions to be similar, this becomes a minimization problem over the MMD.<br />
<br />
Two important notes:<br />
# The RKHS, and as such MMD, depend on the choice of the kernel;<br />
# Computing the MMD efficiently requires an unbiased estimate of the MMD (as outlined [https://arxiv.org/pdf/1502.02791.pdf]).<br />
<br />
==== MMD for Feature Adaptation in the RTN ====<br />
The authors wish to minimize the MMD between the fusion features outlined above derived from the source and target domains. Concretely this amounts to forcing the distribution of the abstract representation of the source domain $\mathcal{D}_s$ to be similar to the distribution of the abstract representation of the target domain $\mathcal{D}_t$. Performing this optimization over the fused features between the $fb_b$ and $fb_c$ forces each of those layers towards similar distributions.<br />
<br />
Practically this involves an additional penalty function given by the following:<br />
<br />
<center><br />
<math display="block"><br />
D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t) = \sum_{i,j=1}^{n_s} \frac{k(z_i^s, z_j^s)}{n_s^2} + \sum_{i,j=1}^{n_t} \frac{k(z_i^t, z_j^t)}{n_t^2} + \sum_{i=1}^{n_s}\sum_{j=1}^{n_t} \frac{k(z_i^s, z_j^t)}{n_sn_t} <br />
</math><br />
</center><br />
<br />
Where the characteristic kernel $k(z, z')$ is the Gaussian kernel, defined on the vectorization of tensors, with bandwidth parameter $b$. That is: $k(z, z') = \exp(-||vec(z) - vec(z')||^2/b)$.<br />
<br />
=== Classifier Adaptation ===<br />
In traditional unsupervised domain adaptation there is a '''shared-classifier assumption''' which is made. In essence, if $f_s(x)$ represents the classifier on the source domain, and $f_t(x)$ represents the classifier on the target domain then this assumption simply states that $f_s = f_t$. While this may seem to be a reasonable assumption at first glance, it is problematic largely in that this is an assumption that is incredibly difficult to check. If it could be readily confirmed that the source and target classifiers could be shared, then the problem of domain adaptation would be largely trivialized. Instead, the authors here relax this assumption slightly. They postulate that instead of being equivalent, the source and target classifier differ by some perturbation function $\Delta f$. The general idea is that, by assuming $f_S(x) = f_T(x) + \Delta f(x)$, where $f_S$ and $f_T$ correspond to the source and target classifiers, pre-activation, and $\Delta f(x)$ is some residual function.<br />
<br />
The authors then suggest using residual blocks, as popularized by the ResNet framework [https://arxiv.org/pdf/1512.03385.pdf], to learn this residual function.<br />
<br />
[[File:Residual-Block-vs-DNN.png|thumb|left|A comparison of a standard Deep Neural Network block which is designed to fit a function H(x) compared to a residual block which fits H(x) as the sum of the input, x, and a learned residual function, F(X).]]<br />
==== Residual Networks Framework ==== <br />
A (Deep) Residual Network, as proposed initially in ResNet, employs residual blocks to assist in the learning process, and were a key component of being able to train extraordinarily deep networks. The Residual Network is comprised largely in the same manner as standard neural networks, with one key difference, namely the inclusion of residual blocks - sets of layers which aim to estimate a residual function in place of estimating the function itself. <br />
<br />
That is, if we wish to use a DNN to estimate some function $h(x)$, a residual block will decompose this to $h(x) = F(x) + x$. The layers are then used to learn $F(x)$, and after the layers which aim to learn this residual function, the input $x$ is recombined through element-wise addition, to form $h(x) = F(x) + x$. This was initially proposed as a manner to allow for deeper networks to be effectively trained, but has since used in novel contexts.<br />
<br />
==== Residual Blocks in the RTN ====<br />
[[File:RTN-Residual-Block.png|thumb|right|The Structure of the Residual Block in the RTN framework. The block relies on two additional dense layers following the target classifier in an attempt to learn the residual difference between the source and target classifiers.]] The authors leverage residual blocks for the purpose of classifier adaptation. Operating under the assumption that the source and target classifiers differ by an arbitrary perturbation function, $f(x)$, the authors add an additional set of densely connected layers which the source data will flow through. In particular, the authors take the $fc_c$ layer above as the desired target classifier. For the source data an additional set of layers ($fc-1$ and $fc-2$) are added following $fc_c$, which are connected as a residual block. The output of the classifier layer is then added back to the output of the residual block in order to form the source classifier.<br />
<br />
It is necessary to note that in this case the output from $fc_c$ passes the non-activated (i.e. pre-softmax activation) to the element-wise addition, the result of which is passed through the activation layer, yielding the source prediction. In the provided diagram, we have that $f_S(x)$ represents the non-activated output from the additive layer in the residual block; $f_T(x)$ represents the non-activated output from the target classifier; and $fc-1$/$fc-2$ are used to learn the perturbation function $\Delta f(x)$.<br />
<br />
==== Entropy Minimization ====<br />
In addition to the residual blocks, the authors make use of the '''entropy minimization principle''' [http://www.iro.umontreal.ca/~lisa/pointeurs/semi-supervised-entropy-nips2004.pdf] to further refine the classifier adaptation. In particular, by minimizing the entropy of the target classifier (or more correctly, the entropy of the class conditional distribution $f_j^t(x_i^t) = p(y_i^t = j \mid x_i^t; f_t)$), low-density separation between the classes is encouraged. '''Low-Density Separation''' is a concept used predominantly in semi-supervised learning, which in essence tries to draw class decision boundaries in regions where there are few data points (labeled or unlabeled). The above paper leverages an entropy regularization scheme to achieve the goal low-density separation goal; this is adopted here to the case of unsupervised domain adaptation.<br />
<br />
In practice this amounts to adding a further penalty based on the entropy of the class conditional distribution. In particular, if $H(\cdot)$ is defined to be the entropy function, such that $H(f_t(x_i^t)) = - \sum_{j=1}^c f_j^t(x_i^t)\log f_j^t(x_i^t)$, where $c$ is the number of classes and $f_j^t(x_i^t)$ represents the probability of predicting class $j$ for point $x_i^t$, then over the target domain $\mathcal{D}_t$ we define the entropy penalty to be:<br />
<br />
<center><br />
<math display="block"><br />
\frac{1}{n_t} \sum_{i=1}^{n_t} H(f_t(x_i^t))<br />
</math><br />
</center><br />
<br />
The combination of the residual learning and the entropy penalty, the authors hypothesize will enable effective classifier adaptation.<br />
<br />
=== Residual Transfer Network ===<br />
The combination of the MMD loss introduced in feature adaptation, the residual block introduced in classifier adaptation, and the application of the entropy minimization principle cumulates in the Residual Transfer Network proposed by the authors. The model will be optimized according to the following loss function, which combines the standard cross-entropy, MMD penalty, and entropy penalty:<br />
<br />
<center><br />
<math display="block"><br />
\min_{f_s = f_t + \Delta f} \underbrace{\left(\frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)\right)}_{\text{Typical Cross-Entropy}} + \underbrace{\frac{\gamma}{n_t}\left(\sum_{i=1}^{n_t} H(f_t(x_i^t)) \right)}_{\text{Target Entropy Minimization}} + \underbrace{\lambda\left(D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t)\right)}_{\text{MMD Penalty}}<br />
</math><br />
</center><br />
<br />
Where we take $\gamma$ and $\lambda$ to be tradeoff parameters between the entropy penalty and the MMD penalty. As classifier adaptation proposed in this paper and feature adaptation studied in [5, 6] are tailored to adapt different layers of deep networks, they are expected to complement each other and to establish better performance.<br />
<br />
The full network, which is trained subject to the above optimization problem, thus takes on the following structure.<br />
<br />
[[File:rtn-full-paper-structure.png||center|alt=The Structure of the RTN]]<br />
<br />
== Experiments == <br />
<br />
=== Set-up ===<br />
The performance of RTN was jointly compared across two key data sets in the area of Unsupervised Domain Adaptation. Specifically, Office-31 (discussed in the introduction) and Office-Caltech (maintained by the same project group). Office-31 is comprised of images from 3 sources, Amazon ('''A'''), Webcam ('''W'''), and DSLR ('''D'''), of 31 different objects. Office-Caltech is derived by considering 10 classes common to both the Office-31 and the Caltech data sets, thus providing further adaptation possibilities. This provides 6 Transfer Tasks on the 31 classes of Office-31 ($\{(A,W), (A,D), (W,A), (W,D), (D,A), (D,W)\}$) and 12 Transfer Tasks on the 10 classes of Office-Caltech ($\{(A,W), (A,D), (A,C), (W,A), (W,D), (W,C), (D,A), (D,W), (D,C), (C,A), (C,W), (C,D)\}$).<br />
<br />
The authors then compare the results on the 18 different adaptation tasks against 6 other models. In order to determine the efficacy of the various contributions outlined in the paper they perform an ablation study, evaluating variants of the RTN. Specifically, they consider the RTN with only the MMD module ('''RTN (mmd)'''), the RTN with the MMD module and the entropy minimization ('''RTN (mmd+ent)'''), and the complete RTN ('''RTN (mmd+ent+res)'''). The experiments leverage all the labeled training data and compute accuracy across all unlabeled domain data. The parameters of the model (i.e. $\gamma$, and $\lambda$) are fixed based on a single validation point on the transfer task $\mathbf{A}\to\mathbf{W}$. These parameters are then maintained across all transfer tasks. <br />
<br />
As for specification details, the authors use mini-batch SGD, with momentum $0.9$, and with the learning rate adjusted based on $\eta_p = \frac{\eta_0}{(1 + \alpha p)^\beta}$, where $p$ indicates the portion of training completed (linear from $0$ to $1$), $\eta_0 = 0.01$, $\alpha = 10$ and $\beta = 0.75$, which was optimized for low error on the source. The MMD and entropy parameters, set as above, were maintained at $\lambda = 0.3$ and $\gamma - 0.3$.<br />
<br />
=== Results ===<br />
[[File:table-1-results.PNG|thumb|right|Results from the Office-31 Experiment]][[File:table-2-results.PNG|thumb|right|Results from the Office-Caltech Experiment]]<br />
In aggregate, the network outperformed all comparison methods, across all transfer tasks. Broadly speaking the network saw the largest increases in accuracy on the hard transfer tasks (for instance $\mathbf{A} \to \mathbf{C}$), where the source-domain discrepancy is large. The authors take this to mean that the proposed model learns "more adaptive classifiers and transferable features for safer domain adaptation." They further indicate that standard deep learning techniques (i.e. just AlexNet) perform similarly to standard shallow techniques (TCA and GFK). Deep-transfer methods which focus on feature adaptation perform significantly better than the standard methods. The proposed RTN, which adds in additional considerations for classifier adaptation, performs even better.<br />
<br />
In addition, the ablation study found a number of interesting results:<br />
# The RTN (mmd) outperforms DAN, which is founded on a similar method, but contains multiple MMD penalties (one for each layer instead of on a bottleneck), and is as such less computationally efficient;<br />
# The addition of the entropy penalty [RTN (mmd+ent)] provides significant marginal benefit over the previous RTN (mmd);<br />
# The full RTN [RTN (mmd+ent+res)] performs the best of all variants, by diminishing returns are seen over the addition of the entropy penalty.<br />
<br />
Overall the authors claim that the RTN (mmd+ent+res) is now regarded as state-of-the-art for unsupervised domain adaptation.<br />
<br />
=== Discussion ===<br />
[[File:t-sne-embeddings.png|thumb|left|t-SNE Embeddings Comparing the Performance of DAN and RTN]] <br />
[[File:mean-sd-layer-outputs.png|thumb|right|The Mean and Standard Deviations of the outputs from the Source Classifier, Target Classifier, and Residual Functions. As expected, the residual function provides a small, but non-zero, contribution.]] <br />
[[File:gamma-tradeoff.png|thumb|left|The accuracy of tests by varying the parameter $\gamma$. We first see an increase in accuracy up to an ideal point, before having the accuracy fall again.]]<br />
[[File:classifier-shift.png|thumb|right|The corresponding weights of the classifier layers, if trained on the labeled source and target data, exhibiting the differences which exist between the two classifiers in an ideal state. ]]<br />
<br />
==== Visualizing Predictions (Versus DAN) ====<br />
DAN uses a similar method for feature adaptation but neglects any attempt at classifier adaptation (i.e. it makes the shared-classifier assumption). In order to demonstrate that this leads to the worse performance, the authors provide images showing the t-SNE embeddings by DAN and RTN on the transfer task $\mathbf{A} \to \mathbf{W}$. The images show that the target categories are not well discriminated by the source classifier, suggesting a violation of the shared-classifier assumption. Conversely, the target classifier for the RTN exhibits better discrimination.<br />
<br />
==== Layer Responses and Classifier Shift ==== <br />
The authors further consider the mean and standard deviation of the outputs of $f_S(x)$, $f_T(x)$ and $\Delta f(x)$ to consider the relative contributions of the different components. As expected, $\Delta f(x)$ provides a small (though non-zero) contribution to the learned source classifier. This provides some merit to the idea of residual learning on the classifiers. <br />
<br />
In addition, the authors train classifiers on the source and target data, with labels present, and compare the realized weights. This is used to test how different the ideal weights are on separate classifiers. The results suggest that there is, in fact, a discrepancy between the classifiers, further motivating the use of tactics to avoid the shared-classifier assumption. <br />
<br />
==== Parameter Sensitivity ==== <br />
Lastly, the authors test the sensitivity of these results against the parameter $\gamma$. They run this test on $\mathbf{A}\to\mathbf{W}$ in addition to $\mathbf{C}\to\mathbf{W}$, varying the parameter from $0.01$ to $1.0$. They find that, on both tasks, the increase of the parameter initially improves accuracy, before seeing a drop-off.<br />
<br />
== Conclusion ==<br />
This paper presented a novel approach to unsupervised domain adaptation which relaxed assumptions made by previous models with regard to the shared nature of classifiers. Emphasis on this paper is portrayed on unsupervised domain adaptation and on mismatches between the source-target classification results (i.e. the marginal distribution difference of source and target). The proposed deep residual network learns through the perturbation function, which is created through the difference of classifiers. The deep residual network also has the ability to couple the feature learning and feature adaptation to minimize the marginal distribution shift. <br />
<br />
Like previous models, this proposed network leverages feature adaptation by matching the distributions of features across the domains. In addition, using a residual network and entropy minimization tactic, the target classifier is allowed to differ from the source classifier by implementing a new residual transfer module as the bridge. In particular, this approach allows for easy integration into existing networks, and can be implemented with any standard deep learning software.<br />
<br />
For follow-up considerations, the authors propose looking for adaptations which may be useful in the semi-supervised domain adaptation problem.<br />
<br />
== Critique ==<br />
While the paper presents a clear approach, which empirically attains great results on the desired tasks, I question the benefit to the residual block that is employed. The results of the ablation study seem to suggest that the majority of the benefits can be derived from using the MMD and Entropy penalties. The residual block appears to add marginal, perhaps insignificant contributions to the outcome. Despite this, the use of MMD loss is not novel, and the entropy loss is less well documented, and less thoroughly explored. Perhaps a different set of ablations would have indicated that the three parts, indeed, are equally effective (and the diminishing returns stems from stacking the three methods), but as it is presented, I question the utility of the final structure versus a less complicated, less novel approach.<br />
<br />
==References==<br />
# https://en.wikipedia.org/wiki/Domain_adaptation<br />
# https://people.eecs.berkeley.edu/~jhoffman/domainadapt/<br />
# Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Domain adaptation for large-scale sentiment classification: A deep learning approach." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.<br />
# Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint arXiv:1412.3474 (2014).<br />
# Long, Mingsheng, et al. "Learning transferable features with deep adaptation networks." International Conference on Machine Learning. 2015.<br />
# Ganin, Yaroslav, and Victor Lempitsky. "Unsupervised domain adaptation by backpropagation." International Conference on Machine Learning. 2015.<br />
# Tzeng, Eric, et al. "Simultaneous deep transfer across domains and tasks." Proceedings of the IEEE International Conference on Computer Vision. 2015.<br />
# He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.<br />
# Yang, Jun, Rong Yan, and Alexander G. Hauptmann. "Cross-domain video concept detection using adaptive svms." Proceedings of the 15th ACM international conference on Multimedia. ACM, 2007.<br />
# Duan, Lixin, et al. "Domain adaptation from multiple sources via auxiliary classifiers." Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009.<br />
# Duan, Lixin, et al. "Visual event recognition in videos by learning from web data." IEEE Transactions on Pattern Analysis and Machine Intelligence 34.9 (2012): 1667-1680.<br />
# http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf<br />
# https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space<br />
#Long, Mingsheng, et al. "Learning transferable features with deep adaptation networks." International Conference on Machine Learning. 2015.<br />
#He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.<br />
# Grandvalet, Yves, and Yoshua Bengio. "Semi-supervised learning by entropy minimization." Advances in neural information processing systems. 2005.<br />
# More information on residual functions https://www.youtube.com/watch?v=urAp0DibYlY <br />
<br />
Expert review from the NIPS community can be found in https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/99.html.<br />
<br />
Implementation Example: https://github.com/thuml/Xlearn</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model-Agnostic_Meta-Learning_for_Fast_Adaptation_of_Deep_Networks&diff=31125Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks2017-11-23T02:22:48Z<p>Asriram: /* Conclusion */</p>
<hr />
<div>='''Introduction & Background'''=<br />
Learning quickly is a hallmark of human intelligence, whether it involves recognizing objects from a few examples or quickly learning new skills after just minutes of experience. In this work, we propose a meta-learning algorithm that is general and model-agnostic, in the sense that it can be directly applied to any learning problem and model that is trained with a gradient descent procedure. Our focus is on deep neural network models, but we illustrate how our approach can easily handle different architectures and different problem settings, including classification, regression, and policy gradient reinforcement learning, with minimal modification. Unlike prior meta-learning methods that learn an update function or learning rule (Schmidhuber, 1987; Bengio et al., 1992; Andrychowicz et al., 2016; Ravi & Larochelle, 2017), this algorithm does not expand the number of learned parameters nor place constraints on the model architecture (e.g. by requiring a recurrent model (Santoro et al., 2016) or a Siamese network (Koch, 2015)), and it can be readily combined with fully connected, convolutional, or recurrent neural networks. It can also be used with a variety of loss functions, including differentiable supervised losses and nondifferentiable reinforcement learning objectives.<br />
<br />
The primary contribution of this work is a simple model and task-agnostic algorithm for meta-learning that trains a model’s parameters such that a small number of gradient updates will lead to fast learning on a new task. The paper shows the effectiveness of the proposed algorithm in different domains, including classification, regression, and reinforcement learning problems.<br />
<br />
==Key Idea==<br />
The key idea underlying this method is to train the model’s initial parameters such that the model has maximal performance on a new task after the parameters have been updated through one or more gradient steps computed with a small amount of data from that new task. This can be viewed from a feature learning standpoint as building an internal representation that is broadly suitable for many tasks. If the internal representation is suitable to many tasks, simply fine-tuning the parameters slightly (e.g. by primarily modifying the top layer weights in a feedforward model) can produce good results.<br />
<br />
='''Model-Agnostic Meta Learning (MAML)'''=<br />
The goal of the proposed model is rapid adaptation. This setting is usually formalized as few-shot learning.<br />
<br />
=== Problem set-up ===<br />
The goal of few-shot meta-learning is to train a model that can quickly adapt to a new task using only a few data points and training iterations. To do so, the model is trained during a meta-learning phase on a set of tasks, such that it can then be adapted to a new task using only a small number of parameter updates. In effect, the meta-learning problem treats entire tasks as training examples. <br />
<br />
Let us consider a model denoted by $f$, that maps the observation $\mathbf{x}$ into the output variable $a$. During meta-learning, the model is trained to be able to adapt to a large or infinite number of tasks. <br />
<br />
Let us consider a generic notion of task as below. Each task $\mathcal{T} = \{\mathcal{L}(\mathbf{x}_1.a_1,\mathbf{x}_2,a_2,..., \mathbf{x}_H,a_H), q(\mathbf{x}_1),q(\mathbf{x}_{t+1}|\mathbf{x}_t,a_t),H \}$, consists of a loss function $\mathcal{L}$, a distribution over initial observations $q(\mathbf{x}_1)$, a transition distribution $q(\mathbf{x}_{t+1}|\mathbf{x}_t)$, and an episode length $H$. In i.i.d. supervised learning problems,<br />
the length $H =1$. The model may generate samples of length $H$ by choosing an output at at each time $t$. The cost $\mathcal{L}$ provides a task-specific feedback, which is defined based on the nature of the problem. <br />
<br />
A distribution over tasks is denoted by $p(\mathcal{T})$. In the K-shot learning setting, the model is trained to learn a new task $\mathcal{T}_i$ drawn from $p(\mathcal{T})$ from only K samples drawn from $q_i$ and feedback $\mathcal{L}_{\mathcal{T}_i}$ generated by $\mathcal{T}_i$. During meta-training, a task $\mathcal{T}_i$ is sampled from $p(\mathcal{T})$, the model is trained with K samples and feedback from the corresponding loss LTi from Ti, and then tested on new samples from Ti. The model f is then improved by considering how the test error on new data from $q_i$ changes with respect to the parameters. In effect, the test error on sampled tasks $\mathcal{T}_i$ serves as the training error of the meta-learning process. At the end of meta-training, new tasks are sampled from $p(\mathcal{T})$, and meta-performance is measured by the model’s performance after learning from K samples. Notice that tasks used for meta-testing are held out during meta-training.<br />
<br />
=== MAML Algorithm ===<br />
[[File:model.png|200px|right|thumb|Figure 1: Diagram of the MAML algorithm]]<br />
The paper proposes a method that can learn the parameters of any standard model via meta-learning in such a way as to prepare that model for fast adaptation. The intuition behind this approach is that some internal representations are more transferrable than others. Since the model will be fine-tuned using a gradient-based learning rule on a new task, we will aim to learn a model in such a way that this gradient-based learning rule can make rapid progress on new tasks drawn from $p(\mathcal{T})$, without overfitting. In effect, we will aim to find model parameters that are sensitive to changes in the task, such that small changes in the parameters will produce large improvements on the loss function of any task drawn from $p(\mathcal{T})$, see Fig 1.<br />
<br />
Note that there is no assumption about the form of the model. Only assumption is that it is parameterized by a vector of parameters $\theta$, and the loss is smooth so that the parameters can be leaned using gradient-based techniques. Formally lets assume that the model is denoted by $f_{\theta}$. When adapting<br />
to a new task $\mathcal{T}_i $, the model’s parameters $\theta$ become $\theta_i'$. In our method, the updated parameter vector $\theta_i'$ is computed using one or more gradient descent updates on task $\mathcal{T}_i $. For example, when using one gradient update:<br />
<br />
$$<br />
\theta_i ' = \theta - \alpha \nabla_{\theta \mathcal{L}_{\mathcal{T}_i}}(f_{\theta}).<br />
$$<br />
<br />
Here $\alpha$ is a the learning rate of each task and considered as a hyperparameter. They consider a single step of update for the rest of the paper, for the sake of the simplicity. <br />
<br />
The model parameters are trained by optimizing for the performance<br />
of $f_{\theta_i'}$ with respect to $\theta$ across tasks sampled from $p(\mathcal{T})$. More concretely, the meta-objective is as follows: <br />
<br />
$$<br />
\min_{\theta} \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'}) = \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta - \alpha \nabla_{\theta \mathcal{L}_{\mathcal{T}_i}}(f_{\theta})})<br />
$$<br />
<br />
Note that the meta-optimization is performed over the model parameters $\theta$, whereas the objective is computed using the updated model parameters $\theta'$. The model aims to optimize the model parameters such that one or a small number of gradient step on a new task will produce maximally effective behavior on that task. <br />
<br />
Therefore the meta-learning across the tasks is performed via stochastic gradient descent (SGD), such that the model parameters $\theta$ are updated as:<br />
<br />
$$<br />
\theta \gets \theta - \beta \nabla_{\theta } \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'})<br />
$$<br />
where $\beta$ is the meta step size. Outline of the algorithm is shown in Algorithm 1. <br />
[[File:ershad_alg1.png|500px|center|thumb]]<br />
<br />
The MAML meta-gradient update involves a gradient through a gradient. Computationally, this requires an additional backward pass through f to compute Hessian-vector products, which is supported by standard deep learning libraries such as TensorFlow.<br />
<br />
='''Different Types of MAML'''=<br />
In this section the MAML algorithm is discussed for different supervised learning and reinforcement learning tasks. The differences between each of these tasks are in their loss function and the way the data is generated. In general, this method does not require additional model parameters nor using any additional meta-learner to learn the update of parameters. Compared to other approaches that tend to “learn to compare new examples in a learned metric space using e.g. Siamese networks or recurrence with attention mechanisms”, the proposed method can be generalized to any other problems including classification, regression and reinforcement learning. <br />
<br />
=== Supervised Regression and Classification ===<br />
Few-shot learning is well-studied in this field. For these two types of tasks the horizon $H$ is equal to 1, since the data points are generated i.i.d. <br />
<br />
Although any common classification and regression objectives can be used as the loss, the paper uses the following losses for these two tasks. <br />
<br />
Regression : For regression we use the mean-square error (MSE):<br />
<br />
$$<br />
\mathcal{L}_{\mathcal{T}_i} (f_{\theta}) = \sum \limits_{\mathbf{x}^{(j)}, \mathcal{y}^{(j)} \sim \mathcal{T}_i} \parallel f_{\theta} (\mathbf{x}^{(j)} - \mathbf{y}^{(j)})\parallel_2^2<br />
$$<br />
<br />
where $\mathbf{x}^{(j)}$ and $\mathbf{y}^{()j}$ are the input/output pair sampled from task $\mathcal{T}_i$. In K-shot regression tasks, K input/output pairs are provided for learning for each task. <br />
<br />
Classification: For classification we use the cross entropy loss:<br />
<br />
$$<br />
\mathcal{L}_{\mathcal{T}_i} (f_{\theta}) = \sum \limits_{\mathbf{x}^{(j)}, \mathcal{y}^{(j)} \sim \mathcal{T}_i} \mathbf{y}^{(j)} \log f_{\theta}(\mathbf{x}^{(j)}) + (1-\mathbf{y}^{(j)}) \log (1-f_{\theta}(\mathbf{x}^{(j)}))<br />
$$<br />
<br />
According to the conventional terminology, K-shot classification tasks use K input/output pairs from each class, for a total of $NK$ data points for N-way classification.<br />
<br />
Given a distribution over tasks, these loss functions can be directly inserted into the equations in the previous section to perform meta-learning, as detailed in Algorithm 2.<br />
[[File:ershad_alg2.png|500px|center|thumb]]<br />
<br />
=== Reinforcement Learning ===<br />
In reinforcement learning (RL), the goal of few-shot meta learning is to enable an agent to quickly acquire a policy for a new test task using only a small amount of experience in the test setting. A new task might involve achieving a new goal or succeeding on a previously trained goal in a new environment. For example an agent may learn how to navigate mazes very quickly so that, when faced with a new maze, it can determine how to reliably reach the exit with only a few samples.<br />
<br />
Each RL task $\mathcal{T}_i$ contains an initial state distribution $q_i(\mathbf{x}_1)$ and a transition distribution $q_i(\mathbf{x}_{t+1}|\mathbf{x}_t,a_t)$ , and the loss $\mathcal{L}_{\mathcal{T}_i}$ corresponds to the (negative) reward function $R$. The entire task is therefore a Markov decision process (MDP) with horizon H, where the learner is allowed to query a limited number of sample trajectories for few-shot learning. Any aspect of the MDP may change across tasks in $p(\mathcal{T})$. The model being learned, $f_{\theta}$, is a policy that maps from states $\mathbf{x}_t$ to a distribution over actions $a_t$ at each timestep $t \in \{1,...,H\}$. The loss for task $\mathcal{T}_i$ and model $f_{\theta}$ takes the form<br />
<br />
$$<br />
\mathcal{L}_{\mathcal{T}_i}(f_{\theta}) = -\mathbb{E}_{\mathbf{x}_t,a_t \sim f_{\theta},q_{\mathcal{T}_i}} \big [\sum_{t=1}^H R_i(\mathbf{x}_t,a_t)\big ]<br />
$$<br />
<br />
<br />
In K-shot reinforcement learning, K rollouts from $f_{\theta}$ and task $\mathcal{T}_i$, $(\mathbf{x}_1,a_1,...,\mathbf{x}_H)$, and the corresponding rewards $ R(\mathbf{x}_t,a_t)$, may be used for adaptation on a new task $\mathcal{T}_i$.<br />
<br />
Since the expected reward is generally not differentiable due to unknown dynamics, we use policy gradient methods to estimate the gradient both for the model gradient update(s) and the meta-optimization. Since policy gradients are an on-policy algorithm, each additional gradient step during the adaptation of $f_{\theta}$ requires new samples from the current policy $f_{\theta_i}$ . We detail the algorithm in Algorithm 3, which has the same structure as Algorithm 2 but also which requires sampling trajectories from the environment corresponding to task $\mathcal{T}_i$ in steps 5 and 8.<br />
[[File:ershad_alg3.png|500px|center|thumb]]<br />
<br />
='''Experiments'''=<br />
<br />
=== Regression ===<br />
We start with a simple regression problem that illustrates the basic principles of MAML. Each task involves regressing from the input to the output of a sine wave, where the amplitude and phase of the sinusoid are varied between tasks. Thus, $p(\mathcal{T})$ is continuous, and the input and output both have a dimensionality of 1. During training and testing, datapoints are sampled uniformly. The loss is the mean-squared error between the prediction and true value. The regressor is a neural network model with 2 hidden layers of size 40 with ReLU nonlinearities. When training with MAML, we use one gradient update with K = 10 examples with a fixed step size 0.01, and use Adam as the metaoptimizer [2]. The baselines are likewise trained with Adam. To evaluate performance, we fine-tune a single meta-learned model on varying numbers of K examples, and compare performance to two baselines: (a) pre-training on all of the tasks, which entails training a network to regress to random sinusoid functions and then, at test-time, fine-tuning with gradient descent on the K provided points, using an automatically tuned step size, and (b) an oracle which receives the true amplitude and phase as input.<br />
<br />
We evaluate performance by fine-tuning the model learned by MAML and the pre-trained model on $K = \{ 5,10,20 \}$ datapoints. During fine-tuning, each gradient step is computed using the same $K$ datapoints. Results are shown in Fig 2.<br />
<br />
<br />
[[File:ershad_results1.png|500px|center|thumb|Figure 2: Few-shot adaptation for the simple regression task. Left: Note that MAML is able to estimate parts of the curve where there are no datapoints, indicating that the model has learned about the periodic structure of sine waves. Right: Fine-tuning of a model pre-trained on the same distribution of tasks without MAML, with a tuned step size. Due to the often contradictory outputs on the pre-training tasks, this model is unable to recover a suitable representation and fails to extrapolate from the small number of test-time samples.]]<br />
<br />
=== Classification ===<br />
<br />
For classification evaluation, Omniglot and MiniImagenet datasets are used. The Omniglot dataset consists of 20 instances of 1623 characters from 50 different alphabets. <br />
<br />
The experiment involves fast learning of N-way classification with 1 or 5 shots. The problem of N-way classification is set up as follows: select N unseen classes, provide the model with K different instances of each of the N classes, and evaluate the model’s ability to classify new instances within the N classes. For Omniglot, 1200 characters are selected randomly for training, irrespective of alphabet, and use the remaining for testing. The Omniglot dataset is augmented with rotations by multiples of 90 degrees.<br />
<br />
The model follows the same architecture as the embedding function that has 4 modules with a 3-by-3 convolutions and 64 filters, followed by batch normalization, a ReLU nonlinearity, and 2-by-2 max-pooling. The Omniglot images are downsampled to 28-by-28, so the dimensionality of the last hidden layer is 64. The last layer is fed into a softmax. For Omniglot, strided convolutions is used instead of max-pooling. For MiniImagenet, 32 filters per layer are used to reduce overfitting. In order to also provide a fair comparison against memory-augmented neural networks [3] and to test the flexibility of MAML, the results for a non-convolutional network are also provided. <br />
<br />
[[File:ershad_results2.png|500px|center|thumb|Table 1: Few-shot classification on held-out Omniglot characters (top) and the MiniImagenet test set (bottom). MAML achieves results that are comparable to or outperform state-of-the-art convolutional and recurrent models. Siamese nets, matching nets, and the memory module approaches are all specific to classification, and are not directly applicable to regression or RL scenarios. The $\pm$ shows 95% confidence intervals over tasks. ]]<br />
<br />
=== Reinforcement Learning ===<br />
Several simulated continuous control environments are used for RL evaluation. In all of the domain, the MAML model is a neural network policy with two hidden layers of size 100, and ReLU activations. The gradient updates are computed using vanilla policy gradient and trust-region policy optimization (TRPO) is used as the meta-optimizer.<br />
<br />
In order to avoid computing third derivatives, finite differences are computed to <br />
compute the Hessian-vector products for TRPO. For both learning and meta-learning updates, we use the standard linear feature baseline proposed by [4], which is fitted separately at each iteration for each sampled task in the batch. <br />
<br />
Three baseline models for the comparison are: <br />
(a) pretraining one policy on all of the tasks and then fine-tuning<br />
(b) training a policy from randomly initialized weights<br />
(c) an oracle policy which receives the parameters of the task as input, which for the tasks below corresponds to a goal position, goal direction, or goal velocity for the agent. <br />
<br />
The baseline models of (a) and (b) are fine-tuned with gradient descent with a manually tuned step size.<br />
<br />
2D Navigation: In the first meta-RL experiment, the authors study a set of tasks where a point agent must move to different goal positions in 2D, randomly chosen for each task within a unit square. The observation is the current 2D position, and actions correspond to velocity commands clipped to be in the range [-0.1; 0.1]. The reward is the negative squared distance to the goal, and episodes terminate when the agent is within 0:01 of the goal or at the horizon ofH = 100. The policy was trained with MAML <br />
to maximize performance after 1 policy gradient update using 20 trajectories. They compare adaptation to a new task with up to 4 gradient updates, each with 40 samples. Results are shown in Fig. 3.<br />
<br />
[[File:ershad_results3.png|500px|center|thumb|Figure 3: Top: quantitative results from 2D navigation task, Bottom: qualitative comparison between model learned with MAML and with fine-tuning from a pretrained network ]]<br />
<br />
Locomotion. To study how well MAML can scale to more complex deep RL problems, we also study adaptation on high-dimensional locomotion tasks with the MuJoCo simulator [5]. The tasks require two simulated robots – a planar cheetah and a 3D quadruped (the “ant”) – to run in a particular direction or at a particular velocity. In the goal velocity experiments, the reward is the negative absolute value between the current velocity of the agent and a goal, which is chosen uniformly at random between 0 and 2 for the cheetah and between 0 and 3 for the ant. In the goal direction experiments, the reward is the magnitude of the velocity in either the forward or backward direction, chosen at random for each task in p(T ). The horizon is H = 200, with 20 rollouts per gradient step for all problems except the ant forward/backward task, which used 40 rollouts per step. The results in Figure 5 show that MAML learns a model that can quickly adapt its velocity and direction with even <br />
just a single gradient update, and continues to improve with more gradient steps. The results also show that, on these challenging tasks, the MAML initialization substantially outperforms random initialization and pretraining.<br />
[[File:ershad_results4.png|500px|center|thumb|Figure 4: Reinforcement learning results for the half-cheetah and ant locomotion tasks, with the tasks shown on the far right. ]]<br />
<br />
A conceptual method to achieve fast adaptation in language modeling tasks ( not been experimented on by the authors) would be to explore methods of attaching an Attention Kernel which results in a simple and differentiable loss. It has been implemented in One-Shot Language Modeling along with state-of-the-art improvements in one-shot learning on Imagenet and Omniglot [7].<br />
<br />
='''Conclusion'''=<br />
<br />
The paper introduced a meta-learning method based on learning easily adaptable model parameters through gradient descent. The approach has a number of benefits. It is simple and does not introduce any learned parameters for meta-learning. It can be combined with any model representation that is amenable to gradient-based training, and any differentiable objective, including classification, regression, and reinforcement learning. Lastly, since the method merely produces a weight initialization, adaptation can be performed with any amount of data and any number of gradient steps, though it demonstrates state-of-the-art results on classification with only one or five examples per class. The authors also show that the method can adapt an RL agent using policy gradients and a very modest amount of experience. To conclude, it is evident that MAML is able to determine good model initializations for several tasks with a small number of gradient steps.<br />
<br />
='''Critique'''=<br />
From my opinion, the Model-Agnostic Meta-Learning looks like a simplified curriculum learning. It treats all tasks the same over the whole training history, and does not consider the difficulty of the tasks and the adaption of neural network to the task. Curriculum learning would be a good idea to speed up the training.<br />
<br />
<br />
='''References'''=<br />
# Schmidhuber, J¨urgen. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 1992.<br />
# Lake, Brenden M, Salakhutdinov, Ruslan, Gross, Jason, and Tenenbaum, Joshua B. One shot learning of simple visual concepts. In Conference of the Cognitive Science Society (CogSci), 2011.<br />
# Santoro, Adam, Bartunov, Sergey, Botvinick, Matthew, Wierstra, Daan, and Lillicrap, Timothy. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (ICML), 2016.<br />
# Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (ICML), 2016.<br />
# Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), 2012.<br />
# Videos the learned policies can be found in https://sites.google.com/view/maml.<br />
# Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra. "Matching Networks for One Shot Learning". arXiv:1606.04080 [cs.LG]<br />
<br />
Implementation Example: https://github.com/cbfinn/maml</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Exploration_via_Bootstrapped_DQN&diff=31123Deep Exploration via Bootstrapped DQN2017-11-23T02:14:53Z<p>Asriram: /* Intro to Reinforcement Learning */</p>
<hr />
<div>== Details ==<br />
<br />
'''Title''': Deep Exploration via Bootstrapped DQN<br />
<br />
'''Authors''': Ian Osband {1,2}, Charles Blundell {2}, Alexander Pritzel {2}, Benjamin Van Roy {1}<br />
<br />
'''Organisations''':<br />
# Stanford University<br />
# Google Deepmind<br />
<br />
'''Conference''': NIPS 2016<br />
<br />
'''URL''': [https://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn papers.nips.cc]<br />
<br />
'''Online code sources'''<br />
* [https://github.com/iassael/torch-bootstrapped-dqn github.com/iassael/torch-bootstrapped-dqn]<br />
<br />
This summary contains background knowledge from Section 2-7 (except Section 5). Feel free to skip if you already know.<br />
<br />
== Intro to Reinforcement Learning ==<br />
<br />
In reinforcement learning, an agent interacts with an environment with the goal to maximize its long-term reward. A common application of reinforcement learning is to the [https://en.wikipedia.org/wiki/Multi-armed_bandit multi armed bandit problem]. In a multi armed bandit problem, there is a gambler and there are $n$ slot machines, and the gambler can choose to play any specific slot machine at any time. All the slot machines have their own probability distributions by which they churn out rewards, but this is unknown to the gambler. So the question is, how can the gambler learn the strategy to get the maximum long term reward?<br />
<br />
There are two things the gambler can do at any instance: either he can try a new slot machine, or he can play the slot machine he has tried before (and he knows he will get some reward). However, even though trying a new slot machine feels like it would bring less reward to the gambler, it is possible that the gambler finds out a new slot machine that gives a better reward than the current best slot machine. This is the dilemma of '''exploration vs exploitation'''. Trying out a new slot machine is '''exploration''', while redoing the best move so far is '''exploiting''' the currently understood perception of the reward.<br />
<br />
[[File:multiarmedbandit.jpg|thumb|Source: [https://blogs.mathworks.com/images/loren/2016/multiarmedbandit.jpg blogs.mathworks.com]]]<br />
<br />
There are many strategies to approach this '''exploration-exploitation dilemma'''. Some [https://web.stanford.edu/class/msande338/lec9.pdf common strategies] for optimizing in an exploration-exploitation setting are Random Walk, Curiosity-Driven Exploration, and Thompson Sampling. A lot of these approaches are provably efficient, but assume that the state space is not very large. For instance, the approach called Curiosity-Driven Exploration aims to take actions that lead to immediate additional information. This requires the model to search “every possible cell in the grid” which is not desirable if state space is very large. Strategies for large state spaces often just either ignore exploration, or do something naive like $\epsilon$-greedy, where you exploit with $1-\epsilon$ probability and explore "randomly" in rest of the cases. The general idea to tackle large or continuous state spaces is by value function approximation. An empirically tested strategy is Value Function Approximation using Fourier Basis [16]. It has also proven to perform well compared to radial basis functions and the polynomial basis, which are the two most popular fixed bases for linear value function approximation. <br />
<br />
This paper presents a new strategy for exploring deep reinforcement learning with discrete actions. In particular, the presented approach uses bootstrapped networks to approximate the posterior distribution of the Q-function. The bootstrapped neural network is comprised of numerous networks that have a shared layer for feature learning, but separate output layers - hence, each network learns a slightly different dataset thereby learning different Q-functions. In addition, the authors also showed that Thompson sampling can work with bootstrapped DQN reinforcement learning algorithm. For validation, the authors tested the proposed algorithm on various Atari benchmark gaming suites. This paper tries to use a Thompson sampling like approach to make decisions.<br />
<br />
== Thompson Sampling<sup>[[#References|[1]]]</sup> ==<br />
<br />
In Thompson sampling, our goal is to reach a belief that resembles the truth. Let's consider a case of coin tosses (2-armed bandit). Suppose we want to be able to reach a satisfactory pdf for $\mathbb{P}_h$ (heads). Assuming that this is a Bernoulli bandit problem, i.e. the rewards are $0$ or $1$, we can start off with $\mathbb{P}_h^{(0)}=\beta(1,1)$. The $\beta(x,y)$ distribution is a very good choice for a possible pdf because it works well for Bernoulli rewards. Further $\beta(1,1)$ is the uniform distribution $\mathbb{N}(0,1)$.<br />
<br />
Now, at every iteration $t$, we observe the reward $R^{(t)}$ and try to make our belief close to the truth by doing a Bayesian computation. Assuming $p$ is the probability of getting a heads,<br />
<br />
$$<br />
\begin{align*}<br />
\mathbb{P}(R|D) &\propto \mathbb{P}(D|R) \cdot \mathbb{P}(R) \\<br />
\mathbb{P}_h^{(t+1)}&\propto \mbox{likelihood}\cdot\mbox{prior} \\<br />
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \mathbb{P}_h^{(t)} \\<br />
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \beta(x_t, y_t) \\<br />
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot p^{x_t-1}(1-p)^{y_t-1} \\<br />
&\propto p^{x_t+R^{(t)}-1}(1-p)^{y_t+R^{(t)}-1} \\<br />
&\propto \beta(x_t+R^{(t)}, y_t+R^{(t)})<br />
\end{align*}<br />
$$<br />
<br />
[[File:thompson sampling coin example.png|thumb||||600px|Source: [https://www.quora.com/What-is-Thompson-sampling-in-laymans-terms Quora]]]<br />
<br />
This means that with successive sampling, our belief can become better at approximating the truth. There are similar update rules if we use a non Bernoulli setting, say, Gaussian. In the Gaussian case, we start with $\mathbb{P}_h^{(0)}=\mathbb{N}(0,1)$ and given that $\mathbb{P}_h^{(t)}\propto\mathbb{N}(\mu, \sigma)$ it is possible to show that the update rule looks like<br />
<br />
$$<br />
\mathbb{P}_h^{(t+1)} \propto \mathbb{N}\bigg(\frac{t\mu+R^{(t)}}{t+1},\frac{\sigma}{\sigma+1}\bigg)<br />
$$<br />
<br />
=== How can we use this in reinforcement learning? ===<br />
<br />
We can use this idea to decide when to explore and when to exploit. We start with an initial belief, choose an action, observe the reward and based on the kind of reward, we update our belief about what action to choose next.<br />
<br />
== Bootstrapping <sup>[[#References|[2,3]]]</sup> ==<br />
<br />
This idea may be unfamiliar to some people, so I thought it would be a good idea to include this. In statistics, bootstrapping is a method to generate new samples from a given sample. Suppose that we have a given population, and we want to study a property $\theta$ of the population. So, we just find $n$ sample points (sample $\{D_i\}_{i=1}^n$), calculate the estimator of the property, $\hat{\theta}$, for these $n$ points, and make our inference. <br />
<br />
If we later wish to find some property related to the estimator $\hat{\theta}$ itself, e.g. we want a bound of $\hat{\theta}$ such that $\delta_1 \leq \hat{\theta} \leq \delta_2$ with a confidence of $c=95%$, then we can use bootstrapping for this.<br />
<br />
Using bootstrapping, we can create a new sample $\{D'_i\}_{i=1}^{n'}$ by '''randomly sampling $n'$ times from $D$, with replacement'''. So, if $D=\{1,2,3,4\}$, a $D'$ of size $n'=10$ could be $\{1,4,4,3,2,2,2,1,3,4\}$. We do this a sufficient $k$ number of times, calculate $\hat{\theta}$ each time, and thus get a distribution $\{\hat{\theta}_i\}_{i=1}^k$. Now, we can choose the $100\cdot c$<sup>th</sup> and $100\cdot(1-c)$<sup>th</sup> percentile of this distribution, (let them be $\hat{\theta}_\alpha$ and $\hat{\theta}_\beta$ respectively) and say<br />
<br />
$$\hat{\theta}_\alpha \leq \hat{\theta} \leq \hat{\theta}_\beta, \mbox{with confidence }c$$<br />
<br />
== Why choose bootstrap and not dropout? ==<br />
<br />
There is previous work<sup>[[#References|[4]]]</sup> that establishes dropout as a good way to train NNs on a posterior such that the trained NN works like a function approximator that is close to the actual posterior. But, there are several problems with the predictions of this trained NN. The figures below are from the appendix of this paper. The left image is the NN trained by the authors of this paper on a sample noisy distribution and the right image is from the accompanying web demo from [[#References|[4]]], where the authors of [[#References|[4]]] show that their NN converges around the mean with a good confidence.<br />
<br />
[[File:dropout_results.png|thumb||center||700px|Source: this paper's appendix]]<br />
<br />
According to the authors of this paper,<br />
# Even though [[#References|[4]]] says that dropout converges arond the mean, their experiment actually behaves weirdly around a reasonable point like $x=0.75$. They think that this happens because dropout only affects the region local to the original data.<br />
# Samples from the NN trained on the original data do not look like a reasonable posterior (very spiky).<br />
# The trained NN collapses to zero uncertainty at the data points from the original data.<br />
<br />
== Q Learning and Deep Q Networks <sup>[[#References|[5]]]</sup> ==<br />
<br />
At any point of time, our rewards dictate what our actions should be. Also, in general, we want good long term rewards. For example, if we are playing a first person shooter game, it is a good idea to go out of cover to kill an enemy, even if some health is lost. Similarly, in reinforcement learning, we want to maximize our long term reward. So if at each time $t$, the reward is $r_t$, then a naive way is to say we want to maximise<br />
<br />
$$<br />
R_t = \sum_{i=0}^{\infty}r_t<br />
$$<br />
<br />
But, this reward is unbounded. So technically it could tend to $\infty$ in a lot of the cases. This is why we use a '''discounted reward'''.<br />
<br />
$$<br />
R_t = \sum_{i=0}^{\infty}\gamma^t r_t<br />
$$<br />
<br />
Here, we take $0\leq \gamma \lt 1$. If it is equal to one, the agent values future reward just as much as current reward. Conversely, a value of zero will cause the agent to only value immediate rewards, which only works with very detailed reward functions. So, what this means is that we value our current reward the most ($r_0$ has a coefficient of $1$), but we also consider the future possible rewards. So if we had two choices: get $+4$ now and $0$ at all other timesteps, or get $-2$ now and $+2$ after $3$ timesteps for $20$ timesteps, we choose the latter ($\gamma=0.9$). This is because $(+4) < (-2)+0.9^3(2+0.9\cdot2+\cdots+0.9^{19}\cdot2)$.<br />
<br />
<br />
A '''policy''' $\pi: \mathbb{S} \rightarrow \mathbb{A}$ is just a function that tells us what action to take in a given state $s\in \mathbb{S}$. Our goal is to find the best policy $\pi^*$ that maximises the reward from a given state $s$. So, a '''value function''' is defined from $s$ (which the agent is in, at timestep $t$) and following the policy $\pi$ as $V^\pi(s) = \mathbb{E}[R_t]$. The optimal value function is then simply<br />
<br />
$$<br />
V^*(s)=\displaystyle\max_{\pi}V^\pi(s)<br />
$$<br />
<br />
For convenience however, it is better to work with the '''Q function''' $Q: \mathbb{S}\times\mathbb{A} \rightarrow \mathbb{R}$. $Q$ is defined similarly as $V$. It is the expected return after taking an action $a$ in the given state $s$. So, $Q^\pi(s,a)=\mathbb{E}[R_t|s,a]$. The optimal $Q$ function is<br />
<br />
$$<br />
Q^*(s,a)=\displaystyle\max_{\pi}Q^\pi(s,a)<br />
$$<br />
<br />
Suppose that we know $Q^*$. Then, if we know that we are supposed to start at $s$ and take an action $a$ right now, what is the best course of action from the next time step? We just choose the optimal action $a'$ at the next state $s'$ that we reach. The optimal action $a'$ at state $s'$ is simply the argument $a_x$ that maximises our $Q^*(s',\cdot)$.<br />
<br />
$$<br />
a'=\displaystyle\arg\max_{a_x} Q^*(s',a_x)<br />
$$<br />
<br />
So, our best expected reward from $s$ taking action $a$ is $\mathbb{E}[r_t+\gamma\mathbb{E}[R_{t+1}]]$. This is known as the '''Bellman equation''' in optimal control problem (By the way, its continuous form is called '''Hamilton-Jacobi-Bellman equation''' or HJB equation, which is a very important partial differential equation):<br />
<br />
$$<br />
Q^*(s,a)=\mathbb{E}[r_t+\gamma \displaystyle\max_{a_x} Q^*(s',a_x)]<br />
$$<br />
<br />
In Q learning, we use a deep neural network with weights $\theta$ as a function approximator for $Q^*$, since Bellman equation is indeed a non-linear PDE and very difficult to solve numerically. The '''naive way''' to do this is to design a deep neural network that takes as input the state $s$ and action $a$, and produces an approximation to $Q^*$. <br />
<br />
* Suppose our neural net weights are $\theta_i$ at iteration $i$.<br />
* We want to train our neural net on the case when we are at $s$, take action $a$, get reward $r$, and reach $s'$.<br />
* To find out what action is best from $s'$, i.e. $a'$, we have to simulate all actions from $s'$. We can do this after we complete this iteration, then run $s',a_x$ for all $a_x\in\mathbb{A}$. But, we don't know how to complete this iteration without knowing this $a'$. So, another way is to simulate all actions from $s'$ using last known set of weights $\theta_{i-1}$. We just simulate state $s'$, action $a_x$ for all $a_x\in\mathbb{A}$ from the previous state and get $Q^*(s',a_x;\theta_{i-1})$. ('''Note''' that some papers do not use the set of weights from the previous iteration $\theta_{i-1}$. Instead they fix the weights for finding the best action for every $\tau$ steps to $\theta^-$, and do $Q^*(s',a_x;\theta^-)$ for $a_x\in\mathbb{A}$ and use this for the target value.)<br />
* Now we can compute our loss function using the Bellman equation, and backpropagate.<br />
$$<br />
\mbox{loss}=\mbox{target}-\mbox{prediction}=(r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta_{i-1}))-Q^*(s,a;\theta_i)<br />
$$<br />
<br />
The '''problem''' with this approach is that at every iteration $i$, we have to do $|\mathbb{A}|$ forward passes on the previous set of weights $\theta_{i-1}$ to find out the best action $a'$ at $s'$. This becomes infeasible quickly with more possible actions.<br />
<br />
Authors of [[#References|[5]]] therefore use another kind of architecture. This architecture takes as input the state $s$, and computes the values $Q^*(s,a_x)$ for $a_x\in\mathbb{A}$. So there are $|\mathbb{A}|$ outputs. This basically parallelizes the forward passes so that $r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta_{i-1})$ can be done with just a single pass through the outputs. The following figure illustrates this fact:<br />
<br />
[[File:hamid.png|thumb|500px|Source: David Silver slides|center]]<br />
<br />
<br />
[[File:DQN_arch.png|thumb||||600px|Source: [https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_7/DQNBreakoutBlocks.png leonardoaraujosantos.gitbooks.io]]]<br />
<br />
'''Note:''' When I say state $s$ as an input, I mean some representation of $s$. Since the environment is a partially observable MDP, it is hard to know $s$. So, we can for example, apply a CNN on the frames and get an idea of what the current state is. We pass this output to the input of the DNN (DNN is the fully connected layer for the CNN then).<br />
<br />
=== Experience Replay ===<br />
<br />
Authors of this paper borrow the concept of experience replay from [[#References|[5,6]]]. In experience replay, we do training in episodes. In each episode, we play and store consecutive $(s,a,r,s')$ tuples in the experience replay buffer. Then after the play, we choose random samples from this buffer and do our training.<br />
<br />
<br />
Advantages of experience replay over simple online Q learning<sup>[[#References|[5]]]</sup>:<br />
* '''Better data efficiency''': It is better to use one transition many times to learn again and again, rather than just learn once from it.<br />
* Learning from consecutive samples is difficult because of correlated data. Experience replay breaks this correlation.<br />
* Online learning means the input is decided by the previous action. So, if the maximising action is to go left in some game, next inputs would be about what happens when we go left. This can cause the optimiser to get stuck in a feedback loop, or even diverge, as [[#Reference|[7]]] points out.<br />
<br />
== Double Q Learning ==<br />
<br />
=== Problem with Q Learning<sup>[[#References|[8]]]</sup> ===<br />
<br />
For a simple neural network, each update tries to shift the current $Q^*$ estimate to a new value:<br />
<br />
$$<br />
Q^*(s,a) \leftarrow Q^*(s,a) + \alpha(r+\gamma\displaystyle\max_{a_x}Q^*(s',a_x) - Q^*(s,a))<br />
$$<br />
<br />
Here $\alpha$ is the scalar learning rate. Suppose the neural net has some inherent noise $\epsilon$. So, the neural net actually stores a value $\mathbb{Q}^*$ given by<br />
<br />
$$<br />
\mathbb{Q}^* = Q^*+\epsilon<br />
$$<br />
<br />
Even if $\epsilon$ has zero mean in the beginning, using the $\max$ operator at the update steps will start propagating $\gamma\cdot\max \mathbb{Q}^*$. This leads to a non zero mean subsequently. The problem is that "max causes overestimation because it does not preserve the zero-mean property of the errors of its operands." ([[#References|[8]]]) Thus, Q learning is more likely to choose overoptimistic values.<br />
<br />
=== How does Double Q Learning work? <sup>[[#References|[9]]]</sup> ===<br />
<br />
The problem can be solved by using two sets of weights $\theta$ and $\Theta$. The $\mbox{target}$ can be broken up as<br />
<br />
$$<br />
\mbox{target} = r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta) = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\theta) = r+Q^*(s',a';\theta)<br />
$$<br />
<br />
Using double Q learning, we '''select''' the best action using current weights $\theta$ and '''evaluate''' the $Q^*$ value to decide the target value using $\Theta$.<br />
<br />
$$<br />
\mbox{target} = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\Theta) = r+Q^*(s',a';\Theta)<br />
$$<br />
<br />
This makes the evaluation fairer.<br />
<br />
=== Double Deep Q Learning ===<br />
<br />
[[#References|[9]]] further talks about how to use this for deep learning without much additional overhead. The suggestion is to use $\theta^-$ as $\Theta$.<br />
<br />
$$<br />
\mbox{target} = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\theta^-) = r+Q^*(s',a';\theta^-)<br />
$$<br />
=== Final DQN used in this paper ===<br />
The authors combine the idea of double DQN discussed above with the loss function discussed in "Q Learning and Deep Q Networks" section. So here is the final update for parameters of action value function:<br />
<br />
$$<br />
\theta_{t+1} \leftarrow \theta_t + \alpha(y_t^Q -Q(s_t,a_t;\theta_t))\nabla_{\theta}Q(s_t,a_t;\theta_t)<br />
$$<br />
$$<br />
y_t^Q \leftarrow r_t + \gamma Q(s_{t+1}, \underset{a}{argmax} \ Q(s_{t+1},a;\theta_t);\theta^{-})<br />
$$<br />
<br />
== Bootstrapped DQN ==<br />
<br />
The authors propose an architecture that has a shared network and $K$ bootstrap heads. So, suppose our experience buffer $E$ has $n$ data points, where each datapoint is a $(s,a,r,s')$ tuple. Each bootstrap head trains on a different buffer $E_i$, where each $E_i$ has been constructed by sampling $n$ data points from the original experience buffer $E$ with replacement ('''bootstrap method''').<br />
<br />
<br />
Because each of the heads train on a different buffer, they model a different $Q^*$ function (say $Q^*_k$). Now, for each episode, we first choose a specific $Q^*_k=Q^*_s$. This $Q^*_s$ helps us create the experience buffer for the episode. From any state $s_t$, we populate the experience buffer by choosing the next action $a_t$ that maximises $Q^*_s$. (similar to '''Thompson Sampling''')<br />
<br />
$$<br />
a_t = \displaystyle\arg\max_a Q^*_s(s_t,a_t)<br />
$$<br />
<br />
Also, along with $s_t,a_t,r_t,s_{t+1}$, they push a bootstrap mask $m_t$. This mask is basically is a binary vector of size $K$, and it tells which $Q_k$ should be affected by this datapoint, if it is chosen as a training point. So, for example, if $K=5$ and there is a experience tuple $(s_t,a_t,r_t,s_{t+1},m_t)$ where $m_t=(0,1,1,0,1)$, then $(s_t,a_t,r_t,s_{t+1})$ should only affect $Q_2,Q_3$ and $Q_5$.<br />
<br />
<br />
So, at each iteration, we just choose few points from this buffer and train the respective $Q_{(\cdot)}$ based on the bootstrap masks.<br />
<br />
=== How to generate masks? ===<br />
<br />
Masks are created by sampling from the '''masking distribution'''. Now, there are many ways to choose this masking distribution:<br />
<br />
* If for each datapoint $D_i$ ($i=1$ to $n$), we mask from $\mbox{Bernoulli}(0.5)$, this will roughly allow us to have half the points from the original buffer. To get to size $n$, we duplicate these points by doubling the weights for each datapoint. This essentially gives us a '''double or nothing''' bootstrap<sup>[[#References|[10]]]</sup>.<br />
* If the mask is $(1, 1 \cdots 1)$, then this becomes an '''ensemble learning''' method.<br />
* $m_t~\mbox{Poi}(1)$ (poisson distribution)<br />
* $m_t[k]~\mbox{Exp}(1)$ (exponential distribution)<br />
<br />
For this paper's results, the authors used a $\mbox{Bernoulli}(p)$ distribution.<br />
<br />
== Related Work ==<br />
<br />
The authors mention the method described in [[#References|[11]]]. The authors of [[#References|[11]]] talk about the principle of "optimism in the face of uncertainty" and modify the reward function to encourage state-action pairs that have not been seen often:<br />
<br />
$$<br />
R(s,a) \leftarrow R(s,a)+\beta\cdot\mbox{novelty}(s,a)<br />
$$<br />
<br />
According to the authors, [[#References|[11]]]'s DQN algorithm relies on a lot of hand tuning and is only good for non stochastic problems. The authors further compare their results to [[#References|[11]]]'s results on Atari.<br />
<br />
<br />
The authors also mention an existing algorithm PSRL<sup>[[#References|[12,13]]]</sup>, or posterior sampling based RL. However, this algorithm requires a solved MDP, which is not feasible for large systems. Bootstrapped DQN approximates this idea by sampling from approximate $Q^*$ functions.<br />
<br />
<br />
Further, the authors mention that the work in [[#References|[12,13]]] has been followed by RLSVI<sup>[[#Reference|[14]]]</sup> which solves the problem for linear cases.<br />
<br />
== Deep Exploration: Why is Bootstrapped DQN so good at it? ==<br />
<br />
The authors consider a simple example to demonstrate the effectiveness of bootstrapped DQN at deep exploration.<br />
<br />
[[File:deep_exploration_example.png|thumb||center||700px|Source: this paper, section 5.1]]<br />
<br />
<br />
<br />
In this example, the agent starts at $s_2$. There are $N$ steps, and $N+9$ timesteps to generate the experience buffer. The agent is said to have learned the optimal policy if it achieves the best possible reward of $10$ (go to the rightmost state in $N-1$ timesteps, then stay there for $10$ timesteps), for at least $100$ such episodes. The results they got:<br />
<br />
[[File:deep_exploration_results.png|thumb||center||700px|Source: this paper, section 5.1]]<br />
<br />
<br />
<br />
The blue dots indicate when the agent learnt the optimal policy. If this took more than $2000$ episodes, they indicate it with a red dot. Thompson DQN is DQN with posterior sampling at every timestep. Ensemble DQN is same as bootstrapped DQN except that the mask is all $(1,1 \cdots 1)$. It is evident from the graphs that bootstrapped DQN can achieve deep exploration better than these two methods, and DQN.<br />
<br />
=== But why is it better? ===<br />
<br />
The authors say that this is because bootstrapped DQN constructs different approximations to the posterior $Q^*$ with the same initial data. This diversity of approximations is because of random initalization of weights for the $Q^*_k$ heads. This means that these heads start out trying random actions (because of diverse random initial $Q^*_k$), but when some head finds a good state and generalises to it, some (but not all) of the heads will learn from it, because of the bootstrapping. Eventually other heads will either find other good states, or end up learning the best good states found by the other heads.<br />
<br />
<br />
So, the architecture explores well and once a head achieves the optimal policy, eventually, all heads achieve the policy.<br />
<br />
== Results ==<br />
<br />
The authors test their architecture on 49 Atari games. They mention that there has been recent work to improve the performance of DDQNs, but those are tweaks whose intentions are orthogonal to this paper's idea. So, they don't compare their results with them.<br />
<br />
=== Scale: What values of $K$, $p$ are best? ===<br />
<br />
[[File:scale_k_p.png|thumb||center||800px|Source: this paper, section 6.1]]<br />
<br />
Recall that $K$ is the number of bootstrap heads and $p$ is the parameter for the masking distribution (Bernoulli). The authors say that around $K=10$, the performance reaches close to the peak, so it should be good.<br />
<br />
<br />
$p$ also represents the amount of data sharing. This is because lesser $p$ means there is lesser chance (due to the Bernoulli distribution) that the corresponding datapoint is taken into the bootstrapped dataset $D_i$. So, lesser $p$ means more identical datapoints, hence more heads share their datapoints. However, the value of $p$ doesn't seem to affect the rewards achieved over time. The authors give the following reasons for it:<br />
<br />
* The heads start with random weights for $Q^*$, so the targets (which use $Q^*$) turn out to be different. So the update rules are different.<br />
* Atari is deterministic.<br />
* Because of the initial diversity, the heads will learn differently even if they predict the same action for the given state.<br />
<br />
$p=1$ is the value they use finally, because this reduces the no. of identical datapoints and reduces time.<br />
<br />
=== Performance on Atari ===<br />
<br />
In general, the results tell us that bootstrapped DQN achieves better results.<br />
<br />
[[File:atari_results_bootstrapped_dqn.png|thumb||center||800px|Source: this paper, section 6.2]]<br />
<br />
The authors plot the improvement they achieved with bootstrapped DQN with the games. They define '''improvement''' to be $x$ if bootstrapped DQN achieves a better result than DQN in $\frac{1}{x}$ frames.<br />
<br />
[[File:bdqn_improvement.png|thumb||center||1000px|Source: this paper, section 6.2]]<br />
<br />
<br />
The authors say that bootstrapped DQN doesn't work good on all Atari games. They point out that there are some challenging games, where exploration is key but bootstrapped DQN doesn't do good enough (but does better than DQN). Some of these games are Frostbite and Montezuma’s Revenge. They say that even better exploration may help, but also point out that there may be other problems like: network instability, reward clipping and temporally extended rewards.<br />
<br />
=== Improvement: Highest Score Reached & how fast is this high score reached? ===<br />
<br />
The authors plot the improvement graphs after 20m and 200m frames.<br />
<br />
[[File:cumulative_rewards_bdqn.png|thumb||center||700px|Source: this paper, section 6.3]]<br />
<br />
=== Visualisation of Results ===<br />
<br />
One of the authors' [https://www.youtube.com/playlist?list=PLdy8eRAW78uLDPNo1jRv8jdTx7aup1ujM youtube playlist] can be found online.<br />
<br />
<br />
The authors also point out that just purely using bootstrapped DQN as an exploitative strategy is pretty good by itself, better than vanilla DQN. This is because of the deep exploration capabilities of bootstrapped DQN, since it can use the best states it knows and also plan to try out states it doesn't have any information about. Even in the videos, it can be seen that the heads agree at all the crucial decisions, but stay diverse at other less important steps.<br />
<br />
== Critique ==<br />
<br />
It would be very interesting and a great addition to the experimental section of the paper, if the authors would have compared with asynchronous methods of exploration of the state space first introduced in [[#References|[15]]]. The authors unfortunately only compared their DQN with the original DQN and not all the other variations in the literature, and justified it by saying that their idea was "orthogonal" to these improvements.<br />
<br />
=== Different way to do exploration-exploitation? ===<br />
<br />
Instead of choosing the next action $a_t$ that maximises $Q^*_s$, they could have chosen different actions $a_i$ with probabilities<br />
<br />
$$<br />
\mathbb{P}(s_t,a_i) = \frac{Q^*_s(s_t,a_i)}{\displaystyle \sum_{i=1}^{|\mathbb{A}|} Q^*_s(s_t,a_i)}<br />
$$<br />
<br />
According to me, this is closer to Thompson Sampling.<br />
<br />
=== Why use Bernoulli? ===<br />
<br />
The choice of having a Bernoulli masking distribution eventually doesn't help them at all, since the algorithm does well because of the initial diversity. Maybe they can use some other masking distribution? However, the bootstrapping procedure is distribution-independent, the choice of masking distribution should not affect the long term performance of Bootstrapped DQN.<br />
<br />
=== Unanswered Questions & Miscellaneous ===<br />
* The Thompson DQN is not preferred because other randomized value functions can implement settings similar to Thompson sampling without the need for an intractable exact posterior update and also by working around the computational issue with Thompson Sampling: resampling every time step. Perhaps the authors could have explored Temporal Difference learning which is an attempt at combining Dynamic Programming and Monte Carlo methods.<br />
* The actual algorithm is hidden in the appendix. It could have been helpful if it were in the main paper.<br />
<br />
=== Improvement on Hierarchical Tasks ===<br />
In this paper, it is interesting that such bootstrap exploration principle actually improves the performance over hierarchical tasks such as Montezuma Revenge. It would be better if the authors could illustrate further about the influence of exploration in a sparse-reward hierarchical task.<br />
<br />
== References ==<br />
<br />
# [https://bandits.wikischolars.columbia.edu/file/view/Lecture+4.pdf Learning and optimization for sequential decision making, Columbia University, Lec 4]<br />
# [https://www.thoughtco.com/what-is-bootstrapping-in-statistics-3126172 Thoughtco, What is bootstrapping in statistics?]<br />
# [https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading24.pdf Bootstrap confidence intervals, Class 24, 18.05, MIT Open Courseware]<br />
# [https://arxiv.org/abs/1506.02142 Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. arXiv preprint arXiv:1506.02142, 2015.]<br />
# [https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf Mnih et al., Playing Atari with Deep Reinforcement Learning, 2015]<br />
# Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993.<br />
# John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997.<br />
# S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning, 1993.<br />
# [https://arxiv.org/pdf/1509.06461.pdf Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning, 2015.]<br />
# [https://pdfs.semanticscholar.org/d623/c2cbf100d6963ba7dafe55158890d43c78b6.pdf Dean Eckles and Maurits Kaptein, Thompson Sampling with the Online Bootstrap, 2014, Pg 3]<br />
# [https://arxiv.org/abs/1507.00814 Bradly C. Stadie, Sergey Levine, Pieter Abbeel, Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models, 2015.]<br />
# Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling, NIPS 2013.<br />
# Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension, NIPS 2014.<br />
# [https://arxiv.org/abs/1402.0635 Ian Osband, Benjamin Van Roy, Zheng Wen, Generalization and Exploration via Randomized Value Functions, 2014.]<br />
# Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International Conference on Machine Learning. 2016.<br />
# George Konidaris, Sarah Osentoski, and Philip Thomas. 2011. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence (AAAI'11). AAAI Press 380-385.<br />
<br />
<br />
Other helpful links (unsorted):<br />
* [http://pemami4911.github.io/paper-summaries/deep-rl/2016/08/16/Deep-exploration.html pemami4911.github.io]<br />
* [http://www.stat.yale.edu/~pollard/Courses/241.fall97/Poisson.pdf Poisson Approximations]<br />
<br />
== Appendix ==<br />
<br />
=== Algorithm for Bootstrapped DQN ===<br />
The appendix lists the following algorithm. Periodically, the replay buffer is played back to update value function network Q.<br />
<br />
[[File:alg1.PNG|thumb||left||700px|Source: this paper's appendix]]</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Alternative_Neural_Network:_Exploring_Contexts_As_Early_As_Possible_For_Action_Recognition&diff=30996Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition2017-11-21T16:14:42Z<p>Asriram: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
<br />
Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, Contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages. <br />
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]<br />
<br />
The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The network themselves involve a lot of layers and the first layer typically being receptive fields (RF) output only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs. In summary, this paper proposes a novel neural network, called deep alternative neural network (DANN), which is a based method for action recognition. The novel component is called an "alternative layer" which is composed of a volumetric convolutional layer followed by a recurrent layer. In addition, the authors also propose a new approach to select network input based on optical flow. The validity of DANN is carried out on HMDB51 and UCF101 datasets and it is observed that the proposed method achieves comparable performance against state of the art methods.<br />
<br />
The main contributions in the paper can be summarized as follows: <br />
* A Deep Alternative Neural Network (DANN) is proposed for action recognition. <br />
* DANN consists of alternative volumetric convolutional and recurrent layers. <br />
* An adaptive method to determine the temporal size of the video clip <br />
* A volumetric pyramid pooling layer to resize the output before fully connected layers.<br />
<br />
===Related Work===<br />
There are already exists a very related paper ([11]) in the literature which proposed a similar alternation architecture. In particular, the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This similarity between the two works was noted by Reviewer 1 in the NIPS review process.<br />
<br />
=== Optic Flow ===<br />
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.<br />
It can be used for affordance perception, the ability to discern possibilities for action within the environment.<br />
<br />
==Deep Alternative Neural Network:==<br />
===Adaptive Network Input===<br />
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with a shorter interval. However, there’s still no systematic way of determining the number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to: <br />
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion <br />
* it is relatively robust to changes in camera viewpoint.<br />
<br />
The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows <br />
<br />
:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math><br />
<br />
Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.<br />
<br />
[[File:golfswing.png]]<br />
<br />
To deal with the different length of video clip, we adopt the idea of spatial pyramid pooling (SPP) in [12] and extend to temporal domain, developing a volumetric pyramid pooling (VPP) layer to transfer video clip of arbitrary size into a universal length in the last alternative layer before fully connected layer.<br />
<br />
===Alternative Layer===<br />
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t, $u_{ij}^{xyz}(t)$, is given by,<br />
<br />
:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\ <br />
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz}) <br />
</math><br />
<br />
U(0): feed forward output of volumetric convolutional layer. <br />
U(t-1) : recurrent input of previous time <br />
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively <br />
f: ReLU function<br />
<br />
Figure 3 depicts this structure:<br />
[[File:unfolded.PNG|1000px]]<br />
<br />
The recurrent connections in AL provide two advantages. First, they enable every unit to incorporate contexts in an arbitrarily large region in the current layer。 However, the drawback is that without top-down connections, the states of the units in the current layer cannot be influenced by the context seen by higher-level units; Second, the recurrent connections increase the network depth while keeping the number of adjustable parameters constant by weight sharing, because AL uses only extra constant parameters of a recurrent kernel size.<br />
<br />
===Volumetric Pyramid Pooling Layer===<br />
<br />
[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]<br />
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer (VPPL) as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. Figure 2 illustrates the structure of VPPL. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:<br />
<br />
M: number of bins <br />
<br />
K: Number of kernels in the last alternative layer.<br />
<br />
This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.<br />
<br />
It reminds me of the spatial pyramid pooling in deep convolutional networks. In CNN, the dimensions of the training data are the same, so that after convolution, we can train the classifiers effectively. To improve the limit of the same dimension, spatial pyramid pooling is introduced.<br />
<br />
==Overall Architecture== <br />
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]<br />
The following are the components of the DANN (as shown in Figure 3)<br />
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps <br />
* 5 ReLU and volumetric pooling layers <br />
* 1 volumetric pyramid pooling layer <br />
* 3 fully connected layers of size 2048 each <br />
* A softmax layer<br />
<br />
==Implementation details==<br />
The authors have used the Torch toolbox platform for implementations of volumetric convolutions, recurrent layers and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. This technique is preferred to the common alternative of pre-processing data using a sliding window approach to have pre-segmented clips. The authors cite how using this technique limits the amount of data when the windows are not overlapped with one another. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. Final score is the average of all clip-level scores and the crop scores.<br />
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].<br />
<br />
==Evaluations==<br />
===Datasets:===<br />
* The datasets used in the evaluation are UCF101 and HMDB51 <br />
* UCF101 – 13K videos annotated into 101 classes <br />
* HMDB51 – 6.8K videos with 51 actions. <br />
* Three training and test splits are provided <br />
* Performance measured by mean classification accuracy across the splits. <br />
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.<br />
<br />
===Quantitative Results===<br />
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.<br />
<br />
<br />
[[File:Performance Comparison of different input modalities.png]]<br />
<br />
===Qualitative Analysis===<br />
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.<br />
<br />
==Conclusions and Critique==<br />
* Deep alternative neural network is introduced for action recognition.<br />
* The key new component is an "alternative layer" which is composed of a convolutional layer followed by a recurrent layer. As the paper targets action recognition in video, the convolutional layer acts on a 3D spatio-temporal volume.<br />
* DANN consists of volumetric convolutional layer and a recurrent layer. <br />
* A preprocessing stage based on optical flow is used to select video fragments to feed to the neural network.<br />
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches. <br />
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement. <br />
* There are prospects for studying action tube which is a more compact input.<br />
* The paper uses volumetric convolutional layer, but it doesn't say how it is better than recurrent neural networks in exploring temporal information.<br />
* There is no experimental evidence to compare the proposed method with long-term recurrent convolutional network. Also there is no analysis of time complexity of the approach used.<br />
<br />
Github code: https://github.com/wangjinzhuo/DANN<br />
<br />
In the formal review of the paper [https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html], some interesting criticisms of the paper are surfaced. For starters, one reviewer notes that a similar architecture was proposed in [https://arxiv.org/abs/1511.06432], limiting the novelty of the approach somewhat. The reviewers question the validity of the approach in even slightly more complicated settings (i.e. any non-static camera, which brings in the issue of optical flow). Other criticisms come from a lack of clear motivation for choices that the authors have made, for instance, the use of Local Response Normalization has fallen slightly out-of-favour, or the benefit of using a sliding window approach during testing (and random clips during training).<br />
<br />
Quantitatively, the benefits of the author's approach is not readily apparent. In comparisons with state-of-the-art, the proposed model performs worse on HMDB, and while they claim the highest performance on UCF, the increase is merely .1 over previous best efforts.<br />
<br />
==References==<br />
<br />
[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014 <br />
<br />
[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014. <br />
<br />
[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015. <br />
<br />
[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013. <br />
<br />
[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015. <br />
<br />
[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016. <br />
<br />
[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015. <br />
<br />
[8] IEEE International Symposium on Multimedia 2013 <br />
<br />
[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action<br />
recognition. arXiv preprint arXiv:1604.04494, 2016<br />
<br />
[10] https://en.wikipedia.org/wiki/Optical_flow<br />
<br />
[11] Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016 <br />
<br />
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI, 37(9):1904–1916, 2015.<br />
<br />
[36] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l<br />
1 optical flow. In Pattern Recognition, pages 214–223. 2007.<br />
<br />
A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30954Modular Multitask Reinforcement Learning with Policy Sketches2017-11-21T04:18:57Z<p>Asriram: /* Conclusion & Critique */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
[[File:MRL_diagram.jpg|thumb|right|400px| Figure 1b: the diagram for policy sketches]]<br />
[[File:MRL_encode.jpg|thumb|right|600px| Figure 1c: All sub tasks are encoded without any semantic meanings]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. Sketches annotate tasks with sequences of named subtasks, providing information about high-level structural relationships among tasks but not how to implement them—specifically not providing the detailed guidance used by much previous work on learning policy abstractions for RL (e.g. intermediate rewards, subtask completion signals, or intrinsic motivations). General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describes its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks: "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards. As shown in Figure 1c, the tasks are divided into human readable sub tasks. However, in the actual settings, the learner can only get access to encoded results.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In an adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Related Work'''=<br />
The approach in this paper is a specific case of the options framework developed by Sutton et al., 1999. In that work, options are introduced as "closed-loop policies for taking action over the period of time". They show that options enable temporally abstract information to be included in reinforcement learning algorithms, though it was published before the large-scale popularity of neural networks for reinforcement.<br />
<br />
Other authors have recently explored techniques for learning policies which require less prior knowledge of the environment than the method presented in this paper. For example, in Vezhnevets et al. (2016), the authors propose a RNN architecture to build "implicit plans" only through interacting with the environment as in the classic reinforcement learning problem formulation.<br />
<br />
One closely related line of work is the Hierarchical Abstract Machines (HAM) framework introduced by Parr & Russell, 1998 [11]. Like the approach which the Modular Multitask Reinforcement Learning with Policy Sketches uses, HAMs begin with a representation of a high-level policy as an automaton (or a more general computer program; Andre & Russell,<br />
2001 [7]; Marthi et al., 2004 [12]) and use reinforcement learning to fill in low-level details.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize overall $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Control policy parameterized by parameter vector $\theta$, $$\displaystyle \max_{\Theta}E[\sum_{t=0}^{H}R(s_{t})|\pi_{\theta}]$$ $\pi_{\theta}(u|s)$ is the probability of action u in state s. The details of policy optimization can be found here: https://people.eecs.berkeley.edu/~pabbeel/nips-tutorial-policy-optimization-Schulman-Abbeel.pdf<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially, the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in a high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
[[File:MRL_maze.png|boarder|right|400px]]<br />
<br />
The maze environment is very similar to “light world” described by [4], which can be seen in Figure 3c. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
<br />
[[File:MRL9.png|border|center|800px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically, they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task/subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model is shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
[[File:MRL10.png|border|right|400px]]<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalized advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
Two other experiments are also performed as Figure 5b: starting with short examples and moving to long ones, and sampling tasks in inverse proportion to their accumulated reward. It is shown that both components help; prioritization by both length and weight gives the best results.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|left|320px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and an adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural sub policy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable sub policies which can be employed for zero-shot generalization when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
Hierarchical Reinforcement Learning is a popular research topic now, it is interesting to compare this paper(which was presented in this class)[13], where the architecture follows the manager-and-workers style. In that work, the sub policy is decided by manager network. To finish a hierarchical task, each worker focuses on the sub task and optimizes the sub task. The difference is in that work, the sub policy is implicit and is trained in the training process. A further question for future work for this paper: would these sub policy be learned automatically instead of pre-defined?<br />
<br />
There are four drawbacks of the presented work. First, the presented ideas portrayed in this paper (for instance: symbolic specifications, actor-critic, shared representations) have all been explored in other works. Secondly, aforementioned approach relies heavily on curriculum learning - which is, as we know, difficult to design as it is pretty complicated depending on the task in-hand. Thirdly, there has been no discussion on how curriculum can be designed for larger problems. Finally, this approach could be that building of different neural networks for each sub tasks could lead to overly complicated networks and is not in the spirit of building an efficient structure.<br />
<br />
= '''Resources''' =<br />
You can find a talk on this paper [https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=video&cd=3&cad=rja&uact=8&ved=0ahUKEwjNzPuBqM7XAhVK6mMKHQICAdEQtwIILDAC&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DNRIcDEB64x8&usg=AOvVaw1NHi2XExGXwhzzeJn5AcnR here].<br />
<br />
<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.<br />
<br />
[9] Author Jacob Andreas presenting the paper - https://www.youtube.com/watch?v=NRIcDEB64x8<br />
<br />
[10] Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., & Agapiou, J. (2016). Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems (pp. 3486-3494).<br />
<br />
[11] Parr, Ron and Russell, Stuart. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems, 1998.<br />
<br />
[12] Marthi, Bhaskara, Lantham, David, Guestrin, Carlos, and Russell, Stuart. Concurrent hierarchical reinforcement learning. In Proceedings of the Meeting of the Association for the Advancement of Artificial Intelligence, 2004.<br />
<br />
[13] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.<br />
<br />
= Appendix =<br />
The authors provide a brief appendix that gives a complete list of tasks and sketches. Asterisk * indicates that the task was held out for generalization experiments in Section 4.5, but included in the multitask experiments of Sections 4.3 and 4.4.<br />
<br />
[[File: tasks.PNG]]</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30914STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-20T16:50:11Z<p>Asriram: </p>
<hr />
<div>=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallelize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals [[#Reference|[6]]]. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN [[#Reference|[9]]]. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embedding to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 [[#Reference|[7]]] illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3 [[#Reference|[7]]]. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions [[#Reference|[3]]]. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modeled successively, with B conditioned on (R, G), and G conditioned on R [[#Reference|[8]]]. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 5 [[#Reference|[8]]]. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 6, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 [[#Reference|[3]]] shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions)[[#Reference|[3]]]. Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighborhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions [[#Reference|[4]]]. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions [[#Reference|[3]]]. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape [number of classes, number of filters], $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=== PixelCNN Auto-Encoders ===<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[5]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. <br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN called Gated PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# PixelCNN AutoEncoders<br />
<br />
=Critique=<br />
# The paper is not descriptive, and does not explain well on how the horizontal and vertical stacks solve the "blindspot" problem. In addition, the authors just mention the "gated block" and how they designed it, but they do not explain the intuition and how this approach is an improvement over the PixelCNN <br />
# The authors do not provide a good pictorial representation on any of the aforementioned novelties<br />
# The PixelCNN AutoEncoder is not descriptive enough! <br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016<br />
# S. Turukin, "Gated PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/24/gated-pixelcnn.html. [Accessed: 15- Nov- 2017].<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick and N. Freitas, "Generating interpretable images with controllable structure", 2016.<br />
# G. Hinton, "Reducing the Dimensionality of Data with Neural Networks", Science, vol. 313, no. 5786, pp. 504-507, 2006.<br />
# "Conditional Image Generation with PixelCNN Decoders", Slideshare.net, 2017. [Online]. Available: https://www.slideshare.net/suga93/conditional-image-generation-with-pixelcnn-decoders. [Accessed: 18- Nov- 2017].<br />
# "Gated PixelCNN", Kawahara.ca, 2017. [Online]. Available: http://kawahara.ca/conditional-image-generation-with-pixelcnn-decoders-slides/gated-pixelcnn/. [Accessed: 17- Nov- 2017].<br />
# K. Dhandhania, "PixelCNN + PixelRNN + PixelCNN 2.0 — Commonlounge", Commonlounge.com, 2017. [Online]. Available: https://www.commonlounge.com/discussion/99e291af08e2427b9d961d41bb12c83b. [Accessed: 15- Nov- 2017].<br />
# S. Turukin, "PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/22/pixelcnn.html. [Accessed: 17- Nov- 2017].</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30908STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-20T16:39:06Z<p>Asriram: </p>
<hr />
<div>=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallelize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals [[#Reference|[6]]]. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN [[#Reference|[9]]]. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embedding to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 [[#Reference|[7]]] illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3 [[#Reference|[7]]]. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions [[#Reference|[3]]]. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modeled successively, with B conditioned on (R, G), and G conditioned on R [[#Reference|[8]]]. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 5 [[#Reference|[8]]]. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 6, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 [[#Reference|[3]]] shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions)[[#Reference|[3]]]. Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighborhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions [[#Reference|[4]]]. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions [[#Reference|[3]]]. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape [number of classes, number of filters], $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=== PixelCNN Auto-Encoders ===<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[5]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. <br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN called Gated PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# PixelCNN AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016<br />
# S. Turukin, "Gated PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/24/gated-pixelcnn.html. [Accessed: 15- Nov- 2017].<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick and N. Freitas, "Generating interpretable images with controllable structure", 2016.<br />
# G. Hinton, "Reducing the Dimensionality of Data with Neural Networks", Science, vol. 313, no. 5786, pp. 504-507, 2006.<br />
# "Conditional Image Generation with PixelCNN Decoders", Slideshare.net, 2017. [Online]. Available: https://www.slideshare.net/suga93/conditional-image-generation-with-pixelcnn-decoders. [Accessed: 18- Nov- 2017].<br />
# "Gated PixelCNN", Kawahara.ca, 2017. [Online]. Available: http://kawahara.ca/conditional-image-generation-with-pixelcnn-decoders-slides/gated-pixelcnn/. [Accessed: 17- Nov- 2017].<br />
# K. Dhandhania, "PixelCNN + PixelRNN + PixelCNN 2.0 — Commonlounge", Commonlounge.com, 2017. [Online]. Available: https://www.commonlounge.com/discussion/99e291af08e2427b9d961d41bb12c83b. [Accessed: 15- Nov- 2017].<br />
# S. Turukin, "PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/22/pixelcnn.html. [Accessed: 17- Nov- 2017].</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30907STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-20T16:37:59Z<p>Asriram: /* Summary */</p>
<hr />
<div>=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallelize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals [[#Reference|[6]]]. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN [[#Reference|[9]]]. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embedding to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 [[#Reference|[7]]] illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3 [[#Reference|[7]]]. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions [[#Reference|[3]]]. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modeled successively, with B conditioned on (R, G), and G conditioned on R [[#Reference|[8]]]. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 5 [[#Reference|[8]]]. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 6, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 [[#Reference|[3]]] shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions)[[#Reference|[3]]]. Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighborhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions [[#Reference|[4]]]. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions [[#Reference|[3]]]. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape [number of classes, number of filters], $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[5]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. <br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN called Gated PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# PixelCNN AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016<br />
# S. Turukin, "Gated PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/24/gated-pixelcnn.html. [Accessed: 15- Nov- 2017].<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick and N. Freitas, "Generating interpretable images with controllable structure", 2016.<br />
# G. Hinton, "Reducing the Dimensionality of Data with Neural Networks", Science, vol. 313, no. 5786, pp. 504-507, 2006.<br />
# "Conditional Image Generation with PixelCNN Decoders", Slideshare.net, 2017. [Online]. Available: https://www.slideshare.net/suga93/conditional-image-generation-with-pixelcnn-decoders. [Accessed: 18- Nov- 2017].<br />
# "Gated PixelCNN", Kawahara.ca, 2017. [Online]. Available: http://kawahara.ca/conditional-image-generation-with-pixelcnn-decoders-slides/gated-pixelcnn/. [Accessed: 17- Nov- 2017].<br />
# K. Dhandhania, "PixelCNN + PixelRNN + PixelCNN 2.0 — Commonlounge", Commonlounge.com, 2017. [Online]. Available: https://www.commonlounge.com/discussion/99e291af08e2427b9d961d41bb12c83b. [Accessed: 15- Nov- 2017].<br />
# S. Turukin, "PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/22/pixelcnn.html. [Accessed: 17- Nov- 2017].</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30906STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-20T16:32:00Z<p>Asriram: </p>
<hr />
<div>=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallelize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals [[#Reference|[6]]]. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN [[#Reference|[9]]]. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embedding to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 [[#Reference|[7]]] illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3 [[#Reference|[7]]]. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions [[#Reference|[3]]]. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modeled successively, with B conditioned on (R, G), and G conditioned on R [[#Reference|[8]]]. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 5 [[#Reference|[8]]]. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 6, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 [[#Reference|[3]]] shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions)[[#Reference|[3]]]. Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighborhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions [[#Reference|[4]]]. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions [[#Reference|[3]]]. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape [number of classes, number of filters], $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[5]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. <br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN called Gated PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016<br />
# S. Turukin, "Gated PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/24/gated-pixelcnn.html. [Accessed: 15- Nov- 2017].<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick and N. Freitas, "Generating interpretable images with controllable structure", 2016.<br />
# G. Hinton, "Reducing the Dimensionality of Data with Neural Networks", Science, vol. 313, no. 5786, pp. 504-507, 2006.<br />
# "Conditional Image Generation with PixelCNN Decoders", Slideshare.net, 2017. [Online]. Available: https://www.slideshare.net/suga93/conditional-image-generation-with-pixelcnn-decoders. [Accessed: 18- Nov- 2017].<br />
# "Gated PixelCNN", Kawahara.ca, 2017. [Online]. Available: http://kawahara.ca/conditional-image-generation-with-pixelcnn-decoders-slides/gated-pixelcnn/. [Accessed: 17- Nov- 2017].<br />
# K. Dhandhania, "PixelCNN + PixelRNN + PixelCNN 2.0 — Commonlounge", Commonlounge.com, 2017. [Online]. Available: https://www.commonlounge.com/discussion/99e291af08e2427b9d961d41bb12c83b. [Accessed: 15- Nov- 2017].<br />
# S. Turukin, "PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/22/pixelcnn.html. [Accessed: 17- Nov- 2017].</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30905STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-20T16:29:45Z<p>Asriram: </p>
<hr />
<div>=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals [[#Reference|[6]]]. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embedding to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 [[#Reference|[7]]] illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3 [[#Reference|[7]]]. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions [[#Reference|[3]]]. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modeled successively, with B conditioned on (R, G), and G conditioned on R [[#Reference|[8]]]. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 5 [[#Reference|[8]]]. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 6, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 [[#Reference|[3]]] shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions)[[#Reference|[3]]]. Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighborhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions [[#Reference|[4]]]. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions [[#Reference|[3]]]. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape [number of classes, number of filters], $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[5]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. <br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN called Gated PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016<br />
# S. Turukin, "Gated PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/24/gated-pixelcnn.html. [Accessed: 15- Nov- 2017].<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick and N. Freitas, "Generating interpretable images with controllable structure", 2016.<br />
# G. Hinton, "Reducing the Dimensionality of Data with Neural Networks", Science, vol. 313, no. 5786, pp. 504-507, 2006.<br />
# "Conditional Image Generation with PixelCNN Decoders", Slideshare.net, 2017. [Online]. Available: https://www.slideshare.net/suga93/conditional-image-generation-with-pixelcnn-decoders. [Accessed: 20- Nov- 2017].<br />
# "Gated PixelCNN", Kawahara.ca, 2017. [Online]. Available: http://kawahara.ca/conditional-image-generation-with-pixelcnn-decoders-slides/gated-pixelcnn/. [Accessed: 17- Nov- 2017].<br />
# K. Dhandhania, "PixelCNN + PixelRNN + PixelCNN 2.0 — Commonlounge", Commonlounge.com, 2017. [Online]. Available: https://www.commonlounge.com/discussion/99e291af08e2427b9d961d41bb12c83b. [Accessed: 15- Nov- 2017].</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30904STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-20T16:17:19Z<p>Asriram: </p>
<hr />
<div>=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions [[#Reference|[3]]]. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 [[#Reference|[3]]] shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions)[[#Reference|[3]]]. Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighborhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions [[#Reference|[4]]]. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions [[#Reference|[3]]]. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape [number of classes, number of filters], $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[5]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. <br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN called Gated PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016<br />
# S. Turukin, "Gated PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/24/gated-pixelcnn.html. [Accessed: 15- Nov- 2017].<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick and N. Freitas, "Generating interpretable images with controllable structure", 2016.<br />
# G. Hinton, "Reducing the Dimensionality of Data with Neural Networks", Science, vol. 313, no. 5786, pp. 504-507, 2006.</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30732STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T19:58:31Z<p>Asriram: /* Summary */</p>
<hr />
<div>=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions). Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighbourhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape [number of classes, number of filters], $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[3]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. <br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN called Gated PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30731STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T19:57:50Z<p>Asriram: /* Conditional PixelCNN */</p>
<hr />
<div>=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions). Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighbourhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape [number of classes, number of filters], $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[3]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. <br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30730STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T19:55:16Z<p>Asriram: </p>
<hr />
<div>=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions). Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighbourhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape $[number of classes, number of filters]$, $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[3]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. <br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:cond_imagenet.png&diff=30728File:cond imagenet.png2017-11-18T19:54:27Z<p>Asriram: </p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:pixelauto.png&diff=30727File:pixelauto.png2017-11-18T19:53:29Z<p>Asriram: </p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:cond_portrait.png&diff=30726File:cond portrait.png2017-11-18T19:52:42Z<p>Asriram: </p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:ucond_imagenet.png&diff=30725File:ucond imagenet.png2017-11-18T19:48:35Z<p>Asriram: </p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:ucond_cifar.png&diff=30723File:ucond cifar.png2017-11-18T19:47:08Z<p>Asriram: </p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30722STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T19:44:00Z<p>Asriram: </p>
<hr />
<div>=NOT DONE YET!=<br />
=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions). Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighbourhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape $[number of classes, number of filters]$, $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[3]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. <br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30721STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T19:07:03Z<p>Asriram: /* Conditional PixelCNN */</p>
<hr />
<div>=NOT DONE YET!=<br />
=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions). Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighbourhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape $[number of classes, number of filters]$, $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[3]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30720STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T19:06:33Z<p>Asriram: </p>
<hr />
<div>=NOT DONE YET!=<br />
=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions). Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighbourhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape $[number of classes, number of filters]$, $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\asts$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[3]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:residual.png&diff=30719File:residual.png2017-11-18T19:04:36Z<p>Asriram: </p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30718STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T19:04:18Z<p>Asriram: </p>
<hr />
<div>=NOT DONE YET!=<br />
=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions). Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighbourhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
[[File:residual.png|500px|center|thumb|Figure 11: Residual connection.]]<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape $[number of classes, number of filters]$, $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\asts$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[3]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30717STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T19:03:04Z<p>Asriram: /* Horizontal Stack */</p>
<hr />
<div>=NOT DONE YET!=<br />
=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions). Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighbourhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
[File:residual.png|500px|center|thumb|Figure 11: Residual connection.]]<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape $[number of classes, number of filters]$, $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\asts$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[3]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30716STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T19:02:14Z<p>Asriram: /* Gated block */</p>
<hr />
<div>=NOT DONE YET!=<br />
=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1\timesn//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions). Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighbourhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
[File:residual.png|500px|center|thumb|Figure 11: Residual connection.]]<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape $[number of classes, number of filters]$, $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\asts$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[3]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30715STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T19:00:02Z<p>Asriram: /* Horizontal Stack */</p>
<hr />
<div>=NOT DONE YET!=<br />
=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1\timesn//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions). Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighbourhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n\times1)$ and $(n\timesn)$ are the masked convolutions which can also be implemented as $([n/2]\times1)$ and $([n/2]\timesn)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N\timesN\times3$ image as input and produces $N\timesN\times3\times256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
[File:residual.png|500px|center|thumb|Figure 11: Residual connection.]]<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape $[number of classes, number of filters]$, $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\asts$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[3]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:vertical_mask.gif&diff=30714File:vertical mask.gif2017-11-18T18:58:13Z<p>Asriram: </p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30713STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T18:57:37Z<p>Asriram: </p>
<hr />
<div>=NOT DONE YET!=<br />
=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1\timesn//2+1$ convolution with shift (pad and crop) rather than $1\timesn$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions). Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighbourhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n\times1)$ and $(n\timesn)$ are the masked convolutions which can also be implemented as $([n/2]\times1)$ and $([n/2]\timesn)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N\timesN\times3$ image as input and produces $N\timesN\times3\times256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
[File:residual.png|500px|center|thumb|Figure 11: Residual connection.]]<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape $[number of classes, number of filters]$, $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\asts$ is an unmasked $1\times1$ convolution.<br />
<br />
=PixelCNN Auto-Encoders=<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[3]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:v_mask.gif&diff=30712File:v mask.gif2017-11-18T18:55:23Z<p>Asriram: Asriram uploaded a new version of File:v mask.gif</p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:v_mask.gif&diff=30711File:v mask.gif2017-11-18T18:50:14Z<p>Asriram: Asriram uploaded a new version of File:v mask.gif</p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30710STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T18:27:10Z<p>Asriram: </p>
<hr />
<div>=NOT DONE YET!=<br />
=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1\timesn//2+1$ convolution with shift (pad and crop) rather than $1\timesn$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions). Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:v_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighbourhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n\times1)$ and $(n\timesn)$ are the masked convolutions which can also be implemented as $([n/2]\times1)$ and $([n/2]\timesn)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N\timesN\times3$ image as input and produces $N\timesN\times3\times256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
[File:residual.png|500px|center|thumb|Figure 11: Residual connection.]]<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:gated_block.png&diff=30709File:gated block.png2017-11-18T18:17:20Z<p>Asriram: </p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:h_mask.png&diff=30708File:h mask.png2017-11-18T18:16:55Z<p>Asriram: </p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:v_mask.gif&diff=30707File:v mask.gif2017-11-18T18:16:32Z<p>Asriram: </p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30699STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T04:39:35Z<p>Asriram: </p>
<hr />
<div>=NOT DONE YET!=<br />
=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 4. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 5, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=30698STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-18T04:33:02Z<p>Asriram: </p>
<hr />
<div>=NOT DONE YET!=<br />
=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embeddings to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictoral understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions - illustrated in Figure 4. <br />
<br />
[[File:masking1.png|200px|center|thumb]]<br />
[[File:masking2.png|500px|center|thumb]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modelled successively, with B conditioned on (R, G), and G conditioned on R. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 5. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking]]<br />
<br />
Now, from Figure 6, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive field, depicted in Figure 7. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By spliting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb]]<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# Pixel AutoEncoders<br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016</div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:vh_stack.png&diff=30697File:vh stack.png2017-11-18T04:26:27Z<p>Asriram: Asriram uploaded a new version of File:vh stack.png</p>
<hr />
<div></div>Asriramhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:vh_stack.png&diff=30696File:vh stack.png2017-11-18T04:25:10Z<p>Asriram: </p>
<hr />
<div></div>Asriram